Add Zipkin tracing parity#2217
Conversation
Greptile SummaryThis PR adds end-to-end OpenTelemetry/Zipkin tracing across the Helm chart and service execution paths: a new
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/service/services/proxy.py | Adds inject_trace_context to all outbound proxy calls; unconditionally strips existing W3C trace headers before injecting current span context, silently breaking propagation on sidecar/config-GET routes that have no active span. |
| nemo_retriever/src/nemo_retriever/service/tracing.py | New tracing helpers module: configure_tracing, start_span (_SafeSpanContext), inject/extract context, attribute sanitization. A partial-setup edge case can leave a shut-down provider as the global tracer. |
| nemo_retriever/src/nemo_retriever/service/routers/ingest.py | Wraps main ingest-document/page handlers in _start_accept_span; adds create_job span + trace_id surfacing; ingest tracing coverage is correct. |
| nemo_retriever/src/nemo_retriever/api/internal/primitives/nim/nim_client.py | Adds nim.infer spans per batch, propagates trace context into gRPC/HTTP headers; introduces _same_trace_context batching gate (dynamic-batching throughput impact noted in prior review). |
| nemo_retriever/src/nemo_retriever/service/services/pipeline_executor.py | Propagates trace context into child worker processes; calls configure_tracing and force_flush in each worker invocation. |
| nemo_retriever/src/nemo_retriever/service/services/pipeline_pool.py | Adds pool.{name}.process span per work item with queue-wait latency attribute; adds trace_context and enqueued_at_monotonic_s to WorkItem. |
| nemo_retriever/helm/templates/_helpers.tpl | Adds 279 lines of Helm helpers for Zipkin/OTel env generation with fail-fast field validation. |
| nemo_retriever/helm/templates/deployment-zipkin.yaml | New Zipkin Deployment template with resource requests/limits, liveness/readiness/startup probes all present in values.yaml defaults. |
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/service/services/proxy.py:127-128
**W3C trace headers unconditionally stripped on proxy routes without an active span**
`inject_trace_context(fwd_headers)` first deletes every existing `traceparent` / `tracestate` key from the carrier, then calls `_TRACE_CONTEXT_PROPAGATOR.inject` — which injects nothing when there is no valid active span. Proxy routes that are **not** wrapped by `_start_accept_span` (sidecar upload/delete at lines 1271–1273 and 1333, and pipeline-config GET at line 1399) fall into this path: any caller that propagates a W3C `traceparent` header will have it silently stripped from the forwarded request, breaking the distributed-trace chain for those endpoints. The ingest-document and ingest-page routes are safe because `_start_accept_span` ensures an active span before forwarding, but the sidecar and config-GET paths have no equivalent guard.
### Issue 2 of 2
nemo_retriever/src/nemo_retriever/service/tracing.py:96-107
**Shut-down provider left as global tracer after partial-setup failure**
When `trace.set_tracer_provider(provider)` succeeds but the `is not provider` identity check immediately after raises a `RuntimeError`, `_cleanup_partial_tracing_setup` is called with `processor_added=True`. It calls `provider.shutdown()`, which deactivates the exporter — but the `trace` module still holds a reference to this now-shut-down `TracerProvider` as the global. `_CONFIGURED_PROVIDER` is never set. On the next call to `configure_tracing`, `_global_tracer_provider_is_already_configured()` returns `True` (the global is still a `TracerProvider` instance), so the function returns `True` without re-initialising, leaving tracing permanently non-functional for the lifetime of the process without any log message to diagnose it. The window is narrow (a true concurrent race on `set_tracer_provider`), but if it fires during startup the service silently loses all span export.
Reviews (5): Last reviewed commit: "Merge branch 'main' into codex/tracing-z..." | Re-trigger Greptile
Description
This adds Zipkin parity and broader trace-context propagation across the Helm deployment and service execution paths.
Compatibility note:
topology.zipkin.enabledintentionally remainstrueby default to preserve the managed trace-backend behavior shipped by the legacy 26.1.2 Helm chart. Upgrades with default values may create a chart-owned Zipkin Deployment and Service; settopology.zipkin.enabled=falsebefore upgrading if your deployment uses an external backend or should not run chart-owned Zipkin. Context: #2217 (comment)Validation:
uv run pytest tests/test_helm_tracing_zipkin.py tests/test_nim_tracing.py tests/test_service_tracing.py tests/test_service_proxy_tracing.py tests/test_service_pool_tracing.py tests/test_service_pipeline_tracing.py tests/test_service_ingest_router.py tests/test_service_ingest_async.py tests/test_service_job_tracker.py tests/test_dashboard_static_trace_ui.py(152 passed).Checklist