Skip to content

Add Zipkin tracing parity#2217

Open
charlesbluca wants to merge 63 commits into
NVIDIA:mainfrom
charlesbluca:codex/tracing-zipkin-parity-spec
Open

Add Zipkin tracing parity#2217
charlesbluca wants to merge 63 commits into
NVIDIA:mainfrom
charlesbluca:codex/tracing-zipkin-parity-spec

Conversation

@charlesbluca

@charlesbluca charlesbluca commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Description

This adds Zipkin parity and broader trace-context propagation across the Helm deployment and service execution paths.

  • Adds Helm support for Zipkin collector deployment, service wiring, endpoint overrides, and chart-wide tracing env generation guards.
  • Keeps tracing observability-only while carrying trace context through gateway proxy requests, service work, pipeline pools/executors, and NIM HTTP/internal inference calls.
  • Surfaces trace IDs through job/service responses and dashboard views so users can correlate service work with collected traces.
  • Documents the Zipkin tracing workflow and adds focused tests for Helm rendering, tracing helpers, NIM tracing, proxy/pool/executor propagation, ingest routing, job tracking, and dashboard trace UI.

Compatibility note: topology.zipkin.enabled intentionally remains true by default to preserve the managed trace-backend behavior shipped by the legacy 26.1.2 Helm chart. Upgrades with default values may create a chart-owned Zipkin Deployment and Service; set topology.zipkin.enabled=false before upgrading if your deployment uses an external backend or should not run chart-owned Zipkin. Context: #2217 (comment)

Validation: uv run pytest tests/test_helm_tracing_zipkin.py tests/test_nim_tracing.py tests/test_service_tracing.py tests/test_service_proxy_tracing.py tests/test_service_pool_tracing.py tests/test_service_pipeline_tracing.py tests/test_service_ingest_router.py tests/test_service_ingest_async.py tests/test_service_job_tracker.py tests/test_dashboard_static_trace_ui.py (152 passed).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@charlesbluca charlesbluca changed the title [codex] Add Zipkin tracing parity Add Zipkin tracing parity Jun 8, 2026
@charlesbluca charlesbluca marked this pull request as ready for review June 8, 2026 17:09
@charlesbluca charlesbluca requested review from a team as code owners June 8, 2026 17:09
@charlesbluca charlesbluca requested a review from edknv June 8, 2026 17:09
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds end-to-end OpenTelemetry/Zipkin tracing across the Helm chart and service execution paths: a new tracing.py helper module, W3C trace context propagation through gateway proxy, pipeline pool/executor, and NIM HTTP/gRPC inference calls, plus trace-ID surfacing in job responses and the dashboard.

  • Helm: new Zipkin Deployment + Service templates gated on topology.zipkin.enabled (defaults true for 26.1.2 parity, documented); _helpers.tpl gains 279 lines of OTel/Zipkin env-generation helpers with fail-fast field validation; per-NIM otel blocks wired into NIMService env.
  • Service tracing core (tracing.py): process-wide OTLP setup via configure_tracing, _SafeSpanContext swallows all span failures so tracing stays observability-only, attribute sanitization redacts credentials/content keys, inject_trace_context / extract_trace_context for W3C carrier propagation.
  • Propagation: create_job captures and stores a trace_id returned to callers; _start_accept_span gates ingest-document/page handlers; pool worker spans carry queue-wait latency; pipeline executor propagates context into child processes and force-flushes before exit; NIM client wraps each batch and HTTP retry in a span.

Confidence Score: 4/5

Safe to merge with one fix: proxy routes for sidecar upload/delete and pipeline-config GET silently drop any caller-supplied W3C traceparent header because inject_trace_context strips existing headers before injecting the current span, and those routes have no active span to inject.

The ingest-document and ingest-page gateway paths are correctly wrapped in _start_accept_span before forwarding, so the main tracing flows work as intended. The issue is isolated to three secondary proxy routes where no span is created, causing inject_trace_context to strip the caller's traceparent without replacement. Everything else — the tracing module, Helm templates, NIM client spans, pool/executor propagation, job trace-ID surfacing, and the 152 passing tests — looks correct and well-structured.

nemo_retriever/src/nemo_retriever/service/services/proxy.py — the _inject_trace_context calls in forward() and forward_get() need a guard to preserve existing W3C headers when no span is active.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/service/services/proxy.py Adds inject_trace_context to all outbound proxy calls; unconditionally strips existing W3C trace headers before injecting current span context, silently breaking propagation on sidecar/config-GET routes that have no active span.
nemo_retriever/src/nemo_retriever/service/tracing.py New tracing helpers module: configure_tracing, start_span (_SafeSpanContext), inject/extract context, attribute sanitization. A partial-setup edge case can leave a shut-down provider as the global tracer.
nemo_retriever/src/nemo_retriever/service/routers/ingest.py Wraps main ingest-document/page handlers in _start_accept_span; adds create_job span + trace_id surfacing; ingest tracing coverage is correct.
nemo_retriever/src/nemo_retriever/api/internal/primitives/nim/nim_client.py Adds nim.infer spans per batch, propagates trace context into gRPC/HTTP headers; introduces _same_trace_context batching gate (dynamic-batching throughput impact noted in prior review).
nemo_retriever/src/nemo_retriever/service/services/pipeline_executor.py Propagates trace context into child worker processes; calls configure_tracing and force_flush in each worker invocation.
nemo_retriever/src/nemo_retriever/service/services/pipeline_pool.py Adds pool.{name}.process span per work item with queue-wait latency attribute; adds trace_context and enqueued_at_monotonic_s to WorkItem.
nemo_retriever/helm/templates/_helpers.tpl Adds 279 lines of Helm helpers for Zipkin/OTel env generation with fail-fast field validation.
nemo_retriever/helm/templates/deployment-zipkin.yaml New Zipkin Deployment template with resource requests/limits, liveness/readiness/startup probes all present in values.yaml defaults.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/service/services/proxy.py:127-128
**W3C trace headers unconditionally stripped on proxy routes without an active span**

`inject_trace_context(fwd_headers)` first deletes every existing `traceparent` / `tracestate` key from the carrier, then calls `_TRACE_CONTEXT_PROPAGATOR.inject` — which injects nothing when there is no valid active span. Proxy routes that are **not** wrapped by `_start_accept_span` (sidecar upload/delete at lines 1271–1273 and 1333, and pipeline-config GET at line 1399) fall into this path: any caller that propagates a W3C `traceparent` header will have it silently stripped from the forwarded request, breaking the distributed-trace chain for those endpoints. The ingest-document and ingest-page routes are safe because `_start_accept_span` ensures an active span before forwarding, but the sidecar and config-GET paths have no equivalent guard.

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/service/tracing.py:96-107
**Shut-down provider left as global tracer after partial-setup failure**

When `trace.set_tracer_provider(provider)` succeeds but the `is not provider` identity check immediately after raises a `RuntimeError`, `_cleanup_partial_tracing_setup` is called with `processor_added=True`. It calls `provider.shutdown()`, which deactivates the exporter — but the `trace` module still holds a reference to this now-shut-down `TracerProvider` as the global. `_CONFIGURED_PROVIDER` is never set. On the next call to `configure_tracing`, `_global_tracer_provider_is_already_configured()` returns `True` (the global is still a `TracerProvider` instance), so the function returns `True` without re-initialising, leaving tracing permanently non-functional for the lifetime of the process without any log message to diagnose it. The window is narrow (a true concurrent race on `set_tracer_provider`), but if it fires during startup the service silently loses all span export.

Reviews (5): Last reviewed commit: "Merge branch 'main' into codex/tracing-z..." | Re-trigger Greptile

Comment thread nemo_retriever/tests/test_dashboard_static_trace_ui.py
Comment thread nemo_retriever/helm/values.yaml Outdated
Comment thread nemo_retriever/helm/values.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant