Skip to content

Align observability with OpenTelemetry at the edges: GenAI semconv naming + post-hoc OTLP export #40

Description

@samkeen

Context

Before investing further in the observability dashboard (live viewer metrics + timeline + filtered tail), we asked whether open standards — specifically OpenTelemetry — should shape the work so we don't reinvent wire formats and vocabularies.

Decision: OTel at the edges, not at the core. events.jsonl stays the canonical, self-contained source of truth; OTel alignment happens in naming and in a derived export channel.

Current state

The event log was already designed trace-shaped: every event carries trace_id (32-hex) and span_id (16-hex) in OTel shape, per session.py's module docstring ("Lets downstream tools (Phoenix, Langfuse, Braintrust) ingest events as a trace").

Standard status as of June 2026:

  • GenAI client spans are stablegen_ai.request.model, gen_ai.usage.input_tokens / output_tokens, gen_ai.response.finish_reasons.
  • Agent spans are still experimental (the level where Tilth's tasks / iterations / verdicts live).
  • Overall GenAI semconv is still marked Development.

So: model-call vocabulary is safe to align with now; agent-level vocabulary is still moving and not worth chasing.

Why not rebuild the core on OTel

  • events.jsonl is the product, not plumbing: replay fidelity (live view byte-identical to replay), chat reconstruction including nudges/reasoning, zero-infra single-file inspection. OTLP export is async and lossy by design; replaying from a collector requires infrastructure.
  • The interesting semantics (case/verdict, rejection categories, ledger, iteration accounting) have no semconv home — they'd be custom attributes regardless.
  • Generic trace UIs won't render the worker↔eval dialogue as a conversation; that layer stays ours either way.
  • The OTel SDK is a real dependency tree against a stdlib-first repo — and it isn't needed: OTLP has a stable JSON encoding over HTTP, so a post-hoc converter is stdlib-only (urllib).

Plan

  1. Naming alignment, opportunistically. When a payload schema is next touched, prefer semconv names (input_tokens/output_tokens over prompt_tokens/eval_tokens). Until then, maintain a documented mapping table (events.jsonl field → gen_ai.* attribute). No big-bang rename — it would ripple through summary.py, the viewer, and SUMMARY_VERSION for zero user-visible gain.
  2. tilth export-otel <session_id> — a derived channel (same principle as summary.json: derived, never a second store) that converts a finished session's events.jsonl to OTLP/JSON and POSTs it to a collector endpoint. No SDK, no change to the loop. Validate against a local Jaeger all-in-one.
  3. Settle the trace hierarchy first. Today trace_id is per-task, so a session would export as N disconnected traces. OTel wants the session as the trace root with tasks as child spans. Decide deliberately (session-level trace id + parent links, or task traces with a session resource attribute) before the exporter lands.

Non-goals

  • OTel SDK inside the harness loop.
  • Replacing events.jsonl or the built-in viewer with a collector/backend.
  • Chasing the experimental agent-span conventions while they churn.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions