Context
Before investing further in the observability dashboard (live viewer metrics + timeline + filtered tail), we asked whether open standards — specifically OpenTelemetry — should shape the work so we don't reinvent wire formats and vocabularies.
Decision: OTel at the edges, not at the core. events.jsonl stays the canonical, self-contained source of truth; OTel alignment happens in naming and in a derived export channel.
Current state
The event log was already designed trace-shaped: every event carries trace_id (32-hex) and span_id (16-hex) in OTel shape, per session.py's module docstring ("Lets downstream tools (Phoenix, Langfuse, Braintrust) ingest events as a trace").
Standard status as of June 2026:
- GenAI client spans are stable —
gen_ai.request.model, gen_ai.usage.input_tokens / output_tokens, gen_ai.response.finish_reasons.
- Agent spans are still experimental (the level where Tilth's tasks / iterations / verdicts live).
- Overall GenAI semconv is still marked Development.
So: model-call vocabulary is safe to align with now; agent-level vocabulary is still moving and not worth chasing.
Why not rebuild the core on OTel
events.jsonl is the product, not plumbing: replay fidelity (live view byte-identical to replay), chat reconstruction including nudges/reasoning, zero-infra single-file inspection. OTLP export is async and lossy by design; replaying from a collector requires infrastructure.
- The interesting semantics (case/verdict, rejection categories, ledger, iteration accounting) have no semconv home — they'd be custom attributes regardless.
- Generic trace UIs won't render the worker↔eval dialogue as a conversation; that layer stays ours either way.
- The OTel SDK is a real dependency tree against a stdlib-first repo — and it isn't needed: OTLP has a stable JSON encoding over HTTP, so a post-hoc converter is stdlib-only (
urllib).
Plan
- Naming alignment, opportunistically. When a payload schema is next touched, prefer semconv names (
input_tokens/output_tokens over prompt_tokens/eval_tokens). Until then, maintain a documented mapping table (events.jsonl field → gen_ai.* attribute). No big-bang rename — it would ripple through summary.py, the viewer, and SUMMARY_VERSION for zero user-visible gain.
tilth export-otel <session_id> — a derived channel (same principle as summary.json: derived, never a second store) that converts a finished session's events.jsonl to OTLP/JSON and POSTs it to a collector endpoint. No SDK, no change to the loop. Validate against a local Jaeger all-in-one.
- Settle the trace hierarchy first. Today
trace_id is per-task, so a session would export as N disconnected traces. OTel wants the session as the trace root with tasks as child spans. Decide deliberately (session-level trace id + parent links, or task traces with a session resource attribute) before the exporter lands.
Non-goals
- OTel SDK inside the harness loop.
- Replacing
events.jsonl or the built-in viewer with a collector/backend.
- Chasing the experimental agent-span conventions while they churn.
References
Context
Before investing further in the observability dashboard (live viewer metrics + timeline + filtered tail), we asked whether open standards — specifically OpenTelemetry — should shape the work so we don't reinvent wire formats and vocabularies.
Decision: OTel at the edges, not at the core.
events.jsonlstays the canonical, self-contained source of truth; OTel alignment happens in naming and in a derived export channel.Current state
The event log was already designed trace-shaped: every event carries
trace_id(32-hex) andspan_id(16-hex) in OTel shape, persession.py's module docstring ("Lets downstream tools (Phoenix, Langfuse, Braintrust) ingest events as a trace").Standard status as of June 2026:
gen_ai.request.model,gen_ai.usage.input_tokens/output_tokens,gen_ai.response.finish_reasons.So: model-call vocabulary is safe to align with now; agent-level vocabulary is still moving and not worth chasing.
Why not rebuild the core on OTel
events.jsonlis the product, not plumbing: replay fidelity (live view byte-identical to replay), chat reconstruction including nudges/reasoning, zero-infra single-file inspection. OTLP export is async and lossy by design; replaying from a collector requires infrastructure.urllib).Plan
input_tokens/output_tokensoverprompt_tokens/eval_tokens). Until then, maintain a documented mapping table (events.jsonl field →gen_ai.*attribute). No big-bang rename — it would ripple throughsummary.py, the viewer, andSUMMARY_VERSIONfor zero user-visible gain.tilth export-otel <session_id>— a derived channel (same principle assummary.json: derived, never a second store) that converts a finished session'sevents.jsonlto OTLP/JSON and POSTs it to a collector endpoint. No SDK, no change to the loop. Validate against a local Jaeger all-in-one.trace_idis per-task, so a session would export as N disconnected traces. OTel wants the session as the trace root with tasks as child spans. Decide deliberately (session-level trace id + parent links, or task traces with a session resource attribute) before the exporter lands.Non-goals
events.jsonlor the built-in viewer with a collector/backend.References