feat(qa): latency-rollup token/cache ledger + slab A/B comparator (Phase 3 harness)#1014
Conversation
…mparator Extend qa/latency_rollup.py (additively) for the slab-tiering A/B (decision Phase 3): parse each beat's result `usage` block (cache_creation/cache_read/input/output tokens + ttft_ms) and aggregate cold-open vs routine means. The schema slab is a per-request PREFILL tax, so its effect lives in the token ledger, not duration_api_ms (generation-bound) — this is the missing measurement to gate the tiering PR. Add compare_arms(): the A/B gate. arm A = baseline (WORLDOS_ENGINE_ALWAYSLOAD=1), arm B = tiered (=0); verdict PASS/FAIL/INSUFFICIENT_DATA on cold-open-not-worse + routine-not-worse + cache-not- dented (routine cache_creation is the dent signal; compared RELATIVE to baseline + tolerance, since real runs show non-trivial routine cache_creation). input_mass_down is the expected WIN (informational). CLI: --compare baseline.json tiered.json (exits non-zero on FAIL). The chance-corrected tool-SELECTION check is gated separately. +7 tests; validated end-to-end on a real transcript (baseline-rc1). Existing rollup behavior + sidecar columns unchanged.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthrough
ChangesToken/TTFT ledger and A/B slab comparison
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
✨ Finishing Touches📝 Generate docstrings
Comment |
Measurement harness for the tool-schema-slab tiering A/B (decision Phase 3). The slab is a per-request prefill tax, so its effect lives in the token ledger, not duration_api_ms (generation-bound).
Additive to qa/latency_rollup.py: parse per-beat usage cache_creation/cache_read/input/output + ttft_ms (cold-open vs routine aggregates); compare_arms() = the A/B gate (arm A baseline WORLDOS_ENGINE_ALWAYSLOAD=1 vs arm B tiered =0), verdict PASS/FAIL/INSUFFICIENT_DATA on cold-open-not-worse + routine-not-worse + cache-not-dented (routine cache_creation is the dent signal, compared relative to baseline + tolerance). CLI --compare a.json b.json exits non-zero on FAIL. Tool-selection check gated separately.
Split out from #1012 because it is independently-useful QA tooling and not A/B-gated. +7 tests; validated on a real transcript; 18 green. The heavy paired duo sweep runs on the support-VM lane (paused).
FPAD record: worldos-session-notes/2026-06-18/tool-schema-slab-decision/.
Summary by CodeRabbit
Release Notes
--compareCLI flag to evaluate performance between two rollup configurations