Skip to content

feat(qa): latency-rollup token/cache ledger + slab A/B comparator (Phase 3 harness)#1014

Merged
100yenadmin merged 1 commit into
mainfrom
feat/slab-ab-harness
Jun 18, 2026
Merged

feat(qa): latency-rollup token/cache ledger + slab A/B comparator (Phase 3 harness)#1014
100yenadmin merged 1 commit into
mainfrom
feat/slab-ab-harness

Conversation

@100yenadmin

@100yenadmin 100yenadmin commented Jun 18, 2026

Copy link
Copy Markdown
Member

Measurement harness for the tool-schema-slab tiering A/B (decision Phase 3). The slab is a per-request prefill tax, so its effect lives in the token ledger, not duration_api_ms (generation-bound).

Additive to qa/latency_rollup.py: parse per-beat usage cache_creation/cache_read/input/output + ttft_ms (cold-open vs routine aggregates); compare_arms() = the A/B gate (arm A baseline WORLDOS_ENGINE_ALWAYSLOAD=1 vs arm B tiered =0), verdict PASS/FAIL/INSUFFICIENT_DATA on cold-open-not-worse + routine-not-worse + cache-not-dented (routine cache_creation is the dent signal, compared relative to baseline + tolerance). CLI --compare a.json b.json exits non-zero on FAIL. Tool-selection check gated separately.

Split out from #1012 because it is independently-useful QA tooling and not A/B-gated. +7 tests; validated on a real transcript; 18 green. The heavy paired duo sweep runs on the support-VM lane (paused).

FPAD record: worldos-session-notes/2026-06-18/tool-schema-slab-decision/.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added A/B comparison mode with --compare CLI flag to evaluate performance between two rollup configurations
    • Now tracks token usage metrics (input, cache creation, cache read, output) and Time-To-First-Token measurements
    • Automated performance validation gates with pass/fail verdicts for cold-start latency, routine latency, and cache efficiency

…mparator

Extend qa/latency_rollup.py (additively) for the slab-tiering A/B (decision Phase 3): parse each
beat's result `usage` block (cache_creation/cache_read/input/output tokens + ttft_ms) and aggregate
cold-open vs routine means. The schema slab is a per-request PREFILL tax, so its effect lives in the
token ledger, not duration_api_ms (generation-bound) — this is the missing measurement to gate the
tiering PR.

Add compare_arms(): the A/B gate. arm A = baseline (WORLDOS_ENGINE_ALWAYSLOAD=1), arm B = tiered
(=0); verdict PASS/FAIL/INSUFFICIENT_DATA on cold-open-not-worse + routine-not-worse + cache-not-
dented (routine cache_creation is the dent signal; compared RELATIVE to baseline + tolerance, since
real runs show non-trivial routine cache_creation). input_mass_down is the expected WIN
(informational). CLI: --compare baseline.json tiered.json (exits non-zero on FAIL). The
chance-corrected tool-SELECTION check is gated separately.

+7 tests; validated end-to-end on a real transcript (baseline-rc1). Existing rollup behavior +
sidecar columns unchanged.
@100yenadmin 100yenadmin merged commit 4775105 into main Jun 18, 2026
14 of 17 checks passed
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 40a15924-7f1f-4a0b-9d5e-47b46aba161c

📥 Commits

Reviewing files that changed from the base of the PR and between 73a8db2 and 55539f8.

📒 Files selected for processing (2)
  • qa/latency_rollup.py
  • qa/test_latency_rollup.py

📝 Walkthrough

Walkthrough

qa/latency_rollup.py gains helpers to extract per-beat token/cache/TTFT fields from result events, computes cold-open vs routine means, and emits them under a tokens key in rollup_files(). A new compare_arms() function applies hard-gate checks and returns PASS/FAIL/INSUFFICIENT_DATA. The CLI adds --compare. Tests extend the beat writer and add coverage for all new paths.

Changes

Token/TTFT ledger and A/B slab comparison

Layer / File(s) Summary
Token/TTFT helpers and rollup_files() wiring
qa/latency_rollup.py
Defines _USAGE_TOKEN_FIELDS, _usage(), _mean(), and _token_aggregates() to split beat rows into cold-open vs routine and compute per-field means; adds usage_rows accumulator and wires _token_aggregates() into the rollup_files() output dict under a new tokens key.
compare_arms() logic and --compare CLI flag
qa/latency_rollup.py
Adds numeric tolerance constants and a None-safe _delta() helper; implements compare_arms() evaluating hard gates on cold-open latency, routine latency, and cache-creation dent, returning verdict plus checks/metrics; extends CLI argument parsing to accept --compare BASELINE TIERED, print the comparison JSON, and exit with non-zero status on FAIL.
Test helpers and new test cases
qa/test_latency_rollup.py
Extends _write_beat() with usage and ttft_ms parameters and adds _usage() test helper; adds _arm() fixture builder and five new tests covering token aggregation, compare_arms() PASS/FAIL/INSUFFICIENT_DATA cases, and CLI --compare exit codes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop hop, the tokens are counted at last,
Cache-creation dents caught, cold opens contrast.
Hard gates now stand where the latency veers,
A/B arms compared — the verdict appears.
PASS or FAIL or NOT ENOUGH DATA yet,
This rabbit checks slabs so you won't forget! 🥕

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant