feat(qa): latency-rollup token/cache ledger + slab A/B comparator (Phase 3 harness) by 100yenadmin · Pull Request #1014 · electricsheephq/WorldOS

100yenadmin · 2026-06-18T07:38:48Z

Measurement harness for the tool-schema-slab tiering A/B (decision Phase 3). The slab is a per-request prefill tax, so its effect lives in the token ledger, not duration_api_ms (generation-bound).

Additive to qa/latency_rollup.py: parse per-beat usage cache_creation/cache_read/input/output + ttft_ms (cold-open vs routine aggregates); compare_arms() = the A/B gate (arm A baseline WORLDOS_ENGINE_ALWAYSLOAD=1 vs arm B tiered =0), verdict PASS/FAIL/INSUFFICIENT_DATA on cold-open-not-worse + routine-not-worse + cache-not-dented (routine cache_creation is the dent signal, compared relative to baseline + tolerance). CLI --compare a.json b.json exits non-zero on FAIL. Tool-selection check gated separately.

Split out from #1012 because it is independently-useful QA tooling and not A/B-gated. +7 tests; validated on a real transcript; 18 green. The heavy paired duo sweep runs on the support-VM lane (paused).

FPAD record: worldos-session-notes/2026-06-18/tool-schema-slab-decision/.

Summary by CodeRabbit

Release Notes

New Features
- Added A/B comparison mode with --compare CLI flag to evaluate performance between two rollup configurations
- Now tracks token usage metrics (input, cache creation, cache read, output) and Time-To-First-Token measurements
- Automated performance validation gates with pass/fail verdicts for cold-start latency, routine latency, and cache efficiency

…mparator Extend qa/latency_rollup.py (additively) for the slab-tiering A/B (decision Phase 3): parse each beat's result `usage` block (cache_creation/cache_read/input/output tokens + ttft_ms) and aggregate cold-open vs routine means. The schema slab is a per-request PREFILL tax, so its effect lives in the token ledger, not duration_api_ms (generation-bound) — this is the missing measurement to gate the tiering PR. Add compare_arms(): the A/B gate. arm A = baseline (WORLDOS_ENGINE_ALWAYSLOAD=1), arm B = tiered (=0); verdict PASS/FAIL/INSUFFICIENT_DATA on cold-open-not-worse + routine-not-worse + cache-not- dented (routine cache_creation is the dent signal; compared RELATIVE to baseline + tolerance, since real runs show non-trivial routine cache_creation). input_mass_down is the expected WIN (informational). CLI: --compare baseline.json tiered.json (exits non-zero on FAIL). The chance-corrected tool-SELECTION check is gated separately. +7 tests; validated end-to-end on a real transcript (baseline-rc1). Existing rollup behavior + sidecar columns unchanged.

coderabbitai · 2026-06-18T07:39:02Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 40a15924-7f1f-4a0b-9d5e-47b46aba161c

📥 Commits

Reviewing files that changed from the base of the PR and between 73a8db2 and 55539f8.

📒 Files selected for processing (2)

qa/latency_rollup.py
qa/test_latency_rollup.py

📝 Walkthrough

Walkthrough

qa/latency_rollup.py gains helpers to extract per-beat token/cache/TTFT fields from result events, computes cold-open vs routine means, and emits them under a tokens key in rollup_files(). A new compare_arms() function applies hard-gate checks and returns PASS/FAIL/INSUFFICIENT_DATA. The CLI adds --compare. Tests extend the beat writer and add coverage for all new paths.

Changes

Token/TTFT ledger and A/B slab comparison

Layer / File(s)	Summary
Token/TTFT helpers and rollup_files() wiring `qa/latency_rollup.py`	Defines `_USAGE_TOKEN_FIELDS`, `_usage()`, `_mean()`, and `_token_aggregates()` to split beat rows into cold-open vs routine and compute per-field means; adds `usage_rows` accumulator and wires `_token_aggregates()` into the `rollup_files()` output dict under a new `tokens` key.
compare_arms() logic and --compare CLI flag `qa/latency_rollup.py`	Adds numeric tolerance constants and a None-safe `_delta()` helper; implements `compare_arms()` evaluating hard gates on cold-open latency, routine latency, and cache-creation dent, returning verdict plus `checks`/`metrics`; extends CLI argument parsing to accept `--compare BASELINE TIERED`, print the comparison JSON, and exit with non-zero status on `FAIL`.
Test helpers and new test cases `qa/test_latency_rollup.py`	Extends `_write_beat()` with `usage` and `ttft_ms` parameters and adds `_usage()` test helper; adds `_arm()` fixture builder and five new tests covering token aggregation, `compare_arms()` `PASS`/`FAIL`/`INSUFFICIENT_DATA` cases, and CLI `--compare` exit codes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop hop, the tokens are counted at last,
Cache-creation dents caught, cold opens contrast.
Hard gates now stand where the latency veers,
A/B arms compared — the verdict appears.
PASS or FAIL or NOT ENOUGH DATA yet,
This rabbit checks slabs so you won't forget! 🥕

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

100yenadmin merged commit 4775105 into main Jun 18, 2026
14 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(qa): latency-rollup token/cache ledger + slab A/B comparator (Phase 3 harness)#1014

feat(qa): latency-rollup token/cache ledger + slab A/B comparator (Phase 3 harness)#1014
100yenadmin merged 1 commit into
mainfrom
feat/slab-ab-harness

100yenadmin commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

100yenadmin commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

100yenadmin commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading