feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated] by 100yenadmin · Pull Request #1012 · electricsheephq/WorldOS

100yenadmin · 2026-06-18T07:26:51Z

DRAFT — do not merge until the Phase-3 duo A/B passes (per the approved FPAD plan: merge only if cache-not-dented + cold-open-not-worse + tool-selection ≥ control).

What

Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"].

Verified the mechanism end-to-end before building: claude 2.1.160 resolves the pin per-tool (_meta["anthropic/alwaysLoad"]===true, binary grep) and FastMCP (mcp 1.27.1) propagates @mcp.tool(meta=...) → list_tools()._meta (runtime probe). So this is an in-place decorator-style annotation on the frozen worldos-engine server — no facade server, no engine split, no rename (R1's feared blocker was falsified).

The win (measured)

	bytes	tools
Full slab (baseline, today)	118,739 B	153
Pinned core (tiered arm)	63,862 B	69
Deferred (behind ToolSearch)	57,500 B	84

−46% of the per-beat injected slab (~13.7k tokens) — with a deliberately generous core. (285-transcript census: 92/153 tools never called in real play.)

How it's safe

_apply_tool_tiering() is inert under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, the default): the harness ORs the server pin over every tool, so production is byte-identical until the post-A/B cutover. It only activates for the tiered A/B arm (=0).
PINNED_ALLOWLIST = hot beat loop + full active-combat verb set (die-triggered, no payload hint) + cold-open path (no payload names them; the 22-turn give-up band) + the 18 reach-for tools. New tools default deferred.
Cold tools stay findable: the engine names them in the obligations/director payloads the DM already holds, or they're explicit-intent-gated (Step-1.7 reach-for validation found no selection regression).

Guard (`test_tool_schema_budget.py`, now tier-aware)

Ratchet on the pinned-core slab bytes. 2. Pinned set == PINNED_ALLOWLIST (growth forcing-function). 3. Per-tool _meta actually propagates to list_tools() (fail loud on a FastMCP/claude upgrade). 4. Full-slab secondary cap + the reach-for first-sentence guard.

Deviation from the approved Phase 1 (flagged)

The approved plan said "drop server-level alwaysLoad" in Phase 1. I kept it env-gated instead so production default stays baseline (this PR is the dormant mechanism). Dropping server-level alwaysLoad = flipping production to tiered = exactly the behavior change the A/B must gate, so it's deferred to a tiny post-A/B cutover flip. Both A/B arms run from this branch today via WORLDOS_ENGINE_ALWAYSLOAD.

Tests

Full engine suite 2992 passed (single-process). Baseline byte-clean asserted.

Phase 3 (the gate — next)

Same-SHA/same-seed duo A/B: arm1 WORLDOS_ENGINE_ALWAYSLOAD=1 vs arm2 =0. Remaining harness work: extend qa/latency_rollup.py to parse cache_creation/cache_read + cold-open seconds from the *.dm.jsonl result events; add a chance-corrected tool-selection check vs the census. Heavy/paired playtests → support-VM lane.

FPAD record: worldos-session-notes/2026-06-18/tool-schema-slab-decision/decision-record.md.

@tool

… (slab Phases 1-2) Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"] — verified to resolve per-tool in claude 2.1.160 and to propagate from FastMCP @tool meta on the installed mcp 1.27.1. No facade server, no rename of the frozen worldos-engine id. - PINNED_ALLOWLIST (69): the hot beat loop + the full active-combat verb set + the cold-open path + the 18 reach-for tools. New tools default DEFERRED. (285-transcript census: 92/153 never called.) - _apply_tool_tiering(): annotates the core; INERT under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, default) so production is byte-identical until the post-A/B cutover; activates for the tiered A/B arm (=0). Validates the allowlist names exist (fail loud). - test_tool_schema_budget.py is now tier-aware: a ratchet on the PINNED-core slab, assert the pinned set == PINNED_ALLOWLIST (the growth forcing-function), assert per-tool _meta actually propagates to list_tools(), keep the reach-for first-sentence guard + a full-slab secondary cap. Measured: pinned core = 63,862 B vs 118,739 B full slab = -46% per beat (~13.7k tokens) with a deliberately generous core. Baseline byte-identical. Full engine suite 2992 green. Phases 1-2 of the FPAD slab decision (worldos-session-notes/2026-06-18/tool-schema-slab-decision/). DO NOT MERGE until the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected selection >= control). This PR is the dormant mechanism + guard; the production cutover (dropping server-level alwaysLoad) is a separate gated flip.

coderabbitai · 2026-06-18T07:26:59Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 39b67ea7-5c32-441e-9804-503e7c098776

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

100yenadmin mentioned this pull request Jun 18, 2026

feat(qa): latency-rollup token/cache ledger + slab A/B comparator (Phase 3 harness) #1014

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012

feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012
100yenadmin wants to merge 1 commit into
mainfrom
feat/slab-phase1-tiering

100yenadmin commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

100yenadmin commented Jun 18, 2026

What

The win (measured)

How it's safe

Guard (test_tool_schema_budget.py, now tier-aware)

Deviation from the approved Phase 1 (flagged)

Tests

Phase 3 (the gate — next)

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Guard (`test_tool_schema_budget.py`, now tier-aware)