Skip to content

feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012

Draft
100yenadmin wants to merge 1 commit into
mainfrom
feat/slab-phase1-tiering
Draft

feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012
100yenadmin wants to merge 1 commit into
mainfrom
feat/slab-phase1-tiering

Conversation

@100yenadmin

Copy link
Copy Markdown
Member

DRAFT — do not merge until the Phase-3 duo A/B passes (per the approved FPAD plan: merge only if cache-not-dented + cold-open-not-worse + tool-selection ≥ control).

What

Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"].

Verified the mechanism end-to-end before building: claude 2.1.160 resolves the pin per-tool (_meta["anthropic/alwaysLoad"]===true, binary grep) and FastMCP (mcp 1.27.1) propagates @mcp.tool(meta=...)list_tools()._meta (runtime probe). So this is an in-place decorator-style annotation on the frozen worldos-engine server — no facade server, no engine split, no rename (R1's feared blocker was falsified).

The win (measured)

bytes tools
Full slab (baseline, today) 118,739 B 153
Pinned core (tiered arm) 63,862 B 69
Deferred (behind ToolSearch) 57,500 B 84

−46% of the per-beat injected slab (~13.7k tokens) — with a deliberately generous core. (285-transcript census: 92/153 tools never called in real play.)

How it's safe

  • _apply_tool_tiering() is inert under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, the default): the harness ORs the server pin over every tool, so production is byte-identical until the post-A/B cutover. It only activates for the tiered A/B arm (=0).
  • PINNED_ALLOWLIST = hot beat loop + full active-combat verb set (die-triggered, no payload hint) + cold-open path (no payload names them; the 22-turn give-up band) + the 18 reach-for tools. New tools default deferred.
  • Cold tools stay findable: the engine names them in the obligations/director payloads the DM already holds, or they're explicit-intent-gated (Step-1.7 reach-for validation found no selection regression).

Guard (test_tool_schema_budget.py, now tier-aware)

  1. Ratchet on the pinned-core slab bytes. 2. Pinned set == PINNED_ALLOWLIST (growth forcing-function). 3. Per-tool _meta actually propagates to list_tools() (fail loud on a FastMCP/claude upgrade). 4. Full-slab secondary cap + the reach-for first-sentence guard.

Deviation from the approved Phase 1 (flagged)

The approved plan said "drop server-level alwaysLoad" in Phase 1. I kept it env-gated instead so production default stays baseline (this PR is the dormant mechanism). Dropping server-level alwaysLoad = flipping production to tiered = exactly the behavior change the A/B must gate, so it's deferred to a tiny post-A/B cutover flip. Both A/B arms run from this branch today via WORLDOS_ENGINE_ALWAYSLOAD.

Tests

Full engine suite 2992 passed (single-process). Baseline byte-clean asserted.

Phase 3 (the gate — next)

Same-SHA/same-seed duo A/B: arm1 WORLDOS_ENGINE_ALWAYSLOAD=1 vs arm2 =0. Remaining harness work: extend qa/latency_rollup.py to parse cache_creation/cache_read + cold-open seconds from the *.dm.jsonl result events; add a chance-corrected tool-selection check vs the census. Heavy/paired playtests → support-VM lane.

FPAD record: worldos-session-notes/2026-06-18/tool-schema-slab-decision/decision-record.md.

… (slab Phases 1-2)

Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail
behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"] — verified to resolve
per-tool in claude 2.1.160 and to propagate from FastMCP @tool meta on the installed mcp 1.27.1.
No facade server, no rename of the frozen worldos-engine id.

- PINNED_ALLOWLIST (69): the hot beat loop + the full active-combat verb set + the cold-open path
  + the 18 reach-for tools. New tools default DEFERRED. (285-transcript census: 92/153 never called.)
- _apply_tool_tiering(): annotates the core; INERT under the whole-server baseline
  (WORLDOS_ENGINE_ALWAYSLOAD=1, default) so production is byte-identical until the post-A/B cutover;
  activates for the tiered A/B arm (=0). Validates the allowlist names exist (fail loud).
- test_tool_schema_budget.py is now tier-aware: a ratchet on the PINNED-core slab, assert the
  pinned set == PINNED_ALLOWLIST (the growth forcing-function), assert per-tool _meta actually
  propagates to list_tools(), keep the reach-for first-sentence guard + a full-slab secondary cap.

Measured: pinned core = 63,862 B vs 118,739 B full slab = -46% per beat (~13.7k tokens) with a
deliberately generous core. Baseline byte-identical. Full engine suite 2992 green.

Phases 1-2 of the FPAD slab decision (worldos-session-notes/2026-06-18/tool-schema-slab-decision/).
DO NOT MERGE until the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected
selection >= control). This PR is the dormant mechanism + guard; the production cutover (dropping
server-level alwaysLoad) is a separate gated flip.
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 39b67ea7-5c32-441e-9804-503e7c098776

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant