feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012
Draft
100yenadmin wants to merge 1 commit into
Draft
feat(engine): per-tool tool-schema tiering + tier-aware guard (slab Phases 1-2) [A/B-gated]#1012100yenadmin wants to merge 1 commit into
100yenadmin wants to merge 1 commit into
Conversation
… (slab Phases 1-2) Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool _meta["anthropic/alwaysLoad"] — verified to resolve per-tool in claude 2.1.160 and to propagate from FastMCP @tool meta on the installed mcp 1.27.1. No facade server, no rename of the frozen worldos-engine id. - PINNED_ALLOWLIST (69): the hot beat loop + the full active-combat verb set + the cold-open path + the 18 reach-for tools. New tools default DEFERRED. (285-transcript census: 92/153 never called.) - _apply_tool_tiering(): annotates the core; INERT under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, default) so production is byte-identical until the post-A/B cutover; activates for the tiered A/B arm (=0). Validates the allowlist names exist (fail loud). - test_tool_schema_budget.py is now tier-aware: a ratchet on the PINNED-core slab, assert the pinned set == PINNED_ALLOWLIST (the growth forcing-function), assert per-tool _meta actually propagates to list_tools(), keep the reach-for first-sentence guard + a full-slab secondary cap. Measured: pinned core = 63,862 B vs 118,739 B full slab = -46% per beat (~13.7k tokens) with a deliberately generous core. Baseline byte-identical. Full engine suite 2992 green. Phases 1-2 of the FPAD slab decision (worldos-session-notes/2026-06-18/tool-schema-slab-decision/). DO NOT MERGE until the Phase-3 duo A/B (cache_creation/read + cold-open + chance-corrected selection >= control). This PR is the dormant mechanism + guard; the production cutover (dropping server-level alwaysLoad) is a separate gated flip.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DRAFT — do not merge until the Phase-3 duo A/B passes (per the approved FPAD plan: merge only if cache-not-dented + cold-open-not-worse + tool-selection ≥ control).
What
Pin only a census-backed core of engine MCP tools into every DM beat and defer the cold tail behind the harness ToolSearch, via per-tool
_meta["anthropic/alwaysLoad"].Verified the mechanism end-to-end before building: claude 2.1.160 resolves the pin per-tool (
_meta["anthropic/alwaysLoad"]===true, binary grep) and FastMCP (mcp 1.27.1) propagates@mcp.tool(meta=...)→list_tools()._meta(runtime probe). So this is an in-place decorator-style annotation on the frozenworldos-engineserver — no facade server, no engine split, no rename (R1's feared blocker was falsified).The win (measured)
−46% of the per-beat injected slab (~13.7k tokens) — with a deliberately generous core. (285-transcript census: 92/153 tools never called in real play.)
How it's safe
_apply_tool_tiering()is inert under the whole-server baseline (WORLDOS_ENGINE_ALWAYSLOAD=1, the default): the harness ORs the server pin over every tool, so production is byte-identical until the post-A/B cutover. It only activates for the tiered A/B arm (=0).PINNED_ALLOWLIST= hot beat loop + full active-combat verb set (die-triggered, no payload hint) + cold-open path (no payload names them; the 22-turn give-up band) + the 18 reach-for tools. New tools default deferred.obligations/directorpayloads the DM already holds, or they're explicit-intent-gated (Step-1.7 reach-for validation found no selection regression).Guard (
test_tool_schema_budget.py, now tier-aware)PINNED_ALLOWLIST(growth forcing-function). 3. Per-tool_metaactually propagates tolist_tools()(fail loud on a FastMCP/claude upgrade). 4. Full-slab secondary cap + the reach-for first-sentence guard.Deviation from the approved Phase 1 (flagged)
The approved plan said "drop server-level
alwaysLoad" in Phase 1. I kept it env-gated instead so production default stays baseline (this PR is the dormant mechanism). Dropping server-levelalwaysLoad= flipping production to tiered = exactly the behavior change the A/B must gate, so it's deferred to a tiny post-A/B cutover flip. Both A/B arms run from this branch today viaWORLDOS_ENGINE_ALWAYSLOAD.Tests
Full engine suite 2992 passed (single-process). Baseline byte-clean asserted.
Phase 3 (the gate — next)
Same-SHA/same-seed duo A/B: arm1
WORLDOS_ENGINE_ALWAYSLOAD=1vs arm2=0. Remaining harness work: extendqa/latency_rollup.pyto parsecache_creation/cache_read+ cold-open seconds from the*.dm.jsonlresult events; add a chance-corrected tool-selection check vs the census. Heavy/paired playtests → support-VM lane.FPAD record:
worldos-session-notes/2026-06-18/tool-schema-slab-decision/decision-record.md.