Skip to content

cyber: make generated worlds honest; harden evolution + the training substrate#351

Merged
larstalian merged 6 commits into
mainfrom
cyber/honest-worldgen-evolution
Jun 25, 2026
Merged

cyber: make generated worlds honest; harden evolution + the training substrate#351
larstalian merged 6 commits into
mainfrom
cyber/honest-worldgen-evolution

Conversation

@larstalian

Copy link
Copy Markdown
Collaborator

What & why

This started as "make the cyber gym's generated worlds honest, and make sure world evolution actually makes sense" — and, through a couple of multi-agent audits that each caught real problems, grew into one coherent pass over worldgen, the curriculum, and the training substrate.

Every change keeps worlds solvable-by-construction, deterministic, and PROCESS↔CONTAINER parity-safe. Full suite green (955 passed); ruff / mypy / pack-core boundary / coverage all pass.

🔴 The headline: an honesty bug that was corrupting training reward

Under the PROCESS backing (what training and eval default to), the internal cloud-metadata endpoint was a directly-servable HTTP route — GET /svc/<host>/latest/meta-data/credential handed back the flag with zero exploitation, in 13 of 13 company worlds. An agent could score reward 1.0 without ever performing the SSRF pivot the world exists to teach.

Fix: internal /svc/<name> endpoints are now a dispatch-only table — reachable only by the in-process SSRF pivot, never a direct request (a direct GET now 404s) — exactly the network segmentation CONTAINER gets for free from real per-service hosts. Gated to networked worlds, so flat worlds (where internal vulns are solved by direct request, by design) are untouched.

The rest of the pass

Difficulty metric. A 2-service toy SSRF world was rated ~40% harder than a full 8-service company breach. Added a capped internal-host fan-out term so a wide estate registers (company difficulty went from 5 → 15 distinct values) and removed the cost-neutral POST "body bump." The inversion is gone.

Realism. The recon page disclosed the exact internal inventory with zero decoys (a perfect oracle) — now padded with believable decoy hostnames the agent has to triage. Decoy files are per-world-unique (were 5 byte-identical literals). Grew the most-memorizable pools (flag key, SSRF targets, service names).

Curriculum / evolution.

  • RoundMetrics.difficulty_gain — the honest "is the frontier still moving" read, distinct from frontier_capped (a run can keep admitting children while only creeping on cosmetic decoys).
  • A consequence seed-gate so a structurally-admitted-but-unsolvable world can't silently seed the pool.
  • Soften can now shorten a chain — collapse the last credential hop to rescue an agent stuck on the chain (decoy-removal couldn't); validated through the real re-admit path (parity-safe, deterministic, difficulty strictly drops).
  • Diversify now rotates the on-path technique — swaps the flag-reading oracle's class within loot-compatible shapes (was: only off-path decoys), without stranding the flag.
  • Fixed a harden-path crash (a curriculum-introduced vuln rendered on undefined params for 6 of 9 classes).

Training substrate. The structural answer to "PROCESS feels toyish": CONTAINER is now the floor for the training/eval measurement. A file-read or sandboxed world escalates to the real container (where reward is genuine, not silently 0); in-band worlds keep the cheap ~50 ms PROCESS path; and it fails loud if that escalation needs Docker that's absent — it never silently trains on an emulation the agent can't exploit. New Pack.minimum_backing SDK hook (default PROCESS, so the SWE/trading packs inherit it untouched).

Process note

Two multi-agent audits gated this work and earned their cost. The plan audit caught three flaws in my own proposals before I built them (a max() over an unordered enum that picked the wrong backing; a chain conversion that would have stranded the flag; a threshold that would have put domain knowledge in core). The implementation/gap audits caught a real regression in already-"done" work (a docker-gated test left red by the recon change) and the missing tests fixed here (the untested fail-loud safety path).

Deferred

Turning a recon→pivot company world into a deeper credential chain is the one large, higher-risk item left — tracked with its full implementation recipe in #343.

🤖 Generated with Claude Code

larstalian and others added 3 commits June 22, 2026 21:08
Drop the comments that just restate what the code/identifier already says
(BootEpisode's signature, the dial test's body, the boot helper's prose);
keep the WHY ones (the boundary-injection and fresh-runtime rationale).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…substrate

Headline fix: under the PROCESS backing (the training/eval default), the internal
cloud-metadata endpoint was a directly-servable route, leaking the flag with zero
exploitation in 13/13 company worlds. Internal /svc/<name> endpoints are now
dispatch-only -- reachable only via the in-process SSRF pivot, as CONTAINER's real
per-service network enforces -- gated to networked worlds so flat worlds are untouched.

Also in this pass (each verified live + tested; gated by two multi-agent audits):
- difficulty: the 2-service-toy > 8-service-company inversion is gone (a capped
  internal-host fan-out term; dropped the cost-neutral POST body bump)
- realism: recon no longer leaks the exact internal inventory (decoy-host chaff);
  per-world-unique decoy files; larger flag-key / host / service-name pools
- curriculum: a RoundMetrics.difficulty_gain "is the frontier moving" signal; a
  consequence seed-gate so an unsolvable world can't seed the pool; soften can now
  shorten a chain to rescue a chain-stuck agent; diversify rotates the on-path
  technique within loot-compatible classes; fixed a harden-path render crash
- training substrate: CONTAINER is now the floor for the training/eval measurement
  (file-read / sandboxed worlds escalate to the real container; in-band worlds keep
  the cheap PROCESS path; fails loud rather than silently 0-reward an emulation the
  agent can't exploit). New Pack.minimum_backing SDK hook, default PROCESS.

Full suite green (959); ruff / mypy / boundary / coverage pass. Deferred company
chain-deepening tracked in #343.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A discipline pass over the comments this branch added, per .rules
(default no comments; add one only when the WHY is non-obvious). Several
blocks were defending against a gotcha or restating a constraint in more
than one place, so the real fix was in the code, not the prose:

- diversify method-flip: guard it to sole-vuln endpoints so it can never
  disturb a co-located decoy, dropping the "this is safe because..."
  defense. The co-located case (db seed 11) stays solvable.
- pool seed: extract _gated_members so WorldPool.seed and EvalPool.seed
  share one loop, removing a comment that was duplicated at both sites
  (and that the SeedGate alias already documents).
- difficulty_gain: explain the metric once, on the public RoundMetrics
  field, not also (with a typo) inside _grow.
- loot-shape constraint: one source of truth (_ORACLE_SHAPES_FOR_LOOT),
  not restated across three comments.
- correct a false claim: a curriculum-added vuln draws from the same
  pools as a sampled one; it is not "byte-identical".
- drop history/version references and an underscore-helper docstring.

Behavior is unchanged except the method-flip guard. ruff, mypy, and the
impacted tests are green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian closed this Jun 24, 2026
@larstalian larstalian reopened this Jun 24, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 24, 2026
Second discipline pass after the diff still read as comment-heavy. Ran a
root-fix-or-delete sweep over every comment in the changed files (not just
the new ones), taking the strongest move per comment:

- root fix: removed a dead ``ONTOLOGY_ID`` keep-alive import in mutation.py;
  renamed difficulty's ``internal - chain_hops - 1`` to a self-documenting
  ``walk_and_target_hosts`` so the comment is unnecessary.
- delete: dropped section headers, pool-convention WHAT comments, restated
  step labels, and four docstrings that just restated name + signature.
- trim: shortened the rest to the one non-obvious WHY, dropping task/version
  refs and clauses already carried by names or a single source of truth.

~76 inline-comment lines removed (sampling.py 142 -> 94). Behavior is
unchanged apart from the inert rename and dead-import removal. ruff, mypy,
and the impacted tests (330) are green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian force-pushed the cyber/honest-worldgen-evolution branch from 7a68840 to e50e58d Compare June 24, 2026 17:15
larstalian and others added 2 commits June 24, 2026 12:24
Integrate the commits main gained since this branch forked: #341
(dashboard), #342 (the comment trim that is also this branch's first
commit), #345/#348 (custom reward_fn reaches the pool + training-path
robustness), and #350 (the self-verifying LLM-Builder seam).

Two conflicts, both resolved toward main's newer code:
- openrange-trl _report_scalar: keep main's reward_fn(report) (#345),
  with this branch's trimmed comment.
- cyber llm_realize: #350 replaced the inline realize_world loop with the
  _HandlerAuthor / realize_verified seam; took main's refactor wholesale
  (this branch's only edit there was a now-moot comment deletion).

ruff, mypy, boundary, and 311 impacted tests are green on the merged tree.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian marked this pull request as ready for review June 25, 2026 04:53
@larstalian larstalian merged commit 4ce54b2 into main Jun 25, 2026
2 checks passed
@larstalian larstalian deleted the cyber/honest-worldgen-evolution branch June 25, 2026 05:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant