Find the layer where a language model commits a decision — and steer it. Works on any open-weight HF model.
A small, dependency-light (torch + transformers only) tool that locates where inside a model a discrete
decision — call this tool vs that one, finish vs keep going, say yes vs no — is computed, and then
steers it. It is the method behind the WANDERING arc's first positive result
(paper #6, "The Lever Is Late"): for a long-horizon agent, the
control surface of an action decision is a late action-commitment block, ~30 layers downstream of the
mid-layer representation that predicts it — and steering that late block with a task-matched donor makes
the agent actually emit the action.
The field keeps finding that internal states which predict a behavior do not control it (the
knowledge–action gap). decision-locator turns that into something actionable: for a specific decision it
answers (1) where is it readable, (2) where is it writable, and (3) does writing it change the output —
not just a probability bump.
pip install git+https://github.com/OpenInterpretability/decision-locatorRequires torch and transformers. Verify it works on your machine in ~2s (tiny random model, no download):
decision-locator selftest # -> SELFTEST PASSEDdecision-locator demo --model gpt2runs the full pipeline on a sentiment decision (P(" great") vs " terrible") and prints, even on a 124M model:
1) LOCATE — logit-lens gap on the positive prompt, per layer:
L 5: -1.22 L 6: +0.36 # L 7: +1.31 ##### L 8: +2.18 ######## <- decision emerges mid-late
2) SWEEP_PATCH — inject the positive donor into the negative prompt; ΔP(target) (control in parens):
L 6: +0.152 (0.045) L 7: +0.502 (0.137) L 8: +0.643 (0.297) <- most writable, donor-specific, at L7
3) STEER — patch the lever at the decision point:
decision P(' great'): 0.10 -> 0.60 (steered @L7)
The decision is invisible early, becomes readable mid-late, and is causally writable late and donor-specific (above a neutral-donor control) — the same shape as the headline Qwen3.6-27B result, on a model you can run on a laptop.
A decision point is a prompt whose final token leaves the model poised to choose between a target token and alternatives.
from decision_locator import DecisionLocator, commitment_layer
loc = DecisionLocator(model, tokenizer) # auto-resolves decoder layers + final norm
# 1) LOCATE — where does the decision become readable? (logit-lens gap per layer)
gaps = loc.locate(ids, target_id=FINISH, alt_ids=[BASH, EDIT])
# 2) SWEEP_PATCH — where is it writable? (ΔP(target) injecting a donor state per layer)
donor_by_layer = {L: loc.donor_state([success_ids], L) for L in LATE_LAYERS}
dP = loc.sweep_patch(ids, donor_by_layer, LATE_LAYERS, FINISH, [BASH, EDIT])
lever = commitment_layer(dP) # the layer with the largest causal effect
# 3) STEER_GENERATE — does writing it change the output? (patch the decision position, decode freely)
donor = loc.donor_state([task_matched_success_ids], lever) # a task-matched donor beats a class mean
print(loc.steer_generate(ids, layer=lever, donor=donor)) # e.g. "=finish> <parameter=output> ..."Key finding baked into the API: a task-matched single donor steers far better than a class mean (paper #6:
42% real-finish flip with a task-matched donor, p=0.031, vs 25% n.s. for the class mean). Prefer a
one-element donor_state([same_task_success_ids], L) over an averaged class.
On a real long-horizon coding agent (Qwen3.6-27B, SWE-bench Pro), the finish decision is flat through layer
31 and only emerges in the last ~12 of 64 layers. Patching the late block at the decision point makes a stuck
("WANDERING") agent emit a real finish call in 42% of cases (exact one-sided McNemar p = 0.031) — but
only with a task-matched donor. The knowledge–action gap on agents is a layer gap: the decision is known
mid-stream and only writable late. Full paper + figures: DOI 10.5281/zenodo.20534219.
- locate projects the residual at each layer through the final norm + unembedding (a logit lens) onto the
target − mean(alts)direction. The layer where this jumps is where the decision becomes readable. - sweep_patch replaces the last-position residual at each layer with a donor state and measures
ΔP(target). A donor-specific jump (above a control donor) marks the commitment layer — the lever. - steer_generate patches the commitment layer at the decision position only (prefill), then decodes freely. Patching every step degenerates into repetition — so it doesn't.
Auto-resolves decoder layers and the final norm for Qwen, Llama, Mistral, GPT-2, GPT-NeoX, and similar HF causal
LMs. For anything else: DecisionLocator(model, tok, layer_modules=list(model.<...>.layers)).
- The steered effect is partial and task-conditioned — there is no single generic "finish direction". Use a task-matched donor.
- A positive
ΔPis necessary but not sufficient — confirm with a real generation. Small models often move the decision probability without flipping the greedy text. - White-box only (needs the open weights).
@misc{vicentino2026leverislate,
title = {The Lever Is Late: Causal Control of Long-Horizon Agent Termination Lives in a Task-Matched, Late Action-Commitment Block},
author = {Vicentino, Caio},
year = {2026},
doi = {10.5281/zenodo.20534219},
note = {OpenInterpretability. WANDERING arc paper #6.}
}Part of the WANDERING arc on long-horizon agent failure. Built by OpenInterpretability.

