decision-locator

Find the layer where a language model commits a decision — and steer it. Works on any open-weight HF model.

A small, dependency-light (torch + transformers only) tool that locates where inside a model a discrete decision — call this tool vs that one, finish vs keep going, say yes vs no — is computed, and then steers it. It is the method behind the WANDERING arc's first positive result (paper #6, "The Lever Is Late"): for a long-horizon agent, the control surface of an action decision is a late action-commitment block, ~30 layers downstream of the mid-layer representation that predicts it — and steering that late block with a task-matched donor makes the agent actually emit the action.

The field keeps finding that internal states which predict a behavior do not control it (the knowledge–action gap). decision-locator turns that into something actionable: for a specific decision it answers (1) where is it readable, (2) where is it writable, and (3) does writing it change the output — not just a probability bump.

Install

pip install git+https://github.com/OpenInterpretability/decision-locator

Requires torch and transformers. Verify it works on your machine in ~2s (tiny random model, no download):

decision-locator selftest        # -> SELFTEST PASSED

60-second demo (on a real model)

decision-locator demo --model gpt2

runs the full pipeline on a sentiment decision (P(" great") vs " terrible") and prints, even on a 124M model:

1) LOCATE — logit-lens gap on the positive prompt, per layer:
   L 5: -1.22     L 6: +0.36 #     L 7: +1.31 #####     L 8: +2.18 ########   <- decision emerges mid-late
2) SWEEP_PATCH — inject the positive donor into the negative prompt; ΔP(target) (control in parens):
   L 6: +0.152 (0.045)   L 7: +0.502 (0.137)   L 8: +0.643 (0.297)           <- most writable, donor-specific, at L7
3) STEER — patch the lever at the decision point:
   decision P(' great'): 0.10 -> 0.60 (steered @L7)

The decision is invisible early, becomes readable mid-late, and is causally writable late and donor-specific (above a neutral-donor control) — the same shape as the headline Qwen3.6-27B result, on a model you can run on a laptop.

Python — the three primitives

A decision point is a prompt whose final token leaves the model poised to choose between a target token and alternatives.

from decision_locator import DecisionLocator, commitment_layer

loc = DecisionLocator(model, tokenizer)              # auto-resolves decoder layers + final norm

# 1) LOCATE — where does the decision become readable? (logit-lens gap per layer)
gaps = loc.locate(ids, target_id=FINISH, alt_ids=[BASH, EDIT])

# 2) SWEEP_PATCH — where is it writable? (ΔP(target) injecting a donor state per layer)
donor_by_layer = {L: loc.donor_state([success_ids], L) for L in LATE_LAYERS}
dP    = loc.sweep_patch(ids, donor_by_layer, LATE_LAYERS, FINISH, [BASH, EDIT])
lever = commitment_layer(dP)                          # the layer with the largest causal effect

# 3) STEER_GENERATE — does writing it change the output? (patch the decision position, decode freely)
donor = loc.donor_state([task_matched_success_ids], lever)   # a task-matched donor beats a class mean
print(loc.steer_generate(ids, layer=lever, donor=donor))     # e.g. "=finish> <parameter=output> ..."

Key finding baked into the API: a task-matched single donor steers far better than a class mean (paper #6: 42% real-finish flip with a task-matched donor, p=0.031, vs 25% n.s. for the class mean). Prefer a one-element donor_state([same_task_success_ids], L) over an averaged class.

The result that motivates it

On a real long-horizon coding agent (Qwen3.6-27B, SWE-bench Pro), the finish decision is flat through layer 31 and only emerges in the last ~12 of 64 layers. Patching the late block at the decision point makes a stuck ("WANDERING") agent emit a real finish call in 42% of cases (exact one-sided McNemar p = 0.031) — but only with a task-matched donor. The knowledge–action gap on agents is a layer gap: the decision is known mid-stream and only writable late. Full paper + figures: DOI 10.5281/zenodo.20534219.

How it works

locate projects the residual at each layer through the final norm + unembedding (a logit lens) onto the target − mean(alts) direction. The layer where this jumps is where the decision becomes readable.
sweep_patch replaces the last-position residual at each layer with a donor state and measures ΔP(target). A donor-specific jump (above a control donor) marks the commitment layer — the lever.
steer_generate patches the commitment layer at the decision position only (prefill), then decodes freely. Patching every step degenerates into repetition — so it doesn't.

Supported models

Auto-resolves decoder layers and the final norm for Qwen, Llama, Mistral, GPT-2, GPT-NeoX, and similar HF causal LMs. For anything else: DecisionLocator(model, tok, layer_modules=list(model.<...>.layers)).

Caveats (from the paper)

The steered effect is partial and task-conditioned — there is no single generic "finish direction". Use a task-matched donor.
A positive ΔP is necessary but not sufficient — confirm with a real generation. Small models often move the decision probability without flipping the greedy text.
White-box only (needs the open weights).

Cite

@misc{vicentino2026leverislate,
  title  = {The Lever Is Late: Causal Control of Long-Horizon Agent Termination Lives in a Task-Matched, Late Action-Commitment Block},
  author = {Vicentino, Caio},
  year   = {2026},
  doi    = {10.5281/zenodo.20534219},
  note   = {OpenInterpretability. WANDERING arc paper #6.}
}

Part of the WANDERING arc on long-horizon agent failure. Built by OpenInterpretability.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
assets		assets
decision_locator		decision_locator
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

decision-locator

Install

60-second demo (on a real model)

Python — the three primitives

The result that motivates it

How it works

Supported models

Caveats (from the paper)

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

decision-locator

Install

60-second demo (on a real model)

Python — the three primitives

The result that motivates it

How it works

Supported models

Caveats (from the paper)

Cite

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages