Skip to content

OpenInterpretability/decision-locator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

decision-locator

Find the layer where a language model commits a decision — and steer it. Works on any open-weight HF model.

ci license paper

the lever is late

A small, dependency-light (torch + transformers only) tool that locates where inside a model a discrete decision — call this tool vs that one, finish vs keep going, say yes vs no — is computed, and then steers it. It is the method behind the WANDERING arc's first positive result (paper #6, "The Lever Is Late"): for a long-horizon agent, the control surface of an action decision is a late action-commitment block, ~30 layers downstream of the mid-layer representation that predicts it — and steering that late block with a task-matched donor makes the agent actually emit the action.

The field keeps finding that internal states which predict a behavior do not control it (the knowledge–action gap). decision-locator turns that into something actionable: for a specific decision it answers (1) where is it readable, (2) where is it writable, and (3) does writing it change the output — not just a probability bump.

Install

pip install git+https://github.com/OpenInterpretability/decision-locator

Requires torch and transformers. Verify it works on your machine in ~2s (tiny random model, no download):

decision-locator selftest        # -> SELFTEST PASSED

60-second demo (on a real model)

decision-locator demo --model gpt2

demo on gpt2 — laptop, offline, CPU

runs the full pipeline on a sentiment decision (P(" great") vs " terrible") and prints, even on a 124M model:

1) LOCATE — logit-lens gap on the positive prompt, per layer:
   L 5: -1.22     L 6: +0.36 #     L 7: +1.31 #####     L 8: +2.18 ########   <- decision emerges mid-late
2) SWEEP_PATCH — inject the positive donor into the negative prompt; ΔP(target) (control in parens):
   L 6: +0.152 (0.045)   L 7: +0.502 (0.137)   L 8: +0.643 (0.297)           <- most writable, donor-specific, at L7
3) STEER — patch the lever at the decision point:
   decision P(' great'): 0.10 -> 0.60 (steered @L7)

The decision is invisible early, becomes readable mid-late, and is causally writable late and donor-specific (above a neutral-donor control) — the same shape as the headline Qwen3.6-27B result, on a model you can run on a laptop.

Python — the three primitives

A decision point is a prompt whose final token leaves the model poised to choose between a target token and alternatives.

from decision_locator import DecisionLocator, commitment_layer

loc = DecisionLocator(model, tokenizer)              # auto-resolves decoder layers + final norm

# 1) LOCATE — where does the decision become readable? (logit-lens gap per layer)
gaps = loc.locate(ids, target_id=FINISH, alt_ids=[BASH, EDIT])

# 2) SWEEP_PATCH — where is it writable? (ΔP(target) injecting a donor state per layer)
donor_by_layer = {L: loc.donor_state([success_ids], L) for L in LATE_LAYERS}
dP    = loc.sweep_patch(ids, donor_by_layer, LATE_LAYERS, FINISH, [BASH, EDIT])
lever = commitment_layer(dP)                          # the layer with the largest causal effect

# 3) STEER_GENERATE — does writing it change the output? (patch the decision position, decode freely)
donor = loc.donor_state([task_matched_success_ids], lever)   # a task-matched donor beats a class mean
print(loc.steer_generate(ids, layer=lever, donor=donor))     # e.g. "=finish> <parameter=output> ..."

Key finding baked into the API: a task-matched single donor steers far better than a class mean (paper #6: 42% real-finish flip with a task-matched donor, p=0.031, vs 25% n.s. for the class mean). Prefer a one-element donor_state([same_task_success_ids], L) over an averaged class.

The result that motivates it

On a real long-horizon coding agent (Qwen3.6-27B, SWE-bench Pro), the finish decision is flat through layer 31 and only emerges in the last ~12 of 64 layers. Patching the late block at the decision point makes a stuck ("WANDERING") agent emit a real finish call in 42% of cases (exact one-sided McNemar p = 0.031) — but only with a task-matched donor. The knowledge–action gap on agents is a layer gap: the decision is known mid-stream and only writable late. Full paper + figures: DOI 10.5281/zenodo.20534219.

How it works

  • locate projects the residual at each layer through the final norm + unembedding (a logit lens) onto the target − mean(alts) direction. The layer where this jumps is where the decision becomes readable.
  • sweep_patch replaces the last-position residual at each layer with a donor state and measures ΔP(target). A donor-specific jump (above a control donor) marks the commitment layer — the lever.
  • steer_generate patches the commitment layer at the decision position only (prefill), then decodes freely. Patching every step degenerates into repetition — so it doesn't.

Supported models

Auto-resolves decoder layers and the final norm for Qwen, Llama, Mistral, GPT-2, GPT-NeoX, and similar HF causal LMs. For anything else: DecisionLocator(model, tok, layer_modules=list(model.<...>.layers)).

Caveats (from the paper)

  • The steered effect is partial and task-conditioned — there is no single generic "finish direction". Use a task-matched donor.
  • A positive ΔP is necessary but not sufficient — confirm with a real generation. Small models often move the decision probability without flipping the greedy text.
  • White-box only (needs the open weights).

Cite

@misc{vicentino2026leverislate,
  title  = {The Lever Is Late: Causal Control of Long-Horizon Agent Termination Lives in a Task-Matched, Late Action-Commitment Block},
  author = {Vicentino, Caio},
  year   = {2026},
  doi    = {10.5281/zenodo.20534219},
  note   = {OpenInterpretability. WANDERING arc paper #6.}
}

Part of the WANDERING arc on long-horizon agent failure. Built by OpenInterpretability.

About

Find the layer where a language model commits a decision — and steer it. Any open-weight HF model. (WANDERING arc paper #6)

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages