Skip to content
@OpenInterpretability

OpenInterpretability

OpenInterpretability

When should we believe a mechanistic interpretability claim — and where, inside a model, does a decision actually live?

Mechanistic interpretability of long-horizon LLM agents, built on Qwen3.6-27B since April 2026: a protocol, a benchmark, a registry — and the WANDERING arc, a six-paper study of why agents fail to stop that ends in the first positive.

openinterp.org · decision-locator · pip install openinterp · pip install openinterp-mcp · Apache-2.0


⭐ Featured — the WANDERING arc + decision-locator

Long-horizon coding agents fail by WANDERING: they stay internally sure the task is solved but never emit the finish action, burning the whole turn budget. Across six papers (Qwen3.6-27B, SWE-bench Pro, all CC-BY-4.0) we showed the agent's "task-done" verdict is linearly decodable (AUROC 0.81–0.91) yet causally inert — no residual injection rescues it, and clamping the exact, named SAE "done" feature moves the probability of finishing by −0.001 — until we found where control actually lives.

The law: the knowledge–action gap on agents is a layer gap. The decision is known mid-stream (the verdict, L23) but only writable late (L51–63, ~30 layers downstream). Patching that late, task-matched block makes a stuck agent emit a real finish call 42% of the time (exact McNemar p = 0.031), from a 0% baseline.

🛠 decision-locator packages the method — find & steer the commitment layer for any tool-calling decision on any open-weight model:

pip install git+https://github.com/OpenInterpretability/decision-locator
decision-locator demo --model gpt2     # locate → sweep → steer, on a laptop

📄 The arc, permanent DOIs: #1 Tool-Entropy · #2 Right Locus · #3 Multi-Channel · #4 Modality Matters · #5 Verdict Circuit · #6 The Lever Is Late · companion note — read them at openinterp.org/research.


What's here

Core protocol

Repo What
registry Six Diagnostics schemas + reference implementation. JSON cards for probes, causal reports, intervention traces. Failed-Replication Registry data.

Research artifacts

Repo What
openinterp-swebench-harness Instrumented agent harness capturing SAE feature trajectories during agent reasoning on SWE-bench Pro. Substrate for the six-paper WANDERING arc.
decision-locator pip install-able, model-agnostic tool: find the layer where a model commits a decision, and steer it. The method behind WANDERING arc paper #6. CLI + Colab + CI.
inspect-tool-entropy-collapse The tool-entropy-collapse WANDERING detector as an Inspect eval (UK AISI inspect_evals submission).
mechreward Mechanistic interpretability as reward signal for RL training. SAE features + GRPO + anti-Goodhart framework.

Developer tools

Repo What
cli pip install openinterp. FabricationGuard probe + ProbeBench leaderboard + Atlas search + Trace generation.
openinterp-mcp MCP server + Colab backend. Bring-your-own-agent infrastructure for mech-interp research. Claude Code · Cursor · Cline compatible.
notebooks Train your first SAE in 30 min → paper-grade at 27B. Free Colab + Kaggle + cloud ladders.

Web

Repo What
web openinterp.org — the protocol, the registry, the publications.

Why this exists

Probes that hit AUROC 0.95 at N=50 collapse at N=500. SAE features that "explain" a concept fail under matched-norm random controls. Steering vectors that flip outputs turn out to be softmax temperature shifts. CoT-redirect interventions that clear sabotage end up causing it. And a "task-done" feature that predicts finishing at AUROC 0.91 doesn't cause it.

We caught these in our own work — with documented walk-backs and pre-registered nulls. The protocol, and the arc's first positive, are the result.

→ Read the Six Diagnostics: openinterp.org/research → Browse the Registry: openinterp.org/atlas → Eval Standard schemas: github.com/OpenInterpretability/registry


Maintainer: Caio Vicentino · caio@openinterp.org · Fortaleza, Brazil License: Apache-2.0 (code) · CC-BY-4.0 (documentation) Collaborate: caio@openinterp.org

Popular repositories Loading

  1. mechreward mechreward Public

    Mechanistic interpretability as reward signal for RL training of LLMs — SAE features + GRPO + anti-Goodhart framework

    Jupyter Notebook 5

  2. notebooks notebooks Public

    Train your first SAE in 30 min → paper-grade at 27B. Free Colab · free Kaggle · cloud ladders. Every scale covered.

    Jupyter Notebook 3

  3. openinterp-swebench-harness openinterp-swebench-harness Public

    Instrumented agent harness for capturing SAE feature trajectories during SWE-bench Pro traces on Qwen3.6-27B (mech anatomy of agent reasoning failure)

    Python 2

  4. web web Public

    Next.js site for OpenInterpretability — the umbrella org for mechreward and public hybrid-architecture SAEs

    TypeScript 1 1

  5. decision-locator decision-locator Public

    Find the layer where a language model commits a decision — and steer it. Any open-weight HF model. (WANDERING arc paper #6)

    Python 1

  6. cli cli Public

    openinterp — Python SDK + CLI. FabricationGuard hallucination probe + ProbeBench leaderboard + Atlas search + Trace generation. pip install openinterp

    Python 1

Repositories

Showing 10 of 10 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…