openinterp — Python SDK + CLI for openinterp.org
Search the feature Atlas, generate Traces from your own SAE, rank against the public InterpScore leaderboard.
pip install openinterp # lite: Atlas + CLI (no torch, ~2 MB total)
pip install "openinterp[full]" # + torch/transformers/safetensors for trace generationRequires Python ≥ 3.10.
| Repo | What's in it |
|---|---|
.github |
Org profile + shared CoC + SECURITY |
web |
Next.js site behind openinterp.org |
notebooks |
23 training + interpretability notebooks |
cli (you are here) |
pip install openinterp — Python SDK |
mechreward |
SAE features as dense RL reward |
$ openinterp atlas "overconfidence" Atlas results: 'overconfidence'
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━
┃ ID ┃ Name ┃ Model ┃ AUROC ┃ Description
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━
│ f2503 │ overconfidence_pattern │ Qwen/Qwen3.6-27B │ 0.54 │ Definitive…
│ f1847 │ urgency_assessment │ Qwen/Qwen3.6-27B │ 0.68 │ Time-critic…
└─────────┴─────────────────────────┴───────────────────┴───────┴────────────
>>> from openinterp import search_features
>>> features = search_features("overconfidence", model="Qwen/Qwen3.6-27B")
>>> features[0].id
'f2503'pip install "openinterp[full]"
openinterp trace \
--model google/gemma-2-2b \
--sae-repo YOUR_HF_USER/gemma2-2b-sae-first \
--prompt "The capital of France is" \
--layer 12 \
--d-model 2304 --d-sae 16384 --k 64 \
--out my_trace.jsonThis:
- Loads the base model in bf16 with SDPA (no flash-attn)
- Loads your SAE from HuggingFace (sae_lens
safetensorsformat) - Generates tokens, captures residuals at layer 12
- Applies the SAE, picks top-10 active features
- Writes a
TraceJSON matching openinterp.org/observatory/trace byte-for-byte
from openinterp import generate_trace
trace = generate_trace(
model_id="google/gemma-2-2b",
sae_repo="YOUR_HF_USER/gemma2-2b-sae-first",
prompt="The capital of France is",
layer=12,
d_model=2304,
d_sae=16384,
k=64,
)
print(trace.model_dump_json(indent=2)) # Trace Theater schema# After running 04_discover_features.ipynb (emits feature_catalog.json):
openinterp trace ... --catalog feature_catalog.jsonTrace features inherit names from your catalog.
Production hallucination probe on Qwen3.6-27B. AUROC 0.88 cross-task on SimpleQA, −88% confident-wrong reduction in mitigation mode, ~1ms scoring latency.
from openinterp import FabricationGuard
guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B")
output = guard.generate("Who won the 2003 Nobel Prize in Aerodynamics?", mode="abstain")
# → "I don't have reliable information to answer this confidently."Methodology lineage: extends Anthropic's persona-vectors approach (Aug 2025, tested on 7-8B) to Qwen3.6-27B (3-4× larger) with formal cross-task AUROC + bootstrap CIs + mitigation-rate evaluation. Apache-2.0 production-grade implementation, not a proprietary platform. Probe artifact: caiovicentino1/FabricationGuard-linearprobe-qwen36-27b. Live demo: openinterp.org/products/fabricationguard.
Reasoning-faithfulness probe at L55 / mid_think on Qwen3.6-27B in thinking mode. Detects wrong-answer trajectories during the <think> block. Honest narrow scope: AUROC 0.888 within math reasoning (GSM8K), 0.605 cross-domain to commonsense (StrategyQA) — domain-bound, not generalized.
Layer × position interaction (novel): shallow layers (L31) favor end_question; deep layers (L55) favor mid_think. Position-of-faithfulness migrates with depth.
from openinterp import probebench
probe = probebench.load("openinterp/reasonguard-qwen36-27b-l55-mid_think")
score = probe.score(activations) # P(wrong-answer trajectory)Both numbers (within + cross) registered honestly per ProbeBench's anti-Goodhart norms. Probe artifact: caiovicentino1/ReasoningGuard-linearprobe-qwen36-27b. Live on openinterp.org/probebench.
The first categorical leaderboard for activation probes — 8 categories, 7-axis ProbeScore, anti-Goodhart by construction.
from openinterp import probebench
probes = probebench.list_probes(category="hallucination")
probe = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
score = probe.score(activations)openinterp probebench list # show all registered probes
openinterp probebench load <probe-id> # download + verify SHA-256
openinterp probebench validate ./my-probe/ # check artifact spec
openinterp probebench reproduce <probe-id> # download reproducer notebookBrowse the leaderboard: openinterp.org/probebench.
examples/agent_probe_guard_env_coupling.md— detect env coupling with cosine + residual norms, refit withAgentProbeGuard.refit(), and choose between one refit helper vs per-env weight variants.
Encapsulates the Qwen3.6 PEFT-save .language_model. infix bug discovered during paper-2 grokking work (April 2026). Saved Qwen3.6 LoRA adapters carry an extra .language_model. infix in state-dict keys; PeftModel.from_pretrained() against a reloaded dense Qwen3.6 silently fails — adapter loaded, max logit-diff = 0.000, no error raised.
from openinterp import safe_load_qwen36_lora
model = safe_load_qwen36_lora(
base_model_id="Qwen/Qwen3.6-27B",
adapter_path="path/to/checkpoint-200",
) # auto strip .language_model. + auto verify logit-diff > 0.01Also exposed: strip_language_model_infix(), verify_adapter_loaded(), LoRAVerificationError. This bug invalidated ~10 hours of prior eval work before being caught — anyone working with Qwen3.6 LoRA save/reload pipelines should run the sanity check.
| Command | Status | What it does |
|---|---|---|
openinterp atlas <query> |
live | Feature search with offline fallback to curated demo features |
openinterp trace ... |
live (needs [full]) |
Real SAE trace generation, sae_lens format, any HF model |
openinterp guard ... |
live | FabricationGuard scoring + abstain mode on Qwen3.6-27B |
openinterp probebench {list,load,score,validate,reproduce,submit} |
live | ProbeBench v0.0.1 SDK |
openinterp.lora.safe_load_qwen36_lora |
live (v0.2.1) | Safe Qwen3.6 LoRA loader with strip + verify |
openinterp info |
live | Version + optional-dep status |
openinterp upload-trace <trace.json>→ shareable openinterp.org URLopeninterp score --sae-repo X→ compute InterpScore (wraps notebook 18)openinterp steer --sae-repo X --feature Y --alpha Z→ intervention (wraps notebook 06)openinterp circuit --sae-repo X --prompt Y→ attribution graph JSON (wraps notebook 14/15)openinterp publish <repo>→ HuggingFace release with model card- ReasonGuard v0.2 — multi-bench training (math + commonsense) to fix cross-domain transfer
Open an issue on the tracker if you'd take one of these.
git clone https://github.com/OpenInterpretability/cli openinterp-cli
cd openinterp-cli
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,full]" # dev = pytest + ruff + build; full = torch + transformers
pytest -xvs # 5 tests, ~1sopeninterp-cli/
├── pyproject.toml # name='openinterp', hatchling build
├── openinterp/
│ ├── __init__.py # public exports + __version__
│ ├── models.py # pydantic types: AtlasFeature, Trace, TraceFeature
│ ├── atlas.py # search_features() — HF API + curated fallback
│ ├── trace.py # generate_trace() — real transformers-based impl
│ └── cli.py # click-based CLI: atlas / trace / info
├── tests/
│ ├── test_atlas.py
│ └── test_trace.py
├── CHANGELOG.md
├── CONTRIBUTING.md
└── README.md
Full rules: CONTRIBUTING.md.
- Decide which notebook it wraps (score → 18, steer → 06, circuit → 14/15, publish → generic)
- Add a function to the matching file (
openinterp/score.py, etc.). Keep it small — actual compute lives in the notebook. - Expose it in
__init__.py - Add a
@main.command()incli.pywith click decorators - Add a smoke test in
tests/test_<name>.py - Update
CHANGELOG.mdunder[Unreleased] - PR title:
Add openinterp <command>
Hard rules:
- Python ≥ 3.10 syntax (PEP 604 unions OK)
dtype=torch.bfloat16, nevertorch_dtype=(transformers 5.x deprecated)- SDPA only, never flash-attn
- New heavy deps (
torchtier) → add to[full]extra, not base - Every new public function has type hints + docstring
# 1. Bump version in BOTH:
# pyproject.toml ([project] version = "X.Y.Z")
# openinterp/__init__.py (__version__ = "X.Y.Z")
# 2. Update CHANGELOG.md — move [Unreleased] → [X.Y.Z] — YYYY-MM-DD
source .venv/bin/activate
rm -rf dist/
python -m build
python -m twine check dist/*
python -m twine upload dist/* # needs PyPI token in ~/.pypirc
git tag vX.Y.Z
git push --tagsEvery PR runs:
pytest -xvsacross Python 3.10, 3.11, 3.12 (see.github/workflows/ci.yml)ruff check .(warn-only for now)python -m build+twine check
Green required to merge.
- Discussions — API proposals, "which repo should this live in"
- Good-first-issues
- PyPI release history
- hi@openinterp.org
Neuronpedia (SAE encyclopedia) · Gemma Scope (reference SAE suite) · Gao et al. 2024 (TopK + AuxK) · SAELens (safetensors format).
Apache-2.0 · openinterp.org

