Skip to content

OpenInterpretability/cli

openinterp — Python SDK + CLI for openinterp.org

Search the feature Atlas, generate Traces from your own SAE, rank against the public InterpScore leaderboard.

PyPI · Python ≥ 3.10 · Apache-2.0

Install

pip install openinterp              # lite: Atlas + CLI (no torch, ~2 MB total)
pip install "openinterp[full]"      # + torch/transformers/safetensors for trace generation

Requires Python ≥ 3.10.


Part of a 5-repo ecosystem

Repo What's in it
.github Org profile + shared CoC + SECURITY
web Next.js site behind openinterp.org
notebooks 23 training + interpretability notebooks
cli (you are here) pip install openinterp — Python SDK
mechreward SAE features as dense RL reward

Quick start

Search the Atlas (offline, zero GPU)

$ openinterp atlas "overconfidence"
                    Atlas results: 'overconfidence'
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━
┃ ID      ┃ Name                    ┃ Model             ┃ AUROC ┃ Description
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━
│ f2503   │ overconfidence_pattern  │ Qwen/Qwen3.6-27B  │  0.54 │ Definitive…
│ f1847   │ urgency_assessment      │ Qwen/Qwen3.6-27B  │  0.68 │ Time-critic…
└─────────┴─────────────────────────┴───────────────────┴───────┴────────────
>>> from openinterp import search_features
>>> features = search_features("overconfidence", model="Qwen/Qwen3.6-27B")
>>> features[0].id
'f2503'

Generate a Trace from your own SAE

pip install "openinterp[full]"

openinterp trace \
    --model google/gemma-2-2b \
    --sae-repo YOUR_HF_USER/gemma2-2b-sae-first \
    --prompt "The capital of France is" \
    --layer 12 \
    --d-model 2304 --d-sae 16384 --k 64 \
    --out my_trace.json

This:

  1. Loads the base model in bf16 with SDPA (no flash-attn)
  2. Loads your SAE from HuggingFace (sae_lens safetensors format)
  3. Generates tokens, captures residuals at layer 12
  4. Applies the SAE, picks top-10 active features
  5. Writes a Trace JSON matching openinterp.org/observatory/trace byte-for-byte

Python API

from openinterp import generate_trace

trace = generate_trace(
    model_id="google/gemma-2-2b",
    sae_repo="YOUR_HF_USER/gemma2-2b-sae-first",
    prompt="The capital of France is",
    layer=12,
    d_model=2304,
    d_sae=16384,
    k=64,
)

print(trace.model_dump_json(indent=2))   # Trace Theater schema

With feature labels from notebook 04

# After running 04_discover_features.ipynb (emits feature_catalog.json):
openinterp trace ... --catalog feature_catalog.json

Trace features inherit names from your catalog.


FabricationGuard (v0.2.0+)

FabricationGuard headline

Production hallucination probe on Qwen3.6-27B. AUROC 0.88 cross-task on SimpleQA, −88% confident-wrong reduction in mitigation mode, ~1ms scoring latency.

from openinterp import FabricationGuard

guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B")
output = guard.generate("Who won the 2003 Nobel Prize in Aerodynamics?", mode="abstain")
# → "I don't have reliable information to answer this confidently."

Methodology lineage: extends Anthropic's persona-vectors approach (Aug 2025, tested on 7-8B) to Qwen3.6-27B (3-4× larger) with formal cross-task AUROC + bootstrap CIs + mitigation-rate evaluation. Apache-2.0 production-grade implementation, not a proprietary platform. Probe artifact: caiovicentino1/FabricationGuard-linearprobe-qwen36-27b. Live demo: openinterp.org/products/fabricationguard.

ReasonGuard v0.1 (in registry)

ReasonGuard headline

Reasoning-faithfulness probe at L55 / mid_think on Qwen3.6-27B in thinking mode. Detects wrong-answer trajectories during the <think> block. Honest narrow scope: AUROC 0.888 within math reasoning (GSM8K), 0.605 cross-domain to commonsense (StrategyQA) — domain-bound, not generalized.

Layer × position interaction (novel): shallow layers (L31) favor end_question; deep layers (L55) favor mid_think. Position-of-faithfulness migrates with depth.

from openinterp import probebench

probe = probebench.load("openinterp/reasonguard-qwen36-27b-l55-mid_think")
score = probe.score(activations)  # P(wrong-answer trajectory)

Both numbers (within + cross) registered honestly per ProbeBench's anti-Goodhart norms. Probe artifact: caiovicentino1/ReasoningGuard-linearprobe-qwen36-27b. Live on openinterp.org/probebench.

ProbeBench (v0.2.0+)

The first categorical leaderboard for activation probes — 8 categories, 7-axis ProbeScore, anti-Goodhart by construction.

from openinterp import probebench

probes = probebench.list_probes(category="hallucination")
probe  = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
score  = probe.score(activations)
openinterp probebench list                       # show all registered probes
openinterp probebench load <probe-id>            # download + verify SHA-256
openinterp probebench validate ./my-probe/       # check artifact spec
openinterp probebench reproduce <probe-id>       # download reproducer notebook

Browse the leaderboard: openinterp.org/probebench.


AgentProbeGuard cookbooks


v0.2.1 — safe_load_qwen36_lora()

Encapsulates the Qwen3.6 PEFT-save .language_model. infix bug discovered during paper-2 grokking work (April 2026). Saved Qwen3.6 LoRA adapters carry an extra .language_model. infix in state-dict keys; PeftModel.from_pretrained() against a reloaded dense Qwen3.6 silently fails — adapter loaded, max logit-diff = 0.000, no error raised.

from openinterp import safe_load_qwen36_lora

model = safe_load_qwen36_lora(
    base_model_id="Qwen/Qwen3.6-27B",
    adapter_path="path/to/checkpoint-200",
)  # auto strip .language_model. + auto verify logit-diff > 0.01

Also exposed: strip_language_model_infix(), verify_adapter_loaded(), LoRAVerificationError. This bug invalidated ~10 hours of prior eval work before being caught — anyone working with Qwen3.6 LoRA save/reload pipelines should run the sanity check.


What's in v0.2.x

Command Status What it does
openinterp atlas <query> live Feature search with offline fallback to curated demo features
openinterp trace ... live (needs [full]) Real SAE trace generation, sae_lens format, any HF model
openinterp guard ... live FabricationGuard scoring + abstain mode on Qwen3.6-27B
openinterp probebench {list,load,score,validate,reproduce,submit} live ProbeBench v0.0.1 SDK
openinterp.lora.safe_load_qwen36_lora live (v0.2.1) Safe Qwen3.6 LoRA loader with strip + verify
openinterp info live Version + optional-dep status

Planned v0.3.0

  • openinterp upload-trace <trace.json> → shareable openinterp.org URL
  • openinterp score --sae-repo X → compute InterpScore (wraps notebook 18)
  • openinterp steer --sae-repo X --feature Y --alpha Z → intervention (wraps notebook 06)
  • openinterp circuit --sae-repo X --prompt Y → attribution graph JSON (wraps notebook 14/15)
  • openinterp publish <repo> → HuggingFace release with model card
  • ReasonGuard v0.2 — multi-bench training (math + commonsense) to fix cross-domain transfer

Open an issue on the tracker if you'd take one of these.


Development

git clone https://github.com/OpenInterpretability/cli openinterp-cli
cd openinterp-cli
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,full]"          # dev = pytest + ruff + build; full = torch + transformers
pytest -xvs                            # 5 tests, ~1s

Package layout

openinterp-cli/
├── pyproject.toml              # name='openinterp', hatchling build
├── openinterp/
│   ├── __init__.py             # public exports + __version__
│   ├── models.py               # pydantic types: AtlasFeature, Trace, TraceFeature
│   ├── atlas.py                # search_features() — HF API + curated fallback
│   ├── trace.py                # generate_trace() — real transformers-based impl
│   └── cli.py                  # click-based CLI: atlas / trace / info
├── tests/
│   ├── test_atlas.py
│   └── test_trace.py
├── CHANGELOG.md
├── CONTRIBUTING.md
└── README.md

Contribution recipe — add a new command

Full rules: CONTRIBUTING.md.

  1. Decide which notebook it wraps (score → 18, steer → 06, circuit → 14/15, publish → generic)
  2. Add a function to the matching file (openinterp/score.py, etc.). Keep it small — actual compute lives in the notebook.
  3. Expose it in __init__.py
  4. Add a @main.command() in cli.py with click decorators
  5. Add a smoke test in tests/test_<name>.py
  6. Update CHANGELOG.md under [Unreleased]
  7. PR title: Add openinterp <command>

Hard rules:

  • Python ≥ 3.10 syntax (PEP 604 unions OK)
  • dtype=torch.bfloat16, never torch_dtype= (transformers 5.x deprecated)
  • SDPA only, never flash-attn
  • New heavy deps (torch tier) → add to [full] extra, not base
  • Every new public function has type hints + docstring

Release process (maintainer)

# 1. Bump version in BOTH:
#    pyproject.toml          ([project] version = "X.Y.Z")
#    openinterp/__init__.py  (__version__ = "X.Y.Z")
# 2. Update CHANGELOG.md — move [Unreleased] → [X.Y.Z] — YYYY-MM-DD

source .venv/bin/activate
rm -rf dist/
python -m build
python -m twine check dist/*
python -m twine upload dist/*     # needs PyPI token in ~/.pypirc

git tag vX.Y.Z
git push --tags

CI

Every PR runs:

  • pytest -xvs across Python 3.10, 3.11, 3.12 (see .github/workflows/ci.yml)
  • ruff check . (warn-only for now)
  • python -m build + twine check

Green required to merge.


Community

Built on

Neuronpedia (SAE encyclopedia) · Gemma Scope (reference SAE suite) · Gao et al. 2024 (TopK + AuxK) · SAELens (safetensors format).


Apache-2.0 · openinterp.org

About

openinterp — Python SDK + CLI. FabricationGuard hallucination probe + ProbeBench leaderboard + Atlas search + Trace generation. pip install openinterp

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages