Skip to content

OpenInterpretability/notebooks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

OpenInterpretability — notebooks

Open notebooks for training SAEs and reproducing 2024–2026 mech-interp papers on Gemma, Qwen, and Llama. Apache-2.0.

License Apache 2.0 · openinterp.org/train

The 5-repo ecosystem

Repo What's in it
.github Org profile + shared CoC + SECURITY
web Next.js site behind openinterp.org
notebooks (you are here) 31 training + interpretability + product-reproducer notebooks
cli pip install openinterp — Python SDK
mechreward SAE features as dense RL reward

Core ladder — train your first SAE

Tier Notebook Platform VRAM Cost Model Time
Hobbyist 01_hobbyist_gemma2_2b_colab.ipynb Colab Free T4 15 GB $0 Gemma-2-2B 30–40 min
Explorer 02_explorer_qwen35_4b_kaggle.ipynb Kaggle 2× T4 32 GB $0 Qwen3.5-4B (hybrid GDN) 4–5 h
Paper-grade 03_papergrade_qwen36_27b_cloud.ipynb Cloud RTX 6000 Pro 96 GB ~$30–60 Qwen3.6-27B 20–24 h

After you train — close the loop

Notebook What it does
04_discover_features.ipynb Auto-label your SAE's features with Claude or GPT-4, emit feature_catalog.json
05_build_shareable_trace.ipynb Your SAE + your prompt → trace.json in the Trace Theater format
06_steer_your_model.ipynb Live feature intervention: baseline vs α ∈ {−3, 0, 1, 3}. Q1 preview of the Q2 Sandbox.

Before you train — reduce friction

Notebook What it does
07_pick_your_tier.ipynb VRAM calculator + layer recommender. Zero GPU needed.

More models — same recipe, different architectures

Notebook Model Platform
08_explorer_llama3_8b_kaggle.ipynb Llama-3.1-8B (Meta license) Kaggle 2× T4
09_explorer_mistral_7b_kaggle.ipynb Mistral-7B-v0.3 Kaggle 2× T4
10_hobbyist_phi3_mini_colab.ipynb Phi-3-mini-4k (Microsoft) Colab Free T4

Research-grade — replicate published results

Notebook Paper / protocol
11_stage_gate_g1.ipynb Stage Gate 1 correlation pre-test (mechreward protocol) — ρ ≥ 0.30 on held-out GSM8K
12_batchtopk_vs_topk.ipynb BatchTopK vs TopK (Bussmann et al., arxiv:2412.06410)

Safety + production preview

Notebook What it does
13_watchtower_preview.ipynb Monitor input prompts for anomalous feature activations. Q1 preview of Q4 Watchtower Enterprise. Forward-only, no generation.

Circuits — attribution graphs between SAE features

Notebook What it does
14_attribution_patching.ipynb AtP* (Kramár et al. 2024, arxiv:2403.00745) — QK-fix + GradDrop node attribution
15_sparse_feature_circuits.ipynb Marks et al. 2024 (arxiv:2403.19647) replication — node + edge + error-term DAG
16_autocircuit_acdc.ipynb ACDC slow-mode via AutoCircuit
17_train_crosscoder.ipynb Sparse Crosscoder (Lindsey et al. 2024) — shared dictionary across L11/L31/L55

All circuit notebooks emit JSON consumed directly by the Circuit Canvas on openinterp.org.

Leaderboard — InterpScore v0.0.1

Notebook What it does
18_interpscore_eval.ipynb Composite SAE ranking — loss_recovered + alive + L0 + sparse probing + TPP. Emits interpscore.json → PR to web/lib/leaderboard.ts.

Lenses — classic layer-wise prediction tools

Notebook Method
19_logit_lens.ipynb Logit Lens (nostalgebraist 2020). 5 lines of PyTorch, ~5 min on T4.
20_tuned_lens.ipynb Tuned Lens (Belrose et al. 2023, arxiv:2303.08112). Pretrained or fresh-fit.

Probing — the supervised baselines SAE features must beat

Notebook Method
21_linear_probe.ipynb sklearn LogisticRegression on residuals + diff-of-means baseline (Farquhar 2023 requires it)
22_ccs_probe.ipynb Contrast Consistent Search (Burns 2022) with honest critique baselines
23_repe_reading_vector.ipynb Representation Engineering LAT (Zou 2023) — extract + monitor + steer

Hallucination — detection & steering arc

The full research arc behind the 2026-04-25 blog post on hallucination in 27B reasoning models. Notebooks 24 → 28b shipped 2026-04-25 → 26.

Notebook What it does
24_hallucination_entity_separation_qwen36_27b.ipynb v0.0.1 — fake AUROC=1.0 from a 2× tokenization confound. The honest negative result.
24b_hallucination_v002_ferrando_proper.ipynb Ferrando 2024 replication on Qwen3.6-27B. AUROC 0.84 on 226 real Wikidata entities.
25_steering_f61723_calibration.ipynb Single-feature steering null result. Detection ≠ control.
26_multi_feature_steering.ipynb Multi-feature top-K (no controls). The version we almost shipped overclaimed.
27_multi_feature_steering_with_controls.ipynb The walk-back. 6 controls (random-K + Claude judge + permutation). It induces hallucination, not calibration.
28_paper_baselines_qwen36_27b.ipynb ICML MI Workshop 2026 paper-1 baselines. L31/f34957 0.81 vs LR ceiling 0.887 vs diff-of-means 0.859. Per-layer scan, bootstrap CI.
28b_sensitivity_refusal_only.ipynb Sensitivity ablation — same residual capture, two labelling rules. Reviewer-defence.

Crosscoders — cross-model + cross-stage

The methodology behind paper-1's Pearson causal-equivalence (Pearson_CE) finding. First per-feature causal-equivalence test in the crosscoder literature.

Notebook What it does Pair
17_train_crosscoder.ipynb Cross-LAYER crosscoder (Lindsey 2024). Single model, multi-layer. Gemma-2-2B L6/L12/L18
17b_crosscoder_model_diff_papergrade.ipynb Cross-MODEL crosscoder + Pearson_CE. Median cosine 0.965 vs CE 0.616 — 38% gap. Gemma-2-2B base/IT
17c_crosscoder_rl_diffing_papergrade.ipynb Cross-STAGE crosscoder. LoRA toggle pattern (single base + PEFT.disable_adapter). Qwen3.5-4B base vs mechreward-G3

Guards — product reproducers

Each notebook reproduces an exact metric behind a shipped openinterp Guard (SDK on PyPI, demo on HF, landing on openinterp.org/products/X). Drop-in pip install openinterp and you have these probes.

Notebook Product Headline number Reproducer
30_hallucinationguard_proof_qwen36_27b.ipynb FabricationGuard PoC v1 Single-feature failed cross-bench (0.50–0.60) Open in Colab
31_hallucinationguard_v2_linear_probe.ipynb FabricationGuard v2 (production) AUROC 0.88 cross-task · −88% confident-wrong Open in Colab
32_reasoningguard_proof_qwen36_27b.ipynb ReasoningGuard PoC TBD — passes 3/3 ships v0.3 Open in Colab

Each reproducer ships:

  • probe.joblib + meta.json to HF dataset (drop-in for the SDK)
  • verdict.json with raw numbers
  • headline.png for landing pages / posts
  • All artifacts pushed to caiovicentino1/<ProductName>-linearprobe-qwen36-27b (HF dataset)

Shared recipe (every training tier)

All tiers use the same research-grade protocol; hyperparameters scale:

  • TopK activation (Gao et al. 2024) — hard top-k, no L1 penalty
  • AuxK auxiliary loss — dead-feature revival (α=1/32, k_aux=d/2, dead_threshold=10M tokens)
  • Geometric-median b_dec init (Weiszfeld) — robust to heavy-tailed residuals
  • Decoder column renorm every step — keeps features interpretable
  • Cosine LR + warmup — non-zero floor for continued dead-feature revival
  • HuggingFace streaming checkpoints — crash-safe, never lose more than 5-10 min
  • sae_lens-compatible exportsafetensors + cfg.json

Notebook constraints

Use dtype=torch.bfloat16 (not the deprecated torch_dtype=) and attn_implementation='sdpa' (not flash-attn — reproducibility + install pain across Colab/Kaggle). HF_TOKEN goes through Colab/Kaggle secrets, never hard-coded. Stream checkpoints to HF every 5–10M tokens — Drive-only checkpoints die with the kernel. Use the multimodal layer-access fallback (getattr(model.model, 'layers', None) or model.model.language_model.layers), not a hard-coded .layers[N]. Report honest var_expl, L0, and dead-feature percentage — not cherry-picked seeds. CI checks all of these.


Contributing

Three common PR patterns, full rules in CONTRIBUTING.md:

  1. Port a notebook to a new model — pick an existing notebook at your tier and swap MODEL_ID, LAYER, D_MODEL. Name it NN_<tier>_<model>_<platform>.ipynb.
  2. Replicate a 2024–2026 paper — title cell with arxiv link, pinned install, paper hyperparameters, inline implementation, validation cell that matches the paper's headline metric within tolerance.
  3. Add a platform (TPU/ROCm/MPS) — write a _platform_<name>.py helper with pick_device() / get_dtype(), patch one notebook as PoC, open a draft PR and tag @caiovicentino for design review.

Before opening a PR, validate JSON: python3 -c "import json; json.load(open('notebooks/YOUR.ipynb'))". CI runs nbformat.validate. If you have a GPU, dry-run with jupyter nbconvert --to notebook --execute --ExecutePreprocessor.timeout=300 — expect heavy training cells to time out; you're just catching import + dtype bugs.


Output schemas other tools consume

If your notebook emits a JSON that the website consumes, match the schema:

Tool Schema (TypeScript source)
Trace Theater web/lib/trace-data.ts · TraceScenario
Circuit Canvas web/lib/circuit-data.ts · CircuitData
InterpScore leaderboard web/lib/leaderboard.ts · LeaderboardEntry

Where to go next

Your SAE is an asset. Put it to work:


Community


Built on

SAELens (checkpoint format) · Gemma Scope (reference at-scale SAE suite) · Gao et al. 2024 (TopK + AuxK) · Bussmann et al. 2024 (BatchTopK) · Neuronpedia.

Apache-2.0 · openinterp.org

About

Train your first SAE in 30 min → paper-grade at 27B. Free Colab · free Kaggle · cloud ladders. Every scale covered.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors