Releases: OpenInterpretability/cli
v0.3.1 — agent-probe-guard refit() for cross-env transfer
Cross-environment probe transfer fix
End-to-end Colab eval revealed that probe weights from v0.3.0 are coupled to
the inference environment they were captured in. On a clean Colab session
(no fla, no flash-attn, sdpa attention), loading the original probe weights
collapses AUROC because the residual at L55 has cosine 0.35 vs the
training-environment residual — same model, same prompt, different forward
pass numerics.
What's new in v0.3.1
AgentProbeGuard.refit(prompts, labels) — captures fresh activations on
your attached model and refits both top-K probes in place. ~2 min for N=240.
Use this before assess() if your inference environment differs from
caiovicentino1/agent-probe-guard-qwen36-27b's training env.
Cross-env transfer measured
| Probe | nb47b env (with fla+flash) | Default Colab env (no fla, sdpa) |
|---|---|---|
| L55 thinking | AUROC 0.848 | 0.559 loaded / 0.791 refit |
| L43 capability | AUROC 0.830 | 0.759 loaded / 0.806 refit |
| L43 cosine(envs) | — | 0.79 |
| L55 cosine(envs) | — | 0.35 |
Refit recovers most signal (-5pp drop) confirming the probe direction is
transferable; only the coordinate-level weights need recalibration.
Quick start
from openinterp import AgentProbeGuard
guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)
# If your env differs from nb47b training env (most users), refit first:
prompts = [...] # 100-300 representative prompts
labels = [...] # binary ground-truth labels (patch_generated, has_think_v1, etc.)
guard.refit(prompts, labels) # CV AUROC printed; weights replaced in place
# Then use as normal:
decision = guard.assess(messages, partial_response=current_thought)Paper appendix
Appendix C of paper/two_forms_epiphenomenal_probes_neurips_mi_2026.md
documents the cross-environment transfer matrix as a methodology contribution.
Links
- 🛡️ Landing: https://openinterp.org/products/agent-probe-guard
- 📦 PyPI v0.3.1: https://pypi.org/project/openinterp/0.3.1/
- 🤗 HF dataset: https://huggingface.co/datasets/caiovicentino1/agent-probe-guard-qwen36-27b
- 📜 Paper: https://github.com/OpenInterpretability/openinterp-swebench-harness/blob/main/paper/two_forms_epiphenomenal_probes_neurips_mi_2026.md
v0.3.0 — agent-probe-guard SDK
agent-probe-guard SDK v0.1 — mid-reasoning gate for code agents
Two-probe activation gate for LLM-based code agents on Qwen3.6-27B. Detect-only by design.
What's new
AgentProbeGuard: two probes (L43 capability, K=10 + L55 thinking-intent, K=5), three modes (skip / escalate / proceed), ~50ms total latency on RTX 6000.from_pretrained()loads probe weights fromcaiovicentino1/agent-probe-guard-qwen36-27b.assess(messages, partial_response)→Decision(action, scores, thresholds).- Detect-only: confirmed across 3 intervention experiments (Phase 7 + Phase 8 + Phase 8 redux). No boost mode shipped.
Eval (sklearn-only, N=240)
| Metric | Value |
|---|---|
| Thinking AUROC | 0.855 in-sample · 0.848 4-fold CV |
| Capability AUROC | 0.863 in-sample · 0.830 4-fold CV |
| Sklearn forward latency p95 | 0.19 ms |
| Decision split (skip<0.20, escalate<0.50) | skip 21.2% / escalate 30.0% / proceed 48.8% |
| skip → true negative rate | 86.3% |
| proceed → true positive rate | 82.1% |
Three sanity checks (paper §3)
- Random-feature baseline at small N (catches over-parameterization)
- Control-token normalization for steering (catches uniform softmax-temperature shifts)
- Structural-rigidity α-sweep diagnostic (catches template-locked decisions)
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
from openinterp import AgentProbeGuard
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.6-27B", dtype="bfloat16",
device_map="cuda", trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)
guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)
decision = guard.assess(messages, partial_response=current_thought)
if decision.action == "skip":
raise BudgetSkip(decision.reason)
elif decision.action == "escalate":
return stronger_model.complete(messages)Links
- 🛡️ Landing: https://openinterp.org/products/agent-probe-guard
- 📦 PyPI: https://pypi.org/project/openinterp/0.3.0/
- 🤗 HF dataset: https://huggingface.co/datasets/caiovicentino1/agent-probe-guard-qwen36-27b
- 📜 Paper draft: https://github.com/OpenInterpretability/openinterp-swebench-harness/blob/main/paper/two_forms_epiphenomenal_probes_neurips_mi_2026.md
- 🔬 Reproduction harness: https://github.com/OpenInterpretability/openinterp-swebench-harness
Apache-2.0 throughout. Patent grant included.
v0.2.2 — Fix verify_adapter_loaded false-positive
Bugfix
safe_load_qwen36_lora() with verify=True was raising LoRAVerificationError on adapters that actually loaded correctly.
Root cause
```python
loaded_model = PeftModel.from_pretrained(base_model, ...)
↑ MUTATES base_model in-place (injects LoRA layers)
Then verify compared base (now LoRA-applied) against loaded (same object) → diff=0.000
```
Fix
Capture base_logits BEFORE calling PeftModel.from_pretrained(). Honest reference for the diff comparison.
Discovered during nb44 v2 paper-3 behavior eval: manual diff post-load was 0.156 (functional) while verify=True was raising silent-fail error.
Upgrade
```bash
pip install --upgrade openinterp # → v0.2.2
```
v0.2.1 — safe_load_qwen36_lora() utility
What's new
Adds openinterp.lora module that encapsulates the Qwen3.6 PEFT-save .language_model. infix bug fix discovered in nb39 → nb40 → nb41 v2 (April 2026).
The bug it fixes
Saved Qwen3.6 LoRA adapters carry a .language_model. infix in their state-dict keys. PeftModel.from_pretrained() against a reloaded dense Qwen3.6 silently fails to apply the adapter — max logit-diff between base and "loaded" model is exactly 0.000. No error raised; the adapter loaded but produces zero functional change.
New API
```python
from openinterp import safe_load_qwen36_lora
model = safe_load_qwen36_lora(
base_model_id="Qwen/Qwen3.6-27B",
adapter_path="path/to/checkpoint-200",
# auto strip .language_model. + auto verify logit-diff > 0.01
)
```
Plus exposed lower-level utilities:
- `strip_language_model_infix(state_dict)` — pure dict transform
- `verify_adapter_loaded(...)` — logit-diff sanity check
- `LoRAVerificationError` — custom exception when adapter loaded silently failed
Why it matters
This bug invalidated ~10 hours of prior eval work on our paper-2 (probe-detected grokking in multi-probe DPO) before being caught. Anyone working with Qwen3.6 LoRA save/reload pipelines should run the sanity check — without it, the failure mode is silent.
Install
```bash
pip install --upgrade openinterp
```
v0.2.0 — FabricationGuard SDK
What's new
FabricationGuard — activation-probe hallucination detection for open-weights LLMs. First production probe-based guard from OpenInterp.
- AUROC 0.88 cross-task on SimpleQA (held-out, probe trained on TruthfulQA + HaluEval + MMLU)
- AUROC 0.90 within-bench on HaluEval-QA
- −88% confident-wrong rate reduction in abstain mode on factual QA
- ~1 ms scoring latency (single matrix multiplication on captured residual)
- Apache-2.0 + patent grant
Install
pip install --upgrade "openinterp[full]"API
from openinterp import FabricationGuard
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen3.6-27B", dtype=torch.bfloat16,
device_map="auto", trust_remote_code=True,
).eval()
guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B").attach(model, tok)
out = guard.generate("Who is Bambale Osby?", mode="abstain")
print(out["text"]) # "I'm not confident about this..."
print(out["score"]) # 0.93CLI
openinterp guard -m Qwen/Qwen3.6-27B -p "Who is Bambale Osby?" --mode abstainSource artifacts
- Probe + headline figure: https://huggingface.co/datasets/caiovicentino1/FabricationGuard-linearprobe-qwen36-27b
- Reproducer notebooks:
OpenInterpretability/notebooks/30_hallucinationguard_proof_qwen36_27b.ipynb+31_hallucinationguard_v2_linear_probe.ipynb - Source SAE (Qwen3.6-27B paper-grade): https://huggingface.co/caiovicentino1/qwen36-27b-sae-papergrade
Honest scope
| ✅ Works for | ❌ Out-of-scope |
|---|---|
| Generation-fabrication (HaluEval-style open QA) | Misconception resistance (TruthfulQA-style) |
| Entity recall failures (SimpleQA-style) | MC knowledge selection (MMLU-style) |
| Customer support / medical / legal / docs Q&A | Multi-step reasoning failures |
Comparison
| Tool | Hallucination AUROC | Latency | Open weights | Multi-model |
|---|---|---|---|---|
| Patronus Lynx-70B | 0.87 (HaluBench) | LLM-judge ~100ms+ | ✅ | ❌ Llama only |
| Vectara HHEM-2.1 | ~0.85 | 600 ms RTX 3090 | ✅ | ✅ generic |
| Goodfire Ember | proprietary, enterprise-only | unknown | ❌ | ❌ Llama only |
| OpenInterp FabricationGuard | 0.88 cross / 0.90 within | ~1 ms | ✅ Apache-2.0 | ✅ via Pearson_CE |
Roadmap
- v0.3 — Multi-model probes via Pearson_CE cross-model transfer (Llama-3.3, Gemma-2)
- v0.4 — vLLM + SGLang inference plugins, LangChain middleware
- v0.5 — Pro tier hosted API at $0.02/1M tokens
Full changelog: CHANGELOG.md