Skip to content

Releases: OpenInterpretability/cli

v0.3.1 — agent-probe-guard refit() for cross-env transfer

08 May 14:21

Choose a tag to compare

Cross-environment probe transfer fix

End-to-end Colab eval revealed that probe weights from v0.3.0 are coupled to
the inference environment they were captured in. On a clean Colab session
(no fla, no flash-attn, sdpa attention), loading the original probe weights
collapses AUROC because the residual at L55 has cosine 0.35 vs the
training-environment residual — same model, same prompt, different forward
pass numerics.

What's new in v0.3.1

AgentProbeGuard.refit(prompts, labels) — captures fresh activations on
your attached model and refits both top-K probes in place. ~2 min for N=240.
Use this before assess() if your inference environment differs from
caiovicentino1/agent-probe-guard-qwen36-27b's training env.

Cross-env transfer measured

Probe nb47b env (with fla+flash) Default Colab env (no fla, sdpa)
L55 thinking AUROC 0.848 0.559 loaded / 0.791 refit
L43 capability AUROC 0.830 0.759 loaded / 0.806 refit
L43 cosine(envs) 0.79
L55 cosine(envs) 0.35

Refit recovers most signal (-5pp drop) confirming the probe direction is
transferable; only the coordinate-level weights need recalibration.

Quick start

from openinterp import AgentProbeGuard

guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)

# If your env differs from nb47b training env (most users), refit first:
prompts = [...]   # 100-300 representative prompts
labels = [...]    # binary ground-truth labels (patch_generated, has_think_v1, etc.)
guard.refit(prompts, labels)  # CV AUROC printed; weights replaced in place

# Then use as normal:
decision = guard.assess(messages, partial_response=current_thought)

Paper appendix

Appendix C of paper/two_forms_epiphenomenal_probes_neurips_mi_2026.md
documents the cross-environment transfer matrix as a methodology contribution.

Links

v0.3.0 — agent-probe-guard SDK

08 May 13:23

Choose a tag to compare

agent-probe-guard SDK v0.1 — mid-reasoning gate for code agents

Two-probe activation gate for LLM-based code agents on Qwen3.6-27B. Detect-only by design.

What's new

  • AgentProbeGuard: two probes (L43 capability, K=10 + L55 thinking-intent, K=5), three modes (skip / escalate / proceed), ~50ms total latency on RTX 6000.
  • from_pretrained() loads probe weights from caiovicentino1/agent-probe-guard-qwen36-27b.
  • assess(messages, partial_response)Decision(action, scores, thresholds).
  • Detect-only: confirmed across 3 intervention experiments (Phase 7 + Phase 8 + Phase 8 redux). No boost mode shipped.

Eval (sklearn-only, N=240)

Metric Value
Thinking AUROC 0.855 in-sample · 0.848 4-fold CV
Capability AUROC 0.863 in-sample · 0.830 4-fold CV
Sklearn forward latency p95 0.19 ms
Decision split (skip<0.20, escalate<0.50) skip 21.2% / escalate 30.0% / proceed 48.8%
skip → true negative rate 86.3%
proceed → true positive rate 82.1%

Three sanity checks (paper §3)

  1. Random-feature baseline at small N (catches over-parameterization)
  2. Control-token normalization for steering (catches uniform softmax-temperature shifts)
  3. Structural-rigidity α-sweep diagnostic (catches template-locked decisions)

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
from openinterp import AgentProbeGuard

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.6-27B", dtype="bfloat16",
    device_map="cuda", trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)

guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)

decision = guard.assess(messages, partial_response=current_thought)
if decision.action == "skip":
    raise BudgetSkip(decision.reason)
elif decision.action == "escalate":
    return stronger_model.complete(messages)

Links

Apache-2.0 throughout. Patent grant included.

v0.2.2 — Fix verify_adapter_loaded false-positive

02 May 01:37

Choose a tag to compare

Bugfix

safe_load_qwen36_lora() with verify=True was raising LoRAVerificationError on adapters that actually loaded correctly.

Root cause

```python
loaded_model = PeftModel.from_pretrained(base_model, ...)

↑ MUTATES base_model in-place (injects LoRA layers)

Then verify compared base (now LoRA-applied) against loaded (same object) → diff=0.000

```

Fix

Capture base_logits BEFORE calling PeftModel.from_pretrained(). Honest reference for the diff comparison.

Discovered during nb44 v2 paper-3 behavior eval: manual diff post-load was 0.156 (functional) while verify=True was raising silent-fail error.

Upgrade

```bash
pip install --upgrade openinterp # → v0.2.2
```

PyPI: https://pypi.org/project/openinterp/0.2.2/

v0.2.1 — safe_load_qwen36_lora() utility

01 May 19:21

Choose a tag to compare

What's new

Adds openinterp.lora module that encapsulates the Qwen3.6 PEFT-save .language_model. infix bug fix discovered in nb39 → nb40 → nb41 v2 (April 2026).

The bug it fixes

Saved Qwen3.6 LoRA adapters carry a .language_model. infix in their state-dict keys. PeftModel.from_pretrained() against a reloaded dense Qwen3.6 silently fails to apply the adapter — max logit-diff between base and "loaded" model is exactly 0.000. No error raised; the adapter loaded but produces zero functional change.

New API

```python
from openinterp import safe_load_qwen36_lora

model = safe_load_qwen36_lora(
base_model_id="Qwen/Qwen3.6-27B",
adapter_path="path/to/checkpoint-200",
# auto strip .language_model. + auto verify logit-diff > 0.01
)
```

Plus exposed lower-level utilities:

  • `strip_language_model_infix(state_dict)` — pure dict transform
  • `verify_adapter_loaded(...)` — logit-diff sanity check
  • `LoRAVerificationError` — custom exception when adapter loaded silently failed

Why it matters

This bug invalidated ~10 hours of prior eval work on our paper-2 (probe-detected grokking in multi-probe DPO) before being caught. Anyone working with Qwen3.6 LoRA save/reload pipelines should run the sanity check — without it, the failure mode is silent.

Install

```bash
pip install --upgrade openinterp
```

PyPI: https://pypi.org/project/openinterp/0.2.1/

v0.2.0 — FabricationGuard SDK

27 Apr 22:40

Choose a tag to compare

What's new

FabricationGuard — activation-probe hallucination detection for open-weights LLMs. First production probe-based guard from OpenInterp.

  • AUROC 0.88 cross-task on SimpleQA (held-out, probe trained on TruthfulQA + HaluEval + MMLU)
  • AUROC 0.90 within-bench on HaluEval-QA
  • −88% confident-wrong rate reduction in abstain mode on factual QA
  • ~1 ms scoring latency (single matrix multiplication on captured residual)
  • Apache-2.0 + patent grant

Install

pip install --upgrade "openinterp[full]"

API

from openinterp import FabricationGuard
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.6-27B", dtype=torch.bfloat16,
    device_map="auto", trust_remote_code=True,
).eval()

guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B").attach(model, tok)
out = guard.generate("Who is Bambale Osby?", mode="abstain")
print(out["text"])   # "I'm not confident about this..."
print(out["score"])  # 0.93

CLI

openinterp guard -m Qwen/Qwen3.6-27B -p "Who is Bambale Osby?" --mode abstain

Source artifacts

Honest scope

✅ Works for ❌ Out-of-scope
Generation-fabrication (HaluEval-style open QA) Misconception resistance (TruthfulQA-style)
Entity recall failures (SimpleQA-style) MC knowledge selection (MMLU-style)
Customer support / medical / legal / docs Q&A Multi-step reasoning failures

Comparison

Tool Hallucination AUROC Latency Open weights Multi-model
Patronus Lynx-70B 0.87 (HaluBench) LLM-judge ~100ms+ ❌ Llama only
Vectara HHEM-2.1 ~0.85 600 ms RTX 3090 ✅ generic
Goodfire Ember proprietary, enterprise-only unknown ❌ Llama only
OpenInterp FabricationGuard 0.88 cross / 0.90 within ~1 ms ✅ Apache-2.0 ✅ via Pearson_CE

Roadmap

  • v0.3 — Multi-model probes via Pearson_CE cross-model transfer (Llama-3.3, Gemma-2)
  • v0.4 — vLLM + SGLang inference plugins, LangChain middleware
  • v0.5 — Pro tier hosted API at $0.02/1M tokens

Full changelog: CHANGELOG.md