08 May 14:21

d2926c0

v0.3.1 — agent-probe-guard refit() for cross-env transfer Latest

Latest

Cross-environment probe transfer fix

End-to-end Colab eval revealed that probe weights from v0.3.0 are coupled to
the inference environment they were captured in. On a clean Colab session
(no fla, no flash-attn, sdpa attention), loading the original probe weights
collapses AUROC because the residual at L55 has cosine 0.35 vs the
training-environment residual — same model, same prompt, different forward
pass numerics.

What's new in v0.3.1

AgentProbeGuard.refit(prompts, labels) — captures fresh activations on
your attached model and refits both top-K probes in place. ~2 min for N=240.
Use this before assess() if your inference environment differs from
caiovicentino1/agent-probe-guard-qwen36-27b's training env.

Cross-env transfer measured

Probe	nb47b env (with fla+flash)	Default Colab env (no fla, sdpa)
L55 thinking	AUROC 0.848	0.559 loaded / 0.791 refit
L43 capability	AUROC 0.830	0.759 loaded / 0.806 refit
L43 cosine(envs)	—	0.79
L55 cosine(envs)	—	0.35

Refit recovers most signal (-5pp drop) confirming the probe direction is
transferable; only the coordinate-level weights need recalibration.

Quick start

from openinterp import AgentProbeGuard

guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)

# If your env differs from nb47b training env (most users), refit first:
prompts = [...]   # 100-300 representative prompts
labels = [...]    # binary ground-truth labels (patch_generated, has_think_v1, etc.)
guard.refit(prompts, labels)  # CV AUROC printed; weights replaced in place

# Then use as normal:
decision = guard.assess(messages, partial_response=current_thought)

Paper appendix

Appendix C of paper/two_forms_epiphenomenal_probes_neurips_mi_2026.md
documents the cross-environment transfer matrix as a methodology contribution.

agent-probe-guard SDK v0.1 — mid-reasoning gate for code agents

Two-probe activation gate for LLM-based code agents on Qwen3.6-27B. Detect-only by design.

What's new

AgentProbeGuard: two probes (L43 capability, K=10 + L55 thinking-intent, K=5), three modes (skip / escalate / proceed), ~50ms total latency on RTX 6000.
from_pretrained() loads probe weights from caiovicentino1/agent-probe-guard-qwen36-27b.
assess(messages, partial_response) → Decision(action, scores, thresholds).
Detect-only: confirmed across 3 intervention experiments (Phase 7 + Phase 8 + Phase 8 redux). No boost mode shipped.

Eval (sklearn-only, N=240)

Metric	Value
Thinking AUROC	0.855 in-sample · 0.848 4-fold CV
Capability AUROC	0.863 in-sample · 0.830 4-fold CV
Sklearn forward latency p95	0.19 ms
Decision split (skip<0.20, escalate<0.50)	skip 21.2% / escalate 30.0% / proceed 48.8%
skip → true negative rate	86.3%
proceed → true positive rate	82.1%

Three sanity checks (paper §3)

Random-feature baseline at small N (catches over-parameterization)
Control-token normalization for steering (catches uniform softmax-temperature shifts)
Structural-rigidity α-sweep diagnostic (catches template-locked decisions)

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
from openinterp import AgentProbeGuard

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.6-27B", dtype="bfloat16",
    device_map="cuda", trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)

guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)

decision = guard.assess(messages, partial_response=current_thought)
if decision.action == "skip":
    raise BudgetSkip(decision.reason)
elif decision.action == "escalate":
    return stronger_model.complete(messages)

Bugfix

safe_load_qwen36_lora() with verify=True was raising LoRAVerificationError on adapters that actually loaded correctly.

Root cause

```python
loaded_model = PeftModel.from_pretrained(base_model, ...)

↑ MUTATES base_model in-place (injects LoRA layers)

Then verify compared base (now LoRA-applied) against loaded (same object) → diff=0.000

```

Fix

Capture base_logits BEFORE calling PeftModel.from_pretrained(). Honest reference for the diff comparison.

Discovered during nb44 v2 paper-3 behavior eval: manual diff post-load was 0.156 (functional) while verify=True was raising silent-fail error.

Upgrade

```bash
pip install --upgrade openinterp # → v0.2.2
```

PyPI: https://pypi.org/project/openinterp/0.2.2/

Assets 2

01 May 19:21

caiovicentino

v0.2.1

2c988a9

v0.2.1 — safe_load_qwen36_lora() utility

What's new

Adds openinterp.lora module that encapsulates the Qwen3.6 PEFT-save .language_model. infix bug fix discovered in nb39 → nb40 → nb41 v2 (April 2026).

The bug it fixes

Saved Qwen3.6 LoRA adapters carry a .language_model. infix in their state-dict keys. PeftModel.from_pretrained() against a reloaded dense Qwen3.6 silently fails to apply the adapter — max logit-diff between base and "loaded" model is exactly 0.000. No error raised; the adapter loaded but produces zero functional change.

New API

```python
from openinterp import safe_load_qwen36_lora

model = safe_load_qwen36_lora(
base_model_id="Qwen/Qwen3.6-27B",
adapter_path="path/to/checkpoint-200",
# auto strip .language_model. + auto verify logit-diff > 0.01
)
```

Plus exposed lower-level utilities:

`strip_language_model_infix(state_dict)` — pure dict transform
`verify_adapter_loaded(...)` — logit-diff sanity check
`LoRAVerificationError` — custom exception when adapter loaded silently failed

Why it matters

This bug invalidated ~10 hours of prior eval work on our paper-2 (probe-detected grokking in multi-probe DPO) before being caught. Anyone working with Qwen3.6 LoRA save/reload pipelines should run the sanity check — without it, the failure mode is silent.

Install

```bash
pip install --upgrade openinterp
```

PyPI: https://pypi.org/project/openinterp/0.2.1/

Assets 2

27 Apr 22:40

caiovicentino

v0.2.0

29de952

v0.2.0 — FabricationGuard SDK

What's new

FabricationGuard — activation-probe hallucination detection for open-weights LLMs. First production probe-based guard from OpenInterp.

AUROC 0.88 cross-task on SimpleQA (held-out, probe trained on TruthfulQA + HaluEval + MMLU)
AUROC 0.90 within-bench on HaluEval-QA
−88% confident-wrong rate reduction in abstain mode on factual QA
~1 ms scoring latency (single matrix multiplication on captured residual)
Apache-2.0 + patent grant

Install

pip install --upgrade "openinterp[full]"

API

from openinterp import FabricationGuard
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.6-27B", dtype=torch.bfloat16,
    device_map="auto", trust_remote_code=True,
).eval()

guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B").attach(model, tok)
out = guard.generate("Who is Bambale Osby?", mode="abstain")
print(out["text"])   # "I'm not confident about this..."
print(out["score"])  # 0.93

CLI

openinterp guard -m Qwen/Qwen3.6-27B -p "Who is Bambale Osby?" --mode abstain

Source artifacts

Probe + headline figure: https://huggingface.co/datasets/caiovicentino1/FabricationGuard-linearprobe-qwen36-27b
Reproducer notebooks: OpenInterpretability/notebooks/30_hallucinationguard_proof_qwen36_27b.ipynb + 31_hallucinationguard_v2_linear_probe.ipynb
Source SAE (Qwen3.6-27B paper-grade): https://huggingface.co/caiovicentino1/qwen36-27b-sae-papergrade

Honest scope

✅ Works for	❌ Out-of-scope
Generation-fabrication (HaluEval-style open QA)	Misconception resistance (TruthfulQA-style)
Entity recall failures (SimpleQA-style)	MC knowledge selection (MMLU-style)
Customer support / medical / legal / docs Q&A	Multi-step reasoning failures

Comparison

Tool	Hallucination AUROC	Latency	Open weights	Multi-model
Patronus Lynx-70B	0.87 (HaluBench)	LLM-judge ~100ms+	✅	❌ Llama only
Vectara HHEM-2.1	~0.85	600 ms RTX 3090	✅	✅ generic
Goodfire Ember	proprietary, enterprise-only	unknown	❌	❌ Llama only
OpenInterp FabricationGuard	0.88 cross / 0.90 within	~1 ms	✅ Apache-2.0	✅ via Pearson_CE

Roadmap

v0.3 — Multi-model probes via Pearson_CE cross-model transfer (Llama-3.3, Gemma-2)
v0.4 — vLLM + SGLang inference plugins, LangChain middleware
v0.5 — Pro tier hosted API at $0.02/1M tokens

Full changelog: CHANGELOG.md

Assets 4

Releases: OpenInterpretability/cli

v0.3.1 — agent-probe-guard refit() for cross-env transfer

Cross-environment probe transfer fix

What's new in v0.3.1

Cross-env transfer measured

Quick start

Paper appendix

Links

Uh oh!

v0.3.0 — agent-probe-guard SDK

agent-probe-guard SDK v0.1 — mid-reasoning gate for code agents

What's new

Eval (sklearn-only, N=240)

Three sanity checks (paper §3)

Quick start

Links

Uh oh!

v0.2.2 — Fix verify_adapter_loaded false-positive

Bugfix

Root cause

↑ MUTATES base_model in-place (injects LoRA layers)

Then verify compared base (now LoRA-applied) against loaded (same object) → diff=0.000

Fix

Upgrade

Uh oh!

v0.2.1 — safe_load_qwen36_lora() utility

What's new

The bug it fixes

New API

Why it matters

Install

Uh oh!

v0.2.0 — FabricationGuard SDK

What's new

Install

API

CLI

Source artifacts

Honest scope

Comparison

Roadmap

Uh oh!