Cross-environment probe transfer fix
End-to-end Colab eval revealed that probe weights from v0.3.0 are coupled to
the inference environment they were captured in. On a clean Colab session
(no fla, no flash-attn, sdpa attention), loading the original probe weights
collapses AUROC because the residual at L55 has cosine 0.35 vs the
training-environment residual — same model, same prompt, different forward
pass numerics.
What's new in v0.3.1
AgentProbeGuard.refit(prompts, labels) — captures fresh activations on
your attached model and refits both top-K probes in place. ~2 min for N=240.
Use this before assess() if your inference environment differs from
caiovicentino1/agent-probe-guard-qwen36-27b's training env.
Cross-env transfer measured
| Probe | nb47b env (with fla+flash) | Default Colab env (no fla, sdpa) |
|---|---|---|
| L55 thinking | AUROC 0.848 | 0.559 loaded / 0.791 refit |
| L43 capability | AUROC 0.830 | 0.759 loaded / 0.806 refit |
| L43 cosine(envs) | — | 0.79 |
| L55 cosine(envs) | — | 0.35 |
Refit recovers most signal (-5pp drop) confirming the probe direction is
transferable; only the coordinate-level weights need recalibration.
Quick start
from openinterp import AgentProbeGuard
guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)
# If your env differs from nb47b training env (most users), refit first:
prompts = [...] # 100-300 representative prompts
labels = [...] # binary ground-truth labels (patch_generated, has_think_v1, etc.)
guard.refit(prompts, labels) # CV AUROC printed; weights replaced in place
# Then use as normal:
decision = guard.assess(messages, partial_response=current_thought)Paper appendix
Appendix C of paper/two_forms_epiphenomenal_probes_neurips_mi_2026.md
documents the cross-environment transfer matrix as a methodology contribution.
Links
- 🛡️ Landing: https://openinterp.org/products/agent-probe-guard
- 📦 PyPI v0.3.1: https://pypi.org/project/openinterp/0.3.1/
- 🤗 HF dataset: https://huggingface.co/datasets/caiovicentino1/agent-probe-guard-qwen36-27b
- 📜 Paper: https://github.com/OpenInterpretability/openinterp-swebench-harness/blob/main/paper/two_forms_epiphenomenal_probes_neurips_mi_2026.md