Release v0.3.1 — agent-probe-guard refit() for cross-env transfer · OpenInterpretability/cli

Cross-environment probe transfer fix

End-to-end Colab eval revealed that probe weights from v0.3.0 are coupled to
the inference environment they were captured in. On a clean Colab session
(no fla, no flash-attn, sdpa attention), loading the original probe weights
collapses AUROC because the residual at L55 has cosine 0.35 vs the
training-environment residual — same model, same prompt, different forward
pass numerics.

What's new in v0.3.1

AgentProbeGuard.refit(prompts, labels) — captures fresh activations on
your attached model and refits both top-K probes in place. ~2 min for N=240.
Use this before assess() if your inference environment differs from
caiovicentino1/agent-probe-guard-qwen36-27b's training env.

Cross-env transfer measured

Probe	nb47b env (with fla+flash)	Default Colab env (no fla, sdpa)
L55 thinking	AUROC 0.848	0.559 loaded / 0.791 refit
L43 capability	AUROC 0.830	0.759 loaded / 0.806 refit
L43 cosine(envs)	—	0.79
L55 cosine(envs)	—	0.35

Refit recovers most signal (-5pp drop) confirming the probe direction is
transferable; only the coordinate-level weights need recalibration.

Quick start

from openinterp import AgentProbeGuard

guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)

# If your env differs from nb47b training env (most users), refit first:
prompts = [...]   # 100-300 representative prompts
labels = [...]    # binary ground-truth labels (patch_generated, has_think_v1, etc.)
guard.refit(prompts, labels)  # CV AUROC printed; weights replaced in place

# Then use as normal:
decision = guard.assess(messages, partial_response=current_thought)

Paper appendix

Appendix C of paper/two_forms_epiphenomenal_probes_neurips_mi_2026.md
documents the cross-environment transfer matrix as a methodology contribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.1 — agent-probe-guard refit() for cross-env transfer

Choose a tag to compare

Sorry, something went wrong.