RahulModugula · RahulModugula · Jun 28, 2026 · Jun 27, 2026 · Jun 27, 2026 · Jun 27, 2026
diff --git a/.gitignore b/.gitignore
@@ -18,11 +18,12 @@ env/
 *.swp
 *.swo
 
-# Testing
+# Testing / tooling caches
 .pytest_cache/
 .coverage
 htmlcov/
 .mypy_cache/
+.ruff_cache/
 
 # Models (downloaded at runtime)
 models/
@@ -39,3 +40,10 @@ Thumbs.db
 
 # Environment
 .env
+
+# Claude Code local settings
+.claude/
+.odin/
+scratchpad/
+# Recording build artifact (GIF is committed; cast is regenerated)
+assets/*.cast
diff --git a/README.md b/README.md
@@ -24,6 +24,25 @@ No document ingestion. No chunking. No agents. No database. Works identically on
 ![License MIT](https://img.shields.io/badge/license-MIT-green)
 ![Version](https://img.shields.io/badge/version-0.1.0-orange)
 
+## Stop hallucinations before they cascade
+
+In a multi-step agent, each step's output feeds the next — a single fabricated
+figure propagates straight into the final answer. `verify_step()` is a circuit
+breaker that halts the chain the moment a claim stops being grounded in the
+evidence:
+
+![Agent circuit-breaker demo](assets/circuit_breaker.gif)
+
+```python
+from athena_verify import verify_step
+
+step = verify_step(claim=reasoning_step, evidence=retrieved_chunks, threshold=0.5)
+if step.action == "halt":
+    raise RuntimeError(f"Ungrounded claim blocked (trust={step.trust_score:.2f})")
+```
+
+Run it yourself: [`examples/agent_circuit_breaker.py`](examples/agent_circuit_breaker.py).
+
 ## How It Works
 
 ```
@@ -81,21 +100,46 @@ pip install "athena-verify[all]"
 
 Evaluated on 100 synthetic cases across 6 hallucination categories (legal, medical, technical, general). Real-world benchmarks against RAGTruth and HaluEval are in progress — download instructions are in [`benchmarks/RESULTS.md`](benchmarks/RESULTS.md).
 
-### Per-Category Performance (NLI-only, synthetic, nli-deberta-v3-base)
+### Hallucination Detection (NLI-only, synthetic, nli-deberta-v3-base)
+
+Each row is the per-category F1 for *catching hallucinations*. The faithful-text
+row is intentionally excluded here — it contains no hallucinations, so its F1 is
+undefined; we report its false-positive rate separately below, which is the
+number that actually matters for clean text.
 
 | Category | Precision | Recall | **F1** |
 |----------|-----------|--------|--------|
-| **Fabricated claims** | 100% | 97% | **98.6%** ✓ |
+| **Fabricated claims** | 100% | 96% | **97.9%** ✓ |
 | **Out-of-context** | 100% | 97% | **98.3%** ✓ |
 | **Subtle contradictions** | 100% | 97% | **98.3%** ✓ |
-| **Number substitutions** | 79% | 96% | **86.8%** |
-| **Partial support** | 78% | 95% | **85.7%** |
-| **Faithful statements** | 0% | 0% | **0.0%** ✗ |
-| **Overall** | 87% | 97% | **91.3%** (synthetic) |
-
-### Where We Lose
-
-Athena has a **high false positive rate on truly faithful statements** (31% of genuinely faithful sentences are incorrectly flagged). This is a known NLI-model limitation — conservative thresholds bias toward catching hallucinations at the cost of flagging clean sentences.
+| **Partial support** | 95% | 91% | **93.0%** |
+| **Number substitutions** | 82% | 96% | **88.5%** |
+| **Overall** | 95% | 96% | **95.0%** (synthetic) |
+
+**False-positive rate on faithful text: 4.6%** (4 of 87 genuinely-supported
+sentences flagged) on the base model, **3.4%** on the large model — down from 17%
+before calibration. Latency: **p50 22.5 ms, p95 34.5 ms** per verification on the
+base model. Numbers are reproducible with `python benchmarks/run_full_eval.py`.
+
+### How false positives are kept low
+
+Standalone NLI scores many faithful paraphrases as "neutral" (entailment ≈ 0)
+even when the claim is fully supported. Athena recovers these without letting
+hallucinations through, using three guarded signals:
+
+- **Anaphora windowing** — a sentence opening with a referent ("This cap…", "It
+  also…") is scored together with its predecessor, restoring the antecedent.
+- **Contradiction-aware rescue** — a not-entailed claim is only rescued when the
+  most on-topic context unit does *not* contradict it, so reversals and subtle
+  contradictions stay flagged.
+- **Numeric gate** — rescue requires every number in the claim to appear in the
+  context, so number-substitution hallucinations ("$5M" vs a "$2M" context) are
+  never rescued.
+
+The remaining false positives are heavily-paraphrased claims with little lexical
+overlap (e.g. "olive oil is drizzled on top"); enable the optional LLM-judge
+escalation (`use_llm_judge=True`) for those. Athena still biases toward catching
+hallucinations over passing every clean sentence — treat it as a guardrail.
 
 **LettuceDetect beats athena on span-level F1** on real-world benchmarks (LettuceDetect 79.2% F1 on annotated spans vs. athena's unvalidated real-world score). Athena wins on latency bounds, provider-neutrality, offline execution, and the spans-in-library integration story — not raw F1.
 

diff --git a/assets/circuit_breaker.gif b/assets/circuit_breaker.gif
diff --git a/assets/circuit_breaker.tape b/assets/circuit_breaker.tape
@@ -0,0 +1,25 @@
+# VHS tape — renders the agent circuit-breaker demo to a GIF.
+# Regenerate with:  vhs assets/circuit_breaker.tape
+Output assets/circuit_breaker.gif
+
+Set Shell "bash"
+Set FontSize 18
+Set Width 1180
+Set Height 760
+Set Padding 24
+Set Theme "Catppuccin Mocha"
+
+# Activate the project venv off-screen so the visible command is just `python …`.
+Hide
+Type "source .venv/bin/activate" Enter
+Type "clear" Enter
+Show
+
+Type "python examples/agent_circuit_breaker.py"
+Sleep 600ms
+Enter
+
+# First call loads the NLI model (~3s), then one step streams every ~0.5s.
+Sleep 8s
+# Linger on the circuit-breaker result.
+Sleep 2s
diff --git a/athena_verify/__init__.py b/athena_verify/__init__.py
@@ -26,7 +26,14 @@
     verify_stream,
 )
 from athena_verify.llm_judge import LLMClient
-from athena_verify.models import Chunk, SentenceScore, StepResult, StreamingResult, SupportingSpan, VerificationResult
+from athena_verify.models import (
+    Chunk,
+    SentenceScore,
+    StepResult,
+    StreamingResult,
+    SupportingSpan,
+    VerificationResult,
+)
 
 __all__ = [
     "verify",

diff --git a/athena_verify/calibration.py b/athena_verify/calibration.py
@@ -20,6 +20,16 @@
 PARTIAL_THRESHOLD = 0.50
 UNSUPPORTED_THRESHOLD = 0.30
 
+# Grounding-rescue thresholds. Cross-encoder NLI frequently scores a faithful
+# paraphrase as "neutral" (entailment ~0) even though the claim is fully
+# supported. When the claim is *not* contradicted, is heavily lexically
+# grounded, and all its numbers appear in the context, we lift it out of the
+# unsupported band — recovering false positives without passing contradictions
+# or number swaps (which fail the contradiction / numeric guards).
+RESCUE_CONTRADICTION_CEILING = 0.45
+RESCUE_CONTAINMENT_FLOOR = 0.50
+RESCUE_TRUST = 0.55
+
 
 def compute_trust_score(
     nli_score: float,
@@ -57,6 +67,43 @@ def compute_trust_score(
     return min(1.0, max(0.0, trust))
 
 
+def apply_grounding_rescue(
+    trust: float,
+    *,
+    entailment: float,
+    contradiction: float,
+    containment: float,
+    numeric_ok: bool,
+) -> float:
+    """Lift trust for neutral-but-grounded paraphrases NLI scores too low.
+
+    Only ever raises the score, and only when all guards pass:
+      - the claim is not contradicted by any context unit,
+      - it is not already strongly entailed (nothing to rescue),
+      - its content words are heavily present in the context, and
+      - every number in it appears in the context.
+
+    Args:
+        trust: The ensemble trust score before rescue.
+        entailment: Max NLI entailment probability for the sentence.
+        contradiction: Max NLI contradiction probability for the sentence.
+        containment: Fraction of content words found in the context.
+        numeric_ok: Whether all numbers in the sentence appear in the context.
+
+    Returns:
+        The (possibly raised) trust score.
+    """
+    if contradiction >= RESCUE_CONTRADICTION_CEILING:
+        return trust
+    if entailment >= SUPPORTED_THRESHOLD:
+        return trust
+    if not numeric_ok:
+        return trust
+    if containment >= RESCUE_CONTAINMENT_FLOOR:
+        return max(trust, RESCUE_TRUST)
+    return trust
+
+
 def classify_support(trust_score: float) -> str:
     """Classify a sentence's support status based on trust score.
 

diff --git a/athena_verify/cli.py b/athena_verify/cli.py
@@ -8,6 +8,7 @@
 from pathlib import Path
 
 from athena_verify import verify
+from athena_verify.models import VerificationResult
 
 
 def color_score(score: float) -> str:
@@ -30,7 +31,7 @@ def format_trust_score(score: float, width: int = 6) -> str:
     return f"{color_score(score)}{score:.2f}{reset_color()}"
 
 
-def print_table(result) -> None:
+def print_table(result: VerificationResult) -> None:
     """Print colored sentence-by-sentence trust score table."""
     print()
     print("Verification Results")