Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,12 @@ env/
*.swp
*.swo

# Testing
# Testing / tooling caches
.pytest_cache/
.coverage
htmlcov/
.mypy_cache/
.ruff_cache/

# Models (downloaded at runtime)
models/
Expand All @@ -39,3 +40,10 @@ Thumbs.db

# Environment
.env

# Claude Code local settings
.claude/
.odin/
scratchpad/
# Recording build artifact (GIF is committed; cast is regenerated)
assets/*.cast
64 changes: 54 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,25 @@ No document ingestion. No chunking. No agents. No database. Works identically on
![License MIT](https://img.shields.io/badge/license-MIT-green)
![Version](https://img.shields.io/badge/version-0.1.0-orange)

## Stop hallucinations before they cascade

In a multi-step agent, each step's output feeds the next — a single fabricated
figure propagates straight into the final answer. `verify_step()` is a circuit
breaker that halts the chain the moment a claim stops being grounded in the
evidence:

![Agent circuit-breaker demo](assets/circuit_breaker.gif)

```python
from athena_verify import verify_step

step = verify_step(claim=reasoning_step, evidence=retrieved_chunks, threshold=0.5)
if step.action == "halt":
raise RuntimeError(f"Ungrounded claim blocked (trust={step.trust_score:.2f})")
```

Run it yourself: [`examples/agent_circuit_breaker.py`](examples/agent_circuit_breaker.py).

## How It Works

```
Expand Down Expand Up @@ -81,21 +100,46 @@ pip install "athena-verify[all]"

Evaluated on 100 synthetic cases across 6 hallucination categories (legal, medical, technical, general). Real-world benchmarks against RAGTruth and HaluEval are in progress — download instructions are in [`benchmarks/RESULTS.md`](benchmarks/RESULTS.md).

### Per-Category Performance (NLI-only, synthetic, nli-deberta-v3-base)
### Hallucination Detection (NLI-only, synthetic, nli-deberta-v3-base)

Each row is the per-category F1 for *catching hallucinations*. The faithful-text
row is intentionally excluded here — it contains no hallucinations, so its F1 is
undefined; we report its false-positive rate separately below, which is the
number that actually matters for clean text.

| Category | Precision | Recall | **F1** |
|----------|-----------|--------|--------|
| **Fabricated claims** | 100% | 97% | **98.6%** ✓ |
| **Fabricated claims** | 100% | 96% | **97.9%** ✓ |
| **Out-of-context** | 100% | 97% | **98.3%** ✓ |
| **Subtle contradictions** | 100% | 97% | **98.3%** ✓ |
| **Number substitutions** | 79% | 96% | **86.8%** |
| **Partial support** | 78% | 95% | **85.7%** |
| **Faithful statements** | 0% | 0% | **0.0%** ✗ |
| **Overall** | 87% | 97% | **91.3%** (synthetic) |

### Where We Lose

Athena has a **high false positive rate on truly faithful statements** (31% of genuinely faithful sentences are incorrectly flagged). This is a known NLI-model limitation — conservative thresholds bias toward catching hallucinations at the cost of flagging clean sentences.
| **Partial support** | 95% | 91% | **93.0%** |
| **Number substitutions** | 82% | 96% | **88.5%** |
| **Overall** | 95% | 96% | **95.0%** (synthetic) |

**False-positive rate on faithful text: 4.6%** (4 of 87 genuinely-supported
sentences flagged) on the base model, **3.4%** on the large model — down from 17%
before calibration. Latency: **p50 22.5 ms, p95 34.5 ms** per verification on the
base model. Numbers are reproducible with `python benchmarks/run_full_eval.py`.

### How false positives are kept low

Standalone NLI scores many faithful paraphrases as "neutral" (entailment ≈ 0)
even when the claim is fully supported. Athena recovers these without letting
hallucinations through, using three guarded signals:

- **Anaphora windowing** — a sentence opening with a referent ("This cap…", "It
also…") is scored together with its predecessor, restoring the antecedent.
- **Contradiction-aware rescue** — a not-entailed claim is only rescued when the
most on-topic context unit does *not* contradict it, so reversals and subtle
contradictions stay flagged.
- **Numeric gate** — rescue requires every number in the claim to appear in the
context, so number-substitution hallucinations ("$5M" vs a "$2M" context) are
never rescued.

The remaining false positives are heavily-paraphrased claims with little lexical
overlap (e.g. "olive oil is drizzled on top"); enable the optional LLM-judge
escalation (`use_llm_judge=True`) for those. Athena still biases toward catching
hallucinations over passing every clean sentence — treat it as a guardrail.

**LettuceDetect beats athena on span-level F1** on real-world benchmarks (LettuceDetect 79.2% F1 on annotated spans vs. athena's unvalidated real-world score). Athena wins on latency bounds, provider-neutrality, offline execution, and the spans-in-library integration story — not raw F1.

Expand Down
Binary file added assets/circuit_breaker.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions assets/circuit_breaker.tape
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# VHS tape — renders the agent circuit-breaker demo to a GIF.
# Regenerate with: vhs assets/circuit_breaker.tape
Output assets/circuit_breaker.gif

Set Shell "bash"
Set FontSize 18
Set Width 1180
Set Height 760
Set Padding 24
Set Theme "Catppuccin Mocha"

# Activate the project venv off-screen so the visible command is just `python …`.
Hide
Type "source .venv/bin/activate" Enter
Type "clear" Enter
Show

Type "python examples/agent_circuit_breaker.py"
Sleep 600ms
Enter

# First call loads the NLI model (~3s), then one step streams every ~0.5s.
Sleep 8s
# Linger on the circuit-breaker result.
Sleep 2s
9 changes: 8 additions & 1 deletion athena_verify/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,14 @@
verify_stream,
)
from athena_verify.llm_judge import LLMClient
from athena_verify.models import Chunk, SentenceScore, StepResult, StreamingResult, SupportingSpan, VerificationResult
from athena_verify.models import (
Chunk,
SentenceScore,
StepResult,
StreamingResult,
SupportingSpan,
VerificationResult,
)

__all__ = [
"verify",
Expand Down
47 changes: 47 additions & 0 deletions athena_verify/calibration.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,16 @@
PARTIAL_THRESHOLD = 0.50
UNSUPPORTED_THRESHOLD = 0.30

# Grounding-rescue thresholds. Cross-encoder NLI frequently scores a faithful
# paraphrase as "neutral" (entailment ~0) even though the claim is fully
# supported. When the claim is *not* contradicted, is heavily lexically
# grounded, and all its numbers appear in the context, we lift it out of the
# unsupported band — recovering false positives without passing contradictions
# or number swaps (which fail the contradiction / numeric guards).
RESCUE_CONTRADICTION_CEILING = 0.45
RESCUE_CONTAINMENT_FLOOR = 0.50
RESCUE_TRUST = 0.55


def compute_trust_score(
nli_score: float,
Expand Down Expand Up @@ -57,6 +67,43 @@ def compute_trust_score(
return min(1.0, max(0.0, trust))


def apply_grounding_rescue(
trust: float,
*,
entailment: float,
contradiction: float,
containment: float,
numeric_ok: bool,
) -> float:
"""Lift trust for neutral-but-grounded paraphrases NLI scores too low.

Only ever raises the score, and only when all guards pass:
- the claim is not contradicted by any context unit,
- it is not already strongly entailed (nothing to rescue),
- its content words are heavily present in the context, and
- every number in it appears in the context.
Comment on lines +80 to +84

Args:
trust: The ensemble trust score before rescue.
entailment: Max NLI entailment probability for the sentence.
contradiction: Max NLI contradiction probability for the sentence.
containment: Fraction of content words found in the context.
numeric_ok: Whether all numbers in the sentence appear in the context.

Returns:
The (possibly raised) trust score.
"""
if contradiction >= RESCUE_CONTRADICTION_CEILING:
return trust
if entailment >= SUPPORTED_THRESHOLD:
return trust
if not numeric_ok:
return trust
if containment >= RESCUE_CONTAINMENT_FLOOR:
return max(trust, RESCUE_TRUST)
return trust


def classify_support(trust_score: float) -> str:
"""Classify a sentence's support status based on trust score.

Expand Down
3 changes: 2 additions & 1 deletion athena_verify/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from pathlib import Path

from athena_verify import verify
from athena_verify.models import VerificationResult


def color_score(score: float) -> str:
Expand All @@ -30,7 +31,7 @@ def format_trust_score(score: float, width: int = 6) -> str:
return f"{color_score(score)}{score:.2f}{reset_color()}"


def print_table(result) -> None:
def print_table(result: VerificationResult) -> None:
"""Print colored sentence-by-sentence trust score table."""
print()
print("Verification Results")
Expand Down
Loading
Loading