Skip to content

Fix/false positive calibration#5

Merged
RahulModugula merged 6 commits into
mainfrom
fix/false-positive-calibration
Jun 28, 2026
Merged

Fix/false positive calibration#5
RahulModugula merged 6 commits into
mainfrom
fix/false-positive-calibration

Conversation

@RahulModugula

Copy link
Copy Markdown
Owner

No description provided.

Sentence-level NLI collapsed to ~0 on faithful answers in two common
cases: a sentence opening with an anaphor ("This cap applies...") lost
its antecedent once the answer was split, and facts spread across
several context sentences matched no single sentence-unit premise.

- Prepend the previous sentence when a hypothesis starts with a pronoun
  or discourse marker, restoring the referent before NLI scoring.
- Score each sentence against individual context sentences *and* the
  whole chunk, taking the max.
- Share one _ground_sentences helper across verify / verify_async /
  verify_batch / verify_batch_async / verify_stream, removing the old
  concatenate-all-chunks premise that silently truncated at the model's
  token limit. All entry points now return identical results and
  populate supporting_spans.
- Resolve the entailment class index from the model's id2label instead
  of hardcoding it, so non-default NLI checkpoints aren't scored on the
  wrong class.
- Make the regex sentence splitter abbreviation-aware (Dr., U.S., Inc.).

Synthetic benchmark: faithful false-positive rate 16.9% -> 11.5%,
overall F1 91.3% -> 93.5%, recall unchanged at 96.7%.
- Annotate print_table and import VerificationResult in the CLI.
- Treat crewai as an optional dependency in the mypy config and drop a
  now-unused type-ignore.
- Sort imports (ruff I001) across the package and tests.

ruff, mypy --strict, and the full test suite (140 tests) all pass.
Regenerate the synthetic eval with the false-positive fixes and refresh
the README and RESULTS.md tables to match:

- Faithful false-positive rate: 17% -> 11.5% (base), 9.2% (large).
- Overall F1: 91.3% -> 93.6% (base), 93.8% (large).
- Replace the misleading "0% F1 on faithful" row (F1 is undefined with
  zero hallucinations) with a stated false-positive rate.

Also tidy .gitignore (add .ruff_cache, scratchpad, local settings dirs).
Standalone NLI scores many faithful paraphrases as neutral (entailment
~0) even when fully supported, which drove the bulk of the remaining
false positives. Recover them without admitting hallucinations:

- Expose the contradiction class alongside entailment (batch_compute_nli),
  read from the model's label map so it is model-agnostic.
- For each sentence, take the contradiction of the most lexically on-topic
  context unit — not the global max — so an unrelated unit can't veto a
  faithful claim while genuine reversals still fire.
- Add lexical containment and numeric-consistency signals (overlap.py).
- apply_grounding_rescue lifts a not-entailed sentence to PARTIAL only
  when it is not contradicted, every number in it appears in the context,
  and most of its content words are grounded — gating out number swaps
  and contradictions.
- Share the logic across all five verify entry points.

Synthetic benchmark (base model): faithful false-positive rate 17% ->
4.6% (3.4% on large), overall F1 91.3% -> 95.0%, recall ~96%. Adds
tests/test_rescue.py; README and RESULTS.md updated to match.
The NLI cross-encoder is the core of the library, but sentence-transformers
lived in the optional [nli] extra, so a fresh `pip install athena-verify`
followed by verify() raised ImportError — contradicting the README. Ship it
by default so the documented one-liner install works cold. The [nli] extra is
kept (now empty) for backwards compatibility.

Verified: clean build (twine check passes) and a from-scratch venv install +
examples/quickstart.py run end-to-end.
Rework examples/agent_circuit_breaker.py into a realistic 4-step financial
research agent: a hallucinated "35% net margin" (the filing says 22%) trips
the verify_step() circuit breaker at step 3, so the BUY recommendation built
on it is never produced. Silence the ML stack's load report / progress bars
and warm the model up front so the run is clean.

Add assets/circuit_breaker.gif (rendered from the demo) and feature it near
the top of the README — the cascade-prevention story is the launch narrative.
Also sort imports in the LangChain example.
Copilot AI review requested due to automatic review settings June 28, 2026 03:04

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces false-positive “unsupported” flags by improving grounding signals beyond plain entailment: it adds contradiction-aware NLI outputs, a guarded lexical/numeric “rescue” path for neutral-but-grounded paraphrases, anaphora windowing, and more robust sentence splitting; it also updates tests, docs/benchmarks, and examples to match.

Changes:

  • Add 2D NLI scoring (entailment, contradiction) plus entailment/contradiction label-index resolution, and wire it through core verification paths.
  • Introduce grounding-rescue calibration using lexical containment + numeric consistency gates, and apply it consistently across verify entry points.
  • Update docs/benchmarks and add an “agent circuit breaker” example/demo assets; refresh tests to patch the new NLI API.

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
athena_verify/core.py Refactors grounding to use per-unit + whole-chunk premises, anaphora windowing, contradiction signal, supporting spans, and rescue-aware trust/status.
athena_verify/nli.py Adds label-map–based entailment/contradiction index resolution and introduces batch_compute_nli() returning (entail, contra).
athena_verify/calibration.py Adds rescue thresholds and apply_grounding_rescue() to lift neutral-but-grounded sentences.
athena_verify/overlap.py Adds containment_score() and numeric_consistency() used by the rescue path.
athena_verify/parser.py Improves regex fallback sentence splitter with abbreviation awareness.
athena_verify/cli.py Adds type annotation for print_table argument.
athena_verify/__init__.py Re-formats imports for readability/consistency.
athena_verify/integrations/langgraph.py Switches Callable import to collections.abc.
athena_verify/integrations/crewai.py Tweaks typing ignores / fallback behavior for optional dependency import.
tests/test_verify.py Updates autouse NLI mocking and tightens the latency-budget test to avoid unpatched calls.
tests/test_new_features.py Updates autouse NLI mocking for the new batch_compute_nli shape.
tests/test_supporting_spans.py Updates span tests to patch batch_compute_nli with (entail, contra) tuples.
tests/test_rescue.py Adds new unit tests for containment/numeric gate and rescue behavior.
tests/test_nli.py Updates model-cache fixture to work with @lru_cache’d loaders/indexes.
README.md Adds circuit-breaker section and updates performance/false-positive claims and explanation.
benchmarks/RESULTS.md Updates benchmark date, metrics, and methodology description to match new grounding logic.
examples/agent_circuit_breaker.py Expands the circuit-breaker demo and silences ML stack output for cleaner UX.
examples/langchain_example.py Reorders imports.
assets/circuit_breaker.tape Adds VHS tape script to regenerate the demo GIF.
.gitignore Ignores additional tooling caches and demo recording artifacts/settings dirs.
pyproject.toml Promotes sentence-transformers to a core dependency and keeps [nli] extra for backwards compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread athena_verify/core.py
Comment on lines +175 to +176
nli_pairs = [(unit, hyp) for hyp in hypotheses for unit in units]
flat = batch_compute_nli(nli_pairs, model_name=nli_model)
Comment on lines +80 to +84
Only ever raises the score, and only when all guards pass:
- the claim is not contradicted by any context unit,
- it is not already strongly entailed (nothing to rescue),
- its content words are heavily present in the context, and
- every number in it appears in the context.
Comment thread tests/test_nli.py
Comment on lines +75 to +80
get_nli_model and entailment_index are both @lru_cache'd, so clear them
around the patch to keep tests isolated.
"""
nli_module.get_nli_model.cache_clear()
nli_module.entailment_index.cache_clear()
models: dict[str, object] = {}
Comment thread tests/test_nli.py
Comment on lines +88 to +89
nli_module.get_nli_model.cache_clear()
nli_module.entailment_index.cache_clear()
Comment thread tests/test_rescue.py
Comment on lines +1 to +3
"""Tests for the grounding-rescue path: containment, numeric gate, and the
contradiction-vetoed rescue that recovers faithful paraphrases NLI scores low.
"""
@RahulModugula RahulModugula merged commit a32f6f5 into main Jun 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants