Skip to content

VibeCodingScientist/pCite_release

Repository files navigation

pCite

CI License: MIT Python 3.12+ Site

Physical measurement weighting for scientific claim retrieval in the age of AI-generated science.

AI systems generate scientific claims at near-zero cost. Existing citation metrics assign equal weight to instrument-measured results and text-derived assertions. pCite addresses this by weighting every citation by the validation class of its source claim, with a 1,000-fold gap between PhysicalMeasurement (10.0) and TextDerived (0.01).


Results

Two independent corpora, evaluated with identical ground truth (_VALIDATED = {PhysicalMeasurement, ClinicalObservation, Replicated}). Traditional baseline is intra-corpus replication count.

Corpus 1 — MetaboLights metabolomics (8,761 claims from 1,994 papers, 62.4% Physical-tier)

Metric pCite Traditional Lift
Mann-Whitney p (base_weight) < 2.2e-16 validated median 10.0 vs 0.5
Precision@50 0.94 0.50 1.88x
NDCG@50 0.94 0.60

Corpus 2 — PRIDE cancer proteomics (3,506 claims from 730 papers, 32% Physical-tier)

Four-way comparison with graded deposit quality and BioLORD-2023 embedding baseline:

Metric pCite (graded) pCite (flat) Traditional Embedding
P@50 1.00 1.00 0.98 0.46
NDCG@50 1.00 1.00 0.98 0.42

Coverage threshold analysis — when does pCite outperform traditional?

Subsampling MetaboLights at decreasing Physical-tier coverage (5 seeds, ±1 std):

Coverage P@50 pCite P@50 Traditional
5% 0.28 0.50
10% 0.42 0.50
15% 0.61 0.51
20% 0.67 0.52
30% 0.80 0.54
62.4% 0.94 0.60

Crossover threshold is ~12% Physical-tier coverage. Below that, pCite has insufficient physical signal and traditional wins. Script: boundary_investigation.py; figure: figures/fig_coverage_threshold.pdf.

Negative control — corpus with 0 Physical-tier claims

Metric pCite Traditional
Precision@50 0.02 0.14
NDCG@50 0.27 0.97

pCite loses predictably when no physically-grounded claims exist. The 1,000-fold weight gap has nothing to act on. The negative control confirms the mechanism. Raw data in data/negative-control/.

Sensitivity analysis — weight ratio robustness

Ratio Precision@50
1:1 0.50
2:1 0.60
5:1 0.94
10:1 1.00
50:1 1.00
100:1 1.00
500:1 1.00
1000:1 1.00

Precision@50 lift saturates at ratio ≥ 10:1 and is insensitive to the exact production value (1000:1). Script: sensitivity_analysis.py; data: data/sensitivity/.

Robustness analyses

Bootstrap adversarial robustness — targeting only validated claims and reassigning exclusively to non-validated classes (worst-case attack):

Validated claims downgraded Remaining P@50
16% 4,616 1.00
50% 2,748 1.00
90% 550 1.00
95% 275 1.00
99% 55 1.00
99.5% 28 0.56
99.9% 6 0.12

pCite survives adversarial misclassification of 99% of validated claims. The 10.0 vs 0.5 weight gap means any surviving Physical claims dominate the ranking. Breaking point requires destroying >99% of validated labels. Script: bootstrap_sweep.py; figure: figures/bootstrap_breaking_point.png.

Study deduplication — after deduplicating by source DOI (keeping only the highest-scoring claim per paper):

Metric pCite Traditional Embedding
Deduplicated P@50 0.86 0.44 0.72
Unique DOIs in top-50 36 86 42
Max claims from single DOI 7 8 3

pCite's advantage persists after removing within-study clustering. Script: study_dedup_analysis.py; data: data/study_dedup.json.

Contradicts weight ablation — varying the contradicts edge penalty:

Weight P@50
-1.0 (amplified) 0.92
-0.5 (default) 0.94
0.0 (disabled) 0.94

P@50 stable across all tested contradicts weights. Script: contradicts_ablation.py; data: data/contradicts_ablation.json.

Time-decay cold-start analysis — adding recency bonus base_weight × exp(-λ × age):

λ P@50 NDCG@50
0.0 (baseline) 0.94 0.94
0.5 0.94 0.94
1.0 0.94 0.94
2.0 0.94 0.94

Time-decay bonus is negligible relative to pcite_score gaps at the top of the ranking. 8,724/8,761 claims resolved to publication year via CrossRef. Script: time_decay_analysis.py; data: data/time_decay.json.

Strict physical verification — fuzzy-matching claim entities against deposited metabolites from MetaboLights FTP:

Metric Lenient Strict
Validated claims 5,495 1,592
Physical claims downgraded 3,903 (71.4%)
Coverage 62.4% 18.2%
P@50 1.00 1.00

Even after downgrading 71.4% of Physical claims that could not be entity-matched to deposited metabolites, P@50 remains perfect. Script: strict_physical.py; data: data/strict_physical.json.

Cross-model extraction concordance

200 stratified abstracts (100 with MetaboLights deposit, 100 without). Same prompt template and predicate vocabulary for all models. v2 adds a tightened preamble (strict predicate rules, completeness, granularity).

Metric v1 (Claude vs GPT-4o) v2 (Claude vs o4-mini)
Claude claims/abstract 6.99 4.88
Other claims/abstract 2.86 0.84
Claude→other concordance 11% 7%
Other→Claude concordance 28% 38%
Predicate agreement 67% 67%
Spearman ρ (claim counts) 0.4633 0.1923
Parse failures (other model) 42 0

Predicate agreement is stable at 67% across both model pairs. The tightened prompt eliminated all parse failures in v2. Claude extracts more claims per abstract than both comparison models; when the other model does extract a claim, 28–38% match Claude's output. Scripts: cross_model_extraction.py (v1), cross_model_extraction_v2.py (v2); data: data/concordance_results*.csv.


How it works

Hypothesis: Weighting citations by the physical grounding of the source claim surfaces validated scientific claims more accurately than traditional citation count.

Validation classes and weights:

Class Weight Definition
PhysicalMeasurement 10.0 Raw instrument data in a public repository (MetaboLights, PDB, PRIDE)
ClinicalObservation 4.0 EHR-verified patient data, IRB-approved trial outcomes
Replicated 2.0 Same assertion confirmed in 3+ independent sources
DatabaseReferenced 0.5 Structured database deposit, no raw data
TextDerived 0.01 Synthesised from literature text
Hypothesis 0.0 Proposed, untested

Scoring formula:

base_weight  = ValidationWeight(class) × log₂(replication_count + 1)
edge_weight  = PCiteTypeWeight(type)   × source.base_weight
pcite_score  = Σ incoming edge weights

Citation edge types and multipliers: replicates 1.5 · extends 1.2 · supports 1.0 · contradicts -0.5 · applies 0.6.

A PhysicalMeasurement claim cited 35 times by other physical claims reaches a score of 364. A TextDerived claim cited 1,000 times peaks at 0.10. No manual scoring. Entirely from the data model.


Pipeline

MetaboLights corpus — deposit-first construction via EBI REST API:

MetaboLights API + PubMed eUtils
        |
corpus.py       →  data/papers.jsonl        (1,994 papers)
        |
extract.py      →  data/claims.jsonl        (8,761 claims, Claude tool_use)
        |
graph.py        →  data/graph.graphml       (30,759 edges, Gemini Flash classification)
                →  data/scores.jsonl
        |
evaluate.py     →  data/results.json
                →  figures/*.pdf

PRIDE corpus — deposit-first with second-degree citation neighbourhood:

PRIDE API + OpenAlex citation graph
        |
pride_corpus.py →  data/pride/papers.jsonl   (730 papers, 32% deposit coverage)
        |
extract.py      →  data/pride/claims.jsonl   (3,506 claims)
        |
graph.py        →  data/pride/scores.jsonl
        |
pride_graded_eval.py → data/pride/graded_table.json (four-way comparison)

Both corpora use deposit-first construction: papers with data deposits in public repositories (MetaboLights, PRIDE) are classified as PhysicalMeasurement-tier by construction, not by text inference. No classifier needed for ground truth.


Quick start

pip install -e .
python run_poc.py --dry-run     # evaluate cached data, no API calls required

To reproduce from scratch (requires API keys):

cp .env.example .env            # add ANTHROPIC_API_KEY + GEMINI_API_KEY
python -m pcite.corpus          # fetch papers
python -m pcite.extract         # extract claims via Claude
python -m pcite.graph           # build citation graph via Gemini + OpenAlex
python -m pcite.evaluate        # compute metrics and figures

Results are deterministic. Gemini classification responses are cached in data/classify_cache.json; rerunning without clearing the cache produces identical edge assignments.


Repository layout

src/pcite/
  models.py           — ValidationClass, Claim, PCite, scoring formula
  corpus.py           — MetaboLights-first corpus construction
  extract.py          — Claude tool_use claim extraction
  graph.py            — OpenAlex citation graph + Gemini edge classification
  evaluate.py         — Mann-Whitney, Precision@k, NDCG@k
analyses/
  bootstrap_sweep.py          — Adversarial bootstrap breaking point
  bootstrap_noise.py          — Bootstrap noise analysis
  boundary_investigation.py   — Coverage threshold subsampling
  fig_coverage_threshold.py   — Figure: P@50 vs Physical-tier coverage
  sensitivity_analysis.py     — Weight ratio robustness
  embedding_baseline.py       — BioLORD-2023 embedding baseline
  pride_graded_eval.py        — Four-way PRIDE evaluation
  contradicts_ablation.py     — Contradicts weight ablation
  time_decay_analysis.py      — Time-decay cold-start analysis
  strict_physical.py          — Metabolite entity verification against deposits
  study_dedup_analysis.py     — Within-study deduplication
concordance/
  cross_model_extraction.py   — Claude vs GPT-4o concordance (v1)
  cross_model_extraction_v2.py — Claude vs o4-mini concordance (v2)
  cross_model_edges.py        — Edge classification concordance
  validate_components.py      — Component validation utilities
corpora/
  pride_corpus.py             — PRIDE deposit-first corpus
  pdb_corpus.py               — PDB structural biology corpus
  build_citation_cache.py     — Citation cache builder
run_poc.py            — MetaboLights orchestrator (--dry-run for cached eval)
run_pride_poc.py      — PRIDE orchestrator
run_pdb_poc.py        — PDB orchestrator
data/
  papers.jsonl, claims.jsonl, scores.jsonl, graph.graphml, results.json
  boundary_results.json, bootstrap_sweep.csv
  concordance_results*.csv, concordance_per_abstract*.csv
  concordance_cache/          — Per-DOI GPT-4o response cache
  concordance_cache_v2/       — Per-DOI Claude + o4-mini response caches
  negative-control/
  sensitivity/
  pride/              — PRIDE corpus (same structure)
figures/
  fig1_rank_comparison.pdf, fig2_score_dist.pdf, fig3_precision_at_k.pdf
  fig_sensitivity.pdf, fig_coverage_threshold.pdf
  bootstrap_breaking_point.png
tests/                — 46 tests, no API keys needed
docs/                 — static site (GitHub Pages)

Requirements

  • Python >= 3.11
  • ANTHROPIC_API_KEY — claim extraction (Claude Sonnet 4.6)
  • GEMINI_API_KEY — citation edge classification (Gemini 2.0 Flash)
  • OPENROUTER_API_KEY — cross-model concordance via OpenRouter (GPT-4o, o4-mini)
  • OpenAlex is used without authentication (mailto param recommended)
  • See .env.example for all configuration
pytest tests/ -v

About

pCite: Validation-Weighted Citation Framework for Metabolomics — Preprint Release

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages