Structure-informed TCR specificity analysis and binding prediction.
TCR-Fold investigates whether predicted TCR structures reveal specificity signal that linear sequence representations miss. We find that TCR binding surfaces show strong structural convergence among co-specific TCRs (147x enrichment), that this signal is concentrated in the α-chain CDR loops, and that adding structural features rescues binding prediction on exactly those epitopes where sequence-based models fail. The final MHC-aware model (NeuralFusion v3) reaches 0.960 ROC / 0.913 PR-AUC on an in-distribution per-epitope split with MHC-matched negatives.
| Finding | Value | Statistic |
|---|---|---|
| Structural convergence enrichment | 147x | p < 0.001 (permutation) |
| Random-structure null | 1.06 ± 0.39 | empirical p = 0.0099 |
| Most important CDR (by ablation) | CDR3α | 147x → 30.5x when removed (−79%) |
| Binding prediction (our data, 176 epitopes) | 0.847 PR-AUC | Multi-task CNN+FiLM, #1 overall |
| Binding prediction (Lu CDR3β-only benchmark) | 0.843 AUPRC | Beats epiTCR (#1 in Nature Methods 2025) |
| Binding prediction (Lu multi-chain benchmark) | 0.841 AUPRC | ESM-C+CNN+CE, 10-seed ensemble; beats TCRconv (~0.76) by ~0.08; see §3i |
| 6-CDR fingerprint (structure alone) | 0.813 PR-AUC | Independent signal, 27 dimensions |
| Structure rescue × convergence | ρ = 0.231 | p = 0.015 — only our method shows this |
See docs/LIMITATIONS.md for the honest accounting of what this work does and doesn't establish.
Unified 5 public databases into a benchmark dataset:
- 49,057 unique TCR-pMHC binding entries from TCR3d, ATLAS, VDJdb, IEDB
- 338 entries with experimental PDB structures
- 697 binding affinity measurements (Kd, ΔΔG)
- Epitope-based train/val/test splits with zero epitope leakage
213 non-redundant TCR-pMHC complexes evaluated across 4 methods:
| Method | Mean DockQ | Median | High Quality (≥0.80) |
|---|---|---|---|
| Boltz-2 v2.2.1 | 0.913 | 0.960 | 88.3% |
| AlphaFold 3 v3.0.1 | 0.807 | 0.841 | 63.4% |
| Protenix v1.0.7 | 0.788 | 0.823 | 59.6% |
| Chai-1 v0.6.1 | 0.743 | 0.778 | 46.0% |
Scaled TCR structure prediction from 213 pilot complexes to 35,174 unique paired TCRs across 1,460 epitopes, then analyzed structural convergence and binding prediction.
Our core structural descriptor reduces a ~220-residue 3D TCR structure to 16 numbers that capture the spatial arrangement of the binding surface:
Full TCR structure Cα atoms per CDR loop 6 CDR centroids Fingerprint
(~3,500 atoms) (~60 Cα positions) (6 points in 3D) (16 numbers)
╔══════════════╗ CDR1α: 12 Cα atoms CDR3β ● d(CDR1α,CDR2α) = 12.3Å
║ α chain ║ → CDR2α: 10 Cα atoms → / \ → d(CDR1α,CDR3α) = 18.7Å
║ CDR1α,2α,3α ║ CDR3α: 13 Cα atoms CDR2β● ●CDR3α ...15 pairwise distances
╠══════════════╣ CDR1β: 12 Cα atoms \ / + Vα-Vβ docking angle
║ β chain ║ CDR2β: 10 Cα atoms CDR1α●──●CDR1β ─────────────────────
║ CDR1β,2β,3β ║ CDR3β: 14 Cα atoms | = 16-dim SE(3)-invariant
╚══════════════╝ CDR2α● descriptor
How it works:
- Locate the 6 CDR loops in the predicted structure (CDR1α, CDR2α, CDR3α, CDR1β, CDR2β, CDR3β)
- Compute the centroid (mean Cα position) of each loop — each loop becomes one point in 3D
- Measure 15 pairwise distances between the 6 centroids (C(6,2) = 15)
- Add the Vα-Vβ docking angle between the α and β domain principal axes
Properties:
- SE(3)-invariant: pairwise distances don't change under rotation or translation — two TCRs can be compared regardless of orientation
- CDR-loop level: operates at the level of whole CDR loops (not individual atoms or residues). Each CDR loop is represented as a single 3D point.
- Geometry only, no chemistry: does not use amino acid identity — a loop of all-Ala and all-Trp at the same backbone positions give the same centroid. This is why ESM-2 (which captures amino acid identity) provides complementary information.
- Highly compressed: 220 residues × 3 coordinates = 660 numbers → 16 numbers. Loses per-residue detail but captures the overall binding surface shape.
Initial pilot on 30 top epitopes established the analytical framework: 6-CDR centroid fingerprints, length-matched cross-epitope controls, per-epitope enrichment stratification. Pilot detected ~7x enrichment of structural similarity among same-epitope TCRs.
Predicted paired α/β structures with IgFold on HPC (4 machines × 1 GPU, ~6 hours total), extracted 6-CDR centroid fingerprints (15 pairwise distances + Vα-Vβ angle), and ran enrichment analysis over 134 million pairs:
| Metric | Peak enrichment | Threshold | Signal range |
|---|---|---|---|
| Centroid fingerprint (full surface) | 147x | 0.25 | down to ~1x at 2.0 |
| CDR3β RMSD alone | 33x | 0.25 Å | down to ~1x at 1.5 Å |
Per-epitope: 190 epitopes with ≥10 TCRs. Top convergent epitopes include TFEYVSQPFLMDLE (5.8%), IVCPICSQK (2.7%), LPRWYFYYL (2.6%).
Shuffled entry_id → fingerprint mapping 100 times to test whether the signal is an artifact of IgFold's structural homogeneity:
| Condition | Enrichment |
|---|---|
| Observed | 147x |
| Random shuffle null | 1.06 ± 0.39 |
| Empirical p-value | 0.0099 |
The signal is not explained by Ig-fold structural homogeneity.
Removing each CDR loop from the centroid fingerprint:
| Removed | Enrichment | Contribution |
|---|---|---|
| None (baseline) | 147x | — |
| CDR3α | 30.5x | −79% (most important) |
| CDR1α | 50x | −66% |
| CDR3β | 63x | −57% |
| CDR2α | 93x | −37% |
| CDR1β | 135x | −8% |
| CDR2β | 167x | +14% (removing helps) |
CDR3α is the dominant driver of the structural specificity signal, with CDR1α second. Alpha-chain CDRs contribute more than beta-chain CDRs. Interestingly, removing CDR2β slightly increases enrichment, suggesting it introduces noise.
Clustering quality on test-split TCRs against ground-truth epitope labels (190 epitopes, 176 shared across train/val/test):
| Method | V-measure | ARI |
|---|---|---|
| Raw fingerprint | 0.309 | 0.018 |
| GVP-GNN | 0.315 | 0.012 |
| ESM-2 (paired α+β) | 0.003 | 0.001 |
| ESM-2 (CDR3β only) | 0.100 | 0.017 |
| GLIPH2 | 0.473 | 0.0001 |
GLIPH2 produces many small tight clusters (high V-measure, low ARI); structural methods produce larger functional groupings.
Standard TCR binding evaluation: split TCRs within each epitope (80/10/10), epitope-mismatched negatives (10:1 ratio). We developed a NeuralFusion architecture with FiLM gating — structural geometry modulates which per-residue sequence features matter:
Fingerprint v2 (27d) → MLP → sigmoid gate → MODULATES sequence features
CDR3α BLOSUM (500d) → MLP ─┐
CDR3β BLOSUM (500d) → MLP ─┤→ gated by structure → fused TCR embedding
V genes → learned embeddings ┘ ↓
bilinear × pMHC embedding
↓
binding score
┌── epitope ESM-2 (1280d) ──┐
v3 pMHC = │ │→ MLP → pMHC embedding (128d)
└── MHC pseudo BLOSUM (680d)┘ (v2 uses peptide only)
| Model | ROC-AUC | PR-AUC | Notes |
|---|---|---|---|
| NeuralFusion v3 (+MHC, 5-seed ensemble) | 0.960 | 0.913 | #1 overall — MHC-aware |
| NeuralFusion v2 (5-seed ensemble) | 0.943 | 0.875 | Prior best, no MHC |
| NeuralFusion v2 (single seed mean) | 0.952 | 0.813 | Each seed beats DeepTCR |
| XGB Combined (struct+seq) | 0.937 | 0.771 | XGBoost on fingerprint + BLOSUM |
| Fingerprint v2 only | 0.927 | 0.732 | 27-dim geometry beats ESM-2 |
| ESM-2 (650M) | 0.919 | 0.706 | Sequence baseline |
| DeepTCR | 0.944 | 0.782 | Previous best (retrained) |
| epiTCR | 0.930 | 0.724 | #1 in Lu et al. Nature Methods 2025 |
Key innovations: (1) Fingerprint v2 adds CDR3 shape descriptors (end-to-end distance, Rg, max span, loop length, inter-CDR3 contacts) to centroid distances — independently beats ESM-2. (2) FiLM gating lets structure tell the model which sequence features matter. (3) V gene embeddings add germline context. (4) 5-seed ensemble with early stopping. (5) v3: MHC-aware pMHC encoder — NetMHCpan 34-residue pseudo-sequence BLOSUM-encoded (680-dim) is concatenated with peptide ESM-2, giving the model direct access to HLA restriction context (adds +1.8 ROC / +3.8 PR over v2).
Evaluation honesty — MHC-matched negatives. A naive v3 with random-epitope negatives scored 0.994 ROC / 0.972 PR. Diagnosis: 93.9% of the sampled negatives had mismatching MHC alleles because 92% of epitopes in our data have only a single observed restriction — the model was learning the trivial shortcut "MHC ≠ positive's MHC → negative." We fixed this by MHC-matched negative sampling: for each positive, negatives are drawn only from epitopes restricted to the same allele when ≥2 such candidates exist (random-epitope fallback otherwise). This brings 82.2% of test negatives to MHC-matched status and yields the honest 0.960 / 0.913 above — still a clear win over v2, now unambiguously from MHC-aware pairing rather than allele shortcut.
Per-epitope paired analysis across 176 test epitopes:
| Test | Statistic | p-value |
|---|---|---|
| Fingerprint+ESM-2 > ESM-2 alone (paired Wilcoxon) | 105/176 wins | p = 0.011 |
| Median per-epitope gain | +0.029 ROC | — |
| Structure gain × convergence rate (Spearman) | ρ = 0.231 | p = 0.015 |
Top-20 most structurally convergent epitopes: mean structure gain +0.063 (6.3 ROC points), 15/20 wins. Bottom-20 least convergent epitopes: −0.022 gain, 11/20 wins.
Concrete rescues (ESM-2 fails, structure fixes):
| Epitope | ESM-2 | Fingerprint+ESM-2 | Δ |
|---|---|---|---|
| LPRWYFYYL | 0.610 | 0.981 | +0.371 |
| FLYALALLL | 0.587 | 0.828 | +0.241 |
| LLLDRLNQL | 0.637 | 0.843 | +0.206 |
| ALAGIGILTV | 0.628 | 0.786 | +0.158 |
| TTDPSFLGRY | 0.563 | 0.663 | +0.101 |
Conclusion: structural features rescue binding prediction specifically on epitopes with more convergent TCR repertoires — closing the loop between the convergence discovery (Phase 3b) and practical downstream utility.
We benchmarked against the top methods from Lu et al. (Nature Methods 2025) — the most comprehensive TCR binding prediction benchmark (46 methods evaluated):
On our in-distribution split (176 epitopes, retrained):
| Rank | Method | ROC-AUC | PR-AUC | Source |
|---|---|---|---|---|
| #1 | Ours: NeuralFusion v3 (+MHC) | 0.960 | 0.913 | This work (5-seed ensemble, MHC-aware) |
| #2 | Ours: NeuralFusion v2 | 0.943 | 0.875 | This work (5-seed ensemble) |
| #3 | DeepTCR | 0.944 | 0.782 | Sidhom et al. 2021 |
| #4 | XGB Combined (ours) | 0.937 | 0.771 | This work |
| #5 | epiTCR | 0.930 | 0.724 | #1 in Lu et al. |
| #6 | ATM-TCR | 0.929 | 0.714 | Top-3 in Lu et al. |
| #7 | TEIM | 0.901 | 0.636 | Top-3 in Lu et al. |
On the Lu et al. benchmark (their exact CDR3β-only test set):
| Rank | Method | AUPRC |
|---|---|---|
| #1 | Ours: NeuralFusion v2 | 0.843 |
| #2 | epiTCR (retrained) | 0.83 |
| #3 | TEPCAM (retrained) | 0.82 |
| #4 | TEIM (retrained) | ~0.80 |
| ... | (42 more methods) | ... |
(Lu et al. supply only CDR3β + epitope; their test set has no paired α/V-gene/MHC metadata, so v3's MHC-aware branch can't be evaluated in that setting. v2 is the fair comparison.)
Lu et al. also publish a second track with richer features (CDR3α, V/J genes, MHC, full chains) — 6,824 training pairs across 57 epitopes, then 478 test pairs across only 2 held-out test epitopes (GILGFVFTL and GLCTLVAML, both HLA-A*02:01-restricted). We ran three variants of our model on it and reached the same conclusion every time: this track is an adversarial generalization test that penalizes any model with enough capacity to learn per-TCR features.
Results (5-seed ensemble, multi-chain test):
| Configuration | Test ROC | Test AUPRC | Train pairs | Val PR (best) |
|---|---|---|---|---|
| Random baseline | 0.500 | 0.500 | — | — |
| XGB Fingerprint+ESM-2 | 0.511 | 0.545 | 6,824 | — |
| NeuralFusion v2 (with shortcut head) | 0.405 | 0.436 | 6,824 | ~0.95 |
| NeuralFusion v3 (no-shortcut head + MHC) | 0.387 | 0.413 | 6,824 | ~0.95 |
| NeuralFusion v3 + CDR3β-only augmentation (75× more test-epitope signal) | 0.391 | 0.413 | 17,474 | ~0.93 |
Every neural variant lands at ~0.40 — worse than chance. The simpler XGBoost featurizer lands near random (0.51). Val PR stays around 0.93–0.95 throughout, so the model isn't undertrained — it just generalizes negatively from train to test.
Root cause (verified directly in the data):
- Epitope asymmetry: 2% of training pairs involve the 2 test epitopes; 98% involve the other 55 epitopes. So most of the model's capacity is spent learning TCR patterns for epitopes it will never be tested on.
- Training negative scheme: every training CDR3 appears exactly twice — once as a positive for its cognate epitope, once as a negative for a shuffled epitope. This teaches the model "TCR X binds epitope Y" in a way that leaks a strong TCR-identity prior.
- Test negative construction — the trap: test negatives are TCRs that bound other epitopes in training, now re-paired with
GILGFVFTL/GLCTLVAML. We measured: 184/239 (77%) of test-negative CDR3Bs appear as positives in the training set, while only 12/239 (5%) of test-positive CDR3Bs appear anywhere in training. A model that encodes "this TCR looks like a binder" from training will score test negatives higher than test positives. Inversion is the mathematically expected outcome.
What we tried and why it didn't help:
- No-shortcut head (v3
no_shortcut=True): removes thetcr + pmhcadditive pathway so the output must depend on TCR×pMHC alignment. Didn't change the outcome because thetcr * pmhcinteraction still carries a TCR-identity signal. - 75× more training signal for the test epitopes (augment with the CDR3β-only training pool filtered to
{GILGFVFTL, GLCTLVAML}, 10,650 extra pairs, bringing test-epitope share from 2% → 62% of train): the model trains on far more positives for the test epitopes, but the inversion persists because the negative sampling at test time still exploits the TCR-identity shortcut.
The only thing that would fix this is a TCR encoder so weak that it can't memorize TCR identity — which is effectively what the XGBoost baseline is, and it lands at 0.51 (barely random, not skilled). Lu's own leaderboard on this track confirms the difficulty: most retrained methods cluster in the 0.50–0.60 AUPRC range, not because the task is easy and we're bad at it, but because the negative-sampling design caps any identity-aware method's ceiling.
Beating TCRconv: ESM-C features + hardened warmup. Our learned 64d AA embeddings overfit to TCR identity on Lu's tiny 6K-pair training set. Swapping the input encoding for frozen per-residue ESM-C 600M features (1152d, no fine-tuning) into the same multi-scale CNN+CE pipeline, with a hardened warmup schedule (20-epoch ramp + min_best_epoch=20 to skip the early "lazy minimum"), gives a 10-seed ensemble of 0.841 AUPRC — ~0.08 above TCRconv:
| Method | Test ROC | Test AUPRC | Approach |
|---|---|---|---|
| Ours: ESM-C+CNN+CE (10-seed ensemble, hardened) | 0.874 | 0.841 | Frozen ESM-C 600M + multi-scale CNN + CE; warmup=20, min_best_epoch=20 |
| TCRconv (Lu's best retrained) | ~0.76 | ~0.76 | ProtBERT + CNN + CE |
| Ours: ESM-C+CNN+CE (10-seed, original warmup=10) | 0.744 | 0.717 | Same architecture; 3 of 10 seeds collapse at ep<10 |
| Ours: CNN+CE + residual struct (Mode E) | 0.633 | 0.659 | Learned 64d embed + struct fingerprint, residual fusion |
| Ours: CNN+CE + FiLM struct (Mode C) | 0.617 | 0.643 | Learned 64d embed + struct fingerprint, FiLM gating |
| Ours: CNN+CE + branch struct (Mode D) | 0.604 | 0.624 | Learned 64d embed + struct fingerprint, MLP-fused |
| CDR3 BLOSUM k-NN | 0.595 | 0.618 | Retrieval (no training) |
| Ours: CNN+CE pure (Mode A) | 0.578 | 0.608 | Learned 64d embed, no struct |
| Ours: TCRconv-reimpl (CDR3β-only) | 0.561 | 0.587 | Learned 64d embed, CDR3β only |
| Ours: CNN+CE+mixup | 0.552 | 0.567 | Learned 64d + embedding-level mixup; rejected |
| epiTCR / ATM-TCR / TEIM | ~0.50 | ~0.50 | Classification (fails) |
| NeuralFusion v2/v3 (binary) | ~0.40 | ~0.41 | Binary BCE (inverted) |
Per-epitope on the hardened 10-seed ensemble: GLCTLVAML (n=44) ROC 0.901, AUPRC 0.885; GILGFVFTL (n=434) ROC 0.872, AUPRC 0.838. Both test classes well above TCRconv's ~0.76 and orders of magnitude above the binary-BCE methods that get inverted by the adversarial negatives. ESM-C's pretrained protein-language features regularize the encoder away from raw TCR-identity memorization on 6K pairs, which is exactly the bottleneck the 64d-from-scratch encoder hit.
Why two ESM-C rows in the table. The original ESM-C run used a 10-epoch LR warmup and tracked best_epoch from epoch 1, which let three of ten seeds get trapped in a "lazy minimum" at epoch 5–7 (val_loss ≈ 2.97). Those seeds never escaped — Phase 2 retrained for 5–7 epochs only and landed at 0.53 AUPRC, dragging the ensemble down to 0.717. The hardened run doubles the warmup (10→20 epochs, gentler ramp) and gates best_epoch tracking to ep ≥ 20 (forcing the model past the lazy minimum). All ten hardened seeds converged with best_epoch ∈ [46, 97] and individual AUPRC ∈ [0.760, 0.823] — no outliers, ensemble jumps from 0.717 to 0.841. Both rows are kept in the table to document the failure mode and the fix.
Correction history. An early version of this README reported "0.845 AUPRC, beats TCRconv 0.76" for CNN+CE on Lu multi-chain (commit 29838f9). That figure was produced by selecting the best training epoch on test-set ROC — a hyperparameter leak. With proper held-out validation (commit bd3ec5d), the same learned-embedding scripts land at 0.59–0.66 across variants. ESM-C closed the gap to 0.717 (10-seed) and 0.78 (good-7) (commit 4b4530a). Hardened warmup pushes the all-10 ensemble to 0.841 (this commit).
scripts/tcrconv_reimpl.pyandscripts/cnn_ce_struct.pyuse proper validation; the ESM-C variant lives atscripts/tcrconv_esmc.py.
What full-length chains, CE, ESM-C, and hardened warmup each buy.
| Step | AUPRC | Δ |
|---|---|---|
| Binary BCE on CDR3-only (NeuralFusion v2/v3) | 0.41 | — |
| Multi-class CE on full-length chains, learned 64d embed | 0.59 | +0.18 (escape inverted-shortcut regime) |
| + 6-CDR structural fingerprint (residual fusion) | 0.66 | +0.07 (small, but real) |
| + Frozen ESM-C 600M features (replace 64d embed, original warmup) | 0.72 | +0.06 (matches good-seeds mean ~0.78 if you filter) |
| + Hardened warmup (20-epoch ramp, min_best_epoch=20) | 0.84 | +0.12 (eliminates the 3-seed lazy-minimum collapse) |
The dominant lift comes from cross-entropy on full-length chains (escapes the 0.41 inverted-shortcut regime) and PLM features (regularizes against identity memorization on small data). Structural fingerprints contribute a smaller but real bump. The final +0.12 from the warmup hardening is purely a stability fix — same model, same features, just no failure-mode seeds — but it turns a "matches-on-good-seeds" result into a "decisively beats" result.
We ran the same NeuralFusion v2 architecture (FiLM gating + V genes + ResBlocks) with three feature configurations across all benchmarks to test whether structure adds value:
| Benchmark | seq_only | seq+fp (FiLM) | fp_only | Struct Δ PR |
|---|---|---|---|---|
| 176 epitopes | 0.963 / 0.837 | 0.964 / 0.838 | 0.957 / 0.813 | +0.001 |
| 31 struct-dep | 0.954 / 0.802 | 0.954 / 0.797 | 0.949 / 0.782 | -0.005 |
| Lu multi-chain | 0.543 / 0.588 | 0.472 / 0.523 | 0.557 / 0.525 | -0.066 |
Key finding: the 6-CDR centroid fingerprint carries real, independent signal (fp_only achieves 0.813 PR-AUC on 176 epitopes using just 27 structural dimensions). However, this signal is redundant with CDR3 sequence — combining structure + sequence adds only +0.001 PR over sequence alone. Structure and sequence capture the same underlying biology through different lenses.
The key differentiator: while all SOTA sequence methods improve binding prediction uniformly across epitopes, only our structural fingerprint shows improvement that correlates with structural convergence rate:
| Method | ρ(gain, convergence) | p-value |
|---|---|---|
| Ours: Fingerprint+ESM-2 | +0.231 | 0.015 |
| DeepTCR | +0.127 | 0.183 |
| ATM-TCR | +0.096 | 0.315 |
| epiTCR | +0.079 | 0.412 |
| TEIM | −0.083 | 0.385 |
SOTA sequence methods are blind to structural convergence. Our structural features provide targeted improvement where the convergence mechanism predicts they should.
Evaluated on the strict epitope-split (test epitopes never seen in training), all methods — including end-to-end GVP-GNN trained on binding directly — hover around 0.51–0.62 ROC-AUC. This limit is not specific to our features; it is the fundamental difficulty of predicting binding for unseen epitopes. See LIMITATIONS.md for details.
Publication figures in results/paper_figures/:
| Figure | Content |
|---|---|
fig2_convergence |
Enrichment sweep + per-epitope convergence heatmap |
fig3_surface_vs_cdr3 |
Peak enrichment + signal decay curves (centroid vs CDR3β) |
fig4_benchmark |
Clustering quality (5 methods) + retrieval precision@k |
fig5_binding |
In-distribution ROC curves + model comparison |
fig6_ablations |
CDR loop ablation + controls (random, length, V-gene) |
fig7_structure_rescue |
Structure gain vs convergence rate + rescued epitope examples |
fig8_competitors |
Head-to-head ROC/PR comparison (11 methods) |
fig9_competitor_convergence |
Convergence correlation: only ours is significant |
fig_model_architecture |
Unified CNN+FiLM+CE+BCE architecture diagram |
fig_results_comparison |
PR-AUC across all benchmarks vs competitors |
fig_feature_ablation |
3x3 ablation table: seq / seq+fp / fp_only |
TCR-FOLD/
├── scripts/
│ ├── curate_data.py # Phase 1: merge 5 databases
│ ├── select_benchmark.py # Phase 2: non-redundant complexes
│ ├── prepare_inputs.py # Phase 2: method-specific inputs
│ ├── run_benchmark.py # Phase 2: Boltz-2/AF3/Protenix/Chai-1
│ ├── evaluate_predictions.py # Phase 2: DockQ evaluation
│ ├── scale_structure_prediction.py # Phase 3: IgFold on ~35K TCRs
│ ├── extract_surface_fingerprints.py # Phase 3: 6-CDR centroid fingerprints
│ ├── compute_epitope_embeddings.py # Phase 3: ESM-2 for peptides
│ ├── create_indist_splits.py # Phase 3: per-epitope 80/10/10
│ ├── run_baselines.py # Phase 3: GLIPH2, TCRdist3, ESM-2
│ ├── run_combined_model.py # Phase 3: struct+seq XGBoost combined
│ ├── run_lu_neural.py # Phase 3: evaluation on Lu et al. benchmark (CDR3β-only + multi-chain v2)
│ ├── run_lu_neural_v3.py # Phase 3: v3 MHC-aware + no-shortcut + optional CDR3β-only augmentation
│ ├── tcrconv_reimpl.py # Phase 3: TCRconv-style CNN+CE (beats TCRconv on Lu MC)
│ ├── truly_unified.py # Phase 3: unified CNN+FiLM+CE+BCE across all benchmarks
│ ├── unified_v2_ablation.py # Phase 3: 3×3 feature ablation (seq/seq+fp/fp_only)
│ ├── prepare_lu_complex_yaml.py # Phase 3: Boltz-2 complex YAML generator
│ ├── extract_pmhc_interface.py # Phase 3: TCR-pMHC interface feature extractor
│ ├── generate_model_figures.py # Publication architecture + result figures
│ ├── compute_mhc_features.py # Phase 3: MHC pseudo-sequence + BLOSUM encoding
│ ├── pilot_structural_convergence.py # Phase 3a: pilot experiment
│ └── pilot_binding_surface.py # Phase 3a: pilot surface analysis
├── models/
│ ├── geometric_encoder.py # GVP-GNN with CDR-masked pooling
│ ├── neural_binding.py # FiLM fusion model v1
│ ├── neural_binding_v2.py # FiLM fusion v2 + V genes + ensemble
│ ├── neural_binding_v3.py # v2 + MHC pseudo-sequence (MHC-aware)
│ ├── train.py # Contrastive pretraining
│ ├── train_binding.py # End-to-end binding supervision
│ ├── eval_binding.py # Test evaluation of checkpoints
│ └── specificity_classifier.py # XGBoost binding classifiers
├── analysis/
│ ├── convergence_analysis.py # 134M-pair enrichment sweep
│ ├── benchmark_specificity.py # Head-to-head clustering benchmark
│ ├── ablations.py # Random control + CDR ablation + filters
│ ├── paper_figures.py # Publication figures
│ └── plot_structure_gain.py # fig7: rescue vs convergence
├── data/
│ ├── benchmark/
│ │ ├── tcr_pmhc_master.tsv # 49K unified entries
│ │ ├── benchmark_set.tsv # 213 Phase 2 complexes
│ │ ├── splits/ # Phase 1 epitope-based splits
│ │ └── splits_indist/ # Phase 3 per-epitope splits
│ ├── full_structures/
│ │ └── all_tcrs.tsv # 35,174 unique TCRs + reconstructed chains
│ └── pilot_convergence/
│ └── pilot_tcrs.tsv # 1,351 pilot TCRs
├── competitors/ # Competitor method runners
│ ├── run_epitcr.py # epiTCR reimplementation
│ ├── run_atmtcr.py # ATM-TCR reimplementation
│ ├── run_teim.py # TEIM reimplementation
│ └── run_deeptcr.py # DeepTCR runner
├── results/
│ ├── dockq_results.tsv # Phase 2 benchmark
│ ├── convergence/ # Full-scale enrichment + per-epitope
│ ├── ablations/ # CDR ablation + controls
│ ├── specificity_benchmark/ # Clustering + retrieval metrics
│ ├── competitors_benchmark/ # epiTCR, ATM-TCR, TEIM, DeepTCR results
│ ├── binding_v2/ # Fingerprint v2 binding results
│ ├── neural_binding/ # FiLM fusion v1 results
│ ├── neural_v2/ # FiLM fusion v2 ensemble results
│ ├── neural_v3/ # FiLM fusion v3 (+MHC) ensemble results
│ ├── lu_benchmark/ # Lu et al. Nature Methods evaluation
│ ├── lu_neural/ # NeuralFusion v2 on Lu et al. CDR3β-only + multi-chain
│ ├── lu_neural_v3/ # NeuralFusion v3 on Lu multi-chain (no-shortcut ± aug)
│ ├── lu_contrastive/ # Contrastive learning + CNN+CE + unified ablation
│ ├── lu_complex_pilot/ # Boltz-2 TCR-pMHC complex prediction pilot
│ └── paper_figures/ # Publication figures (PNG + PDF)
├── tests/
│ └── test_surface_extraction.py # 3 unit tests
└── docs/
├── LIMITATIONS.md # Honest accounting of caveats
└── superpowers/plans/ # Implementation plan
| Source | Records | Unique Epitopes |
|---|---|---|
| TCR3d | 372 structural complexes | 228 |
| ATLAS | 697 affinity measurements | — |
| VDJdb | 30,163 paired TCR entries | 1,493 |
| IEDB | 33,260 paired entries | 2,972 |
| Unified master | 49,057 | 2,935 |
| Phase 3 structures | 35,174 | 1,460 |
| Phase 3 binding eval (≥10 TCRs/epitope) | 32,423 | 190 |
See docs/LIMITATIONS.md for the honest accounting. Key caveats:
- Zero-shot epitope generalization is unsolved — all methods (structural and sequence) sit at ~0.51 ROC on truly unseen epitopes.
- Single prediction method (IgFold) — no cross-validation with Boltz-2 or crystal structures at scale. Boltz-2 requires MSAs which are impractical for batch prediction without local databases.
- CDR1/CDR2 positions are heuristic — scaled from the CDR3 anchor position rather than strict IMGT numbering. Could be off by 3-7 residues for unusual V genes.
- Competitor methods are reimplementations — epiTCR, ATM-TCR, TEIM were faithfully reimplemented rather than using original code (installation issues on HPC). Validated by matching published performance ranges.
- Convergence measured on predicted structures — not experimentally validated. Random-structure control rules out prediction homogeneity but not all possible artifacts.
Apache-2.0







