deductive-eval

Deductive-ground-truth evaluation of forecasting methods: when an outcome is fixed by arithmetic over observable state, which learners recover that certainty, and which can only reach it by a rule? Foundation models included.

Forecast evaluation usually scores a prediction against a stochastic outcome, so a calibrated belief and a lucky one look alike. This repo builds an evaluation where the answer is known by arithmetic and asks who recovers it.

The test case is the clock-killing first down in NFL play-by-play: a conversion by the leading team late enough that it can kneel the clock to zero given the trailing team's timeouts. Once it holds the game is decided by arithmetic, not by chance. The event is labeled from raw game state — clock, score, timeouts — with no expected-points or win-probability model, so the label is independent of what any forecaster predicts. Across 2010–2024 there are 504 such plays; on the 398 scored out of sample, the leading team won every one.

Result

At the clock-kill the production win-probability model sits near 0.97 where the answer is 1 — a residual of about 0.03. Nothing data-driven closes it:

feature-free win probability, a random walk, and three zero-shot foundation models (Chronos-Bolt, TimesFM, TiRex) all sit near 0.97;
gradient boosting on the clock features — and the same model handed the exact kneel-out margin — leaves it, as does an MLP;
a foundation model fed the clock as an exogenous covariate (Chronos-2) leaves it;
emphasizing the observable late-game region lifts it only partway and plateaus short, mostly as blanket region-confidence.

Only consuming the deductive label, or applying the rule, closes the residual. The obstacle is localization, not representation: the cell is a tenth of a percent of any rule-free region and pays no rent in average loss, so no loss-minimizing learner carves it out — even when it holds the exact discriminating feature. Once a label points at the cell, the same model fits the boundary and generalizes it to held-out events.

Full recovery ladder and the commands that produce it: RESULTS.md.

Reproduce

The evaluation is deterministic: a rule-based labeler, fixed seeds, game-clustered bootstraps. Walk-forward backtest — each test season is scored by models trained only on earlier seasons.

pip install pandas numpy scipy pyarrow scikit-learn matplotlib
# fetch play-by-play into data/incoming/ — see DATA.md
python pilot/12_recoverability.py       # recovery ladder: feature-free, GBM, oracle, random walk
python pilot/14_covariate_recovery.py   # covariate ladder, incl. the MLP rung
python pilot/16_recovery_audit.py       # near-miss calibration + weight sweep

The foundation-model rungs (Chronos-Bolt, TimesFM, TiRex, Chronos-2) need a GPU environment — see DATA.md.

Layout

src/deductive_eval/
  clockkill.py    deterministic clock-kill labeler (kneel-out arithmetic, no EP/WP)
  leverage.py     |EPA| / |wpa| leverage pair, frame normalization
  data.py         season load, schema normalization, v0 filtering
  windowing.py    play-indexed windowing + matched controls
  forecasters.py  univariate / covariate forecaster wrappers
  metrics.py      coverage, calibration, game-clustered bootstrap
pilot/            numbered analysis scripts (01–16)
figures/          generated figures

Data & setup

Play-by-play and model weights are fetched from source (nflverse, Hugging Face), not committed. See DATA.md.

Paper

Working draft in paper/draft.md: Forecasting With a Known Answer: A Deductive-Ground-Truth Case Study on the Clock-Killing First Down.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
figures		figures
paper		paper
pilot		pilot
src/deductive_eval		src/deductive_eval
.gitattributes		.gitattributes
.gitignore		.gitignore
DATA.md		DATA.md
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
requirements-models.txt		requirements-models.txt
requirements.lock		requirements.lock
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deductive-eval

Result

Reproduce

Layout

Data & setup

Paper

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

deductive-eval

Result

Reproduce

Layout

Data & setup

Paper

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages