Deductive-ground-truth evaluation of forecasting methods: when an outcome is fixed by arithmetic over observable state, which learners recover that certainty, and which can only reach it by a rule? Foundation models included.
Forecast evaluation usually scores a prediction against a stochastic outcome, so a calibrated belief and a lucky one look alike. This repo builds an evaluation where the answer is known by arithmetic and asks who recovers it.
The test case is the clock-killing first down in NFL play-by-play: a conversion by the leading team late enough that it can kneel the clock to zero given the trailing team's timeouts. Once it holds the game is decided by arithmetic, not by chance. The event is labeled from raw game state — clock, score, timeouts — with no expected-points or win-probability model, so the label is independent of what any forecaster predicts. Across 2010–2024 there are 504 such plays; on the 398 scored out of sample, the leading team won every one.
At the clock-kill the production win-probability model sits near 0.97 where the answer is 1 — a residual of about 0.03. Nothing data-driven closes it:
- feature-free win probability, a random walk, and three zero-shot foundation models (Chronos-Bolt, TimesFM, TiRex) all sit near 0.97;
- gradient boosting on the clock features — and the same model handed the exact kneel-out margin — leaves it, as does an MLP;
- a foundation model fed the clock as an exogenous covariate (Chronos-2) leaves it;
- emphasizing the observable late-game region lifts it only partway and plateaus short, mostly as blanket region-confidence.
Only consuming the deductive label, or applying the rule, closes the residual. The obstacle is localization, not representation: the cell is a tenth of a percent of any rule-free region and pays no rent in average loss, so no loss-minimizing learner carves it out — even when it holds the exact discriminating feature. Once a label points at the cell, the same model fits the boundary and generalizes it to held-out events.
Full recovery ladder and the commands that produce it: RESULTS.md.
The evaluation is deterministic: a rule-based labeler, fixed seeds, game-clustered bootstraps. Walk-forward backtest — each test season is scored by models trained only on earlier seasons.
pip install pandas numpy scipy pyarrow scikit-learn matplotlib
# fetch play-by-play into data/incoming/ — see DATA.md
python pilot/12_recoverability.py # recovery ladder: feature-free, GBM, oracle, random walk
python pilot/14_covariate_recovery.py # covariate ladder, incl. the MLP rung
python pilot/16_recovery_audit.py # near-miss calibration + weight sweepThe foundation-model rungs (Chronos-Bolt, TimesFM, TiRex, Chronos-2) need a GPU environment — see DATA.md.
src/deductive_eval/
clockkill.py deterministic clock-kill labeler (kneel-out arithmetic, no EP/WP)
leverage.py |EPA| / |wpa| leverage pair, frame normalization
data.py season load, schema normalization, v0 filtering
windowing.py play-indexed windowing + matched controls
forecasters.py univariate / covariate forecaster wrappers
metrics.py coverage, calibration, game-clustered bootstrap
pilot/ numbered analysis scripts (01–16)
figures/ generated figures
Play-by-play and model weights are fetched from source (nflverse, Hugging Face), not committed. See DATA.md.
Working draft in paper/draft.md: Forecasting With a Known Answer: A Deductive-Ground-Truth Case Study on the Clock-Killing First Down.
Apache License 2.0 — see LICENSE. Copyright 2026 Riel St. Amand.