Skip to content

CanadaApollo6/deductive-eval

Repository files navigation

deductive-eval

Deductive-ground-truth evaluation of forecasting methods: when an outcome is fixed by arithmetic over observable state, which learners recover that certainty, and which can only reach it by a rule? Foundation models included.

Forecast evaluation usually scores a prediction against a stochastic outcome, so a calibrated belief and a lucky one look alike. This repo builds an evaluation where the answer is known by arithmetic and asks who recovers it.

The test case is the clock-killing first down in NFL play-by-play: a conversion by the leading team late enough that it can kneel the clock to zero given the trailing team's timeouts. Once it holds the game is decided by arithmetic, not by chance. The event is labeled from raw game state — clock, score, timeouts — with no expected-points or win-probability model, so the label is independent of what any forecaster predicts. Across 2010–2024 there are 504 such plays; on the 398 scored out of sample, the leading team won every one.

Result

At the clock-kill the production win-probability model sits near 0.97 where the answer is 1 — a residual of about 0.03. Nothing data-driven closes it:

  • feature-free win probability, a random walk, and three zero-shot foundation models (Chronos-Bolt, TimesFM, TiRex) all sit near 0.97;
  • gradient boosting on the clock features — and the same model handed the exact kneel-out margin — leaves it, as does an MLP;
  • a foundation model fed the clock as an exogenous covariate (Chronos-2) leaves it;
  • emphasizing the observable late-game region lifts it only partway and plateaus short, mostly as blanket region-confidence.

Only consuming the deductive label, or applying the rule, closes the residual. The obstacle is localization, not representation: the cell is a tenth of a percent of any rule-free region and pays no rent in average loss, so no loss-minimizing learner carves it out — even when it holds the exact discriminating feature. Once a label points at the cell, the same model fits the boundary and generalizes it to held-out events.

Full recovery ladder and the commands that produce it: RESULTS.md.

Reproduce

The evaluation is deterministic: a rule-based labeler, fixed seeds, game-clustered bootstraps. Walk-forward backtest — each test season is scored by models trained only on earlier seasons.

pip install pandas numpy scipy pyarrow scikit-learn matplotlib
# fetch play-by-play into data/incoming/ — see DATA.md
python pilot/12_recoverability.py       # recovery ladder: feature-free, GBM, oracle, random walk
python pilot/14_covariate_recovery.py   # covariate ladder, incl. the MLP rung
python pilot/16_recovery_audit.py       # near-miss calibration + weight sweep

The foundation-model rungs (Chronos-Bolt, TimesFM, TiRex, Chronos-2) need a GPU environment — see DATA.md.

Layout

src/deductive_eval/
  clockkill.py    deterministic clock-kill labeler (kneel-out arithmetic, no EP/WP)
  leverage.py     |EPA| / |wpa| leverage pair, frame normalization
  data.py         season load, schema normalization, v0 filtering
  windowing.py    play-indexed windowing + matched controls
  forecasters.py  univariate / covariate forecaster wrappers
  metrics.py      coverage, calibration, game-clustered bootstrap
pilot/            numbered analysis scripts (01–16)
figures/          generated figures

Data & setup

Play-by-play and model weights are fetched from source (nflverse, Hugging Face), not committed. See DATA.md.

Paper

Working draft in paper/draft.md: Forecasting With a Known Answer: A Deductive-Ground-Truth Case Study on the Clock-Killing First Down.

License

Apache License 2.0 — see LICENSE. Copyright 2026 Riel St. Amand.

About

Deductive-ground-truth evaluation of forecasting methods — what learners recover vs. what's only reachable by rule, foundation models included.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors