Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models
Prakul Sunil Hiremath · Harshit R Hiremath (Aliens on Earth)
This repository contains the complete reproduction pipeline for the CDUR paper.
CDUR is the phenomenon whereby increasing the reasoning budget of a large language model (LLM) first improves and then worsens calibration, producing a non-monotone trajectory in the Expected Calibration Error (ECE) as a function of reasoning budget
The paper introduces:
- A formal definition of CDUR via a U-shaped
$\text{ECE}(B)$ function. - The Hypothesis Lock-In Model: a mechanistic account of autoregressive reasoning under commitment.
- Empirical evidence on Llama-3.1-8B and Llama-3.3-70B across four reasoning budgets and 21 trap-question categories.
- CABStop: a calibration-aware optimal stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate.
cdur/
├── config/
│ └── default_config.yaml # Model, budget, elicitation, and CABStop parameters
├── src/
│ ├── __init__.py
│ ├── data_loader.py # Reasoning-trap dataset + response validity filter
│ ├── evaluators.py # Evaluation coordinator + calibrated LLM simulator
│ ├── metrics.py # ECE, overconfidence gap, wrong-and-confident count
│ └── cabstop.py # CABStop algorithm (Algorithm 1 from paper)
├── run_pipeline.py # Main entry point
├── requirements.txt
├── Experiments/
│ ├── v1.0.py
│ ├── v1.1.py
│ ├── v1.2.py
│ ├── v1.3.py
│ ├── v2.py
│ └── v3.py
├── Paper
│ ├── 2606.11211v1.pdf
└── README.md
Requires Python 3.10 or later.
git clone https://github.com/prakulhiremath/CDUR.git
cd CDUR
pip install -r requirements.txtNo GPU or API key is required to run the reproduction pipeline. The evaluator uses a deterministic simulator calibrated to the empirical 8B results reported in the paper.
Run the full pipeline with default settings (both models, all four budgets, seeds 1/2/3):
python run_pipeline.pyRun only the 8B model with light and heavy budgets:
python run_pipeline.py --models llama-3.1-8b --budgets none light heavyAdjust CABStop threshold and seeds:
python run_pipeline.py --delta 0.15 --seeds 1 2 3 4 5Suppress CABStop demo output:
python run_pipeline.py --no-cabstopIncrease logging verbosity:
python run_pipeline.py --log-level DEBUGRunning python run_pipeline.py prints:
- Results table — ECE (mean ± std across seeds), overconfidence gap, and accuracy per model per budget.
- Smoking gun examples — incorrect responses with confidence ≥ 0.90.
- CABStop demo — per-question stopping decisions for the first three dataset items.
Example results table (abbreviated):
CDUR Reproduction Results — Calibration Drift Under Reasoning
┌──────────────────────┬──────────┬────────────────┬────────────────┬────────────┐
│ Model │ Budget │ ECE (mean±std) │ OG (mean) │ Acc (mean) │
├──────────────────────┼──────────┼────────────────┼────────────────┼────────────┤
│ llama-3.1-8b │ none │ 0.0436 ± 0.015 │ +0.4930 │ 0.4610 │
│ │ light │ 0.1040 ± 0.034 │ +0.2490 │ 0.7320 │
│ │ medium │ 0.0496 ± 0.049 │ +0.3360 │ 0.6530 │
│ │ heavy │ 0.0145 ± 0.005 │ +0.2450 │ 0.7390 │
├──────────────────────┼──────────┼────────────────┼────────────────┼────────────┤
│ llama-3.3-70b │ none │ 0.0352 ± 0.026 │ +0.1550 │ 0.8250 │
│ │ light │ — │ — │ — │
...
The non-monotone ECE trajectory for llama-3.1-8b (none → light → medium → heavy) is the primary empirical signature of CDUR.
All parameters are controlled via config/default_config.yaml.
| Key | Default | Description |
|---|---|---|
elicitation.temperature |
0.7 |
Sampling temperature |
elicitation.seeds |
[1, 2, 3] |
Random seeds for variance estimation |
cabstop.delta |
0.10 |
Calibration gap threshold for halting |
cabstop.max_budget |
2048 |
Maximum token budget before forced stop |
cabstop.check_interval |
128 |
Token interval between CABStop checks |
cabstop.self_consistency_k |
5 |
Number of samples for auxiliary accuracy estimate |
metrics.ece_bins |
10 |
Number of equal-width bins for ECE |
metrics.overconfidence_threshold |
0.90 |
Confidence threshold for wrong-and-confident count |
Contains 25 hardcoded reasoning-trap questions across 15 categories (counting, set_theory, spatial, semantic, probability, syllogism, algebra, modular, operator_precedence, percentage, compound, contrapositive, anchor, combinatorics, relative_motion, conditional_prob, exponential, mixture, pattern). Each question is a TrapQuestion dataclass with id, category, question, and expected_answer. Includes a regex-based response validity filter.
Implements:
calculate_ece(confidences, accuracies, num_bins)— equal-width binning, empty-bin safe.calculate_overconfidence_gap(confidences, accuracies)— mean confidence minus mean accuracy.compute_metrics(...)— returns aCalibrationMetricsdataclass.aggregate_metrics_across_seeds(...)— mean and std across seed runs.
Deterministic simulator calibrated to match the empirical dynamics of Llama-3.1-8B:
nonebudget: moderate accuracy (0.46), highly volatile confidence, many confident-wrong responses.lightbudget: higher accuracy (0.73) with Hypothesis Lock-In signature — confidence inflated to ~1.0.mediumbudget: slightly lower accuracy (0.65) with instability.heavybudget: highest accuracy (0.74), near-maximum confidence, lowest ECE.
70B model applies a scale correction factor (+35% accuracy) relative to 8B.
Implements Algorithm 1 from the paper. At each check_interval token checkpoint:
- Extracts candidate answer and confidence via
inference_fn(t). - Estimates auxiliary accuracy via simulated self-consistency over
ksamples. - Halts and returns if
confidence − auxiliary_accuracy > delta.
CLI entry point. Parses arguments, loads config, calls run_evaluation, computes aggregate metrics, and prints formatted ASCII tables.
The simulator in src/evaluators.py is parameterized to reproduce the key empirical observations from Table A.1 of the paper:
| Model | Budget | ECE (paper) | OG (paper) | Acc (paper) |
|---|---|---|---|---|
| 8B | none | 0.0436 ± 0.015 | +0.493 | 0.461 |
| 8B | light | 0.1040 ± 0.034 | +0.249 | 0.732 |
| 8B | medium | 0.0496 ± 0.049 | +0.336 | 0.653 |
| 8B | heavy | 0.0145 ± 0.005 | +0.245 | 0.739 |
| 70B | none | 0.0352 ± 0.026 | +0.155 | 0.825 |
To run against a real LLM API, replace the _simulate_response function in src/evaluators.py with a call to your preferred inference endpoint. The ModelResponse dataclass and parse_and_validate_response function in src/data_loader.py handle response validation and confidence extraction automatically.
If you use this codebase or build on the CDUR framework, please cite:
@article{hiremath2025calibration,
author = {Hiremath, Prakul Sunil and Hiremath, Harshit R.},
title = {Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models},
journal = {arXiv preprint arXiv:2606.11211},
year = {2025},
doi = {10.5281/zenodo.19709379},
url = {https://arxiv.org/abs/2606.11211},
note = {Code available at \url{https://github.com/prakulhiremath/CDUR}. Post commentary available at Medium (\url{https://medium.com/@prakulhiremath/the-ai-that-thinks-too-hard-and-gets-dangerously-wrong-7f3c32e62864})}
}
MIT License. See LICENSE for details.