CDUR: Calibration Drift Under Reasoning

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

Prakul Sunil Hiremath · Harshit R Hiremath (Aliens on Earth)

Overview

This repository contains the complete reproduction pipeline for the CDUR paper.

CDUR is the phenomenon whereby increasing the reasoning budget of a large language model (LLM) first improves and then worsens calibration, producing a non-monotone trajectory in the Expected Calibration Error (ECE) as a function of reasoning budget $B$.

The paper introduces:

A formal definition of CDUR via a U-shaped $\text{ECE}(B)$ function.
The Hypothesis Lock-In Model: a mechanistic account of autoregressive reasoning under commitment.
Empirical evidence on Llama-3.1-8B and Llama-3.3-70B across four reasoning budgets and 21 trap-question categories.
CABStop: a calibration-aware optimal stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate.

Repository Structure

cdur/
├── config/
│   └── default_config.yaml     # Model, budget, elicitation, and CABStop parameters
├── src/
│   ├── __init__.py
│   ├── data_loader.py          # Reasoning-trap dataset + response validity filter
│   ├── evaluators.py           # Evaluation coordinator + calibrated LLM simulator
│   ├── metrics.py              # ECE, overconfidence gap, wrong-and-confident count
│   └── cabstop.py              # CABStop algorithm (Algorithm 1 from paper)
├── run_pipeline.py             # Main entry point
├── requirements.txt
├── Experiments/
│   ├── v1.0.py       
│   ├── v1.1.py         
│   ├── v1.2.py         
│   ├── v1.3.py          
│   ├── v2.py             
│   └── v3.py
├── Paper
│   ├── 2606.11211v1.pdf   
└── README.md

Installation

Requires Python 3.10 or later.

git clone https://github.com/prakulhiremath/CDUR.git
cd CDUR
pip install -r requirements.txt

No GPU or API key is required to run the reproduction pipeline. The evaluator uses a deterministic simulator calibrated to the empirical 8B results reported in the paper.

Quickstart

Run the full pipeline with default settings (both models, all four budgets, seeds 1/2/3):

python run_pipeline.py

Run only the 8B model with light and heavy budgets:

python run_pipeline.py --models llama-3.1-8b --budgets none light heavy

Adjust CABStop threshold and seeds:

python run_pipeline.py --delta 0.15 --seeds 1 2 3 4 5

Suppress CABStop demo output:

python run_pipeline.py --no-cabstop

Increase logging verbosity:

python run_pipeline.py --log-level DEBUG

Expected Output

Running python run_pipeline.py prints:

Results table — ECE (mean ± std across seeds), overconfidence gap, and accuracy per model per budget.
Smoking gun examples — incorrect responses with confidence ≥ 0.90.
CABStop demo — per-question stopping decisions for the first three dataset items.

Example results table (abbreviated):

  CDUR Reproduction Results — Calibration Drift Under Reasoning

  ┌──────────────────────┬──────────┬────────────────┬────────────────┬────────────┐
  │ Model                │ Budget   │ ECE (mean±std) │ OG (mean)      │ Acc (mean) │
  ├──────────────────────┼──────────┼────────────────┼────────────────┼────────────┤
  │ llama-3.1-8b         │ none     │ 0.0436 ± 0.015 │ +0.4930        │ 0.4610     │
  │                      │ light    │ 0.1040 ± 0.034 │ +0.2490        │ 0.7320     │
  │                      │ medium   │ 0.0496 ± 0.049 │ +0.3360        │ 0.6530     │
  │                      │ heavy    │ 0.0145 ± 0.005 │ +0.2450        │ 0.7390     │
  ├──────────────────────┼──────────┼────────────────┼────────────────┼────────────┤
  │ llama-3.3-70b        │ none     │ 0.0352 ± 0.026 │ +0.1550        │ 0.8250     │
  │                      │ light    │ —              │ —              │ —          │
  ...

The non-monotone ECE trajectory for llama-3.1-8b (none → light → medium → heavy) is the primary empirical signature of CDUR.

Configuration

All parameters are controlled via config/default_config.yaml.

Key	Default	Description
`elicitation.temperature`	`0.7`	Sampling temperature
`elicitation.seeds`	`[1, 2, 3]`	Random seeds for variance estimation
`cabstop.delta`	`0.10`	Calibration gap threshold for halting
`cabstop.max_budget`	`2048`	Maximum token budget before forced stop
`cabstop.check_interval`	`128`	Token interval between CABStop checks
`cabstop.self_consistency_k`	`5`	Number of samples for auxiliary accuracy estimate
`metrics.ece_bins`	`10`	Number of equal-width bins for ECE
`metrics.overconfidence_threshold`	`0.90`	Confidence threshold for wrong-and-confident count

Modules

`src/data_loader.py`

Contains 25 hardcoded reasoning-trap questions across 15 categories (counting, set_theory, spatial, semantic, probability, syllogism, algebra, modular, operator_precedence, percentage, compound, contrapositive, anchor, combinatorics, relative_motion, conditional_prob, exponential, mixture, pattern). Each question is a TrapQuestion dataclass with id, category, question, and expected_answer. Includes a regex-based response validity filter.

`src/metrics.py`

Implements:

calculate_ece(confidences, accuracies, num_bins) — equal-width binning, empty-bin safe.
calculate_overconfidence_gap(confidences, accuracies) — mean confidence minus mean accuracy.
compute_metrics(...) — returns a CalibrationMetrics dataclass.
aggregate_metrics_across_seeds(...) — mean and std across seed runs.

`src/evaluators.py`

Deterministic simulator calibrated to match the empirical dynamics of Llama-3.1-8B:

none budget: moderate accuracy (0.46), highly volatile confidence, many confident-wrong responses.
light budget: higher accuracy (0.73) with Hypothesis Lock-In signature — confidence inflated to ~1.0.
medium budget: slightly lower accuracy (0.65) with instability.
heavy budget: highest accuracy (0.74), near-maximum confidence, lowest ECE.

70B model applies a scale correction factor (+35% accuracy) relative to 8B.

`src/cabstop.py`

Implements Algorithm 1 from the paper. At each check_interval token checkpoint:

Extracts candidate answer and confidence via inference_fn(t).
Estimates auxiliary accuracy via simulated self-consistency over k samples.
Halts and returns if confidence − auxiliary_accuracy > delta.

`run_pipeline.py`

CLI entry point. Parses arguments, loads config, calls run_evaluation, computes aggregate metrics, and prints formatted ASCII tables.

Reproducing Paper Results

The simulator in src/evaluators.py is parameterized to reproduce the key empirical observations from Table A.1 of the paper:

Model	Budget	ECE (paper)	OG (paper)	Acc (paper)
8B	none	0.0436 ± 0.015	+0.493	0.461
8B	light	0.1040 ± 0.034	+0.249	0.732
8B	medium	0.0496 ± 0.049	+0.336	0.653
8B	heavy	0.0145 ± 0.005	+0.245	0.739
70B	none	0.0352 ± 0.026	+0.155	0.825

Live API Evaluation

To run against a real LLM API, replace the _simulate_response function in src/evaluators.py with a call to your preferred inference endpoint. The ModelResponse dataclass and parse_and_validate_response function in src/data_loader.py handle response validation and confidence extraction automatically.

Citation

If you use this codebase or build on the CDUR framework, please cite:

@article{hiremath2025calibration,
  author    = {Hiremath, Prakul Sunil and Hiremath, Harshit R.},
  title     = {Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models},
  journal   = {arXiv preprint arXiv:2606.11211},
  year      = {2025},
  doi       = {10.5281/zenodo.19709379},
  url       = {https://arxiv.org/abs/2606.11211},
  note      = {Code available at \url{https://github.com/prakulhiremath/CDUR}. Post commentary available at Medium (\url{https://medium.com/@prakulhiremath/the-ai-that-thinks-too-hard-and-gets-dangerously-wrong-7f3c32e62864})}
}

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDUR: Calibration Drift Under Reasoning

Overview

Repository Structure

Installation

Quickstart

Expected Output

Configuration

Modules

`src/data_loader.py`

`src/metrics.py`

`src/evaluators.py`

`src/cabstop.py`

`run_pipeline.py`

Reproducing Paper Results

Live API Evaluation

Citation

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Experiments		Experiments
Paper		Paper
config		config
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
citation.cff		citation.cff
index.html		index.html
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

CDUR: Calibration Drift Under Reasoning

Overview

Repository Structure

Installation

Quickstart

Expected Output

Configuration

Modules

src/data_loader.py

src/metrics.py

src/evaluators.py

src/cabstop.py

run_pipeline.py

Reproducing Paper Results

Live API Evaluation

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`src/data_loader.py`

`src/metrics.py`

`src/evaluators.py`

`src/cabstop.py`

`run_pipeline.py`

Packages