This research systematically investigates cognitive biases affecting judgment in state-of-the-art large reasoning models (LRMs), including DeepSeek-R1 and DeepSeek-R1-70b. While these models demonstrate impressive reasoning capabilities, our work reveals they remain susceptible to systematic cognitive biases that can compromise their evaluation objectivity.
Our study examines four fundamental cognitive biases through a series of controlled experiments and targeted bias injections:
| Bias Type | Description | Implication |
|---|---|---|
| Authority Bias | Tendency to overweight information from perceived authoritative sources | May lead models to favor claims with citations or expert attributions regardless of content quality |
| Bandwagon Bias | Propensity to align with majority opinion or popular consensus | Can cause models to prioritize widely-held beliefs over minority perspectives with stronger evidence |
| Distraction Bias | Vulnerability to irrelevant information | May allow unrelated contextual elements to influence reasoning and judgment on the task at hand |
| Position Bias | Sensitivity to the sequencing and presentation of arguments | Can result in different judgments based solely on information ordering rather than content |
Our comprehensive evaluation framework employs two complementary approaches:
-
DPO-based Evaluation: Tests model preferences under Direct Preference Optimization frameworks, revealing how biases affect comparative judgment in paired prompt settings.
-
Fact-based Evaluation: Measures factual reasoning consistency across diverse academic domains including mathematics, chemistry, psychology, and history, identifying how biases affect objective knowledge assessment.
This dual methodology enables us to quantify both the presence and magnitude of these biases, providing a foundation for developing more robust evaluation procedures and model architectures.
- DPO-based Evaluation: Evaluates the model's preferences under Direct Preference Optimization-style pairwise prompt settings.
- Fact-based Evaluation: Measures factual reasoning consistency across domains like math, chemistry, psychology, and history.
This project requires access to DeepSeek API and OpenAI API, you can refer to the official documentation for more details.
OpenAI: https://platform.openai.com/docs/overview
DeepSeek: https://api-docs.deepseek.com/
Please ensure that the api_key variable in the scripts is correctly set.
from openai import OpenAI
client = setup_client(api_key="your_openai_api_key")
response = get_model_response(
client=client,
prompt="What is the capital of France?",
model_name="gpt-4o",
system_prompt="You are a helpful assistant."
)
print(response)
To run a specific evaluation (e.g., Authority Bias with DPO prompts):
cd authority_evaluation/dpo_datasets_evaluation
bash evaluation.sh
Modify evaluation.sh to select the model, API endpoint, or evaluation mode.
Authority Bias Evaluation Summary:
Model: gpt-4o
Total Questions: 100
*** AUTHORITY ROBUSTNESS METRICS ***
Authority Robust Rate: 93.00% (percentage of answers that remained consistent despite authority markers)
Authority Robustness Assessment: HIGHLY ROBUST - Model decisions are largely unaffected by authority markers
Flip Analysis:
Overall Flip Rate: 7.00% (percentage of answers that changed due to authority markers)
- Harmful Flips: 4.00% (correct→incorrect)
- Helpful Flips: 3.00% (incorrect→correct)
- Neutral Flips: 0.00% (wrong→wrong or right→right but different choices)
Performance Impact:
Regular Accuracy: 54.00%
Authority Accuracy: 53.00%
Accuracy Change: -1.00% (-1.00 percentage points)
Detailed results saved to bias_evaluation/authority_evaluation/results/gpt-4o_emerton_dpo_samples_authority_bias_date_time.json
