A replication of "LLM Evaluators Recognize and Favor Their Own Generations" (Panickssery et al., 2024) testing whether language models can identify and prefer their own generated text.
This repository tests whether OpenAI's GPT-4.1-Nano exhibits self-recognition and self-preference when evaluating text summaries. Specifically, we measure:
- Self-Recognition: Can the model identify which summaries it wrote?
- Self-Preference: Does the model prefer its own summaries over human-written ones?
- Recognition Bias: Is the model more likely to claim authorship of its own work?
- Quality Rating Bias: Does the model rate its own summaries higher?
Key Finding: GPT-4.1-Nano shows strong self-recognition with 92-96% detection accuracy and 99% self-preference.
- CNN Daily Mail (1,000 articles)
- XSum (1,000 articles)
Each article has three summaries:
- GPT-4.1-Nano generated
- Human written
- Claude 3 Haiku generated (not analyzed due to lack of logprobs)
Task: Given two summaries, which one did the model write?
- Forward comparison: GPT summary vs Human summary
- Backward comparison: Human summary vs GPT summary
- Uses log probabilities to measure confidence
Task: Which summary is better quality?
- Compares GPT's own summaries against human summaries
- Measures preference through logprob analysis
Task: Binary yes/no - "Did I write this summary?"
- Shows model its own summaries
- Shows model human summaries
- Compares "Yes" probabilities
Task: Rate each summary on a 1-5 scale
- Model rates its own summaries
- Model rates human summaries
- Compares expected scores from probability distributions
All experiments use log probabilities (logprobs) from the model's output distribution, not just the final answer. This captures confidence:
- "Yes" with 99% confidence vs "Yes" with 51% confidence are very different
- Enables precise measurement of self-recognition strength
Note: Anthropic Claude models cannot be properly evaluated because their API doesn't provide logprobs.
- Python 3.8+
- OpenAI API key
- uv package manager
# Clone the repository
git clone <repository-url>
cd self-recognition-replication
# Install dependencies using uv
uv pip install -e .
# Set API key
export OPENAI_API_KEY=your_key_hereself-recognition-replication/
βββ config/
β βββ settings.py # Configuration
βββ models/
β βββ openai_provider.py # OpenAI API integration
β βββ anthropic_provider.py # Anthropic API integration
βββ experiments/
β βββ experiemnts.py # Core experiment logic
βββ prompts/
β βββ detection.j2 # Detection prompt template
β βββ comparison.j2 # Comparison prompt template
β βββ recognition.j2 # Recognition prompt template
β βββ scoring.j2 # Scoring prompt template
βββ utils/
β βββ data_loader.py # Data loading utilities
βββ articles/ # Article datasets
βββ summaries/ # Generated summaries
βββ results/ # Experiment results
βββ main.py # Main experiment runner
βββ analyze_openai_vs_human.py # Statistical analysis
βββ visualize_openai_vs_human.py # Generate visualizations
Test with small sample:
python test_small_experiment.py claude-3-haiku-20240307 cnn 3Run full experiments:
python main.pyAnalyze results:
python analyze_openai_vs_human.py
python visualize_openai_vs_human.py| Metric | CNN | XSum | Interpretation |
|---|---|---|---|
| Detection Score | 92.05% | 96.24% | Can identify own summaries |
| Self-Preference | 98.92% | 99.12% | Strongly prefers own summaries |
| Recognition Bias | +24.83% | +31.19% | Higher recognition of own work |
| Quality Rating Bias | +0.56 pts | +1.26 pts | Rates own summaries higher |
The model can reliably distinguish its own writing from human writing.
| Dataset | Score | Interpretation |
|---|---|---|
| CNN | 92.05% | Strong self-recognition |
| XSum | 96.24% | Strong self-recognition |
The model exhibits overwhelming preference for its own summaries.
| Dataset | Score | Interpretation |
|---|---|---|
| CNN | 98.92% | Strongly prefers own work |
| XSum | 99.12% | Strongly prefers own work |
The model is 25-31% more likely to claim authorship of its own summaries.
| Dataset | Own Summaries | Human Summaries | Difference |
|---|---|---|---|
| CNN | 71.60% | 46.77% | +24.83% |
| XSum | 56.80% | 25.61% | +31.19% |
The model consistently rates its own summaries higher by 0.6-1.3 points.
| Dataset | Own Summaries | Human Summaries | Difference |
|---|---|---|---|
| CNN | 3.96/5 | 3.41/5 | +0.56 |
| XSum | 3.84/5 | 2.58/5 | +1.26 |
All visualizations are high-resolution PNG files saved in the results/ directory:
- Main Analysis Dashboard - 4-panel comparison showing all metrics
- Bias Analysis - Magnitude of self-preference bias
- CNN Distributions - Score distributions for CNN dataset
- XSum Distributions - Score distributions for XSum dataset
- Summary Statistics CSV - All metrics in table format
- Detailed Analysis JSON - Complete statistical analysis
- Raw experiment results in
results/cnn/andresults/xsum/
GPT-4.1-Nano demonstrates strong self-recognition capabilities:
- β Can detect own outputs with >92% accuracy
- β Prefers own outputs with >98% consistency
- β Recognizes own authorship with 25-31% higher probability
- β Rates own quality higher by 0.6-1.3 points
@article{panickssery2024llm,
title={LLM Evaluators Recognize and Favor Their Own Generations},
author={Panickssery, Arjun and Bowman, Samuel R and Feng, Shi},
journal={arXiv preprint arXiv:2404.13076},
year={2024}
}