Skip to content

ish-codes-magic/self-recognition-replication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Self-Recognition Experiments

A replication of "LLM Evaluators Recognize and Favor Their Own Generations" (Panickssery et al., 2024) testing whether language models can identify and prefer their own generated text.


🎯 Goal

This repository tests whether OpenAI's GPT-4.1-Nano exhibits self-recognition and self-preference when evaluating text summaries. Specifically, we measure:

  1. Self-Recognition: Can the model identify which summaries it wrote?
  2. Self-Preference: Does the model prefer its own summaries over human-written ones?
  3. Recognition Bias: Is the model more likely to claim authorship of its own work?
  4. Quality Rating Bias: Does the model rate its own summaries higher?

Key Finding: GPT-4.1-Nano shows strong self-recognition with 92-96% detection accuracy and 99% self-preference.


πŸ”¬ Experiment Setup

Datasets

  • CNN Daily Mail (1,000 articles)
  • XSum (1,000 articles)

Each article has three summaries:

  • GPT-4.1-Nano generated
  • Human written
  • Claude 3 Haiku generated (not analyzed due to lack of logprobs)

Four Experiments

1. Detection Experiment

Task: Given two summaries, which one did the model write?

  • Forward comparison: GPT summary vs Human summary
  • Backward comparison: Human summary vs GPT summary
  • Uses log probabilities to measure confidence

2. Self-Preference Experiment

Task: Which summary is better quality?

  • Compares GPT's own summaries against human summaries
  • Measures preference through logprob analysis

3. Recognition Experiment

Task: Binary yes/no - "Did I write this summary?"

  • Shows model its own summaries
  • Shows model human summaries
  • Compares "Yes" probabilities

4. Quality Scoring Experiment

Task: Rate each summary on a 1-5 scale

  • Model rates its own summaries
  • Model rates human summaries
  • Compares expected scores from probability distributions

Why Log Probabilities?

All experiments use log probabilities (logprobs) from the model's output distribution, not just the final answer. This captures confidence:

  • "Yes" with 99% confidence vs "Yes" with 51% confidence are very different
  • Enables precise measurement of self-recognition strength

Note: Anthropic Claude models cannot be properly evaluated because their API doesn't provide logprobs.


πŸ› οΈ Setup

Prerequisites

  • Python 3.8+
  • OpenAI API key
  • uv package manager

Installation

# Clone the repository
git clone <repository-url>
cd self-recognition-replication

# Install dependencies using uv
uv pip install -e .

# Set API key
export OPENAI_API_KEY=your_key_here

Repository Structure

self-recognition-replication/
β”œβ”€β”€ config/
β”‚   └── settings.py           # Configuration
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ openai_provider.py    # OpenAI API integration
β”‚   └── anthropic_provider.py # Anthropic API integration
β”œβ”€β”€ experiments/
β”‚   └── experiemnts.py        # Core experiment logic
β”œβ”€β”€ prompts/
β”‚   β”œβ”€β”€ detection.j2          # Detection prompt template
β”‚   β”œβ”€β”€ comparison.j2         # Comparison prompt template
β”‚   β”œβ”€β”€ recognition.j2        # Recognition prompt template
β”‚   └── scoring.j2            # Scoring prompt template
β”œβ”€β”€ utils/
β”‚   └── data_loader.py        # Data loading utilities
β”œβ”€β”€ articles/                 # Article datasets
β”œβ”€β”€ summaries/                # Generated summaries
β”œβ”€β”€ results/                  # Experiment results
β”œβ”€β”€ main.py                   # Main experiment runner
β”œβ”€β”€ analyze_openai_vs_human.py    # Statistical analysis
└── visualize_openai_vs_human.py  # Generate visualizations

Running Experiments

Test with small sample:

python test_small_experiment.py claude-3-haiku-20240307 cnn 3

Run full experiments:

python main.py

Analyze results:

python analyze_openai_vs_human.py
python visualize_openai_vs_human.py

πŸ“Š Findings

Summary Results

Metric CNN XSum Interpretation
Detection Score 92.05% 96.24% Can identify own summaries
Self-Preference 98.92% 99.12% Strongly prefers own summaries
Recognition Bias +24.83% +31.19% Higher recognition of own work
Quality Rating Bias +0.56 pts +1.26 pts Rates own summaries higher

Detailed Results

1. Detection Experiment βœ…

The model can reliably distinguish its own writing from human writing.

Dataset Score Interpretation
CNN 92.05% Strong self-recognition
XSum 96.24% Strong self-recognition

2. Self-Preference Experiment βœ…

The model exhibits overwhelming preference for its own summaries.

Dataset Score Interpretation
CNN 98.92% Strongly prefers own work
XSum 99.12% Strongly prefers own work

3. Recognition Experiment βœ…

The model is 25-31% more likely to claim authorship of its own summaries.

Dataset Own Summaries Human Summaries Difference
CNN 71.60% 46.77% +24.83%
XSum 56.80% 25.61% +31.19%

4. Quality Rating Experiment βœ…

The model consistently rates its own summaries higher by 0.6-1.3 points.

Dataset Own Summaries Human Summaries Difference
CNN 3.96/5 3.41/5 +0.56
XSum 3.84/5 2.58/5 +1.26

Visualizations

All visualizations are high-resolution PNG files saved in the results/ directory:

Data Files


πŸ” Key Conclusions

GPT-4.1-Nano demonstrates strong self-recognition capabilities:

  1. βœ… Can detect own outputs with >92% accuracy
  2. βœ… Prefers own outputs with >98% consistency
  3. βœ… Recognizes own authorship with 25-31% higher probability
  4. βœ… Rates own quality higher by 0.6-1.3 points

πŸ“š Citation

@article{panickssery2024llm,
  title={LLM Evaluators Recognize and Favor Their Own Generations},
  author={Panickssery, Arjun and Bowman, Samuel R and Feng, Shi},
  journal={arXiv preprint arXiv:2404.13076},
  year={2024}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages