AI² — Artificial Intelligence Squared

LLM Debate Benchmark: The top 10 frontier AI models debate head-to-head in Intelligence Squared format, judged by AI jury panels.

🏗 Architecture

Models vs Agents

A model is a type of brain (e.g., Claude Opus 4.6). An agent is a brain instance with its own isolated context. The same model can power multiple agents simultaneously — what matters is that each agent has completely separate context.

In each debate:

2 debater-agents: argue FOR and AGAINST the motion
10 judge-agents: one per model (including the debaters), each with isolated evaluator context and a unique persona

Debate Format (Intelligence Squared)

Opening Statements — Each debater frames their position (600 tokens)
Rebuttals — Respond to opponent's opening, attack assumptions (500 tokens)
Cross-Examination — 3 rounds of Q&A between debaters (150/300 tokens)
Audience Questions — Judges generate questions, debaters answer (300 tokens)
Closing Statements — Compress strongest points, no new arguments (300 tokens)

Scoring

10 judges vote before and after the debate
Winner = side with the highest vote conversion (Δ = final% − initial%)
Confidence-weighted persuasion tracked for soft signal
ELO ratings updated after each matchup (K=32)

🏆 Top 10 Models

#	Model	Provider	Arena Score
1	Claude Opus 4.6 (Thinking)	Anthropic	1503
2	Claude Opus 4.6	Anthropic	1497
3	Gemini 3.1 Pro Preview	Google	1493
4	Grok 4.20	xAI	1490
5	Gemini 3 Pro	Google	1486
6	GPT-5.4 (High)	OpenAI	1484
7	Grok 4.20 (Reasoning)	xAI	1480
8	GPT-5.2 Chat	OpenAI	1477
9	Grok 4.20 Multi-Agent	xAI	1475
10	Gemini 3 Flash	Google	1474

📊 Judge Persona Map

Each model gets a fixed persona when acting as a judge:

Model	Judge Persona
Claude Opus 4.6 (Thinking)	Risk-averse economist
Claude Opus 4.6	Philosophy professor
Gemini 3.1 Pro Preview	Neutral academic
Grok 4.20	Contrarian thinker
Gemini 3 Pro	Environmental activist
GPT-5.4 (High)	Corporate executive
Grok 4.20 (Reasoning)	Skeptical policymaker
GPT-5.2 Chat	Optimistic technologist
Grok 4.20 Multi-Agent	Union labor representative
Gemini 3 Flash	Data scientist & AI researcher

🚀 Getting Started

Prerequisites

Node.js 22+
Python 3.12+
OpenRouter API key

Run the Web App

cd web
npm install
npm run dev

Open http://localhost:3000

Run the Benchmark

# Set your API key
export OPENROUTER_API_KEY=sk-or-v1-...

# Install dependencies
pip install aiohttp python-dotenv

# Run tests first
python3 -m benchmark.test_debate

# Run full benchmark (45 matchups)
python3 -m benchmark.runner

# Sync results to web app
./sync_results.sh

Live Arena

Visit the /arena page to:

Enter your OpenRouter API key
Select two models
Choose a topic
Watch the debate unfold in real-time with SSE streaming

📁 Project Structure

├── benchmark/               # Python benchmark engine
│   ├── models.py           # Model definitions & configs
│   ├── prompts.py          # All prompt templates
│   ├── api_client.py       # OpenRouter API wrapper
│   ├── audience.py         # Judge agent module
│   ├── debate.py           # Debate orchestration
│   ├── scoring.py          # ELO & scoring
│   ├── runner.py           # Full benchmark runner
│   └── test_debate.py      # Step-by-step tests
├── web/                    # Next.js web application
│   ├── src/app/            # App Router pages
│   ├── src/components/     # React components
│   ├── src/lib/            # Types, utils, data
│   └── public/images/      # Generated model icons
└── sync_results.sh         # Sync benchmark → web app

☁️ Deploy to Vercel

The recommended way to deploy AI² is on Vercel, which natively supports Next.js.

One-Click Deploy

Manual Deploy

Install the Vercel CLI:
```
npm i -g vercel
```
Deploy from the web/ directory:
```
cd web
vercel
```
Follow the prompts. Vercel auto-detects Next.js and uses npm run build.
Deploy to production:
```
vercel --prod
```

Configuration Details

Root Directory: Set to web in Vercel project settings (or pass --cwd web via CLI). The Next.js app lives in web/, not the repo root.
Framework Preset: Next.js (auto-detected).
Build Command: npm run build (default).
Output Directory: .next (default).
Node.js Version: 22.x (set in Vercel project settings under General > Node.js Version).

Live Arena API Endpoint

The /api/debate route is a serverless function that orchestrates live debates via SSE streaming. It makes many sequential calls to the OpenRouter API, so it requires an extended execution timeout:

maxDuration: 800 (~13 minutes) is configured in both vercel.json and as a route segment export. This requires a Vercel Pro plan or higher. On the free Hobby plan, the max is 60 seconds, which is too short for a full debate.
The route uses runtime: "nodejs" and dynamic: "force-dynamic" to ensure it always runs as a serverless function, never statically cached.
No environment variables are needed on Vercel — the user's OpenRouter API key is passed per-request from the browser and never stored server-side.

Vercel Plan Requirements

Feature	Hobby (Free)	Pro	Enterprise
Static pages	✅	✅	✅
Live Arena (short debates)	⚠️ 60s limit	✅	✅
Live Arena (full debates)	❌ Too slow	✅ 800s	✅ 900s

🔬 Unique Analytics

Self-judging bias: Does a model favor itself when it's both debater and judge?
Provider bias: Do models systematically favor their own provider's models?
Cross-model matrix: 10×10 grid showing every judge×debater relationship
Persuasion resistance: Which models are hardest to convince as judges?
Stance flip rate: How often each judge changes position after a debate

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
benchmark		benchmark
web-assets		web-assets
web		web
.gitignore		.gitignore
README.md		README.md
sync_results.sh		sync_results.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI² — Artificial Intelligence Squared

🏗 Architecture

Models vs Agents

Debate Format (Intelligence Squared)

Scoring

🏆 Top 10 Models

📊 Judge Persona Map

🚀 Getting Started

Prerequisites

Run the Web App

Run the Benchmark

Live Arena

📁 Project Structure

☁️ Deploy to Vercel

One-Click Deploy

Manual Deploy

Configuration Details

Live Arena API Endpoint

Vercel Plan Requirements

🔬 Unique Analytics

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI² — Artificial Intelligence Squared

🏗 Architecture

Models vs Agents

Debate Format (Intelligence Squared)

Scoring

🏆 Top 10 Models

📊 Judge Persona Map

🚀 Getting Started

Prerequisites

Run the Web App

Run the Benchmark

Live Arena

📁 Project Structure

☁️ Deploy to Vercel

One-Click Deploy

Manual Deploy

Configuration Details

Live Arena API Endpoint

Vercel Plan Requirements

🔬 Unique Analytics

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages