Compare Claude and OpenAI responses side-by-side with a built-in evaluation framework. Upload PDFs, generate summaries from both models, ask questions using RAG, and vote on which responses are better. Track win rates, latency, cost, and quality metrics in a real-time dashboard.
Live Demo: document-summarizer-dun.vercel.app
- Side-by-Side Comparison — Every summary and Q&A response generated by both Claude and OpenAI in parallel
- Evaluation Framework — Vote on responses, track win rates, thumbs up/down ratings, and agreement metrics
- RAG-Powered Q&A — Ask questions answered using retrieved document chunks with cosine similarity
- Cost & Latency Tracking — Monitor token usage, response times (P50/P95), and API costs per model
- Export Analytics — Download evaluation data as JSON for offline analysis
- Date Range Filtering — Analyze metrics over custom time periods
Summary View — Claude vs OpenAI with latency badges and voting controls
Q&A — RAG-based answers from both models |
Eval Dashboard — Win rates, latency, cost tracking |
Voting — Thumbs up/down + head-to-head comparison |
|
Insights — Auto-generated performance analysis |
Building AI applications isn't just about calling APIs — it's about understanding which model works best for your use case. This tool demonstrates:
- Model Evaluation → Systematic comparison of Claude vs OpenAI on real documents
- Quality Metrics → Beyond "does it work" to "which is better and why"
- Cost Awareness → Track actual API costs to inform model selection
- Production Patterns → RAG implementation, caching, observability
flowchart TB
subgraph Client["Next.js Frontend"]
UI["Upload / Summary / Q&A"]
Dashboard["Eval Dashboard"]
end
subgraph API["API Routes (Vercel)"]
Upload["/api/upload"]
Embed["/api/embed"]
Summarize["/api/summarize"]
Query["/api/query"]
Feedback["/api/feedback"]
Compare["/api/compare"]
Evals["/api/evals"]
end
subgraph External["External Services"]
Claude["Claude API<br/>(claude-haiku-4-5-20251001)"]
OpenAI["OpenAI API<br/>(gpt-4o-mini + embeddings)"]
Neon[("Neon Postgres<br/>+ pgvector")]
end
UI --> Upload & Summarize & Query
UI --> Feedback & Compare
Dashboard --> Evals
Upload --> Neon
Embed --> OpenAI
Embed --> Neon
Summarize --> Claude & OpenAI
Summarize --> Neon
Query --> Neon
Query --> Claude & OpenAI
Feedback & Compare --> Neon
Evals --> Neon
| Layer | Technology |
|---|---|
| Framework | Next.js 14 (App Router) |
| Language | TypeScript |
| Styling | Tailwind CSS |
| Database | Neon Postgres + Drizzle ORM |
| Embeddings | OpenAI text-embedding-3-small |
| LLMs | Claude Haiku, GPT-4o-mini |
| PDF Parsing | pdf-parse |
| Deployment | Vercel |
# Clone
git clone https://github.com/nickcarndt/document-summarizer.git
cd document-summarizer
# Install
npm install
# Configure
cp .env.example .env.local
# Add your API keys to .env.local
# Database
npx drizzle-kit push
# Run
npm run devANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://...Every document generates summaries from both Claude and OpenAI in parallel:
- Latency badges (green <5s, yellow 5-15s, red >15s)
- Character count comparison
- Independent thumbs up/down per response
- Head-to-head "which is better" voting
Documents are chunked (1500 chars, 200 overlap) and embedded:
- Question is embedded using text-embedding-3-small
- Top 5 chunks retrieved via cosine similarity
- Both models generate answers using same context
- Side-by-side display with voting
Real-time metrics:
- Win Rates — Percentage of comparisons won
- Thumbs Up Rate — Absolute quality rating
- Agreement Rate — How often winner also got thumbs up
- Latency Distribution — Min, P50, P95, Max
- Response Length — Average character count
- Cost Tracking — Actual API costs based on tokens
- Win Rate by Type — Summaries vs Q&A breakdown
Auto-generated summary highlighting:
- Overall model preference
- Response length differences
- Speed comparison
- Quality consensus
- Upload a PDF (financial report, technical doc, etc.)
- Compare summaries from both models
- Vote on which summary is better
- Ask questions about the document
- Rate Q&A responses
- View dashboard for aggregate metrics
- Export data as JSON for analysis
| Route | Method | Description |
|---|---|---|
/api/upload |
POST | Upload PDF, extract text |
/api/embed |
POST | Chunk and embed document |
/api/summarize |
POST | Generate summaries (both models) |
/api/summaries/[docId] |
GET | Get existing summaries |
/api/query |
POST | RAG Q&A (both models) |
/api/feedback |
POST | Record thumbs up/down |
/api/compare |
POST | Record comparison vote |
/api/evals |
GET | Dashboard metrics |
/api/evals/export |
GET | Export all data as JSON |
npm run dev # Development server
npm run build # Production build
npm run typecheck # TypeScript check
npm run lint # ESLint- Push to GitHub
- Import in Vercel
- Add environment variables
- Deploy
After testing on financial documents:
| Metric | Claude 3.5 Haiku | GPT-4o-mini |
|---|---|---|
| Win Rate | 50% | 0% |
| Avg Latency | 9.9s | 6.1s |
| Avg Length | 1,734 chars | 1,706 chars |
| Thumbs Up | 100% | 50% |
| Cost | $0.02 | $0.01 |
Insight: Fair comparison between fast-tier models. Claude 3.5 Haiku and GPT-4o-mini show competitive performance with similar costs.
- Multi-document batch analysis
- Statistical significance on win rates
- Additional model support (Gemini, Llama)
- Blind A/B testing mode
- User accounts for preference tracking
MIT — see LICENSE
Built by Nick Arndt — demonstrating applied AI engineering and LLM evaluation patterns.



