Skip to content

nickcarndt/document-summarizer

Repository files navigation

Document Summarizer

Compare Claude and OpenAI responses side-by-side with a built-in evaluation framework. Upload PDFs, generate summaries from both models, ask questions using RAG, and vote on which responses are better. Track win rates, latency, cost, and quality metrics in a real-time dashboard.

Live Demo: document-summarizer-dun.vercel.app

Next.js TypeScript Claude OpenAI

Highlights

  • Side-by-Side Comparison — Every summary and Q&A response generated by both Claude and OpenAI in parallel
  • Evaluation Framework — Vote on responses, track win rates, thumbs up/down ratings, and agreement metrics
  • RAG-Powered Q&A — Ask questions answered using retrieved document chunks with cosine similarity
  • Cost & Latency Tracking — Monitor token usage, response times (P50/P95), and API costs per model
  • Export Analytics — Download evaluation data as JSON for offline analysis
  • Date Range Filtering — Analyze metrics over custom time periods

Side-by-side summary comparison
Summary View — Claude vs OpenAI with latency badges and voting controls

Q&A with RAG
Q&A — RAG-based answers from both models
Evaluation dashboard
Eval Dashboard — Win rates, latency, cost tracking
Voting interface
Voting — Thumbs up/down + head-to-head comparison
Key insights
Insights — Auto-generated performance analysis

Why This Exists

Building AI applications isn't just about calling APIs — it's about understanding which model works best for your use case. This tool demonstrates:

  • Model Evaluation → Systematic comparison of Claude vs OpenAI on real documents
  • Quality Metrics → Beyond "does it work" to "which is better and why"
  • Cost Awareness → Track actual API costs to inform model selection
  • Production Patterns → RAG implementation, caching, observability

Architecture

flowchart TB
    subgraph Client["Next.js Frontend"]
        UI["Upload / Summary / Q&A"]
        Dashboard["Eval Dashboard"]
    end

    subgraph API["API Routes (Vercel)"]
        Upload["/api/upload"]
        Embed["/api/embed"]
        Summarize["/api/summarize"]
        Query["/api/query"]
        Feedback["/api/feedback"]
        Compare["/api/compare"]
        Evals["/api/evals"]
    end

    subgraph External["External Services"]
        Claude["Claude API<br/>(claude-haiku-4-5-20251001)"]
        OpenAI["OpenAI API<br/>(gpt-4o-mini + embeddings)"]
        Neon[("Neon Postgres<br/>+ pgvector")]
    end

    UI --> Upload & Summarize & Query
    UI --> Feedback & Compare
    Dashboard --> Evals

    Upload --> Neon
    Embed --> OpenAI
    Embed --> Neon
    Summarize --> Claude & OpenAI
    Summarize --> Neon
    Query --> Neon
    Query --> Claude & OpenAI
    Feedback & Compare --> Neon
    Evals --> Neon
Loading

Tech Stack

Layer Technology
Framework Next.js 14 (App Router)
Language TypeScript
Styling Tailwind CSS
Database Neon Postgres + Drizzle ORM
Embeddings OpenAI text-embedding-3-small
LLMs Claude Haiku, GPT-4o-mini
PDF Parsing pdf-parse
Deployment Vercel

Quickstart

# Clone
git clone https://github.com/nickcarndt/document-summarizer.git
cd document-summarizer

# Install
npm install

# Configure
cp .env.example .env.local
# Add your API keys to .env.local

# Database
npx drizzle-kit push

# Run
npm run dev

Open http://localhost:3000

Environment Variables

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://...

Features

Side-by-Side Comparison

Every document generates summaries from both Claude and OpenAI in parallel:

  • Latency badges (green <5s, yellow 5-15s, red >15s)
  • Character count comparison
  • Independent thumbs up/down per response
  • Head-to-head "which is better" voting

RAG-Powered Q&A

Documents are chunked (1500 chars, 200 overlap) and embedded:

  1. Question is embedded using text-embedding-3-small
  2. Top 5 chunks retrieved via cosine similarity
  3. Both models generate answers using same context
  4. Side-by-side display with voting

Evaluation Dashboard

Real-time metrics:

  • Win Rates — Percentage of comparisons won
  • Thumbs Up Rate — Absolute quality rating
  • Agreement Rate — How often winner also got thumbs up
  • Latency Distribution — Min, P50, P95, Max
  • Response Length — Average character count
  • Cost Tracking — Actual API costs based on tokens
  • Win Rate by Type — Summaries vs Q&A breakdown

Key Insights

Auto-generated summary highlighting:

  • Overall model preference
  • Response length differences
  • Speed comparison
  • Quality consensus

Demo Walkthrough

  1. Upload a PDF (financial report, technical doc, etc.)
  2. Compare summaries from both models
  3. Vote on which summary is better
  4. Ask questions about the document
  5. Rate Q&A responses
  6. View dashboard for aggregate metrics
  7. Export data as JSON for analysis

API Routes

Route Method Description
/api/upload POST Upload PDF, extract text
/api/embed POST Chunk and embed document
/api/summarize POST Generate summaries (both models)
/api/summaries/[docId] GET Get existing summaries
/api/query POST RAG Q&A (both models)
/api/feedback POST Record thumbs up/down
/api/compare POST Record comparison vote
/api/evals GET Dashboard metrics
/api/evals/export GET Export all data as JSON

Development

npm run dev        # Development server
npm run build      # Production build
npm run typecheck  # TypeScript check
npm run lint       # ESLint

Deploy to Vercel

  1. Push to GitHub
  2. Import in Vercel
  3. Add environment variables
  4. Deploy

Sample Results

After testing on financial documents:

Metric Claude 3.5 Haiku GPT-4o-mini
Win Rate 50% 0%
Avg Latency 9.9s 6.1s
Avg Length 1,734 chars 1,706 chars
Thumbs Up 100% 50%
Cost $0.02 $0.01

Insight: Fair comparison between fast-tier models. Claude 3.5 Haiku and GPT-4o-mini show competitive performance with similar costs.

Future Improvements

  • Multi-document batch analysis
  • Statistical significance on win rates
  • Additional model support (Gemini, Llama)
  • Blind A/B testing mode
  • User accounts for preference tracking

License

MIT — see LICENSE

Author

Built by Nick Arndt — demonstrating applied AI engineering and LLM evaluation patterns.

About

Compare Claude and OpenAI responses side-by-side with a built-in evaluation framework. Upload PDFs, generate summaries, ask questions using RAG, and track metrics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages