Route every AI call to the cheapest model that can do the job well.
48 MCP tools · 20+ LLM providers · intelligent routing · personal memory · budget tracking · decision analytics.
Result: 60–80% cost reduction vs running everything on Claude Opus.
- The Problem & Solution
- Real-World Savings
- Quick Start
- Key Features
- How It Works
- Supported Tools
- Configuration
- Monitoring & Optimization
- Combined Download Statistics
- MCP Tools Reference
- Architecture & Development
Traditional AI-assisted development routes every task to your most capable (and expensive) model. A simple file lookup costs the same as complex architecture redesign—burning through your quota and budget on low-value work.
llm-router analyzes each task and routes it to the cheapest model that can handle it well:
- Simple lookups → Ollama/Haiku (free/cheap)
- Moderate coding → Gemini Pro/GPT-4o (budget-friendly)
- Complex reasoning → Claude Opus/o3 (premium, only when needed)
The magic: You keep the same conversational experience. No manual routing, no model picking. It just works behind the scenes.
Real numbers: 51 releases, 22.6M tokens, $6.95 spent (vs $50–60 with traditional Opus-everywhere approach).
- Actual spend: $6.95 (22.6M development tokens)
- Opus baseline: $50–60 (traditional approach)
- Savings: $43–53 per 2 weeks (87% reduction)
- Annualized: ~$180/year vs $1,200–1,500 baseline
31% from free models → 7.0M tokens, $0 cost (Ollama + Codex)
38% from budget models → 8.6M tokens, $2.82 cost (Gemini Flash + GPT-4o-mini)
31% from premium models → 7.0M tokens, $4.13 cost (GPT-4o, Pro, Claude)
Free-first routing eliminated budget pressure over the sprint—enabling sustainable feature velocity without cost anxiety.
# One command to install and configure
pipx install llm-routing && llm-router install# Available in .env or ~/.llm-router/config.yaml
export OPENAI_API_KEY="sk-..." # For GPT-4o, o3 (optional)
export GEMINI_API_KEY="AIza..." # For Gemini models (free tier available)
export OLLAMA_BASE_URL="..." # For local Ollama (optional, auto-starts)Start using Claude Code, Gemini CLI, Codex, VS Code, Cursor, or any MCP-compatible editor. llm-router handles everything automatically.
- Complexity Classification — Analyzes prompts to determine if task is simple/moderate/complex
- Provider Fallback Chains — Ollama → Codex → GPT-4o → Claude (free-first, always)
- Budget Pressure Awareness — Automatically downgrades model when quota is limited
- Quality Monitoring — Demotes models with degraded performance via judge scoring
- Decision Logging — Track which model was selected, why, and the cost impact
- Zero-Config — Works out-of-the-box with Claude subscription (no setup required)
- Free-First Chain — Prioritizes free/prepaid models (Ollama, Codex) before paid APIs
- Session Budgeting — Allocates smart budgets to Agent calls (prevents runaway costs)
- Real-Time Savings — Session-end hook shows exact $ saved vs baseline
- Usage Analytics — Detailed breakdowns by model, task type, complexity tier
| Tool | Support | Auto-Routing |
|---|---|---|
| Claude Code | ✅ Full | Yes (hooks + quota display) |
| Gemini CLI | ✅ Full | Yes (hooks + quota display) |
| Codex CLI | ✅ Full | Yes (hooks + cost tracking) |
| VS Code + Copilot | ✅ MCP | No (tools available, model decides) |
| Cursor | ✅ MCP | No (tools available, model decides) |
| OpenCode, Windsurf | ✅ MCP | No (tools available, model decides) |
| Any MCP-compatible tool | ✅ MCP | Manual config |
Full = auto-routing hooks enforce your policy before the model responds. MCP = tools available but optional.
- Personal Routing Memory — Learns from your usage patterns over time
- Judge Scores — Real-time model quality monitoring (demotes if score drops below 0.6)
- Session Context — Understands whether you're in code mode, research mode, or Q&A
- Policy Enforcement — Prevents expensive violations (e.g., Claude answering when Haiku should route)
- Routing Decision Analytics — Every decision tracked: which model, which task, which complexity
- Savings Dashboard — See exactly how much $ you saved this session/week/month
- Quota Pressure Tracking — Real-time display of remaining budget
- Violation Detection — Logs when routing hints are ignored
- Performance Reports — Judge scores, quality trends, model health status
User Prompt
↓
[Heuristic Fast-Path] — Does it match known patterns? (instant)
↓
[Complexity Classifier] — Haiku/Sonnet/Opus tier? (local or API)
↓
[Free-First Router Chain] — Try Ollama → Codex → Gemini → OpenAI → Claude
↓
[Budget Pressure Adjustment] — Downshift if over 85% quota
↓
[Quality Guard Check] — Demote if model health score < 0.6
↓
[Provider Health Circuit] — Skip if provider is degraded/unavailable
↓
Execute on Selected Model → Track decision & cost
Simple task ("What does this error mean?")
Ollama (free, local)
↓ (if unavailable)
Codex gpt-5.4 (prepaid, free if you have subscription)
↓ (if unavailable/degraded)
Gemini Flash (budget, $0.0001 per 1M tokens)
↓ (if degraded)
Groq (free tier, rate-limited)
↓
GPT-4o-mini (fallback, $0.15/1M in tokens)
Moderate task ("Implement user auth with OAuth")
Ollama (free, local)
↓ (if unavailable)
Codex (prepaid)
↓ (if unavailable)
Gemini Pro (quality+cost sweet spot)
↓ (if degraded)
GPT-4o (feature-complete, moderate cost)
↓
Claude Sonnet (subscription, if available)
Complex task ("Design a distributed tracing system")
Ollama (free, local)
↓ (if unavailable)
Codex (prepaid)
↓ (if unavailable)
o3 (reasoning powerhouse, $$$)
↓ (if unavailable)
Claude Opus (max reasoning, $$$$)
pipx install llm-routing
llm-router installGet: auto-routing hooks, session tracking, quota display, decision analytics
pipx install llm-routing
llm-router install --host gemini-cliGet: auto-routing hooks, savings tracking, free-first chaining, quota display
pipx install llm-routing
llm-router install --host codexGet: auto-routing hooks, cost tracking, Codex injected as free fallback in chains
pipx install llm-routing
llm-router install --host vscode # or --host cursorGet: MCP server with 48 tools available (routing is model-voluntary)
Add to your tool's .mcp.json or equivalent:
{
"tools": [
{
"type": "command",
"command": "llm-router",
"args": ["--mcp"]
}
]
}If using Claude Code Pro/Max (subscription), everything works out-of-the-box. No API keys needed, Ollama auto-starts.
# Provider API Keys (only set what you have)
export OPENAI_API_KEY="sk-proj-..." # GPT-4o, o3
export GEMINI_API_KEY="AIza..." # Gemini models
export PERPLEXITY_API_KEY="pplx-..." # Web-grounded research
export ANTHROPIC_API_KEY="sk-ant-..." # For non-subscription Claude access
# Local Inference (Free)
export OLLAMA_BASE_URL="http://localhost:11434" # Ollama local server
export OLLAMA_BUDGET_MODELS="gemma4:latest,qwen3.5:latest" # Models to auto-load
# Routing Policy
export LLM_ROUTER_PROFILE="balanced" # budget|balanced|premium
export LLM_ROUTER_ENFORCE="smart" # smart|hard|soft|off
export LLM_ROUTER_MAX_AGENT_DEPTH=3 # Circuit breaker for nested agents
# Advanced
export LLM_ROUTER_DB_PATH="~/.llm-router/usage.db" # Where to store usage logs
export LLM_ROUTER_CAVEMAN_INTENSITY="full" # Compress output tokens (off|lite|full|ultra)For teams with security policies blocking .env files:
# Create secure config file
llm-router init-config
# Edit ~/.llm-router/config.yaml
openai_api_key: "sk-proj-..."
gemini_api_key: "AIza..."
ollama_base_url: "http://localhost:11434"
llm_router_profile: "balanced"Set permissions: chmod 600 ~/.llm-router/config.yaml
For full setup guide, see docs/SETUP.md.
# Show savings from current session
llm-router usage today
# Show all usage data (weekly breakdown)
llm-router usage week
# Show cost efficiency over time
llm-router savings
# Check current quota pressure
llm-router budget# Analyze routing decisions and find inefficiencies
python3 scripts/analyze-violations.py
# → Shows which sessions violated routing hints
# → Identifies patterns (e.g., Bash used when llm_query should route)
# Analyze hook health
llm-router health
# → Model quality scores, provider status, circuit breaker stateViolation: Ignoring a ⚡ MANDATORY ROUTE hint (costs extra tokens with zero savings).
Example violation:
⚡ MANDATORY ROUTE: query/simple → call llm_query
❌ User/Claude uses Bash instead → burns expensive Claude tokens
Enforcement modes:
LLM_ROUTER_ENFORCE=smart # (default) Hard block Q&A violations, soft block code
LLM_ROUTER_ENFORCE=hard # Block all violations (strictest, best savings)
LLM_ROUTER_ENFORCE=soft # Log violations, allow calls (permissive)
LLM_ROUTER_ENFORCE=off # No enforcement (max flexibility)Sessions with 3+ violations get a warning:
⚠️ ESCALATION: 5 routing violations this session.
Next prompt should call llm_query FIRST before any Bash/Read/Edit.
Set LLM_ROUTER_ENFORCE=hard to block violations automatically.
For detailed guidance, see Monitoring & Reducing Violations in CLAUDE.md.
Complete budget management for Agent calls with provisional tracking.
- Session Budget Allocation — Smart carving: 30% of remaining quota per session, $5–$50 range
- Provisional Spend Tracking — Real-time budget decrements prevent multiple agents from thinking budget is available
- Budget Reconciliation — On failure: refund 50% (only pay for delivered value)
- Hard Limits — $5/agent, $50/session (fallback safety valve)
Example:
Session → $5 budget allocated
Agent 1 (code) → -$1.00 → $4 remaining
Agent 2 (code) → -$1.00 → $3 remaining
Agent 1 succeeds → keep deduction
Agent 2 times out → refund $0.50 → $3.50 remaining
See Agent Resource Budgeting for detailed setup.
Automatic detection of writing tasks with decomposition guidance.
- Smart Detection — Recognizes "write", "draft", "add card", "create spec" patterns
- Decomposition — Suggests: route generation → integrate locally (saves 90% on writing)
- Soft Nudges — Hook suggests routing without blocking
Example suggestion:
⚡ SUGGESTION: This looks like content generation.
Consider: llm_generate() first, then Edit/Write to integrate.
Saves ~$0.0005, follows best-practice decomposition.
Optimized routing chains with automatic local inference.
- Ollama Auto-Startup — Session-start hook launches Ollama + loads budget models if not running
- Free-First Chains — Ollama → Codex → Gemini → OpenAI → Claude (all complexity levels)
- Codex as Free Fallback — Injected before all paid models when subscription available
- Routing Analytics — Track which model selected, cost impact, complexity distribution
See CHANGELOG.md for complete v6.x history.
| Tool | Purpose |
|---|---|
llm_route |
Route task to optimal model by complexity/profile |
llm_classify |
Classify task complexity: simple/moderate/complex |
llm_track_usage |
Manually log token usage for budget tracking |
| Tool | Purpose |
|---|---|
llm_query |
Answer questions (Haiku-class models, fast) |
llm_research |
Research with web access (Perplexity) |
llm_generate |
Create content (Flash-class, cheap) |
llm_analyze |
Deep analysis (Sonnet-class reasoning) |
llm_code |
Code generation & refactoring (Sonnet → Opus) |
llm_edit |
Multi-file code edits with reasoning |
| Tool | Purpose |
|---|---|
llm_image |
Generate images (Gemini/DALL-E/Flux) |
llm_video |
Generate videos (Gemini Veo/Runway) |
llm_audio |
Generate speech (ElevenLabs/OpenAI TTS) |
| Tool | Purpose |
|---|---|
llm_orchestrate |
Multi-step pipelines (research → analysis → generation) |
llm_pipeline_templates |
List available pipeline templates |
| Tool | Purpose |
|---|---|
llm_usage |
Show cost breakdown (today/week/month/all) |
llm_savings |
Cost savings vs Opus baseline |
llm_budget |
Real-time budget pressure (0.0–1.0) |
llm_health |
Provider health status & circuit breaker state |
llm_providers |
List configured providers & API key status |
llm_set_profile |
Switch routing profile (budget/balanced/premium) |
| Tool | Purpose |
|---|---|
llm_setup |
Interactive provider setup guide |
llm_policy |
View/manage routing policies |
llm_quality_report |
Judge scores & quality trends |
llm_save_session |
Archive session for cross-session learning |
llm_check_usage |
Refresh Claude subscription quota data |
llm_update_usage |
Update usage cache from API response |
llm_refresh_claude_usage |
Auto-refresh Claude quota (OAuth) |
| Tool | Purpose |
|---|---|
llm_codex |
Route directly to Codex (prepaid OpenAI subscription) |
llm_auto |
Host-agnostic routing wrapper (savings tracking) |
llm_gemini |
Route directly to Gemini CLI |
llm_fs_find |
Find files by description |
llm_fs_rename |
Generate bulk rename commands |
llm_fs_edit_many |
Multi-file edits with cheap model reasoning |
llm_fs_analyze_context |
Build workspace context for routing |
Full Tool Reference — Complete documentation for all 48 tools with examples.
llm-router routes your prompts to multiple AI providers based on task analysis.
- Sanitize inputs — Protect against prompt injection
- Scrub secrets — Remove API keys from logs before persistence
- Verify hook safety — Ensure hooks won't create deadlocks or security issues
- Authenticate all calls — Use your API keys to authenticate with providers
- Local storage — All routing decisions stored locally in
~/.llm-router/
- Prompts are sent to providers — OpenAI, Gemini, Anthropic, etc. (based on your routing policy)
- API keys stored locally —
.envor~/.llm-router/config.yaml(encrypted if possible, local-only) - Usage logs are unencrypted —
~/.llm-router/usage.dbcontains routing decisions and costs - All providers share your content — Routed prompts go to your configured providers
See SECURITY.md for responsible disclosure and detailed security design at docs/SECURITY_DESIGN.md.
# Run tests
uv run pytest tests/ -q
# Lint code
uv run ruff check src/ tests/
# Run single test file
uv run pytest tests/test_classifier.py -x -q
# Run full test suite with slow tests
uv run pytest tests/ -m "" -q
# Build distribution
uv buildSee CLAUDE.md for:
- System design decisions and trade-offs
- Module organization and dependencies
- Hook architecture and deadlock prevention
- Release process and version management
See docs/ARCHITECTURE.md for:
- Three-layer compression pipeline (token optimization)
- Judge scoring system (quality monitoring)
- Quality trend tracking (performance degradation detection)
- Budget pressure algorithm (quota-aware routing)
The llm-router stats command aggregates download statistics across the current package (llm-routing) and its legacy predecessor (claude-code-llm-router), providing unified adoption metrics.
# Display combined downloads for the last 6 weeks
llm-router stats
# View stats for a specific period
llm-router stats --period last_month
llm-router stats --period last_quarter
llm-router stats --period all_time
# Output as markdown (for README/docs)
llm-router stats --format markdown
# Output as JSON (for automation/dashboards)
llm-router stats --format jsonCombined Download Statistics
========================================
Total Downloads: 2,450
Period: recent
Breakdown:
- llm-routing: 1,500 (61.2%)
New package (current)
- claude-code-llm-router: 950 (38.8%)
Legacy package (deprecated)
When the package was migrated from claude-code-llm-router to llm-routing, existing users with the legacy package continue to receive updates. The combined stats show:
- Total adoption across both package names
- Migration progress (new vs legacy usage ratio)
- Unified impact for documentation and marketing
Integrate into your release pipeline to auto-update badges:
# Fetch combined stats and update a badge or config
llm-router stats --format json | jq '.total_downloads'
# Output: 2450| Aspect | llm-router | Manual Routing | Always Use Opus |
|---|---|---|---|
| Cost | $180–360/year | Depends on discipline | $1,200–1,500/year |
| Setup | One command (llm-router install) |
Manual each time | No setup |
| Decision Quality | Learned from your usage | Manual + error-prone | Optimal (but expensive) |
| Budget Control | Real-time pressure awareness | No automation | Subscription limits |
| Monitoring | Full analytics dashboard | Manual tracking | Subscription usage page |
| Quota Pressure Handling | Auto-downgrades intelligently | Must manually switch | No switching |
| Learning | Adapts to your patterns over time | Static | No personalization |
| Provider Fallback | Automatic, always available | Manual chain | Single provider only |
- Issues & Bugs — GitHub Issues
- Feature Requests & Discussion — GitHub Discussions
- Security Concerns — SECURITY.md for responsible disclosure
- Latest Releases — PyPI Package
- Changelog & Version History — CHANGELOG.md
MIT — See LICENSE
Contributions welcome! See CONTRIBUTING.md for guidelines.
Made with ❤️ for AI developers who care about cost and quality.



