An Anthropic-API proxy with auto-failover. Per-slot pipe-separated fallback chains across NVIDIA NIM, OpenRouter, DeepSeek, Ollama, LM Studio, llama.cpp. Drop-in for Claude Code, Cursor, and any Anthropic SDK client. Configure once, and Quench cools failing providers and routes around them.
Quick Start Β· Providers Β· API Keys Β· Fallback Chains Β· Configuration Β· Compare Β· Contributing
Quench is a local proxy that speaks the Anthropic Messages API. It sits between your client (Claude Code, Cursor, any Anthropic SDK consumer) and one or more upstream LLM providers. Configure a pipe-separated chain per Claude model slot; when one provider fails, Quench quenches it for a TTL and routes to the next entry, transparently.
You hit Claude rate limits mid-session. Subscription credit runs out. A provider has a flaky weekend. Existing options: switch SDKs and break your workflow, run an OpenAI-format proxy and convert every tool, or hand-roll retry logic per provider per error code. Quench's answer: keep your client unmodified, configure once, let the proxy handle failover.
MODEL_SONNET="nvidia_nim/qwen/qwen3-coder-480b|nvidia_nim/openai/gpt-oss-120b|open_router/google/gemma-4-31b-it:free"On 401 (auth/quota), 429 (rate limit), or 5xx (overloaded), the active entry is held cold for a TTL (1 hour, 60 seconds, 30 seconds respectively) and the next entry takes over. Even mid-stream: a message_start SSE buffer ensures clients see one clean stream when failover happens before any content has been emitted.
This repo started as a fork of Alishahryar1/free-claude-code, rebuilt around the chain-fallback feature and rebranded as Quench.
git clone https://github.com/SwayamDash/quench.git
cd quench
./setup.sh # installs uv, syncs deps, prepares .env, prompts for keys
./quench start # starts the proxy, points Claude Code and VSCode at itThat's the whole install. ./quench stop reverts to the official Anthropic API. ./quench status shows current state. ./quench logs tails the proxy log.
- Claude Code installed.
- At least one provider key, or one local provider running.
bash,curl,jq(auto-installed by./setup.shwhere missing).
# install uv (Astral)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Quench requires Python 3.14
uv python install 3.14
# install deps
uv sync
# copy env template, fill in keys
cp .env.example .env
# start the proxy
./quench startWindows PowerShell:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
uv python install 3.14
uv sync
copy .env.example .env
.\quench startThe minimum config is a single line in .env mapping a Claude model slot to an upstream model:
# single-provider mapping
MODEL_SONNET="nvidia_nim/qwen/qwen3-coder-480b-a35b-instruct"
# chain: try first; on auth/rate-limit/overload, try second; then third
MODEL_OPUS="nvidia_nim/qwen/qwen3-coder-480b|nvidia_nim/openai/gpt-oss-120b|open_router/anthropic/claude-3.5-sonnet:beta"
# fully local, no API keys
MODEL_HAIKU="ollama/llama3.2|lmstudio/local-model"Format: provider_id/model_path[|next_provider_id/model_path|...]. Provider IDs are nvidia_nim, open_router, deepseek, ollama, lmstudio, llamacpp. The model path is whatever string the upstream uses for the model name.
Once .env is configured, ./quench start brings the proxy up on 127.0.0.1:8082 and points Claude Code at it. Use Claude Code normally; routing happens behind the scenes.
Quench supports six provider backends, grouped by what you need to use them:
Free-tier (with key):
| Provider | Free quota | Signup |
|---|---|---|
| NVIDIA NIM | 40 requests / minute | build.nvidia.com |
| OpenRouter | Free models on :free suffix; small monthly credits |
openrouter.ai/keys |
Paid (with key):
| Provider | Pricing | Signup |
|---|---|---|
| DeepSeek | Pay-as-you-go, ~$0.14/M tokens | platform.deepseek.com |
Local (no key required):
| Provider | Setup | Default URL |
|---|---|---|
| Ollama | curl -fsSL https://ollama.com/install.sh | sh && ollama serve |
http://localhost:11434 |
| LM Studio | Download β Local Server tab β Start | http://localhost:1234/v1 |
| llama.cpp | Build / install llama-server, run llama-server -m model.gguf |
http://localhost:8080/v1 |
Step-by-step for each provider. Skip the ones you don't plan to use.
- Sign up at build.nvidia.com (free, requires NVIDIA developer account).
- Visit the API keys page: build.nvidia.com/settings/api-keys.
- Click Generate API Key. Copy the key.
- In
.env, set:NVIDIA_NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
- Free tier: 40 requests per minute. Quench will hold a 60-second cooldown if you hit the rate limit and route to the next chain entry.
Recommended models on NVIDIA NIM: qwen/qwen3-coder-480b-a35b-instruct (coding), openai/gpt-oss-120b (general).
- Sign up at openrouter.ai (Google or GitHub OAuth).
- Get your key at openrouter.ai/keys. Click Create Key.
- In
.env, set:OPENROUTER_API_KEY="sk-or-v1-xxxxxxxxxxxxxxxxxxxxxxxx"
- Browse free models with the
:freesuffix at openrouter.ai/models?max_price=0. Examples:google/gemma-4-31b-it:free,mistralai/mistral-7b-instruct:free. - Free models are rate-limited; chain them with NVIDIA NIM for resilience.
- Sign up at platform.deepseek.com.
- Get a key at platform.deepseek.com/api_keys.
- Add credit (minimum $1).
- In
.env, set:DEEPSEEK_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
- Pricing is roughly an order of magnitude below GPT-4-class APIs. Useful as a paid backstop.
- Install:
curl -fsSL https://ollama.com/install.sh | sh(macOS/Linux) orwinget install Ollama.Ollama(Windows). - Pull a model:
ollama pull llama3.2(or whatever fits your hardware). - Start the server:
ollama serve(or justollama run <model>for one-shot). - No API key. Set in
.env:OLLAMA_BASE_URL="http://localhost:11434"
- Reference in chains as
ollama/<model_name>. Works offline.
- Download from lmstudio.ai.
- Browse and download a model in the GUI.
- Click the Local Server tab. Click Start Server.
- No API key. Set in
.env:LM_STUDIO_BASE_URL="http://localhost:1234/v1"
- Reference in chains as
lmstudio/<model_name>.
- Install or build
llama.cpp. See github.com/ggml-org/llama.cpp. - Run with the OpenAI-compatible server:
llama-server -m path/to/model.gguf --port 8080. - No API key. Set in
.env:LLAMACPP_BASE_URL="http://localhost:8080/v1"
- Reference in chains as
llamacpp/<model_name>.
A walkthrough GIF for the per-provider key setup is coming in v0.1.1.
This is the headline feature. Three layers of failover, all transparent to the client:
Inter-turn failover (per request). The quench registry persists in-process across requests. Once a provider has been quenched on turn N, turn N+1 skips it and starts from the next chain entry. Restoration is automatic when the TTL expires.
Pre-stream failover. Before opening the upstream HTTP stream, Quench runs a preflight check. If the provider returns a retryable error (auth, rate limit, overload), the chain advances and the client sees no failure.
In-stream-but-pre-content failover. Anthropic SSE streams begin with a message_start event. Most providers emit message_start before the upstream HTTP call completes, so an early 401 or 429 arrives after the first SSE chunk has hit the wire. Quench buffers message_start and only flushes it once the upstream commits to a real response. If the first provider dies after message_start but before any content, the buffer is dropped silently and the next chain entry takes over from a clean state.
TTL defaults, tunable per call:
| Error class | TTL | Why |
|---|---|---|
| Auth (401/403) | 1 hour | Quota or credit exhaustion. Long backoff. |
| Rate limit (429) | 60 seconds | Most rate-limit windows reset within a minute. |
| Overloaded (5xx, 529) | 30 seconds | Server-side recovery is usually fast. |
| Other API error | 15 seconds | Fallback default. |
The registry is in-memory only. Restart Quench and all entries clear. Single-entry chains skip quench entirely so legitimate client retries aren't refused.
True mid-stream failures (the upstream dies after content has been streamed to the client) cannot be recovered without buffering full responses. Quench gracefully ends the SSE in that case so the client doesn't hang.
.env covers everything. Highlights:
| Variable | Default | Purpose |
|---|---|---|
MODEL_OPUS, MODEL_SONNET, MODEL_HAIKU, MODEL |
empty / nvidia_nim/z-ai/glm4.7 |
Per-Claude-model chain. Empty means inherit MODEL. |
ENABLE_OPUS_THINKING, ENABLE_SONNET_THINKING, ENABLE_HAIKU_THINKING, ENABLE_MODEL_THINKING |
true | Pass thinking blocks to the upstream when supported. |
ANTHROPIC_AUTH_TOKEN |
empty | If set, Quench requires this bearer token from clients. Set this whenever the proxy is reachable from anywhere other than loopback. |
QUENCH_HOST |
127.0.0.1 |
Bind address. 0.0.0.0 exposes Quench on the LAN. |
QUENCH_PORT |
8082 |
Listen port. |
QUENCH_LOG |
/tmp/quench.log |
Log file path. |
QUENCH_START_TIMEOUT |
10 |
Seconds to wait for uvicorn to bind on quench start. |
LOG_RAW_API_PAYLOADS, LOG_RAW_SSE_EVENTS, LOG_API_ERROR_TRACEBACKS |
false | Verbose diagnostics. Off by default to avoid leaking request content into logs. |
WEB_FETCH_ALLOW_PRIVATE_NETWORKS |
false | If true, web-search/web-fetch tools may reach RFC1918 ranges. |
See .env.example for the full list with annotations.
| Free-tier helpers | Fallback chains | Anthropic-native SSE | Drop-in for Claude Code | |
|---|---|---|---|---|
| Quench | Yes, with chain auto-fallback | Yes, per-slot, TTL-quenched, mid-stream-safe | Yes | Yes |
| BerriAI/litellm | BYO keys | Request-level routing, no per-slot chains | OpenAI-format-first, Anthropic via adapter | Indirect |
| Alishahryar1/free-claude-code (upstream) | Yes | Single value per slot | Yes | Yes |
| musistudio/claude-code-router | BYO keys | Intent-based routing, no quench TTL | Translates to OpenAI chat | Yes |
| fuergaosi233/claude-code-proxy | BYO keys | No | Chat-only, no Anthropic SSE | Yes |
Pick Quench when: you're hitting free-tier rate limits and want a configure-once-then-forget setup that survives provider hiccups.
Pick LiteLLM when: you have many paid keys and want OpenAI-format unification across the entire stack, with cost tracking and gateway features.
Pick claude-code-router when: you want intent-based routing (different model per task category: reasoning vs. background vs. web search).
+----------------------+ +-------------------+ +--------------------+
| Claude Code / SDK | | Quench proxy | | Upstream provider |
| (Anthropic format) | ----> | :8082 loopback | ----> | (NIM / OR / etc.) |
+----------------------+ +---------+---------+ +--------------------+
|
v
+--------------+
| Quench |
| registry | (TTL cooldowns)
+--------------+
Quench inspects each /v1/messages request, resolves the model chain, walks chain entries skipping quenched ones, and forwards to the first healthy upstream. Errors during preflight or before the first content chunk trigger transparent failover. After the first chunk reaches the client, errors gracefully end the stream.
The chain pump (api/services.py::ClaudeProxyService._chain_pump) is the load-bearing function; the quench registry (core/quench.py) is a thread-safe TTL map. State is process-local and never persisted.
Quench inherits the optional Discord and Telegram remote-coding bots from upstream. They let you drive a Claude CLI subprocess from chat, with tree-based threading and session persistence.
Set MESSAGING_PLATFORM in .env to discord, telegram, or none. See .env.example for the full set of *_BOT_TOKEN and channel-allowlist variables. The bot stack is opt-in; default is discord but disabled until a token is provided.
Quench targets Python 3.14 and uses uv for everything.
# install dev deps
uv sync
# format + lint + type check + test (in this order)
uv run ruff format .
uv run ruff check .
uv run ty check
uv run pytest -vCI enforces all four checks on every push and PR (see .github/workflows/tests.yml). A second workflow (fresh-clone.yml) validates that a brand-new clone runs setup.sh cleanly and that quench parses without syntax errors.
Smoke tests live under smoke/ and are opt-in via QUENCH_LIVE_SMOKE=1. They touch real providers, real bot APIs, or real local model servers.
- v0.2: VSCode extension with live failover view and per-request provider attribution.
- v0.2: Persistent quench registry (SQLite) so cooldowns survive restarts.
- v0.2:
quench-skillscompanion repo with curated prompt-engineering and automation skills. - Future: Routing telemetry to learn fastest healthy provider per task type.
Issues and PRs welcome. See CONTRIBUTING.md.
Quench is MIT-licensed and accepts contributions of all sizes. New provider adapters, new fallback strategies, bug fixes, docs, examples β all welcome. See CONTRIBUTING.md for the dev workflow and SECURITY.md for the threat model and disclosure policy.
Want to add a provider? Check providers/ for the existing adapter pattern. The BaseProvider ABC is small and well-commented.
Quench is forked from Alishahryar1/free-claude-code, which built the original Anthropic-to-OpenAI translation layer, the Discord/Telegram bot stack, and the per-Claude-model routing primitives. Upstream development paused, so Quench picks up day-to-day fixes and the chain-fallback feature in this fork.
Comparison table differentiates Quench from musistudio/claude-code-router and fuergaosi233/claude-code-proxy, which solve adjacent problems with different trade-offs.
MIT. Copyright (c) 2026 Ali Khokhar (original work) and Swayam Dash (Quench fork). See LICENSE.
