An experimental Anthropic-compatible local API server for Claude Code, backed by
mlx-lm.
Claude Code expects an Anthropic-style HTTP API. MLX does not provide a native
Claude Code route, so this proxy exposes a small subset of the Anthropic API and
calls mlx_lm locally.
Claude Code
-> Anthropic-compatible /v1/messages
-> anthropic-api-to-mlx
-> mlx_lm
-> local MLX model
This project is intended for local coding-agent experiments and benchmarking, not as a complete Anthropic API implementation.
- Anthropic-compatible
POST /v1/messages - Non-streaming and SSE streaming responses
POST /v1/messages/count_tokensGET /v1/models,GET /v1/models/{id}, andGET /health- Claude Code-oriented tool call parsing for common XML/JSON-ish model output
- Qwen chat-template rendering via the model tokenizer
- Optional Qwen reasoning mode while stripping
<think>...</think>before returning content to Claude Code - In-memory token-prefix prompt cache using
mlx_lmprompt cache objects - Request, cache, prefill, decode, and memory logging for benchmarks
Install dependencies with uv:
uv syncRun the proxy:
uv run anthropic-api-to-mlxBy default it listens on:
http://127.0.0.1:8080
The default model is:
~/llm/models/qwen3.6-27b-nvfp4
Override it with MLX_MODEL:
MLX_MODEL=/path/to/mlx/model uv run anthropic-api-to-mlxPoint Claude Code at the local Anthropic-compatible endpoint. One common setup is:
export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_AUTH_TOKEN=sk-localThen run Claude Code normally from the workspace you want it to edit.
The proxy accepts either:
Authorization: Bearer <ANTHROPIC_AUTH_TOKEN>x-api-key: <ANTHROPIC_AUTH_TOKEN>
Set ANTHROPIC_AUTH_TOKEN to an empty value if you want to disable local auth.
Configuration is environment-based.
| Variable | Default | Description |
|---|---|---|
HOST |
127.0.0.1 |
Listen host |
PORT |
8080 |
Listen port |
MLX_MODEL |
~/llm/models/qwen3.6-27b-nvfp4 |
MLX model path |
ANTHROPIC_AUTH_TOKEN |
sk-local |
Local auth token |
MLX_MAX_TOKENS |
8192 |
Server-side max output token cap |
MLX_TEMPERATURE |
0.3 |
Default sampling temperature |
MLX_TOP_P |
0.95 |
Default top-p |
MLX_MAX_KV_SIZE |
196608 |
KV cache context size |
MLX_PREFILL_STEP_SIZE |
2048 |
MLX prefill step size |
PROMPT_CACHE |
1 |
Enable in-memory prompt cache |
PROMPT_CACHE_BYTES |
0 |
Cache byte cap; 0 means no byte cap |
CONTEXT_TTL_MS |
43200000 |
Session mapping TTL |
CONTEXT_MAX_SESSIONS |
500 |
Max retained session/cache entries |
MLX_ENABLE_THINKING |
1 |
Pass Qwen enable_thinking to chat template |
EXPOSE_THINKING |
0 |
Return thinking blocks to client; usually keep off for Claude Code |
MLX_TRUST_REMOTE_CODE |
1 |
Tokenizer trust-remote-code setting |
ANTHROPIC_VERSION |
2023-06-01 |
Response header version |
DEBUG_REQUEST_DIR |
unset | Optional request summary logging directory |
Example benchmark-ish run:
HOST=127.0.0.1 \
PORT=8080 \
MLX_MODEL=~/llm/models/qwen3.6-35b-a3b-nvfp4 \
MLX_ENABLE_THINKING=1 \
MLX_MAX_KV_SIZE=196608 \
MLX_MAX_TOKENS=2048 \
PROMPT_CACHE=1 \
uv run anthropic-api-to-mlxReturns model, context, thinking, and prompt-cache status.
curl -s http://127.0.0.1:8080/health | jqReturns the configured MLX_MODEL as the advertised model.
Accepts a Claude/Anthropic-style messages request and returns an Anthropic-style
message response. Streaming is enabled with "stream": true.
Minimal example:
curl -s http://127.0.0.1:8080/v1/messages \
-H 'content-type: application/json' \
-H 'authorization: Bearer sk-local' \
-d '{
"model": "local",
"max_tokens": 128,
"messages": [
{"role": "user", "content": "Say hello in one short sentence."}
]
}' | jqRenders the same chat prompt path and counts tokens with the loaded tokenizer.
Prompt caching is deliberately conservative.
- The request body is still treated as the source of truth.
- The proxy keeps in-memory cache slots keyed by Claude/session identity.
- A cache hit is only allowed when the current chat-template-rendered token IDs start with the cached token IDs exactly.
- Any token-prefix mismatch falls back to a full prefill.
- Cache hit/miss reason, cached token count, rest token count, prefill t/s, decode t/s, and peak memory are logged to stderr.
Example log lines:
[cache] request=... session=session:... hit=true reason=hit:session_slot_tokenized_prefix cached_tokens=28248 rest_tokens=283
[perf] {"request":"...","input_tokens":28531,"output_tokens":1171,"cached_tokens":28248,"rest_tokens":283,"prompt_tps":326.22,"generation_tps":25.09,"peak_memory_gb":26.60}
The cache is memory-only and is lost when the process exits.
Qwen reasoning models may emit <think>...</think> blocks. Claude Code can be
confused by provider-specific thinking text in normal content, so the proxy
strips those blocks by default before returning the final Anthropic response.
Use:
EXPOSE_THINKING=1only if your client explicitly expects Anthropic-style thinking blocks.
Implemented endpoints:
GET /healthGET /v1/modelsGET /v1/models/{id}POST /v1/messagesPOST /v1/messages/count_tokens
Supported input content blocks:
texttool_usetool_result
Tool-use output parsing is best-effort. It is designed around the kinds of tool-call text Qwen models tend to emit under Claude Code prompts, not around a formal model-side tool calling protocol.
- This is not a full Anthropic API implementation.
- Tool-use parsing is best-effort and model-dependent.
- The server is intentionally single-process/single-worker. MLX prompt cache objects are not safe to share across arbitrary worker threads.
- Prompt cache is in-memory only.
- Long-context coding-agent prompts can use a lot of memory during prefill.
- Some model/quantization artifacts may be broken under
mlx_lm; always sanity check a model directly withmlx_lm.generate. - Tested primarily with Claude Code local endpoint workflows.
Run tests:
uv run python -m unittest discover -s testsRun a syntax smoke check:
uv run python -m py_compile src/anthropic_api_to_mlx/*.pyFor reproducible benchmark writeups, record:
- proxy commit hash
- model path and quantization
MLX_MAX_KV_SIZEMLX_MAX_TOKENS- thinking on/off
- prompt cache on/off
- wall time
- tool calls and turns
- cache hit/miss counts
- average prefill/decode throughput
- hidden-test pass/fail status
The proxy logs enough timing/cache data to support those fields.
MIT