anthropic-api-to-mlx

An experimental Anthropic-compatible local API server for Claude Code, backed by mlx-lm.

Claude Code expects an Anthropic-style HTTP API. MLX does not provide a native Claude Code route, so this proxy exposes a small subset of the Anthropic API and calls mlx_lm locally.

Claude Code
  -> Anthropic-compatible /v1/messages
  -> anthropic-api-to-mlx
  -> mlx_lm
  -> local MLX model

This project is intended for local coding-agent experiments and benchmarking, not as a complete Anthropic API implementation.

Features

Anthropic-compatible POST /v1/messages
Non-streaming and SSE streaming responses
POST /v1/messages/count_tokens
GET /v1/models, GET /v1/models/{id}, and GET /health
Claude Code-oriented tool call parsing for common XML/JSON-ish model output
Qwen chat-template rendering via the model tokenizer
Optional Qwen reasoning mode while stripping <think>...</think> before returning content to Claude Code
In-memory token-prefix prompt cache using mlx_lm prompt cache objects
Request, cache, prefill, decode, and memory logging for benchmarks

Quick Start

Install dependencies with uv:

uv sync

Run the proxy:

uv run anthropic-api-to-mlx

By default it listens on:

http://127.0.0.1:8080

The default model is:

~/llm/models/qwen3.6-27b-nvfp4

Override it with MLX_MODEL:

MLX_MODEL=/path/to/mlx/model uv run anthropic-api-to-mlx

Claude Code Setup

Point Claude Code at the local Anthropic-compatible endpoint. One common setup is:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_AUTH_TOKEN=sk-local

Then run Claude Code normally from the workspace you want it to edit.

The proxy accepts either:

Authorization: Bearer <ANTHROPIC_AUTH_TOKEN>
x-api-key: <ANTHROPIC_AUTH_TOKEN>

Set ANTHROPIC_AUTH_TOKEN to an empty value if you want to disable local auth.

Configuration

Configuration is environment-based.

Variable	Default	Description
`HOST`	`127.0.0.1`	Listen host
`PORT`	`8080`	Listen port
`MLX_MODEL`	`~/llm/models/qwen3.6-27b-nvfp4`	MLX model path
`ANTHROPIC_AUTH_TOKEN`	`sk-local`	Local auth token
`MLX_MAX_TOKENS`	`8192`	Server-side max output token cap
`MLX_TEMPERATURE`	`0.3`	Default sampling temperature
`MLX_TOP_P`	`0.95`	Default top-p
`MLX_MAX_KV_SIZE`	`196608`	KV cache context size
`MLX_PREFILL_STEP_SIZE`	`2048`	MLX prefill step size
`PROMPT_CACHE`	`1`	Enable in-memory prompt cache
`PROMPT_CACHE_BYTES`	`0`	Cache byte cap; `0` means no byte cap
`CONTEXT_TTL_MS`	`43200000`	Session mapping TTL
`CONTEXT_MAX_SESSIONS`	`500`	Max retained session/cache entries
`MLX_ENABLE_THINKING`	`1`	Pass Qwen `enable_thinking` to chat template
`EXPOSE_THINKING`	`0`	Return thinking blocks to client; usually keep off for Claude Code
`MLX_TRUST_REMOTE_CODE`	`1`	Tokenizer trust-remote-code setting
`ANTHROPIC_VERSION`	`2023-06-01`	Response header version
`DEBUG_REQUEST_DIR`	unset	Optional request summary logging directory

Example benchmark-ish run:

HOST=127.0.0.1 \
PORT=8080 \
MLX_MODEL=~/llm/models/qwen3.6-35b-a3b-nvfp4 \
MLX_ENABLE_THINKING=1 \
MLX_MAX_KV_SIZE=196608 \
MLX_MAX_TOKENS=2048 \
PROMPT_CACHE=1 \
uv run anthropic-api-to-mlx

Endpoints

`GET /health`

Returns model, context, thinking, and prompt-cache status.

curl -s http://127.0.0.1:8080/health | jq

`GET /v1/models`

Returns the configured MLX_MODEL as the advertised model.

`POST /v1/messages`

Accepts a Claude/Anthropic-style messages request and returns an Anthropic-style message response. Streaming is enabled with "stream": true.

Minimal example:

curl -s http://127.0.0.1:8080/v1/messages \
  -H 'content-type: application/json' \
  -H 'authorization: Bearer sk-local' \
  -d '{
    "model": "local",
    "max_tokens": 128,
    "messages": [
      {"role": "user", "content": "Say hello in one short sentence."}
    ]
  }' | jq

`POST /v1/messages/count_tokens`

Renders the same chat prompt path and counts tokens with the loaded tokenizer.

Prompt Cache

Prompt caching is deliberately conservative.

The request body is still treated as the source of truth.
The proxy keeps in-memory cache slots keyed by Claude/session identity.
A cache hit is only allowed when the current chat-template-rendered token IDs start with the cached token IDs exactly.
Any token-prefix mismatch falls back to a full prefill.
Cache hit/miss reason, cached token count, rest token count, prefill t/s, decode t/s, and peak memory are logged to stderr.

Example log lines:

[cache] request=... session=session:... hit=true reason=hit:session_slot_tokenized_prefix cached_tokens=28248 rest_tokens=283
[perf] {"request":"...","input_tokens":28531,"output_tokens":1171,"cached_tokens":28248,"rest_tokens":283,"prompt_tps":326.22,"generation_tps":25.09,"peak_memory_gb":26.60}

The cache is memory-only and is lost when the process exits.

Thinking Blocks

Qwen reasoning models may emit <think>...</think> blocks. Claude Code can be confused by provider-specific thinking text in normal content, so the proxy strips those blocks by default before returning the final Anthropic response.

Use:

EXPOSE_THINKING=1

only if your client explicitly expects Anthropic-style thinking blocks.

Compatibility Scope

Implemented endpoints:

GET /health
GET /v1/models
GET /v1/models/{id}
POST /v1/messages
POST /v1/messages/count_tokens

Supported input content blocks:

text
tool_use
tool_result

Tool-use output parsing is best-effort. It is designed around the kinds of tool-call text Qwen models tend to emit under Claude Code prompts, not around a formal model-side tool calling protocol.

Known Issues

This is not a full Anthropic API implementation.
Tool-use parsing is best-effort and model-dependent.
The server is intentionally single-process/single-worker. MLX prompt cache objects are not safe to share across arbitrary worker threads.
Prompt cache is in-memory only.
Long-context coding-agent prompts can use a lot of memory during prefill.
Some model/quantization artifacts may be broken under mlx_lm; always sanity check a model directly with mlx_lm.generate.
Tested primarily with Claude Code local endpoint workflows.

Development

Run tests:

uv run python -m unittest discover -s tests

Run a syntax smoke check:

uv run python -m py_compile src/anthropic_api_to_mlx/*.py

Benchmark Notes

For reproducible benchmark writeups, record:

proxy commit hash
model path and quantization
MLX_MAX_KV_SIZE
MLX_MAX_TOKENS
thinking on/off
prompt cache on/off
wall time
tool calls and turns
cache hit/miss counts
average prefill/decode throughput
hidden-test pass/fail status

The proxy logs enough timing/cache data to support those fields.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/anthropic_api_to_mlx		src/anthropic_api_to_mlx
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

anthropic-api-to-mlx

Features

Quick Start

Claude Code Setup

Configuration

Endpoints

`GET /health`

`GET /v1/models`

`POST /v1/messages`

`POST /v1/messages/count_tokens`

Prompt Cache

Thinking Blocks

Compatibility Scope

Known Issues

Development

Benchmark Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

anthropic-api-to-mlx

Features

Quick Start

Claude Code Setup

Configuration

Endpoints

GET /health

GET /v1/models

POST /v1/messages

POST /v1/messages/count_tokens

Prompt Cache

Thinking Blocks

Compatibility Scope

Known Issues

Development

Benchmark Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`GET /v1/models`

`POST /v1/messages`

`POST /v1/messages/count_tokens`

Packages