Skip to content

audexdev/anthropic-api-to-mlx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

anthropic-api-to-mlx

An experimental Anthropic-compatible local API server for Claude Code, backed by mlx-lm.

Claude Code expects an Anthropic-style HTTP API. MLX does not provide a native Claude Code route, so this proxy exposes a small subset of the Anthropic API and calls mlx_lm locally.

Claude Code
  -> Anthropic-compatible /v1/messages
  -> anthropic-api-to-mlx
  -> mlx_lm
  -> local MLX model

This project is intended for local coding-agent experiments and benchmarking, not as a complete Anthropic API implementation.

Features

  • Anthropic-compatible POST /v1/messages
  • Non-streaming and SSE streaming responses
  • POST /v1/messages/count_tokens
  • GET /v1/models, GET /v1/models/{id}, and GET /health
  • Claude Code-oriented tool call parsing for common XML/JSON-ish model output
  • Qwen chat-template rendering via the model tokenizer
  • Optional Qwen reasoning mode while stripping <think>...</think> before returning content to Claude Code
  • In-memory token-prefix prompt cache using mlx_lm prompt cache objects
  • Request, cache, prefill, decode, and memory logging for benchmarks

Quick Start

Install dependencies with uv:

uv sync

Run the proxy:

uv run anthropic-api-to-mlx

By default it listens on:

http://127.0.0.1:8080

The default model is:

~/llm/models/qwen3.6-27b-nvfp4

Override it with MLX_MODEL:

MLX_MODEL=/path/to/mlx/model uv run anthropic-api-to-mlx

Claude Code Setup

Point Claude Code at the local Anthropic-compatible endpoint. One common setup is:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_AUTH_TOKEN=sk-local

Then run Claude Code normally from the workspace you want it to edit.

The proxy accepts either:

  • Authorization: Bearer <ANTHROPIC_AUTH_TOKEN>
  • x-api-key: <ANTHROPIC_AUTH_TOKEN>

Set ANTHROPIC_AUTH_TOKEN to an empty value if you want to disable local auth.

Configuration

Configuration is environment-based.

Variable Default Description
HOST 127.0.0.1 Listen host
PORT 8080 Listen port
MLX_MODEL ~/llm/models/qwen3.6-27b-nvfp4 MLX model path
ANTHROPIC_AUTH_TOKEN sk-local Local auth token
MLX_MAX_TOKENS 8192 Server-side max output token cap
MLX_TEMPERATURE 0.3 Default sampling temperature
MLX_TOP_P 0.95 Default top-p
MLX_MAX_KV_SIZE 196608 KV cache context size
MLX_PREFILL_STEP_SIZE 2048 MLX prefill step size
PROMPT_CACHE 1 Enable in-memory prompt cache
PROMPT_CACHE_BYTES 0 Cache byte cap; 0 means no byte cap
CONTEXT_TTL_MS 43200000 Session mapping TTL
CONTEXT_MAX_SESSIONS 500 Max retained session/cache entries
MLX_ENABLE_THINKING 1 Pass Qwen enable_thinking to chat template
EXPOSE_THINKING 0 Return thinking blocks to client; usually keep off for Claude Code
MLX_TRUST_REMOTE_CODE 1 Tokenizer trust-remote-code setting
ANTHROPIC_VERSION 2023-06-01 Response header version
DEBUG_REQUEST_DIR unset Optional request summary logging directory

Example benchmark-ish run:

HOST=127.0.0.1 \
PORT=8080 \
MLX_MODEL=~/llm/models/qwen3.6-35b-a3b-nvfp4 \
MLX_ENABLE_THINKING=1 \
MLX_MAX_KV_SIZE=196608 \
MLX_MAX_TOKENS=2048 \
PROMPT_CACHE=1 \
uv run anthropic-api-to-mlx

Endpoints

GET /health

Returns model, context, thinking, and prompt-cache status.

curl -s http://127.0.0.1:8080/health | jq

GET /v1/models

Returns the configured MLX_MODEL as the advertised model.

POST /v1/messages

Accepts a Claude/Anthropic-style messages request and returns an Anthropic-style message response. Streaming is enabled with "stream": true.

Minimal example:

curl -s http://127.0.0.1:8080/v1/messages \
  -H 'content-type: application/json' \
  -H 'authorization: Bearer sk-local' \
  -d '{
    "model": "local",
    "max_tokens": 128,
    "messages": [
      {"role": "user", "content": "Say hello in one short sentence."}
    ]
  }' | jq

POST /v1/messages/count_tokens

Renders the same chat prompt path and counts tokens with the loaded tokenizer.

Prompt Cache

Prompt caching is deliberately conservative.

  • The request body is still treated as the source of truth.
  • The proxy keeps in-memory cache slots keyed by Claude/session identity.
  • A cache hit is only allowed when the current chat-template-rendered token IDs start with the cached token IDs exactly.
  • Any token-prefix mismatch falls back to a full prefill.
  • Cache hit/miss reason, cached token count, rest token count, prefill t/s, decode t/s, and peak memory are logged to stderr.

Example log lines:

[cache] request=... session=session:... hit=true reason=hit:session_slot_tokenized_prefix cached_tokens=28248 rest_tokens=283
[perf] {"request":"...","input_tokens":28531,"output_tokens":1171,"cached_tokens":28248,"rest_tokens":283,"prompt_tps":326.22,"generation_tps":25.09,"peak_memory_gb":26.60}

The cache is memory-only and is lost when the process exits.

Thinking Blocks

Qwen reasoning models may emit <think>...</think> blocks. Claude Code can be confused by provider-specific thinking text in normal content, so the proxy strips those blocks by default before returning the final Anthropic response.

Use:

EXPOSE_THINKING=1

only if your client explicitly expects Anthropic-style thinking blocks.

Compatibility Scope

Implemented endpoints:

  • GET /health
  • GET /v1/models
  • GET /v1/models/{id}
  • POST /v1/messages
  • POST /v1/messages/count_tokens

Supported input content blocks:

  • text
  • tool_use
  • tool_result

Tool-use output parsing is best-effort. It is designed around the kinds of tool-call text Qwen models tend to emit under Claude Code prompts, not around a formal model-side tool calling protocol.

Known Issues

  • This is not a full Anthropic API implementation.
  • Tool-use parsing is best-effort and model-dependent.
  • The server is intentionally single-process/single-worker. MLX prompt cache objects are not safe to share across arbitrary worker threads.
  • Prompt cache is in-memory only.
  • Long-context coding-agent prompts can use a lot of memory during prefill.
  • Some model/quantization artifacts may be broken under mlx_lm; always sanity check a model directly with mlx_lm.generate.
  • Tested primarily with Claude Code local endpoint workflows.

Development

Run tests:

uv run python -m unittest discover -s tests

Run a syntax smoke check:

uv run python -m py_compile src/anthropic_api_to_mlx/*.py

Benchmark Notes

For reproducible benchmark writeups, record:

  • proxy commit hash
  • model path and quantization
  • MLX_MAX_KV_SIZE
  • MLX_MAX_TOKENS
  • thinking on/off
  • prompt cache on/off
  • wall time
  • tool calls and turns
  • cache hit/miss counts
  • average prefill/decode throughput
  • hidden-test pass/fail status

The proxy logs enough timing/cache data to support those fields.

License

MIT

About

Anthropic-compatible local API server for Claude Code backed by mlx-lm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages