DLRouter

DLRouter is an OpenAI-compatible inference gateway for large language model backends. It routes requests across LMDeploy, vLLM, and SGLang instances with pluggable routing strategies, runtime node management, health checks, and Prefill/Decode disaggregation support.

Use DLRouter when you want one API endpoint in front of multiple LLM serving nodes, while keeping backend-specific DistServe / PD orchestration out of your application code.

Highlights

OpenAI-compatible API: /v1/models, /v1/chat/completions, and /v1/completions.
Multiple routing policies: round-robin, weighted random, consistent hash, latency-aware routing, and prefix-cache-aware routing.
Multi-backend support: LMDeploy, vLLM, and SGLang through a pluggable backend adapter interface.
DistServe / PD disaggregation: backend-owned Prefill/Decode flows for LMDeploy, vLLM, and SGLang.
Dynamic node management: register, remove, inspect, and terminate backend nodes through REST APIs.
Health checking and lazy model discovery: unhealthy nodes are removed after consecutive failures, and model lists can be discovered after a backend becomes ready.
Optional authentication and TLS: Bearer-token API keys and SSL/TLS support are available through CLI and environment configuration.

Supported Backends

Backend	Hybrid forwarding	DistServe / PD	Discovery modes	Notes
LMDeploy	Yes	Yes	External node registration	Uses LMDeploy PD connection pool and RDMA migration when available.
vLLM	Yes	Yes	Static, heartbeat	Supports two-stage KV transfer and static NIXL DP-aware rank routing.
SGLang	Yes	Yes	Static	Uses bootstrap dual dispatch with aligned prefill bootstrap ports.

DLRouter is configured with one backend type per router process through --backend. Run multiple router processes if you need separate backend types at the same time.

Installation

pip install -e .

For development:

pip install -e ".[dev]"

Python 3.9 or newer is required.

Quick Start

This example starts DLRouter in vLLM hybrid mode, registers one vLLM server, and sends an OpenAI-compatible chat request through DLRouter.

Start a vLLM server:

vllm serve /path/to/model \
  --host 0.0.0.0 \
  --port 8100 \
  --served-model-name Qwen3-4B

# For single-node setups without Ray, add:
#   --distributed-executor-backend mp

Start DLRouter:

python -m dlrouter \
  --serving_strategy hybrid \
  --backend vllm

Register the backend node:

curl -X POST http://localhost:8000/nodes/add \
  -H "Content-Type: application/json" \
  -d '{"url": "http://127.0.0.1:8100"}'

Send a request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-4B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

DLRouter also installs a dlrouter console script, so dlrouter ... is equivalent to python -m dlrouter ... after installation.

Common Usage

Routing Strategies

python -m dlrouter \
  --backend vllm \
  --serving_strategy hybrid \
  --routing_strategy min_expected_latency

Available strategies:

Strategy	Description
`round_robin`	Sequentially cycle through nodes serving the requested model.
`random`	Weighted random selection. Nodes reporting higher speed receive more traffic.
`consistent_hash`	Route requests with the same key to the same node for affinity or cache locality.
`min_expected_latency`	Select the node with the lowest estimated latency: `unfinished_requests / speed`.
`min_observed_latency`	Select the node with the lowest recent average latency.
`prefix_cache`	Route by KV cache prefix locality to improve cache hit rate.

vLLM DistServe: Static P/D Lists

Use static mode when prefill and decode HTTP endpoints are known at router startup. DLRouter infers static discovery when both --prefill_urls and --decode_urls are provided.

python -m dlrouter \
  --serving_strategy distserve \
  --backend vllm \
  --prefill_urls "http://prefill-1:30000,http://prefill-2:30000" \
  --decode_urls "http://decode-1:30000,http://decode-2:30000" \
  --models "Qwen3-32B" \
  --disable_cache_status

For NIXL intra-node data parallel routing, set --intra_node_data_parallel_size to the local DP size. DLRouter expands each physical URL into url@rank logical nodes for routing state, strips the suffix before forwarding, and sends X-data-parallel-rank to vLLM.

python -m dlrouter \
  --serving_strategy distserve \
  --backend vllm \
  --prefill_urls "http://prefill-node:8001" \
  --decode_urls "http://decode-node:8002" \
  --models "Qwen3-32B" \
  --intra_node_data_parallel_size 8 \
  --disable_cache_status

If you change the DP size between router restarts, clear the persisted node cache or use --disable_cache_status to avoid stale url@rank entries.

vLLM DistServe: Heartbeat Discovery

When neither --prefill_urls nor --decode_urls is provided, vLLM DistServe uses heartbeat discovery. P/D instances register themselves with DLRouter by publishing HTTP and ZMQ addresses.

python -m dlrouter \
  --serving_strategy distserve \
  --backend vllm \
  --zmq_host 0.0.0.0 \
  --zmq_port 30001 \
  --models "Qwen3-32B" \
  --disable_cache_status

In heartbeat mode, a node enters the routable set only after DLRouter resolves its model information. If a restarted node sends heartbeats before its HTTP API is ready, registration is skipped temporarily and retried by later heartbeats.

SGLang DistServe

SGLang currently uses static discovery in DLRouter. Provide both prefill and decode URL lists. --prefill_bootstrap_ports is aligned with --prefill_urls; if omitted, each prefill defaults to 8998.

python -m dlrouter \
  --serving_strategy distserve \
  --backend sglang \
  --prefill_urls "http://prefill-1:13700,http://prefill-2:13700" \
  --decode_urls "http://decode-1:13701,http://decode-2:13701" \
  --prefill_bootstrap_ports "8998,8998" \
  --models "Qwen3-32B" \
  --disable_cache_status

DLRouter injects bootstrap_host, bootstrap_port, and bootstrap_room into the request body, sends the decorated request to prefill and decode concurrently, and returns the decode response.

LMDeploy DistServe

LMDeploy DistServe relies on externally registered prefill and decode nodes. DLRouter selects P/D nodes through NodeManager; no separate discovery component is created by the app factory.

python -m dlrouter \
  --serving_strategy distserve \
  --backend lmdeploy \
  --migration_protocol RDMA \
  --link_type RoCE

LMDeploy-specific PD features require the LMDeploy disaggregation dependencies to be installed in the runtime environment.

CLI Reference

Common Options

Option	Default	Description
`--server_name`	`0.0.0.0`	Bind address.
`--server_port`	`8000`	Listen port.
`--backend`	`lmdeploy`	Backend type: `lmdeploy`, `vllm`, or `sglang`.
`--routing_strategy`	`min_expected_latency`	Request routing strategy.
`--serving_strategy`	`hybrid`	Serving mode: `hybrid` or `distserve`.
`--api_keys`	`None`	Comma-separated Bearer tokens for API authentication.
`--ssl`	`False`	Enable SSL. Requires `SSL_KEYFILE` and `SSL_CERTFILE`.
`--log_level`	`INFO`	DLRouter log level.
`--disable_cache_status`	`False`	Disable persisted node status.
`--config_path`	`None`	Custom node status persistence file.
`--workers`	`1`	Number of worker processes. Values greater than 1 use Gunicorn.

Backend Options

Backend-specific options are added dynamically and are visible with --help after selecting a backend.

Backend	Option	Default	Description
LMDeploy	`--migration_protocol`	`RDMA`	PD migration protocol.
LMDeploy	`--link_type`	`RoCE`	RDMA link type: `RoCE` or `IB`.
LMDeploy	`--with_gdr`	`True`	Enable GPU Direct RDMA.
LMDeploy	`--dummy_prefill`	`False`	Use dummy prefill for testing.
vLLM	`--zmq_host`	`0.0.0.0`	ZMQ discovery bind host.
vLLM	`--zmq_port`	`30001`	ZMQ discovery port.
vLLM	`--zmq_ping_timeout`	`5`	ZMQ instance ping timeout in seconds.
vLLM	`--prefill_urls`	`None`	Comma-separated prefill URLs for static mode.
vLLM	`--decode_urls`	`None`	Comma-separated decode URLs for static mode.
vLLM	`--models`	`None`	Comma-separated model names.
vLLM	`--intra_node_data_parallel_size`	`1`	Static NIXL DP-aware logical rank count per physical URL.
SGLang	`--prefill_urls`	`None`	Comma-separated SGLang prefill HTTP URLs.
SGLang	`--decode_urls`	`None`	Comma-separated SGLang decode HTTP URLs.
SGLang	`--prefill_bootstrap_ports`	`8998 per prefill`	Comma-separated bootstrap ports aligned with prefill URLs.
SGLang	`--models`	`None`	Comma-separated model names.

API Reference

Inference

Method	Path	Description
`GET`	`/health`	Router health check.
`GET`	`/v1/models`	List available models across registered nodes.
`POST`	`/v1/chat/completions`	OpenAI-compatible chat completion endpoint.
`POST`	`/v1/completions`	OpenAI-compatible text completion endpoint.

Node Management

Method	Path	Description
`GET`	`/nodes/status`	Show registered nodes and routing state.
`POST`	`/nodes/add`	Register a backend node.
`POST`	`/nodes/remove`	Remove a backend node.
`POST`	`/nodes/terminate`	Terminate and remove a backend node.
`POST`	`/nodes/terminate_all`	Terminate all registered nodes.

Node registration can provide only a URL, or a URL plus explicit status metadata:

curl -X POST http://localhost:8000/nodes/add \
  -H "Content-Type: application/json" \
  -d '{"url": "http://backend-host:8000"}'

Architecture

Client (OpenAI SDK / curl)
        |
        v
   FastAPI routes
        |
        v
   ProxyEngine
        |
        +--> Hybrid: NodeManager -> RoutingStrategy -> Backend HTTP forward
        |
        +--> DistServe: Backend-owned PD executor
                 |
                 +--> LMDeploy PD / vLLM two-stage KV transfer / SGLang bootstrap

Key modules:

Module	Responsibility
`dlrouter/api/`	FastAPI app, middleware, and OpenAI-compatible routes.
`dlrouter/core/proxy_engine.py`	Dispatches hybrid requests and delegates DistServe requests to backends.
`dlrouter/core/node_manager.py`	Maintains node state, model lists, request counters, and routing strategy instances.
`dlrouter/core/health_check.py`	Runs background health checks and lazy model discovery.
`dlrouter/routing/`	Pluggable routing strategy implementations.
`dlrouter/backends/`	Backend adapters, shared HTTP transport, and PD execution helpers.

Backend adapters share the async HTTP transport layer in dlrouter/backends/http.py for normal forwarding, streaming forwarding, health checks, session lifecycle, and backend-specific stream framing. Backend-specific logic remains in each backend package.

Discovery Semantics

Mode	Behavior
`HYBRID`	Backend instances are registered explicitly, usually through `/nodes/add`.
`DISTSERVE + vLLM + static`	Providing both `prefill_urls` and `decode_urls` selects static discovery.
`DISTSERVE + vLLM + heartbeat`	Providing neither URL list selects heartbeat discovery.
`DISTSERVE + SGLang`	Static P/D lists are required; heartbeat discovery is not used.
`DISTSERVE + LMDeploy`	P/D nodes are selected from `NodeManager`; no router-startup discovery object is created.

Providing only one of prefill_urls or decode_urls is treated as a configuration error.

Environment Variables

Variable	Description
`DLROUTER_HEARTBEAT_EXPIRATION`	Heartbeat timeout in seconds. Default: `90`.
`DLROUTER_HEALTH_CHECK_TIMEOUT`	Per-node health-check HTTP timeout in seconds. Default: `30`.
`DLROUTER_HEALTH_CHECK_MAX_FAILURES`	Consecutive failures before removing a node. Default: `3`.
`DLROUTER_AIOHTTP_TIMEOUT`	HTTP request timeout to backends in seconds. Default: `1800`.
`UVICORN_LOG_LEVEL`	Uvicorn log level. Default: `info`.
`SSL_KEYFILE`	SSL key file path when `--ssl` is enabled.
`SSL_CERTFILE`	SSL certificate file path when `--ssl` is enabled.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Format code
make format

# Lint
make lint

# Auto-fix lint issues
make fix

# Type-check
make type-check

# Run tests
make test

# Run all checks
make all

The test suite lives under tests/ and covers backend contracts, routing strategies, service discovery, health checks, and PD executors.

Current Limitations

One DLRouter process is configured for one backend type at startup.
SGLang DistServe currently uses static discovery only.
LMDeploy PD features require LMDeploy disaggregation dependencies in the runtime environment.
fetch_models() is synchronous in the current backend contract because node registration and lazy health-check discovery call it synchronously.

Acknowledgements

DLRouter draws inspiration from these open-source projects:

LMDeploy, especially its proxy and PD disaggregation design.
vLLM, including router and cache-aware load-balancing ideas.
SGLang, especially router and mini load-balancer patterns for bootstrap-based PD proxying.

Thanks to the developers and contributors of these projects for their work in the LLM inference ecosystem.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
dlrouter		dlrouter
docs		docs
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DLRouter

Highlights

Supported Backends

Installation

Quick Start

Common Usage

Routing Strategies

vLLM DistServe: Static P/D Lists

vLLM DistServe: Heartbeat Discovery

SGLang DistServe

LMDeploy DistServe

CLI Reference

Common Options

Backend Options

API Reference

Inference

Node Management

Architecture

Discovery Semantics

Environment Variables

Development

Current Limitations

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DLRouter

Highlights

Supported Backends

Installation

Quick Start

Common Usage

Routing Strategies

vLLM DistServe: Static P/D Lists

vLLM DistServe: Heartbeat Discovery

SGLang DistServe

LMDeploy DistServe

CLI Reference

Common Options

Backend Options

API Reference

Inference

Node Management

Architecture

Discovery Semantics

Environment Variables

Development

Current Limitations

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages