Skip to content

EnzymeAD/ReactantServer.jl

Repository files navigation

ReactantServer.jl

Documentation CI

A production inference server for XLA-accelerated models, compiled through Reactant.jl (StableHLO via XLA today). It speaks the KServe V2 inference API natively over gRPC, so standard Triton and KServe clients connect unchanged; it scales from a single GPU to many from one container; and it squeezes the most models out of each GPU by balancing model memory against compute. It is Julia-first throughout — custom pre/postprocessing is plain Julia, and every convention follows Julia's (column-major, batch-last axes).

It targets static-graph workloads — computer vision, scientific computing — where many models share a GPU and one model executes at a time.

Not sure which setup fits you? Common Use Cases walks through the deployment shapes (single GPU, multi-GPU distributed or replicated, multi-node) with an example configuration for each.

Highlights

  • XLA-accelerated, Reactant-compiled. Models are compiled ahead of time into device executables through Reactant's PJRT bindings. The runtime is device-agnostic (CUDA today, CPU for dev/fallback); supporting more accelerators is a goal, not a redesign. → Architecture, Philosophy
  • Julia-first pre/postprocessing. A bundle's model.jl registers preprocess/postprocess hooks in plain Julia; they run per request, in parallel and overlapped with GPU execution. → Bundles & model.jl
  • Julia-aligned conventions. Shapes are column-major with the batch axis last, the way Julia and Lux write them; the codec converts to KServe's row-major wire at the boundary, so Triton clients are unchanged and you never reason about row-major order. → Getting Started
  • Elegant configuration. One typed YAML node file (with environment-variable overrides) describes a machine; manifests declare tensors with an einsum-style named-axis notation. → Node Configuration, Bundles & model.jl
  • Standard inference protocol. KServe V2 over gRPC. Tensor data travels inline or through the Triton-compatible system-shared-memory extension for zero-copy local clients. → Client Usage
  • One container, single or multi-GPU. A node supervisor runs one worker per visible GPU: a single worker serves the public ports directly; two or more get an embedded gateway behind one endpoint. The external interface (:8001 gRPC, :8002 metrics/health) is identical either way. → Docker Deployment, Scaling to Multiple GPUs
  • Balances memory and compute. Every model's weights stay resident in host RAM and stream onto the GPU on demand, evicted LRU under a byte budget — so a card serves far more models than fit in VRAM, paying a single host-to-device transfer on a cold call. → On-demand Weights
  • Batch coalescing. Concurrent same-model requests are merged into one execution at a compiled batch size, amortizing per-launch overhead and the one-time weight transfer across the batch. → Architecture
  • Scheduling modes for single and multi-GPU. On a worker, fair (deficit-weighted, cost-aware) or fifo; across GPUs, the gateway offers round_robin or memory-aware lpt_packing that concentrates each model's traffic to fill batches. → Architecture, Multi-GPU Gateway
  • Fast iteration. In dynamic mode the server watches the model repository and hot-loads, unloads, and reloads bundles online — weights, MLIR, manifest, and model.jl alike — with no restart (static and explicit control modes are also available). → Node Configuration
  • Meta models. A kind: meta bundle chains several models with data-dependent Julia between stages: its model.jl registers a run hook that calls sub-models, runs off the GPU dispatch loop, and re-enters the scheduler for each sub-call. → Meta Models

Quick start

The image is built locally (it is not published to a registry), so build it once and then serve a directory of model bundles from the container (it scales to all visible GPUs):

git submodule update --init --recursive   # fetch the vendored lib/ forks the build needs
make image                                 # build reactantserver:latest (or: docker compose build)

docker run --gpus all --ipc=host -p 8001:8001 -p 8002:8002 \
  -v /path/to/bundles:/var/lib/reactantserver/models:ro reactantserver

The build is large and the first server startup is slow, since every model compiles before the gRPC plane accepts traffic. See Docker Deployment for the docker compose workflow and configuration.

Or from pure Julia:

using ReactantServerNode
ReactantServerNode.supervise("docker/node.yaml")   # one worker per GPU (+ gateway if >1)

Clients speak KServe V2 gRPC to :8001; health and metrics are on :8002. Walk through exporting a model, configuring a node, and querying it in Getting Started. ReactantServer is designed for a trusted network — read Security before exposing an endpoint.

Status

The full serving path is implemented end to end: export a bundle, compile it through Reactant/PJRT, schedule and coalesce requests, and serve over the KServe V2 gRPC control plane. The cost-aware scheduler, the on-demand GPU weight cache, dynamic model lifecycle, meta-model orchestration (with a worked object detection example), and the single- and multi-GPU deployment paths all work today on CUDA (with CPU for development and fallback); broader accelerator support is intended to follow. Deferred to later milestones: dynamic-batch export with server-side stablehlo-refine specialization, the compiled-executable disk cache, multi-model orchestrators, and full StableHLO-signature validation of manifests. See Architecture for the full picture.

Repository layout

A Julia 1.12+ workspace of five packages under packages/, split so that talking to a server never pulls in the heavy Reactant/XLA stack:

  • ReactantServerCore — shared, Reactant-free substrate: dtypes, the KServe V2 protobuf messages, boundary types, the manifest parser, node config, the wire codec, the shared-memory registry, and the staging BufferPool.
  • ReactantServer — the inference worker (registry, runtime, scheduler, KServe V2 gRPC server); the only package that depends on Reactant. Exports serve, stop!, register_model.
  • ReactantServerGateway — the multi-GPU KServe gRPC reverse proxy (serve_gateway). No Reactant.
  • ReactantServerClient — a Reactant-free inference client (KServeModel, infer_sync, infer_async, InferInput, InferOutput).
  • ReactantServerNode — the single-container node supervisor (supervise): detects GPUs, runs one worker per device, embeds the gateway when there is more than one, and multiplexes their logs.

Offline model export lives in packages/ReactantServerExport (a Reactant tracing frontend plus a PythonCall-triggered PyTorch extension); it is deliberately not a workspace member, so its Lux/PythonCall weakdeps stay out of the server images. The vendored forks/unregistered deps (Reactant, gRPCServer, gRPCClient, HTTP) are git submodules under lib/.

Acknowledgments

Development of ReactantServer.jl is sponsored by Medical Metrics, Inc.

About

Production inference server for Reactant-compiled models, serving KServe V2 over gRPC

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages