A High-Performance LLM Inference Engine with PagedAttention & Continuous Batching
⚠️ Development Status: This project is in early development (v0.1.0). It currently uses a Mock GPU executor for testing and demonstration purposes. Real CUDA kernel support is planned but not yet implemented.
English | 中文 | Documentation
Hetero-Paged-Infer is an inference engine for Large Language Models (LLMs) built in Rust, designed with a modular architecture for future production deployment. It implements cutting-edge techniques from vLLM with a modular, testable architecture designed for production deployment.
| Feature | Description | Status |
|---|---|---|
| PagedAttention KV Cache | Block-based memory management, <5% waste | ✅ |
| Continuous Batching | Dynamic prefill/decode scheduling | ✅ |
| Memory Pressure Awareness | Configurable OOM prevention | ✅ |
| Modular Architecture | Trait-based abstractions | ✅ |
| Comprehensive Testing | 122 tests (unit, property, integration) | ✅ |
| OpenAI-Compatible Server | /v1/completions + /v1/chat/completions + SSE |
✅ |
| CUDA Kernels | Real GPU execution | 🚧 Planned |
┌──────────────────────────────────────────────────────────────────────┐
│ InferenceEngine (CPU) │
├──────────────────────────────────────────────────────────────────────┤
│ ┌────────────┐ ┌────────────┐ ┌────────────────────────────────┐ │
│ │ Tokenizer │ │ Scheduler │ │ KV Cache Manager │ │
│ │ │ │ │ │ BlockPool + PageTable │ │
│ └─────┬──────┘ └─────┬──────┘ └───────────────┬────────────────┘ │
│ │ │ │ │
├────────┼───────────────┼─────────────────────────────────────────────┤
│ │ ┌──────▼──────┐ │
│ │ │ GPU Executor│ (CUDA / Mock) │
│ │ └──────┬──────┘ │
│ │ ┌──────▼──────┐ │
│ └───────►│ KV Cache │ (GPU Memory) │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
- Rust 1.70+ (2021 edition)
- Linux (Ubuntu 20.04+ recommended) or macOS
# Clone the repository
git clone https://github.com/LessUp/hetero-paged-infer.git
cd hetero-paged-infer
# Build in release mode
cargo build --release
# Run the test suite (122 tests)
cargo test# Basic usage
./target/release/hetero-infer --input "Hello, world!" --max-tokens 50
# With custom parameters
./target/release/hetero-infer \
--input "Explain quantum computing" \
--max-tokens 100 \
--temperature 0.8 \
--top-p 0.95
# Start OpenAI-compatible HTTP server
./target/release/hetero-infer --serve# Start server with default address 127.0.0.1:3000
cargo run -- --serve
# Health / readiness / metrics
curl http://127.0.0.1:3000/healthz
curl http://127.0.0.1:3000/readyz
curl http://127.0.0.1:3000/metrics
# Completions
curl http://127.0.0.1:3000/v1/completions \
-H "content-type: application/json" \
-d '{"model":"hetero-infer","prompt":"hello","max_tokens":8}'
# Chat completions
curl http://127.0.0.1:3000/v1/chat/completions \
-H "content-type: application/json" \
-d '{"model":"hetero-infer","messages":[{"role":"user","content":"say hi"}],"max_tokens":8}'use hetero_infer::{EngineConfig, GenerationParams, InferenceEngine};
// Create engine with default configuration
let mut engine = InferenceEngine::new(EngineConfig::default())?;
// Submit a generation request
let request_id = engine.submit_request(
"Hello, world!",
GenerationParams {
max_tokens: 100,
temperature: 0.8,
top_p: 0.95
}
)?;
// Run inference and collect results
let results = engine.run();
for result in results {
println!("Generated: {}", result.output_text);
}| Parameter | Default | Description |
|---|---|---|
--block-size |
16 | Tokens per physical block |
--max-num-blocks |
1024 | Total physical blocks |
--max-batch-size |
32 | Max sequences per batch |
--max-num-seqs |
256 | Maximum number of sequences |
--max-model-len |
2048 | Maximum model context length |
--max-total-tokens |
4096 | Maximum tokens per batch |
--memory-threshold |
0.9 | Memory pressure threshold (0.0-1.0) |
--max-tokens |
100 | Maximum tokens to generate |
--temperature |
1.0 | Sampling temperature |
--top-p |
0.9 | Nucleus sampling threshold |
Config file (config.json):
{
"block_size": 16,
"max_num_blocks": 1024,
"max_batch_size": 32,
"max_num_seqs": 256,
"max_model_len": 2048,
"max_total_tokens": 4096,
"memory_threshold": 0.9,
"max_retry_attempts": 2,
"tokenizer": {
"kind": "simple",
"path": null
},
"serving": {
"host": "127.0.0.1",
"port": 3000,
"model_name": "hetero-infer",
"backend": {
"kind": "local_engine",
"command": null
}
}
}Load: ./hetero-infer --config config.json
For a HuggingFace tokenizer file:
{
"tokenizer": {
"kind": "huggingface",
"path": "tokenizer.json"
}
}For command bridge mode:
{
"serving": {
"backend": {
"kind": "command_bridge",
"command": {
"program": "/bin/sh",
"args": ["-c", "printf 'bridge:%s' \"$HETERO_PROMPT\""]
}
}
}
}| Resource | Link |
|---|---|
| GitHub Pages | https://lessup.github.io/hetero-paged-infer/ |
| Architecture Guide | docs/en/architecture/overview.md | | Contributing Guide | CONTRIBUTING.md | | Changelog | CHANGELOG.md |
# Build and open API documentation
cargo doc --open
# Build documentation site locally
pip install mkdocs-material mkdocs-static-i18n
mkdocs serve -f mkdocs.yml| Approach | Memory Waste | Throughput | Description |
|---|---|---|---|
| Static Allocation | ~40-60% | Baseline | Pre-allocate max context for each request |
| Dynamic Allocation | ~20-30% | +20% | Resize per request but still fragmented |
| PagedAttention | <5% | +50% | Block-based sharing with copy-on-write |
Note: Performance figures are theoretical estimates based on vLLM research papers.
Traditional LLM serving allocates contiguous memory blocks for each request's KV cache, leading to significant memory fragmentation and waste. PagedAttention solves this by:
- Block-based allocation: Split KV cache into fixed-size blocks
- On-demand paging: Allocate blocks only when needed
- Copy-on-write: Share blocks across sequences for efficient beam search
# Run all tests
cargo test
# Run with coverage
cargo llvm-cov --html
# Run property-based tests
cargo test -- --test-threads=1| Type | Count | Description |
|---|---|---|
| Unit Tests | 79 | Core functionality tests |
| Property Tests | 22 | Invariant verification with proptest |
| Integration Tests | 13 | End-to-end workflow tests |
| Doc Tests | 7 | Documentation examples |
| Total | 121 |
We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.
# Run all checks before submitting
cargo test && cargo fmt --check && cargo clippy- PagedAttention KV Cache
- Continuous Batching Scheduler
- Memory Pressure Awareness
- Property-Based Testing
- Real CUDA Kernels
- Real Tokenizer Integration
- Async CPU/GPU Overlap
MIT License - See LICENSE.
- vLLM - PagedAttention concept and inspiration
- Rust - Systems programming language
- Criterion - Statistical benchmarking
Made with ❤️ by LessUp