Skip to content

LessUp/hetero-paged-infer

Repository files navigation

Hetero-Paged-Infer

CI License: MIT Rust

A High-Performance LLM Inference Engine with PagedAttention & Continuous Batching

⚠️ Development Status: This project is in early development (v0.1.0). It currently uses a Mock GPU executor for testing and demonstration purposes. Real CUDA kernel support is planned but not yet implemented.

English | 中文 | Documentation


Overview

Hetero-Paged-Infer is an inference engine for Large Language Models (LLMs) built in Rust, designed with a modular architecture for future production deployment. It implements cutting-edge techniques from vLLM with a modular, testable architecture designed for production deployment.

Feature Description Status
PagedAttention KV Cache Block-based memory management, <5% waste
Continuous Batching Dynamic prefill/decode scheduling
Memory Pressure Awareness Configurable OOM prevention
Modular Architecture Trait-based abstractions
Comprehensive Testing 122 tests (unit, property, integration)
OpenAI-Compatible Server /v1/completions + /v1/chat/completions + SSE
CUDA Kernels Real GPU execution 🚧 Planned

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        InferenceEngine (CPU)                          │
├──────────────────────────────────────────────────────────────────────┤
│  ┌────────────┐  ┌────────────┐  ┌────────────────────────────────┐  │
│  │ Tokenizer  │  │ Scheduler  │  │      KV Cache Manager          │  │
│  │            │  │            │  │   BlockPool + PageTable        │  │
│  └─────┬──────┘  └─────┬──────┘  └───────────────┬────────────────┘  │
│        │               │                         │                    │
├────────┼───────────────┼─────────────────────────────────────────────┤
│        │        ┌──────▼──────┐                                       │
│        │        │ GPU Executor│  (CUDA / Mock)                        │
│        │        └──────┬──────┘                                       │
│        │        ┌──────▼──────┐                                       │
│        └───────►│  KV Cache   │  (GPU Memory)                         │
│                 └─────────────┘                                       │
└──────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

  • Rust 1.70+ (2021 edition)
  • Linux (Ubuntu 20.04+ recommended) or macOS

Installation

# Clone the repository
git clone https://github.com/LessUp/hetero-paged-infer.git
cd hetero-paged-infer

# Build in release mode
cargo build --release

# Run the test suite (122 tests)
cargo test

CLI Usage

# Basic usage
./target/release/hetero-infer --input "Hello, world!" --max-tokens 50

# With custom parameters
./target/release/hetero-infer \
  --input "Explain quantum computing" \
  --max-tokens 100 \
  --temperature 0.8 \
  --top-p 0.95

# Start OpenAI-compatible HTTP server
./target/release/hetero-infer --serve

OpenAI-Compatible Server

# Start server with default address 127.0.0.1:3000
cargo run -- --serve

# Health / readiness / metrics
curl http://127.0.0.1:3000/healthz
curl http://127.0.0.1:3000/readyz
curl http://127.0.0.1:3000/metrics

# Completions
curl http://127.0.0.1:3000/v1/completions \
  -H "content-type: application/json" \
  -d '{"model":"hetero-infer","prompt":"hello","max_tokens":8}'

# Chat completions
curl http://127.0.0.1:3000/v1/chat/completions \
  -H "content-type: application/json" \
  -d '{"model":"hetero-infer","messages":[{"role":"user","content":"say hi"}],"max_tokens":8}'

Library Usage

use hetero_infer::{EngineConfig, GenerationParams, InferenceEngine};

// Create engine with default configuration
let mut engine = InferenceEngine::new(EngineConfig::default())?;

// Submit a generation request
let request_id = engine.submit_request(
    "Hello, world!",
    GenerationParams { 
        max_tokens: 100, 
        temperature: 0.8, 
        top_p: 0.95 
    }
)?;

// Run inference and collect results
let results = engine.run();
for result in results {
    println!("Generated: {}", result.output_text);
}

Configuration

Parameter Default Description
--block-size 16 Tokens per physical block
--max-num-blocks 1024 Total physical blocks
--max-batch-size 32 Max sequences per batch
--max-num-seqs 256 Maximum number of sequences
--max-model-len 2048 Maximum model context length
--max-total-tokens 4096 Maximum tokens per batch
--memory-threshold 0.9 Memory pressure threshold (0.0-1.0)
--max-tokens 100 Maximum tokens to generate
--temperature 1.0 Sampling temperature
--top-p 0.9 Nucleus sampling threshold

Config file (config.json):

{
  "block_size": 16,
  "max_num_blocks": 1024,
  "max_batch_size": 32,
  "max_num_seqs": 256,
  "max_model_len": 2048,
  "max_total_tokens": 4096,
  "memory_threshold": 0.9,
  "max_retry_attempts": 2,
  "tokenizer": {
    "kind": "simple",
    "path": null
  },
  "serving": {
    "host": "127.0.0.1",
    "port": 3000,
    "model_name": "hetero-infer",
    "backend": {
      "kind": "local_engine",
      "command": null
    }
  }
}

Load: ./hetero-infer --config config.json

For a HuggingFace tokenizer file:

{
  "tokenizer": {
    "kind": "huggingface",
    "path": "tokenizer.json"
  }
}

For command bridge mode:

{
  "serving": {
    "backend": {
      "kind": "command_bridge",
      "command": {
        "program": "/bin/sh",
        "args": ["-c", "printf 'bridge:%s' \"$HETERO_PROMPT\""]
      }
    }
  }
}

Documentation

Resource Link
GitHub Pages https://lessup.github.io/hetero-paged-infer/

| Architecture Guide | docs/en/architecture/overview.md | | Contributing Guide | CONTRIBUTING.md | | Changelog | CHANGELOG.md |

Local Documentation

# Build and open API documentation
cargo doc --open

# Build documentation site locally
pip install mkdocs-material mkdocs-static-i18n
mkdocs serve -f mkdocs.yml

Performance

Approach Memory Waste Throughput Description
Static Allocation ~40-60% Baseline Pre-allocate max context for each request
Dynamic Allocation ~20-30% +20% Resize per request but still fragmented
PagedAttention <5% +50% Block-based sharing with copy-on-write

Note: Performance figures are theoretical estimates based on vLLM research papers.

Why PagedAttention?

Traditional LLM serving allocates contiguous memory blocks for each request's KV cache, leading to significant memory fragmentation and waste. PagedAttention solves this by:

  1. Block-based allocation: Split KV cache into fixed-size blocks
  2. On-demand paging: Allocate blocks only when needed
  3. Copy-on-write: Share blocks across sequences for efficient beam search

Testing

# Run all tests
cargo test

# Run with coverage
cargo llvm-cov --html

# Run property-based tests
cargo test -- --test-threads=1
Type Count Description
Unit Tests 79 Core functionality tests
Property Tests 22 Invariant verification with proptest
Integration Tests 13 End-to-end workflow tests
Doc Tests 7 Documentation examples
Total 121

Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.

# Run all checks before submitting
cargo test && cargo fmt --check && cargo clippy

Roadmap

  • PagedAttention KV Cache
  • Continuous Batching Scheduler
  • Memory Pressure Awareness
  • Property-Based Testing
  • Real CUDA Kernels
  • Real Tokenizer Integration
  • Async CPU/GPU Overlap

License

MIT License - See LICENSE.

Acknowledgments

  • vLLM - PagedAttention concept and inspiration
  • Rust - Systems programming language
  • Criterion - Statistical benchmarking

Made with ❤️ by LessUp