CMLIS

CPU-native Modular Language Intelligence System — A CPU-first inference orchestration framework that validates NUMA-aware memory control and heuristic routing can deliver ≥25% throughput uplift on large models without retraining.

🔬 Research PoC Status: This is a prototype for validating architectural concepts, not production-ready inference middleware. See production readiness gaps.

What It Does

CMLIS orchestrates inference workloads on multi-socket x86 systems by:

Discovering hardware topology — NUMA nodes, physical cores, L3 cache layouts
Classifying prompts into routing tiers (short, medium, long)
Binding processes to specific NUMA nodes with CPU affinity
Benchmarking configurations across workloads with statistical validation
Collecting telemetry on locality, throughput, and resource utilization
Simulating the entire pipeline when hardware isn't available

Current Status

✅ What Works

Topology discovery — Automatic detection of sockets, NUMA nodes, physical cores, and L3 cache
Smart routing — Classifies prompts and applies NUMA-aware process binding via numactl + taskset
Benchmarking — Compares naive, NUMA-only, and full CMLIS configurations with Welch t-test validation
Cross-platform simulation — Run the full orchestration pipeline on any machine without hardware
Telemetry collection — Remote NUMA traffic fraction and per-CPU utilization tracking

⚠️ Known Limitations

KV cache placement — kv_cache_chunks is currently routing metadata only; real KV placement not yet wired into llama.cpp
Long-context NUMA awareness — Not yet implemented in the runtime path
Runtime isolation — Benchmark runs rotate across NUMA nodes but don't yet launch isolated per-socket instances
Expert routing — Uses --override-kv but model-specific validation still pending

Quick Start

Installation

cd poc
pip install -e ".[dev]"
pytest
ruff check . --fix

Run Locally

cd poc

# Discover your hardware topology
cmlis topo

# Get routing decision for a 2048-token prompt
cmlis plan --input-tokens 2048

# Simulate a single run (no hardware required)
cmlis run --simulate --input-tokens 2048 --output-tokens 128

# Simulate a full benchmark (10 repetitions)
cmlis bench --simulate --reps 10

# Run real benchmark (requires llama-cli + GGUF model)
cmlis bench --model /models/mixtral-8x7b.Q5_K_M.gguf --reps 30

Output

Benchmark reports are written to ./reports/cmlis-bench-YYYYMMDD-HHMMSS.json with:

Per-run throughput (tokens/sec)
Per-configuration statistics
Significance test results (uplift %, t-statistic, p-value, degrees of freedom)

Project Goals

CMLIS aims to validate that memory-topology-aware orchestration can improve CPU inference without retraining or modifying the underlying model engine.

Primary Success Criteria (from SPEC.md)

Metric	Target
Throughput uplift	≥25% over naive baseline (statistically significant)
Mixtral 8x7B (medium-context)	≥3.5 tok/s
Remote NUMA traffic	<10% during optimized runs
Perplexity regression	No material coherence loss

⚠️ These are target validation criteria, not claims that the current PoC has proven them at production grade.

Repository Contents

Directory/File	Purpose
`poc/`	Runnable Python PoC package and test suite
`SPEC.md`	System specification and success criteria
`INSTALL.md`	Hardware and environment setup guide
`METHODOLOGY.md`	Benchmark methodology and validation rules
`MODEL.md`	Model selection and constraints
`BRIEF.md`	Executive system overview
`PRODUCTION_TASKS.md`	Roadmap for production-readiness gaps

Platform Notes

Platform	Support Level	Notes
Linux	✅ Full	NUMA enforcement, CPU affinity, `numastat` telemetry
Windows	✅ Dev/Sim	Development and simulation mode only
macOS	✅ Dev/Sim	Development and simulation mode only

Reference configuration: poc/configs/mixtral-dual-socket.json

Production Readiness

This PoC has several gaps before production deployment. The complete remediation plan is documented in PRODUCTION_TASKS.md.

Key Gaps

Real KV cache placement strategy integration with llama.cpp
Long-context NUMA-aware KV management
Per-socket runtime isolation for benchmarking
Production-grade error handling and observability

Learn More

PoC usage & API: poc/README.md
Installation & hardware setup: INSTALL.md
System architecture brief: BRIEF.md
Benchmark methodology: METHODOLOGY.md
Model selection: MODEL.md
Production roadmap: PRODUCTION_TASKS.md

License

This project is licensed under the GNU General Public License v3.0.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.claude		.claude
.github/workflows		.github/workflows
poc		poc
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
BRIEF.md		BRIEF.md
CLAUDE.md		CLAUDE.md
HARDWARE_TOPOLOGY.md		HARDWARE_TOPOLOGY.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
MODEL.md		MODEL.md
MVP.md		MVP.md
PRODUCTION_TASKS.md		PRODUCTION_TASKS.md
README.md		README.md
SPEC.md		SPEC.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CMLIS

📋 Table of Contents

What It Does

Current Status

✅ What Works

⚠️ Known Limitations

Quick Start

Installation

Run Locally

Output

Project Goals

Primary Success Criteria (from SPEC.md)

Repository Contents

Platform Notes

Production Readiness

Key Gaps

Learn More

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CMLIS

📋 Table of Contents

What It Does

Current Status

✅ What Works

⚠️ Known Limitations

Quick Start

Installation

Run Locally

Output

Project Goals

Primary Success Criteria (from SPEC.md)

Repository Contents

Platform Notes

Production Readiness

Key Gaps

Learn More

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages