A planner for running large language models locally under a fixed memory
budget. You describe a model and a piece of hardware; it returns a
concrete, runnable configuration — per-layer weight precision, KV-cache
precision, context length, and a GPU/CPU/disk layer-placement split — plus the
exact llama.cpp, HuggingFace Transformers, and Ollama commands to realise it.
The core library and CLI have no third-party dependencies (pure Python).
Flask is only needed for the optional web dashboard; torch/transformers
are only needed if you want measured per-layer sensitivity instead of the
built-in heuristic.
No install is required — the CLI runs straight from the source tree:
python -m llm_optimizer.cli analyze --model llama3-8b --hardware rtx4090
Or install it (gives you a plain llm-optimize command on your PATH):
pip install -e . # core only, zero dependencies
pip install -e ".[webui]" # + Flask dashboard
pip install -e ".[measured]" # + numpy/safetensors for measured sensitivity
After installing, the two forms are interchangeable — llm-optimize ... is
the same as python -m llm_optimizer.cli .... The examples below use whichever
is shorter.
llm-optimize analyze --model llama3-8b --hardware rtx4090
llm-optimize optimize --model llama3-70b --hardware rtx4090 \
--context 8192 --target-tps 3 --commands --compare
python examples/demo.py # three end-to-end scenarios
python webui/app.py # optional dashboard at 127.0.0.1:5000
A real optimize run — fitting a 70B on a single 24 GB card at a ~3 tok/s
target. AHMP picks a per-layer precision mix, offloads the overflow to CPU, and
drops the KV cache to q4_0 so the resident layers keep more bits:
$ llm-optimize optimize --model llama3-70b --hardware rtx4090 --context 8192 --target-tps 3 --compare
PLAN Llama-3-70B (70.55B) on RTX 4090 + DDR5
feasible: True
context length: 8192
KV cache: q4_0
per-layer quant: {'q4_k_m': 20, 'q4_0': 60}
placement: gpu=51 cpu=29 disk=0 (n_gpu_layers=51)
est decode tok/s: 3.15
usability: slow - usable for batch / non-interactive work (~3.2 tok/s)
quality proxy (rel-PPL %): 2.830 (estimated degradation vs fp16; lower is better)
notes:
- Tuned for ~3 tok/s target (reached ~3.2).
--- uniform Q4_K_M baseline (same context) ---
feasible=True ngl=46/80 tok/s=2.6 quality_proxy=1.078%
The usability line is an honest practicality verdict. feasible only means
the model fits across VRAM+RAM+disk — which is almost always true once disk
spill is allowed — so it is a weak signal on its own. usability reads the
estimated decode speed and grades it usable (interactive), slow (fine for
batch), impractical (fits but too slow to recommend), or infeasible. Asking
to run a 70B on a 4 GB card, for example, reports impractical (~1 tok/s)
rather than a misleading green light. It is also exposed in --json.
Add --commands to print the matching llama.cpp / HuggingFace / Ollama
invocations, or --json for machine-readable output.
Every individual technique here already exists: mixed-precision quantization, KV-cache quantization, partial GPU offload, context adaptation, mmap. They are normally chosen separately, by hand. AHMP's contribution is to treat them as one constrained allocation problem and solve the hard part exactly.
The key sub-problem is: given a per-block VRAM (or VRAM+RAM) budget, which
precision should each transformer block use to maximise quality? Each block
has several candidate precisions, each with a memory cost and a
sensitivity-weighted quality penalty, and exactly one must be chosen per block.
That is precisely a Multiple-Choice Knapsack Problem, which AHMP solves
with an exact dynamic program (optimizer.mckp_mixed_precision) over a
bucketised memory axis — so the precision mix is optimal for the budget, not a
greedy guess. Around that core, plan_hybrid co-schedules the remaining
levers:
- Solve mixed precision to fit everything in VRAM (trying KV f16 → q8_0 → q4_0).
- If that fails, re-solve against VRAM+RAM and assign a greedy GPU→CPU→disk placement, sensitive layers preferred for the GPU.
- If it still won't fit, shrink context (binary search) and finally spill to disk as a last resort.
- If a throughput target is given, walk a documented precision/KV "speed ladder" and return the least-aggressive setting that meets it.
Sensitivity (which blocks deserve the bits) defaults to a U-shaped heuristic —
first and last blocks are treated as more sensitive — and can be replaced by a
real logit-KL probe when torch/transformers are installed
(sensitivity.measure_sensitivity).
Objective, stated honestly: with no throughput target, AHMP maximises a
quality proxy subject only to the capacity budget. On a model that "fits"
across VRAM+RAM it will therefore keep precision high and accept heavy CPU
offload (i.e. slow generation). Pass --target-tps to make it trade precision
for residency/speed. The demo's Scenario 2 shows both behaviours on a 70B.
All estimates come from explicit formulas in estimator.py; nothing is mocked.
- Weights =
params × effective_bpw / 8, with per-format effective bits (incl. block overhead) calibrated so e.g. Llama-3-8B Q4_K_M ≈ 4.9 GB, matching real GGUF sizes. - KV cache =
2 × layers × context × (n_kv_heads × head_dim) × bytes/elem, which folds in GQA and matches published llama.cpp figures (e.g. Llama-2-7B @ 4096, f16 ≈ 2 GiB). - Parameter counts are derived from architecture (embeddings, attention with GQA, gated/SwiGLU FFN, norms, output head), reproducing the headline sizes of the bundled presets.
- Decode throughput uses a first-order roofline: batch-1 generation is
memory-bandwidth bound, so
tok/s ≈ 1 / Σ_tier (bytes_read_tier / (bandwidth_tier × efficiency)). Prefill is not modelled.
These are first-order estimates, not measurements, and the code labels them as such. The "quality proxy" is a sensitivity-weighted relative-perplexity estimate built from per-format degradation priors — useful for ranking plans, not a substitute for a real perplexity/benchmark run.
llm-optimizer/
├── README.md
├── LICENSE # Apache-2.0
├── pyproject.toml # install + `llm-optimize` console command
├── requirements.txt
├── .gitignore
├── .github/workflows/ # CI: runs the test suites on push
├── llm_optimizer/
│ ├── __init__.py # public API
│ ├── hardware.py # Hardware presets (VRAM/RAM/bandwidth), unit conventions
│ ├── models.py # ModelArch + presets, exact param-count math
│ ├── quant.py # weight & KV quant catalogue (bpw, ggml types, ppl priors)
│ ├── estimator.py # validated memory + roofline-throughput formulas
│ ├── sensitivity.py # per-layer importance (heuristic, or torch logit-KL probe)
│ ├── optimizer.py # AHMP: MCKP DP + plan_hybrid co-scheduler (novel method)
│ ├── frontier.py # Pareto frontier (quality vs speed menu), zero-dep
│ ├── analysis.py # Model Analysis Engine (memory/perf across quants)
│ ├── runtime.py # llama.cpp / HF / Ollama command generators
│ ├── cli.py # argparse entrypoint (analyze / optimize, --json)
│ └── extras/ # optional, dependency-bearing add-ons (never imported by core)
│ └── measured.py # measured per-layer sensitivity (numpy + safetensors)
├── examples/
│ └── demo.py # three runnable scenarios, prints real output
├── tests/
│ ├── test_core.py # pins validated numbers + planner invariants (no deps)
│ └── test_extras.py # measured-sensitivity tests (skip cleanly w/o deps)
└── webui/
└── app.py # optional Flask dashboard (single file)
Run the tests with python tests/test_core.py (zero dependencies) or pytest tests/. They pin the eight preset parameter counts against published values, the exact KV-cache size, the uniform-quant weight footprint, and the planner invariants (the knapsack never exceeds its memory budget and never assigns a precision below the quality floor).
Models: llama3-8b, llama3-70b, llama2-7b, llama2-13b, mistral-7b,
qwen2-7b, phi3-mini, gemma2-9b. To add another model, construct a
ModelArch with its real architecture hyperparameters (the estimator needs
layer count, hidden size, head counts, and FFN width — not just a parameter
count) or register it in MODEL_PRESETS.
Hardware: rtx4090, rtx3090, rtx3060-12g, rtx4060-8g, rtx3050-4g,
a100-40g, a100-80g, m2-max, m3-pro, cpu-64g, cpu-32g.
For any machine without a preset, use --hardware custom and pass its specs:
llm-optimize optimize --model phi3-mini --hardware custom --vram 4 --ram 32
# optional bandwidths (defaults shown): --gpu-bw 200 --ram-bw 50 --disk-bw 3.5
# --vram 0 models a CPU-only box.Instead of one plan, get the whole menu of non-dominated options on your hardware. The frontier is traced by sweeping the knapsack's memory budget from "every block at the floor" up to "every block at fp16", so each point is the optimal precision mix at its memory level — not a grid search.
llm-optimize optimize --model llama3-8b --hardware rtx3050-4g --paretoEach row is a feasible plan; reading down the list trades estimated quality for decode speed. This is zero-dependency (pure orchestration over the core DP).
The core uses a heuristic per-layer importance curve. If you have the model
weights, you can replace the guess with a measurement: for each block,
extras.measured fake-quantises its weight matrices and scores the
reconstruction error (mean abs error relative to the median weight magnitude,
a robust scale that outliers don't inflate). This catches genuinely fragile
layers the position-based heuristic misses.
pip install numpy safetensors
llm-optimize optimize --model llama3-8b --hardware rtx3050-4g \
--sensitivity measured --weights /path/to/hf-model-dirThis is data-free (weights only, no forward pass) and needs only
numpy + safetensors — the core stays dependency-free and never imports it. For
the stronger activation-aware signal (logit-KL under fake quantisation), see
llm_optimizer.sensitivity.measure_sensitivity, which needs torch.
Everything the CLI does is available as a plain library — the planner returns a
Plan object and the command generators take it directly:
import llm_optimizer as L
arch = L.get_model("llama2-13b")
hw = L.get_hardware("rtx3060-12g")
plan = L.plan_hybrid(arch, hw, context_length=16384) # AHMP
print(plan.quant_histogram(), plan.kv_cache_type, plan.est_decode_tps)
cmds = L.all_commands(arch, hw, plan) # runnable commands
print(cmds["llama_cpp"])- Throughput is a roofline ceiling; real tok/s is lower (kernel overhead, PCIe transfers, prefill). Treat it as a comparative estimate.
- The per-layer
llama-quantize --tensor-typerecipe needs a recent llama.cpp build; bitsandbytes expresses per-layer precision only coarsely (a 4/8-bit base +llm_int8_skip_modules); Ollama is one-quant-per-model, so AHMP reports the nearest uniform quant for it. Each generator states its own limitation inline.
python webui/app.py starts a single-file Flask app at 127.0.0.1:5000 where
you pick a model and hardware preset from dropdowns and get the same plan, fit
breakdown, and generated commands rendered in the browser — handy for sweeping
presets without retyping CLI flags. It needs Flask (pip install -e ".[webui]");
the core library and CLI do not.
Apache-2.0 — see LICENSE. Copyright 2026 Faisal.