Skip to content

Fai305/llm-optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-optimizer

tests license python core deps

A planner for running large language models locally under a fixed memory budget. You describe a model and a piece of hardware; it returns a concrete, runnable configuration — per-layer weight precision, KV-cache precision, context length, and a GPU/CPU/disk layer-placement split — plus the exact llama.cpp, HuggingFace Transformers, and Ollama commands to realise it.

The core library and CLI have no third-party dependencies (pure Python). Flask is only needed for the optional web dashboard; torch/transformers are only needed if you want measured per-layer sensitivity instead of the built-in heuristic.

Install

No install is required — the CLI runs straight from the source tree:

python -m llm_optimizer.cli analyze --model llama3-8b --hardware rtx4090

Or install it (gives you a plain llm-optimize command on your PATH):

pip install -e .                 # core only, zero dependencies
pip install -e ".[webui]"        # + Flask dashboard
pip install -e ".[measured]"     # + numpy/safetensors for measured sensitivity

After installing, the two forms are interchangeable — llm-optimize ... is the same as python -m llm_optimizer.cli .... The examples below use whichever is shorter.

Quickstart

llm-optimize analyze  --model llama3-8b  --hardware rtx4090
llm-optimize optimize --model llama3-70b --hardware rtx4090 \
        --context 8192 --target-tps 3 --commands --compare
python examples/demo.py            # three end-to-end scenarios
python webui/app.py                # optional dashboard at 127.0.0.1:5000

A real optimize run — fitting a 70B on a single 24 GB card at a ~3 tok/s target. AHMP picks a per-layer precision mix, offloads the overflow to CPU, and drops the KV cache to q4_0 so the resident layers keep more bits:

$ llm-optimize optimize --model llama3-70b --hardware rtx4090 --context 8192 --target-tps 3 --compare

PLAN  Llama-3-70B (70.55B) on RTX 4090 + DDR5
  feasible:        True
  context length:  8192
  KV cache:        q4_0
  per-layer quant: {'q4_k_m': 20, 'q4_0': 60}
  placement:       gpu=51 cpu=29 disk=0 (n_gpu_layers=51)
  est decode tok/s:    3.15
  usability:           slow - usable for batch / non-interactive work (~3.2 tok/s)
  quality proxy (rel-PPL %): 2.830  (estimated degradation vs fp16; lower is better)
  notes:
      - Tuned for ~3 tok/s target (reached ~3.2).

--- uniform Q4_K_M baseline (same context) ---
  feasible=True  ngl=46/80  tok/s=2.6  quality_proxy=1.078%

The usability line is an honest practicality verdict. feasible only means the model fits across VRAM+RAM+disk — which is almost always true once disk spill is allowed — so it is a weak signal on its own. usability reads the estimated decode speed and grades it usable (interactive), slow (fine for batch), impractical (fits but too slow to recommend), or infeasible. Asking to run a 70B on a 4 GB card, for example, reports impractical (~1 tok/s) rather than a misleading green light. It is also exposed in --json.

Add --commands to print the matching llama.cpp / HuggingFace / Ollama invocations, or --json for machine-readable output.

The novel method: AHMP (Adaptive Hybrid Memory Planner)

Every individual technique here already exists: mixed-precision quantization, KV-cache quantization, partial GPU offload, context adaptation, mmap. They are normally chosen separately, by hand. AHMP's contribution is to treat them as one constrained allocation problem and solve the hard part exactly.

The key sub-problem is: given a per-block VRAM (or VRAM+RAM) budget, which precision should each transformer block use to maximise quality? Each block has several candidate precisions, each with a memory cost and a sensitivity-weighted quality penalty, and exactly one must be chosen per block. That is precisely a Multiple-Choice Knapsack Problem, which AHMP solves with an exact dynamic program (optimizer.mckp_mixed_precision) over a bucketised memory axis — so the precision mix is optimal for the budget, not a greedy guess. Around that core, plan_hybrid co-schedules the remaining levers:

  1. Solve mixed precision to fit everything in VRAM (trying KV f16 → q8_0 → q4_0).
  2. If that fails, re-solve against VRAM+RAM and assign a greedy GPU→CPU→disk placement, sensitive layers preferred for the GPU.
  3. If it still won't fit, shrink context (binary search) and finally spill to disk as a last resort.
  4. If a throughput target is given, walk a documented precision/KV "speed ladder" and return the least-aggressive setting that meets it.

Sensitivity (which blocks deserve the bits) defaults to a U-shaped heuristic — first and last blocks are treated as more sensitive — and can be replaced by a real logit-KL probe when torch/transformers are installed (sensitivity.measure_sensitivity).

Objective, stated honestly: with no throughput target, AHMP maximises a quality proxy subject only to the capacity budget. On a model that "fits" across VRAM+RAM it will therefore keep precision high and accept heavy CPU offload (i.e. slow generation). Pass --target-tps to make it trade precision for residency/speed. The demo's Scenario 2 shows both behaviours on a 70B.

What it models, and how

All estimates come from explicit formulas in estimator.py; nothing is mocked.

  • Weights = params × effective_bpw / 8, with per-format effective bits (incl. block overhead) calibrated so e.g. Llama-3-8B Q4_K_M ≈ 4.9 GB, matching real GGUF sizes.
  • KV cache = 2 × layers × context × (n_kv_heads × head_dim) × bytes/elem, which folds in GQA and matches published llama.cpp figures (e.g. Llama-2-7B @ 4096, f16 ≈ 2 GiB).
  • Parameter counts are derived from architecture (embeddings, attention with GQA, gated/SwiGLU FFN, norms, output head), reproducing the headline sizes of the bundled presets.
  • Decode throughput uses a first-order roofline: batch-1 generation is memory-bandwidth bound, so tok/s ≈ 1 / Σ_tier (bytes_read_tier / (bandwidth_tier × efficiency)). Prefill is not modelled.

These are first-order estimates, not measurements, and the code labels them as such. The "quality proxy" is a sensitivity-weighted relative-perplexity estimate built from per-format degradation priors — useful for ranking plans, not a substitute for a real perplexity/benchmark run.

Project structure

llm-optimizer/
├── README.md
├── LICENSE                 # Apache-2.0
├── pyproject.toml          # install + `llm-optimize` console command
├── requirements.txt
├── .gitignore
├── .github/workflows/      # CI: runs the test suites on push
├── llm_optimizer/
│   ├── __init__.py        # public API
│   ├── hardware.py        # Hardware presets (VRAM/RAM/bandwidth), unit conventions
│   ├── models.py          # ModelArch + presets, exact param-count math
│   ├── quant.py           # weight & KV quant catalogue (bpw, ggml types, ppl priors)
│   ├── estimator.py       # validated memory + roofline-throughput formulas
│   ├── sensitivity.py     # per-layer importance (heuristic, or torch logit-KL probe)
│   ├── optimizer.py       # AHMP: MCKP DP + plan_hybrid co-scheduler  (novel method)
│   ├── frontier.py        # Pareto frontier (quality vs speed menu), zero-dep
│   ├── analysis.py        # Model Analysis Engine (memory/perf across quants)
│   ├── runtime.py         # llama.cpp / HF / Ollama command generators
│   ├── cli.py             # argparse entrypoint (analyze / optimize, --json)
│   └── extras/            # optional, dependency-bearing add-ons (never imported by core)
│       └── measured.py    # measured per-layer sensitivity (numpy + safetensors)
├── examples/
│   └── demo.py            # three runnable scenarios, prints real output
├── tests/
│   ├── test_core.py       # pins validated numbers + planner invariants (no deps)
│   └── test_extras.py     # measured-sensitivity tests (skip cleanly w/o deps)
└── webui/
    └── app.py             # optional Flask dashboard (single file)

Run the tests with python tests/test_core.py (zero dependencies) or pytest tests/. They pin the eight preset parameter counts against published values, the exact KV-cache size, the uniform-quant weight footprint, and the planner invariants (the knapsack never exceeds its memory budget and never assigns a precision below the quality floor).

Presets

Models: llama3-8b, llama3-70b, llama2-7b, llama2-13b, mistral-7b, qwen2-7b, phi3-mini, gemma2-9b. To add another model, construct a ModelArch with its real architecture hyperparameters (the estimator needs layer count, hidden size, head counts, and FFN width — not just a parameter count) or register it in MODEL_PRESETS.

Hardware: rtx4090, rtx3090, rtx3060-12g, rtx4060-8g, rtx3050-4g, a100-40g, a100-80g, m2-max, m3-pro, cpu-64g, cpu-32g.

For any machine without a preset, use --hardware custom and pass its specs:

llm-optimize optimize --model phi3-mini --hardware custom --vram 4 --ram 32
# optional bandwidths (defaults shown): --gpu-bw 200 --ram-bw 50 --disk-bw 3.5
# --vram 0 models a CPU-only box.

Pareto frontier (quality vs speed)

Instead of one plan, get the whole menu of non-dominated options on your hardware. The frontier is traced by sweeping the knapsack's memory budget from "every block at the floor" up to "every block at fp16", so each point is the optimal precision mix at its memory level — not a grid search.

llm-optimize optimize --model llama3-8b --hardware rtx3050-4g --pareto

Each row is a feasible plan; reading down the list trades estimated quality for decode speed. This is zero-dependency (pure orchestration over the core DP).

Measured sensitivity (optional extra)

The core uses a heuristic per-layer importance curve. If you have the model weights, you can replace the guess with a measurement: for each block, extras.measured fake-quantises its weight matrices and scores the reconstruction error (mean abs error relative to the median weight magnitude, a robust scale that outliers don't inflate). This catches genuinely fragile layers the position-based heuristic misses.

pip install numpy safetensors
llm-optimize optimize --model llama3-8b --hardware rtx3050-4g \
    --sensitivity measured --weights /path/to/hf-model-dir

This is data-free (weights only, no forward pass) and needs only numpy + safetensors — the core stays dependency-free and never imports it. For the stronger activation-aware signal (logit-KL under fake quantisation), see llm_optimizer.sensitivity.measure_sensitivity, which needs torch.

Python API

Everything the CLI does is available as a plain library — the planner returns a Plan object and the command generators take it directly:

import llm_optimizer as L

arch = L.get_model("llama2-13b")
hw   = L.get_hardware("rtx3060-12g")

plan = L.plan_hybrid(arch, hw, context_length=16384)   # AHMP
print(plan.quant_histogram(), plan.kv_cache_type, plan.est_decode_tps)

cmds = L.all_commands(arch, hw, plan)                   # runnable commands
print(cmds["llama_cpp"])

Caveats

  • Throughput is a roofline ceiling; real tok/s is lower (kernel overhead, PCIe transfers, prefill). Treat it as a comparative estimate.
  • The per-layer llama-quantize --tensor-type recipe needs a recent llama.cpp build; bitsandbytes expresses per-layer precision only coarsely (a 4/8-bit base + llm_int8_skip_modules); Ollama is one-quant-per-model, so AHMP reports the nearest uniform quant for it. Each generator states its own limitation inline.

Web dashboard (optional)

python webui/app.py starts a single-file Flask app at 127.0.0.1:5000 where you pick a model and hardware preset from dropdowns and get the same plan, fit breakdown, and generated commands rendered in the browser — handy for sweeping presets without retyping CLI flags. It needs Flask (pip install -e ".[webui]"); the core library and CLI do not.

License

Apache-2.0 — see LICENSE. Copyright 2026 Faisal.

About

Fit any LLM under a memory budget — an optimizer/planner that picks per-layer quantization, KV-cache precision, context length and GPU/CPU/disk offload to run large language models locally, then emits ready-to-run llama.cpp / HuggingFace / Ollama commands.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages