Dynamic, hardware-aware recommender for the best quantized LLMs to run locally. Detects your CPU/GPU/RAM and your installed runtime (Ollama / LM Studio / MLX), then picks the GGUF or MLX models that actually fit and run well — with concrete install commands.
Running local LLMs is a moving target. New models drop weekly, every model has
five quantization variants, and "will it fit my GPU?" requires math you don't
want to do every time. llm-advisor answers that for you, dynamically:
- It reads your hardware (CPU, RAM, GPU, VRAM) at runtime.
- It pulls a live catalog from HuggingFace and the Ollama Library.
- It checks real GGUF / safetensors file sizes from the HF API — no hand-maintained list, no stale guesses.
- It picks the best variant per model for your hardware and shows you the
exact
ollama pull/lms get/mlx_lm.generatecommand. - On Apple Silicon it prefers native MLX models over GGUF.
- It only suggests runtimes you actually have installed.
| 🖥️ Hardware detection | CPU, RAM, GPU/VRAM — NVIDIA via nvidia-smi, AMD via rocm-smi/WMI, Apple Silicon via sysctl (unified memory) |
| ⚙️ Runtime detection | Ollama (binary + :11434/api/tags) and LM Studio (binary, native :1234/api/v0/models + OpenAI-compat fallback). Lists locally installed models and which are currently loaded |
| 🌐 Live catalog | HuggingFace API for GGUF & MLX repos, Ollama Library scraping by category (code, tools) — 10-min in-memory cache |
| 📏 Real file sizes | After ranking, the top candidates' actual GGUF / safetensors sizes are fetched from /api/models/{repo}/tree/main (split shards merged) — no more "estimated 4 GB, actually 18 GB" |
| 🎯 Smart variant picker | Largest-VRAM-fit if anything fits; else smallest decent-quality variant that fits memory (so a 30B model recommends Q3_K_M @ 17 GB, not Q5_K_M @ 53 GB onto your 16 GB GPU) |
| 🏷️ Use-case classification | coding, agentic, tool_use, general, and all_round (models strong in all three) |
| 📊 Quant-quality awareness | Every recommendation comes with a quality label — near-lossless, balanced, moderate, low — and a one-line explainer per quant format (Q4_K_M, IQ3_M, TQ1_0, MLX 4bit, …) |
| 🍎 Apple Silicon: MLX-first | When is_apple_silicon=True we also fetch MLX-tagged repos and rank them above GGUF (MLX is significantly faster on M-series chips) |
| 🧰 Runtime-aware commands | If only LM Studio is running, you get only lms get lines. Only Ollama? Only ollama pull. Both? Both panels. None? A friendly nudge to install one. |
─── LLM Advisor ───
Hardware
OS Windows 11
CPU AMD Ryzen 9 5900X
RAM 127.9 GB total · 80.7 GB free
GPU NVIDIA GeForce RTX 5060 Ti
VRAM 15.9 GB
Backend cuda
─── Runtimes ───
Runtime Status API Local models
Ollama not found - -
LM Studio running http://127.0.0.1:1234 12
─── All-round recommendations ───
# Model Params Quant Quality Size Tier
1 Qwen3-Coder-Next 7B UD-TQ1_0 experimental 17.6 GB ok (split)
2 Qwen2.5-Coder-7B-Instruct 7B Q6_K very high 11.6 GB fast (GPU)
3 Dolphin3.0-Llama3.1-8B 8B Q8_0 near-lossless 8.0 GB fast (GPU)
4 Qwen2.5-Coder-14B-Instruct 14B Q4_0 balanced 7.9 GB fast (GPU)
5 Qwen3.5-9B-Claude-4.6-OS 9B Q8_0 near-lossless 9.8 GB fast (GPU)
LM Studio (running) — download commands
1. lms get Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
→ pick file: qwen2.5-coder-7b-instruct-q6_k.gguf (11.6 GB)
...
Quantization legend
Q8_0 near-lossless · best quality
Q6_K very high quality · larger file
Q4_K_M ★ sweet spot — recommended default
IQ3_M imatrix Q3 · for tight VRAM
Q2_K / IQ2_* severe loss · last resort
git clone https://github.com/<you>/llm-advisor
cd llm-advisor
pip install -r requirements.txtRequirements
- Python 3.10+
- Dependencies:
psutil,requests,rich,numpy(seerequirements.txt)
Optional system tools (for accurate GPU detection)
| Platform | Tool | Notes |
|---|---|---|
| NVIDIA | nvidia-smi |
Ships with the driver — usually already on PATH |
| AMD on Linux | rocm-smi |
From ROCm install |
| AMD/Intel on Windows | (none required) | Falls back to PowerShell Get-CimInstance Win32_VideoController. VRAM is best-effort due to a 32-bit Windows limit. |
| Apple Silicon | (none required) | Unified memory read via sysctl |
# One-shot — the recommended way to use this tool
python check_all.py # top 5 all-round recommendations
python check_all.py 8 # top 8
# or as a sub-command
python llm_advisor.py allThat single command does everything: detects hardware, finds Ollama/LM Studio, lists locally installed models, fetches the live catalog from HuggingFace, verifies real file sizes, and prints only the models that are good across coding + agentic + tool use simultaneously, with concrete install commands matched to whichever runtime you have.
python check_all.py [TOP_N] # default TOP_N = 5
python llm_advisor.py all -n 8python llm_advisor.py hardware # CPU / RAM / GPU only
python llm_advisor.py runtimes # Ollama + LM Studio + their installed models
python llm_advisor.py benchmark # FP32 matmul benchmark + tier label
python llm_advisor.py quants # full GGUF quantization reference tablepython llm_advisor.py recommend all_round # ⭐ models that do everything
python llm_advisor.py recommend coding -n 8
python llm_advisor.py recommend agentic
python llm_advisor.py recommend tool_use
python llm_advisor.py recommend general
# Add an extra HF search keyword
python llm_advisor.py recommend coding --query "qwen3"| Use case | What it captures |
|---|---|
coding |
Code generation, refactoring, test writing — coder-finetunes |
agentic |
Multi-step reasoning, planning, autonomous agents — instruct LLMs |
tool_use |
Function calling, structured tool/API invocation |
general |
All-purpose chat / instruct |
all_round |
Models scoring well in coding and agentic and tool use |
CPU → wmic / sysctl / /proc/cpuinfo
RAM → psutil
GPU/VRAM → nvidia-smi (NVIDIA, all platforms)
→ sysctl (Apple Silicon unified memory ≈ 75% of RAM)
→ rocm-smi (AMD on Linux)
→ Get-CimInstance Win32_VideoController (AMD/Intel on Windows)
| Runtime | Binary on PATH | API endpoint | Local models source |
|---|---|---|---|
| Ollama | ollama |
http://127.0.0.1:11434/api/tags |
API + ~/.ollama/models |
| LM Studio | lms (or app dir) |
http://127.0.0.1:1234/api/v0/models (native, richer) → /v1/models (OpenAI-compat fallback) |
API + ~/.lmstudio/models |
We use 127.0.0.1 (not localhost) to avoid IPv6 ::1 resolution failures
when the runtime binds only to IPv4 — a common gotcha on Windows. From the
LM Studio native API we also surface state (loaded/not-loaded), arch,
quantization, and capabilities (e.g. tool_use).
- HuggingFace:
GET /api/models?filter=gguf|mlx&sort=downloads&direction=-1 - Ollama Library: HTML scrape of
https://ollama.com/search?c=tools|code - Real sizes for the top candidates:
GET /api/models/{repo}/tree/main?recursive=true, with split-shard merging (-00001-of-00003.gguf→ one logical entry) - All responses cached in-memory for 10 min
Each candidate is scored:
score = use_case_match_bonus (0 / +50 / +100)
+ popularity (downloads^0.25, capped at 50)
+ likes (likes^0.5, capped at 20)
+ hardware_tier_bonus (-5 .. +30)
+ recency_bonus (5–15 for 2024+/2025+)
+ apple_silicon_mlx_bonus (+50 if MLX on Apple, -5 GGUF on Apple)
- junk_penalty (test/draft/experimental)
For all_round, the use-case bonus rewards breadth: +40 per matched skill
(coding, agentic, tool_use), with a +40 bonus for hitting all three.
For each top candidate the algorithm asks: which GGUF variant is the best fit for this user's hardware?
1. Largest variant that fully fits in VRAM → "fast (GPU)"
2. Smallest decent-quality (≥ Q3) variant
that fits in (VRAM + 80% RAM) → "ok (split)" or "slow (CPU)"
3. Last-ditch: smallest variant that fits → "slow (CPU)"
4. Otherwise → drop the model
Why "smallest decent" in step 2? Because on a 16 GB VRAM GPU you don't want to be told "use Q5_K_M @ 53 GB on CPU" — you want a Q3_K_M @ 18 GB that mostly stays on GPU with a small spill into RAM.
When hw.is_apple_silicon is true:
- Both GGUF and MLX repos are fetched from HF (separate API calls).
- MLX models get +50 to their score (M-series runs MLX much faster than GGUF).
- GGUF models get -5 so MLX wins ties.
- MLX-format models on non-Apple platforms are excluded entirely (they wouldn't run anyway).
- The suggestion panel adds an MLX-LM section with
mlx_lm.generatecommands.
The advisor labels every recommendation with a quality bucket and explains
each format. Run python llm_advisor.py quants for the full table; here's the
short version:
| Quant | Bits/wt | Quality | Trade-off |
|---|---|---|---|
F16/BF16 |
16.0 | lossless | Reference quality, very large |
Q8_0 |
8.5 | near-lossless | ~50% of FP16, best practical quality |
Q6_K |
6.6 | very high | Almost lossless, big |
Q5_K_M |
5.7 | high | Great quality, moderate size |
Q4_K_M |
4.9 | balanced | ★ Sweet spot — recommended default |
IQ4_XS |
4.3 | balanced | Imatrix Q4 — slightly smaller, similar quality |
Q3_K_M |
3.8 | moderate | Noticeable quality loss; for tight VRAM |
IQ3_M |
3.7 | moderate | Imatrix Q3 — better than Q3_K_S at similar size |
Q2_K |
3.0 | low | Severe quality loss; last resort |
IQ2_* |
2.1–2.7 | very low | Heavy compression — for huge models on small VRAM |
IQ1_* |
1.6–1.8 | extreme low | Experimental / research |
TQ1_0 |
~1.6 | experimental | Trellis Quantization — newer, evolving |
Apple-MLX repos use a different convention: 4bit, 8bit, bf16, fp16.
The advisor parses these from the repo name (mlx-community/...-4bit).
.
├── hardware.py # CPU / RAM / GPU detection (cross-platform)
├── runtimes.py # Ollama + LM Studio detection (binary, API, local models)
├── catalog.py # Live fetching: HF API + Ollama Library + real GGUF/safetensors sizes
├── quants.py # Quantization reference (bits/weight, quality bucket, trade-off)
├── recommender.py # Use-case classification, scoring, variant picker, MLX preference
├── llm_advisor.py # CLI entry point (argparse + rich)
├── check_all.py # One-shot wrapper for the `all` sub-command
├── requirements.txt
└── README.md
Everything is plain requests + psutil + rich — no heavy ML deps, no
HuggingFace SDK, no Ollama SDK. The whole thing works from a clean Python
install in a few seconds.
- AMD/Intel GPUs on Windows — VRAM detection is best-effort. The Win32
AdapterRAMvalue is capped at 4 GB by the 32-bit type. NVIDIA and Apple Silicon are reliable. - Ollama Library scraping — no official API, so HTML scraping. Resilient but could break if the page is restructured. The 10-min cache softens this.
- CPU benchmark — currently a FP32 matmul, not a real tokens/s test.
It produces a coarse tier (
mid,high,extreme) only. - Quant-quality labels are static — they reflect the typical bits/weight
and llama.cpp release notes, not per-model evals. A
Q4_K_Mon a tiny model degrades more than on a huge model; the label doesn't capture that. - Param count from name — when neither the repo name nor HF tags carry a
size like
7b,30b-a3b, the advisor falls back to 7B as a default. The real-size enrichment step still corrects the displayed size, but the Params column may be off until HF tags improve. - MLX file sizing — assumes one quant variant per repo (the
mlx-communityconvention). Repos with multiple variants in one folder will be summed.
Things on the wishlist:
- Real tokens/s benchmark via the Ollama HTTP API when present
- Ollama tag picker that prefers
:Q4_K_Mover:latestdeterministically - Eval-leaderboard-driven scoring (EvalPlus, BFCL, MMLU-Pro, IFEval)
- LM Studio config-file scan to resolve the user's actual model dir
- Live token/s estimate per model based on bandwidth × layer count
- Optional JSON output for scripting
-
servemode that exposes recommendations via a tiny HTTP API
PRs welcome.
- Fork & clone
pip install -r requirements.txt- Make changes — keep modules focused (
hardware.pyonly deals with hardware, etc.) - Smoke-test with
python check_all.py - Open a PR
Style: type hints everywhere, no comments that just restate the code, dataclasses for structured data, rich for output.
MIT — see LICENSE (or treat as MIT until a file is added).