LLM Advisor

Dynamic, hardware-aware recommender for the best quantized LLMs to run locally. Detects your CPU/GPU/RAM and your installed runtime (Ollama / LM Studio / MLX), then picks the GGUF or MLX models that actually fit and run well — with concrete install commands.

Why

Running local LLMs is a moving target. New models drop weekly, every model has five quantization variants, and "will it fit my GPU?" requires math you don't want to do every time. llm-advisor answers that for you, dynamically:

It reads your hardware (CPU, RAM, GPU, VRAM) at runtime.
It pulls a live catalog from HuggingFace and the Ollama Library.
It checks real GGUF / safetensors file sizes from the HF API — no hand-maintained list, no stale guesses.
It picks the best variant per model for your hardware and shows you the exact ollama pull / lms get / mlx_lm.generate command.
On Apple Silicon it prefers native MLX models over GGUF.
It only suggests runtimes you actually have installed.

Features


🖥️ Hardware detection	CPU, RAM, GPU/VRAM — NVIDIA via `nvidia-smi`, AMD via `rocm-smi`/WMI, Apple Silicon via `sysctl` (unified memory)
⚙️ Runtime detection	Ollama (binary + `:11434/api/tags`) and LM Studio (binary, native `:1234/api/v0/models` + OpenAI-compat fallback). Lists locally installed models and which are currently loaded
🌐 Live catalog	HuggingFace API for GGUF & MLX repos, Ollama Library scraping by category (`code`, `tools`) — 10-min in-memory cache
📏 Real file sizes	After ranking, the top candidates' actual GGUF / safetensors sizes are fetched from `/api/models/{repo}/tree/main` (split shards merged) — no more "estimated 4 GB, actually 18 GB"
🎯 Smart variant picker	Largest-VRAM-fit if anything fits; else smallest decent-quality variant that fits memory (so a 30B model recommends Q3_K_M @ 17 GB, not Q5_K_M @ 53 GB onto your 16 GB GPU)
🏷️ Use-case classification	`coding`, `agentic`, `tool_use`, `general`, and `all_round` (models strong in all three)
📊 Quant-quality awareness	Every recommendation comes with a quality label — `near-lossless`, `balanced`, `moderate`, `low` — and a one-line explainer per quant format (Q4_K_M, IQ3_M, TQ1_0, MLX 4bit, …)
🍎 Apple Silicon: MLX-first	When `is_apple_silicon=True` we also fetch MLX-tagged repos and rank them above GGUF (MLX is significantly faster on M-series chips)
🧰 Runtime-aware commands	If only LM Studio is running, you get only `lms get` lines. Only Ollama? Only `ollama pull`. Both? Both panels. None? A friendly nudge to install one.

Demo

─── LLM Advisor ───
                      Hardware
 OS              Windows 11
 CPU             AMD Ryzen 9 5900X
 RAM             127.9 GB total · 80.7 GB free
 GPU             NVIDIA GeForce RTX 5060 Ti
 VRAM            15.9 GB
 Backend         cuda

─── Runtimes ───
 Runtime    Status     API                        Local models
 Ollama     not found  -                          -
 LM Studio  running    http://127.0.0.1:1234      12

─── All-round recommendations ───
 #  Model                       Params  Quant     Quality        Size     Tier
 1  Qwen3-Coder-Next               7B   UD-TQ1_0  experimental   17.6 GB  ok (split)
 2  Qwen2.5-Coder-7B-Instruct      7B   Q6_K      very high      11.6 GB  fast (GPU)
 3  Dolphin3.0-Llama3.1-8B         8B   Q8_0      near-lossless   8.0 GB  fast (GPU)
 4  Qwen2.5-Coder-14B-Instruct    14B   Q4_0      balanced        7.9 GB  fast (GPU)
 5  Qwen3.5-9B-Claude-4.6-OS       9B   Q8_0      near-lossless   9.8 GB  fast (GPU)

LM Studio (running) — download commands
  1. lms get Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
       → pick file: qwen2.5-coder-7b-instruct-q6_k.gguf (11.6 GB)
  ...

Quantization legend
  Q8_0   near-lossless · best quality
  Q6_K   very high quality · larger file
  Q4_K_M ★ sweet spot — recommended default
  IQ3_M  imatrix Q3 · for tight VRAM
  Q2_K / IQ2_*  severe loss · last resort

Installation

git clone https://github.com/<you>/llm-advisor
cd llm-advisor
pip install -r requirements.txt

Requirements

Python 3.10+
Dependencies: psutil, requests, rich, numpy (see requirements.txt)

Optional system tools (for accurate GPU detection)

Platform	Tool	Notes
NVIDIA	`nvidia-smi`	Ships with the driver — usually already on PATH
AMD on Linux	`rocm-smi`	From ROCm install
AMD/Intel on Windows	(none required)	Falls back to PowerShell `Get-CimInstance Win32_VideoController`. VRAM is best-effort due to a 32-bit Windows limit.
Apple Silicon	(none required)	Unified memory read via `sysctl`

Quick start

# One-shot — the recommended way to use this tool
python check_all.py            # top 5 all-round recommendations
python check_all.py 8          # top 8

# or as a sub-command
python llm_advisor.py all

That single command does everything: detects hardware, finds Ollama/LM Studio, lists locally installed models, fetches the live catalog from HuggingFace, verifies real file sizes, and prints only the models that are good across coding + agentic + tool use simultaneously, with concrete install commands matched to whichever runtime you have.

Usage

One-shot diagnostic

python check_all.py [TOP_N]      # default TOP_N = 5
python llm_advisor.py all -n 8

Individual sub-commands

python llm_advisor.py hardware    # CPU / RAM / GPU only
python llm_advisor.py runtimes    # Ollama + LM Studio + their installed models
python llm_advisor.py benchmark   # FP32 matmul benchmark + tier label
python llm_advisor.py quants      # full GGUF quantization reference table

Targeted recommendations

python llm_advisor.py recommend all_round         # ⭐ models that do everything
python llm_advisor.py recommend coding -n 8
python llm_advisor.py recommend agentic
python llm_advisor.py recommend tool_use
python llm_advisor.py recommend general

# Add an extra HF search keyword
python llm_advisor.py recommend coding --query "qwen3"

Use cases

Use case	What it captures
`coding`	Code generation, refactoring, test writing — coder-finetunes
`agentic`	Multi-step reasoning, planning, autonomous agents — instruct LLMs
`tool_use`	Function calling, structured tool/API invocation
`general`	All-purpose chat / instruct
`all_round`	Models scoring well in coding and agentic and tool use

How it works

1. Hardware detection (`hardware.py`)

CPU      → wmic / sysctl / /proc/cpuinfo
RAM      → psutil
GPU/VRAM → nvidia-smi  (NVIDIA, all platforms)
         → sysctl       (Apple Silicon unified memory ≈ 75% of RAM)
         → rocm-smi     (AMD on Linux)
         → Get-CimInstance Win32_VideoController  (AMD/Intel on Windows)

2. Runtime detection (`runtimes.py`)

Runtime	Binary on PATH	API endpoint	Local models source
Ollama	`ollama`	`http://127.0.0.1:11434/api/tags`	API + `~/.ollama/models`
LM Studio	`lms` (or app dir)	`http://127.0.0.1:1234/api/v0/models` (native, richer) → `/v1/models` (OpenAI-compat fallback)	API + `~/.lmstudio/models`

We use 127.0.0.1 (not localhost) to avoid IPv6 ::1 resolution failures when the runtime binds only to IPv4 — a common gotcha on Windows. From the LM Studio native API we also surface state (loaded/not-loaded), arch, quantization, and capabilities (e.g. tool_use).

3. Catalog (`catalog.py`)

HuggingFace: GET /api/models?filter=gguf|mlx&sort=downloads&direction=-1
Ollama Library: HTML scrape of https://ollama.com/search?c=tools|code
Real sizes for the top candidates: GET /api/models/{repo}/tree/main?recursive=true, with split-shard merging (-00001-of-00003.gguf → one logical entry)
All responses cached in-memory for 10 min

4. Use-case classification & scoring (`recommender.py`)

Each candidate is scored:

score = use_case_match_bonus      (0 / +50 / +100)
      + popularity                (downloads^0.25, capped at 50)
      + likes                     (likes^0.5, capped at 20)
      + hardware_tier_bonus       (-5 .. +30)
      + recency_bonus             (5–15 for 2024+/2025+)
      + apple_silicon_mlx_bonus   (+50 if MLX on Apple, -5 GGUF on Apple)
      - junk_penalty              (test/draft/experimental)

For all_round, the use-case bonus rewards breadth: +40 per matched skill (coding, agentic, tool_use), with a +40 bonus for hitting all three.

5. Smart variant picker

For each top candidate the algorithm asks: which GGUF variant is the best fit for this user's hardware?

1. Largest variant that fully fits in VRAM    → "fast (GPU)"
2. Smallest decent-quality (≥ Q3) variant
   that fits in (VRAM + 80% RAM)              → "ok (split)" or "slow (CPU)"
3. Last-ditch: smallest variant that fits     → "slow (CPU)"
4. Otherwise                                  → drop the model

Why "smallest decent" in step 2? Because on a 16 GB VRAM GPU you don't want to be told "use Q5_K_M @ 53 GB on CPU" — you want a Q3_K_M @ 18 GB that mostly stays on GPU with a small spill into RAM.

6. Apple Silicon: MLX preference

When hw.is_apple_silicon is true:

Both GGUF and MLX repos are fetched from HF (separate API calls).
MLX models get +50 to their score (M-series runs MLX much faster than GGUF).
GGUF models get -5 so MLX wins ties.
MLX-format models on non-Apple platforms are excluded entirely (they wouldn't run anyway).
The suggestion panel adds an MLX-LM section with mlx_lm.generate commands.

Quantization guide

The advisor labels every recommendation with a quality bucket and explains each format. Run python llm_advisor.py quants for the full table; here's the short version:

Quant	Bits/wt	Quality	Trade-off
`F16/BF16`	16.0	lossless	Reference quality, very large
`Q8_0`	8.5	near-lossless	~50% of FP16, best practical quality
`Q6_K`	6.6	very high	Almost lossless, big
`Q5_K_M`	5.7	high	Great quality, moderate size
`Q4_K_M`	4.9	balanced	★ Sweet spot — recommended default
`IQ4_XS`	4.3	balanced	Imatrix Q4 — slightly smaller, similar quality
`Q3_K_M`	3.8	moderate	Noticeable quality loss; for tight VRAM
`IQ3_M`	3.7	moderate	Imatrix Q3 — better than Q3_K_S at similar size
`Q2_K`	3.0	low	Severe quality loss; last resort
`IQ2_*`	2.1–2.7	very low	Heavy compression — for huge models on small VRAM
`IQ1_*`	1.6–1.8	extreme low	Experimental / research
`TQ1_0`	~1.6	experimental	Trellis Quantization — newer, evolving

Apple-MLX repos use a different convention: 4bit, 8bit, bf16, fp16. The advisor parses these from the repo name (mlx-community/...-4bit).

Architecture

.
├── hardware.py        # CPU / RAM / GPU detection (cross-platform)
├── runtimes.py        # Ollama + LM Studio detection (binary, API, local models)
├── catalog.py         # Live fetching: HF API + Ollama Library + real GGUF/safetensors sizes
├── quants.py          # Quantization reference (bits/weight, quality bucket, trade-off)
├── recommender.py     # Use-case classification, scoring, variant picker, MLX preference
├── llm_advisor.py     # CLI entry point (argparse + rich)
├── check_all.py       # One-shot wrapper for the `all` sub-command
├── requirements.txt
└── README.md

Everything is plain requests + psutil + rich — no heavy ML deps, no HuggingFace SDK, no Ollama SDK. The whole thing works from a clean Python install in a few seconds.

Limitations

AMD/Intel GPUs on Windows — VRAM detection is best-effort. The Win32 AdapterRAM value is capped at 4 GB by the 32-bit type. NVIDIA and Apple Silicon are reliable.
Ollama Library scraping — no official API, so HTML scraping. Resilient but could break if the page is restructured. The 10-min cache softens this.
CPU benchmark — currently a FP32 matmul, not a real tokens/s test. It produces a coarse tier (mid, high, extreme) only.
Quant-quality labels are static — they reflect the typical bits/weight and llama.cpp release notes, not per-model evals. A Q4_K_M on a tiny model degrades more than on a huge model; the label doesn't capture that.
Param count from name — when neither the repo name nor HF tags carry a size like 7b, 30b-a3b, the advisor falls back to 7B as a default. The real-size enrichment step still corrects the displayed size, but the Params column may be off until HF tags improve.
MLX file sizing — assumes one quant variant per repo (the mlx-community convention). Repos with multiple variants in one folder will be summed.

Roadmap

Things on the wishlist:

Real tokens/s benchmark via the Ollama HTTP API when present
Ollama tag picker that prefers :Q4_K_M over :latest deterministically
Eval-leaderboard-driven scoring (EvalPlus, BFCL, MMLU-Pro, IFEval)
LM Studio config-file scan to resolve the user's actual model dir
Live token/s estimate per model based on bandwidth × layer count
Optional JSON output for scripting
serve mode that exposes recommendations via a tiny HTTP API

PRs welcome.

Contributing

Fork & clone
pip install -r requirements.txt
Make changes — keep modules focused (hardware.py only deals with hardware, etc.)
Smoke-test with python check_all.py
Open a PR

Style: type hints everywhere, no comments that just restate the code, dataclasses for structured data, rich for output.

License

MIT — see LICENSE (or treat as MIT until a file is added).

Credits

HuggingFace — model catalog API
Ollama — Library + local runtime
LM Studio — local runtime + native REST API
llama.cpp — GGUF format & quantization research
MLX — Apple's array framework for ML on Apple Silicon
Rich — the gorgeous terminal output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Advisor

Why

Features

Demo

Installation

Quick start

Usage

One-shot diagnostic

Individual sub-commands

Targeted recommendations

Use cases

How it works

1. Hardware detection (`hardware.py`)

2. Runtime detection (`runtimes.py`)

3. Catalog (`catalog.py`)

4. Use-case classification & scoring (`recommender.py`)

5. Smart variant picker

6. Apple Silicon: MLX preference

Quantization guide

Architecture

Limitations

Roadmap

Contributing

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
README.md		README.md
REMCODER.md		REMCODER.md
catalog.py		catalog.py
check_all.py		check_all.py
hardware.py		hardware.py
llm_advisor.py		llm_advisor.py
quants.py		quants.py
recommender.py		recommender.py
requirements.txt		requirements.txt
runtimes.py		runtimes.py

Folders and files

Latest commit

History

Repository files navigation

LLM Advisor

Why

Features

Demo

Installation

Quick start

Usage

One-shot diagnostic

Individual sub-commands

Targeted recommendations

Use cases

How it works

1. Hardware detection (hardware.py)

2. Runtime detection (runtimes.py)

3. Catalog (catalog.py)

4. Use-case classification & scoring (recommender.py)

5. Smart variant picker

6. Apple Silicon: MLX preference

Quantization guide

Architecture

Limitations

Roadmap

Contributing

License

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Hardware detection (`hardware.py`)

2. Runtime detection (`runtimes.py`)

3. Catalog (`catalog.py`)

4. Use-case classification & scoring (`recommender.py`)

Packages