Skip to content

zurd46/LLM-Advisor

Repository files navigation

LLM Advisor

Dynamic, hardware-aware recommender for the best quantized LLMs to run locally. Detects your CPU/GPU/RAM and your installed runtime (Ollama / LM Studio / MLX), then picks the GGUF or MLX models that actually fit and run well — with concrete install commands.

Python Platform License


Why

Running local LLMs is a moving target. New models drop weekly, every model has five quantization variants, and "will it fit my GPU?" requires math you don't want to do every time. llm-advisor answers that for you, dynamically:

  • It reads your hardware (CPU, RAM, GPU, VRAM) at runtime.
  • It pulls a live catalog from HuggingFace and the Ollama Library.
  • It checks real GGUF / safetensors file sizes from the HF API — no hand-maintained list, no stale guesses.
  • It picks the best variant per model for your hardware and shows you the exact ollama pull / lms get / mlx_lm.generate command.
  • On Apple Silicon it prefers native MLX models over GGUF.
  • It only suggests runtimes you actually have installed.

Features

🖥️ Hardware detection CPU, RAM, GPU/VRAM — NVIDIA via nvidia-smi, AMD via rocm-smi/WMI, Apple Silicon via sysctl (unified memory)
⚙️ Runtime detection Ollama (binary + :11434/api/tags) and LM Studio (binary, native :1234/api/v0/models + OpenAI-compat fallback). Lists locally installed models and which are currently loaded
🌐 Live catalog HuggingFace API for GGUF & MLX repos, Ollama Library scraping by category (code, tools) — 10-min in-memory cache
📏 Real file sizes After ranking, the top candidates' actual GGUF / safetensors sizes are fetched from /api/models/{repo}/tree/main (split shards merged) — no more "estimated 4 GB, actually 18 GB"
🎯 Smart variant picker Largest-VRAM-fit if anything fits; else smallest decent-quality variant that fits memory (so a 30B model recommends Q3_K_M @ 17 GB, not Q5_K_M @ 53 GB onto your 16 GB GPU)
🏷️ Use-case classification coding, agentic, tool_use, general, and all_round (models strong in all three)
📊 Quant-quality awareness Every recommendation comes with a quality label — near-lossless, balanced, moderate, low — and a one-line explainer per quant format (Q4_K_M, IQ3_M, TQ1_0, MLX 4bit, …)
🍎 Apple Silicon: MLX-first When is_apple_silicon=True we also fetch MLX-tagged repos and rank them above GGUF (MLX is significantly faster on M-series chips)
🧰 Runtime-aware commands If only LM Studio is running, you get only lms get lines. Only Ollama? Only ollama pull. Both? Both panels. None? A friendly nudge to install one.

Demo

─── LLM Advisor ───
                      Hardware
 OS              Windows 11
 CPU             AMD Ryzen 9 5900X
 RAM             127.9 GB total · 80.7 GB free
 GPU             NVIDIA GeForce RTX 5060 Ti
 VRAM            15.9 GB
 Backend         cuda

─── Runtimes ───
 Runtime    Status     API                        Local models
 Ollama     not found  -                          -
 LM Studio  running    http://127.0.0.1:1234      12

─── All-round recommendations ───
 #  Model                       Params  Quant     Quality        Size     Tier
 1  Qwen3-Coder-Next               7B   UD-TQ1_0  experimental   17.6 GB  ok (split)
 2  Qwen2.5-Coder-7B-Instruct      7B   Q6_K      very high      11.6 GB  fast (GPU)
 3  Dolphin3.0-Llama3.1-8B         8B   Q8_0      near-lossless   8.0 GB  fast (GPU)
 4  Qwen2.5-Coder-14B-Instruct    14B   Q4_0      balanced        7.9 GB  fast (GPU)
 5  Qwen3.5-9B-Claude-4.6-OS       9B   Q8_0      near-lossless   9.8 GB  fast (GPU)

LM Studio (running) — download commands
  1. lms get Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
       → pick file: qwen2.5-coder-7b-instruct-q6_k.gguf (11.6 GB)
  ...

Quantization legend
  Q8_0   near-lossless · best quality
  Q6_K   very high quality · larger file
  Q4_K_M ★ sweet spot — recommended default
  IQ3_M  imatrix Q3 · for tight VRAM
  Q2_K / IQ2_*  severe loss · last resort

Installation

git clone https://github.com/<you>/llm-advisor
cd llm-advisor
pip install -r requirements.txt

Requirements

  • Python 3.10+
  • Dependencies: psutil, requests, rich, numpy (see requirements.txt)

Optional system tools (for accurate GPU detection)

Platform Tool Notes
NVIDIA nvidia-smi Ships with the driver — usually already on PATH
AMD on Linux rocm-smi From ROCm install
AMD/Intel on Windows (none required) Falls back to PowerShell Get-CimInstance Win32_VideoController. VRAM is best-effort due to a 32-bit Windows limit.
Apple Silicon (none required) Unified memory read via sysctl

Quick start

# One-shot — the recommended way to use this tool
python check_all.py            # top 5 all-round recommendations
python check_all.py 8          # top 8

# or as a sub-command
python llm_advisor.py all

That single command does everything: detects hardware, finds Ollama/LM Studio, lists locally installed models, fetches the live catalog from HuggingFace, verifies real file sizes, and prints only the models that are good across coding + agentic + tool use simultaneously, with concrete install commands matched to whichever runtime you have.


Usage

One-shot diagnostic

python check_all.py [TOP_N]      # default TOP_N = 5
python llm_advisor.py all -n 8

Individual sub-commands

python llm_advisor.py hardware    # CPU / RAM / GPU only
python llm_advisor.py runtimes    # Ollama + LM Studio + their installed models
python llm_advisor.py benchmark   # FP32 matmul benchmark + tier label
python llm_advisor.py quants      # full GGUF quantization reference table

Targeted recommendations

python llm_advisor.py recommend all_round         # ⭐ models that do everything
python llm_advisor.py recommend coding -n 8
python llm_advisor.py recommend agentic
python llm_advisor.py recommend tool_use
python llm_advisor.py recommend general

# Add an extra HF search keyword
python llm_advisor.py recommend coding --query "qwen3"

Use cases

Use case What it captures
coding Code generation, refactoring, test writing — coder-finetunes
agentic Multi-step reasoning, planning, autonomous agents — instruct LLMs
tool_use Function calling, structured tool/API invocation
general All-purpose chat / instruct
all_round Models scoring well in coding and agentic and tool use

How it works

1. Hardware detection (hardware.py)

CPU      → wmic / sysctl / /proc/cpuinfo
RAM      → psutil
GPU/VRAM → nvidia-smi  (NVIDIA, all platforms)
         → sysctl       (Apple Silicon unified memory ≈ 75% of RAM)
         → rocm-smi     (AMD on Linux)
         → Get-CimInstance Win32_VideoController  (AMD/Intel on Windows)

2. Runtime detection (runtimes.py)

Runtime Binary on PATH API endpoint Local models source
Ollama ollama http://127.0.0.1:11434/api/tags API + ~/.ollama/models
LM Studio lms (or app dir) http://127.0.0.1:1234/api/v0/models (native, richer) → /v1/models (OpenAI-compat fallback) API + ~/.lmstudio/models

We use 127.0.0.1 (not localhost) to avoid IPv6 ::1 resolution failures when the runtime binds only to IPv4 — a common gotcha on Windows. From the LM Studio native API we also surface state (loaded/not-loaded), arch, quantization, and capabilities (e.g. tool_use).

3. Catalog (catalog.py)

  • HuggingFace: GET /api/models?filter=gguf|mlx&sort=downloads&direction=-1
  • Ollama Library: HTML scrape of https://ollama.com/search?c=tools|code
  • Real sizes for the top candidates: GET /api/models/{repo}/tree/main?recursive=true, with split-shard merging (-00001-of-00003.gguf → one logical entry)
  • All responses cached in-memory for 10 min

4. Use-case classification & scoring (recommender.py)

Each candidate is scored:

score = use_case_match_bonus      (0 / +50 / +100)
      + popularity                (downloads^0.25, capped at 50)
      + likes                     (likes^0.5, capped at 20)
      + hardware_tier_bonus       (-5 .. +30)
      + recency_bonus             (5–15 for 2024+/2025+)
      + apple_silicon_mlx_bonus   (+50 if MLX on Apple, -5 GGUF on Apple)
      - junk_penalty              (test/draft/experimental)

For all_round, the use-case bonus rewards breadth: +40 per matched skill (coding, agentic, tool_use), with a +40 bonus for hitting all three.

5. Smart variant picker

For each top candidate the algorithm asks: which GGUF variant is the best fit for this user's hardware?

1. Largest variant that fully fits in VRAM    → "fast (GPU)"
2. Smallest decent-quality (≥ Q3) variant
   that fits in (VRAM + 80% RAM)              → "ok (split)" or "slow (CPU)"
3. Last-ditch: smallest variant that fits     → "slow (CPU)"
4. Otherwise                                  → drop the model

Why "smallest decent" in step 2? Because on a 16 GB VRAM GPU you don't want to be told "use Q5_K_M @ 53 GB on CPU" — you want a Q3_K_M @ 18 GB that mostly stays on GPU with a small spill into RAM.

6. Apple Silicon: MLX preference

When hw.is_apple_silicon is true:

  • Both GGUF and MLX repos are fetched from HF (separate API calls).
  • MLX models get +50 to their score (M-series runs MLX much faster than GGUF).
  • GGUF models get -5 so MLX wins ties.
  • MLX-format models on non-Apple platforms are excluded entirely (they wouldn't run anyway).
  • The suggestion panel adds an MLX-LM section with mlx_lm.generate commands.

Quantization guide

The advisor labels every recommendation with a quality bucket and explains each format. Run python llm_advisor.py quants for the full table; here's the short version:

Quant Bits/wt Quality Trade-off
F16/BF16 16.0 lossless Reference quality, very large
Q8_0 8.5 near-lossless ~50% of FP16, best practical quality
Q6_K 6.6 very high Almost lossless, big
Q5_K_M 5.7 high Great quality, moderate size
Q4_K_M 4.9 balanced ★ Sweet spot — recommended default
IQ4_XS 4.3 balanced Imatrix Q4 — slightly smaller, similar quality
Q3_K_M 3.8 moderate Noticeable quality loss; for tight VRAM
IQ3_M 3.7 moderate Imatrix Q3 — better than Q3_K_S at similar size
Q2_K 3.0 low Severe quality loss; last resort
IQ2_* 2.1–2.7 very low Heavy compression — for huge models on small VRAM
IQ1_* 1.6–1.8 extreme low Experimental / research
TQ1_0 ~1.6 experimental Trellis Quantization — newer, evolving

Apple-MLX repos use a different convention: 4bit, 8bit, bf16, fp16. The advisor parses these from the repo name (mlx-community/...-4bit).


Architecture

.
├── hardware.py        # CPU / RAM / GPU detection (cross-platform)
├── runtimes.py        # Ollama + LM Studio detection (binary, API, local models)
├── catalog.py         # Live fetching: HF API + Ollama Library + real GGUF/safetensors sizes
├── quants.py          # Quantization reference (bits/weight, quality bucket, trade-off)
├── recommender.py     # Use-case classification, scoring, variant picker, MLX preference
├── llm_advisor.py     # CLI entry point (argparse + rich)
├── check_all.py       # One-shot wrapper for the `all` sub-command
├── requirements.txt
└── README.md

Everything is plain requests + psutil + rich — no heavy ML deps, no HuggingFace SDK, no Ollama SDK. The whole thing works from a clean Python install in a few seconds.


Limitations

  • AMD/Intel GPUs on Windows — VRAM detection is best-effort. The Win32 AdapterRAM value is capped at 4 GB by the 32-bit type. NVIDIA and Apple Silicon are reliable.
  • Ollama Library scraping — no official API, so HTML scraping. Resilient but could break if the page is restructured. The 10-min cache softens this.
  • CPU benchmark — currently a FP32 matmul, not a real tokens/s test. It produces a coarse tier (mid, high, extreme) only.
  • Quant-quality labels are static — they reflect the typical bits/weight and llama.cpp release notes, not per-model evals. A Q4_K_M on a tiny model degrades more than on a huge model; the label doesn't capture that.
  • Param count from name — when neither the repo name nor HF tags carry a size like 7b, 30b-a3b, the advisor falls back to 7B as a default. The real-size enrichment step still corrects the displayed size, but the Params column may be off until HF tags improve.
  • MLX file sizing — assumes one quant variant per repo (the mlx-community convention). Repos with multiple variants in one folder will be summed.

Roadmap

Things on the wishlist:

  • Real tokens/s benchmark via the Ollama HTTP API when present
  • Ollama tag picker that prefers :Q4_K_M over :latest deterministically
  • Eval-leaderboard-driven scoring (EvalPlus, BFCL, MMLU-Pro, IFEval)
  • LM Studio config-file scan to resolve the user's actual model dir
  • Live token/s estimate per model based on bandwidth × layer count
  • Optional JSON output for scripting
  • serve mode that exposes recommendations via a tiny HTTP API

PRs welcome.


Contributing

  1. Fork & clone
  2. pip install -r requirements.txt
  3. Make changes — keep modules focused (hardware.py only deals with hardware, etc.)
  4. Smoke-test with python check_all.py
  5. Open a PR

Style: type hints everywhere, no comments that just restate the code, dataclasses for structured data, rich for output.


License

MIT — see LICENSE (or treat as MIT until a file is added).


Credits

  • HuggingFace — model catalog API
  • Ollama — Library + local runtime
  • LM Studio — local runtime + native REST API
  • llama.cpp — GGUF format & quantization research
  • MLX — Apple's array framework for ML on Apple Silicon
  • Rich — the gorgeous terminal output

About

Small CLI tool that dynamically recommends the best quantized LLM models for Ollama and LM Studio — tailored to your PC / laptop / Mac hardware and your use case.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages