Skip to content

glq-bench: generalized model-performance benchmarking toolkit#10

Open
cnygaard wants to merge 1 commit into
mainfrom
feat/glq-bench
Open

glq-bench: generalized model-performance benchmarking toolkit#10
cnygaard wants to merge 1 commit into
mainfrom
feat/glq-bench

Conversation

@cnygaard

Copy link
Copy Markdown
Owner

What

Adds glq-bench — a model-agnostic benchmarking toolkit (glq/bench/ package + glq-bench console script + [bench] extra). Run a benchmark on any model/quant method, capture full reproducibility provenance, append a structured record to a centralized git results repo, and produce comparison tables, plots, and a weighted % -of-bf16 quality index.

Quant method is read from the checkpoint config, so GLQ / NVFP4 (modelopt) / GPTQ / AWQ / bf16 all compare against the shared base model.

Layout (glq/bench/)

module role
record schema-v1 BenchRecord dataclasses + JSONL (forward-compatible)
provenance env (torch/vllm/transformers/sglang/triton/cuda) + GPU + glq sha
hfmeta base/parent model, quant method, bpw min/max, HF links, disk size
runtime vLLM load; captures the exact serving command + weights-at-load (parses "Model loading took X GiB" via VLLM_LOGGING_CONFIG_PATH)
tasks/ mmlu_pro, aime_2024/25/26 (thinking, avg@k), wikitext2_ppl, throughput
runner orchestrator — quality tasks share one engine; hf/throughput run separately
store standalone git results repo (records/<base>/<model>.jsonl), idempotent
index / report / plot % -of-bf16 index, markdown tables, matplotlib PNGs
cli glq-bench run/index/report/compare/leaderboard/plot/pull/push
progress timestamped heartbeat for long runs (visible even mid-generation)

Also: pyproject.toml [bench] extra + glq-bench script; scripts/seed_bench_records.py (seeds gemma-4 31B/12B/E4B numbers).

Index

index(M) = Σ wₜ·(scoreₜ(M) / scoreₜ(bf16 of same base)) / Σ wₜ over standardized quality tasks (perplexity inverts). bf16 = 100%; tok/s + VRAM-at-load shown beside it, not folded in. n column flags how many tasks each index averages.

Testing

  • 29 CPU-only unit tests (tests/test_bench_*.py) — record (de)serialize, index math, hfmeta derivation, report/store roundtrip, progress heartbeat. All green.
  • GPU-validated on RTX PRO 6000 (vLLM 0.23): SmolLM2-360M GLQ vs bf16 across mmlu_pro / wikitext2_ppl / throughput — real records with provenance + load-mem + tok/s; index/compare/report/plot produce correct output.

Results repo: cnygaard/glq-benchmarks.

…oolkit

Run benchmarks on any model + quant method, capture full reproducibility
provenance, append structured records to a centralized git results repo, and
produce comparison tables, plots, and a weighted %-of-bf16 quality index. Quant
method is read from the checkpoint config, so GLQ / NVFP4(modelopt) / GPTQ / AWQ /
bf16 all compare against the shared base model.

glq/bench/:
  record       schema-v1 BenchRecord dataclasses + JSONL (forward-compatible)
  provenance   env (torch/vllm/transformers/sglang/triton/cuda) + GPU + glq sha
  hfmeta       base/parent model, quant method, bpw min/max, HF links, disk size
  runtime      vLLM load; captures the serving command + weights-at-load (parses
               "Model loading took X GiB" via VLLM_LOGGING_CONFIG_PATH on stdout)
  tasks/       mmlu_pro, aime_2024/25/26 (thinking, avg@k), wikitext2_ppl, throughput
  runner       orchestrator; quality tasks share one engine, hf/throughput separate
  store        standalone git results repo (records/<base>/<model>.jsonl), idempotent
  index        Sigma w*retention vs same-base bf16; report (md tables); plot (PNGs)
  cli          glq-bench run/index/report/compare/leaderboard/plot/pull/push
  progress     timestamped heartbeat for long runs (visible even mid-generation)

pyproject: [bench] extra + glq-bench console script.
scripts/seed_bench_records.py: seeds this session's gemma-4 31B/12B/E4B numbers.

29 CPU unit tests (tests/test_bench_*.py); GPU-validated on RTX PRO 6000 (SmolLM2-360M
GLQ vs bf16 across mmlu_pro/wikitext2_ppl/throughput). Results repo: cnygaard/glq-benchmarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant