glq-bench: generalized model-performance benchmarking toolkit by cnygaard · Pull Request #10 · cnygaard/glq

cnygaard · 2026-06-20T14:17:27Z

What

Adds glq-bench — a model-agnostic benchmarking toolkit (glq/bench/ package + glq-bench console script + [bench] extra). Run a benchmark on any model/quant method, capture full reproducibility provenance, append a structured record to a centralized git results repo, and produce comparison tables, plots, and a weighted % -of-bf16 quality index.

Quant method is read from the checkpoint config, so GLQ / NVFP4 (modelopt) / GPTQ / AWQ / bf16 all compare against the shared base model.

Layout (`glq/bench/`)

module	role
`record`	schema-v1 `BenchRecord` dataclasses + JSONL (forward-compatible)
`provenance`	env (torch/vllm/transformers/sglang/triton/cuda) + GPU + glq sha
`hfmeta`	base/parent model, quant method, bpw min/max, HF links, disk size
`runtime`	vLLM load; captures the exact serving command + weights-at-load (parses "Model loading took X GiB" via `VLLM_LOGGING_CONFIG_PATH`)
`tasks/`	`mmlu_pro`, `aime_2024/25/26` (thinking, avg@k), `wikitext2_ppl`, `throughput`
`runner`	orchestrator — quality tasks share one engine; hf/throughput run separately
`store`	standalone git results repo (`records/<base>/<model>.jsonl`), idempotent
`index` / `report` / `plot`	% -of-bf16 index, markdown tables, matplotlib PNGs
`cli`	`glq-bench run/index/report/compare/leaderboard/plot/pull/push`
`progress`	timestamped heartbeat for long runs (visible even mid-generation)

Also: pyproject.toml [bench] extra + glq-bench script; scripts/seed_bench_records.py (seeds gemma-4 31B/12B/E4B numbers).

Index

index(M) = Σ wₜ·(scoreₜ(M) / scoreₜ(bf16 of same base)) / Σ wₜ over standardized quality tasks (perplexity inverts). bf16 = 100%; tok/s + VRAM-at-load shown beside it, not folded in. n column flags how many tasks each index averages.

Testing

29 CPU-only unit tests (tests/test_bench_*.py) — record (de)serialize, index math, hfmeta derivation, report/store roundtrip, progress heartbeat. All green.
GPU-validated on RTX PRO 6000 (vLLM 0.23): SmolLM2-360M GLQ vs bf16 across mmlu_pro / wikitext2_ppl / throughput — real records with provenance + load-mem + tok/s; index/compare/report/plot produce correct output.

Results repo: cnygaard/glq-benchmarks.

…oolkit Run benchmarks on any model + quant method, capture full reproducibility provenance, append structured records to a centralized git results repo, and produce comparison tables, plots, and a weighted %-of-bf16 quality index. Quant method is read from the checkpoint config, so GLQ / NVFP4(modelopt) / GPTQ / AWQ / bf16 all compare against the shared base model. glq/bench/: record schema-v1 BenchRecord dataclasses + JSONL (forward-compatible) provenance env (torch/vllm/transformers/sglang/triton/cuda) + GPU + glq sha hfmeta base/parent model, quant method, bpw min/max, HF links, disk size runtime vLLM load; captures the serving command + weights-at-load (parses "Model loading took X GiB" via VLLM_LOGGING_CONFIG_PATH on stdout) tasks/ mmlu_pro, aime_2024/25/26 (thinking, avg@k), wikitext2_ppl, throughput runner orchestrator; quality tasks share one engine, hf/throughput separate store standalone git results repo (records/<base>/<model>.jsonl), idempotent index Sigma w*retention vs same-base bf16; report (md tables); plot (PNGs) cli glq-bench run/index/report/compare/leaderboard/plot/pull/push progress timestamped heartbeat for long runs (visible even mid-generation) pyproject: [bench] extra + glq-bench console script. scripts/seed_bench_records.py: seeds this session's gemma-4 31B/12B/E4B numbers. 29 CPU unit tests (tests/test_bench_*.py); GPU-validated on RTX PRO 6000 (SmolLM2-360M GLQ vs bf16 across mmlu_pro/wikitext2_ppl/throughput). Results repo: cnygaard/glq-benchmarks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glq-bench: generalized model-performance benchmarking toolkit#10

glq-bench: generalized model-performance benchmarking toolkit#10
cnygaard wants to merge 1 commit into
mainfrom
feat/glq-bench

cnygaard commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cnygaard commented Jun 20, 2026

What

Layout (glq/bench/)

Index

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Layout (`glq/bench/`)