glq-bench: generalized model-performance benchmarking toolkit#10
Open
cnygaard wants to merge 1 commit into
Open
glq-bench: generalized model-performance benchmarking toolkit#10cnygaard wants to merge 1 commit into
cnygaard wants to merge 1 commit into
Conversation
…oolkit
Run benchmarks on any model + quant method, capture full reproducibility
provenance, append structured records to a centralized git results repo, and
produce comparison tables, plots, and a weighted %-of-bf16 quality index. Quant
method is read from the checkpoint config, so GLQ / NVFP4(modelopt) / GPTQ / AWQ /
bf16 all compare against the shared base model.
glq/bench/:
record schema-v1 BenchRecord dataclasses + JSONL (forward-compatible)
provenance env (torch/vllm/transformers/sglang/triton/cuda) + GPU + glq sha
hfmeta base/parent model, quant method, bpw min/max, HF links, disk size
runtime vLLM load; captures the serving command + weights-at-load (parses
"Model loading took X GiB" via VLLM_LOGGING_CONFIG_PATH on stdout)
tasks/ mmlu_pro, aime_2024/25/26 (thinking, avg@k), wikitext2_ppl, throughput
runner orchestrator; quality tasks share one engine, hf/throughput separate
store standalone git results repo (records/<base>/<model>.jsonl), idempotent
index Sigma w*retention vs same-base bf16; report (md tables); plot (PNGs)
cli glq-bench run/index/report/compare/leaderboard/plot/pull/push
progress timestamped heartbeat for long runs (visible even mid-generation)
pyproject: [bench] extra + glq-bench console script.
scripts/seed_bench_records.py: seeds this session's gemma-4 31B/12B/E4B numbers.
29 CPU unit tests (tests/test_bench_*.py); GPU-validated on RTX PRO 6000 (SmolLM2-360M
GLQ vs bf16 across mmlu_pro/wikitext2_ppl/throughput). Results repo: cnygaard/glq-benchmarks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
glq-bench— a model-agnostic benchmarking toolkit (glq/bench/package +glq-benchconsole script +[bench]extra). Run a benchmark on any model/quant method, capture full reproducibility provenance, append a structured record to a centralized git results repo, and produce comparison tables, plots, and a weighted % -of-bf16 quality index.Quant method is read from the checkpoint config, so GLQ / NVFP4 (modelopt) / GPTQ / AWQ / bf16 all compare against the shared base model.
Layout (
glq/bench/)recordBenchRecorddataclasses + JSONL (forward-compatible)provenancehfmetaruntimeVLLM_LOGGING_CONFIG_PATH)tasks/mmlu_pro,aime_2024/25/26(thinking, avg@k),wikitext2_ppl,throughputrunnerstorerecords/<base>/<model>.jsonl), idempotentindex/report/plotcliglq-bench run/index/report/compare/leaderboard/plot/pull/pushprogressAlso:
pyproject.toml[bench]extra +glq-benchscript;scripts/seed_bench_records.py(seeds gemma-4 31B/12B/E4B numbers).Index
index(M) = Σ wₜ·(scoreₜ(M) / scoreₜ(bf16 of same base)) / Σ wₜover standardized quality tasks (perplexity inverts). bf16 = 100%; tok/s + VRAM-at-load shown beside it, not folded in.ncolumn flags how many tasks each index averages.Testing
tests/test_bench_*.py) — record (de)serialize, index math, hfmeta derivation, report/store roundtrip, progress heartbeat. All green.mmlu_pro/wikitext2_ppl/throughput— real records with provenance + load-mem + tok/s; index/compare/report/plot produce correct output.Results repo:
cnygaard/glq-benchmarks.