You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Worker: E1 (ensemble: model frameworks; integral validator of the spine pattern) Mnemonic IDs blocking this: D1 (parity harness — provides per-kernel baseline) Mnemonic IDs blocked by this: E2 (candle-fork ONNX/ViT/Whisper parity follows the same shape)
Why
crates/burn/ in this repo vendors AdaWorldAPI/burn (pinned at rev 9b2b671, see crates/burn/Cargo.toml) and rewires the burn-ndarray backend through ndarray = { path = "../.." } — gemm, BF16 matmul, AMX dispatch all come from this repo's SIMD + HPC stack now. The standing claim is: burn-fork can reliably load and execute any GGUF, because everything beneath it is the spine.
That claim is currently asserted at the per-kernel level (D1 parity harness) but never integrated end-to-end. Per-kernel parity is necessary but not sufficient — kernel composition, KV-cache layout, RoPE numerics, attention masking, and quantization dequant paths all interact, and BF16 mantissa-exact at one kernel does not guarantee BF16-tolerant logits at token N=50.
This issue is the integral of D1: full-stack GGUF inference, run identically on burn-fork (using ndarray kernels) and on llama.cpp reference, top-N logits compared at every token position. If the parity holds, the spine pattern is validated for real production models. If not, the BF16 mantissa-exact claim has a hole and we know exactly where the model-level numerics diverge from per-kernel numerics.
Clinical relevance. GGUF generative models (Llama 3, Qwen 2.5/3.5) are the generative-reasoning side of the MedCare-rs stack: anamnesis summarization, patient-facing translation (incl. German), clinical reasoning chains. They must produce stable, reproducible outputs across SPR-AMX cloud and NEON-Pi edge. A logit drift at token 30 of a 50-token discharge summary is a clinical safety incident, not a numerics curiosity.
What
A nightly CI job that runs identical prompts through burn-fork-via-ndarray and llama.cpp-reference, and asserts top-5 logit parity within a documented BF16 tolerance band, on pinned weights, pinned reference, pinned ISA paths.
Test models (CI smoke + production-representative)
Model
Quantizations
Approx size (Q4_K_M / Q8_0)
Llama 3 8B
Q4_K_M, Q8_0
~4.7 GB / ~8.5 GB
Qwen 2.5 7B
Q4_K_M, Q8_0
~4.4 GB / ~7.6 GB
Two quantizations per model exercise different precision regimes — Q4_K_M (k-quant block-scaled, lossy, common in deployment) and Q8_0 (8-bit, near-lossless, the parity ceiling).
Pinning (must be in tests/fixtures/gguf_parity.toml or equivalent)
GGUF weights: SHA-256 of each .gguf file, source URL, license note
llama.cpp reference: exact upstream commit SHA (e.g. ggerganov/llama.cpp@), build flags, and the llama-cli invocation used to produce the reference logits
Tokenizer: SHA-256 of the tokenizer.json / merges baked into each GGUF's metadata
Prompts: the 10-prompt suite frozen in-repo (no live downloads)
Test suite (10 prompts × 50 generated tokens × top-5 logits per position)
Bucket
Prompts
Why
Short factual
2
"Capital of France?" — fast smoke, deterministic
Long generative
2
200-token continuation — accumulates drift
Code
2
Rust + Python snippet completion — different token distribution
Multilingual
2
German + one non-Latin (e.g. Mandarin or Arabic) — clinical-translation surface
Reasoning
2
Short chain-of-thought — exposes attention-mask + KV-cache bugs
For each (model, quant, prompt, position), record top-5 token IDs and their logit values from both engines, assert:
top-1 token ID matches exactly (greedy stability)
top-5 token ID set matches (rank-stability allowed within set)
per-token logit deltas within BF16 tolerance band from D1, per quantization — Q4_K_M will be looser than Q8_0; tolerances are documented, not guessed
ISA coverage (per D2's CI matrix)
SPR AMX path — explicit, not polyfill. The whole point of the spine is that AMX dispatch comes from ndarray; this is where it must be exercised under a real model.
AVX2 path — baseline x86, the floor every cloud runner hits.
NEON path (Pi 5 8GB if runner allows) — model size is the binding constraint here; 8B BF16 weights are ~16 GB which does not fit, so NEON coverage is realistically Q4_K_M only and may need a smaller model substituted (e.g. Llama 3.2 3B Q4_K_M) as a proxy. Decision needed at design pass.
The point: burn-fork is the validation surface, ndarray is what's actually being validated. burn-fork being correct end-to-end is evidence the spine substitution worked.
Acceptance criteria
tests/gguf_parity/ (or crates/burn/tests/gguf_parity/) scaffolded with the 10-prompt fixture, pinned-weights manifest, pinned llama.cpp commit SHA recorded in a LLAMA_CPP_REF constant or fixture file.
Llama 3 8B Q4_K_M — top-5 logit parity vs llama.cpp on the 10-prompt suite × 50 tokens, within Q4_K_M BF16 tolerance band, on at least SPR AMX + AVX2 + (NEON or NEON-proxy) paths.
Per-quantization tolerance bands documented in fixture (Q4_K_M looser than Q8_0; numbers traceable to D1's per-kernel baseline).
Reproducible: runs in CI nightly (not per-PR — the weights are too heavy), pinned weights/llama.cpp/tokenizer SHAs, deterministic seeds, no live network dependency at test time (weights pre-staged on the runner).
On failure, the harness emits (model, quant, prompt_idx, token_idx, top5_burn, top5_llamacpp, delta) so divergence is greppable, not just a red X.
CI matrix entry for SPR AMX is explicit — not silently routed through AVX2 polyfill (per D2's ISA-coverage requirement).
Out of scope
candle-fork (ONNX/ViT/Whisper) — that's E2; same shape of issue, different framework, will be filed separately.
Production deployment of GGUF models inside MedCare-rs services — downstream consumer concern.
Multi-batch / multi-stream inference — single-stream greedy decode only here; batching is a follow-up.
Sampling parity (top-p, temperature, mirostat) — out of scope; we compare logits, not sampled tokens, to keep the assertion deterministic. Sampling is a downstream concern.
Larger models (70B, MoE) — they don't fit in CI smoke; covered in a future "heavy parity" issue if needed.
Training/fine-tuning paths — inference only.
Dependencies
Blocks on D1 (parity harness) — D1 defines the per-kernel BF16 tolerance bands this issue integrates over. Without D1 we'd be guessing tolerance numbers.
Blocks E2 (candle-fork ONNX/ViT/Whisper parity) — E2 mirrors this issue's structure for the discriminative side of the stack; landing E1 first proves the pattern.
Preliminary exploration (already done)
crates/burn/Cargo.toml confirms burn-ndarray backend depends on ndarray = { path = "../.." } and pins upstream burn at AdaWorldAPI/burn rev 9b2b671. ndarray spine wiring is real.
No GGUF test scaffolding exists in crates/burn/tests/ or top-level tests/ yet — this issue is greenfield from the test-fixture standpoint.
No existing gguf / llama references in repo source tree — confirms this validation is genuinely new work, not a duplicate of an existing test.
Worker: E1 (ensemble: model frameworks; integral validator of the spine pattern)
Mnemonic IDs blocking this: D1 (parity harness — provides per-kernel baseline)
Mnemonic IDs blocked by this: E2 (candle-fork ONNX/ViT/Whisper parity follows the same shape)
Why
crates/burn/in this repo vendorsAdaWorldAPI/burn(pinned at rev9b2b671, seecrates/burn/Cargo.toml) and rewires the burn-ndarray backend throughndarray = { path = "../.." }— gemm, BF16 matmul, AMX dispatch all come from this repo's SIMD + HPC stack now. The standing claim is: burn-fork can reliably load and execute any GGUF, because everything beneath it is the spine.That claim is currently asserted at the per-kernel level (D1 parity harness) but never integrated end-to-end. Per-kernel parity is necessary but not sufficient — kernel composition, KV-cache layout, RoPE numerics, attention masking, and quantization dequant paths all interact, and BF16 mantissa-exact at one kernel does not guarantee BF16-tolerant logits at token N=50.
This issue is the integral of D1: full-stack GGUF inference, run identically on burn-fork (using ndarray kernels) and on llama.cpp reference, top-N logits compared at every token position. If the parity holds, the spine pattern is validated for real production models. If not, the BF16 mantissa-exact claim has a hole and we know exactly where the model-level numerics diverge from per-kernel numerics.
Clinical relevance. GGUF generative models (Llama 3, Qwen 2.5/3.5) are the generative-reasoning side of the MedCare-rs stack: anamnesis summarization, patient-facing translation (incl. German), clinical reasoning chains. They must produce stable, reproducible outputs across SPR-AMX cloud and NEON-Pi edge. A logit drift at token 30 of a 50-token discharge summary is a clinical safety incident, not a numerics curiosity.
What
A nightly CI job that runs identical prompts through burn-fork-via-ndarray and llama.cpp-reference, and asserts top-5 logit parity within a documented BF16 tolerance band, on pinned weights, pinned reference, pinned ISA paths.
Test models (CI smoke + production-representative)
Two quantizations per model exercise different precision regimes — Q4_K_M (k-quant block-scaled, lossy, common in deployment) and Q8_0 (8-bit, near-lossless, the parity ceiling).
Pinning (must be in
tests/fixtures/gguf_parity.tomlor equivalent)ggerganov/llama.cpp@), build flags, and thellama-cliinvocation used to produce the reference logitsTest suite (10 prompts × 50 generated tokens × top-5 logits per position)
For each (model, quant, prompt, position), record top-5 token IDs and their logit values from both engines, assert:
ISA coverage (per D2's CI matrix)
ndarray; this is where it must be exercised under a real model.Architecture
The point: burn-fork is the validation surface, ndarray is what's actually being validated. burn-fork being correct end-to-end is evidence the spine substitution worked.
Acceptance criteria
tests/gguf_parity/(orcrates/burn/tests/gguf_parity/) scaffolded with the 10-prompt fixture, pinned-weights manifest, pinned llama.cpp commit SHA recorded in aLLAMA_CPP_REFconstant or fixture file.(model, quant, prompt_idx, token_idx, top5_burn, top5_llamacpp, delta)so divergence is greppable, not just a red X.Out of scope
Dependencies
Preliminary exploration (already done)
crates/burn/Cargo.tomlconfirms burn-ndarray backend depends onndarray = { path = "../.." }and pins upstream burn at AdaWorldAPI/burn rev9b2b671. ndarray spine wiring is real.crates/burn/tests/or top-leveltests/yet — this issue is greenfield from the test-fixture standpoint.gguf/llamareferences in repo source tree — confirms this validation is genuinely new work, not a duplicate of an existing test.