You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Without a bit-exact parity harness against established references, our reverse-engineered BLAS-graph + GEMM, BF16 mantissa-aware add_mul (no FP32 round-trip), Intel AMX byte-call hacks (Sapphire Rapids), NEON (Pi 4+/Orange/Zero), AVX-512/AVX2, and the SIMD polyfill floor can each silently drift relative to the reference world. Downstream forks (burn-fork, candle-fork) inherit that drift invisibly.
Clinical-liability stake. A ViT classifier returning 0.748 on Sapphire Rapids vs 0.751 on a Pi-5 clinic-edge box is not numerical noise when the decision threshold is 0.75 -- it is a diagnosis flip on the same image. For Railway-hosted SPR production AND clinic-edge Pi-5 deployment to render the same diagnosis, kernel outputs must agree to within a documented, defended tolerance band -- not "close enough." Bit-exact where the math allows; tightly-bounded ULP-band where rounding-order is fundamental (gemm).
What
A new test crate crates/parity-harness/ (sibling to existing crates/numeric-tests, crates/blas-tests, crates/blas-mock-tests) that compares ndarray kernel outputs against multiple authoritative references and fails CI on drift.
Reference baselines (pinned, reproducible)
numpy -- driven by a Python script (crates/parity-harness/refs/numpy_ref.py) emitting reference inputs+outputs to deterministic JSON / little-endian binary blobs. Pinned numpy==X.Y.Z, python==3.11.Z.
OpenBLAS -- direct FFI call from a Rust harness binary, version pinned in CI image.
MKL -- direct FFI call when available (Linux x86_64 CI legs); skipped with explicit SKIP reason on aarch64 / no-MKL hosts.
upstream rust-ndarray (the public crate from crates.io) -- pulled as a dev-dependency under an alias; we run the same op through it and diff.
Reference fixtures generated once, checked in (or content-addressed and fetched), with the generator script + lockfile so CI can regenerate and confirm reproducibility.
Coverage matrix
Kernel
f32
f16
bf16
Mode
gemm
yes
yes
yes
tolerance band (ULP)
add (elemwise)
yes
yes
yes
bit-exact
mul (elemwise)
yes
yes
yes
bit-exact
sum reduction
yes
yes
yes
tolerance band (ULP)
mean
yes
yes
yes
tolerance band (ULP)
conv2d
yes
yes
yes
tolerance band (ULP)
BF16 contract (non-negotiable):add and mul go through the mantissa-aware path with no FP32 round-trip and no implicit promotion. The harness asserts byte-equality on the raw bf16 output; any path that silently widens to f32 to do the work fails the test. This is the whole point of the mantissa-bit-counting work -- link only formats with matching mantissa width, no hidden upcast.
Tolerance bands
Op class
Bound
elementwise add
0 ULP (bit-exact, all formats)
elementwise mul
0 ULP (bit-exact, all formats)
gemm (typical)
<= 1 ULP per output element
gemm (large K)
documented widened band (rounding-order dominates)
reductions
<= 1 ULP for tree-reduce; documented for naive
conv2d
derived from underlying gemm band
Each band is documented inline in crates/parity-harness/README.md with the math justifying it (rounding-order sensitivity, tree-reduce vs sequential, etc.) so reviewers can defend it to a clinical auditor.
CI gate
The harness runs on every PR. Any kernel exceeding its band fails the build. Drift caught at PR time, not in production.
crates/parity-harness/tests/ -- one integration test per (kernel x dtype) cell.
BF16 mantissa-aware add_mul path under test lives in the SIMD layer (likely src/simd/ or equivalent -- confirm during implementation); harness exercises it directly without going through any code path that would round-trip via f32.
This issue defines the contract. D2 (separate issue) builds the CI matrix infrastructure (per-arch runners: SPR-AMX, AVX-512, AVX2, NEON, polyfill) that runs this harness.
Acceptance criteria
crates/parity-harness/ exists as a workspace member.
All listed kernels (gemm, add, mul, sum, mean, conv2d) in all listed dtypes (f32, f16, bf16) pass the bit-exact / tolerance-band check on at least the SIMD-polyfill build.
BF16 add and mul are byte-equal to reference (no FP32 round-trip path is hit; verified by an explicit test that would fail if promotion occurred).
Per-arch paths (SPR AMX, AVX-512, AVX2, NEON, polyfill) each have a place in the matrix where they will be tested separately. (Actually wiring those runners is D2's responsibility; this harness must support them.)
Tolerance band per kernel documented in crates/parity-harness/README.md with mathematical justification.
Harness wired into CI as a required check that fails the PR on drift.
Out of scope
candle-fork integration end-to-end (covered by E2).
Per-arch CI runner provisioning / matrix infrastructure itself (covered by D2, runs in parallel).
Performance regression gating -- this issue is correctness-only.
burn-fork end-to-end (covered by E1).
Dependencies
Upstream: none. Independent of any other open work; can start immediately.
Blocks E1 (burn-fork e2e) -- burn-fork cannot be certified for clinical workloads until ndarray kernels have parity guarantees.
Blocks E2 (candle-fork) -- same reasoning.
Pairs with D2 (CI matrix infra) -- D2 satisfies the per-arch infra contract this harness defines. They land independently and meet at the CI config.
Why
Without a bit-exact parity harness against established references, our reverse-engineered BLAS-graph + GEMM, BF16 mantissa-aware
add_mul(no FP32 round-trip), Intel AMX byte-call hacks (Sapphire Rapids), NEON (Pi 4+/Orange/Zero), AVX-512/AVX2, and the SIMD polyfill floor can each silently drift relative to the reference world. Downstream forks (burn-fork,candle-fork) inherit that drift invisibly.Clinical-liability stake. A ViT classifier returning
0.748on Sapphire Rapids vs0.751on a Pi-5 clinic-edge box is not numerical noise when the decision threshold is0.75-- it is a diagnosis flip on the same image. For Railway-hosted SPR production AND clinic-edge Pi-5 deployment to render the same diagnosis, kernel outputs must agree to within a documented, defended tolerance band -- not "close enough." Bit-exact where the math allows; tightly-bounded ULP-band where rounding-order is fundamental (gemm).What
A new test crate
crates/parity-harness/(sibling to existingcrates/numeric-tests,crates/blas-tests,crates/blas-mock-tests) that compares ndarray kernel outputs against multiple authoritative references and fails CI on drift.Reference baselines (pinned, reproducible)
crates/parity-harness/refs/numpy_ref.py) emitting reference inputs+outputs to deterministic JSON / little-endian binary blobs. Pinnednumpy==X.Y.Z,python==3.11.Z.SKIPreason on aarch64 / no-MKL hosts.Reference fixtures generated once, checked in (or content-addressed and fetched), with the generator script + lockfile so CI can regenerate and confirm reproducibility.
Coverage matrix
BF16 contract (non-negotiable):
addandmulgo through the mantissa-aware path with no FP32 round-trip and no implicit promotion. The harness asserts byte-equality on the rawbf16output; any path that silently widens to f32 to do the work fails the test. This is the whole point of the mantissa-bit-counting work -- link only formats with matching mantissa width, no hidden upcast.Tolerance bands
Each band is documented inline in
crates/parity-harness/README.mdwith the math justifying it (rounding-order sensitivity, tree-reduce vs sequential, etc.) so reviewers can defend it to a clinical auditor.CI gate
The harness runs on every PR. Any kernel exceeding its band fails the build. Drift caught at PR time, not in production.
Architecture
crates/parity-harness/-- new workspace member.crates/parity-harness/refs/-- reference generator scripts (numpy, OpenBLAS/MKL FFI shims) + pinned lockfiles + checked-in or fetched fixtures.crates/parity-harness/src/-- comparator + ULP helpers + bf16 byte-equality asserter.crates/parity-harness/tests/-- one integration test per (kernel x dtype) cell.add_mulpath under test lives in the SIMD layer (likelysrc/simd/or equivalent -- confirm during implementation); harness exercises it directly without going through any code path that would round-trip via f32.Acceptance criteria
crates/parity-harness/exists as a workspace member.numpy/ MKL / OpenBLAS / upstream-ndarray versions; lockfiles committed.crates/parity-harness/README.mdwith mathematical justification.Out of scope
candle-forkintegration end-to-end (covered by E2).burn-forkend-to-end (covered by E1).Dependencies