Skip to content

Numerical parity harness: bit-exact f32/f16/bf16 vs numpy/MKL/OpenBLAS/upstream #137

@AdaWorldAPI

Description

@AdaWorldAPI

Why

Without a bit-exact parity harness against established references, our reverse-engineered BLAS-graph + GEMM, BF16 mantissa-aware add_mul (no FP32 round-trip), Intel AMX byte-call hacks (Sapphire Rapids), NEON (Pi 4+/Orange/Zero), AVX-512/AVX2, and the SIMD polyfill floor can each silently drift relative to the reference world. Downstream forks (burn-fork, candle-fork) inherit that drift invisibly.

Clinical-liability stake. A ViT classifier returning 0.748 on Sapphire Rapids vs 0.751 on a Pi-5 clinic-edge box is not numerical noise when the decision threshold is 0.75 -- it is a diagnosis flip on the same image. For Railway-hosted SPR production AND clinic-edge Pi-5 deployment to render the same diagnosis, kernel outputs must agree to within a documented, defended tolerance band -- not "close enough." Bit-exact where the math allows; tightly-bounded ULP-band where rounding-order is fundamental (gemm).

What

A new test crate crates/parity-harness/ (sibling to existing crates/numeric-tests, crates/blas-tests, crates/blas-mock-tests) that compares ndarray kernel outputs against multiple authoritative references and fails CI on drift.

Reference baselines (pinned, reproducible)

  • numpy -- driven by a Python script (crates/parity-harness/refs/numpy_ref.py) emitting reference inputs+outputs to deterministic JSON / little-endian binary blobs. Pinned numpy==X.Y.Z, python==3.11.Z.
  • OpenBLAS -- direct FFI call from a Rust harness binary, version pinned in CI image.
  • MKL -- direct FFI call when available (Linux x86_64 CI legs); skipped with explicit SKIP reason on aarch64 / no-MKL hosts.
  • upstream rust-ndarray (the public crate from crates.io) -- pulled as a dev-dependency under an alias; we run the same op through it and diff.

Reference fixtures generated once, checked in (or content-addressed and fetched), with the generator script + lockfile so CI can regenerate and confirm reproducibility.

Coverage matrix

Kernel f32 f16 bf16 Mode
gemm yes yes yes tolerance band (ULP)
add (elemwise) yes yes yes bit-exact
mul (elemwise) yes yes yes bit-exact
sum reduction yes yes yes tolerance band (ULP)
mean yes yes yes tolerance band (ULP)
conv2d yes yes yes tolerance band (ULP)

BF16 contract (non-negotiable): add and mul go through the mantissa-aware path with no FP32 round-trip and no implicit promotion. The harness asserts byte-equality on the raw bf16 output; any path that silently widens to f32 to do the work fails the test. This is the whole point of the mantissa-bit-counting work -- link only formats with matching mantissa width, no hidden upcast.

Tolerance bands

Op class Bound
elementwise add 0 ULP (bit-exact, all formats)
elementwise mul 0 ULP (bit-exact, all formats)
gemm (typical) <= 1 ULP per output element
gemm (large K) documented widened band (rounding-order dominates)
reductions <= 1 ULP for tree-reduce; documented for naive
conv2d derived from underlying gemm band

Each band is documented inline in crates/parity-harness/README.md with the math justifying it (rounding-order sensitivity, tree-reduce vs sequential, etc.) so reviewers can defend it to a clinical auditor.

CI gate

The harness runs on every PR. Any kernel exceeding its band fails the build. Drift caught at PR time, not in production.

Architecture

  • crates/parity-harness/ -- new workspace member.
  • crates/parity-harness/refs/ -- reference generator scripts (numpy, OpenBLAS/MKL FFI shims) + pinned lockfiles + checked-in or fetched fixtures.
  • crates/parity-harness/src/ -- comparator + ULP helpers + bf16 byte-equality asserter.
  • crates/parity-harness/tests/ -- one integration test per (kernel x dtype) cell.
  • BF16 mantissa-aware add_mul path under test lives in the SIMD layer (likely src/simd/ or equivalent -- confirm during implementation); harness exercises it directly without going through any code path that would round-trip via f32.
  • This issue defines the contract. D2 (separate issue) builds the CI matrix infrastructure (per-arch runners: SPR-AMX, AVX-512, AVX2, NEON, polyfill) that runs this harness.

Acceptance criteria

  • crates/parity-harness/ exists as a workspace member.
  • Reference outputs generated reproducibly with pinned numpy / MKL / OpenBLAS / upstream-ndarray versions; lockfiles committed.
  • All listed kernels (gemm, add, mul, sum, mean, conv2d) in all listed dtypes (f32, f16, bf16) pass the bit-exact / tolerance-band check on at least the SIMD-polyfill build.
  • BF16 add and mul are byte-equal to reference (no FP32 round-trip path is hit; verified by an explicit test that would fail if promotion occurred).
  • Per-arch paths (SPR AMX, AVX-512, AVX2, NEON, polyfill) each have a place in the matrix where they will be tested separately. (Actually wiring those runners is D2's responsibility; this harness must support them.)
  • Tolerance band per kernel documented in crates/parity-harness/README.md with mathematical justification.
  • Harness wired into CI as a required check that fails the PR on drift.

Out of scope

  • candle-fork integration end-to-end (covered by E2).
  • Per-arch CI runner provisioning / matrix infrastructure itself (covered by D2, runs in parallel).
  • Performance regression gating -- this issue is correctness-only.
  • burn-fork end-to-end (covered by E1).

Dependencies

  • Upstream: none. Independent of any other open work; can start immediately.
  • Blocks E1 (burn-fork e2e) -- burn-fork cannot be certified for clinical workloads until ndarray kernels have parity guarantees.
  • Blocks E2 (candle-fork) -- same reasoning.
  • Pairs with D2 (CI matrix infra) -- D2 satisfies the per-arch infra contract this harness defines. They land independently and meet at the CI config.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions