Numerical parity harness: bit-exact f32/f16/bf16 vs numpy/MKL/OpenBLAS/upstream

## Why

Without a bit-exact parity harness against established references, our reverse-engineered BLAS-graph + GEMM, BF16 mantissa-aware `add_mul` (no FP32 round-trip), Intel AMX byte-call hacks (Sapphire Rapids), NEON (Pi 4+/Orange/Zero), AVX-512/AVX2, and the SIMD polyfill floor can each silently drift relative to the reference world. Downstream forks (`burn-fork`, `candle-fork`) inherit that drift invisibly.

**Clinical-liability stake.** A ViT classifier returning `0.748` on Sapphire Rapids vs `0.751` on a Pi-5 clinic-edge box is not numerical noise when the decision threshold is `0.75` -- it is a diagnosis flip on the same image. For Railway-hosted SPR production AND clinic-edge Pi-5 deployment to render the same diagnosis, kernel outputs must agree to within a documented, defended tolerance band -- not "close enough." Bit-exact where the math allows; tightly-bounded ULP-band where rounding-order is fundamental (gemm).

## What

A new test crate **`crates/parity-harness/`** (sibling to existing `crates/numeric-tests`, `crates/blas-tests`, `crates/blas-mock-tests`) that compares ndarray kernel outputs against multiple authoritative references and **fails CI on drift**.

### Reference baselines (pinned, reproducible)

- **numpy** -- driven by a Python script (`crates/parity-harness/refs/numpy_ref.py`) emitting reference inputs+outputs to deterministic JSON / little-endian binary blobs. Pinned `numpy==X.Y.Z`, `python==3.11.Z`.
- **OpenBLAS** -- direct FFI call from a Rust harness binary, version pinned in CI image.
- **MKL** -- direct FFI call when available (Linux x86_64 CI legs); skipped with explicit `SKIP` reason on aarch64 / no-MKL hosts.
- **upstream rust-ndarray** (the public crate from crates.io) -- pulled as a dev-dependency under an alias; we run the same op through it and diff.

Reference fixtures generated once, checked in (or content-addressed and fetched), with the generator script + lockfile so CI can regenerate and confirm reproducibility.

### Coverage matrix

| Kernel        | f32 | f16 | bf16 | Mode                    |
|---------------|-----|-----|------|-------------------------|
| gemm          | yes | yes | yes  | tolerance band (ULP)    |
| add (elemwise)| yes | yes | yes  | **bit-exact**           |
| mul (elemwise)| yes | yes | yes  | **bit-exact**           |
| sum reduction | yes | yes | yes  | tolerance band (ULP)    |
| mean          | yes | yes | yes  | tolerance band (ULP)    |
| conv2d        | yes | yes | yes  | tolerance band (ULP)    |

**BF16 contract (non-negotiable):** `add` and `mul` go through the mantissa-aware path with **no FP32 round-trip** and **no implicit promotion**. The harness asserts byte-equality on the raw `bf16` output; any path that silently widens to f32 to do the work fails the test. This is the whole point of the mantissa-bit-counting work -- link only formats with matching mantissa width, no hidden upcast.

### Tolerance bands

| Op class         | Bound                                              |
|------------------|----------------------------------------------------|
| elementwise add  | 0 ULP (bit-exact, all formats)                     |
| elementwise mul  | 0 ULP (bit-exact, all formats)                     |
| gemm (typical)   | <= 1 ULP per output element                        |
| gemm (large K)   | documented widened band (rounding-order dominates) |
| reductions       | <= 1 ULP for tree-reduce; documented for naive     |
| conv2d           | derived from underlying gemm band                  |

Each band is documented inline in `crates/parity-harness/README.md` with the math justifying it (rounding-order sensitivity, tree-reduce vs sequential, etc.) so reviewers can defend it to a clinical auditor.

### CI gate

The harness runs on every PR. Any kernel exceeding its band fails the build. Drift caught at PR time, not in production.

## Architecture

- `crates/parity-harness/` -- new workspace member.
- `crates/parity-harness/refs/` -- reference generator scripts (numpy, OpenBLAS/MKL FFI shims) + pinned lockfiles + checked-in or fetched fixtures.
- `crates/parity-harness/src/` -- comparator + ULP helpers + bf16 byte-equality asserter.
- `crates/parity-harness/tests/` -- one integration test per (kernel x dtype) cell.
- BF16 mantissa-aware `add_mul` path under test lives in the SIMD layer (likely `src/simd/` or equivalent -- confirm during implementation); harness exercises it directly without going through any code path that would round-trip via f32.
- This issue **defines the contract**. **D2** (separate issue) builds the CI matrix infrastructure (per-arch runners: SPR-AMX, AVX-512, AVX2, NEON, polyfill) that runs this harness.

## Acceptance criteria

- [ ] `crates/parity-harness/` exists as a workspace member.
- [ ] Reference outputs generated reproducibly with pinned `numpy` / MKL / OpenBLAS / upstream-ndarray versions; lockfiles committed.
- [ ] All listed kernels (gemm, add, mul, sum, mean, conv2d) in all listed dtypes (f32, f16, bf16) pass the bit-exact / tolerance-band check on **at least the SIMD-polyfill build**.
- [ ] BF16 add and mul are byte-equal to reference (no FP32 round-trip path is hit; verified by an explicit test that would fail if promotion occurred).
- [ ] Per-arch paths (SPR AMX, AVX-512, AVX2, NEON, polyfill) each have a place in the matrix where they will be tested separately. (Actually wiring those runners is D2's responsibility; this harness must support them.)
- [ ] Tolerance band per kernel documented in `crates/parity-harness/README.md` with mathematical justification.
- [ ] Harness wired into CI as a required check that **fails the PR** on drift.

## Out of scope

- `candle-fork` integration end-to-end (covered by **E2**).
- Per-arch CI runner provisioning / matrix infrastructure itself (covered by **D2**, runs in parallel).
- Performance regression gating -- this issue is correctness-only.
- `burn-fork` end-to-end (covered by **E1**).

## Dependencies

- **Upstream:** none. Independent of any other open work; can start immediately.
- **Blocks E1** (burn-fork e2e) -- burn-fork cannot be certified for clinical workloads until ndarray kernels have parity guarantees.
- **Blocks E2** (candle-fork) -- same reasoning.
- **Pairs with D2** (CI matrix infra) -- D2 satisfies the per-arch infra contract this harness defines. They land independently and meet at the CI config.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical parity harness: bit-exact f32/f16/bf16 vs numpy/MKL/OpenBLAS/upstream #137

Why

What

Reference baselines (pinned, reproducible)

Coverage matrix

Tolerance bands

CI gate

Architecture

Acceptance criteria

Out of scope

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Kernel	f32	f16	bf16	Mode
gemm	yes	yes	yes	tolerance band (ULP)
add (elemwise)	yes	yes	yes	bit-exact
mul (elemwise)	yes	yes	yes	bit-exact
sum reduction	yes	yes	yes	tolerance band (ULP)
mean	yes	yes	yes	tolerance band (ULP)
conv2d	yes	yes	yes	tolerance band (ULP)

Op class	Bound
elementwise add	0 ULP (bit-exact, all formats)
elementwise mul	0 ULP (bit-exact, all formats)
gemm (typical)	<= 1 ULP per output element
gemm (large K)	documented widened band (rounding-order dominates)
reductions	<= 1 ULP for tree-reduce; documented for naive
conv2d	derived from underlying gemm band

Numerical parity harness: bit-exact f32/f16/bf16 vs numpy/MKL/OpenBLAS/upstream #137

Description

Why

What

Reference baselines (pinned, reproducible)

Coverage matrix

Tolerance bands

CI gate

Architecture

Acceptance criteria

Out of scope

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions