candle-fork plan: ONNX/ViT/Whisper via ndarray + Pi/NEON edge matrix

**Worker:** E2 (ensemble: model-frameworks track)
**Surface:** model frameworks
**Filed in:** `AdaWorldAPI/ndarray` because ndarray is the obligatory spine; `AdaWorldAPI/candle` does not exist yet — repo creation is part of execution under this plan, not a precondition.

## Why

After E1 confirms the spine pattern works in `burn-fork` (GGUF end-to-end parity passes against the upstream reference), the same playbook applies to HuggingFace `candle`. Today HuggingFace `candle` links `accelerate-src` on macOS and `cblas` on Linux. A `candle-fork` that swaps both for `AdaWorldAPI/ndarray` unlocks four production surfaces we currently cannot serve coherently from one BLAS substrate:

- **ONNX runtime** — clinical NLP is ONNX-first (German clinical BERT variants, med-de-identification, ICD-10 coding suggestion). Without an ONNX path on the spine, every clinical NLP integration is a one-off.
- **ViT (Vision Transformer)** — medical imaging: dermatology screening, retinal/slit-lamp scans, dental panoramics, derm flagging.
- **Whisper** — DACH medical dictation. Hausarzt practice is dictation-heavy; this is a real unmet need with no good local-first option today.
- **embedanything DTO** — already Jina5 / Qwen 3.5 / OpenBERT compatible (transcoded; ~month+ of work already landed). Drives entity-embedding for ontology matching against the TripletGraph spine.

The strategic point: **one ndarray, two deployment modes, hardware-accelerated on both.** The same model binary runs on Railway SPR-AMX (cohort/batch path) and on a Pi 5 NEON box at the practice (edge inference). For a German Hausarzt that cannot ship patient images to cloud (GDPR + Krankenhaus-IT-Sicherheitsgesetz), local ViT screening on a Pi 5 is the difference between deployable and non-deployable. That asymmetry — same binary, two accelerator backends, both first-class — is the moat.

## What

Create `AdaWorldAPI/candle` as a fork of `huggingface/candle`, swap the BLAS dependency to `AdaWorldAPI/ndarray`, and validate four smoke surfaces plus an edge deployment matrix.

### Concrete items

- **Fork creation** — `AdaWorldAPI/candle` from `huggingface/candle`. Tag the upstream commit at fork time; default branch tracks our integration branch, not upstream `main`.
- **Manifest swap** — replace `accelerate-src` (macOS) and `cblas` (Linux) with `AdaWorldAPI/ndarray` as the BLAS substrate across the workspace `Cargo.toml` and per-crate manifests. Remove the platform `cfg` gates that pick between `accelerate-src` and `cblas`; the ndarray dependency is the single source.
- **Manual upstream merge cadence** — candle is fast-moving. Document the merge process: weekly `git fetch upstream` + integration-branch rebase, conflict resolution log, regression-gate (smoke tests must pass before merging upstream into our default branch). This is tracked here but the recurring work happens in the candle-fork repo, not on this issue.
- **`candle-onnx` crate** — wire ONNX gemm through ndarray (same gemm path the rest of candle now uses). No bypass back to cblas.
- **Smoke tests (4)**:
  1. **ONNX** — load + run a small German clinical encoder (e.g. `medbert-de` or equivalent), parity vs `onnxruntime` on 10 sample sentences. Top-K embeddings agree within the tolerance defined by D1's parity harness.
  2. **ViT** — load + run `ViT-Base`, parity vs reference on 10 images.
  3. **Whisper** — 10-second German clinical clip transcription, BLEU/CER vs `whisper.cpp` reference.
  4. **embedanything** — Jina5 retrieval test on a small corpus, top-K agreement vs upstream.
- **Edge deployment matrix doc** — for each target, validate that build + ViT inference works:
  - **Pi 5** (NEON, 8GB RAM) — primary edge target.
  - **Pi Zero 2 W** (NEON, 512MB) — minimum-viable edge, small models only.
  - **Orange Pi** (NEON, varies) — third-party validation.
  - **x86_64 SPR** (AMX) — Railway production.
  - Build flag combinations validated, e.g. `--no-default-features --features simd-neon` for Pi targets; AMX flags for SPR.
- **License audit gate** — clinical deployment requires per-model auditing (some HuggingFace models are research-only / non-commercial). Document a whitelist policy: model weights, license SPDX, commercial-use status, in-scope-for-clinical decision, last-audited date.

## Architecture

```text
                          AdaWorldAPI/ndarray  (spine)
                                   |
                                   |  one BLAS substrate
                                   |
                ┌──────────────────┼──────────────────┐
                |                  |                  |
          burn-fork (E1)     candle-fork (this)     ...future forks
                                   |
              ┌────────────────────┼────────────────────┐
              |                    |                    |
            ONNX                  ViT                Whisper
       (clinical NLP)      (medical imaging)    (DACH dictation)
              |                    |                    |
              └────────────────────┼────────────────────┘
                                   |
                          embedanything DTO
                  (Jina5 / Qwen 3.5 / OpenBERT)
                                   |
                          ── two deployment modes ──
                                   |
              ┌────────────────────┴────────────────────┐
              |                                         |
        Railway SPR-AMX                          Pi 5 NEON edge
       (cohort/batch path)                  (Hausarzt local inference,
                                             GDPR / KHZG compliant)
```

**Spine pattern (repeated from E1).** The fork's job is to retarget BLAS at the spine; everything above the BLAS line — model loading, kernels, control flow — is upstream code we keep merging in. The fork is small, structural, and audit-friendly; it does not own the model logic.

**Edge moat.** The same compiled artefact for ViT (or Whisper, or ONNX) targets `--features amx` on SPR and `--features simd-neon` on Pi. ndarray decides which kernel to dispatch at runtime via the same gemm entry point. No model-side `#[cfg]`, no separate Pi build of model code. That property is what makes "ship same binary to practice + cohort engine" feasible.

## Acceptance criteria

- [ ] `AdaWorldAPI/candle` repo exists, forked from `huggingface/candle`, fork commit tagged.
- [ ] Manifest swap done — `accelerate-src` and `cblas` removed across the workspace, `AdaWorldAPI/ndarray` is the only BLAS dependency.
- [ ] Builds clean on `x86_64-linux` and `aarch64-linux`.
- [ ] All 4 smoke tests pass: ONNX (clinical BERT), ViT (ViT-Base), Whisper (German clinical clip), embedanything (Jina5 retrieval).
- [ ] Edge deployment matrix validated: each target (Pi 5, Pi Zero 2 W, Orange Pi, x86 SPR) builds and runs ViT inference.
- [ ] Build-flag combinations documented (e.g. `--no-default-features --features simd-neon` for Pi).
- [ ] Manual upstream merge cadence documented in the candle-fork repo (`MERGE_CADENCE.md` or equivalent).
- [ ] License whitelist policy doc committed (SPDX + commercial-use status + clinical-in-scope flag per model).

## Out of scope

- Ongoing upstream merge work after the initial fork — tracked here but executed in the candle-fork repo on its own cadence, not on this issue.
- `MedCare-rs` handler wiring (ONNX/ViT/Whisper plumbed into clinical request paths) — a separate item, downstream of this one.
- Replacing `onnxruntime` for *training*; this fork is inference-only.
- Quantisation work (GGUF on candle, INT8 ONNX) — separate items, after the parity surface is green.
- Model fine-tuning / training loops — candle-fork is for inference; training stays on the upstream BLAS path until separately scoped.

## Dependencies

- **Blocks on D1** — parity harness must define the tolerance under which ONNX top-K, ViT logits, Whisper CER, and Jina5 retrieval agreements are evaluated. Without D1's harness this issue cannot define passing thresholds.
- **Blocks on E1** — burn-fork GGUF end-to-end parity must pass against the upstream reference first. E1 is the proof that the spine pattern (ndarray substituted for the cblas/accelerate-src BLAS path) works for a real model framework. Without that proof we should not fork a second framework.
- Neither blocker is in this repo's scope; both are tracked in their own issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

candle-fork plan: ONNX/ViT/Whisper via ndarray + Pi/NEON edge matrix #136

Why

What

Concrete items

Architecture

Acceptance criteria

Out of scope

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

candle-fork plan: ONNX/ViT/Whisper via ndarray + Pi/NEON edge matrix #136

Description

Why

What

Concrete items

Architecture

Acceptance criteria

Out of scope

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions