Skip to content

candle-fork plan: ONNX/ViT/Whisper via ndarray + Pi/NEON edge matrix #136

@AdaWorldAPI

Description

@AdaWorldAPI

Worker: E2 (ensemble: model-frameworks track)
Surface: model frameworks
Filed in: AdaWorldAPI/ndarray because ndarray is the obligatory spine; AdaWorldAPI/candle does not exist yet — repo creation is part of execution under this plan, not a precondition.

Why

After E1 confirms the spine pattern works in burn-fork (GGUF end-to-end parity passes against the upstream reference), the same playbook applies to HuggingFace candle. Today HuggingFace candle links accelerate-src on macOS and cblas on Linux. A candle-fork that swaps both for AdaWorldAPI/ndarray unlocks four production surfaces we currently cannot serve coherently from one BLAS substrate:

  • ONNX runtime — clinical NLP is ONNX-first (German clinical BERT variants, med-de-identification, ICD-10 coding suggestion). Without an ONNX path on the spine, every clinical NLP integration is a one-off.
  • ViT (Vision Transformer) — medical imaging: dermatology screening, retinal/slit-lamp scans, dental panoramics, derm flagging.
  • Whisper — DACH medical dictation. Hausarzt practice is dictation-heavy; this is a real unmet need with no good local-first option today.
  • embedanything DTO — already Jina5 / Qwen 3.5 / OpenBERT compatible (transcoded; ~month+ of work already landed). Drives entity-embedding for ontology matching against the TripletGraph spine.

The strategic point: one ndarray, two deployment modes, hardware-accelerated on both. The same model binary runs on Railway SPR-AMX (cohort/batch path) and on a Pi 5 NEON box at the practice (edge inference). For a German Hausarzt that cannot ship patient images to cloud (GDPR + Krankenhaus-IT-Sicherheitsgesetz), local ViT screening on a Pi 5 is the difference between deployable and non-deployable. That asymmetry — same binary, two accelerator backends, both first-class — is the moat.

What

Create AdaWorldAPI/candle as a fork of huggingface/candle, swap the BLAS dependency to AdaWorldAPI/ndarray, and validate four smoke surfaces plus an edge deployment matrix.

Concrete items

  • Fork creationAdaWorldAPI/candle from huggingface/candle. Tag the upstream commit at fork time; default branch tracks our integration branch, not upstream main.
  • Manifest swap — replace accelerate-src (macOS) and cblas (Linux) with AdaWorldAPI/ndarray as the BLAS substrate across the workspace Cargo.toml and per-crate manifests. Remove the platform cfg gates that pick between accelerate-src and cblas; the ndarray dependency is the single source.
  • Manual upstream merge cadence — candle is fast-moving. Document the merge process: weekly git fetch upstream + integration-branch rebase, conflict resolution log, regression-gate (smoke tests must pass before merging upstream into our default branch). This is tracked here but the recurring work happens in the candle-fork repo, not on this issue.
  • candle-onnx crate — wire ONNX gemm through ndarray (same gemm path the rest of candle now uses). No bypass back to cblas.
  • Smoke tests (4):
    1. ONNX — load + run a small German clinical encoder (e.g. medbert-de or equivalent), parity vs onnxruntime on 10 sample sentences. Top-K embeddings agree within the tolerance defined by D1's parity harness.
    2. ViT — load + run ViT-Base, parity vs reference on 10 images.
    3. Whisper — 10-second German clinical clip transcription, BLEU/CER vs whisper.cpp reference.
    4. embedanything — Jina5 retrieval test on a small corpus, top-K agreement vs upstream.
  • Edge deployment matrix doc — for each target, validate that build + ViT inference works:
    • Pi 5 (NEON, 8GB RAM) — primary edge target.
    • Pi Zero 2 W (NEON, 512MB) — minimum-viable edge, small models only.
    • Orange Pi (NEON, varies) — third-party validation.
    • x86_64 SPR (AMX) — Railway production.
    • Build flag combinations validated, e.g. --no-default-features --features simd-neon for Pi targets; AMX flags for SPR.
  • License audit gate — clinical deployment requires per-model auditing (some HuggingFace models are research-only / non-commercial). Document a whitelist policy: model weights, license SPDX, commercial-use status, in-scope-for-clinical decision, last-audited date.

Architecture

                          AdaWorldAPI/ndarray  (spine)
                                   |
                                   |  one BLAS substrate
                                   |
                ┌──────────────────┼──────────────────┐
                |                  |                  |
          burn-fork (E1)     candle-fork (this)     ...future forks
                                   |
              ┌────────────────────┼────────────────────┐
              |                    |                    |
            ONNX                  ViT                Whisper
       (clinical NLP)      (medical imaging)    (DACH dictation)
              |                    |                    |
              └────────────────────┼────────────────────┘
                                   |
                          embedanything DTO
                  (Jina5 / Qwen 3.5 / OpenBERT)
                                   |
                          ── two deployment modes ──
                                   |
              ┌────────────────────┴────────────────────┐
              |                                         |
        Railway SPR-AMX                          Pi 5 NEON edge
       (cohort/batch path)                  (Hausarzt local inference,
                                             GDPR / KHZG compliant)

Spine pattern (repeated from E1). The fork's job is to retarget BLAS at the spine; everything above the BLAS line — model loading, kernels, control flow — is upstream code we keep merging in. The fork is small, structural, and audit-friendly; it does not own the model logic.

Edge moat. The same compiled artefact for ViT (or Whisper, or ONNX) targets --features amx on SPR and --features simd-neon on Pi. ndarray decides which kernel to dispatch at runtime via the same gemm entry point. No model-side #[cfg], no separate Pi build of model code. That property is what makes "ship same binary to practice + cohort engine" feasible.

Acceptance criteria

  • AdaWorldAPI/candle repo exists, forked from huggingface/candle, fork commit tagged.
  • Manifest swap done — accelerate-src and cblas removed across the workspace, AdaWorldAPI/ndarray is the only BLAS dependency.
  • Builds clean on x86_64-linux and aarch64-linux.
  • All 4 smoke tests pass: ONNX (clinical BERT), ViT (ViT-Base), Whisper (German clinical clip), embedanything (Jina5 retrieval).
  • Edge deployment matrix validated: each target (Pi 5, Pi Zero 2 W, Orange Pi, x86 SPR) builds and runs ViT inference.
  • Build-flag combinations documented (e.g. --no-default-features --features simd-neon for Pi).
  • Manual upstream merge cadence documented in the candle-fork repo (MERGE_CADENCE.md or equivalent).
  • License whitelist policy doc committed (SPDX + commercial-use status + clinical-in-scope flag per model).

Out of scope

  • Ongoing upstream merge work after the initial fork — tracked here but executed in the candle-fork repo on its own cadence, not on this issue.
  • MedCare-rs handler wiring (ONNX/ViT/Whisper plumbed into clinical request paths) — a separate item, downstream of this one.
  • Replacing onnxruntime for training; this fork is inference-only.
  • Quantisation work (GGUF on candle, INT8 ONNX) — separate items, after the parity surface is green.
  • Model fine-tuning / training loops — candle-fork is for inference; training stays on the upstream BLAS path until separately scoped.

Dependencies

  • Blocks on D1 — parity harness must define the tolerance under which ONNX top-K, ViT logits, Whisper CER, and Jina5 retrieval agreements are evaluated. Without D1's harness this issue cannot define passing thresholds.
  • Blocks on E1 — burn-fork GGUF end-to-end parity must pass against the upstream reference first. E1 is the proof that the spine pattern (ndarray substituted for the cblas/accelerate-src BLAS path) works for a real model framework. Without that proof we should not fork a second framework.
  • Neither blocker is in this repo's scope; both are tracked in their own issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions