docs(architecture): PERFORMANCE-HARNESS-FRAMEWORK — turn performance covenants into evidence (Joel's 'proof + harness' directive) by joelteply · Pull Request #1348 · CambrianTech/continuum

joelteply · 2026-05-16T22:50:33Z

What

Adds `docs/architecture/PERFORMANCE-HARNESS-FRAMEWORK.md` (393 lines). The framework that turns the architecture docs' performance covenants into measurable evidence.

Doc-only PR. Step two of Joel's two-step directive ("ask for proof of performance concerns and then design harnesses"). Step one — airc broadcast asking the room for concrete perf data — is in flight; the framework's "Pending Evidence-Driven Additions" section will absorb the responses.

Why

The architecture docs name performance covenants throughout: RAG composition < 500ms, vector search < 50ms, voice response < 3s, persona tick < 1ms, recall hot-path < 5ms on Air, working-set page-in < 1ms, governor `current_policy()` < 50ns, and roughly 30 more per-Part budgets in `GENOME-FOUNDRY-SENTINEL.md`. They are claims until they are measured. This document specifies the harnesses that turn the claims into evidence.

Three Principles

Harnesses produce VDD records, not prose reports. The substrate's Standard VDD Record (CBAR-SUBSTRATE §"Standard VDD Record") is the output of every harness. JSONL for machines, markdown for humans.
Per-anchor scoping. Every harness runs against Air (UMA-16) + 5090 (discrete-32+64) at minimum. Intermediate anchors added as evidence accumulates.
Baseline-relative, not absolute. Pass/fail is relative to a committed baseline, not to wishful budgets. Budgets bound expectations; baselines are the regression line.

Harness Catalog (11)

Designed against substrate covenants:

Harness	What it proves	Cadence
`cold-start-harness`	Time to first usable substrate < 30s ceiling	per-PR + nightly
`persona-tick-harness`	< 1ms tick claim (validates RTOS rule)	per-PR (runtime) / weekly
`rag-composition-harness`	< 500ms RAG composition + cache hit < 100ms	per-PR (cognition/genome) / weekly
`vector-search-harness`	< 50ms vector search on 10k engrams	per-PR (genome/recall) / weekly
`voice-response-harness`	< 3s voice end-to-end (VAD → STT → cog → composer → TTS)	weekly
`consolidation-phase-harness`	Sleep phase fits 30min on Air; cascade stays ≤ 2	nightly
`multi-persona-contention-harness`	A1-A3 invariants under N=8 personas; prefix-share KV win	weekly
`federation-gossip-harness`	< 5s gossip round; conflict resolution correct	weekly
`speculation-hit-rate-harness`	Hit rate > 0.4 (Air) / 0.6 (5090); zero cascade oscillation	weekly
`reprojection-confidence-harness`	Correct transform variant + confidence ±0.05; no silent stale	per-PR (reprojection) / weekly
`governor-cascade-harness`	Hysteresis + restore-speculation-last; oscillation count == 0	per-PR (governor) / weekly
`audit-recorder-roundtrip-harness`	1200 entries land + sigs verify + mutations rejected (gates #1344)	per-PR

Each catalog entry specifies: scenario, key VDD fields, pass thresholds per anchor, cadence, baseline location.

Regression Detection

Two layers:

Hard ceilings (substrate covenants) — failure blocks PR regardless of baseline.
Baseline delta — ≤5% pass, 5-10% warn, 10-25% review-flag, >25% fail, ≥5% faster auto-suggests baseline update.

Baselines are committed JSON; updating is a separate reviewable action — never silent.

CI Integration

cargo continuum-vdd <harness> runs any harness. CI tags: per-pr / weekly / nightly / release. Per-PR Rust-touching changes hit a small focused subset; full weekly + nightly schedules cover the rest.

Output Bundle (Per Run)

Three artifacts under ~/.continuum/vdd/<sha>/<harness>/:

VDD record JSONL (for regression detection)
Reproducibility manifest TOML (git_sha + policy_version + cascade_step + env + seeds)
Human-readable summary markdown (reviewer-friendly)

Pending Evidence-Driven Additions

The catalog above is the design floor. The airc broadcast asked the room for: slowest wall-clock paths in canary, regressions noticed, resource pressure incidents, what can't currently be measured, what's budgeted but unverified, hardware-class gaps. Each concrete data point becomes either (a) a new harness, (b) a sharpened pass-threshold on an existing one, or (c) a new VDD schema extension field.

Connection To Other Docs

CBAR-SUBSTRATE-ARCHITECTURE.md §"Standard VDD Record" + §"One-Line Instrumentation API" — the recordable substrate this framework sits on
GENOME-FOUNDRY-SENTINEL.md Performance Budget tables per Part — the covenants harnessed
PERSONA-COGNITION-CONTRACT.md §"Acceptance Criteria" — the invariants verified
MODULE-CATALOG.md — the modules these harnesses validate
ALPHA-GAP-ANALYSIS.md — Lane C VDD telemetry substrate is the foundation this framework lives on

Acceptance Criteria For The Framework Itself

Six checkpoints, including the framework's own performance budget: harness overhead (setup + measurement + compare, excluding scenario) < 50ms.

Open Questions (6)

Where harnesses live in workspace, hardware availability for CI, noisy harnesses (P50 vs P99.9), baseline update authority, cross-harness regression detection, per-persona-shape harnesses.

…s into evidence Joel's directive: 'ask for proof of performance concerns and then design harnesses.' This is step two (step one: airc broadcast asking for evidence is in flight). The architecture docs name performance covenants throughout — RAG composition < 500ms, vector search < 50ms, voice response < 3s, persona tick < 1ms, recall hot-path < 5ms on Air, working-set page-in < 1ms, governor current_policy() < 50ns, and ~30 more per-Part budgets in GENOME-FOUNDRY-SENTINEL. This document specifies the harnesses that turn the claims into evidence. Three principles: 1. Harnesses produce VDD records, not prose reports. The substrate's Standard VDD Record format (CBAR-SUBSTRATE §'Standard VDD Record') is the output of every harness. Humans paste it into PR comments; machines consume the JSONL form for regression detection. 2. Per-anchor scoping. Every harness runs against Air (UMA-16) + 5090 (discrete-32+64) at minimum. Intermediate hardware classes interpolate; explicit entries added as evidence accumulates. 3. Baseline-relative, not absolute. Pass/fail is RELATIVE to a committed baseline, not to a hand-written budget. Budgets bound expectations; baselines are the regression line. Sections: - Harness Anatomy: four-part Rust template (setup / scenario / measure / compare) with vdd_scope! instrumentation. Each harness ≤ 200 lines + a baseline JSON per hardware anchor. - Per-Anchor Scoping: concrete file paths for baselines per anchor. Missing baselines produce [Skipped: NoAirBaseline] — never silent pass. - Harness Catalog: 11 harnesses designed against substrate covenants: cold-start, persona-tick (< 1ms tick claim), rag-composition (< 500ms claim), vector-search (< 50ms claim), voice-response (< 3s claim), consolidation-phase, multi-persona-contention (validates A1-A3 invariants under load + prefix-share KV win), federation-gossip, speculation-hit-rate (validates Part 9 oscillation-free behavior), reprojection-confidence (validates CBAR-SUBSTRATE reprojection toolkit), governor-cascade (validates Part 11 hysteresis + restore-speculation-last rule), audit-recorder-roundtrip (gates regression on the just-shipped #1344). Each harness entry has: scenario, key VDD fields, pass thresholds for Air + 5090, cadence, baseline location. - Schema Extensions: typed extension structs per harness category (TickMetrics, CompositionMetrics, RecallMetrics, etc.). Base VDD Record stays uniform; extensions land alongside the harness that needs them. - Regression Detection: two layers. Layer 1 hard ceilings (covenant violations fail PR regardless of baseline). Layer 2 baseline delta (≤5% pass, 5-10% warn, 10-25% review-flag, >25% fail, ≥5% faster auto-suggests baseline update). Baselines are committed JSON; updating is a separate reviewable action. - CI Integration: tagged per-pr / weekly / nightly / release. A cargo continuum-vdd <harness> invocation runs harnesses locally; CI uses the same binary. - Harness Output Bundle: VDD record JSONL + reproducibility manifest TOML + human-readable summary markdown. All three under ~/.continuum/vdd/<sha>/<harness>/. - Pending Evidence-Driven Additions: placeholder section that fills in as the room responds to the perf evidence request. Each concrete data point becomes either a new harness, a sharpened pass-threshold, or a new VDD schema field. - Acceptance Criteria For The Framework: six checkpoints including the framework's own performance budget (< 50ms harness overhead excluding scenario). - 6 Open Questions including: where harnesses live in workspace, hardware availability for CI, handling noisy harnesses (P50/P99/ P99.9), baseline update authority, cross-harness regression detection, per-persona-shape harnesses. The framework lands with the airc broadcast in flight; specific harnesses will sharpen as evidence comes back from claude-tab-1 / codex / vhsm-d1f4 / the room. Doc-only. No code. Implementation lands as ALPHA-GAP Lane C (VDD telemetry substrate) — this doc is the spec.

github-actions Bot added the size: L label May 16, 2026

joelteply merged commit cd19b81 into canary May 16, 2026
2 checks passed

joelteply deleted the joel/docs-performance-harnesses branch May 16, 2026 23:01

This was referenced May 16, 2026

feat(governor): Lane H PR-2 — TOML policy file loader + validator #1350

Merged

docs(architecture): PROD-COGNITION-REPLAY — 100% Rust cognition + proof from PROD not POC #1386

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(architecture): PERFORMANCE-HARNESS-FRAMEWORK — turn performance covenants into evidence (Joel's 'proof + harness' directive)#1348

docs(architecture): PERFORMANCE-HARNESS-FRAMEWORK — turn performance covenants into evidence (Joel's 'proof + harness' directive)#1348
joelteply merged 1 commit into
canaryfrom
joel/docs-performance-harnesses

joelteply commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented May 16, 2026

What

Why

Three Principles

Harness Catalog (11)

Regression Detection

CI Integration

Output Bundle (Per Run)

Pending Evidence-Driven Additions

Connection To Other Docs

Acceptance Criteria For The Framework Itself

Open Questions (6)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant