docs(architecture): PERFORMANCE-HARNESS-FRAMEWORK — turn performance covenants into evidence (Joel's 'proof + harness' directive)#1348
Merged
Conversation
…s into evidence Joel's directive: 'ask for proof of performance concerns and then design harnesses.' This is step two (step one: airc broadcast asking for evidence is in flight). The architecture docs name performance covenants throughout — RAG composition < 500ms, vector search < 50ms, voice response < 3s, persona tick < 1ms, recall hot-path < 5ms on Air, working-set page-in < 1ms, governor current_policy() < 50ns, and ~30 more per-Part budgets in GENOME-FOUNDRY-SENTINEL. This document specifies the harnesses that turn the claims into evidence. Three principles: 1. Harnesses produce VDD records, not prose reports. The substrate's Standard VDD Record format (CBAR-SUBSTRATE §'Standard VDD Record') is the output of every harness. Humans paste it into PR comments; machines consume the JSONL form for regression detection. 2. Per-anchor scoping. Every harness runs against Air (UMA-16) + 5090 (discrete-32+64) at minimum. Intermediate hardware classes interpolate; explicit entries added as evidence accumulates. 3. Baseline-relative, not absolute. Pass/fail is RELATIVE to a committed baseline, not to a hand-written budget. Budgets bound expectations; baselines are the regression line. Sections: - Harness Anatomy: four-part Rust template (setup / scenario / measure / compare) with vdd_scope! instrumentation. Each harness ≤ 200 lines + a baseline JSON per hardware anchor. - Per-Anchor Scoping: concrete file paths for baselines per anchor. Missing baselines produce [Skipped: NoAirBaseline] — never silent pass. - Harness Catalog: 11 harnesses designed against substrate covenants: cold-start, persona-tick (< 1ms tick claim), rag-composition (< 500ms claim), vector-search (< 50ms claim), voice-response (< 3s claim), consolidation-phase, multi-persona-contention (validates A1-A3 invariants under load + prefix-share KV win), federation-gossip, speculation-hit-rate (validates Part 9 oscillation-free behavior), reprojection-confidence (validates CBAR-SUBSTRATE reprojection toolkit), governor-cascade (validates Part 11 hysteresis + restore-speculation-last rule), audit-recorder-roundtrip (gates regression on the just-shipped #1344). Each harness entry has: scenario, key VDD fields, pass thresholds for Air + 5090, cadence, baseline location. - Schema Extensions: typed extension structs per harness category (TickMetrics, CompositionMetrics, RecallMetrics, etc.). Base VDD Record stays uniform; extensions land alongside the harness that needs them. - Regression Detection: two layers. Layer 1 hard ceilings (covenant violations fail PR regardless of baseline). Layer 2 baseline delta (≤5% pass, 5-10% warn, 10-25% review-flag, >25% fail, ≥5% faster auto-suggests baseline update). Baselines are committed JSON; updating is a separate reviewable action. - CI Integration: tagged per-pr / weekly / nightly / release. A cargo continuum-vdd <harness> invocation runs harnesses locally; CI uses the same binary. - Harness Output Bundle: VDD record JSONL + reproducibility manifest TOML + human-readable summary markdown. All three under ~/.continuum/vdd/<sha>/<harness>/. - Pending Evidence-Driven Additions: placeholder section that fills in as the room responds to the perf evidence request. Each concrete data point becomes either a new harness, a sharpened pass-threshold, or a new VDD schema field. - Acceptance Criteria For The Framework: six checkpoints including the framework's own performance budget (< 50ms harness overhead excluding scenario). - 6 Open Questions including: where harnesses live in workspace, hardware availability for CI, handling noisy harnesses (P50/P99/ P99.9), baseline update authority, cross-harness regression detection, per-persona-shape harnesses. The framework lands with the airc broadcast in flight; specific harnesses will sharpen as evidence comes back from claude-tab-1 / codex / vhsm-d1f4 / the room. Doc-only. No code. Implementation lands as ALPHA-GAP Lane C (VDD telemetry substrate) — this doc is the spec.
This was referenced May 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds `docs/architecture/PERFORMANCE-HARNESS-FRAMEWORK.md` (393 lines). The framework that turns the architecture docs' performance covenants into measurable evidence.
Doc-only PR. Step two of Joel's two-step directive ("ask for proof of performance concerns and then design harnesses"). Step one — airc broadcast asking the room for concrete perf data — is in flight; the framework's "Pending Evidence-Driven Additions" section will absorb the responses.
Why
The architecture docs name performance covenants throughout: RAG composition < 500ms, vector search < 50ms, voice response < 3s, persona tick < 1ms, recall hot-path < 5ms on Air, working-set page-in < 1ms, governor `current_policy()` < 50ns, and roughly 30 more per-Part budgets in `GENOME-FOUNDRY-SENTINEL.md`. They are claims until they are measured. This document specifies the harnesses that turn the claims into evidence.
Three Principles
CBAR-SUBSTRATE§"Standard VDD Record") is the output of every harness. JSONL for machines, markdown for humans.Harness Catalog (11)
Designed against substrate covenants:
cold-start-harnesspersona-tick-harnessrag-composition-harnessvector-search-harnessvoice-response-harnessconsolidation-phase-harnessmulti-persona-contention-harnessfederation-gossip-harnessspeculation-hit-rate-harnessreprojection-confidence-harnessgovernor-cascade-harnessaudit-recorder-roundtrip-harnessEach catalog entry specifies: scenario, key VDD fields, pass thresholds per anchor, cadence, baseline location.
Regression Detection
Two layers:
Baselines are committed JSON; updating is a separate reviewable action — never silent.
CI Integration
cargo continuum-vdd <harness>runs any harness. CI tags:per-pr/weekly/nightly/release. Per-PR Rust-touching changes hit a small focused subset; full weekly + nightly schedules cover the rest.Output Bundle (Per Run)
Three artifacts under
~/.continuum/vdd/<sha>/<harness>/:Pending Evidence-Driven Additions
The catalog above is the design floor. The airc broadcast asked the room for: slowest wall-clock paths in canary, regressions noticed, resource pressure incidents, what can't currently be measured, what's budgeted but unverified, hardware-class gaps. Each concrete data point becomes either (a) a new harness, (b) a sharpened pass-threshold on an existing one, or (c) a new VDD schema extension field.
Connection To Other Docs
CBAR-SUBSTRATE-ARCHITECTURE.md§"Standard VDD Record" + §"One-Line Instrumentation API" — the recordable substrate this framework sits onGENOME-FOUNDRY-SENTINEL.mdPerformance Budget tables per Part — the covenants harnessedPERSONA-COGNITION-CONTRACT.md§"Acceptance Criteria" — the invariants verifiedMODULE-CATALOG.md— the modules these harnesses validateALPHA-GAP-ANALYSIS.md— Lane C VDD telemetry substrate is the foundation this framework lives onAcceptance Criteria For The Framework Itself
Six checkpoints, including the framework's own performance budget: harness overhead (setup + measurement + compare, excluding scenario) < 50ms.
Open Questions (6)
Where harnesses live in workspace, hardware availability for CI, noisy harnesses (P50 vs P99.9), baseline update authority, cross-harness regression detection, per-persona-shape harnesses.