Skip to content

docs(architecture): PERFORMANCE-HARNESS-FRAMEWORK — turn performance covenants into evidence (Joel's 'proof + harness' directive)#1348

Merged
joelteply merged 1 commit into
canaryfrom
joel/docs-performance-harnesses
May 16, 2026
Merged

docs(architecture): PERFORMANCE-HARNESS-FRAMEWORK — turn performance covenants into evidence (Joel's 'proof + harness' directive)#1348
joelteply merged 1 commit into
canaryfrom
joel/docs-performance-harnesses

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

What

Adds `docs/architecture/PERFORMANCE-HARNESS-FRAMEWORK.md` (393 lines). The framework that turns the architecture docs' performance covenants into measurable evidence.

Doc-only PR. Step two of Joel's two-step directive ("ask for proof of performance concerns and then design harnesses"). Step one — airc broadcast asking the room for concrete perf data — is in flight; the framework's "Pending Evidence-Driven Additions" section will absorb the responses.

Why

The architecture docs name performance covenants throughout: RAG composition < 500ms, vector search < 50ms, voice response < 3s, persona tick < 1ms, recall hot-path < 5ms on Air, working-set page-in < 1ms, governor `current_policy()` < 50ns, and roughly 30 more per-Part budgets in `GENOME-FOUNDRY-SENTINEL.md`. They are claims until they are measured. This document specifies the harnesses that turn the claims into evidence.

Three Principles

  1. Harnesses produce VDD records, not prose reports. The substrate's Standard VDD Record (CBAR-SUBSTRATE §"Standard VDD Record") is the output of every harness. JSONL for machines, markdown for humans.
  2. Per-anchor scoping. Every harness runs against Air (UMA-16) + 5090 (discrete-32+64) at minimum. Intermediate anchors added as evidence accumulates.
  3. Baseline-relative, not absolute. Pass/fail is relative to a committed baseline, not to wishful budgets. Budgets bound expectations; baselines are the regression line.

Harness Catalog (11)

Designed against substrate covenants:

Harness What it proves Cadence
cold-start-harness Time to first usable substrate < 30s ceiling per-PR + nightly
persona-tick-harness < 1ms tick claim (validates RTOS rule) per-PR (runtime) / weekly
rag-composition-harness < 500ms RAG composition + cache hit < 100ms per-PR (cognition/genome) / weekly
vector-search-harness < 50ms vector search on 10k engrams per-PR (genome/recall) / weekly
voice-response-harness < 3s voice end-to-end (VAD → STT → cog → composer → TTS) weekly
consolidation-phase-harness Sleep phase fits 30min on Air; cascade stays ≤ 2 nightly
multi-persona-contention-harness A1-A3 invariants under N=8 personas; prefix-share KV win weekly
federation-gossip-harness < 5s gossip round; conflict resolution correct weekly
speculation-hit-rate-harness Hit rate > 0.4 (Air) / 0.6 (5090); zero cascade oscillation weekly
reprojection-confidence-harness Correct transform variant + confidence ±0.05; no silent stale per-PR (reprojection) / weekly
governor-cascade-harness Hysteresis + restore-speculation-last; oscillation count == 0 per-PR (governor) / weekly
audit-recorder-roundtrip-harness 1200 entries land + sigs verify + mutations rejected (gates #1344) per-PR

Each catalog entry specifies: scenario, key VDD fields, pass thresholds per anchor, cadence, baseline location.

Regression Detection

Two layers:

  • Hard ceilings (substrate covenants) — failure blocks PR regardless of baseline.
  • Baseline delta — ≤5% pass, 5-10% warn, 10-25% review-flag, >25% fail, ≥5% faster auto-suggests baseline update.

Baselines are committed JSON; updating is a separate reviewable action — never silent.

CI Integration

cargo continuum-vdd <harness> runs any harness. CI tags: per-pr / weekly / nightly / release. Per-PR Rust-touching changes hit a small focused subset; full weekly + nightly schedules cover the rest.

Output Bundle (Per Run)

Three artifacts under ~/.continuum/vdd/<sha>/<harness>/:

  1. VDD record JSONL (for regression detection)
  2. Reproducibility manifest TOML (git_sha + policy_version + cascade_step + env + seeds)
  3. Human-readable summary markdown (reviewer-friendly)

Pending Evidence-Driven Additions

The catalog above is the design floor. The airc broadcast asked the room for: slowest wall-clock paths in canary, regressions noticed, resource pressure incidents, what can't currently be measured, what's budgeted but unverified, hardware-class gaps. Each concrete data point becomes either (a) a new harness, (b) a sharpened pass-threshold on an existing one, or (c) a new VDD schema extension field.

Connection To Other Docs

  • CBAR-SUBSTRATE-ARCHITECTURE.md §"Standard VDD Record" + §"One-Line Instrumentation API" — the recordable substrate this framework sits on
  • GENOME-FOUNDRY-SENTINEL.md Performance Budget tables per Part — the covenants harnessed
  • PERSONA-COGNITION-CONTRACT.md §"Acceptance Criteria" — the invariants verified
  • MODULE-CATALOG.md — the modules these harnesses validate
  • ALPHA-GAP-ANALYSIS.md — Lane C VDD telemetry substrate is the foundation this framework lives on

Acceptance Criteria For The Framework Itself

Six checkpoints, including the framework's own performance budget: harness overhead (setup + measurement + compare, excluding scenario) < 50ms.

Open Questions (6)

Where harnesses live in workspace, hardware availability for CI, noisy harnesses (P50 vs P99.9), baseline update authority, cross-harness regression detection, per-persona-shape harnesses.

…s into evidence

Joel's directive: 'ask for proof of performance concerns and then
design harnesses.' This is step two (step one: airc broadcast asking
for evidence is in flight). The architecture docs name performance
covenants throughout — RAG composition < 500ms, vector search
< 50ms, voice response < 3s, persona tick < 1ms, recall hot-path
< 5ms on Air, working-set page-in < 1ms, governor current_policy()
< 50ns, and ~30 more per-Part budgets in GENOME-FOUNDRY-SENTINEL.
This document specifies the harnesses that turn the claims into
evidence.

Three principles:

1. Harnesses produce VDD records, not prose reports. The substrate's
   Standard VDD Record format (CBAR-SUBSTRATE §'Standard VDD Record')
   is the output of every harness. Humans paste it into PR comments;
   machines consume the JSONL form for regression detection.

2. Per-anchor scoping. Every harness runs against Air (UMA-16) +
   5090 (discrete-32+64) at minimum. Intermediate hardware classes
   interpolate; explicit entries added as evidence accumulates.

3. Baseline-relative, not absolute. Pass/fail is RELATIVE to a
   committed baseline, not to a hand-written budget. Budgets bound
   expectations; baselines are the regression line.

Sections:

- Harness Anatomy: four-part Rust template (setup / scenario /
  measure / compare) with vdd_scope! instrumentation. Each harness
  ≤ 200 lines + a baseline JSON per hardware anchor.

- Per-Anchor Scoping: concrete file paths for baselines per anchor.
  Missing baselines produce [Skipped: NoAirBaseline] — never silent
  pass.

- Harness Catalog: 11 harnesses designed against substrate covenants:
  cold-start, persona-tick (< 1ms tick claim), rag-composition
  (< 500ms claim), vector-search (< 50ms claim), voice-response
  (< 3s claim), consolidation-phase, multi-persona-contention
  (validates A1-A3 invariants under load + prefix-share KV win),
  federation-gossip, speculation-hit-rate (validates Part 9
  oscillation-free behavior), reprojection-confidence (validates
  CBAR-SUBSTRATE reprojection toolkit), governor-cascade (validates
  Part 11 hysteresis + restore-speculation-last rule),
  audit-recorder-roundtrip (gates regression on the just-shipped
  #1344).

  Each harness entry has: scenario, key VDD fields, pass thresholds
  for Air + 5090, cadence, baseline location.

- Schema Extensions: typed extension structs per harness category
  (TickMetrics, CompositionMetrics, RecallMetrics, etc.). Base VDD
  Record stays uniform; extensions land alongside the harness that
  needs them.

- Regression Detection: two layers. Layer 1 hard ceilings (covenant
  violations fail PR regardless of baseline). Layer 2 baseline delta
  (≤5% pass, 5-10% warn, 10-25% review-flag, >25% fail, ≥5% faster
  auto-suggests baseline update). Baselines are committed JSON;
  updating is a separate reviewable action.

- CI Integration: tagged per-pr / weekly / nightly / release. A
  cargo continuum-vdd <harness> invocation runs harnesses locally;
  CI uses the same binary.

- Harness Output Bundle: VDD record JSONL + reproducibility
  manifest TOML + human-readable summary markdown. All three under
  ~/.continuum/vdd/<sha>/<harness>/.

- Pending Evidence-Driven Additions: placeholder section that fills
  in as the room responds to the perf evidence request. Each
  concrete data point becomes either a new harness, a sharpened
  pass-threshold, or a new VDD schema field.

- Acceptance Criteria For The Framework: six checkpoints including
  the framework's own performance budget (< 50ms harness overhead
  excluding scenario).

- 6 Open Questions including: where harnesses live in workspace,
  hardware availability for CI, handling noisy harnesses (P50/P99/
  P99.9), baseline update authority, cross-harness regression
  detection, per-persona-shape harnesses.

The framework lands with the airc broadcast in flight; specific
harnesses will sharpen as evidence comes back from claude-tab-1 /
codex / vhsm-d1f4 / the room.

Doc-only. No code. Implementation lands as ALPHA-GAP Lane C
(VDD telemetry substrate) — this doc is the spec.
@joelteply joelteply merged commit cd19b81 into canary May 16, 2026
2 checks passed
@joelteply joelteply deleted the joel/docs-performance-harnesses branch May 16, 2026 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant