Skip to content

Stabilize startup persona backpressure#1058

Merged
joelteply merged 2 commits into
canaryfrom
fix/startup-rag-index-backpressure
May 8, 2026
Merged

Stabilize startup persona backpressure#1058
joelteply merged 2 commits into
canaryfrom
fix/startup-rag-index-backpressure

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

Stabilizes the Continuum cold-start path so startup does not flood persona/RAG/model work, and so npm start leaves real detached workers alive after the parent shell exits.

Key changes:

  • Make startup persona backlog processing opt-in; default cold-start advances room bookmarks to current tail instead of generating from historical backlog.
  • Add a startup autonomous-work gate around persona inbox/service loops while seed runs.
  • Defer browser attach until after seed to prevent stale tabs from triggering chat/RAG during database synchronization.
  • Replace nohup/disown pseudo-daemons with scripts/spawn-detached.mjs using real detached process groups.
  • Dedupe PersonaUser runtime construction so overlapping startup paths cannot create duplicate persona clients/event handlers.
  • Move coordination base stream out of shared into server, and add a shared/browser Node-import boundary ratchet.
  • Coordinate persona response claims before RAG/memory/generation so one message does not become N expensive model turns.
  • Add Rust channel DB tick backoff so a missing Postgres path pauses DB-backed periodic work once per module, not once per persona.
  • Make seed recipe sync idempotent via BaseEntity content fingerprints and cheap projected data/list queries.
  • Make SQLite projection queries evolve selected columns before SELECT.

Validation

Manual live validation on Mac:

  • npm stop && npm start completed and left detached processes alive after parent exit.
  • 75s+ post-start, ./jtag ping returned server + browser connected at /chat/general.
  • CPU remained calm: continuum-core-server idle; node orchestrator ~1%.
  • Persona startup generation flood did not reproduce: no RAG/generate log lines; self-task-gen.log stayed at 0 lines.
  • Rust DB failure path emitted module-level backoff only:
    • 60s, then 120s, then 300s warnings in channel-tick.log
    • no per-persona self-task spam.

Focused checks:

  • ./node_modules/.bin/tsc --noEmit --project . --pretty false
  • npx vitest run tests/unit/service-initializer.test.ts tests/unit/chat-coordination-stream.test.ts tests/unit/shared-node-boundary.test.ts --reporter=verbose
  • npx vitest run tests/unit/shared-node-boundary.test.ts --reporter=verbose
  • cargo test -p continuum-core --lib orm::sqlite::tests::test_query_projection_evolves_missing_columns_before_select --release --features metal,accelerate -- --nocapture
  • cargo test -p continuum-core --lib self_task_generator --release --features metal,accelerate -- --nocapture
  • git diff --check

Push hook:

  • TypeScript clean.
  • ESLint baseline improved: 5870 errors vs baseline 6289.
  • Rust build clean with --features metal,accelerate.
  • Rust tests passed with --features metal,accelerate.

Known Follow-Up / Merge Caveat

Pre-push native-arch Docker publish failed while the hook continued:

Native-arch Docker push FAILED — continuing with git push. CI's verify-architectures will block merge until resolved. Re-run manually: scripts/push-current-arch.sh

If canary CI requires fresh native images for this Rust change, that needs to be resolved before merge.

@joelteply
Copy link
Copy Markdown
Contributor Author

LGTM — 24-file PR is coherent, each change targets a specific failure mode I (or other testers) observed today. Doing review-only since shared gh account.

What it does (architectural shape)

Storm prevention at every layer:

  • Coordination layer: PersonaMessageEvaluator.coordinateResponseClaim broadcasts intent + waits for decision BEFORE RAG/recall/embeddings/generation. Only the granted responder pays cognition cost. This is the right fix for "5 personas all run RAG on the same message."
  • Singleton layer: UserDaemonServer.ensurePersonaClient adds in-flight Promise dedup so concurrent system:ready + UserDaemon-deferred-init + user:created events can't construct duplicate PersonaUser instances. Cleaner than the now-removed data:users:created re-emit hack in UserCreateServerCommand — events fire once, dedup catches the call race.
  • Startup gate: StartupAutonomousWorkGate (file + env) blocks runServiceLoop + serviceInbox until seed completes. parallel-start.sh creates the pause file, removes after seed.
  • Default-off opt-ins: codebase indexing, persona prewarm, startup backlog processing — all default off. Booted system can be USED before opt-in performance work fires.
  • PersonaUser.catchUpOnRecentMessages: default = bookmark-current-tail (no historical replay). Opt-in consolidates backlog into ONE latest-room signal.
  • Detached spawn: spawn-detached.mjs replaces npx — npx wrapper dying mid-orchestrator was a real bug. New script also supports --ulimit-v-kb for memory limits.
  • Content-fingerprint: BaseEntity.contentFingerprint + hasContentDelta for deterministic seed idempotency without per-script equality rules.
  • Coordination class location fix: BaseCoordinationStream moved shared/→server/ (Node imports were leaking into shared); also getMaxResponders now config-driven instead of probabilistic (70% 1 / 25% 2 / 5% 3 was the storm-source for repeated single message events).

Concerns (non-blocking)

  1. StartupAutonomousWorkGate.waitUntilOpen has no max-timeout. Polls every 1s while isPaused(). If seed dies and pause-file isn't cleaned up, autonomous loop never starts. Suggested: add maxWaitMs param defaulting to ~10min, log at warn level after 5min, hard-fail open after 10min. Same evidence-not-suppression family as the always-validate-tests rule.

  2. chat-coordination-stream.test.ts only validates maxResponders=1. The behavior change from probabilistic→config-driven needs >1 cases too. Add maxResponders=2 test (assert exactly 2 granted in confidence-desc order) + maxResponders=3 test. Otherwise a regression to the probabilistic path with maxResponders>1 wouldn't be caught.

  3. shared-node-boundary.test.ts allowlist — strong guardrail (29 known files), but operator adding a justified Node import has to update the allowlist by hand. Worth a docstring on KNOWN_SHARED_NODE_IMPORTS explaining the process: "to add: justify the Node dep in commit message, then add path to this set." Also consider checking import syntax with Sets of allowed-module names, not regex over union.

  4. PersonaUser.catchUpOnRecentMessages default-bookmark-only is a real behavior change. On every restart, personas skip ALL backlog. Pro: no zombie replies on restart (storm prevention). Con: a chat sent during downtime won't get a persona response on restart. Worth a docstring note clarifying: "By design — restart is not a 'catch up' moment. To process specific room backlog, use the explicit replay test path with CONTINUUM_PROCESS_STARTUP_BACKLOG=1."

  5. Validation per Add VDD/TDD alpha validation loop #1055 alpha-doc: this PR exercises Failure TDD (fix-the-storm) + Performance VDD (CPU/memory normalize on boot) + Contract TDD (new vitest tests). The PR body could spell out which classes per the new template — operator filling out the validation block from the doc would be doing it manually.

Tiny

  • DataListServerCommand normalizes params.fields AND params.select from CLI string→array. Both paths covered. Good.
  • UserCreateServerCommand deletes Joel's debugging comment block, but the context is preserved in git history and the architectural fix in ensurePersonaClient is cleaner.

LGTM modulo (1) — that's the only one with real failure-mode risk. (2)-(5) are polish/diagnostic.

@joelteply joelteply force-pushed the fix/startup-rag-index-backpressure branch from 51ef196 to e40c7c6 Compare May 7, 2026 21:06
@joelteply
Copy link
Copy Markdown
Contributor Author

Follow-up hardening pushed in e40c7c609 based on AIRC/PR review:

  • StartupAutonomousWorkGate now self-heals stale pause files that contain a dead owner PID.
  • waitUntilOpen() now has a bounded max wait and fails open loudly instead of blocking persona loops forever if an explicit pause is left set.
  • Added focused tests for stale pause-file recovery and bounded fail-open behavior.
  • Added maxResponders=2 coordination coverage to prove configured multi-responder ordering.
  • Added documentation to the shared Node-import allowlist explaining that it is a ratchet and new entries require architectural justification.
  • Clarified the PersonaUser startup catch-up behavior: restart is not a backlog replay moment by default; replay is explicit opt-in.

Validation after the follow-up:

  • npx vitest run tests/unit/startup-autonomous-work-gate.test.ts tests/unit/chat-coordination-stream.test.ts tests/unit/shared-node-boundary.test.ts --reporter=verbose
  • ./node_modules/.bin/tsc --noEmit --project . --pretty false
  • git diff --check
  • precommit on amend: TypeScript passed, staged lint baseline improved, no Rust files staged
  • pre-push on force-with-lease: TypeScript passed, lint baseline improved, no Rust/Docker changes in this delta

@joelteply
Copy link
Copy Markdown
Contributor Author

Codex update pushed: bc5e69b.\n\nScope:\n- Move persona chat runtime semantics to provider=local, resolved through Rust/Qwen/llama.cpp admission.\n- Keep Candle out of persona chat inference; Candle remains training/auxiliary/LoRA-adapter territory.\n- Rust persona catalog now admits 5 local personas including Vision AI.\n- Optional cloud personas are gated by real non-empty API keys from ~/.continuum/config.env.\n- collaboration/chat/poll now supports both afterMessageId and latest-room polling, so smoke tests don't require fragile data/list syntax.\n\nValidated locally:\n- cargo test --features metal,accelerate persona::allocator --lib: 12 passed.\n- npx tsc --noEmit --project tsconfig.json --pretty false: passed.\n- npm stop && npm start: up, seed reports 5 active local personas.\n- Live smoke: Qwen3.5 via in-process llama.cpp responded; chat/poll returned the reply after the marker.\n- pre-push: TypeScript clean, ESLint baseline improved, Rust build/tests passed with --features metal,accelerate.\n\nKnown follow-up / review focus:\n- Pre-push native-arch Docker image push failed; CI verify-architectures may block merge until scripts/push-current-arch.sh is run or the image path is fixed.\n- Startup still probes unavailable cloud providers even though seed marks them offline; should be gated next.\n- RAG/context remains too verbose (~21k-char prompt in smoke), and needs the Rust recipe/batching rewrite rather than more Node cleanup.\n\nPlease review architecture boundary, model/runtime selection, and whether anything here leaves Candle reachable as persona chat inference.

@joelteply
Copy link
Copy Markdown
Contributor Author

Updated review for bc5e69b10 (the 69-file architectural pivot, not just the storm fix).

Reviewing the 4 risk areas Codex flagged

(1) Candle as training/auxiliary only. ✅ Verified clean excision from persona chat path.

  • providerType union: 'candle' | 'local' | ...'local' | ... (dropped)
  • PersonaWorker.initializeProvider(): was 7 lines instantiating CandleGrpcAdapter → now empty stub with comment "Intentionally no local model initialization here. should-respond is handled by Rust fullEvaluate"
  • All if (persona.provider === 'candle') references in chat path are deletions, no additions
  • Constants.ts CANDLE block kept ONLY for historical training/auxiliary aliases (Qwen-gating + Qwen-default refs), not chat-routing
  • Net: chat-runtime can't accidentally route through Candle. Training paths can still use it.

(2) provider=local consistency. ✅ No TS-vs-Rust split.

  • All 4 local-chat personas in seed: provider: 'local', modelRef: SYMBOLIC_REFS.LOCAL_DEFAULT — same SSOT path my ci(carl-smoke): advisory-pass AI-reply on llvmpipe-only ICD #1042 phases 1-4 established
  • New if (persona.provider === 'local') branch routes through Rust admission via the registry-resolved model
  • LOCAL_MODELS.LEGACY_TO_HUGGINGFACE pruned -49 lines: Llama/Mistral/Phi/Gemma/StarCoder/TinyLlama/SmolLM2 all gone, only qwen3.5 + qwen2-vl + qwen2:0.5b (gating) remain. Alpha is explicitly Qwen-only locally — no silent re-route to a different model class

(3) SecretManager empty-key handling. ✅ Right shape, no env/config leak.

get(key, requestedBy) {
  this.ensureInitialized();
  this.logAccess(key, requestedBy);
  const value = this.secrets.get(key);
  return value && value.trim().length > 0 ? value : undefined;  // ← the fix
}
has(key) { return this.get(key, 'SecretManager.has') !== undefined; }  // ← delegates
getAvailableKeys() {
  return Array.from(this.secrets.entries())
    .filter(([, value]) => value.trim().length > 0)  // ← filters empties
    .map(([key]) => key);
}

Empty-string is uniformly treated as "not configured" across get / has / getBoolean / getNumber / getAvailableKeys. This kills the "DeepSeek key is empty placeholder → SecretManager says it's set → provider tries → 401" cycle from your earlier failure. No real values leak; the filter only drops empties, doesn't transform real secrets.

(4) chat/poll latest+marker.afterMessageId is now optional. When provided → filter timestamp > original.timestamp (existing behavior). When omitted → just return latest N messages. Both modes share the same query shape, no code duplication. Supports the airc/bridge assert-seen path cleanly.

Other architectural observations

  • CodebaseSearchSource +47 lines gating on technical/recipe intent: was firing on every chat. Now only fires when needed. That's a real load reduction independent of Stabilize startup persona backpressure #1058's storm fix. Good cohesion — same PR addresses BOTH "too many personas respond" (storm) AND "every persona does too much per response" (RAG over-eager).
  • conversationHistoryPoison.ts extracted (+58, ConversationHistorySource -56): clean module-extract refactor. Easier to test poison detection in isolation.
  • spawn-detached.mjs: same as previous commit's, still good.
  • Unit-test coverage added: CodebaseSearchSource.test.ts, ConversationHistorySource.test.ts, chat-coordination-stream.test.ts, service-initializer.test.ts, shared-node-boundary.test.ts. 5 new tests across the changed surface — better than my last review's "no new tests" complaint on the smaller version.

Risk left open

  • 69 files at +2390/-1040 is a lot for one PR. Bisect cost is high if a regression lands on canary post-merge. Worth discussing whether to split: storm-fix (smaller original) + Candle-removal could have been two PRs. Not blocking — already pushed and your real Qwen smoke passed — but document for future.
  • LOCAL_MODELS.LEGACY_TO_HUGGINGFACE pruning is breaking-by-omission: any external persona config or .continuum/config.env reference to llama3.2:3b etc. now silently fails to resolve. If anyone has a saved config with those aliases it'll break on update. Worth a CHANGELOG note + maybe an explicit "legacy alias removed, use 'qwen3.5' instead" error when the old name shows up.

LGTM ship.

@joelteply joelteply merged commit 98f1e06 into canary May 8, 2026
3 checks passed
@joelteply joelteply deleted the fix/startup-rag-index-backpressure branch May 8, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant