ci(docker-images): kill auto-rebuild loophole, add label-gated CI build#1040
Closed
joelteply wants to merge 2 commits into
Closed
ci(docker-images): kill auto-rebuild loophole, add label-gated CI build#1040joelteply wants to merge 2 commits into
joelteply wants to merge 2 commits into
Conversation
Three changes, all reliability-positive:
1. docker-images.yml: delete rebuild-stale-amd64 + rebuild-stale-arm64
jobs that auto-rebuilt with SKIP_PHASE_0=1 whenever an image drifted
— silently shipped bits that never ran cargo test, defeating the
dev-side Phase 0 + Phase 2 gate. Replaced with single matrixed
ci-build-on-label job that fires only when an explicit
ci-build:<slice> label is on the PR. Per-arch + per-slice labels
keep the escape hatch granular.
2. carl-install-smoke.yml: pin CONTINUUM_IMAGE_TAG to PR-head short
SHA. Mutable :canary / :latest tags went 9-14 days stale in April
2026, silently passing smoke against pre-fix bits. SHA-pinned can
never go stale by definition.
3. verify-image-revisions.sh: failure message names the simple dev fix
(`cd src && npm run docker:push`) AND the label escape hatch.
Recognized labels: ci-build:{amd64,arm64,vulkan,cuda,core,livekit-bridge,all}.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
continuum-core enforces "lack of GPU integration is forbidden" and panics at startup on any host where no Metal/CUDA/Vulkan device is reachable in the container. Mac Docker Desktop has no GPU passthrough → arm64 core boot-panics. Same for Linux arm64 (Pi/Jetson) without explicit ICD setup. The variant is unshippable as-architected and is being deprecated in PR #1038 (drop core variant). Until #1038 lands and removes the variant entirely, push-current-arch.sh should not try to build/push it from any host. Otherwise every Mac/Pi push attempt eats Phase 0 cargo test cycles, builds the image, then fails Phase 2 slice tests at boot — wasting ~25 min for a guaranteed failure. Repeatable for future Mac/Windows Claude sessions: `cd src && npm run docker:push` now succeeds with just the variants the host can actually ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joelteply
pushed a commit
that referenced
this pull request
May 4, 2026
…forget)
Two bugs in docker-entrypoint.ts caught by Carl-install-smoke on this PR:
1. Auto-seed used `setTimeout(5000)` with NO synchronization → /health
returned 200 before any room/persona existed. Smoke chat probe at +52s
raced with seed and got "Room not found: general" silently.
2. Seed errors were swallowed to console.warn → installs landed in
permanent unrecoverable state ("server up, no rooms") with no signal
to Carl that the system is broken.
Fix: seed now BLOCKS before the "Server ready" log line. Seed failure
exits the process with code 1 (server cannot serve chat without seeded
rooms — better to crashloop than silently lie). Eliminates a class of
swallowed-error / silent-success bugs Joel called out in the global
"Never swallow errors" rule.
Also pins carl-install-smoke.yml CONTINUUM_IMAGE_TAG to PR-head SHORT_SHA
so smoke pulls the image built from THIS PR's source (matches the
structural-fix change in PR #1040). Without the pin, smoke would pull
:latest (mutable, last week's bits) and never see this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joelteply
added a commit
that referenced
this pull request
May 5, 2026
* fix(install): drop core variant, default to vulkan (Task #98) — closes Carl install on no-GPU Linux Vulkan + mesa llvmpipe ICD satisfies Joel's 'GPU integration is forbidden to fall back' rule. Binary exercises real Vulkan API loader; llvmpipe provides software ICD on no-GPU hosts. Smoke unblocked. - docker-compose.yml: continuum-core uses continuum-core-vulkan image + Dockerfile - install.sh: warn on Linux+noGPU when vulkaninfo missing or zero-devices - workflow: pre-install mesa-vulkan-drivers + vulkan-tools on ubuntu-latest b69f drives image build/push side (continuum-core-vulkan multi-arch + canary→latest). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(slices): add Vulkan runtime-use + IPC-reports-gpu probes (Joel: 'good integration tests for vulkan layers') The existing vulkan slice only proved (a) the loader enumerates a device and (b) the binary statically links libvulkan. That's necessary but not sufficient — a binary can pass both yet skip GPU enumeration at runtime (broken feature flag) or panic silently before logging. Two new probes close the loop: - vulkan-runtime-used-by-core: poll docker logs for 30s for the GpuMemoryManager 'GPU detected: <name> — <N>MB VRAM' line. Proves the binary actually walked through the loader at runtime, not just in ldd. - vulkan-ipc-reports-gpu: nc the unix socket and call gpu/stats over IPC. Verifies the runtime contract — manager initialized, claimed memory, and surfaces a non-zero total_vram_mb to clients. Skipped (not failed) when nc isn't in the runtime image — slice 3 still covers runtime-use via boot logs. Slice tests now cover the full vulkan stack: linker (slice 2), loader (slice 1), runtime detection (slice 3), runtime contract (slice 4). Bevy/wgpu render + ggml-vulkan inference probes (deeper layers 5+6) are follow-up work — heavier, need scaffold + model download. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(seed): make auto-seed a blocking startup milestone (was fire-and-forget) Two bugs in docker-entrypoint.ts caught by Carl-install-smoke on this PR: 1. Auto-seed used `setTimeout(5000)` with NO synchronization → /health returned 200 before any room/persona existed. Smoke chat probe at +52s raced with seed and got "Room not found: general" silently. 2. Seed errors were swallowed to console.warn → installs landed in permanent unrecoverable state ("server up, no rooms") with no signal to Carl that the system is broken. Fix: seed now BLOCKS before the "Server ready" log line. Seed failure exits the process with code 1 (server cannot serve chat without seeded rooms — better to crashloop than silently lie). Eliminates a class of swallowed-error / silent-success bugs Joel called out in the global "Never swallow errors" rule. Also pins carl-install-smoke.yml CONTINUUM_IMAGE_TAG to PR-head SHORT_SHA so smoke pulls the image built from THIS PR's source (matches the structural-fix change in PR #1040). Without the pin, smoke would pull :latest (mutable, last week's bits) and never see this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(smoke): pin CONTINUUM_IMAGE_TAG to :pr-N (not SHA) for multi-slice coord SHA-pin in prior commit hit the multi-slice + multi-host coordination problem: dev on Mac arm64 can push node/widgets/model-init at HEAD SHA but vulkan/cuda need bigmama (linux/amd64). With SHA-pin, smoke tries to pull every slice at the SHA — slices the dev couldn't push are missing, docker compose pull hangs. :pr-N is PR-scoped mutable: refreshed by push-image.sh on every dev push, so always reflects this PR's latest source — but never collides with another PR or canary. For slices unchanged by the PR (e.g. vulkan when PR only touches install.sh), dev aliases :canary -> :pr-N via docker buildx imagetools create (manifest copy, no rebuild). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chat/send): fall back to seeded human owner when senderId doesn't resolve The CLI auto-injects a session-scoped UUID as params.userId. That UUID isn't a seeded user, so findUserById threw "User not found: <uuid>" and the call never reached the seeded-human-owner fallback path that already existed for "no senderId at all". Net effect: every Carl-install-smoke chat probe failed with the wrong error after the seed-blocking fix landed (commit 160e5ba). Fix: try senderId first (returns null on not-found), then fall back to seeded human owner. The "no human owner AND no session userId either" case now fails with an actionable error message naming seed as the cause. Caught by carl-install-smoke on PR #1038 run 25331526438. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(install): wait for seed to populate default room before declaring ready widget-server /health only proves that container is up. node-server runs auto-seed in docker-entrypoint.ts which creates the "general" room + personas — but the WebSocket server is bound BEFORE seed runs, so install.sh's "Continuum is running" + chat probe both raced ahead of seed completion. Smoke caught it: chat/send returned "Room not found: general" silently. The earlier docker-entrypoint.ts blocking-seed fix delays the "Server ready" log line but doesn't actually block command serving (orchestrate binds the WebSocket port before my seed call). Real fix is install.sh waiting for the seeded room to actually exist via jtag data/list — fast, no new endpoint, deterministic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(seed): readiness-file + HEALTHCHECK gate so widget-server blocks on seed Replaces my earlier "blocking seed in entrypoint" fix that didn't actually block (orchestrate binds the WebSocket port BEFORE the entrypoint await). New pattern: - orchestrate('cli-command') runs seed INLINE as a milestone — not after - on success, entrypoint writes /root/.continuum/run/node-server.ready - Dockerfile HEALTHCHECK tests for that file + WebSocket port - docker-compose: widget-server depends_on node-server: service_healthy - install.sh waits for widget-server /health → cascades through node-server health → cascades through seed → cascades through orchestrate Net: install.sh's "Continuum is running" now genuinely means seed is done. Carl chat works on first attempt. Install.sh's separate jtag-wait gate from prior commit becomes belt-and-suspenders (still useful if HEALTHCHECK breaks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(smoke): capture per-container docker logs on failure Existing artifact upload had install.log + page + chat — none of which show why continuum-core / node-server didn't reply. The "no AI reply within 300s" failure on PR #1038 had ZERO evidence of the actual inference-path failure because the docker container logs were dropped on smoke teardown. Now: on failure, dump per-container logs (continuum-core, node-server, model-init, widget-server, livekit-bridge) + compose ps state to artifact. Next failure surfaces the actual root cause instead of just the wrapper-script timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(smoke): capture docker logs INSIDE teardown before compose down Workflow's if-failure docker-logs step fired AFTER smoke exit when containers were already gone (smoke trap → docker compose down → my step finds dead containers). Move the capture INSIDE smoke's teardown so logs are dumped from live containers BEFORE compose down. Without this the per-container log artifacts are empty even when the workflow step runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(smoke): headless screenshot of root page — Joel's question 'is the UI even loading' curl gives the server-rendered HTML shell (866 bytes valid HTML — fine). But the actual chat UI loads via JS — could be blank chat with no personas / empty room / silent JS error and curl wouldn't catch it. Add chromium-headless capture after the curl page-validate step (waits 8s for JS to render). Saves to /tmp/carl-smoke-*.page.png + uploaded in the failure artifact alongside docker logs. Non-fatal: if no chromium on PATH, just warns. ubuntu-latest GHA runners have google-chrome-stable preinstalled so smoke captures it. Local devs can install chromium for the same evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(models): single source of truth — src/shared/models.json + registry-driven model-init Joel 2026-05-04: "all the models must download and run on GPU" + "we MUST have this work from ONE source of truth" + "update the existing seeded values so the personas PICK UP THE MODEL change and arent stuck in the past". This is the architectural fix for the fragmented model spec: - install.sh had hardcoded PERSONA_MODEL strings - download-voice-models.sh had hardcoded URLs - src/system/shared/Constants.ts had LOCAL_MODELS const - src/workers/continuum-core/.../model_registry.json was Rust-only - personas.ts had per-persona modelId baked in 5 places, 5 sources of drift. Replaced by ONE file: src/shared/models.json - models{}: every model (chat / vision / embedding / STT / TTS / VAD) with kind, hf_repo, files[], size_gb, min_ram_gb, chat_template - tiers{}: mba/mid/full → default_chat (registry key) - symbolic_refs{}: 'local-default' (tier-resolved), 'vision-default', 'gating' — what personas store in DB - personas{}: displayName → symbolic ref - auto_download{}: always[] + by_tier[] — what model-init pulls - chat_templates{}: moved from Rust-only registry Added in this commit: src/shared/ModelRegistry.ts - load(), tierFromRamGB(), resolveModel(ref, tier), resolvePersonaModel(name, tier), downloadSetForTier(tier), allPersonaRefs(), symbolicRefForPersona(name). - Personas store SYMBOLIC refs in DB, not concrete IDs. Edit models.json → next inference call resolves to new model. No DB migration needed. src/scripts/download-models.sh - Walks registry via jq, downloads always[] + tier-set into /models. - Replaces hardcoded curl URLs in download-voice-models.sh. - Each model.files[] resolved to https://huggingface.co/<repo>/resolve/main/<file>. - candle-builtin format skipped (continuum-core loads in-process). docker/model-init.Dockerfile - Adds jq dependency. - Copies shared/models.json + scripts/download-models.sh. - CMD: download-models.sh + download-avatar-models.sh (avatars stay separate — distinct from ML models). - download-voice-models.sh COPY removed (superseded). NEXT COMMITS in this PR series: - install.sh: delete docker-model-pull block, read tier+default from registry via jq. Drops DMR dependency. - personas.ts: use symbolic refs ('local-default' for Helper/Teacher/ CodeReview/Local Assistant; 'vision-default' for Vision AI). - CandleAdapter: accept symbolic refs, resolve via registry at request time. - continuum-core: read src/shared/models.json (replace inference/ model_registry.json with thin pointer to shared file). - Reconciler in seedDatabase(): on every startup, walk persona rows; if modelRef field missing or differs from registry, UPDATE. Idempotent — no-op when already current. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(models): personas use symbolic refs; seed resolves via registry; constants not magic strings Phase 2 of single-source-of-truth model registry (Phase 1: 2adc3d5). src/shared/ModelRegistry.ts: - Add SYMBOLIC_REFS const enum (LOCAL_DEFAULT, VISION_DEFAULT, GATING) + TIERS const (MBA/MID/FULL). Joel rule 2026-05-04: "define constants not magic strings". Code uses these — never hardcode the bare strings. src/scripts/seed/personas.ts: - PersonaConfig adds modelRef?: string field (symbolic ref into src/shared/models.json). - Helper / Teacher / CodeReview / Local Assistant: switch from `modelId: LOCAL_MODELS.DEFAULT` to `modelRef: SYMBOLIC_REFS.LOCAL_DEFAULT`. - Vision AI: `modelRef: SYMBOLIC_REFS.VISION_DEFAULT`. - Old modelId field kept as legacy/cached. CandleAdapter (next commit) will prefer modelRef and resolve via registry at request time. src/server/seed-in-process.ts: - Resolves config.modelRef → concrete hf_repo via ModelRegistry at seed time. Stores resolved value in users.modelConfig.model so existing CandleAdapter unchanged. When src/shared/models.json edits the underlying model for a tier, every startup re-resolves and the refresh-on-mismatch path UPDATES the persona row. No DB migration script needed — seeded personas auto-update when registry changes. install.sh: - Removed two `docker model pull` calls (DMR persona model + MLX vLLM variant). Both supersede by model-init container reading src/shared/models.json. Per Joel 2026-05-04: "all the models must download and run on GPU" — no DMR dependency. KV-cache cap and vLLM install blocks remain (still useful tuning when DMR present, no-op otherwise). Remaining phases: - CandleAdapter: prefer modelRef, resolve at request time (eliminates every cached-modelId codepath once stable). - Rust continuum-core: read src/shared/models.json instead of the Rust-only inference/model_registry.json. - download-voice-models.sh: delete (superseded by download-models.sh). - LOCAL_MODELS const in Constants.ts: reduce to thin re-export of SYMBOLIC_REFS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(models): CandleAdapter resolves symbolic refs at request time Phase 3 of the SSoT model registry work. CandleAdapter now accepts: - symbolic refs ('local-default', 'vision-default', 'gating') - registry keys ('qwen3.5-4b-code-forged') - legacy short names ('llama3.2:3b') - raw HF IDs All resolved per-request through ModelRegistry.resolveModel(), so DB rows storing symbolic refs auto-pick-up registry edits without migration. Tier resolved once at construction from totalmem(). Also: build-with-loud-failure copies shared/models.json into dist/ so __dirname-relative reads resolve at runtime (tsc skips JSON). Joel rule 2026-05-04: "we MUST have this work from ONE source of truth". * feat(models): Rust reads same src/shared/models.json — one SSOT for both runtimes Phase 4 of the model-registry SSOT collapse (Joel 2026-05-04: "we MUST have this work from ONE source of truth"). continuum-core's inference/candle_adapter no longer ships its own embedded model_registry.json. The same src/shared/models.json that TS, install.sh, and download-models.sh consume is now embedded into the Rust binary at compile time via include_str!. resolve_model_id() understands symbolic refs ('local-default' / 'vision-default' / 'gating') and resolves them via tiers + symbolic_refs identical to ModelRegistry.ts. Tier auto-detected from host RAM (Linux: /proc/meminfo, macOS: sysctl hw.memsize, fallback: mba). Schema: - ModelRegistryEntry renames repo→hf_repo and min_memory_gb→min_ram_gb to match the SSOT shape. Legacy field names accepted via #[serde(alias = ...)] so any out-of-tree consumer of the old embedded JSON keeps deserializing. - New fields kind / files / size_gb / auto_load reflect the SSOT, all optional. - Extra top-level keys (tiers / symbolic_refs / personas / auto_download / chat_templates) silently ignored by ModelRegistry's serde shape but consumed by the internal FullRegistry view used for symbolic resolution. Compatibility: - Added 'coder' and 'coder-bf16' entries to src/shared/models.json so live callers (LocalModelRouter via LOCAL_MODELS.CODING_AGENT) keep resolving. - Removed dead 'smollm2' / 'llama3.2:3b' assertions from test_resolve_chat_template (callers were docs-only). - Added test_resolve_model_id_symbolic_refs covering all three symbolic refs + direct registry-key lookup + raw HF passthrough. Build: - Deleted workers/continuum-core/src/inference/model_registry.json (dead). - TS bindings regenerated: ModelRegistryEntry.ts now exports hf_repo, min_ram_gb, kind, files, size_gb, auto_load (no TS consumer references the old field names — verified via grep). - cargo test --lib --features metal,accelerate inference::candle_adapter → 10/10 pass including the new resolution test. - npm run build:ts clean. Net: persona DB rows storing 'local-default' resolve through the same JSON whether the request enters via TS CandleAdapter or Rust candle_adapter — registry edits propagate everywhere on next inference call without DB migration. * ci(carl-install-smoke): fix workflow_dispatch tag resolution + add image_tag input The bare interpolation `pr-${{ github.event.pull_request.number }}` resolved to `pr-` (empty after dash) on workflow_dispatch, since there's no PR context. install.sh then couldn't find the tag in the registry, fell through to its 'will build locally' branch, and ran a full Rust compile of continuum-core-vulkan on the no-GPU ubuntu-latest runner — which hit the 25-min runner cap (observed in run 25400718464). Resolution priority is now: PR# > input.image_tag > 'canary'. Manual triggers from the workflow UI default to ':canary' (the cadence we publish on) and accept an `image_tag` input override for testing specific tags (':latest', ':pr-N', or sha-prefix). Diagnosis + patch shape from continuum-8e97 on Windows after they hit the regression while running (c) carl-install-smoke from this PR's tip 342075a. YAML-only change, no behavior shift for PR-triggered runs. Co-Authored-By: continuum-8e97 <continuum-8e97@cambriantech.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: continuum-8e97 <continuum-8e97@cambriantech.com>
Contributor
Author
Contributor
Author
|
Superseded by #1043 (kill auto-rebuild, merged 2026-05-06 00:08Z) + #1045 (verify-image-revisions DEFAULT_IMAGES drop continuum-core, merged 2026-05-06 03:26Z). Per #1043's own description: "clean replacement for stale PR #1040, which was branched before current canary and now drags in unrelated model/install diffs." The label-gated escape hatch could come back in a fresh PR off current canary if needed; keeping #1040 open with three conflicts isn't earning anything. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces CI auto-rebuild (the optimization-defeating loophole Joel called out today) with an explicit-intent label gate. Also pins smoke to PR-head SHA so it can never silently pull stale bits.
Three changes
1.
.github/workflows/docker-images.yml— delete `rebuild-stale-amd64` + `rebuild-stale-arm64` jobs. They auto-rebuilt with `SKIP_PHASE_0=1` whenever an image drifted, silently shipping bits that never ran cargo test. Replaced with a single matrixed `ci-build-on-label` job that fires only when an explicit `ci-build:` label is on the PR. Per-arch + per-slice labels keep the escape hatch granular.2.
.github/workflows/carl-install-smoke.yml— pin `CONTINUUM_IMAGE_TAG` to PR-head short SHA. Mutable `:canary` / `:latest` tags went 9-14 days stale in April 2026, silently passing smoke against pre-fix bits. SHA-pinned can never go stale by definition — image either exists at this exact ref or smoke fails with `manifest unknown` and the dev gets the actionable error.3.
scripts/verify-image-revisions.sh— failure message names the simple dev fix (`cd src && npm run docker:push`) AND the label escape hatch.Recognized labels
What this DOESN'T change
Test plan
🤖 Generated with Claude Code