Skip to content

docs: retarget deploy docs from Cloud Run to k3s + Argo CD#19

Merged
robbiebyrd merged 3 commits into
mainfrom
docs/k3s-argocd-deploy
Jun 15, 2026
Merged

docs: retarget deploy docs from Cloud Run to k3s + Argo CD#19
robbiebyrd merged 3 commits into
mainfrom
docs/k3s-argocd-deploy

Conversation

@robbiebyrd

Copy link
Copy Markdown
Collaborator

Summary

Production moved off Google Cloud Run to a self-hosted 2-node k3s cluster (nodes joined over Tailscale/WireGuard). The build (GitHub Actions → GHCR) and rollout (Argo CD GitOps) layers are unchanged — only the runtime substrate changed. The deploy docs still described the old Cloud Run + Memorystore + VPC Connector + Cloud Build pipeline, so they're rewritten to match.

All details were verified against the live manifests in Keeping-History/infra (apps/time-machine/), not assumed.

Changes

  • README.md — replace the "Deployment (Google Cloud Run)" section with GHCR + Argo CD: single-replica Deployment, gcsfuse native sidecar, in-cluster Redis, Ingress-terminated TLS/WSS.
  • docs/deployment.md — full rewrite for k3s / Argo CD / GHCR:
    • gcsfuse native sidecar (initContainer w/ restartPolicy: Always) mounting tm-cache-723408812472
    • GCS key in its own gcs-sa-key Secret (not time-machine-secrets)
    • Redis on a local-path PVC with --appendonly
    • pod MTU = 1280 gotcha (WireGuard overhead black-holes large TLS egress packets → UND_ERR_CONNECT_TIMEOUT)
    • ProxyMesh section flagged currently disabled in prod (direct egress)
  • docs/post-deploy.md — translate gcloud checks to kubectl; replace the Cloud Run flags check with rollout / image / sidecar-health verification; rollback via kubectl rollout undo / argocd app rollback.
  • CLAUDE.md — note the cluster is self-hosted k3s (not GKE) + the MTU gotcha.

Legacy Cloud Build / Cloud Run artifacts (cloudbuild.yaml, deploy.sh, .gcloudignore) are left in-tree, labeled reference-only.

Notes

  • Docs only — no code changed; nothing to build or test.
  • .env.prod was intentionally not included (project rules forbid committing it).

🤖 Generated with Claude Code

Robbie Byrd and others added 3 commits June 15, 2026 17:16
…can't kill the process

The recursive crawl exposed a process-fatal hazard. A 200 response body is
returned as a Node Readable carrying the request's armed AbortSignal.timeout.
If a consumer abandons it before reading (cache.writeStream throws at
mkdir/rename on a path collision), the timeout fires ~10s later, aborts the
dangling stream, and the unhandled 'error' becomes an uncaughtException. Since
the HTTP server and BullMQ workers share one process, the whole server dies
(CrashLoopBackOff).

Attach a bound no-op 'error' listener to the returned body so an orphaned
stream can never surface as an uncaughtException; real consumers use
stream.pipeline and still observe/propagate the error. Also cancel the body on
4xx/5xx paths so a non-200 response can't leave a stream dangling either.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…re-child case

A URL namespace lets a node be both a leaf (the `/a` page) and an internal node
(parent of `/a/b`), but a filesystem name can't be both a file and a directory.
The recursive crawl hit this constantly: when a child URL is cached before its
parent page, the slot at `<root>/a` becomes a directory, and writing the `/a`
page then fails EISDIR on rename — and lookup would even return the directory as
a file body (EISDIR on read).

- writeStream: on EISDIR, store the page at `<dest>/index.html` — the
  directory-index form lookup already probes.
- lookup: when the primary path is a directory, skip it and fall through to the
  `<path>/index.html` probe instead of returning a directory as a file.

No cache migration: existing file-form entries are still found by the primary
probe; both forms coexist. The inverse direction (a page cached as a file, then
a child needs that name as a directory) still fails the child's write gracefully
(no crash, served live on demand) — a deeper canonicalization fix is tracked
separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Production moved off Cloud Run to a self-hosted 2-node k3s cluster
(joined over Tailscale/WireGuard). The build (GHCR) and rollout
(Argo CD GitOps) layers are unchanged; only the runtime substrate
changed. Rewrite the deploy docs to match, verified against the live
manifests in Keeping-History/infra (apps/time-machine/):

- README: replace the Cloud Run deploy section with GHCR + Argo CD;
  single-replica Deployment, gcsfuse native sidecar, in-cluster Redis.
- docs/deployment.md: full rewrite (k3s/Argo CD/GHCR). Native gcsfuse
  sidecar mounting tm-cache-723408812472; GCS key in its own gcs-sa-key
  Secret (not time-machine-secrets); Redis on a local-path PVC; pod
  MTU=1280 gotcha for WireGuard TLS egress; ProxyMesh currently disabled.
- docs/post-deploy.md: translate gcloud checks to kubectl; replace the
  Cloud Run flags check with rollout/image/sidecar-health verification.
- CLAUDE.md: note the cluster is self-hosted k3s (not GKE) + the MTU gotcha.

Legacy Cloud Build/Cloud Run artifacts left in-tree, labeled reference-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@robbiebyrd robbiebyrd merged commit 3439f2b into main Jun 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant