docs: add AWS deployment guide for hermes claw#71
Conversation
Adds docs/deploying-to-aws.md alongside the existing fly.io and Kubernetes guides, covering two production-realistic AWS paths and a pointer to EKS: - ECS on Fargate (recommended) — EFS for persistent state, Secrets Manager for keys, `aws ecs execute-command` (SSM Session Manager) for shell-in. 1:1 mapping to the k8s manifest's task definition, PVC, secrets, and exec semantics. - EC2 + Docker + SSM (lightweight) — single t4g.small with a systemd unit pulling secrets from Secrets Manager at boot. SSM Session Manager replaces SSH; no inbound 22, no keys. - EKS — one-paragraph note that the existing k8s manifest deploys unmodified (EBS CSI driver is the only AWS-specific concern). The doc explicitly preserves the project's "don't build a derived image" stance (linking the fly.io rationale) and provides the AWS-native equivalent injection points for both options. Briefly explains why App Runner (no persistent volume), Elastic Beanstalk (wrong abstraction for a single-replica daemon), Amplify (wrong product entirely), and Lambda (15-min execution cap) were deliberately excluded. Updates the README's "Deploying hermes as a claw" section to link the new guide as the third deployment target. Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end testing against a real AWS account (us-east-1) surfaced two bugs in the original guide that would break the documented deploy flow: 1. ARM64 Fargate AZ availability — the original `Subnets[0]` picker was non-deterministic and could grab us-east-1e or us-east-1f, where ARM64 Fargate is not supported. Task placement then fails with the cryptic "The required capabilities cannot be supported on requested platform" error. The subnet picker now filters to AZs a/b/c/d with an inline comment explaining why, plus a parenthetical added to the existing ARM64 recommendation callout under the task definition. 2. EC2 bind-mount ownership — `mkdir -p /var/lib/hermes-claw` left the directory owned by root. When Docker bind-mounts it into the container, the in-container harness user (uid 1000) cannot write to it, so entrypoint-hermes.sh's `cp -rn` first-boot seed crash-loops with "Permission denied". The user-data script now does `chown 1000:1000 /var/lib/hermes-claw` immediately after the mkdir, with an inline comment explaining the consequence of omitting it. Co-authored-by: Cursor <cursoragent@cursor.com>
Adds two callouts to docs/deploying-to-aws.md from observations during end-to-end testing that aren't covered by the existing text: 1. First-task timing — initial Fargate placement takes 2-3 min (most of it pulling the ~500 MB image from ghcr.io). Transient `CannotPullContainerError` events are routine; ECS automatically stops the failed task and starts a fresh one. Persistent failures usually mean a real misconfiguration. Worth saying so users don't think their deploy is broken when it's just slow. 2. Exec sessions run as root — `aws ecs execute-command` opens a root shell inside the container by default, even though the workload (PID 1) is uid 1000. This trips up "is my mount writable?" debugging because root can write anywhere. Doc now suggests `runuser -u harness --` for harness-user perspective and `stat -c %u /proc/1` for verifying the workload uid from outside the exec session. Co-authored-by: Cursor <cursoragent@cursor.com>
Adds `tmp/` to .gitignore for ad-hoc local files that shouldn't ship with the project (e.g. one-off validation scripts, scratch notes, generated debug output). Follows the common convention of treating tmp/ as a per-developer scratch directory rather than committed project state. Co-authored-by: Cursor <cursoragent@cursor.com>
35458a4 to
e837d96
Compare
hggz
left a comment
There was a problem hiding this comment.
Drive-by, not codeowner.
Pure docs PR + already approved by capotej, so just a flake note: the build (ubuntu-latest, linux/amd64) failure is unrelated to anything in this PR. It's a transient curl 502 from github.com while downloading the tirith release binary inside the Dockerfile (same flake family I've seen hit tini and cosign downloads on this repo).
A retrigger (empty commit or rerun) should clear it. Nothing to change in the doc itself.
Quick skim of the guide: solid structure, mirrors the fly/k8s pattern, and the AZ-filter + chown 1000:1000 callouts are the kind of footgun-prevention that's gold for an ops doc. The "Why not Beanstalk/Amplify/App Runner/Lambda" section is a nice touch — saves reviewers from asking.
capotej
left a comment
There was a problem hiding this comment.
need to rebase so you can merge
| ```bash | ||
| export AWS_REGION=us-east-1 | ||
| export CLAW_NAME=hermes-claw | ||
| export HARNESS_IMAGE=ghcr.io/capotej/harness:hermes-1.6.4 |
There was a problem hiding this comment.
youll want to edit the release skill to update this every release (like it does for fly guide and readme right now)
Summary
Adds
docs/deploying-to-aws.mdalongside the existing fly.io and Kubernetes deploy guides, covering two production-realistic AWS paths for runninghermesas a long-running claw, plus a short EKS pointer.The new doc mirrors the structure and philosophy of the existing two — same architecture-table → prereqs → deploy → manifest reference → monitoring → teardown → customization flow, and the same "use the upstream signed image as-is, don't build a derived image" stance, with the AWS-native equivalents of fly's
[[files]]injection pattern called out.Options covered
/home/harness/.hermes-openrouter(uid:gid 1000:1000 to match the non-rootharnessuser), Secrets Manager → task-defsecrets[], CloudWatch Logs,--enable-execute-commandfor shell-in viaaws ecs execute-command(which uses SSM Session Manager under the hood). 1:1 mapping to the k8s manifest'sDeployment/PVC/Secret/kubectl execshape.t4g.small(~$13/mo) running Amazon Linux 2023, IAM instance profile withAmazonSSMManagedInstanceCore+ scoped Secrets Manager read, user-data pulls secrets and writes a systemd unit that runs the upstream image with--restart=always. SSM Session Manager replaces SSH entirely — no key management, no inbound port 22.StorageClass.Why not Elastic Beanstalk / Amplify / App Runner / Lambda
The doc includes a short "Why not…" section at the end explaining the disqualifying factor for each: App Runner has no persistent volume (hermes' faster-whisper cache + sessions wouldn't survive restarts), Elastic Beanstalk's ELB/ASG/EC2 abstractions are wrong for a single-replica daemon, Amplify is for full-stack web apps, Lambda has a 15-min execution cap. This keeps the doc focused on the two paths that actually fit the existing deploy model.
Live validation
Both options were end-to-end validated against a real AWS account (us-east-1) before opening this PR. Results: ECS Fargate 8/8 verifications pass in ~6 min, EC2+SSM 11/11 in ~4 min, total cost < $0.10 with full auto-teardown.
Verifications covered: container PID 1 runs as uid 1000 (harness user mapping honored), volume mount visible inside container and writable as uid 1000, all four secrets (
OPENROUTER_API_KEY,TELEGRAM_BOT_TOKEN,TELEGRAM_ALLOWED_USERS,GH_TOKEN) injected as env vars from Secrets Manager (presence checked by name, value never printed), upstream agent tooling (hermes,gh) present at expected paths, shell-in works viaaws ecs execute-command/aws ssm start-session. Test override: containercommandwas set tosleep 3600so the test exercised AWS plumbing (IAM, image pull, volume mount, secrets injection, exec access) without needing real OPENROUTER/TELEGRAM keys — production deployments usehermes gatewayas documented.The validation scripts themselves aren't part of this PR (they're local scratch tooling, not project-shipped tests). What ships are the doc fixes + operational notes the validation surfaced.
Findings folded into the doc
98eebab) — the original subnet picker (Subnets[0]) was non-deterministic and could grabus-east-1eorus-east-1f, where ARM64 Fargate is not supported, producing the cryptic"The required capabilities cannot be supported on requested platform"placement error. The picker now filters to AZsa/b/c/dwith an inline comment + a parenthetical added to the ARM64 recommendation callout.98eebab) —mkdir -p /var/lib/hermes-clawleft the directory owned by root, soentrypoint-hermes.sh'scp -rnfirst-boot seed crash-looped withPermission deniedwhen the in-container uid 1000 tried to write. User-data now doeschown 1000:1000 /var/lib/hermes-clawimmediately after the mkdir, with an inline comment.7a60e6b) — added a>callout explaining that initial task placement takes 2–3 min (image pull dominates), and that transientCannotPullContainerErrorevents are routine because ECS auto-retries. Saves users from chasing phantom failures.7a60e6b) — added a>callout thataws ecs execute-commandopens a root shell by default even though the workload is uid 1000, withrunuser -u harness --for harness-perspective debugging andstat -c %u /proc/1for verifying the workload uid from outside the exec session.Files
docs/deploying-to-aws.mdREADME.md.gitignoretmp/for per-developer scratch files (one-line, sized to match the existing entries).No changes to
src/harness.ts, the Dockerfiles, agent entrypoints, or CI.Test plan
npx markdownlint-cli2@0.17.2 "**/*.md" "#node_modules"passes with 0 errorsdeploying-to-fly.md#customizing-the-claw--dont-extend-the-image,deploying-to-k8s.md#all-in-one-k8sclawyaml)660493448574: 8/8 verifications pass, full teardown clean660493448574: 11/11 verifications pass, full teardown clean