docs: add AWS deployment guide for hermes claw by EnriqueCanals · Pull Request #71 · capotej/harness

EnriqueCanals · 2026-05-23T05:44:30Z

Summary

Adds docs/deploying-to-aws.md alongside the existing fly.io and Kubernetes deploy guides, covering two production-realistic AWS paths for running hermes as a long-running claw, plus a short EKS pointer.

The new doc mirrors the structure and philosophy of the existing two — same architecture-table → prereqs → deploy → manifest reference → monitoring → teardown → customization flow, and the same "use the upstream signed image as-is, don't build a derived image" stance, with the AWS-native equivalents of fly's [[files]] injection pattern called out.

Options covered

ECS on Fargate (recommended) — task definition with 1 vCPU / 2 GiB ARM64, EFS access point at /home/harness/.hermes-openrouter (uid:gid 1000:1000 to match the non-root harness user), Secrets Manager → task-def secrets[], CloudWatch Logs, --enable-execute-command for shell-in via aws ecs execute-command (which uses SSM Session Manager under the hood). 1:1 mapping to the k8s manifest's Deployment / PVC / Secret / kubectl exec shape.
EC2 + Docker + SSM (lightweight) — single t4g.small (~$13/mo) running Amazon Linux 2023, IAM instance profile with AmazonSSMManagedInstanceCore + scoped Secrets Manager read, user-data pulls secrets and writes a systemd unit that runs the upstream image with --restart=always. SSM Session Manager replaces SSH entirely — no key management, no inbound port 22.
EKS — one-paragraph note that the existing k8s manifest deploys unmodified; only AWS-specific concern is using the EBS CSI driver as the default StorageClass.

Why not Elastic Beanstalk / Amplify / App Runner / Lambda

The doc includes a short "Why not…" section at the end explaining the disqualifying factor for each: App Runner has no persistent volume (hermes' faster-whisper cache + sessions wouldn't survive restarts), Elastic Beanstalk's ELB/ASG/EC2 abstractions are wrong for a single-replica daemon, Amplify is for full-stack web apps, Lambda has a 15-min execution cap. This keeps the doc focused on the two paths that actually fit the existing deploy model.

Live validation

Both options were end-to-end validated against a real AWS account (us-east-1) before opening this PR. Results: ECS Fargate 8/8 verifications pass in ~6 min, EC2+SSM 11/11 in ~4 min, total cost < $0.10 with full auto-teardown.

Verifications covered: container PID 1 runs as uid 1000 (harness user mapping honored), volume mount visible inside container and writable as uid 1000, all four secrets (OPENROUTER_API_KEY, TELEGRAM_BOT_TOKEN, TELEGRAM_ALLOWED_USERS, GH_TOKEN) injected as env vars from Secrets Manager (presence checked by name, value never printed), upstream agent tooling (hermes, gh) present at expected paths, shell-in works via aws ecs execute-command / aws ssm start-session. Test override: container command was set to sleep 3600 so the test exercised AWS plumbing (IAM, image pull, volume mount, secrets injection, exec access) without needing real OPENROUTER/TELEGRAM keys — production deployments use hermes gateway as documented.

The validation scripts themselves aren't part of this PR (they're local scratch tooling, not project-shipped tests). What ships are the doc fixes + operational notes the validation surfaced.

Findings folded into the doc

ARM64 Fargate AZ availability (commit 98eebab) — the original subnet picker (Subnets[0]) was non-deterministic and could grab us-east-1e or us-east-1f, where ARM64 Fargate is not supported, producing the cryptic "The required capabilities cannot be supported on requested platform" placement error. The picker now filters to AZs a/b/c/d with an inline comment + a parenthetical added to the ARM64 recommendation callout.
EC2 bind-mount ownership (commit 98eebab) — mkdir -p /var/lib/hermes-claw left the directory owned by root, so entrypoint-hermes.sh's cp -rn first-boot seed crash-looped with Permission denied when the in-container uid 1000 tried to write. User-data now does chown 1000:1000 /var/lib/hermes-claw immediately after the mkdir, with an inline comment.
First-task timing + pull retries (commit 7a60e6b) — added a > callout explaining that initial task placement takes 2–3 min (image pull dominates), and that transient CannotPullContainerError events are routine because ECS auto-retries. Saves users from chasing phantom failures.
Exec sessions run as root (commit 7a60e6b) — added a > callout that aws ecs execute-command opens a root shell by default even though the workload is uid 1000, with runuser -u harness -- for harness-perspective debugging and stat -c %u /proc/1 for verifying the workload uid from outside the exec session.

Files

Path	Change
`docs/deploying-to-aws.md`	New. The deploy guide.
`README.md`	Links the AWS guide as the third bullet under "Deploying hermes as a claw".
`.gitignore`	Adds `tmp/` for per-developer scratch files (one-line, sized to match the existing entries).

No changes to src/harness.ts, the Dockerfiles, agent entrypoints, or CI.

Test plan

npx markdownlint-cli2@0.17.2 "**/*.md" "#node_modules" passes with 0 errors
Internal cross-references resolve (deploying-to-fly.md#customizing-the-claw--dont-extend-the-image, deploying-to-k8s.md#all-in-one-k8sclawyaml)
End-to-end deploy of Option A (ECS Fargate) against AWS account 660493448574: 8/8 verifications pass, full teardown clean
End-to-end deploy of Option B (EC2 + SSM) against AWS account 660493448574: 11/11 verifications pass, full teardown clean
Post-test account sweep confirms zero residual ECS / EFS / EC2 / Secrets / SGs / IAM / log groups

Adds docs/deploying-to-aws.md alongside the existing fly.io and Kubernetes guides, covering two production-realistic AWS paths and a pointer to EKS: - ECS on Fargate (recommended) — EFS for persistent state, Secrets Manager for keys, `aws ecs execute-command` (SSM Session Manager) for shell-in. 1:1 mapping to the k8s manifest's task definition, PVC, secrets, and exec semantics. - EC2 + Docker + SSM (lightweight) — single t4g.small with a systemd unit pulling secrets from Secrets Manager at boot. SSM Session Manager replaces SSH; no inbound 22, no keys. - EKS — one-paragraph note that the existing k8s manifest deploys unmodified (EBS CSI driver is the only AWS-specific concern). The doc explicitly preserves the project's "don't build a derived image" stance (linking the fly.io rationale) and provides the AWS-native equivalent injection points for both options. Briefly explains why App Runner (no persistent volume), Elastic Beanstalk (wrong abstraction for a single-replica daemon), Amplify (wrong product entirely), and Lambda (15-min execution cap) were deliberately excluded. Updates the README's "Deploying hermes as a claw" section to link the new guide as the third deployment target. Co-authored-by: Cursor <cursoragent@cursor.com>

End-to-end testing against a real AWS account (us-east-1) surfaced two bugs in the original guide that would break the documented deploy flow: 1. ARM64 Fargate AZ availability — the original `Subnets[0]` picker was non-deterministic and could grab us-east-1e or us-east-1f, where ARM64 Fargate is not supported. Task placement then fails with the cryptic "The required capabilities cannot be supported on requested platform" error. The subnet picker now filters to AZs a/b/c/d with an inline comment explaining why, plus a parenthetical added to the existing ARM64 recommendation callout under the task definition. 2. EC2 bind-mount ownership — `mkdir -p /var/lib/hermes-claw` left the directory owned by root. When Docker bind-mounts it into the container, the in-container harness user (uid 1000) cannot write to it, so entrypoint-hermes.sh's `cp -rn` first-boot seed crash-loops with "Permission denied". The user-data script now does `chown 1000:1000 /var/lib/hermes-claw` immediately after the mkdir, with an inline comment explaining the consequence of omitting it. Co-authored-by: Cursor <cursoragent@cursor.com>

Adds two callouts to docs/deploying-to-aws.md from observations during end-to-end testing that aren't covered by the existing text: 1. First-task timing — initial Fargate placement takes 2-3 min (most of it pulling the ~500 MB image from ghcr.io). Transient `CannotPullContainerError` events are routine; ECS automatically stops the failed task and starts a fresh one. Persistent failures usually mean a real misconfiguration. Worth saying so users don't think their deploy is broken when it's just slow. 2. Exec sessions run as root — `aws ecs execute-command` opens a root shell inside the container by default, even though the workload (PID 1) is uid 1000. This trips up "is my mount writable?" debugging because root can write anywhere. Doc now suggests `runuser -u harness --` for harness-user perspective and `stat -c %u /proc/1` for verifying the workload uid from outside the exec session. Co-authored-by: Cursor <cursoragent@cursor.com>

Adds `tmp/` to .gitignore for ad-hoc local files that shouldn't ship with the project (e.g. one-off validation scripts, scratch notes, generated debug output). Follows the common convention of treating tmp/ as a per-developer scratch directory rather than committed project state. Co-authored-by: Cursor <cursoragent@cursor.com>

hggz

Drive-by, not codeowner.

Pure docs PR + already approved by capotej, so just a flake note: the build (ubuntu-latest, linux/amd64) failure is unrelated to anything in this PR. It's a transient curl 502 from github.com while downloading the tirith release binary inside the Dockerfile (same flake family I've seen hit tini and cosign downloads on this repo).

A retrigger (empty commit or rerun) should clear it. Nothing to change in the doc itself.

Quick skim of the guide: solid structure, mirrors the fly/k8s pattern, and the AZ-filter + chown 1000:1000 callouts are the kind of footgun-prevention that's gold for an ops doc. The "Why not Beanstalk/Amplify/App Runner/Lambda" section is a nice touch — saves reviewers from asking.

capotej

need to rebase so you can merge

capotej · 2026-05-25T13:54:23Z

+```bash
+export AWS_REGION=us-east-1
+export CLAW_NAME=hermes-claw
+export HARNESS_IMAGE=ghcr.io/capotej/harness:hermes-1.6.4


youll want to edit the release skill to update this every release (like it does for fly guide and readme right now)

EnriqueCanals requested a review from capotej as a code owner May 23, 2026 05:44

EnriqueCanals and others added 3 commits May 23, 2026 03:19

EnriqueCanals force-pushed the docs/deploy-to-aws branch from 35458a4 to e837d96 Compare May 23, 2026 07:30

capotej approved these changes May 24, 2026

View reviewed changes

Merge branch 'main' into docs/deploy-to-aws

0ad00ac

hggz reviewed May 24, 2026

View reviewed changes

capotej approved these changes May 25, 2026

View reviewed changes

capotej reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add AWS deployment guide for hermes claw#71

docs: add AWS deployment guide for hermes claw#71
EnriqueCanals wants to merge 5 commits into
capotej:mainfrom
EnriqueCanals:docs/deploy-to-aws

EnriqueCanals commented May 23, 2026 •

edited

Loading

Uh oh!

hggz left a comment

Uh oh!

capotej left a comment

Uh oh!

capotej May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

EnriqueCanals commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Options covered

Why not Elastic Beanstalk / Amplify / App Runner / Lambda

Live validation

Findings folded into the doc

Files

Test plan

Uh oh!

hggz left a comment

Choose a reason for hiding this comment

Uh oh!

capotej left a comment

Choose a reason for hiding this comment

Uh oh!

capotej May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EnriqueCanals commented May 23, 2026 •

edited

Loading