Skip to content

docs: add AWS deployment guide for hermes claw#71

Open
EnriqueCanals wants to merge 5 commits into
capotej:mainfrom
EnriqueCanals:docs/deploy-to-aws
Open

docs: add AWS deployment guide for hermes claw#71
EnriqueCanals wants to merge 5 commits into
capotej:mainfrom
EnriqueCanals:docs/deploy-to-aws

Conversation

@EnriqueCanals
Copy link
Copy Markdown
Collaborator

@EnriqueCanals EnriqueCanals commented May 23, 2026

Summary

Adds docs/deploying-to-aws.md alongside the existing fly.io and Kubernetes deploy guides, covering two production-realistic AWS paths for running hermes as a long-running claw, plus a short EKS pointer.

The new doc mirrors the structure and philosophy of the existing two — same architecture-table → prereqs → deploy → manifest reference → monitoring → teardown → customization flow, and the same "use the upstream signed image as-is, don't build a derived image" stance, with the AWS-native equivalents of fly's [[files]] injection pattern called out.

Options covered

  • ECS on Fargate (recommended) — task definition with 1 vCPU / 2 GiB ARM64, EFS access point at /home/harness/.hermes-openrouter (uid:gid 1000:1000 to match the non-root harness user), Secrets Manager → task-def secrets[], CloudWatch Logs, --enable-execute-command for shell-in via aws ecs execute-command (which uses SSM Session Manager under the hood). 1:1 mapping to the k8s manifest's Deployment / PVC / Secret / kubectl exec shape.
  • EC2 + Docker + SSM (lightweight) — single t4g.small (~$13/mo) running Amazon Linux 2023, IAM instance profile with AmazonSSMManagedInstanceCore + scoped Secrets Manager read, user-data pulls secrets and writes a systemd unit that runs the upstream image with --restart=always. SSM Session Manager replaces SSH entirely — no key management, no inbound port 22.
  • EKS — one-paragraph note that the existing k8s manifest deploys unmodified; only AWS-specific concern is using the EBS CSI driver as the default StorageClass.

Why not Elastic Beanstalk / Amplify / App Runner / Lambda

The doc includes a short "Why not…" section at the end explaining the disqualifying factor for each: App Runner has no persistent volume (hermes' faster-whisper cache + sessions wouldn't survive restarts), Elastic Beanstalk's ELB/ASG/EC2 abstractions are wrong for a single-replica daemon, Amplify is for full-stack web apps, Lambda has a 15-min execution cap. This keeps the doc focused on the two paths that actually fit the existing deploy model.

Live validation

Both options were end-to-end validated against a real AWS account (us-east-1) before opening this PR. Results: ECS Fargate 8/8 verifications pass in ~6 min, EC2+SSM 11/11 in ~4 min, total cost < $0.10 with full auto-teardown.

Verifications covered: container PID 1 runs as uid 1000 (harness user mapping honored), volume mount visible inside container and writable as uid 1000, all four secrets (OPENROUTER_API_KEY, TELEGRAM_BOT_TOKEN, TELEGRAM_ALLOWED_USERS, GH_TOKEN) injected as env vars from Secrets Manager (presence checked by name, value never printed), upstream agent tooling (hermes, gh) present at expected paths, shell-in works via aws ecs execute-command / aws ssm start-session. Test override: container command was set to sleep 3600 so the test exercised AWS plumbing (IAM, image pull, volume mount, secrets injection, exec access) without needing real OPENROUTER/TELEGRAM keys — production deployments use hermes gateway as documented.

The validation scripts themselves aren't part of this PR (they're local scratch tooling, not project-shipped tests). What ships are the doc fixes + operational notes the validation surfaced.

Findings folded into the doc

  1. ARM64 Fargate AZ availability (commit 98eebab) — the original subnet picker (Subnets[0]) was non-deterministic and could grab us-east-1e or us-east-1f, where ARM64 Fargate is not supported, producing the cryptic "The required capabilities cannot be supported on requested platform" placement error. The picker now filters to AZs a/b/c/d with an inline comment + a parenthetical added to the ARM64 recommendation callout.
  2. EC2 bind-mount ownership (commit 98eebab) — mkdir -p /var/lib/hermes-claw left the directory owned by root, so entrypoint-hermes.sh's cp -rn first-boot seed crash-looped with Permission denied when the in-container uid 1000 tried to write. User-data now does chown 1000:1000 /var/lib/hermes-claw immediately after the mkdir, with an inline comment.
  3. First-task timing + pull retries (commit 7a60e6b) — added a > callout explaining that initial task placement takes 2–3 min (image pull dominates), and that transient CannotPullContainerError events are routine because ECS auto-retries. Saves users from chasing phantom failures.
  4. Exec sessions run as root (commit 7a60e6b) — added a > callout that aws ecs execute-command opens a root shell by default even though the workload is uid 1000, with runuser -u harness -- for harness-perspective debugging and stat -c %u /proc/1 for verifying the workload uid from outside the exec session.

Files

Path Change
docs/deploying-to-aws.md New. The deploy guide.
README.md Links the AWS guide as the third bullet under "Deploying hermes as a claw".
.gitignore Adds tmp/ for per-developer scratch files (one-line, sized to match the existing entries).

No changes to src/harness.ts, the Dockerfiles, agent entrypoints, or CI.

Test plan

  • npx markdownlint-cli2@0.17.2 "**/*.md" "#node_modules" passes with 0 errors
  • Internal cross-references resolve (deploying-to-fly.md#customizing-the-claw--dont-extend-the-image, deploying-to-k8s.md#all-in-one-k8sclawyaml)
  • End-to-end deploy of Option A (ECS Fargate) against AWS account 660493448574: 8/8 verifications pass, full teardown clean
  • End-to-end deploy of Option B (EC2 + SSM) against AWS account 660493448574: 11/11 verifications pass, full teardown clean
  • Post-test account sweep confirms zero residual ECS / EFS / EC2 / Secrets / SGs / IAM / log groups

Adds docs/deploying-to-aws.md alongside the existing fly.io and
Kubernetes guides, covering two production-realistic AWS paths and
a pointer to EKS:

- ECS on Fargate (recommended) — EFS for persistent state, Secrets
  Manager for keys, `aws ecs execute-command` (SSM Session Manager)
  for shell-in. 1:1 mapping to the k8s manifest's task definition,
  PVC, secrets, and exec semantics.
- EC2 + Docker + SSM (lightweight) — single t4g.small with a systemd
  unit pulling secrets from Secrets Manager at boot. SSM Session
  Manager replaces SSH; no inbound 22, no keys.
- EKS — one-paragraph note that the existing k8s manifest deploys
  unmodified (EBS CSI driver is the only AWS-specific concern).

The doc explicitly preserves the project's "don't build a derived
image" stance (linking the fly.io rationale) and provides the
AWS-native equivalent injection points for both options.

Briefly explains why App Runner (no persistent volume), Elastic
Beanstalk (wrong abstraction for a single-replica daemon), Amplify
(wrong product entirely), and Lambda (15-min execution cap) were
deliberately excluded.

Updates the README's "Deploying hermes as a claw" section to link
the new guide as the third deployment target.

Co-authored-by: Cursor <cursoragent@cursor.com>
@EnriqueCanals EnriqueCanals requested a review from capotej as a code owner May 23, 2026 05:44
EnriqueCanals and others added 3 commits May 23, 2026 03:19
End-to-end testing against a real AWS account (us-east-1) surfaced
two bugs in the original guide that would break the documented
deploy flow:

1. ARM64 Fargate AZ availability — the original `Subnets[0]` picker
   was non-deterministic and could grab us-east-1e or us-east-1f,
   where ARM64 Fargate is not supported. Task placement then fails
   with the cryptic "The required capabilities cannot be supported
   on requested platform" error. The subnet picker now filters to
   AZs a/b/c/d with an inline comment explaining why, plus a
   parenthetical added to the existing ARM64 recommendation
   callout under the task definition.

2. EC2 bind-mount ownership — `mkdir -p /var/lib/hermes-claw` left
   the directory owned by root. When Docker bind-mounts it into
   the container, the in-container harness user (uid 1000) cannot
   write to it, so entrypoint-hermes.sh's `cp -rn` first-boot seed
   crash-loops with "Permission denied". The user-data script now
   does `chown 1000:1000 /var/lib/hermes-claw` immediately after
   the mkdir, with an inline comment explaining the consequence
   of omitting it.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds two callouts to docs/deploying-to-aws.md from observations
during end-to-end testing that aren't covered by the existing text:

1. First-task timing — initial Fargate placement takes 2-3 min
   (most of it pulling the ~500 MB image from ghcr.io). Transient
   `CannotPullContainerError` events are routine; ECS automatically
   stops the failed task and starts a fresh one. Persistent
   failures usually mean a real misconfiguration. Worth saying so
   users don't think their deploy is broken when it's just slow.

2. Exec sessions run as root — `aws ecs execute-command` opens a
   root shell inside the container by default, even though the
   workload (PID 1) is uid 1000. This trips up "is my mount
   writable?" debugging because root can write anywhere. Doc now
   suggests `runuser -u harness --` for harness-user perspective
   and `stat -c %u /proc/1` for verifying the workload uid from
   outside the exec session.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds `tmp/` to .gitignore for ad-hoc local files that shouldn't
ship with the project (e.g. one-off validation scripts, scratch
notes, generated debug output). Follows the common convention
of treating tmp/ as a per-developer scratch directory rather
than committed project state.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Collaborator

@hggz hggz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by, not codeowner.

Pure docs PR + already approved by capotej, so just a flake note: the build (ubuntu-latest, linux/amd64) failure is unrelated to anything in this PR. It's a transient curl 502 from github.com while downloading the tirith release binary inside the Dockerfile (same flake family I've seen hit tini and cosign downloads on this repo).

A retrigger (empty commit or rerun) should clear it. Nothing to change in the doc itself.

Quick skim of the guide: solid structure, mirrors the fly/k8s pattern, and the AZ-filter + chown 1000:1000 callouts are the kind of footgun-prevention that's gold for an ops doc. The "Why not Beanstalk/Amplify/App Runner/Lambda" section is a nice touch — saves reviewers from asking.

Copy link
Copy Markdown
Owner

@capotej capotej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to rebase so you can merge

Comment thread docs/deploying-to-aws.md
```bash
export AWS_REGION=us-east-1
export CLAW_NAME=hermes-claw
export HARNESS_IMAGE=ghcr.io/capotej/harness:hermes-1.6.4
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

youll want to edit the release skill to update this every release (like it does for fly guide and readme right now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants