How to design, build, and operate AI agents for infrastructure teams — safely.
AI agents can write IaC, fix compliance findings, detect drift, review PRs, and respond to incidents — all autonomously. But autonomy without guardrails is a liability. Agents that can terraform apply can also terraform destroy. Agents that read configs can leak secrets. Agents that loop can burn budgets.
This guide covers every architectural decision you need to make when building infrastructure agents — with real patterns, code snippets, multiple alternatives, and the risk framework to evaluate your choices.
- Platform engineers evaluating whether to build or buy agent capabilities
- SREs designing safe automation for incident response and remediation
- DevOps leads building self-service IaC platforms
- Engineering leaders who need a reviewed architecture for AI-driven infrastructure
| # | Chapter | What You'll Learn |
|---|---|---|
| 1 | Architecture Overview | The six planes of an infra-agent system |
| 2 | Agent Runtime & Orchestration | LLM runtimes (Claude Agent SDK, OpenAI, LangChain, custom), task queuing, worker isolation |
| 3 | Tools, CLIs & Skills | CLI tooling, skill systems, MCP, and capability management |
| 4 | Sandboxed Execution | Container isolation with Docker, Modal, Azure Container Apps |
| 5 | Credential Management | Short-lived tokens, vault patterns, blast radius control |
| 6 | The Data Plane | Infrastructure knowledge layer, resource graphs, context serialization |
| 7 | Change Control & GitOps | PR-based workflows, drift verification, validation loops |
| 8 | Policy & Guardrails | Tool restrictions, approval gates, autonomy tiers |
| 9 | Observability & Audit | OpenTelemetry, action trails, debugging agent failures |
| 10 | Autonomous Operations & Notifications | Scheduling, autonomous agents, notification routing, escalation chains |
| 11 | Testing & Hardening | Trajectory tests, prompt injection defense, security benchmarks |
| 12 | UX & Usability | Multi-tenancy, RBAC, onboarding, team collaboration, error prevention |
| 13 | Risk Framework & Checklists | Decision matrices, compliance mapping, go-live checklists |
- Agents Never Deploy Directly — Every infrastructure change flows through a pull request. The agent produces diffs, not deployments.
- Least Privilege by Default — Agents get the minimum credentials and tool access needed. Privileges are scoped, time-limited, and auditable.
- Observability Is Not Optional — Every tool call, credential request, and decision point is logged with correlation IDs.
- Fail Safe, Not Fail Open — When in doubt, the agent stops and asks a human. Timeouts and policy gates are structural — not suggestions.
- The Agent Is Not Special — Agent-initiated changes go through the same review, CI, and deployment pipelines as human-initiated changes.
┌───────────────────────────────────────────────────────────┐
│ YOUR INFRASTRUCTURE │
│ AWS / Azure / GCP / OCI Terraform / Bicep / Pulumi │
│ GitHub / GitLab / ADO Prowler / Checkov / Custom │
└────────────────────────────┬──────────────────────────────┘
│
┌──────────▼──────────┐
│ POLICY PLANE │ ← What agents CAN do
│ (rules, approvals) │
└──────────┬──────────┘
│
┌────────────────▼────────────────┐
│ AGENT RUNTIME │
│ Skills · Tool Access · Session │ ← How agents DO it
│ Credentials · Sandboxing │
└────────────────┬────────────────┘
│
┌──────────▼──────────┐
│ CHANGE CONTROL │ ← How changes LAND
│ (PRs, validation) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ OBSERVABILITY │ ← How you SEE it
│ (traces, alerts) │
└─────────────────────┘
This guide doesn't prescribe a single stack. For each architectural layer, we cover multiple approaches:
| Layer | Options Covered |
|---|---|
| LLM Runtime | Claude Agent SDK, OpenAI Agents SDK / Codex CLI, LangChain/LangGraph, direct API |
| Task Queue | Redis Streams, BullMQ, AWS SQS, RabbitMQ, Temporal |
| Sandboxing | Docker, Modal, Azure Container Apps Jobs, AWS Lambda, Firecracker |
| Credential Store | HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, 1Password |
| Change Control | GitHub Actions, GitLab CI, Azure Pipelines, Atlantis, Spacelift |
| Observability | OpenTelemetry + Grafana, Datadog, Dash0, New Relic |
| Notifications | Slack, Microsoft Teams, PagerDuty, Opsgenie, email, webhooks |
| Scheduling | Cron (systemd/k8s), Temporal, AWS EventBridge, Azure Timer Triggers |
| State Storage | PostgreSQL, Redis, Azure Blob, S3, SQLite |
Found an error? Have a better pattern? Contributions are welcome.
- Issues — Open an issue for questions, suggestions, or corrections
- Pull Requests — Submit a PR for content improvements or new examples
- Discussions — Use GitHub Discussions for broader architectural questions
Please keep contributions focused on patterns and architecture — not vendor-specific marketing.
This guide is built by the team at Cloudgeni, where we design, build, and operate autonomous infrastructure agents in production across AWS, Azure, GCP, and OCI for enterprise teams.
Every pattern here comes from running these systems in production. We open-sourced it because we kept answering the same architectural questions — writing them down once seemed more useful.
The guide text is released under CC BY 4.0. Use it, adapt it, share it — just give credit.
Code snippets are released under MIT.