Infrastructure Agents Guide

How to design, build, and operate AI agents for infrastructure teams — safely.

AI agents can write IaC, fix compliance findings, detect drift, review PRs, and respond to incidents — all autonomously. But autonomy without guardrails is a liability. Agents that can terraform apply can also terraform destroy. Agents that read configs can leak secrets. Agents that loop can burn budgets.

This guide covers every architectural decision you need to make when building infrastructure agents — with real patterns, code snippets, multiple alternatives, and the risk framework to evaluate your choices.

Who This Is For

Platform engineers evaluating whether to build or buy agent capabilities
SREs designing safe automation for incident response and remediation
DevOps leads building self-service IaC platforms
Engineering leaders who need a reviewed architecture for AI-driven infrastructure

Guide Structure

#	Chapter	What You'll Learn
1	Architecture Overview	The six planes of an infra-agent system
2	Agent Runtime & Orchestration	LLM runtimes (Claude Agent SDK, OpenAI, LangChain, custom), task queuing, worker isolation
3	Tools, CLIs & Skills	CLI tooling, skill systems, MCP, and capability management
4	Sandboxed Execution	Container isolation with Docker, Modal, Azure Container Apps
5	Credential Management	Short-lived tokens, vault patterns, blast radius control
6	The Data Plane	Infrastructure knowledge layer, resource graphs, context serialization
7	Change Control & GitOps	PR-based workflows, drift verification, validation loops
8	Policy & Guardrails	Tool restrictions, approval gates, autonomy tiers
9	Observability & Audit	OpenTelemetry, action trails, debugging agent failures
10	Autonomous Operations & Notifications	Scheduling, autonomous agents, notification routing, escalation chains
11	Testing & Hardening	Trajectory tests, prompt injection defense, security benchmarks
12	UX & Usability	Multi-tenancy, RBAC, onboarding, team collaboration, error prevention
13	Risk Framework & Checklists	Decision matrices, compliance mapping, go-live checklists

Core Principles

Agents Never Deploy Directly — Every infrastructure change flows through a pull request. The agent produces diffs, not deployments.
Least Privilege by Default — Agents get the minimum credentials and tool access needed. Privileges are scoped, time-limited, and auditable.
Observability Is Not Optional — Every tool call, credential request, and decision point is logged with correlation IDs.
Fail Safe, Not Fail Open — When in doubt, the agent stops and asks a human. Timeouts and policy gates are structural — not suggestions.
The Agent Is Not Special — Agent-initiated changes go through the same review, CI, and deployment pipelines as human-initiated changes.

Quick Start: Mental Model

┌───────────────────────────────────────────────────────────┐
│                    YOUR INFRASTRUCTURE                    │
│  AWS / Azure / GCP / OCI    Terraform / Bicep / Pulumi    │
│  GitHub / GitLab / ADO      Prowler / Checkov / Custom    │
└────────────────────────────┬──────────────────────────────┘
                             │
                  ┌──────────▼──────────┐
                  │    POLICY PLANE     │  ← What agents CAN do
                  │  (rules, approvals) │
                  └──────────┬──────────┘
                             │
            ┌────────────────▼────────────────┐
            │          AGENT RUNTIME          │
            │  Skills · Tool Access · Session │  ← How agents DO it
            │  Credentials · Sandboxing       │
            └────────────────┬────────────────┘
                             │
                  ┌──────────▼──────────┐
                  │   CHANGE CONTROL    │  ← How changes LAND
                  │  (PRs, validation)  │
                  └──────────┬──────────┘
                             │
                  ┌──────────▼──────────┐
                  │    OBSERVABILITY    │  ← How you SEE it
                  │  (traces, alerts)   │
                  └─────────────────────┘

Alternatives Covered

This guide doesn't prescribe a single stack. For each architectural layer, we cover multiple approaches:

Layer	Options Covered
LLM Runtime	Claude Agent SDK, OpenAI Agents SDK / Codex CLI, LangChain/LangGraph, direct API
Task Queue	Redis Streams, BullMQ, AWS SQS, RabbitMQ, Temporal
Sandboxing	Docker, Modal, Azure Container Apps Jobs, AWS Lambda, Firecracker
Credential Store	HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, 1Password
Change Control	GitHub Actions, GitLab CI, Azure Pipelines, Atlantis, Spacelift
Observability	OpenTelemetry + Grafana, Datadog, Dash0, New Relic
Notifications	Slack, Microsoft Teams, PagerDuty, Opsgenie, email, webhooks
Scheduling	Cron (systemd/k8s), Temporal, AWS EventBridge, Azure Timer Triggers
State Storage	PostgreSQL, Redis, Azure Blob, S3, SQLite

Contributing

Found an error? Have a better pattern? Contributions are welcome.

Issues — Open an issue for questions, suggestions, or corrections
Pull Requests — Submit a PR for content improvements or new examples
Discussions — Use GitHub Discussions for broader architectural questions

Please keep contributions focused on patterns and architecture — not vendor-specific marketing.

About

This guide is built by the team at Cloudgeni, where we design, build, and operate autonomous infrastructure agents in production across AWS, Azure, GCP, and OCI for enterprise teams.

Every pattern here comes from running these systems in production. We open-sourced it because we kept answering the same architectural questions — writing them down once seemed more useful.

License

The guide text is released under CC BY 4.0. Use it, adapt it, share it — just give credit.

Code snippets are released under MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
01-architecture.md		01-architecture.md
02-agent-runtime.md		02-agent-runtime.md
03-tools-skills.md		03-tools-skills.md
04-sandboxed-execution.md		04-sandboxed-execution.md
05-credential-management.md		05-credential-management.md
06-data-plane.md		06-data-plane.md
07-change-control.md		07-change-control.md
08-policy-guardrails.md		08-policy-guardrails.md
09-observability.md		09-observability.md
10-autonomy-notifications.md		10-autonomy-notifications.md
11-testing-hardening.md		11-testing-hardening.md
12-ux-usability.md		12-ux-usability.md
13-risk-framework.md		13-risk-framework.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Infrastructure Agents Guide

Who This Is For

Guide Structure

Core Principles

Quick Start: Mental Model

Alternatives Covered

Contributing

About

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Infrastructure Agents Guide

Who This Is For

Guide Structure

Core Principles

Quick Start: Mental Model

Alternatives Covered

Contributing

About

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages