From production incident to confidence-scored root cause in under five minutes.
Author β Reyas Khan
When production breaks, an on-call engineer manually pivots across metrics, logs, deployments, and incident history under pressure β a slow, tribal, error-prone scramble while every minute costs revenue and trust.
OpsPilot turns that scramble into an autonomous, evidence-backed investigation. A Commander agent dispatches a team of specialized AI agents that investigate in parallel, correlate their evidence into a causal timeline, determine a confidence-scored root cause, and propose ranked, executable remediations β all streamed live to an Azure-Portal-style command centre over Server-Sent Events.
| Traditional on-call | OpsPilot | |
|---|---|---|
| Time to root cause | 30β60+ min | < 5 min |
| Evidence gathering | Serial, manual | Parallel, autonomous |
| Root cause | Tribal guess | Confidence-scored hypothesis |
| Remediation | Ad hoc | Ranked, executable fixes |
πΉ Product walkthrough β a full incident, from detection to resolution.
After root-cause analysis, OpsPilot computes a combined confidence score. Only when it falls below threshold (default 70) does it route the full context to the o4-mini reasoning model, which re-examines the evidence from first principles and returns a refined root cause plus a complete reasoning trace. Frontier reasoning stays off the hot path unless it's actually needed β governing cost and latency by design. This is intelligent model routing, not brute force.
| Agent | Role | Model |
|---|---|---|
| Commander | Orchestration & synthesis | GPT-4o |
| Metrics / Logs / Deployment | Parallel evidence gathering | GPT-4o-mini |
| Time Machine | Causal event timeline | GPT-4o |
| Root Cause | Confidence-scored hypothesis | GPT-4o |
| Deep Reasoning | First-principles re-analysis (escalation) | o4-mini |
| Recommendation | Ranked, executable remediations | GPT-4o-mini |
Agents are routed by role to configurable Azure AI Foundry deployments and
orchestrated as a compiled LangGraph StateGraph with conditional escalation
edges. A provider abstraction runs the whole system in zero-credential mock
mode, then flips a single env var for live Foundry inference.
OpsPilot runs in zero-credential mock mode by default β no Azure account needed to evaluate the full experience. Every path opens the app at http://localhost:3000.
First, from the repository root:
cp .env.example .envThe defaults in.env.exampleselect mock mode, so this just works.
cp .env.example .env
docker compose -f docker-compose.ghcr.yml upOr pull the images directly:
docker pull ghcr.io/rewyekha/opspilot-backend:latest
docker pull ghcr.io/rewyekha/opspilot-frontend:latestcp .env.example .env
docker compose -f docker-compose.yml up --buildcd backend && python -m uvicorn app.main:app --reload --port 8000 # mock mode, no credentials
cd frontend && npm install && npm run devEdit .env:
EXECUTION_MODE=foundry
FOUNDRY_ENDPOINT=https://<your-resource>.openai.azure.com/
FOUNDRY_API_KEY=<your-key> # or leave empty to use managed identity
COMMANDER_MODEL_DEPLOYMENT=gpt-4o
SPECIALIST_MODEL_DEPLOYMENT=gpt-4o-mini
REASONING_MODEL_DEPLOYMENT=o4-miniThen re-run any option above. See backend/.env.example
for the full variable list (all optional except FOUNDRY_ENDPOINT in live mode).
| Image | Registry path |
|---|---|
| Backend (FastAPI + LangGraph) | ghcr.io/rewyekha/opspilot-backend |
| Frontend (React + nginx) | ghcr.io/rewyekha/opspilot-frontend |
Built and published on every push to main by
.github/workflows/docker-ghcr.yml using the
built-in GITHUB_TOKEN β no external secrets. Tags: latest, the short commit
SHA, and semantic-version tags on v* releases.
.
βββ backend/ # FastAPI + LangGraph agent runtime (Python 3.12)
βββ frontend/ # React + TypeScript + Fluent UI v9 (nginx in prod)
βββ infra/ # Azure Bicep infrastructure-as-code
βββ evaluation/ # Foundry evaluation datasets and runners
βββ demo-workloads/ # Sample apps used to generate real incidents
βββ docs/ # architecture diagram assets
βββ docker-compose*.yml # local / prebuilt run configurations
βββ .github/workflows/ # CI/CD (Docker β GHCR)
React 18 Β· TypeScript Β· Fluent UI v9 Β· FastAPI Β· Python 3.12 Β·
LangGraph Β· Azure AI Foundry Β· Azure OpenAI Β· Docker Β· GitHub Actions
Built for the Microsoft Agents League Hackathon Β· Reasoning Agents Track
