Skip to content

rewyekha/OpsPilot

Repository files navigation

πŸ›‘οΈ OpsPilot

Autonomous Multi-Agent SRE Command Center

From production incident to confidence-scored root cause in under five minutes.

Microsoft Agents League Azure AI Foundry Reasoning Agents

Publish to GHCR

Python FastAPI LangGraph React TypeScript Docker


Author β€” Reyas Khan

GitHub Portfolio Microsoft Learn


The problem

When production breaks, an on-call engineer manually pivots across metrics, logs, deployments, and incident history under pressure β€” a slow, tribal, error-prone scramble while every minute costs revenue and trust.

What OpsPilot does

OpsPilot turns that scramble into an autonomous, evidence-backed investigation. A Commander agent dispatches a team of specialized AI agents that investigate in parallel, correlate their evidence into a causal timeline, determine a confidence-scored root cause, and propose ranked, executable remediations β€” all streamed live to an Azure-Portal-style command centre over Server-Sent Events.

Traditional on-call OpsPilot
Time to root cause 30–60+ min < 5 min
Evidence gathering Serial, manual Parallel, autonomous
Root cause Tribal guess Confidence-scored hypothesis
Remediation Ad hoc Ranked, executable fixes

Demo

πŸ“Ή Product walkthrough β€” a full incident, from detection to resolution.


The differentiator: confidence-gated reasoning escalation

After root-cause analysis, OpsPilot computes a combined confidence score. Only when it falls below threshold (default 70) does it route the full context to the o4-mini reasoning model, which re-examines the evidence from first principles and returns a refined root cause plus a complete reasoning trace. Frontier reasoning stays off the hot path unless it's actually needed β€” governing cost and latency by design. This is intelligent model routing, not brute force.

Agent Role Model
Commander Orchestration & synthesis GPT-4o
Metrics / Logs / Deployment Parallel evidence gathering GPT-4o-mini
Time Machine Causal event timeline GPT-4o
Root Cause Confidence-scored hypothesis GPT-4o
Deep Reasoning First-principles re-analysis (escalation) o4-mini
Recommendation Ranked, executable remediations GPT-4o-mini

Agents are routed by role to configurable Azure AI Foundry deployments and orchestrated as a compiled LangGraph StateGraph with conditional escalation edges. A provider abstraction runs the whole system in zero-credential mock mode, then flips a single env var for live Foundry inference.


Run it (for judges)

OpsPilot runs in zero-credential mock mode by default β€” no Azure account needed to evaluate the full experience. Every path opens the app at http://localhost:3000.

First, from the repository root: cp .env.example .env The defaults in .env.example select mock mode, so this just works.

Option A β€” Prebuilt images from GHCR (fastest, no build)

cp .env.example .env
docker compose -f docker-compose.ghcr.yml up

Or pull the images directly:

docker pull ghcr.io/rewyekha/opspilot-backend:latest
docker pull ghcr.io/rewyekha/opspilot-frontend:latest

Option B β€” Build from source with Docker

cp .env.example .env
docker compose -f docker-compose.yml up --build

Option C β€” Local dev (hot reload)

cd backend && python -m uvicorn app.main:app --reload --port 8000  # mock mode, no credentials
cd frontend && npm install && npm run dev

Bring your own Azure AI Foundry credentials (live mode)

Edit .env:

EXECUTION_MODE=foundry
FOUNDRY_ENDPOINT=https://<your-resource>.openai.azure.com/
FOUNDRY_API_KEY=<your-key>            # or leave empty to use managed identity
COMMANDER_MODEL_DEPLOYMENT=gpt-4o
SPECIALIST_MODEL_DEPLOYMENT=gpt-4o-mini
REASONING_MODEL_DEPLOYMENT=o4-mini

Then re-run any option above. See backend/.env.example for the full variable list (all optional except FOUNDRY_ENDPOINT in live mode).


Container images

Image Registry path
Backend (FastAPI + LangGraph) ghcr.io/rewyekha/opspilot-backend
Frontend (React + nginx) ghcr.io/rewyekha/opspilot-frontend

Built and published on every push to main by .github/workflows/docker-ghcr.yml using the built-in GITHUB_TOKEN β€” no external secrets. Tags: latest, the short commit SHA, and semantic-version tags on v* releases.


Repository layout

.
β”œβ”€β”€ backend/            # FastAPI + LangGraph agent runtime (Python 3.12)
β”œβ”€β”€ frontend/           # React + TypeScript + Fluent UI v9 (nginx in prod)
β”œβ”€β”€ infra/              # Azure Bicep infrastructure-as-code
β”œβ”€β”€ evaluation/         # Foundry evaluation datasets and runners
β”œβ”€β”€ demo-workloads/     # Sample apps used to generate real incidents
β”œβ”€β”€ docs/               # architecture diagram assets
β”œβ”€β”€ docker-compose*.yml # local / prebuilt run configurations
└── .github/workflows/  # CI/CD (Docker β†’ GHCR)

Tech stack

React 18 Β· TypeScript Β· Fluent UI v9 Β· FastAPI Β· Python 3.12 Β· LangGraph Β· Azure AI Foundry Β· Azure OpenAI Β· Docker Β· GitHub Actions


Built for the Microsoft Agents League Hackathon Β· Reasoning Agents Track

About

Autonomous multi-agent SRE command center on Azure AI Foundry - production incident to confidence-scored root cause in under 5 minutes, with reasoning-gated escalation to o4-mini. built for Microsoft Hackathon

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors