Production-grade containerized infrastructure stack demonstrating SRE practices, monitoring, automation, and reliability engineering.
┌─────────────────────────────────────────────────────────┐
│ USERS / INTERNET │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Nginx Reverse Proxy (:80/:443) │
│ rate limiting · gzip · security headers │
└───────────┬─────────────────────────────┬───────────────┘
│ │
/api/* │ │ /*
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ Flask Backend │ │ Frontend (HTML) │
│ (:5000) │ │ (:3000) │
│ prometheus_client│ │ nginx-alpine │
└─────┬───────┬─────┘ └───────────────────┘
│ │
┌─────────┘ └──────────┐
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ PostgreSQL 16 │ │ Redis 7 │
│ (:5432) │ │ (:6379) │
│ internal only │ │ internal only │
└───────────────────┘ └───────────────────┘
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ MONITORING STACK ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Prometheus │ │ Grafana │ │ Node Exporter │ │ cAdvisor │
│ (:9090) │──│ (:3001) │ │ (:9100) │ │ (:8080) │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
| Component | Technology | Purpose |
|---|---|---|
| Reverse Proxy | Nginx 1.27 | Traffic routing, rate limiting, security headers |
| Backend | Python Flask + Gunicorn | REST API with Prometheus metrics |
| Frontend | HTML/CSS/JS + Nginx | Infrastructure monitoring dashboard |
| Database | PostgreSQL 16 | Persistent data storage |
| Cache | Redis 7 | In-memory caching with LRU eviction |
| Metrics | Prometheus | Time-series metrics collection |
| Dashboards | Grafana | Visualization and alerting |
| Host Metrics | Node Exporter | System-level metrics |
| Container Metrics | cAdvisor | Docker container resource tracking |
| Automation | Ansible | Server provisioning and deployment |
| CI/CD | GitHub Actions | Automated testing and deployment |
┌─────────────────────────────────┐
│ frontend-net │
│ nginx ── frontend ── backend │
└─────────────────────────────────┘
┌─────────────────────────────────┐
│ backend-net (internal) │
│ backend ── postgres ── redis │
└─────────────────────────────────┘
┌────────────────────────────────────────────┐
│ monitoring-net │
│ backend ── prometheus ── grafana │
│ node-exporter ── cadvisor │
└────────────────────────────────────────────┘
The backend-net network is marked as internal: true, which means PostgreSQL and Redis have zero exposure to the host or internet. The backend service bridges all three networks since it needs to serve API requests, connect to data stores, and expose metrics.
- Docker Engine 24+
- Docker Compose v2
- Git
git clone https://github.com/yourusername/DockOps.git
cd DockOpsEdit .env with your credentials if needed, then deploy:
./scripts/deploy.shdocker compose build
docker compose up -d./scripts/healthcheck.sh| Service | URL |
|---|---|
| Dashboard | http://localhost |
| API Health | http://localhost/api/health |
| API Status | http://localhost/api/status |
| Prometheus | http://localhost:9090 |
| Grafana | http://localhost:3001 |
Default Grafana credentials: admin / graf_s3cur3_2024
DockOps/
├── ansible/
│ ├── inventory
│ ├── playbook.yml
│ ├── group_vars/all.yml
│ └── roles/
│ ├── docker/
│ │ ├── tasks/main.yml
│ │ └── handlers/main.yml
│ ├── deploy/
│ │ ├── tasks/main.yml
│ │ ├── handlers/main.yml
│ │ └── templates/env.j2
│ └── monitoring/
│ ├── tasks/main.yml
│ └── handlers/main.yml
├── backend/
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── gunicorn.conf.py
│ ├── wsgi.py
│ ├── app/
│ │ ├── __init__.py
│ │ ├── routes.py
│ │ ├── database.py
│ │ └── cache.py
│ └── tests/
│ └── test_api.py
├── frontend/
│ ├── Dockerfile
│ ├── nginx.conf
│ └── src/
│ ├── index.html
│ ├── style.css
│ └── app.js
├── nginx/
│ ├── nginx.conf
│ └── conf.d/default.conf
├── monitoring/
│ ├── prometheus/
│ │ ├── prometheus.yml
│ │ └── alert_rules.yml
│ └── grafana/
│ ├── provisioning/
│ │ ├── datasources/datasource.yml
│ │ └── dashboards/dashboard.yml
│ └── dashboards/infrastructure.json
├── scripts/
│ ├── deploy.sh
│ ├── healthcheck.sh
│ ├── cleanup.sh
│ ├── monitor.sh
│ └── chaos-test.sh
├── .github/workflows/deploy.yml
├── docker-compose.yml
├── .env
├── .gitignore
└── README.md
Prometheus scrapes four targets on a 15-second interval:
- backend — HTTP request counters, latency histograms, active connections
- node-exporter — Host CPU, memory, disk, network metrics
- cadvisor — Per-container CPU, memory, network, filesystem usage
- self — Prometheus internal metrics
A pre-provisioned dashboard (DockOps Infrastructure) includes:
- HTTP request rate by status code
- p50/p95 request latency
- Host CPU and memory gauges with threshold coloring
- Container CPU and memory time series
- Host network I/O
- Service uptime status
| Alert | Condition | Severity |
|---|---|---|
| BackendDown | Backend unreachable for 30s | Critical |
| HighRequestLatency | p95 > 2s for 1m | Warning |
| HighErrorRate | 5xx rate > 5% for 2m | Critical |
| HostHighCPU | CPU > 85% for 5m | Warning |
| HostHighMemory | Available < 10% for 5m | Critical |
| ContainerRestarting | > 3 restarts in 15m | Warning |
| DiskSpaceLow | Root FS < 15% free for 5m | Warning |
Every service has restart: always and a Docker health check. When a container crashes or fails its health check, Docker automatically restarts it. The backend's health check verifies database and Redis connectivity, ensuring dependent services are functional before the container is marked healthy.
./scripts/chaos-test.shThis script:
- Verifies the backend is healthy
- Kills the backend container with
docker kill - Confirms the service is unreachable
- Measures time until Docker auto-restarts the container
- Validates the restored health endpoint
Typical recovery time is under 30 seconds.
The backend uses STOPSIGNAL SIGTERM and Gunicorn's graceful_timeout of 30 seconds, allowing in-flight requests to complete before the worker process exits.
For deploying to a remote Ubuntu server:
cd ansible/
# Update inventory with your server IP
vim inventory
# Run the full playbook
ansible-playbook -i inventory playbook.ymlThe playbook executes three roles in order:
- docker — Installs Docker CE, Compose plugin, configures user permissions
- deploy — Clones the repo, generates
.env, builds and starts the stack - monitoring — Tunes kernel parameters, configures Docker log rotation, verifies Prometheus scraping
- Database and Redis are on an internal-only network with no port bindings
- Backend runs as a non-root user (
dockops) - Nginx adds
X-Frame-Options,X-Content-Type-Options,X-XSS-Protection,Content-Security-Policy, andReferrer-Policyheaders - Prometheus
/metricsendpoint is restricted to internal Docker CIDR ranges - Rate limiting on both API (30 req/s) and general (60 req/s) traffic
- Secrets stored in
.env(gitignored) and injected via environment variables server_tokens offhides Nginx version
# docker-compose.yml
backend:
deploy:
replicas: 3The Nginx upstream block already uses upstream backend_pool, so additional backend replicas are automatically load-balanced.
Gunicorn auto-calculates workers based on CPU cores (workers = cpu_count * 2 + 1). Redis is configured with a 128MB memory ceiling and LRU eviction.
- Add Traefik or HAProxy for service discovery
- Move to Docker Swarm or Kubernetes for multi-node orchestration
- Add read replicas for PostgreSQL
- Implement Redis Sentinel for cache HA
The GitHub Actions workflow runs on every push to main:
- Lint —
flake8for Python,yamllintfor YAML,docker compose configvalidation - Test —
pyteston backend unit tests - Build — Builds both Docker images, verifies non-root user configuration
- Integration — Spins up core services, waits for healthy backend, validates API endpoints
docker compose logs backend
docker compose exec backend python -c "import psycopg2; print('pg ok')"Check that PostgreSQL is healthy first:
docker compose exec postgres pg_isreadyVerify the backend is on the monitoring network:
docker network inspect dockops_monitoring-netConfirm the Prometheus datasource URL is http://prometheus:9090 (container name, not localhost).
If port 80, 3001, or 9090 are taken:
# Edit .env
NGINX_PORT=8080./scripts/cleanup.sh --full
./scripts/deploy.sh| Script | Purpose |
|---|---|
scripts/deploy.sh |
Build and launch the entire stack |
scripts/healthcheck.sh |
Verify all containers and endpoints |
scripts/cleanup.sh |
Stop and optionally purge everything |
scripts/monitor.sh |
Live terminal dashboard of container stats |
scripts/chaos-test.sh |
Kill backend and measure auto-recovery |
MIT