DockOps

Production-grade containerized infrastructure stack demonstrating SRE practices, monitoring, automation, and reliability engineering.

Architecture

                    ┌─────────────────────────────────────────────────────────┐
                    │                     USERS / INTERNET                     │
                    └─────────────────────┬───────────────────────────────────┘
                                          │
                                          ▼
                    ┌─────────────────────────────────────────────────────────┐
                    │              Nginx Reverse Proxy (:80/:443)              │
                    │         rate limiting · gzip · security headers          │
                    └───────────┬─────────────────────────────┬───────────────┘
                                │                             │
                         /api/* │                             │ /*
                                ▼                             ▼
                    ┌───────────────────┐         ┌───────────────────┐
                    │   Flask Backend   │         │  Frontend (HTML)  │
                    │     (:5000)       │         │     (:3000)       │
                    │  prometheus_client│         │  nginx-alpine     │
                    └─────┬───────┬─────┘         └───────────────────┘
                          │       │
                ┌─────────┘       └──────────┐
                ▼                            ▼
    ┌───────────────────┐        ┌───────────────────┐
    │    PostgreSQL 16   │        │     Redis 7       │
    │      (:5432)       │        │     (:6379)       │
    │   internal only    │        │   internal only   │
    └───────────────────┘        └───────────────────┘

    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ MONITORING STACK ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

    ┌───────────────┐  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
    │  Prometheus   │  │   Grafana     │  │ Node Exporter │  │   cAdvisor    │
    │   (:9090)     │──│   (:3001)     │  │   (:9100)     │  │   (:8080)     │
    └───────────────┘  └───────────────┘  └───────────────┘  └───────────────┘

Stack

Component	Technology	Purpose
Reverse Proxy	Nginx 1.27	Traffic routing, rate limiting, security headers
Backend	Python Flask + Gunicorn	REST API with Prometheus metrics
Frontend	HTML/CSS/JS + Nginx	Infrastructure monitoring dashboard
Database	PostgreSQL 16	Persistent data storage
Cache	Redis 7	In-memory caching with LRU eviction
Metrics	Prometheus	Time-series metrics collection
Dashboards	Grafana	Visualization and alerting
Host Metrics	Node Exporter	System-level metrics
Container Metrics	cAdvisor	Docker container resource tracking
Automation	Ansible	Server provisioning and deployment
CI/CD	GitHub Actions	Automated testing and deployment

Network Architecture

┌─────────────────────────────────┐
│         frontend-net            │
│  nginx ── frontend ── backend   │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│    backend-net (internal)       │
│  backend ── postgres ── redis   │
└─────────────────────────────────┘

┌────────────────────────────────────────────┐
│             monitoring-net                  │
│  backend ── prometheus ── grafana           │
│  node-exporter ── cadvisor                  │
└────────────────────────────────────────────┘

The backend-net network is marked as internal: true, which means PostgreSQL and Redis have zero exposure to the host or internet. The backend service bridges all three networks since it needs to serve API requests, connect to data stores, and expose metrics.

Quick Start

Prerequisites

Docker Engine 24+
Docker Compose v2
Git

Setup

git clone https://github.com/yourusername/DockOps.git
cd DockOps

Edit .env with your credentials if needed, then deploy:

./scripts/deploy.sh

Manual Start

docker compose build
docker compose up -d

Verify

./scripts/healthcheck.sh

Access Points

Service	URL
Dashboard	http://localhost
API Health	http://localhost/api/health
API Status	http://localhost/api/status
Prometheus	http://localhost:9090
Grafana	http://localhost:3001

Default Grafana credentials: admin / graf_s3cur3_2024

Project Structure

DockOps/
├── ansible/
│   ├── inventory
│   ├── playbook.yml
│   ├── group_vars/all.yml
│   └── roles/
│       ├── docker/
│       │   ├── tasks/main.yml
│       │   └── handlers/main.yml
│       ├── deploy/
│       │   ├── tasks/main.yml
│       │   ├── handlers/main.yml
│       │   └── templates/env.j2
│       └── monitoring/
│           ├── tasks/main.yml
│           └── handlers/main.yml
├── backend/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── gunicorn.conf.py
│   ├── wsgi.py
│   ├── app/
│   │   ├── __init__.py
│   │   ├── routes.py
│   │   ├── database.py
│   │   └── cache.py
│   └── tests/
│       └── test_api.py
├── frontend/
│   ├── Dockerfile
│   ├── nginx.conf
│   └── src/
│       ├── index.html
│       ├── style.css
│       └── app.js
├── nginx/
│   ├── nginx.conf
│   └── conf.d/default.conf
├── monitoring/
│   ├── prometheus/
│   │   ├── prometheus.yml
│   │   └── alert_rules.yml
│   └── grafana/
│       ├── provisioning/
│       │   ├── datasources/datasource.yml
│       │   └── dashboards/dashboard.yml
│       └── dashboards/infrastructure.json
├── scripts/
│   ├── deploy.sh
│   ├── healthcheck.sh
│   ├── cleanup.sh
│   ├── monitor.sh
│   └── chaos-test.sh
├── .github/workflows/deploy.yml
├── docker-compose.yml
├── .env
├── .gitignore
└── README.md

Monitoring

Prometheus Targets

Prometheus scrapes four targets on a 15-second interval:

backend — HTTP request counters, latency histograms, active connections
node-exporter — Host CPU, memory, disk, network metrics
cadvisor — Per-container CPU, memory, network, filesystem usage
self — Prometheus internal metrics

Grafana Dashboard

A pre-provisioned dashboard (DockOps Infrastructure) includes:

HTTP request rate by status code
p50/p95 request latency
Host CPU and memory gauges with threshold coloring
Container CPU and memory time series
Host network I/O
Service uptime status

Alert Rules

Alert	Condition	Severity
BackendDown	Backend unreachable for 30s	Critical
HighRequestLatency	p95 > 2s for 1m	Warning
HighErrorRate	5xx rate > 5% for 2m	Critical
HostHighCPU	CPU > 85% for 5m	Warning
HostHighMemory	Available < 10% for 5m	Critical
ContainerRestarting	> 3 restarts in 15m	Warning
DiskSpaceLow	Root FS < 15% free for 5m	Warning

Reliability

Self-Healing

Every service has restart: always and a Docker health check. When a container crashes or fails its health check, Docker automatically restarts it. The backend's health check verifies database and Redis connectivity, ensuring dependent services are functional before the container is marked healthy.

Chaos Testing

./scripts/chaos-test.sh

This script:

Verifies the backend is healthy
Kills the backend container with docker kill
Confirms the service is unreachable
Measures time until Docker auto-restarts the container
Validates the restored health endpoint

Typical recovery time is under 30 seconds.

Graceful Shutdown

The backend uses STOPSIGNAL SIGTERM and Gunicorn's graceful_timeout of 30 seconds, allowing in-flight requests to complete before the worker process exits.

Ansible Deployment

For deploying to a remote Ubuntu server:

cd ansible/

# Update inventory with your server IP
vim inventory

# Run the full playbook
ansible-playbook -i inventory playbook.yml

The playbook executes three roles in order:

docker — Installs Docker CE, Compose plugin, configures user permissions
deploy — Clones the repo, generates .env, builds and starts the stack
monitoring — Tunes kernel parameters, configures Docker log rotation, verifies Prometheus scraping

Security

Database and Redis are on an internal-only network with no port bindings
Backend runs as a non-root user (dockops)
Nginx adds X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Content-Security-Policy, and Referrer-Policy headers
Prometheus /metrics endpoint is restricted to internal Docker CIDR ranges
Rate limiting on both API (30 req/s) and general (60 req/s) traffic
Secrets stored in .env (gitignored) and injected via environment variables
server_tokens off hides Nginx version

Scaling

Horizontal Backend Scaling

# docker-compose.yml
backend:
  deploy:
    replicas: 3

The Nginx upstream block already uses upstream backend_pool, so additional backend replicas are automatically load-balanced.

Vertical Scaling

Gunicorn auto-calculates workers based on CPU cores (workers = cpu_count * 2 + 1). Redis is configured with a 128MB memory ceiling and LRU eviction.

Future Scaling

Add Traefik or HAProxy for service discovery
Move to Docker Swarm or Kubernetes for multi-node orchestration
Add read replicas for PostgreSQL
Implement Redis Sentinel for cache HA

CI/CD Pipeline

The GitHub Actions workflow runs on every push to main:

Lint — flake8 for Python, yamllint for YAML, docker compose config validation
Test — pytest on backend unit tests
Build — Builds both Docker images, verifies non-root user configuration
Integration — Spins up core services, waits for healthy backend, validates API endpoints

Troubleshooting

Backend won't start

docker compose logs backend
docker compose exec backend python -c "import psycopg2; print('pg ok')"

Check that PostgreSQL is healthy first:

docker compose exec postgres pg_isready

Prometheus shows targets as DOWN

Verify the backend is on the monitoring network:

docker network inspect dockops_monitoring-net

Grafana shows "No data"

Confirm the Prometheus datasource URL is http://prometheus:9090 (container name, not localhost).

Port conflicts

If port 80, 3001, or 9090 are taken:

# Edit .env
NGINX_PORT=8080

Full reset

./scripts/cleanup.sh --full
./scripts/deploy.sh

Helper Scripts

Script	Purpose
`scripts/deploy.sh`	Build and launch the entire stack
`scripts/healthcheck.sh`	Verify all containers and endpoints
`scripts/cleanup.sh`	Stop and optionally purge everything
`scripts/monitor.sh`	Live terminal dashboard of container stats
`scripts/chaos-test.sh`	Kill backend and measure auto-recovery

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
ansible		ansible
backend		backend
frontend		frontend
monitoring		monitoring
nginx		nginx
scripts		scripts
.gitignore		.gitignore
README.md		README.md
architecture-diagram.png		architecture-diagram.png
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

DockOps

Architecture

Stack

Network Architecture

Quick Start

Prerequisites

Setup

Manual Start

Verify

Access Points

Project Structure

Monitoring

Prometheus Targets

Grafana Dashboard

Alert Rules

Reliability

Self-Healing

Chaos Testing

Graceful Shutdown

Ansible Deployment

Security

Scaling

Horizontal Backend Scaling

Vertical Scaling

Future Scaling

CI/CD Pipeline

Troubleshooting

Backend won't start

Prometheus shows targets as DOWN

Grafana shows "No data"

Port conflicts

Full reset

Helper Scripts

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages