Skip to content

SonOfTroll/DockOps

Repository files navigation

DockOps

Production-grade containerized infrastructure stack demonstrating SRE practices, monitoring, automation, and reliability engineering.

Architecture

                    ┌─────────────────────────────────────────────────────────┐
                    │                     USERS / INTERNET                     │
                    └─────────────────────┬───────────────────────────────────┘
                                          │
                                          ▼
                    ┌─────────────────────────────────────────────────────────┐
                    │              Nginx Reverse Proxy (:80/:443)              │
                    │         rate limiting · gzip · security headers          │
                    └───────────┬─────────────────────────────┬───────────────┘
                                │                             │
                         /api/* │                             │ /*
                                ▼                             ▼
                    ┌───────────────────┐         ┌───────────────────┐
                    │   Flask Backend   │         │  Frontend (HTML)  │
                    │     (:5000)       │         │     (:3000)       │
                    │  prometheus_client│         │  nginx-alpine     │
                    └─────┬───────┬─────┘         └───────────────────┘
                          │       │
                ┌─────────┘       └──────────┐
                ▼                            ▼
    ┌───────────────────┐        ┌───────────────────┐
    │    PostgreSQL 16   │        │     Redis 7       │
    │      (:5432)       │        │     (:6379)       │
    │   internal only    │        │   internal only   │
    └───────────────────┘        └───────────────────┘

    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ MONITORING STACK ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

    ┌───────────────┐  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
    │  Prometheus   │  │   Grafana     │  │ Node Exporter │  │   cAdvisor    │
    │   (:9090)     │──│   (:3001)     │  │   (:9100)     │  │   (:8080)     │
    └───────────────┘  └───────────────┘  └───────────────┘  └───────────────┘

Stack

Component Technology Purpose
Reverse Proxy Nginx 1.27 Traffic routing, rate limiting, security headers
Backend Python Flask + Gunicorn REST API with Prometheus metrics
Frontend HTML/CSS/JS + Nginx Infrastructure monitoring dashboard
Database PostgreSQL 16 Persistent data storage
Cache Redis 7 In-memory caching with LRU eviction
Metrics Prometheus Time-series metrics collection
Dashboards Grafana Visualization and alerting
Host Metrics Node Exporter System-level metrics
Container Metrics cAdvisor Docker container resource tracking
Automation Ansible Server provisioning and deployment
CI/CD GitHub Actions Automated testing and deployment

Network Architecture

┌─────────────────────────────────┐
│         frontend-net            │
│  nginx ── frontend ── backend   │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│    backend-net (internal)       │
│  backend ── postgres ── redis   │
└─────────────────────────────────┘

┌────────────────────────────────────────────┐
│             monitoring-net                  │
│  backend ── prometheus ── grafana           │
│  node-exporter ── cadvisor                  │
└────────────────────────────────────────────┘

The backend-net network is marked as internal: true, which means PostgreSQL and Redis have zero exposure to the host or internet. The backend service bridges all three networks since it needs to serve API requests, connect to data stores, and expose metrics.

Quick Start

Prerequisites

  • Docker Engine 24+
  • Docker Compose v2
  • Git

Setup

git clone https://github.com/yourusername/DockOps.git
cd DockOps

Edit .env with your credentials if needed, then deploy:

./scripts/deploy.sh

Manual Start

docker compose build
docker compose up -d

Verify

./scripts/healthcheck.sh

Access Points

Service URL
Dashboard http://localhost
API Health http://localhost/api/health
API Status http://localhost/api/status
Prometheus http://localhost:9090
Grafana http://localhost:3001

Default Grafana credentials: admin / graf_s3cur3_2024

Project Structure

DockOps/
├── ansible/
│   ├── inventory
│   ├── playbook.yml
│   ├── group_vars/all.yml
│   └── roles/
│       ├── docker/
│       │   ├── tasks/main.yml
│       │   └── handlers/main.yml
│       ├── deploy/
│       │   ├── tasks/main.yml
│       │   ├── handlers/main.yml
│       │   └── templates/env.j2
│       └── monitoring/
│           ├── tasks/main.yml
│           └── handlers/main.yml
├── backend/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── gunicorn.conf.py
│   ├── wsgi.py
│   ├── app/
│   │   ├── __init__.py
│   │   ├── routes.py
│   │   ├── database.py
│   │   └── cache.py
│   └── tests/
│       └── test_api.py
├── frontend/
│   ├── Dockerfile
│   ├── nginx.conf
│   └── src/
│       ├── index.html
│       ├── style.css
│       └── app.js
├── nginx/
│   ├── nginx.conf
│   └── conf.d/default.conf
├── monitoring/
│   ├── prometheus/
│   │   ├── prometheus.yml
│   │   └── alert_rules.yml
│   └── grafana/
│       ├── provisioning/
│       │   ├── datasources/datasource.yml
│       │   └── dashboards/dashboard.yml
│       └── dashboards/infrastructure.json
├── scripts/
│   ├── deploy.sh
│   ├── healthcheck.sh
│   ├── cleanup.sh
│   ├── monitor.sh
│   └── chaos-test.sh
├── .github/workflows/deploy.yml
├── docker-compose.yml
├── .env
├── .gitignore
└── README.md

Monitoring

Prometheus Targets

Prometheus scrapes four targets on a 15-second interval:

  • backend — HTTP request counters, latency histograms, active connections
  • node-exporter — Host CPU, memory, disk, network metrics
  • cadvisor — Per-container CPU, memory, network, filesystem usage
  • self — Prometheus internal metrics

Grafana Dashboard

A pre-provisioned dashboard (DockOps Infrastructure) includes:

  • HTTP request rate by status code
  • p50/p95 request latency
  • Host CPU and memory gauges with threshold coloring
  • Container CPU and memory time series
  • Host network I/O
  • Service uptime status

Alert Rules

Alert Condition Severity
BackendDown Backend unreachable for 30s Critical
HighRequestLatency p95 > 2s for 1m Warning
HighErrorRate 5xx rate > 5% for 2m Critical
HostHighCPU CPU > 85% for 5m Warning
HostHighMemory Available < 10% for 5m Critical
ContainerRestarting > 3 restarts in 15m Warning
DiskSpaceLow Root FS < 15% free for 5m Warning

Reliability

Self-Healing

Every service has restart: always and a Docker health check. When a container crashes or fails its health check, Docker automatically restarts it. The backend's health check verifies database and Redis connectivity, ensuring dependent services are functional before the container is marked healthy.

Chaos Testing

./scripts/chaos-test.sh

This script:

  1. Verifies the backend is healthy
  2. Kills the backend container with docker kill
  3. Confirms the service is unreachable
  4. Measures time until Docker auto-restarts the container
  5. Validates the restored health endpoint

Typical recovery time is under 30 seconds.

Graceful Shutdown

The backend uses STOPSIGNAL SIGTERM and Gunicorn's graceful_timeout of 30 seconds, allowing in-flight requests to complete before the worker process exits.

Ansible Deployment

For deploying to a remote Ubuntu server:

cd ansible/

# Update inventory with your server IP
vim inventory

# Run the full playbook
ansible-playbook -i inventory playbook.yml

The playbook executes three roles in order:

  1. docker — Installs Docker CE, Compose plugin, configures user permissions
  2. deploy — Clones the repo, generates .env, builds and starts the stack
  3. monitoring — Tunes kernel parameters, configures Docker log rotation, verifies Prometheus scraping

Security

  • Database and Redis are on an internal-only network with no port bindings
  • Backend runs as a non-root user (dockops)
  • Nginx adds X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Content-Security-Policy, and Referrer-Policy headers
  • Prometheus /metrics endpoint is restricted to internal Docker CIDR ranges
  • Rate limiting on both API (30 req/s) and general (60 req/s) traffic
  • Secrets stored in .env (gitignored) and injected via environment variables
  • server_tokens off hides Nginx version

Scaling

Horizontal Backend Scaling

# docker-compose.yml
backend:
  deploy:
    replicas: 3

The Nginx upstream block already uses upstream backend_pool, so additional backend replicas are automatically load-balanced.

Vertical Scaling

Gunicorn auto-calculates workers based on CPU cores (workers = cpu_count * 2 + 1). Redis is configured with a 128MB memory ceiling and LRU eviction.

Future Scaling

  • Add Traefik or HAProxy for service discovery
  • Move to Docker Swarm or Kubernetes for multi-node orchestration
  • Add read replicas for PostgreSQL
  • Implement Redis Sentinel for cache HA

CI/CD Pipeline

The GitHub Actions workflow runs on every push to main:

  1. Lintflake8 for Python, yamllint for YAML, docker compose config validation
  2. Testpytest on backend unit tests
  3. Build — Builds both Docker images, verifies non-root user configuration
  4. Integration — Spins up core services, waits for healthy backend, validates API endpoints

Troubleshooting

Backend won't start

docker compose logs backend
docker compose exec backend python -c "import psycopg2; print('pg ok')"

Check that PostgreSQL is healthy first:

docker compose exec postgres pg_isready

Prometheus shows targets as DOWN

Verify the backend is on the monitoring network:

docker network inspect dockops_monitoring-net

Grafana shows "No data"

Confirm the Prometheus datasource URL is http://prometheus:9090 (container name, not localhost).

Port conflicts

If port 80, 3001, or 9090 are taken:

# Edit .env
NGINX_PORT=8080

Full reset

./scripts/cleanup.sh --full
./scripts/deploy.sh

Helper Scripts

Script Purpose
scripts/deploy.sh Build and launch the entire stack
scripts/healthcheck.sh Verify all containers and endpoints
scripts/cleanup.sh Stop and optionally purge everything
scripts/monitor.sh Live terminal dashboard of container stats
scripts/chaos-test.sh Kill backend and measure auto-recovery

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors