Skip to content

Infrastructure

Siyâvash edited this page Dec 16, 2025 · 1 revision

Infrastructure

This page provides an overview of ExpoScholar's infrastructure components, operational procedures, and best practices for managing production deployments.

Overview

The infrastructure/ directory serves as the single source of truth for all operational concerns, enabling reproducible deployments, automated operations, security hardening, disaster recovery, and infrastructure-as-code practices.

Infrastructure components include:

  • Containerization: Docker packaging for consistent deployment across environments
  • Orchestration: Kubernetes for managing containers with automated scaling and health checks
  • Autoscaling: Automatic resource adjustment based on demand
  • Monitoring: System health, performance metrics, and error detection
  • Security: Network policies, bot blocking, honeypots, and access controls
  • Backups: Automated backup jobs with retention policies and restore procedures
  • Cloud Provisioning: Infrastructure-as-code using Terraform

Directory Structure

Path Purpose
docker/ Dockerfiles, docker-compose stacks, Nginx reverse proxy configuration
kubernetes/ Namespace, deployments, services, ingress, secrets, configmaps
autoscaling/ Horizontal/vertical pod autoscalers, cluster autoscaler configurations
maintenance/ Maintenance page template, whitelist examples, runbooks
monitoring/ Prometheus/Alertmanager/Grafana configurations and alert rules
security/ Bot blocker rules, honeypot, network policies, RBAC, Nginx security configurations
backup/ CronJobs for PostgreSQL/Redis/application backups and restore scripts
terraform/ AWS infrastructure-as-code (VPC, compute, storage, networking)

Each subdirectory contains a dedicated README with comprehensive deployment instructions, operational procedures, and technical details.

Docker Containerization

Docker provides consistent application packaging and deployment across development, staging, and production environments.

Docker Components

Server Dockerfile (infrastructure/docker/Dockerfile):

  • Base image: Python 3.11
  • Installs system dependencies and Python packages
  • Configures Gunicorn as the WSGI server
  • Sets up proper file permissions and security

Mobile Dockerfile (infrastructure/docker/Dockerfile.mobile):

  • Base image: Flutter SDK
  • Builds Flutter web application
  • Outputs static files for Nginx serving

Docker Compose (infrastructure/docker/docker-compose.yml):

  • Orchestrates PostgreSQL, Redis, Django server, and Nginx
  • Configures networking and volume mounts
  • Sets up development environment

Common Docker Operations

Build Images:

# Build server image
docker build -f infrastructure/docker/Dockerfile -t exposcholar-server:latest .

# Build mobile image
docker build -f infrastructure/docker/Dockerfile.mobile -t exposcholar-mobile:latest .

Run Development Stack:

cd infrastructure/docker
docker compose up -d

View Logs:

docker compose logs -f

Stop Services:

docker compose down

Kubernetes Orchestration

Kubernetes provides container orchestration with automated scaling, health checks, rolling updates, and service discovery.

Core Components

Namespace (infrastructure/kubernetes/namespace.yaml):

  • Isolates ExpoScholar resources from other applications
  • Applies resource quotas and limits

Deployment (infrastructure/kubernetes/deployment.yaml):

  • Defines pod replicas, container images, and resource requests/limits
  • Configures health checks (liveness and readiness probes)
  • Manages rolling updates

Service (infrastructure/kubernetes/service.yaml):

  • Exposes pods via stable network endpoints
  • Load balances traffic across pod replicas

Ingress (infrastructure/kubernetes/ingress.yaml):

  • Routes external traffic to services
  • Manages TLS/SSL termination
  • Configures domain-based routing

ConfigMap (infrastructure/kubernetes/configmap.yaml):

  • Stores non-sensitive configuration data
  • Environment variables and application settings

Secrets (infrastructure/kubernetes/secrets.yaml):

  • Stores sensitive data (passwords, API keys, certificates)
  • Base64-encoded values
  • Must be created/updated manually with production credentials

Deployment Workflow

1. Apply Core Resources:

kubectl apply -f infrastructure/kubernetes/namespace.yaml
kubectl apply -f infrastructure/kubernetes/configmap.yaml
kubectl create secret generic exposcholar-secrets \
  --from-env-file=server/.env.production \
  -n exposcholar

2. Deploy Application:

kubectl apply -f infrastructure/kubernetes/deployment.yaml
kubectl apply -f infrastructure/kubernetes/service.yaml
kubectl apply -f infrastructure/kubernetes/ingress.yaml

3. Verify Deployment:

kubectl get pods -n exposcholar
kubectl get services -n exposcholar
kubectl get ingress -n exposcholar

4. Check Logs:

kubectl logs -f deployment/exposcholar-server -n exposcholar

Monitoring Stack

The monitoring stack provides comprehensive observability into system health, performance, and errors.

Components

Prometheus:

  • Time-series database for metrics collection
  • Scrapes metrics from all ExpoScholar components and Kubernetes infrastructure
  • Stores historical data for trend analysis

Alertmanager:

  • Routes alerts based on severity and service
  • Sends notifications via email, Slack, and other channels
  • Groups and deduplicates alerts

Grafana:

  • Metrics visualization and dashboard platform
  • Custom dashboards for application and infrastructure metrics
  • Real-time monitoring and historical analysis

Node Exporter:

  • System metrics exporter
  • Provides CPU, memory, disk, and network metrics from cluster nodes

Blackbox Exporter:

  • HTTP/HTTPS endpoint health checker
  • Probes ExpoScholar services for availability

Deployment

Deploy Monitoring Stack:

kubectl apply -f infrastructure/monitoring/k8s-monitoring.yaml

Verify Components:

kubectl get pods -n monitoring

Access Grafana:

kubectl port-forward svc/grafana -n monitoring 3000:3000
# Open http://localhost:3000

Configure Alerting:

  1. Update infrastructure/monitoring/alertmanager.yml with SMTP credentials and Slack webhook URLs
  2. Update the ConfigMap: kubectl apply -f infrastructure/monitoring/k8s-monitoring.yaml

Customize Alert Rules:

  • Modify infrastructure/monitoring/rules/exposcholar.yml to adjust alert thresholds
  • Add new alert conditions as needed

Security Hardening

Security configurations protect against attacks and unauthorized access.

Security Components

Pod Security Standards:

  • Enforces restricted security policy on namespaces
  • Prevents privilege escalation and host access
  • Configures read-only root filesystems

Network Policies:

  • Restricts pod-to-pod communication
  • Implements network segmentation
  • Controls ingress and egress traffic

RBAC (Role-Based Access Control):

  • Defines service account permissions
  • Limits access to Kubernetes resources
  • Follows principle of least privilege

Bot Blocker:

  • Nginx-based bot detection and blocking
  • Blocks known malicious user agents and IPs
  • Reduces automated attack traffic

Honeypot:

  • Decoy endpoints to detect scanning and attacks
  • Logs suspicious traffic patterns
  • Provides early warning of security threats

Deployment

Apply Security Policies:

kubectl apply -f infrastructure/security/pod-security-standards.yaml
kubectl apply -f infrastructure/security/network-policies.yaml
kubectl apply -f infrastructure/security/rbac.yaml

Deploy Bot Blocker:

  • Mount security/bot-blocker/ directory into Nginx containers
  • Include configuration files in nginx.conf

Enable Honeypot:

  • Mount security/honeypot/honeypot.conf into Nginx
  • Monitor logs for suspicious traffic patterns

Backup and Disaster Recovery

Automated backups ensure data protection and enable disaster recovery.

Backup Components

PostgreSQL Backups:

  • Daily automated backups via CronJob
  • Compressed SQL dumps stored in S3
  • 30-day retention policy

Redis Backups:

  • Daily automated backups via CronJob
  • RDB snapshots stored in S3
  • 7-day retention policy

Application Backups:

  • Daily automated backups of media files
  • Compressed archives stored in S3
  • 30-day retention policy

Disaster Recovery Testing:

  • Weekly automated restore tests
  • Validates backup integrity
  • Ensures restore procedures are functional

Backup Operations

Deploy Backup CronJobs:

kubectl apply -f infrastructure/backup/backup-cronjob.yaml

Verify Backup Execution:

kubectl get cronjob -n exposcholar
kubectl get jobs -n exposcholar | grep backup

List Backups:

aws s3 ls s3://exposcholar-backups/postgresql/
aws s3 ls s3://exposcholar-backups/redis/
aws s3 ls s3://exposcholar-backups/application/

Restore from Backup:

./infrastructure/backup/restore-scripts/restore-postgresql.sh \
  s3://exposcholar-backups/postgresql/exposcholar-backup-20240115-020000.sql.gz

Autoscaling

Autoscaling automatically adjusts resources based on demand to optimize performance and cost.

Autoscaling Components

Horizontal Pod Autoscaler (HPA):

  • Scales pod replicas based on CPU and memory usage
  • Configurable thresholds (default: 70% CPU, 80% memory)
  • Minimum and maximum replica limits

Vertical Pod Autoscaler (VPA):

  • Adjusts pod resource requests and limits
  • Recommends optimal resource allocation
  • Can automatically apply recommendations

Cluster Autoscaler:

  • Scales cluster nodes based on pod scheduling needs
  • Adds nodes when pods cannot be scheduled
  • Removes nodes when underutilized

Configuration

Apply Autoscalers:

kubectl apply -f infrastructure/autoscaling/hpa.yaml
kubectl apply -f infrastructure/autoscaling/vpa.yaml
kubectl apply -f infrastructure/autoscaling/cluster-autoscaler.yaml

Monitor Scaling Events:

kubectl get hpa -n exposcholar
kubectl describe hpa -n exposcholar

Maintenance Mode

Maintenance mode allows planned downtime for updates and maintenance.

Operations

Enable Maintenance Mode:

./scripts/utils/maintenance_mode_enable.sh --reload-nginx

Configure Whitelisting:

./scripts/utils/maintenance_mode_whitelist.sh add 192.168.1.100

Disable Maintenance Mode:

./scripts/utils/maintenance_mode_disable.sh --reload-nginx

Infrastructure as Code (Terraform)

Terraform enables version-controlled cloud infrastructure provisioning.

Terraform Components

AWS Infrastructure:

  • VPC and networking configuration
  • RDS PostgreSQL database
  • ElastiCache Redis cluster
  • S3 buckets for backups and static files
  • IAM roles and policies

Kubernetes Infrastructure:

  • EKS cluster provisioning (optional)
  • Node groups and autoscaling
  • Load balancers and ingress

Terraform Workflow

1. Configure AWS Credentials:

export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_DEFAULT_REGION=us-west-2

2. Initialize Terraform:

cd infrastructure/terraform
terraform init

3. Review Planned Changes:

terraform plan -out=tfplan

4. Apply Infrastructure:

terraform apply tfplan

5. Configure Remote State (for shared environments):

terraform {
  backend "s3" {
    bucket         = "exposcholar-terraform-state"
    key            = "infrastructure/terraform.tfstate"
    region         = "us-west-2"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Operational Best Practices

Secrets Management

  • Never commit secrets: All sensitive data must be stored in Kubernetes Secrets, not in version control
  • Use Base64 encoding: When creating Secrets manually, encode values: echo -n "your-password" | base64
  • Rotate credentials regularly: Establish a schedule for rotating database passwords, API keys, and TLS certificates
  • Limit secret access: Use RBAC to restrict which service accounts and users can access Secrets

Configuration Management

  • Version control all configs: All configuration files should be tracked in Git, with the exception of secrets and environment-specific overrides
  • Use ConfigMaps for non-sensitive data: Application settings, feature flags, and non-secret environment variables belong in ConfigMaps
  • Document configuration changes: Maintain changelogs or commit messages that explain why configurations were modified

Monitoring and Observability

  • Set up alerting early: Configure Prometheus alerts and Alertmanager routing before deploying to production
  • Review dashboards regularly: Use Grafana dashboards to identify trends, capacity issues, and performance degradation
  • Log aggregation: Consider integrating with centralized logging solutions (ELK stack, Loki, CloudWatch) for comprehensive log analysis
  • Test alerting: Regularly verify that alerts fire correctly and that notification channels (email, Slack) are functional

Backup and Disaster Recovery

  • Test restore procedures: The disaster recovery test CronJob runs weekly, but also perform manual restore tests quarterly
  • Verify backup retention: Ensure backups older than the retention period (30 days for PostgreSQL/application, 7 days for Redis) are automatically deleted
  • Document restore procedures: Maintain runbooks that detail step-by-step restore processes for each component
  • Off-site storage: All backups are stored in AWS S3, separate from production infrastructure, ensuring protection against regional failures

Security Hardening

  • Keep security configs updated: Regularly update bot blocker lists, honeypot rules, and network policies as new threats emerge
  • Review access controls: Periodically audit RBAC policies and network policies to ensure they follow the principle of least privilege
  • Monitor security logs: Review honeypot logs, Nginx access logs, and Kubernetes audit logs for suspicious activity
  • Apply security patches: Keep base images, Kubernetes versions, and dependencies up to date with security patches

Autoscaling Configuration

  • Set appropriate thresholds: HPA CPU/memory thresholds (70% CPU, 80% memory) should be tuned based on actual workload patterns
  • Monitor scaling events: Review HPA and VPA scaling decisions to ensure they align with application requirements
  • Configure cluster autoscaler: Ensure cluster autoscaler is properly configured for your cloud provider to handle node scaling
  • Test scaling behavior: Verify that applications handle pod scaling gracefully and that database connections are properly managed

Troubleshooting

Common Issues

Problem: Pods fail to start or crash loop.

Solutions:

  • Check pod logs: kubectl logs <pod-name> -n exposcholar
  • Verify ConfigMaps and Secrets exist and are correctly mounted
  • Review resource requests/limits in deployment manifests
  • Check for image pull errors: kubectl describe pod <pod-name> -n exposcholar

Problem: Services are not accessible.

Solutions:

  • Verify Service selectors match Deployment labels
  • Check Ingress configuration and TLS certificates
  • Review network policies that might block traffic
  • Test service endpoints directly: kubectl port-forward svc/<service-name>

Problem: Backups are not running.

Solutions:

  • Check CronJob status: kubectl get cronjob -n exposcholar
  • Review recent job executions: kubectl get jobs -n exposcholar
  • Verify AWS credentials in Secrets
  • Check S3 bucket permissions and connectivity

Problem: Monitoring stack is not collecting metrics.

Solutions:

  • Verify Prometheus can scrape targets: Check /targets in Prometheus UI
  • Review ServiceMonitor or PodMonitor configurations
  • Ensure services expose /metrics endpoints
  • Check RBAC permissions for Prometheus service account

Additional Resources

For detailed component-specific documentation, see the README files in each subdirectory of infrastructure/.


Last Updated: 2025-01-25
ExpoScholar Version: v0.9.3-beta+3