Infrastructure

This page provides an overview of ExpoScholar's infrastructure components, operational procedures, and best practices for managing production deployments.

Overview

The infrastructure/ directory serves as the single source of truth for all operational concerns, enabling reproducible deployments, automated operations, security hardening, disaster recovery, and infrastructure-as-code practices.

Infrastructure components include:

Containerization: Docker packaging for consistent deployment across environments
Orchestration: Kubernetes for managing containers with automated scaling and health checks
Autoscaling: Automatic resource adjustment based on demand
Monitoring: System health, performance metrics, and error detection
Security: Network policies, bot blocking, honeypots, and access controls
Backups: Automated backup jobs with retention policies and restore procedures
Cloud Provisioning: Infrastructure-as-code using Terraform

Directory Structure

Path	Purpose
`docker/`	Dockerfiles, docker-compose stacks, Nginx reverse proxy configuration
`kubernetes/`	Namespace, deployments, services, ingress, secrets, configmaps
`autoscaling/`	Horizontal/vertical pod autoscalers, cluster autoscaler configurations
`maintenance/`	Maintenance page template, whitelist examples, runbooks
`monitoring/`	Prometheus/Alertmanager/Grafana configurations and alert rules
`security/`	Bot blocker rules, honeypot, network policies, RBAC, Nginx security configurations
`backup/`	CronJobs for PostgreSQL/Redis/application backups and restore scripts
`terraform/`	AWS infrastructure-as-code (VPC, compute, storage, networking)

Each subdirectory contains a dedicated README with comprehensive deployment instructions, operational procedures, and technical details.

Docker Containerization

Docker provides consistent application packaging and deployment across development, staging, and production environments.

Docker Components

Server Dockerfile (infrastructure/docker/Dockerfile):

Base image: Python 3.11
Installs system dependencies and Python packages
Configures Gunicorn as the WSGI server
Sets up proper file permissions and security

Mobile Dockerfile (infrastructure/docker/Dockerfile.mobile):

Base image: Flutter SDK
Builds Flutter web application
Outputs static files for Nginx serving

Docker Compose (infrastructure/docker/docker-compose.yml):

Orchestrates PostgreSQL, Redis, Django server, and Nginx
Configures networking and volume mounts
Sets up development environment

Common Docker Operations

Build Images:

# Build server image
docker build -f infrastructure/docker/Dockerfile -t exposcholar-server:latest .

# Build mobile image
docker build -f infrastructure/docker/Dockerfile.mobile -t exposcholar-mobile:latest .

Run Development Stack:

cd infrastructure/docker
docker compose up -d

View Logs:

docker compose logs -f

Stop Services:

docker compose down

Kubernetes Orchestration

Kubernetes provides container orchestration with automated scaling, health checks, rolling updates, and service discovery.

Core Components

Namespace (infrastructure/kubernetes/namespace.yaml):

Isolates ExpoScholar resources from other applications
Applies resource quotas and limits

Deployment (infrastructure/kubernetes/deployment.yaml):

Defines pod replicas, container images, and resource requests/limits
Configures health checks (liveness and readiness probes)
Manages rolling updates

Service (infrastructure/kubernetes/service.yaml):

Exposes pods via stable network endpoints
Load balances traffic across pod replicas

Ingress (infrastructure/kubernetes/ingress.yaml):

Routes external traffic to services
Manages TLS/SSL termination
Configures domain-based routing

ConfigMap (infrastructure/kubernetes/configmap.yaml):

Stores non-sensitive configuration data
Environment variables and application settings

Secrets (infrastructure/kubernetes/secrets.yaml):

Stores sensitive data (passwords, API keys, certificates)
Base64-encoded values
Must be created/updated manually with production credentials

Deployment Workflow

1. Apply Core Resources:

kubectl apply -f infrastructure/kubernetes/namespace.yaml
kubectl apply -f infrastructure/kubernetes/configmap.yaml
kubectl create secret generic exposcholar-secrets \
  --from-env-file=server/.env.production \
  -n exposcholar

2. Deploy Application:

kubectl apply -f infrastructure/kubernetes/deployment.yaml
kubectl apply -f infrastructure/kubernetes/service.yaml
kubectl apply -f infrastructure/kubernetes/ingress.yaml

3. Verify Deployment:

kubectl get pods -n exposcholar
kubectl get services -n exposcholar
kubectl get ingress -n exposcholar

4. Check Logs:

kubectl logs -f deployment/exposcholar-server -n exposcholar

Monitoring Stack

The monitoring stack provides comprehensive observability into system health, performance, and errors.

Components

Prometheus:

Time-series database for metrics collection
Scrapes metrics from all ExpoScholar components and Kubernetes infrastructure
Stores historical data for trend analysis

Alertmanager:

Routes alerts based on severity and service
Sends notifications via email, Slack, and other channels
Groups and deduplicates alerts

Grafana:

Metrics visualization and dashboard platform
Custom dashboards for application and infrastructure metrics
Real-time monitoring and historical analysis

Node Exporter:

System metrics exporter
Provides CPU, memory, disk, and network metrics from cluster nodes

Blackbox Exporter:

HTTP/HTTPS endpoint health checker
Probes ExpoScholar services for availability

Deployment

Deploy Monitoring Stack:

kubectl apply -f infrastructure/monitoring/k8s-monitoring.yaml

Verify Components:

kubectl get pods -n monitoring

Access Grafana:

kubectl port-forward svc/grafana -n monitoring 3000:3000
# Open http://localhost:3000

Configure Alerting:

Update infrastructure/monitoring/alertmanager.yml with SMTP credentials and Slack webhook URLs
Update the ConfigMap: kubectl apply -f infrastructure/monitoring/k8s-monitoring.yaml

Customize Alert Rules:

Modify infrastructure/monitoring/rules/exposcholar.yml to adjust alert thresholds
Add new alert conditions as needed

Security Hardening

Security configurations protect against attacks and unauthorized access.

Security Components

Pod Security Standards:

Enforces restricted security policy on namespaces
Prevents privilege escalation and host access
Configures read-only root filesystems

Network Policies:

Restricts pod-to-pod communication
Implements network segmentation
Controls ingress and egress traffic

RBAC (Role-Based Access Control):

Defines service account permissions
Limits access to Kubernetes resources
Follows principle of least privilege

Bot Blocker:

Nginx-based bot detection and blocking
Blocks known malicious user agents and IPs
Reduces automated attack traffic

Honeypot:

Decoy endpoints to detect scanning and attacks
Logs suspicious traffic patterns
Provides early warning of security threats

Deployment

Apply Security Policies:

kubectl apply -f infrastructure/security/pod-security-standards.yaml
kubectl apply -f infrastructure/security/network-policies.yaml
kubectl apply -f infrastructure/security/rbac.yaml

Deploy Bot Blocker:

Mount security/bot-blocker/ directory into Nginx containers
Include configuration files in nginx.conf

Enable Honeypot:

Mount security/honeypot/honeypot.conf into Nginx
Monitor logs for suspicious traffic patterns

Backup and Disaster Recovery

Automated backups ensure data protection and enable disaster recovery.

Backup Components

PostgreSQL Backups:

Daily automated backups via CronJob
Compressed SQL dumps stored in S3
30-day retention policy

Redis Backups:

Daily automated backups via CronJob
RDB snapshots stored in S3
7-day retention policy

Application Backups:

Daily automated backups of media files
Compressed archives stored in S3
30-day retention policy

Disaster Recovery Testing:

Weekly automated restore tests
Validates backup integrity
Ensures restore procedures are functional

Backup Operations

Deploy Backup CronJobs:

kubectl apply -f infrastructure/backup/backup-cronjob.yaml

Verify Backup Execution:

kubectl get cronjob -n exposcholar
kubectl get jobs -n exposcholar | grep backup

List Backups:

aws s3 ls s3://exposcholar-backups/postgresql/
aws s3 ls s3://exposcholar-backups/redis/
aws s3 ls s3://exposcholar-backups/application/

Restore from Backup:

./infrastructure/backup/restore-scripts/restore-postgresql.sh \
  s3://exposcholar-backups/postgresql/exposcholar-backup-20240115-020000.sql.gz

Autoscaling

Autoscaling automatically adjusts resources based on demand to optimize performance and cost.

Autoscaling Components

Horizontal Pod Autoscaler (HPA):

Scales pod replicas based on CPU and memory usage
Configurable thresholds (default: 70% CPU, 80% memory)
Minimum and maximum replica limits

Vertical Pod Autoscaler (VPA):

Adjusts pod resource requests and limits
Recommends optimal resource allocation
Can automatically apply recommendations

Cluster Autoscaler:

Scales cluster nodes based on pod scheduling needs
Adds nodes when pods cannot be scheduled
Removes nodes when underutilized

Configuration

Apply Autoscalers:

kubectl apply -f infrastructure/autoscaling/hpa.yaml
kubectl apply -f infrastructure/autoscaling/vpa.yaml
kubectl apply -f infrastructure/autoscaling/cluster-autoscaler.yaml

Monitor Scaling Events:

kubectl get hpa -n exposcholar
kubectl describe hpa -n exposcholar

Maintenance Mode

Maintenance mode allows planned downtime for updates and maintenance.

Operations

Enable Maintenance Mode:

./scripts/utils/maintenance_mode_enable.sh --reload-nginx

Configure Whitelisting:

./scripts/utils/maintenance_mode_whitelist.sh add 192.168.1.100

Disable Maintenance Mode:

./scripts/utils/maintenance_mode_disable.sh --reload-nginx

Infrastructure as Code (Terraform)

Terraform enables version-controlled cloud infrastructure provisioning.

Terraform Components

AWS Infrastructure:

VPC and networking configuration
RDS PostgreSQL database
ElastiCache Redis cluster
S3 buckets for backups and static files
IAM roles and policies

Kubernetes Infrastructure:

EKS cluster provisioning (optional)
Node groups and autoscaling
Load balancers and ingress

Terraform Workflow

1. Configure AWS Credentials:

export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_DEFAULT_REGION=us-west-2

2. Initialize Terraform:

cd infrastructure/terraform
terraform init

3. Review Planned Changes:

terraform plan -out=tfplan

4. Apply Infrastructure:

terraform apply tfplan

5. Configure Remote State (for shared environments):

terraform {
  backend "s3" {
    bucket         = "exposcholar-terraform-state"
    key            = "infrastructure/terraform.tfstate"
    region         = "us-west-2"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Operational Best Practices

Secrets Management

Never commit secrets: All sensitive data must be stored in Kubernetes Secrets, not in version control
Use Base64 encoding: When creating Secrets manually, encode values: echo -n "your-password" | base64
Rotate credentials regularly: Establish a schedule for rotating database passwords, API keys, and TLS certificates
Limit secret access: Use RBAC to restrict which service accounts and users can access Secrets

Configuration Management

Version control all configs: All configuration files should be tracked in Git, with the exception of secrets and environment-specific overrides
Use ConfigMaps for non-sensitive data: Application settings, feature flags, and non-secret environment variables belong in ConfigMaps
Document configuration changes: Maintain changelogs or commit messages that explain why configurations were modified

Monitoring and Observability

Set up alerting early: Configure Prometheus alerts and Alertmanager routing before deploying to production
Review dashboards regularly: Use Grafana dashboards to identify trends, capacity issues, and performance degradation
Log aggregation: Consider integrating with centralized logging solutions (ELK stack, Loki, CloudWatch) for comprehensive log analysis
Test alerting: Regularly verify that alerts fire correctly and that notification channels (email, Slack) are functional

Backup and Disaster Recovery

Test restore procedures: The disaster recovery test CronJob runs weekly, but also perform manual restore tests quarterly
Verify backup retention: Ensure backups older than the retention period (30 days for PostgreSQL/application, 7 days for Redis) are automatically deleted
Document restore procedures: Maintain runbooks that detail step-by-step restore processes for each component
Off-site storage: All backups are stored in AWS S3, separate from production infrastructure, ensuring protection against regional failures

Security Hardening

Keep security configs updated: Regularly update bot blocker lists, honeypot rules, and network policies as new threats emerge
Review access controls: Periodically audit RBAC policies and network policies to ensure they follow the principle of least privilege
Monitor security logs: Review honeypot logs, Nginx access logs, and Kubernetes audit logs for suspicious activity
Apply security patches: Keep base images, Kubernetes versions, and dependencies up to date with security patches

Autoscaling Configuration

Set appropriate thresholds: HPA CPU/memory thresholds (70% CPU, 80% memory) should be tuned based on actual workload patterns
Monitor scaling events: Review HPA and VPA scaling decisions to ensure they align with application requirements
Configure cluster autoscaler: Ensure cluster autoscaler is properly configured for your cloud provider to handle node scaling
Test scaling behavior: Verify that applications handle pod scaling gracefully and that database connections are properly managed

Troubleshooting

Common Issues

Problem: Pods fail to start or crash loop.

Solutions:

Check pod logs: kubectl logs <pod-name> -n exposcholar
Verify ConfigMaps and Secrets exist and are correctly mounted
Review resource requests/limits in deployment manifests
Check for image pull errors: kubectl describe pod <pod-name> -n exposcholar

Problem: Services are not accessible.

Solutions:

Verify Service selectors match Deployment labels
Check Ingress configuration and TLS certificates
Review network policies that might block traffic
Test service endpoints directly: kubectl port-forward svc/<service-name>

Problem: Backups are not running.

Solutions:

Check CronJob status: kubectl get cronjob -n exposcholar
Review recent job executions: kubectl get jobs -n exposcholar
Verify AWS credentials in Secrets
Check S3 bucket permissions and connectivity

Problem: Monitoring stack is not collecting metrics.

Solutions:

Verify Prometheus can scrape targets: Check /targets in Prometheus UI
Review ServiceMonitor or PodMonitor configurations
Ensure services expose /metrics endpoints
Check RBAC permissions for Prometheus service account

Additional Resources

Kubernetes Documentation: https://kubernetes.io/docs/
Prometheus Documentation: https://prometheus.io/docs/
Terraform AWS Provider: https://registry.terraform.io/providers/hashicorp/aws/
Docker Documentation: https://docs.docker.com/
Nginx Documentation: https://nginx.org/en/docs/

For detailed component-specific documentation, see the README files in each subdirectory of infrastructure/.

Last Updated: 2025-01-25
ExpoScholar Version: v0.9.3-beta+3

Infrastructure

Infrastructure

Overview

Directory Structure

Docker Containerization

Docker Components

Common Docker Operations

Kubernetes Orchestration

Core Components

Deployment Workflow

Monitoring Stack

Components

Deployment

Security Hardening

Security Components

Deployment

Backup and Disaster Recovery

Backup Components

Backup Operations

Autoscaling

Autoscaling Components

Configuration

Maintenance Mode

Operations

Infrastructure as Code (Terraform)

Terraform Components

Terraform Workflow

Operational Best Practices

Secrets Management

Configuration Management

Monitoring and Observability

Backup and Disaster Recovery

Security Hardening

Autoscaling Configuration

Troubleshooting

Common Issues

Additional Resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally