-
Notifications
You must be signed in to change notification settings - Fork 0
Infrastructure
This page provides an overview of ExpoScholar's infrastructure components, operational procedures, and best practices for managing production deployments.
The infrastructure/ directory serves as the single source of truth for all operational concerns, enabling reproducible deployments, automated operations, security hardening, disaster recovery, and infrastructure-as-code practices.
Infrastructure components include:
- Containerization: Docker packaging for consistent deployment across environments
- Orchestration: Kubernetes for managing containers with automated scaling and health checks
- Autoscaling: Automatic resource adjustment based on demand
- Monitoring: System health, performance metrics, and error detection
- Security: Network policies, bot blocking, honeypots, and access controls
- Backups: Automated backup jobs with retention policies and restore procedures
- Cloud Provisioning: Infrastructure-as-code using Terraform
| Path | Purpose |
|---|---|
docker/ |
Dockerfiles, docker-compose stacks, Nginx reverse proxy configuration |
kubernetes/ |
Namespace, deployments, services, ingress, secrets, configmaps |
autoscaling/ |
Horizontal/vertical pod autoscalers, cluster autoscaler configurations |
maintenance/ |
Maintenance page template, whitelist examples, runbooks |
monitoring/ |
Prometheus/Alertmanager/Grafana configurations and alert rules |
security/ |
Bot blocker rules, honeypot, network policies, RBAC, Nginx security configurations |
backup/ |
CronJobs for PostgreSQL/Redis/application backups and restore scripts |
terraform/ |
AWS infrastructure-as-code (VPC, compute, storage, networking) |
Each subdirectory contains a dedicated README with comprehensive deployment instructions, operational procedures, and technical details.
Docker provides consistent application packaging and deployment across development, staging, and production environments.
Server Dockerfile (infrastructure/docker/Dockerfile):
- Base image: Python 3.11
- Installs system dependencies and Python packages
- Configures Gunicorn as the WSGI server
- Sets up proper file permissions and security
Mobile Dockerfile (infrastructure/docker/Dockerfile.mobile):
- Base image: Flutter SDK
- Builds Flutter web application
- Outputs static files for Nginx serving
Docker Compose (infrastructure/docker/docker-compose.yml):
- Orchestrates PostgreSQL, Redis, Django server, and Nginx
- Configures networking and volume mounts
- Sets up development environment
Build Images:
# Build server image
docker build -f infrastructure/docker/Dockerfile -t exposcholar-server:latest .
# Build mobile image
docker build -f infrastructure/docker/Dockerfile.mobile -t exposcholar-mobile:latest .Run Development Stack:
cd infrastructure/docker
docker compose up -dView Logs:
docker compose logs -fStop Services:
docker compose downKubernetes provides container orchestration with automated scaling, health checks, rolling updates, and service discovery.
Namespace (infrastructure/kubernetes/namespace.yaml):
- Isolates ExpoScholar resources from other applications
- Applies resource quotas and limits
Deployment (infrastructure/kubernetes/deployment.yaml):
- Defines pod replicas, container images, and resource requests/limits
- Configures health checks (liveness and readiness probes)
- Manages rolling updates
Service (infrastructure/kubernetes/service.yaml):
- Exposes pods via stable network endpoints
- Load balances traffic across pod replicas
Ingress (infrastructure/kubernetes/ingress.yaml):
- Routes external traffic to services
- Manages TLS/SSL termination
- Configures domain-based routing
ConfigMap (infrastructure/kubernetes/configmap.yaml):
- Stores non-sensitive configuration data
- Environment variables and application settings
Secrets (infrastructure/kubernetes/secrets.yaml):
- Stores sensitive data (passwords, API keys, certificates)
- Base64-encoded values
- Must be created/updated manually with production credentials
1. Apply Core Resources:
kubectl apply -f infrastructure/kubernetes/namespace.yaml
kubectl apply -f infrastructure/kubernetes/configmap.yaml
kubectl create secret generic exposcholar-secrets \
--from-env-file=server/.env.production \
-n exposcholar2. Deploy Application:
kubectl apply -f infrastructure/kubernetes/deployment.yaml
kubectl apply -f infrastructure/kubernetes/service.yaml
kubectl apply -f infrastructure/kubernetes/ingress.yaml3. Verify Deployment:
kubectl get pods -n exposcholar
kubectl get services -n exposcholar
kubectl get ingress -n exposcholar4. Check Logs:
kubectl logs -f deployment/exposcholar-server -n exposcholarThe monitoring stack provides comprehensive observability into system health, performance, and errors.
Prometheus:
- Time-series database for metrics collection
- Scrapes metrics from all ExpoScholar components and Kubernetes infrastructure
- Stores historical data for trend analysis
Alertmanager:
- Routes alerts based on severity and service
- Sends notifications via email, Slack, and other channels
- Groups and deduplicates alerts
Grafana:
- Metrics visualization and dashboard platform
- Custom dashboards for application and infrastructure metrics
- Real-time monitoring and historical analysis
Node Exporter:
- System metrics exporter
- Provides CPU, memory, disk, and network metrics from cluster nodes
Blackbox Exporter:
- HTTP/HTTPS endpoint health checker
- Probes ExpoScholar services for availability
Deploy Monitoring Stack:
kubectl apply -f infrastructure/monitoring/k8s-monitoring.yamlVerify Components:
kubectl get pods -n monitoringAccess Grafana:
kubectl port-forward svc/grafana -n monitoring 3000:3000
# Open http://localhost:3000Configure Alerting:
- Update
infrastructure/monitoring/alertmanager.ymlwith SMTP credentials and Slack webhook URLs - Update the ConfigMap:
kubectl apply -f infrastructure/monitoring/k8s-monitoring.yaml
Customize Alert Rules:
- Modify
infrastructure/monitoring/rules/exposcholar.ymlto adjust alert thresholds - Add new alert conditions as needed
Security configurations protect against attacks and unauthorized access.
Pod Security Standards:
- Enforces restricted security policy on namespaces
- Prevents privilege escalation and host access
- Configures read-only root filesystems
Network Policies:
- Restricts pod-to-pod communication
- Implements network segmentation
- Controls ingress and egress traffic
RBAC (Role-Based Access Control):
- Defines service account permissions
- Limits access to Kubernetes resources
- Follows principle of least privilege
Bot Blocker:
- Nginx-based bot detection and blocking
- Blocks known malicious user agents and IPs
- Reduces automated attack traffic
Honeypot:
- Decoy endpoints to detect scanning and attacks
- Logs suspicious traffic patterns
- Provides early warning of security threats
Apply Security Policies:
kubectl apply -f infrastructure/security/pod-security-standards.yaml
kubectl apply -f infrastructure/security/network-policies.yaml
kubectl apply -f infrastructure/security/rbac.yamlDeploy Bot Blocker:
- Mount
security/bot-blocker/directory into Nginx containers - Include configuration files in
nginx.conf
Enable Honeypot:
- Mount
security/honeypot/honeypot.confinto Nginx - Monitor logs for suspicious traffic patterns
Automated backups ensure data protection and enable disaster recovery.
PostgreSQL Backups:
- Daily automated backups via CronJob
- Compressed SQL dumps stored in S3
- 30-day retention policy
Redis Backups:
- Daily automated backups via CronJob
- RDB snapshots stored in S3
- 7-day retention policy
Application Backups:
- Daily automated backups of media files
- Compressed archives stored in S3
- 30-day retention policy
Disaster Recovery Testing:
- Weekly automated restore tests
- Validates backup integrity
- Ensures restore procedures are functional
Deploy Backup CronJobs:
kubectl apply -f infrastructure/backup/backup-cronjob.yamlVerify Backup Execution:
kubectl get cronjob -n exposcholar
kubectl get jobs -n exposcholar | grep backupList Backups:
aws s3 ls s3://exposcholar-backups/postgresql/
aws s3 ls s3://exposcholar-backups/redis/
aws s3 ls s3://exposcholar-backups/application/Restore from Backup:
./infrastructure/backup/restore-scripts/restore-postgresql.sh \
s3://exposcholar-backups/postgresql/exposcholar-backup-20240115-020000.sql.gzAutoscaling automatically adjusts resources based on demand to optimize performance and cost.
Horizontal Pod Autoscaler (HPA):
- Scales pod replicas based on CPU and memory usage
- Configurable thresholds (default: 70% CPU, 80% memory)
- Minimum and maximum replica limits
Vertical Pod Autoscaler (VPA):
- Adjusts pod resource requests and limits
- Recommends optimal resource allocation
- Can automatically apply recommendations
Cluster Autoscaler:
- Scales cluster nodes based on pod scheduling needs
- Adds nodes when pods cannot be scheduled
- Removes nodes when underutilized
Apply Autoscalers:
kubectl apply -f infrastructure/autoscaling/hpa.yaml
kubectl apply -f infrastructure/autoscaling/vpa.yaml
kubectl apply -f infrastructure/autoscaling/cluster-autoscaler.yamlMonitor Scaling Events:
kubectl get hpa -n exposcholar
kubectl describe hpa -n exposcholarMaintenance mode allows planned downtime for updates and maintenance.
Enable Maintenance Mode:
./scripts/utils/maintenance_mode_enable.sh --reload-nginxConfigure Whitelisting:
./scripts/utils/maintenance_mode_whitelist.sh add 192.168.1.100Disable Maintenance Mode:
./scripts/utils/maintenance_mode_disable.sh --reload-nginxTerraform enables version-controlled cloud infrastructure provisioning.
AWS Infrastructure:
- VPC and networking configuration
- RDS PostgreSQL database
- ElastiCache Redis cluster
- S3 buckets for backups and static files
- IAM roles and policies
Kubernetes Infrastructure:
- EKS cluster provisioning (optional)
- Node groups and autoscaling
- Load balancers and ingress
1. Configure AWS Credentials:
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_DEFAULT_REGION=us-west-22. Initialize Terraform:
cd infrastructure/terraform
terraform init3. Review Planned Changes:
terraform plan -out=tfplan4. Apply Infrastructure:
terraform apply tfplan5. Configure Remote State (for shared environments):
terraform {
backend "s3" {
bucket = "exposcholar-terraform-state"
key = "infrastructure/terraform.tfstate"
region = "us-west-2"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}- Never commit secrets: All sensitive data must be stored in Kubernetes Secrets, not in version control
-
Use Base64 encoding: When creating Secrets manually, encode values:
echo -n "your-password" | base64 - Rotate credentials regularly: Establish a schedule for rotating database passwords, API keys, and TLS certificates
- Limit secret access: Use RBAC to restrict which service accounts and users can access Secrets
- Version control all configs: All configuration files should be tracked in Git, with the exception of secrets and environment-specific overrides
- Use ConfigMaps for non-sensitive data: Application settings, feature flags, and non-secret environment variables belong in ConfigMaps
- Document configuration changes: Maintain changelogs or commit messages that explain why configurations were modified
- Set up alerting early: Configure Prometheus alerts and Alertmanager routing before deploying to production
- Review dashboards regularly: Use Grafana dashboards to identify trends, capacity issues, and performance degradation
- Log aggregation: Consider integrating with centralized logging solutions (ELK stack, Loki, CloudWatch) for comprehensive log analysis
- Test alerting: Regularly verify that alerts fire correctly and that notification channels (email, Slack) are functional
- Test restore procedures: The disaster recovery test CronJob runs weekly, but also perform manual restore tests quarterly
- Verify backup retention: Ensure backups older than the retention period (30 days for PostgreSQL/application, 7 days for Redis) are automatically deleted
- Document restore procedures: Maintain runbooks that detail step-by-step restore processes for each component
- Off-site storage: All backups are stored in AWS S3, separate from production infrastructure, ensuring protection against regional failures
- Keep security configs updated: Regularly update bot blocker lists, honeypot rules, and network policies as new threats emerge
- Review access controls: Periodically audit RBAC policies and network policies to ensure they follow the principle of least privilege
- Monitor security logs: Review honeypot logs, Nginx access logs, and Kubernetes audit logs for suspicious activity
- Apply security patches: Keep base images, Kubernetes versions, and dependencies up to date with security patches
- Set appropriate thresholds: HPA CPU/memory thresholds (70% CPU, 80% memory) should be tuned based on actual workload patterns
- Monitor scaling events: Review HPA and VPA scaling decisions to ensure they align with application requirements
- Configure cluster autoscaler: Ensure cluster autoscaler is properly configured for your cloud provider to handle node scaling
- Test scaling behavior: Verify that applications handle pod scaling gracefully and that database connections are properly managed
Problem: Pods fail to start or crash loop.
Solutions:
- Check pod logs:
kubectl logs <pod-name> -n exposcholar - Verify ConfigMaps and Secrets exist and are correctly mounted
- Review resource requests/limits in deployment manifests
- Check for image pull errors:
kubectl describe pod <pod-name> -n exposcholar
Problem: Services are not accessible.
Solutions:
- Verify Service selectors match Deployment labels
- Check Ingress configuration and TLS certificates
- Review network policies that might block traffic
- Test service endpoints directly:
kubectl port-forward svc/<service-name>
Problem: Backups are not running.
Solutions:
- Check CronJob status:
kubectl get cronjob -n exposcholar - Review recent job executions:
kubectl get jobs -n exposcholar - Verify AWS credentials in Secrets
- Check S3 bucket permissions and connectivity
Problem: Monitoring stack is not collecting metrics.
Solutions:
- Verify Prometheus can scrape targets: Check
/targetsin Prometheus UI - Review ServiceMonitor or PodMonitor configurations
- Ensure services expose
/metricsendpoints - Check RBAC permissions for Prometheus service account
- Kubernetes Documentation: https://kubernetes.io/docs/
- Prometheus Documentation: https://prometheus.io/docs/
- Terraform AWS Provider: https://registry.terraform.io/providers/hashicorp/aws/
- Docker Documentation: https://docs.docker.com/
- Nginx Documentation: https://nginx.org/en/docs/
For detailed component-specific documentation, see the README files in each subdirectory of infrastructure/.
Last Updated: 2025-01-25
ExpoScholar Version: v0.9.3-beta+3