A Kubernetes operator for automated Kafka backup and disaster recovery. Built with Rust using kube-rs for high performance and reliability.
- Scheduled Backups - Cron-based automatic backups with configurable retention
- Point-in-Time Recovery (PITR) - Restore data to any specific timestamp
- Multi-Cloud Storage - Support for PVC, S3, Azure Blob Storage, and GCS
- Azure Workload Identity - Secure, secretless authentication for Azure
- Compression - LZ4 and Zstd compression support with configurable levels
- Checkpointing - Resumable backups that survive pod restarts
- Rate Limiting - Control backup/restore throughput to minimize cluster impact
- Circuit Breaker - Automatic failure detection and recovery
- Topic Mapping - Restore to different topic names or partitions
- Consumer Offset Management - Reset and rollback consumer group offsets
- Validation Evidence - Validate backups and generate compliance evidence reports
- Prometheus Metrics - Full observability with built-in metrics endpoint
- Kubernetes cluster (1.26+)
- Helm 3.x
- A running Kafka cluster
# Add the OSO DevOps Helm repository
helm repo add oso https://osodevops.github.io/helm-charts/
helm repo update
# Install the operator
helm install kafka-backup-operator oso/kafka-backup-operator \
--namespace kafka-backup-system \
--create-namespaceapiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: my-backup
namespace: kafka
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
topics:
- orders
- events
storage:
storageType: pvc
pvc:
claimName: kafka-backups
schedule: "0 0 */6 * * * *" # Every 6 hours
stopAtCurrentOffsets: true
compression: zstdkubectl apply -f backup.yamlThe operator provides five CRDs for managing Kafka backup and restore operations:
| CRD | Short Name | Description |
|---|---|---|
KafkaBackup |
kb |
Define backup schedules and configurations |
KafkaRestore |
kr |
Trigger restore operations from backups |
KafkaOffsetReset |
kor |
Reset consumer group offsets |
KafkaOffsetRollback |
korb |
Rollback offsets after failed restores |
KafkaBackupValidation |
kbv |
Validate backups and produce evidence reports |
This release aligns the operator CRDs and adapters with kafka-backup-core v0.12.0.
checkpoint.enabledno longer makes aKafkaBackuprun continuously. Setcontinuous: trueexplicitly for streaming backups.- Scheduled point-in-time backups should set
stopAtCurrentOffsets: trueso the backup exits after it reaches the high watermarks captured at start. KafkaBackupcan now snapshot consumer group offsets withconsumerGroupSnapshot: true;KafkaRestorecan load those snapshots withautoConsumerGroups: true.KafkaRestorenow exposesrepartitioning,produceAcks,produceTimeoutMs, andpurgeTopics.- S3-compatible endpoints can use
storage.s3.pathStyleandstorage.s3.allowHttp;allowHttplogs a warning because it may enable plaintext object storage traffic. kafkaCluster.connection.connectionsPerBrokercan be tuned on backup, restore, validation, offset reset, and offset rollback resources.- Azure storage authentication validation now accepts the adapter-supported methods: workload identity, service principal, SAS token, account key, or default credential fallback.
apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: production-backup
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
securityProtocol: SASL_SSL
tlsSecret:
name: kafka-tls
saslSecret:
name: kafka-sasl
mechanism: SCRAM-SHA-512
topics:
- orders
- inventory
- events
storage:
storageType: azure
azure:
container: kafka-backups
accountName: mystorageaccount
prefix: production
useWorkloadIdentity: true
schedule: "0 0 2 * * * *" # Daily at 2 AM
compression: zstd
compressionLevel: 3
stopAtCurrentOffsets: true
includeOffsetHeaders: true
sourceClusterId: production
consumerGroupSnapshot: true
checkpoint:
enabled: true
intervalSecs: 30apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: s3-backup
spec:
kafkaCluster:
bootstrapServers:
- kafka:9092
topics:
- my-topic
storage:
storageType: s3
s3:
bucket: my-kafka-backups
region: eu-west-1
endpoint: http://minio.storage.svc.cluster.local:9000
pathStyle: true
allowHttp: true
prefix: backups
credentialsSecret:
name: aws-credentials
accessKeyIdKey: AWS_ACCESS_KEY_ID
secretAccessKeyKey: AWS_SECRET_ACCESS_KEY
schedule: "0 0 */4 * * * *"apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaRestore
metadata:
name: restore-orders
spec:
backupRef:
name: production-backup
backupId: "production-backup-20251210-020000" # Optional: specific backup
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
topics:
- orders
# Optional: Point-in-time recovery
pitr:
endTime: "2025-12-10T12:00:00Z"
# Optional: Restore to different topic
topicMapping:
orders: orders-restored
repartitioning:
orders-restored:
strategy: murmur2
targetPartitions: 6
createTopics: true
produceAcks: -1
produceTimeoutMs: 30000
autoConsumerGroups: true
purgeTopics: false
# Safety: Create snapshot before restore
rollback:
snapshotBeforeRestore: true
autoRollbackOnFailure: trueapiVersion: kafka.oso.sh/v1alpha1
kind: KafkaOffsetReset
metadata:
name: reset-consumer
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
consumerGroups:
- my-consumer-group
topics:
- orders
resetStrategy: to-earliestKey configuration options for the Helm chart:
# values.yaml
replicaCount: 1
image:
repository: ghcr.io/osodevops/kafka-backup-operator
tag: "" # Defaults to appVersion
serviceAccount:
create: true
annotations: {}
# Azure Workload Identity
azureWorkloadIdentity:
enabled: false
clientId: ""
# Logging
logging:
level: "info,kafka_backup_operator=debug"
# Metrics
metrics:
enabled: true
serviceMonitor:
enabled: false
interval: 30s
# Resources
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Extra pod volumes and mounts
extraVolumes:
- name: kafka-tls
secret:
secretName: kafka-tls-certs
- name: custom-config
configMap:
name: kafka-backup-config
extraVolumeMounts:
- name: kafka-tls
mountPath: /certs/kafka
readOnly: true
- name: custom-config
mountPath: /config
readOnly: true
# Extra env vars for custom CA trust / endpoint behavior
extraEnv:
- name: SSL_CERT_FILE
value: /etc/internal-certs/ca.crtFor secure, secretless authentication to Azure Blob Storage:
# Enable Workload Identity on AKS
az aks update --resource-group myRG --name myAKS \
--enable-oidc-issuer --enable-workload-identity
# Create managed identity
az identity create --resource-group myRG --name kafka-backup-identity
# Assign Storage Blob Data Contributor role
az role assignment create \
--assignee-object-id $(az identity show -g myRG -n kafka-backup-identity --query principalId -o tsv) \
--role "Storage Blob Data Contributor" \
--scope /subscriptions/.../storageAccounts/mystorageaccount
# Create federated credential
az identity federated-credential create \
--resource-group myRG \
--identity-name kafka-backup-identity \
--name kafka-backup-fedcred \
--issuer $(az aks show -g myRG -n myAKS --query oidcIssuerProfile.issuerUrl -o tsv) \
--subject system:serviceaccount:kafka-backup-system:kafka-backup-operator \
--audience api://AzureADTokenExchange
# Install with Workload Identity enabled
helm install kafka-backup-operator oso/kafka-backup-operator \
--namespace kafka-backup-system \
--set azureWorkloadIdentity.enabled=true \
--set azureWorkloadIdentity.clientId=$(az identity show -g myRG -n kafka-backup-identity --query clientId -o tsv)See docs/azure-workload-identity.md for detailed setup instructions.
The operator exposes Prometheus metrics on port 8080:
| Metric | Description |
|---|---|
kafka_backup_reconciliations_total |
Total reconciliation attempts |
kafka_backup_reconcile_duration_seconds |
Reconciliation duration histogram |
kafka_backup_backups_total |
Total backups by status |
kafka_backup_backup_size_bytes |
Backup size in bytes |
kafka_backup_backup_records |
Records processed |
kafka_backup_restores_total |
Total restores by status |
# Enable in Helm values
metrics:
serviceMonitor:
enabled: true
interval: 30s
labels:
release: prometheus- Normal Operation:
KafkaBackupruns on schedule, storing backups to cloud storage - Disaster Occurs: Kafka cluster fails or data is corrupted
- Recovery:
- Identify the backup to restore from:
kubectl get kafkabackup my-backup -o yaml - Create a
KafkaRestoreresource pointing to the backup - Monitor progress:
kubectl get kafkarestore -w
- Identify the backup to restore from:
- Post-Recovery: Optionally reset consumer offsets with
KafkaOffsetReset
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ kafka-backup-operator │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Backup │ │ Restore │ │ Offset Reset/ │ │ │
│ │ │ Controller │ │ Controller │ │ Rollback Ctrl │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────────┬──────────┘ │ │
│ │ │ │ │ │ │
│ │ └───────────────┼────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ kafka-backup-core │ │ │
│ │ │ (Rust lib) │ │ │
│ │ └─────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Kafka Cluster │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Cloud Storage │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ S3 │ │ Azure │ │ GCS │ │
│ │ │ │ Blob │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
# Clone the repository
git clone https://github.com/osodevops/kafka-backup-operator.git
cd kafka-backup-operator
# Build
cargo build --release
# Generate CRDs
cargo run --bin crdgen > deploy/crds/all.yaml
# Run tests
cargo testSee minikube/README.md for local development setup with Confluent for Kubernetes.
Contributions are welcome! Please read our Contributing Guide for details on the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- kafka-backup-core - The core backup library
- OSO DevOps Helm Charts - Helm chart repository