Skip to content

feat(controlplane): refactor object storage#2613

Open
jirevwe wants to merge 101 commits intomainfrom
raymond/feat/refactor-object-storage
Open

feat(controlplane): refactor object storage#2613
jirevwe wants to merge 101 commits intomainfrom
raymond/feat/refactor-object-storage

Conversation

@jirevwe
Copy link
Copy Markdown
Collaborator

@jirevwe jirevwe commented Mar 30, 2026

Summary

Overhauls the backup/export infrastructure to support streaming uploads, multiple storage backends, CDC-based backup via PostgreSQL logical replication, incremental time-windowed exports, and on-demand
manual backup via CLI and API.

Two Backup Modes

Mode How It Works Best For
CDC (recommended) Streams WAL INSERTs via pglogrepl, buffers in memory, flushes to blob storage on interval. Zero DB load at export time, 1 persistent replication connection. Production
Cron (default) Periodically queries records created within the last backup interval using REPEATABLE READ transactions. No replication setup needed. OSS / simpler deployments

Toggle via CONVOY_CDC_BACKUP_ENABLED=true. Both produce gzip-compressed JSONL at backup/{date}/{table}/{timestamp}.jsonl.gz.

Three Storage Backends

  • S3 (+ MinIO-compatible) — streaming multipart upload via s3manager
  • Azure Blob Storage — streaming upload via azblob.UploadStream, auto-creates container
  • On-Prem — context-aware writes to local filesystem with path traversal protection

New BlobStore interface (Upload(ctx, key, io.Reader)) replaces the old file-based ObjectStore.Save(filename) pattern, eliminating /tmp disk usage during S3 uploads.

Manual Backup (CLI + API)

Operators can trigger one-time backups on demand — always uses the cron-based exporter, never CDC, regardless of config:

# CLI                                                                      
convoy backup                                                          # last interval
convoy backup --start 2026-04-01T00:00:00Z --end 2026-04-02T00:00:00Z # custom window
                                                                                                                                                                                                               
# API
POST /ui/backups/trigger                                                                                                                                                                                       
{"start": "2026-04-01T00:00:00Z", "end": "2026-04-02T00:00:00Z"}                                                                                                                                               
# Returns 202 Accepted with job_id; runs asynchronously in worker

Key Changes

  • internal/pkg/backup_collector/ — New CDC-based backup via pglogrepl. Creates a permanent replication slot, streams WAL INSERTs for events/event_deliveries/delivery_attempts, buffers by table, flushes gzip
    JSONL to any BlobStore on a configurable interval. Atomic flushedLSN, at-least-once flush semantics.
  • internal/pkg/blob-store/ — New package replacing object-store/. Streaming BlobStore interface with S3, Azure, and OnPrem implementations.
  • internal/pkg/exporter/ — Refactored to use BlobStore.Upload via io.Pipe (streaming, no temp files). Export is now global (not per-project), time-windowed (only records in last interval, not full dump).
    Added NewExporterWithWindow for manual backups with explicit start/end bounds.
  • internal/configuration/ — Azure Blob Storage config support (account name, key, container, endpoint, prefix). DB columns + API models + env sync on server startup.
  • worker/task/backup_jobs.go — Claim-table based exactly-once execution via SELECT FOR UPDATE SKIP LOCKED. EnqueueBackupJobIfIdle prevents duplicate jobs. Added ManualBackup handler for on-demand exports.
  • worker/task/retention_policies.go — Backup fully decoupled from retention. Mutex expiry fixed (1s → 30min with renewal goroutine).
  • cmd/backup/backup.go — New CLI command for on-demand backup with --start/--end flags.
  • api/handlers/backup.go — New POST /ui/backups/trigger endpoint, enqueues ManualBackupJob to worker queue.
  • datastore/cached/ — Refactored all cached repositories to use generic cachedrepo utilities, removing duplicate boilerplate.
  • testenv/ — Added Azurite testcontainer with --skipApiVersionCheck, wal_level=logical for CDC tests, unique ULID-based bucket/container names per test for isolation.
  • e2e/backup/ — 6-combination test matrix (CDC/Export × OnPrem/S3/Azure) with content verification.
  • docs/backup-configuration.md — Comprehensive 13-section production guide covering both modes, all backends, interval estimation, monitoring, troubleshooting, manual backup, and recovery.

Configuration

# Common
CONVOY_RETENTION_POLICY_ENABLED=true
CONVOY_BACKUP_INTERVAL=1h              # flush/export frequency (drives cron + CDC)

# CDC mode
CONVOY_CDC_BACKUP_ENABLED=true
CONVOY_REPLICATION_DSN=postgres://...   # direct PG connection (bypasses pgbouncer)                                                                                                                            
   
# Storage (pick one)                                                                                                                                                                                           
CONVOY_STORAGE_POLICY_TYPE=s3           # or on_prem, azure_blob           
CONVOY_STORAGE_AWS_BUCKET=...                                                                                                                                                                                  
CONVOY_STORAGE_AZURE_ACCOUNT_NAME=...
CONVOY_STORAGE_AZURE_ACCOUNT_KEY=...                                                                                                                                                                           
CONVOY_STORAGE_AZURE_CONTAINER_NAME=...                                    

Tessting

  • E2E backup tests pass: CDC and Export modes across OnPrem, S3 (MinIO), and Azure (Azurite)
  • Unit tests pass for backup collector buffer, flush, BlobStore clients
  • Config tests updated with BackupInterval defaults
  • Manual test: convoy backup exports records and prints counts
  • Manual test: POST /ui/backups/trigger returns 202, worker completes backup
  • Stress tested at 10rps and 100rps with 5-minute intervals

jirevwe added 30 commits March 30, 2026 22:58
…ment consistency, enhance return type clarity, and address stale job handling
…ment consistency, enhance return type clarity, and address stale job handling
jirevwe added 8 commits April 9, 2026 15:26
…lities for cache-aside operations; remove legacy implementation.
…hedrepo` utilities for cache-aside operations.
…chedrepo` utilities for cache-aside operations.
…o `cachedrepo` utilities for cache-aside operations.
…e to `cachedrepo` utilities for cache-aside operations.
…migrate to `cachedrepo` utilities for cache-aside operations.
…grate to `cachedrepo` utilities for cache-aside operations.
Comment thread pkg/cachedrepo/cachedrepo.go Dismissed
Comment thread pkg/cachedrepo/cachedrepo.go Dismissed
@jirevwe jirevwe requested a review from mekilis April 10, 2026 09:25
@convoy-engineering
Copy link
Copy Markdown
Contributor

@cursoragent bugbot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 10, 2026

The GH token still lacks comment permissions. Here is the full third review inline.

Third Review — 19 New Commits Since Last Review

Major new work: time-window exporter refactor, cachedrepo utilities, manual backup API, removal of per-repo cached implementations. CodeQL security findings also flagged. All previously-reported critical/high fixes remain in place.

Critical — New

1. TriggerBackup endpoint lacks authorizationPOST /ui/backups/trigger only requires RequireAuth. No project/org/role checks. Any authenticated user can trigger a tenant-wide data export with arbitrary time windows.

2. EnqueueBackupJobIfIdle writes identical hour_start/hour_end (impl.go:73-82) — SQL uses $1, $1 with the same now timestamp, creating a degenerate [now, now) window. Meanwhile, ProcessBackupJob uses NewExporter (config-derived interval), completely ignoring the job's stored window. The hour_start/hour_end columns are unused and misleading.

High — New

  • No rate limiting on /ui/backups/trigger — queue flooding risk
  • Unbounded time window — only validates start < end, no max duration or "not in the future"
  • r.ContentLength > 0 gate skips chunked bodies (backup.go:43) — chunked request bodies get silently ignored
  • Events export missing deleted_at IS NULL (events/queries.sql:413-421) — soft-deleted events included in backups, unlike event_deliveries and delivery_attempts which filter correctly

Medium — New

  • CodeQL: APIKey sensitive data in cacheHash/Salt fields flow through cachedrepo.FetchOne to the cache backend. Consider stripping secrets from the cached representation.
  • FetchWithNotFound returns (nil, nil) for cached not-found instead of the original sentinel error — changes the error contract for callers expecting ErrAPIKeyNotFound
  • Manual backup silently no-ops when retention is disabled — returns 202 but worker does nothing
  • No durable audit trail for manual backups — no backup_jobs row written
  • Retry produces duplicate blob objects — keys include time.Now(), so retries create new objects

Low — New

  • NewExporterWithWindow doesn't validate start < end; NewExporter uses local time while blob keys use UTC; DeleteFilter doesn't invalidate cache; BackupProjectData swallowed-error issue is now gone (retention is just ret.Perform)

Previously Resolved (still good)

All fixes from reviews 1 and 2 remain intact: ClaimBackupJob ErrNoRows, flushedLSN atomic, shutdown flush context, slot lookup, streamLoop death signaling, OnPrem path traversal + context, nil check, CREATE PUBLICATION IF NOT EXISTS.

Open in Web Open in Cursor 


postgres:
image: postgres:15.2-alpine
image: postgres:18-alpine
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jirevwe, was this absolutely necessary? I don't have an issue with it. I am only worried about self-hosted upgrades and if there are any incompatibilities users might run into. If we can't do without it, then we'll need to have upgrade notes.

Also, if this is a prerequisite i'm curious about the availability across cloud providers.

Comment thread sql/1774901839.sql
hour_start TIMESTAMPTZ NOT NULL,
hour_end TIMESTAMPTZ NOT NULL,
status VARCHAR NOT NULL DEFAULT 'pending',
worker_id VARCHAR,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: agent_id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants