feat(controlplane): refactor object storage by jirevwe · Pull Request #2613 · frain-dev/convoy

jirevwe · 2026-03-30T21:48:27Z

Summary

Overhauls the backup/export infrastructure to support streaming uploads, multiple storage backends, CDC-based backup via PostgreSQL logical replication, incremental time-windowed exports, and on-demand
manual backup via CLI and API.

Two Backup Modes

Mode	How It Works	Best For
CDC (recommended)	Streams WAL INSERTs via `pglogrepl`, buffers in memory, flushes to blob storage on interval. Zero DB load at export time, 1 persistent replication connection.	Production
Cron (default)	Periodically queries records created within the last backup interval using REPEATABLE READ transactions. No replication setup needed.	OSS / simpler deployments

Toggle via CONVOY_CDC_BACKUP_ENABLED=true. Both produce gzip-compressed JSONL at backup/{date}/{table}/{timestamp}.jsonl.gz.

Three Storage Backends

S3 (+ MinIO-compatible) — streaming multipart upload via s3manager
Azure Blob Storage — streaming upload via azblob.UploadStream, auto-creates container
On-Prem — context-aware writes to local filesystem with path traversal protection

New BlobStore interface (Upload(ctx, key, io.Reader)) replaces the old file-based ObjectStore.Save(filename) pattern, eliminating /tmp disk usage during S3 uploads.

Manual Backup (CLI + API)

Operators can trigger one-time backups on demand — always uses the cron-based exporter, never CDC, regardless of config:

# CLI                                                                      
convoy backup                                                          # last interval
convoy backup --start 2026-04-01T00:00:00Z --end 2026-04-02T00:00:00Z # custom window
                                                                                                                                                                                                               
# API
POST /ui/backups/trigger                                                                                                                                                                                       
{"start": "2026-04-01T00:00:00Z", "end": "2026-04-02T00:00:00Z"}                                                                                                                                               
# Returns 202 Accepted with job_id; runs asynchronously in worker

Key Changes

internal/pkg/backup_collector/ — New CDC-based backup via pglogrepl. Creates a permanent replication slot, streams WAL INSERTs for events/event_deliveries/delivery_attempts, buffers by table, flushes gzip
JSONL to any BlobStore on a configurable interval. Atomic flushedLSN, at-least-once flush semantics.
internal/pkg/blob-store/ — New package replacing object-store/. Streaming BlobStore interface with S3, Azure, and OnPrem implementations.
internal/pkg/exporter/ — Refactored to use BlobStore.Upload via io.Pipe (streaming, no temp files). Export is now global (not per-project), time-windowed (only records in last interval, not full dump).
Added NewExporterWithWindow for manual backups with explicit start/end bounds.
internal/configuration/ — Azure Blob Storage config support (account name, key, container, endpoint, prefix). DB columns + API models + env sync on server startup.
worker/task/backup_jobs.go — Claim-table based exactly-once execution via SELECT FOR UPDATE SKIP LOCKED. EnqueueBackupJobIfIdle prevents duplicate jobs. Added ManualBackup handler for on-demand exports.
worker/task/retention_policies.go — Backup fully decoupled from retention. Mutex expiry fixed (1s → 30min with renewal goroutine).
cmd/backup/backup.go — New CLI command for on-demand backup with --start/--end flags.
api/handlers/backup.go — New POST /ui/backups/trigger endpoint, enqueues ManualBackupJob to worker queue.
datastore/cached/ — Refactored all cached repositories to use generic cachedrepo utilities, removing duplicate boilerplate.
testenv/ — Added Azurite testcontainer with --skipApiVersionCheck, wal_level=logical for CDC tests, unique ULID-based bucket/container names per test for isolation.
e2e/backup/ — 6-combination test matrix (CDC/Export × OnPrem/S3/Azure) with content verification.
docs/backup-configuration.md — Comprehensive 13-section production guide covering both modes, all backends, interval estimation, monitoring, troubleshooting, manual backup, and recovery.

Configuration

# Common
CONVOY_RETENTION_POLICY_ENABLED=true
CONVOY_BACKUP_INTERVAL=1h              # flush/export frequency (drives cron + CDC)

# CDC mode
CONVOY_CDC_BACKUP_ENABLED=true
CONVOY_REPLICATION_DSN=postgres://...   # direct PG connection (bypasses pgbouncer)                                                                                                                            
   
# Storage (pick one)                                                                                                                                                                                           
CONVOY_STORAGE_POLICY_TYPE=s3           # or on_prem, azure_blob           
CONVOY_STORAGE_AWS_BUCKET=...                                                                                                                                                                                  
CONVOY_STORAGE_AZURE_ACCOUNT_NAME=...
CONVOY_STORAGE_AZURE_ACCOUNT_KEY=...                                                                                                                                                                           
CONVOY_STORAGE_AZURE_CONTAINER_NAME=...

Tessting

E2E backup tests pass: CDC and Export modes across OnPrem, S3 (MinIO), and Azure (Azurite)
Unit tests pass for backup collector buffer, flush, BlobStore clients
Config tests updated with BackupInterval defaults
Manual test: convoy backup exports records and prints counts
Manual test: POST /ui/backups/trigger returns 202, worker completes backup
Stress tested at 10rps and 100rps with 5-minute intervals

…REPEATABLE READ transactions

…ssing gzip data.

…REPEATABLE READ transactions

…mplify NULL field management

…lob clients

…ironment and E2E tests

…ontainer

…bID` and `errMsg`.

…ject

… and refactor related logic

…ment consistency, enhance return type clarity, and address stale job handling

…nd On-Prem backends

…ment consistency, enhance return type clarity, and address stale job handling

…mance and reduce database load.

…che handling and reduce database queries

…ze database queries

…ce database queries

… reduce database queries

…uce database queries

…base queries

…ase queries

…it tests.

…lities for cache-aside operations; remove legacy implementation.

…hedrepo` utilities for cache-aside operations.

…chedrepo` utilities for cache-aside operations.

…o `cachedrepo` utilities for cache-aside operations.

…e to `cachedrepo` utilities for cache-aside operations.

…migrate to `cachedrepo` utilities for cache-aside operations.

…grate to `cachedrepo` utilities for cache-aside operations.

…ation.

…s; add `TimeWindow` tests across modules.

…rt file validation logic.

… e2e tests. Update configurations and operations to use dynamic resource names instead of hardcoded defaults.

…guards.

…enizeSearch`.

…t flows

…eation

convoy-engineering · 2026-04-10T16:44:09Z

@cursoragent bugbot review

cursor · 2026-04-10T16:44:17Z

The GH token still lacks comment permissions. Here is the full third review inline.

Third Review — 19 New Commits Since Last Review

Major new work: time-window exporter refactor, cachedrepo utilities, manual backup API, removal of per-repo cached implementations. CodeQL security findings also flagged. All previously-reported critical/high fixes remain in place.

Critical — New

1. TriggerBackup endpoint lacks authorization — POST /ui/backups/trigger only requires RequireAuth. No project/org/role checks. Any authenticated user can trigger a tenant-wide data export with arbitrary time windows.

2. EnqueueBackupJobIfIdle writes identical hour_start/hour_end (impl.go:73-82) — SQL uses $1, $1 with the same now timestamp, creating a degenerate [now, now) window. Meanwhile, ProcessBackupJob uses NewExporter (config-derived interval), completely ignoring the job's stored window. The hour_start/hour_end columns are unused and misleading.

High — New

No rate limiting on /ui/backups/trigger — queue flooding risk
Unbounded time window — only validates start < end, no max duration or "not in the future"
r.ContentLength > 0 gate skips chunked bodies (backup.go:43) — chunked request bodies get silently ignored
Events export missing deleted_at IS NULL (events/queries.sql:413-421) — soft-deleted events included in backups, unlike event_deliveries and delivery_attempts which filter correctly

Medium — New

CodeQL: APIKey sensitive data in cache — Hash/Salt fields flow through cachedrepo.FetchOne to the cache backend. Consider stripping secrets from the cached representation.
FetchWithNotFound returns (nil, nil) for cached not-found instead of the original sentinel error — changes the error contract for callers expecting ErrAPIKeyNotFound
Manual backup silently no-ops when retention is disabled — returns 202 but worker does nothing
No durable audit trail for manual backups — no backup_jobs row written
Retry produces duplicate blob objects — keys include time.Now(), so retries create new objects

Low — New

NewExporterWithWindow doesn't validate start < end; NewExporter uses local time while blob keys use UTC; DeleteFilter doesn't invalidate cache; BackupProjectData swallowed-error issue is now gone (retention is just ret.Perform)

Previously Resolved (still good)

All fixes from reviews 1 and 2 remain intact: ClaimBackupJob ErrNoRows, flushedLSN atomic, shutdown flush context, slot lookup, streamLoop death signaling, OnPrem path traversal + context, nil check, CREATE PUBLICATION IF NOT EXISTS.

…indow in `NewExporterWithWindow`.

…duling; update task logic to use dynamic backup intervals.

…miss logic

…ventType.

subomi · 2026-04-04T17:37:59Z


    postgres:
-        image: postgres:15.2-alpine
+        image: postgres:18-alpine


@jirevwe, was this absolutely necessary? I don't have an issue with it. I am only worried about self-hosted upgrades and if there are any incompatibilities users might run into. If we can't do without it, then we'll need to have upgrade notes.

Also, if this is a prerequisite i'm curious about the availability across cloud providers.

subomi · 2026-04-04T19:54:38Z

+    hour_start    TIMESTAMPTZ NOT NULL,
+    hour_end      TIMESTAMPTZ NOT NULL,
+    status        VARCHAR NOT NULL DEFAULT 'pending',
+    worker_id     VARCHAR,


nit: agent_id.

jirevwe added 30 commits March 30, 2026 22:58

Add backup jobs infrastructure for hourly project data exports

209c64b

Add models for AzureBlobStorage and BackupJob to datastore

846276e

Add BlobStore abstraction and support for Azure Blob Storage integration

37b2f77

Add StreamExport for gzip-compressed JSONL backups to BlobStore

93928d6

Switch ExportRecords to JSONL format with snapshot consistency using …

b357755

…REPEATABLE READ transactions

Refactor tests to use helper functions for parsing JSONL and decompre…

bea397c

…ssing gzip data.

Switch ExportRecords to JSONL format with snapshot consistency using …

1c8c0d6

…REPEATABLE READ transactions

Add Azure Blob Storage configuration support to queries and migrations

1cfe7e0

Add Azure Blob Storage configuration support to storage policy

7225cff

Add E2E tests for Azure Blob Storage backup functionality

bb4378e

Refactor storage policy handling to support Azure Blob Storage and si…

6b79fd5

…mplify NULL field management

Add unit tests for BlobStore with support for OnPrem, S3, and Azure B…

4f1eb9c

…lob clients

Add support for Azurite (Azure Blob Storage emulator) for testing env…

4f0d221

…ironment and E2E tests

Handle "ContainerAlreadyExists" error when creating default Azurite c…

28bdc28

…ontainer

Simplify FailBackupJob method signature by combining parameters `jo…

e5edf07

…bID` and `errMsg`.

Add indexes for filtering and sorting backup_jobs by status and pro…

db0bb0b

…ject

Remove unnecessary blank line in closeWithError function

c396aa3

Add configurable backup interval support with dynamic cron scheduling…

5b89ead

… and refactor related logic

Refactor SQL queries and Go methods for backup_jobs to improve argu…

6d2e37e

…ment consistency, enhance return type clarity, and address stale job handling

Add stress test script for backups with support for S3, Azure Blob, a…

45fe85a

…nd On-Prem backends

Refactor SQL queries and Go methods for backup_jobs to improve argu…

e2bff73

…ment consistency, enhance return type clarity, and address stale job handling

Update Postgres version to 18 in docker-compose and configuration

03e6ecb

Integrate cached repositories across services to improve query perfor…

c8ef81f

…mance and reduce database load.

Add cached endpoint repository with comprehensive tests to improve ca…

b698073

…che handling and reduce database queries

Add cached filter repository with tests to enhance caching and minimi…

458b63d

…ze database queries

Add cached project repository with tests to optimize caching and redu…

3994a9b

…ce database queries

Add cached subscription repository with tests to optimize caching and…

2f73f3b

… reduce database queries

Add cached API key repository with methods to enhance caching and red…

69071ea

…uce database queries

Add cached organisation repository to enhance caching and reduce data…

3a492dc

…base queries

Add cached portal link repository to enhance caching and reduce datab…

2012ef0

…ase queries

jirevwe added 8 commits April 9, 2026 15:26

Add cachedrepo package for cache-aside repository utilities with un…

6cd5bd6

…it tests.

Refactor CachedEndpointRepository and tests to use cachedrepo uti…

a62a8c1

…lities for cache-aside operations; remove legacy implementation.

Remove CachedFilterRepository and associated tests; migrate to `cac…

84b5971

…hedrepo` utilities for cache-aside operations.

Remove CachedProjectRepository and associated tests; migrate to `ca…

5eda4f6

…chedrepo` utilities for cache-aside operations.

Remove CachedSubscriptionRepository and associated tests; migrate t…

43c4ccf

…o `cachedrepo` utilities for cache-aside operations.

Remove CachedAPIKeyRepository and associated implementation; migrat…

1d234fd

…e to `cachedrepo` utilities for cache-aside operations.

Remove CachedOrganisationRepository and associated implementation; …

e00bb76

…migrate to `cachedrepo` utilities for cache-aside operations.

Remove CachedPortalLinkRepository and associated implementation; mi…

7490c92

…grate to `cachedrepo` utilities for cache-aside operations.

github-advanced-security AI found potential problems Apr 9, 2026

View reviewed changes

Comment thread pkg/cachedrepo/cachedrepo.go Dismissed

Comment thread pkg/cachedrepo/cachedrepo.go Dismissed

jirevwe added 3 commits April 9, 2026 17:17

Increase PGBOUNCER_DEFAULT_POOL_SIZE to 80 in local .env configur…

9a3594d

…ation.

Clean up redundant comments in cached repository implementations.

a1709af

Update tests to use explicit epoch start time in ExportRecords call…

f2ee71f

…s; add `TimeWindow` tests across modules.

jirevwe requested a review from mekilis April 10, 2026 09:25

jirevwe added 9 commits April 10, 2026 14:00

Update tests to use current timestamps for event seeding; revise expo…

89ba623

…rt file validation logic.

Add unique bucket and container creation for test isolation in backup…

c9ce84e

… e2e tests. Update configurations and operations to use dynamic resource names instead of hardcoded defaults.

Enable RetentionPolicy-based task registration and backup scheduling …

8b1ab12

…guards.

Remove unused task registrations for MonitorTwitterSources and `Tok…

4f49c73

…enizeSearch`.

Add NewExporterWithWindow constructor for time-bounded manual/expor…

bb26063

…t flows

Add ManualBackup task for on-demand, time-bounded backup operations

ef74b41

Add backup command for on-demand event and delivery backup operations

238175e

Add ManualBackupJob task type for on-demand backup operations

71068df

Add TriggerBackup handler and API route for on-demand backup job cr…

7b4978e

…eation

jirevwe added 5 commits April 13, 2026 11:29

Ensure UTC timestamps in Exporter initialization; validate export w…

fa21ded

…indow in `NewExporterWithWindow`.

Refactor EnqueueBackupJobIfIdle to support time-bounded backup sche…

ce01073

…duling; update task logic to use dynamic backup intervals.

Add admin guard and retention policy checks to TriggerBackup handler

d7edfae

Add notFoundErr handling to FetchWithNotFound for improved cache …

1573161

…miss logic

Invalidate cache entries in DeleteFilter using subscriptionID and e…

6f22c50

…ventType.

subomi reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(controlplane): refactor object storage#2613

feat(controlplane): refactor object storage#2613
jirevwe wants to merge 101 commits intomainfrom
raymond/feat/refactor-object-storage

jirevwe commented Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

convoy-engineering commented Apr 10, 2026

Uh oh!

cursor Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

subomi Apr 4, 2026

Uh oh!

subomi Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jirevwe commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Two Backup Modes

Three Storage Backends

Manual Backup (CLI + API)

Key Changes

Configuration

Tessting

Uh oh!

Uh oh!

Uh oh!

convoy-engineering commented Apr 10, 2026

Uh oh!

cursor Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Third Review — 19 New Commits Since Last Review

Critical — New

High — New

Medium — New

Low — New

Previously Resolved (still good)

Uh oh!

subomi Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

subomi Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jirevwe commented Mar 30, 2026 •

edited

Loading

cursor Bot commented Apr 10, 2026 •

edited

Loading