Skip to content

feat(openaudio): auto-tune Postgres memory and WAL defaults at container start#220

Open
RolfAris wants to merge 2 commits intoOpenAudio:mainfrom
RolfAris:feat/audiusd-postgres-auto-tune-defaults
Open

feat(openaudio): auto-tune Postgres memory and WAL defaults at container start#220
RolfAris wants to merge 2 commits intoOpenAudio:mainfrom
RolfAris:feat/audiusd-postgres-auto-tune-defaults

Conversation

@RolfAris
Copy link
Copy Markdown
Contributor

@RolfAris RolfAris commented May 2, 2026

Summary

The audiusd container ships stock Debian Postgres 15 defaults (shared_buffers = 128MB, work_mem = 4MB, effective_cache_size = 4GB), sized for a tiny dev VM rather than a validator host. This adds an entrypoint shim that picks a memory and WAL tier from detected host RAM and writes one drop-in conf file at container start.

Suggesting, not requiring. Happy to scope down or close. We're running an equivalent tuning on our 20-node fleet and it's helped meaningfully, so wanted to put it in front of the team.

Tier table

Host RAM shared_buffers work_mem maint_work_mem effective_cache_size wal_buffers max_wal_size min_wal_size
< 2 GB (skip, stock defaults)
2 to 4 GB 256 MB 4 MB 128 MB 1 GB 8 MB 1 GB 256 MB
4 to 8 GB 1 GB 8 MB 256 MB 3 GB 16 MB 2 GB 512 MB
8 to 16 GB 2 GB 16 MB 512 MB 6 GB 16 MB 2 GB 1 GB
16 to 32 GB 4 GB 32 MB 1 GB 12 GB 16 MB 4 GB 1 GB
32 to 64 GB 8 GB 32 MB 2 GB 24 GB 16 MB 8 GB 2 GB
64 GB and up 16 GB 32 MB 2 GB 48 GB 16 MB 16 GB 2 GB

Sizing rules: shared_buffers near 25% of RAM (capped to leave headroom for the audiusd Go process, observed at roughly 7 GB RSS on a busy validator). effective_cache_size near 50% (more conservative than pgtune's 75% since Postgres is not the only tenant in this container). wal_buffers capped at 16 MB per the Postgres docs. work_mem modest because audiusd's observed concurrency is roughly 8 connections, not 100.

Conservative-by-default behavior

The shim skips with stock defaults whenever it cannot prove safety:

  1. Operator already tuned postgresql.conf. If shared_buffers, work_mem, maintenance_work_mem, effective_cache_size, wal_buffers, max_wal_size, or min_wal_size is set uncommented in postgresql.conf, the shim skips with a log line. Operator wins.
  2. Operator already has an include_dir directive. Any include_dir line (active, commented out, or pointing at a different directory or using different quoting) makes the shim skip rather than risk last-occurrence-wins overriding the operator's directory.
  3. Non-root execution context. If the shim runs as a uid that is not root and not the postgres user, it skips. The chown step would otherwise silently fail and leave a conf file postgres cannot read.
  4. Postgres rejects the rendered conf. A postgres -C shared_buffers --config-file=... preflight runs after rendering. If Postgres refuses to parse the conf, the shim removes the rendered file and exits.
  5. Any I/O failure. Every error path exits 0 and leaves stock defaults in place.

Override knobs

# Disable entirely
docker run -e AUDIUSD_DISABLE_AUTO_TUNE=1 ...

# Per-setting override (later wins by include order within conf.d)
echo "shared_buffers = 8GB" > $DATA/conf.d/99-operator.conf

# Or via SQL after connect (postgresql.auto.conf is processed last)
ALTER SYSTEM SET shared_buffers = '8GB';

Precedence (later wins):

  1. postgresql.conf top-of-file
  2. conf.d/00-audiusd-defaults.conf (this shim, written conditional on the conservative checks above)
  3. conf.d/99-*.conf (operator override slot)
  4. postgresql.auto.conf (ALTER SYSTEM, processed last by Postgres regardless of position)

Evidence: controlled before/after on one of our nodes

Same node (24 GB host, 144 GB DB, 38 GB indexes), 20-min steady-state windows on each side, reset stats between. Only shared_buffers, wal_buffers, max_wal_size, and min_wal_size were changed (the restart-required group). The other tier values were already in place via ALTER SYSTEM on that node.

Metric Before: stock 128 MB shared_buffers After: 4 GB shared_buffers (16-32 GB tier)
pg_stat_bgwriter.buffers_alloc rate 23,068 / sec 65 / sec (-99.7%)
pg_stat_bgwriter.buffers_backend (window) 1,247,356 11,591 (-99.1%)
pg_stat_bgwriter.buffers_checkpoint (window) 91 35,169 (planned writes replace emergency backend writes)
Buffer hit ratio 10.97% (depressed by cold EXPLAINs in the window) 83.05%

The shape of buffer accounting changed. Before: backends doing emergency dirty-page writes (1.25M of them) because shared_buffers was exhausted. After: the checkpointer does planned batched writes on schedule (35k). The 99% drop in buffers_backend is the strongest signal that shared_buffers was undersized.

Representative heavy query, SELECT count(*) FROM ops WHERE "table" = 'uploads' (full table scan over a 50 GB table):

Metric Before After
Execution time 52,481 ms 41,034 ms (-22%)
Buffers dirtied during the query 1,180,495 5,623 (-99.5%)
Buffers written during the query 1,180,370 5,474 (-99.5%)
Buffers read from disk 6,300,680 6,304,652 (unchanged; table doesn't fit in 4 GB either)

Light queries (e.g. SELECT * FROM core_blocks ORDER BY created_at DESC LIMIT 100) went from 9 to 14 disk reads down to 0. Index pages stay resident in the bigger pool. Sub-ms either way, but the disk-read count delta is the durable signal.

Restart cost: roughly 3 seconds Postgres unavailability. The audiusd Go process kept running and reconnected; no application errors observed.

Reproduce on any node:

SELECT pg_stat_reset_shared('bgwriter'); SELECT pg_stat_reset();
-- wait 20 min steady-state, capture:
SELECT round(100.0 * blks_hit / (blks_hit + blks_read), 2) AS hit_ratio_pct
  FROM pg_stat_database WHERE datname = 'openaudio';
SELECT buffers_alloc, buffers_backend, checkpoints_req
  FROM pg_stat_bgwriter;

Determinism and consensus safety

A tuning that changes plan choice (work_mem, effective_cache_size) could in principle affect consensus state if any state-applied query relied on default plan ordering. We audited the currently visible ORDER-sensitive paths:

  • All :many ... LIMIT queries in pkg/core/db/sql/reads.sql have explicit ORDER BY.
  • :one ... LIMIT 1 queries WHERE on a unique column.
  • Unordered queries (GetAllRegisteredNodes, GetAllEthAddressesOfRegisteredNodes, GetActiveStorageNodeEndpoints) feed only common.GetAttestorRendezvous, which sorts internally by hash. The output is order-independent.
  • The CRDT ops sweep is Order("ulid asc") server-side (pkg/mediorum/server/serve_crud.go:33).

We did not find a path where a plan flip could change consensus state. This is an audit, not an executable guard. Adding a deterministic-order assertion test would harden this further; happy to do that if it would help review.

Caveats for operators

  1. shared_buffers is restart-required. On image upgrade, docker compose up -d recreates the container and Postgres starts with the new value. On hosts with very tight free RAM at upgrade time, the larger allocation may cause Postgres start to fail. Workaround: set AUDIUSD_DISABLE_AUTO_TUNE=1 before upgrading, or override via conf.d/99-*.conf.
  2. /dev/shm size on big tiers. Postgres 15's parallel workers use /dev/shm for dynamic shared memory. Docker defaults /dev/shm to 64 MB. On the 64 GB and up tier (16 GB shared_buffers), parallel queries with many workers can hit could not resize shared memory segment. Operators on big hosts should pass --shm-size=2g or larger.
  3. Non-root container runtimes. k8s securityContext.runAsUser, rootless docker, or podman with --user make the in-container chown postgres:postgres no-op. The shim detects this and skips. Operators in those modes will see stock defaults, which is the existing behavior.

Out of scope

random_page_cost, effective_io_concurrency (assume SSD), synchronous_commit, wal_compression, checkpoint_*, max_connections. Memory and WAL sizing only.

Test plan

  • bash cmd/openaudio/postgres-auto-tune_test.sh, 161 assertions covering: every tier at midpoint and at both boundary edges (one inside the tier, one just below it); sub-floor; idempotency across re-runs; AUDIUSD_DISABLE_AUTO_TUNE=1 short-circuit; =true and =0 correctly NOT honored (canonical form is =1); operator-tuned postgresql.conf skip; commented-tuning does NOT trigger skip; foreign include_dir skip (alternate dir, double-quoted, commented); existing include_dir = 'conf.d' recognized as ours; well-formed postgresql.conf after atomic append; orphan tmp file cleanup; tier log line. Lint-clean (shellcheck).
  • CI builds the image
  • Fresh-init container, Postgres starts with shim-applied defaults
  • Existing-data-dir, include_dir = 'conf.d' appended once via atomic temp+rename, drop-in renders, restart picks up new shared_buffers
  • AUDIUSD_DISABLE_AUTO_TUNE=1, no conf.d directory, stock defaults
  • docker run -m 1G (cgroup-limited container), shim detects via cgroup, sub-2GB skip, stock defaults
  • Pre-existing postgresql.conf with hand-tuned shared_buffers, shim detects and skips with log line

RolfAris added 2 commits May 2, 2026 20:34
…ner start

The audiusd container ships stock Debian Postgres 15 defaults
(shared_buffers=128MB, work_mem=4MB, effective_cache_size=4GB) which
are sized for a tiny dev VM rather than a validator host. Adds an
entrypoint shim that picks a memory and WAL tier from detected host
RAM and writes a single drop-in conf at $POSTGRES_DATA_DIR/conf.d/.

Conservative-by-default: skips with stock defaults when postgresql.conf
already has any of the tuned parameters set, when any include_dir
directive is already present (active, commented, or pointing at a
different dir), when running as a non-root non-postgres uid, when
postgres -C preflight rejects the rendered conf, or on any I/O failure.

Disable with AUDIUSD_DISABLE_AUTO_TUNE=1 or override via conf.d/99-*.conf
or ALTER SYSTEM. Atomic writes (mktemp + rename) on both the tune file
and postgresql.conf. Cgroup-aware memory detection (v2 then v1 then
/proc/meminfo).

Tested: 161 assertions covering every tier midpoint, every boundary
value (2048, 4095, 4096, 8191, 8192, 16383, 16384, 32767, 32768, 65535,
65536), sub-floor cases, disable variants, operator-tuning detection,
foreign include_dir detection, atomic append well-formedness, orphan
tmp cleanup, and the tier log line. shellcheck clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant