Skip to content

fix(wal): fix concurrent corruption and torn-tail boot failure#58

Draft
pocky wants to merge 1 commit into
mainfrom
fix/wal-concurrent-corruption
Draft

fix(wal): fix concurrent corruption and torn-tail boot failure#58
pocky wants to merge 1 commit into
mainfrom
fix/wal-concurrent-corruption

Conversation

@pocky

@pocky pocky commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix WAL concurrent corruption: replace cached file descriptor with per-append O_APPEND re-opens guarded by flock(LOCK_EX), so concurrent writers from overlapping MCP sessions can no longer clobber each other's records
  • Fix torn-tail boot failure: instead of propagating an error on corrupt/partial WAL entries, replay now truncates the journal at the last good record and returns success, keeping the server bootable after a crash mid-write
  • Fix WAL archive reaping: cleanArchivedWals was matching journal.*.wal (which also matched journal.wal itself and journal.lock); update to match only journal-*.wal rotation archives
  • Bootstrap degrades gracefully on restore errors (except OOM) instead of bricking the boot path; CLAUDE.md content migrated to AGENTS.md for multi-agent compatibility

Changes

Persistence — WAL

  • src/persistence/wal.zig: Replace cached file fd with a stable lock_file (advisory flock target); every append/replay/rotate re-opens journal.wal by path under flock(LOCK_EX). Add torn-tail recovery in replay: track good_bytes, truncate on first unparseable/unknown-op line. Use nanosecond-precision archive names to avoid same-second collisions. Update tests to match new semantics (concurrent writers, torn-tail truncation, rotate isolation)

Persistence — Manager

  • src/persistence/manager.zig: Fix cleanArchivedWals prefix from journal. to journal- and drop the redundant journal.wal guard (the suffix change already excludes it). Add test asserting active journal.wal and journal.lock survive a snapshot's reap pass

Bootstrap

  • src/cli/bootstrap.zig: Catch non-OOM errors from pm.restore() and log a warning instead of propagating, matching the degraded-boot contract of mount-manifest failures

Tool tests

  • src/tools/forget_fact.zig: Replace /dev/null fd-swap trick with directory-in-place-of-file approach to force O_WRONLY open failure, compatible with the per-append re-open design
  • src/tools/remember_fact.zig: Same fix as forget_fact.zig
  • src/tools/update_fact.zig: Same fix as forget_fact.zig

Docs / Config

  • AGENTS.md: New file — project instructions extracted from CLAUDE.md for multi-agent runner compatibility
  • CLAUDE.md: Reduced to a single @AGENTS.md reference
  • README.md: Bump minimum Zig requirement to 0.16.0; simplify binary path in MCP config example; update docs URL

Test plan

  • make test passes with all inline WAL tests green, including the new concurrent-writer, torn-tail, and rotate-isolation tests
  • make functional-test completes without errors against the rebuilt binary
  • Manually corrupt journal.wal (append a truncated JSON fragment), restart the server, and verify it boots with the good records loaded and the torn tail removed
  • Run two zpm serve processes against the same data directory, issue concurrent remember_fact calls, then replay the WAL and confirm every record is present and parseable

Generated with awf commit workflow

- `AGENTS.md`: Add project instructions (extracted from CLAUDE.md)
- `CLAUDE.md`: Replace inline content with @AGENTS.md reference
- `README.md`: Update Zig requirement to 0.16.0 and docs URL
- `src/cli/bootstrap.zig`: Degrade on restore error instead of failing boot
- `src/persistence/manager.zig`: Fix archive reaping to match journal-<ts>.wal naming; add test
- `src/persistence/wal.zig`: Replace cached fd with per-op re-open + flock serialization; add torn-tail truncate recovery; add concurrent-writer and rotate-strand tests
- `src/tools/forget_fact.zig`: Update journal-write failure test to use directory collision
- `src/tools/remember_fact.zig`: Update journal-write failure test to use directory collision
- `src/tools/update_fact.zig`: Update journal-write failure test to use directory collision
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant