gpt: write backup-first and fall back on Read by eriknordmark · Pull Request #401 · diskfs/go-diskfs

eriknordmark · 2026-05-27T20:16:55Z

Motivation

Table.Write() writes the primary GPT before the backup, with no Sync() between writes. A power loss mid-write can leave the disk in a state where the primary header on LBA 1 references partition entries that don't yet match its CRC, while the backup at end-of-disk is still the old layout (or the inverse). The Linux/sgdisk convention is the opposite — write the backup first so that on a crash, the primary remains in its old-and-consistent state and bootloaders that prefer the primary continue to work.

Table.Read() validates primary header CRC and partition-entries CRC; on either failure it returns an error without consulting the backup. So a half-written primary makes the disk unreadable through go-diskfs even when the backup is intact.

This PR addresses both.

Changes

Reorder Write(): protective MBR → backup entries → sync → backup header → sync → primary entries → sync → primary header → sync. All header/array bytes are serialized up front, so any encoding error aborts cleanly before touching the disk. The per-side intra-syncs hold the invariant "the primary on disk is either fully-old-and-consistent or fully-new-and-consistent at every crash point."
syncWritable helper: best-effort Sync() via runtime type assertion (f.(interface{ Sync() error })). *os.File satisfies it; in-memory test backends don't and silently no-op. The backend.WritableFile interface contract is unchanged.
Read() falls back to backup GPT on a primary content failure (bad header signature, header CRC, or partition-entries CRC). I/O errors propagate unchanged — the backup is read through the same I/O path and would fail the same way, so falling back there would just bury the original error. Disk size is discovered via Seek(0, io.SeekEnd) (works uniformly on image files and Linux block devices). The recovered Table has a new RecoveredFromBackup bool field set so the caller knows to rewrite the primary by calling Write() (or use an external tool like sgdisk --repair).
Test update: the existing TestRead case that asserted "corrupted primary returns nil + error" now asserts "corrupted primary recovers from backup with RecoveredFromBackup=true". A new case asserts that primary AND backup both corrupt still returns an error.

Crash-safety analysis

After this change, an interrupted Write() leaves the disk in one of:

both copies still old (operation not yet observably started),
backup new, primary still old (boot uses primary, sees old layout — re-running converges),
backup new, primary inconsistent (Read CRC fails, falls back to backup — sees new layout),
both copies fully new (success).

In every case, a consumer that validates CRCs and falls back to the backup on primary failure will see a consistent partition layout. With the order before this PR, a crash between writes could leave the primary advertising a new layout while the backup was still old, or vice versa, with no guarantee of consistency between them.

Compatibility

Table struct gains one exported field (RecoveredFromBackup); existing fields and order unchanged.
backend.WritableFile interface unchanged.
Write() and Read() signatures unchanged.
All existing tests pass (verified go test ./... across the repo).

Signed-off-by: eriknordmark erik@zededa.com

deitch · 2026-05-28T11:41:53Z

OK, this is pretty good (and always happy to get a PR from Erik).

Should Read() auto-rewrite the primary when it recovers from backup? Current behavior leaves it to the caller (since Read() takes a backend.File, not a backend.WritableFile — silently mutating the disk on what looks like a read call seemed surprising).

Definitely should not. This is a library. If your Read() call finds something amiss, a separate call should fix it.

*os.File.Sync() on a Linux block device issues fsync(2) which the kernel translates to a SCSI/SATA cache-flush. On devices with broken or disabled write caches this may be a no-op; for stricter durability a caller would follow with ioctl(BLKFLSBUF). That's out of scope here but worth noting.

I like the idea, but then you run into issues of running this on Linux vs Windows vs macOS vs BSD vs Solaris vs VAX (I might be exaggerating a bit). We have a few places in this library where we needed to do things cross-platform like ioctls, so if you want it, there might be some reference points. It also isn't bad if it does a "better" sync on one platform, and nothing on the other, because the other does not provide the right calls for it, as long as it is documented. is out of scope, in any case, so subject to a different discussion.

deitch

I have a request for a comment change, and some sanitation around extraneous function call. Other than that, looks pretty good.

deitch · 2026-05-28T11:49:40Z

+// directly without a backup attempt, because the backup is read through
+// the same I/O path and would fail the same way. Callers that want to
+// retry with a different reader can detect this case by inspecting the
+// returned error.


The new comment is helpful, but it deleted some of the useful lines of the old comment, like what has to be passed and what it returns.

Restored (with some reformatting).

deitch · 2026-05-28T11:58:40Z

+	primaryHeaderOff := sectorBytes
+	secondaryHeaderOff := int64(t.secondaryHeader) * sectorBytes
+
+	writeAt := func(buf []byte, off int64, what string) error {


This writeAt() func is used 5 times, 4 of which immediately call syncAfter(), and the last one (protective MBR) can write it and it doesn't hurt. Is there any reason not to just do writeAtWithSync(), combine the two, and always call it? That seems cleaner (and conciser) to me.

OK - fixed.

deitch · 2026-05-28T12:05:32Z

I reviewed and added comments, and CI went through cleanly as well.

One thing I don't get. Why does it matter if we write primary and then backup, or backup and then primary? I get the "write-and-sync" pattern. But if we write the primary first and it fails, then backup is valid old and can be read; if we write the backup first and it fails, then primary still is valid old and can be read. Why does the write order matter?

eriknordmark · 2026-05-28T15:53:57Z

I reviewed and added comments, and CI went through cleanly as well.

One thing I don't get. Why does it matter if we write primary and then backup, or backup and then primary? I get the "write-and-sync" pattern. But if we write the primary first and it fails, then backup is valid old and can be read; if we write the backup first and it fails, then primary still is valid old and can be read. Why does the write order matter?

The issue is that if you write the primary and then the power goes off, you'll be in a permanent state where the backup is stale/incorrect. (It might have incorrect CRC, or it could be correct but stale.) Nothing would indicate that you need to update the backup to match.

When you do it in the other order so that the primary is old (and with correct CRCs) and the backup is new/correct, then you'll notice in the indepotent retry logic that the change has not yet been done, and proceed to check and notice that the primary and backup to not match and ensure that the primary gets updated.

FWIW I'm told that sgdisk does it in the order proposed in this PR with the syncs inbetween to make sure that some storage controller doesn't reorder the writes.

deitch · 2026-05-28T18:08:47Z

The issue is that if you write the primary and then the power goes off, you'll be in a permanent state where the backup is stale/incorrect. (It might have incorrect CRC, or it could be correct but stale.) Nothing would indicate that you need to update the backup to match.

If primary is bad and backup is good, relies on backup (and maybe should fix primary)
If primary is good and backup is bad, relies on primary (and should fix backup)

Why does the order matter? Bad primary + stale backup is no worse or better than stale primary + bad backup, as far as I can tell.

If sgdisk does it this way, then I am fine putting it in, but would like to understand why.

eriknordmark · 2026-05-29T15:22:25Z

If primary is bad and backup is good, relies on backup (and maybe should fix primary)

If primary is good and backup is bad, relies on primary (and should fix backup)

What sgdisk's and this PRs ordering is trying to avoid is the second case where everything looks fine, but you still need to check and fix the backup.

If "bad" means "bad CRC", then you are also forgetting the two stale cases:

If primary is stale and backup is new, then X
If primary is new and backup is stale, then relies on primary.

In the X case one has to depend on the application to check that it didn't complete its work and retry after the power failure.
But the ordering in sgdisk and this PR avoids the second stale case.

Of course, there are also possible combinations of stale and bad (CRC errors) but that doesn't change the basic shape,

Make GPT writes survive power loss by writing the backup GPT at end-of-disk before the primary at LBA 1 — and within each side, writing the partition-entries array before the header sector that references its CRC. fsync between every write so the kernel cannot reorder writes across the consistency boundaries. A best-effort syncWritable helper uses a runtime type assertion to call Sync() on backends that support it (*os.File does; in-memory test backends silently no-op), so the backend.WritableFile interface contract is unchanged. Make Read fall back to the backup GPT on a primary content failure (bad header signature, header CRC, or partition-entries CRC). I/O errors propagate unchanged, because the backup is read through the same I/O path and would fail the same way. The recovered Table has a new RecoveredFromBackup field set so the caller knows to rewrite the primary. The existing TestRead case that asserted "corrupted primary returns nil + error" becomes "corrupted primary recovers from backup with RecoveredFromBackup=true"; a new case asserts that primary AND backup both corrupt still returns an error. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

eriknordmark mentioned this pull request May 28, 2026

squashfs: honor partition offset in Read and Finalize #404

Open

deitch requested changes May 28, 2026

View reviewed changes

eriknordmark force-pushed the gpt-crash-safety branch from 23885fd to 0d92178 Compare May 29, 2026 15:48

eriknordmark requested a review from deitch May 29, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt: write backup-first and fall back on Read#401

gpt: write backup-first and fall back on Read#401
eriknordmark wants to merge 1 commit into
diskfs:masterfrom
eriknordmark:gpt-crash-safety

eriknordmark commented May 27, 2026 •

edited

Loading

Uh oh!

deitch commented May 28, 2026

Uh oh!

deitch left a comment

Uh oh!

deitch May 28, 2026

Uh oh!

eriknordmark May 29, 2026

Uh oh!

deitch May 28, 2026

Uh oh!

eriknordmark May 29, 2026 •

edited

Loading

Uh oh!

deitch commented May 28, 2026

Uh oh!

eriknordmark commented May 28, 2026

Uh oh!

deitch commented May 28, 2026

Uh oh!

eriknordmark commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eriknordmark commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Crash-safety analysis

Compatibility

Uh oh!

deitch commented May 28, 2026

Uh oh!

deitch left a comment

Choose a reason for hiding this comment

Uh oh!

deitch May 28, 2026

Choose a reason for hiding this comment

Uh oh!

eriknordmark May 29, 2026

Choose a reason for hiding this comment

Uh oh!

deitch May 28, 2026

Choose a reason for hiding this comment

Uh oh!

eriknordmark May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deitch commented May 28, 2026

Uh oh!

eriknordmark commented May 28, 2026

Uh oh!

deitch commented May 28, 2026

Uh oh!

eriknordmark commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eriknordmark commented May 27, 2026 •

edited

Loading

eriknordmark May 29, 2026 •

edited

Loading