Skip to content

gpt: write backup-first and fall back on Read#401

Open
eriknordmark wants to merge 1 commit into
diskfs:masterfrom
eriknordmark:gpt-crash-safety
Open

gpt: write backup-first and fall back on Read#401
eriknordmark wants to merge 1 commit into
diskfs:masterfrom
eriknordmark:gpt-crash-safety

Conversation

@eriknordmark
Copy link
Copy Markdown

@eriknordmark eriknordmark commented May 27, 2026

Motivation

Table.Write() writes the primary GPT before the backup, with no Sync() between writes. A power loss mid-write can leave the disk in a state where the primary header on LBA 1 references partition entries that don't yet match its CRC, while the backup at end-of-disk is still the old layout (or the inverse). The Linux/sgdisk convention is the opposite — write the backup first so that on a crash, the primary remains in its old-and-consistent state and bootloaders that prefer the primary continue to work.

Table.Read() validates primary header CRC and partition-entries CRC; on either failure it returns an error without consulting the backup. So a half-written primary makes the disk unreadable through go-diskfs even when the backup is intact.

This PR addresses both.

Changes

  1. Reorder Write(): protective MBR → backup entries → sync → backup header → sync → primary entries → sync → primary header → sync. All header/array bytes are serialized up front, so any encoding error aborts cleanly before touching the disk. The per-side intra-syncs hold the invariant "the primary on disk is either fully-old-and-consistent or fully-new-and-consistent at every crash point."

  2. syncWritable helper: best-effort Sync() via runtime type assertion (f.(interface{ Sync() error })). *os.File satisfies it; in-memory test backends don't and silently no-op. The backend.WritableFile interface contract is unchanged.

  3. Read() falls back to backup GPT on a primary content failure (bad header signature, header CRC, or partition-entries CRC). I/O errors propagate unchanged — the backup is read through the same I/O path and would fail the same way, so falling back there would just bury the original error. Disk size is discovered via Seek(0, io.SeekEnd) (works uniformly on image files and Linux block devices). The recovered Table has a new RecoveredFromBackup bool field set so the caller knows to rewrite the primary by calling Write() (or use an external tool like sgdisk --repair).

  4. Test update: the existing TestRead case that asserted "corrupted primary returns nil + error" now asserts "corrupted primary recovers from backup with RecoveredFromBackup=true". A new case asserts that primary AND backup both corrupt still returns an error.

Crash-safety analysis

After this change, an interrupted Write() leaves the disk in one of:

  • both copies still old (operation not yet observably started),
  • backup new, primary still old (boot uses primary, sees old layout — re-running converges),
  • backup new, primary inconsistent (Read CRC fails, falls back to backup — sees new layout),
  • both copies fully new (success).

In every case, a consumer that validates CRCs and falls back to the backup on primary failure will see a consistent partition layout. With the order before this PR, a crash between writes could leave the primary advertising a new layout while the backup was still old, or vice versa, with no guarantee of consistency between them.

Compatibility

  • Table struct gains one exported field (RecoveredFromBackup); existing fields and order unchanged.
  • backend.WritableFile interface unchanged.
  • Write() and Read() signatures unchanged.
  • All existing tests pass (verified go test ./... across the repo).

Signed-off-by: eriknordmark erik@zededa.com

@deitch
Copy link
Copy Markdown
Collaborator

deitch commented May 28, 2026

OK, this is pretty good (and always happy to get a PR from Erik).

Should Read() auto-rewrite the primary when it recovers from backup? Current behavior leaves it to the caller (since Read() takes a backend.File, not a backend.WritableFile — silently mutating the disk on what looks like a read call seemed surprising).

Definitely should not. This is a library. If your Read() call finds something amiss, a separate call should fix it.

*os.File.Sync() on a Linux block device issues fsync(2) which the kernel translates to a SCSI/SATA cache-flush. On devices with broken or disabled write caches this may be a no-op; for stricter durability a caller would follow with ioctl(BLKFLSBUF). That's out of scope here but worth noting.

I like the idea, but then you run into issues of running this on Linux vs Windows vs macOS vs BSD vs Solaris vs VAX (I might be exaggerating a bit). We have a few places in this library where we needed to do things cross-platform like ioctls, so if you want it, there might be some reference points. It also isn't bad if it does a "better" sync on one platform, and nothing on the other, because the other does not provide the right calls for it, as long as it is documented. is out of scope, in any case, so subject to a different discussion.

Copy link
Copy Markdown
Collaborator

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a request for a comment change, and some sanitation around extraneous function call. Other than that, looks pretty good.

Comment thread partition/gpt/table.go
// directly without a backup attempt, because the backup is read through
// the same I/O path and would fail the same way. Callers that want to
// retry with a different reader can detect this case by inspecting the
// returned error.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new comment is helpful, but it deleted some of the useful lines of the old comment, like what has to be passed and what it returns.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored (with some reformatting).

Comment thread partition/gpt/table.go Outdated
primaryHeaderOff := sectorBytes
secondaryHeaderOff := int64(t.secondaryHeader) * sectorBytes

writeAt := func(buf []byte, off int64, what string) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This writeAt() func is used 5 times, 4 of which immediately call syncAfter(), and the last one (protective MBR) can write it and it doesn't hurt. Is there any reason not to just do writeAtWithSync(), combine the two, and always call it? That seems cleaner (and conciser) to me.

Copy link
Copy Markdown
Author

@eriknordmark eriknordmark May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - fixed.

@deitch
Copy link
Copy Markdown
Collaborator

deitch commented May 28, 2026

I reviewed and added comments, and CI went through cleanly as well.

One thing I don't get. Why does it matter if we write primary and then backup, or backup and then primary? I get the "write-and-sync" pattern. But if we write the primary first and it fails, then backup is valid old and can be read; if we write the backup first and it fails, then primary still is valid old and can be read. Why does the write order matter?

@eriknordmark
Copy link
Copy Markdown
Author

I reviewed and added comments, and CI went through cleanly as well.

One thing I don't get. Why does it matter if we write primary and then backup, or backup and then primary? I get the "write-and-sync" pattern. But if we write the primary first and it fails, then backup is valid old and can be read; if we write the backup first and it fails, then primary still is valid old and can be read. Why does the write order matter?

The issue is that if you write the primary and then the power goes off, you'll be in a permanent state where the backup is stale/incorrect. (It might have incorrect CRC, or it could be correct but stale.) Nothing would indicate that you need to update the backup to match.

When you do it in the other order so that the primary is old (and with correct CRCs) and the backup is new/correct, then you'll notice in the indepotent retry logic that the change has not yet been done, and proceed to check and notice that the primary and backup to not match and ensure that the primary gets updated.

FWIW I'm told that sgdisk does it in the order proposed in this PR with the syncs inbetween to make sure that some storage controller doesn't reorder the writes.

@deitch
Copy link
Copy Markdown
Collaborator

deitch commented May 28, 2026

The issue is that if you write the primary and then the power goes off, you'll be in a permanent state where the backup is stale/incorrect. (It might have incorrect CRC, or it could be correct but stale.) Nothing would indicate that you need to update the backup to match.

  • If primary is bad and backup is good, relies on backup (and maybe should fix primary)
  • If primary is good and backup is bad, relies on primary (and should fix backup)

Why does the order matter? Bad primary + stale backup is no worse or better than stale primary + bad backup, as far as I can tell.

If sgdisk does it this way, then I am fine putting it in, but would like to understand why.

@eriknordmark
Copy link
Copy Markdown
Author

  • If primary is bad and backup is good, relies on backup (and maybe should fix primary)
  • If primary is good and backup is bad, relies on primary (and should fix backup)

What sgdisk's and this PRs ordering is trying to avoid is the second case where everything looks fine, but you still need to check and fix the backup.

If "bad" means "bad CRC", then you are also forgetting the two stale cases:

  • If primary is stale and backup is new, then X
  • If primary is new and backup is stale, then relies on primary.

In the X case one has to depend on the application to check that it didn't complete its work and retry after the power failure.
But the ordering in sgdisk and this PR avoids the second stale case.

Of course, there are also possible combinations of stale and bad (CRC errors) but that doesn't change the basic shape,

Make GPT writes survive power loss by writing the backup GPT at
end-of-disk before the primary at LBA 1 — and within each side,
writing the partition-entries array before the header sector that
references its CRC. fsync between every write so the kernel cannot
reorder writes across the consistency boundaries. A best-effort
syncWritable helper uses a runtime type assertion to call Sync()
on backends that support it (*os.File does; in-memory test backends
silently no-op), so the backend.WritableFile interface contract is
unchanged.

Make Read fall back to the backup GPT on a primary content failure
(bad header signature, header CRC, or partition-entries CRC). I/O
errors propagate unchanged, because the backup is read through the
same I/O path and would fail the same way. The recovered Table has
a new RecoveredFromBackup field set so the caller knows to rewrite
the primary.

The existing TestRead case that asserted "corrupted primary returns
nil + error" becomes "corrupted primary recovers from backup with
RecoveredFromBackup=true"; a new case asserts that primary AND
backup both corrupt still returns an error.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@eriknordmark eriknordmark requested a review from deitch May 29, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants