gpt: write backup-first and fall back on Read#401
Conversation
|
OK, this is pretty good (and always happy to get a PR from Erik).
Definitely should not. This is a library. If your
I like the idea, but then you run into issues of running this on Linux vs Windows vs macOS vs BSD vs Solaris vs VAX (I might be exaggerating a bit). We have a few places in this library where we needed to do things cross-platform like ioctls, so if you want it, there might be some reference points. It also isn't bad if it does a "better" sync on one platform, and nothing on the other, because the other does not provide the right calls for it, as long as it is documented. is out of scope, in any case, so subject to a different discussion. |
deitch
left a comment
There was a problem hiding this comment.
I have a request for a comment change, and some sanitation around extraneous function call. Other than that, looks pretty good.
| // directly without a backup attempt, because the backup is read through | ||
| // the same I/O path and would fail the same way. Callers that want to | ||
| // retry with a different reader can detect this case by inspecting the | ||
| // returned error. |
There was a problem hiding this comment.
The new comment is helpful, but it deleted some of the useful lines of the old comment, like what has to be passed and what it returns.
There was a problem hiding this comment.
Restored (with some reformatting).
| primaryHeaderOff := sectorBytes | ||
| secondaryHeaderOff := int64(t.secondaryHeader) * sectorBytes | ||
|
|
||
| writeAt := func(buf []byte, off int64, what string) error { |
There was a problem hiding this comment.
This writeAt() func is used 5 times, 4 of which immediately call syncAfter(), and the last one (protective MBR) can write it and it doesn't hurt. Is there any reason not to just do writeAtWithSync(), combine the two, and always call it? That seems cleaner (and conciser) to me.
|
I reviewed and added comments, and CI went through cleanly as well. One thing I don't get. Why does it matter if we write primary and then backup, or backup and then primary? I get the "write-and-sync" pattern. But if we write the primary first and it fails, then backup is valid old and can be read; if we write the backup first and it fails, then primary still is valid old and can be read. Why does the write order matter? |
The issue is that if you write the primary and then the power goes off, you'll be in a permanent state where the backup is stale/incorrect. (It might have incorrect CRC, or it could be correct but stale.) Nothing would indicate that you need to update the backup to match. When you do it in the other order so that the primary is old (and with correct CRCs) and the backup is new/correct, then you'll notice in the indepotent retry logic that the change has not yet been done, and proceed to check and notice that the primary and backup to not match and ensure that the primary gets updated. FWIW I'm told that sgdisk does it in the order proposed in this PR with the syncs inbetween to make sure that some storage controller doesn't reorder the writes. |
Why does the order matter? Bad primary + stale backup is no worse or better than stale primary + bad backup, as far as I can tell. If sgdisk does it this way, then I am fine putting it in, but would like to understand why. |
What sgdisk's and this PRs ordering is trying to avoid is the second case where everything looks fine, but you still need to check and fix the backup. If "bad" means "bad CRC", then you are also forgetting the two stale cases:
In the X case one has to depend on the application to check that it didn't complete its work and retry after the power failure. Of course, there are also possible combinations of stale and bad (CRC errors) but that doesn't change the basic shape, |
Make GPT writes survive power loss by writing the backup GPT at end-of-disk before the primary at LBA 1 — and within each side, writing the partition-entries array before the header sector that references its CRC. fsync between every write so the kernel cannot reorder writes across the consistency boundaries. A best-effort syncWritable helper uses a runtime type assertion to call Sync() on backends that support it (*os.File does; in-memory test backends silently no-op), so the backend.WritableFile interface contract is unchanged. Make Read fall back to the backup GPT on a primary content failure (bad header signature, header CRC, or partition-entries CRC). I/O errors propagate unchanged, because the backup is read through the same I/O path and would fail the same way. The recovered Table has a new RecoveredFromBackup field set so the caller knows to rewrite the primary. The existing TestRead case that asserted "corrupted primary returns nil + error" becomes "corrupted primary recovers from backup with RecoveredFromBackup=true"; a new case asserts that primary AND backup both corrupt still returns an error. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
23885fd to
0d92178
Compare
Motivation
Table.Write()writes the primary GPT before the backup, with noSync()between writes. A power loss mid-write can leave the disk in a state where the primary header on LBA 1 references partition entries that don't yet match its CRC, while the backup at end-of-disk is still the old layout (or the inverse). The Linux/sgdiskconvention is the opposite — write the backup first so that on a crash, the primary remains in its old-and-consistent state and bootloaders that prefer the primary continue to work.Table.Read()validates primary header CRC and partition-entries CRC; on either failure it returns an error without consulting the backup. So a half-written primary makes the disk unreadable through go-diskfs even when the backup is intact.This PR addresses both.
Changes
Reorder
Write(): protective MBR → backup entries → sync → backup header → sync → primary entries → sync → primary header → sync. All header/array bytes are serialized up front, so any encoding error aborts cleanly before touching the disk. The per-side intra-syncs hold the invariant "the primary on disk is either fully-old-and-consistent or fully-new-and-consistent at every crash point."syncWritablehelper: best-effortSync()via runtime type assertion (f.(interface{ Sync() error })).*os.Filesatisfies it; in-memory test backends don't and silently no-op. Thebackend.WritableFileinterface contract is unchanged.Read()falls back to backup GPT on a primary content failure (bad header signature, header CRC, or partition-entries CRC). I/O errors propagate unchanged — the backup is read through the same I/O path and would fail the same way, so falling back there would just bury the original error. Disk size is discovered viaSeek(0, io.SeekEnd)(works uniformly on image files and Linux block devices). The recoveredTablehas a newRecoveredFromBackup boolfield set so the caller knows to rewrite the primary by callingWrite()(or use an external tool likesgdisk --repair).Test update: the existing
TestReadcase that asserted "corrupted primary returnsnil + error" now asserts "corrupted primary recovers from backup withRecoveredFromBackup=true". A new case asserts that primary AND backup both corrupt still returns an error.Crash-safety analysis
After this change, an interrupted
Write()leaves the disk in one of:In every case, a consumer that validates CRCs and falls back to the backup on primary failure will see a consistent partition layout. With the order before this PR, a crash between writes could leave the primary advertising a new layout while the backup was still old, or vice versa, with no guarantee of consistency between them.
Compatibility
Tablestruct gains one exported field (RecoveredFromBackup); existing fields and order unchanged.backend.WritableFileinterface unchanged.Write()andRead()signatures unchanged.go test ./...across the repo).Signed-off-by: eriknordmark erik@zededa.com