Add chunk-integrity self-verify with distinct exit code (43)#5
Add chunk-integrity self-verify with distinct exit code (43)#5JackackMach9 wants to merge 1 commit into
Conversation
…lure Every chunk is round-tripped through lazperf::reader::chunk_decompressor immediately after compressor.done(). If any chunk fails to decode all view->size() points, throw ChunkIntegrityError, which main() translates to exit code 43. This addresses a rare non-deterministic bug where a single COPC chunk's declared point_count exceeds what its compressed bytes can decode to. Two production runs on the same EC2 host with identical inputs produced different outputs (one corrupt, one clean), ruling out a deterministic input-dependent cause. 10000+ local reproductions with this branch produced zero corruption. The bug appears to require specific cross-voxel thread-interleaving that's below detection threshold on most hardware. The self-verify cost is one extra decompression pass per chunk (~5-10% overhead). A distinct exit code lets orchestrators (e.g. the ingest pipeline) retry specifically on chunk corruption instead of treating it as a hard failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe changes introduce a dedicated error-handling path for chunk integrity failures. A new Changes
Sequence DiagramsequenceDiagram
participant Processor
participant PyramidManager
participant Lazperf
participant Main as Untwine.cpp
Processor->>Processor: createChunk()
Processor->>Processor: Compress chunk to LAZ/Lazperf
Processor->>Lazperf: Round-trip: decompress freshly compressed chunk
alt Decompression fails
Lazperf-->>Processor: Exception
Processor->>Processor: Create ChunkIntegrityError
Processor->>PyramidManager: queueWithChunkIntegrityError()
PyramidManager->>PyramidManager: Set m_errorIsChunkIntegrity = true
PyramidManager->>PyramidManager: Wake worker loop
else Decompression succeeds
Lazperf-->>Processor: OK
Processor->>PyramidManager: Continue normal processing
end
PyramidManager->>PyramidManager: run() detects error + flag
alt m_errorIsChunkIntegrity is true
PyramidManager-->>Main: Throw ChunkIntegrityError
else
PyramidManager-->>Main: Throw FatalError
end
Main->>Main: catch(ChunkIntegrityError)
Main->>Main: Return ChunkIntegrityExitCode (43)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bu/PyramidManager.cpp`:
- Around line 69-80: queueWithChunkIntegrityError sets m_errorIsChunkIntegrity =
true but the generic error path (queueWithError) never resets it, causing later
errors to be misclassified; update the generic error path (queueWithError) to
explicitly set m_errorIsChunkIntegrity = false (and set m_error to the passed
error and push the OctantInfo under the same mutex) so stale flags are cleared,
or alternatively ensure any non-chunk path always clears m_errorIsChunkIntegrity
before notifying (refer to PyramidManager::queueWithChunkIntegrityError and
PyramidManager::queueWithError).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 58f8e9ee-aa61-483a-aee8-2c6efd5d6735
📒 Files selected for processing (5)
bu/Processor.cppbu/PyramidManager.cppbu/PyramidManager.hppuntwine/FatalError.hppuntwine/Untwine.cpp
| void PyramidManager::queueWithChunkIntegrityError(const OctantInfo& o, | ||
| const std::string& error) | ||
| { | ||
| { | ||
| std::lock_guard<std::mutex> lock(m_mutex); | ||
|
|
||
| m_queue.push(o); | ||
| m_error = error; | ||
| m_errorIsChunkIntegrity = true; | ||
| } | ||
| m_cv.notify_one(); | ||
| } |
There was a problem hiding this comment.
Reset the integrity flag on generic error queueing to avoid misclassification.
m_errorIsChunkIntegrity is set to true in the new path but never explicitly reset in the generic path. A later queueWithError(...) can inherit the stale flag and be rethrown as ChunkIntegrityError, producing the wrong exit code.
🛠️ Proposed fix
void PyramidManager::queueWithError(const OctantInfo& o, const std::string& error)
{
{
std::lock_guard<std::mutex> lock(m_mutex);
m_queue.push(o);
m_error = error;
+ m_errorIsChunkIntegrity = false;
}
m_cv.notify_one();
}Also applies to: 99-104
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bu/PyramidManager.cpp` around lines 69 - 80, queueWithChunkIntegrityError
sets m_errorIsChunkIntegrity = true but the generic error path (queueWithError)
never resets it, causing later errors to be misclassified; update the generic
error path (queueWithError) to explicitly set m_errorIsChunkIntegrity = false
(and set m_error to the passed error and push the OctantInfo under the same
mutex) so stale flags are cleared, or alternatively ensure any non-chunk path
always clears m_errorIsChunkIntegrity before notifying (refer to
PyramidManager::queueWithChunkIntegrityError and
PyramidManager::queueWithError).
Summary
Adds per-chunk round-trip verification in
Processor::createChunkand introduces a distinct exit code (43) when a corrupt chunk is detected. Lets orchestrators retry on chunk corruption specifically rather than treating it as a hard failure.Context
Production pcdnet ingest of dataset
9f81b064-b41b-4b8a-ab8b-98c7a4aa4d43produced a COPC with one undecodable chunk atVoxelKey(10, 117, 280, 6)— declared 26,534 points but the compressed bytes decode to only 24,102. Downstreampdil.Copc.load_boxraisedlazrs.LazrsError, which pcdnet silently treated as "empty tile" and dropped an entire ~30×30m spatial region of points (visible geometric hole in the final output).Root cause was investigated but not pinpointed:
std::random_deviceseeds produced zero corruption (bug doesn't trigger under isolated single-threaded reprocessing of the target voxel)Leading hypothesis: rare race/memory ordering interaction in the concurrent BU write path (up to 10 parallel
Processor::createChunkcalls writing to the shared output file via independentstd::ofstreamhandles). The bug surfaces about 1 in 50k chunks per run on production's Intel EC2 environment; below detection threshold on my AMD Ryzen local.Since we can't catch the race, this change makes the corruption self-detecting and retryable instead of silent.
What this does
bu/Processor.cpp: aftercompressor.done(), build alazperf::reader::chunk_decompressoron the chunk bytes and decodeview->size()points. On exception, throwChunkIntegrityErrorwith the voxel key and declared point_count/byte_size in the message.untwine/FatalError.hpp: newChunkIntegrityErrorexception (subclass ofFatalError) andChunkIntegrityExitCode = 43constant.bu/PyramidManager.{hpp,cpp}: newqueueWithChunkIntegrityError+ internal flag so the main thread rethrows the correct exception type (the originalqueueWithErrorpath discards the type).untwine/Untwine.cpp: catchChunkIntegrityErrorbeforeFatalErrorand setstatus = ChunkIntegrityExitCode(43).Testing
41st and Yale Arterial Rehab). Output: 4.3 GB COPC, 0 corrupt chunks across 62,903 non-empty nodes, clean exit (0).Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes