Add chunk-integrity self-verify with distinct exit code (43) by JackackMach9 · Pull Request #5 · mach9robotics/untwine

JackackMach9 · 2026-04-23T01:12:43Z

Summary

Adds per-chunk round-trip verification in Processor::createChunk and introduces a distinct exit code (43) when a corrupt chunk is detected. Lets orchestrators retry on chunk corruption specifically rather than treating it as a hard failure.

Context

Production pcdnet ingest of dataset 9f81b064-b41b-4b8a-ab8b-98c7a4aa4d43 produced a COPC with one undecodable chunk at VoxelKey(10, 117, 280, 6) — declared 26,534 points but the compressed bytes decode to only 24,102. Downstream pdil.Copc.load_box raised lazrs.LazrsError, which pcdnet silently treated as "empty tile" and dropped an entire ~30×30m spatial region of points (visible geometric hole in the final output).

Root cause was investigated but not pinpointed:

Bad and good runs happened on the same EC2 host 6 days apart with identical code, identical inputs, identical blob → rules out deterministic cause
~10,000 single-voxel BU replays with fresh std::random_device seeds produced zero corruption (bug doesn't trigger under isolated single-threaded reprocessing of the target voxel)
7 full untwine repro runs produced zero corruption locally
55-minute TSan run found no cross-thread races in untwine code (only pre-existing libcurl noise from PDAL's LAS reader)

Leading hypothesis: rare race/memory ordering interaction in the concurrent BU write path (up to 10 parallel Processor::createChunk calls writing to the shared output file via independent std::ofstream handles). The bug surfaces about 1 in 50k chunks per run on production's Intel EC2 environment; below detection threshold on my AMD Ryzen local.

Since we can't catch the race, this change makes the corruption self-detecting and retryable instead of silent.

What this does

bu/Processor.cpp: after compressor.done(), build a lazperf::reader::chunk_decompressor on the chunk bytes and decode view->size() points. On exception, throw ChunkIntegrityError with the voxel key and declared point_count/byte_size in the message.
untwine/FatalError.hpp: new ChunkIntegrityError exception (subclass of FatalError) and ChunkIntegrityExitCode = 43 constant.
bu/PyramidManager.{hpp,cpp}: new queueWithChunkIntegrityError + internal flag so the main thread rethrows the correct exception type (the original queueWithError path discards the type).
untwine/Untwine.cpp: catch ChunkIntegrityError before FatalError and set status = ChunkIntegrityExitCode (43).

Testing

Built and ran on a 5-file, 1.3B-point dataset (41st and Yale Arterial Rehab). Output: 4.3 GB COPC, 0 corrupt chunks across 62,903 non-empty nodes, clean exit (0).
Self-verify overhead: ~5-10% (one decompression pass per chunk; chunks are single-threaded within a Processor).
A paired perception-side retry wrapper will be landed separately that detects exit code 43 and retries up to 3 times.

Test plan

Build with cmake Release on PDAL base image
Full pipeline run on production dataset (preprocessed, PF7, 1.3B pts) — output decodes cleanly
Exit code 0 on clean run
Exit code 43 path verified end-to-end once perception retry wrapper lands

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added automatic integrity verification of compressed chunks during processing
- Introduced specialized error codes to distinguish chunk corruption failures from other processing errors
Bug Fixes
- Enhanced error handling to detect and separately report chunk corruption issues
- Improved data validation to catch corruption earlier in the processing pipeline

…lure Every chunk is round-tripped through lazperf::reader::chunk_decompressor immediately after compressor.done(). If any chunk fails to decode all view->size() points, throw ChunkIntegrityError, which main() translates to exit code 43. This addresses a rare non-deterministic bug where a single COPC chunk's declared point_count exceeds what its compressed bytes can decode to. Two production runs on the same EC2 host with identical inputs produced different outputs (one corrupt, one clean), ruling out a deterministic input-dependent cause. 10000+ local reproductions with this branch produced zero corruption. The bug appears to require specific cross-voxel thread-interleaving that's below detection threshold on most hardware. The self-verify cost is one extra decompression pass per chunk (~5-10% overhead). A distinct exit code lets orchestrators (e.g. the ingest pipeline) retry specifically on chunk corruption instead of treating it as a hard failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-23T01:12:55Z

📝 Walkthrough

Walkthrough

The changes introduce a dedicated error-handling path for chunk integrity failures. A new ChunkIntegrityError exception class and exit code constant are added to distinguish corruption failures. Chunk processing now includes post-compression self-verification via round-trip decompression. PyramidManager gains a separate error-queueing API to track and propagate integrity errors distinctly.

Changes

Cohort / File(s)	Summary
Error Infrastructure `untwine/FatalError.hpp`, `untwine/Untwine.cpp`	New `ChunkIntegrityError` exception class and `ChunkIntegrityExitCode` constant (43) added to distinguish corruption failures. Main error handler now catches this exception explicitly and returns the specialized exit code.
PyramidManager Error Queueing `bu/PyramidManager.hpp`, `bu/PyramidManager.cpp`	New `queueWithChunkIntegrityError()` API method and internal `m_errorIsChunkIntegrity` flag added to track error type. `run()` loop now selects exception type based on this flag when reporting failures.
Processor Chunk Validation `bu/Processor.cpp`	Enhanced `createChunk()` with post-compression self-verify: freshly compressed LAZ/Lazperf chunks are immediately decompressed using expected format parameters. Decompression failures trigger `ChunkIntegrityError` with voxel key and metadata. `run()` now handles this error separately via new queueing path.

Sequence Diagram

sequenceDiagram
    participant Processor
    participant PyramidManager
    participant Lazperf
    participant Main as Untwine.cpp

    Processor->>Processor: createChunk()
    Processor->>Processor: Compress chunk to LAZ/Lazperf
    Processor->>Lazperf: Round-trip: decompress freshly compressed chunk
    alt Decompression fails
        Lazperf-->>Processor: Exception
        Processor->>Processor: Create ChunkIntegrityError
        Processor->>PyramidManager: queueWithChunkIntegrityError()
        PyramidManager->>PyramidManager: Set m_errorIsChunkIntegrity = true
        PyramidManager->>PyramidManager: Wake worker loop
    else Decompression succeeds
        Lazperf-->>Processor: OK
        Processor->>PyramidManager: Continue normal processing
    end
    
    PyramidManager->>PyramidManager: run() detects error + flag
    alt m_errorIsChunkIntegrity is true
        PyramidManager-->>Main: Throw ChunkIntegrityError
    else
        PyramidManager-->>Main: Throw FatalError
    end
    
    Main->>Main: catch(ChunkIntegrityError)
    Main->>Main: Return ChunkIntegrityExitCode (43)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A chunk once written, now we test,
Round-trip through lazperf—integrity blessed!
If corruption's caught mid-compress,
Exit code 43 confesses the mess. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding chunk-integrity self-verification with a distinct exit code.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chunk-integrity-self-verify

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bu/PyramidManager.cpp`:
- Around line 69-80: queueWithChunkIntegrityError sets m_errorIsChunkIntegrity =
true but the generic error path (queueWithError) never resets it, causing later
errors to be misclassified; update the generic error path (queueWithError) to
explicitly set m_errorIsChunkIntegrity = false (and set m_error to the passed
error and push the OctantInfo under the same mutex) so stale flags are cleared,
or alternatively ensure any non-chunk path always clears m_errorIsChunkIntegrity
before notifying (refer to PyramidManager::queueWithChunkIntegrityError and
PyramidManager::queueWithError).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58f8e9ee-aa61-483a-aee8-2c6efd5d6735

📥 Commits

Reviewing files that changed from the base of the PR and between 5ee38bb and 3bffb61.

📒 Files selected for processing (5)

bu/Processor.cpp
bu/PyramidManager.cpp
bu/PyramidManager.hpp
untwine/FatalError.hpp
untwine/Untwine.cpp

coderabbitai · 2026-04-23T01:15:52Z

+void PyramidManager::queueWithChunkIntegrityError(const OctantInfo& o,
+    const std::string& error)
+{
+    {
+        std::lock_guard<std::mutex> lock(m_mutex);
+
+        m_queue.push(o);
+        m_error = error;
+        m_errorIsChunkIntegrity = true;
+    }
+    m_cv.notify_one();
+}


⚠️ Potential issue | 🟠 Major

Reset the integrity flag on generic error queueing to avoid misclassification.

m_errorIsChunkIntegrity is set to true in the new path but never explicitly reset in the generic path. A later queueWithError(...) can inherit the stale flag and be rethrown as ChunkIntegrityError, producing the wrong exit code.

🛠️ Proposed fix

void PyramidManager::queueWithError(const OctantInfo& o, const std::string& error) { { std::lock_guard<std::mutex> lock(m_mutex); m_queue.push(o); m_error = error; + m_errorIsChunkIntegrity = false; } m_cv.notify_one(); }

Also applies to: 99-104

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bu/PyramidManager.cpp` around lines 69 - 80, queueWithChunkIntegrityError sets m_errorIsChunkIntegrity = true but the generic error path (queueWithError) never resets it, causing later errors to be misclassified; update the generic error path (queueWithError) to explicitly set m_errorIsChunkIntegrity = false (and set m_error to the passed error and push the OctantInfo under the same mutex) so stale flags are cleared, or alternatively ensure any non-chunk path always clears m_errorIsChunkIntegrity before notifying (refer to PyramidManager::queueWithChunkIntegrityError and PyramidManager::queueWithError).

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

JackackMach9 requested a review from harveybia April 23, 2026 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add chunk-integrity self-verify with distinct exit code (43)#5

Add chunk-integrity self-verify with distinct exit code (43)#5
JackackMach9 wants to merge 1 commit into
mainfrom
chunk-integrity-self-verify

JackackMach9 commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JackackMach9 commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

What this does

Testing

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JackackMach9 commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading