Skip to content

Add chunk-integrity self-verify with distinct exit code (43)#5

Open
JackackMach9 wants to merge 1 commit into
mainfrom
chunk-integrity-self-verify
Open

Add chunk-integrity self-verify with distinct exit code (43)#5
JackackMach9 wants to merge 1 commit into
mainfrom
chunk-integrity-self-verify

Conversation

@JackackMach9

@JackackMach9 JackackMach9 commented Apr 23, 2026

Copy link
Copy Markdown

Summary

Adds per-chunk round-trip verification in Processor::createChunk and introduces a distinct exit code (43) when a corrupt chunk is detected. Lets orchestrators retry on chunk corruption specifically rather than treating it as a hard failure.

Context

Production pcdnet ingest of dataset 9f81b064-b41b-4b8a-ab8b-98c7a4aa4d43 produced a COPC with one undecodable chunk at VoxelKey(10, 117, 280, 6) — declared 26,534 points but the compressed bytes decode to only 24,102. Downstream pdil.Copc.load_box raised lazrs.LazrsError, which pcdnet silently treated as "empty tile" and dropped an entire ~30×30m spatial region of points (visible geometric hole in the final output).

Root cause was investigated but not pinpointed:

  • Bad and good runs happened on the same EC2 host 6 days apart with identical code, identical inputs, identical blob → rules out deterministic cause
  • ~10,000 single-voxel BU replays with fresh std::random_device seeds produced zero corruption (bug doesn't trigger under isolated single-threaded reprocessing of the target voxel)
  • 7 full untwine repro runs produced zero corruption locally
  • 55-minute TSan run found no cross-thread races in untwine code (only pre-existing libcurl noise from PDAL's LAS reader)

Leading hypothesis: rare race/memory ordering interaction in the concurrent BU write path (up to 10 parallel Processor::createChunk calls writing to the shared output file via independent std::ofstream handles). The bug surfaces about 1 in 50k chunks per run on production's Intel EC2 environment; below detection threshold on my AMD Ryzen local.

Since we can't catch the race, this change makes the corruption self-detecting and retryable instead of silent.

What this does

  • bu/Processor.cpp: after compressor.done(), build a lazperf::reader::chunk_decompressor on the chunk bytes and decode view->size() points. On exception, throw ChunkIntegrityError with the voxel key and declared point_count/byte_size in the message.
  • untwine/FatalError.hpp: new ChunkIntegrityError exception (subclass of FatalError) and ChunkIntegrityExitCode = 43 constant.
  • bu/PyramidManager.{hpp,cpp}: new queueWithChunkIntegrityError + internal flag so the main thread rethrows the correct exception type (the original queueWithError path discards the type).
  • untwine/Untwine.cpp: catch ChunkIntegrityError before FatalError and set status = ChunkIntegrityExitCode (43).

Testing

  • Built and ran on a 5-file, 1.3B-point dataset (41st and Yale Arterial Rehab). Output: 4.3 GB COPC, 0 corrupt chunks across 62,903 non-empty nodes, clean exit (0).
  • Self-verify overhead: ~5-10% (one decompression pass per chunk; chunks are single-threaded within a Processor).
  • A paired perception-side retry wrapper will be landed separately that detects exit code 43 and retries up to 3 times.

Test plan

  • Build with cmake Release on PDAL base image
  • Full pipeline run on production dataset (preprocessed, PF7, 1.3B pts) — output decodes cleanly
  • Exit code 0 on clean run
  • Exit code 43 path verified end-to-end once perception retry wrapper lands

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic integrity verification of compressed chunks during processing
    • Introduced specialized error codes to distinguish chunk corruption failures from other processing errors
  • Bug Fixes

    • Enhanced error handling to detect and separately report chunk corruption issues
    • Improved data validation to catch corruption earlier in the processing pipeline

…lure

Every chunk is round-tripped through lazperf::reader::chunk_decompressor
immediately after compressor.done(). If any chunk fails to decode all
view->size() points, throw ChunkIntegrityError, which main() translates
to exit code 43.

This addresses a rare non-deterministic bug where a single COPC chunk's
declared point_count exceeds what its compressed bytes can decode to.
Two production runs on the same EC2 host with identical inputs produced
different outputs (one corrupt, one clean), ruling out a deterministic
input-dependent cause. 10000+ local reproductions with this branch
produced zero corruption. The bug appears to require specific cross-voxel
thread-interleaving that's below detection threshold on most hardware.

The self-verify cost is one extra decompression pass per chunk (~5-10%
overhead). A distinct exit code lets orchestrators (e.g. the ingest
pipeline) retry specifically on chunk corruption instead of treating it
as a hard failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The changes introduce a dedicated error-handling path for chunk integrity failures. A new ChunkIntegrityError exception class and exit code constant are added to distinguish corruption failures. Chunk processing now includes post-compression self-verification via round-trip decompression. PyramidManager gains a separate error-queueing API to track and propagate integrity errors distinctly.

Changes

Cohort / File(s) Summary
Error Infrastructure
untwine/FatalError.hpp, untwine/Untwine.cpp
New ChunkIntegrityError exception class and ChunkIntegrityExitCode constant (43) added to distinguish corruption failures. Main error handler now catches this exception explicitly and returns the specialized exit code.
PyramidManager Error Queueing
bu/PyramidManager.hpp, bu/PyramidManager.cpp
New queueWithChunkIntegrityError() API method and internal m_errorIsChunkIntegrity flag added to track error type. run() loop now selects exception type based on this flag when reporting failures.
Processor Chunk Validation
bu/Processor.cpp
Enhanced createChunk() with post-compression self-verify: freshly compressed LAZ/Lazperf chunks are immediately decompressed using expected format parameters. Decompression failures trigger ChunkIntegrityError with voxel key and metadata. run() now handles this error separately via new queueing path.

Sequence Diagram

sequenceDiagram
    participant Processor
    participant PyramidManager
    participant Lazperf
    participant Main as Untwine.cpp

    Processor->>Processor: createChunk()
    Processor->>Processor: Compress chunk to LAZ/Lazperf
    Processor->>Lazperf: Round-trip: decompress freshly compressed chunk
    alt Decompression fails
        Lazperf-->>Processor: Exception
        Processor->>Processor: Create ChunkIntegrityError
        Processor->>PyramidManager: queueWithChunkIntegrityError()
        PyramidManager->>PyramidManager: Set m_errorIsChunkIntegrity = true
        PyramidManager->>PyramidManager: Wake worker loop
    else Decompression succeeds
        Lazperf-->>Processor: OK
        Processor->>PyramidManager: Continue normal processing
    end
    
    PyramidManager->>PyramidManager: run() detects error + flag
    alt m_errorIsChunkIntegrity is true
        PyramidManager-->>Main: Throw ChunkIntegrityError
    else
        PyramidManager-->>Main: Throw FatalError
    end
    
    Main->>Main: catch(ChunkIntegrityError)
    Main->>Main: Return ChunkIntegrityExitCode (43)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A chunk once written, now we test,
Round-trip through lazperf—integrity blessed!
If corruption's caught mid-compress,
Exit code 43 confesses the mess.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding chunk-integrity self-verification with a distinct exit code.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chunk-integrity-self-verify

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bu/PyramidManager.cpp`:
- Around line 69-80: queueWithChunkIntegrityError sets m_errorIsChunkIntegrity =
true but the generic error path (queueWithError) never resets it, causing later
errors to be misclassified; update the generic error path (queueWithError) to
explicitly set m_errorIsChunkIntegrity = false (and set m_error to the passed
error and push the OctantInfo under the same mutex) so stale flags are cleared,
or alternatively ensure any non-chunk path always clears m_errorIsChunkIntegrity
before notifying (refer to PyramidManager::queueWithChunkIntegrityError and
PyramidManager::queueWithError).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58f8e9ee-aa61-483a-aee8-2c6efd5d6735

📥 Commits

Reviewing files that changed from the base of the PR and between 5ee38bb and 3bffb61.

📒 Files selected for processing (5)
  • bu/Processor.cpp
  • bu/PyramidManager.cpp
  • bu/PyramidManager.hpp
  • untwine/FatalError.hpp
  • untwine/Untwine.cpp

Comment thread bu/PyramidManager.cpp
Comment on lines +69 to +80
void PyramidManager::queueWithChunkIntegrityError(const OctantInfo& o,
const std::string& error)
{
{
std::lock_guard<std::mutex> lock(m_mutex);

m_queue.push(o);
m_error = error;
m_errorIsChunkIntegrity = true;
}
m_cv.notify_one();
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset the integrity flag on generic error queueing to avoid misclassification.

m_errorIsChunkIntegrity is set to true in the new path but never explicitly reset in the generic path. A later queueWithError(...) can inherit the stale flag and be rethrown as ChunkIntegrityError, producing the wrong exit code.

🛠️ Proposed fix
 void PyramidManager::queueWithError(const OctantInfo& o, const std::string& error)
 {
     {
         std::lock_guard<std::mutex> lock(m_mutex);

         m_queue.push(o);
         m_error = error;
+        m_errorIsChunkIntegrity = false;
     }
     m_cv.notify_one();
 }

Also applies to: 99-104

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bu/PyramidManager.cpp` around lines 69 - 80, queueWithChunkIntegrityError
sets m_errorIsChunkIntegrity = true but the generic error path (queueWithError)
never resets it, causing later errors to be misclassified; update the generic
error path (queueWithError) to explicitly set m_errorIsChunkIntegrity = false
(and set m_error to the passed error and push the OctantInfo under the same
mutex) so stale flags are cleared, or alternatively ensure any non-chunk path
always clears m_errorIsChunkIntegrity before notifying (refer to
PyramidManager::queueWithChunkIntegrityError and
PyramidManager::queueWithError).

@JackackMach9 JackackMach9 requested a review from harveybia April 23, 2026 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant