Fix branch merge with blob columns#78
Conversation
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
cubic analysis
No issues found across 3 files
Linked issue analysis
Linked issue: MR-901: Validate and fix native branch merge with Blob columns
| Status | Acceptance criteria | Notes |
|---|---|---|
| ✅ | Add a regression test that creates a repo with a Blob property, creates a branch, writes Blob-backed rows, and merges into main. | Added test branch_merge_with_blob_columns_preserves_blob_data |
| ✅ | Confirm branch merge handles both old and current Lance Blob physical schemas, or fails with a clear migration/remediation message. | Blob description parsing accepts legacy/current kinds and errors on misalignment |
| ✅ | Remove the need for export/delta-load workarounds for Blob-backed tables. | Merge materializes blobs and test verifies blob bytes preserved after merge |
| ✅ | Materialize Lance Blob columns from merge cursors before staging merged rows, preserving blob bytes across branch merge. | row_batch calls TableStore::materialize_blob_batch for blob columns |
| Reuse the staged merge-insert publish path for merge deltas so row version metadata remains correct for unchanged rows. | scan_batches_for_rewrite used, but explicit merge-insert publish reuse not fully obvious | |
| ✅ | Add a regression test that verifies blob reads after merge (i.e., bytes preserved for branch and main rows). | Test asserts read_blob bytes for inserted/updated rows after merge |
| ✅ | Handle legacy/current Lance blob physical schema compatibility and row-id alignment during materialization. | Materialization reads _rowid, takes blobs by row ids and checks alignment |
| ✅ | Support merges where both main and branch have edits (concurrent edits scenario covered by test). | Test updates main and branch then merges, asserting expected outcomes |
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/omnigraph/src/table_store.rs">
<violation number="1" location="crates/omnigraph/src/table_store.rs:382">
P2: Always pushing `blob.read()` here rewrites URI-backed blobs to inline bytes, which loses URI round-trip metadata and forces full blob reads during merge. Preserve URI blobs with `push_uri` and only read bytes when no URI is present.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| builder | ||
| .push_bytes( | ||
| blob.read() | ||
| .await | ||
| .map_err(|e| OmniError::Lance(e.to_string()))?, | ||
| ) | ||
| .map_err(|e| OmniError::Lance(e.to_string()))?; |
There was a problem hiding this comment.
P2: Always pushing blob.read() here rewrites URI-backed blobs to inline bytes, which loses URI round-trip metadata and forces full blob reads during merge. Preserve URI blobs with push_uri and only read bytes when no URI is present.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/omnigraph/src/table_store.rs, line 382:
<comment>Always pushing `blob.read()` here rewrites URI-backed blobs to inline bytes, which loses URI round-trip metadata and forces full blob reads during merge. Preserve URI blobs with `push_uri` and only read bytes when no URI is present.</comment>
<file context>
@@ -379,19 +379,13 @@ impl TableStore {
- )
- .map_err(|e| OmniError::Lance(e.to_string()))?;
- }
+ builder
+ .push_bytes(
+ blob.read()
</file context>
| builder | |
| .push_bytes( | |
| blob.read() | |
| .await | |
| .map_err(|e| OmniError::Lance(e.to_string()))?, | |
| ) | |
| .map_err(|e| OmniError::Lance(e.to_string()))?; | |
| if let Some(uri) = blob.uri() { | |
| builder | |
| .push_uri(uri) | |
| .map_err(|e| OmniError::Lance(e.to_string()))?; | |
| } else { | |
| builder | |
| .push_bytes( | |
| blob.read() | |
| .await | |
| .map_err(|e| OmniError::Lance(e.to_string()))?, | |
| ) | |
| .map_err(|e| OmniError::Lance(e.to_string()))?; | |
| } |
There was a problem hiding this comment.
This P2 is a valid tradeoff concern but not actionable for MR-901 as-is. Preserving blob.uri() reintroduces the earlier P1 failure for external URIs because Lance 4.0.0 MergeInsertBuilder does not expose WriteParams, so branch merge cannot set allow_external_blob_outside_bases; fca2b74 intentionally materializes URI-backed blobs into self-contained payloads and adds branch_merge_with_external_blob_uri_materializes_payload to cover that path. We can preserve URI round-trip metadata later if Lance exposes WriteParams on merge-insert or a public API to pre-resolve which URIs are safe to preserve.
Summary
_rowid, resolve payloads withDataset::take_blobs, and rebuild Blob v2 input arrays withBlobArrayBuilder.urireferences, avoiding Lancemerge_insert's missingallow_external_blob_outside_baseshook.Review & Testing Checklist for Human
omnigraph branch mergeagainst a copy of a Blob-backed repo before applying to important local data.Notes
Latest PR comment assessment:
position/size) and Lance v2 descriptions (kind/position/size/blob_uri), and returns explicitunrecognized blob description schema ...errors for unexpected descriptor layouts.MergeInsertBuilderin Lance 4.0.0 does not exposeWriteParams, so rewritten branch-merge batches now materialize blob bytes instead of carrying external URI references intomerge_insert.Lance docs checked from
docs/lance.md:Struct<data: LargeBinary, uri: Utf8>via blob builders; reads should usetake_blobswith exactly one selector, preferring row IDs for stable logical identity when available._rowidwithDataset::take_blobs.Local checks:
rustfmt --edition 2024 crates/omnigraph/src/table_store.rs crates/omnigraph/tests/branching.rs --checkgit diff --checkcargo check -p omnigraph-engine(passes with existing warnings)cargo test -p omnigraph-engine --test branching branch_merge_with_blob_columns_preserves_blob_data -- --exactcargo test -p omnigraph-engine --test branching branch_merge_with_external_blob_uri_materializes_payload -- --exactPreviously on this branch:
cargo test -p omnigraph-engine --test branchingcargo test --workspace --lockedcargo clippy -p omnigraph-engine --all-targets -- -D warningsstill fails on pre-existing compiler-crate lints outside this PR's files.Link to Devin session: https://app.devin.ai/sessions/6065ebd76d1c4d3881294a223e1dfd0e
Requested by: @ragnorc