Skip to content

Fix branch merge with blob columns#78

Open
devin-ai-integration[bot] wants to merge 4 commits into
mainfrom
devin/1778363660-mr-901-blob-branch-merge
Open

Fix branch merge with blob columns#78
devin-ai-integration[bot] wants to merge 4 commits into
mainfrom
devin/1778363660-mr-901-blob-branch-merge

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented May 9, 2026

Summary

  • Materialize Lance Blob columns from merge cursors before staging merged rows, preserving blob payloads across branch merge.
  • Use Lance-recommended Blob APIs for rewrite paths: scan blob descriptions with _rowid, resolve payloads with Dataset::take_blobs, and rebuild Blob v2 input arrays with BlobArrayBuilder.
  • Materialize rewritten blobs as bytes instead of preserving external uri references, avoiding Lance merge_insert's missing allow_external_blob_outside_bases hook.
  • Reuse the staged merge-insert publish path for merge deltas so row version metadata remains correct for unchanged rows.
  • Add regression tests for inline Blob-backed branch merge and external-file Blob URI branch merge.

Review & Testing Checklist for Human

  • Verify the blob merge regressions cover the production shape from MR-901, including concurrent main and branch edits.
  • Review the blob materialization path for legacy/current Lance blob physical schema compatibility and row-id alignment.
  • Run an end-to-end omnigraph branch merge against a copy of a Blob-backed repo before applying to important local data.

Notes

Latest PR comment assessment:

  • Legacy blob schema comment: not actionable as written. The current code handles Lance v1 descriptions (position/size) and Lance v2 descriptions (kind/position/size/blob_uri), and returns explicit unrecognized blob description schema ... errors for unexpected descriptor layouts.
  • External URI comment: valid. MergeInsertBuilder in Lance 4.0.0 does not expose WriteParams, so rewritten branch-merge batches now materialize blob bytes instead of carrying external URI references into merge_insert.

Lance docs checked from docs/lance.md:

  • Blob guide: Blob v2 writes should use Struct<data: LargeBinary, uri: Utf8> via blob builders; reads should use take_blobs with exactly one selector, preferring row IDs for stable logical identity when available.
  • Read/write guide: merge-insert with update-all / insert-all is the recommended upsert primitive.
  • Row ID lineage spec: stable row IDs preserve logical identity across updates, which matches using _rowid with Dataset::take_blobs.

Local checks:

  • rustfmt --edition 2024 crates/omnigraph/src/table_store.rs crates/omnigraph/tests/branching.rs --check
  • git diff --check
  • cargo check -p omnigraph-engine (passes with existing warnings)
  • cargo test -p omnigraph-engine --test branching branch_merge_with_blob_columns_preserves_blob_data -- --exact
  • cargo test -p omnigraph-engine --test branching branch_merge_with_external_blob_uri_materializes_payload -- --exact

Previously on this branch:

  • cargo test -p omnigraph-engine --test branching
  • cargo test --workspace --locked

cargo clippy -p omnigraph-engine --all-targets -- -D warnings still fails on pre-existing compiler-crate lints outside this PR's files.

Link to Devin session: https://app.devin.ai/sessions/6065ebd76d1c4d3881294a223e1dfd0e
Requested by: @ragnorc


Open in Devin Review

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

No issues found across 3 files

Linked issue analysis

Linked issue: MR-901: Validate and fix native branch merge with Blob columns

Status Acceptance criteria Notes
Add a regression test that creates a repo with a Blob property, creates a branch, writes Blob-backed rows, and merges into main. Added test branch_merge_with_blob_columns_preserves_blob_data
Confirm branch merge handles both old and current Lance Blob physical schemas, or fails with a clear migration/remediation message. Blob description parsing accepts legacy/current kinds and errors on misalignment
Remove the need for export/delta-load workarounds for Blob-backed tables. Merge materializes blobs and test verifies blob bytes preserved after merge
Materialize Lance Blob columns from merge cursors before staging merged rows, preserving blob bytes across branch merge. row_batch calls TableStore::materialize_blob_batch for blob columns
⚠️ Reuse the staged merge-insert publish path for merge deltas so row version metadata remains correct for unchanged rows. scan_batches_for_rewrite used, but explicit merge-insert publish reuse not fully obvious
Add a regression test that verifies blob reads after merge (i.e., bytes preserved for branch and main rows). Test asserts read_blob bytes for inserted/updated rows after merge
Handle legacy/current Lance blob physical schema compatibility and row-id alignment during materialization. Materialization reads _rowid, takes blobs by row ids and checks alignment
Support merges where both main and branch have edits (concurrent edits scenario covered by test). Test updates main and branch then merges, asserting expected outcomes

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
cubic-dev-ai[bot]

This comment was marked as resolved.

devin-ai-integration Bot and others added 2 commits May 10, 2026 21:55
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/omnigraph/src/table_store.rs">

<violation number="1" location="crates/omnigraph/src/table_store.rs:382">
P2: Always pushing `blob.read()` here rewrites URI-backed blobs to inline bytes, which loses URI round-trip metadata and forces full blob reads during merge. Preserve URI blobs with `push_uri` and only read bytes when no URI is present.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +382 to +388
builder
.push_bytes(
blob.read()
.await
.map_err(|e| OmniError::Lance(e.to_string()))?,
)
.map_err(|e| OmniError::Lance(e.to_string()))?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Always pushing blob.read() here rewrites URI-backed blobs to inline bytes, which loses URI round-trip metadata and forces full blob reads during merge. Preserve URI blobs with push_uri and only read bytes when no URI is present.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/omnigraph/src/table_store.rs, line 382:

<comment>Always pushing `blob.read()` here rewrites URI-backed blobs to inline bytes, which loses URI round-trip metadata and forces full blob reads during merge. Preserve URI blobs with `push_uri` and only read bytes when no URI is present.</comment>

<file context>
@@ -379,19 +379,13 @@ impl TableStore {
-                    )
-                    .map_err(|e| OmniError::Lance(e.to_string()))?;
-            }
+            builder
+                .push_bytes(
+                    blob.read()
</file context>
Suggested change
builder
.push_bytes(
blob.read()
.await
.map_err(|e| OmniError::Lance(e.to_string()))?,
)
.map_err(|e| OmniError::Lance(e.to_string()))?;
if let Some(uri) = blob.uri() {
builder
.push_uri(uri)
.map_err(|e| OmniError::Lance(e.to_string()))?;
} else {
builder
.push_bytes(
blob.read()
.await
.map_err(|e| OmniError::Lance(e.to_string()))?,
)
.map_err(|e| OmniError::Lance(e.to_string()))?;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This P2 is a valid tradeoff concern but not actionable for MR-901 as-is. Preserving blob.uri() reintroduces the earlier P1 failure for external URIs because Lance 4.0.0 MergeInsertBuilder does not expose WriteParams, so branch merge cannot set allow_external_blob_outside_bases; fca2b74 intentionally materializes URI-backed blobs into self-contained payloads and adds branch_merge_with_external_blob_uri_materializes_payload to cover that path. We can preserve URI round-trip metadata later if Lance exposes WriteParams on merge-insert or a public API to pre-resolve which URIs are safe to preserve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant