Skip to content

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229

Merged
wjones127 merged 1 commit into
lance-format:mainfrom
wjones127:feat/custom-deepsize-trait
Jun 8, 2026
Merged

refactor: replace deepsize crate with custom DeepSizeOf trait for Arrow-aware memory accounting#6229
wjones127 merged 1 commit into
lance-format:mainfrom
wjones127:feat/custom-deepsize-trait

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

The external deepsize crate does not correctly account for Arrow buffers
that are shared across Arc references, causing double-counting in cache
size calculations.

This PR introduces a custom DeepSizeOf trait in lance-core::deepsize
with a Context that tracks both Arc and raw buffer pointers in a
unified HashSet. It also adds a lance-derive proc-macro crate for
#[derive(DeepSizeOf)] and removes the dependency on the external
deepsize crate.

Changes:

  • Add lance-core::deepsize module with custom DeepSizeOf trait and pointer-tracking Context
  • Add lance-derive proc-macro crate with #[derive(DeepSizeOf)]
  • Remove external deepsize crate from all crates; update all impls to use lance_core::deepsize
  • Fix all deep_size_of_children implementations to thread context through (previously some ignored the parameter)
  • Add DeepSizeOf for arrow_buffer::ScalarBuffer<T> with proper pointer tracking
  • Fix PlainPostingList and Positions to use deep_size_of_children with context

@github-actions github-actions Bot added the A-python Python bindings label Mar 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Review

Good motivation — the external deepsize crate doesn't track Arc/buffer pointer identity, so shared Arrow buffers get double-counted in cache size calculations. The custom trait with a pointer-tracking Context is the right approach.

Issues to address

P0: dyn Array::deep_size_of_children calls self.to_data() which clones all buffers

Array::to_data() can be expensive — it materializes an ArrayData by cloning Arc<Buffer>s and for some array types does non-trivial work. This is called every time you measure a column's size in a RecordBatch, which happens on every cache insertion/eviction. Consider using Array::get_array_memory_size() for the total and only using the Context for dedup at the Arc<dyn Array> / Arc<RecordBatch> level. Alternatively, use array.to_data() only once and cache the result, or see if ArrayData can be obtained without cloning (e.g. array.into_data() if you own it, or accessing internal data references).

P1: CompressedPostingList still uses get_buffer_memory_size() without context tracking

CompressedPostingList::deep_size_of_children (rust/lance-index/src/scalar/inverted/index.rs) was only half-migrated — it still calls self.blocks.get_buffer_memory_size() and self.positions.as_ref().map(Array::get_buffer_memory_size) directly, bypassing the Context dedup. This means shared buffers in compressed posting lists can still be double-counted.

P1: ScalarBuffer reports len * size_of::<T>() instead of capacity

In deepsize.rs, ScalarBuffer::deep_size_of_children returns self.len() * size_of::<T>(), but the underlying Buffer may have been allocated with more capacity (e.g. after slicing). Other impls in the same file (like Vec) correctly use capacity. For consistency and accuracy, consider using the underlying buffer's capacity.

Minor

  • The HashMap size estimate uses capacity * (size_of::<K>() + size_of::<V>() + 1), which is a reasonable approximation for hashbrown/Swiss tables but the actual overhead per bucket is closer to 1 byte of control metadata per slot (not per occupied entry). This is fine as an estimate since you're using capacity() which already accounts for the total slot count — just noting it's approximate.

  • No tests for the derive macro on enums or generics. Consider adding at least one test struct/enum in deepsize.rs tests that uses #[derive(DeepSizeOf)] to verify the proc macro generates correct code.

@wjones127 wjones127 marked this pull request as ready for review March 19, 2026 17:44
@wjones127 wjones127 force-pushed the feat/custom-deepsize-trait branch from 8566f00 to 1c70985 Compare April 1, 2026 01:31
@github-actions github-actions Bot added the A-java Java bindings + JNI label Apr 1, 2026
@wjones127 wjones127 force-pushed the feat/custom-deepsize-trait branch from 1c70985 to 23df3cb Compare April 1, 2026 02:26

@westonpace westonpace left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, I didn't look too closely at all the changes as I think most were find/replace.

Should we do some blanket impl like?

impl<T: lance_core::deepsize::DeepSizeOf> deepsize::DeepSizeOf for T {
  ...
}

Although maybe better to keep the dependency away and let callers do that themselves if they are using deepsize I suppose.

///
/// Generates an implementation that sums the `deep_size_of_children` of all
/// fields (for structs) or the active variant's fields (for enums).
#[proc_macro_derive(DeepSizeOf)]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests for this macro somewhere? I guess we get some testing via usage? You can, in theory, do integration tests with proc-macros if I recall (unit tests don't make a ton of sense)

Maybe not important for an internal utility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I should add some more tests.

@wjones127

Copy link
Copy Markdown
Contributor Author

This is nice, I didn't look too closely at all the changes as I think most were find/replace.

The critical change from deepsize crate is that they track just Arc and RC pointers in Context: 1:

pub struct Context {
    /// A set of all [`Arc`](std::sync::Arc)s that have already been counted
    arcs: GenericSet<usize>,
    /// A set of all [`Rc`](std::sync::Arc)s that have already been counted
    rcs: GenericSet<usize>,
}

We instead just track one list of pointers:

pub struct Context {
    seen: HashSet<usize>,
}

This is important because Arrow doesn't give you access to the Arc, just the underlying pointers. So if you want to mark Arrow buffers as seen, you can't use the Context from deepsize. (I initially tried to implement deepsize::DeepSizeOf for a new type wrapper around Arrow types, but ran into this. That's what motivated me to make this change.)

Should we do some blanket impl like?

impl<T: lance_core::deepsize::DeepSizeOf> deepsize::DeepSizeOf for T {
  ...
}

I think this might be difficult, given I would need to create some sort of adapter between our Context and their Context. Plus, I don't get the sense that deepsize crate is actively maintained. All the PRs are left as stale. It's our only dependency that still uses syn 1.x. So I lean towards dropping it.

Footnotes

  1. https://github.com/Aeledfyr/deepsize/blob/5deebe687695610e6038dc70753defbe7634d549/src/lib.rs#L151-L156

…ow-aware memory accounting

The external `deepsize` crate does not correctly account for Arrow buffers
that are shared across `Arc` references, causing double-counting in cache
size calculations.

This PR introduces a custom `DeepSizeOf` trait in `lance-core::deepsize`
with a `Context` that tracks both `Arc` and raw buffer pointers in a
unified `HashSet`. It also adds a `lance-derive` proc-macro crate for
`#[derive(DeepSizeOf)]` and removes the dependency on the external
`deepsize` crate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/custom-deepsize-trait branch from 56744bc to 0445d04 Compare June 8, 2026 17:13
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer A-deps Dependency updates A-encoding Encoding, IO, file reader/writer labels Jun 8, 2026

@westonpace westonpace left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is one way to work around the orphan rule 😆

@wjones127

Copy link
Copy Markdown
Contributor Author

I suppose this is one way to work around the orphan rule 😆

DeepSizeOf is basically unmaintained these days. My PR to upgrade dependencies has been sitting for months with no response. Aeledfyr/deepsize#42

@wjones127 wjones127 merged commit 8151ec4 into lance-format:main Jun 8, 2026
30 checks passed
LuQQiu added a commit that referenced this pull request Jun 9, 2026
## Summary
- Adds a `[[tool.bumpversion.files]]` entry for `lance-derive` in
`.bumpversion.toml` so its workspace dependency line in `Cargo.toml` is
bumped alongside the other lance crates during beta releases.

## Why
The `lance-derive` crate (added in #6229) was missing from
`.bumpversion.toml`. As a result, during the most recent `publish-beta`
run, `[workspace.package].version` was bumped from `8.0.0-beta.7` →
`8.0.0-beta.8` (which bumped `rust/lance-derive/Cargo.toml`), but the
workspace dependency line in the root `Cargo.toml` stayed at
`lance-derive = { version = "=8.0.0-beta.7", ... }`. Cargo then failed:

```
error: failed to select a version for the requirement `lance-derive = "=8.0.0-beta.7"`
```

Failing run:
https://github.com/lance-format/lance/actions/runs/27225140622/job/80390190228

Same fix pattern as #6526 (lance-tokenizer) and #4572 (fsst,
bitpacking).

## Test plan
- [ ] Next `publish-beta` run succeeds and bumps `lance-derive` in
lockstep with the other crates.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-encoding Encoding, IO, file reader/writer A-index Vector index, linalg, tokenizer A-java Java bindings + JNI A-python Python bindings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants