Skip to content

feat: Implement icebug-disk format#429

Open
aheev wants to merge 15 commits intoLadybugDB:mainfrom
aheev:icedisk-impl
Open

feat: Implement icebug-disk format#429
aheev wants to merge 15 commits intoLadybugDB:mainfrom
aheev:icedisk-impl

Conversation

@aheev
Copy link
Copy Markdown
Contributor

@aheev aheev commented Apr 28, 2026

  • Added IceDisk storage tables: Added IceDiskNodeTable and IceDiskRelTable
  • IceDisk test suite: Added demo_db_ice_disk.test covering node/rel table scans against IceDisk storage.
  • IceDisk utilities: Added ice_disk_utils.h with shared helpers for path resolution and Parquet-backed row group access.
  • added icebug-disk spec implementation

Closes #469

@aheev aheev force-pushed the icedisk-impl branch 3 times, most recently from 7f26fa3 to 2c9f811 Compare May 7, 2026 05:22
@aheev aheev changed the title [WIP] feat: add icebug-disk tables feat: add icebug-disk tables May 8, 2026
@aheev aheev changed the title feat: add icebug-disk tables feat: Implement icebug-disk format May 8, 2026
@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 8, 2026

How's this different from:

src/include/storage/table/parquet_rel_table.h
src/include/storage/table/parquet_node_table.h

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

How's this different from:

src/include/storage/table/parquet_rel_table.h
src/include/storage/table/parquet_node_table.h

updating description. Give me a minute

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

@adsharma

Improvements:

  • Batch scanning in node table
  • revamped rel_table scanning. Previously, we used to scan every row group for each boundNodeOffset. Changed it to boundNodeOffset driven scan. Based on fwd or bwd scan, we'll fetch the rowGroups
  • Other improvements like removing unnecessary data from scanStates, removing repeated state data population, caching
  • blocking user from creating mixed tables/rels

However, the main motivation behind adding new classes is having separate entities for icebug-format because it may diverge from existing columnar implementations. As discussed earlier, we are planning to use/replace exisiting parquet or arrow classes for normal tables or delegate to duckDB anyway

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

@adsharma

With this implementation, we can have a working db as an output from non-icedisk-to-icedisk-converter or user can pass the schema.cypher or run the cmds themselves to create icebug-disk tables. If you want to attach from existing db, we already support attaching external ladybug dbs

The main issue is path though. If the user moves the db or path, he/she might need to trigger create table queries again. DuckDB provides override option. We may have to provide similar option in the future

Couple of clarifications:

  • Where to add version? I am thinking metadata in each file
  • Should we keep the current column(0) for nbr_offset in in indptr? or allow users to specify

@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 8, 2026

The improvements are nice. But we can't have two implementations with 2x the bugs. Need to remove one and justify the other with benchmarks/data comparing the improvement.

DuckDB and Lance could be additional ColumnarBaseTable implementations. They don't remove the need for parquet.

@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 8, 2026

uvx icebug-format --help

It already creates a metadata table thusly:

        # Create global metadata
        con.execute(f"""
        CREATE TABLE {csr_table_name}_metadata AS
        SELECT {total_nodes} AS n_nodes, {total_edges} AS n_edges, {directed} AS directed
        """)

Yes, this is a good place to add version.

Should we keep the current column(0) for nbr_offset in in indptr? or allow users to specify

nbr_offset is kuzu/ladybug specific terminology. I'd avoid it in the code/docs. CSR format is broadly defined. Include a pointer to existing definition. My recollection is that indptr always has only one column specifying the offset into indices. However, indices can have many columns if there are properties on the edge.

We can specify that target is always col0 and additional columns appear in an order as specified in the schema.

The distinction between schema.cypher and catalog entry are mechanical. I don't think we should spend a lot of time specifying that or we'll be duplicating a significant chunk of docs.ladybugdb.com.

However, getting other Graph databases and analytical packages to adopt icebug is an explicit goal. So I'd try to strike a balance between over-specifying (as opposed to just refer to reference implementation) and adding ladybug specific assumptions that would rub the non-ladybug people the wrong way.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

dataset addition PR: LadybugDB/dataset#1

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

The improvements are nice. But we can't have two implementations with 2x the bugs. Need to remove one and justify the other with benchmarks/data comparing the improvement.

DuckDB and Lance could be additional ColumnarBaseTable implementations. They don't remove the need for parquet.

I will run a benchmark tmrw and attach the results

But, what about support for normal parquet tables? We can just remove them from docs until we change the existing classes to normal tables

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

uvx icebug-format --help

It already creates a metadata table thusly:

        # Create global metadata
        con.execute(f"""
        CREATE TABLE {csr_table_name}_metadata AS
        SELECT {total_nodes} AS n_nodes, {total_edges} AS n_edges, {directed} AS directed
        """)

I will update the tool and add it in this repo tmrw

@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 8, 2026

I will update the tool and add it in this repo tmrw

Ladybug-Memory/icebug-format#2 (comment)
Ladybug-Memory/icebug-format#2 (comment)

If the issue is icebug-format is in a for profit company namespace, we can move it some place else. But ladybugdb core repo is the wrong place for the spec and the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add Icebug-disk spec impl

2 participants