feat: Support fixed-length nested Array type for NeuG#507
Conversation
0866c3b to
d21c33c
Compare
| auto rightBracketPos = trimmedStr.find_last_of(']'); | ||
| auto childType = | ||
| convertFromString(trimmedStr.substr(0, leftBracketPos), context); | ||
| // Parse left-to-right so that INT32[2][3] → Array(Array(INT32, 3), 2) |
There was a problem hiding this comment.
@shirly121 Any comments? We need to support nested array type.
| auto sourceExpr = | ||
| castExpr.getChild(0)->constPtrCast<binder::LiteralExpression>(); | ||
| auto& targetType = castExpr.getDataType(); | ||
| const auto& targetType = scalarExpr.getBindData() |
| input.arguments[0]->getDataType(), targetType) | ||
| ->execFunc; | ||
| } catch (...) {} | ||
| if (targetType.id() != DataTypeId::kArray && |
There was a problem hiding this comment.
Maybe we should revert back
| * stores elements flat: row i element j is at datas_[i * array_size_ + j]. | ||
| * No items_ overhead — simpler and more cache-friendly. | ||
| */ | ||
| class ArrayColumn : public IContextColumn { |
There was a problem hiding this comment.
union_col and optional_shuffle are not implemented for ArrayColumn in the execution layer because the underlying execution ListColumn also lacks these implementations. @luoxiaojian @liulx20
3b8196c to
ca55a94
Compare
|
TODO: Refer to DuckDB to check whether it is possible to avoid copying |
There was a problem hiding this comment.
Pull request overview
Note
Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.
This PR adds native fixed-length ARRAY<T, N> (including nested arrays) to NeuG end-to-end, spanning storage, execution, type conversion, serialization, and Python bindings.
Changes:
- Introduces
DataTypeId::kArray+ helpers, and wires LIST↔ARRAY conversion through DDL/DML planning and execution. - Adds storage/execution column implementations for arrays (flat child-column layout) and updates checkpoint/module serialization to support referenced sub-modules.
- Extends YAML/protobuf/Arrow serialization and adds Python + integration tests and documentation updates for arrays and known limitations.
Reviewed changes
Copilot reviewed 72 out of 74 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/python_bind/tests/test_tp_service.py | Adds session-level CREATE/QUERY/UPDATE tests for array properties over Bolt/SQL. |
| tools/python_bind/tests/test_db_array.py | Adds comprehensive Python integration tests for ARRAY values (vertex/edge/nested/DDL/CAST/UNWIND). |
| tools/python_bind/tests/test_data_io_docs.py | Adds test coverage for documented CSV ingestion limitations with ARRAY. |
| tools/python_bind/neug/database.py | Makes Database.__del__ safer during interpreter shutdown; adds optional logging to close(). |
| tests/utils/test_table.cc | Updates dumping to new manifest-based Dump API; adds ModuleBroker tests for referenced-module skipping. |
| tests/unittest/utils.h | Adds helper dump_module_descriptor() and updates legacy dump helpers to manifest-based Dump. |
| tests/unittest/test_column.cc | Adds ArrayColumn unit test and updates typed column dump/open tests to new Dump API. |
| tests/storage/test_mutable_csr.cc | Updates CSR dump/open tests to new manifest-based Dump API. |
| tests/storage/test_immutable_csr.cc | Updates immutable CSR dump/open tests to new manifest-based Dump API. |
| tests/storage/test_csr_batch_ops.cc | Updates CSR batch tests to new manifest-based Dump API. |
| tests/execution/test_value_column.cc | Adds ArrayColumnBuilder validation test. |
| tests/execution/test_value.cc | Adds LIST↔ARRAY conversion behavior tests (length checks, casting, rejection cases). |
| src/utils/yaml_utils.cc | Adds YAML encoding for ARRAY type. |
| src/utils/property/default_value.cc | Adds default value construction for ARRAY types. |
| src/utils/property/column.cc | Wires ArrayColumn / ArrayRefColumn into property column factory. |
| src/utils/property/array_column.cc | Adds storage-layer ArrayColumn backed by a flat child column; manifest refs for child modules. |
| src/utils/pb_utils.cc | Adds protobuf type/value support for ARRAY (and LIST fallback), with element-count validation. |
| src/utils/arrow_utils.cc | Adds Arrow fixed_size_list mapping for ARRAY with bounds checks. |
| src/transaction/update_transaction.cc | Strengthens type matching to full DataType equality (not just id). |
| src/transaction/insert_transaction.cc | Strengthens type matching to full DataType equality (not just id). |
| src/storages/module/module_broker.cc | Skips referenced modules on Open; uses manifest-aware Open; updates Dump API usage. |
| src/storages/graph/vertex_timestamp.cc | Migrates VertexTimestamp to new manifest-based Dump API. |
| src/storages/graph/vertex_table.cc | Migrates vertex property dumps to new manifest-based Dump API. |
| src/storages/graph/schema.cc | Updates bundled-edge logic for ARRAY/LIST; removes DataType archive ops (moved to common/types.cc). |
| src/storages/graph/module_descriptor.cc | Adds JSON serialization for module refs and referenced-module flag. |
| src/storages/graph/edge_table.cc | Migrates edge property dumps to new manifest-based Dump API. |
| src/storages/csr/mutable_csr.cc | Migrates mutable CSR dumps to new manifest-based Dump API. |
| src/storages/csr/immutable_csr.cc | Migrates immutable CSR dumps to new manifest-based Dump API. |
| src/storages/container/CMakeLists.txt | Adds build dependency on neug_proto for container objects. |
| src/main/query_result.cc | Improves error message formatting for cursor bounds checks. |
| src/execution/expression/exprs/variable.cc | Parses protobuf ARRAY into fixed-length ARRAY when max_length is set, else LIST. |
| src/execution/execute/ops/insert/merge_vertex.cc | Allows implicit LIST↔ARRAY conversion for property assignment; rejects other mismatches. |
| src/execution/execute/ops/insert/merge_edge.cc | Allows implicit LIST↔ARRAY conversion for edge property assignment; rejects other mismatches. |
| src/execution/execute/ops/batch/batch_update_vertex.cc | Allows implicit LIST↔ARRAY conversion in batch updates; simplifies mismatch errors. |
| src/execution/execute/ops/batch/batch_update_edge.cc | Allows implicit LIST↔ARRAY conversion in batch updates; removes overly restrictive scalar-only filter. |
| src/execution/common/types/value.cc | Adds ARRAY value construction, JSON/archive encoding, comparisons, and LIST↔ARRAY conversion helper. |
| src/execution/common/operators/retrieve/unfold.cc | Extends UNFOLD/UNWIND to support ARRAY in addition to LIST. |
| src/execution/common/operators/retrieve/sink.cc | Serializes ARRAY columns as list_array with fixed offsets for wire compatibility. |
| src/execution/common/operators/insert/create_vertex.cc | Adds implicit LIST↔ARRAY conversion when inserting properties. |
| src/execution/common/operators/insert/create_edge.cc | Adds implicit LIST↔ARRAY conversion when inserting edge properties. |
| src/execution/common/columns/list_columns.cc | Extends ListColumn::unfold to build ArrayColumn when list elements are ARRAY. |
| src/execution/common/columns/columns_utils.cc | Adds builder creation support for ARRAY. |
| src/execution/common/columns/array_columns.cc | Adds execution-layer ArrayColumn shuffle/unfold helpers. |
| src/compiler/gopt/g_type_converter.cpp | Emits max_length for fixed-size ARRAY; supports LIST/ARRAY child type conversion. |
| src/compiler/gopt/g_expr_converter.cpp | Adjusts CAST/default handling so ARRAY/LIST conversions use extension func paths; blocks nested array defaults. |
| src/compiler/gopt/g_ddl_converter.cpp | Refactors default-value assignment to skip unsupported defaults (e.g., nested array-like). |
| src/compiler/function/vector_cast_functions.cpp | Switches CAST target parsing to convertFromString; supports LIST↔ARRAY casting. |
| src/compiler/common/types/types.cpp | Fixes parsing semantics for nested array type strings (left-to-right, C/C++ semantics). |
| src/common/types.cc | Implements DataType archive (de)serialization here, including ARRAY type info. |
| include/neug/utils/property/types.h | Adds YAML encode/decode for ARRAY DataType. |
| include/neug/utils/property/column.h | Migrates Column Dump API to manifest-based; adjusts string/empty column dumping. |
| include/neug/utils/property/array_column.h | Declares storage-layer ArrayColumn and ArrayRefColumn. |
| include/neug/storages/module_descriptor.h | Adds named refs + referenced-module marker to ModuleDescriptor. |
| include/neug/storages/module/module.h | Updates Module interface to manifest-aware Open and manifest-based Dump. |
| include/neug/storages/loader/loader_utils.h | Adds include needed for column-related types after changes. |
| include/neug/storages/graph/vertex_timestamp.h | Updates Dump signature to new manifest-based API. |
| include/neug/storages/graph/schema.h | Removes DataType archive ops declarations (moved to common/types). |
| include/neug/storages/csr/mutable_csr.h | Updates Dump signature to new manifest-based API (mutable + single + empty). |
| include/neug/storages/csr/immutable_csr.h | Updates Dump signature to new manifest-based API (immutable + single). |
| include/neug/storages/csr/csr_view_utils.h | Treats LIST/ARRAY as non-inline edge properties for search payload selection. |
| include/neug/storages/checkpoint.h | Includes CheckpointManifest directly (removes forward decl). |
| include/neug/execution/common/types/value.h | Adds ARRAY API and LIST↔ARRAY conversion helper declaration; adds array comparisons. |
| include/neug/execution/common/columns/list_columns.h | Namespace closing formatting update. |
| include/neug/execution/common/columns/array_columns.h | Declares execution-layer ArrayColumn/ArrayColumnBuilder. |
| include/neug/common/types.h | Declares DataType archive operators. |
| doc/source/data_io/load_data.md | Documents CSV LOAD array parsing limitation. |
| doc/source/data_io/import_data.md | Documents CSV COPY FROM array materialization limitation. |
| doc/source/cypher_manual/expression/list_op.md | Documents list-like ops and ARRAY support (UNWIND; array indexing unsupported). |
| doc/source/cypher_manual/expression/cast_func.md | Documents LIST↔ARRAY casting behavior and rules. |
| doc/source/cypher_manual/dml_clause.md | Documents CREATE/SET examples for array-valued properties. |
| doc/source/cypher_manual/ddl_clause.md | Documents ARRAY DDL syntax, defaults behavior, and nested arrays. |
| doc/source/cypher_manual/data_types.md | Adds ARRAY to data type docs with semantics and examples. |
| .gitignore | Ignores .codex/. |
| .github/workflows/neug-test.yml | Runs new Python array test suite in CI. |
Comments suppressed due to low confidence (10)
src/execution/common/types/value.cc:1
- ARRAY JSON deserialization currently (a) returns an empty non-null ARRAY when the JSON value is not an array, and (b) does not validate fixed length against
ArrayType::GetNumElements(type). This can create arrays with incorrect cardinality and violate fixed-size invariants downstream. Consider either throwing on non-array/non-matching lengths or returning a typed NULLValue(type); and enforcearr.Size() == expected_sizefor fixed-size arrays.
src/execution/common/types/value.cc:1 - This throws
std::runtime_errordirectly while most of the codebase appears to use the project exception macros/types (e.g.,THROW_RUNTIME_ERROR,InvalidArgumentException). For consistency (and for tests that assert specific exception types), it would be better to throw via the project exception mechanism here as well.
src/utils/property/array_column.cc:1 ArrayType::GetNumElementsreturnsuint64_tbutarray_size_isuint32_t, so this assignment can silently truncate for large arrays. Other layers added explicit bounds checks (e.g., Arrow/GOpt). Consider adding a similar bounds check here (and throwing NotSupported/InvalidArgument), or wideningarray_size_touint64_tto avoid narrowing.
tools/python_bind/tests/test_tp_service.py:1- These tests use fixed
/tmp/...directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider usingtmp_pathfor directories and selecting an ephemeral free port (or lettingserve()choose) to make the suite robust in CI and local parallel runs.
tools/python_bind/tests/test_tp_service.py:1 - These tests use fixed
/tmp/...directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider usingtmp_pathfor directories and selecting an ephemeral free port (or lettingserve()choose) to make the suite robust in CI and local parallel runs.
tools/python_bind/tests/test_tp_service.py:1 - These tests use fixed
/tmp/...directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider usingtmp_pathfor directories and selecting an ephemeral free port (or lettingserve()choose) to make the suite robust in CI and local parallel runs.
tools/python_bind/tests/test_tp_service.py:1 - These tests use fixed
/tmp/...directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider usingtmp_pathfor directories and selecting an ephemeral free port (or lettingserve()choose) to make the suite robust in CI and local parallel runs.
tools/python_bind/tests/test_tp_service.py:1 - These tests use fixed
/tmp/...directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider usingtmp_pathfor directories and selecting an ephemeral free port (or lettingserve()choose) to make the suite robust in CI and local parallel runs.
tools/python_bind/tests/test_tp_service.py:1 - These tests use fixed
/tmp/...directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider usingtmp_pathfor directories and selecting an ephemeral free port (or lettingserve()choose) to make the suite robust in CI and local parallel runs.
tools/python_bind/tests/test_tp_service.py:1 - These tests use fixed
/tmp/...directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider usingtmp_pathfor directories and selecting an ephemeral free port (or lettingserve()choose) to make the suite robust in CI and local parallel runs.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
It turns out that DuckDB also copy |
|
|
||
| ## Array Properties | ||
|
|
||
| Use `T[N]` to declare a fixed-size array property, where `T` is the child type and `N` is a positive fixed length. `T[]` remains a variable-length list type. |
There was a problem hiding this comment.
What is the behavior when N is 0?
There was a problem hiding this comment.
N must be greater than 0, doc and test are added
There was a problem hiding this comment.
Currently we will report error is N is 0.
| if (col->elem_type().id() != DataTypeId::kList) { | ||
| LOG(ERROR) << "Unfold column type is not list"; | ||
| RETURN_INVALID_ARGUMENT_ERROR("Unfold column type is not list"); | ||
| if (!isListLikeType(col->elem_type().id())) { |
There was a problem hiding this comment.
It seems like there might not be a need to flatten arrays.
There was a problem hiding this comment.
You mean there is no need to implement unfold for array column? After discuess with @shirly121 , we believe it is neccessary to support unfold for ArrayColumn, the UNWIND is not limited to List type.
Reverts parseArrayType to find_last_of('[') and ToString to simple
recursive form so that INT[2][3] maps to Array(Array(INT,2),3) —
3 elements each being a 2-element array. Updates nested array tests,
docs, and adds test_array_zero_size_rejected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
93cb9f7 to
fd62456
Compare
| auto sourceExpr = | ||
| castExpr.getChild(0)->constPtrCast<binder::LiteralExpression>(); | ||
| auto& targetType = castExpr.getDataType(); | ||
| const auto& targetType = scalarExpr.getBindData() |
|
|
||
| ## Array Properties | ||
|
|
||
| Use `T[N]` to declare a fixed-size array property, where `T` is the child type and `N` is a positive fixed length. `T[]` remains a variable-length list type. |
There was a problem hiding this comment.
N must be greater than 0, doc and test are added
Summary
INT32[3],DOUBLE[128], etc.) for vertices and edgesArrayColumnstorage backed by a flat child column for efficient columnar accessCloses #506
Changes
DataTypeId::kArray,DataType::Array(),ArrayTypehelpers,ArrayExtraTypeInfoArrayColumnclass — row i, element j stored atchild[i*N + j]Value::ARRAY,ArrayValue, comparison operators for arrays