Skip to content

feat: Support fixed-length nested Array type for NeuG#507

Merged
shirly121 merged 27 commits into
alibaba:mainfrom
zhanglei1949:zl/feat-array
Jun 30, 2026
Merged

feat: Support fixed-length nested Array type for NeuG#507
shirly121 merged 27 commits into
alibaba:mainfrom
zhanglei1949:zl/feat-array

Conversation

@zhanglei1949

@zhanglei1949 zhanglei1949 commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

  • Add native fixed-length array property type (INT32[3], DOUBLE[128], etc.) for vertices and edges
  • Implement ArrayColumn storage backed by a flat child column for efficient columnar access
  • Wire through the full query pipeline: type system, schema, compiler, execution values, serialization, and Python bindings

Closes #506

Changes

  • Type system: DataTypeId::kArray, DataType::Array(), ArrayType helpers, ArrayExtraTypeInfo
  • Storage: New ArrayColumn class — row i, element j stored at child[i*N + j]
  • Execution: Value::ARRAY, ArrayValue, comparison operators for arrays
  • Compiler: gopt type conversion, vector cast functions updated
  • Serialization: protobuf and YAML schema support for array types
  • Python: Array values returned as Python lists via existing bindings
  • Tests: Integration tests covering INT32, INT64, FLOAT, DOUBLE,STRING arrays; create, query, update operations

@zhanglei1949 zhanglei1949 marked this pull request as draft June 9, 2026 03:16
@zhanglei1949 zhanglei1949 force-pushed the zl/feat-array branch 3 times, most recently from 0866c3b to d21c33c Compare June 23, 2026 03:29
@zhanglei1949 zhanglei1949 marked this pull request as ready for review June 23, 2026 03:30
Comment thread src/common/types.cc
Comment thread src/compiler/common/types/types.cpp Outdated
auto rightBracketPos = trimmedStr.find_last_of(']');
auto childType =
convertFromString(trimmedStr.substr(0, leftBracketPos), context);
// Parse left-to-right so that INT32[2][3] → Array(Array(INT32, 3), 2)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shirly121 Any comments? We need to support nested array type.

Comment thread src/compiler/gopt/g_expr_converter.cpp Outdated
auto sourceExpr =
castExpr.getChild(0)->constPtrCast<binder::LiteralExpression>();
auto& targetType = castExpr.getDataType();
const auto& targetType = scalarExpr.getBindData()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里这么改是因为有测试过不了吗?

Comment thread src/execution/common/operators/insert/create_edge.cc Outdated
Comment thread src/execution/common/operators/insert/create_vertex.cc Outdated
input.arguments[0]->getDataType(), targetType)
->execFunc;
} catch (...) {}
if (targetType.id() != DataTypeId::kArray &&

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should revert back

@zhanglei1949 zhanglei1949 changed the title feat: Support fixed-length Array<T, N> property type feat: Support fixed-length nested Array type for NeuG Jun 23, 2026
Comment thread include/neug/execution/common/types/value.h Outdated
Comment thread src/execution/common/operators/insert/create_edge.cc Outdated
* stores elements flat: row i element j is at datas_[i * array_size_ + j].
* No items_ overhead — simpler and more cache-friendly.
*/
class ArrayColumn : public IContextColumn {

@zhanglei1949 zhanglei1949 Jun 23, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

union_col and optional_shuffle are not implemented for ArrayColumn in the execution layer because the underlying execution ListColumn also lacks these implementations. @luoxiaojian @liulx20

@zhanglei1949 zhanglei1949 requested a review from luoxiaojian June 23, 2026 07:58
Comment thread include/neug/execution/common/columns/array_columns.h Outdated
Comment thread include/neug/execution/common/types/value.h Outdated
Comment thread src/common/types.cc
@zhanglei1949

Copy link
Copy Markdown
Member Author

TODO: Refer to DuckDB to check whether it is possible to avoid copying Value from storage to execution.

@zhanglei1949 zhanglei1949 requested a review from Copilot June 24, 2026 02:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

This PR adds native fixed-length ARRAY<T, N> (including nested arrays) to NeuG end-to-end, spanning storage, execution, type conversion, serialization, and Python bindings.

Changes:

  • Introduces DataTypeId::kArray + helpers, and wires LIST↔ARRAY conversion through DDL/DML planning and execution.
  • Adds storage/execution column implementations for arrays (flat child-column layout) and updates checkpoint/module serialization to support referenced sub-modules.
  • Extends YAML/protobuf/Arrow serialization and adds Python + integration tests and documentation updates for arrays and known limitations.

Reviewed changes

Copilot reviewed 72 out of 74 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/python_bind/tests/test_tp_service.py Adds session-level CREATE/QUERY/UPDATE tests for array properties over Bolt/SQL.
tools/python_bind/tests/test_db_array.py Adds comprehensive Python integration tests for ARRAY values (vertex/edge/nested/DDL/CAST/UNWIND).
tools/python_bind/tests/test_data_io_docs.py Adds test coverage for documented CSV ingestion limitations with ARRAY.
tools/python_bind/neug/database.py Makes Database.__del__ safer during interpreter shutdown; adds optional logging to close().
tests/utils/test_table.cc Updates dumping to new manifest-based Dump API; adds ModuleBroker tests for referenced-module skipping.
tests/unittest/utils.h Adds helper dump_module_descriptor() and updates legacy dump helpers to manifest-based Dump.
tests/unittest/test_column.cc Adds ArrayColumn unit test and updates typed column dump/open tests to new Dump API.
tests/storage/test_mutable_csr.cc Updates CSR dump/open tests to new manifest-based Dump API.
tests/storage/test_immutable_csr.cc Updates immutable CSR dump/open tests to new manifest-based Dump API.
tests/storage/test_csr_batch_ops.cc Updates CSR batch tests to new manifest-based Dump API.
tests/execution/test_value_column.cc Adds ArrayColumnBuilder validation test.
tests/execution/test_value.cc Adds LIST↔ARRAY conversion behavior tests (length checks, casting, rejection cases).
src/utils/yaml_utils.cc Adds YAML encoding for ARRAY type.
src/utils/property/default_value.cc Adds default value construction for ARRAY types.
src/utils/property/column.cc Wires ArrayColumn / ArrayRefColumn into property column factory.
src/utils/property/array_column.cc Adds storage-layer ArrayColumn backed by a flat child column; manifest refs for child modules.
src/utils/pb_utils.cc Adds protobuf type/value support for ARRAY (and LIST fallback), with element-count validation.
src/utils/arrow_utils.cc Adds Arrow fixed_size_list mapping for ARRAY with bounds checks.
src/transaction/update_transaction.cc Strengthens type matching to full DataType equality (not just id).
src/transaction/insert_transaction.cc Strengthens type matching to full DataType equality (not just id).
src/storages/module/module_broker.cc Skips referenced modules on Open; uses manifest-aware Open; updates Dump API usage.
src/storages/graph/vertex_timestamp.cc Migrates VertexTimestamp to new manifest-based Dump API.
src/storages/graph/vertex_table.cc Migrates vertex property dumps to new manifest-based Dump API.
src/storages/graph/schema.cc Updates bundled-edge logic for ARRAY/LIST; removes DataType archive ops (moved to common/types.cc).
src/storages/graph/module_descriptor.cc Adds JSON serialization for module refs and referenced-module flag.
src/storages/graph/edge_table.cc Migrates edge property dumps to new manifest-based Dump API.
src/storages/csr/mutable_csr.cc Migrates mutable CSR dumps to new manifest-based Dump API.
src/storages/csr/immutable_csr.cc Migrates immutable CSR dumps to new manifest-based Dump API.
src/storages/container/CMakeLists.txt Adds build dependency on neug_proto for container objects.
src/main/query_result.cc Improves error message formatting for cursor bounds checks.
src/execution/expression/exprs/variable.cc Parses protobuf ARRAY into fixed-length ARRAY when max_length is set, else LIST.
src/execution/execute/ops/insert/merge_vertex.cc Allows implicit LIST↔ARRAY conversion for property assignment; rejects other mismatches.
src/execution/execute/ops/insert/merge_edge.cc Allows implicit LIST↔ARRAY conversion for edge property assignment; rejects other mismatches.
src/execution/execute/ops/batch/batch_update_vertex.cc Allows implicit LIST↔ARRAY conversion in batch updates; simplifies mismatch errors.
src/execution/execute/ops/batch/batch_update_edge.cc Allows implicit LIST↔ARRAY conversion in batch updates; removes overly restrictive scalar-only filter.
src/execution/common/types/value.cc Adds ARRAY value construction, JSON/archive encoding, comparisons, and LIST↔ARRAY conversion helper.
src/execution/common/operators/retrieve/unfold.cc Extends UNFOLD/UNWIND to support ARRAY in addition to LIST.
src/execution/common/operators/retrieve/sink.cc Serializes ARRAY columns as list_array with fixed offsets for wire compatibility.
src/execution/common/operators/insert/create_vertex.cc Adds implicit LIST↔ARRAY conversion when inserting properties.
src/execution/common/operators/insert/create_edge.cc Adds implicit LIST↔ARRAY conversion when inserting edge properties.
src/execution/common/columns/list_columns.cc Extends ListColumn::unfold to build ArrayColumn when list elements are ARRAY.
src/execution/common/columns/columns_utils.cc Adds builder creation support for ARRAY.
src/execution/common/columns/array_columns.cc Adds execution-layer ArrayColumn shuffle/unfold helpers.
src/compiler/gopt/g_type_converter.cpp Emits max_length for fixed-size ARRAY; supports LIST/ARRAY child type conversion.
src/compiler/gopt/g_expr_converter.cpp Adjusts CAST/default handling so ARRAY/LIST conversions use extension func paths; blocks nested array defaults.
src/compiler/gopt/g_ddl_converter.cpp Refactors default-value assignment to skip unsupported defaults (e.g., nested array-like).
src/compiler/function/vector_cast_functions.cpp Switches CAST target parsing to convertFromString; supports LIST↔ARRAY casting.
src/compiler/common/types/types.cpp Fixes parsing semantics for nested array type strings (left-to-right, C/C++ semantics).
src/common/types.cc Implements DataType archive (de)serialization here, including ARRAY type info.
include/neug/utils/property/types.h Adds YAML encode/decode for ARRAY DataType.
include/neug/utils/property/column.h Migrates Column Dump API to manifest-based; adjusts string/empty column dumping.
include/neug/utils/property/array_column.h Declares storage-layer ArrayColumn and ArrayRefColumn.
include/neug/storages/module_descriptor.h Adds named refs + referenced-module marker to ModuleDescriptor.
include/neug/storages/module/module.h Updates Module interface to manifest-aware Open and manifest-based Dump.
include/neug/storages/loader/loader_utils.h Adds include needed for column-related types after changes.
include/neug/storages/graph/vertex_timestamp.h Updates Dump signature to new manifest-based API.
include/neug/storages/graph/schema.h Removes DataType archive ops declarations (moved to common/types).
include/neug/storages/csr/mutable_csr.h Updates Dump signature to new manifest-based API (mutable + single + empty).
include/neug/storages/csr/immutable_csr.h Updates Dump signature to new manifest-based API (immutable + single).
include/neug/storages/csr/csr_view_utils.h Treats LIST/ARRAY as non-inline edge properties for search payload selection.
include/neug/storages/checkpoint.h Includes CheckpointManifest directly (removes forward decl).
include/neug/execution/common/types/value.h Adds ARRAY API and LIST↔ARRAY conversion helper declaration; adds array comparisons.
include/neug/execution/common/columns/list_columns.h Namespace closing formatting update.
include/neug/execution/common/columns/array_columns.h Declares execution-layer ArrayColumn/ArrayColumnBuilder.
include/neug/common/types.h Declares DataType archive operators.
doc/source/data_io/load_data.md Documents CSV LOAD array parsing limitation.
doc/source/data_io/import_data.md Documents CSV COPY FROM array materialization limitation.
doc/source/cypher_manual/expression/list_op.md Documents list-like ops and ARRAY support (UNWIND; array indexing unsupported).
doc/source/cypher_manual/expression/cast_func.md Documents LIST↔ARRAY casting behavior and rules.
doc/source/cypher_manual/dml_clause.md Documents CREATE/SET examples for array-valued properties.
doc/source/cypher_manual/ddl_clause.md Documents ARRAY DDL syntax, defaults behavior, and nested arrays.
doc/source/cypher_manual/data_types.md Adds ARRAY to data type docs with semantics and examples.
.gitignore Ignores .codex/.
.github/workflows/neug-test.yml Runs new Python array test suite in CI.
Comments suppressed due to low confidence (10)

src/execution/common/types/value.cc:1

  • ARRAY JSON deserialization currently (a) returns an empty non-null ARRAY when the JSON value is not an array, and (b) does not validate fixed length against ArrayType::GetNumElements(type). This can create arrays with incorrect cardinality and violate fixed-size invariants downstream. Consider either throwing on non-array/non-matching lengths or returning a typed NULL Value(type); and enforce arr.Size() == expected_size for fixed-size arrays.
    src/execution/common/types/value.cc:1
  • This throws std::runtime_error directly while most of the codebase appears to use the project exception macros/types (e.g., THROW_RUNTIME_ERROR, InvalidArgumentException). For consistency (and for tests that assert specific exception types), it would be better to throw via the project exception mechanism here as well.
    src/utils/property/array_column.cc:1
  • ArrayType::GetNumElements returns uint64_t but array_size_ is uint32_t, so this assignment can silently truncate for large arrays. Other layers added explicit bounds checks (e.g., Arrow/GOpt). Consider adding a similar bounds check here (and throwing NotSupported/InvalidArgument), or widening array_size_ to uint64_t to avoid narrowing.
    tools/python_bind/tests/test_tp_service.py:1
  • These tests use fixed /tmp/... directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider using tmp_path for directories and selecting an ephemeral free port (or letting serve() choose) to make the suite robust in CI and local parallel runs.
    tools/python_bind/tests/test_tp_service.py:1
  • These tests use fixed /tmp/... directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider using tmp_path for directories and selecting an ephemeral free port (or letting serve() choose) to make the suite robust in CI and local parallel runs.
    tools/python_bind/tests/test_tp_service.py:1
  • These tests use fixed /tmp/... directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider using tmp_path for directories and selecting an ephemeral free port (or letting serve() choose) to make the suite robust in CI and local parallel runs.
    tools/python_bind/tests/test_tp_service.py:1
  • These tests use fixed /tmp/... directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider using tmp_path for directories and selecting an ephemeral free port (or letting serve() choose) to make the suite robust in CI and local parallel runs.
    tools/python_bind/tests/test_tp_service.py:1
  • These tests use fixed /tmp/... directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider using tmp_path for directories and selecting an ephemeral free port (or letting serve() choose) to make the suite robust in CI and local parallel runs.
    tools/python_bind/tests/test_tp_service.py:1
  • These tests use fixed /tmp/... directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider using tmp_path for directories and selecting an ephemeral free port (or letting serve() choose) to make the suite robust in CI and local parallel runs.
    tools/python_bind/tests/test_tp_service.py:1
  • These tests use fixed /tmp/... directories and fixed ports (10015–10017). This can cause flaky failures under parallel test execution (port already in use) or when previous runs left processes/data behind. Consider using tmp_path for directories and selecting an ephemeral free port (or letting serve() choose) to make the suite robust in CI and local parallel runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/neug/utils/property/column.h Outdated
Comment thread include/neug/utils/property/types.h Outdated
@zhanglei1949

Copy link
Copy Markdown
Member Author

TODO: Refer to DuckDB to check whether it is possible to avoid copying Value from storage to execution.

It turns out that DuckDB also copy Value from storage to execution.

Comment thread doc/source/cypher_manual/ddl_clause.md Outdated

## Array Properties

Use `T[N]` to declare a fixed-size array property, where `T` is the child type and `N` is a positive fixed length. `T[]` remains a variable-length list type.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior when N is 0?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N must be greater than 0, doc and test are added

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we will report error is N is 0.

if (col->elem_type().id() != DataTypeId::kList) {
LOG(ERROR) << "Unfold column type is not list";
RETURN_INVALID_ARGUMENT_ERROR("Unfold column type is not list");
if (!isListLikeType(col->elem_type().id())) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there might not be a need to flatten arrays.

@zhanglei1949 zhanglei1949 Jun 29, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean there is no need to implement unfold for array column? After discuess with @shirly121 , we believe it is neccessary to support unfold for ArrayColumn, the UNWIND is not limited to List type.

Comment thread src/compiler/gopt/g_expr_converter.cpp Outdated
auto sourceExpr =
castExpr.getChild(0)->constPtrCast<binder::LiteralExpression>();
auto& targetType = castExpr.getDataType();
const auto& targetType = scalarExpr.getBindData()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里这么改是因为有测试过不了吗?

Comment thread doc/source/cypher_manual/ddl_clause.md Outdated

## Array Properties

Use `T[N]` to declare a fixed-size array property, where `T` is the child type and `N` is a positive fixed length. `T[]` remains a variable-length list type.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N must be greater than 0, doc and test are added

@shirly121 shirly121 merged commit a5670f8 into alibaba:main Jun 30, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Support fixed-length Array<T, N> property type

4 participants