Skip to content

GH-45284: [Parquet][C++] Proposed RowRanges API#48635

Open
poshul wants to merge 16 commits intoapache:mainfrom
poshul:row_selection
Open

GH-45284: [Parquet][C++] Proposed RowRanges API#48635
poshul wants to merge 16 commits intoapache:mainfrom
poshul:row_selection

Conversation

@poshul
Copy link
Copy Markdown

@poshul poshul commented Dec 23, 2025

Rationale for this change

Based on @wgtmac 's #45234 This attempts to address the comments in that PR

What changes are included in this PR?

add an experimental row_selection API

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@poshul poshul requested a review from wgtmac as a code owner December 23, 2025 14:41
@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@poshul poshul changed the title Row selection GH-45284: [Parquet][C++] Proposed RowRanges API Jan 5, 2026
@poshul
Copy link
Copy Markdown
Author

poshul commented Jan 5, 2026

@emkornfield any thoughts on this as a proposed first step?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 5, 2026

⚠️ GitHub issue #45284 has been automatically assigned in GitHub to PR creator.

Comment thread cpp/src/parquet/row_selection.h Outdated
Comment thread cpp/src/parquet/row_selection.h Outdated
/// \brief EXPERIMENTAL: Validate the row ranges.
/// \throws ParquetException if the row ranges are not in ascending order or
/// overlapped.
void Validate() const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like validation should happen on construction?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably makes sense. I've moved it so that it gets called before any of the functions the create RowSelections return.

@emkornfield
Copy link
Copy Markdown
Contributor

Took a look at the top-level public API, looks reasonable to me. If you need a full review please ping me.

Apologies for the delay.

@poshul
Copy link
Copy Markdown
Author

poshul commented Feb 2, 2026

Took a look at the top-level public API, looks reasonable to me. If you need a full review please ping me.

Apologies for the delay.

No worries for the delay, January is just a difficult month. I would appreciate a full review if you have the time for it.

static RowSelection Union(const RowSelection& lhs, const RowSelection& rhs);

/// \brief EXPERIMENTAL: Make a single row range of [start, end].
static RowSelection MakeSingle(int64_t start, int64_t end);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use-case for this. It is a little big awkward that this uses a differnt convention then IntervalRange. Also, I assume inclusive here makes something easier (typically half-open ranges are more natural in APIs)

ASSERT_FALSE(batch.empty());
auto interval = batch[0];
EXPECT_EQ(interval.start, 0);
EXPECT_EQ((interval.start + interval.length - 1), 10);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward, could we just make an equality checking function for Interval Range and use that instead of repeating the code here.

If end row is an important value we could maybe add it as a function to IntervalRange.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, it would probably be useful to setup test infrastructure to make these parameterized so it is there is less boiler plate, and the test cases are more apparent.

static RowSelection FromIntervals(::arrow::util::span<const IntervalRange> intervals);

/// \brief EXPERIMENTAL: Make a row range from a vector of intervals.
static RowSelection FromIntervals(const std::vector<IntervalRange>& intervals);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this when take span?

virtual ~Iterator() = default;
/// \brief Get the next batch of ranges.
/// Returns an empty span when exhausted.
virtual ::arrow::util::span<const IntervalRange> NextRange() = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

span here is only valid until the next call to NextRange? What is the use-case for batching ranges?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like batch size isn't exposed here in general, maybe this should return optional? or a pointer with and return null at the end?


// Start with whichever range has the smaller start
IntervalRange current;
if (lhs_batch[0].start <= rhs_batch[0].start) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we could simplify to some extent if batch size is always assumed to be 1?

Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long delay, re-reviewing the API I had some questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants