GH-45284: [Parquet][C++] Proposed RowRanges API#48635
GH-45284: [Parquet][C++] Proposed RowRanges API#48635poshul wants to merge 16 commits intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
@emkornfield any thoughts on this as a proposed first step? |
|
|
| /// \brief EXPERIMENTAL: Validate the row ranges. | ||
| /// \throws ParquetException if the row ranges are not in ascending order or | ||
| /// overlapped. | ||
| void Validate() const; |
There was a problem hiding this comment.
It seems like validation should happen on construction?
There was a problem hiding this comment.
That probably makes sense. I've moved it so that it gets called before any of the functions the create RowSelections return.
|
Took a look at the top-level public API, looks reasonable to me. If you need a full review please ping me. Apologies for the delay. |
Co-authored-by: emkornfield <emkornfield@gmail.com>
…tions that create a RowSelection.
No worries for the delay, January is just a difficult month. I would appreciate a full review if you have the time for it. |
| static RowSelection Union(const RowSelection& lhs, const RowSelection& rhs); | ||
|
|
||
| /// \brief EXPERIMENTAL: Make a single row range of [start, end]. | ||
| static RowSelection MakeSingle(int64_t start, int64_t end); |
There was a problem hiding this comment.
What is the use-case for this. It is a little big awkward that this uses a differnt convention then IntervalRange. Also, I assume inclusive here makes something easier (typically half-open ranges are more natural in APIs)
| ASSERT_FALSE(batch.empty()); | ||
| auto interval = batch[0]; | ||
| EXPECT_EQ(interval.start, 0); | ||
| EXPECT_EQ((interval.start + interval.length - 1), 10); |
There was a problem hiding this comment.
This is a bit awkward, could we just make an equality checking function for Interval Range and use that instead of repeating the code here.
If end row is an important value we could maybe add it as a function to IntervalRange.
There was a problem hiding this comment.
in general, it would probably be useful to setup test infrastructure to make these parameterized so it is there is less boiler plate, and the test cases are more apparent.
| static RowSelection FromIntervals(::arrow::util::span<const IntervalRange> intervals); | ||
|
|
||
| /// \brief EXPERIMENTAL: Make a row range from a vector of intervals. | ||
| static RowSelection FromIntervals(const std::vector<IntervalRange>& intervals); |
There was a problem hiding this comment.
do we need this when take span?
| virtual ~Iterator() = default; | ||
| /// \brief Get the next batch of ranges. | ||
| /// Returns an empty span when exhausted. | ||
| virtual ::arrow::util::span<const IntervalRange> NextRange() = 0; |
There was a problem hiding this comment.
span here is only valid until the next call to NextRange? What is the use-case for batching ranges?
There was a problem hiding this comment.
seems like batch size isn't exposed here in general, maybe this should return optional? or a pointer with and return null at the end?
|
|
||
| // Start with whichever range has the smaller start | ||
| IntervalRange current; | ||
| if (lhs_batch[0].start <= rhs_batch[0].start) { |
There was a problem hiding this comment.
seems we could simplify to some extent if batch size is always assumed to be 1?
emkornfield
left a comment
There was a problem hiding this comment.
Apologies for the long delay, re-reviewing the API I had some questions.
Rationale for this change
Based on @wgtmac 's #45234 This attempts to address the comments in that PR
What changes are included in this PR?
add an experimental row_selection API
Are these changes tested?
Yes
Are there any user-facing changes?
Yes