ADD: ByteChunker for streaming decode-to-bytes#129
Closed
wtn wants to merge 1 commit into
Closed
Conversation
Contributor
|
I'm hesitant to merge it as it seems pretty specific, though it has some similarities to #117. Is this something you could implement in your own crate/pyo3 Python extension library? There doesn't seem to be a trait interaction requiring it to be in You could also look at using |
Contributor
Author
|
Thank you!! |
Contributor
|
awesome, glad that worked for you |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
decode::ByteChunkeranddecode::RecordFilter: composable wrappers around anyDecodeRecordRef.ByteChunkeraggregates raw record bytes into bounded chunks;RecordFilter(itself aDecodeRecordRef) drops records byinstrument_id,publisher_id, or a half-open time window[start_ts, end_ts). Compose asByteChunker::new(RecordFilter::builder(decoder).…build()?)or use either alone.I realize this is a large patch; understand if you need to reject on architectural grounds, or defer to later.
Motivation
My use case necessitates making header-based filtering from interpreted languages much faster, which we can do by skipping per-record FFI allocation entirely. Inspecting a header field currently requires the full decode path; for low-selectivity filters (meaning, a small proportion kept/used), a lot of work is wasted. Pairing a pre-decode header filter with raw-byte output lets the binding hand back contiguous bytes the caller can forward, recompress, or batch-deserialize in one shot.
Local prototype Python binding (~75 lines of pyo3) over an earlier version, filtering an uncompressed XNAS MBP-1 file to 1 symbol (1.421% selectivity):
DBNDecoder+ Python filterBenchmarked on
aarch64-darwin, 6.7 GB MBP-1 (decompressed from 2.2 GB zst), 83.4M input records; both paths emit the same 1.18M filtered records, across 3 trials. Additive todatabento-python'sNDArrayStreamIteratorfor header-filtered, heterogeneous, or live streams and raw-byte forwarding.API
Both wrappers passthrough
DbnMetadata;RecordFilteralso implementsDecodeRecordandDecodeStream. Filters are ANDed; time filters key onRecord::raw_index_ts.end_tsterminates iteration at the first record at or past the bound; that record is consumed and stashed as aRecordBufso callers can prepend it on resume. Output bytes follow the wrapped decoder'sVersionUpgradePolicy, so prepending its metadata yields a valid stream.Design notes
Vec::contains: faster thanHashSetfor the small lists this targets (tens of ids).end_tsrequires monotonically non-decreasing primary timestamps. DBN files from the Databento API satisfy this; documented onend_ts.Benchmark script
Type of change
Checklist
scripts/build.sh)scripts/lint.shandscripts/format.sh)scripts/test.sh)Declaration
I confirm this contribution is made under an Apache 2.0 license and that I have the authority
necessary to make this contribution on behalf of its copyright owner.