Releases: Blosc/python-blosc2
Release 4.5.1
Changes from 4.5.0 to 4.5.1
This follow-up release builds the b2view terminal viewer into a richer
data-exploration tool — a scatter plot, a searchable column picker, a
one-shot demo download, refreshed chrome, and several interaction fixes — and
upgrades the bundled C-Blosc2 to 3.1.4. WASM/Pyodide is now a fully
supported platform, and CTable.info reports per-column compressed sizes.
b2view: richer exploration
- Scatter plots: from a column plot, press
sto scatter the current column
(X) against another column (Y) chosen from a list, over the current (zoomed)
row range;hthen opens a high-resolutionmatplotlibscatter. - High-res for 1-D series is now an envelope plot (matching the in-terminal
view), and a newrkey toggles between the min/max envelope and the raw
values (strided-sampled when the range is wide). - Searchable column picker: the
cgo-to-column key now opens a searchable,
selectable list (type to filter, ↑/↓, Enter) for CTables, instead of a text
field; N-D arrays still go by numeric index. - Show/hide columns:
/opens a searchable multi-select to pick which CTable
columns are displayed. - Demo download:
b2view --downloadfetches a demo bundle
(chicago-taxi-flat.b2zby default) into the current directory if it is not
already there, then opens it. - Refreshed chrome: a branded header, a left-docked filename label in the
title, and clearer status chips.
b2view: interaction fixes
- Go-to-row/column pre-fill is now pre-selected, so the first keystroke
replaces the current index instead of appending to it (typing a column name no
longer produced e.g.0payment.fare). - Escape keeps its layered exit while a panel is maximized: with the data
panel maximized, escape now unlocks a plot's locked row window (and clears
filters) as documented, instead of being hijacked into restoring the panel —
userto restore (ESCAPE_TO_MINIMIZE = False). - Test-suite robustness fixes (a timing flake and a Windows rendering glitch).
Other
- C-Blosc2 upgraded to 3.1.4.
- WASM/Pyodide is now a fully supported platform, with more frequent CI runs.
CTable.infoshows per-column compressed sizes (cbytesandcratio),
andprint_versions()uses clearerPython-Blosc2/C-Blosc2labels.
Release 4.5.0
Changes from 4.4.5 to 4.5.0
This release teaches the b2view terminal viewer to plot — peak-preserving
envelope line plots of any series, with zoom, a row-window lock, and an optional
high-resolution matplotlib view — and gives CTable a pandas-like display and
CSV experience. It also publishes WASM/Pyodide wheels to PyPI and adds
faster strided reads for NDArray and Column.
b2view: plotting and data inspection
- In-terminal plots: press
pon a numeric series (a CTable column or an
array row) to draw a braille line plot. Plots are peak-preserving min/max
envelopes by default, so no spike or trough is hidden however large the
series is; large local series stream their envelope exactly in bounded
spans (only remote c2arrays fall back to a labeled strided sample). - Zoom and row-window lock: zoom the plot into a row range and pan it; press
vto lock the data grid to the plotted range so paging stays inside it
(escape unlocks). The plot and high-res views honor the locked window. - High-resolution view:
hopens a high-resmatplotlibimage of the
plotted range (new optionalhiresextra:matplotlib+textual-image). - On-demand cell decode:
enterdecodes a single skipped/expensive CTable
cell, and SChunk nodes now preview as a paged hex dump. - Fixes and polish: row paging re-aligns to the page grid after dim-mode
single-row scrolls; the data panel now focuses correctly with
--path ... --panel data; status chips are branded yellow.
CTable display
CTable.to_string()now renders the whole table by default (every row and
every column), likepandas'DataFrame.to_string(). Newmax_rowsand
max_widthparameters truncate on demand. Behaviour change: previously
to_string()returned the truncated view; code that relied on that should
passmax_rows=/max_width=(or usestr()).- The
[N rows x M columns]dimensions footer now follows pandas: omitted by
to_string()(passshow_dimensions=Trueto force it), and shown by
str/repr/printonly when the view is actually truncated. Previously it
was always appended. repr(ctable)now shows the same truncated table asstr(ctable)
(pandas/polars convention), instead of the one-lineCTable<…>summary. The
compact summary remains available viactable.info.- New display options in
set_printoptions:display_widthcontrols the
column-fitting width budget (None= auto-detect terminal,-1= show all
columns, positive int = fixed budget), anddisplay_rowsnow accepts-1to
show all rows (0still shows none). - New
blosc2.printoptions(...)context manager temporarily sets the display
options and restores them on exit, e.g.
with blosc2.printoptions(display_rows=-1, display_width=-1): print(t).
CTable I/O
CTable.to_csv()now accepts no path, returning the CSV as a string like
pandas'DataFrame.to_csv(). Passing a path still writes the file (and
returnsNone); the returned string is byte-for-byte the same as the file.
Performance
- Faster strided reads:
NDArray.__getitem__gains a sparse-gather fast
path for large strides, andColumn.__getitem__short-circuits when the
logical positions equal the physical ones. - Fix: a negative
stepinColumngetitem could return[]; it now
returns the reversed selection.
Indexing
- Fix: a sidecar-handle cache collision could return the wrong SUMMARY
index for a compact-store column. - Cross-column index pruning is now enabled for compact CTable queries, so
more predicates prune blocks before any data is materialized. The docs also
note when summary indexes are not created automatically.
Packaging
- WASM/Pyodide wheels on PyPI: the main wheel build now also produces
pyemscriptenwheels for CPython 3.13 (2025 ABI) and 3.14 (2026 ABI) and
uploads them to PyPI, soblosc2ismicropip-installable in Pyodide, and
b2viewprints a clear message instead of crashing when run under WASM.
Known limitation: slicing an in-memorySChunkloaded from a frame fails on
the Pyodide 0.29.x Emscripten toolchain (cp313); it works on Pyodide 314
(cp314) and natively. See issue #664. - cibuildwheel updated to 4.1.
Release 4.4.5
Changes from 4.4.3 to 4.4.5
Note: 4.4.4 was skipped due to a failure during the release process.
This release promotes the b2view terminal viewer to a core feature —
installed by default, with new interactive row and column filtering — and
makes BatchArray block layouts (and hence compression ratios) reproducible
across CPUs.
b2view, the terminal data viewer
- Installed by default:
textualandrichare now regular
dependencies, so theb2viewCLI works out of the box (the[tui]extra
is gone). A getting-started walkthrough was added to the docs, and the
README now lists the CLI tools. - Row filtering: pressing
fon a CTable node opens a modal that takes
the same string expressions asCTable.where()(dotted nested names,
and/or) and pages through the matching view. Filters are remembered
per node for the session, the data header shows the active filter plus
the unfiltered total, and escape (or an empty expression) clears it. - Column filtering:
/narrows the visible columns by case-insensitive
substring; column paging and thecgoto-column modal then operate on
that subset. Combines freely with the row filter; escape clears one
layer per press (rows first, then columns). - Mouse handling: the terminal owns the mouse by default, so native
text selection/copy works like in any CLI program;--mouselets b2view
capture it instead (click-to-focus, wheel scrolling by half a page,
paging at the edges). - Navigation:
?opens a help screen listing all keys;cjumps to a
column by index, exact name or unique name prefix;s/ejump to the
first/last column window; row paging and jumps keep the cursor on its
column; dim-mode index/viewport movements clamp at the boundaries instead
of wrapping around. - Rendering: column windows are fitted from measured rendered widths
(and re-fitted on terminal resize and panel maximize/restore), and float
columns use a uniform number of decimals so decimal points align down
the column. - Test suite: first automated tests for the TUI — Pilot-driven keyboard
journeys against a deterministic generated store (markertui), plus
render unit tests. Skipped on wasm, where Textual apps cannot start
(no termios).
BatchArray
- Reproducible block layouts: automatic variable-length block sizing
now uses fixed byte budgets (1 MiB for clevel 1-3, 8 MiB for 4-6, 16 MiB
for 7-8) instead of the CPU cache sizes, so the layout — and hence the
compression ratio — no longer depends on the machine that created the
array.
Build and docs
- Installing test dependencies: the docs now use
pip install . --group test(a PEP 735 dependency group); the stale
[test]extra syntax was removed. - cibuildwheel updated to 4.0.
Release 4.4.3
Changes from 4.4.2 to 4.4.3
This is a maintenance release focused on faster CTable cold-start, printing
and groupby performance, a lighter import blosc2, new raw-storage access
for columns, and support for the new J2K/HTJ2K codec plugins.
CTable performance
- Lazy column opening in views:
select()(and other view-producing
operations) no longer open every projected column up front. A column is
only opened from storage when the view actually reads it, so selecting and
then touching a subset of columns — or aggregating a single one — skips
the cold-start cost of the rest. - Lazy index opening in queries: query planning no longer opens every
SUMMARY-indexed column on a wide persistent table; only indexes for
columns actually referenced by the predicate are loaded. - Faster table printing:
repr()/to_string()now memoise per-column
sparse gathers for the duration of a render and combine the head and tail
rows into a single sparse read per column. Each column is read from
storage once instead of ~6 times (precision detection, width sizing and
row rendering all hit the cache). - Groupby with integral float keys: float key columns whose values are
integral and fit a compact non-negative range (e.g. float32 id/second
columns) now take the dense single-key fast path instead of the markedly
slower generic float-hash path. Fractional or non-finite keys fall back
automatically. - No tempdir in read mode: opening a
.b2z/.b2dstore in'r'mode
no longer creates a temporary working directory, since nothing is ever
written.
Lighter imports and prefetcher rework
- asyncio dependency dropped: the on-disk chunk prefetcher used by the
UDF and numexpr fallback engines now uses plainconcurrent.futures
instead of an asyncio event loop.import blosc2no longer pulls in
~30 asyncio modules, saving ~3 MB of memory footprint at import time. - Prefetcher deadlock fixed: an exception during evaluation could leave
the generator finalizer blocked forever inthread.join()while the
reader thread was stuck on a full prefetch queue. A stop event now makes
the producer bail out when its consumer goes away.
New features
Column.rawaccessor: returns the underlying storage container of a
column (NDArray,ListArray,DictionaryColumn, …) directly. Unlike
Column.__getitem__, which always materializes NumPy arrays, this is the
column as a blosc2-native compressed object — usable as a lazy-expression
operand without decompressing, and exposing storage details likeschunk,
chunksorcparams. Note that this is a physical view: fixed-width
containers are over-allocated to chunk capacity, so slice tolen(table)
to get just the live rows, and no validity-mask or null-sentinel
processing is applied. RaisesAttributeErrorfor computed columns,
which have no backing storage.- J2K and HTJ2K codec IDs:
blosc2.Codec.J2Kandblosc2.Codec.HTJ2K
expose the IDs for the new JPEG 2000 codec plugins (installable with
pip install blosc2-j2kandpip install blosc2-htj2k).
Fixes
--float-trunc-precand nested columns: the precision-truncation
filter of theparquet_to_blosc2CLI now propagates to float fields
inside nested (struct/list) columns too.- Guard for unsupported computed-column expressions: expressions that
would serialize to an empty or non-round-trippable string are now rejected
with an early, actionableValueErroratadd_computed_column()time,
instead of silently breaking on reload.
Build
- C-Blosc2 updated to 3.1.3.
Release 4.4.2
Changes from 4.4.1 to 4.4.2
This is a feature and maintenance release that promotes DSL kernels to
first-class CTable computed columns, adds a new CTable.__setitem__
assignment idiom, optimises bulk NDArray writes, and fixes several
correctness issues.
DSL kernels as first-class CTable columns
add_computed_column()accepts DSL kernels:@blosc2.dsl_kernel-decorated
functions can now back virtual computed columns directly, in addition to
the existing string-expression form. The column survives save/open
round-trips via persisteddsl_source.add_generated_column()accepts DSL kernels: stored generated columns
(written duringappend/extendand onrefresh_generated_column())
now support DSL kernels as their transformer.CTable.where()accepts UDF/DSL kernels: filter predicates are no
longer limited to expression strings — any DSL kernel can be passed directly.dtypeinference for DSL kernels: whendtypeis omitted,
lazyudf()infers the output dtype via NumPy type promotion of the input
column dtypes. Passdtypeexplicitly for type-changing kernels
(comparisons, casts).kernel_from_source()utility: newdsl_kernel.kernel_from_source()
reconstructs aDSLKernelfrom its stored source text, shared by the
CTable DSL-column loaders and the persistedLazyUDFdecoder.- Security note:
.b2dfiles from untrusted sources that contain DSL
computed columns execute stored Python source on open. A warning is now
included in the documentation.
New CTable.__setitem__ column-assignment API
t["col"] = arr: new shorthand equivalent tot["col"][:] = arr.
Accepts any array-like includingblosc2.NDArray. RaisesKeyErrorfor
unknown columns andValueErrorfor views or read-only tables.
Chunked NDArray writes in extend() and Column.__setitem__
extend({"col": ndarray})decompresses chunk-by-chunk: when a
blosc2.NDArrayis passed as a column value toextend(), it is now
written in chunks instead of being fully decompressed upfront. Pass
validate=Falseto avoid a transient full decompression during constraint
checking.col[:] = blosc2_ndarrayfast path: a new no-holes fast path in
Column.__setitem__skips the O(n) validity-mask gather and writes the
NDArray one chunk at a time using contiguous slice writes. Works for both
scalar and fixed-shape ndarray columns. Falls back to a chunked fancy-index
path when deleted rows are present.
BLOSC_ME_JIT environment variable override
- Full CLI override:
BLOSC_ME_JITnow takes unconditional priority over
both thejit=andjit_backend=keyword arguments, making it easy to
switch JIT backends from the command line without modifying code.
Correctness fixes
- View corruption in
Column.__setitem__: aNone == Noneguard
evaluation on view-backed columns could fire the NDArray fast path,
bypassing physical-position remapping and silently corrupting rows. Fixed
by explicitly checkingbase is Nonebefore activating the fast path. CTable.__setitem__view guard: the newt["col"] = arrAPI now
raisesValueErroron views, matching the contract of all other mutating
CTable methods.- Fast path enabled for disk-opened tables: the fast path previously
remained dormant for tables opened from disk because_last_posstarts as
None. The guard now calls_resolve_last_pos()to lazily initialise it. - DSL column
jit_backendpreserved in_empty_copy: thejit_backend
setting was silently dropped during internal table copies; it is now
retained. lazyexprColumn unwrapping:convert_inputs()now automatically
unwrapsCTable.Columnobjects to their backing NDArray so that shape and
identity checks work correctly.
Documentation and examples
- Parquet-to-blosc2 walkthrough: new step-by-step tutorial added to the
getting-started section. Thanks to @SyedIshmumAhnaf. - CTable performance tips: new section in the overview covering when to
prefer computed vs. generated columns, chunk sizing, and query optimisation. - Simplified docstring examples: examples throughout
ndarray.pyand
ctable.pynow useblosc2.array(),blosc2.arange(), and
blosc2.linspace()directly instead of two-step numpy-then-asarray
patterns. udf-computed-col.pyexample: new end-to-end example demonstrating DSL
kernel computed and generated columns.
Release 4.4.1
Changes from 4.3.3 to 4.4.1
This is a feature release focused on a new interactive data viewer, automatic
SUMMARY indexes for fast WHERE queries, chunk-aligned Arrow/Parquet imports,
expanded where() acceleration via miniexpr, and a range of CTable ergonomics
and performance improvements. Python 3.10 support has been dropped; Python
3.11 is now the minimum.
b2view: interactive Text User Interface data viewer
- New
b2viewcommand: a terminal-based interactive viewer for all
blosc2 containers —NDArray,CTable,SChunk,BatchArray, and more.
Launch it withb2view <file>or asblosc2.b2view()from Python. - Full 1-D and 2-D browsing: arrays with more than two dimensions are
sliceable along any axis; 1-D arrays are shown as a single-column table. - CTable navigation: scroll through rows with keyboard shortcuts;
t/b
jump to the top/bottom;--paneljumps straight to a named panel on launch. CTable.vlmetapanel: variable-length metadata is exposed in a dedicated
panel.- New dim mode: navigate along all dimensions freely for N-D arrays.
SUMMARY indexes for fast WHERE queries
- Automatic SUMMARY index creation: when a
CTableis closed after a
write session, SUMMARY indexes (per-block min/max) are built by default for
all eligible scalar columns with no extra configuration needed. - Incremental build during writes: indexes are accumulated block-by-block
duringextend()and Arrow import, so closing the table costs almost nothing
beyond the write already done. - Block-skip prefilter: the miniexpr prefilter uses SUMMARY bitmaps to skip
entire blocks whose min/max range cannot satisfy the WHERE predicate, reducing
decompression work for selective queries. - Conjunction support: per-column SUMMARY block masks are combined with
bitwise AND so multi-column conjunctions prune blocks efficiently. - Cost gate: a cost model guards index use; the SUMMARY path is skipped when
block skipping is unlikely to help (e.g. very low selectivity). --no-summary-index: new CLI flag forparquet-to-blosc2to disable
automatic index creation on import.
CTable column grid alignment
- Shared chunk/block grid for scalar columns: fixed-size columns are now
written on a shared chunk/block grid derived from the numeric column widths,
so all columns have identical chunk boundaries. This makes multi-column
SUMMARY scans and chunk-parallel reads significantly faster. - Chunk-aligned Arrow import: incoming Arrow/Parquet batches are buffered
and flushed in exact chunk-sized blocks, so each chunk is compressed exactly
once instead of being split across batch boundaries. - Vectorized dictionary-column import: dictionary codes are now written in
bulk at full chunk capacity rather than element by element. - Small fixed strings on the grid: fixed-length string columns narrow enough
to share the numeric grid are admitted to it, reducing the number of distinct
chunk sizes. --reduce-mem: new CLI option forparquet-to-blosc2to cap the Arrow
read-batch size on nestedlist<struct>imports, keeping peak RSS low at a
modest speed cost.
CTable.copy() enhancements
- C-level bulk copy for
ListArrayandBatchArray: a newchunk_copy()
method transfers pre-compressed chunks directly at the C level, bypassing
Python-level serialization and recompression.CTable.copy()uses this path
automatically. chunks=/blocks=overrides inCTable.copy(): callers can now
specify target chunk and block sizes for the output copy.cparamsandblocksoverrides:CTable.copy()acceptscparamsand
blocksto recompress the copy with different settings.--chunks/--blocksadded to theparquet-to-blosc2CLI.
Take/gather APIs
- Added
NDArray.take()following Array APItakeshape semantics, including
axis=Noneflattening and N-dimensional integer indices. One-dimensional
gathers use a new sparse C-level path (b2nd_get_sparse_cbuffer) internally. - Extended top-level
blosc2.take()to dispatch toNDArray.take(),
CTable.take(), andColumn.take()while preserving the input container
type. - Added
CTable.take()andColumn.take()for logical row/value gathers that
preserve order and duplicate indices, unlike mask-based views. - For
ndim > 1axis-based take, orthogonal selection is used internally for
better performance.
where() and miniexpr acceleration
where(cond, x)via miniexpr: the single-argumentwhere(fill-with-zero
variant) is now handled directly by the miniexpr engine when the condition is
a boolean array, avoiding a numexpr round-trip.where(cond, x, y)via miniexpr: the two-argument flavor is likewise
dispatched to miniexpr for element-wise conditional selection.- Sparse boolean mask fast path: when a boolean indexing result is very
sparse (high selectivity), auto-detection switches to a fast gather path
instead of a full-array scan. - Early boolean key check:
NDArray.__getitem__with a boolean array key
now detects it before the generalprocess_key/nonzeropath, avoiding
wasted work. - Compressed transient masks: temporary boolean masks created during
queries are now stored as LZ4-compressed blosc2 arrays, reducing memory
pressure without measurable speed regression. BLOSC_ME_JIT/BLOSC_ME_JIT_TRACE: new environment variables to
control and trace the miniexpr JIT backend at runtime.
CTable views and lazy sorting
sort_by()on a view is now lazy: callingsort_by()on a filtered view
returns a position-reordered view without materializing data; the sort
positions are cached and used directly on column access.- Lazy column materialization in filtered views:
select()on a view no
longer materializes unneeded columns eagerly; columns are resolved only when
accessed.
NestedColumn and .info improvements
NestedColumnpublic class: the previously internal
_NestedColumnNamespacehas been renamed and promoted toNestedColumn,
providing aggregate metadata (col_names,nrows,nbytes,cbytes,
cratio) and a structured.inforeport over a group of dotted columns.- Uniform
.infoacross containers:Column.info,CTable.info,
NestedColumn.info, and related classes now follow a consistent field order
(identity → shape/grid → sizes → content → compression params).
Context manager support for blosc2.open()
- All objects returned by
blosc2.open()—NDArray,SChunk,CTable,
BatchArray,ListArray, and stores — now support thewithstatement.
The__exit__method flushes and closes the underlying storage.
Performance improvements and fixes
- CTable.nrows stored persistently: row counts are written to metadata on
close and read back on open, avoiding a full column scan at startup. - Index sidecar loading from .b2z: SUMMARY/BUCKET sidecars inside
.b2z
archives are now read in-place rather than extracted to a temporary directory,
cutting open latency for indexed tables. - Compressed query cache: the hot query-result cache is now stored
LZ4-compressed, reducing its memory footprint with negligible overhead. - Query cache consistency fixes: on-disk query cache side effects and a
miniexpr chunk-cache race condition on Apple Silicon have been resolved. - macOS L2 floor for chunk sizing: on macOS the full L2 cache is used as a
floor for automatic chunk sizing, giving better compression/speed trade-offs. - Better Apple Silicon L3 handling: missing L3 cache on Apple Silicon is
handled more gracefully in the cache-size heuristic. - Table capacity management: large CTables grow more conservatively, and
capacity is trimmed on close and after Arrow import to reclaim over-allocated
space. - Faster iteration with
iterchunks_info(): several hot loops switched to
iterchunks_info()for lower overhead per chunk. - Cost-model index refinement threshold: the previously hardcoded threshold
for switching between index and scan has been replaced with a data-driven cost
model. - Index prefetch reuse: data already prefetched during an index lookup is
reused in the refinement phase, avoiding redundant I/O. - Simplify index sidecar filenames in _indexes/{col}/ directories.
DictStoreembed disabled by default: embedding a store inside a dict
store is now opt-in (it was error-prone as the default).- Fixed wasm32 issue: a 32-bit platform arithmetic fix for reduce operations.
- Chunks never exceed array dimension:
compute_chunks_blocksnow
guarantees chunk dimensions are capped at the array shape dimension. max_rowsrobust to older PyArrow: truncation logic no longer depends on
PyArrow APIs that are absent in older releases.cratiodisplay: compression ratio is now shown with an explicitx
suffix (e.g.2.47x) throughout.infooutput.- Updated bundled C-Blosc2 to the latest release.
Dropped Python 3.10 support
- Python 3.11 is now the minimum supported version.
Release 4.3.3
Changes from 4.3.1 to 4.3.3
note: 4.3.2 was an internal pre-release that was not published to PyPI.
This is a maintenance release focused on CTable display ergonomics, indexed-query
correctness, and query-planner performance.
CTable display and print options
- Pandas-like CTable display by default:
str(table)/print(table)now use
a compact, pandas/DuckDB-style table representation, including a displayed
logical row index, numeric alignment, compact spacing, and a trailing footer
such as[726017 rows x 5 columns]. - Configurable display options: added
blosc2.set_printoptions()and
blosc2.get_printoptions()for CTable rendering. The supported options are
display_index,display_rows,display_precision, andfancy. CTable.to_string(): added a one-off formatting API for producing CTable
string representations without changing global print options.- Compact truncation for large tables: when a table exceeds the configured
display_rowsthreshold, only the first five and last five rows are shown,
with an ellipsis row in between. - Float display refinements: compact mode uses pandas-like fixed precision
for floating-point columns, and integer-valued float columns are displayed
with a single decimal place. - Fancy display preserved:
set_printoptions(fancy=True)restores the more
decorated display with dtype rows, separator rules, and hidden row/column
counts.
Indexed queries and sorting
- Cross-column exact index refinement: multi-column conjunctions can now use
exact positions from a selective indexed column (FULL,PARTIAL, orOPSI)
as a compact pre-filter, then refine the remaining predicates on those
positions instead of scanning the full table. - NaN-safe index boundary navigation: fixed sorted-boundary navigation for
floating-point indexes containingNaNvalues, so indexed results match scan
results for bucket/full index lookups. - Better index-planner heuristics: the planner now avoids low-value indexed
paths when segment pruning is unlikely to help, and avoids expensive scalar
specialization for non-scalar arrays. - Faster filtered sorting: small filtered views can be materialized and
sorted directly, avoiding an extra gather of sort keys.
Performance and fixes
- Avoid full materialization of
valid_rowsin several CTable code paths. - Keep row counts lazy for views and avoid unnecessary
nrowscalls in the
query planner. - Reduced overhead in root-column iteration and small query-planner operations.
- Fixed dictionary-column capacity handling during Arrow import and a regression
affecting dictionary columns. - Marked additional long-running tests as
heavyto reduce default test-suite
runtime.
Documentation
- Updated the containers tutorial with dedicated
ListArrayandCTable
sections, including CTable's columnar storage model and support for columns
backed byNDArray,BatchArray,ObjectArray,ListArray, and related
containers.
Release 4.3.1
Changes from 4.3.0 to 4.3.1
This is a maintenance release focused on CTable nested-column ergonomics,
grouped reductions, and API/documentation polish.
CTable nested columns and grouped reductions
- Nested column names in
group_by()results: grouped output columns can now
preserve dotted/nested names such astrip.secinstead of requiring valid
Python identifiers. - Column-object selectors:
CTable.group_by()andCTable.sort_by()now
acceptColumnobjects as well as string names, enabling idioms such as
t.group_by(t.trip.sec)andt.sort_by(t.trip.sec). - Grouped arg reductions:
CTableGroupBynow supportsargmin()and
argmax(), plusagg({"col": "argmin"})/agg({"col": "argmax"}).
Results are logical row positions in the grouped table or view; groups with no
non-null values return-1.
NDArray constructor ergonomics
blosc2.array(): added a NumPy-like constructor for NDArrays. It mirrors
blosc2.asarray()but defaults tocopy=True, so passing an existing
NDArraycreates a copy unlesscopy=Falseorcopy=Noneis requested.
Documentation
- Expanded the CTable reference with
RowTransformer,Column.row_transformer,
andCTableGroupBy.argmin/argmaxdocumentation. - Added
blosc2.ndarray(),blosc2.dictionary(), and related public schema
factory functions to the Schema Specs reference. - Moved
blosc2.group_reduce()into the Reduction Functions reference and
updated its example to use Blosc2 NDArrays.
Release 4.3.0
Changes from 4.2.0 to 4.3.0
CTable: N-dimensional (ndarray) columns
- Multidimensional columns: CTable columns can now hold NDArray-backed cells, allowing
each row of a column to contain a full n-dimensional compressed array. This enables
use cases such as embedding vectors, image patches, time-series windows, or any other
multidimensional per-row payload. - CSV and DataFrame import/export: Multidimensional column data can be imported and
exported via CSV and pandas DataFrames, with automatic detection of array-valued cells. - Nullable ndarray columns: Multidimensional columns fully support the nullable
semantics (null_count, sentinel handling,null_policy) already available for scalar
columns. from_pandas()improvements:CTable.from_pandas()now creates the correct
specialized backing storage forDictionarySpec,ListSpec,VLStringSpec,
VLBytesSpec, and other variable-length scalar specifications.- Improved schema coverage: New CTable timestamp schema type and extended
Column.infooutput withshape,chunks, andblocksdescriptors. - Arg reductions: Added
argmin()andargmax()for scalar and ndarray
CTable columns, plus row-transformer support for generated columns such as
per-row peak-hour or dominant-embedding-dimension features.
CTable: Group-by and filtered aggregation
-
CTable.group_by(): The primary group-by interface. Call
t.group_by("city", sort=True).agg({"qty": "mean"})to produce a new
:class:CTablewith aggregated results. Single-key and multi-key groupings are
supported, along with convenience methods such as.size(),.count(),
.sum(),.mean(),.min()and.max():.. code-block:: python
by_city = t.group_by("city", sort=True) by_city.size() # COUNT(*) by_city.sum("sales") # SUM(sales) per city by_city.agg({"sales": ["sum", "mean"]}) # SUM(sales), AVG(sales) per city -
Performance accelerators: Dedicated Cython fast paths deliver significant speedups:
~25× for float32/64 group-by keys, ~8× for integer and dictionary-code keys, and a
general-purpose hash table for arbitrary float keys. -
Filtered aggregate pushdown: The
where=parameter is now accepted in aggregation
methods, pushing the filter into the compute engine so that only matching rows are
read and reduced. -
Persistent grouped output: Group-by results can be saved directly to persistent
storage via theurlpath=parameter. -
blosc2.group_reduce(): New public function that performs group-by reduction over
NDArray instances and CTable columns, with Cython-accelerated backends for common
key/reduction combinations.
CTable: Dictionary / categorical columns
DictionarySpeccolumn type: Introduced a new dictionary-encoded (categorical)
column type that stores string or integer codes mapped to a shared dictionary, providing
compact storage and accelerated equality and membership queries.- Dictionary types in
whereclauses: Dictionary columns can be queried with the same
where=expression syntax as other column types, including nested dotted-name access. - Improved display:
CTableprinting now adapts to the terminal width, and dictionary
values are shown in their decoded form.Column.infohas been extended with type
details, shape, chunks, and blocks.
CTable: Nested columns and field-name escaping
- Dotted nested column access: Columns whose names contain literal
.
(e.g.,"root.nested") are now fully addressable via the dotted accessor syntax in
whereexpressions,__getitem__, and the public API. - Hierarchical
_colsstorage paths: The internal column storage layout now preserves
a hierarchical structure that mirrors the logical nesting, improving introspection
and interop. - Nested-field pipeline: A new flattened-storage pipeline with logical mapping
preserves nested schema structure (field names, types, and hierarchy) through
Arrow and Parquet import/export. For unnamed top-levellist<struct<...>>Parquet
files, the logical schema round-trips faithfully, though the original physical row
grouping is intentionally not preserved. - Field-name escaping: Special characters (
.and/) in column names are
automatically escaped during schema construction and metadata round-trips.
Parquet import/export improvements
- Arrow serializer by default:
CTable.from_parquet()now defaults to the Arrow
serializer, providing better schema fidelity and nested-type support. - Progress reporting: A
--progressflag and an ETA estimator have been added to
theparquet-to-blosc2CLI for long-running imports. --max-rowsparameter:CTable.from_parquet()and the CLI now acceptmax_rows
to limit the number of imported rows.--timestamp-unit: New CLI option to control timestamp unit conversion on import.--float-trunc-prec: New CLI option to truncate floating-point precision on import.- Separated nested columns enabled by default: The
separate_nested_colsflag is now
Trueby default for both the Python API and the CLI, ensuring nested Arrow structs
are always expanded into flat columns. list_serializerparameter: New option to control how list-type columns are
serialized, with sensible defaults for different list layouts.- Validation optimizations: Arrow datetime values are validated only during import,
reducing runtime overhead on subsequent operations.
TreeStore: Inline CTable support
- CTables inside TreeStore:
CTableobjects can now be stored inline as items
inside aTreeStore, enabling hierarchical storage that mixes arrays and tables in a
single persistent container. - Cache hardening: TreeStore cache assignments now use defensive copies and cache
effective object roots to avoid aliasing and stale-cache errors. - Examples and tutorials: New tutorials and docstring examples demonstrate how to
store, retrieve, and query CTables within a TreeStore.
Performance and usability enhancements
- Faster open and import:
blosc2.open()and store constructors now assume valid
file extensions and defer column metainfo loading, makingCTable.open()and
package import noticeably faster. CTable.nrowsis now lazy: The row count is computed on demand rather than eagerly,
speeding up open and schema-inspection workflows.- Accelerated scalar and small-slice access: The batch/list path for reading scalar
values or small column slices has been overhauled, eliminating internal placeholder
materialization and yielding lower latency. - Late-import optimizations: Heavy optional dependencies are imported lazily at the
blosc2 package level, reducing the baselineimport blosc2overhead. iter_arrow_batches()optimization: Avoids full Python object materialization of
batches during iteration, reducing memory pressure.NDArray-to-list conversion: Small optimization when converting NDArray objects
to Python lists._last_posinvalidation skipped: Mid-table deletes no longer eagerly invalidate
cached positional state, improving delete latency.
Documentation, examples and benchmarks
- API reference expanded:
blosc2.group_reduce()has been added to the Sphinx
reference, along with updated CTable, Column, and TreeStore pages. - New tutorials and examples: Added sections on CTable–TreeStore integration,
nested fields, dictionary columns, aggregates, grouping and querying withwhere=. - New benchmarks: Graph benchmarks for CTable insert time, column count, memory usage,
andwhere=queries, plus dedicated group-by, nested-filter, and Parquet round-trip
benchmarks.
Fixes and compatibility
- Null and NaN handling: NumPy scalar null sentinels are now normalized to plain Python
scalars, and floating-point NaN sentinels are treated consistently with Python
float('nan'). - Empty aggregate results: Filtered aggregations that produce no rows now handle the
empty result gracefully. - Generated column safety: Accessing a stalled (unfillable) generated column now raises
a clear exception instead of producing undefined results. - Miniexpr bundling: Miniexpr’s bundled
libtccand related runtime files are now
kept inside theblosc2package, avoiding conflicts with other TCC installations. - Test improvements: Torch-dependent tests are marked as
heavy, PyArrow-optional
tests are skipped when the library is absent, and parametrization matrices have been
trimmed to reduce CI time. - Missing Cython validation: Added validation guards for several Cython extension
functions that previously lacked explicit error checking. - C-Blosc2 update: Bundled C-Blosc2 has been updated to the latest version (3.0.3).
blosc2.open()default mode changed from 'a' to 'r': Removed the FutureWarning that
was added to prepare for this transition.
Release 4.2.0
Changes from 4.1.2 to 4.2.0
CTable: columnar compressed tables
- Introduced
blosc2.CTable, a new columnar table container for compressed, typed columns. CTables support dataclass- and schema-based construction, row iteration, column access, table views,head()/tail()/sample(), sorting, selection and compactwhereexpressions. - Added persistent CTables backed by
TreeStore, with support forblosc2.open(),CTable.open(),CTable.load(),CTable.save(),CTable.to_b2d()andCTable.to_b2z(). CTable views can be saved too, and.b2z/.b2dpath handling has been tightened. - Added mutation operations for CTables, including
append(),extend(),delete(),compact(),add_column(),drop_column(),rename_column()and related schema validation. - Added computed columns, including virtual computed columns backed by lazy expressions, materialized computed columns and automatic filling of materialized computed columns during inserts.
- Added CTable indexing support, including persistent indexes, direct expression indexes, ordered index reuse, boolean
LazyExpr/NDArraymasks inCTable.__getitem__,iter_sorted()and indexing support for.b2ztables. - Added nullable schema support and null policies for CTable scalar columns, preserving nullable scalar Parquet round-trips.
- Added variable-length CTable column support via
ListArray/ObjectArray, includingvlstringandvlbytesschema specs, fixed-length string/bytes import support and list/struct Arrow/Parquet round-trips. - Added Arrow, Parquet and CSV interoperability for CTables, including batch-wise Arrow/Parquet import/export, Arrow schema metadata preservation,
CTable.from_arrow_batches()improvements and a newparquet-to-blosc2CLI utility. - Added CTable documentation, tutorials, examples and benchmarks covering schema definition, persistence, querying, indexing, mutations, nullable columns, computed columns and variable-length columns.
Indexing and ordering
- Added a new indexing subsystem for NDArrays and CTables, including full, partial/bucket, light/medium and OPSI-style index kinds, out-of-core index builders and sidecar storage.
- Added
blosc2.Indexas the unified public index handle, plus APIs such ascreate_index(),compact_index(),iter_sorted(),will_use_index()and related query explanation support. - Added materialized expression indexes for NDArrays and direct expression indexes for CTables.
- Added persistent query-result caching for indexed lookups, with FIFO pruning and cache accounting.
- Added
blosc2.argsort()and refactored indexing APIs around explicit index enums and sorting helpers. - Improved indexed query performance with Cython accelerators, threaded chunk batching, zero-copy/cached mmap reads, chunk-aware and reduced-order layouts and faster scattered row gathering.
- Reduced memory usage during index creation and lookup by avoiding full sidecar materialization, replacing memmap staging with Blosc2 scratch arrays and adding
tmpdirsupport for full out-of-core indexes.
Persistence, stores and serialization
- Added structured Blosc2 serialization based on b2object carriers, including persisted
C2Array,LazyExprand DSLLazyUDFobjects. - Added
blosc2.Reffor serializing external references, plus examples for b2object bundles and persisted expressions/UDFs. - Added
blosc2.load()as a convenience loader. - Added
vlmetasupport toLazyArrayobjects. - Improved store handling by preserving lazy b2object carriers in
DictStore, allowing reopened proxies to refill caches after read-only opens, relaxingDictStore/TreeStoresuffix requirements and addingDictStore.to_b2d(). - Accelerated
blosc2.open()by trying standard opens first and warning on implicit append mode.
Arrays, computation and containers
- Added
ObjectArrayfor fully general object data and renamed the earlierVLArraywork accordingly; addedListArraydocstrings and Arrow integration improvements. - Added schema helpers including numeric specs,
blosc2.struct()andblosc2.object()for nested/fully general column declarations. - Improved
fromiter()with direct chunked construction and substantially lower peak memory use. - Improved
asarray()behavior for NDArray inputs when copy-inducing keyword arguments are supplied. - Added
SChunk.reorder_offsets(). - Improved
BatchArraydefaults and documentation; the default compression level is now tuned for faster lookup/scan behavior. - Continued matmul/linalg optimization work and shared-thread-pool integration.
CLI, docs and examples
- Added the
parquet-to-blosc2command with options such as--max-rows,--parquet-batch-size,--blosc2-items-per-blockand--use-dict. - Added new CTable, ObjectArray, BatchArray, containers, indexing and serialization tutorials and examples.
- Reorganized and expanded the API reference for CTable, Column, schema specs, Index, save/load helpers and miscellaneous APIs.
- Updated benchmark suites for CTables, indexing, Parquet import/export, BatchArray and NDArray construction/indexing.
Fixes and compatibility
- Updated bundled C-Blosc2 to v3.0.2 and require C-Blosc2 >= 3.0.0 when building against a system library.
- Updated bundled C-Blosc2 and miniexpr sources multiple times.
- Restored compatibility with NumPy < 2.
- Fixed Windows and mmap/file-locking issues in index creation, rebuilds and temporary file cleanup.
- Fixed full-index query failures for large CTable columns and full out-of-core merge failures on systems with small
/tmp. - Fixed stale sidecar/cache reuse and targeted cache invalidation when persistent sidecars are replaced.
- Fixed
.b2zdouble-open corruption caused by GC-triggered repacking and made temporary.b2zunpacking default to the source file directory. - Fixed a regression when reopening persisted proxies in read-only mode.
- Fixed GC-induced thread hangs on macOS with Python 3.14 and hardened async chunk reading/cache cleanup paths.
- Fixed lazy-chunk source-size handling in decode/getitem callers.
- Fixed nullable validation, dictionary extend validation, CTable close propagation, print alignment and NumPy mask support.
- Fixed
arange()regressions and several pre-existingset_sliceerror-handling issues. - Clamped indexing/thread defaults for wasm32.