Skip to content

Releases: Blosc/python-blosc2

Release 4.5.1

17 Jun 10:42

Choose a tag to compare

Changes from 4.5.0 to 4.5.1

This follow-up release builds the b2view terminal viewer into a richer
data-exploration tool — a scatter plot, a searchable column picker, a
one-shot demo download, refreshed chrome, and several interaction fixes — and
upgrades the bundled C-Blosc2 to 3.1.4. WASM/Pyodide is now a fully
supported platform
, and CTable.info reports per-column compressed sizes.

b2view: richer exploration

  • Scatter plots: from a column plot, press s to scatter the current column
    (X) against another column (Y) chosen from a list, over the current (zoomed)
    row range; h then opens a high-resolution matplotlib scatter.
  • High-res for 1-D series is now an envelope plot (matching the in-terminal
    view), and a new r key toggles between the min/max envelope and the raw
    values (strided-sampled when the range is wide).
  • Searchable column picker: the c go-to-column key now opens a searchable,
    selectable list (type to filter, ↑/↓, Enter) for CTables, instead of a text
    field; N-D arrays still go by numeric index.
  • Show/hide columns: / opens a searchable multi-select to pick which CTable
    columns are displayed.
  • Demo download: b2view --download fetches a demo bundle
    (chicago-taxi-flat.b2z by default) into the current directory if it is not
    already there, then opens it.
  • Refreshed chrome: a branded header, a left-docked filename label in the
    title, and clearer status chips.

b2view: interaction fixes

  • Go-to-row/column pre-fill is now pre-selected, so the first keystroke
    replaces the current index instead of appending to it (typing a column name no
    longer produced e.g. 0payment.fare).
  • Escape keeps its layered exit while a panel is maximized: with the data
    panel maximized, escape now unlocks a plot's locked row window (and clears
    filters) as documented, instead of being hijacked into restoring the panel —
    use r to restore (ESCAPE_TO_MINIMIZE = False).
  • Test-suite robustness fixes (a timing flake and a Windows rendering glitch).

Other

  • C-Blosc2 upgraded to 3.1.4.
  • WASM/Pyodide is now a fully supported platform, with more frequent CI runs.
  • CTable.info shows per-column compressed sizes (cbytes and cratio),
    and print_versions() uses clearer Python-Blosc2 / C-Blosc2 labels.

Release 4.5.0

15 Jun 12:03

Choose a tag to compare

Changes from 4.4.5 to 4.5.0

This release teaches the b2view terminal viewer to plot — peak-preserving
envelope line plots of any series, with zoom, a row-window lock, and an optional
high-resolution matplotlib view — and gives CTable a pandas-like display and
CSV
experience. It also publishes WASM/Pyodide wheels to PyPI and adds
faster strided reads for NDArray and Column.

b2view: plotting and data inspection

  • In-terminal plots: press p on a numeric series (a CTable column or an
    array row) to draw a braille line plot. Plots are peak-preserving min/max
    envelopes by default
    , so no spike or trough is hidden however large the
    series is; large local series stream their envelope exactly in bounded
    spans (only remote c2arrays fall back to a labeled strided sample).
  • Zoom and row-window lock: zoom the plot into a row range and pan it; press
    v to lock the data grid to the plotted range so paging stays inside it
    (escape unlocks). The plot and high-res views honor the locked window.
  • High-resolution view: h opens a high-res matplotlib image of the
    plotted range (new optional hires extra: matplotlib + textual-image).
  • On-demand cell decode: enter decodes a single skipped/expensive CTable
    cell, and SChunk nodes now preview as a paged hex dump.
  • Fixes and polish: row paging re-aligns to the page grid after dim-mode
    single-row scrolls; the data panel now focuses correctly with
    --path ... --panel data; status chips are branded yellow.

CTable display

  • CTable.to_string() now renders the whole table by default (every row and
    every column), like pandas' DataFrame.to_string(). New max_rows and
    max_width parameters truncate on demand. Behaviour change: previously
    to_string() returned the truncated view; code that relied on that should
    pass max_rows=/max_width= (or use str()).
  • The [N rows x M columns] dimensions footer now follows pandas: omitted by
    to_string() (pass show_dimensions=True to force it), and shown by
    str/repr/print only when the view is actually truncated. Previously it
    was always appended.
  • repr(ctable) now shows the same truncated table as str(ctable)
    (pandas/polars convention), instead of the one-line CTable<…> summary. The
    compact summary remains available via ctable.info.
  • New display options in set_printoptions: display_width controls the
    column-fitting width budget (None = auto-detect terminal, -1 = show all
    columns, positive int = fixed budget), and display_rows now accepts -1 to
    show all rows (0 still shows none).
  • New blosc2.printoptions(...) context manager temporarily sets the display
    options and restores them on exit, e.g.
    with blosc2.printoptions(display_rows=-1, display_width=-1): print(t).

CTable I/O

  • CTable.to_csv() now accepts no path, returning the CSV as a string like
    pandas' DataFrame.to_csv(). Passing a path still writes the file (and
    returns None); the returned string is byte-for-byte the same as the file.

Performance

  • Faster strided reads: NDArray.__getitem__ gains a sparse-gather fast
    path for large strides, and Column.__getitem__ short-circuits when the
    logical positions equal the physical ones.
  • Fix: a negative step in Column getitem could return []; it now
    returns the reversed selection.

Indexing

  • Fix: a sidecar-handle cache collision could return the wrong SUMMARY
    index for a compact-store column.
  • Cross-column index pruning is now enabled for compact CTable queries, so
    more predicates prune blocks before any data is materialized. The docs also
    note when summary indexes are not created automatically.

Packaging

  • WASM/Pyodide wheels on PyPI: the main wheel build now also produces
    pyemscripten wheels for CPython 3.13 (2025 ABI) and 3.14 (2026 ABI) and
    uploads them to PyPI, so blosc2 is micropip-installable in Pyodide, and
    b2view prints a clear message instead of crashing when run under WASM.
    Known limitation: slicing an in-memory SChunk loaded from a frame fails on
    the Pyodide 0.29.x Emscripten toolchain (cp313); it works on Pyodide 314
    (cp314) and natively. See issue #664.
  • cibuildwheel updated to 4.1.

Release 4.4.5

12 Jun 15:25

Choose a tag to compare

Changes from 4.4.3 to 4.4.5

Note: 4.4.4 was skipped due to a failure during the release process.

This release promotes the b2view terminal viewer to a core feature —
installed by default, with new interactive row and column filtering — and
makes BatchArray block layouts (and hence compression ratios) reproducible
across CPUs.

b2view, the terminal data viewer

  • Installed by default: textual and rich are now regular
    dependencies, so the b2view CLI works out of the box (the [tui] extra
    is gone). A getting-started walkthrough was added to the docs, and the
    README now lists the CLI tools.
  • Row filtering: pressing f on a CTable node opens a modal that takes
    the same string expressions as CTable.where() (dotted nested names,
    and/or) and pages through the matching view. Filters are remembered
    per node for the session, the data header shows the active filter plus
    the unfiltered total, and escape (or an empty expression) clears it.
  • Column filtering: / narrows the visible columns by case-insensitive
    substring; column paging and the c goto-column modal then operate on
    that subset. Combines freely with the row filter; escape clears one
    layer per press (rows first, then columns).
  • Mouse handling: the terminal owns the mouse by default, so native
    text selection/copy works like in any CLI program; --mouse lets b2view
    capture it instead (click-to-focus, wheel scrolling by half a page,
    paging at the edges).
  • Navigation: ? opens a help screen listing all keys; c jumps to a
    column by index, exact name or unique name prefix; s/e jump to the
    first/last column window; row paging and jumps keep the cursor on its
    column; dim-mode index/viewport movements clamp at the boundaries instead
    of wrapping around.
  • Rendering: column windows are fitted from measured rendered widths
    (and re-fitted on terminal resize and panel maximize/restore), and float
    columns use a uniform number of decimals so decimal points align down
    the column.
  • Test suite: first automated tests for the TUI — Pilot-driven keyboard
    journeys against a deterministic generated store (marker tui), plus
    render unit tests. Skipped on wasm, where Textual apps cannot start
    (no termios).

BatchArray

  • Reproducible block layouts: automatic variable-length block sizing
    now uses fixed byte budgets (1 MiB for clevel 1-3, 8 MiB for 4-6, 16 MiB
    for 7-8) instead of the CPU cache sizes, so the layout — and hence the
    compression ratio — no longer depends on the machine that created the
    array.

Build and docs

  • Installing test dependencies: the docs now use
    pip install . --group test (a PEP 735 dependency group); the stale
    [test] extra syntax was removed.
  • cibuildwheel updated to 4.0.

Release 4.4.3

10 Jun 17:44

Choose a tag to compare

Changes from 4.4.2 to 4.4.3

This is a maintenance release focused on faster CTable cold-start, printing
and groupby performance, a lighter import blosc2, new raw-storage access
for columns, and support for the new J2K/HTJ2K codec plugins.

CTable performance

  • Lazy column opening in views: select() (and other view-producing
    operations) no longer open every projected column up front. A column is
    only opened from storage when the view actually reads it, so selecting and
    then touching a subset of columns — or aggregating a single one — skips
    the cold-start cost of the rest.
  • Lazy index opening in queries: query planning no longer opens every
    SUMMARY-indexed column on a wide persistent table; only indexes for
    columns actually referenced by the predicate are loaded.
  • Faster table printing: repr()/to_string() now memoise per-column
    sparse gathers for the duration of a render and combine the head and tail
    rows into a single sparse read per column. Each column is read from
    storage once instead of ~6 times (precision detection, width sizing and
    row rendering all hit the cache).
  • Groupby with integral float keys: float key columns whose values are
    integral and fit a compact non-negative range (e.g. float32 id/second
    columns) now take the dense single-key fast path instead of the markedly
    slower generic float-hash path. Fractional or non-finite keys fall back
    automatically.
  • No tempdir in read mode: opening a .b2z/.b2d store in 'r' mode
    no longer creates a temporary working directory, since nothing is ever
    written.

Lighter imports and prefetcher rework

  • asyncio dependency dropped: the on-disk chunk prefetcher used by the
    UDF and numexpr fallback engines now uses plain concurrent.futures
    instead of an asyncio event loop. import blosc2 no longer pulls in
    ~30 asyncio modules, saving ~3 MB of memory footprint at import time.
  • Prefetcher deadlock fixed: an exception during evaluation could leave
    the generator finalizer blocked forever in thread.join() while the
    reader thread was stuck on a full prefetch queue. A stop event now makes
    the producer bail out when its consumer goes away.

New features

  • Column.raw accessor: returns the underlying storage container of a
    column (NDArray, ListArray, DictionaryColumn, …) directly. Unlike
    Column.__getitem__, which always materializes NumPy arrays, this is the
    column as a blosc2-native compressed object — usable as a lazy-expression
    operand without decompressing, and exposing storage details like schunk,
    chunks or cparams. Note that this is a physical view: fixed-width
    containers are over-allocated to chunk capacity, so slice to len(table)
    to get just the live rows, and no validity-mask or null-sentinel
    processing is applied. Raises AttributeError for computed columns,
    which have no backing storage.
  • J2K and HTJ2K codec IDs: blosc2.Codec.J2K and blosc2.Codec.HTJ2K
    expose the IDs for the new JPEG 2000 codec plugins (installable with
    pip install blosc2-j2k and pip install blosc2-htj2k).

Fixes

  • --float-trunc-prec and nested columns: the precision-truncation
    filter of the parquet_to_blosc2 CLI now propagates to float fields
    inside nested (struct/list) columns too.
  • Guard for unsupported computed-column expressions: expressions that
    would serialize to an empty or non-round-trippable string are now rejected
    with an early, actionable ValueError at add_computed_column() time,
    instead of silently breaking on reload.

Build

  • C-Blosc2 updated to 3.1.3.

Release 4.4.2

04 Jun 16:10

Choose a tag to compare

Changes from 4.4.1 to 4.4.2

This is a feature and maintenance release that promotes DSL kernels to
first-class CTable computed columns, adds a new CTable.__setitem__
assignment idiom, optimises bulk NDArray writes, and fixes several
correctness issues.

DSL kernels as first-class CTable columns

  • add_computed_column() accepts DSL kernels: @blosc2.dsl_kernel-decorated
    functions can now back virtual computed columns directly, in addition to
    the existing string-expression form. The column survives save/open
    round-trips via persisted dsl_source.
  • add_generated_column() accepts DSL kernels: stored generated columns
    (written during append/extend and on refresh_generated_column())
    now support DSL kernels as their transformer.
  • CTable.where() accepts UDF/DSL kernels: filter predicates are no
    longer limited to expression strings — any DSL kernel can be passed directly.
  • dtype inference for DSL kernels: when dtype is omitted,
    lazyudf() infers the output dtype via NumPy type promotion of the input
    column dtypes. Pass dtype explicitly for type-changing kernels
    (comparisons, casts).
  • kernel_from_source() utility: new dsl_kernel.kernel_from_source()
    reconstructs a DSLKernel from its stored source text, shared by the
    CTable DSL-column loaders and the persisted LazyUDF decoder.
  • Security note: .b2d files from untrusted sources that contain DSL
    computed columns execute stored Python source on open. A warning is now
    included in the documentation.

New CTable.__setitem__ column-assignment API

  • t["col"] = arr: new shorthand equivalent to t["col"][:] = arr.
    Accepts any array-like including blosc2.NDArray. Raises KeyError for
    unknown columns and ValueError for views or read-only tables.

Chunked NDArray writes in extend() and Column.__setitem__

  • extend({"col": ndarray}) decompresses chunk-by-chunk: when a
    blosc2.NDArray is passed as a column value to extend(), it is now
    written in chunks instead of being fully decompressed upfront. Pass
    validate=False to avoid a transient full decompression during constraint
    checking.
  • col[:] = blosc2_ndarray fast path: a new no-holes fast path in
    Column.__setitem__ skips the O(n) validity-mask gather and writes the
    NDArray one chunk at a time using contiguous slice writes. Works for both
    scalar and fixed-shape ndarray columns. Falls back to a chunked fancy-index
    path when deleted rows are present.

BLOSC_ME_JIT environment variable override

  • Full CLI override: BLOSC_ME_JIT now takes unconditional priority over
    both the jit= and jit_backend= keyword arguments, making it easy to
    switch JIT backends from the command line without modifying code.

Correctness fixes

  • View corruption in Column.__setitem__: a None == None guard
    evaluation on view-backed columns could fire the NDArray fast path,
    bypassing physical-position remapping and silently corrupting rows. Fixed
    by explicitly checking base is None before activating the fast path.
  • CTable.__setitem__ view guard: the new t["col"] = arr API now
    raises ValueError on views, matching the contract of all other mutating
    CTable methods.
  • Fast path enabled for disk-opened tables: the fast path previously
    remained dormant for tables opened from disk because _last_pos starts as
    None. The guard now calls _resolve_last_pos() to lazily initialise it.
  • DSL column jit_backend preserved in _empty_copy: the jit_backend
    setting was silently dropped during internal table copies; it is now
    retained.
  • lazyexpr Column unwrapping: convert_inputs() now automatically
    unwraps CTable.Column objects to their backing NDArray so that shape and
    identity checks work correctly.

Documentation and examples

  • Parquet-to-blosc2 walkthrough: new step-by-step tutorial added to the
    getting-started section. Thanks to @SyedIshmumAhnaf.
  • CTable performance tips: new section in the overview covering when to
    prefer computed vs. generated columns, chunk sizing, and query optimisation.
  • Simplified docstring examples: examples throughout ndarray.py and
    ctable.py now use blosc2.array(), blosc2.arange(), and
    blosc2.linspace() directly instead of two-step numpy-then-asarray
    patterns.
  • udf-computed-col.py example: new end-to-end example demonstrating DSL
    kernel computed and generated columns.

Release 4.4.1

03 Jun 05:12

Choose a tag to compare

Changes from 4.3.3 to 4.4.1

This is a feature release focused on a new interactive data viewer, automatic
SUMMARY indexes for fast WHERE queries, chunk-aligned Arrow/Parquet imports,
expanded where() acceleration via miniexpr, and a range of CTable ergonomics
and performance improvements. Python 3.10 support has been dropped; Python
3.11 is now the minimum.

b2view: interactive Text User Interface data viewer

  • New b2view command: a terminal-based interactive viewer for all
    blosc2 containers — NDArray, CTable, SChunk, BatchArray, and more.
    Launch it with b2view <file> or as blosc2.b2view() from Python.
  • Full 1-D and 2-D browsing: arrays with more than two dimensions are
    sliceable along any axis; 1-D arrays are shown as a single-column table.
  • CTable navigation: scroll through rows with keyboard shortcuts; t/b
    jump to the top/bottom; --panel jumps straight to a named panel on launch.
  • CTable.vlmeta panel: variable-length metadata is exposed in a dedicated
    panel.
  • New dim mode: navigate along all dimensions freely for N-D arrays.

SUMMARY indexes for fast WHERE queries

  • Automatic SUMMARY index creation: when a CTable is closed after a
    write session, SUMMARY indexes (per-block min/max) are built by default for
    all eligible scalar columns with no extra configuration needed.
  • Incremental build during writes: indexes are accumulated block-by-block
    during extend() and Arrow import, so closing the table costs almost nothing
    beyond the write already done.
  • Block-skip prefilter: the miniexpr prefilter uses SUMMARY bitmaps to skip
    entire blocks whose min/max range cannot satisfy the WHERE predicate, reducing
    decompression work for selective queries.
  • Conjunction support: per-column SUMMARY block masks are combined with
    bitwise AND so multi-column conjunctions prune blocks efficiently.
  • Cost gate: a cost model guards index use; the SUMMARY path is skipped when
    block skipping is unlikely to help (e.g. very low selectivity).
  • --no-summary-index: new CLI flag for parquet-to-blosc2 to disable
    automatic index creation on import.

CTable column grid alignment

  • Shared chunk/block grid for scalar columns: fixed-size columns are now
    written on a shared chunk/block grid derived from the numeric column widths,
    so all columns have identical chunk boundaries. This makes multi-column
    SUMMARY scans and chunk-parallel reads significantly faster.
  • Chunk-aligned Arrow import: incoming Arrow/Parquet batches are buffered
    and flushed in exact chunk-sized blocks, so each chunk is compressed exactly
    once instead of being split across batch boundaries.
  • Vectorized dictionary-column import: dictionary codes are now written in
    bulk at full chunk capacity rather than element by element.
  • Small fixed strings on the grid: fixed-length string columns narrow enough
    to share the numeric grid are admitted to it, reducing the number of distinct
    chunk sizes.
  • --reduce-mem: new CLI option for parquet-to-blosc2 to cap the Arrow
    read-batch size on nested list<struct> imports, keeping peak RSS low at a
    modest speed cost.

CTable.copy() enhancements

  • C-level bulk copy for ListArray and BatchArray: a new chunk_copy()
    method transfers pre-compressed chunks directly at the C level, bypassing
    Python-level serialization and recompression. CTable.copy() uses this path
    automatically.
  • chunks= / blocks= overrides in CTable.copy(): callers can now
    specify target chunk and block sizes for the output copy.
  • cparams and blocks overrides: CTable.copy() accepts cparams and
    blocks to recompress the copy with different settings.
  • --chunks / --blocks added to the parquet-to-blosc2 CLI.

Take/gather APIs

  • Added NDArray.take() following Array API take shape semantics, including
    axis=None flattening and N-dimensional integer indices. One-dimensional
    gathers use a new sparse C-level path (b2nd_get_sparse_cbuffer) internally.
  • Extended top-level blosc2.take() to dispatch to NDArray.take(),
    CTable.take(), and Column.take() while preserving the input container
    type.
  • Added CTable.take() and Column.take() for logical row/value gathers that
    preserve order and duplicate indices, unlike mask-based views.
  • For ndim > 1 axis-based take, orthogonal selection is used internally for
    better performance.

where() and miniexpr acceleration

  • where(cond, x) via miniexpr: the single-argument where (fill-with-zero
    variant) is now handled directly by the miniexpr engine when the condition is
    a boolean array, avoiding a numexpr round-trip.
  • where(cond, x, y) via miniexpr: the two-argument flavor is likewise
    dispatched to miniexpr for element-wise conditional selection.
  • Sparse boolean mask fast path: when a boolean indexing result is very
    sparse (high selectivity), auto-detection switches to a fast gather path
    instead of a full-array scan.
  • Early boolean key check: NDArray.__getitem__ with a boolean array key
    now detects it before the general process_key / nonzero path, avoiding
    wasted work.
  • Compressed transient masks: temporary boolean masks created during
    queries are now stored as LZ4-compressed blosc2 arrays, reducing memory
    pressure without measurable speed regression.
  • BLOSC_ME_JIT / BLOSC_ME_JIT_TRACE: new environment variables to
    control and trace the miniexpr JIT backend at runtime.

CTable views and lazy sorting

  • sort_by() on a view is now lazy: calling sort_by() on a filtered view
    returns a position-reordered view without materializing data; the sort
    positions are cached and used directly on column access.
  • Lazy column materialization in filtered views: select() on a view no
    longer materializes unneeded columns eagerly; columns are resolved only when
    accessed.

NestedColumn and .info improvements

  • NestedColumn public class: the previously internal
    _NestedColumnNamespace has been renamed and promoted to NestedColumn,
    providing aggregate metadata (col_names, nrows, nbytes, cbytes,
    cratio) and a structured .info report over a group of dotted columns.
  • Uniform .info across containers: Column.info, CTable.info,
    NestedColumn.info, and related classes now follow a consistent field order
    (identity → shape/grid → sizes → content → compression params).

Context manager support for blosc2.open()

  • All objects returned by blosc2.open()NDArray, SChunk, CTable,
    BatchArray, ListArray, and stores — now support the with statement.
    The __exit__ method flushes and closes the underlying storage.

Performance improvements and fixes

  • CTable.nrows stored persistently: row counts are written to metadata on
    close and read back on open, avoiding a full column scan at startup.
  • Index sidecar loading from .b2z: SUMMARY/BUCKET sidecars inside .b2z
    archives are now read in-place rather than extracted to a temporary directory,
    cutting open latency for indexed tables.
  • Compressed query cache: the hot query-result cache is now stored
    LZ4-compressed, reducing its memory footprint with negligible overhead.
  • Query cache consistency fixes: on-disk query cache side effects and a
    miniexpr chunk-cache race condition on Apple Silicon have been resolved.
  • macOS L2 floor for chunk sizing: on macOS the full L2 cache is used as a
    floor for automatic chunk sizing, giving better compression/speed trade-offs.
  • Better Apple Silicon L3 handling: missing L3 cache on Apple Silicon is
    handled more gracefully in the cache-size heuristic.
  • Table capacity management: large CTables grow more conservatively, and
    capacity is trimmed on close and after Arrow import to reclaim over-allocated
    space.
  • Faster iteration with iterchunks_info(): several hot loops switched to
    iterchunks_info() for lower overhead per chunk.
  • Cost-model index refinement threshold: the previously hardcoded threshold
    for switching between index and scan has been replaced with a data-driven cost
    model.
  • Index prefetch reuse: data already prefetched during an index lookup is
    reused in the refinement phase, avoiding redundant I/O.
  • Simplify index sidecar filenames in _indexes/{col}/ directories.
  • DictStore embed disabled by default: embedding a store inside a dict
    store is now opt-in (it was error-prone as the default).
  • Fixed wasm32 issue: a 32-bit platform arithmetic fix for reduce operations.
  • Chunks never exceed array dimension: compute_chunks_blocks now
    guarantees chunk dimensions are capped at the array shape dimension.
  • max_rows robust to older PyArrow: truncation logic no longer depends on
    PyArrow APIs that are absent in older releases.
  • cratio display: compression ratio is now shown with an explicit x
    suffix (e.g. 2.47x) throughout .info output.
  • Updated bundled C-Blosc2 to the latest release.

Dropped Python 3.10 support

  • Python 3.11 is now the minimum supported version.

Release 4.3.3

21 May 12:10

Choose a tag to compare

Changes from 4.3.1 to 4.3.3

note: 4.3.2 was an internal pre-release that was not published to PyPI.

This is a maintenance release focused on CTable display ergonomics, indexed-query
correctness, and query-planner performance.

CTable display and print options

  • Pandas-like CTable display by default: str(table) / print(table) now use
    a compact, pandas/DuckDB-style table representation, including a displayed
    logical row index, numeric alignment, compact spacing, and a trailing footer
    such as [726017 rows x 5 columns].
  • Configurable display options: added blosc2.set_printoptions() and
    blosc2.get_printoptions() for CTable rendering. The supported options are
    display_index, display_rows, display_precision, and fancy.
  • CTable.to_string(): added a one-off formatting API for producing CTable
    string representations without changing global print options.
  • Compact truncation for large tables: when a table exceeds the configured
    display_rows threshold, only the first five and last five rows are shown,
    with an ellipsis row in between.
  • Float display refinements: compact mode uses pandas-like fixed precision
    for floating-point columns, and integer-valued float columns are displayed
    with a single decimal place.
  • Fancy display preserved: set_printoptions(fancy=True) restores the more
    decorated display with dtype rows, separator rules, and hidden row/column
    counts.

Indexed queries and sorting

  • Cross-column exact index refinement: multi-column conjunctions can now use
    exact positions from a selective indexed column (FULL, PARTIAL, or OPSI)
    as a compact pre-filter, then refine the remaining predicates on those
    positions instead of scanning the full table.
  • NaN-safe index boundary navigation: fixed sorted-boundary navigation for
    floating-point indexes containing NaN values, so indexed results match scan
    results for bucket/full index lookups.
  • Better index-planner heuristics: the planner now avoids low-value indexed
    paths when segment pruning is unlikely to help, and avoids expensive scalar
    specialization for non-scalar arrays.
  • Faster filtered sorting: small filtered views can be materialized and
    sorted directly, avoiding an extra gather of sort keys.

Performance and fixes

  • Avoid full materialization of valid_rows in several CTable code paths.
  • Keep row counts lazy for views and avoid unnecessary nrows calls in the
    query planner.
  • Reduced overhead in root-column iteration and small query-planner operations.
  • Fixed dictionary-column capacity handling during Arrow import and a regression
    affecting dictionary columns.
  • Marked additional long-running tests as heavy to reduce default test-suite
    runtime.

Documentation

  • Updated the containers tutorial with dedicated ListArray and CTable
    sections, including CTable's columnar storage model and support for columns
    backed by NDArray, BatchArray, ObjectArray, ListArray, and related
    containers.

Release 4.3.1

19 May 17:38

Choose a tag to compare

Changes from 4.3.0 to 4.3.1

This is a maintenance release focused on CTable nested-column ergonomics,
grouped reductions, and API/documentation polish.

CTable nested columns and grouped reductions

  • Nested column names in group_by() results: grouped output columns can now
    preserve dotted/nested names such as trip.sec instead of requiring valid
    Python identifiers.
  • Column-object selectors: CTable.group_by() and CTable.sort_by() now
    accept Column objects as well as string names, enabling idioms such as
    t.group_by(t.trip.sec) and t.sort_by(t.trip.sec).
  • Grouped arg reductions: CTableGroupBy now supports argmin() and
    argmax(), plus agg({"col": "argmin"}) / agg({"col": "argmax"}).
    Results are logical row positions in the grouped table or view; groups with no
    non-null values return -1.

NDArray constructor ergonomics

  • blosc2.array(): added a NumPy-like constructor for NDArrays. It mirrors
    blosc2.asarray() but defaults to copy=True, so passing an existing
    NDArray creates a copy unless copy=False or copy=None is requested.

Documentation

  • Expanded the CTable reference with RowTransformer, Column.row_transformer,
    and CTableGroupBy.argmin / argmax documentation.
  • Added blosc2.ndarray(), blosc2.dictionary(), and related public schema
    factory functions to the Schema Specs reference.
  • Moved blosc2.group_reduce() into the Reduction Functions reference and
    updated its example to use Blosc2 NDArrays.

Release 4.3.0

18 May 16:55

Choose a tag to compare

Changes from 4.2.0 to 4.3.0

CTable: N-dimensional (ndarray) columns

  • Multidimensional columns: CTable columns can now hold NDArray-backed cells, allowing
    each row of a column to contain a full n-dimensional compressed array. This enables
    use cases such as embedding vectors, image patches, time-series windows, or any other
    multidimensional per-row payload.
  • CSV and DataFrame import/export: Multidimensional column data can be imported and
    exported via CSV and pandas DataFrames, with automatic detection of array-valued cells.
  • Nullable ndarray columns: Multidimensional columns fully support the nullable
    semantics (null_count, sentinel handling, null_policy) already available for scalar
    columns.
  • from_pandas() improvements: CTable.from_pandas() now creates the correct
    specialized backing storage for DictionarySpec, ListSpec, VLStringSpec,
    VLBytesSpec, and other variable-length scalar specifications.
  • Improved schema coverage: New CTable timestamp schema type and extended
    Column.info output with shape, chunks, and blocks descriptors.
  • Arg reductions: Added argmin() and argmax() for scalar and ndarray
    CTable columns, plus row-transformer support for generated columns such as
    per-row peak-hour or dominant-embedding-dimension features.

CTable: Group-by and filtered aggregation

  • CTable.group_by(): The primary group-by interface. Call
    t.group_by("city", sort=True).agg({"qty": "mean"}) to produce a new
    :class:CTable with aggregated results. Single-key and multi-key groupings are
    supported, along with convenience methods such as .size(), .count(),
    .sum(), .mean(), .min() and .max():

    .. code-block:: python

    by_city = t.group_by("city", sort=True)
    by_city.size()  # COUNT(*)
    by_city.sum("sales")  # SUM(sales) per city
    by_city.agg({"sales": ["sum", "mean"]})  # SUM(sales), AVG(sales) per city
    
  • Performance accelerators: Dedicated Cython fast paths deliver significant speedups:
    ~25× for float32/64 group-by keys, ~8× for integer and dictionary-code keys, and a
    general-purpose hash table for arbitrary float keys.

  • Filtered aggregate pushdown: The where= parameter is now accepted in aggregation
    methods, pushing the filter into the compute engine so that only matching rows are
    read and reduced.

  • Persistent grouped output: Group-by results can be saved directly to persistent
    storage via the urlpath= parameter.

  • blosc2.group_reduce(): New public function that performs group-by reduction over
    NDArray instances and CTable columns, with Cython-accelerated backends for common
    key/reduction combinations.

CTable: Dictionary / categorical columns

  • DictionarySpec column type: Introduced a new dictionary-encoded (categorical)
    column type that stores string or integer codes mapped to a shared dictionary, providing
    compact storage and accelerated equality and membership queries.
  • Dictionary types in where clauses: Dictionary columns can be queried with the same
    where= expression syntax as other column types, including nested dotted-name access.
  • Improved display: CTable printing now adapts to the terminal width, and dictionary
    values are shown in their decoded form. Column.info has been extended with type
    details, shape, chunks, and blocks.

CTable: Nested columns and field-name escaping

  • Dotted nested column access: Columns whose names contain literal .
    (e.g., "root.nested") are now fully addressable via the dotted accessor syntax in
    where expressions, __getitem__, and the public API.
  • Hierarchical _cols storage paths: The internal column storage layout now preserves
    a hierarchical structure that mirrors the logical nesting, improving introspection
    and interop.
  • Nested-field pipeline: A new flattened-storage pipeline with logical mapping
    preserves nested schema structure (field names, types, and hierarchy) through
    Arrow and Parquet import/export. For unnamed top-level list<struct<...>> Parquet
    files, the logical schema round-trips faithfully, though the original physical row
    grouping is intentionally not preserved.
  • Field-name escaping: Special characters (. and /) in column names are
    automatically escaped during schema construction and metadata round-trips.

Parquet import/export improvements

  • Arrow serializer by default: CTable.from_parquet() now defaults to the Arrow
    serializer, providing better schema fidelity and nested-type support.
  • Progress reporting: A --progress flag and an ETA estimator have been added to
    the parquet-to-blosc2 CLI for long-running imports.
  • --max-rows parameter: CTable.from_parquet() and the CLI now accept max_rows
    to limit the number of imported rows.
  • --timestamp-unit: New CLI option to control timestamp unit conversion on import.
  • --float-trunc-prec: New CLI option to truncate floating-point precision on import.
  • Separated nested columns enabled by default: The separate_nested_cols flag is now
    True by default for both the Python API and the CLI, ensuring nested Arrow structs
    are always expanded into flat columns.
  • list_serializer parameter: New option to control how list-type columns are
    serialized, with sensible defaults for different list layouts.
  • Validation optimizations: Arrow datetime values are validated only during import,
    reducing runtime overhead on subsequent operations.

TreeStore: Inline CTable support

  • CTables inside TreeStore: CTable objects can now be stored inline as items
    inside a TreeStore, enabling hierarchical storage that mixes arrays and tables in a
    single persistent container.
  • Cache hardening: TreeStore cache assignments now use defensive copies and cache
    effective object roots to avoid aliasing and stale-cache errors.
  • Examples and tutorials: New tutorials and docstring examples demonstrate how to
    store, retrieve, and query CTables within a TreeStore.

Performance and usability enhancements

  • Faster open and import: blosc2.open() and store constructors now assume valid
    file extensions and defer column metainfo loading, making CTable.open() and
    package import noticeably faster.
  • CTable.nrows is now lazy: The row count is computed on demand rather than eagerly,
    speeding up open and schema-inspection workflows.
  • Accelerated scalar and small-slice access: The batch/list path for reading scalar
    values or small column slices has been overhauled, eliminating internal placeholder
    materialization and yielding lower latency.
  • Late-import optimizations: Heavy optional dependencies are imported lazily at the
    blosc2 package level, reducing the baseline import blosc2 overhead.
  • iter_arrow_batches() optimization: Avoids full Python object materialization of
    batches during iteration, reducing memory pressure.
  • NDArray-to-list conversion: Small optimization when converting NDArray objects
    to Python lists.
  • _last_pos invalidation skipped: Mid-table deletes no longer eagerly invalidate
    cached positional state, improving delete latency.

Documentation, examples and benchmarks

  • API reference expanded: blosc2.group_reduce() has been added to the Sphinx
    reference, along with updated CTable, Column, and TreeStore pages.
  • New tutorials and examples: Added sections on CTable–TreeStore integration,
    nested fields, dictionary columns, aggregates, grouping and querying with where=.
  • New benchmarks: Graph benchmarks for CTable insert time, column count, memory usage,
    and where= queries, plus dedicated group-by, nested-filter, and Parquet round-trip
    benchmarks.

Fixes and compatibility

  • Null and NaN handling: NumPy scalar null sentinels are now normalized to plain Python
    scalars, and floating-point NaN sentinels are treated consistently with Python
    float('nan').
  • Empty aggregate results: Filtered aggregations that produce no rows now handle the
    empty result gracefully.
  • Generated column safety: Accessing a stalled (unfillable) generated column now raises
    a clear exception instead of producing undefined results.
  • Miniexpr bundling: Miniexpr’s bundled libtcc and related runtime files are now
    kept inside the blosc2 package, avoiding conflicts with other TCC installations.
  • Test improvements: Torch-dependent tests are marked as heavy, PyArrow-optional
    tests are skipped when the library is absent, and parametrization matrices have been
    trimmed to reduce CI time.
  • Missing Cython validation: Added validation guards for several Cython extension
    functions that previously lacked explicit error checking.
  • C-Blosc2 update: Bundled C-Blosc2 has been updated to the latest version (3.0.3).
  • blosc2.open() default mode changed from 'a' to 'r': Removed the FutureWarning that
    was added to prepare for this transition.

Release 4.2.0

07 May 11:38

Choose a tag to compare

Changes from 4.1.2 to 4.2.0

CTable: columnar compressed tables

  • Introduced blosc2.CTable, a new columnar table container for compressed, typed columns. CTables support dataclass- and schema-based construction, row iteration, column access, table views, head() / tail() / sample(), sorting, selection and compact where expressions.
  • Added persistent CTables backed by TreeStore, with support for blosc2.open(), CTable.open(), CTable.load(), CTable.save(), CTable.to_b2d() and CTable.to_b2z(). CTable views can be saved too, and .b2z/.b2d path handling has been tightened.
  • Added mutation operations for CTables, including append(), extend(), delete(), compact(), add_column(), drop_column(), rename_column() and related schema validation.
  • Added computed columns, including virtual computed columns backed by lazy expressions, materialized computed columns and automatic filling of materialized computed columns during inserts.
  • Added CTable indexing support, including persistent indexes, direct expression indexes, ordered index reuse, boolean LazyExpr/NDArray masks in CTable.__getitem__, iter_sorted() and indexing support for .b2z tables.
  • Added nullable schema support and null policies for CTable scalar columns, preserving nullable scalar Parquet round-trips.
  • Added variable-length CTable column support via ListArray / ObjectArray, including vlstring and vlbytes schema specs, fixed-length string/bytes import support and list/struct Arrow/Parquet round-trips.
  • Added Arrow, Parquet and CSV interoperability for CTables, including batch-wise Arrow/Parquet import/export, Arrow schema metadata preservation, CTable.from_arrow_batches() improvements and a new parquet-to-blosc2 CLI utility.
  • Added CTable documentation, tutorials, examples and benchmarks covering schema definition, persistence, querying, indexing, mutations, nullable columns, computed columns and variable-length columns.

Indexing and ordering

  • Added a new indexing subsystem for NDArrays and CTables, including full, partial/bucket, light/medium and OPSI-style index kinds, out-of-core index builders and sidecar storage.
  • Added blosc2.Index as the unified public index handle, plus APIs such as create_index(), compact_index(), iter_sorted(), will_use_index() and related query explanation support.
  • Added materialized expression indexes for NDArrays and direct expression indexes for CTables.
  • Added persistent query-result caching for indexed lookups, with FIFO pruning and cache accounting.
  • Added blosc2.argsort() and refactored indexing APIs around explicit index enums and sorting helpers.
  • Improved indexed query performance with Cython accelerators, threaded chunk batching, zero-copy/cached mmap reads, chunk-aware and reduced-order layouts and faster scattered row gathering.
  • Reduced memory usage during index creation and lookup by avoiding full sidecar materialization, replacing memmap staging with Blosc2 scratch arrays and adding tmpdir support for full out-of-core indexes.

Persistence, stores and serialization

  • Added structured Blosc2 serialization based on b2object carriers, including persisted C2Array, LazyExpr and DSL LazyUDF objects.
  • Added blosc2.Ref for serializing external references, plus examples for b2object bundles and persisted expressions/UDFs.
  • Added blosc2.load() as a convenience loader.
  • Added vlmeta support to LazyArray objects.
  • Improved store handling by preserving lazy b2object carriers in DictStore, allowing reopened proxies to refill caches after read-only opens, relaxing DictStore/TreeStore suffix requirements and adding DictStore.to_b2d().
  • Accelerated blosc2.open() by trying standard opens first and warning on implicit append mode.

Arrays, computation and containers

  • Added ObjectArray for fully general object data and renamed the earlier VLArray work accordingly; added ListArray docstrings and Arrow integration improvements.
  • Added schema helpers including numeric specs, blosc2.struct() and blosc2.object() for nested/fully general column declarations.
  • Improved fromiter() with direct chunked construction and substantially lower peak memory use.
  • Improved asarray() behavior for NDArray inputs when copy-inducing keyword arguments are supplied.
  • Added SChunk.reorder_offsets().
  • Improved BatchArray defaults and documentation; the default compression level is now tuned for faster lookup/scan behavior.
  • Continued matmul/linalg optimization work and shared-thread-pool integration.

CLI, docs and examples

  • Added the parquet-to-blosc2 command with options such as --max-rows, --parquet-batch-size, --blosc2-items-per-block and --use-dict.
  • Added new CTable, ObjectArray, BatchArray, containers, indexing and serialization tutorials and examples.
  • Reorganized and expanded the API reference for CTable, Column, schema specs, Index, save/load helpers and miscellaneous APIs.
  • Updated benchmark suites for CTables, indexing, Parquet import/export, BatchArray and NDArray construction/indexing.

Fixes and compatibility

  • Updated bundled C-Blosc2 to v3.0.2 and require C-Blosc2 >= 3.0.0 when building against a system library.
  • Updated bundled C-Blosc2 and miniexpr sources multiple times.
  • Restored compatibility with NumPy < 2.
  • Fixed Windows and mmap/file-locking issues in index creation, rebuilds and temporary file cleanup.
  • Fixed full-index query failures for large CTable columns and full out-of-core merge failures on systems with small /tmp.
  • Fixed stale sidecar/cache reuse and targeted cache invalidation when persistent sidecars are replaced.
  • Fixed .b2z double-open corruption caused by GC-triggered repacking and made temporary .b2z unpacking default to the source file directory.
  • Fixed a regression when reopening persisted proxies in read-only mode.
  • Fixed GC-induced thread hangs on macOS with Python 3.14 and hardened async chunk reading/cache cleanup paths.
  • Fixed lazy-chunk source-size handling in decode/getitem callers.
  • Fixed nullable validation, dictionary extend validation, CTable close propagation, print alignment and NumPy mask support.
  • Fixed arange() regressions and several pre-existing set_slice error-handling issues.
  • Clamped indexing/thread defaults for wasm32.