Skip to content

feat: read and write zarr methods to replace read_pickle#1908

Open
gampnico wants to merge 5 commits into
OGGM:masterfrom
gampnico:fix-pickle-doomsday
Open

feat: read and write zarr methods to replace read_pickle#1908
gampnico wants to merge 5 commits into
OGGM:masterfrom
gampnico:fix-pickle-doomsday

Conversation

@gampnico
Copy link
Copy Markdown
Contributor

@gampnico gampnico commented Apr 30, 2026

This PR currently adds:

  • read_store, read_zarr and write_zarr methods which fall back to pickle if no zarr store is available
  • Warnings that read_pickle methods will be deprecated for future GlacierDirectories.
  • Zarr as a core dependency.

Refs: #1903

Points for discussion

All pickles are now stored as xr.DataTrees in a single zarr file, which I'm currently naming "data_store". The filename in read_zarr(filename) therefore reads the group, rather than entire zarr file. This avoids having multiple small zarr files, and we can maintain cross-compatibility with DTCG.

The default replacement for read_pickle is now read_store, which handles both pickle (read_pickle) and zarr (read_zarr). read_store falls back to read_pickle if no zarr store exists. This preserves backwards compatibility with existing gdirs if a user selects an older URL.

Zarr does not support all the data structures used by certain pickles (e.g. shapely objects). oggm-zarr converts these to a compatible type. When using read_store, these objects are converted back into the type expected by OGGM. This should minimise rewrites across the rest of the codebase.

I'd like to upload a small sample zarr to oggm-sample-data, so this can be included in init_hef, and I can then add tests for reading a zarr from a gdir directly.

I'm currently keeping code for converting from pickle to zarr as a dual-licensed package, as some of this includes code from DTCG which is not compatible with OGGM's license. I'll see if I can rework this, but it may be simpler to handle conversion via dtcg, since this is already set up and we can build directly on the existing GeoZarrHandler.

You can view a conversion workflow here.

Closes #1903

  • Tests added/passed
  • Fully documented
  • Entry in whats-new.rst

Adds:
  - `read_zarr` and `write_zarr` methods which fall back to pickle if no zarr store is available.
  - Warnings that read_pickle methods will be deprecated for future GlacierDirectories.
  - Zarr as a core dependency.

Refs: OGGM#1903
@fmaussion
Copy link
Copy Markdown
Member

Thanks! Looking promising

A few quick thoughts:

  • I don't understand the PR yet - it does not touch the code that needs to be changed, i.e. the places where the pickles are actually written?
  • the read_zarr, read_store and _validate_store are written with the assumption that they are just replacing read_pickle - I think these may become more generic in the future. Maybe we need to rename them according to what they do. Alternatively they could be external functions (not methods of gdir) because they are dealing only with some specific data formats within the oggm workflow - to discuss.
  • I do not want a new oggm-zarr package unless strictly necessary / there are good arguments for it - we can deal with the license like usual, by including the license in OGGM. This could be a module in OGGM if needed
  • It should not be necessary to resort to oggm-sample-data for tests, as OGGM writes and reads pickles (there are no pickles in oggm-sample-data either). In theory, since the change will be mostly internal, the existing tests should "just work" after the conversion to zarr (except those where read_pickle is used), and we need to add a test for backwards compatibility, although I think
    def test_start_from_level_3(self):
    will cover a little bit of this

@gampnico
Copy link
Copy Markdown
Contributor Author

gampnico commented May 1, 2026

Notes based on internal meeting:

I don't understand the PR yet - it does not touch the code that needs to be changed, i.e. the places where the pickles are actually written?

Next on agenda.

the read_zarr, read_store and _validate_store are written with the assumption that they are just replacing read_pickle - I think these may become more generic in the future. Maybe we need to rename them according to what they do. Alternatively they could be external functions (not methods of gdir) because they are dealing only with some specific data formats within the oggm workflow - to discuss.

Agreed. _validate_store is for converting to and from zarr so that data is structured in a way that OGGM recognises. As discussed we may need separate helper functions for writing specific files (e.g. write_model_flowlines). In pseudo-code:

model_flowlines: Centerline = Centerline(...)
write_store(model_flowlines, "model_flowlines")
model_flowlines: Centerline = read_store("model_flowlines")

where internally read_store and write_store both do something like:

def write_store(self, obj, filename, filesuffix, **kwargs):
    data: xr.DataTree = _validate_store(obj)  # converts shapely objects etc. into zarr-compatible types
    write_zarr(data: xr.DataTree=data, group:str=filename, **kwargs)

I do not want a new oggm-zarr package unless strictly necessary / there are good arguments for it - we can deal with the license like usual, by including the license in OGGM. This could be a module in OGGM if needed

Zarr-related code will be placed under utils as a public module. I'd prefer to keep read_store etc under _workflow as this is where read_pickle is located.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Get rid of pickles, one pickle at a time

2 participants