Skip to content

Add opera_subset: spatial subsetting tool for OPERA product files#1

Closed
Copilot wants to merge 3 commits into
mainfrom
copilot/tool-for-subsetting-opera-files
Closed

Add opera_subset: spatial subsetting tool for OPERA product files#1
Copilot wants to merge 3 commits into
mainfrom
copilot/tool-for-subsetting-opera-files

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 10, 2026

New Python package (opera_tools) providing a tool to clip OPERA SAR/InSAR product files to a geographic bounding box.

Package structure

  • src/opera_tools/subset.py — core subsetting logic + CLI (opera_subset)
  • setup.py / requirements.txt — packaging with opera_subset console entry point

Supported formats

  • GeoTIFF (.tif, .tiff) via rasterio — handles CRS reprojection automatically
  • HDF5 (.h5, .hdf5, .nc) via h5py — discovers lat/lon coordinate datasets and slices all spatial dimensions

API

from opera_tools import subset_opera_file, subset_opera_files

# Single file
subset_opera_file("RTC_product.tif", "out.tif", bbox=(-118.5, 33.5, -117.5, 34.5))

# Batch
subset_opera_files(["a.tif", "b.h5"], output_dir="./subset/", bbox=(-118.5, 33.5, -117.5, 34.5))

CLI

opera_subset --bbox -118.5 33.5 -117.5 34.5 -o ./subset/ *.tif *.h5

Returns non-zero exit code and warns when a file's extent does not overlap the requested bbox.

Original prompt

Tool for subsetting opera files

Summary by Sourcery

Add a new opera_tools package providing a CLI and Python API for spatially subsetting OPERA GeoTIFF and HDF5 product files to a geographic bounding box.

New Features:

  • Expose subset_opera_file and subset_opera_files functions in a new opera_tools package for programmatic subsetting of OPERA GeoTIFF and HDF5 products.
  • Provide an opera_subset command-line tool to subset one or more OPERA product files to a specified WGS84 bounding box with configurable output directory and filename suffix.

Enhancements:

  • Document installation, CLI usage, and Python API examples for the new OPERA subsetting tooling in the README.

Build:

  • Introduce basic packaging and dependency configuration via setup.py and requirements.txt.

Tests:

  • Add a comprehensive test suite for GeoTIFF and HDF5 subsetting, dispatcher behavior, batch processing, and CLI exit codes using synthetic test data.

Chores:

  • Add a .gitignore file to exclude common build artifacts and compiled files from version control.

Copilot AI and others added 2 commits April 10, 2026 15:54
Agent-Logs-Url: https://github.com/geodesymiami/Opera_tools/sessions/6bc9539d-5fbf-42db-8800-c24731ce628e

Co-authored-by: disilvestro <142009043+disilvestro@users.noreply.github.com>
@disilvestro disilvestro marked this pull request as ready for review April 10, 2026 15:55
Copilot AI changed the title [WIP] Add tool for subsetting opera files Add opera_subset: spatial subsetting tool for OPERA product files Apr 10, 2026
Copilot AI requested a review from disilvestro April 10, 2026 15:56
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 10, 2026

Reviewer's Guide

Introduces a new opera_tools Python package providing a CLI and Python API for spatially subsetting OPERA GeoTIFF and HDF5 products to a geographic bounding box, including format-specific subsetting logic, HDF5 helpers, argument parsing, and a comprehensive test suite plus documentation and packaging files.

Sequence diagram for opera_subset CLI subsetting flow

sequenceDiagram
    actor User
    participant CLI as opera_subset
    participant Parser as _build_parser
    participant API as subset_opera_files
    participant Single as subset_opera_file
    participant GTiff as subset_geotiff
    participant HDF5 as subset_hdf5
    participant HLatLon as _find_latlon_datasets
    participant HCopy as _copy_hdf5_subset

    User->>CLI: invoke with args (files, --bbox, --outdir, --suffix)
    CLI->>Parser: build ArgumentParser
    Parser-->>CLI: parser instance
    CLI->>CLI: parse_args(argv)
    CLI->>API: subset_opera_files(files, outdir, bbox, suffix)

    loop for each input_file
        API->>Single: subset_opera_file(input_file, output_file, bbox)
        alt GeoTIFF extension
            Single->>GTiff: subset_geotiff(input_file, output_file, bbox)
            GTiff->>GTiff: validate bbox and file
            GTiff->>GTiff: reproject bbox if file CRS != WGS84
            GTiff->>GTiff: compute window and read data
            GTiff->>GTiff: write subset GeoTIFF
            GTiff-->>Single: True or False
        else HDF5 extension
            Single->>HDF5: subset_hdf5(input_file, output_file, bbox)
            HDF5->>HLatLon: _find_latlon_datasets(h5file)
            HLatLon-->>HDF5: lat_dataset, lon_dataset
            HDF5->>HDF5: compute row and column indices in bbox
            HDF5->>HCopy: _copy_hdf5_subset(src, dst, row_start, row_stop, col_start, col_stop)
            HCopy-->>HDF5: subset written
            HDF5-->>Single: True or False
        end
        Single-->>API: success flag
        alt success is False
            API->>CLI: print warning to stderr for this file
        else success is True
            API->>API: append output_file to written list
        end
    end

    API-->>CLI: list of written output files
    alt written is not empty
        CLI->>User: print count and paths of written files
    else written is empty
        CLI->>User: print no output files message
        CLI->>CLI: exit with status 1
    end
Loading

Class diagram for opera_tools package and subset module API

classDiagram
    class opera_tools {
        <<module>>
        +subset_opera_file(input_file, output_file, bbox) bool
        +subset_opera_files(input_files, output_dir, bbox, suffix) list
    }

    class subset {
        <<module>>
        +subset_geotiff(input_file, output_file, bbox) bool
        +subset_hdf5(input_file, output_file, bbox) bool
        +subset_opera_file(input_file, output_file, bbox) bool
        +subset_opera_files(input_files, output_dir, bbox, suffix) list
        +main(argv) void
        +_build_parser() ArgumentParser
        +_check_bbox(bbox) void
        +_find_latlon_datasets(h5file) tuple
        +_copy_hdf5_subset(src, dst, row_start, row_stop, col_start, col_stop) void
    }

    opera_tools ..> subset : reexports

    class rasterio {
        <<external library>>
    }

    class h5py {
        <<external library>>
    }

    class numpy {
        <<external library>>
    }

    subset ..> rasterio : used by subset_geotiff
    subset ..> h5py : used by subset_hdf5
    subset ..> numpy : used for array operations
Loading

File-Level Changes

Change Details Files
Implement core GeoTIFF and HDF5 subsetting logic and public API for OPERA products.
  • Add bbox validation helper to enforce coordinate ordering and ranges.
  • Implement GeoTIFF subsetting using rasterio windows, optional CRS reprojection of bbox, and metadata-preserving writes.
  • Implement HDF5 subsetting that locates lat/lon datasets, derives row/column index ranges, subsets data/coordinate arrays, and preserves attributes.
  • Introduce dispatcher function to route files by extension to the appropriate subsetting implementation.
  • Add batch subsetting helper that builds output filenames with suffixes, writes to an output directory, and logs skipped files to stderr.
src/opera_tools/subset.py
Provide a command-line interface for batch subsetting of OPERA files.
  • Add argparse-based parser builder describing supported formats and arguments, with examples in the epilog.
  • Implement main() entry point that parses bbox and file arguments, calls batch subsetting, prints written outputs, and exits non-zero when nothing is written.
src/opera_tools/subset.py
Expose a minimal, user-friendly Python package API and declare runtime dependencies.
  • Create package init that re-exports subset_opera_file and subset_opera_files as the public API.
  • Add requirements.txt specifying numpy, rasterio, and h5py as dependencies.
  • Add (stubbed) setup.py for package installation/entry points (content not fully shown in diff).
src/opera_tools/__init__.py
requirements.txt
setup.py
Document installation, CLI usage, and Python API for the new subsetting tool.
  • Expand README with project description, supported formats, installation instructions, CLI usage examples, argument documentation, Python API examples, dependencies, and test-running instructions.
README.md
Add comprehensive tests covering bbox validation, GeoTIFF/HDF5 subsetting behavior, dispatcher, batch processing, and CLI.
  • Add helpers to create synthetic GeoTIFF and HDF5 files used across tests.
  • Test bbox validation edge cases and error messages.
  • Test GeoTIFF subsetting success, non-overlap behavior, error handling, and directory creation.
  • Test HDF5 subsetting including spatial filtering, non-overlap behavior, and attribute preservation.
  • Test dispatcher logic for supported extensions and error on unsupported ones.
  • Test batch subsetting behavior, custom suffixes, and skipping non-overlapping files.
  • Test CLI integration paths for success, non-overlap (non-zero exit), and custom suffix handling.
tests/test_subset.py
Add basic repo hygiene files.
  • Add .gitignore to exclude build artifacts and compiled files (contents not shown).
.gitignore

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The HDF5 subsetting path (_copy_hdf5_subset) reads entire datasets into memory via obj[()] even when only a spatial slice is needed; consider using hyperslab slicing on the source dataset and writing chunks to the destination to avoid excessive memory use for large products.
  • Lat/lon detection in _find_latlon_datasets is driven by a small fixed set of name variants; exposing an API to pass explicit dataset paths or detection hints (e.g., via attributes) would make the tool more robust across different OPERA product variants.
  • The batch helper subset_opera_files prints warnings directly to stderr from library code; consider using the logging module or a user-provided callback so callers can better control how non-overlap cases are reported.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The HDF5 subsetting path (`_copy_hdf5_subset`) reads entire datasets into memory via `obj[()]` even when only a spatial slice is needed; consider using hyperslab slicing on the source dataset and writing chunks to the destination to avoid excessive memory use for large products.
- Lat/lon detection in `_find_latlon_datasets` is driven by a small fixed set of name variants; exposing an API to pass explicit dataset paths or detection hints (e.g., via attributes) would make the tool more robust across different OPERA product variants.
- The batch helper `subset_opera_files` prints warnings directly to `stderr` from library code; consider using the `logging` module or a user-provided callback so callers can better control how non-overlap cases are reported.

## Individual Comments

### Comment 1
<location path="src/opera_tools/subset.py" line_range="280" />
<code_context>
+                except Exception:
+                    pass
+        elif isinstance(obj, h5py.Dataset):
+            data = obj[()]
+            lower = name.lower().split("/")[-1]
+
</code_context>
<issue_to_address>
**suggestion (performance):** Copying each HDF5 dataset by loading it fully into memory is risky for large multidimensional datasets.

In `_copy_hdf5_subset`, `data = obj[()]` forces the full dataset into RAM before subsetting. Instead, slice directly from `obj` (e.g. `obj[row_start:row_stop, col_start:col_stop]` or `obj[:, row_start:row_stop, col_start:col_stop]`) so only the required hyperslab is read, avoiding the temporary full-materialization in memory.

Suggested implementation:

```python
        if isinstance(obj, h5py.Group):
            grp = dst.require_group(name)
            for key, val in obj.attrs.items():
                try:
                    grp.attrs[key] = val
                except Exception:
                    pass
        elif isinstance(obj, h5py.Dataset):
            src_dset = obj
            lower = name.lower().split("/")[-1]

            # Determine the slice to apply for this dataset.
            # `dataset_slices` should be provided by the outer scope and map
            # dataset names (or lower-cased names) to index tuples/slices,
            # e.g. (row_start:row_stop, col_start:col_stop) or
            # (slice(None), row_start:row_stop, col_start:col_stop).
            dset_slice = dataset_slices.get(lower) if "dataset_slices" in locals() else None

            if dset_slice is None:
                # Fall back to HDF5-level copy to avoid forcing the dataset
                # through Python memory when possible.
                dst.copy(src_dset, name)
            else:
                # Read only the requested hyperslab into memory
                sliced = src_dset[dset_slice]

                # Mirror the source dataset's creation properties if a helper
                # exists; otherwise let h5py choose sane defaults.
                if "_get_create_kwargs" in globals():
                    create_kwargs = _get_create_kwargs(src_dset)
                    dst_dset = dst.create_dataset(name, data=sliced, **create_kwargs)
                else:
                    dst_dset = dst.create_dataset(name, data=sliced)

                # Copy dataset attributes
                for key, val in src_dset.attrs.items():
                    try:
                        dst_dset.attrs[key] = val
                    except Exception:
                        pass

Supports GeoTIFF (.tif) and HDF5 (.h5) OPERA product files.

```

To make this work end-to-end and fully avoid `obj[()]` materializing whole datasets, you’ll need to:

1. Ensure `_copy_item` has access to a `dataset_slices` mapping:
   - If `_copy_hdf5_subset` constructs `_copy_item` as an inner function, add a `dataset_slices` dict (or similar) in the outer scope and close over it.
   - Example pattern (pseudocode):
     ```python
     dataset_slices = {
         "amplitude": (row_start, row_stop, col_start, col_stop),
         "intensity": (slice(None), row_start, row_stop, col_start, col_stop),
     }

     def _copy_item(name, obj):
         ...
     src.visititems(_copy_item)
     ```
     Adjust keys (`lower`, full `name`, etc.) to match your dataset naming conventions.

2. Compute the row/column slices from the bounding box:
   - Wherever you currently compute the spatial subset indices from the geo/lat/lon grids, convert them into Python `slice` objects (and, for 3D/4D datasets, prepend `slice(None)` for dimensions you are not subsetting).
   - Store these slices in `dataset_slices` keyed by the dataset name you use in `_copy_item` (`lower` or `name`).

3. If you do not have a `_get_create_kwargs` helper:
   - Either implement it to copy dtype, compression, chunking, etc. from `src_dset`, or remove that branch and always call:
     ```python
     dst_dset = dst.create_dataset(name, data=sliced)
     ```
   - A minimal helper could look like:
     ```python
     def _get_create_kwargs(src_dset):
         kwargs = {"dtype": src_dset.dtype}
         if src_dset.chunks is not None:
             kwargs["chunks"] = src_dset.chunks
         if "compression" in src_dset.compression_opts:
             kwargs["compression"] = src_dset.compression
             kwargs["compression_opts"] = src_dset.compression_opts
         return kwargs
     ```
     (Adjust to your existing conventions if a similar utility already exists.)

4. If some datasets should still be copied whole:
   - Do not add them to `dataset_slices`; they will follow the `dst.copy(src_dset, name)` path which avoids Python-level full materialization and lets HDF5 handle the copy efficiently.
</issue_to_address>

### Comment 2
<location path="src/opera_tools/subset.py" line_range="458-461" />
<code_context>
+    parser = _build_parser()
+    args = parser.parse_args(argv)
+
+    bbox = tuple(args.bbox)
+    written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix)
+
+    if written:
</code_context>
<issue_to_address>
**suggestion:** CLI error handling for invalid bounding boxes propagates raw tracebacks to the user.

If `_check_bbox` raises a `ValueError` (e.g., swapped min/max or out-of-range coordinates), the CLI currently prints a full traceback. Consider catching `ValueError` around `subset_opera_files`, writing a clear error message to stderr, and exiting with a non-zero status instead.

```suggestion
    bbox = tuple(args.bbox)
    try:
        written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix)
    except ValueError as exc:
        print(f"Error: {exc}", file=sys.stderr)
        sys.exit(1)

    if written:
```
</issue_to_address>

### Comment 3
<location path="tests/test_subset.py" line_range="94" />
<code_context>
+# Tests for subset_geotiff
+# ---------------------------------------------------------------------------
+
+class TestSubsetGeotiff:
+    def test_basic_subset(self, tmp_path):
+        rasterio = pytest.importorskip("rasterio")
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test that exercises the GeoTIFF reprojection path (non-WGS84 CRS) to ensure `transform_bounds` and window computation behave as expected.

Right now `_make_geotiff` always writes EPSG:4326, so `subset_geotiff` never exercises the reprojection branch using `transform_bounds`. Please add a test that uses a non-WGS84 CRS (e.g., EPSG:3857) or a small in-memory/mocked rasterio dataset so that `file_crs != wgs84`, then call `subset_geotiff` and assert that: (1) an overlapping bbox returns `True`, and (2) the output dimensions and resulting transform are sensible. This will improve coverage of the reprojection path and guard against regressions in `transform_bounds` error handling.

Suggested implementation:

```python
# ---------------------------------------------------------------------------
# Tests for subset_geotiff
# ---------------------------------------------------------------------------

    def test_subset_geotiff_reprojection(self, tmp_path):
        import numpy as np
        import pytest

        rasterio = pytest.importorskip("rasterio")
        from rasterio.crs import CRS
        from rasterio.warp import transform_bounds
        from opera_tools.subset import subset_geotiff

        # Define a geographic (WGS84) bbox that we will subset with
        lon_min, lat_min, lon_max, lat_max = -10.0, -10.0, 10.0, 10.0
        wgs84 = CRS.from_epsg(4326)
        mercator = CRS.from_epsg(3857)

        # Compute the dataset bounds in EPSG:3857 so that the file CRS != WGS84
        file_left, file_bottom, file_right, file_top = transform_bounds(
            wgs84,
            mercator,
            lon_min,
            lat_min,
            lon_max,
            lat_max,
        )

        width, height = 64, 64
        input_path = tmp_path / "reproj_input.tif"

        # Create a simple GeoTIFF in EPSG:3857 that fully covers the WGS84 bbox
        transform = rasterio.transform.from_bounds(
            file_left,
            file_bottom,
            file_right,
            file_top,
            width,
            height,
        )
        data = np.arange(width * height, dtype=np.float32).reshape(1, height, width)

        with rasterio.open(
            input_path,
            "w",
            driver="GTiff",
            width=width,
            height=height,
            count=1,
            dtype="float32",
            crs=mercator,
            transform=transform,
        ) as ds:
            ds.write(data)

        output_path = tmp_path / "subset_reproj.tif"

        # Exercise the reprojection path (file_crs != WGS84)
        ok = subset_geotiff(
            input_path,
            (lon_min, lat_min, lon_max, lat_max),
            output_path,
        )

        # (1) Overlapping bbox should be detected
        assert ok is True

        # (2) Output dimensions and transform should be sensible
        with rasterio.open(output_path) as out_ds:
            # Subset should be non-empty and no larger than the source
            assert 0 < out_ds.width <= width
            assert 0 < out_ds.height <= height

            # Output bounds should lie within the original dataset bounds
            out_left, out_bottom, out_right, out_top = out_ds.bounds
            assert file_left <= out_left < file_right
            assert file_bottom <= out_bottom < file_top
            assert file_left < out_right <= file_right
            assert file_bottom < out_top <= file_top

    from rasterio.transform import from_bounds
    from rasterio.crs import CRS

```

The `subset_geotiff` call in this test assumes a signature of:
`subset_geotiff(input_path, bbox, output_path)`. If the actual function in `opera_tools.subset` uses a different argument order or keyword-only parameters, adjust the call accordingly so that:
1. `input_path` is the path to the EPSG:3857 GeoTIFF,
2. `bbox` is the WGS84 (lon_min, lat_min, lon_max, lat_max) tuple,
3. `output_path` is where the subset GeoTIFF is written.

If `subset_geotiff` returns something other than a boolean indicating whether an overlap was found (e.g., raises on no-overlap or returns the output path), update the `ok` assertion to match the real API while keeping the intent: verify that an overlapping bbox is treated as a successful subset operation, and the resulting dataset has consistent bounds and shape.
</issue_to_address>

### Comment 4
<location path="tests/test_subset.py" line_range="95-104" />
<code_context>
+    def test_basic_subset(self, tmp_path):
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests to exercise `_copy_hdf5_subset` for 2-D/3-D datasets and 1-D non-coordinate datasets to ensure slicing logic is correct.

`_copy_hdf5_subset` is currently covered only via a 2-D `data` array plus 1-D `latitude`/`longitude`. The code also supports 3-D datasets (subset last two dims) and 1-D non-coordinate arrays (copied as-is), but those paths aren't tested. Please extend `_make_hdf5` or add a helper to include: (1) a 3-D dataset (e.g., `(time, y, x)`) and verify that only the spatial dimensions are subset; (2) an extra 1-D non-coordinate dataset (e.g., `time`) and verify its length is unchanged after subsetting.

Suggested implementation:

```python
class TestSubsetGeotiff:
    def test_basic_subset(self, tmp_path):
        rasterio = pytest.importorskip("rasterio")
        from opera_tools.subset import subset_geotiff

        src = str(tmp_path / "input.tif")
        dst = str(tmp_path / "output.tif")
        _make_geotiff(src)

        # Subset to a smaller bbox inside the file extent
        result = subset_geotiff(src, dst, (-118.2, 33.7, -117.8, 34.2))
        assert result is True


def _make_hdf5_with_3d(path):
    """Create an HDF5 file with:
    - 1-D latitude/longitude coordinate datasets
    - 3-D data dataset with shape (time, y, x)
    - 1-D non-coordinate dataset 'time' that should not be subset in length
    """
    import numpy as np
    import h5py

    ny, nx = 8, 10
    nt = 3

    lats = np.linspace(33.0, 35.0, ny, dtype=np.float32)
    lons = np.linspace(-119.0, -117.0, nx, dtype=np.float32)
    data3d = np.arange(nt * ny * nx, dtype=np.float32).reshape(nt, ny, nx)
    time = np.arange(nt, dtype=np.int32)

    with h5py.File(path, "w") as f:
        # Coordinate variables
        f.create_dataset("latitude", data=lats)
        f.create_dataset("longitude", data=lons)

        # Non-coordinate 1-D variable that should be copied as-is
        f.create_dataset("time", data=time)

        # 3-D data variable with (time, y, x) dimensions
        f.create_dataset("data3d", data=data3d)


class TestCopyHdf5Subset:
    def test_3d_and_1d_non_coordinate_subset(self, tmp_path):
        """Ensure _copy_hdf5_subset correctly:
        - Subsets only the spatial dimensions of 3-D datasets
        - Copies 1-D non-coordinate datasets without changing their length
        """
        pytest.importorskip("h5py")
        import numpy as np
        import h5py
        # Use the public API, which internally exercises _copy_hdf5_subset
        from opera_tools.subset import subset_hdf5

        src = str(tmp_path / "input.h5")
        dst = str(tmp_path / "output.h5")
        _make_hdf5_with_3d(src)

        # Bbox that selects an interior spatial window
        # (lat_min, lon_min, lat_max, lon_max)
        bbox = (-118.5, 33.5, -117.5, 34.5)

        result = subset_hdf5(src, dst, bbox)
        # subset_hdf5 is expected to succeed; adjust if it returns None/other
        assert result is True or result is None

        with h5py.File(src, "r") as src_f, h5py.File(dst, "r") as dst_f:
            src_data = src_f["data3d"][...]
            dst_data = dst_f["data3d"][...]

            # Time dimension (axis 0) should be unchanged
            assert dst_data.shape[0] == src_data.shape[0]

            # Spatial dimensions should be reduced
            assert dst_data.shape[1] < src_data.shape[1]
            assert dst_data.shape[2] < src_data.shape[2]

            # 1-D non-coordinate dataset 'time' should be copied unchanged
            assert np.array_equal(dst_f["time"][...], src_f["time"][...])

```

Depending on the existing HDF5 API and test conventions in this repository, you may need to:

1. **Adjust the HDF5 subset function name/signature**:
   - If the public API is named differently (e.g., `subset_hdf5_file` or requires keyword arguments like `xvar="longitude", yvar="latitude"`), update the `from opera_tools.subset import subset_hdf5` import and the `subset_hdf5(src, dst, bbox)` call accordingly.

2. **Align dataset names/dimensions with existing helpers**:
   - If `_copy_hdf5_subset` expects specific dataset names (e.g., `"lat"`, `"lon"` instead of `"latitude"`, `"longitude"`), adjust `_make_hdf5_with_3d` to match those names.
   - If the code uses dimensions via attributes (e.g., `dimensions=("time", "y", "x")`), you may want to set those attributes on the created datasets to mirror the production files more closely.

3. **Return value expectations**:
   - If `subset_hdf5` has a well-defined return value (always `True`, or returns the output path, etc.), tighten the assertion:
     - Replace `assert result is True or result is None` with whatever is appropriate for your implementation.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/opera_tools/subset.py
except Exception:
pass
elif isinstance(obj, h5py.Dataset):
data = obj[()]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Copying each HDF5 dataset by loading it fully into memory is risky for large multidimensional datasets.

In _copy_hdf5_subset, data = obj[()] forces the full dataset into RAM before subsetting. Instead, slice directly from obj (e.g. obj[row_start:row_stop, col_start:col_stop] or obj[:, row_start:row_stop, col_start:col_stop]) so only the required hyperslab is read, avoiding the temporary full-materialization in memory.

Suggested implementation:

        if isinstance(obj, h5py.Group):
            grp = dst.require_group(name)
            for key, val in obj.attrs.items():
                try:
                    grp.attrs[key] = val
                except Exception:
                    pass
        elif isinstance(obj, h5py.Dataset):
            src_dset = obj
            lower = name.lower().split("/")[-1]

            # Determine the slice to apply for this dataset.
            # `dataset_slices` should be provided by the outer scope and map
            # dataset names (or lower-cased names) to index tuples/slices,
            # e.g. (row_start:row_stop, col_start:col_stop) or
            # (slice(None), row_start:row_stop, col_start:col_stop).
            dset_slice = dataset_slices.get(lower) if "dataset_slices" in locals() else None

            if dset_slice is None:
                # Fall back to HDF5-level copy to avoid forcing the dataset
                # through Python memory when possible.
                dst.copy(src_dset, name)
            else:
                # Read only the requested hyperslab into memory
                sliced = src_dset[dset_slice]

                # Mirror the source dataset's creation properties if a helper
                # exists; otherwise let h5py choose sane defaults.
                if "_get_create_kwargs" in globals():
                    create_kwargs = _get_create_kwargs(src_dset)
                    dst_dset = dst.create_dataset(name, data=sliced, **create_kwargs)
                else:
                    dst_dset = dst.create_dataset(name, data=sliced)

                # Copy dataset attributes
                for key, val in src_dset.attrs.items():
                    try:
                        dst_dset.attrs[key] = val
                    except Exception:
                        pass

Supports GeoTIFF (.tif) and HDF5 (.h5) OPERA product files.

To make this work end-to-end and fully avoid obj[()] materializing whole datasets, you’ll need to:

  1. Ensure _copy_item has access to a dataset_slices mapping:

    • If _copy_hdf5_subset constructs _copy_item as an inner function, add a dataset_slices dict (or similar) in the outer scope and close over it.
    • Example pattern (pseudocode):
      dataset_slices = {
          "amplitude": (row_start, row_stop, col_start, col_stop),
          "intensity": (slice(None), row_start, row_stop, col_start, col_stop),
      }
      
      def _copy_item(name, obj):
          ...
      src.visititems(_copy_item)
      Adjust keys (lower, full name, etc.) to match your dataset naming conventions.
  2. Compute the row/column slices from the bounding box:

    • Wherever you currently compute the spatial subset indices from the geo/lat/lon grids, convert them into Python slice objects (and, for 3D/4D datasets, prepend slice(None) for dimensions you are not subsetting).
    • Store these slices in dataset_slices keyed by the dataset name you use in _copy_item (lower or name).
  3. If you do not have a _get_create_kwargs helper:

    • Either implement it to copy dtype, compression, chunking, etc. from src_dset, or remove that branch and always call:
      dst_dset = dst.create_dataset(name, data=sliced)
    • A minimal helper could look like:
      def _get_create_kwargs(src_dset):
          kwargs = {"dtype": src_dset.dtype}
          if src_dset.chunks is not None:
              kwargs["chunks"] = src_dset.chunks
          if "compression" in src_dset.compression_opts:
              kwargs["compression"] = src_dset.compression
              kwargs["compression_opts"] = src_dset.compression_opts
          return kwargs
      (Adjust to your existing conventions if a similar utility already exists.)
  4. If some datasets should still be copied whole:

    • Do not add them to dataset_slices; they will follow the dst.copy(src_dset, name) path which avoids Python-level full materialization and lets HDF5 handle the copy efficiently.

Comment thread src/opera_tools/subset.py
Comment on lines +458 to +461
bbox = tuple(args.bbox)
written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix)

if written:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: CLI error handling for invalid bounding boxes propagates raw tracebacks to the user.

If _check_bbox raises a ValueError (e.g., swapped min/max or out-of-range coordinates), the CLI currently prints a full traceback. Consider catching ValueError around subset_opera_files, writing a clear error message to stderr, and exiting with a non-zero status instead.

Suggested change
bbox = tuple(args.bbox)
written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix)
if written:
bbox = tuple(args.bbox)
try:
written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix)
except ValueError as exc:
print(f"Error: {exc}", file=sys.stderr)
sys.exit(1)
if written:

Comment thread tests/test_subset.py
# Tests for subset_geotiff
# ---------------------------------------------------------------------------

class TestSubsetGeotiff:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a test that exercises the GeoTIFF reprojection path (non-WGS84 CRS) to ensure transform_bounds and window computation behave as expected.

Right now _make_geotiff always writes EPSG:4326, so subset_geotiff never exercises the reprojection branch using transform_bounds. Please add a test that uses a non-WGS84 CRS (e.g., EPSG:3857) or a small in-memory/mocked rasterio dataset so that file_crs != wgs84, then call subset_geotiff and assert that: (1) an overlapping bbox returns True, and (2) the output dimensions and resulting transform are sensible. This will improve coverage of the reprojection path and guard against regressions in transform_bounds error handling.

Suggested implementation:

# ---------------------------------------------------------------------------
# Tests for subset_geotiff
# ---------------------------------------------------------------------------

    def test_subset_geotiff_reprojection(self, tmp_path):
        import numpy as np
        import pytest

        rasterio = pytest.importorskip("rasterio")
        from rasterio.crs import CRS
        from rasterio.warp import transform_bounds
        from opera_tools.subset import subset_geotiff

        # Define a geographic (WGS84) bbox that we will subset with
        lon_min, lat_min, lon_max, lat_max = -10.0, -10.0, 10.0, 10.0
        wgs84 = CRS.from_epsg(4326)
        mercator = CRS.from_epsg(3857)

        # Compute the dataset bounds in EPSG:3857 so that the file CRS != WGS84
        file_left, file_bottom, file_right, file_top = transform_bounds(
            wgs84,
            mercator,
            lon_min,
            lat_min,
            lon_max,
            lat_max,
        )

        width, height = 64, 64
        input_path = tmp_path / "reproj_input.tif"

        # Create a simple GeoTIFF in EPSG:3857 that fully covers the WGS84 bbox
        transform = rasterio.transform.from_bounds(
            file_left,
            file_bottom,
            file_right,
            file_top,
            width,
            height,
        )
        data = np.arange(width * height, dtype=np.float32).reshape(1, height, width)

        with rasterio.open(
            input_path,
            "w",
            driver="GTiff",
            width=width,
            height=height,
            count=1,
            dtype="float32",
            crs=mercator,
            transform=transform,
        ) as ds:
            ds.write(data)

        output_path = tmp_path / "subset_reproj.tif"

        # Exercise the reprojection path (file_crs != WGS84)
        ok = subset_geotiff(
            input_path,
            (lon_min, lat_min, lon_max, lat_max),
            output_path,
        )

        # (1) Overlapping bbox should be detected
        assert ok is True

        # (2) Output dimensions and transform should be sensible
        with rasterio.open(output_path) as out_ds:
            # Subset should be non-empty and no larger than the source
            assert 0 < out_ds.width <= width
            assert 0 < out_ds.height <= height

            # Output bounds should lie within the original dataset bounds
            out_left, out_bottom, out_right, out_top = out_ds.bounds
            assert file_left <= out_left < file_right
            assert file_bottom <= out_bottom < file_top
            assert file_left < out_right <= file_right
            assert file_bottom < out_top <= file_top

    from rasterio.transform import from_bounds
    from rasterio.crs import CRS

The subset_geotiff call in this test assumes a signature of:
subset_geotiff(input_path, bbox, output_path). If the actual function in opera_tools.subset uses a different argument order or keyword-only parameters, adjust the call accordingly so that:

  1. input_path is the path to the EPSG:3857 GeoTIFF,
  2. bbox is the WGS84 (lon_min, lat_min, lon_max, lat_max) tuple,
  3. output_path is where the subset GeoTIFF is written.

If subset_geotiff returns something other than a boolean indicating whether an overlap was found (e.g., raises on no-overlap or returns the output path), update the ok assertion to match the real API while keeping the intent: verify that an overlapping bbox is treated as a successful subset operation, and the resulting dataset has consistent bounds and shape.

Comment thread tests/test_subset.py
Comment on lines +95 to +104
def test_basic_subset(self, tmp_path):
rasterio = pytest.importorskip("rasterio")
from opera_tools.subset import subset_geotiff

src = str(tmp_path / "input.tif")
dst = str(tmp_path / "output.tif")
_make_geotiff(src)

# Subset to a smaller bbox inside the file extent
result = subset_geotiff(src, dst, (-118.2, 33.7, -117.8, 34.2))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests to exercise _copy_hdf5_subset for 2-D/3-D datasets and 1-D non-coordinate datasets to ensure slicing logic is correct.

_copy_hdf5_subset is currently covered only via a 2-D data array plus 1-D latitude/longitude. The code also supports 3-D datasets (subset last two dims) and 1-D non-coordinate arrays (copied as-is), but those paths aren't tested. Please extend _make_hdf5 or add a helper to include: (1) a 3-D dataset (e.g., (time, y, x)) and verify that only the spatial dimensions are subset; (2) an extra 1-D non-coordinate dataset (e.g., time) and verify its length is unchanged after subsetting.

Suggested implementation:

class TestSubsetGeotiff:
    def test_basic_subset(self, tmp_path):
        rasterio = pytest.importorskip("rasterio")
        from opera_tools.subset import subset_geotiff

        src = str(tmp_path / "input.tif")
        dst = str(tmp_path / "output.tif")
        _make_geotiff(src)

        # Subset to a smaller bbox inside the file extent
        result = subset_geotiff(src, dst, (-118.2, 33.7, -117.8, 34.2))
        assert result is True


def _make_hdf5_with_3d(path):
    """Create an HDF5 file with:
    - 1-D latitude/longitude coordinate datasets
    - 3-D data dataset with shape (time, y, x)
    - 1-D non-coordinate dataset 'time' that should not be subset in length
    """
    import numpy as np
    import h5py

    ny, nx = 8, 10
    nt = 3

    lats = np.linspace(33.0, 35.0, ny, dtype=np.float32)
    lons = np.linspace(-119.0, -117.0, nx, dtype=np.float32)
    data3d = np.arange(nt * ny * nx, dtype=np.float32).reshape(nt, ny, nx)
    time = np.arange(nt, dtype=np.int32)

    with h5py.File(path, "w") as f:
        # Coordinate variables
        f.create_dataset("latitude", data=lats)
        f.create_dataset("longitude", data=lons)

        # Non-coordinate 1-D variable that should be copied as-is
        f.create_dataset("time", data=time)

        # 3-D data variable with (time, y, x) dimensions
        f.create_dataset("data3d", data=data3d)


class TestCopyHdf5Subset:
    def test_3d_and_1d_non_coordinate_subset(self, tmp_path):
        """Ensure _copy_hdf5_subset correctly:
        - Subsets only the spatial dimensions of 3-D datasets
        - Copies 1-D non-coordinate datasets without changing their length
        """
        pytest.importorskip("h5py")
        import numpy as np
        import h5py
        # Use the public API, which internally exercises _copy_hdf5_subset
        from opera_tools.subset import subset_hdf5

        src = str(tmp_path / "input.h5")
        dst = str(tmp_path / "output.h5")
        _make_hdf5_with_3d(src)

        # Bbox that selects an interior spatial window
        # (lat_min, lon_min, lat_max, lon_max)
        bbox = (-118.5, 33.5, -117.5, 34.5)

        result = subset_hdf5(src, dst, bbox)
        # subset_hdf5 is expected to succeed; adjust if it returns None/other
        assert result is True or result is None

        with h5py.File(src, "r") as src_f, h5py.File(dst, "r") as dst_f:
            src_data = src_f["data3d"][...]
            dst_data = dst_f["data3d"][...]

            # Time dimension (axis 0) should be unchanged
            assert dst_data.shape[0] == src_data.shape[0]

            # Spatial dimensions should be reduced
            assert dst_data.shape[1] < src_data.shape[1]
            assert dst_data.shape[2] < src_data.shape[2]

            # 1-D non-coordinate dataset 'time' should be copied unchanged
            assert np.array_equal(dst_f["time"][...], src_f["time"][...])

Depending on the existing HDF5 API and test conventions in this repository, you may need to:

  1. Adjust the HDF5 subset function name/signature:

    • If the public API is named differently (e.g., subset_hdf5_file or requires keyword arguments like xvar="longitude", yvar="latitude"), update the from opera_tools.subset import subset_hdf5 import and the subset_hdf5(src, dst, bbox) call accordingly.
  2. Align dataset names/dimensions with existing helpers:

    • If _copy_hdf5_subset expects specific dataset names (e.g., "lat", "lon" instead of "latitude", "longitude"), adjust _make_hdf5_with_3d to match those names.
    • If the code uses dimensions via attributes (e.g., dimensions=("time", "y", "x")), you may want to set those attributes on the created datasets to mirror the production files more closely.
  3. Return value expectations:

    • If subset_hdf5 has a well-defined return value (always True, or returns the output path, etc.), tighten the assertion:
      • Replace assert result is True or result is None with whatever is appropriate for your implementation.

Copy link
Copy Markdown
Member

@disilvestro disilvestro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only asked about README file DIO CANE

Copilot AI requested a review from disilvestro April 10, 2026 16:01
Copilot stopped work on behalf of disilvestro due to an error April 10, 2026 16:01
@disilvestro disilvestro deleted the copilot/tool-for-subsetting-opera-files branch April 10, 2026 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants