Add opera_subset: spatial subsetting tool for OPERA product files#1
Add opera_subset: spatial subsetting tool for OPERA product files#1Copilot wants to merge 3 commits into
Conversation
Agent-Logs-Url: https://github.com/geodesymiami/Opera_tools/sessions/6bc9539d-5fbf-42db-8800-c24731ce628e Co-authored-by: disilvestro <142009043+disilvestro@users.noreply.github.com>
Agent-Logs-Url: https://github.com/geodesymiami/Opera_tools/sessions/6bc9539d-5fbf-42db-8800-c24731ce628e Co-authored-by: disilvestro <142009043+disilvestro@users.noreply.github.com>
Reviewer's GuideIntroduces a new opera_tools Python package providing a CLI and Python API for spatially subsetting OPERA GeoTIFF and HDF5 products to a geographic bounding box, including format-specific subsetting logic, HDF5 helpers, argument parsing, and a comprehensive test suite plus documentation and packaging files. Sequence diagram for opera_subset CLI subsetting flowsequenceDiagram
actor User
participant CLI as opera_subset
participant Parser as _build_parser
participant API as subset_opera_files
participant Single as subset_opera_file
participant GTiff as subset_geotiff
participant HDF5 as subset_hdf5
participant HLatLon as _find_latlon_datasets
participant HCopy as _copy_hdf5_subset
User->>CLI: invoke with args (files, --bbox, --outdir, --suffix)
CLI->>Parser: build ArgumentParser
Parser-->>CLI: parser instance
CLI->>CLI: parse_args(argv)
CLI->>API: subset_opera_files(files, outdir, bbox, suffix)
loop for each input_file
API->>Single: subset_opera_file(input_file, output_file, bbox)
alt GeoTIFF extension
Single->>GTiff: subset_geotiff(input_file, output_file, bbox)
GTiff->>GTiff: validate bbox and file
GTiff->>GTiff: reproject bbox if file CRS != WGS84
GTiff->>GTiff: compute window and read data
GTiff->>GTiff: write subset GeoTIFF
GTiff-->>Single: True or False
else HDF5 extension
Single->>HDF5: subset_hdf5(input_file, output_file, bbox)
HDF5->>HLatLon: _find_latlon_datasets(h5file)
HLatLon-->>HDF5: lat_dataset, lon_dataset
HDF5->>HDF5: compute row and column indices in bbox
HDF5->>HCopy: _copy_hdf5_subset(src, dst, row_start, row_stop, col_start, col_stop)
HCopy-->>HDF5: subset written
HDF5-->>Single: True or False
end
Single-->>API: success flag
alt success is False
API->>CLI: print warning to stderr for this file
else success is True
API->>API: append output_file to written list
end
end
API-->>CLI: list of written output files
alt written is not empty
CLI->>User: print count and paths of written files
else written is empty
CLI->>User: print no output files message
CLI->>CLI: exit with status 1
end
Class diagram for opera_tools package and subset module APIclassDiagram
class opera_tools {
<<module>>
+subset_opera_file(input_file, output_file, bbox) bool
+subset_opera_files(input_files, output_dir, bbox, suffix) list
}
class subset {
<<module>>
+subset_geotiff(input_file, output_file, bbox) bool
+subset_hdf5(input_file, output_file, bbox) bool
+subset_opera_file(input_file, output_file, bbox) bool
+subset_opera_files(input_files, output_dir, bbox, suffix) list
+main(argv) void
+_build_parser() ArgumentParser
+_check_bbox(bbox) void
+_find_latlon_datasets(h5file) tuple
+_copy_hdf5_subset(src, dst, row_start, row_stop, col_start, col_stop) void
}
opera_tools ..> subset : reexports
class rasterio {
<<external library>>
}
class h5py {
<<external library>>
}
class numpy {
<<external library>>
}
subset ..> rasterio : used by subset_geotiff
subset ..> h5py : used by subset_hdf5
subset ..> numpy : used for array operations
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The HDF5 subsetting path (
_copy_hdf5_subset) reads entire datasets into memory viaobj[()]even when only a spatial slice is needed; consider using hyperslab slicing on the source dataset and writing chunks to the destination to avoid excessive memory use for large products. - Lat/lon detection in
_find_latlon_datasetsis driven by a small fixed set of name variants; exposing an API to pass explicit dataset paths or detection hints (e.g., via attributes) would make the tool more robust across different OPERA product variants. - The batch helper
subset_opera_filesprints warnings directly tostderrfrom library code; consider using theloggingmodule or a user-provided callback so callers can better control how non-overlap cases are reported.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The HDF5 subsetting path (`_copy_hdf5_subset`) reads entire datasets into memory via `obj[()]` even when only a spatial slice is needed; consider using hyperslab slicing on the source dataset and writing chunks to the destination to avoid excessive memory use for large products.
- Lat/lon detection in `_find_latlon_datasets` is driven by a small fixed set of name variants; exposing an API to pass explicit dataset paths or detection hints (e.g., via attributes) would make the tool more robust across different OPERA product variants.
- The batch helper `subset_opera_files` prints warnings directly to `stderr` from library code; consider using the `logging` module or a user-provided callback so callers can better control how non-overlap cases are reported.
## Individual Comments
### Comment 1
<location path="src/opera_tools/subset.py" line_range="280" />
<code_context>
+ except Exception:
+ pass
+ elif isinstance(obj, h5py.Dataset):
+ data = obj[()]
+ lower = name.lower().split("/")[-1]
+
</code_context>
<issue_to_address>
**suggestion (performance):** Copying each HDF5 dataset by loading it fully into memory is risky for large multidimensional datasets.
In `_copy_hdf5_subset`, `data = obj[()]` forces the full dataset into RAM before subsetting. Instead, slice directly from `obj` (e.g. `obj[row_start:row_stop, col_start:col_stop]` or `obj[:, row_start:row_stop, col_start:col_stop]`) so only the required hyperslab is read, avoiding the temporary full-materialization in memory.
Suggested implementation:
```python
if isinstance(obj, h5py.Group):
grp = dst.require_group(name)
for key, val in obj.attrs.items():
try:
grp.attrs[key] = val
except Exception:
pass
elif isinstance(obj, h5py.Dataset):
src_dset = obj
lower = name.lower().split("/")[-1]
# Determine the slice to apply for this dataset.
# `dataset_slices` should be provided by the outer scope and map
# dataset names (or lower-cased names) to index tuples/slices,
# e.g. (row_start:row_stop, col_start:col_stop) or
# (slice(None), row_start:row_stop, col_start:col_stop).
dset_slice = dataset_slices.get(lower) if "dataset_slices" in locals() else None
if dset_slice is None:
# Fall back to HDF5-level copy to avoid forcing the dataset
# through Python memory when possible.
dst.copy(src_dset, name)
else:
# Read only the requested hyperslab into memory
sliced = src_dset[dset_slice]
# Mirror the source dataset's creation properties if a helper
# exists; otherwise let h5py choose sane defaults.
if "_get_create_kwargs" in globals():
create_kwargs = _get_create_kwargs(src_dset)
dst_dset = dst.create_dataset(name, data=sliced, **create_kwargs)
else:
dst_dset = dst.create_dataset(name, data=sliced)
# Copy dataset attributes
for key, val in src_dset.attrs.items():
try:
dst_dset.attrs[key] = val
except Exception:
pass
Supports GeoTIFF (.tif) and HDF5 (.h5) OPERA product files.
```
To make this work end-to-end and fully avoid `obj[()]` materializing whole datasets, you’ll need to:
1. Ensure `_copy_item` has access to a `dataset_slices` mapping:
- If `_copy_hdf5_subset` constructs `_copy_item` as an inner function, add a `dataset_slices` dict (or similar) in the outer scope and close over it.
- Example pattern (pseudocode):
```python
dataset_slices = {
"amplitude": (row_start, row_stop, col_start, col_stop),
"intensity": (slice(None), row_start, row_stop, col_start, col_stop),
}
def _copy_item(name, obj):
...
src.visititems(_copy_item)
```
Adjust keys (`lower`, full `name`, etc.) to match your dataset naming conventions.
2. Compute the row/column slices from the bounding box:
- Wherever you currently compute the spatial subset indices from the geo/lat/lon grids, convert them into Python `slice` objects (and, for 3D/4D datasets, prepend `slice(None)` for dimensions you are not subsetting).
- Store these slices in `dataset_slices` keyed by the dataset name you use in `_copy_item` (`lower` or `name`).
3. If you do not have a `_get_create_kwargs` helper:
- Either implement it to copy dtype, compression, chunking, etc. from `src_dset`, or remove that branch and always call:
```python
dst_dset = dst.create_dataset(name, data=sliced)
```
- A minimal helper could look like:
```python
def _get_create_kwargs(src_dset):
kwargs = {"dtype": src_dset.dtype}
if src_dset.chunks is not None:
kwargs["chunks"] = src_dset.chunks
if "compression" in src_dset.compression_opts:
kwargs["compression"] = src_dset.compression
kwargs["compression_opts"] = src_dset.compression_opts
return kwargs
```
(Adjust to your existing conventions if a similar utility already exists.)
4. If some datasets should still be copied whole:
- Do not add them to `dataset_slices`; they will follow the `dst.copy(src_dset, name)` path which avoids Python-level full materialization and lets HDF5 handle the copy efficiently.
</issue_to_address>
### Comment 2
<location path="src/opera_tools/subset.py" line_range="458-461" />
<code_context>
+ parser = _build_parser()
+ args = parser.parse_args(argv)
+
+ bbox = tuple(args.bbox)
+ written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix)
+
+ if written:
</code_context>
<issue_to_address>
**suggestion:** CLI error handling for invalid bounding boxes propagates raw tracebacks to the user.
If `_check_bbox` raises a `ValueError` (e.g., swapped min/max or out-of-range coordinates), the CLI currently prints a full traceback. Consider catching `ValueError` around `subset_opera_files`, writing a clear error message to stderr, and exiting with a non-zero status instead.
```suggestion
bbox = tuple(args.bbox)
try:
written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix)
except ValueError as exc:
print(f"Error: {exc}", file=sys.stderr)
sys.exit(1)
if written:
```
</issue_to_address>
### Comment 3
<location path="tests/test_subset.py" line_range="94" />
<code_context>
+# Tests for subset_geotiff
+# ---------------------------------------------------------------------------
+
+class TestSubsetGeotiff:
+ def test_basic_subset(self, tmp_path):
+ rasterio = pytest.importorskip("rasterio")
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test that exercises the GeoTIFF reprojection path (non-WGS84 CRS) to ensure `transform_bounds` and window computation behave as expected.
Right now `_make_geotiff` always writes EPSG:4326, so `subset_geotiff` never exercises the reprojection branch using `transform_bounds`. Please add a test that uses a non-WGS84 CRS (e.g., EPSG:3857) or a small in-memory/mocked rasterio dataset so that `file_crs != wgs84`, then call `subset_geotiff` and assert that: (1) an overlapping bbox returns `True`, and (2) the output dimensions and resulting transform are sensible. This will improve coverage of the reprojection path and guard against regressions in `transform_bounds` error handling.
Suggested implementation:
```python
# ---------------------------------------------------------------------------
# Tests for subset_geotiff
# ---------------------------------------------------------------------------
def test_subset_geotiff_reprojection(self, tmp_path):
import numpy as np
import pytest
rasterio = pytest.importorskip("rasterio")
from rasterio.crs import CRS
from rasterio.warp import transform_bounds
from opera_tools.subset import subset_geotiff
# Define a geographic (WGS84) bbox that we will subset with
lon_min, lat_min, lon_max, lat_max = -10.0, -10.0, 10.0, 10.0
wgs84 = CRS.from_epsg(4326)
mercator = CRS.from_epsg(3857)
# Compute the dataset bounds in EPSG:3857 so that the file CRS != WGS84
file_left, file_bottom, file_right, file_top = transform_bounds(
wgs84,
mercator,
lon_min,
lat_min,
lon_max,
lat_max,
)
width, height = 64, 64
input_path = tmp_path / "reproj_input.tif"
# Create a simple GeoTIFF in EPSG:3857 that fully covers the WGS84 bbox
transform = rasterio.transform.from_bounds(
file_left,
file_bottom,
file_right,
file_top,
width,
height,
)
data = np.arange(width * height, dtype=np.float32).reshape(1, height, width)
with rasterio.open(
input_path,
"w",
driver="GTiff",
width=width,
height=height,
count=1,
dtype="float32",
crs=mercator,
transform=transform,
) as ds:
ds.write(data)
output_path = tmp_path / "subset_reproj.tif"
# Exercise the reprojection path (file_crs != WGS84)
ok = subset_geotiff(
input_path,
(lon_min, lat_min, lon_max, lat_max),
output_path,
)
# (1) Overlapping bbox should be detected
assert ok is True
# (2) Output dimensions and transform should be sensible
with rasterio.open(output_path) as out_ds:
# Subset should be non-empty and no larger than the source
assert 0 < out_ds.width <= width
assert 0 < out_ds.height <= height
# Output bounds should lie within the original dataset bounds
out_left, out_bottom, out_right, out_top = out_ds.bounds
assert file_left <= out_left < file_right
assert file_bottom <= out_bottom < file_top
assert file_left < out_right <= file_right
assert file_bottom < out_top <= file_top
from rasterio.transform import from_bounds
from rasterio.crs import CRS
```
The `subset_geotiff` call in this test assumes a signature of:
`subset_geotiff(input_path, bbox, output_path)`. If the actual function in `opera_tools.subset` uses a different argument order or keyword-only parameters, adjust the call accordingly so that:
1. `input_path` is the path to the EPSG:3857 GeoTIFF,
2. `bbox` is the WGS84 (lon_min, lat_min, lon_max, lat_max) tuple,
3. `output_path` is where the subset GeoTIFF is written.
If `subset_geotiff` returns something other than a boolean indicating whether an overlap was found (e.g., raises on no-overlap or returns the output path), update the `ok` assertion to match the real API while keeping the intent: verify that an overlapping bbox is treated as a successful subset operation, and the resulting dataset has consistent bounds and shape.
</issue_to_address>
### Comment 4
<location path="tests/test_subset.py" line_range="95-104" />
<code_context>
+ def test_basic_subset(self, tmp_path):
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests to exercise `_copy_hdf5_subset` for 2-D/3-D datasets and 1-D non-coordinate datasets to ensure slicing logic is correct.
`_copy_hdf5_subset` is currently covered only via a 2-D `data` array plus 1-D `latitude`/`longitude`. The code also supports 3-D datasets (subset last two dims) and 1-D non-coordinate arrays (copied as-is), but those paths aren't tested. Please extend `_make_hdf5` or add a helper to include: (1) a 3-D dataset (e.g., `(time, y, x)`) and verify that only the spatial dimensions are subset; (2) an extra 1-D non-coordinate dataset (e.g., `time`) and verify its length is unchanged after subsetting.
Suggested implementation:
```python
class TestSubsetGeotiff:
def test_basic_subset(self, tmp_path):
rasterio = pytest.importorskip("rasterio")
from opera_tools.subset import subset_geotiff
src = str(tmp_path / "input.tif")
dst = str(tmp_path / "output.tif")
_make_geotiff(src)
# Subset to a smaller bbox inside the file extent
result = subset_geotiff(src, dst, (-118.2, 33.7, -117.8, 34.2))
assert result is True
def _make_hdf5_with_3d(path):
"""Create an HDF5 file with:
- 1-D latitude/longitude coordinate datasets
- 3-D data dataset with shape (time, y, x)
- 1-D non-coordinate dataset 'time' that should not be subset in length
"""
import numpy as np
import h5py
ny, nx = 8, 10
nt = 3
lats = np.linspace(33.0, 35.0, ny, dtype=np.float32)
lons = np.linspace(-119.0, -117.0, nx, dtype=np.float32)
data3d = np.arange(nt * ny * nx, dtype=np.float32).reshape(nt, ny, nx)
time = np.arange(nt, dtype=np.int32)
with h5py.File(path, "w") as f:
# Coordinate variables
f.create_dataset("latitude", data=lats)
f.create_dataset("longitude", data=lons)
# Non-coordinate 1-D variable that should be copied as-is
f.create_dataset("time", data=time)
# 3-D data variable with (time, y, x) dimensions
f.create_dataset("data3d", data=data3d)
class TestCopyHdf5Subset:
def test_3d_and_1d_non_coordinate_subset(self, tmp_path):
"""Ensure _copy_hdf5_subset correctly:
- Subsets only the spatial dimensions of 3-D datasets
- Copies 1-D non-coordinate datasets without changing their length
"""
pytest.importorskip("h5py")
import numpy as np
import h5py
# Use the public API, which internally exercises _copy_hdf5_subset
from opera_tools.subset import subset_hdf5
src = str(tmp_path / "input.h5")
dst = str(tmp_path / "output.h5")
_make_hdf5_with_3d(src)
# Bbox that selects an interior spatial window
# (lat_min, lon_min, lat_max, lon_max)
bbox = (-118.5, 33.5, -117.5, 34.5)
result = subset_hdf5(src, dst, bbox)
# subset_hdf5 is expected to succeed; adjust if it returns None/other
assert result is True or result is None
with h5py.File(src, "r") as src_f, h5py.File(dst, "r") as dst_f:
src_data = src_f["data3d"][...]
dst_data = dst_f["data3d"][...]
# Time dimension (axis 0) should be unchanged
assert dst_data.shape[0] == src_data.shape[0]
# Spatial dimensions should be reduced
assert dst_data.shape[1] < src_data.shape[1]
assert dst_data.shape[2] < src_data.shape[2]
# 1-D non-coordinate dataset 'time' should be copied unchanged
assert np.array_equal(dst_f["time"][...], src_f["time"][...])
```
Depending on the existing HDF5 API and test conventions in this repository, you may need to:
1. **Adjust the HDF5 subset function name/signature**:
- If the public API is named differently (e.g., `subset_hdf5_file` or requires keyword arguments like `xvar="longitude", yvar="latitude"`), update the `from opera_tools.subset import subset_hdf5` import and the `subset_hdf5(src, dst, bbox)` call accordingly.
2. **Align dataset names/dimensions with existing helpers**:
- If `_copy_hdf5_subset` expects specific dataset names (e.g., `"lat"`, `"lon"` instead of `"latitude"`, `"longitude"`), adjust `_make_hdf5_with_3d` to match those names.
- If the code uses dimensions via attributes (e.g., `dimensions=("time", "y", "x")`), you may want to set those attributes on the created datasets to mirror the production files more closely.
3. **Return value expectations**:
- If `subset_hdf5` has a well-defined return value (always `True`, or returns the output path, etc.), tighten the assertion:
- Replace `assert result is True or result is None` with whatever is appropriate for your implementation.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| except Exception: | ||
| pass | ||
| elif isinstance(obj, h5py.Dataset): | ||
| data = obj[()] |
There was a problem hiding this comment.
suggestion (performance): Copying each HDF5 dataset by loading it fully into memory is risky for large multidimensional datasets.
In _copy_hdf5_subset, data = obj[()] forces the full dataset into RAM before subsetting. Instead, slice directly from obj (e.g. obj[row_start:row_stop, col_start:col_stop] or obj[:, row_start:row_stop, col_start:col_stop]) so only the required hyperslab is read, avoiding the temporary full-materialization in memory.
Suggested implementation:
if isinstance(obj, h5py.Group):
grp = dst.require_group(name)
for key, val in obj.attrs.items():
try:
grp.attrs[key] = val
except Exception:
pass
elif isinstance(obj, h5py.Dataset):
src_dset = obj
lower = name.lower().split("/")[-1]
# Determine the slice to apply for this dataset.
# `dataset_slices` should be provided by the outer scope and map
# dataset names (or lower-cased names) to index tuples/slices,
# e.g. (row_start:row_stop, col_start:col_stop) or
# (slice(None), row_start:row_stop, col_start:col_stop).
dset_slice = dataset_slices.get(lower) if "dataset_slices" in locals() else None
if dset_slice is None:
# Fall back to HDF5-level copy to avoid forcing the dataset
# through Python memory when possible.
dst.copy(src_dset, name)
else:
# Read only the requested hyperslab into memory
sliced = src_dset[dset_slice]
# Mirror the source dataset's creation properties if a helper
# exists; otherwise let h5py choose sane defaults.
if "_get_create_kwargs" in globals():
create_kwargs = _get_create_kwargs(src_dset)
dst_dset = dst.create_dataset(name, data=sliced, **create_kwargs)
else:
dst_dset = dst.create_dataset(name, data=sliced)
# Copy dataset attributes
for key, val in src_dset.attrs.items():
try:
dst_dset.attrs[key] = val
except Exception:
pass
Supports GeoTIFF (.tif) and HDF5 (.h5) OPERA product files.To make this work end-to-end and fully avoid obj[()] materializing whole datasets, you’ll need to:
-
Ensure
_copy_itemhas access to adataset_slicesmapping:- If
_copy_hdf5_subsetconstructs_copy_itemas an inner function, add adataset_slicesdict (or similar) in the outer scope and close over it. - Example pattern (pseudocode):
Adjust keys (
dataset_slices = { "amplitude": (row_start, row_stop, col_start, col_stop), "intensity": (slice(None), row_start, row_stop, col_start, col_stop), } def _copy_item(name, obj): ... src.visititems(_copy_item)
lower, fullname, etc.) to match your dataset naming conventions.
- If
-
Compute the row/column slices from the bounding box:
- Wherever you currently compute the spatial subset indices from the geo/lat/lon grids, convert them into Python
sliceobjects (and, for 3D/4D datasets, prependslice(None)for dimensions you are not subsetting). - Store these slices in
dataset_sliceskeyed by the dataset name you use in_copy_item(lowerorname).
- Wherever you currently compute the spatial subset indices from the geo/lat/lon grids, convert them into Python
-
If you do not have a
_get_create_kwargshelper:- Either implement it to copy dtype, compression, chunking, etc. from
src_dset, or remove that branch and always call:dst_dset = dst.create_dataset(name, data=sliced)
- A minimal helper could look like:
(Adjust to your existing conventions if a similar utility already exists.)
def _get_create_kwargs(src_dset): kwargs = {"dtype": src_dset.dtype} if src_dset.chunks is not None: kwargs["chunks"] = src_dset.chunks if "compression" in src_dset.compression_opts: kwargs["compression"] = src_dset.compression kwargs["compression_opts"] = src_dset.compression_opts return kwargs
- Either implement it to copy dtype, compression, chunking, etc. from
-
If some datasets should still be copied whole:
- Do not add them to
dataset_slices; they will follow thedst.copy(src_dset, name)path which avoids Python-level full materialization and lets HDF5 handle the copy efficiently.
- Do not add them to
| bbox = tuple(args.bbox) | ||
| written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix) | ||
|
|
||
| if written: |
There was a problem hiding this comment.
suggestion: CLI error handling for invalid bounding boxes propagates raw tracebacks to the user.
If _check_bbox raises a ValueError (e.g., swapped min/max or out-of-range coordinates), the CLI currently prints a full traceback. Consider catching ValueError around subset_opera_files, writing a clear error message to stderr, and exiting with a non-zero status instead.
| bbox = tuple(args.bbox) | |
| written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix) | |
| if written: | |
| bbox = tuple(args.bbox) | |
| try: | |
| written = subset_opera_files(args.files, args.outdir, bbox, suffix=args.suffix) | |
| except ValueError as exc: | |
| print(f"Error: {exc}", file=sys.stderr) | |
| sys.exit(1) | |
| if written: |
| # Tests for subset_geotiff | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| class TestSubsetGeotiff: |
There was a problem hiding this comment.
suggestion (testing): Add a test that exercises the GeoTIFF reprojection path (non-WGS84 CRS) to ensure transform_bounds and window computation behave as expected.
Right now _make_geotiff always writes EPSG:4326, so subset_geotiff never exercises the reprojection branch using transform_bounds. Please add a test that uses a non-WGS84 CRS (e.g., EPSG:3857) or a small in-memory/mocked rasterio dataset so that file_crs != wgs84, then call subset_geotiff and assert that: (1) an overlapping bbox returns True, and (2) the output dimensions and resulting transform are sensible. This will improve coverage of the reprojection path and guard against regressions in transform_bounds error handling.
Suggested implementation:
# ---------------------------------------------------------------------------
# Tests for subset_geotiff
# ---------------------------------------------------------------------------
def test_subset_geotiff_reprojection(self, tmp_path):
import numpy as np
import pytest
rasterio = pytest.importorskip("rasterio")
from rasterio.crs import CRS
from rasterio.warp import transform_bounds
from opera_tools.subset import subset_geotiff
# Define a geographic (WGS84) bbox that we will subset with
lon_min, lat_min, lon_max, lat_max = -10.0, -10.0, 10.0, 10.0
wgs84 = CRS.from_epsg(4326)
mercator = CRS.from_epsg(3857)
# Compute the dataset bounds in EPSG:3857 so that the file CRS != WGS84
file_left, file_bottom, file_right, file_top = transform_bounds(
wgs84,
mercator,
lon_min,
lat_min,
lon_max,
lat_max,
)
width, height = 64, 64
input_path = tmp_path / "reproj_input.tif"
# Create a simple GeoTIFF in EPSG:3857 that fully covers the WGS84 bbox
transform = rasterio.transform.from_bounds(
file_left,
file_bottom,
file_right,
file_top,
width,
height,
)
data = np.arange(width * height, dtype=np.float32).reshape(1, height, width)
with rasterio.open(
input_path,
"w",
driver="GTiff",
width=width,
height=height,
count=1,
dtype="float32",
crs=mercator,
transform=transform,
) as ds:
ds.write(data)
output_path = tmp_path / "subset_reproj.tif"
# Exercise the reprojection path (file_crs != WGS84)
ok = subset_geotiff(
input_path,
(lon_min, lat_min, lon_max, lat_max),
output_path,
)
# (1) Overlapping bbox should be detected
assert ok is True
# (2) Output dimensions and transform should be sensible
with rasterio.open(output_path) as out_ds:
# Subset should be non-empty and no larger than the source
assert 0 < out_ds.width <= width
assert 0 < out_ds.height <= height
# Output bounds should lie within the original dataset bounds
out_left, out_bottom, out_right, out_top = out_ds.bounds
assert file_left <= out_left < file_right
assert file_bottom <= out_bottom < file_top
assert file_left < out_right <= file_right
assert file_bottom < out_top <= file_top
from rasterio.transform import from_bounds
from rasterio.crs import CRSThe subset_geotiff call in this test assumes a signature of:
subset_geotiff(input_path, bbox, output_path). If the actual function in opera_tools.subset uses a different argument order or keyword-only parameters, adjust the call accordingly so that:
input_pathis the path to the EPSG:3857 GeoTIFF,bboxis the WGS84 (lon_min, lat_min, lon_max, lat_max) tuple,output_pathis where the subset GeoTIFF is written.
If subset_geotiff returns something other than a boolean indicating whether an overlap was found (e.g., raises on no-overlap or returns the output path), update the ok assertion to match the real API while keeping the intent: verify that an overlapping bbox is treated as a successful subset operation, and the resulting dataset has consistent bounds and shape.
| def test_basic_subset(self, tmp_path): | ||
| rasterio = pytest.importorskip("rasterio") | ||
| from opera_tools.subset import subset_geotiff | ||
|
|
||
| src = str(tmp_path / "input.tif") | ||
| dst = str(tmp_path / "output.tif") | ||
| _make_geotiff(src) | ||
|
|
||
| # Subset to a smaller bbox inside the file extent | ||
| result = subset_geotiff(src, dst, (-118.2, 33.7, -117.8, 34.2)) |
There was a problem hiding this comment.
suggestion (testing): Add tests to exercise _copy_hdf5_subset for 2-D/3-D datasets and 1-D non-coordinate datasets to ensure slicing logic is correct.
_copy_hdf5_subset is currently covered only via a 2-D data array plus 1-D latitude/longitude. The code also supports 3-D datasets (subset last two dims) and 1-D non-coordinate arrays (copied as-is), but those paths aren't tested. Please extend _make_hdf5 or add a helper to include: (1) a 3-D dataset (e.g., (time, y, x)) and verify that only the spatial dimensions are subset; (2) an extra 1-D non-coordinate dataset (e.g., time) and verify its length is unchanged after subsetting.
Suggested implementation:
class TestSubsetGeotiff:
def test_basic_subset(self, tmp_path):
rasterio = pytest.importorskip("rasterio")
from opera_tools.subset import subset_geotiff
src = str(tmp_path / "input.tif")
dst = str(tmp_path / "output.tif")
_make_geotiff(src)
# Subset to a smaller bbox inside the file extent
result = subset_geotiff(src, dst, (-118.2, 33.7, -117.8, 34.2))
assert result is True
def _make_hdf5_with_3d(path):
"""Create an HDF5 file with:
- 1-D latitude/longitude coordinate datasets
- 3-D data dataset with shape (time, y, x)
- 1-D non-coordinate dataset 'time' that should not be subset in length
"""
import numpy as np
import h5py
ny, nx = 8, 10
nt = 3
lats = np.linspace(33.0, 35.0, ny, dtype=np.float32)
lons = np.linspace(-119.0, -117.0, nx, dtype=np.float32)
data3d = np.arange(nt * ny * nx, dtype=np.float32).reshape(nt, ny, nx)
time = np.arange(nt, dtype=np.int32)
with h5py.File(path, "w") as f:
# Coordinate variables
f.create_dataset("latitude", data=lats)
f.create_dataset("longitude", data=lons)
# Non-coordinate 1-D variable that should be copied as-is
f.create_dataset("time", data=time)
# 3-D data variable with (time, y, x) dimensions
f.create_dataset("data3d", data=data3d)
class TestCopyHdf5Subset:
def test_3d_and_1d_non_coordinate_subset(self, tmp_path):
"""Ensure _copy_hdf5_subset correctly:
- Subsets only the spatial dimensions of 3-D datasets
- Copies 1-D non-coordinate datasets without changing their length
"""
pytest.importorskip("h5py")
import numpy as np
import h5py
# Use the public API, which internally exercises _copy_hdf5_subset
from opera_tools.subset import subset_hdf5
src = str(tmp_path / "input.h5")
dst = str(tmp_path / "output.h5")
_make_hdf5_with_3d(src)
# Bbox that selects an interior spatial window
# (lat_min, lon_min, lat_max, lon_max)
bbox = (-118.5, 33.5, -117.5, 34.5)
result = subset_hdf5(src, dst, bbox)
# subset_hdf5 is expected to succeed; adjust if it returns None/other
assert result is True or result is None
with h5py.File(src, "r") as src_f, h5py.File(dst, "r") as dst_f:
src_data = src_f["data3d"][...]
dst_data = dst_f["data3d"][...]
# Time dimension (axis 0) should be unchanged
assert dst_data.shape[0] == src_data.shape[0]
# Spatial dimensions should be reduced
assert dst_data.shape[1] < src_data.shape[1]
assert dst_data.shape[2] < src_data.shape[2]
# 1-D non-coordinate dataset 'time' should be copied unchanged
assert np.array_equal(dst_f["time"][...], src_f["time"][...])Depending on the existing HDF5 API and test conventions in this repository, you may need to:
-
Adjust the HDF5 subset function name/signature:
- If the public API is named differently (e.g.,
subset_hdf5_fileor requires keyword arguments likexvar="longitude", yvar="latitude"), update thefrom opera_tools.subset import subset_hdf5import and thesubset_hdf5(src, dst, bbox)call accordingly.
- If the public API is named differently (e.g.,
-
Align dataset names/dimensions with existing helpers:
- If
_copy_hdf5_subsetexpects specific dataset names (e.g.,"lat","lon"instead of"latitude","longitude"), adjust_make_hdf5_with_3dto match those names. - If the code uses dimensions via attributes (e.g.,
dimensions=("time", "y", "x")), you may want to set those attributes on the created datasets to mirror the production files more closely.
- If
-
Return value expectations:
- If
subset_hdf5has a well-defined return value (alwaysTrue, or returns the output path, etc.), tighten the assertion:- Replace
assert result is True or result is Nonewith whatever is appropriate for your implementation.
- Replace
- If
disilvestro
left a comment
There was a problem hiding this comment.
I only asked about README file DIO CANE
New Python package (
opera_tools) providing a tool to clip OPERA SAR/InSAR product files to a geographic bounding box.Package structure
src/opera_tools/subset.py— core subsetting logic + CLI (opera_subset)setup.py/requirements.txt— packaging withopera_subsetconsole entry pointSupported formats
.tif,.tiff) viarasterio— handles CRS reprojection automatically.h5,.hdf5,.nc) viah5py— discovers lat/lon coordinate datasets and slices all spatial dimensionsAPI
CLI
Returns non-zero exit code and warns when a file's extent does not overlap the requested bbox.
Original prompt
Summary by Sourcery
Add a new opera_tools package providing a CLI and Python API for spatially subsetting OPERA GeoTIFF and HDF5 product files to a geographic bounding box.
New Features:
Enhancements:
Build:
Tests:
Chores: