Skip to content

fix: PDB archive shards and output layout#9

Merged
youngdashu merged 3 commits intomainfrom
fix/archive-pdb-zips
Apr 9, 2026
Merged

fix: PDB archive shards and output layout#9
youngdashu merged 3 commits intomainfrom
fix/archive-pdb-zips

Conversation

@youngdashu
Copy link
Copy Markdown
Collaborator

Summary

Fixes create_archive so multi-batch H5 layouts (.../<batch>/pdbs.h5) no longer overwrite the same pdbs.zip, removes the redundant .tgz wrapper around zip shards, and writes shard zips plus the merged archive under ${data_path}/archives/<dataset_dir_name>/ instead of datasets/.../archives and cwd().

Tests

  • tests/test_archive.py covers distinct shard zip names for same basename and long paths.

Made with Cursor

- Name per-shard zips from path relative to data_path to avoid pdbs.h5 collisions
- Remove redundant tar.gz wrapper around zip shards
- Write shard and final merged zips to {data_path}/archives/{dataset}/
- Open H5 paths from index keys directly; pass data_path for naming
- Add unit tests for _shard_zip_name

Made-with: Cursor
@PawelSzczerbiak
Copy link
Copy Markdown
Collaborator

Works fine but we don't need the full archive_pdb_(...).zip. The structures__(...)__N__pdbs.zip files are enough.

@youngdashu youngdashu merged commit 66dff1b into main Apr 9, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants