Skip to content

Refactor featurization logic and cache seed-independent inputs#679

Open
y1zhou wants to merge 4 commits into
google-deepmind:mainfrom
y1zhou:data-pipeline/cache-process-struct
Open

Refactor featurization logic and cache seed-independent inputs#679
y1zhou wants to merge 4 commits into
google-deepmind:mainfrom
y1zhou:data-pipeline/cache-process-struct

Conversation

@y1zhou
Copy link
Copy Markdown

@y1zhou y1zhou commented May 22, 2026

This PR addresses #675 by refactoring WholePdbPipeline.process_structure, moving features that do not depend on the input random seed into a separate @lru_cacheed function _process_structure_seed_independent(). A simple test for validating cache hits was also added.

We have tested the code on the same test 872 token case in #675. The generated structures have ~0.0 RMSD before/after this patch. The featurization time for the first seed is about 1s slower (10s -> 11s), but for subsequent seeds the time drops from ~11s to 3-4s.

AI usage disclosure

The initial refactoring was discussed and partly implemented by Codex. I have manually reviewed the code, made edits to reduce the verbosity, and tested for correctness as mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant