feat: add CPU cluster-pair spatial search with Highway SIMD optimization#129
Closed
HaoZeke wants to merge 2 commits into
Closed
feat: add CPU cluster-pair spatial search with Highway SIMD optimization#129HaoZeke wants to merge 2 commits into
HaoZeke wants to merge 2 commits into
Conversation
GROMACS-inspired nbnxm-style algorithm that groups atoms into 4-atom clusters and uses AABB bounding-box rejection before expanding to atom pairs. Auto-dispatched for systems with N >= 64 atoms on CPU. New files: cluster.hpp (Cluster/ClusterGrid structs, BB distance), cluster_pair_search.cpp (build_cluster_grid + cluster_pair_neighbors), tests/cluster_pair.cpp (correctness vs cell-list on cubic, triclinic, periodic, non-periodic systems).
Use Google Highway for portable SIMD vectorization of the distance calculation inner loop. Key changes: - Cluster size raised from 4 to 8 atoms (matches AVX2 with 2-iteration loop, degrades gracefully to SSE or AVX-512) - SoA (Structure of Arrays) position data in Cluster struct for aligned SIMD loads (pos_x/pos_y/pos_z arrays, 64-byte aligned) - Precomputed wrapped positions avoid per-pair matrix multiply (the cell_shift.cartesian(cell_matrix) call moves from O(N_pairs) to O(N_cell_pairs)) - SIMD inner loop: broadcasts atom i position, loads 8 atom j positions, computes 8 distances in parallel using MulAdd - Dispatch threshold raised from 64 to 256 (small systems do not amortize the cluster grid overhead) Highway fetched via CMake FetchContent (v1.2.0, tests/examples disabled). Runtime dispatch selects best available ISA (SSE4, AVX2, AVX-512, NEON). New test cases with 7^3=343 atoms exercise the SIMD cluster-pair path (above the 256 threshold) and verify identical results to cell-list.
Contributor
|
Can you check the speedup for partial-pbc and non-pbc as well? There's a bug in treating non-pbc case with cell list right now #126 , which leads to a significant performance degradation. so if possible I think we need to check if the related logic works well with these cases |
Owner
|
Now in #161 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GROMACS nbnxm-inspired spatial search algorithm grouping atoms into 8-atom clusters
with AABB bounding-box rejection and SIMD distance calculations via Google Highway.
Selected automatically for N >= 64 atoms on CPU, providing 1.3-1.6x speedup over
cell-list for typical cutoffs.
pos_x[8], pos_y[8], pos_z[8])for efficient SIMD vectorization. Precomputed wrapped positions eliminate per-pair
matrix multiplication.
distance calculation, reducing unnecessary SIMD operations.
simd_check_distances()useshwy::ScalableTag<double>to compute8 distances in parallel. Highway v1.2.0 fetched via CMake FetchContent.
systems use brute-force, very large systems may use cell-list depending on density.
Performance results (cosmolab, RTX 4070 Ti SUPER, FCC cutoff=5.0)
Initial implementation (4-atom clusters, no SIMD) was 17% slower than cell-list.
Root causes: per-pair matrix multiply, poor BB rejection ratio, grid building
overhead not amortized. Fixed by:
simd_check_distances()with Highway for 8 distances in parallel