Skip to content

feat: add CPU cluster-pair spatial search with Highway SIMD optimization#129

Closed
HaoZeke wants to merge 2 commits into
Luthaf:mainfrom
HaoZeke:feat/cluster-pair-search
Closed

feat: add CPU cluster-pair spatial search with Highway SIMD optimization#129
HaoZeke wants to merge 2 commits into
Luthaf:mainfrom
HaoZeke:feat/cluster-pair-search

Conversation

@HaoZeke

@HaoZeke HaoZeke commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

GROMACS nbnxm-inspired spatial search algorithm grouping atoms into 8-atom clusters
with AABB bounding-box rejection and SIMD distance calculations via Google Highway.
Selected automatically for N >= 64 atoms on CPU, providing 1.3-1.6x speedup over
cell-list for typical cutoffs.

  • Cluster data structures: 8-atom clusters with SoA layout (pos_x[8], pos_y[8], pos_z[8])
    for efficient SIMD vectorization. Precomputed wrapped positions eliminate per-pair
    matrix multiplication.
  • AABB rejection: Axis-aligned bounding box test filters cluster pairs before
    distance calculation, reducing unnecessary SIMD operations.
  • Highway SIMD: simd_check_distances() uses hwy::ScalableTag<double> to compute
    8 distances in parallel. Highway v1.2.0 fetched via CMake FetchContent.
  • Auto-dispatch: Cluster-pair search selected for N >= 64 atoms on CPU. Smaller
    systems use brute-force, very large systems may use cell-list depending on density.

Performance results (cosmolab, RTX 4070 Ti SUPER, FCC cutoff=5.0)

N atoms Cell-list (ms) SIMD cluster (ms) Speedup
256 0.452 0.283 1.60x
1024 1.221 0.943 1.29x
4096 5.165 3.652 1.41x
16384 19.447 14.650 1.33x
32768 39.601 30.050 1.32x

Initial implementation (4-atom clusters, no SIMD) was 17% slower than cell-list.
Root causes: per-pair matrix multiply, poor BB rejection ratio, grid building
overhead not amortized. Fixed by:

  1. Cluster size 4 -> 8 (matches AVX2 lane width)
  2. SoA position arrays in Cluster struct
  3. Precomputed wrapped positions (eliminates per-pair matrix multiply)
  4. simd_check_distances() with Highway for 8 distances in parallel
  5. Dispatch threshold raised from 64 to 256

HaoZeke added 2 commits March 23, 2026 14:08
GROMACS-inspired nbnxm-style algorithm that groups atoms into 4-atom
clusters and uses AABB bounding-box rejection before expanding to
atom pairs. Auto-dispatched for systems with N >= 64 atoms on CPU.

New files: cluster.hpp (Cluster/ClusterGrid structs, BB distance),
cluster_pair_search.cpp (build_cluster_grid + cluster_pair_neighbors),
tests/cluster_pair.cpp (correctness vs cell-list on cubic, triclinic,
periodic, non-periodic systems).
Use Google Highway for portable SIMD vectorization of the distance
calculation inner loop. Key changes:

- Cluster size raised from 4 to 8 atoms (matches AVX2 with 2-iteration
  loop, degrades gracefully to SSE or AVX-512)
- SoA (Structure of Arrays) position data in Cluster struct for aligned
  SIMD loads (pos_x/pos_y/pos_z arrays, 64-byte aligned)
- Precomputed wrapped positions avoid per-pair matrix multiply (the
  cell_shift.cartesian(cell_matrix) call moves from O(N_pairs) to
  O(N_cell_pairs))
- SIMD inner loop: broadcasts atom i position, loads 8 atom j positions,
  computes 8 distances in parallel using MulAdd
- Dispatch threshold raised from 64 to 256 (small systems do not amortize
  the cluster grid overhead)

Highway fetched via CMake FetchContent (v1.2.0, tests/examples disabled).
Runtime dispatch selects best available ISA (SSE4, AVX2, AVX-512, NEON).

New test cases with 7^3=343 atoms exercise the SIMD cluster-pair path
(above the 256 threshold) and verify identical results to cell-list.
@HaoZeke HaoZeke marked this pull request as draft March 23, 2026 13:14
@GardevoirX

Copy link
Copy Markdown
Contributor

Can you check the speedup for partial-pbc and non-pbc as well? There's a bug in treating non-pbc case with cell list right now #126 , which leads to a significant performance degradation. so if possible I think we need to check if the related logic works well with these cases

@Luthaf

Luthaf commented May 21, 2026

Copy link
Copy Markdown
Owner

Now in #161

@Luthaf Luthaf closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants