Skip to content

feat: SIMD-accelerated CPU Verlet (cluster-pair search + packed candidate blocks)#161

Open
HaoZeke wants to merge 8 commits into
Luthaf:mainfrom
HaoZeke:feat/cuda-verlet-list
Open

feat: SIMD-accelerated CPU Verlet (cluster-pair search + packed candidate blocks)#161
HaoZeke wants to merge 8 commits into
Luthaf:mainfrom
HaoZeke:feat/cuda-verlet-list

Conversation

@HaoZeke

@HaoZeke HaoZeke commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

This stacks SIMD-accelerated CPU spatial search and Verlet recompute on top of the existing CPU Verlet cache (PR #148):

  • GROMACS nbnxm-style 8-atom cluster-pair spatial search with AABB rejection and Highway SIMD distance calculations (hwy::ScalableTag<double>, ~8 lanes on AVX2 / SVE). Auto-dispatched at N >= 256 for both stateless neighbor lists and the cell-list rebuild step of the Verlet cache.
  • Cluster-backed Verlet cache: retain cluster pair lists across recomputes, cache cartesian shifts per candidate, and prune entries that drift out of the skin radius. Recompute only re-evaluates the filtered distances.
  • SIMD-packed candidate blocks: lay out cached candidates as 8-wide blocks of (i, j, shift_idx) aligned to the Highway scalable-tag lane width, with sentinel j-padding for the tail.
  • Highway-vectorized recompute: hwy::GatherIndexN against the packed candidate vector, SIMD-batched sqrt of kept distances, and batched ensure_capacity + unchecked set_* writes so each block pays one capacity check instead of N.

The benchmark driver that produced the numbers below lives in our internal pixi_envs repo (vesin_bench/bench_verlet_simd.py) and is not vendored into this PR; CSV/PNG artifacts attached below are the evidence.

Design

CLUSTER_PAIR_THRESHOLD = 256 selects cluster-pair on CPU. Below that, the existing cell-list path wins because cluster grid construction overhead dominates. The cluster-pair search produces the same neighbor-pair output as cpu_cell_list (verified by vesin/tests/cluster_pair.cpp, 621 lines, orthogonal/triclinic, periodic/non-periodic, half/full list).

When algorithm = Auto and N >= threshold, VerletState::rebuild routes the candidate search through cluster_pair_neighbors instead of stateless_neighbors. The cluster pair list is retained alongside the candidate cache and Cartesian shifts. Recompute reuses both: the SIMD candidate block filter (filter_simd_candidate_blocks) gathers dx, dy, dz through hwy::GatherIndexN, computes d2 in 8-lane lanes, reduces to a per-block survivor count, and only then evaluates sqrt on the survivors (also lane-batched).

Cache invalidation extends VerletState::set_options to clear candidates whenever the algorithm option changes (since cluster-pair and cell-list candidates have different ordering and shift conventions).

Performance results (CPU only)

Measured on rg.cosmolab (Threadripper PRO 3955WX host CPU), density 0.05, cutoff 5.0, skin 1.0, 80 measured MD-like steps, 3 repeats; this branch at c849c87, baseline at upstream origin/main 8e86d9b; 2026-05-18.

CPU Verlet: upstream main vs feat/simd-verlet-prototype

vs upstream main, post-Verlet recompute (per-step hot path)

This is the steady-state cost during MD: cache built, then --steps calls reuse it. Median ms/step:

atoms upstream main (cell-list + scalar recompute) this PR, cell-list rebuild + SIMD recompute this PR, cluster-pair rebuild + SIMD recompute speedup vs main
1024 0.91 0.64 0.57 1.42x → 1.58x
2048 1.94 1.33 1.19 1.46x → 1.62x
4096 3.76 2.58 2.46 1.46x → 1.53x
8192 7.61 5.61 5.32 1.36x → 1.43x
16384 14.75 10.46 10.35 1.41x → 1.43x
32768 31.86 23.76 24.05 1.34x → 1.32x

The Highway-vectorized recompute (hwy::GatherIndexN over packed candidate blocks + lane-batched sqrt + batched ensure_capacity + unchecked set_*) is 1.3-1.5x faster than main's scalar recompute across all sizes, on both the cell-list and cluster-pair rebuild paths. The cluster-pair recompute is marginally faster at small N (extra pre-Verlet headroom from the faster rebuild leaves the recompute cache slightly tighter) and converges with cell-list at large N (recompute is sqrt/memory-bound and the candidate cache shape stops mattering).

Stateless / rebuild path

This is the cost paid on the first call (or whenever atoms drift > skin/2). Median ms/step:

atoms upstream main (cell-list) this PR, cell-list this PR, SIMD cluster-pair rebuild speedup vs main
1024 2.66 2.65 1.64 1.62x
2048 5.87 5.87 3.43 1.71x
4096 10.26 10.25 6.49 1.58x
8192 20.88 20.88 13.03 1.60x
16384 39.25 39.34 25.26 1.55x
32768 73.12 73.59 49.58 1.47x

This PR's cell-list rebuild matches main exactly (no algorithm change on that path). The new SIMD cluster-pair rebuild beats cell-list by 1.47-1.71x and is selected automatically when algorithm = Auto and N >= CLUSTER_PAIR_THRESHOLD = 256.

image

HaoZeke added 5 commits May 18, 2026 07:10
GROMACS nbnxm-inspired spatial search grouping atoms into 8-atom clusters
with AABB bounding-box rejection and SIMD distance calculations via Google
Highway (v1.2.0, fetched via CMake FetchContent). Auto-dispatched for
N >= CLUSTER_PAIR_THRESHOLD (256) on CPU and for the cell-list rebuild
inside the CPU Verlet cache.

- vesin/src/cluster.hpp: 8-atom cluster structs (SoA pos_x/y/z[8])
  + CLUSTER_PAIR_THRESHOLD constant
- vesin/src/cluster_pair_search.cpp: SIMD-ready cluster-pair search,
  AABB rejection, simd_check_distances via hwy::ScalableTag<double>
- vesin/src/vesin.cpp: auto-dispatch to cluster_pair_neighbors at the
  CPU entry point
- vesin/src/verlet.cpp: invalidate cache when algorithm option changes
  and route the rebuild candidate search through cluster_pair_neighbors
  when N >= threshold
- vesin/tests/cluster_pair.cpp: correctness vs cell-list, orthogonal/
  triclinic boxes, periodic/non-periodic, half/full list, large systems
- vesin/CMakeLists.txt: register cluster_pair_search.cpp, fetch Highway
  v1.2.0
Cache cluster pair lists from the cluster-pair search so subsequent Verlet
recomputes reuse the topology, and cache cartesian shifts so the recompute
path only re-evaluates the filtered distances.

- vesin/src/verlet.cpp/.hpp: retain cluster pair index/shift cache, store
  cartesian shifts per candidate, prune entries that drift out of skin
  radius
- vesin/src/cluster.hpp, cluster_pair_search.cpp: expose the cluster pair
  cache helpers needed by VerletState
- vesin/tests/verlet.cpp: cache invalidation on algorithm change,
  cluster-backed recompute, retained-cache reuse, parity with stateless
  cluster-pair output
Lay out the cached candidate list as 8-wide (Highway scalable-tag aligned)
blocks of (i, j, shift_idx) so the recompute filter can stream contiguous
double lanes through hwy::ScalableTag<double> without per-pair gathers
against the candidate vector. Block headers are aligned to the SIMD lane
width and the tail is padded with sentinel j indices that filter cleanly.

- vesin/src/verlet.{cpp,hpp}: packed candidate block layout + alignment
- vesin/tests/verlet.cpp: flat candidate layout assertions for the
  packed-block invariant
Push Highway SIMD into the recompute hot path:

- vesin/src/verlet.cpp: replace per-pair scalar gather of candidate
  deltas with hwy::GatherIndexN against the packed candidate vector;
  batch the sqrt of kept distances through a SIMD lane reduction in
  filter_simd_candidate_blocks
- vesin/src/cpu_cell_list.{cpp,hpp}: batched ensure_capacity + unchecked
  set_* writes in the verlet recompute fast path so each candidate block
  pays one capacity check and one bounds check instead of N
- python/vesin/tests/test_verlet.py: cover auto Verlet recompute on
  large candidate caches and use the installed vesin to catch a previous
  in-tree-import crash regression
@HaoZeke HaoZeke force-pushed the feat/cuda-verlet-list branch from bca1260 to c849c87 Compare May 18, 2026 12:39
HaoZeke added 3 commits May 18, 2026 07:59
CI failures on PR Luthaf#161:

- 'Create the single file build' tox env: create-single-cpp.py concatenates
  vesin's .cpp files and tries to resolve every #include against
  vesin/src, vesin/include, and _deps/gpulite-src. The script could not
  find hwy/highway.h (Highway is a third-party FetchContent dep with no
  source in the repo) and threw at script time before any compile.

- M1 macOS / x86_64 Linux (torch v2.6) wheel build: delocate-wheel and
  auditwheel both failed because libvesin.{so,dylib} dynamically linked
  libhwy.{so.1,1.dylib} and the dependency was not on the wheel bundling
  path. Highway lived in build-shared/_deps/highway-build/, not in any
  system or rpath location.

- ubuntu-24.04 / Clang fortran-tests: the deterministic CPU sort fix
  (47714b1) changed the within-pair[0] sort order. The fortran test
  encoded the previous implementation-defined order; the new sort is
  by (pair[0], pair[1], shift_x, shift_y, shift_z).

Fixes:

1. vesin/CMakeLists.txt: temporarily set BUILD_SHARED_LIBS=OFF around
   FetchContent_MakeAvailable(highway) so libhwy compiles as a static
   library and libvesin.so contains its objects (no runtime hwy dep).
   Define VESIN_HAVE_HIGHWAY so the source can gate Highway-using code
   for builds that do not have it (the single-file dist build).

2. vesin/src/{verlet.cpp,cluster_pair_search.cpp}: gate the Highway
   include + Highway-using bodies (simd_filter_deltas, gather_pair_deltas,
   the inline SIMD sqrt in filter_simd_candidate_blocks, simd_check_distances)
   with #ifdef VESIN_HAVE_HIGHWAY and provide scalar fallbacks. Same
   output, slower; required by the single-file build path.

3. vesin/scripts/create-single-cpp.py: include cluster_pair_search.cpp
   in the concatenated source (linker would otherwise miss
   build_cluster_grid / build_cluster_pair_candidates / cluster_pair_neighbors).

4. vesin/src/cluster.hpp: move shared inline helpers (is_zero_shift,
   shift_cartesian) into the public header so the single-build does not
   end up with two anonymous-namespace static copies redefined in the
   concatenated TU.

5. Drop the now-duplicate definitions of those helpers from verlet.cpp
   and cluster_pair_search.cpp.

6. fortran/tests/tests.f90: update expected_pairs / expected_distances
   / expected_shifts / expected_vectors to the new deterministic order
   (same content, reordered). Add a comment recording the sort key.
The ubuntu-24.04 / GCC CI image ships GCC 9 / libstdc++ 9, where
Catch2 (>=3) hits an enable_if<false>::type instantiation error when
trying to stringify std::set<std::tuple<size_t, size_t, int32_t,
int32_t, int32_t>> for the failure-diff path of CHECK(cl_pairs ==
auto_pairs) in vesin/tests/cluster_pair.cpp. The default StringMaker
SFINAE chain does not detect the tuple as range-printable on that
libstdc++.

Provide an explicit StringMaker specialisation for the (i, j, sx,
sy, sz) tuple. Short-circuits the SFINAE chain; same diagnostic
output everywhere else. Tested locally (22036 assertions in 13
cases) with GCC 13 / libstdc++ 13.
CI lint job used clang-format 20.1.8 and rejected the multi-line
string concatenation I wrote in the previous commit. Re-format with
the locked clang-format version.
@Luthaf

Luthaf commented May 19, 2026

Copy link
Copy Markdown
Owner

CLUSTER_PAIR_THRESHOLD = 256 selects cluster-pair on CPU.

I don't get this part. What does the threshold apply to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants