feat: SIMD-accelerated CPU Verlet (cluster-pair search + packed candidate blocks) by HaoZeke · Pull Request #161 · Luthaf/vesin

HaoZeke · 2026-05-18T12:09:17Z

Summary

This stacks SIMD-accelerated CPU spatial search and Verlet recompute on top of the existing CPU Verlet cache (PR #148):

GROMACS nbnxm-style 8-atom cluster-pair spatial search with AABB rejection and Highway SIMD distance calculations (hwy::ScalableTag<double>, ~8 lanes on AVX2 / SVE). Auto-dispatched at N >= 256 for both stateless neighbor lists and the cell-list rebuild step of the Verlet cache.
Cluster-backed Verlet cache: retain cluster pair lists across recomputes, cache cartesian shifts per candidate, and prune entries that drift out of the skin radius. Recompute only re-evaluates the filtered distances.
SIMD-packed candidate blocks: lay out cached candidates as 8-wide blocks of (i, j, shift_idx) aligned to the Highway scalable-tag lane width, with sentinel j-padding for the tail.
Highway-vectorized recompute: hwy::GatherIndexN against the packed candidate vector, SIMD-batched sqrt of kept distances, and batched ensure_capacity + unchecked set_* writes so each block pays one capacity check instead of N.

The benchmark driver that produced the numbers below lives in our internal pixi_envs repo (vesin_bench/bench_verlet_simd.py) and is not vendored into this PR; CSV/PNG artifacts attached below are the evidence.

Design

CLUSTER_PAIR_THRESHOLD = 256 selects cluster-pair on CPU. Below that, the existing cell-list path wins because cluster grid construction overhead dominates. The cluster-pair search produces the same neighbor-pair output as cpu_cell_list (verified by vesin/tests/cluster_pair.cpp, 621 lines, orthogonal/triclinic, periodic/non-periodic, half/full list).

When algorithm = Auto and N >= threshold, VerletState::rebuild routes the candidate search through cluster_pair_neighbors instead of stateless_neighbors. The cluster pair list is retained alongside the candidate cache and Cartesian shifts. Recompute reuses both: the SIMD candidate block filter (filter_simd_candidate_blocks) gathers dx, dy, dz through hwy::GatherIndexN, computes d2 in 8-lane lanes, reduces to a per-block survivor count, and only then evaluates sqrt on the survivors (also lane-batched).

Cache invalidation extends VerletState::set_options to clear candidates whenever the algorithm option changes (since cluster-pair and cell-list candidates have different ordering and shift conventions).

Performance results (CPU only)

Measured on rg.cosmolab (Threadripper PRO 3955WX host CPU), density 0.05, cutoff 5.0, skin 1.0, 80 measured MD-like steps, 3 repeats; this branch at c849c87, baseline at upstream origin/main 8e86d9b; 2026-05-18.

vs upstream main, post-Verlet recompute (per-step hot path)

This is the steady-state cost during MD: cache built, then --steps calls reuse it. Median ms/step:

atoms	upstream main (cell-list + scalar recompute)	this PR, cell-list rebuild + SIMD recompute	this PR, cluster-pair rebuild + SIMD recompute	speedup vs main
1024	0.91	0.64	0.57	1.42x → 1.58x
2048	1.94	1.33	1.19	1.46x → 1.62x
4096	3.76	2.58	2.46	1.46x → 1.53x
8192	7.61	5.61	5.32	1.36x → 1.43x
16384	14.75	10.46	10.35	1.41x → 1.43x
32768	31.86	23.76	24.05	1.34x → 1.32x

The Highway-vectorized recompute (hwy::GatherIndexN over packed candidate blocks + lane-batched sqrt + batched ensure_capacity + unchecked set_*) is 1.3-1.5x faster than main's scalar recompute across all sizes, on both the cell-list and cluster-pair rebuild paths. The cluster-pair recompute is marginally faster at small N (extra pre-Verlet headroom from the faster rebuild leaves the recompute cache slightly tighter) and converges with cell-list at large N (recompute is sqrt/memory-bound and the candidate cache shape stops mattering).

Stateless / rebuild path

This is the cost paid on the first call (or whenever atoms drift > skin/2). Median ms/step:

atoms	upstream main (cell-list)	this PR, cell-list	this PR, SIMD cluster-pair	rebuild speedup vs main
1024	2.66	2.65	1.64	1.62x
2048	5.87	5.87	3.43	1.71x
4096	10.26	10.25	6.49	1.58x
8192	20.88	20.88	13.03	1.60x
16384	39.25	39.34	25.26	1.55x
32768	73.12	73.59	49.58	1.47x

This PR's cell-list rebuild matches main exactly (no algorithm change on that path). The new SIMD cluster-pair rebuild beats cell-list by 1.47-1.71x and is selected automatically when algorithm = Auto and N >= CLUSTER_PAIR_THRESHOLD = 256.

GROMACS nbnxm-inspired spatial search grouping atoms into 8-atom clusters with AABB bounding-box rejection and SIMD distance calculations via Google Highway (v1.2.0, fetched via CMake FetchContent). Auto-dispatched for N >= CLUSTER_PAIR_THRESHOLD (256) on CPU and for the cell-list rebuild inside the CPU Verlet cache. - vesin/src/cluster.hpp: 8-atom cluster structs (SoA pos_x/y/z[8]) + CLUSTER_PAIR_THRESHOLD constant - vesin/src/cluster_pair_search.cpp: SIMD-ready cluster-pair search, AABB rejection, simd_check_distances via hwy::ScalableTag<double> - vesin/src/vesin.cpp: auto-dispatch to cluster_pair_neighbors at the CPU entry point - vesin/src/verlet.cpp: invalidate cache when algorithm option changes and route the rebuild candidate search through cluster_pair_neighbors when N >= threshold - vesin/tests/cluster_pair.cpp: correctness vs cell-list, orthogonal/ triclinic boxes, periodic/non-periodic, half/full list, large systems - vesin/CMakeLists.txt: register cluster_pair_search.cpp, fetch Highway v1.2.0

Cache cluster pair lists from the cluster-pair search so subsequent Verlet recomputes reuse the topology, and cache cartesian shifts so the recompute path only re-evaluates the filtered distances. - vesin/src/verlet.cpp/.hpp: retain cluster pair index/shift cache, store cartesian shifts per candidate, prune entries that drift out of skin radius - vesin/src/cluster.hpp, cluster_pair_search.cpp: expose the cluster pair cache helpers needed by VerletState - vesin/tests/verlet.cpp: cache invalidation on algorithm change, cluster-backed recompute, retained-cache reuse, parity with stateless cluster-pair output

Lay out the cached candidate list as 8-wide (Highway scalable-tag aligned) blocks of (i, j, shift_idx) so the recompute filter can stream contiguous double lanes through hwy::ScalableTag<double> without per-pair gathers against the candidate vector. Block headers are aligned to the SIMD lane width and the tail is padded with sentinel j indices that filter cleanly. - vesin/src/verlet.{cpp,hpp}: packed candidate block layout + alignment - vesin/tests/verlet.cpp: flat candidate layout assertions for the packed-block invariant

Push Highway SIMD into the recompute hot path: - vesin/src/verlet.cpp: replace per-pair scalar gather of candidate deltas with hwy::GatherIndexN against the packed candidate vector; batch the sqrt of kept distances through a SIMD lane reduction in filter_simd_candidate_blocks - vesin/src/cpu_cell_list.{cpp,hpp}: batched ensure_capacity + unchecked set_* writes in the verlet recompute fast path so each candidate block pays one capacity check and one bounds check instead of N - python/vesin/tests/test_verlet.py: cover auto Verlet recompute on large candidate caches and use the installed vesin to catch a previous in-tree-import crash regression

CI failures on PR Luthaf#161: - 'Create the single file build' tox env: create-single-cpp.py concatenates vesin's .cpp files and tries to resolve every #include against vesin/src, vesin/include, and _deps/gpulite-src. The script could not find hwy/highway.h (Highway is a third-party FetchContent dep with no source in the repo) and threw at script time before any compile. - M1 macOS / x86_64 Linux (torch v2.6) wheel build: delocate-wheel and auditwheel both failed because libvesin.{so,dylib} dynamically linked libhwy.{so.1,1.dylib} and the dependency was not on the wheel bundling path. Highway lived in build-shared/_deps/highway-build/, not in any system or rpath location. - ubuntu-24.04 / Clang fortran-tests: the deterministic CPU sort fix (47714b1) changed the within-pair[0] sort order. The fortran test encoded the previous implementation-defined order; the new sort is by (pair[0], pair[1], shift_x, shift_y, shift_z). Fixes: 1. vesin/CMakeLists.txt: temporarily set BUILD_SHARED_LIBS=OFF around FetchContent_MakeAvailable(highway) so libhwy compiles as a static library and libvesin.so contains its objects (no runtime hwy dep). Define VESIN_HAVE_HIGHWAY so the source can gate Highway-using code for builds that do not have it (the single-file dist build). 2. vesin/src/{verlet.cpp,cluster_pair_search.cpp}: gate the Highway include + Highway-using bodies (simd_filter_deltas, gather_pair_deltas, the inline SIMD sqrt in filter_simd_candidate_blocks, simd_check_distances) with #ifdef VESIN_HAVE_HIGHWAY and provide scalar fallbacks. Same output, slower; required by the single-file build path. 3. vesin/scripts/create-single-cpp.py: include cluster_pair_search.cpp in the concatenated source (linker would otherwise miss build_cluster_grid / build_cluster_pair_candidates / cluster_pair_neighbors). 4. vesin/src/cluster.hpp: move shared inline helpers (is_zero_shift, shift_cartesian) into the public header so the single-build does not end up with two anonymous-namespace static copies redefined in the concatenated TU. 5. Drop the now-duplicate definitions of those helpers from verlet.cpp and cluster_pair_search.cpp. 6. fortran/tests/tests.f90: update expected_pairs / expected_distances / expected_shifts / expected_vectors to the new deterministic order (same content, reordered). Add a comment recording the sort key.

The ubuntu-24.04 / GCC CI image ships GCC 9 / libstdc++ 9, where Catch2 (>=3) hits an enable_if<false>::type instantiation error when trying to stringify std::set<std::tuple<size_t, size_t, int32_t, int32_t, int32_t>> for the failure-diff path of CHECK(cl_pairs == auto_pairs) in vesin/tests/cluster_pair.cpp. The default StringMaker SFINAE chain does not detect the tuple as range-printable on that libstdc++. Provide an explicit StringMaker specialisation for the (i, j, sx, sy, sz) tuple. Short-circuits the SFINAE chain; same diagnostic output everywhere else. Tested locally (22036 assertions in 13 cases) with GCC 13 / libstdc++ 13.

CI lint job used clang-format 20.1.8 and rejected the multi-line string concatenation I wrote in the previous commit. Re-format with the locked clang-format version.

Luthaf · 2026-05-19T10:01:39Z

CLUSTER_PAIR_THRESHOLD = 256 selects cluster-pair on CPU.

I don't get this part. What does the threshold apply to?

HaoZeke added 5 commits May 18, 2026 07:10

fix: make CPU neighbor sorting deterministic

47714b1

HaoZeke force-pushed the feat/cuda-verlet-list branch from bca1260 to c849c87 Compare May 18, 2026 12:39

HaoZeke added 3 commits May 18, 2026 07:59

fix(test): re-format StringMaker per clang-format-20.1.8

e078106

CI lint job used clang-format 20.1.8 and rejected the multi-line string concatenation I wrote in the previous commit. Re-format with the locked clang-format version.

Luthaf mentioned this pull request May 21, 2026

feat: add CPU cluster-pair spatial search with Highway SIMD optimization #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SIMD-accelerated CPU Verlet (cluster-pair search + packed candidate blocks)#161

feat: SIMD-accelerated CPU Verlet (cluster-pair search + packed candidate blocks)#161
HaoZeke wants to merge 8 commits into
Luthaf:mainfrom
HaoZeke:feat/cuda-verlet-list

HaoZeke commented May 18, 2026 •

edited

Loading

Uh oh!

Luthaf commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HaoZeke commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Performance results (CPU only)

vs upstream main, post-Verlet recompute (per-step hot path)

Stateless / rebuild path

Uh oh!

Luthaf commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HaoZeke commented May 18, 2026 •

edited

Loading