feat: SIMD-accelerated CPU Verlet (cluster-pair search + packed candidate blocks)#161
Open
HaoZeke wants to merge 8 commits into
Open
feat: SIMD-accelerated CPU Verlet (cluster-pair search + packed candidate blocks)#161HaoZeke wants to merge 8 commits into
HaoZeke wants to merge 8 commits into
Conversation
GROMACS nbnxm-inspired spatial search grouping atoms into 8-atom clusters with AABB bounding-box rejection and SIMD distance calculations via Google Highway (v1.2.0, fetched via CMake FetchContent). Auto-dispatched for N >= CLUSTER_PAIR_THRESHOLD (256) on CPU and for the cell-list rebuild inside the CPU Verlet cache. - vesin/src/cluster.hpp: 8-atom cluster structs (SoA pos_x/y/z[8]) + CLUSTER_PAIR_THRESHOLD constant - vesin/src/cluster_pair_search.cpp: SIMD-ready cluster-pair search, AABB rejection, simd_check_distances via hwy::ScalableTag<double> - vesin/src/vesin.cpp: auto-dispatch to cluster_pair_neighbors at the CPU entry point - vesin/src/verlet.cpp: invalidate cache when algorithm option changes and route the rebuild candidate search through cluster_pair_neighbors when N >= threshold - vesin/tests/cluster_pair.cpp: correctness vs cell-list, orthogonal/ triclinic boxes, periodic/non-periodic, half/full list, large systems - vesin/CMakeLists.txt: register cluster_pair_search.cpp, fetch Highway v1.2.0
Cache cluster pair lists from the cluster-pair search so subsequent Verlet recomputes reuse the topology, and cache cartesian shifts so the recompute path only re-evaluates the filtered distances. - vesin/src/verlet.cpp/.hpp: retain cluster pair index/shift cache, store cartesian shifts per candidate, prune entries that drift out of skin radius - vesin/src/cluster.hpp, cluster_pair_search.cpp: expose the cluster pair cache helpers needed by VerletState - vesin/tests/verlet.cpp: cache invalidation on algorithm change, cluster-backed recompute, retained-cache reuse, parity with stateless cluster-pair output
Lay out the cached candidate list as 8-wide (Highway scalable-tag aligned)
blocks of (i, j, shift_idx) so the recompute filter can stream contiguous
double lanes through hwy::ScalableTag<double> without per-pair gathers
against the candidate vector. Block headers are aligned to the SIMD lane
width and the tail is padded with sentinel j indices that filter cleanly.
- vesin/src/verlet.{cpp,hpp}: packed candidate block layout + alignment
- vesin/tests/verlet.cpp: flat candidate layout assertions for the
packed-block invariant
Push Highway SIMD into the recompute hot path:
- vesin/src/verlet.cpp: replace per-pair scalar gather of candidate
deltas with hwy::GatherIndexN against the packed candidate vector;
batch the sqrt of kept distances through a SIMD lane reduction in
filter_simd_candidate_blocks
- vesin/src/cpu_cell_list.{cpp,hpp}: batched ensure_capacity + unchecked
set_* writes in the verlet recompute fast path so each candidate block
pays one capacity check and one bounds check instead of N
- python/vesin/tests/test_verlet.py: cover auto Verlet recompute on
large candidate caches and use the installed vesin to catch a previous
in-tree-import crash regression
bca1260 to
c849c87
Compare
CI failures on PR Luthaf#161: - 'Create the single file build' tox env: create-single-cpp.py concatenates vesin's .cpp files and tries to resolve every #include against vesin/src, vesin/include, and _deps/gpulite-src. The script could not find hwy/highway.h (Highway is a third-party FetchContent dep with no source in the repo) and threw at script time before any compile. - M1 macOS / x86_64 Linux (torch v2.6) wheel build: delocate-wheel and auditwheel both failed because libvesin.{so,dylib} dynamically linked libhwy.{so.1,1.dylib} and the dependency was not on the wheel bundling path. Highway lived in build-shared/_deps/highway-build/, not in any system or rpath location. - ubuntu-24.04 / Clang fortran-tests: the deterministic CPU sort fix (47714b1) changed the within-pair[0] sort order. The fortran test encoded the previous implementation-defined order; the new sort is by (pair[0], pair[1], shift_x, shift_y, shift_z). Fixes: 1. vesin/CMakeLists.txt: temporarily set BUILD_SHARED_LIBS=OFF around FetchContent_MakeAvailable(highway) so libhwy compiles as a static library and libvesin.so contains its objects (no runtime hwy dep). Define VESIN_HAVE_HIGHWAY so the source can gate Highway-using code for builds that do not have it (the single-file dist build). 2. vesin/src/{verlet.cpp,cluster_pair_search.cpp}: gate the Highway include + Highway-using bodies (simd_filter_deltas, gather_pair_deltas, the inline SIMD sqrt in filter_simd_candidate_blocks, simd_check_distances) with #ifdef VESIN_HAVE_HIGHWAY and provide scalar fallbacks. Same output, slower; required by the single-file build path. 3. vesin/scripts/create-single-cpp.py: include cluster_pair_search.cpp in the concatenated source (linker would otherwise miss build_cluster_grid / build_cluster_pair_candidates / cluster_pair_neighbors). 4. vesin/src/cluster.hpp: move shared inline helpers (is_zero_shift, shift_cartesian) into the public header so the single-build does not end up with two anonymous-namespace static copies redefined in the concatenated TU. 5. Drop the now-duplicate definitions of those helpers from verlet.cpp and cluster_pair_search.cpp. 6. fortran/tests/tests.f90: update expected_pairs / expected_distances / expected_shifts / expected_vectors to the new deterministic order (same content, reordered). Add a comment recording the sort key.
The ubuntu-24.04 / GCC CI image ships GCC 9 / libstdc++ 9, where Catch2 (>=3) hits an enable_if<false>::type instantiation error when trying to stringify std::set<std::tuple<size_t, size_t, int32_t, int32_t, int32_t>> for the failure-diff path of CHECK(cl_pairs == auto_pairs) in vesin/tests/cluster_pair.cpp. The default StringMaker SFINAE chain does not detect the tuple as range-printable on that libstdc++. Provide an explicit StringMaker specialisation for the (i, j, sx, sy, sz) tuple. Short-circuits the SFINAE chain; same diagnostic output everywhere else. Tested locally (22036 assertions in 13 cases) with GCC 13 / libstdc++ 13.
CI lint job used clang-format 20.1.8 and rejected the multi-line string concatenation I wrote in the previous commit. Re-format with the locked clang-format version.
Owner
I don't get this part. What does the threshold apply to? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This stacks SIMD-accelerated CPU spatial search and Verlet recompute on top of the existing CPU Verlet cache (PR #148):
hwy::ScalableTag<double>, ~8 lanes on AVX2 / SVE). Auto-dispatched at N >= 256 for both stateless neighbor lists and the cell-list rebuild step of the Verlet cache.(i, j, shift_idx)aligned to the Highway scalable-tag lane width, with sentinel j-padding for the tail.hwy::GatherIndexNagainst the packed candidate vector, SIMD-batched sqrt of kept distances, and batchedensure_capacity+ uncheckedset_*writes so each block pays one capacity check instead of N.The benchmark driver that produced the numbers below lives in our internal
pixi_envsrepo (vesin_bench/bench_verlet_simd.py) and is not vendored into this PR; CSV/PNG artifacts attached below are the evidence.Design
CLUSTER_PAIR_THRESHOLD = 256selects cluster-pair on CPU. Below that, the existing cell-list path wins because cluster grid construction overhead dominates. The cluster-pair search produces the same neighbor-pair output ascpu_cell_list(verified byvesin/tests/cluster_pair.cpp, 621 lines, orthogonal/triclinic, periodic/non-periodic, half/full list).When
algorithm = Autoand N >= threshold,VerletState::rebuildroutes the candidate search throughcluster_pair_neighborsinstead ofstateless_neighbors. The cluster pair list is retained alongside the candidate cache and Cartesian shifts. Recompute reuses both: the SIMD candidate block filter (filter_simd_candidate_blocks) gathersdx, dy, dzthroughhwy::GatherIndexN, computesd2in 8-lane lanes, reduces to a per-block survivor count, and only then evaluatessqrton the survivors (also lane-batched).Cache invalidation extends
VerletState::set_optionsto clear candidates whenever the algorithm option changes (since cluster-pair and cell-list candidates have different ordering and shift conventions).Performance results (CPU only)
Measured on
rg.cosmolab(Threadripper PRO 3955WX host CPU), density 0.05, cutoff 5.0, skin 1.0, 80 measured MD-like steps, 3 repeats; this branch atc849c87, baseline at upstreamorigin/main8e86d9b; 2026-05-18.vs upstream main, post-Verlet recompute (per-step hot path)
This is the steady-state cost during MD: cache built, then
--stepscalls reuse it. Median ms/step:The Highway-vectorized recompute (
hwy::GatherIndexNover packed candidate blocks + lane-batchedsqrt+ batchedensure_capacity+ uncheckedset_*) is 1.3-1.5x faster than main's scalar recompute across all sizes, on both the cell-list and cluster-pair rebuild paths. The cluster-pair recompute is marginally faster at small N (extra pre-Verlet headroom from the faster rebuild leaves the recompute cache slightly tighter) and converges with cell-list at large N (recompute is sqrt/memory-bound and the candidate cache shape stops mattering).Stateless / rebuild path
This is the cost paid on the first call (or whenever atoms drift > skin/2). Median ms/step:
This PR's cell-list rebuild matches main exactly (no algorithm change on that path). The new SIMD cluster-pair rebuild beats cell-list by 1.47-1.71x and is selected automatically when
algorithm = AutoandN >= CLUSTER_PAIR_THRESHOLD = 256.