nbody_streams

Direct N-body simulator and utilities for collisionless multi-species systems. Designed around a minimal, Pythonic API — while keeping every computationally expensive routine in compiled, parallelized code with no pure-Python loops in the hot path.

Multi-species simulations with any number of particle types (dark matter, stars, gas tracers, black holes, …).
Single entry point: run_simulation with architecture='cpu'|'gpu' and method='direct'|'tree' flags.
GPU direct: hand-written CUDA kernels compiled at runtime via CuPy + nvcc — float4 128-bit vectorized loads, optional Kahan compensated summation, --use_fast_math, and architecture-tuned PTX.
GPU tree: Barnes-Hut GPU tree code (C++ / CUDA shared library) — O(N log N) scaling, per-particle softening, watchdog-guarded KDK integrator.
CPU direct: Numba JIT-compiled, auto-parallelized force kernels (@njit(parallel=True)) using all available CPU cores.
CPU tree / FMM: falcON fast multipole method via pyfalcon — true O(N) scaling for large particle counts.
Optional external potentials via Agama (evaluated in C++ per timestep, added to N-body accelerations).
Intended for collisionless systems with up to ~2M particles (benchmarks depend on hardware).
Fast stream-generation methods (particle spray, restricted N-body) via AGAMA.

Quick start

pip install -r requirements.txt
pip install -e .          # editable (development) install

# Optional extras (agama and pyfalcon require --no-build-isolation)
pip install -e ".[agama]"  --no-build-isolation   # AGAMA for fast_sims & external potentials
pip install -e ".[falcon]" --no-build-isolation   # pyfalcon tree/FMM backend
pip install -e ".[healpy]"                        # HEALPix for mollweide projections
pip install -e ".[all]"    --no-build-isolation   # all of the above

# GPU support — install CuPy matching your CUDA version
pip install cupy-cuda12x   # CUDA 12.x
pip install cupy-cuda13x   # CUDA 13.x

# GPU tree code — build the shared library (requires nvcc in PATH)
cd nbody_streams/tree_gpu
make -j$(nproc)            # produces libtreeGPU.so alongside the source files
cd ../..

Package overview

Subpackage	Description
`nbody_streams.species`	`Species` dataclass, `PerformanceWarning`, and array-building helpers
`nbody_streams.sim`	`run_simulation` — unified multi-species entry point (all backends)
`nbody_streams.fields`	Force and potential kernels: dispatches to GPU (CUDA) or CPU (Numba) backends
`nbody_streams.cuda_kernels`	Hand-written CUDA kernel source strings (float32 `float4`, float64); compiled at runtime via nvcc
`nbody_streams.run`	KDK leapfrog integrators (`run_nbody_gpu`, `run_nbody_cpu`); `make_plummer_sphere` IC generator
`nbody_streams.tree_gpu`	GPU Barnes-Hut tree code: `tree_gravity_gpu`, `TreeGPU`, `run_nbody_gpu_tree`, watchdog; requires `make`
`nbody_streams.io`	`ParticleReader` for HDF5 snapshots, save/load helpers
`nbody_streams.utils`	Profile fitting (Dehnen, Plummer, double-power-law), iterative shape measurement, energy-based unbinding
`nbody_streams.fast_sims`	Fast stream generation: particle spray and restricted N-body (requires AGAMA)
`nbody_streams.agama_helper`	Fit, store, modify, and load Agama Multipole/CylSpline BFE potentials; HDF5 coefficient archives; time-evolving potentials; FIRE helpers
`nbody_streams.coords`	Coordinate transforms, vector field transforms, stream coordinate generation
`nbody_streams.viz`	SPH/histogram surface density (`plot_density`), Mollweide projections, stream sky and evolution plots; `render_surface_density`, `get_smoothing_lengths`

Under the hood

The Python layer handles only orchestration: argument validation, backend dispatch, HDF5 I/O, and snapshot management. Every force evaluation is delegated to compiled, parallelized code:

Path	Implementation	Key details
GPU direct	CUDA kernels compiled at runtime (CuPy + nvcc)	`float4` 128-bit vectorized loads; optional Kahan compensated summation; `--use_fast_math`; arch-tuned PTX via `compute_capability` auto-detection
GPU tree	C++/CUDA shared library (`libtreeGPU.so`); ctypes interface	Barnes-Hut with monopole + quadrupole; per-particle softening (max convention); watchdog-guarded KDK; float32 throughout
CPU direct	Numba `@njit(parallel=True)`	Prange-parallelized inner loop; JIT-compiled on first call, cached thereafter
CPU tree / FMM	falcON via pyfalcon (C++)	True O(N) fast multipole method; independent of the direct-force code path
Integration	KDK symplectic leapfrog	GPU direct: float64 pos/vel; GPU tree: float32 throughout; only force calls use float32/float64 kernels
External potentials	Agama C++ library	`agama.Potential.__call__` invoked once per timestep; result added directly to N-body accelerations

API

Multi-species simulation (`run_simulation`, `Species`)

The recommended entry point for all N-body runs. It handles single- and multi-species setups, dispatches to the correct backend, and returns per-species final phase-space arrays.

`Species` dataclass

from nbody_streams import Species

# Named constructors for common types
dm    = Species.dark(N=10_000, mass=1e6, softening=0.1)   # dark matter
stars = Species.star(N=2_000,  mass=5e4, softening=0.03)  # stellar particles

# Arbitrary species (e.g. a single central black hole)
bh = Species(name='bh', N=1, mass=1e9, softening=0.001)

# Per-particle mass / softening are also accepted
import numpy as np
m_arr = np.linspace(1e5, 2e5, 500)
gas   = Species(name='gas', N=500, mass=m_arr, softening=0.05)

Species parameters:

Field	Type	Description
`name`	`str`	Unique species identifier
`N`	`int`	Number of particles
`mass`	`float` or `ndarray (N,)`	Particle mass(es)
`softening`	`float` or `ndarray (N,)`	Gravitational softening length(s)

`run_simulation`

from nbody_streams import run_simulation, Species, make_plummer_sphere
import numpy as np

result = run_simulation(
    phase_space,          # (N_total, 6) — all species concatenated
    species,              # list[Species] in the same order as phase_space
    time_start=0.0,
    time_end=1.0,
    dt=0.001,
    architecture='gpu',   # 'cpu' or 'gpu'
    method='direct',      # 'direct' (O(N^2)) or 'tree' (GPU: Barnes-Hut; CPU: falcON O(N))
    external_potential=pot,        # optional agama.Potential
    output_dir='./output',
    save_snapshots=True,
    snapshots=100,
    verbose=True,
    debug_energy=False,   # True -> print Q=KE/|PE| and dE/E at each output interval
    # Backend-specific kwargs (passed through **kwargs):
    #   theta=0.6                  tree opening angle           (tree backends)
    #   nthreads=None              CPU thread count             (CPU direct)
    #   precision='float32_kahan'  force computation precision  (GPU direct)
    #   external_update_interval=1 ext. force caching interval  (GPU backends)
    #   nleaf=64, ncrit=64         tree node size params        (GPU tree only)
)
# result is dict[str, ndarray]  keyed by species name
# result['dark'].shape  -> (N_dark,  6)
# result['star'].shape  -> (N_star,  6)

Example — single species: Globular Cluster

from nbody_streams import run_simulation, Species, make_plummer_sphere

# ~10 000 equal-mass stars in a Plummer model
# M_total = 5e5 Msun, scale radius a = 5 pc = 0.005 kpc
N = 10_000
xv, _ = make_plummer_sphere(N, M_total=5e5, a=0.005)

gc = Species.star(N=N, mass=5e5 / N, softening=0.0005)  # 0.5 pc softening

result = run_simulation(
    xv, [gc],
    time_start=0.0,
    time_end=0.1,          # 0.1 kpc/(km/s) ~ 100 Myr
    dt=1e-4,
    architecture='gpu',    # or 'cpu' for quick tests
    method='direct',
    output_dir='./output/gc',
    snapshots=50,
)
# result['star'] -> (10000, 6) final phase-space

Example — multi-species: Dwarf Galaxy (dark matter + stars + gas tracers)

from nbody_streams import run_simulation, Species, make_plummer_sphere
import numpy as np

# --- Dark matter halo: 5 000 particles, extended (a = 1 kpc) ---
N_dm = 5_000
M_dm = 1e9          # Msun
xv_dm, _ = make_plummer_sphere(N_dm, M_total=M_dm, a=1.0)
dm = Species.dark(N=N_dm, mass=M_dm / N_dm, softening=0.1)  # 100 pc softening

# --- Stellar disc: 1 000 particles, more concentrated (a = 0.3 kpc) ---
N_star = 1_000
M_star = 5e7        # Msun
xv_star, _ = make_plummer_sphere(N_star, M_total=M_star, a=0.3)
star = Species.star(N=N_star, mass=M_star / N_star, softening=0.03)

# --- Gas tracers: 500 collisionless test particles (a = 0.5 kpc) ---
N_gas = 500
M_gas = 2e7         # Msun
xv_gas, _ = make_plummer_sphere(N_gas, M_total=M_gas, a=0.5)
gas = Species(name='gas', N=N_gas, mass=M_gas / N_gas, softening=0.05)

# --- Combine all particles ---
xv_all = np.vstack([xv_dm, xv_star, xv_gas])

result = run_simulation(
    xv_all,
    [dm, star, gas],
    time_start=0.0,
    time_end=1.0,          # 1 kpc/(km/s) ~ 1 Gyr
    dt=1e-3,
    architecture='gpu',
    method='direct',
    output_dir='./output/dwarf',
    snapshots=100,
)

# Per-species final coordinates
dm_final   = result['dark']   # (5000, 6)
star_final = result['star']   # (1000, 6)
gas_final  = result['gas']    # (500,  6)

Reading snapshots from a multi-species run:

from nbody_streams import ParticleReader

r    = ParticleReader('./output/dwarf/snapshot.h5')
snap = r.read_snapshot(50)

# New primary API: per-species dict
snap.species['dark']['posvel']   # (5000, 6)
snap.species['star']['posvel']   # (1000, 6)
snap.species['gas']['posvel']    # (500,  6)

# Backward-compat aliases still work
snap.dark['posvel']
snap.star['posvel']

Initial conditions (`make_plummer_sphere`)

from nbody_streams import make_plummer_sphere

xv, mass = make_plummer_sphere(
    N         = 10_000,       # number of particles
    M_total   = 1e5,          # total mass [Msun]
    a         = 0.01,         # Plummer scale radius [kpc]
    G         = 4.300917e-6,  # gravitational constant (kpc, km/s, Msun units)
    seed      = 42,
)
# xv   : (N, 6) float64  — [x, y, z, vx, vy, vz]
# mass : (N,)  float64  — equal-mass particles summing to M_total

Positions and velocities are sampled from the exact Plummer distribution using rejection sampling (Aarseth 1974). The centre of mass and bulk velocity are corrected to zero before returning.

GPU tree-code (`tree_gpu`)

The GPU tree backend must be compiled once before use:

cd nbody_streams/tree_gpu
make -j$(nproc)    # auto-detects GPU architecture via nvidia-smi / CuPy

After building, it is available as a subpackage:

from nbody_streams.tree_gpu import tree_gravity_gpu, TreeGPU, run_nbody_gpu_tree
import cupy as cp

# One-shot: compute forces + potential for N particles
pos  = cp.asarray(xv[:, :3], dtype=cp.float32)
mass = cp.asarray(masses,    dtype=cp.float32)
eps  = cp.asarray(softening, dtype=cp.float32)   # per-particle or scalar

acc, phi = tree_gravity_gpu(pos, mass, eps, G=4.3009e-6, theta=0.6)

# Pre-allocated handle for time-stepping loops (saves ~27 ms of GPU malloc per step)
with TreeGPU(N, eps=0.05, theta=0.6) as tree:
    for step in range(n_steps):
        acc, phi = tree_gravity_gpu(pos, mass, eps, tree=tree)
        vel += 0.5 * dt * acc
        pos += dt * vel
        acc, phi = tree_gravity_gpu(pos, mass, eps, tree=tree)
        vel += 0.5 * dt * acc

# Full integration — same interface as run_simulation with method='tree'
final_xv = run_nbody_gpu_tree(
    phase_space, masses, time_start=0.0, time_end=5.0, dt=1e-4,
    softening=eps_arr, theta=0.6, step_timeout_s=60.0,
    output_dir="./output", snapshots=500,
)

# Or via the unified entry point:
result = run_simulation(
    phase_space, species,
    time_start=0.0, time_end=5.0, dt=1e-4,
    architecture='gpu', method='tree', theta=0.6,
)

The _StepWatchdog inside run_nbody_gpu_tree fires a KeyboardInterrupt in the main thread if any single integration step exceeds step_timeout_s seconds, protecting against deadlocked CUDA kernels.

Low-level API — Direct N-body (`fields`, `run`)

run_simulation covers the common case. For full control — custom softening kernels, alternative float precision, per-particle softening, or external_update_interval on the CPU — call the backend functions directly:

from nbody_streams import (
    compute_nbody_forces_gpu,
    compute_nbody_forces_cpu,
    compute_nbody_potential_gpu,
    run_nbody_gpu,
    run_nbody_cpu,
    make_plummer_sphere,
)

# Compute accelerations (GPU, float32 with Kahan correction)
acc = compute_nbody_forces_gpu(
    pos, mass, softening=0.01,
    precision='float32_kahan', kernel='spline',
)

# Compute potential energy per particle
phi = compute_nbody_potential_gpu(
    pos, mass, softening=0.01,
    precision='float32_kahan', kernel='spline',
)

# Full N-body integration (GPU direct) — full kernel / precision control
run_nbody_gpu(
    phase_space,             # (N, 6) array [x, y, z, vx, vy, vz]
    masses,
    time_start=0.0,
    time_end=1.0,
    dt=0.001,
    softening=0.01,
    precision='float64',     # 'float32' | 'float64' | 'float32_kahan'
    kernel='dehnen_k2',      # 'newtonian' | 'plummer' | 'dehnen_k1' | 'dehnen_k2' | 'spline'
    external_potential=pot,
    external_update_interval=5,  # cache ext. forces for 5 steps
    force_extra=None,            # optional: callable(pos, vel, masses, t) -> (N,3)
                                 # e.g. a Chandrasekhar DF closure — user handles GPU/CPU
    debug_energy=True,       # print Q and dE/E at each output interval
    snapshots=10,
    output_dir='./output',
)

# Full N-body integration (CPU) — tree with custom kernel / theta
run_nbody_cpu(
    phase_space, masses,
    time_start=0.0, time_end=1.0, dt=0.001,
    softening=0.02,
    method='tree',
    kernel='dehnen_k2',      # or integer 2; pyfalcon: 0=plummer 1=dehnen_k1 2=dehnen_k2
    theta=0.5,
    nthreads=32,
    debug_energy=True,       # phi returned for free by falcON; zero overhead
    output_dir='./output',
)

Snapshots (`io`)

from nbody_streams import ParticleReader

r = ParticleReader("path/to/sim.*.h5", verbose=True)
snap = r.read_snapshot(10)  # returns dict with pos, vel, mass, time

Fast stream generation (`fast_sims`)

All methods require AGAMA (pip install agama).

Particle spray

import agama
agama.setUnits(mass=1, length=1, velocity=1)

from nbody_streams.fast_sims import create_particle_spray_stream

pot = agama.Potential(str(path_to_potential_ini))

result = create_particle_spray_stream(
    pot_host=pot,
    initmass=20_000,                    # Msun
    sat_cen_present=[-10.9, -3.4, -22.2, 70.4, 188.6, 95.8],  # [kpc, km/s]
    scaleradius=0.01,                   # kpc
    num_particles=10_000,
    time_total=3.0,                     # Gyr lookback
    time_end=13.78,                     # Gyr present epoch
    save_rate=5,                        # number of output snapshots
    prog_pot_kind='Plummer',            # or 'King', 'Plummer_withRcut'
)
# result['times']   — (5,) save times
# result['prog_xv'] — (5, 6) progenitor trajectory
# result['part_xv'] — (N, 5, 6) stream particles at each snapshot

Custom stripping times (e.g. pericenter-weighted / episodic release):

N = num_particles // 2 + 1
t_strip = my_pericenter_times(N)  # length N, within [time_end - time_total, time_end]

result = create_particle_spray_stream(
    ...,
    time_stripping=t_strip,  # duplicates are handled automatically
)

Fardal+2015 IC method:

from nbody_streams.fast_sims import (
    create_particle_spray_stream,
    create_ic_particle_spray_fardal2015,
)

result = create_particle_spray_stream(
    ...,
    create_ic_method=create_ic_particle_spray_fardal2015,
)

Adding a subhalo perturber:

result = create_particle_spray_stream(
    ...,
    add_perturber={
        'mass': 1e8,              # Msun
        'scaleRadius': 0.5,       # kpc (NFW)
        'w_subhalo_impact': [x, y, z, vx, vy, vz],  # impact phase-space
        'time_impact': 1.5,       # Gyr ago
    },
)

Restricted N-body

from nbody_streams.fast_sims import run_restricted_nbody

result = run_restricted_nbody(
    pot_host=pot,
    initmass=20_000,
    sat_cen_present=[-10.9, -3.4, -22.2, 70.4, 188.6, 95.8],
    scaleradius=0.01,
    num_particles=10_000,
    time_total=3.0,
    time_end=13.78,
    step_size=50,           # ODE steps per potential update
    save_rate=5,
    trajsize_each_step=5,
    prog_pot_kind='Plummer',
    dynFric=True,           # optional dynamical friction
)
# result['times']      — save times
# result['prog_xv']    — progenitor trajectory
# result['part_xv']    — particle phase-space
# result['bound_mass'] — bound mass at each save time

Pre-existing particles (skip rewinding/sampling):

result = run_restricted_nbody(
    pot_host=pot,
    initmass=20_000,
    sat_cen_present=com_xv,
    xv_init=my_particles,   # shape (N, 6)
    time_total=0.5,
    time_end=13.78,
    step_size=50,
    save_rate=3,
    trajsize_each_step=5,
)

AGAMA potential helper (`agama_helper`)

Requires AGAMA. See docs/agama_helper.md for the full reference.

from nbody_streams import agama_helper as ah

# --- Read coefficient dataclasses (file, HDF5, or raw string — same call) ---
mc = ah.read_coefs("potential/090.dark.none_8.coef_mult")    # Multipole
cc = ah.read_coefs("MW_cylsp.h5", group_name="snap_090")     # CylSpline from HDF5

# Inspect
print(mc.lmax, mc.l_values)        # e.g.  8  [0, 2, 4, 6, 8]
print(mc.total_power(2))           # quadrupole power

# Filter harmonics (int = all m for that l; tuple = specific (l,m))
mc_axi = mc.zeroed([0, 2, 4])             # keep l=0,2,4; all m
mc_sel = mc.zeroed([(0,0), (2,0)])        # keep specific (l,m) pairs

# --- Load Agama potential (from any source) ---
pot = ah.load_agama_potential("potential/090.dark.none_8.coef_mult")
pot = ah.load_agama_potential("MW_mult.h5", group_name="snap_090")
pot = ah.load_agama_potential(mc_axi)                  # directly from dataclass
pot = ah.load_agama_potential("MW_mult.h5",
                               group_name="snap_090",
                               keep_lm_mult=[0, 2])    # in-memory filtering

# --- Time-evolving potential ---
# From HDF5 (times embedded at write time)
pot_ev = ah.load_agama_evolving_potential("MW_mult.h5")

# From a native Agama .ini file
pot_ev = ah.load_agama_evolving_potential("potential/MW_mult.ini")

# With per-snapshot filtering
pot_ev = ah.load_agama_evolving_potential("MW_mult.h5", keep_lm_mult=[0, 2])

# --- Pack text coefficient files into HDF5 ---
import numpy as np
ah.write_snapshot_coefs_to_h5(
    snapshot_ids=range(90, 101),
    coef_file_patterns=["potential/{snap:03d}.dark.none_8.coef_mult"],
    h5_output_paths=["MW_mult.h5"],
    times=np.linspace(6.0, 14.0, 11),    # embed times for load_agama_evolving_potential
)

Analysis utilities (`utils`)

from nbody_streams.utils import (
    empirical_density_profile,
    empirical_circular_velocity_profile,
    empirical_velocity_dispersion_profile,
    empirical_velocity_rms_profile,
    empirical_velocity_anisotropy_profile,
    fit_dehnen_profile,
    fit_plummer_profile,
    fit_double_spheroid_profile,
    fit_iterative_ellipsoid,
    find_center,
    iterative_unbinding,
)

# Radial profiles
r_bins, rho = empirical_density_profile(pos, mass, bins=50)
r_bins, v_circ = empirical_circular_velocity_profile(pos, mass)
r_bins, sigma = empirical_velocity_dispersion_profile(pos, vel)

# Profile fitting
gamma, a, M, r_fit, rho_fit = fit_dehnen_profile(pos, mass)
a, M, r_fit, rho_fit = fit_plummer_profile(pos, mass)

# Iterative shape measurement
axes, axis_ratios, eigvecs = fit_iterative_ellipsoid(pos, mass)

# Centre finding (default method="density_peak" via gravitational potential minimum)
cen = find_center(pos, mass)

# Centre finding with velocity centering
cen_pos, cen_vel = find_center(pos, mass, vel=vel, return_velocity=True, vel_aperture=5.0)

# Energy-based iterative unbinding
bound_mask = iterative_unbinding(
    pos, vel, mass,
    center_position=cen,
)

Note: find_center_position is a deprecated alias for find_center. compute_iterative_boundness is also deprecated -- use iterative_unbinding instead.

Coordinates (`coords`)

from nbody_streams.coords import (
    convert_coords,
    convert_vectors,
    convert_to_vel_los,
    generate_stream_coords,
    get_observed_stream_coords,
)

# Coordinate transforms: 'cart' <-> 'sph' <-> 'cyl'
sph = convert_coords(pos, 'cart', 'sph')                   # (N, 3) -> (r, theta, phi)
cyl = convert_coords(pos, 'cart', 'cyl')                   # (N, 3) -> (R, phi, z)

# Vector field transforms (position + velocity together)
pos_sph, vel_sph = convert_vectors(pos, vel, 'cart', 'sph')

# Line-of-sight velocity
v_los = convert_to_vel_los(xv)                              # from galactocentric phase-space

# Stream coordinates (phi1, phi2) from galactocentric phase-space
phi1, phi2 = generate_stream_coords(xv_stream, xv_prog=xv_prog)

# Full observables: stream coords + distance, PM, v_los
phi1, phi2, dist, pm, v_los = get_observed_stream_coords(
    xv_stream, xv_prog=xv_prog,
)

Visualization (`viz`)

from nbody_streams.viz import (
    plot_density,
    plot_mollweide,
    plot_stream_sky,
    plot_stream_evolution,
    render_surface_density,   # low-level SPH renderer
    get_smoothing_lengths,    # per-particle SPH smoothing lengths
)

# --- Projected density map ---

# From raw arrays — SPH (default, GPU-accelerated when CuPy is available)
plot_density(pos=pos, mass=mass, gridsize=100.0)

# From a ParticleReader snapshot — extract dark matter automatically
snap = reader.read_snapshot(50)
plot_density(snap=snap, spec='dark', gridsize=100.0, vmin=4.0, vmax=9.0)

# Choose rendering method
plot_density(pos=pos, mass=mass, method='sph')          # SPH kernel (default)
plot_density(pos=pos, mass=mass, method='gauss_smooth', smooth_sigma=1.5)
plot_density(pos=pos, mass=mass, method='histogram')

# Return the raw (pre-log10) density array for further analysis
im, dens = plot_density(pos=pos, mass=mass, return_dens=True)

# Slice along Z and render X-Y surface density
plot_density(pos=pos, mass=mass, xval='x', yval='y',
             slice_width=2.0, density_kind='surface')

# --- Low-level SPH API ---

# Full SPH surface-density grid (GPU -> CPU fallback)
grid, bounds = render_surface_density(x, y, mass, resolution=512, gridsize=120.0)

# Smoothing lengths only (useful for custom renderers)
h = get_smoothing_lengths(np.column_stack([x, y]), k_neighbors=32)

# --- Other viz ---

# Mollweide sky projection (requires healpy)
plot_mollweide(pos, weights=masses, initial_nside=60)

# Stream in sky coordinates (alpha/delta and phi1/phi2)
plot_stream_sky(xv_stream, xv_prog=xv_prog)

# Stream evolution over time (from run_simulation output)
plot_stream_evolution(result['prog_xv'], times=result['times'], part_xv=result['part_xv'])

`plot_density` method options

`method`	Description	Best for
`'sph'`	SPH cubic-spline kernel splatting (GPU-accelerated)	Physics-accurate density maps (default)
`'gauss_smooth'`	2-D histogram + Gaussian filter	Quick smooth previews
`'histogram'`	Raw 2-D histogram / pixel area	Counting statistics, debugging

Caveats and known limitations

Dynamical friction with external potentials

When external_potential is supplied to any run_* function or run_simulation, the host is modelled as a smooth, fixed background field. There is no back-reaction on the host and no granularity-driven scattering. For low-mass satellites this is an excellent approximation; for LMC-class objects it is not.

Chandrasekhar DF is now implemented via the dynamical_friction=True flag in run_simulation.

import agama
agama.setUnits(mass=1, length=1, velocity=1)
pot = agama.Potential(type='NFW', mass=1e12, scaleRadius=15.0,
                      outerCutoffRadius=300.0)

result = run_simulation(
    xv, [dm],
    time_start=0.0, time_end=5.0, dt=1e-3,
    architecture='gpu', method='direct',
    external_potential=pot,
    dynamical_friction=True,
    df_M_sat=5e9,
    df_coulomb_mode='variable',
)

See docs/dynamical_friction.md for the full reference including all df_* kwargs, the Coulomb logarithm modes, and core-stalling suppression.

For advanced users the internal closure is available directly:

from nbody_streams._chandrasekhar import make_df_force_extra

df_fn = make_df_force_extra(pot, M_sat=5e9, t_start=0.0, t_end=5.0)
final = run_nbody_gpu(xv, masses, ..., force_extra=df_fn)

The agama_helper subpackage is accessible as nb.agama_helper (or from nbody_streams import agama_helper). See docs/agama_helper.md for the full reference.

When DF is safe to ignore:

The Chandrasekhar inspiral timescale at orbital radius r scales as

t_df  ~  1.17 * (M_host / M_sat) * (r / V_c) / ln(Lambda)

M_sat (Msun)	Example	t_df at 50 kpc in MW halo	Safe to ignore?
< 1e8	Globular cluster	> 2000 Gyr	Yes, always
1e8 – 1e9	Ultra-faint dwarf	200 – 2000 Gyr	Yes
1e9 – 1e10	SMC-class	20 – 200 Gyr	Marginal over few-Gyr runs
> 1e10	LMC-class	< 20 Gyr	No — DF is important

For stream progenitors with M_tot < ~1e9 Msun, or for simulations shorter than a few Gyr, neglecting DF introduces negligible orbit evolution error. run_simulation emits a PerformanceWarning when the total satellite mass exceeds 1e10 M_sun and external_potential is set but dynamical_friction is False.

Performance

GPU benchmarks (N = 10,240 particles, RTX 3080, --use_fast_math + arch-tuned):

Kernel	Time/Step	Throughput	Energy Conservation
Float32 + float4	1.5 ms	~100 Gint/s	< 0.001%
Float32_kahan + float4	1.7 ms	~100 Gint/s	< 0.001%
Float64	~20 ms	~4.5 Gint/s	< 0.001%

Approaching the theoretical FLOP ceiling (~25 FLOPs per interaction).

Which solver should I use?

The plot below compares time-per-particle for direct (CPU & GPU) and FMM/tree solvers across particle counts.

Recommendations:

Particle count	Recommended method	Why
N < 500	CPU direct (Numba)	Lowest latency — no GPU kernel-launch overhead.
1K – 100K	GPU direct (consumer GPU)	~10x faster than CPU direct; exact O(N²) forces with float32 Kahan precision.
100K – 500K	GPU direct (consumer GPU) ≈ FMM/tree	A consumer GPU (e.g. RTX 3070) matches falcON/FMM throughput up to ~500K particles.
500K – 2M	GPU direct (datacenter GPU) ≈ FMM/tree	An H200-class GPU keeps direct-force time competitive with tree codes up to ~2M particles.
N > 2M	Tree / FMM (falcON)	O(N) FMM scaling wins; direct O(N²) becomes prohibitive on any current GPU.

Float32 precision analysis (Newton's 3rd Law at small scales)

Summary

Float32 kernels are mathematically correct but suffer from precision loss when particle positions are at small scales (< 0.1 length units). This manifests as violation of Newton's 3rd Law at the ~1% level for scales around 0.01.

Important: Even without scaling, the "asymmetric" float32 forces do not cause energy drift over long integrations because:

Integration is performed in float64 (positions, velocities)
Force errors are small and largely random (not systematic)
Symplectic integrators are robust to small force errors

Scale-dependent error

Scale Factor	Typical \|r\|	Float32 Net Force	Status
0.1	10	6.4e-9	Excellent
1	1.0	1.4e-6	Good
10	0.1	1.2e-4	Marginal
100	0.01	1.0e-2	Broken

Solutions

Option 1 — Unit scaling (recommended): Rescale positions to O(1) before computing forces. Gives perfect Newton's 3rd Law + full float32 speed.

scale = prog_scaleradius  # e.g. 0.01 kpc
pos_scaled = pos / scale
acc_scaled = compute_nbody_forces_gpu(pos_scaled.astype(np.float32), ...)
acc_physical = acc_scaled / (scale**2)

Option 2 — Float32 as-is: Simplest code, energy still conserved to < 0.001%. Newton's 3rd Law violated by ~1% (cosmetic only).

Option 3 — Float64: Accurate at all scales, ~10x slower. Best for small N (< 5000).

Comparison: direct N-body vs tree methods

Method	Force Accuracy	Energy Conservation	Speed
Direct N^2 (float32)	~1% asymmetry	< 0.001% drift	1.5 ms/step
Tree (float32, theta=0.5)	1-5% force errors	~0.01-0.1% drift	5 ms/step
Tree (float64, theta=0.5)	1-5% force errors	~0.001% drift	10 ms/step

Direct N-body with float32 gives better energy conservation than tree methods because tree approximation errors are systematic (monopole bias), while float32 rounding errors are random and cancel statistically.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
docs		docs
examples		examples
legacy		legacy
nbody_streams		nbody_streams
plots		plots
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Folders and files

Latest commit

History

Repository files navigation

nbody_streams

Quick start

Package overview

Under the hood

API

Multi-species simulation (run_simulation, Species)

Species dataclass

run_simulation

Initial conditions (make_plummer_sphere)

GPU tree-code (tree_gpu)

Low-level API — Direct N-body (fields, run)

Snapshots (io)

Fast stream generation (fast_sims)

Particle spray

Restricted N-body

AGAMA potential helper (agama_helper)

Analysis utilities (utils)

Coordinates (coords)

Visualization (viz)

plot_density method options

Caveats and known limitations

Dynamical friction with external potentials

Performance

Which solver should I use?

Summary

Scale-dependent error

Solutions

Comparison: direct N-body vs tree methods

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Multi-species simulation (`run_simulation`, `Species`)

`Species` dataclass

`run_simulation`

Initial conditions (`make_plummer_sphere`)

GPU tree-code (`tree_gpu`)

Low-level API — Direct N-body (`fields`, `run`)

Snapshots (`io`)

Fast stream generation (`fast_sims`)

AGAMA potential helper (`agama_helper`)

Analysis utilities (`utils`)

Coordinates (`coords`)

Visualization (`viz`)

`plot_density` method options

Packages