Write once, run anywhere.
ezpz makes distributed PyTorch launches portable across any supported
hardware {NVIDIA, AMD, Intel, MPS, CPU} with zero code changes.
- Multi-hardware β automatic device detection and backend selection (CUDA/NCCL, XPU/CCL, MPS, CPU/Gloo)
- Zero code changes β same script runs on a laptop, a single GPU, or a thousand-node supercomputer
- HPC integration β native PBS and SLURM support with automatic hostfile discovery and rank assignment
- Metric tracking β built-in
Historyclass for recording, plotting, and saving training metrics - CLI tools β
ezpz launch,ezpz test,ezpz doctorfor launching jobs, smoke-testing, and diagnostics
uv pip install git+https://github.com/saforem2/ezpzimport torch
import ezpz
rank = ezpz.setup_torch() # auto-detects device + backend
device = ezpz.get_torch_device()
model = torch.nn.Linear(128, 10).to(device)
model = ezpz.wrap_model(model) # FSDP (default)
# Multi-dim parallelism (TP/PP/CP) on XPU? Use ezpz.init_device_mesh_safe
# instead of torch's init_device_mesh β works around xccl's missing
# split_group on Aurora/Sunspot. See https://ezpz.cool/troubleshooting/.# Same command everywhere -- Mac laptop, NVIDIA cluster, Intel Aurora:
ezpz launch python3 train.pyFor a side-by-side diff against the equivalent raw-torch boilerplate, see the API Cheat Sheet.
ezpz launch python3 train.py # launch distributed training
ezpz submit -N 2 -q debug -- python3 train.py # submit batch job to PBS/SLURM
ezpz test # smoke-test your setup
ezpz benchmark # run + compare example benchmarks
ezpz doctor # diagnose environment issuesCompared to the alternatives:
- vs raw
torchrun/mpirun/srun: one launcher that detects your scheduler, builds the right command, and works on a laptop too. - vs
accelerate: lower surface area, no config files, designed for HPC schedulers from the ground up rather than retrofitted. - vs
DeepSpeed: not an alternative βezpzwraps your distributed init so you can still use DeepSpeed (or anything else) underneath.
See the full comparison for details.
Full documentation is available at ezpz.cool.
Useful entry points:
- πββοΈ Quickstart β install β script β launch in 5 minutes
- π Distributed Training Tutorial β progressive hello-world β FSDP+TP
- π³ Recipes β copy-pasteable patterns (checkpointing, gradient accumulation, MFU tracking)
- π§ Troubleshooting β XPU FSDP2 hangs, NCCL/CCL errors, scheduler issues
- π Examples β runnable end-to-end (FSDP, ViT, Diffusion, HF Trainer)