Skip to content

saforem2/ezpz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1,941 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‹ ezpz

Write once, run anywhere.

ezpz makes distributed PyTorch launches portable across any supported hardware {NVIDIA, AMD, Intel, MPS, CPU} with zero code changes.

Features

  • Multi-hardware β€” automatic device detection and backend selection (CUDA/NCCL, XPU/CCL, MPS, CPU/Gloo)
  • Zero code changes β€” same script runs on a laptop, a single GPU, or a thousand-node supercomputer
  • HPC integration β€” native PBS and SLURM support with automatic hostfile discovery and rank assignment
  • Metric tracking β€” built-in History class for recording, plotting, and saving training metrics
  • CLI tools β€” ezpz launch, ezpz test, ezpz doctor for launching jobs, smoke-testing, and diagnostics

Quick Install

uv pip install git+https://github.com/saforem2/ezpz

Quick Start

import torch
import ezpz

rank = ezpz.setup_torch()           # auto-detects device + backend
device = ezpz.get_torch_device()
model = torch.nn.Linear(128, 10).to(device)
model = ezpz.wrap_model(model)       # FSDP (default)

# Multi-dim parallelism (TP/PP/CP) on XPU? Use ezpz.init_device_mesh_safe
# instead of torch's init_device_mesh β€” works around xccl's missing
# split_group on Aurora/Sunspot. See https://ezpz.cool/troubleshooting/.
# Same command everywhere -- Mac laptop, NVIDIA cluster, Intel Aurora:
ezpz launch python3 train.py

For a side-by-side diff against the equivalent raw-torch boilerplate, see the API Cheat Sheet.

CLI

ezpz launch python3 train.py    # launch distributed training
ezpz submit -N 2 -q debug -- python3 train.py   # submit batch job to PBS/SLURM
ezpz test                       # smoke-test your setup
ezpz benchmark                  # run + compare example benchmarks
ezpz doctor                     # diagnose environment issues

Why ezpz?

Compared to the alternatives:

  • vs raw torchrun / mpirun / srun: one launcher that detects your scheduler, builds the right command, and works on a laptop too.
  • vs accelerate: lower surface area, no config files, designed for HPC schedulers from the ground up rather than retrofitted.
  • vs DeepSpeed: not an alternative β€” ezpz wraps your distributed init so you can still use DeepSpeed (or anything else) underneath.

See the full comparison for details.

Documentation

Full documentation is available at ezpz.cool.

Useful entry points:

  • πŸƒβ€β™‚οΈ Quickstart β€” install β†’ script β†’ launch in 5 minutes
  • πŸŽ“ Distributed Training Tutorial β€” progressive hello-world β†’ FSDP+TP
  • 🍳 Recipes β€” copy-pasteable patterns (checkpointing, gradient accumulation, MFU tracking)
  • πŸ”§ Troubleshooting β€” XPU FSDP2 hangs, NCCL/CCL errors, scheduler issues
  • πŸ“ Examples β€” runnable end-to-end (FSDP, ViT, Diffusion, HF Trainer)