Projected Polystochastic Diffusion for
Self-Supervised Multi-View 3D Human Pose Estimation
Tony Danjun Wang · Tolga Birdal · Nassir Navab · Lennart Bastian
Technical University of Munich · Munich Center for Machine Learning · Imperial College London
Diffusion over the space of polystochastic tensors progressively resolves noisy detections into clean, consistent multi-view 3D associations.
DisPOSE recasts the discrete cross-view person-assignment problem as a generative diffusion process on the space of polystochastic tensors, with differentiable Sinkhorn projections at every reverse step keeping each iterate on the feasible set. A hypergraph convolutional decoder then regresses full 3D skeletons across joints and camera views. Training needs no 3D ground truth, only 2D pseudo-labels from an off-the-shelf detector.
Important
Tested with Python 3.11 + CUDA 12.8 + PyTorch 2.7.1.
# Package manager: uv (https://docs.astral.sh/uv/)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv lock && uv sync
# Point the build system at the right CUDA toolkit
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=/usr/local/cuda-12.8/bin:$PATH
unset TORCH_CUDA_ARCH_LIST
# Compile the deformable-attention CUDA op used by the hypergraph convolutional decoder
env -u TORCH_CUDA_ARCH_LIST .venv/bin/python src/external/ops/setup.py clean --all build install
# Install torch-scatter (scatter ops used by the hypergraph convolutions).
# Build for the GPU on this machine. If you build on a node without a visible
# GPU, set TORCH_CUDA_ARCH_LIST by hand instead (e.g. "8.0+PTX" for A100,
# "9.0+PTX" for H100, "12.0+PTX" for Blackwell).
ARCH=$(.venv/bin/python -c "import torch; print('{}.{}'.format(*torch.cuda.get_device_capability()))")
TORCH_CUDA_ARCH_LIST="${ARCH}+PTX" uv pip install --python .venv/bin/python --no-cache-dir --no-build-isolation --force-reinstall --no-binary=:all: torch-scatter==2.1.2Download the pretrained weights (four models, ~630 MB) and unzip them.
# Pretrained weights -> checkpoints/pose/<dataset>.pt
uvx gdown 1KjDpvcUIbWdfx1sC6jrzLf2IbtXsv2mk -O dispose_pose_weights.zip
unzip dispose_pose_weights.zip -d checkpoints/
# Evaluate (the checkpoint already contains the backbone, so backbone loading is disabled)
python src/eval.py dataset=shelf task=pose \
ckpt_path=checkpoints/pose/shelf.pt \
model.backbone.ckpt_path=nulldataset may be panoptic, shelf, campus, or mm_or; point ckpt_path at the matching checkpoints/pose/<dataset>.pt.
Evaluation needs the corresponding dataset on disk (see Data) and a CUDA GPU.
Since publication, a fix to multi-GPU training raises Panoptic AP25 by ~10%.
Note
Training needs 2D pseudo-labels; see Training.
Evaluated on CMU Panoptic, Shelf, Campus, and MM-OR Pose, a benchmark we introduce that adds 3D pose annotations to the operating-room scenes of MM-OR. Download CMU Panoptic (the other datasets come from their respective hosts):
bash preparation/data/panoptic/get_data.shNote
MM-OR Pose 3D annotations. Our 3D-pose ground truth for the MM-OR evaluation
sequences (004_PKA, 011_TKA, 036_PKA) is released separately. Get the base
MM-OR dataset from here, then overlay our annotations onto it:
uvx gdown 1s28d2BwULZHxlCVbEoF_H_q1i7sgizAL -O dispose_mm_or_annotations.zip
unzip dispose_mm_or_annotations.zip -d data/ # -> data/mm_or/<sequence>/poses/<frame>.jsonEach poses/<frame>.json holds [{"id", "keypoints_xyzs" (15×4), "label"}, ...]; the
MM-OR pose loader reads them from data/mm_or/<sequence>/poses/.
Expected directory layout
data/
├── panoptic/
│ ├── <sequence>/
│ │ ├── hdImgs/
│ │ │ ├── 00_03/
│ │ │ ├── 00_06/
│ │ │ ├── 00_12/
│ │ │ ├── 00_13/
│ │ │ └── 00_23/
│ │ ├── hdPose3d_stage1_coco19/
│ │ └── calibration_<sequence>.json
│ └── ...
├── shelf/Camera{0..4}/
├── campus/Camera{0..2}/
└── mm_or/<take>/
├── colorimage/ # camera{01..05}_colorimage-*.jpg
├── camera{01..05}.json # calibration
└── poses/ # our 3D-pose annotations (eval sequences)
Runs are configured by two axes, dataset and task, composed by Hydra:
python src/train.py dataset=<dataset> task=<task> task_name=<run_name>dataset ∈ {panoptic, shelf, campus, mm_or} · task ∈ {backbone, pose}.
Training runs on weak 2D pseudo-labels (not needed for evaluation). Download our precomputed labels, or regenerate them yourself:
# Option A: download precomputed pseudo-labels (~140 MB) -> data/preparation/<dataset>/<timestamp>_train/
uvx gdown 1O8Un36ROUUGnLhYwbZ0OZ0TySj0YHaFR -O dispose_pseudo_labels.zip
unzip dispose_pseudo_labels.zip -d data/
# Option B: regenerate from scratch (RT-DETR + ViTPose -> cross-view ILP association -> anatomical rectification).
# One-time setup: install the COMPOSE association library (the `prep` extra).
GIT_LFS_SKIP_SMUDGE=1 uv pip install --no-deps "compose @ git+https://github.com/wngTn/COMPOSE.git"
python preparation/generate.py --dataset panoptic # --dataset: panoptic|shelf|campus|mm_or; interval defaults per dataset (panoptic & mm_or 3, shelf/campus 1), override with --interval NRegeneration writes the pickle to data/preparation/<dataset>/<timestamp>_train/; point the dataset config's label_file: at the new pickle file to use it.
Tip
Pose training reads a frozen heatmap backbone from checkpoints/backbone/<dataset>/pose_resnet50.pt. You can fine-tune it (see Heatmap-backbone fine-tuning below) or, since the released pose checkpoints embed their backbone, extract one without retraining:
python -c "import torch; sd=torch.load('checkpoints/pose/panoptic.pt')['state_dict']; torch.save({k[9:]:v for k,v in sd.items() if k.startswith('backbone.')}, 'checkpoints/backbone/panoptic/pose_resnet50.pt')"Example pose run (Panoptic, 2 GPUs):
python src/train.py dataset=panoptic task=pose task_name=panoptic_pose trainer.devices=2 data.train_cfg.dataset.interval=3Per-dataset pose commands & key arguments
Stage I (projected polystochastic diffusion for cross-view assignment and 3D root regression) and Stage II (hypergraph convolutional decoder for full-body pose) are optimized jointly on a frozen heatmap backbone.
| argument | meaning |
|---|---|
dataset=<name> |
dataset to train on |
task=pose |
the joint Stage I + Stage II training stage |
task_name=<name> |
directory name under logs/Pose/ for this run |
trainer.devices=2 |
DDP across 2 GPUs |
data.train_cfg.dataset.interval=<N> |
sample every Nth frame from each sequence |
trainer.max_steps=<N> |
total optimizer steps |
model.scheduler.T_max=<N> |
cosine-anneal horizon (keep in sync with max_steps) |
# Panoptic (from scratch)
python src/train.py dataset=panoptic task=pose task_name=panoptic_pose trainer.devices=2 data.train_cfg.dataset.interval=3
# Shelf / Campus / MM-OR warm-start from the panoptic pose weights with the backbone
# stripped out. Create that file once from the panoptic pose checkpoint:
mkdir -p checkpoints/warmstart
python -c "import torch; c=torch.load('checkpoints/pose/panoptic.pt', map_location='cpu', weights_only=False); torch.save({'state_dict': {k: v for k, v in c['state_dict'].items() if not k.startswith('backbone.')}}, 'checkpoints/warmstart/panoptic_pose_no_backbone.pt')"
# Shelf
python src/train.py dataset=shelf task=pose task_name=shelf_pose trainer.devices=2 data.train_cfg.dataset.interval=1 trainer.max_steps=15000 model.scheduler.T_max=15000
# Campus
python src/train.py dataset=campus task=pose task_name=campus_pose trainer.devices=2 data.train_cfg.dataset.interval=1 trainer.max_steps=30000 model.scheduler.T_max=30000
# MM-OR Pose
python src/train.py dataset=mm_or task=pose task_name=mm_or_pose trainer.devices=2[!TIP] Common overrides:
trainer.devices=1 trainer.strategy=autofor a single GPU,trainer=debugfor a short dev run, andckpt_path=<path>.ckptto resume from a checkpoint.
Heatmap-backbone fine-tuning (optional)
Trains the per-view feature backbone that produces the 2D joint heatmaps used by both stages of the pose model. Run this before pose training.
# Panoptic, from scratch
python src/train.py dataset=panoptic task=backbone task_name=panoptic_backbone model.backbone.ckpt_path=null
# Shelf / Campus / MM-OR Pose: fine-tune the panoptic backbone
python src/train.py dataset=shelf task=backbone task_name=shelf_backbone
python src/train.py dataset=campus task=backbone task_name=campus_backbone
python src/train.py dataset=mm_or task=backbone task_name=mm_or_backboneWhen each run finishes, save the extracted heatmap weights to checkpoints/backbone/<dataset>/pose_resnet50.pt, which is the path pose training reads.
@inproceedings{wang2026dispose,
title = {DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation},
author = {Wang, Tony Danjun and Birdal, Tolga and Navab, Nassir and Bastian, Lennart},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
year = {2026},
}Released under the MIT License.
