Add distributed intra-graph parallelism for GNN from unstructured mesh (graph splitting)#55
Add distributed intra-graph parallelism for GNN from unstructured mesh (graph splitting)#55pzhanggit wants to merge 2 commits into
Conversation
TsChala
left a comment
There was a problem hiding this comment.
This is interesting! Overall it looks good, I found a few small typos/bugs. I also got a question regarding the norms, see the comments.
I think we should update the inference script accordingly as well, to have ghost info in the opts etc.
| if self.use_MPI: | ||
| if self.group_rank == 0: | ||
| self._run_partitioning() | ||
| _MPI.COMM_WORLD.Barrier() |
There was a problem hiding this comment.
This looks like a bug, remove _ before MPI?
| from torch.utils.data import Dataset | ||
| from torch_geometric.utils import coalesce | ||
| import torch.nn.functional as F | ||
| from collections import defaultdict | ||
| from typing import Optional, List, Dict, Tuple, NamedTuple | ||
| import warnings |
There was a problem hiding this comment.
Many of these are not used: Dataset, coalesce, Optional, Tuple, warnings
| shard_path = rank_shard | ||
| else: | ||
| # fallback: full graph (single-rank mode) | ||
| shard_path = full_path |
There was a problem hiding this comment.
Maybe raise a warning/print out something when this happens for the user to be aware that it's falling back to single-rank mode.
| maxdiff_ghost = 0.0 | ||
|
|
||
| assert maxdiff_all<1e-6 and maxdiff_own<1e-6 and maxdiff_ghost<1e-6, ( | ||
| f"Pei debugging ghost0 [rank {dist.get_rank(sequence_parallel_group)}] first halo maxdiff_all={maxdiff_all:.6e}, " |
There was a problem hiding this comment.
sequence_parallel_group is not defined here. sequence_parallel_group --> comm
| h = HaloExchange_sync(h, ghost_info, comm) | ||
| h_in = h | ||
| h = conv(h, edge_index) | ||
| h = norm(h, batch) |
There was a problem hiding this comment.
If I understand things correctly, if the HaloExchange_sync happens, then we will normalize based on the owned nodes and the ghost nodes as well. Is this the intended behavior? Wouldn't it make more sense to ignore the ghost nodes for normalization?
This also happens in the GraphhMLP_stem.
There was a problem hiding this comment.
Yes, it's intended. Strictly speaking, we probably should apply all the local operations on the owned nodes ONLY. But the implementation did not separate owned nodes and ghost nodes across all the operations, except for the sync here, for simplicity. I was expecting this would not cause much performance difference. What do you think?
There was a problem hiding this comment.
Okay. I think it's fine as well. I just wanted to make sure it's intended and things are not falling through cracks.
| leadtime = self.leadtime_max #max(self.leadtime_max//2, 1)]) | ||
| #else: | ||
| # raise ValueError(f"Fix leadtime for now but got {self.leadtime_fixed}") |
There was a problem hiding this comment.
The autoregressive model forward is still not in main branch for graphs, so I'm not sure if this part make sense to be added now?
Splits a single large graph across multiple GPUs so that graphs too large to fit on one device can be trained at scale. Each rank owns a contiguous subgraph (METIS or random partition) and communicates boundary node embeddings with neighbouring ranks via halo exchange (set to be optional) before each GNN message-passing layer.
Tested on MeshGraphNets airfoil dataset (about 5000 nodes) with 4-split on Frontier. Checked the ghost sync function works as expected. The slightly slower convergence is probably due to the per-sample based operations. In the small test (toy test that does not really require graph partition and random split with 60~70% ghost node ratio) the runtime also goes up about 6x when 4-split and another 4x when 4-split + ghostsync.
Baseline
4-split without ghostsync
4-split with ghossync