[SymmMem] Tiled reduce#162243
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162243
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit a622842 with merge base a707042 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Can we have some benchmarks for inter and intra node? E.g. compared to copy + nccl? |
|
@ngimel I need to add some util code to create multiple teams to boost the bandwidth. Stack PR coming :) |
ngimel
left a comment
There was a problem hiding this comment.
Can you give a tl;dr how you expect multiple teams to improve perf?
| // src_tensor and dst_tensor are already the tiles to operate on, thus we set | ||
| // the start_coord to 0 | ||
| auto start_coord = nvshmemx::make_shape(0, 0); | ||
| nvshmemx::tile_sum_reduce<decltype(src_tensor), decltype(dst_tensor), Shape2D, nvshmemx::tile_coll_algo_t::NVLS_ONE_SHOT_PULL_NBI>( |
There was a problem hiding this comment.
One-shot algorithms are ok only for small sizes, for larger sizes they result in 4x more network traffic for 8 world size
There was a problem hiding this comment.
The docs say The users are expected to use tile_collective_wait routine to ensure completion of the non-blocking collectives., I don't see it here
There was a problem hiding this comment.
That's the limitation of NVSHMEM today, only three algorithms are available:
tile_coll_algo_t::NVLS_ONE_SHOT_PUSH_NBI
tile_coll_algo_t::NVLS_ONE_SHOT_PULL_NBI
tile_coll_algo_t::NVLS_TWO_SHOT_PUSH_NBI
And I don't think TWO_SHOT would work for reduce.
One-shot reduce (not all-reduce) will not create extra traffic, but it would indeed create a hot-spot at the root GPU.
So this relates to whether the collective has access to intermediate buffers or not:
- if not, reduce can do only one-shot, thus hot-spot and never bandwidth optimal;
- if yes, then algorithms like ring is possible, thus bandwidth optimal.
There was a problem hiding this comment.
if (for some not very small sizes) TWO_SHOT allreduce is faster than one-shot reduce we should be using it? At a higher level, what are you trying to achieve? Is it inter-node or intra-node?
There was a problem hiding this comment.
TWO_SHOT allreduce will modify non-root ranks' buffer, so not so much a 1:1 in terms of semantics.
@ngimel today the PR launches only 1 CUDA block to work on the tile. If we want to scale to multiple blocks, e.g. 1 block per k rows, we'd need 1 team per block, because that's the semantics of the |
| * receiving the reduced tensor. */ | ||
| TORCH_CHECK(reduce_op == "sum", "tile_reduce: only sum is supported for now"); | ||
| TORCH_CHECK(in_tile.dim() == 2 && out_tile.dim() == 2, "Only 2D tensors are supported"); | ||
| TORCH_CHECK_EQ(in_tile.dtype(), out_tile.dtype()); |
There was a problem hiding this comment.
add user-friendly error messages here
There was a problem hiding this comment.
I had a look at the macro expansion of TORCH_CHECK_EQ, looks okay friendly?
pytorch/c10/util/logging_is_not_google_glog.h
Lines 139 to 141 in 6861a27
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Starting merge as part of PR stack under #164757 |
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: #164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: #162243
Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)`
For now supports only:
- NVSHMEM backed symmetric tensor;
- 2D tensor and tile;
- torch.float.
Testing on right-bottom quandrant:
```
rank 0:
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 1., 1., 1.],
[0., 0., 0., 0., 1., 1., 1., 1.],
[0., 0., 0., 0., 1., 1., 1., 1.],
[0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0')
PASSED
```
Pull Request resolved: pytorch#162243
Approved by: https://github.com/ngimel
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: pytorch#164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: pytorch#162243
Stack from ghstack (oldest at bottom):
Added op:
tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)For now supports only:
Testing on right-bottom quandrant:
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @ezyang