Implement DistribtuedReadingService#727
Conversation
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Just to be clear, does the first version assumes 1 process per rank? |
NivekT
left a comment
There was a problem hiding this comment.
Question about the distributed design in general:
- Are we expecting users to always use something like
torch.multiprocessing.spawnto start distributed training? And that will properly start/clean up all the processes? - To what extent is the optimization from #555 compatible with this? Maybe it can work for every node?
Yeah. I think our next step is to support mixed reading service (distributed reading service + multiprocessing reading service) |
Nope. I can add a different test for elastic training. |
|
This PR should be ready to review. cc: @NivekT @VitalyFedyunin |
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Actually, there is one thing left, which is doc and tutorial for DistributedReadingService. I will do a separate PR to document DataLoader2 and DistributedReadingService. |
|
I will wait until pytorch/pytorch#83741 is landed and released into nightly because it will also use the updated API. |
e4cd5fe to
43b06eb
Compare
|
Failing DataPipe tests because the nightly binaries for mac haven't been updated yet. |
bb5de07 to
8ec5764
Compare
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
NivekT
left a comment
There was a problem hiding this comment.
LGTM!
nit comment:
IIUC, the difference between this and _test_distributed_rs is DL1 vs DL2?
- If so, we can potential remove the duplicate code (not urgent).
- Alternatively, we can just label
_test_distributed_dland_test_distributed_dataloader(andelastic_dl/elastic_training) so it will be obvious from the first glance that those two are mostly the same but testing different version of DL.
8ec5764 to
b1e8f35
Compare
Fixed |
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
b2ad947 to
8cad29d
Compare
77ad0ef to
4aa1272
Compare
4aa1272 to
8900311
Compare
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
149ce37 to
e4bdcd4
Compare
|
I will land this PR until PyTorch nightly is updated. |
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Add DistributedReadingService
Add tests for both
DataLoader2andDataLoader.