Implement FullSyncIterDataPipe by ejguan · Pull Request #713 · meta-pytorch/data

ejguan · 2022-08-03T20:22:49Z

Changes

Add _PrefetchExecutor to run prefetching in multi-threading
- This should be reused by PrefetchIterDataPipe
Add FullSyncIterDataPipe
- Tested the result using elastic training and normal distributed training.
- Add distributed unit test for GLOO
- Add serialization test (How to do checkpoint for FullSyncIterDataPipe is unclear to me)
  - Serialization is working properly as I am using spawn to serialize/deserialize such DataPipe to distributed processes.

ejguan · 2022-08-03T20:47:47Z

torchdata/datapipes/iter/util/prefetch.py

+        return self.error is not None
+
+
+class _PrefetchExecutor:


This executor takes reference from the internal implementation: https://fburl.com/code/7dk6mvs4
On top of the implementation, I added prefetch_size and attached index to Expected object to make sure it can work with Prefetch in the future.

ejguan · 2022-08-04T15:58:38Z

Adding test now.

facebook-github-bot · 2022-08-05T16:31:05Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-05T16:53:14Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2022-08-05T18:42:23Z

test/test_distributed.py

+        data_length = 23
+        dp = IterableWrapper(list(range(data_length))).sharding_filter().fullsync()


Without fullsync, this pipeline would hang forever.

facebook-github-bot · 2022-08-08T14:45:52Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-08T18:01:25Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

NivekT · 2022-08-08T19:04:30Z

torchdata/_constants.py

+# LICENSE file in the root directory of this source tree.
+
+# Use the same timeout as PyTorch Distributed
+default_timeout_in_s = 30 * 60


Just a comment (no change) - we should put other things such as default buffer size here too.

NivekT · 2022-08-08T19:11:59Z

torchdata/datapipes/iter/util/prefetch.py

+    which is caused by uneven sharded data (functional name: ``fullsync``). It should
+    be appended at the end of the graph of ``DataPipe`` by ``DistributedReadingService``
+    automatically.
+


Question: do we recommend against usage of this DataPipe outside of a ReadingService? If not, can we potentially include an example?

Makes sense. Will add it even though we should always recommend users relying on RS

torchdata/datapipes/iter/util/prefetch.py

facebook-github-bot · 2022-08-09T15:45:21Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torchdata/datapipes/iter/__init__.pyi.in

ejguan · 2022-08-11T20:27:45Z

torchdata/datapipes/iter/util/prefetch.py

+    def __getstate__(self):
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(self)
+        state = (
+            self.datapipe,
+            self.timeout,
+        )
+        return state


IMHO, checkpoint for fullsync or prefetch is a little tricky.
Let's confirm the expected behavior. When we do checkpoint, we should pause any further prefetching and save all prefetched data into a buffer. Then, we serialize the buffer ant inner datapipe (because we have to serialize datapipe after prefetching is done). And, only when we start iteration again, would we start prefetching again.

WDYT: @VitalyFedyunin @NivekT

Then, the whole logic of fullsync should be changed. This is even more complicated when the data ends when put the prefetched data into the buffer. I might open a new PR to achieve serialization.

Yea I think we should stop the prefetch and capture current data. I feel this can be similar to internal client snapshot, so https://fburl.com/code/6hrjawgh may be helpful for reference

torchdata/datapipes/iter/util/prefetch.py

ejguan · 2022-08-12T14:20:49Z

I will land this PR for now. List two follow-up works:

Implement serialization logic for fullsync
Implement padding for fullsync (I need to take a deeper look to see if we can combine it with wraparound)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 3, 2022

ejguan force-pushed the fullsync branch from 229d6bf to c81c881 Compare August 3, 2022 20:43

ejguan requested review from NivekT and VitalyFedyunin August 3, 2022 20:46

ejguan commented Aug 3, 2022

View reviewed changes

ejguan marked this pull request as draft August 4, 2022 15:58

ejguan removed request for NivekT and VitalyFedyunin August 4, 2022 15:58

ejguan added 3 commits August 5, 2022 16:25

Implement FullSyncIterDataPipe

48846bd

Add serialization logic

0d72382

Fix lint

0ce199a

ejguan force-pushed the fullsync branch 2 times, most recently from ffe3a1e to fd98a28 Compare August 5, 2022 16:30

ejguan force-pushed the fullsync branch from fd98a28 to 3eafb9a Compare August 5, 2022 16:52

ejguan marked this pull request as ready for review August 5, 2022 17:45

ejguan requested review from NivekT and VitalyFedyunin August 5, 2022 17:56

ejguan commented Aug 5, 2022

View reviewed changes

ejguan force-pushed the fullsync branch from 3eafb9a to 5fd4981 Compare August 8, 2022 14:44

ejguan requested a review from Miiira August 8, 2022 16:35

Add distributed test

158dfa5

ejguan force-pushed the fullsync branch from 5fd4981 to 158dfa5 Compare August 8, 2022 18:00

NivekT reviewed Aug 8, 2022

View reviewed changes

Add example & improve doc

fdcc00a

ejguan commented Aug 10, 2022

View reviewed changes

torchdata/datapipes/iter/__init__.pyi.in Show resolved Hide resolved

ejguan mentioned this pull request Aug 11, 2022

Implement DistribtuedReadingService #727

Closed

ejguan commented Aug 11, 2022

View reviewed changes

Miiira approved these changes Aug 12, 2022

View reviewed changes

torchdata/datapipes/iter/util/prefetch.py Show resolved Hide resolved

NivekT approved these changes Aug 12, 2022

View reviewed changes

facebook-github-bot closed this in 827b13d Aug 12, 2022

		data_length = 23
		dp = IterableWrapper(list(range(data_length))).sharding_filter().fullsync()

Conversation

ejguan commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

ejguan Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejguan commented Aug 4, 2022

Uh oh!

facebook-github-bot commented Aug 5, 2022

Uh oh!

facebook-github-bot commented Aug 5, 2022

Uh oh!

ejguan Aug 5, 2022

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 8, 2022

Uh oh!

facebook-github-bot commented Aug 8, 2022

Uh oh!

NivekT Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

NivekT Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

ejguan Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Aug 9, 2022

Uh oh!

Uh oh!

ejguan Aug 11, 2022

Choose a reason for hiding this comment

Uh oh!

Miiira Aug 12, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ejguan commented Aug 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ejguan commented Aug 3, 2022 •

edited

Loading

ejguan Aug 3, 2022 •

edited

Loading