Guarantee datapipe being reset iterator when all loops have received reset request in the dispatching process by ejguan · Pull Request #994 · meta-pytorch/data

ejguan · 2023-02-07T18:57:42Z

Changes

~~Fix S3 Tests~~ in Fix test_remote_io.py due to mutating public s3 bucket #997
Ignore AttributeError for ReadingService when gc gets involved
Add a reset_iterator_counter to dispatching process to guarantee that the datapipe is reset when all loops have received the request. Otherwise, datapipe can be reset in the middle of iteration of the other loops.
Remove reference of the thread from Prefetcher. This would prevent racing condition when both finally in generator and reset function are accessing the same thread.

facebook-github-bot · 2023-02-07T19:16:11Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2023-02-07T19:16:54Z

test/test_remote_io.py

-            [["s3://ai2-public-datasets/charades/"], 18],  # folder without '/'
-            [["s3://ai2-public-datasets/charad"], 18],  # prefix
-            [
+            (


ejguan · 2023-02-07T19:17:06Z

torchdata/dataloader2/reading_service.py

+    def __del__(self):
+        try:
+            self.finalize()
+        except AttributeError:
+            pass
+


Just a comment - a little bit surprised we need this given ProtoMultiRS's finalize seems to be catching every possible AttributeError? Maybe the issue is Distributed? Thoughts?

ProtoMPRS doesn't handle all of AttributeError like the self._worker_processes from
https://github.com/pytorch/data/blob/98222ad72ee7a29e676646e6b3f9173576410320/torchdata/dataloader2/reading_service.py#L345

Technical speaking, I should remove those try-except clauses in finalize to simplify the codebase

ejguan · 2023-02-07T19:17:18Z

torchdata/dataloader2/communication/iter.py

+            # Ensure only reset iterator once for the dispatching process
+            if reset_iterator_counter is not None:
+                reset_iterator_counter.increment()
+                while not reset_iterator_counter.is_reached():
+                    yield True
+                # Sync between loops within the dispatching process
+                source_datapipe.reset_iterator()
+                yield True
+                reset_iterator_counter.reset()


You might need to do something similar to this for resume dispatching process. cc: @NivekT

Just to confirm I understand - this is to handle the situation where some workers are handling GetNextRequest while some are trying to reset? You want all GetNext to be done before the dispatching process executes reset?

Not workers. It only happens to the dispatching process when multiple leaf DataPipes shares the same data source (Round robin demux on the same DataPipe) in a single process.
It handles the case when some loops have received reset while the others haven't. We want to wait to request getNext until all loops have received reset. Otherwise, there will be a case that the data source is reset during the middle of iteration for other loops

torchdata/datapipes/iter/util/prefetcher.py

ejguan · 2023-02-07T19:32:17Z

I will rerun all tests tmrw when PyTorch nightly has been updated.

facebook-github-bot · 2023-02-07T20:46:04Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

NivekT

LGTM! Hope the nightly gets uploaded and all the CIs get fixed

NivekT · 2023-02-07T22:11:29Z

torchdata/dataloader2/reading_service.py

+    def __del__(self):
+        try:
+            self.finalize()
+        except AttributeError:
+            pass
+


Just a comment - a little bit surprised we need this given ProtoMultiRS's finalize seems to be catching every possible AttributeError? Maybe the issue is Distributed? Thoughts?

NivekT · 2023-02-07T22:45:09Z

torchdata/dataloader2/communication/eventloop.py

+        if self._reached:
+            return self._reached


nit: These two lines can be removed? But I guess it is slightly faster, so I'm indifferent.

Lol, you are right. I will remove them tmrw

NivekT · 2023-02-07T22:51:33Z

torchdata/dataloader2/communication/iter.py

+            # Ensure only reset iterator once for the dispatching process
+            if reset_iterator_counter is not None:
+                reset_iterator_counter.increment()
+                while not reset_iterator_counter.is_reached():
+                    yield True
+                # Sync between loops within the dispatching process
+                source_datapipe.reset_iterator()
+                yield True
+                reset_iterator_counter.reset()


Just to confirm I understand - this is to handle the situation where some workers are handling GetNextRequest while some are trying to reset? You want all GetNext to be done before the dispatching process executes reset?

facebook-github-bot · 2023-02-08T16:23:02Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2023-02-08T16:57:30Z

The test takes way more time to finish. However, I can't really reproduce it either on linux or mac

Edit: Find the culprit test test_basic_mapdatapipe_threading because I forget to update map.DataPipeBehindQueues. Will fix it shrotly.

ejguan · 2023-02-08T17:16:23Z

And, I am going to remove all S3 related commits. To fix S3 test, I plan to rely on #997

facebook-github-bot · 2023-02-08T17:20:14Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-02-08T18:41:13Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-02-09T13:54:51Z

@ejguan merged this pull request in b450cfd.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 7, 2023

ejguan changed the title ~~Fix tests~~ Fix S3 Tests and guarantee datapipe being reset when all loops have received reset request in the dispatching process Feb 7, 2023

ejguan commented Feb 7, 2023

View reviewed changes

ejguan force-pushed the fix_tests branch from 603fe31 to e21bf75 Compare February 7, 2023 20:45

NivekT approved these changes Feb 7, 2023

View reviewed changes

ejguan force-pushed the fix_tests branch from e21bf75 to 973844f Compare February 8, 2023 15:28

NivekT approved these changes Feb 8, 2023

View reviewed changes

ejguan added 5 commits February 8, 2023 17:17

Ignore AttributeError in __del__ for reading service

5b3823a

Ensure reset iterator when all loops have received request

0eb2413

Clean up generator-local thread

7a9f3e8

Resolve comments

830eecb

Fix map.DataPipeBehindQueues and add inline doc

f9fdb94

ejguan force-pushed the fix_tests branch from 6626607 to f9fdb94 Compare February 8, 2023 17:18

ejguan changed the title ~~Fix S3 Tests and guarantee datapipe being reset when all loops have received reset request in the dispatching process~~ Guarantee datapipe being reset iterator when all loops have received reset request in the dispatching process Feb 8, 2023

Skip windows test

ffdc350

NivekT approved these changes Feb 8, 2023

View reviewed changes

facebook-github-bot closed this in b450cfd Feb 9, 2023

facebook-github-bot added the Merged label Feb 9, 2023

Conversation

ejguan commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

facebook-github-bot commented Feb 7, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ejguan commented Feb 7, 2023

Uh oh!

facebook-github-bot commented Feb 7, 2023

Uh oh!

NivekT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Feb 8, 2023

Uh oh!

ejguan commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejguan commented Feb 8, 2023

Uh oh!

facebook-github-bot commented Feb 8, 2023

Uh oh!

facebook-github-bot commented Feb 8, 2023

Uh oh!

facebook-github-bot commented Feb 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ejguan commented Feb 7, 2023 •

edited

Loading

ejguan commented Feb 8, 2023 •

edited

Loading