[DataLoader] Deep copy ReadingService during DL2 initialization#746
[DataLoader] Deep copy ReadingService during DL2 initialization#746NivekT wants to merge 6 commits intogh/NivekT/89/basefrom
Conversation
[ghstack-poisoned]
torchdata/dataloader2/dataloader2.py
Outdated
| else: | ||
| self.datapipe_adapter_fns = [datapipe_adapter_fn] | ||
| self.reading_service = reading_service | ||
| self.reading_service = deepcopy(reading_service) |
There was a problem hiding this comment.
Let's try to import your PR to internal to validate if this is working properly with internal RS.
There was a problem hiding this comment.
+1 on checking existing implementations.
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
We should put this information into the ReadingService interface file too. To make sure that RS developers know about potential deep copy. |
|
One more though, we deepcopy RS, but we pickle-unpickle DP. I think we need to pick consistent approach. |
Totally agree. And, I support |
Can you elaborate on this? What make a consistent approach matter here? |
|
Also, it seems like some internal ReadingService cannot be deep-copied. One alternative is for users to pass in a constructor instead of the object itself, but that will change the interface quite a bit. |
It seems the Error comes from construction of process group in |
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we can allow the instance of ReadingService to store state, without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. In a single object of that ReadingService is used to initialize multiple DataLoaders, it could've led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService` Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we can allow the instance of ReadingService to store state, without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. In a single object of that ReadingService is used to initialize multiple DataLoaders, it could've led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService` Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`. Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`. Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…itialization" Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743) [ghstack-poisoned]
Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743) [ghstack-poisoned]
…itialization" Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743) [ghstack-poisoned]
Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743) [ghstack-poisoned]
Summary: Pull Request resolved: #786 Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D39741743 Pulled By: ejguan fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
Summary: Pull Request resolved: meta-pytorch#786 Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after meta-pytorch#746 (the other PR will have to rebase). Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D39741743 Pulled By: ejguan fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
Summary: Pull Request resolved: #786 Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D39741743 Pulled By: ejguan fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`. Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Pull Request resolved: meta-pytorch#786 Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after meta-pytorch#746 (the other PR will have to rebase). Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D39741743 Pulled By: ejguan fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
Stack from ghstack:
By creating a deep copy of the input
ReadingServiceduring the initialization ofDataLoader2, we allow the instance ofReadingServiceto store state without worrying about the users passing the sameReadingServiceto initialize additional DataLoaders.For instance,
PrototypeMultiProcessingReadingServicestoresprocessesanddatapipesas instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change.Relevant test are added in subsequent PR related to lengths and eventually
PrototypeMultiprocessingReadingService.Differential Revision: D38868731
BC Breaking Note:
At the time when this PR landed,
MultiProcessingReadingServicewas known asPrototypeMultiProcessingReadingService.Previously, if a ReadingService is being passed to multiple DataLoaders, the ReadingService's state can be altered by any of those DataLoaders. This use case is not intended and can lead to silent incorrectness.
For instance,
MultiProcessingReadingServicestoresprocessesanddatapipesas instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states.Consider the code snippet below:
Prior to this change, since the reading service objects are identical within each DataLoader2, initializing
dl2will change the state of the reading service that exists indl1.After this change, the reading service that exist in each DataLoader2 is independent and will not impact each other.