🐛 Describe the bug
Current state of determinism
Using DataLoader2 + PrototypeMultiProcessingReadingService as an example:
- Before each iteration starts, a distributed shared seed will be generated (link)
- With multiprocessing, each subprocess will reset all of shuffle operations to the same random seeds at the beginning of each iteration based on the distributed shared seed in step 1. (link)
- And,
torch, numpy and python.random will get a different process-local seeds for each subprocess (link)
Additional feature
For the step 2 in the last section, we set the same shuffle seed across distribtued/mp workers because we want to make sure the shuffled data can be sharded in a mutually exclusive and collectively exhaustive manner.
An additional feature is needed to make sure all random operations after sharding_filter having the different seeds across workers to preserve fully data randomization.
Let's say we have a pipeline as:
data_source.shuffle().sharding_filter().map(fn).batch(8).shuffle()
We will have the random state shared for the first shuffle, but different states for the second shuffle. And, those states should be generated in a deterministic manner so we will be able to reproduce it.
Versions
main branch
cc: @msaroufim @VitalyFedyunin