Skip to content

Determinsim about Local shuffle/random_op after sharding_filter #885

@ejguan

Description

@ejguan

🐛 Describe the bug

Current state of determinism

Using DataLoader2 + PrototypeMultiProcessingReadingService as an example:

  1. Before each iteration starts, a distributed shared seed will be generated (link)
  2. With multiprocessing, each subprocess will reset all of shuffle operations to the same random seeds at the beginning of each iteration based on the distributed shared seed in step 1. (link)
  3. And, torch, numpy and python.random will get a different process-local seeds for each subprocess (link)

Additional feature

For the step 2 in the last section, we set the same shuffle seed across distribtued/mp workers because we want to make sure the shuffled data can be sharded in a mutually exclusive and collectively exhaustive manner.
An additional feature is needed to make sure all random operations after sharding_filter having the different seeds across workers to preserve fully data randomization.

Let's say we have a pipeline as:

data_source.shuffle().sharding_filter().map(fn).batch(8).shuffle()

We will have the random state shared for the first shuffle, but different states for the second shuffle. And, those states should be generated in a deterministic manner so we will be able to reproduce it.

Versions

main branch

cc: @msaroufim @VitalyFedyunin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions