Skip to content

Recommended practice to shuffle data with datapipes differently every epoch #718

@BarclayII

Description

@BarclayII

📚 The doc issue

I was trying torchdata 0.4.0 and I found that shuffling with data pipes will always yield the same result across different epochs, unless I shuffle it again at the beginning of every epoch.

# same_result.py
import torch
import torchdata.datapipes as dp
X = torch.randn(200, 5)
dpX = dp.map.SequenceWrapper(X)
dpXS = dpX.shuffle()
for _ in range(5):
    for i in dpXS:
        print(i)   # always prints the same value
        break

# different_result.py
import torch
import torchdata.datapipes as dp
X = torch.randn(200, 5)
dpX = dp.map.SequenceWrapper(X)
for _ in range(5):
    dpXS = dpX.shuffle()
    for i in dpXS:
        print(i)   # prints different values
        break

I wonder what is the recommended practice to shuffle the data at the beginning of every epoch? Neither the documentation nor the examples seem to answer this question.

Suggest a potential alternative/fix

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions