Skip to content

[DataPipe] shard expander#405

Closed
tmbdev wants to merge 8 commits intometa-pytorch:mainfrom
tmbdev:wdsshardexpander
Closed

[DataPipe] shard expander#405
tmbdev wants to merge 8 commits intometa-pytorch:mainfrom
tmbdev:wdsshardexpander

Conversation

@tmbdev
Copy link
Contributor

@tmbdev tmbdev commented May 13, 2022

This PR adds a ShardExpander filter, a filter that will take shard specs of the form "prefix-{000..999}.tar" and expand them into the 1000 corresponding file names, like the shell brace expansion. This only implements numerical range expansion (rather than full shell-style brace expansion).

Specifying collections of files in terms of numerical shard specs (instead of using classes like FileLister) has a number of advantages: (1) it does not require the ability to list files on the target (not possible in general for, for example, HTTP), (2) it documents and makes specifying a collection of files reproducible and independent of the state of the storage system, (3) it makes it easy to choose different dataset sizes by specifying different numerical subranges.

ShardExpander is frequently used with WebDataset to specify collections of training shards.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022
@VitalyFedyunin VitalyFedyunin changed the title shard expander [DataPipe] shard expander May 19, 2022
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please support other types of expansions:
imagenet-{000000..001199}.tar -> as designed
imagenet-{0..1199}.tar -> expands to imagenet-0.tar, imagenet-1.tar, ..., imagenet-10.tar, imagenet-11.tar, ..., imagenet-1199.tar
imagenet-{182..1199}.tar -> expands to imagenet-182.tar, imagenet-183.tar, ..., imagenet-210.tar, imagenet-211.tar, ..., imagenet-1199.tar
imagenet-{0182..1199}.tar -> works
imagenet-{182..01199}.tar -> throws error as it is unclear when to start adding leading 0
imagenet-{082..1199}.tar -> throws error


# Functional Test: ensure expansion generates the right number of shards
stage1 = IterableWrapper(["ds-{000000..000009}.tar"])
print(list(iter(stage1)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop print

print(list(iter(stage1)))
stage2 = ShardExpander(stage1)
output = list(iter(stage2))
assert len(output) == 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use self.assertEqual instead

print(list(iter(stage1)))
stage2 = ShardExpander(stage1)
output = list(iter(stage2))
assert len(output) == 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will pass even if ShardExpander will return garbage for each line.

@tadejsv
Copy link

tadejsv commented Aug 20, 2022

@tmbdev Do you still plan to work on this (and related) PRs?

@tmbdev
Copy link
Contributor Author

tmbdev commented Aug 20, 2022

Yes, I'm planning on working on the handful of PRs we have been discussing and addressing the issues raised.

@tmbdev
Copy link
Contributor Author

tmbdev commented Aug 31, 2022

I've update the PR and I believe I have addressed all the comments.

@tadejsv
Copy link

tadejsv commented Sep 12, 2022

@VitalyFedyunin Mind taking a look at this PR?

@NivekT
Copy link
Contributor

NivekT commented Sep 13, 2022

@VitalyFedyunin Mind taking a look at this PR?

We'll have a look tomorrow or Wednesday. Thanks!

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good to me! Just small changes (and rebase); after that I think we can merge.

Thank you so much for your contribution!

cc: @VitalyFedyunin

@VitalyFedyunin
Copy link
Contributor

LGTM for me with @NivekT suggestions applied

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmbdev LGTM! Thank you so much! I will handle landing this.

Let me know if you have the chance to revisit #402 and #406.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants