-
Notifications
You must be signed in to change notification settings - Fork 18.6k
shuffle data from hdf5 datasets #1347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Other than that it looks good to me! Cool! |
|
I changed some parts. |
|
Could you also do a speed benchmark and see how shuffling affects typical read speed? It used to cause a lot of trouble when reading randomly from a leveldb. Usually large-scale datasets don't need shuffling that much so if speed is a concern, it might be better to keep sequential read. (Since shuffling is turned off in default, I think having the capability is good.) |
|
@jeffdonahue can you review and merge if this looks good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify this comment to explain that the HDF5 files themselves are shuffled but the order within any given file is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything will be shuffled: hdf5 files and entries in these hdf5 files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see that now, my bad. I think it should still be clarified though -- it's not actually a full shuffle of the dataset (i.e., some orderings of the dataset are impossible to obtain) unless you only have a single HDF5 file (or each HDF5 file only has a single entry).
|
Replaced by #2118. |
The order of read HDF5 files and the order of the entries of the HDF5 files can be shuffled when setting the flag
shufflein the hdf5data layer