Support creating a DatasetPipeline windowed by bytes#22577
Merged
ericl merged 7 commits intoray-project:masterfrom Feb 26, 2022
Merged
Support creating a DatasetPipeline windowed by bytes#22577ericl merged 7 commits intoray-project:masterfrom
ericl merged 7 commits intoray-project:masterfrom
Conversation
jjyao
reviewed
Feb 25, 2022
| increases the latency to initial output, since it decreases the | ||
| length of the pipeline. Setting this to infinity effectively | ||
| disables pipelining. | ||
| bytes_per_window: Specify the window size in bytes instead of blocks. |
Contributor
There was a problem hiding this comment.
Is it too much to split a single block to have better bytes_per_window?
Contributor
Author
There was a problem hiding this comment.
I think it's a little complex (and maybe not needed once we have block splitting).
python/ray/data/dataset.py
Outdated
| self._splits = blocks.split(split_size=blocks_per_window) | ||
| sizes = [s.size_bytes() for s in self._splits] | ||
|
|
||
| def fmt(size_bytes): |
Contributor
There was a problem hiding this comment.
handle size unknown case or adding an assert to check that size must be known? Actually when it might be unknown?
Contributor
Author
There was a problem hiding this comment.
Done. Currently it should always be known.
ericl
commented
Feb 26, 2022
Contributor
Author
ericl
left a comment
There was a problem hiding this comment.
Comments addressed.
jjyao
approved these changes
Feb 26, 2022
| else: | ||
| self._splits = blocks.split(split_size=blocks_per_window) | ||
| try: | ||
| sizes = [s.size_bytes() for s in self._splits] |
Contributor
There was a problem hiding this comment.
size_bytes() returns -1 if the size is unknown instead of exception? maybe just assert min(sizes) >= 0?
simonsays1980
pushed a commit
to simonsays1980/ray
that referenced
this pull request
Feb 27, 2022
rkooo567
added a commit
to rkooo567/ray
that referenced
this pull request
Feb 28, 2022
…ject#22577)" This reverts commit b5b4460.
ericl
pushed a commit
that referenced
this pull request
Feb 28, 2022
ericl
added a commit
to ericl/ray
that referenced
this pull request
Feb 28, 2022
…ay-project#22577)" (ray-project#22695)" This reverts commit ba4f142.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
This adds the ability to create a pipeline windowed by bytes, which simplifies many user calculations compared to creating it by blocks.
Related issue number
Closes #18100