Skip to content

Support creating a DatasetPipeline windowed by bytes#22577

Merged
ericl merged 7 commits intoray-project:masterfrom
ericl:window-bytes
Feb 26, 2022
Merged

Support creating a DatasetPipeline windowed by bytes#22577
ericl merged 7 commits intoray-project:masterfrom
ericl:window-bytes

Conversation

@ericl
Copy link
Copy Markdown
Contributor

@ericl ericl commented Feb 23, 2022

Why are these changes needed?

This adds the ability to create a pipeline windowed by bytes, which simplifies many user calculations compared to creating it by blocks.

Related issue number

Closes #18100

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 23, 2022
@ericl ericl changed the title [WIP] Support windowing by bytes Support creating a DatasetPipeline windowed by bytes Feb 25, 2022
@ericl ericl added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Feb 25, 2022
increases the latency to initial output, since it decreases the
length of the pipeline. Setting this to infinity effectively
disables pipelining.
bytes_per_window: Specify the window size in bytes instead of blocks.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too much to split a single block to have better bytes_per_window?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little complex (and maybe not needed once we have block splitting).

self._splits = blocks.split(split_size=blocks_per_window)
sizes = [s.size_bytes() for s in self._splits]

def fmt(size_bytes):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle size unknown case or adding an assert to check that size must be known? Actually when it might be unknown?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Currently it should always be known.

@jjyao jjyao added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 25, 2022
@ericl ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 26, 2022
Copy link
Copy Markdown
Contributor Author

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed.

@ericl ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 26, 2022
else:
self._splits = blocks.split(split_size=blocks_per_window)
try:
sizes = [s.size_bytes() for s in self._splits]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_bytes() returns -1 if the size is unknown instead of exception? maybe just assert min(sizes) >= 0?

@ericl ericl merged commit b5b4460 into ray-project:master Feb 26, 2022
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
rkooo567 added a commit to rkooo567/ray that referenced this pull request Feb 28, 2022
ericl pushed a commit that referenced this pull request Feb 28, 2022
ericl added a commit to ericl/ray that referenced this pull request Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[datasets] Support advanced windowing (e.g., by bytes)

4 participants