[Datasets] Support out-of-band serialization.#22616
Closed
clarkzinzow wants to merge 1 commit intoray-project:masterfrom
Closed
[Datasets] Support out-of-band serialization.#22616clarkzinzow wants to merge 1 commit intoray-project:masterfrom
clarkzinzow wants to merge 1 commit intoray-project:masterfrom
Conversation
clarkzinzow
commented
Feb 24, 2022
clarkzinzow
commented
Feb 24, 2022
a76612e to
8387d7d
Compare
ericl
requested changes
Feb 24, 2022
jjyao
reviewed
Feb 24, 2022
40284c7 to
d801b00
Compare
ericl
requested changes
Mar 1, 2022
Contributor
ericl
left a comment
There was a problem hiding this comment.
A few things to simplify the logic here:
- Can we avoid the "shoe-horning" of read tasks into a block list? The plan object can have an explicit list of read tasks we extract from the LazyBlockList. This avoids needing to change the block list classes.
- Can we avoid having an "index" of completed vs not? It would be clearer to instead split the stages into "prev_stages" and "stages".
6 tasks
Contributor
|
ping @clarkzinzow any update on this PR? |
Contributor
Author
|
@scv119 Actively working on it and the integration with Xiaowei, ran into some complications with the suggested refactor and I'm working on a solution that doesn't increase the scope of the PR. We should merge this by EOD Monday to make sure AIR is unblocked. |
2 tasks
2b1774f to
dfa88f1
Compare
bad315d to
6b9398f
Compare
813d7c5 to
7368d13
Compare
clarkzinzow
commented
Mar 18, 2022
7368d13 to
2a68795
Compare
ericl
requested changes
Mar 19, 2022
Contributor
ericl
left a comment
There was a problem hiding this comment.
Can you split this up into smaller PRs?
2a68795 to
2071303
Compare
2071303 to
7576de4
Compare
5d58b71 to
cbeea16
Compare
6 tasks
ericl
pushed a commit
that referenced
this pull request
Apr 14, 2022
…#23821) This PR refactors `LazyBlockList` in service of out-of-band serialization (see [mono-PR](#22616)) and is a precursor to an execution plan refactor (PR #2) and adding the actual out-of-band serialization APIs (PR #3). The following is included in this refactor: 1. `ReadTask`s are now a first-class concept, replacing calls; 2. read stage progress tracking is consolidated into `LazyBlockList._get_blocks_with_metadta()` and more of the read task complexity, e.g. the read remote function, was pushed into `LazyBlockList` to make `ray.data.read_datasource()` simpler; 3. we are a bit smarter with how we progressively launch tasks and fetch and cache metadata, including fetching the metadata for read tasks in `.iter_blocks_with_metadata()` instead of relying on the pre-read task metadata (which will be less accurate), and we also fix some small bugs in the lazy ramp-up around progressive metadata fetching. (1) is the most important item for supporting out-of-band serialization and fundamentally changes the `LazyBlockList` data model. This is required since we need to be able to reference the underlying read tasks when rewriting read stages during optimization and when serializing the lineage of the Dataset. See the [mono-PR](#22616) for more context. Other changes: 1. Changed stats actor to a global named actor singleton in order to obviate the need for serializing the actor handle with the Dataset stats; without this, we were encountering serialization failures.
cbeea16 to
fe61ff6
Compare
Contributor
Author
|
Superseded by stacked PRs, supported added in #23932. Closing! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for out-of-band serialization of datasets, which is required for tuning a training dataset hyperparameter with cross-cluster stopping and resuming of experiments.
In the process of adding this feature, a refactor of the execution plan and
LazyBlockListseemed prudent to meet the following set of requirements:while adhering to the following constraints:
ray.put()) read tasks into aBlockListis untenable.Solution
In addition to adding out-of-band serialization support, this PR:
ReadTasks a first-class concept inLazyBlockList.LazyBlockListramp-up, including around progressive schema/metadata fetching.TODO
ExecutionPlanandLazyBlockList.Closes #22778
Checks
scripts/format.shto lint the changes in this PR.