[Data] Support subprogress bars on `AllToAllOperator`s with optimizer enabled by scottjlee · Pull Request #34997 · ray-project/ray

scottjlee · 2023-05-03T07:06:59Z

Why are these changes needed?

Currently, subprogress bars are not correctly rendered and updated with AllToAllOperators when the optimizer is enabled. This PR adds the subprogress bars for the different AllToAll LogicalOperators, such as RandomShuffle, Sort, and Repartition.

The original intent of this PR was to separate out the sub_progress_bars_dict from TaskContext, but we found that this was difficult and will require significant reworking to support it because the sub-progress bars need to be initialized and prior to being passed to the scheduler for execution.

Tested with the following code to observe the output subprogress bars (with all combinations of push/pull based shuffle, and no-shuffle for sort):

import ray 
import time
def sleep(x):
    time.sleep(0.1)
    return x

ctx = ray.data.DataContext.get_current()
ctx.optimizer_enabled = False
ctx.use_push_based_shuffle = True
for _ in (
    ray.data.range(1000 * 1000, parallelism=200)
    .map_batches(sleep, num_cpus=2)
    #.map_batches(sleep, compute=ray.data.ActorPoolStrategy(min_size=2, max_size=4))
    .random_shuffle() # -> tested pushbased=False; pushbased=True has duplicated issue
    #.sort("id") # -> tested pushbased=True+False
    #.repartition(400, shuffle=True) # -> tested shuffle=False, pushbased=True+False
    #.map_batches(sleep, num_cpus=2)
    .iter_batches()
):
    pass

random_shuffle():

Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 7.75 MiB/512.0 MiB object_store_memory:  32%|████████████████████▍                                            | 63/200 [00:24<00:37,  3.65it/s]
- RandomShuffle: 0 active, 0 queued, 0.0 MiB objects, 0 output: 100%|████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00, 24.36s/it]
  *- Shuffle Map:  54%|█████████████████████████████████████████████████████████████████▉                                                        | 108/200 [00:23<00:14,  6.55it/s]
  *- Shuffle Reduce:  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████   | 195/200 [00:24<00:00, 11.48it/s]

sort():

Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 7.75 MiB/512.0 MiB object_store_memory:   0%|▎                                                               | 1/200 [00:24<1:20:59, 24.42s/it]
- Sort: 0 active, 0 queued, 0.0 MiB objects, 0 output: 100%|█████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00, 24.41s/it]
  *- Sort Sample: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:23<00:00,  8.61it/s]
  *- Shuffle Map:  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 197/200 [00:23<00:00, 11.76it/s]
  *- Shuffle Reduce: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00, 11.67it/s]

repartition():

Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.04 MiB/512.0 MiB object_store_memory:   0%|                                                                          | 0/400 [00:23<?, ?it/s]
- Repartition: 0 active, 0 queued, 0.0 MiB objects, 0 output:   0%|                                                                                        | 0/400 [00:00<?, ?it/s]
  *- Shuffle Map:  100%|████████████████████████████████████████████████████████████████████████████                                                                                      | 400/400 [00:23<00:40,  7.04it/s]
  *- 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                      | 392/400 [00:24<2:45:10, 24.84s/it]

Related issue number

Closes #33374

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2023-05-31T18:49:11Z

python/ray/data/_internal/planner/exchange/pull_based_shuffle_task_scheduler.py

+            map_bar = sub_progress_bar_dict[bar_name]
+            should_close_bar = False
+        else:
+            map_bar = ProgressBar(bar_name, position=0, total=input_num_blocks)


Is it possible to get rid of these if-else branches? they are a little bit ugly. In which cases, the sub_progress_bar_dict is None? Would be great to unify the code logic.

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-06-02T02:02:16Z

Other failing tests are unrelated to this PR, and present for previous commits into master according to flakey-tests.

… enabled (ray-project#34997) ## Why are these changes needed? Currently, subprogress bars are not correctly rendered and updated with AllToAllOperators when the optimizer is enabled. This PR adds the subprogress bars for the different AllToAll LogicalOperators, such as `RandomShuffle`, `Sort`, and `Repartition`. The original intent of this PR was to separate out the `sub_progress_bars_dict` from `TaskContext`, but we found that this was difficult and will require significant reworking to support it because the sub-progress bars need to be initialized and prior to being passed to the scheduler for execution. Tested with the following code to observe the output subprogress bars (with all combinations of push/pull based shuffle, and no-shuffle for sort): ``` import ray import time def sleep(x): time.sleep(0.1) return x ctx = ray.data.DataContext.get_current() ctx.optimizer_enabled = False ctx.use_push_based_shuffle = True for _ in ( ray.data.range(1000 * 1000, parallelism=200) .map_batches(sleep, num_cpus=2) #.map_batches(sleep, compute=ray.data.ActorPoolStrategy(min_size=2, max_size=4)) .random_shuffle() # -> tested pushbased=False; pushbased=True has duplicated issue #.sort("id") # -> tested pushbased=True+False #.repartition(400, shuffle=True) # -> tested shuffle=False, pushbased=True+False #.map_batches(sleep, num_cpus=2) .iter_batches() ): pass ``` - `random_shuffle()`: ``` Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 7.75 MiB/512.0 MiB object_store_memory: 32%|████████████████████▍ | 63/200 [00:24<00:37, 3.65it/s] - RandomShuffle: 0 active, 0 queued, 0.0 MiB objects, 0 output: 100%|████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00, 24.36s/it] *- Shuffle Map: 54%|█████████████████████████████████████████████████████████████████▉ | 108/200 [00:23<00:14, 6.55it/s] *- Shuffle Reduce: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 195/200 [00:24<00:00, 11.48it/s] ``` - `sort()`: ``` Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 7.75 MiB/512.0 MiB object_store_memory: 0%|▎ | 1/200 [00:24<1:20:59, 24.42s/it] - Sort: 0 active, 0 queued, 0.0 MiB objects, 0 output: 100%|█████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00, 24.41s/it] *- Sort Sample: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:23<00:00, 8.61it/s] *- Shuffle Map: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 197/200 [00:23<00:00, 11.76it/s] *- Shuffle Reduce: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00, 11.67it/s] ``` - `repartition()`: ``` Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.04 MiB/512.0 MiB object_store_memory: 0%| | 0/400 [00:23<?, ?it/s] - Repartition: 0 active, 0 queued, 0.0 MiB objects, 0 output: 0%| | 0/400 [00:00<?, ?it/s] *- Shuffle Map: 100%|████████████████████████████████████████████████████████████████████████████ | 400/400 [00:23<00:40, 7.04it/s] *- 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 392/400 [00:24<2:45:10, 24.84s/it] ``` Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Scott Lee added 7 commits May 3, 2023 00:05

move subprogress bar from TaskContext to OpState

c2d0e7d

Signed-off-by: Scott Lee <sjl@anyscale.com>

wip with named subprogress bars but disappears after closing

2437a77

Signed-off-by: Scott Lee <sjl@anyscale.com>

wip testing

ebf918b

Signed-off-by: Scott Lee <sjl@anyscale.com>

progress

ba4b5c1

Signed-off-by: Scott Lee <sjl@anyscale.com>

progress

01b623d

Signed-off-by: Scott Lee <sjl@anyscale.com>

update all alltoall ops

caffeaf

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into a2a-subprog

9330d5b

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review May 13, 2023 01:03

scottjlee requested review from amogkam, bveeramani, c21, ericl, raulchen and scv119 as code owners May 13, 2023 01:03

Scott Lee added 2 commits May 12, 2023 18:38

Merge branch 'master' into a2a-subprog

f277e3b

Signed-off-by: Scott Lee <sjl@anyscale.com>

lint

599fce0

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee assigned raulchen and c21 May 13, 2023

raulchen reviewed May 31, 2023

View reviewed changes

Scott Lee added 5 commits May 31, 2023 18:56

Merge branch 'master' into a2a-subprog

c0f771c

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into a2a-subprog

67bfa49

Signed-off-by: Scott Lee <sjl@anyscale.com>

simplify case where progress bars are not initialized

dcf25bb

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into a2a-subprog

49e643d

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

40c475f

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen approved these changes Jun 2, 2023

View reviewed changes

scottjlee added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 2, 2023

scottjlee assigned amogkam and unassigned amogkam Jun 2, 2023

raulchen merged commit 3d1f6a9 into ray-project:master Jun 2, 2023

kyuds mentioned this pull request Sep 29, 2025

[data] Improve execution progress rendering #56992

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Support subprogress bars on `AllToAllOperator`s with optimizer enabled#34997

[Data] Support subprogress bars on `AllToAllOperator`s with optimizer enabled#34997
raulchen merged 14 commits intoray-project:masterfrom
scottjlee:a2a-subprog

scottjlee commented May 3, 2023 •

edited

Loading

Uh oh!

raulchen May 31, 2023

Uh oh!

scottjlee commented Jun 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

scottjlee commented May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

raulchen May 31, 2023

Choose a reason for hiding this comment

Uh oh!

scottjlee commented Jun 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scottjlee commented May 3, 2023 •

edited

Loading