Enable P2P shuffling by default by fjetter · Pull Request #9991 · dask/dask

fjetter · 2023-02-22T15:11:39Z

This would enable P2P shuffling by default for most shuffle based dataframe workloads (set_index, groupby w/ou split_out, etc.) iff pyarrow is installed.

There are still a couple of cases that are hard coded to tasks because they went through an elaborate evaluation which is something I don't feel entirely comfortable with toggling right now without even testing at least once. See for some discussion over here #9826

We haven't received a lot of feedback (see also dask/distributed#7509) yet but haven't encountered any critical issues during internal validation that I would suggest to flip the switch unless there are major objections (e.g. CI complains very hard).
The discussion issue also outlines a couple of pros/cons of moving to this new algorithm but I believe the benefits outweigh the cost for almost all users.

fjetter · 2023-02-22T15:27:45Z

dask/utils.py

+            import pyarrow  # noqa
+
+            return "p2p"
+        except ImportError:
+            return "tasks"


We could add a warning/info log suggesting to install pyarrow if we hit this exception.

Should we add a p2p extras target that requires distributed and a suitable pyarrow version?

A p2p extra feels way too specific. I'd rather add this to the dataframe target but that'd obviously be a bigger change

jrbourbeau

Thanks @fjetter! I've not thought too deeply about this yet, but will devote some cycles to it.

FWIW it looks like the dask/tests/test_distributed.py::test_fused_blockwise_dataframe_merge failure is genuinely related to the change in default

Also cc @quasiben who asked me about this a few days ago

jrbourbeau · 2023-02-22T18:23:02Z

We haven't received a lot of feedback (see also dask/distributed#7509)

@rjzamora I think you mentioned wanting to try P2P shuffling out at some point. If you have any feedback, it would certainly be welcome. @quasiben are there folks you're aware of that would be interested in trying P2P and sharing feedback on their experience?

quasiben · 2023-02-22T18:37:22Z

Yes, @wence- met up with @fjetter and others and is going to starting experimenting shortly

fjetter · 2023-02-23T10:31:31Z

It seems we're installing a couple of rather old pyarrow versions on some jobs, e.g. python3.8 jobs are installing pyarrow 4. Not sure why that is since we're not pinning anything.

fjetter · 2023-02-23T14:54:10Z

Ok, looks like it's just the python 3.8 builds that are using this ancient pyarrow version. I haven't checked the entire matrix but py3.9+ seems to be using pyarrow 11.X so the tests are actually covering the p2p shuffle ;)

hendrikmakait · 2023-02-23T16:52:14Z

This failure is cause by shuffling and task fusion not playing together nicely at the moment due to our reliance on the task name being unchanged. @fjetter: Could you re-run CI after the latest changes to dask/distributed#7578? The error should then pop up again.

fjetter · 2023-02-23T17:01:47Z

Since this affects merges, maybe this goes away with #9900 + dask/distributed#7514 ?

fjetter · 2023-02-23T17:16:55Z

dask/utils.py

+        # We might loose annotations if low level fusion is active
+        if not dask.config.get("optimization.fuse.active"):


I added this guard to only toggle to p2p if low level fusion is disabled. We are relying on annotations and low level fusion apparently strips them sometimes.

fjetter · 2023-02-24T14:08:46Z

Tests are looking good. Anybody brave enough to ✅ ? @mrocklin maybe? :)

mrocklin · 2023-02-24T14:33:47Z

I don't have enough hands-on experience with this to approve I think, especially given that we're up against a release day. Unfortunately I'm going to take the cowardly way here and defer to you and @hendrikmakait , who have the appropriate context here to make this decision.

I will say though that, if you both feel good about this, then I encourage you to move forward with it boldly.

dask/utils.py

hendrikmakait

Thanks @fjetter! The guards seem reasonable to me and should avoid user pains. Given the lack of negative feedback and the strictly positive results we've seen in testing, I'd say we should go ahead and roll this out. @wence-: Any chance you managed to break things in the meantime and would want to veto this?

hendrikmakait · 2023-02-24T15:21:57Z

As discussed offline, we should document the config value users can set to fallback globally and rename it from a top-level shuffle to dataframe.shuffle (with a deprecation cycle).

wence- · 2023-02-24T17:36:59Z

@wence-: Any chance you managed to break things in the meantime and would want to veto this?

I haven't had a chance yet (and for the cases I was going to try "tasks" is already broken, or at least doesn't complete). In dask-cudf we explicitly override the shuffle default, so I think I don't have any veto reasons.

dask/dask-schema.yaml

hendrikmakait · 2023-02-24T17:37:11Z

dask/dask.yaml

-  shuffle-compression: null  # compression for on disk-shuffling. Partd supports ZLib, BZ2, SNAPPY
+  shuffle:
+    algorithm: null
+    compression: null  # compression for on disk-shuffling. Partd supports ZLib, BZ2, SNAPPY


If it's only about disk-based shuffling, what about dataframe.shuffle.disk.compression?

I could see this being used elsewhere and think it's fine for a specific algorithm to ignore such a value and suggest to keep it as is

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

…default

jrbourbeau · 2023-02-28T19:53:34Z

dask/utils.py

+        # We might lose annotations if low level fusion is active
+        if not dask.config.get("optimization.fuse.active"):


I'm a bit confused about why this is needed (not necessarily saying it's wrong, just that I'm lacking context). @hendrikmakait you said

This failure is cause by shuffling and task fusion not playing together nicely at the moment due to our reliance on the task name being unchanged

Can you point me to where I can read more about this?

Possibly related, are we accounting for dask.dataframe and dask.array treating this config option slightly differently? dask.array will perform low-level task fusion by default, while dask.dataframe won't.

dask.array

dask/dask/array/optimization.py

Lines 52 to 55 in a71c15b

# Perform low-level fusion unless the user has

# specified False explicitly.

if config.get("optimization.fuse.active") is False:

return dsk

dask.dataframe

dask/dask/dataframe/optimize.py

Lines 27 to 31 in a71c15b

# Do not perform low-level fusion unless the user has

# specified True explicitly. The configuration will

# be None by default.

if not config.get("optimization.fuse.active"):

return dsk

Sorry for being late to the party, I'm still catching up from some recent PTO

I'm a bit confused about why this is needed (not necessarily saying it's wrong, just that I'm lacking context).

task fusing both changes the keyname and is dropping annotations. We currently rely on both. If anybody toggles this on, we cannot use P2P

Possibly related, are we accounting for dask.dataframe and dask.array treating this config option slightly differently?

This code is only relevant for dask.dataframe

fjetter commented Feb 22, 2023

View reviewed changes

jrbourbeau reviewed Feb 22, 2023

View reviewed changes

fjetter marked this pull request as ready for review February 23, 2023 14:51

fjetter mentioned this pull request Feb 23, 2023

Allow p2p shuffle kwarg for DataFrame merges #9900

Merged

1 task

fjetter commented Feb 23, 2023

View reviewed changes

Enable P2P shuffling by default

7062b82

fjetter force-pushed the p2p_by_default branch from e596510 to 7062b82 Compare February 24, 2023 13:13

wence- reviewed Feb 24, 2023

View reviewed changes

dask/utils.py Outdated Show resolved Hide resolved

hendrikmakait approved these changes Feb 24, 2023

View reviewed changes

This was referenced Feb 24, 2023

Raise when attempting P2P with active fuse optimization dask/distributed#7585

Merged

P2P rechunking #9939

Merged

Add config descriptions and migrate old values

e54d71f

hendrikmakait reviewed Feb 24, 2023

View reviewed changes

fjetter and others added 2 commits February 24, 2023 18:38

Update dask/dask-schema.yaml

75df004

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

Update dask/utils.py

509f17a

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

github-actions bot added the dataframe label Feb 24, 2023

fjetter mentioned this pull request Feb 24, 2023

Release 2023.2.1 dask/community#308

Closed

4 tasks

fjetter added 2 commits February 24, 2023 18:48

Adjust usage of deprecated config and docs

0a2b3a1

Merge branch 'p2p_by_default' of github.com:fjetter/dask into p2p_by_…

0a0c231

…default

github-actions bot added the documentation Improve or add to documentation label Feb 24, 2023

fjetter added 3 commits February 24, 2023 18:53

Fix test_core_file

e6ffe6c

Fix config schema

c42bf2c

Merge branch 'main' into p2p_by_default

0b74952

fjetter mentioned this pull request Feb 24, 2023

Efficient scalable shuffle - P2P shuffle extension dask/distributed#7507

Open

fjetter merged commit 945f4e8 into dask:main Feb 24, 2023

fjetter deleted the p2p_by_default branch February 24, 2023 19:19

fjetter mentioned this pull request Feb 28, 2023

Bag must not pick p2p as shuffle default #10005

Merged

jrbourbeau mentioned this pull request Feb 28, 2023

Minor follow-up to P2P by default #10008

Merged

jrbourbeau reviewed Feb 28, 2023

View reviewed changes

		# We might loose annotations if low level fusion is active
		if not dask.config.get("optimization.fuse.active"):

		# We might lose annotations if low level fusion is active
		if not dask.config.get("optimization.fuse.active"):

	# Perform low-level fusion unless the user has
	# specified False explicitly.
	if config.get("optimization.fuse.active") is False:
	return dsk

	# Do not perform low-level fusion unless the user has
	# specified True explicitly. The configuration will
	# be None by default.
	if not config.get("optimization.fuse.active"):
	return dsk

Uh oh!

Conversation

fjetter commented Feb 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau commented Feb 22, 2023

Uh oh!

quasiben commented Feb 22, 2023

Uh oh!

fjetter commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Feb 23, 2023

Uh oh!

hendrikmakait commented Feb 23, 2023

Uh oh!

fjetter commented Feb 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter commented Feb 24, 2023

Uh oh!

mrocklin commented Feb 24, 2023

Uh oh!

Uh oh!

hendrikmakait left a comment

Choose a reason for hiding this comment

Uh oh!

hendrikmakait commented Feb 24, 2023

Uh oh!

wence- commented Feb 24, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fjetter commented Feb 23, 2023 •

edited

Loading