Begin experimenting with parallel prefix scan for cumsum and cumprod by eriknw · Pull Request #6675 · dask/dask

eriknw · 2020-09-25T21:25:45Z

This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on dask.array, so feedback is most welcome.

This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks.

Previously, to do a cumsum across N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate.

Our parallel cumsum works by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take the cumsum of each block and add the sum of the previous blocks.

NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod significantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant).

I also think this will be an interesting test for dask.order and the scheduler.

Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks?

Benchmarks and graph diagrams are forthcoming :)

Tests added / passed
Passes black dask / flake8 dask

See: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda

…in dask.array This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on dask.array, so feedback is most welcome. This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks. Previously, to do a cumsum across N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate. A parallel cumsum works by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take the cumsum of each block and add the sum of the previous blocks. NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod significantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant). I also think this will be an interesting test for `dask.order` and the scheduler. Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks? Benchmarks and graph diagrams are forthcoming :)

eriknw · 2020-09-25T22:01:26Z

This is the failing test. It assumes the old cumsum behavior and check the ordering:
https://github.com/eriknw/dask/blob/prefix_scan/dask/tests/test_order.py#L252

Perhaps we should allow the user to choose which method to use regardless of benchmark results. This will let the above test pass, and it provides a fail-safe in case I introduce a bug.

mrocklin · 2020-09-25T22:36:02Z

If benchmarks show that this is reliably faster (and also not significantly more memory hungry) then I would be comfortable making it default.

…

On Fri, Sep 25, 2020 at 3:01 PM Erik Welch ***@***.***> wrote: This is the failing test. It assumes the old cumsum behavior and check the ordering: https://github.com/eriknw/dask/blob/prefix_scan/dask/tests/test_order.py#L252 Perhaps we should allow the user to choose which method to use regardless of benchmark results. This will let the above test pass, and it provides a fail-safe in case I introduce a bug. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6675 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHMV5WN6ZFQBD23WYDSHUHMJANCNFSM4R2NEJCA> .

madhur-tandon · 2020-09-26T01:51:43Z

Hi @eriknw, this is very interesting work. I haven't had time to understand this PR completely but it seems like a practical application of what I learnt in my GPU Computing Class before.

I want to ask this one thing though: Although I understand why Blelloch algorithm is preferred over Hillis Steele here since the most likely case is that we would be limited by the number of processors (and have huge data compared to the number of processors).

But, won't it be good to also include Hillis Steele's implementation as well? (in-case we have more processors than work) and care about step-efficiency in that case?

From a user's endpoint -- we could probably supply an argument that helps us choose between the 2 implementations?

Also, I wonder if Dask supports the numpy.polyval function already (which helps us to evaluate a polynomial given the input and the coefficients)? Regardless, I think this PR could perhaps also be useful / pave the way for implementing the polyval function that leverages this scan approach.

PS -- Just a curious passer-by here, I found this PR interesting so I thought I should comment.
Thanks!

Current choices are "sequential", "blelloch", and "blelloch-split". Default is "sequential". I need to document these.

eriknw · 2020-09-28T02:33:50Z

Thanks for stopping by to leave an encouraging message @madhur-tandon! I'm glad you find it interesting too. I think it would be straightforward to add "Hillis Steele" method. Are you interested in helping out? With the "Blelloch" method, I began by writing a pure Python version that worked on lists, which I was able to easily test and iterate on to make sure all the edge cases are handled correctly. I did this in my own time for my own curiosity. Then, since I had it, I thought it would be straightforward to adapt for dask.array, and indeed it was (with a little prior knowledge of programming for dask)!

My curiosity is more-or-less satisfied. Meaningful benchmarks are hard, and I would rather spend my time getting this PR ready to merge than to perform more benchmarks. So, I propose we add a method= keyword to let users opt-in to using "Blelloch". For now, I think the current method should remain the default, which I call "sequential".

My preliminary findings from benchmarking is that more benchmarking is needed. "blelloch" seems to be at least as fast as "sequential". It is often comparable or only a little faster. Sometimes it is significantly faster. I found memory usage to be unpredictable. There are a lot of dimensions to explore that I have not done.

Here is a cherry-picked example on an array of size 2**16 x 2**16 with chunks=2**11.

Sequential:

Blelloch:

This result shouldn't be considered typical, but it does show a nice performance gain and lower memory use, so I think it is worthwhile to continue investigating "blelloch".

I'm more interesting in investigating the behavior of dask.order on graphs from "blelloch". Here's a graph colored by order for 33 blocks:

Here's the same graph as above but with output collapsed (sometimes it's nice to see it this way):

Observe that the first half (red on the bottom) is ordered more-or-less flawlessly. But, we see that order jumped to the top and works its way back down in a non-ideal way. I think I know why. This become more clear if we look at a larger graph.

Details

Here is one with 129 blocks:

I would feel more comfortable with "blelloch" being the default method if the ordering were better, although the distributed scheduler should be fine to use. The patterns in these graphs are the best examples I have seen of dask.order performing poorly due to it "switching" to better targets. There may be an easy tweak to fix this.

madhur-tandon · 2020-09-29T17:01:41Z

@eriknw Sure, I can probably try to add Hillis Steele's method during the weekend. It might take some time since I need to brush up the theory again, but I will be up for it for sure. Once I get something out, I could really appreciate some feedback on it as well 👍

PS -- These graphs look beautiful :)

TomAugspurger · 2020-10-01T13:57:10Z

Very neat @eriknw. I'm fine with adding a keyword for users to to experiment. There is added value in picking the default that we want people to use, since things like np.cumsum(dask_array) don't have the option to provide a keyword (we could also pick the default from the dask config).

Could you document the method argument for these functions? I think that if you just include a regular docstring it'll include your docstring somewhere between the docstring opening and the Parameters section.

eriknw · 2020-10-03T01:44:54Z

dask/array/reductions.py

+        if method == "blelloch":
+            for indexes, vals in zip(drop(1, full_indices), prefix_vals):
+                for index, val in zip(indexes, vals):
+                    dsk[base_key + index] = (
+                        _compute_prefixscan,
+                        func,
+                        binop,
+                        val,
+                        (x.name,) + index,
+                        axis,
+                        dtype,
+                    )
+        else:  # method == "blelloch-split"
+            # e.g., compute `cumsum` before adding it to the sum of previous values
+            level_key = (level,)
+            for indexes, vals in zip(drop(1, full_indices), prefix_vals):
+                for index, val in zip(indexes, vals):
+                    final_key = base_key + index
+                    func_key = final_key + level_key
+                    dsk[func_key] = (
+                        _compute_prefixscan0,
+                        func,
+                        (x.name,) + index,
+                        axis,
+                        dtype,
+                    )
+                    dsk[final_key] = (binop, val, func_key)


Note to self: this implements two variants of the "blelloch" method. For example, the "blelloch-split" method computes the cumsum in a separate task before combining it to the sum of previous blocks. The "blelloch" method performs these in one step.

What little benchmarking I've done suggests "blelloch" is better than "blelloch-split" for cumsum and cumprod. This may not hold true under scrutiny, and "blelloch-split" may be useful for slower functions.

I'm deleting "blelloch-split" for now, and leaving this note for posterity.

eriknw · 2020-10-03T01:49:59Z

Docs added. I did not add a config option.

@madhur-tandon excellent! I'm looking forward to seeing Hillis Steele's method. Don't be afraid to ask for help.

TomAugspurger · 2020-10-26T19:39:14Z

Apologies for the delay in following up here @eriknw. Merging. Thanks for working on this!

…ask#6675) * Begin experimenting with parallel prefix scan for cumsum and cumprod in dask.array This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on dask.array, so feedback is most welcome. This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks. Previously, to do a cumsum across N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate. A parallel cumsum works by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take the cumsum of each block and add the sum of the previous blocks. NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod significantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant). I also think this will be an interesting test for `dask.order` and the scheduler. Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks? Benchmarks and graph diagrams are forthcoming :) * Choose cumsum/cumprod with `method=` keyword argument. Current choices are "sequential", "blelloch", and "blelloch-split". Default is "sequential". I need to document these. * black * Add docstrings for "blelloch" method for cumsum/cumprod

eriknw added 2 commits September 26, 2020 14:21

Choose cumsum/cumprod with method= keyword argument.

6544ff3

Current choices are "sequential", "blelloch", and "blelloch-split". Default is "sequential". I need to document these.

black

822a397

eriknw commented Oct 3, 2020

View reviewed changes

Add docstrings for "blelloch" method for cumsum/cumprod

eb2ea96

TomAugspurger merged commit a8329c2 into dask:master Oct 26, 2020

dcherian mentioned this pull request Jul 25, 2023

Add scan / prefix sum primitive cubed-dev/cubed#277

Closed

aidoskanapyanov mentioned this pull request Aug 2, 2024

feat: dask with_row_index and rename narwhals-dev/narwhals#692

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Begin experimenting with parallel prefix scan for cumsum and cumprod#6675

Begin experimenting with parallel prefix scan for cumsum and cumprod#6675
TomAugspurger merged 4 commits intodask:masterfrom
eriknw:prefix_scan

eriknw commented Sep 25, 2020 •

edited

Loading

Uh oh!

eriknw commented Sep 25, 2020

Uh oh!

mrocklin commented Sep 25, 2020 via email

Uh oh!

madhur-tandon commented Sep 26, 2020

Uh oh!

eriknw commented Sep 28, 2020

Uh oh!

madhur-tandon commented Sep 29, 2020

Uh oh!

TomAugspurger commented Oct 1, 2020

Uh oh!

eriknw Oct 3, 2020

Uh oh!

eriknw commented Oct 3, 2020

Uh oh!

TomAugspurger commented Oct 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

eriknw commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eriknw commented Sep 25, 2020

Uh oh!

mrocklin commented Sep 25, 2020 via email

Uh oh!

madhur-tandon commented Sep 26, 2020

Uh oh!

eriknw commented Sep 28, 2020

Uh oh!

madhur-tandon commented Sep 29, 2020

Uh oh!

TomAugspurger commented Oct 1, 2020

Uh oh!

eriknw Oct 3, 2020

Choose a reason for hiding this comment

Uh oh!

eriknw commented Oct 3, 2020

Uh oh!

TomAugspurger commented Oct 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eriknw commented Sep 25, 2020 •

edited

Loading