Skip to content

Begin experimenting with parallel prefix scan for cumsum and cumprod#6675

Merged
TomAugspurger merged 4 commits intodask:masterfrom
eriknw:prefix_scan
Oct 26, 2020
Merged

Begin experimenting with parallel prefix scan for cumsum and cumprod#6675
TomAugspurger merged 4 commits intodask:masterfrom
eriknw:prefix_scan

Conversation

@eriknw
Copy link
Member

@eriknw eriknw commented Sep 25, 2020

This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on dask.array, so feedback is most welcome.

This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks.

Previously, to do a cumsum across N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate.

Our parallel cumsum works by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take the cumsum of each block and add the sum of the previous blocks.

NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod significantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant).

I also think this will be an interesting test for dask.order and the scheduler.

Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks?

Benchmarks and graph diagrams are forthcoming :)

  • Tests added / passed
  • Passes black dask / flake8 dask

See: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda

…in dask.array

This is a WIP and needs benchmarked.  I think it's interesting, though, and want to share.
It's been a while since I've worked on dask.array, so feedback is most welcome.

This is a work-efficient parallel prefix scan.  It uses a Brent-Kung construction and
is known as the Blelloch algorithm.  We adapt it to work on chunks.

Previously, to do a cumsum across N chunks would require N levels of dependencies.
This PR takes approximately 2 * lg(N) levels of dependencies.  It exposes parallelism.
It is work-efficient and only requires a third more tasks than the previous method.
Scans on floating point values should also be more accurate.

A parallel cumsum works by first taking the sum of each block, then do a binary tree
merge followed by a fan-out (i.e., the Brent-Kung pattern).  We then take the cumsum
of each block and add the sum of the previous blocks.

NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod
significantly faster.  This is why I think this approach will be faster.
Exposing parallelism and an efficient communication pattern is another reason I think
this should be faster (especially when communication costs are significant).

I also think this will be an interesting test for `dask.order` and the scheduler.

Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)?
Does the answer to this depend on benchmarks?

Benchmarks and graph diagrams are forthcoming :)
@eriknw
Copy link
Member Author

eriknw commented Sep 25, 2020

This is the failing test. It assumes the old cumsum behavior and check the ordering:
https://github.com/eriknw/dask/blob/prefix_scan/dask/tests/test_order.py#L252

Perhaps we should allow the user to choose which method to use regardless of benchmark results. This will let the above test pass, and it provides a fail-safe in case I introduce a bug.

@mrocklin
Copy link
Member

mrocklin commented Sep 25, 2020 via email

@madhur-tandon
Copy link
Contributor

Hi @eriknw, this is very interesting work. I haven't had time to understand this PR completely but it seems like a practical application of what I learnt in my GPU Computing Class before.

I want to ask this one thing though: Although I understand why Blelloch algorithm is preferred over Hillis Steele here since the most likely case is that we would be limited by the number of processors (and have huge data compared to the number of processors).

But, won't it be good to also include Hillis Steele's implementation as well? (in-case we have more processors than work) and care about step-efficiency in that case?

From a user's endpoint -- we could probably supply an argument that helps us choose between the 2 implementations?

Also, I wonder if Dask supports the numpy.polyval function already (which helps us to evaluate a polynomial given the input and the coefficients)? Regardless, I think this PR could perhaps also be useful / pave the way for implementing the polyval function that leverages this scan approach.

PS -- Just a curious passer-by here, I found this PR interesting so I thought I should comment.
Thanks!

Current choices are "sequential", "blelloch", and "blelloch-split".
Default is "sequential".  I need to document these.
@eriknw
Copy link
Member Author

eriknw commented Sep 28, 2020

Thanks for stopping by to leave an encouraging message @madhur-tandon! I'm glad you find it interesting too. I think it would be straightforward to add "Hillis Steele" method. Are you interested in helping out? With the "Blelloch" method, I began by writing a pure Python version that worked on lists, which I was able to easily test and iterate on to make sure all the edge cases are handled correctly. I did this in my own time for my own curiosity. Then, since I had it, I thought it would be straightforward to adapt for dask.array, and indeed it was (with a little prior knowledge of programming for dask)!

My curiosity is more-or-less satisfied. Meaningful benchmarks are hard, and I would rather spend my time getting this PR ready to merge than to perform more benchmarks. So, I propose we add a method= keyword to let users opt-in to using "Blelloch". For now, I think the current method should remain the default, which I call "sequential".

My preliminary findings from benchmarking is that more benchmarking is needed. "blelloch" seems to be at least as fast as "sequential". It is often comparable or only a little faster. Sometimes it is significantly faster. I found memory usage to be unpredictable. There are a lot of dimensions to explore that I have not done.

Here is a cherry-picked example on an array of size 2**16 x 2**16 with chunks=2**11.

Sequential:
Screen Shot 2020-09-26 at 2 30 57 PM

Blelloch:
Screen Shot 2020-09-26 at 2 31 12 PM

This result shouldn't be considered typical, but it does show a nice performance gain and lower memory use, so I think it is worthwhile to continue investigating "blelloch".

I'm more interesting in investigating the behavior of dask.order on graphs from "blelloch". Here's a graph colored by order for 33 blocks:
prefixscan33
Here's the same graph as above but with output collapsed (sometimes it's nice to see it this way):
prefixscan33_collapsed
Observe that the first half (red on the bottom) is ordered more-or-less flawlessly. But, we see that order jumped to the top and works its way back down in a non-ideal way. I think I know why. This become more clear if we look at a larger graph.

Details

Here is one with 129 blocks:

prefixscan129

I would feel more comfortable with "blelloch" being the default method if the ordering were better, although the distributed scheduler should be fine to use. The patterns in these graphs are the best examples I have seen of dask.order performing poorly due to it "switching" to better targets. There may be an easy tweak to fix this.

@madhur-tandon
Copy link
Contributor

@eriknw Sure, I can probably try to add Hillis Steele's method during the weekend. It might take some time since I need to brush up the theory again, but I will be up for it for sure. Once I get something out, I could really appreciate some feedback on it as well 👍

PS -- These graphs look beautiful :)

@TomAugspurger
Copy link
Member

Very neat @eriknw. I'm fine with adding a keyword for users to to experiment. There is added value in picking the default that we want people to use, since things like np.cumsum(dask_array) don't have the option to provide a keyword (we could also pick the default from the dask config).

Could you document the method argument for these functions? I think that if you just include a regular docstring it'll include your docstring somewhere between the docstring opening and the Parameters section.

Comment on lines +1180 to +1206
if method == "blelloch":
for indexes, vals in zip(drop(1, full_indices), prefix_vals):
for index, val in zip(indexes, vals):
dsk[base_key + index] = (
_compute_prefixscan,
func,
binop,
val,
(x.name,) + index,
axis,
dtype,
)
else: # method == "blelloch-split"
# e.g., compute `cumsum` before adding it to the sum of previous values
level_key = (level,)
for indexes, vals in zip(drop(1, full_indices), prefix_vals):
for index, val in zip(indexes, vals):
final_key = base_key + index
func_key = final_key + level_key
dsk[func_key] = (
_compute_prefixscan0,
func,
(x.name,) + index,
axis,
dtype,
)
dsk[final_key] = (binop, val, func_key)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this implements two variants of the "blelloch" method. For example, the "blelloch-split" method computes the cumsum in a separate task before combining it to the sum of previous blocks. The "blelloch" method performs these in one step.

What little benchmarking I've done suggests "blelloch" is better than "blelloch-split" for cumsum and cumprod. This may not hold true under scrutiny, and "blelloch-split" may be useful for slower functions.

I'm deleting "blelloch-split" for now, and leaving this note for posterity.

@eriknw
Copy link
Member Author

eriknw commented Oct 3, 2020

Docs added. I did not add a config option.

@madhur-tandon excellent! I'm looking forward to seeing Hillis Steele's method. Don't be afraid to ask for help.

@TomAugspurger
Copy link
Member

Apologies for the delay in following up here @eriknw. Merging. Thanks for working on this!

@TomAugspurger TomAugspurger merged commit a8329c2 into dask:master Oct 26, 2020
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
…ask#6675)

* Begin experimenting with parallel prefix scan for cumsum and cumprod in dask.array

This is a WIP and needs benchmarked.  I think it's interesting, though, and want to share.
It's been a while since I've worked on dask.array, so feedback is most welcome.

This is a work-efficient parallel prefix scan.  It uses a Brent-Kung construction and
is known as the Blelloch algorithm.  We adapt it to work on chunks.

Previously, to do a cumsum across N chunks would require N levels of dependencies.
This PR takes approximately 2 * lg(N) levels of dependencies.  It exposes parallelism.
It is work-efficient and only requires a third more tasks than the previous method.
Scans on floating point values should also be more accurate.

A parallel cumsum works by first taking the sum of each block, then do a binary tree
merge followed by a fan-out (i.e., the Brent-Kung pattern).  We then take the cumsum
of each block and add the sum of the previous blocks.

NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod
significantly faster.  This is why I think this approach will be faster.
Exposing parallelism and an efficient communication pattern is another reason I think
this should be faster (especially when communication costs are significant).

I also think this will be an interesting test for `dask.order` and the scheduler.

Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)?
Does the answer to this depend on benchmarks?

Benchmarks and graph diagrams are forthcoming :)

* Choose cumsum/cumprod with `method=` keyword argument.

Current choices are "sequential", "blelloch", and "blelloch-split".
Default is "sequential".  I need to document these.

* black

* Add docstrings for "blelloch" method for cumsum/cumprod
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants