Begin experimenting with parallel prefix scan for cumsum and cumprod#6675
Begin experimenting with parallel prefix scan for cumsum and cumprod#6675TomAugspurger merged 4 commits intodask:masterfrom
Conversation
…in dask.array This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on dask.array, so feedback is most welcome. This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks. Previously, to do a cumsum across N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate. A parallel cumsum works by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take the cumsum of each block and add the sum of the previous blocks. NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod significantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant). I also think this will be an interesting test for `dask.order` and the scheduler. Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks? Benchmarks and graph diagrams are forthcoming :)
|
This is the failing test. It assumes the old Perhaps we should allow the user to choose which method to use regardless of benchmark results. This will let the above test pass, and it provides a fail-safe in case I introduce a bug. |
|
If benchmarks show that this is reliably faster (and also not significantly
more memory hungry) then I would be comfortable making it default.
…On Fri, Sep 25, 2020 at 3:01 PM Erik Welch ***@***.***> wrote:
This is the failing test. It assumes the old cumsum behavior and check
the ordering:
https://github.com/eriknw/dask/blob/prefix_scan/dask/tests/test_order.py#L252
Perhaps we should allow the user to choose which method to use regardless
of benchmark results. This will let the above test pass, and it provides a
fail-safe in case I introduce a bug.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6675 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHMV5WN6ZFQBD23WYDSHUHMJANCNFSM4R2NEJCA>
.
|
|
Hi @eriknw, this is very interesting work. I haven't had time to understand this PR completely but it seems like a practical application of what I learnt in my GPU Computing Class before. I want to ask this one thing though: Although I understand why But, won't it be good to also include From a user's endpoint -- we could probably supply an argument that helps us choose between the 2 implementations? Also, I wonder if PS -- Just a curious passer-by here, I found this PR interesting so I thought I should comment. |
Current choices are "sequential", "blelloch", and "blelloch-split". Default is "sequential". I need to document these.
|
Thanks for stopping by to leave an encouraging message @madhur-tandon! I'm glad you find it interesting too. I think it would be straightforward to add "Hillis Steele" method. Are you interested in helping out? With the "Blelloch" method, I began by writing a pure Python version that worked on lists, which I was able to easily test and iterate on to make sure all the edge cases are handled correctly. I did this in my own time for my own curiosity. Then, since I had it, I thought it would be straightforward to adapt for My curiosity is more-or-less satisfied. Meaningful benchmarks are hard, and I would rather spend my time getting this PR ready to merge than to perform more benchmarks. So, I propose we add a My preliminary findings from benchmarking is that more benchmarking is needed. Here is a cherry-picked example on an array of size This result shouldn't be considered typical, but it does show a nice performance gain and lower memory use, so I think it is worthwhile to continue investigating I'm more interesting in investigating the behavior of I would feel more comfortable with |
|
@eriknw Sure, I can probably try to add PS -- These graphs look beautiful :) |
|
Very neat @eriknw. I'm fine with adding a keyword for users to to experiment. There is added value in picking the default that we want people to use, since things like Could you document the |
dask/array/reductions.py
Outdated
| if method == "blelloch": | ||
| for indexes, vals in zip(drop(1, full_indices), prefix_vals): | ||
| for index, val in zip(indexes, vals): | ||
| dsk[base_key + index] = ( | ||
| _compute_prefixscan, | ||
| func, | ||
| binop, | ||
| val, | ||
| (x.name,) + index, | ||
| axis, | ||
| dtype, | ||
| ) | ||
| else: # method == "blelloch-split" | ||
| # e.g., compute `cumsum` before adding it to the sum of previous values | ||
| level_key = (level,) | ||
| for indexes, vals in zip(drop(1, full_indices), prefix_vals): | ||
| for index, val in zip(indexes, vals): | ||
| final_key = base_key + index | ||
| func_key = final_key + level_key | ||
| dsk[func_key] = ( | ||
| _compute_prefixscan0, | ||
| func, | ||
| (x.name,) + index, | ||
| axis, | ||
| dtype, | ||
| ) | ||
| dsk[final_key] = (binop, val, func_key) |
There was a problem hiding this comment.
Note to self: this implements two variants of the "blelloch" method. For example, the "blelloch-split" method computes the cumsum in a separate task before combining it to the sum of previous blocks. The "blelloch" method performs these in one step.
What little benchmarking I've done suggests "blelloch" is better than "blelloch-split" for cumsum and cumprod. This may not hold true under scrutiny, and "blelloch-split" may be useful for slower functions.
I'm deleting "blelloch-split" for now, and leaving this note for posterity.
|
Docs added. I did not add a config option. @madhur-tandon excellent! I'm looking forward to seeing Hillis Steele's method. Don't be afraid to ask for help. |
|
Apologies for the delay in following up here @eriknw. Merging. Thanks for working on this! |
…ask#6675) * Begin experimenting with parallel prefix scan for cumsum and cumprod in dask.array This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on dask.array, so feedback is most welcome. This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks. Previously, to do a cumsum across N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate. A parallel cumsum works by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take the cumsum of each block and add the sum of the previous blocks. NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod significantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant). I also think this will be an interesting test for `dask.order` and the scheduler. Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks? Benchmarks and graph diagrams are forthcoming :) * Choose cumsum/cumprod with `method=` keyword argument. Current choices are "sequential", "blelloch", and "blelloch-split". Default is "sequential". I need to document these. * black * Add docstrings for "blelloch" method for cumsum/cumprod





This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on
dask.array, so feedback is most welcome.This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks.
Previously, to do a
cumsumacross N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate.Our parallel
cumsumworks by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take thecumsumof each block and add the sum of the previous blocks.NumPy calculates
cumsumandcumprodvery fast, but it calculatessumandprodsignificantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant).I also think this will be an interesting test for
dask.orderand the scheduler.Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks?
Benchmarks and graph diagrams are forthcoming :)
black dask/flake8 daskSee: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda