Rewrite matmul as blockwise without concatenated contraction by ravwojdyla · Pull Request #7000 · dask/dask

ravwojdyla · 2020-12-21T11:48:26Z

Tests added / passed
Passes black dask / flake8 dask

Performance stats in a notebook.

There are more optimizations I have in mind, but first I would like to get some feedback about the overall direction. This PR implements matmul as blockwise without concatenated contraction which reduces the memory footprint and allows for better control over execution time and memory usage. Please see the notebook above to see the impact on performance.

eric-czech · 2020-12-22T13:21:09Z

Nice @ravwojdyla!

A couple quick questions:

My understanding of what happens here is that chunks are paired together in a scheme (for inputs A and B with Axy = row chunk x, col chunk y), like A11 with B11, A12 with B21, etc., the chunks are passed to matmul, the results are placed into a new dimension, and then finally those partial products are summed to remove the new dimension. Would you agree, ignoring some edge cases, that this is what happens here?
What happens if the size of the column chunks in A don't match the row chunks in B?
How does this differ from what tensordot would do in the 2D case? I had envisioned that it was similar to this.

ravwojdyla · 2020-12-22T14:08:06Z

@eric-czech

the chunks are passed to matmul, the results are placed into a new dimension, and then finally those partial products are summed to remove the new dimension. Would you agree, ignoring some edge cases, that this is what happens here?

chunks of inputs are passed into matmul, and the result have one dummy dimension appended to accommodate for the contraction that came from matmul on the chunks:

dask/dask/array/routines.py

Line 288 in eb6d82b

return chunk[..., np.newaxis]

When last dimension of A or last 2 two dimensions of B are chunked we need to sum partial results (the sum axis depends on the chunking scheme), otherwise we have full results.

What happens if the size of the column chunks in A don't match the row chunks in B?

Afaiu this is handled by the align_arrays of blockwise, and will results in more tasks/chunks.

How does this differ from what tensordot would do in the 2D case? I had envisioned that it was similar to this.

For 2D case, there isn't much difference, but this also handles matrices with more than 2 dimensions.

eric-czech · 2021-01-07T19:06:43Z

Hey @mrocklin / @TomAugspurger do you think you could help us direct a review for this one since it's important for unblocking some of our scalability testing (cf. sgkit#390)?

FYI this work supersedes what we proposed in #6924 -- I believe @tomwhite was going to close it.

jrbourbeau · 2021-01-07T19:26:21Z

I'm not sure what bandwidth he has available, but cc @gforsyth just in case he has a chance to take a look

gforsyth

Hey @ravwojdyla -- thanks for putting this in and for the very thorough notebook running through the performance changes.

The changes you've made look good to me and I've tried to break this locally and haven't been able to. 🎉

In re: the slight performance regression vs the original implementation when the arrays fit cleanly in memory, the chunking used in those examples (for the in-memory ones) defaults to (1000, 250) which puts it at something like 2mb per chunk. Bumping that up (as you do in the following example) shows that your implementation is much more performant overall.

My only request here is that we add a few more tests along the lines of @eric-czech's questions -- all the current tests seem to have symmetric chunk sizes and it would be good to have a few where the chunk sizes are a bit more disparate, e.g.

X = da.random.random(size=(3, 3, 50, 100), chunks=(1, 3, 10, 25))
Y = da.random.random(size=(3, 3, 100, 50), chunks=(1, 3, 20, 5))

I'd like to see that and maybe two other random disparate chunk size examples in the tests just to cover our bases and then this is good to go in.

ravwojdyla · 2021-01-08T21:45:17Z

Hi @gforsyth, thanks for the review!

My only request here is that we add a few more tests along the lines of @eric-czech's questions

Certainly, will add.

ravwojdyla · 2021-01-09T01:56:20Z

@gforsyth tests added:

dask/dask/array/tests/test_routines.py

Lines 252 to 259 in ada36fa

    
                   # These tests use explicitly special/disparate chunk sizes: 
        
                   [(), (7,), (), (5,)], 
        
                   [(), (7, 11, 15, 19), (), (1, 3, 5, 19)], 
        
                   [(7, 11), (11, 7), (1, 1), (1, 1)], 
        
                   [(7, 11), (11, 7), (3, 5), (4, 2)], 
        
                   [(7, 11), (11, 7), (7, 11), (11, 7)], 
        
                   [(11, 15, 19), (7, 11, 19, 15), (7, 7, 7), (3, 9, 9, 9)], 
        
                   [(3, 3, 20, 30), (3, 3, 30, 20), (1, 3, 2, 6), (1, 3, 5, 10)],

gforsyth

Thanks for putting this in @ravwojdyla -- this looks great! @jrbourbeau this is ready to go in.

TomAugspurger · 2021-01-09T18:31:04Z

Thanks all!

ravwojdyla · 2021-01-09T20:32:23Z

Thanks @gforsyth and @TomAugspurger! Will go ahead and create an issue to validate the new implementation works and scales well on cupy and sparse.

The (unit) tests included in dask work (at least the sparse ones, the cupy is not tested as part of the CI, right?), but I would also like to validate the performance for those two array types (before release), are there any other that we should look into?

ParticularMiner · 2021-11-22T20:45:26Z

dask/array/routines.py

+    # Since we have performed the contraction via matmul
+    # but blockwise expects all dimensions back, we need
+    # to add one dummy dimension back
+    return chunk[..., np.newaxis]


Your PR is an impressive piece of code @ravwojdyla !

I know this PR is closed. But I'd appreciate it if you would answer one last question from me.

This is the only line that puzzles me — while I understand why the number of the dimensions of the output chunk should match what blockwise expects, I can't figure out why the new axis is inserted in the last position instead of the second to last position. Perhaps it doesn't matter in the end, but I'd like to know why.

I mean, why not the following:

# to add one dummy dimension back return chunk[..., np.newaxis, :]

which more closely matches the expected output of the blockwise call?

Thanks!

👋 @ParticularMiner my initial reaction would be that it probably doesn't matter (tho might require some changes it in the downstream logic from blockwise). That extra dummy dimension is "squeezed" out later afair, and as you have pointed out - it's there to make blockwise happy at the metadata level. That said, it's been a while, and it's just my initial reaction.

Thank you very much for your reply @ravwojdyla despite this conversation being almost a year old!

Indeed, it turns out through testing that it doesn’t matter, which puzzled me.

Also the downstream logic after blockwise operates on the assumption that the contraction-axis is the second-to-last axis of the output of blockwise.

@ravwojdyla

The following is not vital (performance-wise). But just to make you aware:

While mirroring your code elsewhere, it turned out that by using

# to add one dummy dimension back return chunk[..., np.newaxis, :]

instead of

# to add one dummy dimension back return chunk[..., np.newaxis]

one can avoid the extra conditional branches in the downstream logic after blockwise by replacing them with a single sum. That is, this line:

out = out.sum(axis=-2)

can replace the following lines:

# When we perform reduction, we need to worry about the last 2 dimensions # which hold the matrices, some care is required to handle chunking in # that space. contraction_dimension_is_chunked = ( max(min(a.chunks[-1], b.chunks[-2])) < a.shape[-1] ) b_last_dim_max_chunk = max(b.chunks[-1]) if contraction_dimension_is_chunked or b_last_dim_max_chunk < b.shape[-1]: if b_last_dim_max_chunk > 1: # This is the case when both contraction and last dimension axes # are chunked out = out.reshape(out.shape[:-1] + (1, -1)) out = out.sum(axis=-3) out = out.reshape(out.shape[:-2] + (b.shape[-1],)) else: # Contraction axis is chunked out = out.sum(axis=-2) else: # Neither contraction nor last dimension axes are chunked, we # remove the dummy dimension without reduction out = out.reshape(out.shape[:-2] + (b.shape[-1],))

I haven't tested this yet for dask itself. But it works in my particular application. If you feel this is worth exploring for dask, then just drop me a line, and I'll create a PR with extensive testing. But I doubt it will significantly improve current performance. Rather it will merely create more readable code. Which is often useful. 😉

@ParticularMiner that's great! Given that those reshape ops should be largely metadata operation, I agree that I won't expect performance impact, but I do love the simpler code. Definitely +1 to PR, iff you test that it works well for diff chunking schemes (as mentioned in the comment above).

@ravwojdyla

Many thanks for your interest!

I've made the suggested changes on my local git and the relevant existing unit-tests (in test_matmul()) have all passed.

The only thing left now is to run your benchmarking jupyter notebook above. (IMO, the results there will be similar. Still it is always safer to check.) But to do that, I need access to a computing cluster, which I currently do not have. Any pointers on where I can access one (preferably for free)? 😉

@ravwojdyla

Sorry. I misunderstood. It seems you ran your notebook on a standard laptop, not a multi-node cluster. I can probably do that too.

Hi @ravwojdyla

You’ll find my suggested PR at #8423.

Currently the CI tests are broken for other reasons. So it is not clear that the PR checks out.

ravwojdyla changed the title ~~Rewrite matmul as blockwise without contraction/concatenate~~ Rewrite matmul as blockwise without concatenated contraction Dec 21, 2020

ravwojdyla mentioned this pull request Dec 21, 2020

Update sphinx to 3.4.1 once it's released #7001

Closed

ravwojdyla force-pushed the rav/matmul_blockwise branch from eb0ebff to eb6d82b Compare December 21, 2020 18:06

ravwojdyla mentioned this pull request Dec 22, 2020

Reuse tensordot memory tricks in matmul #6874

Closed

gforsyth reviewed Jan 8, 2021

View reviewed changes

Rewrite matmul as blockwise without contraction/concatenate

ada36fa

ravwojdyla force-pushed the rav/matmul_blockwise branch from eb6d82b to ada36fa Compare January 9, 2021 01:29

gforsyth approved these changes Jan 9, 2021

View reviewed changes

TomAugspurger merged commit 51add4e into dask:master Jan 9, 2021

ravwojdyla deleted the rav/matmul_blockwise branch January 9, 2021 20:16

ravwojdyla mentioned this pull request Jan 9, 2021

Validate the new matmul works and scales well on cupy and sparse #7050

Open

tomwhite mentioned this pull request Jan 11, 2021

Avoid unbounded memory usage in matmul by delegating to dot #6924

Closed

2 tasks

ravwojdyla mentioned this pull request Jan 13, 2021

Genetics data IO performance stats/doc sgkit-dev/sgkit#437

Open

abduhbm pushed a commit to abduhbm/dask that referenced this pull request Jan 19, 2021

Rewrite matmul as blockwise without contraction/concatenate (dask#7000)

e51a047

aktech mentioned this pull request Jan 21, 2021

Pairwise distance scalability sgkit-dev/sgkit#375

Closed

tomwhite added a commit to tomwhite/gwas-benchmark that referenced this pull request Jan 25, 2021

Use new matmul dask/dask#7000

b6bb974

tomwhite mentioned this pull request Feb 9, 2021

Identify lack of scalability in gwas_linear_regression sgkit-dev/sgkit#390

Open

This was referenced Mar 30, 2021

Understand Hail GWAS regression implementation sgkit-dev/sgkit#448

Closed

Upgrade to dask 2.30.1+ related-sciences/ukb-gwas-pipeline-nealelab#27

Open

ParticularMiner reviewed Nov 22, 2021

View reviewed changes

This was referenced Nov 26, 2021

simplified the implementation of matmul() for improved readability #8422

Closed

improved the efficiency of matmul() by completely removing concatenation #8423

Merged

mueslo mentioned this pull request Apr 25, 2023

matmul producing wrong results in local threaded mode in combination with Fortran-ordered arrays. #10217

Open

Uh oh!

Conversation

ravwojdyla commented Dec 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-czech commented Dec 22, 2020

Uh oh!

ravwojdyla commented Dec 22, 2020

Uh oh!

eric-czech commented Jan 7, 2021

Uh oh!

jrbourbeau commented Jan 7, 2021

Uh oh!

gforsyth left a comment

Choose a reason for hiding this comment

Uh oh!

ravwojdyla commented Jan 8, 2021

Uh oh!

ravwojdyla commented Jan 9, 2021

Uh oh!

gforsyth left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jan 9, 2021

Uh oh!

ravwojdyla commented Jan 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ParticularMiner Nov 22, 2021

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Nov 22, 2021

Choose a reason for hiding this comment

Uh oh!

ParticularMiner Nov 22, 2021

Choose a reason for hiding this comment

Uh oh!

ParticularMiner Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

ParticularMiner Nov 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParticularMiner Nov 26, 2021

Choose a reason for hiding this comment

Uh oh!

ParticularMiner Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ravwojdyla commented Dec 21, 2020 •

edited

Loading

ravwojdyla commented Jan 9, 2021 •

edited

Loading

ParticularMiner Nov 26, 2021 •

edited

Loading

ParticularMiner Nov 29, 2021 •

edited

Loading