Skip to content

Fix: dask.array.cumprod does not deal with dtype#12097

Merged
jacobtomlinson merged 4 commits intodask:mainfrom
tonyyuyiding:fix-cumprod-dtype
Oct 13, 2025
Merged

Fix: dask.array.cumprod does not deal with dtype#12097
jacobtomlinson merged 4 commits intodask:mainfrom
tonyyuyiding:fix-cumprod-dtype

Conversation

@tonyyuyiding
Copy link
Copy Markdown
Contributor

The problem is that dask.array.cumprod does not do type conversion using dtype. It seems that numpy converts the data to dtype first and then performs reduction.

I chose a simple method to fix the issue, which is adding type conversion in cumreduction. However, I noticed that map_blocks also accepts a dtype parameter, but it does not actually convert data to that type, which is quite puzzling. Therefore, I wonder whether we should add type conversion in map_blocks instead. For example, the result of

import dask.array as da
print(da.array([1.5, 2.5]).map_blocks(lambda x: x, dtype=int).compute())

is [1.5 2.5], even though dtype has been explicitly set to int

The current update has fixed the issue, but further suggestions are welcome :)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 9, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ±  0        9 suites  ±0   3h 10m 40s ⏱️ - 1m 40s
 18 126 tests + 24   16 909 ✅ + 24   1 217 💤 ±0  0 ❌ ±0 
162 208 runs  +216  150 092 ✅ +215  12 116 💤 +1  0 ❌ ±0 

Results for commit 09d4126. ± Comparison against base commit c60725b.

♻️ This comment has been updated with latest results.

@jacobtomlinson
Copy link
Copy Markdown
Member

@mmccarty would you have some time to review this?

@TomAugspurger
Copy link
Copy Markdown
Member

map_blocks also accepts a dtype parameter, but it does not actually convert data to that type, which is quite puzzling. Therefore, I wonder whether we should add type conversion in map_blocks instead

The purpose of that dtype parameter is to let dask know the dtype of the output ndarray before executing the dask graph. I don't think we should overload it to also mean "cast the output to this dtype".

@mmccarty
Copy link
Copy Markdown
Member

mmccarty commented Oct 9, 2025

It would also be a good to test that the result dtype matches the input array's dtype when the parameter isn't specified in cumprod.

@mmccarty
Copy link
Copy Markdown
Member

Thanks, @tonyyuyiding!
@jacobtomlinson - Nothing further from me.

@jacobtomlinson
Copy link
Copy Markdown
Member

@TomAugspurger I take your comment to be blocking on the implementation here, is that right?

@TomAugspurger
Copy link
Copy Markdown
Member

I hadn't looked at the diff yet, just the description :)

So IIUC, the root cause here is a clash between a func (like np.cumprod) which takes a dtype param and map_blocks (slightly different) usage of dtype. Does that sound right @tonyyuyiding?

I'd be curious whether we could do func = partial(func, dtype=dtype).map_blocks(func, axis=axis, dtype=dtype)? In this case, we happen to know that e.g. np.cumprod(data, dtype=dtype) means the result type will be dtype (and cumprod takes care of the casting to ensure that's true), so the meaning of dtype is the same across the two functions.

@tonyyuyiding
Copy link
Copy Markdown
Contributor Author

So IIUC, the root cause here is a clash between a func (like np.cumprod) which takes a dtype param and map_blocks (slightly different) usage of dtype. Does that sound right?

Yes, that sounds right to me.

I'd be curious whether we could do func = partial(func, dtype=dtype).map_blocks(func, axis=axis, dtype=dtype)? In this case, we happen to know that e.g. np.cumprod(data, dtype=dtype) means the result type will be dtype (and cumprod takes care of the casting to ensure that's true), so the meaning of dtype is the same across the two functions.

Thanks for the suggestion! I agree that using partial is better, in case some other func does not do type conversion.

@TomAugspurger
Copy link
Copy Markdown
Member

Great, thanks.

My only remaining concern is whether we can assume that func always takes a dtype= argument. The docs are not clear on this:

Cumulative function like np.cumsum or np.cumprod

We can maybe treat "like" as meaning "has the same function signature and call this good enough, assuming the tests pass.

@tonyyuyiding
Copy link
Copy Markdown
Contributor Author

My only remaining concern is whether we can assume that func always takes a dtype= argument.

I found that some func does not take dtype. Perhaps we can use inspect to check whether func accepts dtype, but I'm not sure whether it is the best practice.

@TomAugspurger
Copy link
Copy Markdown
Member

TomAugspurger commented Oct 10, 2025

Perhaps we can use inspect to check whether func accepts dtype,

Yep. That's a bit unfortunate, but I think that's our best option.

I didn't mention it earlier, but I'd like to leave as much up to NumPy as possible. It's not 100% clear to me that np.cumprod(x.astype(dtype)) is the same in all cases as np.cumprod(x, dtype=dtype), and I'd like to not have to worry about that.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're able to, it'd be nice to have a test that covers the case where func doesn't accept a dtype parameter.

@tonyyuyiding
Copy link
Copy Markdown
Contributor Author

If you're able to, it'd be nice to have a test that covers the case where func doesn't accept a dtype parameter.

I checked the places where cumreduction is used. It seems that the only case where func does not accept dtype is inside the push function, which is Dask-version of bottleneck.push. That push function is quite different from cumsum and cumprod, and it has its own tests in another file. I see those tests already failed when I used partial without checking whether func accepts dtype. Therefore, I guess we don't need to add additional tests here?

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see those tests already failed when I used partial without checking whether func accepts dtype.

Yep that seems to satisfy the request here.

Thanks for iterating on this!

@jacobtomlinson jacobtomlinson merged commit d79d453 into dask:main Oct 13, 2025
23 of 24 checks passed
@tonyyuyiding tonyyuyiding deleted the fix-cumprod-dtype branch October 13, 2025 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dask.cumprod brings different results with numpy.cumprod

4 participants