Add cuda.compute implementation to CCCL study by shwina · Pull Request #3748 · scikit-hep/awkward

shwina · 2025-11-26T17:08:02Z

This PR adds an implementation using the CCCL cuda.compute to the study introduced in #3734 .

The goals of this PR are to show that cuda.compute enables:

even better performance on GPUs compared to the existing Awkward CUDA backend
implementing GPU-acelerated operations with no custom CUDA kernels (everything is pure Python)

This PR includes a benchmark.py that runs the simplified analysis on large artificial datasets.

Here are results of the script for different data sizes on my workstation with a NVIDIA RTX 6000 Ada GPU, and AMD Ryzen Threadripper PRO 7975WX 32-Cores CPU. Speedups over the baseline (CPU) implementation are shown in paranthesis.

Events	CPU (s)	Awkward CUDA backend	cuda.compute
10,000	0.0094	0.0293 (0.32x)	0.0071 (1.31x)
100,000	0.0234	0.0297 (0.79x)	0.0060 (3.89x)
1,000,000	0.2400	0.0298 (8.06x)	0.0071 (33.96x)
10,000,000	2.4663	0.0700 (35.25x)	0.0243 (101.62x)

move data to the GPU. Also add type annotations to aid transformiterator

This reverts commit f0ef2db.

shwina · 2025-11-26T17:19:59Z

studies/cccl/helpers.py

+
+
+@nvtx.annotate("empty_like")
+def empty_like(array, kind="empty"):


Warning: this function, and the below helpers awkward_to_iterator and reconstruct_with_offsets are largely vibe-coded :)

I found that writing these to be as efficient as possible requires significant interaction with Awkward internals.

In principle, there shouldn't ever be a need for empty_like for awkward arrays because they are immutable (there is also an issue saying the same thing and why only zeros/ones/full_like exist only). Apart from that, I think you should be able to achieve this (if it is really needed for some reason that I haven't checked here) using ak.transform: https://awkward-array.org/doc/main/reference/generated/ak.transform.html
with a transformation like

def transformation(layout, **kwargs): if layout.is_numpy: return ak.contents.NumpyArray( xp.empty_like(layout.data) )

which will just do the recursion that you implemented here down the whole layout for you. The limitation here is that I think this will share buffers between the arrays for the ak.index.Index classes because the transformation doesn't touch them here (you may need to handle layout classes that have ak.index.Index nodes inside the transformation.

It's also possible to do the same thing using to/from_buffers tricks. You can disassemble an array into buffers and recreate it using the following ak.from_buffers(*ak.to_buffers(array)). For empty_like, you could maybe just to_buffers` first and then get all the buffers, make an empty_like of them and reassemble the array back with the new buffers.
Probably something like

form, length, container = ak.to_buffers(array) new_container = {k: v.copy() for k, v in container.items()} ak.from_buffers(form, length, new_container)

Regarding the other two helpers that you mentioned, I have only looked at them for 2 seconds, but in general, things that require layouts recursions probably can be implemented most of the time in terms of either ak.transform or ak._do.recursively_apply:

awkward/src/awkward/_do.py

Line 22 in f2e18e6

def recursively_apply(

There may be cases that I'm not considering here though, but I'm just adding general context here which may even help your LLM 😃.

Thank you for the suggestions! I'll look into these.

The reason empty_like exists is because cuda.compute algorithms require the user to provide output buffers into which the results are written.

Initially, I was indeed using .from_buffers and .to_buffers. While the implementation was quite simple, I found that it was quite expensive. IIRC the profiles showed a lot of time being spent in

awkward/src/awkward/operations/ak_from_buffers.py

Line 257 in e547a34

def _reconstitute(

.

Yeah the awkward array is immutable but the buffers are not (as they are just ndarrays) but no user ever interacts with them. Indeed in your case I understand why you just want to copy all the buffers to write the result.
Regarding the to/from_buffers performance, that would make sense because the reconsistute recursion is complicated, I merely suggested it for simplicity. For the same performance, maybe ak.transform or recursively_apply will just give you the same order of performance without repeating code but indeed for the maximum performance possible it may make sense to rewrite the simplest thing possible like you have.

@shwina - I think, cupy structured arrays would be a good fit here - see @leofang's comment on #3517 (comment)

I think, it would be suitable for getting the results since awkward allows setting the fields:

ak_zeros_fromnumpy = ak.Array(np.zeros(shape=5, dtype=np.dtype([('index', '<i8'), ('tag', '<i4')]))) ak_zeros_fromnumpy["index"]= [1,2,3,4,5]

Oh @shwina if the actual values in the buffer don't matter (i.e. you don't really care about doing xp.empty/empty_like in the buffer and you can just have a copy of the existing buffer since you intend to modify all of it), you could also just do an ak.copy which just copies the whole array including the buffers too.

ianna

@shwina - This looks really great! Thank you! The CI will not run on the studies, I will merge it as is. Thanks again.

github-actions · 2025-11-26T18:18:45Z

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3748

ianna · 2025-11-26T18:44:50Z

studies/cccl/_segment_algorithms.py

+    """
+    Convert segment offsets to segment IDs (indicators).
+
+    Given offsets [0, 2, 5, 8, 10], produces [0, 0, 1, 1, 1, 2, 2, 2, 3, 3]


Ah, this is what we call parents in our kernels :-)

That's useful to know when reading Awkward code - thank you!

* first commit * some changes * Working CCCL implementation * Added a benchmark * Exclusing warmup run * Don't include the time it takes to move data to the GPU. Also add type annotations to aid transformiterator * More fixes * Improve dispatch overhead in helpers.py * More elimination of overheads in helpers.py * More optimizations in helpers.py * Elminate get overhead. * Fix a bug where we forgot to copy offsets. * Add some profiling annotations * Don't use segmented_reduce * Eliminate segmented_reduce * Add profiling script * Merge nonzero with select * Improve select_segments performance * Remove an unnecessary synchronize * Try stateful op * Revert "Try stateful op" This reverts commit f0ef2db. * A few minor updates --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

shwina added 22 commits November 17, 2025 17:41

first commit

6870b47

some changes

aa26233

Working CCCL implementation

323b64a

Added a benchmark

490f4f2

Exclusing warmup run

748f958

Don't include the time it takes to

b977ffc

move data to the GPU. Also add type annotations to aid transformiterator

More fixes

08da18c

Improve dispatch overhead in helpers.py

0dbf791

More elimination of overheads in helpers.py

9c6f7c9

More optimizations in helpers.py

9f9bcbd

Elminate get overhead.

8b44ac8

Fix a bug where we forgot to copy offsets.

3862619

Add some profiling annotations

f043a0c

Don't use segmented_reduce

e60ba98

Eliminate segmented_reduce

4cb16af

Add profiling script

6328fe6

Merge nonzero with select

d176e8b

Improve select_segments performance

144275d

Remove an unnecessary synchronize

dad09ef

Try stateful op

f0ef2db

Revert "Try stateful op"

fc3b70a

This reverts commit f0ef2db.

A few minor updates

ea025b3

shwina changed the title ~~Cccl study add cuda compute implementation~~ Add cuda.compute implementation to CCCL study Nov 26, 2025

shwina commented Nov 26, 2025

View reviewed changes

ianna approved these changes Nov 26, 2025

View reviewed changes

ianna merged commit e547a34 into scikit-hep:main Nov 26, 2025
13 of 15 checks passed

ianna reviewed Nov 26, 2025

View reviewed changes

shwina mentioned this pull request Nov 26, 2025

feat: Add ak.sort() for CUDA backend #3750

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cuda.compute implementation to CCCL study#3748

Add cuda.compute implementation to CCCL study#3748
ianna merged 22 commits intoscikit-hep:mainfrom
shwina:cccl-study-add-cuda-compute-implementation

shwina commented Nov 26, 2025

Uh oh!

shwina Nov 26, 2025

Uh oh!

ikrommyd Nov 26, 2025 •

edited

Loading

Uh oh!

ikrommyd Nov 26, 2025 •

edited

Loading

Uh oh!

shwina Nov 26, 2025

Uh oh!

ikrommyd Nov 26, 2025 •

edited

Loading

Uh oh!

ianna Nov 26, 2025

Uh oh!

ikrommyd Nov 27, 2025

Uh oh!

ianna left a comment

Uh oh!

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

ianna Nov 26, 2025

Uh oh!

shwina Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@nvtx.annotate("empty_like")
		def empty_like(array, kind="empty"):

Conversation

shwina commented Nov 26, 2025

Uh oh!

shwina Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ikrommyd Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ikrommyd Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shwina Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ikrommyd Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianna Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ikrommyd Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

ianna Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

shwina Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikrommyd Nov 26, 2025 •

edited

Loading

ikrommyd Nov 26, 2025 •

edited

Loading

ikrommyd Nov 26, 2025 •

edited

Loading