Skip to content

Add cuda.compute implementation to CCCL study#3748

Merged
ianna merged 22 commits intoscikit-hep:mainfrom
shwina:cccl-study-add-cuda-compute-implementation
Nov 26, 2025
Merged

Add cuda.compute implementation to CCCL study#3748
ianna merged 22 commits intoscikit-hep:mainfrom
shwina:cccl-study-add-cuda-compute-implementation

Conversation

@shwina
Copy link
Copy Markdown
Contributor

@shwina shwina commented Nov 26, 2025

This PR adds an implementation using the CCCL cuda.compute to the study introduced in #3734 .

The goals of this PR are to show that cuda.compute enables:

  • even better performance on GPUs compared to the existing Awkward CUDA backend
  • implementing GPU-acelerated operations with no custom CUDA kernels (everything is pure Python)

This PR includes a benchmark.py that runs the simplified analysis on large artificial datasets.

Here are results of the script for different data sizes on my workstation with a NVIDIA RTX 6000 Ada GPU, and AMD Ryzen Threadripper PRO 7975WX 32-Cores CPU. Speedups over the baseline (CPU) implementation are shown in paranthesis.

Events CPU (s) Awkward CUDA backend cuda.compute
10,000 0.0094 0.0293 (0.32x) 0.0071 (1.31x)
100,000 0.0234 0.0297 (0.79x) 0.0060 (3.89x)
1,000,000 0.2400 0.0298 (8.06x) 0.0071 (33.96x)
10,000,000 2.4663 0.0700 (35.25x) 0.0243 (101.62x)

@shwina shwina changed the title Cccl study add cuda compute implementation Add cuda.compute implementation to CCCL study Nov 26, 2025


@nvtx.annotate("empty_like")
def empty_like(array, kind="empty"):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning: this function, and the below helpers awkward_to_iterator and reconstruct_with_offsets are largely vibe-coded :)

I found that writing these to be as efficient as possible requires significant interaction with Awkward internals.

Copy link
Copy Markdown
Collaborator

@ikrommyd ikrommyd Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, there shouldn't ever be a need for empty_like for awkward arrays because they are immutable (there is also an issue saying the same thing and why only zeros/ones/full_like exist only). Apart from that, I think you should be able to achieve this (if it is really needed for some reason that I haven't checked here) using ak.transform: https://awkward-array.org/doc/main/reference/generated/ak.transform.html
with a transformation like

def transformation(layout, **kwargs):
    if layout.is_numpy:
        return ak.contents.NumpyArray(
            xp.empty_like(layout.data)
        )

which will just do the recursion that you implemented here down the whole layout for you. The limitation here is that I think this will share buffers between the arrays for the ak.index.Index classes because the transformation doesn't touch them here (you may need to handle layout classes that have ak.index.Index nodes inside the transformation.

It's also possible to do the same thing using to/from_buffers tricks. You can disassemble an array into buffers and recreate it using the following ak.from_buffers(*ak.to_buffers(array)). For empty_like, you could maybe just to_buffers` first and then get all the buffers, make an empty_like of them and reassemble the array back with the new buffers.
Probably something like

form, length, container = ak.to_buffers(array)
new_container = {k: v.copy() for k, v in container.items()}
ak.from_buffers(form, length, new_container)

Copy link
Copy Markdown
Collaborator

@ikrommyd ikrommyd Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the other two helpers that you mentioned, I have only looked at them for 2 seconds, but in general, things that require layouts recursions probably can be implemented most of the time in terms of either ak.transform or ak._do.recursively_apply:

def recursively_apply(
There may be cases that I'm not considering here though, but I'm just adding general context here which may even help your LLM 😃.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestions! I'll look into these.

The reason empty_like exists is because cuda.compute algorithms require the user to provide output buffers into which the results are written.

Initially, I was indeed using .from_buffers and .to_buffers. While the implementation was quite simple, I found that it was quite expensive. IIRC the profiles showed a lot of time being spent in

.

Copy link
Copy Markdown
Collaborator

@ikrommyd ikrommyd Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the awkward array is immutable but the buffers are not (as they are just ndarrays) but no user ever interacts with them. Indeed in your case I understand why you just want to copy all the buffers to write the result.
Regarding the to/from_buffers performance, that would make sense because the reconsistute recursion is complicated, I merely suggested it for simplicity. For the same performance, maybe ak.transform or recursively_apply will just give you the same order of performance without repeating code but indeed for the maximum performance possible it may make sense to rewrite the simplest thing possible like you have.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina - I think, cupy structured arrays would be a good fit here - see @leofang's comment on #3517 (comment)

I think, it would be suitable for getting the results since awkward allows setting the fields:

ak_zeros_fromnumpy = ak.Array(np.zeros(shape=5, dtype=np.dtype([('index', '<i8'), ('tag', '<i4')])))
ak_zeros_fromnumpy["index"]= [1,2,3,4,5]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh @shwina if the actual values in the buffer don't matter (i.e. you don't really care about doing xp.empty/empty_like in the buffer and you can just have a copy of the existing buffer since you intend to modify all of it), you could also just do an ak.copy which just copies the whole array including the buffers too.

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina - This looks really great! Thank you! The CI will not run on the studies, I will merge it as is. Thanks again.

@ianna ianna merged commit e547a34 into scikit-hep:main Nov 26, 2025
13 of 15 checks passed
@github-actions
Copy link
Copy Markdown

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3748

"""
Convert segment offsets to segment IDs (indicators).

Given offsets [0, 2, 5, 8, 10], produces [0, 0, 1, 1, 1, 2, 2, 2, 3, 3]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is what we call parents in our kernels :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's useful to know when reading Awkward code - thank you!

ikrommyd pushed a commit to ikrommyd/awkward that referenced this pull request Dec 9, 2025
* first commit

* some changes

* Working CCCL implementation

* Added a benchmark

* Exclusing warmup run

* Don't include the time it takes to
move data to the GPU.

Also add type annotations to aid transformiterator

* More fixes

* Improve dispatch overhead in helpers.py

* More elimination of overheads in helpers.py

* More optimizations in helpers.py

* Elminate get overhead.

* Fix a bug where we forgot to copy offsets.

* Add some profiling annotations

* Don't use segmented_reduce

* Eliminate segmented_reduce

* Add profiling script

* Merge nonzero with select

* Improve select_segments performance

* Remove an unnecessary synchronize

* Try stateful op

* Revert "Try stateful op"

This reverts commit f0ef2db.

* A few minor updates

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants