Add cuda.compute implementation to CCCL study#3748
Conversation
move data to the GPU. Also add type annotations to aid transformiterator
This reverts commit f0ef2db.
|
|
||
|
|
||
| @nvtx.annotate("empty_like") | ||
| def empty_like(array, kind="empty"): |
There was a problem hiding this comment.
Warning: this function, and the below helpers awkward_to_iterator and reconstruct_with_offsets are largely vibe-coded :)
I found that writing these to be as efficient as possible requires significant interaction with Awkward internals.
There was a problem hiding this comment.
In principle, there shouldn't ever be a need for empty_like for awkward arrays because they are immutable (there is also an issue saying the same thing and why only zeros/ones/full_like exist only). Apart from that, I think you should be able to achieve this (if it is really needed for some reason that I haven't checked here) using ak.transform: https://awkward-array.org/doc/main/reference/generated/ak.transform.html
with a transformation like
def transformation(layout, **kwargs):
if layout.is_numpy:
return ak.contents.NumpyArray(
xp.empty_like(layout.data)
)which will just do the recursion that you implemented here down the whole layout for you. The limitation here is that I think this will share buffers between the arrays for the ak.index.Index classes because the transformation doesn't touch them here (you may need to handle layout classes that have ak.index.Index nodes inside the transformation.
It's also possible to do the same thing using to/from_buffers tricks. You can disassemble an array into buffers and recreate it using the following ak.from_buffers(*ak.to_buffers(array)). For empty_like, you could maybe just to_buffers` first and then get all the buffers, make an empty_like of them and reassemble the array back with the new buffers.
Probably something like
form, length, container = ak.to_buffers(array)
new_container = {k: v.copy() for k, v in container.items()}
ak.from_buffers(form, length, new_container)There was a problem hiding this comment.
Regarding the other two helpers that you mentioned, I have only looked at them for 2 seconds, but in general, things that require layouts recursions probably can be implemented most of the time in terms of either ak.transform or ak._do.recursively_apply:
Line 22 in f2e18e6
There was a problem hiding this comment.
Thank you for the suggestions! I'll look into these.
The reason empty_like exists is because cuda.compute algorithms require the user to provide output buffers into which the results are written.
Initially, I was indeed using .from_buffers and .to_buffers. While the implementation was quite simple, I found that it was quite expensive. IIRC the profiles showed a lot of time being spent in
There was a problem hiding this comment.
Yeah the awkward array is immutable but the buffers are not (as they are just ndarrays) but no user ever interacts with them. Indeed in your case I understand why you just want to copy all the buffers to write the result.
Regarding the to/from_buffers performance, that would make sense because the reconsistute recursion is complicated, I merely suggested it for simplicity. For the same performance, maybe ak.transform or recursively_apply will just give you the same order of performance without repeating code but indeed for the maximum performance possible it may make sense to rewrite the simplest thing possible like you have.
There was a problem hiding this comment.
@shwina - I think, cupy structured arrays would be a good fit here - see @leofang's comment on #3517 (comment)
I think, it would be suitable for getting the results since awkward allows setting the fields:
ak_zeros_fromnumpy = ak.Array(np.zeros(shape=5, dtype=np.dtype([('index', '<i8'), ('tag', '<i4')])))
ak_zeros_fromnumpy["index"]= [1,2,3,4,5]There was a problem hiding this comment.
Oh @shwina if the actual values in the buffer don't matter (i.e. you don't really care about doing xp.empty/empty_like in the buffer and you can just have a copy of the existing buffer since you intend to modify all of it), you could also just do an ak.copy which just copies the whole array including the buffers too.
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3748 |
| """ | ||
| Convert segment offsets to segment IDs (indicators). | ||
|
|
||
| Given offsets [0, 2, 5, 8, 10], produces [0, 0, 1, 1, 1, 2, 2, 2, 3, 3] |
There was a problem hiding this comment.
Ah, this is what we call parents in our kernels :-)
There was a problem hiding this comment.
That's useful to know when reading Awkward code - thank you!
* first commit * some changes * Working CCCL implementation * Added a benchmark * Exclusing warmup run * Don't include the time it takes to move data to the GPU. Also add type annotations to aid transformiterator * More fixes * Improve dispatch overhead in helpers.py * More elimination of overheads in helpers.py * More optimizations in helpers.py * Elminate get overhead. * Fix a bug where we forgot to copy offsets. * Add some profiling annotations * Don't use segmented_reduce * Eliminate segmented_reduce * Add profiling script * Merge nonzero with select * Improve select_segments performance * Remove an unnecessary synchronize * Try stateful op * Revert "Try stateful op" This reverts commit f0ef2db. * A few minor updates --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
This PR adds an implementation using the CCCL cuda.compute to the study introduced in #3734 .
The goals of this PR are to show that
cuda.computeenables:This PR includes a
benchmark.pythat runs the simplified analysis on large artificial datasets.Here are results of the script for different data sizes on my workstation with a NVIDIA RTX 6000 Ada GPU, and AMD Ryzen Threadripper PRO 7975WX 32-Cores CPU. Speedups over the baseline (CPU) implementation are shown in paranthesis.