With the great effort by @anaruse in #2090, I've seen encouraging performance boosts. Below is a list of possible improvements I can think of, either for offering extensive support or for enabling even more boost. I am interested in knowing what I've missed or misunderstood.
if users set up a context manager like this
with stream:
arr.sum()
# do other stuff
The non-default stream should be honored. All of the CUB functions introduced in #2090 support an optional stream argument. We just need to pick up the current stream pointer during setup and modify the wrappers.
currently they are all Python def functions. Could be be beneficial for performance. In particular, if we don't want to expose those wrappers to end users, cdef would be a nice choice.
currently only a full reduction is supported, but if a reduction over the last axes of a contiguous array of shape, say, (X, Y, Z), is needed, this seems possible with a naive loop over the remaining axes. In other words, in this case we can use CUB to do arr.sum(axis=2) or arr.sum(axis=(1,2)), assuming arr is C contiguous. This resembles the current treatment of PlanNd in the FFT module.
Question: (from #2508 (comment)): is Jenkins configured to test CUB functionalities? UPDATE: No, see #2538 (comment).
With the great effort by @anaruse in #2090, I've seen encouraging performance boosts. Below is a list of possible improvements I can think of, either for offering extensive support or for enabling even more boost. I am interested in knowing what I've missed or misunderstood.
if users set up a context manager like this
The non-default
streamshould be honored. All of the CUB functions introduced in #2090 support an optional stream argument. We just need to pick up the current stream pointer during setup and modify the wrappers.Change the CUB wrappers from(UPDATE: see Discussion for possible enhancements of the new CUB support #2519 (comment))deftocdeforcpdef:currently they are all Pythondeffunctions. Could be be beneficial for performance. In particular, if we don't want to expose those wrappers to end users,cdefwould be a nice choice.axisargument; Fix alignments for Thrust's complex types #2562):currently only a full reduction is supported, but if a reduction over the last axes of a contiguous array of shape, say,
(X, Y, Z), is needed, this seems possible with a naive loop over the remaining axes. In other words, in this case we can use CUB to doarr.sum(axis=2)orarr.sum(axis=(1,2)), assumingarris C contiguous. This resembles the current treatment ofPlanNdin the FFT module.CUB_PATHandCUB_DISABLED.-> could be avoided if the CUB source code is bundled (Build the
cupy.cuda.cubmodule by default #2584)argminandargmax(Add CUB support forargmax()andargmin()#2596 enables a global (noaxis) search)keepdimsargument (keepdims should always preserve all dimensions in CUB-based reductions #2725)Question: (from #2508 (comment)): is Jenkins configured to test CUB functionalities?UPDATE: No, see #2538 (comment).