Description
- Provide an interface similar to
cudaOccupancyMaxActiveBlocksPerMultiprocessor to facilitate max_blocks estimation
- Add an option to disable the grid-stride loop
Context
I noticed that the grid-stride loop is used by default in cuda_kernel_template_forward. I have a basic understanding of its benefits, but I've encountered some issues:
- The grid-stride loop is enabled by default, but in cases where I'm certain it's unnecessary, disabling it could provide a slight performance improvement (possibly due to better register usage).
- It's not trivial to estimate an appropriate max_blocks value — I typically need to use tools like Nsight Compute to determine how many blocks can run on an SM before arriving at the final max_blocks setting.
- The grid-stride loop uses size_t (64-bit integer), which may introduce unnecessary 64-bit arithmetic in some cases.
Description
cudaOccupancyMaxActiveBlocksPerMultiprocessorto facilitate max_blocks estimationContext
I noticed that the grid-stride loop is used by default in cuda_kernel_template_forward. I have a basic understanding of its benefits, but I've encountered some issues: