[REQ] Enhance ease of use for grid-stride loop

### Description
* Provide an interface similar to `cudaOccupancyMaxActiveBlocksPerMultiprocessor` to facilitate max_blocks estimation
* Add an option to disable the grid-stride loop

### Context
I noticed that the grid-stride loop is used by default in cuda_kernel_template_forward. I have a basic understanding of its benefits, but I've encountered some issues:
* The grid-stride loop is enabled by default, but in cases where I'm certain it's unnecessary, disabling it could provide a slight performance improvement (possibly due to better register usage).
* It's not trivial to estimate an appropriate max_blocks value — I typically need to use tools like Nsight Compute to determine how many blocks can run on an SM before arriving at the final max_blocks setting.
* The grid-stride loop uses size_t (64-bit integer), which may introduce unnecessary 64-bit arithmetic in some cases.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQ] Enhance ease of use for grid-stride loop #1270

Description

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[REQ] Enhance ease of use for grid-stride loop #1270

Description

Description

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions