Skip to content

[REQ] Enhance ease of use for grid-stride loop #1270

@lmtss

Description

@lmtss

Description

  • Provide an interface similar to cudaOccupancyMaxActiveBlocksPerMultiprocessor to facilitate max_blocks estimation
  • Add an option to disable the grid-stride loop

Context

I noticed that the grid-stride loop is used by default in cuda_kernel_template_forward. I have a basic understanding of its benefits, but I've encountered some issues:

  • The grid-stride loop is enabled by default, but in cases where I'm certain it's unnecessary, disabling it could provide a slight performance improvement (possibly due to better register usage).
  • It's not trivial to estimate an appropriate max_blocks value — I typically need to use tools like Nsight Compute to determine how many blocks can run on an SM before arriving at the final max_blocks setting.
  • The grid-stride loop uses size_t (64-bit integer), which may introduce unnecessary 64-bit arithmetic in some cases.

Metadata

Metadata

Labels

feature requestRequest for something to be added
No fields configured for Enhancement.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions