Skip to content

Adding support for larger SharedArray #6385

@Linyou

Description

@Linyou

Concisely describe the proposed feature

SharedArray helps store the intermediate results so we don't need to write back to global memory, which really increases the performance.

However, as quoted from CUDA doc: link

Devices of compute capability 8.0 allow a single thread block to address up to 163 KB of shared memory, while devices of compute capabilities 8.6 and 8.9 allow up to 99 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize; see Shared Memory for the Volta architecture.

Right now, we can only allocate 48KB of shared memory (SharedArray), even with new high-end GPUs which already have 99KB or more shared memory. So, is this possible for Taichi to support dynamic shared memory in CUDA that allows us to allocate more shared memory?

Metadata

Metadata

Assignees

Labels

feature requestSuggest an idea on this project

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions