Adding support for larger SharedArray

**Concisely describe the proposed feature**

SharedArray helps store the intermediate results so we don't need to write back to global memory, which really increases the performance. 

However, as quoted from CUDA doc: [link](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory-8-x)

> Devices of compute capability 8.0 allow a single thread block to address up to 163 KB of shared memory, while devices of compute capabilities 8.6 and 8.9 allow up to 99 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize; see [Shared Memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/topics/compute-capabilities.html#shared-memory-7.x) for the Volta architecture.

Right now, we can only allocate 48KB of shared memory (SharedArray), even with new high-end GPUs which already have 99KB or more shared memory. So, is this possible for Taichi to support dynamic shared memory in CUDA that allows us to allocate more shared memory?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for larger SharedArray #6385

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding support for larger SharedArray #6385

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions