-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Description
in dev, in common.hpp, there are tests on CUDA_ARCH.
it seems host code should not rely on this macro. while it all seems a bit complex/convoluted to me, i think this is not a pedantic/trivial point issue, but sensible: the host code is compiled once, with CUDA_ARCH undefined (or at least not guaranteed to have any particular value), while the device code is compiled (potentially) multiple times with different values of CUDA_ARCH. mind you, i'm a bit fuzzy on the exact way the source code is split between host and device ...
AFAIK, at run time, depending on the GPU's CM, one of these per-CM versions of each device function will actually be used for kernel launches -- maybe multiple ones for the same process when multiple GPUs are in play. so it's not really possible for the host code to have any correct single or static value of the arch.
for caffe, the net result seems to be that the block size is effectively chosen as a static 512, as opposed to 1024 as seems to be the intent (not that the intent is a good idea, mind you). this happens because the code that uses the value (indirectly) is host code (i.e. in a host function), despite being in a .cu file.
i'd assume that if one wanted to make the block size depend on the arch, you'd need use cudaGetDeviceProperties() and case-split at least per cuda device (the are various ways to do something like this).
sticking this cpp code prior to the #if that uses CUDA_ARCH in common.hpp was illustrative for me. note that the warnings get printed many times per compilation of a single .cu file:
#ifndef __CUDA_ARCH__
#warning( "CA undef" )
#else
#warning( "CA def" )
#endif