-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Hi, I've noticed in cuBLAS that there is a penalty when performing the transposes needed for row-major input and output relative to pure column major. For CUTLASS complex tensor examples, we have the extra luxury that the planar input and output forces us to do conversions to that format from the native cuComplex. Given that we have the freedom to transform into any format we'd like on input and output, is there a difference in performance for row versus column-major with CUTLASS complex on tensor cores? Should we also try to stick to the computations being column-majored? Note I'm only asking in the case of the NN transform.
I should also mention (and this is a separate issue), but I cannot get the profile to output anything, no matter what I try. I compiled only for complex tensor core support, and it won't run any of the profiler tests, and there are no errors:
./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop
./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop --m 8 --n 8 --k 8
./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop --m 8 --n 8 --k 8 --providers=cutlass
./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop --m 8 --n 8 --k 8 --min_cc=75 --max_cc=75
./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop --m 8 --n 8 --k 8 --min_cc=75 --max_cc=75 --batch_count=1000