Best Order For Performance

Hi, I've noticed in cuBLAS that there is a penalty when performing the transposes needed for row-major input and output relative to pure column major. For CUTLASS complex tensor examples, we have the extra luxury that the planar input and output forces us to do conversions to that format from the native cuComplex. Given that we have the freedom to transform into any format we'd like on input and output, is there a difference in performance for row versus column-major with CUTLASS complex on tensor cores? Should we also try to stick to the computations being column-majored? Note I'm only asking in the case of the NN transform.

I should also mention (and this is a separate issue), but I cannot get the profile to output anything, no matter what I try. I compiled only for complex tensor core support, and it won't run any of the profiler tests, and there are no errors:
```
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop  --m 8 --n 8 --k 8
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop  --m 8 --n 8 --k 8 --providers=cutlass
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop  --m 8 --n 8 --k 8 --min_cc=75 --max_cc=75
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop  --m 8 --n 8 --k 8 --min_cc=75 --max_cc=75 --batch_count=1000
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Order For Performance #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Best Order For Performance #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions