Skip to content

Best Order For Performance #131

@cliffburdick

Description

@cliffburdick

Hi, I've noticed in cuBLAS that there is a penalty when performing the transposes needed for row-major input and output relative to pure column major. For CUTLASS complex tensor examples, we have the extra luxury that the planar input and output forces us to do conversions to that format from the native cuComplex. Given that we have the freedom to transform into any format we'd like on input and output, is there a difference in performance for row versus column-major with CUTLASS complex on tensor cores? Should we also try to stick to the computations being column-majored? Note I'm only asking in the case of the NN transform.

I should also mention (and this is a separate issue), but I cannot get the profile to output anything, no matter what I try. I compiled only for complex tensor core support, and it won't run any of the profiler tests, and there are no errors:

 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop  --m 8 --n 8 --k 8
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop  --m 8 --n 8 --k 8 --providers=cutlass
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop  --m 8 --n 8 --k 8 --min_cc=75 --max_cc=75
 ./tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=planar_complex --op_class=tensorop  --m 8 --n 8 --k 8 --min_cc=75 --max_cc=75 --batch_count=1000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions