Add couple configs into generator.py for mixed input MM#1350
Conversation
What do you mean by symmetric (same?). Tensor Core math_instruction shape for both upcast_a and upcast_b is 16816. The supported tile_description (more precisely tile shape) may need to be different for upcast_a vs upcast_b. |
|
By symmetry, I meant on As far as |
|
For For |
|
Thanks for the clarification. I've updated that matching produces at least one line of profiling output, which should mean that the kernel compiled and ran successfully. As a matter of fact, in these tests of mine |
|
Asking again: how to properly run verification after my changes? |
cmake --no-warn-unused-cli -DCMAKE_BUILD_TYPE:STRING=Release -DCUTLASS_NVCC_ARCHS:STRING=80 -DCUTLASS_NVCC_KEEP:STRING=OFF -DCUTLASS_ENABLE_F16C:STRING=ON -DCUTLASS_LIBRARY_KERNELS:STRING=f16_s16816gemm_f16_s8_128x128_64x*,f16_s16816gemm_s8_f16_128x128_64x*,f16_s16816gemm_u8_f16_128x128_64x*,f16_s16816gemm_f16_u8_128x128_64x*,bf16_s16816gemm_bf16_s8_128x128_64x*,bf16_s16816gemm_s8_bf16_128x128_64x*,bf16_s16816gemm_bf16_u8_128x128_64x*,bf16_s16816gemm_u8_bf16_128x128_64x*,f16_s16816gemm_f16_128x128_64x*_tn_align8,bf16_s16816gemm_bf16_128x128_64x*_tn_align8 -DCUTLASS_LIBRARY_IGNORE_KERNELS:STRING=gemm_grouped*,gemm_planar* -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE -DCMAKE_C_COMPILER:FILEPATH=/usr/bin/gcc -DCMAKE_CXX_COMPILER:FILEPATH=/usr/bin/g++ -S/mnt/disks/gcloud_workspace/repos/cutlass/cutlass_tree_2/cutlass -B/mnt/disks/gcloud_workspace/repos/cutlass/cutlass_tree_2/build -G NinjaThe cmake flags to play with are
Apologies for the delayed response. I have been OOO for last few weeks. |
|
Thanks for the clarifications. PR is updated with the changes suggested: Added number of tests, so that it should be all consistent now between tests, Script used to validate that
|
|
This PR has been labeled |
|
@hwu36: Thanks for the test fix! The problem with the configurations added in your commit is that they won't work - one could try to change for example exactly the test you fixed, to one of these configurations (I'm attaching an example in the patch below), and then to do (The reason I removed some configurations from the generator in my commit is exactly that I tried them all, and removed those that won't compile.) |
|
I did not change unit test. The reason that profiler cannot do 128x32 is due to epilogue alignment. I fixed that. So 128x32 is fine now. |
I meant on fixing the test name :-)
Ah, I see - indeed it works this way. So, do you think this PR could be merged now? And, unrelated: any chance to have PR 1190 reviewed by someone from the team - it's stalled for very long time, and this functionality would be still very useful (at least for PyTorch)? |
|
i will merge this pr after 3.5.1 pr is merged. @manishucsd is reviewing pr1190. that one changed mainloop, it will take a while. |
|
Thank you for the change. Overall looks good. Can do the following?
Dump the output put CSV here and make sure all the mixed input variants
|
|
@manishucsd , do you mean this pr or 1190? i did all the testing for this pr myself. |
I meant this one. However, this one is adding only unit tests and more instances to the It will be good know the following two items:
|
I've built and ran the profiler according to instructions you provided above, and the |
@manishucsd: Maybe you could review PR 1413 before 1190? The 1413 is third, and last, of my PRs awaiting for the review, the changes closely follow int8/float16 functionality you implemented initially, and the only thing that may need changes is the conversion. So it should be possible to review/update/merge this one relatively quickly, and this particular mixed data-types combination is also highly requested (well, at least for PyTorch). The 1190 will need more time. It's not ready for merging, just has couple commits providing proof of the concept in slightly different ways, each one with its disadvantages. This one may need input from other members of the team, and will certainly take more time to review, and additional work by me to complete. |
Here is my run |
There was a problem hiding this comment.
Thank you for sharing this. There is nothing in the instance that I could say causing the issue. Can you try smaller range for initializing operandB of BF16. The current range is from -5 to 5, can you do try -3 to 3?
|
Here is mine: Edit: I re-built, without rebasing the branch on latest main, and all rows in the .csv file still have |
|
LGTM. @hwu36 and CUTLASS team can you please merge this? cc: @alexsamardzic |
|
Checking the status on this reviewed PR. If this is already merged? |
It doesn't seem to be merged yet. |
* Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
I'm adding (PR here) CUTLASS kernels as an auto-tune option for PyTorch compiler, and it would be nice to have these additional configurations available. This is not urgent, and more of alike changes may be further desired, so if it's OK to make changes like this then this PR could be kept open for while, and I'll make further additions, as needed, to it.
@manishucsd : Would it make sense for
GenerateSM80_TensorOp_16816_mixed_input_upcast_aandGenerateSM80_TensorOp_16816_mixed_input_upcast_bto be symmetric w.r.t.math_instructionsandtile_descriptions? I can change it through this PR too.