Add couple configs into generator.py for mixed input MM by alexsamardzic · Pull Request #1350 · NVIDIA/cutlass

alexsamardzic · 2024-02-16T18:55:31Z

I'm adding (PR here) CUTLASS kernels as an auto-tune option for PyTorch compiler, and it would be nice to have these additional configurations available. This is not urgent, and more of alike changes may be further desired, so if it's OK to make changes like this then this PR could be kept open for while, and I'll make further additions, as needed, to it.

@manishucsd : Would it make sense for GenerateSM80_TensorOp_16816_mixed_input_upcast_a and GenerateSM80_TensorOp_16816_mixed_input_upcast_b to be symmetric w.r.t. math_instructions and tile_descriptions? I can change it through this PR too.

manishucsd · 2024-02-19T23:24:41Z

to be symmetric w.r.t. math_instructions and tile_descriptions.

What do you mean by symmetric (same?). Tensor Core math_instruction shape for both upcast_a and upcast_b is 16816. The supported tile_description (more precisely tile shape) may need to be different for upcast_a vs upcast_b.

alexsamardzic · 2024-02-20T12:56:30Z

By symmetry, I meant on math_instructions list within given generator methods: I was thinking that, if GenerateSM80_SparseTensorOp_16832 method has for example DataType.f16, DataType.f16, DataType.f32 combination listed there, then upcast_a method should have DataType.s8, DataType.f16, DataType.f32 and DataType.u8, DataType.f16, DataType.f32, and upcast_b method should have DataType.f16, DataType.s8, DataType.f32 and DataType.f16, DataType.u8, DataType.f32; and alike for other elements of this list in GenerateSM80_SparseTensorOp_16832 method. I've update the PR with all the changes I think should be made in that regard.

As far as tile_descriptions lists concerned, I thought that most of them should be the same between GenerateSM80_SparseTensorOp_16832, and upcast_a and upcast_b methods - my reasoning was that the multiplication itself is 16-bit in each case. On the other side, less shared memory is used for mixed data-types case, so some configurations may be different, but at least I'd expect some kind of "symmetry" between upcast_a and upcast_b, in the sense that I don't get why the first one has only 9 elements in tile_descriptions list, while the other one has 16.

manishucsd · 2024-02-21T06:39:21Z

For math_instructions makes sense. Yes, we should have the support for combinations you listed. Once you add those, please ensure the references for the same are also in place, run the verification to ensure the kernel runs and verifies.

For tile_descriptions, the 8-bit operand needs to be loaded from GMEM to SMEM and this puts some restrictions on what tile_descriptions (shapes) are currently supported. These may not be same for upcast_a and upcast_b.

alexsamardzic · 2024-02-21T20:15:15Z

Thanks for the clarification. I've updated gemm_fp_mixed_input.cu in my PR. W.r.t. verification - is there an "official" way to do it? I've checked that, on A100, whenever for example there is an item in gemm_fp_mixed_input.cu like:

make_gemm_real_canonical_layouts<
    uint8_t,
    half_t,
    half_t,
    half_t,
    half_t
  >(manifest);

that matching cutlass_profiler run, in this case:

cutlass_profiler --A=u8 --B=f16 --C=f16 --accum=f16

produces at least one line of profiling output, which should mean that the kernel compiled and ran successfully.

As a matter of fact, in these tests of mine cutlass_profiler always produces exactly one line of output, and if I try to force (using cta_m/cta_n/cta_k etc. command-line arguments) different tile description (but still one listed in generator.py, for upcast_a in this particular case) from one that cutlass_profiler printed out, it won't print anything. The cutlass_profiler is not giving any information about tile descriptions that it tried, but that didn't work; on the other side, in the PyTorch mixed data-types GEMM auto-tuning context mentioned in my first comment, more information about the compilation is printed, and I noticed that kernels generated by cutlass_library would fail to compile for some tile descriptions. This is another reason that I'm interested in proper verification.

alexsamardzic · 2024-03-04T12:41:12Z

Asking again: how to properly run verification after my changes?

manishucsd · 2024-03-05T03:37:54Z

For your mixed-input case, add a device-level unit test. Track similar unit test from here.
You should also test if the profiler is working with verification for your mixed-input case. Tips on achiving that:

Use cmake flags to compile only the kernels you are interested in. Use the below cmake command as an example to create your own

cmake --no-warn-unused-cli -DCMAKE_BUILD_TYPE:STRING=Release -DCUTLASS_NVCC_ARCHS:STRING=80 -DCUTLASS_NVCC_KEEP:STRING=OFF -DCUTLASS_ENABLE_F16C:STRING=ON -DCUTLASS_LIBRARY_KERNELS:STRING=f16_s16816gemm_f16_s8_128x128_64x*,f16_s16816gemm_s8_f16_128x128_64x*,f16_s16816gemm_u8_f16_128x128_64x*,f16_s16816gemm_f16_u8_128x128_64x*,bf16_s16816gemm_bf16_s8_128x128_64x*,bf16_s16816gemm_s8_bf16_128x128_64x*,bf16_s16816gemm_bf16_u8_128x128_64x*,bf16_s16816gemm_u8_bf16_128x128_64x*,f16_s16816gemm_f16_128x128_64x*_tn_align8,bf16_s16816gemm_bf16_128x128_64x*_tn_align8 -DCUTLASS_LIBRARY_IGNORE_KERNELS:STRING=gemm_grouped*,gemm_planar* -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE -DCMAKE_C_COMPILER:FILEPATH=/usr/bin/gcc -DCMAKE_CXX_COMPILER:FILEPATH=/usr/bin/g++ -S/mnt/disks/gcloud_workspace/repos/cutlass/cutlass_tree_2/cutlass -B/mnt/disks/gcloud_workspace/repos/cutlass/cutlass_tree_2/build -G Ninja

The cmake flags to play with are

        // "CUTLASS_LIBRARY_KERNELS": "tensorop_s16816fprop_optimized_f16_256x128_32x3_nhwc_align8,s16816gemm_bf16_128x128_64x3_tn_align8,s16816gemm_f16_128x128_64x3_tn_align8,h16816gemm_128x128_64x3_tn_align8",
        // Upcast on OperandA and OperandB
        "CUTLASS_LIBRARY_KERNELS": "f16_s16816gemm_f16_s8_128x128_64x*,f16_s16816gemm_s8_f16_128x128_64x*,f16_s16816gemm_u8_f16_128x128_64x*,f16_s16816gemm_f16_u8_128x128_64x*,bf16_s16816gemm_bf16_s8_128x128_64x*,bf16_s16816gemm_s8_bf16_128x128_64x*,bf16_s16816gemm_bf16_u8_128x128_64x*,bf16_s16816gemm_u8_bf16_128x128_64x*,f16_s16816gemm_f16_128x128_64x*_tn_align8,bf16_s16816gemm_bf16_128x128_64x*_tn_align8",
        // Upcast on OperandB only        
        // "CUTLASS_LIBRARY_KERNELS": "s16816gemm_f16_s8_*,s16816gemm_bf16_s8_*,s16816gemm_bf16_128x128_64x*_tn_align8,s16816gemm_f16_128x128_64x*_tn_align8",
        "CUTLASS_LIBRARY_IGNORE_KERNELS": "gemm_grouped*,gemm_planar*"

compile cutlass_profiler, make sure the kernel you are interested is generated and compiled.
Use ./cutlass_profiler --kernels="kernel_name" to run the kernel you are interested.

Apologies for the delayed response. I have been OOO for last few weeks.

alexsamardzic · 2024-03-07T22:05:29Z

Thanks for the clarifications.

PR is updated with the changes suggested: Added number of tests, so that it should be all consistent now between tests, generator.py and gemm_fp_mixed_input.cu. Also fixed several unrelated typos in the generator and tests.

Script used to validate that cutlass_profiler generates kernels for mixed data-types

#! /bin/bash

IFS=","

for cfg in \
    s8,f16,f32,f32 \
    u8,f16,f32,f32 \
    s8,bf16,f32,f32 \
    u8,bf16,f32,f32 \
    s8,f16,f16,f32 \
    u8,f16,f16,f32 \
    s8,bf16,bf16,f32 \
    u8,bf16,bf16,f32 \
    s8,f16,f16,f16 \
    u8,f16,f16,f16
do
    set -- $cfg
    ./tools/profiler/cutlass_profiler \
        --operation=gemm --op_class=tensorop \
        --A=$1 --B=$2 --C=$3 --accum=$4
    read -n1 -s -r -p $"A=$1 B=$2 C=$3 accum=$4 done - Press any key to continue..." key
    ./tools/profiler/cutlass_profiler \
        --operation=gemm --op_class=tensorop \
        --A=$2 --B=$1 --C=$3 --accum=$4
    read -n1 -s -r -p $"A=$2 B=$1 C=$3 accum=$4 done - Press any key to continue..." key
done

github-actions · 2024-04-22T22:05:49Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

alexsamardzic · 2024-07-11T11:47:32Z

@hwu36: Thanks for the test fix! The problem with the configurations added in your commit is that they won't work - one could try to change for example exactly the test you fixed, to one of these configurations (I'm attaching an example in the patch below), and then to do make cutlass_test_unit_gemm_device_mixed_input_tensorop_sm80 - it won't build, reporting first error:

include/cutlass/epilogue/threadblock/output_tile_thread_map.h(262): error: static assertion failed with "Iteration Count Row must be > 0"

(The reason I removed some configurations from the generator in my commit is exactly that I tried them all, and removed those that won't compile.)

patch.txt

hwu36 · 2024-07-11T12:27:41Z

I did not change unit test.

The reason that profiler cannot do 128x32 is due to epilogue alignment. I fixed that. So 128x32 is fine now.

alexsamardzic · 2024-07-11T12:52:50Z

I did not change unit test.

I meant on fixing the test name :-)

The reason that profiler cannot do 128x32 is due to epilogue alignment. I fixed that. So 128x32 is fine now.

Ah, I see - indeed it works this way.

So, do you think this PR could be merged now? And, unrelated: any chance to have PR 1190 reviewed by someone from the team - it's stalled for very long time, and this functionality would be still very useful (at least for PyTorch)?

hwu36 · 2024-07-11T18:19:59Z

i will merge this pr after 3.5.1 pr is merged.

@manishucsd is reviewing pr1190. that one changed mainloop, it will take a while.

manishucsd · 2024-07-11T18:20:27Z

Thank you for the change. Overall looks good. Can do the following?

CUTLASS Profiler Output for All the mixed input GEMMs

build $ cmake ../cutlass/ -DCUTLASS_NVCC_ARCHS="90a" -DCUTLASS_ENABLE_F16C=ON -DCMAKE_BUILD_TYPE=Release -DCUTLASS_LIBRARY_KERNELS="u8_f16,s8_f16,f16_u8,f16_s8,u8_bf16,s8_bf16,bf16_u8,bf16_s8,"

build $ ./tools/profiler/cutlass_profiler --output=mixed_input.csv

Dump the output put CSV here and make sure all the mixed input variants Disposition column has Passed .

I am hoping you have ran the newer unit tests cutlass_test_unit_gemm_device_mixed_input_tensorop_sm80 and everything passes here.
Can you also run the full test on Ampere atleast? This should help @hwu36 in merging it.

cmake ../cutlass/ -DCUTLASS_NVCC_ARCHS='80'
make all -j32
make test ARGS='-j32 --output-on-error'

hwu36 · 2024-07-11T18:22:04Z

@manishucsd , do you mean this pr or 1190? i did all the testing for this pr myself.

manishucsd · 2024-07-11T18:32:55Z

@manishucsd , do you mean this pr or 1190? i did all the testing for this pr myself.

I meant this one. However, this one is adding only unit tests and more instances to the cutlass_library, and if you already tested it, then this is good to go.

It will be good know the following two items:

Reason behind the failing unit test here.
I do see some unverified rows in the output of cutlass profiler for mixed-input runs.

alexsamardzic · 2024-07-11T20:24:25Z

2. I do see some unverified rows in the output of cutlass profiler for mixed-input runs.

I've built and ran the profiler according to instructions you provided above, and the Disposition is passed for each row in the .csv file produced. Can you clarify about "unverified rows" you found?

alexsamardzic · 2024-07-11T20:34:50Z

i will merge this pr after 3.5.1 pr is merged.

@manishucsd is reviewing pr1190. that one changed mainloop, it will take a while.

@manishucsd: Maybe you could review PR 1413 before 1190? The 1413 is third, and last, of my PRs awaiting for the review, the changes closely follow int8/float16 functionality you implemented initially, and the only thing that may need changes is the conversion. So it should be possible to review/update/merge this one relatively quickly, and this particular mixed data-types combination is also highly requested (well, at least for PyTorch).

The 1190 will need more time. It's not ready for merging, just has couple commits providing proof of the concept in slightly different ways, each one with its disadvantages. This one may need input from other members of the team, and will certainly take more time to review, and additional work by me to complete.

manishucsd · 2024-07-11T21:02:50Z

2. I do see some unverified rows in the output of cutlass profiler for mixed-input runs.
I've built and ran the profiler according to instructions you provided above, and the Disposition is passed for each row in the .csv file produced. Can you clarify about "unverified rows" you found?

Here is my run
mixed_input.gemm.csv

manishucsd · 2024-07-11T21:18:21Z

Thank you for sharing this. There is nothing in the instance that I could say causing the issue. Can you try smaller range for initializing operandB of BF16. The current range is from -5 to 5, can you do try -3 to 3?

alexsamardzic · 2024-07-11T21:24:03Z

Here is mine:
mixed_input.gemm.csv - all fine here. Not sure what could be causing the difference... I tested on A100, the only difference may be that I re-based my branch on CUTLASS main.

Edit: I re-built, without rebasing the branch on latest main, and all rows in the .csv file still have passed as Disposition.

manishucsd · 2024-07-16T18:57:13Z

LGTM. @hwu36 and CUTLASS team can you please merge this?

cc: @alexsamardzic

manishucsd · 2024-08-14T05:48:34Z

Checking the status on this reviewed PR. If this is already merged?

alexsamardzic · 2024-08-14T10:27:03Z

Checking the status on this reviewed PR. If this is already merged?

It doesn't seem to be merged yet.

* Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

alexsamardzic mentioned this pull request Feb 29, 2024

Add CUTLASS kernel as choice for _int_mm() Inductor autotuning pytorch/pytorch#119685

Closed

alexsamardzic mentioned this pull request Mar 22, 2024

Add support for mixed 4-bit/8-bit data types GEMM #1413

Merged

github-actions Bot added the inactive-30d label Apr 22, 2024

github-actions Bot removed the inactive-30d label Jul 11, 2024

This comment was marked as off-topic.

Sign in to view

manishucsd approved these changes Jul 11, 2024

View reviewed changes

alexsamardzic and others added 3 commits July 12, 2024 14:47

Add couple configs into generator.py for mixed input MM

88ed5f0

change one unit test name; reenable 128x32 in the profiler

8446d90

Added U8/BF16 tests.

834076b

hwu36 approved these changes Aug 16, 2024

View reviewed changes

Merge branch 'main' into add-mixed-input-configs

3ec724e

hwu36 merged commit 3f084f7 into NVIDIA:main Aug 16, 2024

alexsamardzic deleted the add-mixed-input-configs branch August 30, 2024 12:05

Conversation

alexsamardzic commented Feb 16, 2024

Uh oh!

manishucsd commented Feb 19, 2024

Uh oh!

alexsamardzic commented Feb 20, 2024

Uh oh!

manishucsd commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented Feb 21, 2024

Uh oh!

alexsamardzic commented Mar 4, 2024

Uh oh!

manishucsd commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2024

Uh oh!

alexsamardzic commented Jul 11, 2024

Uh oh!

hwu36 commented Jul 11, 2024

Uh oh!

alexsamardzic commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hwu36 commented Jul 11, 2024

Uh oh!

manishucsd commented Jul 11, 2024

Uh oh!

hwu36 commented Jul 11, 2024

Uh oh!

This comment was marked as off-topic.

Uh oh!

manishucsd commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented Jul 11, 2024

Uh oh!

alexsamardzic commented Jul 11, 2024

Uh oh!

manishucsd commented Jul 11, 2024

Uh oh!

manishucsd Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

alexsamardzic commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manishucsd commented Jul 16, 2024

Uh oh!

manishucsd commented Aug 14, 2024

Uh oh!

alexsamardzic commented Aug 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

manishucsd commented Feb 21, 2024 •

edited

Loading

manishucsd commented Mar 5, 2024 •

edited

Loading

alexsamardzic commented Mar 7, 2024 •

edited

Loading

alexsamardzic commented Jul 11, 2024 •

edited

Loading

manishucsd commented Jul 11, 2024 •

edited

Loading

alexsamardzic commented Jul 11, 2024 •

edited

Loading