Add support for mixed 4-bit/8-bit data types GEMM by alexsamardzic · Pull Request #1413 · NVIDIA/cutlass

alexsamardzic · 2024-03-19T14:49:39Z

No description provided.

alexsamardzic · 2024-03-19T14:55:39Z

More to come here: support for U4, support for generator in the CUTLASS library, etc. Still, opening PR to solicit feedback for S8/S4 and S4/S8 GEMMs that are now available; in particular, I'm interested in eventual suggestions for a faster approach to S4->S8 conversion.

alexsamardzic · 2024-03-20T10:56:48Z

Added more tests.

lezcano · 2024-03-21T09:33:30Z

leftover? (just lurking around)

Yep. But it's there pretty much in all the tests in test/unit/gemm/device, it seems we've been just copying it around... Would be the best to remove all of them in a separate PR.

alexsamardzic · 2024-03-22T15:40:23Z

Added generator support for S8/S4 and S4/S8.

AFAIK, implementing generator support for given operation is not specifically documented, so I want to clarify the steps I've taken here. Basically, I've copied code from GenerateSM80_TensorOp_16832_TN method into GenerateSM80_TensorOp_16832_TN_mixed_input_upcast_(a|b), and then made some changes:

Obviously, I've changed math_instructions assignments according to data types actually used for mixed input data types.
I'm not sure from where smem_usage = 164 in GenerateSM80_TensorOp_16832_TN comes, and this variable is not further used anyway, so I skipped it in new methods.
I've used two sets of alignment constraints. The alignments for operands A and B are the same, but operand C (and thus the result too) could be either 32-bit or 8-bit. The code at the end of the mixed input methods, within the last if statement is handling the later case, and alignments are changed here accordingly. (Note that for GenerateSM80_TensorOp_16816_mixed_input_upcast_(a|b) there are snippets of code at the end of methods doing alike thing, but they're slightly different from each other, and also from what I did here.)
The tile_descriptions were initially copied from GenerateSM80_TensorOp_16832_TN, and then I would make sure that all the relevant kernels would be compiled (through adding CUTLASS_LIBRARY_KERNELS="*i16832gemm*" to the CMake command line), and would remove tiles that would fail to compile.

I did the verification as @manishucsd suggested here: As mentioned above, I did the build with all the relevant kernels included, and then I verified that cutlass_profiler would run all the tile variations that are specified in GenerateSM80_TensorOp_16832_TN_mixed_input_upcast_(a|b). Note that the profiler would produce Disposition: Incorrect for all the kernels with 8-bit output; I suppose it's related to saturation - I'm not sure if I should actually come up with applying saturation somehow for this combination of input data types?

Overall, this PR now contains everything that I intended to do for S4/S8 and S8/S4 GEMM, and it's ready for review. It has grown somewhat large, so I'd suggest to have it reviewed and eventually merged, and then I can add U4/U8 and U8/U4, and maybe U4/S8 and S8/U4, support in follow-up PR(s).

andrewor14 · 2024-03-22T19:32:14Z

Hi @alexsamardzic, thanks for working on this. Just wanted to clarify, will this kernel support int4 grouped per channel weight quantization + int8 per token dynamic activation quantization?

alexsamardzic · 2024-03-22T19:48:30Z

Hi @alexsamardzic, thanks for working on this. Just wanted to clarify, will this kernel support int4 grouped per channel weight quantization + int8 per token dynamic activation quantization?

This kernel is just int4/int8 GEMM, producing int32 (or int8) result. Quantization is not to be supported by CUTLASS directly, but could be implemented using an EVT epilogue. In particular, I'm trying to get this feature into CUTLASS mainly in order to have this particular operation supported in PyTorch, with using it along with quantization as primary motivator.

alexsamardzic · 2024-04-02T06:49:24Z

@manishucsd, @hwu36: Would it be possible for someone to review this PR (and eventually #1350 too)? These should not be controversial, are needed by PyTorch, and for this one I'd like to proceed with another PR to add other 4-bit/8-bit integer combinations that make sense.

hwu36 · 2024-04-18T18:35:15Z

working on it now.

Hongbosherlock · 2024-05-06T09:28:54Z

Hi @alexsamardzic, thanks for working on this. Just wanted to clarify, will this kernel support int4 grouped per channel weight quantization + int8 per token dynamic activation quantization?

This kernel is just int4/int8 GEMM, producing int32 (or int8) result. Quantization is not to be supported by CUTLASS directly, but could be implemented using an EVT epilogue. In particular, I'm trying to get this feature into CUTLASS mainly in order to have this particular operation supported in PyTorch, with using it along with quantization as primary motivator.

Great job! How can I integrate this PR with PyTorch? Are there any example codes available ? @alexsamardzic

alexsamardzic · 2024-05-06T15:18:41Z

How can I integrate this PR with PyTorch? Are there any example codes available ? @alexsamardzic

The primary motivation for this PR is to have this combination of operands supported by PyTorch, so the integration should be coming soon.

Hongbosherlock · 2024-05-08T09:11:03Z

How can I integrate this PR with PyTorch? Are there any example codes available ? @alexsamardzic

The primary motivation for this PR is to have this combination of operands supported by PyTorch, so the integration should be coming soon.

I'm a beginner with Cutlass, I have on idea how to use my own constructed s4/s8 data to run this GEMM.
Could you please provide an example code for testing this s4/s8 GEMM? like the official example here:
https://github.com/NVIDIA/cutlass/blob/main/examples/55_hopper_mixed_dtype_gemm/README.md

alexsamardzic · 2024-05-08T13:11:32Z

I'm a beginner with Cutlass, I have on idea how to use my own constructed s4/s8 data to run this GEMM. Could you please provide an example code for testing this s4/s8 GEMM? like the official example here: https://github.com/NVIDIA/cutlass/blob/main/examples/55_hopper_mixed_dtype_gemm/README.md

These changes are not for Hopper, but for Ampere architecture. The code to run s4/s8 GEMM would be the same as for any other GEMM, for example s8/s8, except that when a GEMM template instantiated, data type and other argument should be specified accordingly. For some examples of this, see using Gemm = cutlass::gemm::device::GemmUniversal... template instantiations in the test cases added by this PR into test/unit/gemm/device directory. As far as your data concerned, s4 data should be provided as two successive values packed into single byte, and that's all.

alexsamardzic · 2024-05-13T10:00:21Z

On a quick look, your strides may be wrong.

zkf331 · 2024-05-13T12:04:08Z

On a quick look, your strides may be wrong.

Thank you for your prompt reply. I don't know much about this parameter, and I can't find many references. Could you give me some more details? Thank you very much.

Hongbosherlock · 2024-05-16T03:46:16Z

I'm a beginner with Cutlass, I have on idea how to use my own constructed s4/s8 data to run this GEMM. Could you please provide an example code for testing this s4/s8 GEMM? like the official example here: https://github.com/NVIDIA/cutlass/blob/main/examples/55_hopper_mixed_dtype_gemm/README.md

These changes are not for Hopper, but for Ampere architecture. The code to run s4/s8 GEMM would be the same as for any other GEMM, for example s8/s8, except that when a GEMM template instantiated, data type and other argument should be specified accordingly. For some examples of this, see using Gemm = cutlass::gemm::device::GemmUniversal... template instantiations in the test cases added by this PR into test/unit/gemm/device directory. As far as your data concerned, s4 data should be provided as two successive values packed into single byte, and that's all.

I have two s4 values packed in a single byte(uint8). Do I need to unpack the uint8 data to get s4 data before GEMM manually?

alexsamardzic · 2024-05-16T10:37:53Z

I have two s4 values packed in a single byte(uint8). Do I need to unpack the uint8 data to get s4 data before GEMM manually?

No, s4 values should be packed, two values per byte.

Hongbosherlock · 2024-05-23T09:23:36Z

I have two s4 values packed in a single byte(uint8). Do I need to unpack the uint8 data to get s4 data before GEMM manually?

No, s4 values should be packed, two values per byte.

Thanks for your help ! I can get correct result now. but I have another question:
Assuming that A is int8 and (M, K), B is int4 and (K, N), after GEMM: C = A·B, and C will be (M, N). Now, I have another matrix E, which is fp32 and also (M,N). I want to perform element-wise multiplication : E * C. Can I complete this element-wise multiplication within the this s4/s8 GEMM operation ? for example by passing matrix E toArguments? I am not sure how to do this.
Maybe here is an example what I want to do :https://github.com/NVIDIA/TensorRT-LLM/blob/5d8ca2faf74c494f220c8f71130340b513eea9a9/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h#L131

alexsamardzic · 2024-05-23T10:25:50Z

Assuming that A is int8 and (M, K), B is int4 and (K, N), after GEMM: C = A·B, and C will be (M, N). Now, I have another matrix E, which is fp32 and also (M,N). I want to perform element-wise multiplication : E * C. Can I complete this element-wise multiplication within the this s4/s8 GEMM operation ? for example by passing matrix E toArguments?

If matrix E is really MxN (i.e. not broadcasted), it doesn't seem that the code you linked is doing this exact operation. I'd say the simplest way to achieve this would be through EVT epilogues, these are exactly for the purpose of fusing matrix multiplications with arbitrary operations. For Ampere, there is examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu example demonstrating how to use EVT epilogues, you'd have to remove everything related to Bias/C1 matrices in this example, to use C2 as your matrix E, and then to replace cutlass::plus with cutlass::multiplies in using Compute2 = ... (also, you should take care that all of the data types in the template instantiations are correctly specified).

Hongbosherlock · 2024-05-28T07:29:36Z

Assuming that A is int8 and (M, K), B is int4 and (K, N), after GEMM: C = A·B, and C will be (M, N). Now, I have another matrix E, which is fp32 and also (M,N). I want to perform element-wise multiplication : E * C. Can I complete this element-wise multiplication within the this s4/s8 GEMM operation ? for example by passing matrix E toArguments?

If matrix E is really MxN (i.e. not broadcasted), it doesn't seem that the code you linked is doing this exact operation. I'd say the simplest way to achieve this would be through EVT epilogues, these are exactly for the purpose of fusing matrix multiplications with arbitrary operations. For Ampere, there is examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu example demonstrating how to use EVT epilogues, you'd have to remove everything related to Bias/C1 matrices in this example, to use C2 as your matrix E, and then to replace cutlass::plus with cutlass::multiplies in using Compute2 = ... (also, you should take care that all of the data types in the template instantiations are correctly specified).

Thanks, I’m trying this, but it’s not going well currently.
To make it clearer, what I want to do is exactly the following：

    // inputs
    //     A           [M, K]    int8
    //     B           [N, K]    int4
    //     alphaCol    [M, 1]    fp32
    //     alphaRow    [1, N]    fp32
    // outputs
    //     mat [M, N]            fp32

That is: (alphaCol x alphaRow) * (A x B)
I think here is a s8/s8 example(A and B are all int8):https://github.com/NVIDIA/TensorRT-LLM/blob/5d8ca2faf74c494f220c8f71130340b513eea9a9/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h#L131, which also uses EVT, and the inputs are passed from here
I wonder if I could use the same EVT code and using Gemm = cutlass::gemm::device::GemmUniversalBaseCompat<GemmKernel> with this s4/s8 GEMM.

alexsamardzic · 2024-05-28T13:48:53Z

Thanks, I’m trying this, but it’s not going well currently. To make it clearer, what I want to do is exactly the following：
    // inputs
    //     A           [M, K]    int8
    //     B           [N, K]    int4
    //     alphaCol    [M, 1]    fp32
    //     alphaRow    [1, N]    fp32
    // outputs
    //     mat [M, N]            fp32

Well, that's not element-wise multiplication with MxN tensor, as stated initially... The example from TensorRT-LLM that you're linking to is not using EVT, but some kind of their CUTLASS extension instead (probably because CUTLASS had no EVT support at that time). I don't have time to look into these, so I can only point you again to examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu example, but now you should take a look into how Bias is applied, and ignore everything C1/C2 related. With cutlass::plus replaced with cutlass::multiplies, your alphaCol will be applied the same way, and once you understand how this works, it should be easy to interchange offsets and apply alphaRow accordingly too.

Hongbosherlock · 2024-06-03T13:33:47Z

Thanks, I’m trying this, but it’s not going well currently. To make it clearer, what I want to do is exactly the following：
    // inputs
    //     A           [M, K]    int8
    //     B           [N, K]    int4
    //     alphaCol    [M, 1]    fp32
    //     alphaRow    [1, N]    fp32
    // outputs
    //     mat [M, N]            fp32
Well, that's not element-wise multiplication with MxN tensor, as stated initially... The example from TensorRT-LLM that you're linking to is not using EVT, but some kind of their CUTLASS extension instead (probably because CUTLASS had no EVT support at that time). I don't have time to look into these, so I can only point you again to examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu example, but now you should take a look into how Bias is applied, and ignore everything C1/C2 related. With cutlass::plus replaced with cutlass::multiplies, your alphaCol will be applied the same way, and once you understand how this works, it should be easy to interchange offsets and apply alphaRow accordingly too.

I got errors with ElementB = cutlass::int4b_t when I tried to follow this example:

    using EVTKernelStreamK =
        typename cutlass::gemm::kernel::DefaultGemmWithVisitor<
        ElementA, LayoutA, cutlass::ComplexTransform::kNone, AlignmentA,
        ElementB, LayoutB, cutlass::ComplexTransform::kNone, AlignmentB,
        ElementC, LayoutC, AlignmentC,
        ElementAccumulator,
        ElementCompute,
        cutlass::arch::OpClassTensorOp,
        cutlass::arch::Sm80,
        ThreadblockShape,
        WarpShape,
        InstructionShape,
        EVTD,
        cutlass::gemm::threadblock::ThreadblockSwizzleStreamK,
        NumStages,
        cutlass::arch::OpMultiplyAdd,
        EVTEpilogueStages
    >::GemmKernel;

error message:

cutlass/include/cutlass/gemm/warp/mma_tensor_op_policy.h(58): error: incomplete type is not allowed
detected during:
instantiation of class "cutlass::gemm::warp::MmaTensorOpPolicy<Operator_, OpDelta_> [with Operator_=cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, int8_t, cutlass::layout::RowMajor, cutlass::int4b_t, cutlass::layout::ColumnMajor, int32_t, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, OpDelta_=cutlass::MatrixShape<1, 1>]"

The complete log is here.
When I set ElementB = int8_t , It's OK. Is it because DefaultGemmWithVisitor doesn't support s8/s4?
If it's possible pointers in the correct direction would be greatly appreciated, thanks!

alexsamardzic · 2024-06-03T13:41:05Z

I got errors with ElementB = cutlass::int4b_t when I tried to follow this example:

Are you using CUTLASS main, or the branch from this PR?

Hongbosherlock · 2024-06-04T02:27:18Z

I got errors with ElementB = cutlass::int4b_t when I tried to follow this example:

Are you using CUTLASS main, or the branch from this PR?

I'm using this PR branch: alexsamardzic:add-mixed-4bit-8bit-gemm, I can get the right s4/s8 GEMM result with using Gemm = cutlass::gemm::device::GemmUniversal as added in the test/unit/gemm/device.

// ok
using Gemm = cutlass::gemm::device::GemmUniversal<
  ElementA,                // ElementA
  cutlass::layout::RowMajor,       // LayoutA
  ElementB,                // ElementB
  cutlass::layout::ColumnMajor,    // LayoutB
  ElementOutput,                         // ElementOutput
  cutlass::layout::RowMajor,       // LayoutOutput
  ElementAccumulator,                         // ElementAccumulator
  cutlass::arch::OpClassTensorOp,  // tag indicating Tensor Cores
  cutlass::arch::Sm80,  // tag indicating target GPU compute architecture
  cutlass::gemm::GemmShape<128, 128, 64>,
  cutlass::gemm::GemmShape<64, 64, 64>,
  cutlass::gemm::GemmShape<16, 8, 32>,
  cutlass::epilogue::thread::LinearCombination<
  ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
  ElementAccumulator, ElementAccumulator>,
  cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
  4,   // Stages
  16,  // AlignmentA
  32,  // AlignmentB
  cutlass::arch::OpMultiplyAddMixedInputUpcast,
  cutlass::ComplexTransform::kNone,
  cutlass::ComplexTransform::kNone
   >;

But when I try to useGemmUniversalBase or GemmUniversalAdapter which need to specify a GemmKernel, It couldn't work with cutlass::int4b_t, while int8_t could.

// get errors
    using EVTKernelStreamK =
        typename cutlass::gemm::kernel::DefaultGemmWithVisitor<
        ElementA, LayoutA, cutlass::ComplexTransform::kNone, AlignmentA,
        ElementB, LayoutB, cutlass::ComplexTransform::kNone, AlignmentB,
        ElementC, LayoutC, AlignmentC,
        ElementAccumulator,
        ElementCompute,
        cutlass::arch::OpClassTensorOp,
        cutlass::arch::Sm80,
        ThreadblockShape,
        WarpShape,
        InstructionShape,
        EVTD,
        cutlass::gemm::threadblock::ThreadblockSwizzleStreamK,
        NumStages,
        cutlass::arch::OpMultiplyAdd,
        EVTEpilogueStages
    >::GemmKernel;   //  where is the key I think

    using DeviceGemmStreamK = cutlass::gemm::device::GemmUniversalAdapter<EVTKernelStreamK>;

I don’t know much about warp-level computation. This PR modifies file include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h, but the errors is related to include/cutlass/gemm/warp/mma_tensor_op_policy.h(58)and include/cutlass/gemm/warp/mma_tensor_op.h(108) as you can see in the log.

alexsamardzic · 2024-06-04T13:20:28Z

But when I try to useGemmUniversalBase or GemmUniversalAdapter which need to specify a GemmKernel, It couldn't work with cutlass::int4b_t, while int8_t could.

Can you post your full code here?

alexsamardzic · 2024-06-04T14:59:45Z

This code uses PyTorch, can you post a reproducible example that uses CUTLASS only?

Hongbosherlock · 2024-06-05T09:14:18Z

This code uses PyTorch, can you post a reproducible example that uses CUTLASS only?

Hi @alexsamardzic , I have pushed my code here: https://github.com/Hongbosherlock/cutlass/blob/add-mixed-4bit-8bit-gemm/examples/61_s4s8_gemm/s4s8_gemm.cu#L114

you can add this example , then complie and run it:

# cutlass/build$ cmake .. -DCUTLASS_NVCC_ARCHS=80

# cutlass/build$ make 61_s4s8_gemm 

#  cutlass/build$ ./examples/61_s4s8_gemm/61_s4s8_gemm

when ElementB = int8_t, it seems ok. you can get the result:

But when ElementB = cutlass::int4b_t, lots of compilation errors occur.

I am really at a loss and would greatly appreciate any guidance or help you can provide. Thank you very much in advance for your time and assistance!

alexsamardzic · 2024-06-06T11:00:17Z

I have pushed my code here: https://github.com/Hongbosherlock/cutlass/blob/add-mixed-4bit-8bit-gemm/examples/61_s4s8_gemm/s4s8_gemm.cu

Replace cutlass::arch::OpMultiplyAddSaturate with cutlass::arch::OpMultiplyAddMixedInputUpcast.

Hongbosherlock · 2024-06-06T14:08:33Z

I have pushed my code here: https://github.com/Hongbosherlock/cutlass/blob/add-mixed-4bit-8bit-gemm/examples/61_s4s8_gemm/s4s8_gemm.cu

Replace cutlass::arch::OpMultiplyAddSaturate with cutlass::arch::OpMultiplyAddMixedInputUpcast.

Works for me. Thanks!

alexsamardzic · 2024-06-06T14:27:52Z

Works for me. Thanks!

Good. Remember that CUTLASS is a heavily templated library, but actually small number of all the possible template argument combination work together - so one cannot just paste pieces of code from different sources, and expect it to work.

Hongbosherlock · 2024-06-07T03:07:43Z

Works for me. Thanks!

Good. Remember that CUTLASS is a heavily templated library, but actually small number of all the possible template argument combination work together - so one cannot just paste pieces of code from different sources, and expect it to work.

Yea, that was a mistake. OpMultiplyAddMixedInputUpcast did appear in the test folder. I think I was a bit disoriented. There are too many arguments.
I think CUTLASS is somewhat challenging for beginners. Do you have any recommended learning paths?

alexsamardzic · 2024-06-07T17:01:52Z

I think CUTLASS is somewhat challenging for beginners. Do you have any recommended learning paths?

Going through relevant examples, as well as unit tests, in the CUTLASS source tree is probably still the best way to start.

Hongbosherlock · 2024-06-19T09:39:16Z

Hi @alexsamardzic , in fact, I am working in the GEMM+de-quantization fusion kernel for W4A8 based on your PR, similar to the W8A8 kernel for pytorch here (GEMM+de-quantization), which also used EVT.

I have completed most of the work and used EVT to finish the de-quantization.
For simplicity, I started with W8A8 to verify that my EVT is correct, and then I will modify it for W4A8.
Here is the W8A8 GEMM+DQ code. The code can be compiled, but the results are incorrect compared to the Python simulation results.
I tried printing some elements, and they are very large values like 18766625, which might indicate an address access error.

Can you please have a look what the possible issues might be?
Thanks!

alexsamardzic · 2024-06-19T12:21:15Z

Can you please have a look what the possible issues might be? Thanks!

I'm sorry, but this has nothing to do with this particular PR, and unfortunately I don't have cycles to help you with this. You need to be sure that you understand building EVT epilogues, as well as specifying corresponding arguments. Then, in your position I would start with a simple epilogue that is just storing values from the accumulator into the output tensor. If results match expected ones, then I would add nodes into the EVT epilogue that do the multiplication, one by one, and would keep comparing results with the expected ones. When there is mismatch, you should know where to look for the fix.

Hongbosherlock · 2024-07-05T08:47:50Z

Hi @alexsamardzic ,thanks for your help, I have make it done.

When profiling a single GEMM, do you think the performance of s4/s8 will be better than that of of s8/s8?
In my test s8/s8(int8 GEMM) is faster.

alexsamardzic · 2024-07-05T09:20:16Z

When profiling a single GEMM, do you think the performance of s4/s8 will be better than that of of s8/s8? In my test s8/s8(int8 GEMM) is faster.

The GEMM actually performed is the same: S8/S8 in both cases. The bandwidth used in transfers between global and shared memory, and then between shared memory and registers, is smaller in S4/S8 case, but on the other side there are additional calculations for S4->S8 conversion in this case. Thus, very broadly: S4/S8 and S8/S8 should be in the same ballpark regarding performance, but it is not unusual if for specific input shapes, and tile sizes selected, one is faster than the other one.

manishucsd

Thank you for working on this. Apologies for a delayed review. LGTM.
Over to NVIDIA/CUTLASS (cc: @hwu36 ) for merging this.

manishucsd · 2024-08-14T06:45:29Z

How did you come up with this TileDescription list for S8 x S4? I guess you carried these from S8 x S8. Please make sure all of these pass verification. You can follow steps similar to here to instantiate all the tile shapes listed here by using -DCUTLASS_LIBRARY_KERNELS="s8_s4,s4_s8". By default the build process only instantiate 128x128 tile shape.

Can you please run the verification and profiling on --m=3456 --n=4096 --k=2048 on an A100?

Please compile using -DCUTLASS_LIBRARY_KERNELS="s8_s4,s4_s8,s4,s8" to also have s4 x s4 and s8 x s8 kernels in the runs.

The tiles selection is desribed in a comment above; also, as mentioned in this comment, I did the verification. I will repeat the verification procedure, together with profiling, and report the outcome here.

There was a build issue after rebasing on the latest main: basically, OpMultiplyAddSaturate for MmaTensorOpPolicy in the specialization of struct DefaultMmaTensorOp (in include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h) seem to be obligatory now, as the build fails if OpMultiplyAdd used. The branch is updated accordingly.

I've configured the build using -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS="s8_s4,s4_s8,s4,s8" CMake options, and then verified that cutlass_test_unit_gemm_device_mixed_input_tensorop_sm80 unit test passes. Then, I did profiler runs as follows:

./build/tools/profiler/cutlass_profiler --operation=gemm --m=3456 --n=4096 --k=2048 --A=s8:row --B=s8:column >& s8_s8.txt ./build/tools/profiler/cutlass_profiler --operation=gemm --m=3456 --n=4096 --k=2048 --A=s4:row --B=s8:column >& s4_s8.txt ./build/tools/profiler/cutlass_profiler --operation=gemm --m=3456 --n=4096 --k=2048 --A=s8:row --B=s4:column >& s8_s4.txt

The corresponding profiler outputs are here:
s8_s8.txt
s4_s8.txt
s8_s4.txt

The disposition values for mixed data types cases with s8 accumulator are still incorrect. Also, the timings are somewhat slower than for corresponding s8xs8 cases (with the same configurations: tile sizes etc.).

Thank you for running and sharing these results.

Accumulator is for all of these runs should be S32 as shown at the bottom of the output in csv format with accum type = S32. The Incorrect disposition with mixed-input is happening for only S8 output, i.e., when the accumulators are S32 but the output is downcast-ed to S8.

We do not see incorrect results for S8xS8 with S32 accumulators and S8 output, can you pick one row of incorrect run from

(elementD/elementC type) <= (elementA type) x(elementB type) + (accum type) S8 <= S8 x S4 + S32

and compare the same kernel configuration against

S8 <= S8 x S8 + S32

to find where is the difference?

I believe it is to do with initialization of the operands during profiling or inside the kernel epilogue S32-to-S8. quantization.

Also, you can just upload the csv that can be produced by adding --output=filename.csv to the profiler runs

Here is what I found so far regarding incorrect cases:

First, I made following change in the code generating inputs, in order to generate the same inputs for profiler for S4xS8 and S8xS8 cases:

diff --git a/tools/profiler/src/device_context.cu b/tools/profiler/src/device_context.cu index 2cbfa5d2..7b488fe8 100644 --- a/tools/profiler/src/device_context.cu +++ b/tools/profiler/src/device_context.cu @@ -105,7 +105,7 @@ DeviceAllocation *DeviceContext::allocate_tensor( data_distribution.set_uniform(-1, 1, 0); break; case library::NumericTypeID::kS4: - data_distribution.set_uniform(-2, 2, 0); + data_distribution.set_uniform(-3, 3, 0); break; case library::NumericTypeID::kU2: data_distribution.set_uniform(0, 2, 0);

I used following profiler runs to make comparision between S4xS8 and S8xS8 cases (BTW, I found that smaller input shapes selection that would still allow for reproducing the problem would be --m=32 --n=64 --k=512):

cutlass_profiler --operation=gemm --gemm_kind=universal --m=3456 --n=4096 --k=2048 --A=s8:row --B=s8:column --C=s8:column --D=s8:column --accum=s32 --cta_m=256 --cta_n=128 --cta_k=64 --stages=3 --save-workspace=always cutlass_profiler --operation=gemm --gemm_kind=universal --m=3456 --n=4096 --k=2048 --A=s4:row --B=s8:column --C=s8:column --D=s8:column --accum=s32 --cta_m=256 --cta_n=128 --cta_k=64 --stages=3 --save-workspace=always

By comparing saved .mat files, I verified that input matrices A, B and C are the same, but also that output matrices D are the same. What differs are actually Reference matrices, which means that reference results calculated for S4xS8 case are wrong. If I understood it correctly, cuBLAS is used for reference calculations, so I'll check what's going on there...

I am not sure if cuBLAS is called for this for reference check. The output should show which references are called. You can use --verification-providers=cublas,host,device to run them all. Is there a host reference for this you must check in here /tools/library/src/reference ?

Indeed - device provider is actually used for reference check here. I posted an update with the fix in reference calculations, so for most of cases with S8 output, cutlass_profiler reports success now. However, there are still couple of cases where incorrect is reported, I'm looking into this...

Pushed another update - the problem with remaining incorrect cases was that I haven't copied C operand alignment update from S8xS8 case, in the generator code. Everything is reported as passed now by profiler, the output files are attached below. I believe this one should be ready for merging now.

s8_s8.gemm.csv
s4_s8.gemm.csv
s8_s4.gemm.csv

@alexsamardzic thank you for digging it through. LGTM!

@hwu36 , @thakkarV , @IonThruster , can you please help it merge it?

alexsamardzic

Thanks for the fix!

hwu36 · 2024-08-30T03:19:59Z

while we are at this, i think we can improve the int4->int8 upcasting. now we use 11 instructions to upcast 8 elements. quite a lot. we used a look-up-table method to do int->fp8 upcasting (https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/numeric_conversion.h#L2983-L3027), I think we maybe able to use the same here.

@alexsamardzic , do you want to give it a try? i am setting up now so it won't take me months to merge your code.

cc @rhenry-nv

alexsamardzic · 2024-08-30T16:52:29Z

@alexsamardzic , do you want to give it a try? i am setting up now so it won't take me months to merge your code.

Sure. Below is a patch to implement the look-up table method for int4->int8 (pretty much the same as existing int4->fp8 code), and also the profiler outputs for original and patched version. It seems that the look-up table method is slower.

I ran the profiler in both cases as follows:

cutlass_profiler --operation=Gemm --m=1024 --n=1024 --k=1024 --A=s4:row --B=s8:column

and here are mentioned files:
patch.txt
original.csv
patched.csv

I was the least happy about the conversion code in this PR, but this is the best I was able to come up with...

* Add support for mixed 4-bit/8-bit data types GEMM * fix ( and ) --------- Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

This was referenced Mar 19, 2024

[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370

Closed

[FEA] W4A8 gemm surpport. #1316

Closed

[New Feature] CUTLASS kernels for w4a8 quantization pytorch/ao#64

Closed

lezcano reviewed Mar 21, 2024

View reviewed changes

yyfcc17 mentioned this pull request Mar 25, 2024

does trt-llm support w4a8 (int4 * int8)? NVIDIA/TensorRT-LLM#1189

Closed

andrewor14 mentioned this pull request Mar 26, 2024

[RFC] Plans for LLM QAT pytorch/ao#86

Closed

This comment was marked as duplicate.

Sign in to view

alexsamardzic mentioned this pull request Jul 11, 2024

Add couple configs into generator.py for mixed input MM #1350

Merged

manishucsd approved these changes Aug 14, 2024

View reviewed changes

msaroufim mentioned this pull request Aug 17, 2024

[RFC] Which low bit CUDA kernels should we merge or write? pytorch/ao#697

Open

Aleksandar Samardžić and others added 2 commits August 20, 2024 19:41

Add support for mixed 4-bit/8-bit data types GEMM

a30806a

fix ( and )

16ce3d7

alexsamardzic commented Aug 29, 2024

View reviewed changes

hwu36 approved these changes Aug 30, 2024

View reviewed changes

hwu36 merged commit e1976da into NVIDIA:main Aug 30, 2024

alexsamardzic deleted the add-mixed-4bit-8bit-gemm branch August 30, 2024 12:05

alexsamardzic mentioned this pull request Oct 1, 2024

W4A8 based on CUTLASS pytorch/ao#880

Merged

Conversation

alexsamardzic commented Mar 19, 2024

Uh oh!

alexsamardzic commented Mar 19, 2024

Uh oh!

alexsamardzic commented Mar 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexsamardzic commented Mar 22, 2024

Uh oh!

andrewor14 commented Mar 22, 2024

Uh oh!

alexsamardzic commented Mar 22, 2024

Uh oh!

alexsamardzic commented Apr 2, 2024

Uh oh!

hwu36 commented Apr 18, 2024

Uh oh!

Hongbosherlock commented May 6, 2024

Uh oh!

alexsamardzic commented May 6, 2024

Uh oh!

Hongbosherlock commented May 8, 2024

Uh oh!

alexsamardzic commented May 8, 2024

Uh oh!

alexsamardzic commented May 13, 2024

Uh oh!

zkf331 commented May 13, 2024

Uh oh!

Hongbosherlock commented May 16, 2024

Uh oh!

alexsamardzic commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hongbosherlock commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented May 23, 2024

Uh oh!

Hongbosherlock commented May 28, 2024

Uh oh!

alexsamardzic commented May 28, 2024

Uh oh!

Hongbosherlock commented Jun 3, 2024

Uh oh!

alexsamardzic commented Jun 3, 2024

Uh oh!

Hongbosherlock commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented Jun 4, 2024

Uh oh!

This comment was marked as duplicate.

alexsamardzic commented Jun 4, 2024

Uh oh!

Hongbosherlock commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented Jun 6, 2024

Uh oh!

Hongbosherlock commented Jun 6, 2024

Uh oh!

alexsamardzic commented Jun 6, 2024

Uh oh!

Hongbosherlock commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented Jun 7, 2024

Uh oh!

Hongbosherlock commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsamardzic commented Jun 19, 2024

Uh oh!

Hongbosherlock commented Jul 5, 2024

Uh oh!

alexsamardzic commented May 16, 2024 •

edited

Loading

Hongbosherlock commented May 23, 2024 •

edited

Loading

Hongbosherlock commented Jun 4, 2024 •

edited

Loading

Hongbosherlock commented Jun 5, 2024 •

edited

Loading

Hongbosherlock commented Jun 7, 2024 •

edited

Loading

Hongbosherlock commented Jun 19, 2024 •

edited

Loading

manishucsd Aug 17, 2024 •

edited

Loading

alexsamardzic left a comment •

edited

Loading