[Operator] Add `hidet.ops.matmul_cublas` operator by yaoyaoding · Pull Request #405 · hidet-org/hidet

yaoyaoding · 2024-01-03T20:20:51Z

Add hidet.ops.matmul_cublas operator, which uses cuBLAS library.

m, n, k = 1024, 1024, 1024
a = hidet.randn([m, k], dtype='float16', device='cuda') / 32.0
b = hidet.randn([k, n], dtype='float16', device='cuda') / 32.0
c = hidet.ops.matmul_cublas(a, b)
d = hidet.ops.matmul(a, b)

hidet.utils.assert_close(actual=c, expected=d, rtol=1e-2, atol=1e-2)

The generated code source.cu looks like

#include <hidet/runtime/cuda/cublas.h>

// ...

DLL void hidet_launch_0(half * __restrict__ a, half * __restrict__ b, half * __restrict__ c) {
  hidet_cublas_strided_gemm(1, 1024, 1024, 1024, 2, 2, 2, a, b, c, 0, 1048576, 1048576, false, false, 64);
}

…st int div/mod (#405) For conv2d operator, replace type of dynamic symbols with `int_fastdiv`. This type converts division to multiply and shift. This speeds up division as well as modulo computation when the divisor is unknown at compile time (dynamic shape). Initial performance comparison [here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing): This speeds up conv2d/reduce operators. But found performance decrease in `hidet_fused_conv2d_channel_last_mul_mul_add_clip_out` operators. So it actually decreases the performance of the whole model (mobilenet_v2). Root cause still to be located. Related kernel cuda code generated after int_fastdiv replacement: [clip_out_kernel.txt](https://github.com/user-attachments/files/16532988/clip_out_kernel.txt) [conv2dkernel.txt](https://github.com/user-attachments/files/16532989/conv2dkernel.txt) [reducekernel.txt](https://github.com/user-attachments/files/16532990/reducekernel.txt) UPDATE: This PR is updated to only convert `Div` op with symbolic divisor to fast division version and `Mod` op with symbolic divisor to fast mod version: 1. On the host side, compute the magic number of fast div/mod before calling kernel function. At the same time, pass the calculated magic numbers as new parameters like this: ``` // d is the original integer m, s, n_addsign = calculatedmagicnumbers(d) // m, s, n_addsign are all int type, so we don't need to add a new int_fastdiv type kernel_func<<...>>(..., d, m, s, n_addsign) ``` 2. In the kernel function, convert / and % operations with runtime divisors into new primitives: ``` // originally block.x / s0 and block.x%s0 int_fastdiv(block.x, s0, m, s, n_addsign) int_fastmod(block.x, s0, m, s, n_addsign) ``` The performance comparison with main branch [here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing): model | dynamic shape index | profiling input shape | latency (main) | latency (fast_div) | ratio (fast_div/main) -- | -- | -- | -- | -- | -- bert-base-uncased | 0 | 16x256 | 10.37882291 | 10.22750485 | 98.54% densenet121 | 0 | 128x3x224x224 | 17.60995965 | 16.57427695 | 94.12% efficientnet_b0 | 0 | 128x3x224x224 | 10.48826607 | 10.16526555 | 96.92% mobilenet_v2 | 0 | 128x3x224x224 | 6.34721314 | 6.16601998 | 97.15% resnet50 | 0 | 128x3x224x224 | 12.59996163 | 12.02326098 | 95.42% vit_b_16 | 0 | 128x3x224x224 | 74.35063778 | 70.71424213 | 95.11% | | | | GMEAN | 96.20% | | | | | GPU | Frequency locked at | CPU | | | RTX4090 | 2400 MHz | i9-13900K 32 core | | |

add hidet.ops.matmul_cublas

3458e3f

yaoyaoding merged commit f4ab1e4 into hidet-org:main Jan 4, 2024

yaoyaoding deleted the cublas-ops branch January 4, 2024 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Operator] Add `hidet.ops.matmul_cublas` operator#405

[Operator] Add `hidet.ops.matmul_cublas` operator#405
yaoyaoding merged 1 commit intohidet-org:mainfrom
yaoyaoding:cublas-ops

yaoyaoding commented Jan 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaoyaoding commented Jan 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant