Skip to content

[Operator] Add hidet.ops.matmul_cublas operator#405

Merged
yaoyaoding merged 1 commit intohidet-org:mainfrom
yaoyaoding:cublas-ops
Jan 4, 2024
Merged

[Operator] Add hidet.ops.matmul_cublas operator#405
yaoyaoding merged 1 commit intohidet-org:mainfrom
yaoyaoding:cublas-ops

Conversation

@yaoyaoding
Copy link
Copy Markdown
Member

Add hidet.ops.matmul_cublas operator, which uses cuBLAS library.

m, n, k = 1024, 1024, 1024
a = hidet.randn([m, k], dtype='float16', device='cuda') / 32.0
b = hidet.randn([k, n], dtype='float16', device='cuda') / 32.0
c = hidet.ops.matmul_cublas(a, b)
d = hidet.ops.matmul(a, b)

hidet.utils.assert_close(actual=c, expected=d, rtol=1e-2, atol=1e-2)

The generated code source.cu looks like

#include <hidet/runtime/cuda/cublas.h>

// ...

DLL void hidet_launch_0(half * __restrict__ a, half * __restrict__ b, half * __restrict__ c) {
  hidet_cublas_strided_gemm(1, 1024, 1024, 1024, 2, 2, 2, a, b, c, 0, 1048576, 1048576, false, false, 64);
}

@yaoyaoding yaoyaoding merged commit f4ab1e4 into hidet-org:main Jan 4, 2024
@yaoyaoding yaoyaoding deleted the cublas-ops branch January 4, 2024 01:01
vadiklyutiy pushed a commit that referenced this pull request Dec 19, 2024
…st int div/mod (#405)

For conv2d operator, replace type of dynamic symbols with `int_fastdiv`.
This type converts division to multiply and shift. This speeds up
division as well as modulo computation when the divisor is unknown at
compile time (dynamic shape).

Initial performance comparison
[here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing):
This speeds up conv2d/reduce operators. But found performance decrease
in `hidet_fused_conv2d_channel_last_mul_mul_add_clip_out` operators. So
it actually decreases the performance of the whole model (mobilenet_v2).
Root cause still to be located.

Related kernel cuda code generated after int_fastdiv replacement:

[clip_out_kernel.txt](https://github.com/user-attachments/files/16532988/clip_out_kernel.txt)

[conv2dkernel.txt](https://github.com/user-attachments/files/16532989/conv2dkernel.txt)

[reducekernel.txt](https://github.com/user-attachments/files/16532990/reducekernel.txt)

UPDATE:
This PR is updated to only convert `Div` op with symbolic divisor to
fast division version and `Mod` op with symbolic divisor to fast mod
version:

1. On the host side, compute the magic number of fast div/mod before
calling kernel function. At the same time, pass the calculated magic
numbers as new parameters like this:
```
// d is the original integer
m, s, n_addsign = calculatedmagicnumbers(d)
// m, s, n_addsign are all int type, so we don't need to add a new int_fastdiv type
kernel_func<<...>>(..., d, m, s, n_addsign)
```
2. In the kernel function, convert / and % operations with runtime
divisors into new primitives:
```
// originally block.x / s0 and block.x%s0
int_fastdiv(block.x, s0, m, s, n_addsign)
int_fastmod(block.x, s0, m, s, n_addsign)
```

The performance comparison with main branch
[here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing):

model | dynamic shape index | profiling input shape | latency (main) |
latency (fast_div) | ratio (fast_div/main)
-- | -- | -- | -- | -- | --
bert-base-uncased | 0 | 16x256 | 10.37882291 | 10.22750485 | 98.54%
densenet121 | 0 | 128x3x224x224 | 17.60995965 | 16.57427695 | 94.12%
efficientnet_b0 | 0 | 128x3x224x224 | 10.48826607 | 10.16526555 | 96.92%
mobilenet_v2 | 0 | 128x3x224x224 | 6.34721314 | 6.16601998 | 97.15%
resnet50 | 0 | 128x3x224x224 | 12.59996163 | 12.02326098 | 95.42%
vit_b_16 | 0 | 128x3x224x224 | 74.35063778 | 70.71424213 | 95.11%
  |   |   |   | GMEAN | 96.20%
  |   |   |   |   |  
GPU | Frequency locked at | CPU |   |   |  
RTX4090 | 2400 MHz | i9-13900K 32 core |   |   |  
vadiklyutiy pushed a commit that referenced this pull request Dec 20, 2024
…st int div/mod (#405)

For conv2d operator, replace type of dynamic symbols with `int_fastdiv`.
This type converts division to multiply and shift. This speeds up
division as well as modulo computation when the divisor is unknown at
compile time (dynamic shape).

Initial performance comparison
[here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing):
This speeds up conv2d/reduce operators. But found performance decrease
in `hidet_fused_conv2d_channel_last_mul_mul_add_clip_out` operators. So
it actually decreases the performance of the whole model (mobilenet_v2).
Root cause still to be located.

Related kernel cuda code generated after int_fastdiv replacement:

[clip_out_kernel.txt](https://github.com/user-attachments/files/16532988/clip_out_kernel.txt)

[conv2dkernel.txt](https://github.com/user-attachments/files/16532989/conv2dkernel.txt)

[reducekernel.txt](https://github.com/user-attachments/files/16532990/reducekernel.txt)

UPDATE:
This PR is updated to only convert `Div` op with symbolic divisor to
fast division version and `Mod` op with symbolic divisor to fast mod
version:

1. On the host side, compute the magic number of fast div/mod before
calling kernel function. At the same time, pass the calculated magic
numbers as new parameters like this:
```
// d is the original integer
m, s, n_addsign = calculatedmagicnumbers(d)
// m, s, n_addsign are all int type, so we don't need to add a new int_fastdiv type
kernel_func<<...>>(..., d, m, s, n_addsign)
```
2. In the kernel function, convert / and % operations with runtime
divisors into new primitives:
```
// originally block.x / s0 and block.x%s0
int_fastdiv(block.x, s0, m, s, n_addsign)
int_fastmod(block.x, s0, m, s, n_addsign)
```

The performance comparison with main branch
[here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing):

model | dynamic shape index | profiling input shape | latency (main) |
latency (fast_div) | ratio (fast_div/main)
-- | -- | -- | -- | -- | --
bert-base-uncased | 0 | 16x256 | 10.37882291 | 10.22750485 | 98.54%
densenet121 | 0 | 128x3x224x224 | 17.60995965 | 16.57427695 | 94.12%
efficientnet_b0 | 0 | 128x3x224x224 | 10.48826607 | 10.16526555 | 96.92%
mobilenet_v2 | 0 | 128x3x224x224 | 6.34721314 | 6.16601998 | 97.15%
resnet50 | 0 | 128x3x224x224 | 12.59996163 | 12.02326098 | 95.42%
vit_b_16 | 0 | 128x3x224x224 | 74.35063778 | 70.71424213 | 95.11%
  |   |   |   | GMEAN | 96.20%
  |   |   |   |   |  
GPU | Frequency locked at | CPU |   |   |  
RTX4090 | 2400 MHz | i9-13900K 32 core |   |   |  
vadiklyutiy pushed a commit that referenced this pull request Dec 26, 2024
…st int div/mod (#405)

For conv2d operator, replace type of dynamic symbols with `int_fastdiv`.
This type converts division to multiply and shift. This speeds up
division as well as modulo computation when the divisor is unknown at
compile time (dynamic shape).

Initial performance comparison
[here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing):
This speeds up conv2d/reduce operators. But found performance decrease
in `hidet_fused_conv2d_channel_last_mul_mul_add_clip_out` operators. So
it actually decreases the performance of the whole model (mobilenet_v2).
Root cause still to be located.

Related kernel cuda code generated after int_fastdiv replacement:

[clip_out_kernel.txt](https://github.com/user-attachments/files/16532988/clip_out_kernel.txt)

[conv2dkernel.txt](https://github.com/user-attachments/files/16532989/conv2dkernel.txt)

[reducekernel.txt](https://github.com/user-attachments/files/16532990/reducekernel.txt)

UPDATE:
This PR is updated to only convert `Div` op with symbolic divisor to
fast division version and `Mod` op with symbolic divisor to fast mod
version:

1. On the host side, compute the magic number of fast div/mod before
calling kernel function. At the same time, pass the calculated magic
numbers as new parameters like this:
```
// d is the original integer
m, s, n_addsign = calculatedmagicnumbers(d)
// m, s, n_addsign are all int type, so we don't need to add a new int_fastdiv type
kernel_func<<...>>(..., d, m, s, n_addsign)
```
2. In the kernel function, convert / and % operations with runtime
divisors into new primitives:
```
// originally block.x / s0 and block.x%s0
int_fastdiv(block.x, s0, m, s, n_addsign)
int_fastmod(block.x, s0, m, s, n_addsign)
```

The performance comparison with main branch
[here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing):

model | dynamic shape index | profiling input shape | latency (main) |
latency (fast_div) | ratio (fast_div/main)
-- | -- | -- | -- | -- | --
bert-base-uncased | 0 | 16x256 | 10.37882291 | 10.22750485 | 98.54%
densenet121 | 0 | 128x3x224x224 | 17.60995965 | 16.57427695 | 94.12%
efficientnet_b0 | 0 | 128x3x224x224 | 10.48826607 | 10.16526555 | 96.92%
mobilenet_v2 | 0 | 128x3x224x224 | 6.34721314 | 6.16601998 | 97.15%
resnet50 | 0 | 128x3x224x224 | 12.59996163 | 12.02326098 | 95.42%
vit_b_16 | 0 | 128x3x224x224 | 74.35063778 | 70.71424213 | 95.11%
  |   |   |   | GMEAN | 96.20%
  |   |   |   |   |  
GPU | Frequency locked at | CPU |   |   |  
RTX4090 | 2400 MHz | i9-13900K 32 core |   |   |  
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant