[Operator] Add hidet.ops.matmul_cublas operator#405
Merged
yaoyaoding merged 1 commit intohidet-org:mainfrom Jan 4, 2024
Merged
[Operator] Add hidet.ops.matmul_cublas operator#405yaoyaoding merged 1 commit intohidet-org:mainfrom
hidet.ops.matmul_cublas operator#405yaoyaoding merged 1 commit intohidet-org:mainfrom
Conversation
vadiklyutiy
pushed a commit
that referenced
this pull request
Dec 19, 2024
…st int div/mod (#405) For conv2d operator, replace type of dynamic symbols with `int_fastdiv`. This type converts division to multiply and shift. This speeds up division as well as modulo computation when the divisor is unknown at compile time (dynamic shape). Initial performance comparison [here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing): This speeds up conv2d/reduce operators. But found performance decrease in `hidet_fused_conv2d_channel_last_mul_mul_add_clip_out` operators. So it actually decreases the performance of the whole model (mobilenet_v2). Root cause still to be located. Related kernel cuda code generated after int_fastdiv replacement: [clip_out_kernel.txt](https://github.com/user-attachments/files/16532988/clip_out_kernel.txt) [conv2dkernel.txt](https://github.com/user-attachments/files/16532989/conv2dkernel.txt) [reducekernel.txt](https://github.com/user-attachments/files/16532990/reducekernel.txt) UPDATE: This PR is updated to only convert `Div` op with symbolic divisor to fast division version and `Mod` op with symbolic divisor to fast mod version: 1. On the host side, compute the magic number of fast div/mod before calling kernel function. At the same time, pass the calculated magic numbers as new parameters like this: ``` // d is the original integer m, s, n_addsign = calculatedmagicnumbers(d) // m, s, n_addsign are all int type, so we don't need to add a new int_fastdiv type kernel_func<<...>>(..., d, m, s, n_addsign) ``` 2. In the kernel function, convert / and % operations with runtime divisors into new primitives: ``` // originally block.x / s0 and block.x%s0 int_fastdiv(block.x, s0, m, s, n_addsign) int_fastmod(block.x, s0, m, s, n_addsign) ``` The performance comparison with main branch [here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing): model | dynamic shape index | profiling input shape | latency (main) | latency (fast_div) | ratio (fast_div/main) -- | -- | -- | -- | -- | -- bert-base-uncased | 0 | 16x256 | 10.37882291 | 10.22750485 | 98.54% densenet121 | 0 | 128x3x224x224 | 17.60995965 | 16.57427695 | 94.12% efficientnet_b0 | 0 | 128x3x224x224 | 10.48826607 | 10.16526555 | 96.92% mobilenet_v2 | 0 | 128x3x224x224 | 6.34721314 | 6.16601998 | 97.15% resnet50 | 0 | 128x3x224x224 | 12.59996163 | 12.02326098 | 95.42% vit_b_16 | 0 | 128x3x224x224 | 74.35063778 | 70.71424213 | 95.11% | | | | GMEAN | 96.20% | | | | | GPU | Frequency locked at | CPU | | | RTX4090 | 2400 MHz | i9-13900K 32 core | | |
vadiklyutiy
pushed a commit
that referenced
this pull request
Dec 20, 2024
…st int div/mod (#405) For conv2d operator, replace type of dynamic symbols with `int_fastdiv`. This type converts division to multiply and shift. This speeds up division as well as modulo computation when the divisor is unknown at compile time (dynamic shape). Initial performance comparison [here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing): This speeds up conv2d/reduce operators. But found performance decrease in `hidet_fused_conv2d_channel_last_mul_mul_add_clip_out` operators. So it actually decreases the performance of the whole model (mobilenet_v2). Root cause still to be located. Related kernel cuda code generated after int_fastdiv replacement: [clip_out_kernel.txt](https://github.com/user-attachments/files/16532988/clip_out_kernel.txt) [conv2dkernel.txt](https://github.com/user-attachments/files/16532989/conv2dkernel.txt) [reducekernel.txt](https://github.com/user-attachments/files/16532990/reducekernel.txt) UPDATE: This PR is updated to only convert `Div` op with symbolic divisor to fast division version and `Mod` op with symbolic divisor to fast mod version: 1. On the host side, compute the magic number of fast div/mod before calling kernel function. At the same time, pass the calculated magic numbers as new parameters like this: ``` // d is the original integer m, s, n_addsign = calculatedmagicnumbers(d) // m, s, n_addsign are all int type, so we don't need to add a new int_fastdiv type kernel_func<<...>>(..., d, m, s, n_addsign) ``` 2. In the kernel function, convert / and % operations with runtime divisors into new primitives: ``` // originally block.x / s0 and block.x%s0 int_fastdiv(block.x, s0, m, s, n_addsign) int_fastmod(block.x, s0, m, s, n_addsign) ``` The performance comparison with main branch [here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing): model | dynamic shape index | profiling input shape | latency (main) | latency (fast_div) | ratio (fast_div/main) -- | -- | -- | -- | -- | -- bert-base-uncased | 0 | 16x256 | 10.37882291 | 10.22750485 | 98.54% densenet121 | 0 | 128x3x224x224 | 17.60995965 | 16.57427695 | 94.12% efficientnet_b0 | 0 | 128x3x224x224 | 10.48826607 | 10.16526555 | 96.92% mobilenet_v2 | 0 | 128x3x224x224 | 6.34721314 | 6.16601998 | 97.15% resnet50 | 0 | 128x3x224x224 | 12.59996163 | 12.02326098 | 95.42% vit_b_16 | 0 | 128x3x224x224 | 74.35063778 | 70.71424213 | 95.11% | | | | GMEAN | 96.20% | | | | | GPU | Frequency locked at | CPU | | | RTX4090 | 2400 MHz | i9-13900K 32 core | | |
vadiklyutiy
pushed a commit
that referenced
this pull request
Dec 26, 2024
…st int div/mod (#405) For conv2d operator, replace type of dynamic symbols with `int_fastdiv`. This type converts division to multiply and shift. This speeds up division as well as modulo computation when the divisor is unknown at compile time (dynamic shape). Initial performance comparison [here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing): This speeds up conv2d/reduce operators. But found performance decrease in `hidet_fused_conv2d_channel_last_mul_mul_add_clip_out` operators. So it actually decreases the performance of the whole model (mobilenet_v2). Root cause still to be located. Related kernel cuda code generated after int_fastdiv replacement: [clip_out_kernel.txt](https://github.com/user-attachments/files/16532988/clip_out_kernel.txt) [conv2dkernel.txt](https://github.com/user-attachments/files/16532989/conv2dkernel.txt) [reducekernel.txt](https://github.com/user-attachments/files/16532990/reducekernel.txt) UPDATE: This PR is updated to only convert `Div` op with symbolic divisor to fast division version and `Mod` op with symbolic divisor to fast mod version: 1. On the host side, compute the magic number of fast div/mod before calling kernel function. At the same time, pass the calculated magic numbers as new parameters like this: ``` // d is the original integer m, s, n_addsign = calculatedmagicnumbers(d) // m, s, n_addsign are all int type, so we don't need to add a new int_fastdiv type kernel_func<<...>>(..., d, m, s, n_addsign) ``` 2. In the kernel function, convert / and % operations with runtime divisors into new primitives: ``` // originally block.x / s0 and block.x%s0 int_fastdiv(block.x, s0, m, s, n_addsign) int_fastmod(block.x, s0, m, s, n_addsign) ``` The performance comparison with main branch [here](https://docs.google.com/spreadsheets/d/1mQEAmXAiqgOcBiHwgsANSW7_d-nx6xocmjab8NOfqGU/edit?usp=sharing): model | dynamic shape index | profiling input shape | latency (main) | latency (fast_div) | ratio (fast_div/main) -- | -- | -- | -- | -- | -- bert-base-uncased | 0 | 16x256 | 10.37882291 | 10.22750485 | 98.54% densenet121 | 0 | 128x3x224x224 | 17.60995965 | 16.57427695 | 94.12% efficientnet_b0 | 0 | 128x3x224x224 | 10.48826607 | 10.16526555 | 96.92% mobilenet_v2 | 0 | 128x3x224x224 | 6.34721314 | 6.16601998 | 97.15% resnet50 | 0 | 128x3x224x224 | 12.59996163 | 12.02326098 | 95.42% vit_b_16 | 0 | 128x3x224x224 | 74.35063778 | 70.71424213 | 95.11% | | | | GMEAN | 96.20% | | | | | GPU | Frequency locked at | CPU | | | RTX4090 | 2400 MHz | i9-13900K 32 core | | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
hidet.ops.matmul_cublasoperator, which uses cuBLAS library.The generated code
source.culooks like