Skip to content

[AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs#7135

Merged
HaiShaw merged 29 commits intosgl-project:mainfrom
hubertlu-tw:silu_and_mul
Jul 25, 2025
Merged

[AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs#7135
HaiShaw merged 29 commits intosgl-project:mainfrom
hubertlu-tw:silu_and_mul

Conversation

@hubertlu-tw
Copy link
Copy Markdown
Collaborator

@hubertlu-tw hubertlu-tw commented Jun 12, 2025

Motivation

This PR builds on the changes introduced in #4432, which leverages more optimal 128-bit vectorized kernels in sgl-kernel. In addition, it is also part of the effort of decoupling vllm dependency in SGLang.

Modifications

  • Added silu_and_mul, gelu_and_mul, and gelu_tanh_and_mul for AMD GPUs (#4432)
  • Added gelu_quick for AMD GPUs
  • Added unit tests for QuickGELU
  • Updated the benchmark script introduced in #4432 for gelu_quick kernel
  • Call at::cuda::getCurrentCUDAStream in cuda script intead of torch.cuda.current_stream() in python script to reduce huge overheads (around 30 us on MI300X). Since I observed that the previous kernel with torch.cuda.current_stream() in python script performs pretty well in the microbenchmark. However, it causes some performance regression for e2e performance benchmark due to the huge overheads from the CPU activities.

With torch.cuda.current_stream() in python script
image

With at::cuda::getCurrentCUDAStream in cuda script
image

Checklist

TODO:

CC: @HaiShaw

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @hubertlu-tw, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request extends SGLang's kernel library to support AMD (ROCm) GPUs for key activation functions, including a new gelu_quick kernel and adaptations for existing fused activation kernels. A significant refactor was also implemented in the C++ kernel interface to improve performance by reducing CPU-GPU synchronization overhead.

Highlights

  • ROCm (AMD GPU) Support: Added support for gelu_quick activation and adapted existing fused activation kernels (silu_and_mul, gelu_and_mul, gelu_tanh_and_mul) for AMD GPUs using the ROCm platform.
  • Performance Optimization: Refactored the C++ kernels to obtain the CUDA stream internally using at::cuda::getCurrentCUDAStream(), eliminating the need to pass it from Python. This significantly reduces CPU overhead and improves end-to-end performance, particularly on MI300X, as shown by included benchmark results.
  • Unit Tests: Added unit tests for the QuickGELU implementation to verify correctness across various shapes and data types.
  • Benchmarking: Included a new benchmark script to compare the performance of the implemented activation kernels against vLLM's equivalents.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hubertlu-tw hubertlu-tw changed the title Silu and mul [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels Jun 12, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements for AMD GPU support, particularly by adding new activation kernels (silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, gelu_quick) and optimizing CUDA stream management. The changes include new C++ kernels, Python bindings, tests, and benchmarks. A key performance improvement comes from switching to at::cuda::getCurrentCUDAStream() in C++ for CUDA stream retrieval, reducing CPU overhead.

Overall, the changes are well-structured and address the motivation of extending AMD GPU support and improving performance.

The changes are generally well-implemented. Key areas for attention include a potential typo in a pragma directive for ROCm vector types and a potentially incorrect cast in a utility function for CUDA. There's also a point of clarification regarding the CUDA implementation of QuickGELU.

Overall, this is a valuable contribution that expands hardware compatibility and improves performance.

Comment thread sgl-kernel/include/utils.h Outdated
Comment thread sgl-kernel/include/impl/hip_vec_bf16_impl.h Outdated
Comment thread sgl-kernel/include/impl/hip_vec_bf16_impl.h Outdated
Comment thread python/sglang/srt/layers/activation.py Outdated
@hubertlu-tw
Copy link
Copy Markdown
Collaborator Author

hubertlu-tw commented Jun 12, 2025

Due to the constraints of GitHub markdown setting, I had to separate the following perf numbers.

Micro-benchmark results (click to expand)
                kernel           dtype  batch_size  seq_len    dim       vLLM  SGL Kernel  Speed-up (×)
0         silu_and_mul   torch.float16           1        1    128   8.620000    6.494000      1.347637
1         silu_and_mul   torch.float16           1        1    256   8.540000    6.455000      1.322850
2         silu_and_mul   torch.float16           1        1    512   8.659000    6.494000      1.308747
3         silu_and_mul   torch.float16           1        1   1024   8.821000    6.575000      1.349656
4         silu_and_mul   torch.float16           1        1   2048   9.301000    6.575000      1.417138
5         silu_and_mul   torch.float16           1        1   4096  10.864000    6.615000      1.593953
6         silu_and_mul   torch.float16           1        1   8192  12.829000    6.696000      1.800151
7         silu_and_mul   torch.float16           1        1  16384  18.361000    8.419000      2.119017
8         silu_and_mul   torch.float16           1        4    128   9.060000    6.495000      1.388761
9         silu_and_mul   torch.float16           1        4    256   9.020000    6.535000      1.349656
10        silu_and_mul   torch.float16           1        4    512   9.100000    6.535000      1.374292
11        silu_and_mul   torch.float16           1        4   1024   9.261000    6.575000      1.387906
12        silu_and_mul   torch.float16           1        4   2048   9.662000    6.495000      1.442053
13        silu_and_mul   torch.float16           1        4   4096  11.105000    6.535000      1.631829
14        silu_and_mul   torch.float16           1        4   8192  13.230000    6.695000      1.994025
15        silu_and_mul   torch.float16           1        4  16384  18.200999    8.580000      2.000117
16        silu_and_mul   torch.float16           1       16    128   9.021000    6.575000      1.377947
17        silu_and_mul   torch.float16           1       16    256   9.141000    6.615000      1.384182
18        silu_and_mul   torch.float16           1       16    512   9.181000    6.615000      1.369615
19        silu_and_mul   torch.float16           1       16   1024   9.342000    6.495000      1.403718
20        silu_and_mul   torch.float16           1       16   2048   9.983000    6.495000      1.518399
21        silu_and_mul   torch.float16           1       16   4096  11.346000    6.615000      1.678760
22        silu_and_mul   torch.float16           1       16   8192  14.393000    6.656000      1.964152
23        silu_and_mul   torch.float16           1       16  16384  18.722000    9.061000      1.986517
24        silu_and_mul   torch.float16           1       64    128   9.141000    6.615000      1.369766
25        silu_and_mul   torch.float16           1       64    256   9.261000    6.615000      1.377296
26        silu_and_mul   torch.float16           1       64    512   9.502000    6.655000      1.406047
27        silu_and_mul   torch.float16           1       64   1024   9.622000    6.615000      1.463422
28        silu_and_mul   torch.float16           1       64   2048  10.263000    6.655000      1.557521
29        silu_and_mul   torch.float16           1       64   4096  11.787000    6.695000      1.730695
30        silu_and_mul   torch.float16           1       64   8192  15.074000    7.136000      2.118799
31        silu_and_mul   torch.float16           1       64  16384  19.885000    9.862000      1.987783
32        silu_and_mul   torch.float16           4        1    128   9.061000    6.535000      1.392502
33        silu_and_mul   torch.float16           4        1    256   9.101000    6.615000      1.393953
34        silu_and_mul   torch.float16           4        1    512   8.980000    6.495000      1.370285
35        silu_and_mul   torch.float16           4        1   1024   9.100000    6.535000      1.386534
36        silu_and_mul   torch.float16           4        1   2048   9.862000    6.575000      1.466259
37        silu_and_mul   torch.float16           4        1   4096  11.065000    6.656000      1.599474
38        silu_and_mul   torch.float16           4        1   8192  13.991000    6.695000      1.892158
39        silu_and_mul   torch.float16           4        1  16384  19.123999    8.619000      1.977002
40        silu_and_mul   torch.float16           4        4    128   8.901000    6.495000      1.370285
41        silu_and_mul   torch.float16           4        4    256   9.060000    6.575000      1.392655
42        silu_and_mul   torch.float16           4        4    512   9.181000    6.575000      1.396410
43        silu_and_mul   torch.float16           4        4   1024   9.301000    6.615000      1.420684
44        silu_and_mul   torch.float16           4        4   2048  10.103000    6.615000      1.478760
45        silu_and_mul   torch.float16           4        4   4096  11.667000    6.736000      1.678693
46        silu_and_mul   torch.float16           4        4   8192  14.192000    6.695000      1.987903
47        silu_and_mul   torch.float16           4        4  16384  19.043000    8.901000      1.904864
48        silu_and_mul   torch.float16           4       16    128   9.141000    6.615000      1.387906
49        silu_and_mul   torch.float16           4       16    256   9.261000    6.655000      1.385575
50        silu_and_mul   torch.float16           4       16    512   9.502000    6.655000      1.403606
51        silu_and_mul   torch.float16           4       16   1024   9.622000    6.655000      1.445830
52        silu_and_mul   torch.float16           4       16   2048  10.263000    6.655000      1.542149
53        silu_and_mul   torch.float16           4       16   4096  11.786000    6.816000      1.715909
54        silu_and_mul   torch.float16           4       16   8192  15.154000    7.096000      2.117993
55        silu_and_mul   torch.float16           4       16  16384  19.926000    9.902000      1.975464
56        silu_and_mul   torch.float16           4       64    128   9.422000    6.695000      1.407170
57        silu_and_mul   torch.float16           4       64    256   9.542000    6.696000      1.413082
58        silu_and_mul   torch.float16           4       64    512   9.622000    6.615500      1.451841
59        silu_and_mul   torch.float16           4       64   1024   9.983000    6.775000      1.464440
60        silu_and_mul   torch.float16           4       64   2048  10.905000    6.856000      1.582306
61        silu_and_mul   torch.float16           4       64   4096  12.789000    7.056000      1.795493
62        silu_and_mul   torch.float16           4       64   8192  16.638000    8.419000      1.799141
63        silu_and_mul   torch.float16           4       64  16384  23.974000   12.348000      1.817806
64        silu_and_mul   torch.float16          16        1    128   9.061000    6.534000      1.376443
65        silu_and_mul   torch.float16          16        1    256   9.060000    6.535000      1.372015
66        silu_and_mul   torch.float16          16        1    512   9.502000    6.575000      1.400000
67        silu_and_mul   torch.float16          16        1   1024   9.301000    6.575000      1.400000
68        silu_and_mul   torch.float16          16        1   2048  10.183000    6.655000      1.478760
69        silu_and_mul   torch.float16          16        1   4096  11.507000    6.696000      1.682748
70        silu_and_mul   torch.float16          16        1   8192  14.513000    6.735000      2.011878
71        silu_and_mul   torch.float16          16        1  16384  19.204000    9.181000      1.991203
72        silu_and_mul   torch.float16          16        4    128   9.140000    6.615000      1.373554
73        silu_and_mul   torch.float16          16        4    256   9.261000    6.616000      1.379564
74        silu_and_mul   torch.float16          16        4    512   9.542000    6.655000      1.403606
75        silu_and_mul   torch.float16          16        4   1024   9.662000    6.655000      1.448526
76        silu_and_mul   torch.float16          16        4   2048  10.224000    6.655000      1.556983
77        silu_and_mul   torch.float16          16        4   4096  11.827000    6.776000      1.745425
78        silu_and_mul   torch.float16          16        4   8192  15.154000    7.136000      2.107384
79        silu_and_mul   torch.float16          16        4  16384  19.966001   10.063000      1.983710
80        silu_and_mul   torch.float16          16       16    128   9.422000    6.735000      1.407109
81        silu_and_mul   torch.float16          16       16    256   9.462000    6.695000      1.406959
82        silu_and_mul   torch.float16          16       16    512   9.502000    6.695000      1.455431
83        silu_and_mul   torch.float16          16       16   1024   9.982000    6.736000      1.479191
84        silu_and_mul   torch.float16          16       16   2048  11.105000    6.855500      1.590664
85        silu_and_mul   torch.float16          16       16   4096  12.789000    7.056000      1.761480
86        silu_and_mul   torch.float16          16       16   8192  16.518001    8.419000      1.826958
87        silu_and_mul   torch.float16          16       16  16384  24.134001   12.308000      1.773119
88        silu_and_mul   torch.float16          16       64    128   9.742000    6.735000      1.434596
89        silu_and_mul   torch.float16          16       64    256   9.943000    6.775000      1.470012
90        silu_and_mul   torch.float16          16       64    512  10.504000    6.856000      1.546948
91        silu_and_mul   torch.float16          16       64   1024  12.348000    7.016000      1.725770
92        silu_and_mul   torch.float16          16       64   2048  15.395000    8.058000      1.859363
93        silu_and_mul   torch.float16          16       64   4096  21.529000   11.827000      1.705082
94        silu_and_mul   torch.float16          16       64   8192  35.962000   20.206001      1.765826
95        silu_and_mul   torch.float16          16       64  16384  68.997003   45.743000      1.575057
96        silu_and_mul  torch.bfloat16           1        1    128   8.780000    6.535000      1.329278
97        silu_and_mul  torch.bfloat16           1        1    256   8.860000    6.575000      1.323042
98        silu_and_mul  torch.bfloat16           1        1    512   8.821000    6.535000      1.331293
99        silu_and_mul  torch.bfloat16           1        1   1024   9.100000    6.615000      1.375661
100       silu_and_mul  torch.bfloat16           1        1   2048   9.422000    6.495000      1.432025
101       silu_and_mul  torch.bfloat16           1        1   4096  10.664000    6.575000      1.601224
102       silu_and_mul  torch.bfloat16           1        1   8192  13.391000    6.976000      1.804616
103       silu_and_mul  torch.bfloat16           1        1  16384  18.081000    9.902000      1.685636
104       silu_and_mul  torch.bfloat16           1        4    128   8.941000    6.535000      1.368018
105       silu_and_mul  torch.bfloat16           1        4    256   9.061000    6.615000      1.363568
106       silu_and_mul  torch.bfloat16           1        4    512   8.980000    6.534000      1.374292
107       silu_and_mul  torch.bfloat16           1        4   1024   9.341000    6.615000      1.412094
108       silu_and_mul  torch.bfloat16           1        4   2048  10.223000    6.575000      1.487909
109       silu_and_mul  torch.bfloat16           1        4   4096  11.426000    6.655000      1.620586
110       silu_and_mul  torch.bfloat16           1        4   8192  14.513000    7.056000      1.959949
111       silu_and_mul  torch.bfloat16           1        4  16384  19.966001   10.223000      1.743074
112       silu_and_mul  torch.bfloat16           1       16    128   9.020000    6.534000      1.380624
113       silu_and_mul  torch.bfloat16           1       16    256   9.101000    6.575000      1.384030
114       silu_and_mul  torch.bfloat16           1       16    512   9.301000    6.575000      1.402433
115       silu_and_mul  torch.bfloat16           1       16   1024   9.382000    6.655000      1.424339
116       silu_and_mul  torch.bfloat16           1       16   2048  10.223000    6.615000      1.500075
117       silu_and_mul  torch.bfloat16           1       16   4096  11.948000    6.695000      1.706647
118       silu_and_mul  torch.bfloat16           1       16   8192  14.793000    7.337000      1.891284
119       silu_and_mul  torch.bfloat16           1       16  16384  19.845000   10.584000      1.842890
120       silu_and_mul  torch.bfloat16           1       64    128   9.141000    6.655000      1.373554
121       silu_and_mul  torch.bfloat16           1       64    256   9.301000    6.695000      1.415778
122       silu_and_mul  torch.bfloat16           1       64    512   9.502000    6.655000      1.409767
123       silu_and_mul  torch.bfloat16           1       64   1024   9.742000    6.655000      1.454898
124       silu_and_mul  torch.bfloat16           1       64   2048  10.464000    6.695000      1.578212
125       silu_and_mul  torch.bfloat16           1       64   4096  12.028000    6.776000      1.776669
126       silu_and_mul  torch.bfloat16           1       64   8192  15.915999    7.858000      1.994783
127       silu_and_mul  torch.bfloat16           1       64  16384  20.807000   11.346000      1.795240
128       silu_and_mul  torch.bfloat16           4        1    128   9.301000    6.615000      1.378099
129       silu_and_mul  torch.bfloat16           4        1    256   9.101000    6.615000      1.363568
130       silu_and_mul  torch.bfloat16           4        1    512   8.980000    6.535000      1.357968
131       silu_and_mul  torch.bfloat16           4        1   1024   9.181000    6.575000      1.390266
132       silu_and_mul  torch.bfloat16           4        1   2048  10.022000    6.575000      1.493840
133       silu_and_mul  torch.bfloat16           4        1   4096  11.346000    6.655000      1.638618
134       silu_and_mul  torch.bfloat16           4        1   8192  14.714000    7.056000      1.845639
135       silu_and_mul  torch.bfloat16           4        1  16384  19.122999   10.063000      1.824187
136       silu_and_mul  torch.bfloat16           4        4    128   9.060000    6.535000      1.398990
137       silu_and_mul  torch.bfloat16           4        4    256   9.061000    6.574000      1.384030
138       silu_and_mul  torch.bfloat16           4        4    512   9.261000    6.575000      1.418141
139       silu_and_mul  torch.bfloat16           4        4   1024   9.381000    6.615000      1.406047
140       silu_and_mul  torch.bfloat16           4        4   2048  10.183000    6.655000      1.497052
141       silu_and_mul  torch.bfloat16           4        4   4096  11.787000    6.695000      1.706392
142       silu_and_mul  torch.bfloat16           4        4   8192  14.633000    7.176000      1.932971
143       silu_and_mul  torch.bfloat16           4        4  16384  19.605000   10.383000      1.743279
144       silu_and_mul  torch.bfloat16           4       16    128   9.341000    6.735000      1.391451
145       silu_and_mul  torch.bfloat16           4       16    256   9.381000    6.695000      1.386934
146       silu_and_mul  torch.bfloat16           4       16    512   9.462000    6.656000      1.398605
147       silu_and_mul  torch.bfloat16           4       16   1024  10.022000    6.816000      1.479014
148       silu_and_mul  torch.bfloat16           4       16   2048  10.424000    6.695000      1.566191
149       silu_and_mul  torch.bfloat16           4       16   4096  11.987000    6.776000      1.757231
150       silu_and_mul  torch.bfloat16           4       16   8192  15.596000    7.818000      1.969174
151       silu_and_mul  torch.bfloat16           4       16  16384  20.687001   11.426000      1.802045
152       silu_and_mul  torch.bfloat16           4       64    128   9.462000    6.735000      1.410481
153       silu_and_mul  torch.bfloat16           4       64    256   9.582000    6.735000      1.410839
154       silu_and_mul  torch.bfloat16           4       64    512   9.703000    6.655000      1.455116
155       silu_and_mul  torch.bfloat16           4       64   1024   9.902000    6.736000      1.476099
156       silu_and_mul  torch.bfloat16           4       64   2048  11.025000    6.855000      1.608315
157       silu_and_mul  torch.bfloat16           4       64   4096  13.030000    7.337000      1.777608
158       silu_and_mul  torch.bfloat16           4       64   8192  17.240001    9.221000      1.716153
159       silu_and_mul  torch.bfloat16           4       64  16384  24.656000   13.872000      1.650087
160       silu_and_mul  torch.bfloat16          16        1    128   9.101000    6.535000      1.382602
161       silu_and_mul  torch.bfloat16          16        1    256   9.101000    6.574000      1.378099
162       silu_and_mul  torch.bfloat16          16        1    512   9.261000    6.575000      1.396350
163       silu_and_mul  torch.bfloat16          16        1   1024   9.342000    6.615000      1.406198
164       silu_and_mul  torch.bfloat16          16        1   2048  10.344000    6.615000      1.496901
165       silu_and_mul  torch.bfloat16          16        1   4096  11.987000    6.695000      1.714328
166       silu_and_mul  torch.bfloat16          16        1   8192  14.754000    7.337000      1.883836
167       silu_and_mul  torch.bfloat16          16        1  16384  19.925000   10.584000      1.797696
168       silu_and_mul  torch.bfloat16          16        4    128   9.181000    6.655000      1.404751
169       silu_and_mul  torch.bfloat16          16        4    256   9.461000    6.655500      1.397386
170       silu_and_mul  torch.bfloat16          16        4    512   9.582000    6.655000      1.409767
171       silu_and_mul  torch.bfloat16          16        4   1024   9.742000    6.655000      1.469872
172       silu_and_mul  torch.bfloat16          16        4   2048  10.424000    6.895000      1.572352
173       silu_and_mul  torch.bfloat16          16        4   4096  12.108000    6.775000      1.757231
174       silu_and_mul  torch.bfloat16          16        4   8192  15.596000    7.858000      1.963929
175       silu_and_mul  torch.bfloat16          16        4  16384  20.687001   11.426000      1.812119
176       silu_and_mul  torch.bfloat16          16       16    128   9.462000    6.735000      1.404751
177       silu_and_mul  torch.bfloat16          16       16    256   9.581000    6.735000      1.404900
178       silu_and_mul  torch.bfloat16          16       16    512   9.702000    6.655000      1.427799
179       silu_and_mul  torch.bfloat16          16       16   1024   9.902000    6.775000      1.511728
180       silu_and_mul  torch.bfloat16          16       16   2048  11.025000    6.816000      1.599765
181       silu_and_mul  torch.bfloat16          16       16   4096  13.030000    7.297000      1.797786
182       silu_and_mul  torch.bfloat16          16       16   8192  17.199000    9.181000      1.719287
183       silu_and_mul  torch.bfloat16          16       16  16384  24.777001   13.912000      1.640471
184       silu_and_mul  torch.bfloat16          16       64    128   9.782000    6.735000      1.414936
185       silu_and_mul  torch.bfloat16          16       64    256   9.982000    6.775000      1.470160
186       silu_and_mul  torch.bfloat16          16       64    512  10.584000    6.816000      1.553045
187       silu_and_mul  torch.bfloat16          16       64   1024  12.709000    7.538000      1.673736
188       silu_and_mul  torch.bfloat16          16       64   2048  16.116001    9.020000      1.734270
189       silu_and_mul  torch.bfloat16          16       64   4096  22.892000   13.069000      1.634889
190       silu_and_mul  torch.bfloat16          16       64   8192  39.168999   22.330999      1.694866
191       silu_and_mul  torch.bfloat16          16       64  16384  72.966002   46.746001      1.583490
192       gelu_and_mul   torch.float16           1        1    128   8.700000    6.535000      1.331293
193       gelu_and_mul   torch.float16           1        1    256   8.740000    6.535000      1.331293
194       gelu_and_mul   torch.float16           1        1    512   8.860000    6.535000      1.343535
195       gelu_and_mul   torch.float16           1        1   1024   9.061000    6.575000      1.357521
196       gelu_and_mul   torch.float16           1        1   2048   9.341000    6.495000      1.450500
197       gelu_and_mul   torch.float16           1        1   4096  10.544000    6.535000      1.564346
198       gelu_and_mul   torch.float16           1        1   8192  12.910000    6.615000      1.860620
199       gelu_and_mul   torch.float16           1        1  16384  17.440001    8.499000      1.919112
200       gelu_and_mul   torch.float16           1        4    128   8.981000    6.534000      1.374292
201       gelu_and_mul   torch.float16           1        4    256   9.140000    6.615000      1.371863
202       gelu_and_mul   torch.float16           1        4    512   9.020000    6.495000      1.376443
203       gelu_and_mul   torch.float16           1        4   1024   9.180000    6.574000      1.392655
204       gelu_and_mul   torch.float16           1        4   2048  10.023000    6.574000      1.530494
205       gelu_and_mul   torch.float16           1        4   4096  11.546000    6.696000      1.700672
206       gelu_and_mul   torch.float16           1        4   8192  14.594000    6.815000      1.982331
207       gelu_and_mul   torch.float16           1        4  16384  20.206001    8.740000      2.112005
208       gelu_and_mul   torch.float16           1       16    128   9.021000    6.534000      1.394919
209       gelu_and_mul   torch.float16           1       16    256   9.181000    6.535000      1.392655
210       gelu_and_mul   torch.float16           1       16    512   9.301000    6.575000      1.396350
211       gelu_and_mul   torch.float16           1       16   1024   9.381000    6.575000      1.424188
212       gelu_and_mul   torch.float16           1       16   2048  10.304000    6.615000      1.514813
213       gelu_and_mul   torch.float16           1       16   4096  11.787000    6.695000      1.748469
214       gelu_and_mul   torch.float16           1       16   8192  14.633000    6.695000      2.077658
215       gelu_and_mul   torch.float16           1       16  16384  19.645000    9.060000      2.026371
216       gelu_and_mul   torch.float16           1       64    128   9.261000    6.615000      1.391585
217       gelu_and_mul   torch.float16           1       64    256   9.501000    6.695000      1.403606
218       gelu_and_mul   torch.float16           1       64    512   9.622000    6.655000      1.415778
219       gelu_and_mul   torch.float16           1       64   1024   9.783000    6.655000      1.457851
220       gelu_and_mul   torch.float16           1       64   2048  10.344000    6.655000      1.560331
221       gelu_and_mul   torch.float16           1       64   4096  11.948000    6.775000      1.739779
222       gelu_and_mul   torch.float16           1       64   8192  15.516000    7.176000      2.145047
223       gelu_and_mul   torch.float16           1       64  16384  20.646501    9.942000      2.056857
224       gelu_and_mul   torch.float16           4        1    128   9.261000    6.575000      1.432852
225       gelu_and_mul   torch.float16           4        1    256   9.342000    6.615000      1.384182
226       gelu_and_mul   torch.float16           4        1    512   8.981000    6.495000      1.374139
227       gelu_and_mul   torch.float16           4        1   1024   9.181000    6.535000      1.392655
228       gelu_and_mul   torch.float16           4        1   2048   9.983000    6.575000      1.487757
229       gelu_and_mul   torch.float16           4        1   4096  11.506000    6.695000      1.654573
230       gelu_and_mul   torch.float16           4        1   8192  14.513000    6.695000      1.916356
231       gelu_and_mul   torch.float16           4        1  16384  19.564001    8.620000      1.976682
232       gelu_and_mul   torch.float16           4        4    128   9.021000    6.495000      1.382756
233       gelu_and_mul   torch.float16           4        4    256   9.101000    6.575000      1.429380
234       gelu_and_mul   torch.float16           4        4    512   9.301000    6.535000      1.408517
235       gelu_and_mul   torch.float16           4        4   1024   9.381000    6.575000      1.420684
236       gelu_and_mul   torch.float16           4        4   2048  10.183000    6.615000      1.491005
237       gelu_and_mul   torch.float16           4        4   4096  11.546000    6.695000      1.696362
238       gelu_and_mul   torch.float16           4        4   8192  14.593000    6.695000      2.047946
239       gelu_and_mul   torch.float16           4        4  16384  19.645000    8.980000      1.928722
240       gelu_and_mul   torch.float16           4       16    128   9.221000    6.615000      1.387906
241       gelu_and_mul   torch.float16           4       16    256   9.342000    6.655000      1.397596
242       gelu_and_mul   torch.float16           4       16    512   9.582000    6.655000      1.409767
243       gelu_and_mul   torch.float16           4       16   1024   9.782000    6.655000      1.439820
244       gelu_and_mul   torch.float16           4       16   2048  10.424000    6.696000      1.587906
245       gelu_and_mul   torch.float16           4       16   4096  11.947000    6.775000      1.745683
246       gelu_and_mul   torch.float16           4       16   8192  15.515000    7.216000      2.139473
247       gelu_and_mul   torch.float16           4       16  16384  20.525999    9.903000      2.073297
248       gelu_and_mul   torch.float16           4       64    128   9.622000    6.695000      1.419268
249       gelu_and_mul   torch.float16           4       64    256   9.862000    6.936000      1.436977
250       gelu_and_mul   torch.float16           4       64    512   9.542000    6.655000      1.451841
251       gelu_and_mul   torch.float16           4       64   1024   9.983000    6.735000      1.476318
252       gelu_and_mul   torch.float16           4       64   2048  10.985000    6.815000      1.594277
253       gelu_and_mul   torch.float16           4       64   4096  12.949000    7.016000      1.811431
254       gelu_and_mul   torch.float16           4       64   8192  16.718000    8.379000      1.859863
255       gelu_and_mul   torch.float16           4       64  16384  24.495000   12.148000      1.836701
256       gelu_and_mul   torch.float16          16        1    128   9.141000    6.495000      1.388761
257       gelu_and_mul   torch.float16          16        1    256   9.101000    6.535000      1.392655
258       gelu_and_mul   torch.float16          16        1    512   9.301000    6.535000      1.408517
259       gelu_and_mul   torch.float16          16        1   1024   9.341000    6.575000      1.412245
260       gelu_and_mul   torch.float16          16        1   2048  10.143000    6.615000      1.524117
261       gelu_and_mul   torch.float16          16        1   4096  11.786000    6.695000      1.700672
262       gelu_and_mul   torch.float16          16        1   8192  14.593000    6.695000      2.053921
263       gelu_and_mul   torch.float16          16        1  16384  19.804999    9.181000      2.062306
264       gelu_and_mul   torch.float16          16        4    128   9.221000    6.615000      1.393953
265       gelu_and_mul   torch.float16          16        4    256   9.381000    6.695000      1.392666
266       gelu_and_mul   torch.float16          16        4    512   9.742000    6.656000      1.421788
267       gelu_and_mul   torch.float16          16        4   1024   9.742000    6.615000      1.446474
268       gelu_and_mul   torch.float16          16        4   2048  10.664000    6.695000      1.598805
269       gelu_and_mul   torch.float16          16        4   4096  12.147000    6.775000      1.751587
270       gelu_and_mul   torch.float16          16        4   8192  15.475000    7.216000      2.140255
271       gelu_and_mul   torch.float16          16        4  16384  20.567000    9.862000      2.053153
272       gelu_and_mul   torch.float16          16       16    128   9.542000    6.735000      1.410839
273       gelu_and_mul   torch.float16          16       16    256   9.582000    6.695000      1.419268
274       gelu_and_mul   torch.float16          16       16    512   9.662000    6.655000      1.463862
275       gelu_and_mul   torch.float16          16       16   1024   9.982000    6.776000      1.488048
276       gelu_and_mul   torch.float16          16       16   2048  11.145000    6.856000      1.637252
277       gelu_and_mul   torch.float16          16       16   4096  13.030000    7.056000      1.817132
278       gelu_and_mul   torch.float16          16       16   8192  16.558001    8.379000      1.893207
279       gelu_and_mul   torch.float16          16       16  16384  24.414999   12.228000      1.823397
280       gelu_and_mul   torch.float16          16       64    128   9.823000    6.696000      1.461090
281       gelu_and_mul   torch.float16          16       64    256   9.983000    6.735000      1.482108
282       gelu_and_mul   torch.float16          16       64    512  10.544000    6.776000      1.541306
283       gelu_and_mul   torch.float16          16       64   1024  12.388000    6.976000      1.757065
284       gelu_and_mul   torch.float16          16       64   2048  15.515000    8.219000      1.847156
285       gelu_and_mul   torch.float16          16       64   4096  21.248000   11.506000      1.724752
286       gelu_and_mul   torch.float16          16       64   8192  35.801001   20.647001      1.792773
287       gelu_and_mul   torch.float16          16       64  16384  68.475999   41.774999      1.568949
288       gelu_and_mul  torch.bfloat16           1        1    128   8.900000    6.535000      1.347529
289       gelu_and_mul  torch.bfloat16           1        1    256   9.061000    6.575000      1.362050
290       gelu_and_mul  torch.bfloat16           1        1    512   8.860000    6.535000      1.343535
291       gelu_and_mul  torch.bfloat16           1        1   1024   9.221000    6.655000      1.390266
292       gelu_and_mul  torch.bfloat16           1        1   2048   9.501000    6.495000      1.432852
293       gelu_and_mul  torch.bfloat16           1        1   4096  10.784000    6.535000      1.619740
294       gelu_and_mul  torch.bfloat16           1        1   8192  13.310000    7.177000      1.754181
295       gelu_and_mul  torch.bfloat16           1        1  16384  18.121000   10.103000      1.636200
296       gelu_and_mul  torch.bfloat16           1        4    128   9.020000    6.534000      1.372015
297       gelu_and_mul  torch.bfloat16           1        4    256   9.141000    6.615000      1.373554
298       gelu_and_mul  torch.bfloat16           1        4    512   9.141000    6.495000      1.388761
299       gelu_and_mul  torch.bfloat16           1        4   1024   9.221000    6.535000      1.390266
300       gelu_and_mul  torch.bfloat16           1        4   2048  10.103000    6.575000      1.493840
301       gelu_and_mul  torch.bfloat16           1        4   4096  11.667000    6.655000      1.650639
302       gelu_and_mul  torch.bfloat16           1        4   8192  14.914000    7.096000      1.886979
303       gelu_and_mul  torch.bfloat16           1        4  16384  20.446001   10.504000      1.742108
304       gelu_and_mul  torch.bfloat16           1       16    128   9.021000    6.495000      1.394919
305       gelu_and_mul  torch.bfloat16           1       16    256   9.181000    6.535000      1.398776
306       gelu_and_mul  torch.bfloat16           1       16    512   9.342000    6.575000      1.414753
307       gelu_and_mul  torch.bfloat16           1       16   1024   9.422000    6.615000      1.428658
308       gelu_and_mul  torch.bfloat16           1       16   2048  10.464000    6.615000      1.502917
309       gelu_and_mul  torch.bfloat16           1       16   4096  12.107000    6.696000      1.748469
310       gelu_and_mul  torch.bfloat16           1       16   8192  14.874000    7.457000      1.914297
311       gelu_and_mul  torch.bfloat16           1       16  16384  20.327000   10.664000      1.830196
312       gelu_and_mul  torch.bfloat16           1       64    128   9.221000    6.615000      1.393953
313       gelu_and_mul  torch.bfloat16           1       64    256   9.381000    6.695000      1.407170
314       gelu_and_mul  torch.bfloat16           1       64    512  10.022000    6.695000      1.431068
315       gelu_and_mul  torch.bfloat16           1       64   1024   9.862000    6.655000      1.463862
316       gelu_and_mul  torch.bfloat16           1       64   2048  10.464000    6.655000      1.590383
317       gelu_and_mul  torch.bfloat16           1       64   4096  12.147000    6.775000      1.781107
318       gelu_and_mul  torch.bfloat16           1       64   8192  15.796000    8.059000      1.935344
319       gelu_and_mul  torch.bfloat16           1       64  16384  21.088000   11.466000      1.827420
320       gelu_and_mul  torch.bfloat16           4        1    128   9.221000    6.535000      1.404897
321       gelu_and_mul  torch.bfloat16           4        1    256   9.180000    6.615000      1.381859
322       gelu_and_mul  torch.bfloat16           4        1    512   9.021000    6.495000      1.380260
323       gelu_and_mul  torch.bfloat16           4        1   1024   9.261000    6.575000      1.396350
324       gelu_and_mul  torch.bfloat16           4        1   2048  10.023000    6.575000      1.530494
325       gelu_and_mul  torch.bfloat16           4        1   4096  11.507000    6.616000      1.723616
326       gelu_and_mul  torch.bfloat16           4        1   8192  14.674000    7.176000      1.915779
327       gelu_and_mul  torch.bfloat16           4        1  16384  20.326000   10.143000      1.758947
328       gelu_and_mul  torch.bfloat16           4        4    128   9.301000    6.535000      1.386746
329       gelu_and_mul  torch.bfloat16           4        4    256   9.261000    6.535000      1.403606
330       gelu_and_mul  torch.bfloat16           4        4    512   9.421000    6.615000      1.414601
331       gelu_and_mul  torch.bfloat16           4        4   1024   9.422000    6.575000      1.418141
332       gelu_and_mul  torch.bfloat16           4        4   2048  10.183000    6.615000      1.499925
333       gelu_and_mul  torch.bfloat16           4        4   4096  11.867000    6.695000      1.694698
334       gelu_and_mul  torch.bfloat16           4        4   8192  14.954000    7.457000      1.891870
335       gelu_and_mul  torch.bfloat16           4        4  16384  20.005999   10.344000      1.790603
336       gelu_and_mul  torch.bfloat16           4       16    128   9.221000    6.615000      1.393953
337       gelu_and_mul  torch.bfloat16           4       16    256   9.381000    6.655000      1.398812
338       gelu_and_mul  torch.bfloat16           4       16    512   9.783000    6.656000      1.421788
339       gelu_and_mul  torch.bfloat16           4       16   1024   9.863000    6.655000      1.469872
340       gelu_and_mul  torch.bfloat16           4       16   2048  10.424000    6.655000      1.551008
341       gelu_and_mul  torch.bfloat16           4       16   4096  12.228000    6.776000      1.733874
342       gelu_and_mul  torch.bfloat16           4       16   8192  15.756000    8.019000      1.944999
343       gelu_and_mul  torch.bfloat16           4       16  16384  21.128001   11.386000      1.826199
344       gelu_and_mul  torch.bfloat16           4       64    128   9.542000    6.735000      1.416630
345       gelu_and_mul  torch.bfloat16           4       64    256   9.582000    6.775000      1.425030
346       gelu_and_mul  torch.bfloat16           4       64    512   9.622000    6.735000      1.469872
347       gelu_and_mul  torch.bfloat16           4       64   1024  10.063000    6.735000      1.487975
348       gelu_and_mul  torch.bfloat16           4       64   2048  11.105000    6.896000      1.641138
349       gelu_and_mul  torch.bfloat16           4       64   4096  13.069000    7.417000      1.780047
350       gelu_and_mul  torch.bfloat16           4       64   8192  16.958000    9.341000      1.705647
351       gelu_and_mul  torch.bfloat16           4       64  16384  24.897000   14.112000      1.626434
352       gelu_and_mul  torch.bfloat16          16        1    128   9.462000    6.615000      1.423259
353       gelu_and_mul  torch.bfloat16          16        1    256   9.221000    6.615000      1.402433
354       gelu_and_mul  torch.bfloat16          16        1    512   9.301000    6.575000      1.439087
355       gelu_and_mul  torch.bfloat16          16        1   1024   9.421000    6.615000      1.418141
356       gelu_and_mul  torch.bfloat16          16        1   2048  10.183000    6.655000      1.521010
357       gelu_and_mul  torch.bfloat16          16        1   4096  11.947000    6.695000      1.724571
358       gelu_and_mul  torch.bfloat16          16        1   8192  14.914000    7.457000      1.880949
359       gelu_and_mul  torch.bfloat16          16        1  16384  20.326499   10.624000      1.759520
360       gelu_and_mul  torch.bfloat16          16        4    128   9.301000    6.615000      1.393953
361       gelu_and_mul  torch.bfloat16          16        4    256   9.382000    6.655000      1.403546
362       gelu_and_mul  torch.bfloat16          16        4    512   9.622000    6.655000      1.410839
363       gelu_and_mul  torch.bfloat16          16        4   1024   9.943000    6.655000      1.475883
364       gelu_and_mul  torch.bfloat16          16        4   2048  10.464000    6.655000      1.590383
365       gelu_and_mul  torch.bfloat16          16        4   4096  12.108000    6.775000      1.758914
366       gelu_and_mul  torch.bfloat16          16        4   8192  15.756000    8.098000      1.939893
367       gelu_and_mul  torch.bfloat16          16        4  16384  21.248000   11.506000      1.814021
368       gelu_and_mul  torch.bfloat16          16       16    128   9.702000    6.735000      1.425030
369       gelu_and_mul  torch.bfloat16          16       16    256   9.621000    6.735000      1.436977
370       gelu_and_mul  torch.bfloat16          16       16    512   9.782000    6.736000      1.463862
371       gelu_and_mul  torch.bfloat16          16       16   1024  10.063000    6.736000      1.482257
372       gelu_and_mul  torch.bfloat16          16       16   2048  11.105500    6.856000      1.619895
373       gelu_and_mul  torch.bfloat16          16       16   4096  13.230000    7.658000      1.718190
374       gelu_and_mul  torch.bfloat16          16       16   8192  16.999001    9.381000      1.703993
375       gelu_and_mul  torch.bfloat16          16       16  16384  24.977000   14.152000      1.620011
376       gelu_and_mul  torch.bfloat16          16       64    128   9.823000    6.736000      1.440535
377       gelu_and_mul  torch.bfloat16          16       64    256  10.023000    6.775000      1.473506
378       gelu_and_mul  torch.bfloat16          16       64    512  10.624000    6.896000      1.572929
379       gelu_and_mul  torch.bfloat16          16       64   1024  12.668000    7.777000      1.643902
380       gelu_and_mul  torch.bfloat16          16       64   2048  15.876001    9.261000      1.675309
381       gelu_and_mul  torch.bfloat16          16       64   4096  22.371000   13.270000      1.586581
382       gelu_and_mul  torch.bfloat16          16       64   8192  37.165001   22.691000      1.643449
383       gelu_and_mul  torch.bfloat16          16       64  16384  71.202002   46.425000      1.535871
384  gelu_tanh_and_mul   torch.float16           1        1    128   8.700000    6.615000      1.316958
385  gelu_tanh_and_mul   torch.float16           1        1    256   8.659000    6.535000      1.298707
386  gelu_tanh_and_mul   torch.float16           1        1    512   8.740000    6.574000      1.312777
387  gelu_tanh_and_mul   torch.float16           1        1   1024   8.900000    6.615000      1.351270
388  gelu_tanh_and_mul   torch.float16           1        1   2048   9.301000    6.495000      1.419707
389  gelu_tanh_and_mul   torch.float16           1        1   4096  10.304000    6.615000      1.542662
390  gelu_tanh_and_mul   torch.float16           1        1   8192  12.869000    6.695000      1.843426
391  gelu_tanh_and_mul   torch.float16           1        1  16384  17.680001    8.900000      1.805214
392  gelu_tanh_and_mul   torch.float16           1        4    128   8.900000    6.535000      1.355777
393  gelu_tanh_and_mul   torch.float16           1        4    256   8.980000    6.615000      1.351474
394  gelu_tanh_and_mul   torch.float16           1        4    512   8.860000    6.534000      1.349656
395  gelu_tanh_and_mul   torch.float16           1        4   1024   9.261000    6.615000      1.371863
396  gelu_tanh_and_mul   torch.float16           1        4   2048   9.943000    6.615000      1.478760
397  gelu_tanh_and_mul   torch.float16           1        4   4096  11.506000    6.735000      1.670650
398  gelu_tanh_and_mul   torch.float16           1        4   8192  14.473000    6.816000      2.035783
399  gelu_tanh_and_mul   torch.float16           1        4  16384  19.885000    9.221000      1.991319
400  gelu_tanh_and_mul   torch.float16           1       16    128   8.861000    6.535000      1.355777
401  gelu_tanh_and_mul   torch.float16           1       16    256   8.980000    6.575000      1.365780
402  gelu_tanh_and_mul   torch.float16           1       16    512   9.141000    6.615000      1.381859
403  gelu_tanh_and_mul   torch.float16           1       16   1024   9.221000    6.615000      1.377296
404  gelu_tanh_and_mul   torch.float16           1       16   2048  10.384000    6.695000      1.497087
405  gelu_tanh_and_mul   torch.float16           1       16   4096  11.747000    6.776000      1.670415
406  gelu_tanh_and_mul   torch.float16           1       16   8192  14.834000    6.976000      2.046973
407  gelu_tanh_and_mul   torch.float16           1       16  16384  19.685000    9.502000      2.000106
408  gelu_tanh_and_mul   torch.float16           1       64    128   9.100500    6.695000      1.373554
409  gelu_tanh_and_mul   torch.float16           1       64    256   9.422000    6.696000      1.398961
410  gelu_tanh_and_mul   torch.float16           1       64    512   9.341000    6.695000      1.395295
411  gelu_tanh_and_mul   torch.float16           1       64   1024   9.622000    6.655000      1.413293
412  gelu_tanh_and_mul   torch.float16           1       64   2048  10.224000    6.695000      1.551008
413  gelu_tanh_and_mul   torch.float16           1       64   4096  11.867000    6.815000      1.694204
414  gelu_tanh_and_mul   torch.float16           1       64   8192  15.435000    7.457000      2.043047
415  gelu_tanh_and_mul   torch.float16           1       64  16384  20.687001   10.384000      1.937935
416  gelu_tanh_and_mul   torch.float16           4        1    128   9.140000    6.655000      1.373554
417  gelu_tanh_and_mul   torch.float16           4        1    256   9.061000    6.615000      1.337340
418  gelu_tanh_and_mul   torch.float16           4        1    512   8.860000    6.535000      1.355777
419  gelu_tanh_and_mul   torch.float16           4        1   1024   9.021000    6.575000      1.371863
420  gelu_tanh_and_mul   torch.float16           4        1   2048  10.063000    6.856000      1.481893
421  gelu_tanh_and_mul   torch.float16           4        1   4096  11.306000    6.656000      1.634802
422  gelu_tanh_and_mul   torch.float16           4        1   8192  14.072000    6.735000      1.886728
423  gelu_tanh_and_mul   torch.float16           4        1  16384  20.166000    9.061000      1.928833
424  gelu_tanh_and_mul   torch.float16           4        4    128   8.860000    6.535000      1.355777
425  gelu_tanh_and_mul   torch.float16           4        4    256   8.981000    6.615000      1.365932
426  gelu_tanh_and_mul   torch.float16           4        4    512   9.141000    6.615000      1.367543
427  gelu_tanh_and_mul   torch.float16           4        4   1024   9.462000    6.655000      1.379564
428  gelu_tanh_and_mul   torch.float16           4        4   2048  10.144000    6.655000      1.470022
429  gelu_tanh_and_mul   torch.float16           4        4   4096  11.666000    6.775000      1.666815
430  gelu_tanh_and_mul   torch.float16           4        4   8192  14.513000    6.815000      2.023764
431  gelu_tanh_and_mul   torch.float16           4        4  16384  19.603999    9.422000      1.897452
432  gelu_tanh_and_mul   torch.float16           4       16    128   9.061000    6.655000      1.367543
433  gelu_tanh_and_mul   torch.float16           4       16    256   9.261000    6.695000      1.371117
434  gelu_tanh_and_mul   torch.float16           4       16    512   9.381000    6.656000      1.389246
435  gelu_tanh_and_mul   torch.float16           4       16   1024   9.542000    6.695000      1.400986
436  gelu_tanh_and_mul   torch.float16           4       16   2048  10.224000    6.695000      1.527110
437  gelu_tanh_and_mul   torch.float16           4       16   4096  11.907000    6.816000      1.707410
438  gelu_tanh_and_mul   torch.float16           4       16   8192  15.475500    7.417000      2.048537
439  gelu_tanh_and_mul   torch.float16           4       16  16384  20.527000   10.384000      1.945647
440  gelu_tanh_and_mul   torch.float16           4       64    128   9.381000    6.735000      1.393022
441  gelu_tanh_and_mul   torch.float16           4       64    256   9.421000    6.735000      1.398605
442  gelu_tanh_and_mul   torch.float16           4       64    512   9.541000    6.695000      1.407170
443  gelu_tanh_and_mul   torch.float16           4       64   1024   9.822000    6.775000      1.443838
444  gelu_tanh_and_mul   torch.float16           4       64   2048  10.865000    6.856000      1.558376
445  gelu_tanh_and_mul   torch.float16           4       64   4096  12.869000    7.136000      1.779735
446  gelu_tanh_and_mul   torch.float16           4       64   8192  16.678000    8.700000      1.804850
447  gelu_tanh_and_mul   torch.float16           4       64  16384  24.536001   13.069000      1.753875
448  gelu_tanh_and_mul   torch.float16          16        1    128   8.981000    6.575000      1.377947
449  gelu_tanh_and_mul   torch.float16          16        1    256   9.221000    6.695000      1.375813
450  gelu_tanh_and_mul   torch.float16          16        1    512   9.342000    6.655000      1.387755
451  gelu_tanh_and_mul   torch.float16          16        1   1024   9.341000    6.615000      1.387545
452  gelu_tanh_and_mul   torch.float16          16        1   2048  10.143000    6.695000      1.487831
453  gelu_tanh_and_mul   torch.float16          16        1   4096  11.546000    6.735000      1.642910
454  gelu_tanh_and_mul   torch.float16          16        1   8192  14.633000    6.896000      2.058303
455  gelu_tanh_and_mul   torch.float16          16        1  16384  19.685000    9.542000      1.970529
456  gelu_tanh_and_mul   torch.float16          16        4    128   9.061000    6.655000      1.353398
457  gelu_tanh_and_mul   torch.float16          16        4    256   9.261000    6.736000      1.377296
458  gelu_tanh_and_mul   torch.float16          16        4    512   9.341000    6.695000      1.378893
459  gelu_tanh_and_mul   torch.float16          16        4   1024   9.823000    6.776000      1.426126
460  gelu_tanh_and_mul   torch.float16          16        4   2048  10.503000    6.695000      1.535908
461  gelu_tanh_and_mul   torch.float16          16        4   4096  11.947000    6.776000      1.745425
462  gelu_tanh_and_mul   torch.float16          16        4   8192  15.435000    7.457000      2.048672
463  gelu_tanh_and_mul   torch.float16          16        4  16384  20.647001   10.464000      1.945458
464  gelu_tanh_and_mul   torch.float16          16       16    128   9.421000    6.696000      1.392873
465  gelu_tanh_and_mul   torch.float16          16       16    256   9.421000    6.736000      1.392873
466  gelu_tanh_and_mul   torch.float16          16       16    512   9.461000    6.696000      1.436828
467  gelu_tanh_and_mul   torch.float16          16       16   1024   9.863000    6.775000      1.443772
468  gelu_tanh_and_mul   torch.float16          16       16   2048  10.985000    6.896000      1.578909
469  gelu_tanh_and_mul   torch.float16          16       16   4096  12.869000    7.136000      1.777608
470  gelu_tanh_and_mul   torch.float16          16       16   8192  16.837999    8.780000      1.790719
471  gelu_tanh_and_mul   torch.float16          16       16  16384  24.536001   12.910000      1.749941
472  gelu_tanh_and_mul   torch.float16          16       64    128   9.661000    6.735000      1.410481
473  gelu_tanh_and_mul   torch.float16          16       64    256   9.782000    6.815000      1.443772
474  gelu_tanh_and_mul   torch.float16          16       64    512  10.464000    6.856000      1.520642
475  gelu_tanh_and_mul   torch.float16          16       64   1024  11.907000    7.096000      1.668615
476  gelu_tanh_and_mul   torch.float16          16       64   2048  15.154000    8.460000      1.700083
477  gelu_tanh_and_mul   torch.float16          16       64   4096  21.289000   12.308000      1.600016
478  gelu_tanh_and_mul   torch.float16          16       64   8192  36.082000   20.327000      1.691830
479  gelu_tanh_and_mul   torch.float16          16       64  16384  69.438003   43.779001      1.637413
480  gelu_tanh_and_mul  torch.bfloat16           1        1    128   8.740000    6.655000      1.325319
481  gelu_tanh_and_mul  torch.bfloat16           1        1    256   8.700000    6.575000      1.304943
482  gelu_tanh_and_mul  torch.bfloat16           1        1    512   8.700000    6.535000      1.304943
483  gelu_tanh_and_mul  torch.bfloat16           1        1   1024   9.021000    6.615000      1.343351
484  gelu_tanh_and_mul  torch.bfloat16           1        1   2048   9.301000    6.495000      1.417138
485  gelu_tanh_and_mul  torch.bfloat16           1        1   4096  10.504000    6.575000      1.542662
486  gelu_tanh_and_mul  torch.bfloat16           1        1   8192  12.950000    7.296000      1.729089
487  gelu_tanh_and_mul  torch.bfloat16           1        1  16384  18.282000   10.263000      1.628958
488  gelu_tanh_and_mul  torch.bfloat16           1        4    128   8.900000    6.575000      1.347529
489  gelu_tanh_and_mul  torch.bfloat16           1        4    256   9.020000    6.655000      1.371322
490  gelu_tanh_and_mul  torch.bfloat16           1        4    512   8.940000    6.575000      1.347886
491  gelu_tanh_and_mul  torch.bfloat16           1        4   1024   9.382000    6.776000      1.406047
492  gelu_tanh_and_mul  torch.bfloat16           1        4   2048   9.902000    6.575000      1.484807
493  gelu_tanh_and_mul  torch.bfloat16           1        4   4096  11.466000    6.656000      1.676624
494  gelu_tanh_and_mul  torch.bfloat16           1        4   8192  14.713000    7.337000      1.858097
495  gelu_tanh_and_mul  torch.bfloat16           1        4  16384  20.285999   10.624000      1.703924
496  gelu_tanh_and_mul  torch.bfloat16           1       16    128   9.061000    6.574000      1.377947
497  gelu_tanh_and_mul  torch.bfloat16           1       16    256   9.060000    6.575000      1.365932
498  gelu_tanh_and_mul  torch.bfloat16           1       16    512   9.181000    6.575000      1.387906
499  gelu_tanh_and_mul  torch.bfloat16           1       16   1024   9.261000    6.615000      1.385575
500  gelu_tanh_and_mul  torch.bfloat16           1       16   2048  10.223000    6.695000      1.488054
501  gelu_tanh_and_mul  torch.bfloat16           1       16   4096  11.746000    6.696000      1.690321
502  gelu_tanh_and_mul  torch.bfloat16           1       16   8192  15.034000    7.617000      1.827109
503  gelu_tanh_and_mul  torch.bfloat16           1       16  16384  20.285999   10.905000      1.763737
504  gelu_tanh_and_mul  torch.bfloat16           1       64    128   9.101000    6.655000      1.367393
505  gelu_tanh_and_mul  torch.bfloat16           1       64    256   9.422000    6.735000      1.365347
506  gelu_tanh_and_mul  torch.bfloat16           1       64    512   9.341000    6.695000      1.395220
507  gelu_tanh_and_mul  torch.bfloat16           1       64   1024   9.742000    6.695000      1.464074
508  gelu_tanh_and_mul  torch.bfloat16           1       64   2048  10.304000    6.695000      1.532935
509  gelu_tanh_and_mul  torch.bfloat16           1       64   4096  12.027000    6.776000      1.770393
510  gelu_tanh_and_mul  torch.bfloat16           1       64   8192  15.876001    8.258000      1.878194
511  gelu_tanh_and_mul  torch.bfloat16           1       64  16384  21.088000   11.867000      1.760229
512  gelu_tanh_and_mul  torch.bfloat16           4        1    128   9.061000    6.575000      1.369615
513  gelu_tanh_and_mul  torch.bfloat16           4        1    256   9.301000    6.655000      1.331330
514  gelu_tanh_and_mul  torch.bfloat16           4        1    512   8.900000    6.575000      1.355777
515  gelu_tanh_and_mul  torch.bfloat16           4        1   1024   9.061000    6.575000      1.369615
516  gelu_tanh_and_mul  torch.bfloat16           4        1   2048  10.023000    6.575000      1.454422
517  gelu_tanh_and_mul  torch.bfloat16           4        1   4096  11.466000    6.856000      1.638618
518  gelu_tanh_and_mul  torch.bfloat16           4        1   8192  14.674000    7.336000      1.813074
519  gelu_tanh_and_mul  torch.bfloat16           4        1  16384  20.326000   10.624000      1.701634
520  gelu_tanh_and_mul  torch.bfloat16           4        4    128   9.061000    6.575000      1.365780
521  gelu_tanh_and_mul  torch.bfloat16           4        4    256   9.021000    6.575000      1.365932
522  gelu_tanh_and_mul  torch.bfloat16           4        4    512   9.181000    6.615000      1.369766
523  gelu_tanh_and_mul  torch.bfloat16           4        4   1024   9.221000    6.615000      1.391736
524  gelu_tanh_and_mul  torch.bfloat16           4        4   2048  10.304000    6.655000      1.487904
525  gelu_tanh_and_mul  torch.bfloat16           4        4   4096  11.666000    6.696000      1.664427
526  gelu_tanh_and_mul  torch.bfloat16           4        4   8192  14.873000    7.577000      1.798997
527  gelu_tanh_and_mul  torch.bfloat16           4        4  16384  20.085000   10.785000      1.686552
528  gelu_tanh_and_mul  torch.bfloat16           4       16    128   9.100000    6.655000      1.359373
529  gelu_tanh_and_mul  torch.bfloat16           4       16    256   9.342000    6.695000      1.363178
530  gelu_tanh_and_mul  torch.bfloat16           4       16    512   9.421000    6.656000      1.419119
531  gelu_tanh_and_mul  torch.bfloat16           4       16   1024   9.662000    6.695000      1.436977
532  gelu_tanh_and_mul  torch.bfloat16           4       16   2048  10.343000    6.735000      1.578212
533  gelu_tanh_and_mul  torch.bfloat16           4       16   4096  11.947000    6.815000      1.753045
534  gelu_tanh_and_mul  torch.bfloat16           4       16   8192  15.836000    8.219000      1.892688
535  gelu_tanh_and_mul  torch.bfloat16           4       16  16384  21.088000   11.787000      1.747489
536  gelu_tanh_and_mul  torch.bfloat16           4       64    128   9.382000    6.735000      1.393022
537  gelu_tanh_and_mul  torch.bfloat16           4       64    256   9.462000    6.736000      1.398961
538  gelu_tanh_and_mul  torch.bfloat16           4       64    512   9.422000    6.695000      1.425243
539  gelu_tanh_and_mul  torch.bfloat16           4       64   1024   9.862000    6.775000      1.449889
540  gelu_tanh_and_mul  torch.bfloat16           4       64   2048  10.865000    6.856000      1.598753
541  gelu_tanh_and_mul  torch.bfloat16           4       64   4096  12.909000    7.578000      1.707968
542  gelu_tanh_and_mul  torch.bfloat16           4       64   8192  16.918000    9.582000      1.645616
543  gelu_tanh_and_mul  torch.bfloat16           4       64  16384  25.056999   14.473000      1.590495
544  gelu_tanh_and_mul  torch.bfloat16          16        1    128   8.981000    6.534000      1.359696
545  gelu_tanh_and_mul  torch.bfloat16          16        1    256   9.221000    6.575000      1.365932
546  gelu_tanh_and_mul  torch.bfloat16          16        1    512   9.180500    6.615000      1.375813
547  gelu_tanh_and_mul  torch.bfloat16          16        1   1024   9.301000    6.735000      1.414601
548  gelu_tanh_and_mul  torch.bfloat16          16        1   2048  10.183000    6.655000      1.512096
549  gelu_tanh_and_mul  torch.bfloat16          16        1   4096  11.747000    6.735000      1.664675
550  gelu_tanh_and_mul  torch.bfloat16          16        1   8192  14.793000    7.617000      1.838675
551  gelu_tanh_and_mul  torch.bfloat16          16        1  16384  20.447001   10.905000      1.820466
552  gelu_tanh_and_mul  torch.bfloat16          16        4    128   9.301000    6.695000      1.367393
553  gelu_tanh_and_mul  torch.bfloat16          16        4    256   9.261000    6.695000      1.365347
554  gelu_tanh_and_mul  torch.bfloat16          16        4    512   9.361500    6.656000      1.391585
555  gelu_tanh_and_mul  torch.bfloat16          16        4   1024   9.662000    6.656000      1.434596
556  gelu_tanh_and_mul  torch.bfloat16          16        4   2048  10.383000    6.655000      1.537563
557  gelu_tanh_and_mul  torch.bfloat16          16        4   4096  12.068000    6.815000      1.752788
558  gelu_tanh_and_mul  torch.bfloat16          16        4   8192  15.716000    8.259000      1.883400
559  gelu_tanh_and_mul  torch.bfloat16          16        4  16384  21.048000   11.827000      1.761856
560  gelu_tanh_and_mul  torch.bfloat16          16       16    128   9.501000    6.775000      1.398961
561  gelu_tanh_and_mul  torch.bfloat16          16       16    256   9.502000    6.736000      1.398812
562  gelu_tanh_and_mul  torch.bfloat16          16       16    512   9.621000    6.655000      1.433809
563  gelu_tanh_and_mul  torch.bfloat16          16       16   1024   9.782000    6.736000      1.455646
564  gelu_tanh_and_mul  torch.bfloat16          16       16   2048  10.905000    6.856000      1.584743
565  gelu_tanh_and_mul  torch.bfloat16          16       16   4096  12.950000    7.657000      1.696604
566  gelu_tanh_and_mul  torch.bfloat16          16       16   8192  17.078999    9.542000      1.680746
567  gelu_tanh_and_mul  torch.bfloat16          16       16  16384  25.096999   14.713000      1.571832
568  gelu_tanh_and_mul  torch.bfloat16          16       64    128   9.862000    6.775000      1.387082
569  gelu_tanh_and_mul  torch.bfloat16          16       64    256   9.782000    6.776000      1.449889
570  gelu_tanh_and_mul  torch.bfloat16          16       64    512  10.464000    7.016000      1.500000
571  gelu_tanh_and_mul  torch.bfloat16          16       64   1024  12.549000    7.857000      1.554608
572  gelu_tanh_and_mul  torch.bfloat16          16       64   2048  15.515000    9.461000      1.557804
573  gelu_tanh_and_mul  torch.bfloat16          16       64   4096  21.888999   13.711000      1.504378
574  gelu_tanh_and_mul  torch.bfloat16          16       64   8192  37.404999   23.373000      1.574066
575  gelu_tanh_and_mul  torch.bfloat16          16       64  16384  71.722999   48.029002      1.549197
576         gelu_quick   torch.float16           1        1    128   8.780000    6.575000      1.325172
577         gelu_quick   torch.float16           1        1    256   8.820000    6.615000      1.323395
578         gelu_quick   torch.float16           1        1    512   8.820000    6.575000      1.325019
579         gelu_quick   torch.float16           1        1   1024   9.020000    6.615000      1.351625
580         gelu_quick   torch.float16           1        1   2048   9.702000    6.575000      1.390266
581         gelu_quick   torch.float16           1        1   4096  10.584000    6.695000      1.530278
582         gelu_quick   torch.float16           1        1   8192  12.348000    6.615000      1.786616
583         gelu_quick   torch.float16           1        1  16384  16.958000    8.499000      1.861320
584         gelu_quick   torch.float16           1        4    128   8.900000    6.535000      1.361897
585         gelu_quick   torch.float16           1        4    256   9.100000    6.575000      1.369766
586         gelu_quick   torch.float16           1        4    512   8.981000    6.575000      1.374139
587         gelu_quick   torch.float16           1        4   1024   9.181000    6.575000      1.371117
588         gelu_quick   torch.float16           1        4   2048   9.983000    6.535000      1.441775
589         gelu_quick   torch.float16           1        4   4096  10.905000    6.655000      1.624944
590         gelu_quick   torch.float16           1        4   8192  14.433000    6.856000      1.884371
591         gelu_quick   torch.float16           1        4  16384  19.123999    8.700000      1.986129
592         gelu_quick   torch.float16           1       16    128   9.180500    6.575000      1.384393
593         gelu_quick   torch.float16           1       16    256   9.141000    6.655000      1.367543
594         gelu_quick   torch.float16           1       16    512   9.221000    6.574000      1.408517
595         gelu_quick   torch.float16           1       16   1024   9.301000    6.575000      1.406198
596         gelu_quick   torch.float16           1       16   2048  10.183000    6.575000      1.497052
597         gelu_quick   torch.float16           1       16   4096  11.627000    6.696000      1.676624
598         gelu_quick   torch.float16           1       16   8192  14.433000    6.816000      1.988476
599         gelu_quick   torch.float16           1       16  16384  19.644000    9.020000      1.942605
600         gelu_quick   torch.float16           1       64    128   9.140000    6.575000      1.378099
601         gelu_quick   torch.float16           1       64    256   9.422000    6.695000      1.397596
602         gelu_quick   torch.float16           1       64    512   9.381000    6.615000      1.418141
603         gelu_quick   torch.float16           1       64   1024   9.622000    6.615000      1.436281
604         gelu_quick   torch.float16           1       64   2048  10.384000    6.655000      1.554170
605         gelu_quick   torch.float16           1       64   4096  11.746000    6.776000      1.716162
606         gelu_quick   torch.float16           1       64   8192  14.392000    6.776000      1.816411
607         gelu_quick   torch.float16           1       64  16384  19.804999    9.582000      2.004106
608         gelu_quick   torch.float16           4        1    128   9.020000    6.575000      1.355777
609         gelu_quick   torch.float16           4        1    256   8.900000    6.535000      1.364126
610         gelu_quick   torch.float16           4        1    512   9.020000    6.575000      1.369766
611         gelu_quick   torch.float16           4        1   1024   9.181000    6.575000      1.378099
612         gelu_quick   torch.float16           4        1   2048   9.742000    6.574000      1.475589
613         gelu_quick   torch.float16           4        1   4096  11.226000    6.615000      1.749962
614         gelu_quick   torch.float16           4        1   8192  13.311000    6.695000      1.838387
615         gelu_quick   torch.float16           4        1  16384  18.722000    8.700000      1.914360
616         gelu_quick   torch.float16           4        4    128   8.941000    6.535000      1.359848
617         gelu_quick   torch.float16           4        4    256   9.181000    6.575000      1.384393
618         gelu_quick   torch.float16           4        4    512   9.181000    6.535000      1.384182
619         gelu_quick   torch.float16           4        4   1024   9.261000    6.575000      1.400000
620         gelu_quick   torch.float16           4        4   2048  10.264000    6.615000      1.506160
621         gelu_quick   torch.float16           4        4   4096  11.506000    6.615000      1.654573
622         gelu_quick   torch.float16           4        4   8192  13.992000    6.695000      1.976244
623         gelu_quick   torch.float16           4        4  16384  19.323001    9.060000      1.909070
624         gelu_quick   torch.float16           4       16    128   9.060000    6.575000      1.363719
625         gelu_quick   torch.float16           4       16    256   9.261000    6.655000      1.395220
626         gelu_quick   torch.float16           4       16    512   9.461000    6.655000      1.403606
627         gelu_quick   torch.float16           4       16   1024   9.782000    6.575000      1.408669
628         gelu_quick   torch.float16           4       16   2048  10.263000    6.655000      1.548159
629         gelu_quick   torch.float16           4       16   4096  11.626000    6.735000      1.690572
630         gelu_quick   torch.float16           4       16   8192  14.312000    6.815000      1.964848
631         gelu_quick   torch.float16           4       16  16384  20.005001    9.582000      2.062461
632         gelu_quick   torch.float16           4       64    128   9.341500    6.695000      1.395220
633         gelu_quick   torch.float16           4       64    256   9.381000    6.695000      1.407170
634         gelu_quick   torch.float16           4       64    512   9.541000    6.656000      1.427648
635         gelu_quick   torch.float16           4       64   1024   9.541000    6.575000      1.469729
636         gelu_quick   torch.float16           4       64   2048  10.584000    6.775000      1.541419
637         gelu_quick   torch.float16           4       64   4096  12.588000    7.056000      1.784155
638         gelu_quick   torch.float16           4       64   8192  16.195999    8.179000      1.863979
639         gelu_quick   torch.float16           4       64  16384  22.852000   11.386000      1.893552
640         gelu_quick   torch.float16          16        1    128   9.140000    6.575000      1.390114
641         gelu_quick   torch.float16          16        1    256   9.422000    6.735000      1.420684
642         gelu_quick   torch.float16          16        1    512   9.141000    6.655000      1.386534
643         gelu_quick   torch.float16          16        1   1024   9.221000    6.615000      1.387906
644         gelu_quick   torch.float16          16        1   2048  10.143000    6.615000      1.496901
645         gelu_quick   torch.float16          16        1   4096  11.386000    6.615000      1.618292
646         gelu_quick   torch.float16          16        1   8192  13.711000    6.735000      1.881069
647         gelu_quick   torch.float16          16        1  16384  18.963000    8.980000      1.918623
648         gelu_quick   torch.float16          16        4    128   9.141000    6.655000      1.381859
649         gelu_quick   torch.float16          16        4    256   9.342000    6.696000      1.391585
650         gelu_quick   torch.float16          16        4    512   9.381000    6.615000      1.403606
651         gelu_quick   torch.float16          16        4   1024   9.501000    6.615000      1.451623
652         gelu_quick   torch.float16          16        4   2048  10.464000    6.695000      1.542149
653         gelu_quick   torch.float16          16        4   4096  11.626000    6.735000      1.690572
654         gelu_quick   torch.float16          16        4   8192  14.353000    6.816000      1.982539
655         gelu_quick   torch.float16          16        4  16384  19.765001    9.622000      2.050198
656         gelu_quick   torch.float16          16       16    128   9.342000    6.695000      1.395220
657         gelu_quick   torch.float16          16       16    256   9.582000    6.735000      1.389246
658         gelu_quick   torch.float16          16       16    512   9.542000    6.695000      1.407170
659         gelu_quick   torch.float16          16       16   1024   9.662000    6.575000      1.460620
660         gelu_quick   torch.float16          16       16   2048  10.504000    6.776000      1.532320
661         gelu_quick   torch.float16          16       16   4096  12.509000    7.056000      1.795351
662         gelu_quick   torch.float16          16       16   8192  16.197000    8.099000      1.889446
663         gelu_quick   torch.float16          16       16  16384  22.812000   11.386000      1.925173
664         gelu_quick   torch.float16          16       64    128   9.542000    6.775000      1.432029
665         gelu_quick   torch.float16          16       64    256   9.702000    6.775000      1.437934
666         gelu_quick   torch.float16          16       64    512  10.424000    6.775000      1.520738
667         gelu_quick   torch.float16          16       64   1024  12.468000    7.096000      1.757046
668         gelu_quick   torch.float16          16       64   2048  14.753000    7.898000      1.836600
669         gelu_quick   torch.float16          16       64   4096  21.048000   11.066000      1.960000
670         gelu_quick   torch.float16          16       64   8192  34.479000   18.642001      1.660829
671         gelu_quick   torch.float16          16       64  16384  61.097998   31.311002      1.972928
672         gelu_quick  torch.bfloat16           1        1    128   8.780000    6.575000      1.315193
673         gelu_quick  torch.bfloat16           1        1    256   8.981000    6.575000      1.333333
674         gelu_quick  torch.bfloat16           1        1    512   8.740000    6.535000      1.329480
675         gelu_quick  torch.bfloat16           1        1   1024   8.780000    6.535000      1.331293
676         gelu_quick  torch.bfloat16           1        1   2048   9.302000    6.575000      1.408517
677         gelu_quick  torch.bfloat16           1        1   4096  10.023000    6.574000      1.518327
678         gelu_quick  torch.bfloat16           1        1   8192  12.869000    6.655000      1.854573
679         gelu_quick  torch.bfloat16           1        1  16384  17.640000    8.860000      1.787359
680         gelu_quick  torch.bfloat16           1        4    128   8.740000    6.535000      1.329278
681         gelu_quick  torch.bfloat16           1        4    256   8.940000    6.535000      1.349656
682         gelu_quick  torch.bfloat16           1        4    512   9.021000    6.575000      1.365932
683         gelu_quick  torch.bfloat16           1        4   1024   9.141000    6.575000      1.384182
684         gelu_quick  torch.bfloat16           1        4   2048   9.782000    6.575000      1.481673
685         gelu_quick  torch.bfloat16           1        4   4096  11.145000    6.615000      1.648375
686         gelu_quick  torch.bfloat16           1        4   8192  14.232000    6.696000      1.880209
687         gelu_quick  torch.bfloat16           1        4  16384  19.083001    9.060000      1.904535
688         gelu_quick  torch.bfloat16           1       16    128   8.980000    6.575000      1.365780
689         gelu_quick  torch.bfloat16           1       16    256   9.221000    6.575000      1.390266
690         gelu_quick  torch.bfloat16           1       16    512   9.221000    6.615000      1.381859
691         gelu_quick  torch.bfloat16           1       16   1024   9.541000    6.615000      1.400000
692         gelu_quick  torch.bfloat16           1       16   2048  10.183000    6.655000      1.493914
693         gelu_quick  torch.bfloat16           1       16   4096  11.627000    6.695000      1.696901
694         gelu_quick  torch.bfloat16           1       16   8192  14.112000    6.855000      1.917343
695         gelu_quick  torch.bfloat16           1       16  16384  19.484000    9.421000      1.877605
696         gelu_quick  torch.bfloat16           1       64    128   9.060000    6.655000      1.369615
697         gelu_quick  torch.bfloat16           1       64    256   9.301000    6.735000      1.401135
698         gelu_quick  torch.bfloat16           1       64    512   9.381000    6.655000      1.403395
699         gelu_quick  torch.bfloat16           1       64   1024   9.542000    6.615000      1.442479
700         gelu_quick  torch.bfloat16           1       64   2048  10.303000    6.695000      1.526732
701         gelu_quick  torch.bfloat16           1       64   4096  11.787000    6.775000      1.726099
702         gelu_quick  torch.bfloat16           1       64   8192  14.593000    7.016000      2.011756
703         gelu_quick  torch.bfloat16           1       64  16384  19.966001   10.023000      1.964321
704         gelu_quick  torch.bfloat16           4        1    128   8.941000    6.575000      1.335361
705         gelu_quick  torch.bfloat16           4        1    256   8.901000    6.535000      1.343535
706         gelu_quick  torch.bfloat16           4        1    512   9.061000    6.575000      1.347681
707         gelu_quick  torch.bfloat16           4        1   1024   9.180000    6.575000      1.384182
708         gelu_quick  torch.bfloat16           4        1   2048   9.822000    6.575000      1.466667
709         gelu_quick  torch.bfloat16           4        1   4096  11.185000    6.615000      1.605484
710         gelu_quick  torch.bfloat16           4        1   8192  13.590500    6.735000      1.809651
711         gelu_quick  torch.bfloat16           4        1  16384  18.322000    8.820000      1.858428
712         gelu_quick  torch.bfloat16           4        4    128   9.181000    6.655000      1.369766
713         gelu_quick  torch.bfloat16           4        4    256   9.261000    6.615000      1.400000
714         gelu_quick  torch.bfloat16           4        4    512   9.221000    6.575000      1.381859
715         gelu_quick  torch.bfloat16           4        4   1024   9.261000    6.655000      1.400000
716         gelu_quick  torch.bfloat16           4        4   2048  10.143000    6.655000      1.488054
717         gelu_quick  torch.bfloat16           4        4   4096  11.346000    6.655000      1.668670
718         gelu_quick  torch.bfloat16           4        4   8192  14.152000    6.815000      1.923415
719         gelu_quick  torch.bfloat16           4        4  16384  19.363999    9.301000      1.865199
720         gelu_quick  torch.bfloat16           4       16    128   9.180000    6.655000      1.361533
721         gelu_quick  torch.bfloat16           4       16    256   9.301000    6.695000      1.377296
722         gelu_quick  torch.bfloat16           4       16    512   9.381000    6.695000      1.395220
723         gelu_quick  torch.bfloat16           4       16   1024   9.502000    6.575000      1.436281
724         gelu_quick  torch.bfloat16           4       16   2048  10.383000    6.656000      1.523888
725         gelu_quick  torch.bfloat16           4       16   4096  11.667000    6.735000      1.702450
726         gelu_quick  torch.bfloat16           4       16   8192  14.452500    6.976000      1.965453
727         gelu_quick  torch.bfloat16           4       16  16384  19.965000   10.023000      1.964268
728         gelu_quick  torch.bfloat16           4       64    128   9.342000    6.735000      1.392873
729         gelu_quick  torch.bfloat16           4       64    256   9.381000    6.735000      1.408413
730         gelu_quick  torch.bfloat16           4       64    512   9.542000    6.696000      1.413082
731         gelu_quick  torch.bfloat16           4       64   1024   9.622000    6.615000      1.491005
732         gelu_quick  torch.bfloat16           4       64   2048  10.665000    6.815000      1.535065
733         gelu_quick  torch.bfloat16           4       64   4096  12.829000    7.136000      1.796646
734         gelu_quick  torch.bfloat16           4       64   8192  16.277000    8.179000      1.848026
735         gelu_quick  torch.bfloat16           4       64  16384  22.812000   11.907000      1.820326
736         gelu_quick  torch.bfloat16          16        1    128   9.141000    6.575000      1.375813
737         gelu_quick  torch.bfloat16          16        1    256   9.221000    6.575000      1.384182
738         gelu_quick  torch.bfloat16          16        1    512   9.181000    6.575000      1.390266
739         gelu_quick  torch.bfloat16          16        1   1024   9.261000    6.615000      1.393953
740         gelu_quick  torch.bfloat16          16        1   2048  10.063000    6.655000      1.496938
741         gelu_quick  torch.bfloat16          16        1   4096  11.466000    6.655000      1.660620
742         gelu_quick  torch.bfloat16          16        1   8192  13.831000    6.816000      1.946576
743         gelu_quick  torch.bfloat16          16        1  16384  19.363999    9.581000      1.797441
744         gelu_quick  torch.bfloat16          16        4    128   9.180000    6.656000      1.375661
745         gelu_quick  torch.bfloat16          16        4    256   9.301000    6.775000      1.377296
746         gelu_quick  torch.bfloat16          16        4    512   9.342000    6.695000      1.397386
747         gelu_quick  torch.bfloat16          16        4   1024   9.582000    6.575000      1.424188
748         gelu_quick  torch.bfloat16          16        4   2048  10.303000    6.655000      1.520986
749         gelu_quick  torch.bfloat16          16        4   4096  11.706000    6.736000      1.708389
750         gelu_quick  torch.bfloat16          16        4   8192  14.433000    6.936000      1.959719
751         gelu_quick  torch.bfloat16          16        4  16384  20.206001   10.063000      1.976051
752         gelu_quick  torch.bfloat16          16       16    128   9.461000    6.775000      1.398961
753         gelu_quick  torch.bfloat16          16       16    256   9.421000    6.736000      1.392873
754         gelu_quick  torch.bfloat16          16       16    512   9.502000    6.696000      1.415565
755         gelu_quick  torch.bfloat16          16       16   1024   9.662000    6.615000      1.460620
756         gelu_quick  torch.bfloat16          16       16   2048  10.584000    6.815000      1.550406
757         gelu_quick  torch.bfloat16          16       16   4096  12.628499    7.096000      1.779594
758         gelu_quick  torch.bfloat16          16       16   8192  16.277000    8.299000      1.829399
759         gelu_quick  torch.bfloat16          16       16  16384  23.011999   11.907000      1.846865
760         gelu_quick  torch.bfloat16          16       64    128   9.542000    6.736000      1.420012
761         gelu_quick  torch.bfloat16          16       64    256   9.742000    6.775000      1.449889
762         gelu_quick  torch.bfloat16          16       64    512  10.584000    6.816000      1.555799
763         gelu_quick  torch.bfloat16          16       64   1024  12.629000    7.256000      1.759755
764         gelu_quick  torch.bfloat16          16       64   2048  14.994000    8.138000      1.810995
765         gelu_quick  torch.bfloat16          16       64   4096  21.288000   11.346000      1.925218
766         gelu_quick  torch.bfloat16          16       64   8192  35.080001   18.722000      1.700371
767         gelu_quick  torch.bfloat16          16       64  16384  62.302001   32.113001      1.938223

@hubertlu-tw hubertlu-tw changed the title [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs Jun 12, 2025
@HaiShaw HaiShaw self-assigned this Jun 13, 2025
Comment thread sgl-kernel/python/sgl_kernel/elementwise.py Outdated
Comment thread sgl-kernel/csrc/elementwise/activation.cu Outdated
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Jul 6, 2025

pre-commit run --all-files

Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hubertlu-tw still - some naming and files discussion needed.

@hubertlu-tw hubertlu-tw changed the title [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs [DON'T MERGE][AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs Jul 8, 2025
@hubertlu-tw hubertlu-tw changed the title [DON'T MERGE][AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs Jul 9, 2025
@zhyncs zhyncs self-assigned this Jul 20, 2025
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Jul 20, 2025

@HaiShaw

@hubertlu-tw hubertlu-tw force-pushed the silu_and_mul branch 2 times, most recently from c36e4f5 to 03715dc Compare July 23, 2025 19:56
@hubertlu-tw hubertlu-tw requested a review from ByronHsu as a code owner July 23, 2025 19:56
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

Comment thread sgl-kernel/benchmark/bench_activation.py Outdated
Comment thread sgl-kernel/include/hip_math_def.h Outdated
Comment thread sgl-kernel/include/hip_math_def.h Outdated
Comment thread sgl-kernel/include/hip_vec_dtypes.h Outdated
Comment thread sgl-kernel/include/hip_vec_dtypes.h Outdated
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, vllm import to address separately.

@HaiShaw HaiShaw merged commit af4b9ba into sgl-project:main Jul 25, 2025
90 of 93 checks passed
@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor

@hubertlu-tw really nice benchmarking !

yuan-luo pushed a commit to antgroup/sglang that referenced this pull request Sep 18, 2025
Merge branch 'sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main

https://code.alipay.com/Theta/SGLang/pull_requests/192


Reviewed-by: 得泽 <zhangkaihong.zkh@antgroup.com>


* fix duplicate args in schedule_batch (sgl-project#7816)
* [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (sgl-project#7187)
* docs: update README (sgl-project#7821)
* [theta] add py-spy deps
* feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (sgl-project#7762)
* Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (sgl-project#7129)
* [Minor] Fix sporadic CI timeout caused by underestimated tests. (sgl-project#7850)
* [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (sgl-project#7853)
* Fix cache modules of triton import error (sgl-project#7832)
* [router] forward stream_options in request (sgl-project#7860)
* Fix illegal memory in trtllm allreduce fusion (sgl-project#7864)
* Fix llama4 vision (sgl-project#7840)
* Support Mimo-VL (sgl-project#7579)
* fix: Handles input_embeds in GenerateReqInput when n>1 (sgl-project#7830)
* [Multimodal][Perf] Use `pybase64` instead of `base64` (sgl-project#7724)
* Bump xgrammar's version to 0.1.20 (sgl-project#7866)
* [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (sgl-project#7818)
* [PD] Add guidance for prefill bootstrap timeout (sgl-project#7846)
* Update native_api doc to match the change in the `get_model_info` endpoint (sgl-project#7660)
* Revert "Embedding parallel by attn_tp (sgl-project#7623)" (sgl-project#7880)
* chore: bump v0.4.9.post1 (sgl-project#7882)
* Fixes typo in assertion message (sgl-project#7895)
* [CI] Add deepep tests to CI (sgl-project#7872)
* [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (sgl-project#7885)
* [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (sgl-project#7838)
* Remove unused imports (sgl-project#7898)
* [router] Update metrics when request completes (sgl-project#7899)
* [feature] Add start step profile argument in /start_profile (sgl-project#7608)
* [bugfix] add pd router policy validation (sgl-project#7904)
* vlm: support video as an input modality (sgl-project#5888)
* Feat: Support Phi-3.5-MoE in SGLang (sgl-project#7907)
* add sentencepiece as dependency explicitly (sgl-project#7922)
* Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (sgl-project#6449)
* [feature]Ascend quantization support (sgl-project#7791)
* [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (sgl-project#7775)
* Support Kimi K2 (sgl-project#7940)
* [feature] kv transfer support of ascend npu (sgl-project#7795)
* fix: minor fix for modelopt weight load compatibility (sgl-project#7953)
* temporarily disable deepep-8-gpu and activate two small tests (sgl-project#7961)
* [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (sgl-project#7932)
* chore: bump sgl-kernel v0.2.5 (sgl-project#7964)
* Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)" (sgl-project#7968)
* chore: upgrade xgrammar 0.1.21 (sgl-project#7962)
* delete uselese code caused by fuse allreduce+add_rmsnorm pr (sgl-project#7970)
* Fix wrong gemm branch cause 250us slower (sgl-project#7969)
* [router] add worker abstraction (sgl-project#7960)
* chore: upgrade sgl-kernel 0.2.5 (sgl-project#7971)
* chore: bump v0.4.9.post2 (sgl-project#7963)
* [minor fix] llama4 hybrid memory (sgl-project#7950)
* [minor fix] SWA missing methods (sgl-project#7972)
* [script] update loogle test (sgl-project#7975)
* perf: add kimi k2 fused_moe tuning config for h20_3e
* [theta] perf: add kimi k2 fused_moe tuning config for h200
* [minor fix] SWA missing methods (sgl-project#7972)
* [script] update loogle test (sgl-project#7975)
* perf: add kimi k2 fused_moe tuning config for h30_3e
* docs: update README (sgl-project#7985)
* Overlap the gating function with shared experts in DeepSeek (sgl-project#7978)
* [BugFix] fix pre_reorder_triton_kernel default int32 issue (sgl-project#7814)
* [minor] Add server_args check for Llama4 with hybrid (sgl-project#7988)
* Tiny fix mooncake log warning wrong output (sgl-project#7952)
* [BugFix] add verify logit_bias to avoid crash because of IndexError  (sgl-project#7749)
* SWA Prefix Cache (sgl-project#7367)
* chore: remove unnecessary limits on quantization methods in test script (sgl-project#7997)
* Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (sgl-project#7844)
* Support for Phi-1.5 & Phi-2 models (sgl-project#7862)
* [Dockerfile] Multi-arch support for ROCm (sgl-project#7902)
* [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (sgl-project#8010)
* perf: add kimi k2 fused_moe tuning config for h30_3e (sgl-project#8021)
* [ci] CI supports use cached models (sgl-project#7874)
* [Minor] Remove redundant print (sgl-project#8005)
* [Feature]TP Group Switching for PD-Multiplexing (sgl-project#7653)
* [Feature] CUDA Green Context Support (sgl-project#7649)
* Fix flaky CI: test_vlm_models (sgl-project#8006)
* Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (sgl-project#7982)
* prevent server crash from potential invalid grammar (sgl-project#7897)
* Setup workflow for releasing mi300x and mi350x dockers. (sgl-project#8035)
* fix: modality length mismatch with image_data (sgl-project#7887)
* Update CODEOWNERS (sgl-project#8044)
* perf: add qwen3-30b-a3b fused moe tuning config for h20
* [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (sgl-project#8023)
* feat: update multimodal data handling in engine entrypoint (sgl-project#8002)
* fix: remove redundant rotary embedding cache recomputation in MiniCPM (sgl-project#8022)
* Fix the input tools format and history tool_calls in OpenAI API  (sgl-project#6556)
* fix: resolve arm build issue (sgl-project#8052)
* concurrently load weights of DeepseekV2ForCausalLM (sgl-project#7943)
* H20 tune config for Kimi (sgl-project#8047)
* Update amd docker image. (sgl-project#8045)
* feat: replace Decord with video_reader-rs (sgl-project#5163)
* remove kv_a.congigous in DeepseekV2AttentionMLA (sgl-project#8058)
* update transformers to 4.53.2 (sgl-project#8029)
* Fix different device type adjustment in PP (sgl-project#7760)
* Use device_group for all_gather when disabling overlap scheduling (sgl-project#8001)
* Revert "feat: replace Decord with video_reader-rs" (sgl-project#8077)
* Fix CI xeon test with triton 3.3.1 (sgl-project#8086)
* fix greenctx stream compability (sgl-project#8090)
* [misc] update nvshmem and pin deepEP commit hash (sgl-project#8098)
* [Feature] Layer-wise Prefill (sgl-project#7634)
* [1/n] chore: decouple quantization implementation from vLLM dependency (sgl-project#7992)
* refactor: unify names of the feature field of MultimodalDataItem (sgl-project#8075)
* feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (sgl-project#7597)
* [ci] limit cmake build nproc (sgl-project#8100)
* [ci] disable memory imbalance check for draft worker (sgl-project#8108)
* [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (sgl-project#8110)
* [ci] recover 8-gpu deepep test (sgl-project#8105)
* Refactor: move all quantization-related code to `srt/layer/quantization` (sgl-project#7989)
* [kernel] opt moe align block kernel by block/warp scan algorithm (sgl-project#7884)
* Super tiny fix typo (sgl-project#8046)
* fix: update HostKVCache init to report correct msg when available memory is not enough (sgl-project#8102)
* [Hunyuan]: Fix Dense Model Support (sgl-project#8117)
* feat: add production metric for retracted requests due to insufficient kvcache (sgl-project#7030)
* refactor: simply MultimodalTokens logic (sgl-project#7924)
* [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (sgl-project#8127)
* Feat: Support Granite 3.0 MoE in SGLang (sgl-project#7959)
* load draft model fix (sgl-project#7506)
* [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input"  (sgl-project#7889)
* [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (sgl-project#7820)
* Hicache Storage Layer Prototype (sgl-project#7704)
* Revert "Fix different device type adjustment in PP" (sgl-project#8141)
* feat: enchance green context stream creation robust with backward compatibility (sgl-project#8136)
* fix compressed tensors WNA16 imports (sgl-project#8142)
* [Bugfix] Fix w8a8_int8 import error on NPU (sgl-project#8147)
* [3/n] chore: decouple AWQ implementation from vLLM dependency (sgl-project#8113)
* [router] Refactor router and policy traits with dependency injection (sgl-project#7987)
* [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (sgl-project#7661)
* [Doc] Steps to add a new attention backend (sgl-project#8155)
* chore: tune mem fraction static for vlm (sgl-project#6881)
* Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (sgl-project#7302)
* Feat: Support audio in Phi4-mm model (sgl-project#8048)
* [PD] Support non-MLA models PD different TP with DP attention (sgl-project#7931)
* [health_generate] fix: fix the /health_generate always success bug (sgl-project#8028)
* [router] router metrics cleanup (sgl-project#8158)
* [router] allow router to have empty workers (sgl-project#8160)
* Add GB200 wide-EP docker (sgl-project#8157)
* [1/N] MoE Refactor: refactor `select_experts` (sgl-project#7966)
* chore: bump sgl-kernel v0.2.6 (sgl-project#8165)
* chore: upgrade sgl-kernel 0.2.6 (sgl-project#8166)
* [theta] sync bailing
* Fix suffix mismatch for the metrics. (sgl-project#8168)
* Update README.md (sgl-project#8171)
* Clean up server args (sgl-project#8161)
* Fix LoRA buffer contamination during adapter eviction (sgl-project#8103)
* Fix Dockerfile.gb200 (sgl-project#8169)
* [router] add ut for worker and errors (sgl-project#8170)
* bugfix: fix sglang crash in NVIDIA MIG container (sgl-project#8167)
* Support start up LoRA server without initial adapters (sgl-project#8019)
* Clean warning logs for gate_proj loading in Lora (sgl-project#8172)
* Fix tuning_fused_moe_triton.py (sgl-project#8175)
* [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (sgl-project#8115)
* Add bf16 output option for dsv3_router_gemm kernel (sgl-project#7999)
* Enable FlashInfer support encoder models and add head_dim padding workaround (sgl-project#6230)
* Add get_hidden_dim to qwen3.py for correct lora (sgl-project#7312)
* feat: add h200 tp 16 kimi k2 moe config (sgl-project#8176)
* feat: add b200 tp 16 kimi k2 moe config (sgl-project#8178)
* fix moe gate dtype, fix tbo, fix fake dispatch (sgl-project#7825)
* Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (sgl-project#8181)
* feat: update nccl 2.27.6 (sgl-project#8182)
* Feat: Support for Persimmon Model (sgl-project#7983)
* feat: add h200 tp 16 kimi k2 moe config (sgl-project#8183)
* Fix eagle3 cuda graph (sgl-project#8163)
* fix: fix the bug of loading Internvl3 (sgl-project#8067)
* Fix dtype error in CI (sgl-project#8197)
* Cherry-pick commit 2dc5de40 "perf: add bailing mo..." 到当前分支
* [router] add ut for pd request, metrics and config (sgl-project#8184)
* [feature] enable NPU CI (sgl-project#7935)
* [fix] fix modelopt fp4 on b200 (sgl-project#8195)
* chore: bump sgl-kernel v0.2.6.post1 (sgl-project#8200)
* Apply fused sorted token ids padding (sgl-project#8193)
* [Refactor] simplify multimodal data processing (sgl-project#8107)
* [theta] feat vl name
* [router] add ut for pd router (sgl-project#8208)
* [router] upgade router version to 0.1.6 (sgl-project#8209)
* Remve router gemm output dtype conversion (sgl-project#8204)
* chore: upgrade sgl-kernel 0.2.6.post1 (sgl-project#8202)
* [Feature] Add a test for Layer-wise Prefill (sgl-project#8231)
* docs: update 2025 h2 roadmap (sgl-project#8237)
* fix: retrieve mm token by modality, raise error if none (sgl-project#8221)
* [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (sgl-project#7484)
* [theta] tune h20 config for qwen3 235b
* [theta] tune h20 config for qwen3 235b
* fix: sgl-router remove dead code (sgl-project#8257)
* [fix] benchmark : routed_scaling_factor is None (sgl-project#8059)
* [Benchmark] add disable-auto-run param for hicache/bench_multiturn (sgl-project#7822)
* Preliminary Support for Qwen3XMLDetector (sgl-project#8260)
* chore: bump v0.4.9.post3 (sgl-project#8265)
* PullRequest: 178 perf: add qwen235b h20-3e fused moe kernel config
* [theta] tune h20 config for qwen3 480b
* Skip llama4 vision module loading when multimodal disabled (sgl-project#8272)
* PullRequest: 180 新增Qwen480B和Qwen235B在NVIDIA H20-3e上的Fused MoE Triton配置
* Fix sgl-kernel ci test (sgl-project#8284)
* [theta] tune h200 config for qwen3 480b
* Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (sgl-project#8261)
* Hicache IO kernel refactoring (sgl-project#8264)
* bug fix and tag (sgl-project#8282)
* HiCache Fix (sgl-project#8288)
* [sgl-kernel] Opt per_token_quant_fp8 with warp reduce (sgl-project#8130)
* [router] add common ut infra to mock worker and app (sgl-project#8295)
* fix: workaround for deepgemm warmup issue (sgl-project#8302)
* [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (sgl-project#8133)
* Fix the issue of incorrect finish reason in final stream response chunk returned during tool call (sgl-project#7708)
* fix: match chat-template for internvl3 (sgl-project#8262)
* Fix gemma3n with hybrid swa (sgl-project#8240)
* chore: upgrade sgl-kernel 0.2.7 (sgl-project#8304)
* fix: prevent crashes due to logit bias dimension mismatch (sgl-project#7685)
* feat(function call): complete utility method for KimiK2Detector and enhance documentation (sgl-project#8043)
* Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP  (sgl-project#7562)
* [AMD] Pull latest image for AMD CI (sgl-project#8070)
* Pin the version of petit kernel to fix the APIs (sgl-project#8235)
* [bug] fix pd completion protocol for batching support (sgl-project#8317)
* [router] fix pd model completion request (sgl-project#8303)
* fix bug when eos_ids==0 (sgl-project#8315)
* [router] add endpoint unit test (sgl-project#8298)
* [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (sgl-project#8310)
* chore: upgrade flashinfer v0.2.9rc1 (sgl-project#8301)
* [router] add streaming unit test (sgl-project#8299)
* [router] add request format unit test (sgl-project#8300)
* HiCache Storage TP Refinement (sgl-project#8307)
* breakdown kernel update (sgl-project#8334)
* support idle batch for TBO (sgl-project#8233)
* [Feature] Integrate quick allreduce and select the best allreduce implementation (sgl-project#6619)
* DP Enhancement (sgl-project#8280)
* fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals (sgl-project#8266)
* [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (sgl-project#7135)
* [CPU] Add tutorial docs for SGL on CPU (sgl-project#8000)
* chore: upgrade mooncake 0.3.5 (sgl-project#8341)
* [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering `torch.compile` in forward pass (sgl-project#8353)
* [P/D] Support ipv6 in P/D scenario (sgl-project#7858)
* Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8344)
* [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector (sgl-project#8357)
* Clean up server_args, triton cache manager (sgl-project#8332)
* fix: upgrade nccl version (sgl-project#8359)
* [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (sgl-project#8363)
* fix: kimi k2 xgrammar crash (sgl-project#8367)
* Fix FP4 MoE accuracy from missing routed_scaling_factor (sgl-project#8333)
* [CI] Fix flaky threshold (sgl-project#8370)
* chore: bump v0.4.9.post4 (sgl-project#8305)
* Fix test_moe_fused_gate_combined sgl-kernel ci test (sgl-project#8374)
* Uodate Dockerfile.gb200 to latest sglang (sgl-project#8356)
* chore: improve mmmu benchmark (sgl-project#7000)
* Save peak memory in logits processor (sgl-project#8343)
* Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce (sgl-project#8267)
* chore: improvements on mm_utils (sgl-project#7737)
* vlm: optimize tensor transport (sgl-project#6003)
* Tiny assert EPLB is used together with expert parallel (sgl-project#8381)
* model: support intern-s1 (sgl-project#8350)
* Add perf tests for LoRA (sgl-project#8314)
* Remove slot usage in code to be backward-compatible with python 3.9 (sgl-project#8396)
* Add docker release flow for gb200 (sgl-project#8394)
* HiCache, check before terminate prefetching (sgl-project#8372)
* Add nvfp4 scaled mm benchmark. (sgl-project#8401)
* Urgent Fix: intern-s1 chat-template matching (sgl-project#8403)
* Tool to dump and compare internal activation tensors (sgl-project#7976)
* Minor tool for comparison of benchmark results (sgl-project#7974)
* Fix bench script making input data on L2 cache (sgl-project#7739)
* [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (sgl-project#8036)
* Update Cutlass in sgl-kernel to v4.1 (sgl-project#8392)
* fix: minor fix TransportProxyTensor under tp (sgl-project#8382)
* [router] add different policies for p node and d node (sgl-project#8395)
* Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8351)
* fix: fix the missing metrics on non-rank0 nodes (sgl-project#7720)
* [2/N] MoE Refactor: Unify weight loader and quant methods (sgl-project#8397)
* Use FlashInfer FP4 gemm. (sgl-project#8241)
* Support precomputed_embeddings for Llama 4 (sgl-project#8156)
* [hotfix] fix merge conflicts in FlashInferEPMoE (sgl-project#8405)
* chore: update CODEOWNERS (sgl-project#8407)
* chore: upgrade flashinfer v0.2.9rc2 (sgl-project#8406)
* Support triton kernels v3.4.0 for fused_moe (sgl-project#8258)
* [Bugfix] Prevent PD server crash from invalid grammar (sgl-project#8062)
* Change to use native arm runner (sgl-project#8414)
* Support overlapped lora updates  (sgl-project#8213)
* Support ue8m0 for triton quant kernel (sgl-project#7603)
* Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (sgl-project#8316)
* bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (sgl-project#8417)
* Fix test_openai_server (sgl-project#8419)
* Fix docker buildx push error (sgl-project#8425)
* bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (sgl-project#8422)
* [router] improve router logs and request id header (sgl-project#8415)
* [feat] Support different attention backends for prefill and decode  (sgl-project#6338)
* chore: bump transformer to 4.54.0 (sgl-project#8416)
* [PD] Fix abort_request for PD disaggregation (sgl-project#8352)
* GLM-4.5 Model Support (sgl-project#8224)
* Remove zstd compression for building Dockerfile.gb200 (sgl-project#8442)
* doc: add bench_one_batch_server in the benchmark doc (sgl-project#8441)
* GLM-4.5 Model Support Follow-up (sgl-project#8445)
* fix GLM4_MOE launch with compressed_tensor quant model (sgl-project#8456)
* Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (sgl-project#8449)
* Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (sgl-project#8457)
* chore: bump v0.4.9.post5 (sgl-project#8458)
* fix:reorder topk experts to ensure shared expert replaces minimal score (sgl-project#8125)
* perf: add kimi k2 h200 fused moe config (extracted from theta-asap-sglang-049)
* Cherry-pick commit 4a75e015 "Add draft model fuse..." 到当前分支
* Update PR template (sgl-project#8465)
* feat: throttle requests at scheduler based on --max_queued_requests (sgl-project#7565)
* [theta] tuning script for glm4 moe
* perf: add fused moe kernel config glm4.5,h20-3e,tp8
* [theta] tuning script for glm4 moe h20
* fix: update dep (sgl-project#8467)
* [NVIDIA] Change to use `num_local_experts` (sgl-project#8453)
* Fix parsing ChatCompletionMessage (sgl-project#7273)
* [3/N] MoE Refactor: Simplify DeepEP Output (sgl-project#8421)
* feat: support glm4 tuning (sgl-project#8473)
* Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (sgl-project#8469)
* Update codeowner (sgl-project#8476)
* chore: add glm4 fp8 tp8 config (sgl-project#8478)
* chore: add glm 4.5 fp8 tp4 config (sgl-project#8480)
* [CI]Add genai-bench Performance Validation for PD Router (sgl-project#8477)
* Update CODEOWNERS (sgl-project#8485)
* Rename the last step in pr-test.yml as pr-test-finish (sgl-project#8486)
* Reduce memory usage for fp4 moe (sgl-project#8413)
* Tiny add warnings for DeepEP when it is suboptimal (sgl-project#8426)
* Support colocating requests (sgl-project#7973)
* Fix incorrect KV cache allocation for MTP models. (sgl-project#8482)
* Add PVC and update resource limits in k8s config (sgl-project#8489)
* chore: bump v0.4.9.post6 (sgl-project#8517)
* Always trigger pr-test (sgl-project#8527)
* Update README.md (sgl-project#8528)
* [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (sgl-project#8499)
* Update cutlass_moe.py (sgl-project#8535)
* Fix moe align kernel test (sgl-project#8531)
* Split the scheduler into multiple mixin classes to reduce the file size (sgl-project#8483)
* bring back kimi vl ci (sgl-project#8537)
* fix: temporarily disable cuda-ipc for mm data tensor (sgl-project#8431)
* Support EPLB in FusedMoE (sgl-project#8448)
* feat(hicache): support file backend reading directory config form env. (sgl-project#8498)
* feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (sgl-project#8516)
* [router] allow longer time out for router e2e (sgl-project#8560)
* Update cutlass_moe.py (sgl-project#8545)
* Update CODEOWNERS (sgl-project#8562)
* [feature] [sgl-router] Add a dp-aware routing strategy (sgl-project#6869)
* [Hot-Fix] moe_aligned_block_size CI failed in AMD (sgl-project#8461)
* Cherry-pick commit 4fdc06a9 "add fp8a8 kimi-k2 dr..." 到当前分支
* [Model] Add support for Arcee Foundational Model (sgl-project#8154)
* Revert "Fix the input tools format and history tool_calls in OpenAI API  (sgl-project#6556)" (sgl-project#8584)
* Add hf3fs support for hicache storage (based on sgl-project#7704) (sgl-project#7280)
* [router] migrate router from actix to axum (sgl-project#8479)
* [Fix]Fix index oob in get_group_gemm_starts kernel. (sgl-project#8564)
* Bump transfomers to 4.54.1 to fix Gemma cache issue. (sgl-project#8541)
* Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (sgl-project#8544)
* Bug: Fix google gemma3n-mm audio input not working bug (sgl-project#8365)
* update sgl-kernel for EP: kernel part  (sgl-project#8514)
* chore: bump sgl-kernel v0.2.8 (sgl-project#8599)
* [bugfix] Fix 2 minor bugs in the hicache storage layer (sgl-project#8404)
* fix incorrect increase of hit count (sgl-project#8533)
* Support l3 cache (mooncake store) for hiradix cache (sgl-project#7211)
* [theta] Conditionally import HiCacheHF3FS sgl-project#8598
* update sgl-kernel for EP: python part (sgl-project#8550)
* add SVG logo (sgl-project#8603)
* [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl-project#8515)
* fix: fork should not run pypi router (sgl-project#8604)
* model: support Step3V (sgl-project#8583)
* [Feature] Hybrid EP and TP (sgl-project#8590)
* chore: bump v0.4.10 (sgl-project#8608)
* [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (sgl-project#8595)
* [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (sgl-project#8611)
* Fix hf3fs_fuse import error (sgl-project#8623)
* Update step3v default config (sgl-project#8626)
* [ci] fix genai-bench execution cmd (sgl-project#8629)
* [router] update router pypi version (sgl-project#8628)
* [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (sgl-project#8577)
* Fix typos in py_test/test_launch_server.py (sgl-project#6227)
* misc: Remove debug print to logger.info (sgl-project#8633)
* SGLang HiCache NIXL Connector (sgl-project#8488)
* [bug] remove pdlb from minilb since its no longer available (sgl-project#8634)
* [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (sgl-project#8630)
* Conditionally import HiCacheHF3FS (sgl-project#8598)
* TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (sgl-project#8632)
* Fix nan value generated after custom all reduce (sgl-project#8532)
* Revert "Fix nan value generated after custom all reduce (sgl-project#8532)" (sgl-project#8642)
* Feature/modelscope model download (sgl-project#8083)
* chore: speedup NPU CI by cache (sgl-project#8270)
* [Bugfix] fix w8a8_int8 load issue (sgl-project#8308)
* [bugfix] fix router python parser for pd urls (sgl-project#8644)
* [router] add basic usage doc (sgl-project#8640)
* [router] upgrade router version to 0.1.8 (sgl-project#8645)
* [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl-project#8450)
* HiCache, fixing hash value indexing (sgl-project#8636)
* Interface change for kvcache io to support page first layout (sgl-project#8318)
* Update batch size limitation of dsv3_router_gemm kernel to 16 (sgl-project#8051)
* chore: bump v0.4.10.post1 (sgl-project#8652)
* Add hf3fs_utils.cpp to package-data (sgl-project#8653)
* Fix chat template handling for OpenAI serving (sgl-project#8635)
* Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (sgl-project#8511)
* [5/N] MoE Refactor: Update MoE parallelism arguments (sgl-project#8658)
* Increase tolerance to address CI failures (sgl-project#8643)
* [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (sgl-project#8013)
* [DOC]Update sgl-kernel README (sgl-project#8665)
* fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project#8543)
* fix arg typo for --disaggregation-transfer-backend (sgl-project#8664)
* [fix] fix pd disagg error of vlms (sgl-project#8094)
* Disable tp for shared experts under expert parallelism for GLM4.5 model (sgl-project#8647) (sgl-project#8647)
* [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (sgl-project#8685)
* [bug] limit bootstrap room to to [0, 2^63 - 1] (sgl-project#8684)
* Update CODEOWNERS (sgl-project#8686)
* Fix deepgemm masked grouped gemm jit compile (sgl-project#8679)
* Fix FP8 block quantization when N or K is not multiples of 128 (sgl-project#8648)
* bugfix(hicache): Fix 'MooncakeStore' not defined error. (sgl-project#8668)
* upgrade xgrammar 0.1.22 (sgl-project#8522)
* [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (sgl-project#8618)
* Add support for NCCL symmetric memory for TP allreduces (sgl-project#8238)
* [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (sgl-project#8364)
* chore(gb200): update dockerfile to handle fp4 disaggregation (sgl-project#8694)
* [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (sgl-project#8688)
* Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (sgl-project#7434)
* model: adapt mllama4 to VisionAttention (sgl-project#8512)
* Add tensor.detach() back to update weight util (sgl-project#8691)
* [Doc] Polish sgl-kernel readme for cu126 build error (sgl-project#8704)
* [theta] merge 0802-3
* Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (sgl-project#8706)
* [router] minor code clean up and and refactoring (sgl-project#8711)
* [Bug] fix green context's incompatibility with `cuda < 12.4` (sgl-project#8701)
* chore: bump sgl-kernel v0.2.9 (sgl-project#8713)
* Remove assertions about per group quant fp8 (sgl-project#8717)
* [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (sgl-project#8693)
* Fix triton moe error caused by TopK refactor (sgl-project#8705)
* [router] Implement HTTP Dependency Injection Pattern for Router System (sgl-project#8714)
* [Feature] Radix Tree in C++ (sgl-project#7369)
* [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (sgl-project#8722)
* Fix fused MoE when `routed_scaling_factor is None` (sgl-project#8709)
* Tiny fix CI pytest error (sgl-project#8524)
* [hotfix] fix mixtral with tensor-level compressed-tensor quantization (sgl-project#8721)
* Support limiting max loaded loras in CPU. (sgl-project#8650)
* Reduce memory accumulation in long-running server (sgl-project#8306)
* HiCache storage, style change and bug fix (sgl-project#8719)
* [feat] support minimum token load balance in dp attention (sgl-project#7379)
* Do layernorm before allgather for DP attention (sgl-project#8631)
* [fix] Fix divide by zero error for llama4. (sgl-project#8683)
* feat: Add new moe triton for NVIDIA RTX 6000 Ada (sgl-project#8547)
* [Improvements] Merge health check route (sgl-project#8444)
* chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (sgl-project#8718)
* Save cuda graph memory for fa3 (sgl-project#8567)
* [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (sgl-project#8579)
* [DP] fix the compatibility issue between DP attention and `--attention-backend triton` (sgl-project#8723)
* chore: bump v0.4.10.post2 (sgl-project#8727)
* feat: Support DP Attention for step3_vl (sgl-project#8699)
* [RL] fix update weight for FusedMoE with EP (sgl-project#8676)
* use fp32 for e_score_correction_bias in GLM-4.5 (sgl-project#8729)
* Fix triton kernels topk with keyword arguments (sgl-project#8732)
* feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (sgl-project#8678)
* Fix the missing 'lof' choice of --schedule-policy server args (sgl-project#7114)
* fix args typo in memory_pool_host (sgl-project#8662)
* [CI] Do not trigger pd-disaggregation CI in draft PR (sgl-project#8737)
* [MoE] Enable `renormalize=False` in Triton kernels (sgl-project#8735)
* Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (sgl-project#8733)
* Fix bug of refactoring TopKOutput in w4afp8 (sgl-project#8745)
* Rename lora_path to lora_id in batches (sgl-project#8437)
* [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (sgl-project#8738)
* [CI] Ascend NPU CI enhancement (sgl-project#8294)
* [bugfix] fix import path in HiCacheController (sgl-project#8749)
@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Oct 1, 2025

Hi @hubertlu-tw great and comprehensive kernel benchmarks, I am wondering if the siluandmul kernel in your PR is actually different from the AITER one? Apologies if I misunderstood, I am not too familiar with hip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants