Skip to content

[1/2] Optimizations and refactors about quant kernel#9534

Merged
ispobock merged 594 commits intosgl-project:mainfrom
fzyzcjy:feat/opt_quant_extracted_kernel
Sep 5, 2025
Merged

[1/2] Optimizations and refactors about quant kernel#9534
ispobock merged 594 commits intosgl-project:mainfrom
fzyzcjy:feat/opt_quant_extracted_kernel

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented Aug 23, 2025

Motivation

to review this, just look at #7601

remarks about CI

thus, we need to check

  1. srt CI in 9534 should pass
  2. sgl-kernel CI in 7601 should pass

then it is safe to merge

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@ispobock
Copy link
Copy Markdown
Collaborator

ispobock commented Sep 5, 2025

@fzyzcjy Can you share the kernel level benchmark results (e.g., memory bandwidth) for this optimized kernel compared with previous one?

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Sep 5, 2025

@ispobock I did it two months ago so do not remember clearly... below is the e2e speedup. also iirc when looking at profile the kernel is much faster than before.

image

@ispobock
Copy link
Copy Markdown
Collaborator

ispobock commented Sep 5, 2025

@fzyzcjy I see. E2E test and profile results are also good. For kernel optimization, I just think it's better to have some kernel-level benchmark results (TFLOPS for compute bound kernel, memory bandwidth for memory bound kernel), especially for different problem sizes. It can help us to understand how we reach the hardware limitation and make sure the optimization benefit for both small and large batch size.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Sep 5, 2025

Definitely. Let me find some (old) logs.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Sep 5, 2025

Here are baseline vs sgl kernel. I do not find previous result logs though... iirc when I started optimizing the bench logic are wrong (things are wrongly put inside L2 cache) and some things even cannot run.

IIRC months ago I asked @Alcanderian after my optimization, and we had a discussion and the conclusion is yes the code is near the limit and almost cannot be pushed further.

old log 1 (I think it is the latest)

per-token-group-quant-8bit-performance:
   num_tokens  hidden_dim  group_size num_ranks            dst_dtype                                                                                                                                         flags  Triton (Inaccurate)  SGL Kernel
0         768        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.954       3.366
1         768        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               24.688       5.745
2         768       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.786       9.302
3        6144        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               38.041      13.467
4        6144        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               29.115      13.757
5        6144        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               54.143      14.079
6        6144        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               69.499      14.322

old log 2 (may be near the latest)

per-token-group-quant-8bit-performance:
     num_tokens  hidden_dim  group_size num_ranks            dst_dtype                                                                                                                                         flags  Triton (Inaccurate)  SGL Kernel
0             1        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.339       2.394
1             1        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.355       2.365
2             1        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.349       2.364
3             1        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.417       2.366
4             1        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.404       2.376
5             1        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.423       2.395
6             1        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.412       2.364
7             1        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.485       2.377
8             1       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.444       2.418
9             1       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.439       2.397
10            1       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.439       2.424
11            1       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.521       2.399
12            4        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.397       2.388
13            4        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.387       2.401
14            4        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.386       2.365
15            4        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.469       2.385
16            4        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.440       2.421
17            4        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.456       2.433
18            4        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.431       2.423
19            4        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.529       2.406
20            4       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.503       2.427
21            4       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.503       2.435
22            4       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.506       2.422
23            4       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.577       2.427
24           16        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.408       2.402
25           16        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.423       2.403
26           16        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.445       2.409
27           16        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.496       2.421
28           16        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.555       2.439
29           16        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.559       2.468
30           16        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.556       2.484
31           16        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.638       2.460
32           16       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.031       2.558
33           16       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.047       2.544
34           16       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.049       2.575
35           16       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.110       2.583
36           64        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.543       2.471
37           64        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.531       2.472
38           64        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.524       2.515
39           64        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.591       2.466
40           64        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.873       2.690
41           64        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.899       2.713
42           64        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.898       2.710
43           64        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.974       2.698
44           64       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.319       3.031
45           64       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.329       3.053
46           64       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.371       3.051
47           64       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.386       3.010
48          256        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.587       2.614
49          256        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.615       2.651
50          256        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.612       2.672
51          256        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.687       2.653
52          256        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                9.588       3.640
53          256        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                9.594       3.719
54          256        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                9.579       3.693
55          256        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                9.640       3.638
56          256       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               19.319       4.969
57          256       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               19.333       5.010
58          256       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               19.329       4.998
59          256       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               19.408       4.930
60          768        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.882       3.182
61          768        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.858       3.238
62          768        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.877       3.210
63          768        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.916       3.183
64          768        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               24.685       5.530
65          768        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               24.718       5.629
66          768        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               24.711       5.624
67          768        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               24.779       5.536
68          768       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.712       8.970
69          768       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.779       9.098
70          768       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.754       9.099
71          768       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.851       8.744
72         2048        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               15.006       4.363
73         2048        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               15.016       4.445
74         2048        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               15.017       4.470
75         2048        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               15.068       4.390
76         2048        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               62.332       9.914
77         2048        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               62.358      10.286
78         2048        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               62.340      10.344
79         2048        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               62.465       9.868
80         2048       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              139.732      17.963
81         2048       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              139.750      18.473
82         2048       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              139.744      18.534
83         2048       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              139.870      17.688
84         8192        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.721       8.676
85         8192        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.743       8.995
86         8192        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.747       8.930
87         8192        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               53.857       8.634
88         8192        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              242.947      28.597
89         8192        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              242.945      29.073
90         8192        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              242.962      29.082
91         8192        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              243.071      28.405
92         8192       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              552.556      59.808
93         8192       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              552.587      60.685
94         8192       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              552.571      60.662
95         8192       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              552.699      59.276
96        16384        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              105.339      14.481
97        16384        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              105.351      14.863
98        16384        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              105.353      14.785
99        16384        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              105.471      14.297
100       16384        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              483.767      53.512
101       16384        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              483.774      53.880
102       16384        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              483.769      53.905
103       16384        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              483.899      52.747
104       16384       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}             1103.000     115.501
105       16384       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}             1103.000     116.086
106       16384       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}             1103.000     116.087
107       16384       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}             1103.000     114.554
108           8        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.885       2.577
109           8        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               11.724      21.245
110           8        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               11.692      21.228
111           8        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               11.661      21.248
112           8        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.865       2.559
113           8        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                6.821      11.545
114           8        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                6.810      11.553
115           8        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                6.824      11.540
116           8        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.843       2.585
117           8        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                4.362       6.707
118           8        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                4.371       6.965
119           8        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                4.348       6.685
120           8        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.831       2.599
121           8        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                3.538       5.143
122           8        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                3.550       5.130
123           8        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                3.521       5.078
124          32        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.899       2.614
125          32        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               11.654      21.266
126          32        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               11.676      21.342
127          32        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               11.801      21.226
128          32        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.955       2.612
129          32        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                6.810      11.672
130          32        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                6.810      11.546
131          32        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                6.893      11.550
132          32        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.909       2.590
133          32        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                4.369       6.976
134          32        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                4.365       6.697
135          32        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                4.461       6.683
136          32        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.915       2.626
137          32        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                3.581       5.224
138          32        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                3.542       5.075
139          32        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                3.638       5.063
140         512        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                5.990       3.732
141         512        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               11.834      21.520
142         512        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               11.796      21.741
143         512        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               11.873      21.287
144         512        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                5.959       3.740
145         512        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                7.625      11.928
146         512        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                6.986      11.767
147         512        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                9.187      11.538
148         512        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                5.949       3.725
149         512        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                5.641       7.292
150         512        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                6.186       7.192
151         512        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                8.648       6.649
152         512        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                5.945       3.737
153         512        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                5.251       5.663
154         512        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                6.282       5.523
155         512        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                8.445       5.058
156        2048        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               14.427       6.536
157        2048        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               13.045      21.777
158        2048        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               14.063      21.366
159        2048        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               27.500      21.565
160        2048        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               14.483       6.552
161        2048        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               11.398      12.245
162        2048        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               15.196      12.351
163        2048        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               25.998      11.812
164        2048        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               14.421       6.558
165        2048        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               11.509       7.871
166        2048        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               15.396       8.025
167        2048        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               25.469       7.291
168        2048        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               14.458       6.544
169        2048        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               11.615       7.381
170        2048        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               21.658       7.304
171        2048        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               25.273       7.328
172        6144        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               36.902      12.961
173        6144        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               25.466      22.291
174        6144        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               29.396      23.389
175        6144        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               73.339      21.950
176        6144        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               36.957      12.921
177        6144        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               24.528      15.244
178        6144        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               31.688      17.964
179        6144        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               70.776      15.976
180        6144        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               36.944      13.014
181        6144        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               25.628      14.589
182        6144        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               50.286      16.555
183        6144        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               70.196      16.014
184        6144        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               36.848      12.953
185        6144        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               27.804      14.352
186        6144        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               53.068      15.131
187        6144        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               69.951      16.031

@ispobock ispobock merged commit 339f8ee into sgl-project:main Sep 5, 2025
256 of 277 checks passed
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 11, 2025

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Sep 11, 2025

oops let me have a check

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 11, 2025

[2025-09-10 21:34:40 DP1 TP2 EP2] Prefill batch. #new-seq: 1, #new-token: 666, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, 
Process Process-2:
Traceback (most recent call last):
  File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/managers/detokenizer_manager.py", line 298, in run_detokenizer_process
    manager.event_loop()
  File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/managers/detokenizer_manager.py", line 118, in event_loop
    output = self._request_dispatcher(recv_obj)
  File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/utils.py", line 483, in __call__
    return fn(obj)
  File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/managers/detokenizer_manager.py", line 186, in handle_batch_token_id_out
    read_texts = self.tokenizer.batch_decode(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3857, in batch_decode
    return [
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3858, in <listcomp>
    self.decode(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3897, in decode
    return self._decode(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 682, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/managers/detokenizer_manager.py", line 300, in run_detokenizer_process
    manager.socket_mapping.clear_all_sockets()
AttributeError: 'DetokenizerManager' object has no attribute 'socket_mapping'
All deep_gemm operations loaded successfully!
Error: The action 'Run test' has timed out after 20 minutes.

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 11, 2025

zhyncs added a commit that referenced this pull request Sep 11, 2025
fzyzcjy pushed a commit to fzyzcjy/sglang that referenced this pull request Sep 11, 2025
…rnel (sgl-project#9534)" (sgl-project#10292)

(cherry picked from commit 6d55f60)

# Conflicts:
#	python/sglang/srt/layers/quantization/fp8_kernel.py
#	sgl-kernel/tests/test_per_token_group_quant_8bit.py
fzyzcjy added a commit to fzyzcjy/sglang that referenced this pull request Sep 11, 2025
fzyzcjy added a commit to fzyzcjy/sglang that referenced this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants