Skip to content

cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization#19624

Merged
am17an merged 3 commits intoggml-org:masterfrom
dfriehs:iq2xxs-cuda
Feb 15, 2026
Merged

cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization#19624
am17an merged 3 commits intoggml-org:masterfrom
dfriehs:iq2xxs-cuda

Conversation

@dfriehs
Copy link
Contributor

@dfriehs dfriehs commented Feb 14, 2026

While looking over quantizations I believe I found a few optimizations for iq2xxs/iq2xs/iq3xxs. With these changes, I get a 5-10% increase in flops in test-backend-ops for small n, and a few extra flops otherwise:

  • load all 8 int8 for a grid position in one load
  • calculate signs via popcnt instead of fetching from ksigns table
  • broadcast signs to drop individual shift/mask

test-backend-ops test -b CUDA0 -p iq2_xxs
test-backend-ops test -b CUDA0 -p iq2_xs
test-backend-ops test -b CUDA0 -p iq3_xxs
all pass for me.

test-backend-ops perf -b CUDA0 on 01d8eaa:

Backend 1/2: CUDA0
  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24123 MB (22796 MB free)

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   238560 runs -    42.05 us/run - 117.44 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   221946 runs -    45.12 us/run - 234.88 MFLOP/run -   5.21 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   191416 runs -    52.27 us/run - 352.32 MFLOP/run -   6.74 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   165927 runs -    60.33 us/run - 469.76 MFLOP/run -   7.79 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   142614 runs -    70.19 us/run - 587.20 MFLOP/run -   8.37 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   104325 runs -    95.94 us/run - 939.52 MFLOP/run -   9.79 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):  14356 runs -   696.66 us/run -  60.13 GFLOP/run -  86.31 TFLOPS

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    245376 runs -    40.78 us/run - 117.44 MFLOP/run -   2.88 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    230040 runs -    43.50 us/run - 234.88 MFLOP/run -   5.40 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    193120 runs -    51.85 us/run - 352.32 MFLOP/run -   6.79 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    168909 runs -    59.23 us/run - 469.76 MFLOP/run -   7.93 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    140049 runs -    71.42 us/run - 587.20 MFLOP/run -   8.22 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    103790 runs -    96.37 us/run - 939.52 MFLOP/run -   9.75 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   11972 runs -   835.39 us/run -  60.13 GFLOP/run -  71.98 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048):                     731216 runs -    13.74 us/run -  25.17 MFLOP/run -   1.83 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1,mul=0):    731216 runs -    13.72 us/run -  25.17 MFLOP/run -   1.83 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048):                     214704 runs -    46.60 us/run - 100.66 MFLOP/run -   2.16 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1,mul=0):    216692 runs -    46.21 us/run - 100.66 MFLOP/run -   2.18 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048):                      66598 runs -   150.60 us/run - 201.33 MFLOP/run -   1.34 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1,mul=0):     63119 runs -   159.07 us/run - 201.33 MFLOP/run -   1.27 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048):                     31125 runs -   321.59 us/run - 805.31 MFLOP/run -   2.50 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1,mul=0):    30875 runs -   324.08 us/run - 805.31 MFLOP/run -   2.48 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048):                     23121 runs -   432.86 us/run -   1.61 GFLOP/run -   3.72 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1,mul=0):    23121 runs -   432.97 us/run -   1.61 GFLOP/run -   3.72 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048):                    14688 runs -   680.84 us/run -   3.22 GFLOP/run -   4.73 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1,mul=0):   14656 runs -   682.48 us/run -   3.22 GFLOP/run -   4.72 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048):                    13600 runs -   735.82 us/run -   6.44 GFLOP/run -   8.76 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1,mul=0):   13616 runs -   734.61 us/run -   6.44 GFLOP/run -   8.77 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048):                    12688 runs -   788.64 us/run -  12.88 GFLOP/run -  16.34 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1,mul=0):   12664 runs -   789.89 us/run -  12.88 GFLOP/run -  16.31 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048):                     664170 runs -    15.09 us/run -  29.36 MFLOP/run -   1.95 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1,mul=0):    664170 runs -    15.10 us/run -  29.36 MFLOP/run -   1.94 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048):                     186588 runs -    53.80 us/run - 117.44 MFLOP/run -   2.18 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1,mul=0):    186588 runs -    53.66 us/run - 117.44 MFLOP/run -   2.19 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048):                      71994 runs -   138.97 us/run - 234.88 MFLOP/run -   1.69 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1,mul=0):     71994 runs -   139.12 us/run - 234.88 MFLOP/run -   1.69 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048):                     50183 runs -   199.66 us/run - 939.52 MFLOP/run -   4.71 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1,mul=0):    50290 runs -   199.09 us/run - 939.52 MFLOP/run -   4.72 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048):                     38340 runs -   260.92 us/run -   1.88 GFLOP/run -   7.20 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1,mul=0):    38394 runs -   260.48 us/run -   1.88 GFLOP/run -   7.21 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048):                    24489 runs -   408.67 us/run -   3.76 GFLOP/run -   9.20 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1,mul=0):   24489 runs -   408.58 us/run -   3.76 GFLOP/run -   9.20 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048):                    22204 runs -   450.41 us/run -   7.52 GFLOP/run -  16.69 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1,mul=0):   22176 runs -   451.13 us/run -   7.52 GFLOP/run -  16.66 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048):                    20608 runs -   485.36 us/run -  15.03 GFLOP/run -  30.97 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1,mul=0):   20559 runs -   486.42 us/run -  15.03 GFLOP/run -  30.90 TFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   252192 runs -    39.76 us/run - 117.44 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   235152 runs -    42.55 us/run - 234.88 MFLOP/run -   5.52 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   179772 runs -    55.66 us/run - 352.32 MFLOP/run -   6.33 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   164223 runs -    60.90 us/run - 469.76 MFLOP/run -   7.71 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   144666 runs -    69.17 us/run - 587.20 MFLOP/run -   8.49 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   105502 runs -    94.83 us/run - 939.52 MFLOP/run -   9.91 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):  14566 runs -   686.61 us/run -  60.13 GFLOP/run -  87.57 TFLOPS
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

on be3a90c (this branch):

Backend 1/2: CUDA0
  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24123 MB (22680 MB free)

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   256452 runs -    39.02 us/run - 117.44 MFLOP/run -   3.01 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   235578 runs -    42.52 us/run - 234.88 MFLOP/run -   5.52 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   183180 runs -    54.60 us/run - 352.32 MFLOP/run -   6.45 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   165075 runs -    60.60 us/run - 469.76 MFLOP/run -   7.75 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   150993 runs -    66.30 us/run - 587.20 MFLOP/run -   8.86 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   107856 runs -    92.75 us/run - 939.52 MFLOP/run -  10.13 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):  14702 runs -   680.24 us/run -  60.13 GFLOP/run -  88.39 TFLOPS

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    269232 runs -    37.23 us/run - 117.44 MFLOP/run -   3.15 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    242820 runs -    41.25 us/run - 234.88 MFLOP/run -   5.69 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    195676 runs -    51.13 us/run - 352.32 MFLOP/run -   6.89 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    168270 runs -    59.44 us/run - 469.76 MFLOP/run -   7.90 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    152703 runs -    65.50 us/run - 587.20 MFLOP/run -   8.96 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    109782 runs -    91.11 us/run - 939.52 MFLOP/run -  10.31 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   12088 runs -   827.40 us/run -  60.13 GFLOP/run -  72.67 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048):                     755060 runs -    13.31 us/run -  25.17 MFLOP/run -   1.89 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1,mul=0):    751086 runs -    13.32 us/run -  25.17 MFLOP/run -   1.89 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048):                     213710 runs -    46.83 us/run - 100.66 MFLOP/run -   2.15 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1,mul=0):    213710 runs -    46.89 us/run - 100.66 MFLOP/run -   2.15 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048):                      56161 runs -   178.76 us/run - 201.33 MFLOP/run -   1.13 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1,mul=0):     64113 runs -   156.79 us/run - 201.33 MFLOP/run -   1.28 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048):                     32750 runs -   305.36 us/run - 805.31 MFLOP/run -   2.64 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1,mul=0):    32625 runs -   306.89 us/run - 805.31 MFLOP/run -   2.62 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048):                     23877 runs -   419.24 us/run -   1.61 GFLOP/run -   3.84 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1,mul=0):    23814 runs -   420.34 us/run -   1.61 GFLOP/run -   3.83 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048):                    14784 runs -   676.45 us/run -   3.22 GFLOP/run -   4.76 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1,mul=0):   14816 runs -   675.55 us/run -   3.22 GFLOP/run -   4.77 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048):                    13792 runs -   725.28 us/run -   6.44 GFLOP/run -   8.88 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1,mul=0):   13808 runs -   724.79 us/run -   6.44 GFLOP/run -   8.89 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048):                    12872 runs -   776.98 us/run -  12.88 GFLOP/run -  16.58 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1,mul=0):   12880 runs -   776.71 us/run -  12.88 GFLOP/run -  16.59 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048):                     677794 runs -    14.81 us/run -  29.36 MFLOP/run -   1.98 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1,mul=0):    674388 runs -    14.83 us/run -  29.36 MFLOP/run -   1.98 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048):                     187440 runs -    53.40 us/run - 117.44 MFLOP/run -   2.20 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1,mul=0):    191700 runs -    52.20 us/run - 117.44 MFLOP/run -   2.25 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048):                      85200 runs -   117.83 us/run - 234.88 MFLOP/run -   1.99 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1,mul=0):     78384 runs -   127.81 us/run - 234.88 MFLOP/run -   1.84 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048):                     52430 runs -   191.01 us/run - 939.52 MFLOP/run -   4.92 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1,mul=0):    52323 runs -   191.32 us/run - 939.52 MFLOP/run -   4.91 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048):                     39420 runs -   253.88 us/run -   1.88 GFLOP/run -   7.40 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1,mul=0):    39366 runs -   254.29 us/run -   1.88 GFLOP/run -   7.39 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048):                    24705 runs -   405.12 us/run -   3.76 GFLOP/run -   9.28 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1,mul=0):   24678 runs -   405.26 us/run -   3.76 GFLOP/run -   9.27 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048):                    22456 runs -   445.49 us/run -   7.52 GFLOP/run -  16.87 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1,mul=0):   22484 runs -   445.01 us/run -   7.52 GFLOP/run -  16.89 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048):                    20930 runs -   477.92 us/run -  15.03 GFLOP/run -  31.45 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1,mul=0):   20881 runs -   479.05 us/run -  15.03 GFLOP/run -  31.38 TFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   293940 runs -    34.07 us/run - 117.44 MFLOP/run -   3.45 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   265398 runs -    37.72 us/run - 234.88 MFLOP/run -   6.23 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   215272 runs -    46.51 us/run - 352.32 MFLOP/run -   7.58 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   181050 runs -    55.29 us/run - 469.76 MFLOP/run -   8.50 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   163476 runs -    61.20 us/run - 587.20 MFLOP/run -   9.60 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   114597 runs -    87.30 us/run - 939.52 MFLOP/run -  10.76 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):  15206 runs -   657.70 us/run -  60.13 GFLOP/run -  91.42 TFLOPS
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Feb 14, 2026
- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask
@dfriehs dfriehs changed the title cuda: optimize iq2xxs dequantization cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization Feb 14, 2026
express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small
@BrickBee
Copy link

These optimizations might mitigate #8760 a little bit.

error: identifier "uint" is undefined
@am17an am17an merged commit 27b93cb into ggml-org:master Feb 15, 2026
72 of 73 checks passed
@dfriehs dfriehs deleted the iq2xxs-cuda branch February 15, 2026 23:05
michaelneale added a commit to michaelneale/llama.cpp that referenced this pull request Feb 17, 2026
* upstream/master: (88 commits)
  ci : bump komac version (ggml-org#19682)
  build : link ws2_32 as PUBLIC on Windows (ggml-org#19666)
  build : cleanup library linking logic (ggml-org#19665)
  convert : add JoyAI-LLM-Flash (ggml-org#19651)
  perplexity: add proper batching (ggml-org#19661)
  common : inline functions (ggml-org#18639)
  ggml : make `ggml_is_view` as API (ggml-org#19539)
  model: Add support for Tiny Aya Models (ggml-org#19611)
  build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658)
  Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591)
  models : deduplicate delta-net graphs for Qwen family (ggml-org#19597)
  graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644)
  ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel  (ggml-org#19132)
  sync : ggml
  ggml : bump version to 0.9.7 (ggml/1425)
  ggml : bump version to 0.9.6 (ggml/1423)
  cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624)
  docs: update s390x build docs (ggml-org#19643)
  build : remove LLAMA_HTTPLIB option (ggml-org#19623)
  cmake : check if KleidiAI API has been fetched (ggml-org#19640)
  ...
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants