cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization by dfriehs · Pull Request #19624 · ggml-org/llama.cpp

dfriehs · 2026-02-14T10:21:03Z

While looking over quantizations I believe I found a few optimizations for iq2xxs/iq2xs/iq3xxs. With these changes, I get a 5-10% increase in flops in test-backend-ops for small n, and a few extra flops otherwise:

load all 8 int8 for a grid position in one load
calculate signs via popcnt instead of fetching from ksigns table
broadcast signs to drop individual shift/mask

test-backend-ops test -b CUDA0 -p iq2_xxs
test-backend-ops test -b CUDA0 -p iq2_xs
test-backend-ops test -b CUDA0 -p iq3_xxs
all pass for me.

test-backend-ops perf -b CUDA0 on 01d8eaa:

Backend 1/2: CUDA0
  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24123 MB (22796 MB free)

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   238560 runs -    42.05 us/run - 117.44 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   221946 runs -    45.12 us/run - 234.88 MFLOP/run -   5.21 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   191416 runs -    52.27 us/run - 352.32 MFLOP/run -   6.74 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   165927 runs -    60.33 us/run - 469.76 MFLOP/run -   7.79 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   142614 runs -    70.19 us/run - 587.20 MFLOP/run -   8.37 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   104325 runs -    95.94 us/run - 939.52 MFLOP/run -   9.79 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):  14356 runs -   696.66 us/run -  60.13 GFLOP/run -  86.31 TFLOPS

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    245376 runs -    40.78 us/run - 117.44 MFLOP/run -   2.88 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    230040 runs -    43.50 us/run - 234.88 MFLOP/run -   5.40 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    193120 runs -    51.85 us/run - 352.32 MFLOP/run -   6.79 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    168909 runs -    59.23 us/run - 469.76 MFLOP/run -   7.93 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    140049 runs -    71.42 us/run - 587.20 MFLOP/run -   8.22 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    103790 runs -    96.37 us/run - 939.52 MFLOP/run -   9.75 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   11972 runs -   835.39 us/run -  60.13 GFLOP/run -  71.98 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048):                     731216 runs -    13.74 us/run -  25.17 MFLOP/run -   1.83 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1,mul=0):    731216 runs -    13.72 us/run -  25.17 MFLOP/run -   1.83 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048):                     214704 runs -    46.60 us/run - 100.66 MFLOP/run -   2.16 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1,mul=0):    216692 runs -    46.21 us/run - 100.66 MFLOP/run -   2.18 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048):                      66598 runs -   150.60 us/run - 201.33 MFLOP/run -   1.34 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1,mul=0):     63119 runs -   159.07 us/run - 201.33 MFLOP/run -   1.27 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048):                     31125 runs -   321.59 us/run - 805.31 MFLOP/run -   2.50 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1,mul=0):    30875 runs -   324.08 us/run - 805.31 MFLOP/run -   2.48 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048):                     23121 runs -   432.86 us/run -   1.61 GFLOP/run -   3.72 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1,mul=0):    23121 runs -   432.97 us/run -   1.61 GFLOP/run -   3.72 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048):                    14688 runs -   680.84 us/run -   3.22 GFLOP/run -   4.73 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1,mul=0):   14656 runs -   682.48 us/run -   3.22 GFLOP/run -   4.72 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048):                    13600 runs -   735.82 us/run -   6.44 GFLOP/run -   8.76 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1,mul=0):   13616 runs -   734.61 us/run -   6.44 GFLOP/run -   8.77 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048):                    12688 runs -   788.64 us/run -  12.88 GFLOP/run -  16.34 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1,mul=0):   12664 runs -   789.89 us/run -  12.88 GFLOP/run -  16.31 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048):                     664170 runs -    15.09 us/run -  29.36 MFLOP/run -   1.95 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1,mul=0):    664170 runs -    15.10 us/run -  29.36 MFLOP/run -   1.94 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048):                     186588 runs -    53.80 us/run - 117.44 MFLOP/run -   2.18 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1,mul=0):    186588 runs -    53.66 us/run - 117.44 MFLOP/run -   2.19 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048):                      71994 runs -   138.97 us/run - 234.88 MFLOP/run -   1.69 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1,mul=0):     71994 runs -   139.12 us/run - 234.88 MFLOP/run -   1.69 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048):                     50183 runs -   199.66 us/run - 939.52 MFLOP/run -   4.71 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1,mul=0):    50290 runs -   199.09 us/run - 939.52 MFLOP/run -   4.72 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048):                     38340 runs -   260.92 us/run -   1.88 GFLOP/run -   7.20 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1,mul=0):    38394 runs -   260.48 us/run -   1.88 GFLOP/run -   7.21 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048):                    24489 runs -   408.67 us/run -   3.76 GFLOP/run -   9.20 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1,mul=0):   24489 runs -   408.58 us/run -   3.76 GFLOP/run -   9.20 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048):                    22204 runs -   450.41 us/run -   7.52 GFLOP/run -  16.69 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1,mul=0):   22176 runs -   451.13 us/run -   7.52 GFLOP/run -  16.66 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048):                    20608 runs -   485.36 us/run -  15.03 GFLOP/run -  30.97 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1,mul=0):   20559 runs -   486.42 us/run -  15.03 GFLOP/run -  30.90 TFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   252192 runs -    39.76 us/run - 117.44 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   235152 runs -    42.55 us/run - 234.88 MFLOP/run -   5.52 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   179772 runs -    55.66 us/run - 352.32 MFLOP/run -   6.33 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   164223 runs -    60.90 us/run - 469.76 MFLOP/run -   7.71 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   144666 runs -    69.17 us/run - 587.20 MFLOP/run -   8.49 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   105502 runs -    94.83 us/run - 939.52 MFLOP/run -   9.91 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):  14566 runs -   686.61 us/run -  60.13 GFLOP/run -  87.57 TFLOPS
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

on be3a90c (this branch):

Backend 1/2: CUDA0
  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24123 MB (22680 MB free)

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   256452 runs -    39.02 us/run - 117.44 MFLOP/run -   3.01 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   235578 runs -    42.52 us/run - 234.88 MFLOP/run -   5.52 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   183180 runs -    54.60 us/run - 352.32 MFLOP/run -   6.45 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   165075 runs -    60.60 us/run - 469.76 MFLOP/run -   7.75 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   150993 runs -    66.30 us/run - 587.20 MFLOP/run -   8.86 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   107856 runs -    92.75 us/run - 939.52 MFLOP/run -  10.13 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):  14702 runs -   680.24 us/run -  60.13 GFLOP/run -  88.39 TFLOPS

  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    269232 runs -    37.23 us/run - 117.44 MFLOP/run -   3.15 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    242820 runs -    41.25 us/run - 234.88 MFLOP/run -   5.69 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    195676 runs -    51.13 us/run - 352.32 MFLOP/run -   6.89 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    168270 runs -    59.44 us/run - 469.76 MFLOP/run -   7.90 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    152703 runs -    65.50 us/run - 587.20 MFLOP/run -   8.96 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    109782 runs -    91.11 us/run - 939.52 MFLOP/run -  10.31 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   12088 runs -   827.40 us/run -  60.13 GFLOP/run -  72.67 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048):                     755060 runs -    13.31 us/run -  25.17 MFLOP/run -   1.89 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1,mul=0):    751086 runs -    13.32 us/run -  25.17 MFLOP/run -   1.89 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048):                     213710 runs -    46.83 us/run - 100.66 MFLOP/run -   2.15 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1,mul=0):    213710 runs -    46.89 us/run - 100.66 MFLOP/run -   2.15 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048):                      56161 runs -   178.76 us/run - 201.33 MFLOP/run -   1.13 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1,mul=0):     64113 runs -   156.79 us/run - 201.33 MFLOP/run -   1.28 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048):                     32750 runs -   305.36 us/run - 805.31 MFLOP/run -   2.64 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1,mul=0):    32625 runs -   306.89 us/run - 805.31 MFLOP/run -   2.62 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048):                     23877 runs -   419.24 us/run -   1.61 GFLOP/run -   3.84 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1,mul=0):    23814 runs -   420.34 us/run -   1.61 GFLOP/run -   3.83 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048):                    14784 runs -   676.45 us/run -   3.22 GFLOP/run -   4.76 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1,mul=0):   14816 runs -   675.55 us/run -   3.22 GFLOP/run -   4.77 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048):                    13792 runs -   725.28 us/run -   6.44 GFLOP/run -   8.88 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1,mul=0):   13808 runs -   724.79 us/run -   6.44 GFLOP/run -   8.89 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048):                    12872 runs -   776.98 us/run -  12.88 GFLOP/run -  16.58 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1,mul=0):   12880 runs -   776.71 us/run -  12.88 GFLOP/run -  16.59 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048):                     677794 runs -    14.81 us/run -  29.36 MFLOP/run -   1.98 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1,mul=0):    674388 runs -    14.83 us/run -  29.36 MFLOP/run -   1.98 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048):                     187440 runs -    53.40 us/run - 117.44 MFLOP/run -   2.20 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1,mul=0):    191700 runs -    52.20 us/run - 117.44 MFLOP/run -   2.25 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048):                      85200 runs -   117.83 us/run - 234.88 MFLOP/run -   1.99 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1,mul=0):     78384 runs -   127.81 us/run - 234.88 MFLOP/run -   1.84 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048):                     52430 runs -   191.01 us/run - 939.52 MFLOP/run -   4.92 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1,mul=0):    52323 runs -   191.32 us/run - 939.52 MFLOP/run -   4.91 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048):                     39420 runs -   253.88 us/run -   1.88 GFLOP/run -   7.40 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1,mul=0):    39366 runs -   254.29 us/run -   1.88 GFLOP/run -   7.39 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048):                    24705 runs -   405.12 us/run -   3.76 GFLOP/run -   9.28 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1,mul=0):   24678 runs -   405.26 us/run -   3.76 GFLOP/run -   9.27 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048):                    22456 runs -   445.49 us/run -   7.52 GFLOP/run -  16.87 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1,mul=0):   22484 runs -   445.01 us/run -   7.52 GFLOP/run -  16.89 TFLOPS
  MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048):                    20930 runs -   477.92 us/run -  15.03 GFLOP/run -  31.45 TFLOPS
  MUL_MAT_ID_FUSION(type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1,mul=0):   20881 runs -   479.05 us/run -  15.03 GFLOP/run -  31.38 TFLOPS

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   293940 runs -    34.07 us/run - 117.44 MFLOP/run -   3.45 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   265398 runs -    37.72 us/run - 234.88 MFLOP/run -   6.23 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   215272 runs -    46.51 us/run - 352.32 MFLOP/run -   7.58 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   181050 runs -    55.29 us/run - 469.76 MFLOP/run -   8.50 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   163476 runs -    61.20 us/run - 587.20 MFLOP/run -   9.60 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):   114597 runs -    87.30 us/run - 939.52 MFLOP/run -  10.76 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):  15206 runs -   657.70 us/run -  60.13 GFLOP/run -  91.42 TFLOPS
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

- load all 8 int8 for a grid position in one load - calculate signs via popcnt instead of fetching from ksigns table - broadcast signs to drop individual shift/mask

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8` express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)` saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight AFAICT no overflow can occur here as iq2xxs values are far too small

ggml/src/ggml-cuda/mmq.cuh

BrickBee · 2026-02-15T09:45:15Z

These optimizations might mitigate #8760 a little bit.

error: identifier "uint" is undefined

* upstream/master: (88 commits) ci : bump komac version (ggml-org#19682) build : link ws2_32 as PUBLIC on Windows (ggml-org#19666) build : cleanup library linking logic (ggml-org#19665) convert : add JoyAI-LLM-Flash (ggml-org#19651) perplexity: add proper batching (ggml-org#19661) common : inline functions (ggml-org#18639) ggml : make `ggml_is_view` as API (ggml-org#19539) model: Add support for Tiny Aya Models (ggml-org#19611) build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658) Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591) models : deduplicate delta-net graphs for Qwen family (ggml-org#19597) graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644) ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (ggml-org#19132) sync : ggml ggml : bump version to 0.9.7 (ggml/1425) ggml : bump version to 0.9.6 (ggml/1423) cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624) docs: update s390x build docs (ggml-org#19643) build : remove LLAMA_HTTPLIB option (ggml-org#19623) cmake : check if KleidiAI API has been fetched (ggml-org#19640) ...

* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization - load all 8 int8 for a grid position in one load - calculate signs via popcnt instead of fetching from ksigns table - broadcast signs to drop individual shift/mask * cuda: iq2xxs: simplify sum scaling express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8` express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)` saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight AFAICT no overflow can occur here as iq2xxs values are far too small * uint -> uint32_t error: identifier "uint" is undefined

dfriehs requested a review from JohannesGaessler as a code owner February 14, 2026 10:21

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Feb 14, 2026

cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

be3a90c

- load all 8 int8 for a grid position in one load - calculate signs via popcnt instead of fetching from ksigns table - broadcast signs to drop individual shift/mask

dfriehs force-pushed the iq2xxs-cuda branch from da09e4f to be3a90c Compare February 14, 2026 19:59

dfriehs changed the title ~~cuda: optimize iq2xxs dequantization~~ cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization Feb 14, 2026

loci-dev mentioned this pull request Feb 15, 2026

UPSTREAM PR #19624: cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization auroralabs-loci/llama.cpp#1177

Open

am17an reviewed Feb 15, 2026

View reviewed changes

ggml/src/ggml-cuda/mmq.cuh Outdated Show resolved Hide resolved

uint -> uint32_t

b82a980

error: identifier "uint" is undefined

am17an approved these changes Feb 15, 2026

View reviewed changes

am17an merged commit 27b93cb into ggml-org:master Feb 15, 2026
72 of 73 checks passed

dfriehs deleted the iq2xxs-cuda branch February 15, 2026 23:05

dfriehs mentioned this pull request Feb 16, 2026

avx2: compute ksigns instead of loading from table #19657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization#19624

cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization#19624
am17an merged 3 commits intoggml-org:masterfrom
dfriehs:iq2xxs-cuda

dfriehs commented Feb 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

BrickBee commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dfriehs commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

BrickBee commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dfriehs commented Feb 14, 2026 •

edited

Loading