Skip to content

CUDA: use mmvq for mul-mat-id for small batch sizes#18958

Merged
am17an merged 4 commits intoggml-org:masterfrom
am17an:mmid-vec
Feb 3, 2026
Merged

CUDA: use mmvq for mul-mat-id for small batch sizes#18958
am17an merged 4 commits intoggml-org:masterfrom
am17an:mmid-vec

Conversation

@am17an
Copy link
Contributor

@am17an am17an commented Jan 20, 2026

Currently for batch_sizes > 1, we immediately move to mmq which is suboptimal for small batch sizes. Bring performance of batched bench in line (previously there was a dip at n_tokens = 2)

Micro-benchmark for test-backend-ops

Backend GGML op Op parameters TFLOPS master TFLOPS mmid-vec Speedup
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 4.61 4.62 1.00
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 2.34 6.13 2.62
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 4.27 6.83 1.60
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 5.49 5.49 1.00
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 3.37 6.37 1.89
CUDA0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 6.57 7.23 1.10

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 20, 2026
@JohannesGaessler
Copy link
Contributor

Performance changes
GPU Model Microbatch size Test t/s b7779 t/s b5aa3ab Speedup
MI60 / MI50 granitemoe 3B BF16 1 pp512 134.96 143.24 1.06
MI60 / MI50 granitemoe 3B BF16 2 pp512 6.94 213.29 30.76
MI60 / MI50 granitemoe 3B BF16 3 pp512 8.62 250.88 29.11
MI60 / MI50 granitemoe 3B BF16 4 pp512 10.28 270.96 26.36
MI60 / MI50 granitemoe 3B BF16 5 pp512 11.54 269.68 23.37
MI60 / MI50 granitemoe 3B BF16 6 pp512 13.17 280.09 21.26
MI60 / MI50 granitemoe 3B BF16 7 pp512 14.54 289.48 19.91
MI60 / MI50 granitemoe 3B BF16 8 pp512 15.87 296.46 18.68
MI60 / MI50 granitemoe 3B F16 1 pp512 136.10 143.15 1.05
MI60 / MI50 granitemoe 3B F16 2 pp512 35.62 212.24 5.96
MI60 / MI50 granitemoe 3B F16 3 pp512 45.50 248.99 5.47
MI60 / MI50 granitemoe 3B F16 4 pp512 54.87 269.38 4.91
MI60 / MI50 granitemoe 3B F16 5 pp512 61.50 270.27 4.39
MI60 / MI50 granitemoe 3B F16 6 pp512 70.65 279.77 3.96
MI60 / MI50 granitemoe 3B F16 7 pp512 77.91 287.59 3.69
MI60 / MI50 granitemoe 3B F16 8 pp512 85.06 293.95 3.46
MI60 / MI50 granitemoe 3B Q4_0 1 pp512 164.53 165.68 1.01
MI60 / MI50 granitemoe 3B Q4_0 2 pp512 231.26 244.19 1.06
MI60 / MI50 granitemoe 3B Q4_0 3 pp512 302.05 283.37 0.94
MI60 / MI50 granitemoe 3B Q4_0 4 pp512 398.88 332.24 0.83
MI60 / MI50 granitemoe 3B Q4_0 5 pp512 472.45 358.40 0.76
MI60 / MI50 granitemoe 3B Q4_0 6 pp512 509.58 363.92 0.71
MI60 / MI50 granitemoe 3B Q4_0 7 pp512 578.75 384.72 0.66
MI60 / MI50 granitemoe 3B Q4_0 8 pp512 680.55 412.97 0.61
MI60 / MI50 granitemoe 3B all F32 1 pp512 114.31 119.75 1.05
MI60 / MI50 granitemoe 3B all F32 2 pp512 92.72 170.34 1.84
MI60 / MI50 granitemoe 3B all F32 3 pp512 118.62 195.39 1.65
MI60 / MI50 granitemoe 3B all F32 4 pp512 136.61 209.57 1.53
MI60 / MI50 granitemoe 3B all F32 5 pp512 141.34 208.63 1.48
MI60 / MI50 granitemoe 3B all F32 6 pp512 152.85 215.89 1.41
MI60 / MI50 granitemoe 3B all F32 7 pp512 160.48 220.84 1.38
MI60 / MI50 granitemoe 3B all F32 8 pp512 169.94 225.70 1.33
MI100 granitemoe 3B BF16 1 pp512 144.94 152.45 1.05
MI100 granitemoe 3B BF16 2 pp512 69.46 238.99 3.44
MI100 granitemoe 3B BF16 3 pp512 90.32 292.94 3.24
MI100 granitemoe 3B BF16 4 pp512 91.70 202.66 2.21
MI100 granitemoe 3B BF16 5 pp512 105.72 231.73 2.19
MI100 granitemoe 3B BF16 6 pp512 123.99 258.94 2.09
MI100 granitemoe 3B BF16 7 pp512 139.64 282.60 2.02
MI100 granitemoe 3B BF16 8 pp512 154.74 304.34 1.97
MI100 granitemoe 3B F16 1 pp512 144.26 154.29 1.07
MI100 granitemoe 3B F16 2 pp512 101.08 240.63 2.38
MI100 granitemoe 3B F16 3 pp512 132.90 281.87 2.12
MI100 granitemoe 3B F16 4 pp512 134.96 225.44 1.67
MI100 granitemoe 3B F16 5 pp512 159.19 258.03 1.62
MI100 granitemoe 3B F16 6 pp512 185.34 286.71 1.55
MI100 granitemoe 3B F16 7 pp512 210.03 311.45 1.48
MI100 granitemoe 3B F16 8 pp512 234.24 333.32 1.42
MI100 granitemoe 3B Q4_0 1 pp512 165.46 168.97 1.02
MI100 granitemoe 3B Q4_0 2 pp512 188.11 261.95 1.39
MI100 granitemoe 3B Q4_0 3 pp512 269.29 343.39 1.28
MI100 granitemoe 3B Q4_0 4 pp512 232.55 258.33 1.11
MI100 granitemoe 3B Q4_0 5 pp512 278.31 298.64 1.07
MI100 granitemoe 3B Q4_0 6 pp512 327.68 333.06 1.02
MI100 granitemoe 3B Q4_0 7 pp512 376.23 365.02 0.97
MI100 granitemoe 3B Q4_0 8 pp512 423.60 396.70 0.94
MI100 granitemoe 3B all F32 1 pp512 126.25 134.82 1.07
MI100 granitemoe 3B all F32 2 pp512 99.91 203.91 2.04
MI100 granitemoe 3B all F32 3 pp512 128.77 239.54 1.86
MI100 granitemoe 3B all F32 4 pp512 70.31 141.95 2.02
MI100 granitemoe 3B all F32 5 pp512 73.12 162.49 2.22
MI100 granitemoe 3B all F32 6 pp512 77.69 181.22 2.33
MI100 granitemoe 3B all F32 7 pp512 82.58 197.99 2.40
MI100 granitemoe 3B all F32 8 pp512 88.01 213.12 2.42
P40 granitemoe 3B BF16 1 pp512 84.42 87.05 1.03
P40 granitemoe 3B BF16 2 pp512 23.74 104.28 4.39
P40 granitemoe 3B BF16 3 pp512 26.11 111.97 4.29
P40 granitemoe 3B BF16 4 pp512 28.59 117.20 4.10
P40 granitemoe 3B BF16 5 pp512 30.94 113.09 3.65
P40 granitemoe 3B BF16 6 pp512 33.38 119.97 3.59
P40 granitemoe 3B BF16 7 pp512 36.16 120.69 3.34
P40 granitemoe 3B BF16 8 pp512 38.80 123.66 3.19
P40 granitemoe 3B F16 1 pp512 84.13 87.41 1.04
P40 granitemoe 3B F16 2 pp512 89.15 105.53 1.18
P40 granitemoe 3B F16 3 pp512 110.00 113.06 1.03
P40 granitemoe 3B F16 4 pp512 120.27 116.00 0.96
P40 granitemoe 3B F16 5 pp512 124.36 121.37 0.98
P40 granitemoe 3B F16 6 pp512 128.24 120.98 0.94
P40 granitemoe 3B F16 7 pp512 137.79 121.79 0.88
P40 granitemoe 3B F16 8 pp512 137.40 125.83 0.92
P40 granitemoe 3B Q4_0 1 pp512 145.25 144.19 0.99
P40 granitemoe 3B Q4_0 2 pp512 237.24 196.94 0.83
P40 granitemoe 3B Q4_0 3 pp512 330.54 236.02 0.71
P40 granitemoe 3B Q4_0 4 pp512 390.47 249.16 0.64
P40 granitemoe 3B Q4_0 5 pp512 430.85 254.75 0.59
P40 granitemoe 3B Q4_0 6 pp512 497.00 267.08 0.54
P40 granitemoe 3B Q4_0 7 pp512 556.73 274.36 0.49
P40 granitemoe 3B Q4_0 8 pp512 603.32 282.71 0.47
P40 granitemoe 3B all F32 1 pp512 66.99 68.02 1.02
P40 granitemoe 3B all F32 2 pp512 76.67 84.01 1.10
P40 granitemoe 3B all F32 3 pp512 96.89 92.11 0.95
P40 granitemoe 3B all F32 4 pp512 119.47 98.52 0.82
P40 granitemoe 3B all F32 5 pp512 117.97 96.43 0.82
P40 granitemoe 3B all F32 6 pp512 131.47 99.41 0.76
P40 granitemoe 3B all F32 7 pp512 143.92 101.80 0.71
P40 granitemoe 3B all F32 8 pp512 157.09 103.78 0.66
RTX 3090 granitemoe 3B BF16 1 pp512 244.17 245.97 1.01
RTX 3090 granitemoe 3B BF16 2 pp512 372.66 283.12 0.76
RTX 3090 granitemoe 3B BF16 3 pp512 515.53 329.23 0.64
RTX 3090 granitemoe 3B BF16 4 pp512 637.38 352.27 0.55
RTX 3090 granitemoe 3B BF16 5 pp512 748.07 369.60 0.49
RTX 3090 granitemoe 3B BF16 6 pp512 869.35 382.79 0.44
RTX 3090 granitemoe 3B BF16 7 pp512 963.74 390.90 0.41
RTX 3090 granitemoe 3B BF16 8 pp512 1079.59 399.63 0.37
RTX 3090 granitemoe 3B F16 1 pp512 243.83 244.87 1.00
RTX 3090 granitemoe 3B F16 2 pp512 372.10 282.17 0.76
RTX 3090 granitemoe 3B F16 3 pp512 515.08 325.96 0.63
RTX 3090 granitemoe 3B F16 4 pp512 638.78 349.84 0.55
RTX 3090 granitemoe 3B F16 5 pp512 748.33 367.71 0.49
RTX 3090 granitemoe 3B F16 6 pp512 865.57 379.64 0.44
RTX 3090 granitemoe 3B F16 7 pp512 963.40 390.02 0.40
RTX 3090 granitemoe 3B F16 8 pp512 1084.06 399.93 0.37
RTX 3090 granitemoe 3B Q4_0 1 pp512 375.22 373.35 1.00
RTX 3090 granitemoe 3B Q4_0 2 pp512 341.56 450.10 1.32
RTX 3090 granitemoe 3B Q4_0 3 pp512 485.98 570.16 1.17
RTX 3090 granitemoe 3B Q4_0 4 pp512 623.42 651.51 1.05
RTX 3090 granitemoe 3B Q4_0 5 pp512 750.58 713.64 0.95
RTX 3090 granitemoe 3B Q4_0 6 pp512 884.41 759.23 0.86
RTX 3090 granitemoe 3B Q4_0 7 pp512 999.83 799.83 0.80
RTX 3090 granitemoe 3B Q4_0 8 pp512 1119.81 827.16 0.74
RTX 3090 granitemoe 3B all F32 1 pp512 181.56 186.22 1.03
RTX 3090 granitemoe 3B all F32 2 pp512 261.19 218.01 0.83
RTX 3090 granitemoe 3B all F32 3 pp512 343.24 246.67 0.72
RTX 3090 granitemoe 3B all F32 4 pp512 418.12 262.21 0.63
RTX 3090 granitemoe 3B all F32 5 pp512 487.32 275.33 0.56
RTX 3090 granitemoe 3B all F32 6 pp512 566.42 285.78 0.50
RTX 3090 granitemoe 3B all F32 7 pp512 631.67 293.10 0.46
RTX 3090 granitemoe 3B all F32 8 pp512 705.24 299.30 0.42
RTX 4090 granitemoe 3B BF16 1 pp512 328.20 329.76 1.00
RTX 4090 granitemoe 3B BF16 2 pp512 446.61 423.29 0.95
RTX 4090 granitemoe 3B BF16 3 pp512 607.94 533.94 0.88
RTX 4090 granitemoe 3B BF16 4 pp512 751.40 611.98 0.81
RTX 4090 granitemoe 3B BF16 5 pp512 881.06 676.83 0.77
RTX 4090 granitemoe 3B BF16 6 pp512 1026.95 733.53 0.71
RTX 4090 granitemoe 3B BF16 7 pp512 1143.58 777.49 0.68
RTX 4090 granitemoe 3B BF16 8 pp512 1285.60 822.06 0.64
RTX 4090 granitemoe 3B F16 1 pp512 328.24 329.84 1.00
RTX 4090 granitemoe 3B F16 2 pp512 448.52 422.58 0.94
RTX 4090 granitemoe 3B F16 3 pp512 610.42 533.40 0.87
RTX 4090 granitemoe 3B F16 4 pp512 755.08 612.63 0.81
RTX 4090 granitemoe 3B F16 5 pp512 884.99 677.42 0.77
RTX 4090 granitemoe 3B F16 6 pp512 1030.24 735.53 0.71
RTX 4090 granitemoe 3B F16 7 pp512 1146.70 777.86 0.68
RTX 4090 granitemoe 3B F16 8 pp512 1290.59 823.54 0.64
RTX 4090 granitemoe 3B Q4_0 1 pp512 486.05 483.75 1.00
RTX 4090 granitemoe 3B Q4_0 2 pp512 424.63 501.16 1.18
RTX 4090 granitemoe 3B Q4_0 3 pp512 636.29 748.53 1.18
RTX 4090 granitemoe 3B Q4_0 4 pp512 783.78 913.47 1.17
RTX 4090 granitemoe 3B Q4_0 5 pp512 977.31 1095.77 1.12
RTX 4090 granitemoe 3B Q4_0 6 pp512 1169.63 1221.42 1.04
RTX 4090 granitemoe 3B Q4_0 7 pp512 1358.48 1314.91 0.97
RTX 4090 granitemoe 3B Q4_0 8 pp512 1558.82 1417.97 0.91
RTX 4090 granitemoe 3B all F32 1 pp512 215.50 215.42 1.00
RTX 4090 granitemoe 3B all F32 2 pp512 303.17 296.95 0.98
RTX 4090 granitemoe 3B all F32 3 pp512 401.96 377.43 0.94
RTX 4090 granitemoe 3B all F32 4 pp512 485.87 433.06 0.89
RTX 4090 granitemoe 3B all F32 5 pp512 566.45 482.90 0.85
RTX 4090 granitemoe 3B all F32 6 pp512 656.69 530.09 0.81
RTX 4090 granitemoe 3B all F32 7 pp512 731.40 566.45 0.77
RTX 4090 granitemoe 3B all F32 8 pp512 815.15 603.73 0.74
RTX 5090 granitemoe 3B BF16 1 pp512 419.26 423.81 1.01
RTX 5090 granitemoe 3B BF16 2 pp512 470.06 440.56 0.94
RTX 5090 granitemoe 3B BF16 3 pp512 665.04 575.22 0.86
RTX 5090 granitemoe 3B BF16 4 pp512 829.97 663.65 0.80
RTX 5090 granitemoe 3B BF16 5 pp512 987.48 742.96 0.75
RTX 5090 granitemoe 3B BF16 6 pp512 1154.25 814.24 0.71
RTX 5090 granitemoe 3B BF16 7 pp512 1285.64 863.00 0.67
RTX 5090 granitemoe 3B BF16 8 pp512 1448.14 924.20 0.64
RTX 5090 granitemoe 3B F16 1 pp512 413.35 416.66 1.01
RTX 5090 granitemoe 3B F16 2 pp512 469.90 431.22 0.92
RTX 5090 granitemoe 3B F16 3 pp512 664.68 557.24 0.84
RTX 5090 granitemoe 3B F16 4 pp512 829.88 641.19 0.77
RTX 5090 granitemoe 3B F16 5 pp512 988.43 715.45 0.72
RTX 5090 granitemoe 3B F16 6 pp512 1154.50 780.47 0.68
RTX 5090 granitemoe 3B F16 7 pp512 1285.19 824.85 0.64
RTX 5090 granitemoe 3B F16 8 pp512 1450.36 880.35 0.61
RTX 5090 granitemoe 3B Q4_0 1 pp512 556.70 556.34 1.00
RTX 5090 granitemoe 3B Q4_0 2 pp512 407.71 468.81 1.15
RTX 5090 granitemoe 3B Q4_0 3 pp512 609.92 709.56 1.16
RTX 5090 granitemoe 3B Q4_0 4 pp512 752.45 861.40 1.14
RTX 5090 granitemoe 3B Q4_0 5 pp512 935.04 1018.52 1.09
RTX 5090 granitemoe 3B Q4_0 6 pp512 1119.42 1138.42 1.02
RTX 5090 granitemoe 3B Q4_0 7 pp512 1300.09 1229.16 0.95
RTX 5090 granitemoe 3B Q4_0 8 pp512 1501.00 1347.51 0.90
RTX 5090 granitemoe 3B all F32 1 pp512 317.18 318.17 1.00
RTX 5090 granitemoe 3B all F32 2 pp512 379.19 370.54 0.98
RTX 5090 granitemoe 3B all F32 3 pp512 495.65 455.88 0.92
RTX 5090 granitemoe 3B all F32 4 pp512 598.07 518.37 0.87
RTX 5090 granitemoe 3B all F32 5 pp512 701.20 583.92 0.83
RTX 5090 granitemoe 3B all F32 6 pp512 816.95 646.94 0.79
RTX 5090 granitemoe 3B all F32 7 pp512 907.56 687.55 0.76
RTX 5090 granitemoe 3B all F32 8 pp512 1019.58 738.31 0.72
RX 6800 granitemoe 3B BF16 1 pp512 74.73 76.42 1.02
RX 6800 granitemoe 3B BF16 2 pp512 4.99 99.98 20.04
RX 6800 granitemoe 3B BF16 3 pp512 6.20 114.43 18.45
RX 6800 granitemoe 3B BF16 4 pp512 7.39 124.28 16.82
RX 6800 granitemoe 3B BF16 5 pp512 8.32 126.49 15.21
RX 6800 granitemoe 3B BF16 6 pp512 9.52 132.27 13.90
RX 6800 granitemoe 3B BF16 7 pp512 10.50 136.14 12.97
RX 6800 granitemoe 3B BF16 8 pp512 11.44 139.26 12.17
RX 6800 granitemoe 3B F16 1 pp512 74.34 76.53 1.03
RX 6800 granitemoe 3B F16 2 pp512 25.61 100.06 3.91
RX 6800 granitemoe 3B F16 3 pp512 32.88 114.07 3.47
RX 6800 granitemoe 3B F16 4 pp512 39.52 124.13 3.14
RX 6800 granitemoe 3B F16 5 pp512 44.81 126.39 2.82
RX 6800 granitemoe 3B F16 6 pp512 51.80 132.06 2.55
RX 6800 granitemoe 3B F16 7 pp512 56.77 135.95 2.39
RX 6800 granitemoe 3B F16 8 pp512 62.33 139.33 2.24
RX 6800 granitemoe 3B Q4_0 1 pp512 129.16 132.50 1.03
RX 6800 granitemoe 3B Q4_0 2 pp512 158.99 219.97 1.38
RX 6800 granitemoe 3B Q4_0 3 pp512 228.05 301.62 1.32
RX 6800 granitemoe 3B Q4_0 4 pp512 296.05 373.80 1.26
RX 6800 granitemoe 3B Q4_0 5 pp512 342.11 406.04 1.19
RX 6800 granitemoe 3B Q4_0 6 pp512 401.56 454.70 1.13
RX 6800 granitemoe 3B Q4_0 7 pp512 459.62 500.37 1.09
RX 6800 granitemoe 3B Q4_0 8 pp512 519.00 542.98 1.05
RX 6800 granitemoe 3B all F32 1 pp512 68.91 70.81 1.03
RX 6800 granitemoe 3B all F32 2 pp512 57.24 95.16 1.66
RX 6800 granitemoe 3B all F32 3 pp512 72.11 109.77 1.52
RX 6800 granitemoe 3B all F32 4 pp512 84.42 119.40 1.41
RX 6800 granitemoe 3B all F32 5 pp512 92.97 121.90 1.31
RX 6800 granitemoe 3B all F32 6 pp512 103.74 128.02 1.23
RX 6800 granitemoe 3B all F32 7 pp512 111.66 132.05 1.18
RX 6800 granitemoe 3B all F32 8 pp512 118.87 135.47 1.14
RX 9060 XT granitemoe 3B BF16 1 pp512 78.91 79.34 1.01
RX 9060 XT granitemoe 3B BF16 2 pp512 132.92 111.79 0.84
RX 9060 XT granitemoe 3B BF16 3 pp512 176.86 131.26 0.74
RX 9060 XT granitemoe 3B BF16 4 pp512 223.96 146.28 0.65
RX 9060 XT granitemoe 3B BF16 5 pp512 257.54 157.11 0.61
RX 9060 XT granitemoe 3B BF16 6 pp512 299.75 165.37 0.55
RX 9060 XT granitemoe 3B BF16 7 pp512 333.61 172.61 0.52
RX 9060 XT granitemoe 3B BF16 8 pp512 352.73 174.43 0.49
RX 9060 XT granitemoe 3B F16 1 pp512 79.28 79.98 1.01
RX 9060 XT granitemoe 3B F16 2 pp512 135.00 112.68 0.83
RX 9060 XT granitemoe 3B F16 3 pp512 183.01 134.28 0.73
RX 9060 XT granitemoe 3B F16 4 pp512 225.71 148.10 0.66
RX 9060 XT granitemoe 3B F16 5 pp512 255.35 157.53 0.62
RX 9060 XT granitemoe 3B F16 6 pp512 303.51 167.23 0.55
RX 9060 XT granitemoe 3B F16 7 pp512 337.65 174.86 0.52
RX 9060 XT granitemoe 3B F16 8 pp512 358.41 176.62 0.49
RX 9060 XT granitemoe 3B Q4_0 1 pp512 134.05 133.43 1.00
RX 9060 XT granitemoe 3B Q4_0 2 pp512 228.09 254.33 1.12
RX 9060 XT granitemoe 3B Q4_0 3 pp512 318.55 350.82 1.10
RX 9060 XT granitemoe 3B Q4_0 4 pp512 408.48 431.06 1.06
RX 9060 XT granitemoe 3B Q4_0 5 pp512 473.50 487.79 1.03
RX 9060 XT granitemoe 3B Q4_0 6 pp512 554.07 549.99 0.99
RX 9060 XT granitemoe 3B Q4_0 7 pp512 636.28 600.81 0.94
RX 9060 XT granitemoe 3B Q4_0 8 pp512 651.52 562.73 0.86
RX 9060 XT granitemoe 3B all F32 1 pp512 69.74 69.75 1.00
RX 9060 XT granitemoe 3B all F32 2 pp512 54.27 95.12 1.75
RX 9060 XT granitemoe 3B all F32 3 pp512 72.82 113.13 1.55
RX 9060 XT granitemoe 3B all F32 4 pp512 88.60 125.81 1.42
RX 9060 XT granitemoe 3B all F32 5 pp512 93.66 131.80 1.41
RX 9060 XT granitemoe 3B all F32 6 pp512 104.51 139.97 1.34
RX 9060 XT granitemoe 3B all F32 7 pp512 106.92 137.02 1.28
RX 9060 XT granitemoe 3B all F32 8 pp512 107.60 139.23 1.29
V100-PCIE-32GB granitemoe 3B BF16 1 pp512 168.58 170.46 1.01
V100-PCIE-32GB granitemoe 3B BF16 2 pp512 19.11 218.00 11.41
V100-PCIE-32GB granitemoe 3B BF16 3 pp512 20.53 252.12 12.28
V100-PCIE-32GB granitemoe 3B BF16 4 pp512 22.31 265.31 11.89
V100-PCIE-32GB granitemoe 3B BF16 5 pp512 24.26 284.10 11.71
V100-PCIE-32GB granitemoe 3B BF16 6 pp512 26.24 295.77 11.27
V100-PCIE-32GB granitemoe 3B BF16 7 pp512 28.30 304.54 10.76
V100-PCIE-32GB granitemoe 3B BF16 8 pp512 30.32 310.83 10.25
V100-PCIE-32GB granitemoe 3B F16 1 pp512 161.98 168.54 1.04
V100-PCIE-32GB granitemoe 3B F16 2 pp512 306.04 216.05 0.71
V100-PCIE-32GB granitemoe 3B F16 3 pp512 405.23 246.33 0.61
V100-PCIE-32GB granitemoe 3B F16 4 pp512 540.16 279.90 0.52
V100-PCIE-32GB granitemoe 3B F16 5 pp512 636.53 295.51 0.46
V100-PCIE-32GB granitemoe 3B F16 6 pp512 733.81 308.19 0.42
V100-PCIE-32GB granitemoe 3B F16 7 pp512 819.60 317.86 0.39
V100-PCIE-32GB granitemoe 3B F16 8 pp512 913.24 325.79 0.36
V100-PCIE-32GB granitemoe 3B Q4_0 1 pp512 224.84 223.01 0.99
V100-PCIE-32GB granitemoe 3B Q4_0 2 pp512 247.50 338.39 1.37
V100-PCIE-32GB granitemoe 3B Q4_0 3 pp512 349.73 421.07 1.20
V100-PCIE-32GB granitemoe 3B Q4_0 4 pp512 439.22 474.12 1.08
V100-PCIE-32GB granitemoe 3B Q4_0 5 pp512 538.29 523.33 0.97
V100-PCIE-32GB granitemoe 3B Q4_0 6 pp512 629.88 562.51 0.89
V100-PCIE-32GB granitemoe 3B Q4_0 7 pp512 719.83 590.56 0.82
V100-PCIE-32GB granitemoe 3B Q4_0 8 pp512 807.63 616.28 0.76
V100-PCIE-32GB granitemoe 3B all F32 1 pp512 138.54 141.46 1.02
V100-PCIE-32GB granitemoe 3B all F32 2 pp512 128.84 185.18 1.44
V100-PCIE-32GB granitemoe 3B all F32 3 pp512 160.33 212.91 1.33
V100-PCIE-32GB granitemoe 3B all F32 4 pp512 173.94 221.40 1.27
V100-PCIE-32GB granitemoe 3B all F32 5 pp512 193.03 231.50 1.20
V100-PCIE-32GB granitemoe 3B all F32 6 pp512 217.12 242.16 1.12
V100-PCIE-32GB granitemoe 3B all F32 7 pp512 238.77 250.86 1.05
V100-PCIE-32GB granitemoe 3B all F32 8 pp512 259.83 257.44 0.99

This PR does not provide a universal speedup for <= 8 tokens, please adjust the kernel selection logic to reflect this, FP16 vs. FP32 vs. BF16 vs. quantized should be enough.

@JohannesGaessler
Copy link
Contributor

GPT-OSS
GPU Model Microbatch size Test t/s b7779 t/s b5aa3ab Speedup
MI60 / MI50 gpt-oss 20B MXFP4 MoE 1 pp512 149.41 151.62 1.01
MI60 / MI50 gpt-oss 20B MXFP4 MoE 2 pp512 75.09 173.13 2.31
MI60 / MI50 gpt-oss 20B MXFP4 MoE 3 pp512 90.72 180.21 1.99
MI60 / MI50 gpt-oss 20B MXFP4 MoE 4 pp512 110.20 197.62 1.79
MI60 / MI50 gpt-oss 20B MXFP4 MoE 5 pp512 128.02 204.53 1.60
MI60 / MI50 gpt-oss 20B MXFP4 MoE 6 pp512 141.84 204.71 1.44
MI60 / MI50 gpt-oss 20B MXFP4 MoE 7 pp512 157.25 210.66 1.34
MI60 / MI50 gpt-oss 20B MXFP4 MoE 8 pp512 172.01 220.44 1.28
MI100 gpt-oss 20B MXFP4 MoE 1 pp512 168.27 177.59 1.06
MI100 gpt-oss 20B MXFP4 MoE 2 pp512 121.45 202.48 1.67
MI100 gpt-oss 20B MXFP4 MoE 3 pp512 158.65 233.83 1.47
MI100 gpt-oss 20B MXFP4 MoE 4 pp512 171.38 213.36 1.24
MI100 gpt-oss 20B MXFP4 MoE 5 pp512 199.90 232.49 1.16
MI100 gpt-oss 20B MXFP4 MoE 6 pp512 229.06 249.74 1.09
MI100 gpt-oss 20B MXFP4 MoE 7 pp512 259.17 265.55 1.02
MI100 gpt-oss 20B MXFP4 MoE 8 pp512 283.00 278.24 0.98
P40 gpt-oss 20B MXFP4 MoE 1 pp512 79.10 78.60 0.99
P40 gpt-oss 20B MXFP4 MoE 2 pp512 111.11 120.93 1.09
P40 gpt-oss 20B MXFP4 MoE 3 pp512 146.86 136.03 0.93
P40 gpt-oss 20B MXFP4 MoE 4 pp512 171.33 140.69 0.82
P40 gpt-oss 20B MXFP4 MoE 5 pp512 199.64 144.30 0.72
P40 gpt-oss 20B MXFP4 MoE 6 pp512 223.87 147.51 0.66
P40 gpt-oss 20B MXFP4 MoE 7 pp512 248.94 150.61 0.61
P40 gpt-oss 20B MXFP4 MoE 8 pp512 267.05 153.24 0.57
RTX 3090 gpt-oss 20B MXFP4 MoE 1 pp512 260.73 247.86 0.95
RTX 3090 gpt-oss 20B MXFP4 MoE 2 pp512 204.00 291.01 1.43
RTX 3090 gpt-oss 20B MXFP4 MoE 3 pp512 278.67 335.85 1.21
RTX 3090 gpt-oss 20B MXFP4 MoE 4 pp512 341.95 357.86 1.05
RTX 3090 gpt-oss 20B MXFP4 MoE 5 pp512 401.57 369.94 0.92
RTX 3090 gpt-oss 20B MXFP4 MoE 6 pp512 455.02 379.64 0.83
RTX 3090 gpt-oss 20B MXFP4 MoE 7 pp512 520.76 393.36 0.76
RTX 3090 gpt-oss 20B MXFP4 MoE 8 pp512 552.54 394.60 0.71
RTX 4090 gpt-oss 20B MXFP4 MoE 1 pp512 331.03 329.12 0.99
RTX 4090 gpt-oss 20B MXFP4 MoE 2 pp512 289.30 403.72 1.40
RTX 4090 gpt-oss 20B MXFP4 MoE 3 pp512 400.77 514.59 1.28
RTX 4090 gpt-oss 20B MXFP4 MoE 4 pp512 498.75 588.40 1.18
RTX 4090 gpt-oss 20B MXFP4 MoE 5 pp512 588.35 644.05 1.09
RTX 4090 gpt-oss 20B MXFP4 MoE 6 pp512 667.23 690.82 1.04
RTX 4090 gpt-oss 20B MXFP4 MoE 7 pp512 774.54 749.92 0.97
RTX 4090 gpt-oss 20B MXFP4 MoE 8 pp512 831.51 776.84 0.93
RTX 5090 gpt-oss 20B MXFP4 MoE 1 pp512 466.26 463.94 1.00
RTX 5090 gpt-oss 20B MXFP4 MoE 2 pp512 339.21 458.85 1.35
RTX 5090 gpt-oss 20B MXFP4 MoE 3 pp512 473.02 592.16 1.25
RTX 5090 gpt-oss 20B MXFP4 MoE 4 pp512 587.80 677.92 1.15
RTX 5090 gpt-oss 20B MXFP4 MoE 5 pp512 696.66 755.19 1.08
RTX 5090 gpt-oss 20B MXFP4 MoE 6 pp512 803.14 826.81 1.03
RTX 5090 gpt-oss 20B MXFP4 MoE 7 pp512 926.15 894.90 0.97
RTX 5090 gpt-oss 20B MXFP4 MoE 8 pp512 1003.47 940.28 0.94
RX 6800 gpt-oss 20B MXFP4 MoE 1 pp512 109.41 111.44 1.02
RX 6800 gpt-oss 20B MXFP4 MoE 2 pp512 88.63 150.23 1.70
RX 6800 gpt-oss 20B MXFP4 MoE 3 pp512 117.63 186.32 1.58
RX 6800 gpt-oss 20B MXFP4 MoE 4 pp512 146.18 215.22 1.47
RX 6800 gpt-oss 20B MXFP4 MoE 5 pp512 167.02 220.52 1.32
RX 6800 gpt-oss 20B MXFP4 MoE 6 pp512 191.79 237.09 1.24
RX 6800 gpt-oss 20B MXFP4 MoE 7 pp512 212.20 249.44 1.18
RX 6800 gpt-oss 20B MXFP4 MoE 8 pp512 228.05 259.28 1.14
RX 9060 XT gpt-oss 20B MXFP4 MoE 1 pp512 94.76 94.39 1.00
RX 9060 XT gpt-oss 20B MXFP4 MoE 2 pp512 119.06 126.00 1.06
RX 9060 XT gpt-oss 20B MXFP4 MoE 3 pp512 169.98 166.31 0.98
RX 9060 XT gpt-oss 20B MXFP4 MoE 4 pp512 212.90 192.73 0.91
RX 9060 XT gpt-oss 20B MXFP4 MoE 5 pp512 253.35 214.19 0.85
RX 9060 XT gpt-oss 20B MXFP4 MoE 6 pp512 292.94 232.26 0.79
RX 9060 XT gpt-oss 20B MXFP4 MoE 7 pp512 329.01 246.66 0.75
RX 9060 XT gpt-oss 20B MXFP4 MoE 8 pp512 348.80 252.58 0.72
V100-PCIE-32GB gpt-oss 20B MXFP4 MoE 1 pp512 180.54 179.49 0.99
V100-PCIE-32GB gpt-oss 20B MXFP4 MoE 2 pp512 109.32 245.14 2.24
V100-PCIE-32GB gpt-oss 20B MXFP4 MoE 3 pp512 143.50 276.99 1.93
V100-PCIE-32GB gpt-oss 20B MXFP4 MoE 4 pp512 172.57 293.25 1.70
V100-PCIE-32GB gpt-oss 20B MXFP4 MoE 5 pp512 205.60 314.73 1.53
V100-PCIE-32GB gpt-oss 20B MXFP4 MoE 6 pp512 231.26 323.52 1.40
V100-PCIE-32GB gpt-oss 20B MXFP4 MoE 7 pp512 263.08 336.41 1.28
V100-PCIE-32GB gpt-oss 20B MXFP4 MoE 8 pp512 283.95 344.14 1.21

@am17an
Copy link
Contributor Author

am17an commented Jan 23, 2026

There was a performance regression for the n=1 (decode) kernel. It's fixed now

Comment on lines +173 to +175
if constexpr (ncols_dst == 1) {
sample_dst *= !ids_stride; // sample_dst for ids is 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could slightly simplify the code if instead we set parameters like stride_sample_x to 0 in host code if ids != nullptr. This should be explained by a comment in device code and done consistently with MMVF.

@ggerganov
Copy link
Member

@am17an
Copy link
Contributor Author

am17an commented Jan 29, 2026

@ggerganov I'm aware, currently there is still a perf regression for n=1 case. I'm looking into it, will push once it's fixed

@am17an
Copy link
Contributor Author

am17an commented Jan 31, 2026

Should be fixed now.

on 5090
llama-batched-bench -m %models/gpt_oss-20b-mxfp4.gguf -c 65536 -b 2048 -ub 512 -npp 1024 -ntg 32 -npl 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  1024 |     32 |    1 |   1056 |    0.165 |  6212.39 |    0.097 |   329.50 |    0.262 |  4031.30 |
|  1024 |     32 |    2 |   2112 |    0.143 | 14361.85 |    0.184 |   348.44 |    0.326 |  6473.09 |
|  1024 |     32 |    3 |   3168 |    0.211 | 14526.61 |    0.200 |   480.65 |    0.411 |  7704.19 |
|  1024 |     32 |    4 |   4224 |    0.280 | 14644.94 |    0.229 |   558.01 |    0.509 |  8297.43 |

vs master

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  1024 |     32 |    1 |   1056 |    0.165 |  6195.55 |    0.097 |   329.23 |    0.262 |  4023.23 |
|  1024 |     32 |    2 |   2112 |    0.143 | 14352.09 |    0.237 |   269.56 |    0.380 |  5556.08 |
|  1024 |     32 |    3 |   3168 |    0.212 | 14517.55 |    0.242 |   396.96 |    0.453 |  6986.58 |
|  1024 |     32 |    4 |   4224 |    0.280 | 14627.68 |    0.258 |   495.96 |    0.538 |  7849.81 |

@ggerganov I remember you saying that these small batch sizes are useful for agent use-cases but I'm not able to understand why

@ggerganov
Copy link
Member

@ggerganov I remember you saying that these small batch sizes are useful for agent use-cases but I'm not able to understand why

I think at some point I noticed Claude Code sending requests in parallel. The OpenCode does not do it, but I think it's just a matter of time before it starts launching tasks in parallel. You can also run multiple agent session at the same time and this will help in such cases. And also, not directly related to agentic coding, but this is useful for tasks with n_cmpl > 1 (e.g. FIM).

@jacekpoplawski
Copy link
Contributor

jacekpoplawski commented Jan 31, 2026

Details g

not sure if I'm testing correctly:

./build_2026.01.31_mmid/bin/llama-bench -m /mnt/models1/GLM/GLM-4.7-Flash-Q8_0.gguf -fa 1 -p 1000 -n 50 -d 0,10000,20000,30000,40000,50000,60000

@am17an
Copy link
Contributor Author

am17an commented Feb 1, 2026

@jacekpoplawski thanks for reporting. You're not doing anything wrong, it's just that this branch didn't have #19126 which provides the speed-up in master. Rebased now

Copy link
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to add a template specialization anyways I think it makes more sense to instead just add a template specialization for MUL_MAT_ID in general. Add a boolean like use_ids and use that instead of the ids pointer in the program logic as well as to determine how blockIdx should be interpreted.

@am17an
Copy link
Contributor Author

am17an commented Feb 3, 2026

The problem is that the n_tokens=1 path with ids sees a slowdown, so we would have to add something like this anyway, a template which disambiguates between n=1, and n > 1

Copy link
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misremembered which model had a performance regression on an RTX 3090. For now I think it's fine to just merge it like this.

@am17an am17an merged commit 8bece2e into ggml-org:master Feb 3, 2026
68 of 78 checks passed
@am17an am17an deleted the mmid-vec branch February 3, 2026 15:31
agent-enemy-2 pushed a commit to agent-enemy-2/llama.cpp that referenced this pull request Feb 4, 2026
* CUDA: use mmvq for mul-mat-id for small batch sizes

* add mmvq too

* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs

* templatize multi_token_path
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* CUDA: use mmvq for mul-mat-id for small batch sizes

* add mmvq too

* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs

* templatize multi_token_path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants