Limit kv split when prefill tokens <= 1536 by Edenzzzz · Pull Request #1 · AKKamath/flashinfer

Edenzzzz · 2025-11-12T03:16:06Z

When prefill len <= 1536 in flashinfer-ai#2079, the split-kv used to saturate SM blocks now takes into account of colocated decode CTAs. I benchmarked this on H200 to see that when prefill len == 1536, enabling and disabling this have the same performance, but when < 1536, not limiting significantly lags behind.
cc @AKKamath

Limiting split

===== Benchmark 1: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.33 ms
Elapsed time (Batched POD Attention): 0.32 ms
Elapsed time (POD Attention): 0.33 ms
Elapsed time (Sequential two kernels): 0.34 ms
Elapsed time (Persistent BatchAttention): 0.31 ms
Batch POD speedup over Persistent BatchAttention: 0.97x
Loading memory size (MB): 1031.00 MB
Memory bandwidth (Batched Prefill): 3032.64 GB/s
Memory bandwidth (Batched POD Attention): 3131.02 GB/s
Memory bandwidth (POD Attention): 3072.02 GB/s
Memory bandwidth (Sequential two kernels): 2973.32 GB/s
Memory bandwidth (Persistent BatchAttention): 3236.97 GB/s
===== Benchmark 2: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.38 ms
Elapsed time (Batched POD Attention): 0.34 ms
Elapsed time (POD Attention): 0.35 ms
Elapsed time (Sequential two kernels): 0.41 ms
Elapsed time (Persistent BatchAttention): 0.36 ms
Batch POD speedup over Persistent BatchAttention: 1.05x
Loading memory size (MB): 1043.00 MB
Memory bandwidth (Batched Prefill): 2694.59 GB/s
Memory bandwidth (Batched POD Attention): 2985.91 GB/s
Memory bandwidth (POD Attention): 2941.49 GB/s
Memory bandwidth (Sequential two kernels): 2498.61 GB/s
Memory bandwidth (Persistent BatchAttention): 2833.03 GB/s
===== Benchmark 3: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.65 ms
Elapsed time (Batched POD Attention): 0.44 ms
Elapsed time (Persistent BatchAttention): 0.50 ms
Batch POD speedup over Persistent BatchAttention: 1.15x
Loading memory size (MB): 1073.00 MB
Memory bandwidth (Batched Prefill): 1620.06 GB/s
Memory bandwidth (Batched POD Attention): 2389.65 GB/s
Memory bandwidth (Persistent BatchAttention): 2076.08 GB/s
===== Benchmark 4: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.55 ms
Elapsed time (Batched POD Attention): 0.36 ms
Elapsed time (POD Attention): 0.36 ms
Elapsed time (Sequential two kernels): 0.46 ms
Elapsed time (Persistent BatchAttention): 0.39 ms
Batch POD speedup over Persistent BatchAttention: 1.11x
Loading memory size (MB): 1049.00 MB
Memory bandwidth (Batched Prefill): 1849.18 GB/s
Memory bandwidth (Batched POD Attention): 2879.38 GB/s
Memory bandwidth (POD Attention): 2843.83 GB/s
Memory bandwidth (Sequential two kernels): 2244.63 GB/s
Memory bandwidth (Persistent BatchAttention): 2600.14 GB/s
===== Benchmark 5: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 1.23 ms
Elapsed time (Batched POD Attention): 0.80 ms
Elapsed time (POD Attention): 0.75 ms
Elapsed time (Sequential two kernels): 1.03 ms
Elapsed time (Persistent BatchAttention): 0.96 ms
Batch POD speedup over Persistent BatchAttention: 1.20x
Loading memory size (MB): 2097.00 MB
Memory bandwidth (Batched Prefill): 1671.60 GB/s
Memory bandwidth (Batched POD Attention): 2558.89 GB/s
Memory bandwidth (POD Attention): 2743.64 GB/s
Memory bandwidth (Sequential two kernels): 1984.41 GB/s
Memory bandwidth (Persistent BatchAttention): 2133.04 GB/s
===== Benchmark 6: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 2.07 ms
Elapsed time (Batched POD Attention): 1.29 ms
Elapsed time (POD Attention): 1.32 ms
Elapsed time (Sequential two kernels): 1.59 ms
Elapsed time (Persistent BatchAttention): 1.53 ms
Batch POD speedup over Persistent BatchAttention: 1.19x
Loading memory size (MB): 4145.00 MB
Memory bandwidth (Batched Prefill): 1957.42 GB/s
Memory bandwidth (Batched POD Attention): 3146.18 GB/s
Memory bandwidth (POD Attention): 3074.90 GB/s
Memory bandwidth (Sequential two kernels): 2545.10 GB/s
Memory bandwidth (Persistent BatchAttention): 2643.79 GB/s
===== Benchmark 7: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 2.71 ms
Elapsed time (Batched POD Attention): 1.91 ms
Elapsed time (POD Attention): 1.78 ms
Elapsed time (Sequential two kernels): 2.28 ms
Elapsed time (Persistent BatchAttention): 2.22 ms
Batch POD speedup over Persistent BatchAttention: 1.16x
Loading memory size (MB): 4171.22 MB
Memory bandwidth (Batched Prefill): 1504.75 GB/s
Memory bandwidth (Batched POD Attention): 2133.02 GB/s
Memory bandwidth (POD Attention): 2284.27 GB/s
Memory bandwidth (Sequential two kernels): 1787.11 GB/s
Memory bandwidth (Persistent BatchAttention): 1833.26 GB/s

Not limiting (always set num_colocated_ctas to 0)

===== Benchmark 1: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.33 ms
Elapsed time (Batched POD Attention): 0.36 ms
Elapsed time (POD Attention): 0.33 ms
Elapsed time (Sequential two kernels): 0.34 ms
Elapsed time (Persistent BatchAttention): 0.31 ms
Batch POD speedup over Persistent BatchAttention: 0.85x
Loading memory size (MB): 1031.00 MB
Memory bandwidth (Batched Prefill): 3036.88 GB/s
Memory bandwidth (Batched POD Attention): 2762.63 GB/s
Memory bandwidth (POD Attention): 3065.43 GB/s
Memory bandwidth (Sequential two kernels): 2958.22 GB/s
Memory bandwidth (Persistent BatchAttention): 3240.90 GB/s
===== Benchmark 2: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.38 ms
Elapsed time (Batched POD Attention): 0.34 ms
Elapsed time (POD Attention): 0.35 ms
Elapsed time (Sequential two kernels): 0.41 ms
Elapsed time (Persistent BatchAttention): 0.36 ms
Batch POD speedup over Persistent BatchAttention: 1.05x
Loading memory size (MB): 1043.00 MB
Memory bandwidth (Batched Prefill): 2680.18 GB/s
Memory bandwidth (Batched POD Attention): 2976.14 GB/s
Memory bandwidth (POD Attention): 2936.33 GB/s
Memory bandwidth (Sequential two kernels): 2495.48 GB/s
Memory bandwidth (Persistent BatchAttention): 2829.31 GB/s
===== Benchmark 3: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.65 ms
Elapsed time (Batched POD Attention): 0.44 ms
Elapsed time (Persistent BatchAttention): 0.51 ms
Batch POD speedup over Persistent BatchAttention: 1.15x
Loading memory size (MB): 1073.00 MB
Memory bandwidth (Batched Prefill): 1611.40 GB/s
Memory bandwidth (Batched POD Attention): 2384.69 GB/s
Memory bandwidth (Persistent BatchAttention): 2074.32 GB/s
===== Benchmark 4: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.54 ms
Elapsed time (Batched POD Attention): 0.36 ms
Elapsed time (POD Attention): 0.36 ms
Elapsed time (Sequential two kernels): 0.46 ms
Elapsed time (Persistent BatchAttention): 0.40 ms
Batch POD speedup over Persistent BatchAttention: 1.12x
Loading memory size (MB): 1049.00 MB
Memory bandwidth (Batched Prefill): 1891.91 GB/s
Memory bandwidth (Batched POD Attention): 2885.48 GB/s
Memory bandwidth (POD Attention): 2845.09 GB/s
Memory bandwidth (Sequential two kernels): 2237.73 GB/s
Memory bandwidth (Persistent BatchAttention): 2579.54 GB/s
===== Benchmark 5: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 1.25 ms
Elapsed time (Batched POD Attention): 0.81 ms
Elapsed time (POD Attention): 0.74 ms
Elapsed time (Sequential two kernels): 1.03 ms
Elapsed time (Persistent BatchAttention): 0.96 ms
Batch POD speedup over Persistent BatchAttention: 1.18x
Loading memory size (MB): 2097.00 MB
Memory bandwidth (Batched Prefill): 1643.79 GB/s
Memory bandwidth (Batched POD Attention): 2534.72 GB/s
Memory bandwidth (POD Attention): 2749.41 GB/s
Memory bandwidth (Sequential two kernels): 1980.12 GB/s
Memory bandwidth (Persistent BatchAttention): 2139.90 GB/s
===== Benchmark 6: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 2.05 ms
Elapsed time (Batched POD Attention): 1.29 ms
Elapsed time (POD Attention): 1.31 ms
Elapsed time (Sequential two kernels): 1.59 ms
Elapsed time (Persistent BatchAttention): 1.53 ms
Batch POD speedup over Persistent BatchAttention: 1.19x
Loading memory size (MB): 4145.00 MB
Memory bandwidth (Batched Prefill): 1976.00 GB/s
Memory bandwidth (Batched POD Attention): 3145.56 GB/s
Memory bandwidth (POD Attention): 3080.07 GB/s
Memory bandwidth (Sequential two kernels): 2543.44 GB/s
Memory bandwidth (Persistent BatchAttention): 2642.19 GB/s
===== Benchmark 7: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 2.70 ms
Elapsed time (Batched POD Attention): 1.91 ms
Elapsed time (POD Attention): 1.78 ms
Elapsed time (Sequential two kernels): 2.28 ms
Elapsed time (Persistent BatchAttention): 2.21 ms
Batch POD speedup over Persistent BatchAttention: 1.16x
Loading memory size (MB): 4171.22 MB
Memory bandwidth (Batched Prefill): 1510.07 GB/s
Memory bandwidth (Batched POD Attention): 2135.78 GB/s
Memory bandwidth (POD Attention): 2282.71 GB/s
Memory bandwidth (Sequential two kernels): 1786.45 GB/s
Memory bandwidth (Persistent BatchAttention): 1839.28 GB/s

add threshold

b7b4d4c

AKKamath merged commit 77d20fd into AKKamath:pod_batched_new Nov 12, 2025

Edenzzzz deleted the pod_batched_new branch November 14, 2025 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit kv split when prefill tokens <= 1536#1

Limit kv split when prefill tokens <= 1536#1
AKKamath merged 1 commit intoAKKamath:pod_batched_newfrom
Edenzzzz:pod_batched_new

Edenzzzz commented Nov 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Edenzzzz commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Limiting split

Not limiting (always set num_colocated_ctas to 0)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Edenzzzz commented Nov 12, 2025 •

edited

Loading