Skip to content

Limit kv split when prefill tokens <= 1536#1

Merged
AKKamath merged 1 commit intoAKKamath:pod_batched_newfrom
Edenzzzz:pod_batched_new
Nov 12, 2025
Merged

Limit kv split when prefill tokens <= 1536#1
AKKamath merged 1 commit intoAKKamath:pod_batched_newfrom
Edenzzzz:pod_batched_new

Conversation

@Edenzzzz
Copy link
Copy Markdown

@Edenzzzz Edenzzzz commented Nov 12, 2025

When prefill len <= 1536 in flashinfer-ai#2079, the split-kv used to saturate SM blocks now takes into account of colocated decode CTAs. I benchmarked this on H200 to see that when prefill len == 1536, enabling and disabling this have the same performance, but when < 1536, not limiting significantly lags behind.
cc @AKKamath

Limiting split

===== Benchmark 1: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.33 ms
Elapsed time (Batched POD Attention): 0.32 ms
Elapsed time (POD Attention): 0.33 ms
Elapsed time (Sequential two kernels): 0.34 ms
Elapsed time (Persistent BatchAttention): 0.31 ms
Batch POD speedup over Persistent BatchAttention: 0.97x
Loading memory size (MB): 1031.00 MB
Memory bandwidth (Batched Prefill): 3032.64 GB/s
Memory bandwidth (Batched POD Attention): 3131.02 GB/s
Memory bandwidth (POD Attention): 3072.02 GB/s
Memory bandwidth (Sequential two kernels): 2973.32 GB/s
Memory bandwidth (Persistent BatchAttention): 3236.97 GB/s
===== Benchmark 2: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.38 ms
Elapsed time (Batched POD Attention): 0.34 ms
Elapsed time (POD Attention): 0.35 ms
Elapsed time (Sequential two kernels): 0.41 ms
Elapsed time (Persistent BatchAttention): 0.36 ms
Batch POD speedup over Persistent BatchAttention: 1.05x
Loading memory size (MB): 1043.00 MB
Memory bandwidth (Batched Prefill): 2694.59 GB/s
Memory bandwidth (Batched POD Attention): 2985.91 GB/s
Memory bandwidth (POD Attention): 2941.49 GB/s
Memory bandwidth (Sequential two kernels): 2498.61 GB/s
Memory bandwidth (Persistent BatchAttention): 2833.03 GB/s
===== Benchmark 3: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.65 ms
Elapsed time (Batched POD Attention): 0.44 ms
Elapsed time (Persistent BatchAttention): 0.50 ms
Batch POD speedup over Persistent BatchAttention: 1.15x
Loading memory size (MB): 1073.00 MB
Memory bandwidth (Batched Prefill): 1620.06 GB/s
Memory bandwidth (Batched POD Attention): 2389.65 GB/s
Memory bandwidth (Persistent BatchAttention): 2076.08 GB/s
===== Benchmark 4: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.55 ms
Elapsed time (Batched POD Attention): 0.36 ms
Elapsed time (POD Attention): 0.36 ms
Elapsed time (Sequential two kernels): 0.46 ms
Elapsed time (Persistent BatchAttention): 0.39 ms
Batch POD speedup over Persistent BatchAttention: 1.11x
Loading memory size (MB): 1049.00 MB
Memory bandwidth (Batched Prefill): 1849.18 GB/s
Memory bandwidth (Batched POD Attention): 2879.38 GB/s
Memory bandwidth (POD Attention): 2843.83 GB/s
Memory bandwidth (Sequential two kernels): 2244.63 GB/s
Memory bandwidth (Persistent BatchAttention): 2600.14 GB/s
===== Benchmark 5: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 1.23 ms
Elapsed time (Batched POD Attention): 0.80 ms
Elapsed time (POD Attention): 0.75 ms
Elapsed time (Sequential two kernels): 1.03 ms
Elapsed time (Persistent BatchAttention): 0.96 ms
Batch POD speedup over Persistent BatchAttention: 1.20x
Loading memory size (MB): 2097.00 MB
Memory bandwidth (Batched Prefill): 1671.60 GB/s
Memory bandwidth (Batched POD Attention): 2558.89 GB/s
Memory bandwidth (POD Attention): 2743.64 GB/s
Memory bandwidth (Sequential two kernels): 1984.41 GB/s
Memory bandwidth (Persistent BatchAttention): 2133.04 GB/s
===== Benchmark 6: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 2.07 ms
Elapsed time (Batched POD Attention): 1.29 ms
Elapsed time (POD Attention): 1.32 ms
Elapsed time (Sequential two kernels): 1.59 ms
Elapsed time (Persistent BatchAttention): 1.53 ms
Batch POD speedup over Persistent BatchAttention: 1.19x
Loading memory size (MB): 4145.00 MB
Memory bandwidth (Batched Prefill): 1957.42 GB/s
Memory bandwidth (Batched POD Attention): 3146.18 GB/s
Memory bandwidth (POD Attention): 3074.90 GB/s
Memory bandwidth (Sequential two kernels): 2545.10 GB/s
Memory bandwidth (Persistent BatchAttention): 2643.79 GB/s
===== Benchmark 7: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 2.71 ms
Elapsed time (Batched POD Attention): 1.91 ms
Elapsed time (POD Attention): 1.78 ms
Elapsed time (Sequential two kernels): 2.28 ms
Elapsed time (Persistent BatchAttention): 2.22 ms
Batch POD speedup over Persistent BatchAttention: 1.16x
Loading memory size (MB): 4171.22 MB
Memory bandwidth (Batched Prefill): 1504.75 GB/s
Memory bandwidth (Batched POD Attention): 2133.02 GB/s
Memory bandwidth (POD Attention): 2284.27 GB/s
Memory bandwidth (Sequential two kernels): 1787.11 GB/s
Memory bandwidth (Persistent BatchAttention): 1833.26 GB/s

Not limiting (always set num_colocated_ctas to 0)

===== Benchmark 1: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.33 ms
Elapsed time (Batched POD Attention): 0.36 ms
Elapsed time (POD Attention): 0.33 ms
Elapsed time (Sequential two kernels): 0.34 ms
Elapsed time (Persistent BatchAttention): 0.31 ms
Batch POD speedup over Persistent BatchAttention: 0.85x
Loading memory size (MB): 1031.00 MB
Memory bandwidth (Batched Prefill): 3036.88 GB/s
Memory bandwidth (Batched POD Attention): 2762.63 GB/s
Memory bandwidth (POD Attention): 3065.43 GB/s
Memory bandwidth (Sequential two kernels): 2958.22 GB/s
Memory bandwidth (Persistent BatchAttention): 3240.90 GB/s
===== Benchmark 2: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.38 ms
Elapsed time (Batched POD Attention): 0.34 ms
Elapsed time (POD Attention): 0.35 ms
Elapsed time (Sequential two kernels): 0.41 ms
Elapsed time (Persistent BatchAttention): 0.36 ms
Batch POD speedup over Persistent BatchAttention: 1.05x
Loading memory size (MB): 1043.00 MB
Memory bandwidth (Batched Prefill): 2680.18 GB/s
Memory bandwidth (Batched POD Attention): 2976.14 GB/s
Memory bandwidth (POD Attention): 2936.33 GB/s
Memory bandwidth (Sequential two kernels): 2495.48 GB/s
Memory bandwidth (Persistent BatchAttention): 2829.31 GB/s
===== Benchmark 3: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.65 ms
Elapsed time (Batched POD Attention): 0.44 ms
Elapsed time (Persistent BatchAttention): 0.51 ms
Batch POD speedup over Persistent BatchAttention: 1.15x
Loading memory size (MB): 1073.00 MB
Memory bandwidth (Batched Prefill): 1611.40 GB/s
Memory bandwidth (Batched POD Attention): 2384.69 GB/s
Memory bandwidth (Persistent BatchAttention): 2074.32 GB/s
===== Benchmark 4: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 0.54 ms
Elapsed time (Batched POD Attention): 0.36 ms
Elapsed time (POD Attention): 0.36 ms
Elapsed time (Sequential two kernels): 0.46 ms
Elapsed time (Persistent BatchAttention): 0.40 ms
Batch POD speedup over Persistent BatchAttention: 1.12x
Loading memory size (MB): 1049.00 MB
Memory bandwidth (Batched Prefill): 1891.91 GB/s
Memory bandwidth (Batched POD Attention): 2885.48 GB/s
Memory bandwidth (POD Attention): 2845.09 GB/s
Memory bandwidth (Sequential two kernels): 2237.73 GB/s
Memory bandwidth (Persistent BatchAttention): 2579.54 GB/s
===== Benchmark 5: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 1.25 ms
Elapsed time (Batched POD Attention): 0.81 ms
Elapsed time (POD Attention): 0.74 ms
Elapsed time (Sequential two kernels): 1.03 ms
Elapsed time (Persistent BatchAttention): 0.96 ms
Batch POD speedup over Persistent BatchAttention: 1.18x
Loading memory size (MB): 2097.00 MB
Memory bandwidth (Batched Prefill): 1643.79 GB/s
Memory bandwidth (Batched POD Attention): 2534.72 GB/s
Memory bandwidth (POD Attention): 2749.41 GB/s
Memory bandwidth (Sequential two kernels): 1980.12 GB/s
Memory bandwidth (Persistent BatchAttention): 2139.90 GB/s
===== Benchmark 6: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 2.05 ms
Elapsed time (Batched POD Attention): 1.29 ms
Elapsed time (POD Attention): 1.31 ms
Elapsed time (Sequential two kernels): 1.59 ms
Elapsed time (Persistent BatchAttention): 1.53 ms
Batch POD speedup over Persistent BatchAttention: 1.19x
Loading memory size (MB): 4145.00 MB
Memory bandwidth (Batched Prefill): 1976.00 GB/s
Memory bandwidth (Batched POD Attention): 3145.56 GB/s
Memory bandwidth (POD Attention): 3080.07 GB/s
Memory bandwidth (Sequential two kernels): 2543.44 GB/s
Memory bandwidth (Persistent BatchAttention): 2642.19 GB/s
===== Benchmark 7: (kv_len, qo_len) set =====
Elapsed time (Batched Prefill): 2.70 ms
Elapsed time (Batched POD Attention): 1.91 ms
Elapsed time (POD Attention): 1.78 ms
Elapsed time (Sequential two kernels): 2.28 ms
Elapsed time (Persistent BatchAttention): 2.21 ms
Batch POD speedup over Persistent BatchAttention: 1.16x
Loading memory size (MB): 4171.22 MB
Memory bandwidth (Batched Prefill): 1510.07 GB/s
Memory bandwidth (Batched POD Attention): 2135.78 GB/s
Memory bandwidth (POD Attention): 2282.71 GB/s
Memory bandwidth (Sequential two kernels): 1786.45 GB/s
Memory bandwidth (Persistent BatchAttention): 1839.28 GB/s

@AKKamath AKKamath merged commit 77d20fd into AKKamath:pod_batched_new Nov 12, 2025
@Edenzzzz Edenzzzz deleted the pod_batched_new branch November 14, 2025 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants