perf(mamba): use Triton conv1d for non-contiguous input to avoid .contiguous() copy#20469
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
a28e87d to
1d3048d
Compare
|
/tag-and-rerun-ci |
Qiaolin-Yu
left a comment
There was a problem hiding this comment.
btw, have you also tested decoding performance? just to verify decoding performance will not have regression when using this kernel
1d3048d to
fd5ecad
Compare
|
Good catch! The concern is valid for callers like Updated in fceca7a: the Triton fallback now only activates when Callers that do pass |
Qiaolin-Yu
left a comment
There was a problem hiding this comment.
I see slightly regression in decode performance, could you test a longer output len for decoding? e.g., bs 16 with input len 10000, output len 30000. btw, I'm a bit concerned that the decoding performance of other models might be affected. If this slight regression is consistently reproduced, could we only use this kernel for prefill?
|
I have cleaned up the code a bit I think now it only impacts prefill |
876e2e4 to
2536dfd
Compare
|
/tag-and-rerun-ci |
6c11188 to
6d4589f
Compare
|
/tag-and-rerun-ci |
…tiguous() copy On large prefill batches, causal_conv1d_fn receives a non-contiguous input tensor because the GEMM output [seq, features] is transposed to [features, seq] before being passed to the CUDA conv1d kernel. The CUDA kernel requires stride(-1)==1, which forces a full tensor copy via .contiguous() costing >0.6ms per layer. The existing Triton conv1d kernel already accepts arbitrary strides by passing stride values directly. This change falls back to the Triton path whenever the input is non-contiguous, eliminating the copy entirely. Tested on both embedding (Qwen3.5-0.8B) and generation workloads. fix: avoid GPU-CPU sync when seq_lens_cpu not pre-computed Only fall back to Triton conv1d when seq_lens_cpu is already available in kwargs (pre-computed on CPU by the caller). When absent, keep the original .contiguous() path to avoid introducing a GPU-CPU sync via query_start_loc.cpu().tolist(). fix: correct fallback logic — use .contiguous() when seq_lens_cpu unavailable Previous fix had dead code. Now clearly: - no sgl_kernel: always use Triton (compute seq_lens_cpu if needed) - sgl_kernel + contiguous: use CUDA kernel (fast path, unchanged) - sgl_kernel + non-contiguous + seq_lens_cpu available: use Triton (no copy) - sgl_kernel + non-contiguous + seq_lens_cpu absent: .contiguous() + CUDA (avoids GPU-CPU sync) style: improve comments and docstring clarity feat(eval): support --dataset-path for GPQA eval in run_eval.py Allow users to pass a local CSV file path via --dataset-path instead of downloading from the hardcoded OpenAI blob URL. Falls back to the original URL when --dataset-path is not provided. refactor: use use_triton variable for dispatch clarity Address reviewer nit: unwrap the dispatch condition into a named use_triton variable in both causal_conv1d_fn and causal_conv1d_update for readability. Also revert docstring format to match the original inline style, adding a brief dispatch note at the end rather than a full rewrite. fix: restore original docstrings, keep use_triton variable Revert docstring changes in causal_conv1d_fn and causal_conv1d_update to match upstream exactly. The dispatch logic explanation now lives only in inline comments, not in the docstring. fix: remove kwargs from causal_conv1d_update, restore clean state - Remove **kwargs from causal_conv1d_update signature and triton call (no callers pass extra kwargs to the decode update function) - Keep **kwargs only in causal_conv1d_fn where seq_lens_cpu is passed - Docstrings now match upstream exactly
6d4589f to
2337425
Compare
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
2 similar comments
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Summary
causal_conv1d_fninsgl_kernelrequires the input tensor to havestride(-1) == 1(contiguous last dimension). When called during prefill, the inputxis a transposed view of the GEMM output:The copy costs >0.6ms per layer on large prefill batches (e.g. 16K tokens × 6144 features = 188MB). The existing Triton conv1d kernel already accepts arbitrary strides by passing stride values directly to the kernel — no copy needed.
This PR adds a fallback: when
x.stride(-1) \!= 1andseq_lens_cpuis already pre-computed by the caller (to avoid introducing a GPU-CPU sync), dispatch to the Triton kernel instead of copying.Also updates
run_eval.pyto support--dataset-pathfor GPQA, allowing local CSV files instead of the hardcoded URL.Test
Embedding server with
Qwen3.5-0.8Bon H200, 16K token inputs:Prefill Throughput (embedding, product distribution):
Decode throughput (generation, 64 requests, concurrency=16, input=10K, output=30K):
Using
bench_serving.py --dataset-name random-ids --random-input-len 10000 --random-output-len 30000 --num-prompts 64 --max-concurrency 16:No decode regression —
causal_conv1d_updateis unaffected by this patch (decode path always uses the CUDA kernel whensgl_kernelis available).Quality (GPQA Diamond, 198 questions):
Quality (GPQA Main, 448 questions):
Cosine similarity between before/after embeddings: 0.99993 (numerical parity).