CUDA: Enable cuda graphs for qwen3 next-style architectures#19521
CUDA: Enable cuda graphs for qwen3 next-style architectures#19521ORippler wants to merge 3 commits intoggml-org:masterfrom
Conversation
|
@ORippler which one is node_143? Can you paste the operations line (op and source nodes)? |
|
Also maybe try with |
|
Though I'm still confused on why we rebuild the graph twice before being able to reuse it (basically |
|
I just saw there is on-going work for a better model-graph for qwen3-next style-architectures in llama.cpp (#19504 and #19375). I still feel we could merge this PR regardless so people get good perf on CUDA backend in the mean-time (happy to update heuristics to the new node-names should they change in the new model-graph) |
|
@ORippler let's do a separate PR for that |
c1090ee to
28177e3
Compare
Done. EDIT: the change is still in this branch, I'll do an interactive rebase once #19566 has been merged) |
|
BTW just for completeness, can you run a larger correctness test for TG. Perhaps |
28177e3 to
89decb7
Compare
|
I rebased the branch now that #19566 was merged.
Here are numbers for wiki.test.raw that show equivalence PPL: CUDA Graphs on (confirmed manually by debugging, also see faster throughput/ETA) CUDA Graphs off |
This is less strict, but should still do the job and be faster. Also matches how we compare the other prefixes
| if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1 && | ||
| (node->src[0] ? node->src[0]->name != gemma3n_per_layer_proj_src0_name : true) && | ||
| (node->src[1] ? node->src[1]->name != gemma3n_per_layer_proj_src1_name : true) && | ||
| strncmp(node->name, ffn_moe_gate_bias_prefix.c_str(), ffn_moe_gate_bias_prefix.size()) != 0 && | ||
| strncmp(node->name, ffn_moe_up_bias_prefix.c_str(), ffn_moe_up_bias_prefix.size()) != 0 && | ||
| strncmp(node->name, ffn_moe_down_bias_prefix.c_str(), ffn_moe_down_bias_prefix.size()) != 0 && | ||
| strncmp(node->name, nemotron_h_block_out_prefix.c_str(), nemotron_h_block_out_prefix.size()) != 0 && | ||
| strncmp(node->name, mamba2_y_add_d_prefix.c_str(), mamba2_y_add_d_prefix.size()) != 0) { | ||
| strncmp(node->name, mamba2_y_add_d_prefix.c_str(), mamba2_y_add_d_prefix.size()) != 0 && | ||
| strncmp(node->name, qwen3next_diag_mask.c_str(), qwen3next_diag_mask.size()) != 0 && | ||
| strncmp(node->name, delta_net_linear_attn_prefix.c_str(), delta_net_linear_attn_prefix.size()) != 0) { | ||
| // disable CUDA graphs for batch size > 1 for now while excluding the matrix-matrix addition as part of Gemma3n's `project_per_layer_input` operation | ||
| // by means of matching node names. See | ||
| // https://github.com/ggml-org/llama.cpp/blob/f9a31eea06a859e34cecb88b4d020c7f03d86cc4/src/llama-model.cpp#L10199-L10241 and | ||
| // https://github.com/huggingface/transformers/blob/bda75b4011239d065de84aa3e744b67ebfa7b245/src/transformers/models/gemma3n/modeling_gemma3n.py#L1773, | ||
| // Generally, changes in batch size or context size can cause changes to the grid size of some kernels. | ||
| use_cuda_graph = false; | ||
| #ifndef NDEBUG | ||
| GGML_LOG_DEBUG("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]); | ||
| #endif | ||
| } |
There was a problem hiding this comment.
// disable CUDA graphs for batch size > 1 for now while excluding the matrix-matrix addition as part of Gemma3n's
project_per_layer_inputoperation
Can we confirm that the only reason we have this entire logic is because of Gemma3n - as explained in the comment?
// Generally, changes in batch size or context size can cause changes to the grid size of some kernels.
Haven't looked at the source of this comment, but changes in the batch size / context size should already be detected by the node properties comparison, correct?
The reason we added this was historical: Users raised segfaults in multi-GPU setting in the PR that added CUDA Graphs support, likely caused by us failing to detect the need to recapture the CUDA Graph. Instead of root-causing and fixing these issues, band-aids have been put in place and updated ever since. I've been meaning to remove the update-counter + batch-size heuristics once we validate:
|


This is achieved via:
While we already know 1. as tech-debt in cuda backend from other models, 2. was new to me. I did try to investigate what happens with the frequent graph updates and hybrid model, but could not figure it out fully:
conv_states_updated-0two times at ˜128 token intervals from 0 -> 24576 -> 0. This seems to come fromllm_graph_input_rs?. My understanding is that thisconv_states_updated-0somewhat reflect a per-sequence mapping of our conv states, whereas we will only have a single sequence in llama-bench's token-generation phase (so it should always be 0 in that case? At least in the reused ggml_cgraph-case it ends-up being 0). This change in ggml_cgraph topology is however seemingly captured byllm_graph_result::can_reuse.node_143(this is somewhere in the firstbuild_layer_attn_linearof qwen3next) seems to have additional data added to it despite the graph being denoted as being reusable byllm_graph_result::can_reuse. Is this potentially a bug? I was under the impression that graph topology is not allowed to change when its reused inllama_context::process_ubatch.Relevant sections from log output and diff patch for above two points
Applied diff to get that log
Maybe folks from #16490 and #16095 (@ggerganov @pwilkin @gabe-l-hart ) could chime in here.
I'll try to collect numbers to see if we can remove that consecutive update counter in general. This should be feasible if Capture + Launch of a
cudaGraphis always faster than naive launch on our workloads under both Linux and Windows.In the mean-time, this closes #19345
I did verify perf
and functional correctness.