CUDA: use CUB for arbitary size argsort by am17an · Pull Request #16754 · ggml-org/llama.cpp

am17an · 2025-10-24T10:26:18Z

Use CUB for argsort when ncols > 1024 or not enough shared mem available

ggml/src/ggml-cuda/argsort.cu

am17an · 2025-10-24T10:52:24Z

cc: @CISC, I did try BailingMoeV2 with this PR, but since it doesn't use argsort at the moment it didn't show any speed-up, might be worth trying again

CISC · 2025-10-25T08:04:18Z

I did try BailingMoeV2 with this PR, but since it doesn't use argsort at the moment it didn't show any speed-up, might be worth trying again

It does use argsort (through top_k), however the first implementation was incorrect. The implementation that was merged has much smaller tensors that fit just fine within the 1024 limit.

Edit: See ggml-org/ggml#1367 though, this will be useful for others.

* model-conversion : add trust_remote_code for orig model run [no ci] (ggml-org#16751) This commit add the trust_remote_code=True argument when loading models using AutoConfig, AutoTokenizer, and AutoModelForCausalLM for the run original model script. The motivation for this is that some models require custom code to be loaded properly, and setting trust_remote_code=True avoids a prompt asking for user confirmation: ```console (venv) $ make causal-run-original-model The repository /path/to/model contains custom code which must be executed to correctly load the model. You can inspect the repository content at /path/to/model. Do you wish to run the custom code? [y/N] N ``` Having this as the default seems like a safe choice as we have to clone or download the models we convert and would be expecting to run any custom code they have. * webui: support q URL parameter (ggml-org#16728) * webui: support q URL parameter Fixes ggml-org#16722 I’ve checked that it works with Firefox’s AI tools * webui: apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * CUDA: use CUB for arbitary size argsort (ggml-org#16754) * ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (ggml-org#16742) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions * convert : avoid dequantizing mxfp4 for GPT-OSS (ggml-org#16756) * vulkan: Optimize SSM_SCAN (ggml-org#16645) * vulkan: delete dead code (ggml-org#16732) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * model : set res->t_embd in PLaMo2 models (ggml-org#16766) --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Florian Badie <florianbadie@odrling.xyz> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: compilade <git@compilade.net> Co-authored-by: Jeff Bolz <jbolz@nvidia.com> Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Shunta Saito <shunta.saito@gmail.com>

DL-TODO: support argsort_f32_i32_cuda_cub fp16 impl.

CUDA: use CUB for arbitary size argsort

fa43f8a

am17an requested a review from slaren as a code owner October 24, 2025 10:26

slaren reviewed Oct 24, 2025

View reviewed changes

ggml/src/ggml-cuda/argsort.cu Show resolved Hide resolved

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 24, 2025

slaren approved these changes Oct 24, 2025

View reviewed changes

am17an merged commit 0bcb40b into ggml-org:master Oct 24, 2025
72 checks passed

am17an deleted the cuda_cub_argsort branch October 24, 2025 12:46

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

CUDA: use CUB for arbitary size argsort (ggml-org#16754)

da52ddf

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

CUDA: use CUB for arbitary size argsort (#16754)

78e0ac6

DL-TODO: support argsort_f32_i32_cuda_cub fp16 impl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: use CUB for arbitary size argsort#16754

CUDA: use CUB for arbitary size argsort#16754
am17an merged 1 commit intoggml-org:masterfrom
am17an:cuda_cub_argsort

am17an commented Oct 24, 2025

Uh oh!

Uh oh!

am17an commented Oct 24, 2025

Uh oh!

Uh oh!

CISC commented Oct 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

am17an commented Oct 24, 2025

Uh oh!

Uh oh!

am17an commented Oct 24, 2025

Uh oh!

Uh oh!

CISC commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CISC commented Oct 25, 2025 •

edited

Loading