Name and Version
❯ llama-cli --version
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 64215 MiB):
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32106 MiB
Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
version: 8740 (e34f042)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:MXFP4_MOE --split-mode tensor --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0
Problem description & steps to reproduce
Testing the new tensor parallelism with my two 5090's.
llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:MXFP4_MOE --split-mode tensor --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0
Simple prompt: Write me a python snake game
Causes looping. Without --split-mode tensor, task completes easily.
Example:
food.penup() # to prevent showing lines when turtle not using tracers
food.shapesize(0.5) # size of the snake (0.5 is big) and by the snake (0.5 is big) and by the snake (0.5 is big) and by the snake (0.5 is big) and by the snake (0.5 is big) and by the snake (0.5 is big) and by the snake (0.5 is big) and by the snake (0.5 is big) and by the snake (0.5 is big) and by the snake (0.5 is big!0.5 is big) and by the snake (0.5 is big!0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0.5 is big?0
Running CUDA 13.2 and I compiled it myself
#!/bin/sh
rm -rf build
git pull
cmake -B build -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA=ON -DGGML_CUDA_FA=ON -DLLAMA_CURL=ON -DGGML_RPC=ON -DLLAMA_BUILD_BORINGSSL=ON -DLLAMA_BUILD_LIBRESSL=ON #-DGGML_VULKAN=1
cmake --build build --config Release -j $(nproc)
First Bad Commit
One of these? #19378
Relevant log output
No logs.
Name and Version
❯ llama-cli --version
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 64215 MiB):
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32106 MiB
Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
version: 8740 (e34f042)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
Problem description & steps to reproduce
Testing the new tensor parallelism with my two 5090's.
Simple prompt:
Write me a python snake gameCauses looping. Without --split-mode tensor, task completes easily.
Example:
Running CUDA 13.2 and I compiled it myself
First Bad Commit
One of these? #19378
Relevant log output
No logs.