Eval bug: Segfault with Gemma 4 31B at ~5,500+ prompt tokens (CUDA, multi-GPU)

### Name and Version

ggml_cuda_init: found 4 CUDA devices (Total VRAM: 96496 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
version: 8660 (d00685831)
built with GNU 13.3.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

- AMD Ryzen 5 9600X 6-Core Processor, 64GB system RAM
- 2x NVIDIA GeForce RTX 3090 (24GB each), connected via NVLink (NV4)

### Models

ggml-org/gemma-4-31B-it-GGUF (Q8_0, 32.6 GB)

### Problem description & steps to reproduce

llama-server segfaults when processing prompts exceeding approximately 5,500 tokens with the Gemma 4 31B Q8_0 model on a dual-GPU CUDA setup with layer splitting.
**Crash boundary:** Prompts up to ~5,400 tokens work reliably. Prompts at ~5,600+ tokens cause a segfault. The crash occurs after prompt processing completes — the server logs a successful 200 response, then segfaults.

**Minimal reproduction:**
`CUDA_VISIBLE_DEVICES=0,2 llama-server \
  -m gemma-4-31B-it-Q8_0.gguf \
  -ngl 999 \
  --host 0.0.0.0 \
  --port 8080 \
  --cache-ram 0 \
  -c 32768


python3 -c "
import requests
filler = 'The quick brown fox jumps over the lazy dog near the river bank. ' * 400
payload = {
    'model': 'gemma-4-31B-it',
    'messages': [{'role': 'user', 'content': f'Summarize:\n\n{filler}\n\nOne paragraph.'}],
    'max_tokens': 50,
    'temperature': 1.0,
    'top_p': 0.95
}
r = requests.post('http://localhost:8080/v1/chat/completions', json=payload, timeout=300)
print(r.json())
"`
**What works (390 repeats, ~5,482 prompt tokens):** Completes normally, returns 200.
**What crashes (400 repeats, ~5,600+ prompt tokens):** Server logs the 200 response, then segfaults.
**Workarounds attempted (none helped):**
- --cache-ram 0 (disable prompt cache)
- -b 512 -ub 256 (reduce batch sizes)
- --flash-attn off (disable flash attention)
- -np 1 (single concurrent slot)

## Server Startup Configuration
`ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48248 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB

load_tensors:   CPU_Mapped model buffer size =  1428.00 MiB
load_tensors:        CUDA0 model buffer size = 15325.78 MiB
load_tensors:        CUDA1 model buffer size = 15783.04 MiB
load_tensors: offloaded 61/61 layers to GPU

llama_context: pipeline parallelism enabled
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: graph splits = 3`

## Additional Context
- Short prompts (<5,400 tokens) work perfectly, including multi-turn conversations
- Decode speed is consistently ~23-24 tok/s
- Prefill speed is ~800-900 tok/s for working prompt sizes
- The segfault boundary appears to be prompt token count, not accumulated state — a single cold request at 5,600 tokens crashes immediately
- Possibly related to the hybrid SWA/global attention KV cache management at certain sequence lengths
- The crash reproduces identically on single-GPU with partial offload, ruling out pipeline parallelism as the cause

### First Bad Commit

_No response_

### Relevant log output

Last lines from server log before crash (from a successful request followed by crash on next request at the boundary):

slot update_slots: id  2 | task 205 | prompt processing done, n_tokens = 3554, batch.n_tokens = 4
slot update_slots: id  2 | task 205 | created context checkpoint 3 of 32 (pos_min = 0, pos_max = 3549, n_tokens = 3550, size = 2773.479 MiB)
slot print_timing: id  2 | task 205 |
prompt eval time =    5691.32 ms /  1769 tokens (    3.22 ms per token,   310.82 tokens per second)
       eval time =    2791.91 ms /    60 tokens (   46.53 ms per token,    21.49 tokens per second)
      total time =    8483.23 ms /  1829 tokens
slot      release: id  2 | task 205 | stop processing: n_tokens = 3613, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Segmentation fault (core dumped)

When sending a large prompt as the very first (cold) request, the crash happens immediately after warmup:

main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv  update_slots: all slots are idle
Segmentation fault (core dumped)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Segfault with Gemma 4 31B at ~5,500+ prompt tokens (CUDA, multi-GPU) #21401

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Server Startup Configuration

Additional Context

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Segfault with Gemma 4 31B at ~5,500+ prompt tokens (CUDA, multi-GPU) #21401

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Server Startup Configuration

Additional Context

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions