Skip to content

Bug: llama-server crashes (segfault) when processing prompts with repeated identical characters #17636

@allenzz-dev

Description

@allenzz-dev

Bug: llama-server crashes (segfault) when processing prompts with repeated identical characters

Issue Summary

llama-server crashes with a segmentation fault when processing prompts containing a large number of repeated identical characters (e.g., 'A' * 10000), but works perfectly fine with real natural language text of the same or larger size.

Environment

  • llama.cpp version: b7148-0543f928a
  • Backend: ROCm (also reproduced on Vulkan with AMDVLK)
  • GPU: AMD (96GB VRAM)
  • OS: Ubuntu Linux
  • Model: unsloth/gpt-oss-120b-GGUF (Q8_0)

Command to Start Server

LLAMA_CHAT_TEMPLATE_KWARGS='{"reasoning_effort": "medium"}' ./llama-server \
  -m ~/model/unsloth/gpt-oss-120b-GGUF/new/gpt-oss-120b-Q8_0-00001-of-00002.gguf \
  --host 0.0.0.0 --port 11435 -ngl 99 --ctx-size 32768 \
  -b 256 -ub 128 --no-warmup --n-predict 8192 \
  --top-k 60 --top-p 0.9 --repeat-penalty 1.1 \
  --jinja --chat-template-kwargs '{"reasoning_effort": "medium"}' \
  --no-mmap --mlock -np 1 -sps 0.0 --cache-reuse 0

Steps to Reproduce

  1. Start llama-server with the above command
  2. Send a request with repeated characters:
curl -X POST http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"'"$(python3 -c "print('A'*10000)")"' Say OK"}],"stream":true}'
  1. Server crashes with: 段错误 (核心已转储) / Segmentation fault (core dumped)

Expected Behavior

Server should process the request (even if slowly) or return an error, not crash.

Actual Behavior

Server immediately crashes with segfault. The crash occurs:

  • At ~10,000 repeated characters threshold
  • Regardless of backend (ROCm or Vulkan)
  • Even when called from localhost

Key Observation: Real Text Works Fine

The same server handles real natural language text of much larger sizes without any issues:

# This works perfectly - 182,000 characters of real text
real_text = "The quick brown fox..." * 500
# Request succeeds, server processes 27,341 tokens without crash
# This crashes - only 10,000 repeated characters
repeated_text = "A" * 10000
# Server crashes with segfault

Server Logs Before Crash

main: model loaded
main: server is listening on http://0.0.0.0:11435
main: starting the main loop...
srv  update_slots: all slots are idle
./runllama.sh: 第 21 行: 4471 段错误 (核心已转储) ...

The crash happens immediately when processing the request - no prompt processing logs appear.

Possible Causes

Based on investigation, this could be related to:

  1. Tokenizer handling of repeated tokens - The BPE tokenizer may have edge cases with highly repetitive input
  2. DRY sampler - The sampler chain includes dry which detects repeated sequences
  3. Repeat penalty calculation - Edge case when all tokens are identical

Additional Context

  • The WebUI on the same server can handle large prompts (27,341 tokens) of real text
  • The crash threshold is consistent (~10,000 repeated characters)
  • This was tested with multiple configurations (with/without -fa, different batch sizes, etc.)

Workaround

Use real/varied text content instead of repeated characters. For actual use cases with natural language, the server works correctly.

Metadata

Metadata

Labels

bugSomething isn't workingchat parserIssues related to the chat parser and chat templatesmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)server

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions