-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Labels
bugSomething isn't workingSomething isn't workingchat parserIssues related to the chat parser and chat templatesIssues related to the chat parser and chat templatesmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)server
Description
Bug: llama-server crashes (segfault) when processing prompts with repeated identical characters
Issue Summary
llama-server crashes with a segmentation fault when processing prompts containing a large number of repeated identical characters (e.g., 'A' * 10000), but works perfectly fine with real natural language text of the same or larger size.
Environment
- llama.cpp version: b7148-0543f928a
- Backend: ROCm (also reproduced on Vulkan with AMDVLK)
- GPU: AMD (96GB VRAM)
- OS: Ubuntu Linux
- Model: unsloth/gpt-oss-120b-GGUF (Q8_0)
Command to Start Server
LLAMA_CHAT_TEMPLATE_KWARGS='{"reasoning_effort": "medium"}' ./llama-server \
-m ~/model/unsloth/gpt-oss-120b-GGUF/new/gpt-oss-120b-Q8_0-00001-of-00002.gguf \
--host 0.0.0.0 --port 11435 -ngl 99 --ctx-size 32768 \
-b 256 -ub 128 --no-warmup --n-predict 8192 \
--top-k 60 --top-p 0.9 --repeat-penalty 1.1 \
--jinja --chat-template-kwargs '{"reasoning_effort": "medium"}' \
--no-mmap --mlock -np 1 -sps 0.0 --cache-reuse 0Steps to Reproduce
- Start llama-server with the above command
- Send a request with repeated characters:
curl -X POST http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"'"$(python3 -c "print('A'*10000)")"' Say OK"}],"stream":true}'- Server crashes with:
段错误 (核心已转储)/Segmentation fault (core dumped)
Expected Behavior
Server should process the request (even if slowly) or return an error, not crash.
Actual Behavior
Server immediately crashes with segfault. The crash occurs:
- At ~10,000 repeated characters threshold
- Regardless of backend (ROCm or Vulkan)
- Even when called from localhost
Key Observation: Real Text Works Fine
The same server handles real natural language text of much larger sizes without any issues:
# This works perfectly - 182,000 characters of real text
real_text = "The quick brown fox..." * 500
# Request succeeds, server processes 27,341 tokens without crash# This crashes - only 10,000 repeated characters
repeated_text = "A" * 10000
# Server crashes with segfaultServer Logs Before Crash
main: model loaded
main: server is listening on http://0.0.0.0:11435
main: starting the main loop...
srv update_slots: all slots are idle
./runllama.sh: 第 21 行: 4471 段错误 (核心已转储) ...
The crash happens immediately when processing the request - no prompt processing logs appear.
Possible Causes
Based on investigation, this could be related to:
- Tokenizer handling of repeated tokens - The BPE tokenizer may have edge cases with highly repetitive input
- DRY sampler - The sampler chain includes
drywhich detects repeated sequences - Repeat penalty calculation - Edge case when all tokens are identical
Additional Context
- The WebUI on the same server can handle large prompts (27,341 tokens) of real text
- The crash threshold is consistent (~10,000 repeated characters)
- This was tested with multiple configurations (with/without
-fa, different batch sizes, etc.)
Workaround
Use real/varied text content instead of repeated characters. For actual use cases with natural language, the server works correctly.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingchat parserIssues related to the chat parser and chat templatesIssues related to the chat parser and chat templatesmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)server