-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Open
Labels
Description
Name and Version
$ ./bin/llama-cli --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes
version: 8233 (c5a7788)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
$ ./bin/llama-server -m ~/ggufs/Qwen3.5-4B-Q4_K_M.gguf -ub 2048 -c 65536 --host 192.168.18.6 -ngl 99 -np 1
$ curl http://192.168.18.6:8080/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: your-api-key" \
-d '{
"model": "qwen-3.5-4b",
"max_tokens": 16,
"system": "You are a helpful assistant.",
"messages": [
{"role": "user", "content": "2+2=?"}
]
}'Problem description & steps to reproduce
Instead of generating up to max_tokens tokens and responding with stop_reason max_tokens llama-server returns error 500 with message:
{"error":{"code":500,"message":"Failed to parse input at pos 59: ","type":"server_error"}}
If I increase max_tokens so that it's greater than generated response everything works OK.
First Bad Commit
Currently no time to investigate.
Relevant log output
Logs
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 545 | processing task, is_child = 0
slot update_slots: id 0 | task 545 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 25
slot update_slots: id 0 | task 545 | n_past = 25, slot.prompt.tokens.size() = 568, seq_id = 0, pos_min = 567, n_swa = 1
slot update_slots: id 0 | task 545 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 545 | n_tokens = 0, memory_seq_rm [0, end)
slot init_sampler: id 0 | task 545 | init sampler, took 0.01 ms, tokens: text = 25, total = 25
slot update_slots: id 0 | task 545 | prompt processing done, n_tokens = 25, batch.n_tokens = 25
slot print_timing: id 0 | task 545 |
prompt eval time = 18.49 ms / 25 tokens ( 0.74 ms per token, 1351.86 tokens per second)
eval time = 67.62 ms / 16 tokens ( 4.23 ms per token, 236.60 tokens per second)
total time = 86.12 ms / 41 tokens
slot release: id 0 | task 545 | stop processing: n_tokens = 40, truncated = 0
srv update_slots: all slots are idle
srv stop: cancel task, id_task = 545
srv update_slots: all slots are idle
srv operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 59: ","type":"server_error"}}
srv log_server_r: done request: POST /v1/messages 192.168.18.11 500Reactions are currently unavailable