Skip to content

Misc. bug: llama-server responds with error code 500 and "Failed to parse input at pos ..." message when max_tokens is reached #20193

@fairydreaming

Description

@fairydreaming

Name and Version

$ ./bin/llama-cli --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes
version: 8233 (c5a7788)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

$ ./bin/llama-server -m ~/ggufs/Qwen3.5-4B-Q4_K_M.gguf -ub 2048 -c 65536 --host 192.168.18.6 -ngl 99 -np 1

$ curl http://192.168.18.6:8080/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: your-api-key" \
  -d '{
    "model": "qwen-3.5-4b",
    "max_tokens": 16,
    "system": "You are a helpful assistant.",
    "messages": [
      {"role": "user", "content": "2+2=?"}
    ]
  }'

Problem description & steps to reproduce

Instead of generating up to max_tokens tokens and responding with stop_reason max_tokens llama-server returns error 500 with message:

{"error":{"code":500,"message":"Failed to parse input at pos 59: ","type":"server_error"}}

If I increase max_tokens so that it's greater than generated response everything works OK.

First Bad Commit

Currently no time to investigate.

Relevant log output

Logs
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 545 | processing task, is_child = 0
slot update_slots: id  0 | task 545 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 25
slot update_slots: id  0 | task 545 | n_past = 25, slot.prompt.tokens.size() = 568, seq_id = 0, pos_min = 567, n_swa = 1
slot update_slots: id  0 | task 545 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 545 | n_tokens = 0, memory_seq_rm [0, end)
slot init_sampler: id  0 | task 545 | init sampler, took 0.01 ms, tokens: text = 25, total = 25
slot update_slots: id  0 | task 545 | prompt processing done, n_tokens = 25, batch.n_tokens = 25
slot print_timing: id  0 | task 545 | 
prompt eval time =      18.49 ms /    25 tokens (    0.74 ms per token,  1351.86 tokens per second)
       eval time =      67.62 ms /    16 tokens (    4.23 ms per token,   236.60 tokens per second)
      total time =      86.12 ms /    41 tokens
slot      release: id  0 | task 545 | stop processing: n_tokens = 40, truncated = 0
srv  update_slots: all slots are idle
srv          stop: cancel task, id_task = 545
srv  update_slots: all slots are idle
srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 59: ","type":"server_error"}}
srv  log_server_r: done request: POST /v1/messages 192.168.18.11 500

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions