Skip to content

Misc. bug: Speculative decoding only works once with /v1/chat/completions #19231

@easyfab

Description

@easyfab

Name and Version

version : latest commit 89f10ba

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

llama-server -m translategemma-27b-it.i1-IQ4_XS.gguf  --port 1234 --host 0.0.0.0 -c 2560  --jinja  -fit on  --temp 0.05 --top_p 1.0   --chat-template-kwargs '{"source_lang_code": "en","target_lang_code": "fr"}' --spec-type ngram-simple --draft-max 64 --draft-min 24  --spec-ngram-size-n 12 -ctk q8_0 --no-cache-prompt -cram 0

Problem description & steps to reproduce

Speculative Decoding only works once with the first request after it doesn't seems to work ( no more draft acceptance rate = ... ) and slower speed ?

srv          init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://0.0.0.0:1234
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Generic
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 2304, n_keep = 0, task.n_tokens = 474
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 410, batch.n_tokens = 410, progress = 0.864979
slot update_slots: id  3 | task 0 | n_tokens = 410, memory_seq_rm [410, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 474, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 474, batch.n_tokens = 64
slot init_sampler: id  3 | task 0 | init sampler, took 0.06 ms, tokens: text = 474, total = 474
slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 0, pos_max = 409, size = 127.530 MiB)
slot print_timing: id  3 | task 0 |
prompt eval time =     283.75 ms /   474 tokens (    0.60 ms per token,  1670.50 tokens per second)
       eval time =    4724.44 ms /   420 tokens (   11.25 ms per token,    88.90 tokens per second)
      total time =    5008.19 ms /   894 tokens
draft acceptance rate = 0.39054 (  223 accepted /   571 generated)
statistics ngram_simple: #calls = 196, #gen drafts = 12, #acc drafts = 8, #gen tokens = 571, #acc tokens = 223, dur = 0.118 ms
slot      release: id  3 | task 0 | stop processing: n_tokens = 893, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.44 200
srv  params_from_: Chat format: Generic
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.225 (> 0.100 thold), f_keep = 0.113
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 199 | processing task, is_child = 0
slot update_slots: id  3 | task 199 | new prompt, n_ctx_slot = 2304, n_keep = 0, task.n_tokens = 449
slot update_slots: id  3 | task 199 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 199 | prompt processing progress, n_tokens = 385, batch.n_tokens = 385, progress = 0.857461
slot update_slots: id  3 | task 199 | n_tokens = 385, memory_seq_rm [385, end)
slot update_slots: id  3 | task 199 | prompt processing progress, n_tokens = 449, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  3 | task 199 | prompt done, n_tokens = 449, batch.n_tokens = 64
slot init_sampler: id  3 | task 199 | init sampler, took 0.05 ms, tokens: text = 449, total = 449
slot print_timing: id  3 | task 199 |
prompt eval time =     253.81 ms /   449 tokens (    0.57 ms per token,  1769.07 tokens per second)
       eval time =    7626.25 ms /   355 tokens (   21.48 ms per token,    46.55 tokens per second)
      total time =    7880.05 ms /   804 tokens
statistics ngram_simple: #calls = 550, #gen drafts = 12, #acc drafts = 8, #gen tokens = 571, #acc tokens = 223, dur = 0.149 ms
slot      release: id  3 | task 199 | stop processing: n_tokens = 803, truncated = 0
srv  update_slots: all slots are idle

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions