-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Closed
Closed
Copy link
Labels
Description
Name and Version
version : latest commit 89f10ba
Operating systems
No response
Which llama.cpp modules do you know to be affected?
No response
Command line
llama-server -m translategemma-27b-it.i1-IQ4_XS.gguf --port 1234 --host 0.0.0.0 -c 2560 --jinja -fit on --temp 0.05 --top_p 1.0 --chat-template-kwargs '{"source_lang_code": "en","target_lang_code": "fr"}' --spec-type ngram-simple --draft-max 64 --draft-min 24 --spec-ngram-size-n 12 -ctk q8_0 --no-cache-prompt -cram 0Problem description & steps to reproduce
Speculative Decoding only works once with the first request after it doesn't seems to work ( no more draft acceptance rate = ... ) and slower speed ?
srv init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://0.0.0.0:1234
main: starting the main loop...
srv update_slots: all slots are idle
srv params_from_: Chat format: Generic
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 2304, n_keep = 0, task.n_tokens = 474
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 410, batch.n_tokens = 410, progress = 0.864979
slot update_slots: id 3 | task 0 | n_tokens = 410, memory_seq_rm [410, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 474, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 474, batch.n_tokens = 64
slot init_sampler: id 3 | task 0 | init sampler, took 0.06 ms, tokens: text = 474, total = 474
slot update_slots: id 3 | task 0 | created context checkpoint 1 of 8 (pos_min = 0, pos_max = 409, size = 127.530 MiB)
slot print_timing: id 3 | task 0 |
prompt eval time = 283.75 ms / 474 tokens ( 0.60 ms per token, 1670.50 tokens per second)
eval time = 4724.44 ms / 420 tokens ( 11.25 ms per token, 88.90 tokens per second)
total time = 5008.19 ms / 894 tokens
draft acceptance rate = 0.39054 ( 223 accepted / 571 generated)
statistics ngram_simple: #calls = 196, #gen drafts = 12, #acc drafts = 8, #gen tokens = 571, #acc tokens = 223, dur = 0.118 ms
slot release: id 3 | task 0 | stop processing: n_tokens = 893, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 192.168.1.44 200
srv params_from_: Chat format: Generic
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.225 (> 0.100 thold), f_keep = 0.113
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 199 | processing task, is_child = 0
slot update_slots: id 3 | task 199 | new prompt, n_ctx_slot = 2304, n_keep = 0, task.n_tokens = 449
slot update_slots: id 3 | task 199 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 199 | prompt processing progress, n_tokens = 385, batch.n_tokens = 385, progress = 0.857461
slot update_slots: id 3 | task 199 | n_tokens = 385, memory_seq_rm [385, end)
slot update_slots: id 3 | task 199 | prompt processing progress, n_tokens = 449, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 3 | task 199 | prompt done, n_tokens = 449, batch.n_tokens = 64
slot init_sampler: id 3 | task 199 | init sampler, took 0.05 ms, tokens: text = 449, total = 449
slot print_timing: id 3 | task 199 |
prompt eval time = 253.81 ms / 449 tokens ( 0.57 ms per token, 1769.07 tokens per second)
eval time = 7626.25 ms / 355 tokens ( 21.48 ms per token, 46.55 tokens per second)
total time = 7880.05 ms / 804 tokens
statistics ngram_simple: #calls = 550, #gen drafts = 12, #acc drafts = 8, #gen tokens = 571, #acc tokens = 223, dur = 0.149 ms
slot release: id 3 | task 199 | stop processing: n_tokens = 803, truncated = 0
srv update_slots: all slots are idle
First Bad Commit
No response
Relevant log output
Logs
Reactions are currently unavailable