Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- fabricio@fabricio-x99:~/Documentos/gemma4MTP/llama-gemma4-mtp$ ./build-cuda/bin/llama-server -m "/mnt/MODELOS TEXTUAIS/models/gemma4-mtp/Gemma4-31B-Q8_0.gguf" --host 0.0.0.0 --port 9090 -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --no-warmup
- 0.00.092.654 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
- 0.00.092.658 I device_info:
- 0.00.232.050 I - CUDA0 : NVIDIA GeForce RTX 3080 (20053 MiB, 19802 MiB free)
- 0.00.366.868 I - CUDA1 : NVIDIA GeForce RTX 3080 (20054 MiB, 19817 MiB free)
- 0.00.366.896 I - CPU : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (64143 MiB, 64143 MiB free)
- 0.00.367.029 I system_info: n_threads = 14 (n_threads_batch = 14) / 28 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
- 0.00.367.036 I srv main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
- 0.00.367.075 I srv init: running without SSL
- 0.00.367.138 I srv init: using 27 threads for HTTP server
- 0.00.367.297 I srv start: binding port with default address family
- 0.00.368.542 I srv main: loading model
- 0.00.368.551 I srv load_model: loading model '/mnt/MODELOS TEXTUAIS/models/gemma4-mtp/Gemma4-31B-Q8_0.gguf'
- 0.00.368.606 I common_init_result: fitting params to device memory ...
- 0.00.368.606 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
- 0.01.698.387 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
- 0.01.698.897 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
- 0.01.720.469 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
- 0.07.029.012 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
- 0.07.162.438 I srv load_model: initializing slots, n_slots = 4
- 0.07.343.638 W common_speculative_init: no implementations specified for speculative decoding
- 0.07.343.643 I slot load_model: id 0 | task -1 | new slot, n_ctx = 32768
- 0.07.343.648 I slot load_model: id 1 | task -1 | new slot, n_ctx = 32768
- 0.07.343.649 I slot load_model: id 2 | task -1 | new slot, n_ctx = 32768
- 0.07.343.649 I slot load_model: id 3 | task -1 | new slot, n_ctx = 32768
- 0.07.343.783 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
- 0.07.343.784 I srv load_model: use `--cache-ram 0` to disable the prompt cache
- 0.07.343.785 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
- 0.07.343.822 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
- 0.07.354.119 I init: chat template, example_format: '<|turn>system
- <|think|>
- You are a helpful assistant<turn|>
- <|turn>user
- Hello<turn|>
- <|turn>model
- Hi there<turn|>
- <|turn>user
- How are you?<turn|>
- <|turn>model
- '
- 0.07.355.342 I srv init: init: chat template, thinking = 1
- 0.07.355.370 I srv main: model loaded
- 0.07.355.374 I srv main: server is listening on http://0.0.0.0:9090
- 0.07.355.379 I srv update_slots: all slots are idle
- 0.11.928.624 I srv params_from_: Chat format: peg-gemma4
- 0.11.928.989 I slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
- 0.11.928.995 I srv get_availabl: updating prompt cache
- 0.11.929.001 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
- 0.11.929.007 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 32768 tokens, 8589934592 est)
- 0.11.929.009 I srv get_availabl: prompt cache update took 0.01 ms
- 0.11.929.093 I slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
- 0.17.271.755 I slot print_timing: id 3 | task 0 | n_decoded = 100, tg = 19.28 t/s
- 0.20.289.173 I slot print_timing: id 3 | task 0 | prompt eval time = 155.71 ms / 18 tokens ( 8.65 ms per token, 115.60 tokens per second)
- 0.20.289.176 I slot print_timing: id 3 | task 0 | eval time = 8204.35 ms / 158 tokens ( 51.93 ms per token, 19.26 tokens per second)
- 0.20.289.177 I slot print_timing: id 3 | task 0 | total time = 8360.06 ms / 176 tokens
- 0.20.289.181 I slot print_timing: id 3 | task 0 | graphs reused = 156
- 0.20.289.215 I slot release: id 3 | task 0 | stop processing: n_tokens = 175, truncated = 0
- 0.20.289.219 I srv update_slots: all slots are idle
- ^C0.28.426.038 I srv operator(): operator(): cleaning up before exit...
- fabricio@fabricio-x99:~/Documentos/gemma4MTP/llama-gemma4-mtp$ ./build-cuda/bin/llama-server -m "/mnt/MODELOS TEXTUAIS/models/gemma4-mtp/Gemma4-31B-Q8_0.gguf" --model-draft "/mnt/MODELOS TEXTUAIS/models/gemma4-mtp/mtp-gemma-4-31B-it.gguf" --host 0.0.0.0 --port 9090 -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2 --no-warmup
- 0.00.096.769 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
- 0.00.096.773 I device_info:
- 0.00.235.710 I - CUDA0 : NVIDIA GeForce RTX 3080 (20053 MiB, 19802 MiB free)
- 0.00.380.035 I - CUDA1 : NVIDIA GeForce RTX 3080 (20054 MiB, 19817 MiB free)
- 0.00.380.063 I - CPU : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (64143 MiB, 64143 MiB free)
- 0.00.380.177 I system_info: n_threads = 14 (n_threads_batch = 14) / 28 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
- 0.00.380.184 I srv main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
- 0.00.380.215 I srv init: running without SSL
- 0.00.380.278 I srv init: using 27 threads for HTTP server
- 0.00.380.424 I srv start: binding port with default address family
- 0.00.381.668 I srv main: loading model
- 0.00.381.671 I srv load_model: loading model '/mnt/MODELOS TEXTUAIS/models/gemma4-mtp/Gemma4-31B-Q8_0.gguf'
- 0.00.381.749 I common_init_result: fitting params to device memory ...
- 0.00.381.750 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
- 0.01.709.180 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
- 0.01.709.694 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
- 0.01.731.546 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
- 0.07.057.952 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
- 0.07.192.015 I srv load_model: loading draft model '/mnt/MODELOS TEXTUAIS/models/gemma4-mtp/mtp-gemma-4-31B-it.gguf'
- 0.07.752.235 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
- 0.07.752.705 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
- 0.07.774.400 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
- 0.07.889.563 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
- 0.08.089.008 I srv load_model: initializing slots, n_slots = 4
- 0.08.225.333 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
- 0.08.225.341 I common_speculative_impl_draft_mtp: - n_max=2, n_min=0, p_min=0.00, n_embd=5376
- 0.08.225.343 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
- 0.08.225.666 I srv load_model: speculative decoding context initialized
- 0.08.225.669 I slot load_model: id 0 | task -1 | new slot, n_ctx = 32768
- 0.08.225.674 I slot load_model: id 1 | task -1 | new slot, n_ctx = 32768
- 0.08.225.675 I slot load_model: id 2 | task -1 | new slot, n_ctx = 32768
- 0.08.225.675 I slot load_model: id 3 | task -1 | new slot, n_ctx = 32768
- 0.08.225.779 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
- 0.08.225.781 I srv load_model: use `--cache-ram 0` to disable the prompt cache
- 0.08.225.781 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
- 0.08.225.820 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
- 0.08.236.282 I init: chat template, example_format: '<|turn>system
- <|think|>
- You are a helpful assistant<turn|>
- <|turn>user
- Hello<turn|>
- <|turn>model
- Hi there<turn|>
- <|turn>user
- How are you?<turn|>
- <|turn>model
- '
- 0.08.237.455 I srv init: init: chat template, thinking = 1
- 0.08.237.483 I srv main: model loaded
- 0.08.237.488 I srv main: server is listening on http://0.0.0.0:9090
- 0.08.237.498 I srv update_slots: all slots are idle
- 0.13.005.287 I srv params_from_: Chat format: peg-gemma4
- 0.13.005.649 I slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
- 0.13.005.655 I srv get_availabl: updating prompt cache
- 0.13.005.661 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
- 0.13.005.667 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 32768 tokens, 8589934592 est)
- 0.13.005.669 I srv get_availabl: prompt cache update took 0.01 ms
- 0.13.005.751 I slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
- 0.23.879.699 I slot print_timing: id 3 | task 0 | n_decoded = 100, tg = 9.38 t/s
- 0.26.984.755 I slot print_timing: id 3 | task 0 | n_decoded = 129, tg = 9.37 t/s
- 0.30.089.399 I slot print_timing: id 3 | task 0 | n_decoded = 158, tg = 9.37 t/s
- 0.33.089.489 I slot print_timing: id 3 | task 0 | n_decoded = 186, tg = 9.36 t/s
- 0.36.102.083 I slot print_timing: id 3 | task 0 | n_decoded = 214, tg = 9.35 t/s
- 0.39.129.962 I slot print_timing: id 3 | task 0 | n_decoded = 242, tg = 9.34 t/s
- 0.42.158.554 I slot print_timing: id 3 | task 0 | n_decoded = 270, tg = 9.33 t/s
- 0.45.192.414 I slot print_timing: id 3 | task 0 | n_decoded = 298, tg = 9.32 t/s
- 0.48.232.309 I slot print_timing: id 3 | task 0 | n_decoded = 326, tg = 9.31 t/s
- 0.51.271.496 I slot print_timing: id 3 | task 0 | n_decoded = 354, tg = 9.30 t/s
- 0.54.310.629 I slot print_timing: id 3 | task 0 | n_decoded = 382, tg = 9.30 t/s
- 0.57.350.243 I slot print_timing: id 3 | task 0 | n_decoded = 410, tg = 9.29 t/s
- 1.00.390.116 I slot print_timing: id 3 | task 0 | n_decoded = 438, tg = 9.29 t/s
- 1.03.430.175 I slot print_timing: id 3 | task 0 | n_decoded = 466, tg = 9.28 t/s
- 1.06.492.514 I slot print_timing: id 3 | task 0 | n_decoded = 494, tg = 9.27 t/s
- 1.09.555.953 I slot print_timing: id 3 | task 0 | n_decoded = 522, tg = 9.27 t/s
- 1.12.179.038 I slot print_timing: id 3 | task 0 | prompt eval time = 216.02 ms / 43 tokens ( 5.02 ms per token, 199.06 tokens per second)
- 1.12.179.042 I slot print_timing: id 3 | task 0 | eval time = 58957.18 ms / 546 tokens ( 107.98 ms per token, 9.26 tokens per second)
- 1.12.179.043 I slot print_timing: id 3 | task 0 | total time = 59173.20 ms / 589 tokens
- 1.12.179.048 I slot print_timing: id 3 | task 0 | graphs reused = 542
- 1.12.179.049 I slot print_timing: id 3 | task 0 | draft acceptance = 0.00000 ( 0 accepted / 1090 generated)
- 1.12.179.084 I statistics draft-mtp: #calls(b,g,a) = 1 545 545, #gen drafts = 545, #acc drafts = 0, #gen tokens = 1090, #acc tokens = 0, dur(b,g,a) = 0.003, 28369.116, 0.809 ms
- 1.12.179.116 I slot release: id 3 | task 0 | stop processing: n_tokens = 588, truncated = 0
- 1.12.179.125 I srv update_slots: all slots are idle
Add Comment
Please, Sign In to add comment