Skip to content

kv-cache : add SWA support#13194

Merged
ggerganov merged 16 commits intomasterfrom
gg/swa
May 20, 2025
Merged

kv-cache : add SWA support#13194
ggerganov merged 16 commits intomasterfrom
gg/swa

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 29, 2025

Overview

Add class llama_kv_cache_unified_iswa for interleaved SWA attention support.

The implementation internally utilizes 2 instances of the existing llama_kv_cache_unified - one for the non-SWA and one for the SWA layers of the model. To achieve that, the llama_kv_cache_unified implementation is updated to be able to cache a subset of the model's layers (instead of always caching all layers as it is on master). The 2 internal caches behave almost in exactly the same way with 2 main differences:

  • The SWA cache is much smaller
  • The SWA cache automatically "forgets/prunes" old tokens upon successful commit (i.e. successful batch decode)

The size of the SWA cache is computed as:

PAD(n_swa*n_seq_max + n_batch)

This way we can store the cache data for the last n_swa tokens for all sequences and we also have room to evaluate a new batch of tokens with size up to n_batch.

Note that advanced cache operations such as removing tokens or shifting their positions are not possible when using SWA cache, because token information becomes lost when the window slides. For such cases, we can "fallback" to the old implementation by expanding the SWA cache size to the full context and disabling the SWA token pruning. This of course would lead to more memory usage. See the swa_full flag for more info.

The new llama_kv_cache_unified_iswa can be used for non-SWA models with n_swa = n_ctx_train.


Main changes

  • Move KV cache store and view logic from llama-graph to llama-kv-cache
  • Move KV cache mask creation logic from llama-graph to llama-kv-cache
  • The inputs to build_attn_mha() are now not permuted
  • The QKV self-attention code is now more harmonious:
      const llama_kv_cache_unified * kv_self = static_cast<const llama_kv_cache_unified *>(memory);
    
      // store to KV cache
      {
          ggml_build_forward_expand(gf, kv_self->cpy_k(ctx0, k_cur, il));
          ggml_build_forward_expand(gf, kv_self->cpy_v(ctx0, v_cur, il));
      }
    
      const auto & kq_mask = inp->get_kq_mask();
    
      ggml_tensor * q = q_cur;
      ggml_tensor * k = kv_self->get_k(ctx0, il);
      ggml_tensor * v = kv_self->get_v(ctx0, il);
    
      ggml_tensor * cur = build_attn_mha(gf, q, k, v, kq_b, kq_mask, v_mla, kq_scale);
      cb(cur, "kqv_out", il);
  • Add enum hparams.swa_type to support chunked and non-chunked SWA (remove hparams.n_attn_chunk)
  • Add class llama_kv_cache_unified_iswa - new iSWA cache that internally utilizes 2 standard llama_kv_cache_unified instances
  • Make the llama_kv_cache_unified implementation more private and polish the interface
  • Move the Llama 4 build function to a new llm_build_llama_iswa()
  • llama-server now respects llama_kv_self_can_shift(ctx)
  • The llama_decode now attempts to do a defrag if it fails to fit the input batch in the cache
  • The llama_decode now correctly restores the cache state in all cases
  • Examples can fallback to full-size SWA cache with --swa-full

API changes

  • Update llama_context_params - add bool swa_full

TODO

  • Cut-off old SWA tokens in llama_kv_cache_unified_iswa::commit()
  • Pass n_seq_max and n_batch to the KV cache and utilize it to determine SWA cache size
  • Allow KV shift when SWA window size is big enough
  • Add limits to batch size based on SWA window
  • llama-server check for llama_kv_self_can_shift
  • Add context parameter for switching between small and large SWA cache (kv-cache : add SWA support #13194 (comment))

Testing

Any help with testing the following scenarios and reporting the results are highly appreciated:

  • Llama 4
  • Phi 3
  • Gemma 2
  • Gemma 3
  • Cohere 2
  • Multi-user
  • Context shift
  • Context reuse
  • Speculative decoding?

Next PRs

  • Split KV cache implementations in separate source files
  • Remove llama_kv_cache_view API (not useful, can be replaced with internal debugging functions)
  • Add struct kv_cells and simplify logic with modifying the cells
  • Refactor the llama_kv_cache logic to allow SWA cache with size n_swa + n_ubatch
  • Set defrag threshold to 0.0 by default
  • llama_decode distinguish return code when we are sure that even after defrag there is no space available
  • Update experimental status of llama_context_params
  • Avoid llama_kv_cache::set_full()
  • Rework llama_kv_cache to not maintain the batching state (kv-cache : add SWA support #13194 (review))
  • Consider template <bool SWA> llm_build_llama()

outdated

This is still very WIP - the goal is to redesign the unified KV cache to properly support layers with sliding-window attention (SWA) in order to reduce the memory usage for models such as Gemma3.

However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.

The reason we cannot do context caching with SWA enabled is because when the window slides, we "forget" the old KV stuff and there is no way to recover it without recomputing it. This means, no prefix cache in llama-server (ok, just last-prefix caching works), no context shift, no context reuse, etc. So I am having some doubts if this is really worth supporting.

Any thoughts?

@slaren
Copy link
Member

slaren commented Apr 29, 2025

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

@ngxson
Copy link
Collaborator

ngxson commented Apr 29, 2025

However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.

Yes this is what I was thinking about for months now. There is no better solution than to disable context caching in this case.

An alternative solution is to allow user to choose one of the 2: either a proper SWA cache (good for memory) or allocate full (good for reusing cache)

So I am having some doubts if this is really worth supporting.

I'm feeling 50/50 here. One of the biggest use case would be to process large and diverse set of documents locally. In this case, user may never reuse the cache because each new request is a new document

@ggerganov
Copy link
Member Author

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

The way I am approaching it is to have the "KV cells" information maintained separately for the non-SWA and SWA layers. This way, upon each KV cache commit (see #12799), we can do a pass over the SWA cells and automatically remove those that have position pos < pos_max(seq_id) - n_swa. Note that such tokens are only pruned from the SWA cells, while they remain in the non-SWA cells. When constructing the KQ mask for the graph, we use the non-SWA cells to construct the kq_mask and the SWA cells to construct the kq_mask_swa.

The rest of the logic is the same - it just operates on both set of cells. For example, find_slot searches in both the non-SWA and SWA cells.

@JohannesGaessler
Copy link
Collaborator

My experience with the Gemma models in the context of Elo HeLLM has been that they required a disproportionate amount of computational resources to run benchmarks. The reason is that I was able to fit comparatively fewer parallel slots on 1 or 2 GPUs and my throughput was lower as a consequence. At least for my use case I value low memory usage for the context more than I value prompt caching because I have O(10000) short prompts and I'm bottlenecked mostly by generation throughput.

@ggerganov
Copy link
Member Author

Continuing thinking about the logic for when to discard tokens from the cache, it's indeed tricky and not very clear how to do. For example, when doing speculative decoding, we can submit a draft batch with D tokens to the target model. If we apply the pruning logic from my previous comment strictly, then this would cause to "forget" D-1 of the oldest tokens in the SWA layers, which depending if the draft gets rejected would be problematic. This makes me think that we should probably have some "extra room" in the SWA cache - for example n_swa + 2*n_batch. And the prune logic should be something like: pos < pos_max(seq_id) - n_swa - n_batch.

@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v6 branch from e37f112 to 7e4b545 Compare April 30, 2025 07:22
@ymcki
Copy link
Contributor

ymcki commented Apr 30, 2025

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

I second slaren's opinion. As far as I know, vllm also doesn't support iSWA while hf transformers and ollama does. vllm is geared toward multi-user server use case. I suppose that's why they don't support it.

Ideally, it should be implemented as a switch to let user choose which one to use. By default, iSWA should be on for llama-cli but off for llama-server.

@ngxson
Copy link
Collaborator

ngxson commented Apr 30, 2025

This makes me think that we should probably have some "extra room" in the SWA cache - for example n_swa + 2*n_batch. And the prune logic should be something like: pos < pos_max(seq_id) - n_swa - n_batch.

Yes I was thinking about this too, I think it can be a bit complicated to manage this case, but totally possible.

We can let user specify how many tokens are allocated in the sliding layers. For example, given n_swa=512, if llama_context is created with n_ctx=4096 and n_ctx_swa=1024, this will allow user to rollback until n_past - (1024 - 512)

We can further let n_ctx_swa = n_ctx * scale by default to make it transparent to end-user, with scale=0.5 by default for example. If scale=-1 then n_ctx_swa=n_swa

And finally, we may need to add an API to return the furthest n_past that user can rollback to, maybe something like llama_kv_self_get_minimum_pos ?

@isaac-mcfadyen
Copy link
Contributor

isaac-mcfadyen commented Apr 30, 2025

I'd +1 the ability to allow the user to switch.

Some use-cases benefit greatly from the prefix caching (example: on Metal systems with 48GB of RAM/VRAM, where pp is much slower than non-Metal pp and we have plenty of VRAM anyway) so allowing the user to choose would be optimal.

@ExtReMLapin
Copy link
Contributor

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

Is llama.cpp single user mode the most used case because that’s what the user base prefer or is it like that because the server performance goes down a lot with more than 3 users ? (#10860 )

We are really thankful of all the work you main contributors do on this project, but please do not fall in this « self-fulfilling prophecy » trap.

@aviallon
Copy link
Contributor

aviallon commented May 1, 2025

I personally use llama.cpp for server use (with multiple users).
I wonder if we could do something hybrid between iSWA and what is currently done.
I wonder if partial kV cache offload could work, with iSWA on the accelerator, and slower cache on RAM.

@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v6 branch 2 times, most recently from 58115a2 to 7e79a42 Compare May 2, 2025 13:02
Base automatically changed from gg/llama-kv-cache-v6 to master May 2, 2025 14:48
@Dampfinchen
Copy link

According to the Gemma3 paper, interleaved Sliding Window Attention reduces KV Cache memory usage by 1/5, so it would be much easier to run as right now KV Cache size is much heavier than comparable models.

If the drawback is the absence of prompt caching, then indeed it would make sense to give the option to the user and let them decide on a per use case basis. I think for cases where you use RAG/Vector DB it would prove to be very useful as prompt caching does not work when beginning of the context changes anyway. I would personally agree with Johannes here, faster token generation thanks to SWA would be more useful for me as well since I'm using vector DB.

So for the use cases short prompts/RAG it would make a lot of sense. For simple chat use cases without any RAG, prompt caching would probably make it faster overall compared to SWA and no prompt cache. Overall, I think having the option would be a great addition to llama.cpp.

If it helps, Ollama implemented iSWA support for Gemma 3, since the project is pretty similar to llama.cpp, perhaps it's useful to get a rough idea on how to implement it (although Ollama is a different coding language): https://github.com/ollama/ollama/blob/2fec73eef6e9482f606f185ebb2ae4f75ad1a37c/model/models/gemma3/model_text.go#L190

I've been thinking, does Ollama support prompt caching? Since Gemma 3 SWA is supported in Ollama, how did they handle it?

@ggerganov ggerganov force-pushed the gg/swa branch 3 times, most recently from 1c69466 to 1e10743 Compare May 9, 2025 12:15
@LostRuins
Copy link
Collaborator

Some people recently mentioned concerns with this PR - I think caching is quite important for a subset of users who don't have GPUs and run purely CPU only.

They are fine spending initial minutes or more ingesting a large initial prompts which they then reuse for many future turns - generation speed itself is usable, but the inability to cache would be crippling for such users.

@ggerganov
Copy link
Member Author

Both the old cache (i.e. more memory usage, but with advanced caching supported) and the new cache (less memory with just last-prefix caching) will be supported. Still figuring the implementation details - will likely be supported via a flag or a parameter.

@ggerganov
Copy link
Member Author

Thanks for all the feedback in this discussion. This branch should be ready for testing - I've listed some important use cases that need to be exercised. If something does not work, please let me know - at the moment I've done very little testing, so there could be some issues remaining.

I will soon write up a detailed summary of the changes and the approach taken. And after that will add some comments to the code and open the PR for review.

Regarding the parameter for controlling the size of the SWA cache - for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache - likely thanks to the data in the non-SWA cache. So I am thinking about giving this approach a try because it keeps the UX simple (i.e. we won't have to add new parameter and handle the use cases where context editing is not possible). If we determine that this breaks some important use cases, we can add the parameter - the libllama change is simple and the behavior would basically fallback to what currently happens on master.

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commented May 11, 2025

To people who have the bandwidth to test models, FYI Cohere 2 arch includes R7B which is much smaller than Command-A

@andportnoy
Copy link

for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache

Does this mean in the current implementation the model isn't executed correctly?

@andportnoy
Copy link

FWIW, Gemma 3 worked better for me on main with Q8 cache quantization than on this branch + unquantized kv cache.

@ggerganov
Copy link
Member Author

ggerganov commented May 11, 2025

@andportnoy It's evaluated correctly, as long as you don't use context shift, cache reuse or branching from old states. Do you do any of that in your tests? Can you provide a repro?

Edit: Also don't change 2 things at the same time when testing. Use the same KV cache type, so we can rule out differences that are not relevant to the changes in this branch.

@luiznpi
Copy link

luiznpi commented Sep 1, 2025

can t get an efficient k type for medgemma-4b-it-Q4_K_M, gemma 3 q4_k_m works fine with type 1

@WilliamTambellini
Copy link
Contributor

@kj-c0d3s
Copy link

kj-c0d3s commented Feb 8, 2026

Thanks for all the feedback in this discussion. This branch should be ready for testing - I've listed some important use cases that need to be exercised. If something does not work, please let me know - at the moment I've done very little testing, so there could be some issues remaining.

I will soon write up a detailed summary of the changes and the approach taken. And after that will add some comments to the code and open the PR for review.

Regarding the parameter for controlling the size of the SWA cache - for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache - likely thanks to the data in the non-SWA cache. So I am thinking about giving this approach a try because it keeps the UX simple (i.e. we won't have to add new parameter and handle the use cases where context editing is not possible). If we determine that this breaks some important use cases, we can add the parameter - the libllama change is simple and the behavior would basically fallback to what currently happens on master.

Using Qwen 3 Coder Next: Any quant I use, I am getting this error: forcing full prompt re-processing due to lack of cache data as shown below, reloading 50,60,70,150k context over and over is quite miserable.. all latest build and fresh quant dl just in case as of today 2/8/26. Any guidance or insight would be appreciated. It has a link to come to this thread, but I have not found specific guidance.

slot print_timing: id 3 | task 6556 | prompt eval time = 95123.58 ms / 54248 tokens ( 1.75 ms per token, 570.29 tokens per second) eval time = 15815.35 ms / 666 tokens ( 23.75 ms per token, 42.11 tokens per second) total time = 110938.94 ms / 54914 tokens slot release: id 3 | task 6556 | stop processing: n_tokens = 54913, truncated = 0 srv update_slots: all slots are idle srv log_server_r: done request: POST /v1/messages 127.0.0.1 200 srv params_from_: Chat format: Qwen3 Coder slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = 99553296262 srv get_availabl: updating prompt cache srv prompt_save: - saving prompt with length 53502, total state size = 1329.942 MiB srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.001 srv update: - cache size limit reached, removing oldest entry (size = 285.055 MiB) srv update: - cache size limit reached, removing oldest entry (size = 840.596 MiB) srv update: - cache size limit reached, removing oldest entry (size = 1335.463 MiB) srv update: - cache state: 5 prompts, 6876.842 MiB (limits: 8192.000 MiB, 262144 tokens, 311062 est) srv update: - prompt 0x5b0377f1a0f0: 50773 tokens, checkpoints: 1, 1341.325 MiB srv update: - prompt 0x5b036f03bf40: 51802 tokens, checkpoints: 1, 1365.454 MiB srv update: - prompt 0x5b0375648cd0: 52203 tokens, checkpoints: 1, 1374.857 MiB srv update: - prompt 0x5b0376a89180: 52844 tokens, checkpoints: 1, 1389.888 MiB srv update: - prompt 0x5b0378b94380: 53502 tokens, checkpoints: 1, 1405.317 MiB srv get_availabl: prompt cache update took 1473.25 ms slot launch_slot_: id 2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist slot launch_slot_: id 2 | task 7250 | processing task, is_child = 0 slot update_slots: id 2 | task 7250 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 54953 slot update_slots: id 2 | task 7250 | n_past = 36, slot.prompt.tokens.size() = 53502, seq_id = 2, pos_min = 53501, n_swa = 1 slot update_slots: id 2 | task 7250 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055) slot update_slots: id 2 | task 7250 | erased invalidated context checkpoint (pos_min = 52818, pos_max = 52818, n_swa = 1, size = 75.376 MiB) slot update_slots: id 2 | task 7250 | n_tokens = 0, memory_seq_rm [0, end)

@ddh0
Copy link
Contributor

ddh0 commented Feb 8, 2026

@kj-c0d3s please try #19408

Edit: Nevermind, if you are on the latest master then you already have that fix. Maybe open a new issue?

@johnlovesgoats
Copy link

Thanks for all the feedback in this discussion. This branch should be ready for testing - I've listed some important use cases that need to be exercised. If something does not work, please let me know - at the moment I've done very little testing, so there could be some issues remaining.
I will soon write up a detailed summary of the changes and the approach taken. And after that will add some comments to the code and open the PR for review.
Regarding the parameter for controlling the size of the SWA cache - for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache - likely thanks to the data in the non-SWA cache. So I am thinking about giving this approach a try because it keeps the UX simple (i.e. we won't have to add new parameter and handle the use cases where context editing is not possible). If we determine that this breaks some important use cases, we can add the parameter - the libllama change is simple and the behavior would basically fallback to what currently happens on master.

Using Qwen 3 Coder Next: Any quant I use, I am getting this error: forcing full prompt re-processing due to lack of cache data as shown below, reloading 50,60,70,150k context over and over is quite miserable.. all latest build and fresh quant dl just in case as of today 2/8/26. Any guidance or insight would be appreciated. It has a link to come to this thread, but I have not found specific guidance.

slot print_timing: id 3 | task 6556 | prompt eval time = 95123.58 ms / 54248 tokens ( 1.75 ms per token, 570.29 tokens per second) eval time = 15815.35 ms / 666 tokens ( 23.75 ms per token, 42.11 tokens per second) total time = 110938.94 ms / 54914 tokens slot release: id 3 | task 6556 | stop processing: n_tokens = 54913, truncated = 0 srv update_slots: all slots are idle srv log_server_r: done request: POST /v1/messages 127.0.0.1 200 srv params_from_: Chat format: Qwen3 Coder slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = 99553296262 srv get_availabl: updating prompt cache srv prompt_save: - saving prompt with length 53502, total state size = 1329.942 MiB srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.001 srv update: - cache size limit reached, removing oldest entry (size = 285.055 MiB) srv update: - cache size limit reached, removing oldest entry (size = 840.596 MiB) srv update: - cache size limit reached, removing oldest entry (size = 1335.463 MiB) srv update: - cache state: 5 prompts, 6876.842 MiB (limits: 8192.000 MiB, 262144 tokens, 311062 est) srv update: - prompt 0x5b0377f1a0f0: 50773 tokens, checkpoints: 1, 1341.325 MiB srv update: - prompt 0x5b036f03bf40: 51802 tokens, checkpoints: 1, 1365.454 MiB srv update: - prompt 0x5b0375648cd0: 52203 tokens, checkpoints: 1, 1374.857 MiB srv update: - prompt 0x5b0376a89180: 52844 tokens, checkpoints: 1, 1389.888 MiB srv update: - prompt 0x5b0378b94380: 53502 tokens, checkpoints: 1, 1405.317 MiB srv get_availabl: prompt cache update took 1473.25 ms slot launch_slot_: id 2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist slot launch_slot_: id 2 | task 7250 | processing task, is_child = 0 slot update_slots: id 2 | task 7250 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 54953 slot update_slots: id 2 | task 7250 | n_past = 36, slot.prompt.tokens.size() = 53502, seq_id = 2, pos_min = 53501, n_swa = 1 slot update_slots: id 2 | task 7250 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055) slot update_slots: id 2 | task 7250 | erased invalidated context checkpoint (pos_min = 52818, pos_max = 52818, n_swa = 1, size = 75.376 MiB) slot update_slots: id 2 | task 7250 | n_tokens = 0, memory_seq_rm [0, end)

im getting this error as well. Did you get a fix?

@windswand
Copy link

windswand commented Feb 18, 2026

Running Unsloth's Qwen3.5-397B-A17B Q6_K_XL GGUF on latest llama.cpp (with pwilkin/autoparser tool calling fixes) and it throws errors directing me to this issue. I'm running Claude cli via LiteLLM into llama.cpp.

This is the error: forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)

Full log:

> build/bin/llama-server --host 0.0.0.0 --port 8080 -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-397B-A17B-GGUF/snapshots/59c9e42d5a7a6edc5cc7fb35cb8237b4a7cc86cd/UD-Q6_K_XL/Qwen3.5-397B-A17B-UD-Q6_K_XL-00001-of-00008.gguf -ngl 999 --threads 127 -fa on -c 196608 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.00 --chat-template-kwargs "{\"enable_thinking\": false}"
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 2: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8101 (519e1f1f3) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 127, n_threads_batch = 127, total_threads = 256

system_info: n_threads = 127 (n_threads_batch = 127) / 256 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 255 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/womble/.cache/huggingface/hub/models--unsloth--Qwen3.5-397B-A17B-GGUF/snapshots/59c9e42d5a7a6edc5cc7fb35cb8237b4a7cc86cd/UD-Q6_K_XL/Qwen3.5-397B-A17B-UD-Q6_K_XL-00001-of-00008.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition):  97249 total,  87086 used,   9403 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition):  97249 total,  82788 used,  13701 free vs. target of   1024
llama_params_fit_impl:   - CUDA2 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition):  97249 total,  79982 used,  16523 free vs. target of   1024
llama_params_fit_impl:   - CUDA3 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition):  97249 total,  80341 used,  16181 free vs. target of   1024
llama_params_fit_impl: projected to use 330198 MiB of device memory vs. 386009 MiB of free device memory
llama_params_fit_impl: targets for free memory can be met on all devices, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.70 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition) (0000:01:00.0) - 96689 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition) (0000:21:00.0) - 96689 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition) (0000:41:00.0) - 96689 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition) (0000:c1:00.0) - 96689 MiB free
llama_model_loader: additional 7 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1098 tensors from /home/womble/.cache/huggingface/hub/models--unsloth--Qwen3.5-397B-A17B-GGUF/snapshots/59c9e42d5a7a6edc5cc7fb35cb8237b4a7cc86cd/UD-Q6_K_XL/Qwen3.5-397B-A17B-UD-Q6_K_XL-00001-of-00008.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3.5-397B-A17B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3.5-397B-A17B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 397B-A17B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3.5 397B A17B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  14:                      qwen35moe.block_count u32              = 60
llama_model_loader: - kv  15:                   qwen35moe.context_length u32              = 262144
llama_model_loader: - kv  16:                 qwen35moe.embedding_length u32              = 4096
llama_model_loader: - kv  17:             qwen35moe.attention.head_count u32              = 32
llama_model_loader: - kv  18:          qwen35moe.attention.head_count_kv u32              = 2
llama_model_loader: - kv  19:          qwen35moe.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  20:                   qwen35moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  21: qwen35moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                     qwen35moe.expert_count u32              = 512
llama_model_loader: - kv  23:                qwen35moe.expert_used_count u32              = 10
llama_model_loader: - kv  24:             qwen35moe.attention.key_length u32              = 256
llama_model_loader: - kv  25:           qwen35moe.attention.value_length u32              = 256
llama_model_loader: - kv  26:       qwen35moe.expert_feed_forward_length u32              = 1024
llama_model_loader: - kv  27: qwen35moe.expert_shared_feed_forward_length u32              = 1024
llama_model_loader: - kv  28:                  qwen35moe.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  29:                   qwen35moe.ssm.state_size u32              = 128
llama_model_loader: - kv  30:                  qwen35moe.ssm.group_count u32              = 16
llama_model_loader: - kv  31:               qwen35moe.ssm.time_step_rank u32              = 64
llama_model_loader: - kv  32:                   qwen35moe.ssm.inner_size u32              = 8192
llama_model_loader: - kv  33:          qwen35moe.full_attention_interval u32              = 4
llama_model_loader: - kv  34:             qwen35moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  35:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  36:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  37:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  39:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - kv  44:                          general.file_type u32              = 18
llama_model_loader: - kv  45:                                   split.no u16              = 0
llama_model_loader: - kv  46:                        split.tensors.count i32              = 1098
llama_model_loader: - kv  47:                                split.count u16              = 8
llama_model_loader: - type  f32:  451 tensors
llama_model_loader: - type q8_0:  105 tensors
llama_model_loader: - type q6_K:  542 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 305.62 GiB (6.62 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35moe
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 60
print_info: n_head                = 32
print_info: n_head_kv             = 2
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 16
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 0
print_info: n_expert              = 512
print_info: n_expert_used         = 10
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 8192
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 64
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = ?B
print_info: model params          = 396.35 B
print_info: general.name          = Qwen3.5-397B-A17B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 61/61 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1030.62 MiB
load_tensors:        CUDA0 model buffer size = 83382.71 MiB
load_tensors:        CUDA1 model buffer size = 77225.60 MiB
load_tensors:        CUDA2 model buffer size = 77215.03 MiB
load_tensors:        CUDA3 model buffer size = 74104.63 MiB
....................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 196608
llama_context: n_ctx_seq     = 196608
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (196608) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     3.79 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1536.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =  1152.00 MiB
llama_kv_cache:      CUDA2 KV buffer size =  1536.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =  1536.00 MiB
llama_kv_cache: size = 5760.00 MiB (196608 cells,  15 layers,  4/1 seqs), K (f16): 2880.00 MiB, V (f16): 2880.00 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =   198.75 MiB
llama_memory_recurrent:      CUDA1 RS buffer size =   198.75 MiB
llama_memory_recurrent:      CUDA2 RS buffer size =   182.19 MiB
llama_memory_recurrent:      CUDA3 RS buffer size =   165.62 MiB
llama_memory_recurrent: size =  745.31 MiB (     4 cells,  60 layers,  4 seqs), R (f32):   25.31 MiB, S (f32):  720.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =  1968.67 MiB
sched_reserve:      CUDA1 compute buffer size =  4179.75 MiB
sched_reserve:      CUDA2 compute buffer size =  1017.38 MiB
sched_reserve:      CUDA3 compute buffer size =  4503.63 MiB
sched_reserve:  CUDA_Host compute buffer size =  1552.08 MiB
sched_reserve: graph nodes  = 11279 (with bs=512), 7094 (with bs=1)
sched_reserve: graph splits = 9
sched_reserve: reserve took 459.56 ms, sched copies = 4
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv    load_model: initializing slots, n_slots = 4
common_speculative_is_compat: the target context does not support partial sequence removal
srv    load_model: speculative decoding not supported by this context
slot   load_model: id  0 | task -1 | new slot, n_ctx = 196608
slot   load_model: id  1 | task -1 | new slot, n_ctx = 196608
slot   load_model: id  2 | task -1 | new slot, n_ctx = 196608
slot   load_model: id  3 | task -1 | new slot, n_ctx = 196608
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 428
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 428, batch.n_tokens = 428, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 428, batch.n_tokens = 428
slot init_sampler: id  3 | task 0 | init sampler, took 0.12 ms, tokens: text = 428, total = 428
slot print_timing: id  3 | task 0 |
prompt eval time =     595.81 ms /   428 tokens (    1.39 ms per token,   718.35 tokens per second)
       eval time =     242.43 ms /    15 tokens (   16.16 ms per token,    61.87 tokens per second)
      total time =     838.23 ms /   443 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 442, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.169 (> 0.100 thold), f_keep = 0.048
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 442, total state size = 199.284 MiB
srv  params_from_: Chat format: peg-native
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.048, sim = 0.169
srv        update:  - cache state: 1 prompts, 199.284 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv  get_availabl: prompt cache update took 71.59 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 16 | processing task, is_child = 0
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 17 | processing task, is_child = 0
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 18 | processing task, is_child = 0
slot update_slots: id  1 | task 18 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 17067
slot update_slots: id  1 | task 18 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.119998
slot update_slots: id  1 | task 18 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.239995
slot update_slots: id  1 | task 18 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.359993
slot update_slots: id  1 | task 18 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.479991
slot update_slots: id  1 | task 18 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.599988
slot update_slots: id  1 | task 18 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.719986
slot update_slots: id  1 | task 18 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.839984
slot update_slots: id  1 | task 18 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.959981
slot update_slots: id  1 | task 18 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 16555, batch.n_tokens = 171, progress = 0.970001
slot update_slots: id  2 | task 17 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 128
slot update_slots: id  2 | task 17 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 17 | prompt processing progress, n_tokens = 128, batch.n_tokens = 299, progress = 1.000000
slot update_slots: id  2 | task 17 | prompt done, n_tokens = 128, batch.n_tokens = 299
slot init_sampler: id  2 | task 17 | init sampler, took 0.01 ms, tokens: text = 128, total = 128
slot update_slots: id  3 | task 16 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 124
slot update_slots: id  3 | task 16 | n_past = 21, slot.prompt.tokens.size() = 442, seq_id = 3, pos_min = 441, n_swa = 1
slot update_slots: id  3 | task 16 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 16 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 16 | prompt processing progress, n_tokens = 124, batch.n_tokens = 423, progress = 1.000000
slot update_slots: id  3 | task 16 | prompt done, n_tokens = 124, batch.n_tokens = 423
slot init_sampler: id  3 | task 16 | init sampler, took 0.01 ms, tokens: text = 124, total = 124
slot update_slots: id  1 | task 18 | n_tokens = 16555, memory_seq_rm [16555, end)
slot update_slots: id  1 | task 18 | prompt processing progress, n_tokens = 17067, batch.n_tokens = 514, progress = 1.000000
slot update_slots: id  1 | task 18 | prompt done, n_tokens = 17067, batch.n_tokens = 514
slot init_sampler: id  1 | task 18 | init sampler, took 1.75 ms, tokens: text = 17067, total = 17067
slot update_slots: id  1 | task 18 | created context checkpoint 1 of 8 (pos_min = 16554, pos_max = 16554, size = 186.329 MiB)
slot print_timing: id  2 | task 17 |
prompt eval time =     838.38 ms /   128 tokens (    6.55 ms per token,   152.68 tokens per second)
       eval time =     880.24 ms /     6 tokens (  146.71 ms per token,     6.82 tokens per second)
      total time =    1718.61 ms /   134 tokens
slot      release: id  2 | task 17 | stop processing: n_tokens = 133, truncated = 0
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  2 | task -1 | selected slot by LCP similarity, sim_best = 0.696 (> 0.100 thold), f_keep = 0.707
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 35 | processing task, is_child = 0
slot update_slots: id  2 | task 35 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 135
slot update_slots: id  2 | task 35 | n_past = 94, slot.prompt.tokens.size() = 133, seq_id = 2, pos_min = 132, n_swa = 1
slot update_slots: id  2 | task 35 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 35 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 35 | prompt processing progress, n_tokens = 135, batch.n_tokens = 137, progress = 1.000000
slot update_slots: id  2 | task 35 | prompt done, n_tokens = 135, batch.n_tokens = 137
slot init_sampler: id  2 | task 35 | init sampler, took 0.01 ms, tokens: text = 135, total = 135
slot print_timing: id  2 | task 35 |
prompt eval time =     341.04 ms /   135 tokens (    2.53 ms per token,   395.85 tokens per second)
       eval time =     399.30 ms /    12 tokens (   33.28 ms per token,    30.05 tokens per second)
      total time =     740.34 ms /   147 tokens
slot      release: id  2 | task 35 | stop processing: n_tokens = 146, truncated = 0
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
slot print_timing: id  3 | task 16 |
prompt eval time =     839.00 ms /   124 tokens (    6.77 ms per token,   147.80 tokens per second)
       eval time =    1749.74 ms /    21 tokens (   83.32 ms per token,    12.00 tokens per second)
      total time =    2588.74 ms /   145 tokens
slot      release: id  3 | task 16 | stop processing: n_tokens = 144, truncated = 0
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
slot print_timing: id  1 | task 18 |
prompt eval time =   22080.48 ms / 17067 tokens (    1.29 ms per token,   772.95 tokens per second)
       eval time =    3477.63 ms /   136 tokens (   25.57 ms per token,    39.11 tokens per second)
      total time =   25558.11 ms / 17203 tokens
slot      release: id  1 | task 18 | stop processing: n_tokens = 17202, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 165 | processing task, is_child = 0
slot update_slots: id  0 | task 165 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 10037
slot update_slots: id  0 | task 165 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 165 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.204045
slot update_slots: id  0 | task 165 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  0 | task 165 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.408090
slot update_slots: id  0 | task 165 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 165 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.612135
slot update_slots: id  0 | task 165 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 165 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.816180
slot update_slots: id  0 | task 165 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 165 | prompt processing progress, n_tokens = 9525, batch.n_tokens = 1333, progress = 0.948989
slot update_slots: id  0 | task 165 | n_tokens = 9525, memory_seq_rm [9525, end)
slot update_slots: id  0 | task 165 | prompt processing progress, n_tokens = 10037, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 165 | prompt done, n_tokens = 10037, batch.n_tokens = 512
slot init_sampler: id  0 | task 165 | init sampler, took 1.03 ms, tokens: text = 10037, total = 10037
slot update_slots: id  0 | task 165 | created context checkpoint 1 of 8 (pos_min = 9524, pos_max = 9524, size = 186.329 MiB)
slot print_timing: id  0 | task 165 |
prompt eval time =   12959.68 ms / 10037 tokens (    1.29 ms per token,   774.48 tokens per second)
       eval time =    1112.77 ms /    52 tokens (   21.40 ms per token,    46.73 tokens per second)
      total time =   14072.44 ms / 10089 tokens
slot      release: id  0 | task 165 | stop processing: n_tokens = 10088, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = 7566795298
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 146, total state size = 190.609 MiB
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.137, sim = 0.009
srv        update:  - cache state: 2 prompts, 389.892 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv  get_availabl: prompt cache update took 77.33 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 223 | processing task, is_child = 0
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.845 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 224 | processing task, is_child = 0
slot update_slots: id  0 | task 224 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 11938
slot update_slots: id  0 | task 224 | n_tokens = 10088, memory_seq_rm [10088, end)
slot update_slots: id  0 | task 224 | prompt processing progress, n_tokens = 11426, batch.n_tokens = 1338, progress = 0.957112
slot update_slots: id  2 | task 223 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 2130
slot update_slots: id  2 | task 223 | n_past = 20, slot.prompt.tokens.size() = 146, seq_id = 2, pos_min = 145, n_swa = 1
slot update_slots: id  2 | task 223 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 223 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 223 | prompt processing progress, n_tokens = 710, batch.n_tokens = 2048, progress = 0.333333
slot update_slots: id  0 | task 224 | n_tokens = 11426, memory_seq_rm [11426, end)
slot update_slots: id  0 | task 224 | prompt processing progress, n_tokens = 11938, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 224 | prompt done, n_tokens = 11938, batch.n_tokens = 512
slot init_sampler: id  0 | task 224 | init sampler, took 1.22 ms, tokens: text = 11938, total = 11938
slot update_slots: id  0 | task 224 | created context checkpoint 2 of 8 (pos_min = 11425, pos_max = 11425, size = 186.329 MiB)
slot update_slots: id  2 | task 223 | n_tokens = 710, memory_seq_rm [710, end)
slot update_slots: id  2 | task 223 | prompt processing progress, n_tokens = 1618, batch.n_tokens = 1420, progress = 0.759624
slot update_slots: id  2 | task 223 | n_tokens = 1618, memory_seq_rm [1618, end)
slot update_slots: id  2 | task 223 | prompt processing progress, n_tokens = 2130, batch.n_tokens = 513, progress = 1.000000
slot update_slots: id  2 | task 223 | prompt done, n_tokens = 2130, batch.n_tokens = 513
slot init_sampler: id  2 | task 223 | init sampler, took 0.32 ms, tokens: text = 2130, total = 2130
slot update_slots: id  2 | task 223 | created context checkpoint 1 of 8 (pos_min = 1617, pos_max = 1617, size = 186.329 MiB)
slot print_timing: id  2 | task 223 |
prompt eval time =    5272.41 ms /  2130 tokens (    2.48 ms per token,   403.99 tokens per second)
       eval time =    1055.37 ms /    26 tokens (   40.59 ms per token,    24.64 tokens per second)
      total time =    6327.78 ms /  2156 tokens
slot      release: id  2 | task 223 | stop processing: n_tokens = 2155, truncated = 0
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
slot print_timing: id  0 | task 224 |
prompt eval time =    4543.85 ms /  1850 tokens (    2.46 ms per token,   407.14 tokens per second)
       eval time =    1993.56 ms /    36 tokens (   55.38 ms per token,    18.06 tokens per second)
      total time =    6537.42 ms /  1886 tokens
slot      release: id  0 | task 224 | stop processing: n_tokens = 11973, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.977 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 262 | processing task, is_child = 0
slot update_slots: id  0 | task 262 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 12256
slot update_slots: id  0 | task 262 | n_tokens = 11973, memory_seq_rm [11973, end)
slot update_slots: id  0 | task 262 | prompt processing progress, n_tokens = 12256, batch.n_tokens = 283, progress = 1.000000
slot update_slots: id  0 | task 262 | prompt done, n_tokens = 12256, batch.n_tokens = 283
slot init_sampler: id  0 | task 262 | init sampler, took 1.26 ms, tokens: text = 12256, total = 12256
slot update_slots: id  0 | task 262 | created context checkpoint 3 of 8 (pos_min = 11972, pos_max = 11972, size = 186.329 MiB)
slot print_timing: id  0 | task 262 |
prompt eval time =     509.41 ms /   283 tokens (    1.80 ms per token,   555.54 tokens per second)
       eval time =     767.81 ms /    36 tokens (   21.33 ms per token,    46.89 tokens per second)
      total time =    1277.23 ms /   319 tokens
slot      release: id  0 | task 262 | stop processing: n_tokens = 12291, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.995 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 299 | processing task, is_child = 0
slot update_slots: id  0 | task 299 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 12358
slot update_slots: id  0 | task 299 | n_tokens = 12291, memory_seq_rm [12291, end)
slot update_slots: id  0 | task 299 | prompt processing progress, n_tokens = 12358, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  0 | task 299 | prompt done, n_tokens = 12358, batch.n_tokens = 67
slot init_sampler: id  0 | task 299 | init sampler, took 1.27 ms, tokens: text = 12358, total = 12358
slot update_slots: id  0 | task 299 | created context checkpoint 4 of 8 (pos_min = 12290, pos_max = 12290, size = 186.329 MiB)
slot print_timing: id  0 | task 299 |
prompt eval time =     291.83 ms /    67 tokens (    4.36 ms per token,   229.58 tokens per second)
       eval time =     789.43 ms /    37 tokens (   21.34 ms per token,    46.87 tokens per second)
      total time =    1081.26 ms /   104 tokens
slot      release: id  0 | task 299 | stop processing: n_tokens = 12394, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  log_server_r: done request: POST /v1/messages/count_tokens 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.454 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 337 | processing task, is_child = 0
slot update_slots: id  0 | task 337 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 27316
slot update_slots: id  0 | task 337 | n_tokens = 12394, memory_seq_rm [12394, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 14442, batch.n_tokens = 2048, progress = 0.528701
slot update_slots: id  0 | task 337 | n_tokens = 14442, memory_seq_rm [14442, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 16490, batch.n_tokens = 2048, progress = 0.603675
slot update_slots: id  0 | task 337 | n_tokens = 16490, memory_seq_rm [16490, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 18538, batch.n_tokens = 2048, progress = 0.678650
slot update_slots: id  0 | task 337 | n_tokens = 18538, memory_seq_rm [18538, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 20586, batch.n_tokens = 2048, progress = 0.753624
slot update_slots: id  0 | task 337 | n_tokens = 20586, memory_seq_rm [20586, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 22634, batch.n_tokens = 2048, progress = 0.828599
slot update_slots: id  0 | task 337 | n_tokens = 22634, memory_seq_rm [22634, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 24682, batch.n_tokens = 2048, progress = 0.903573
slot update_slots: id  0 | task 337 | n_tokens = 24682, memory_seq_rm [24682, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 26730, batch.n_tokens = 2048, progress = 0.978547
slot update_slots: id  0 | task 337 | n_tokens = 26730, memory_seq_rm [26730, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 26804, batch.n_tokens = 74, progress = 0.981256
slot update_slots: id  0 | task 337 | n_tokens = 26804, memory_seq_rm [26804, end)
slot update_slots: id  0 | task 337 | prompt processing progress, n_tokens = 27316, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 337 | prompt done, n_tokens = 27316, batch.n_tokens = 512
slot init_sampler: id  0 | task 337 | init sampler, took 2.79 ms, tokens: text = 27316, total = 27316
slot update_slots: id  0 | task 337 | created context checkpoint 5 of 8 (pos_min = 26803, pos_max = 26803, size = 186.329 MiB)
slot print_timing: id  0 | task 337 |
prompt eval time =   19357.22 ms / 14922 tokens (    1.30 ms per token,   770.88 tokens per second)
       eval time =    1050.59 ms /    49 tokens (   21.44 ms per token,    46.64 tokens per second)
      total time =   20407.80 ms / 14971 tokens
slot      release: id  0 | task 337 | stop processing: n_tokens = 27364, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.951 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 395 | processing task, is_child = 0
slot update_slots: id  0 | task 395 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 28771
slot update_slots: id  0 | task 395 | n_tokens = 27364, memory_seq_rm [27364, end)
slot update_slots: id  0 | task 395 | prompt processing progress, n_tokens = 28259, batch.n_tokens = 895, progress = 0.982204
slot update_slots: id  0 | task 395 | n_tokens = 28259, memory_seq_rm [28259, end)
slot update_slots: id  0 | task 395 | prompt processing progress, n_tokens = 28771, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 395 | prompt done, n_tokens = 28771, batch.n_tokens = 512
slot init_sampler: id  0 | task 395 | init sampler, took 2.94 ms, tokens: text = 28771, total = 28771
slot update_slots: id  0 | task 395 | created context checkpoint 6 of 8 (pos_min = 28258, pos_max = 28258, size = 186.329 MiB)
slot print_timing: id  0 | task 395 |
prompt eval time =    1946.21 ms /  1407 tokens (    1.38 ms per token,   722.94 tokens per second)
       eval time =    1195.83 ms /    55 tokens (   21.74 ms per token,    45.99 tokens per second)
      total time =    3142.04 ms /  1462 tokens
slot      release: id  0 | task 395 | stop processing: n_tokens = 28825, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  2 | task -1 | selected slot by LCP similarity, sim_best = 0.467 (> 0.100 thold), f_keep = 0.149
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 2155, total state size = 249.489 MiB
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.149, sim = 0.467
srv        update:  - cache state: 3 prompts, 825.710 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv  get_availabl: prompt cache update took 113.13 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 452 | processing task, is_child = 0
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.986 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 453 | processing task, is_child = 0
slot update_slots: id  0 | task 453 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 29229
slot update_slots: id  0 | task 453 | n_tokens = 28825, memory_seq_rm [28825, end)
slot update_slots: id  0 | task 453 | prompt processing progress, n_tokens = 29229, batch.n_tokens = 404, progress = 1.000000
slot update_slots: id  0 | task 453 | prompt done, n_tokens = 29229, batch.n_tokens = 404
slot init_sampler: id  0 | task 453 | init sampler, took 2.99 ms, tokens: text = 29229, total = 29229
slot update_slots: id  0 | task 453 | created context checkpoint 7 of 8 (pos_min = 28824, pos_max = 28824, size = 186.329 MiB)
slot update_slots: id  2 | task 452 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 687
slot update_slots: id  2 | task 452 | n_past = 321, slot.prompt.tokens.size() = 2155, seq_id = 2, pos_min = 2154, n_swa = 1
slot update_slots: id  2 | task 452 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 452 | erased invalidated context checkpoint (pos_min = 1617, pos_max = 1617, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  2 | task 452 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 452 | prompt processing progress, n_tokens = 175, batch.n_tokens = 579, progress = 0.254731
slot update_slots: id  2 | task 452 | n_tokens = 175, memory_seq_rm [175, end)
slot update_slots: id  2 | task 452 | prompt processing progress, n_tokens = 687, batch.n_tokens = 513, progress = 1.000000
slot update_slots: id  2 | task 452 | prompt done, n_tokens = 687, batch.n_tokens = 513
slot init_sampler: id  2 | task 452 | init sampler, took 0.07 ms, tokens: text = 687, total = 687
slot update_slots: id  2 | task 452 | created context checkpoint 1 of 8 (pos_min = 174, pos_max = 174, size = 186.329 MiB)
slot print_timing: id  2 | task 452 |
prompt eval time =    1663.80 ms /   687 tokens (    2.42 ms per token,   412.91 tokens per second)
       eval time =    1072.68 ms /    26 tokens (   41.26 ms per token,    24.24 tokens per second)
      total time =    2736.48 ms /   713 tokens
slot      release: id  2 | task 452 | stop processing: n_tokens = 712, truncated = 0
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
slot print_timing: id  0 | task 453 |
prompt eval time =     981.88 ms /   404 tokens (    2.43 ms per token,   411.46 tokens per second)
       eval time =    2938.16 ms /    79 tokens (   37.19 ms per token,    26.89 tokens per second)
      total time =    3920.04 ms /   483 tokens
slot      release: id  0 | task 453 | stop processing: n_tokens = 29307, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  2 | task -1 | selected slot by LCP similarity, sim_best = 0.411 (> 0.100 thold), f_keep = 0.451
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 712, total state size = 207.197 MiB
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.451, sim = 0.411
srv        update:  - cache state: 4 prompts, 1219.237 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv  get_availabl: prompt cache update took 102.52 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 533 | processing task, is_child = 0
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.984 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 534 | processing task, is_child = 0
slot update_slots: id  0 | task 534 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 29783
slot update_slots: id  0 | task 534 | n_tokens = 29307, memory_seq_rm [29307, end)
slot update_slots: id  0 | task 534 | prompt processing progress, n_tokens = 29783, batch.n_tokens = 476, progress = 1.000000
slot update_slots: id  0 | task 534 | prompt done, n_tokens = 29783, batch.n_tokens = 476
slot init_sampler: id  0 | task 534 | init sampler, took 3.05 ms, tokens: text = 29783, total = 29783
slot update_slots: id  0 | task 534 | created context checkpoint 8 of 8 (pos_min = 29306, pos_max = 29306, size = 186.329 MiB)
slot update_slots: id  2 | task 533 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 781
slot update_slots: id  2 | task 533 | n_past = 321, slot.prompt.tokens.size() = 712, seq_id = 2, pos_min = 711, n_swa = 1
slot update_slots: id  2 | task 533 | restored context checkpoint (pos_min = 174, pos_max = 174, size = 186.329 MiB)
slot update_slots: id  2 | task 533 | n_tokens = 175, memory_seq_rm [175, end)
slot update_slots: id  2 | task 533 | prompt processing progress, n_tokens = 269, batch.n_tokens = 570, progress = 0.344430
slot update_slots: id  2 | task 533 | n_tokens = 269, memory_seq_rm [269, end)
slot update_slots: id  2 | task 533 | prompt processing progress, n_tokens = 781, batch.n_tokens = 513, progress = 1.000000
slot update_slots: id  2 | task 533 | prompt done, n_tokens = 781, batch.n_tokens = 513
slot init_sampler: id  2 | task 533 | init sampler, took 0.09 ms, tokens: text = 781, total = 781
slot update_slots: id  2 | task 533 | created context checkpoint 2 of 8 (pos_min = 268, pos_max = 268, size = 186.329 MiB)
slot print_timing: id  2 | task 533 |
prompt eval time =    1675.54 ms /   606 tokens (    2.76 ms per token,   361.67 tokens per second)
       eval time =    1074.06 ms /    26 tokens (   41.31 ms per token,    24.21 tokens per second)
      total time =    2749.60 ms /   632 tokens
slot      release: id  2 | task 533 | stop processing: n_tokens = 806, truncated = 0
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
slot print_timing: id  0 | task 534 |
prompt eval time =     991.75 ms /   476 tokens (    2.08 ms per token,   479.96 tokens per second)
       eval time =    2286.48 ms /    49 tokens (   46.66 ms per token,    21.43 tokens per second)
      total time =    3278.23 ms /   525 tokens
slot      release: id  0 | task 534 | stop processing: n_tokens = 29831, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.949 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 584 | processing task, is_child = 0
slot update_slots: id  0 | task 584 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 31434
slot update_slots: id  0 | task 584 | n_tokens = 29831, memory_seq_rm [29831, end)
slot update_slots: id  0 | task 584 | prompt processing progress, n_tokens = 30922, batch.n_tokens = 1091, progress = 0.983712
slot update_slots: id  0 | task 584 | n_tokens = 30922, memory_seq_rm [30922, end)
slot update_slots: id  0 | task 584 | prompt processing progress, n_tokens = 31434, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 584 | prompt done, n_tokens = 31434, batch.n_tokens = 512
slot init_sampler: id  0 | task 584 | init sampler, took 3.21 ms, tokens: text = 31434, total = 31434
slot update_slots: id  0 | task 584 | erasing old context checkpoint (pos_min = 9524, pos_max = 9524, size = 186.329 MiB)
slot update_slots: id  0 | task 584 | created context checkpoint 8 of 8 (pos_min = 30921, pos_max = 30921, size = 186.329 MiB)
slot print_timing: id  0 | task 584 |
prompt eval time =    2286.57 ms /  1603 tokens (    1.43 ms per token,   701.05 tokens per second)
       eval time =    1073.47 ms /    50 tokens (   21.47 ms per token,    46.58 tokens per second)
      total time =    3360.04 ms /  1653 tokens
slot      release: id  0 | task 584 | stop processing: n_tokens = 31483, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.966 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 636 | processing task, is_child = 0
slot update_slots: id  0 | task 636 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 32601
slot update_slots: id  0 | task 636 | n_tokens = 31483, memory_seq_rm [31483, end)
slot update_slots: id  0 | task 636 | prompt processing progress, n_tokens = 32089, batch.n_tokens = 606, progress = 0.984295
slot update_slots: id  0 | task 636 | n_tokens = 32089, memory_seq_rm [32089, end)
slot update_slots: id  0 | task 636 | prompt processing progress, n_tokens = 32601, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 636 | prompt done, n_tokens = 32601, batch.n_tokens = 512
slot init_sampler: id  0 | task 636 | init sampler, took 3.33 ms, tokens: text = 32601, total = 32601
slot update_slots: id  0 | task 636 | erasing old context checkpoint (pos_min = 11425, pos_max = 11425, size = 186.329 MiB)
slot update_slots: id  0 | task 636 | created context checkpoint 8 of 8 (pos_min = 32088, pos_max = 32088, size = 186.329 MiB)
slot print_timing: id  0 | task 636 |
prompt eval time =    1680.46 ms /  1118 tokens (    1.50 ms per token,   665.29 tokens per second)
       eval time =   12059.16 ms /   552 tokens (   21.85 ms per token,    45.77 tokens per second)
      total time =   13739.62 ms /  1670 tokens
slot      release: id  0 | task 636 | stop processing: n_tokens = 33152, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.967 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 1190 | processing task, is_child = 0
slot update_slots: id  1 | task 1190 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 17794
slot update_slots: id  1 | task 1190 | n_tokens = 17202, memory_seq_rm [17202, end)
slot update_slots: id  1 | task 1190 | prompt processing progress, n_tokens = 17282, batch.n_tokens = 80, progress = 0.971226
slot update_slots: id  1 | task 1190 | n_tokens = 17282, memory_seq_rm [17282, end)
slot update_slots: id  1 | task 1190 | prompt processing progress, n_tokens = 17794, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  1 | task 1190 | prompt done, n_tokens = 17794, batch.n_tokens = 512
slot init_sampler: id  1 | task 1190 | init sampler, took 1.82 ms, tokens: text = 17794, total = 17794
slot update_slots: id  1 | task 1190 | created context checkpoint 2 of 8 (pos_min = 17281, pos_max = 17281, size = 186.329 MiB)
slot print_timing: id  1 | task 1190 |
prompt eval time =     927.61 ms /   592 tokens (    1.57 ms per token,   638.20 tokens per second)
       eval time =    5455.44 ms /   250 tokens (   21.82 ms per token,    45.83 tokens per second)
      total time =    6383.04 ms /   842 tokens
slot      release: id  1 | task 1190 | stop processing: n_tokens = 18043, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.931 (> 0.100 thold), f_keep = 0.946
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 1442 | processing task, is_child = 0
slot update_slots: id  1 | task 1442 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 18331
slot update_slots: id  1 | task 1442 | n_past = 17063, slot.prompt.tokens.size() = 18043, seq_id = 1, pos_min = 18042, n_swa = 1
slot update_slots: id  1 | task 1442 | restored context checkpoint (pos_min = 16554, pos_max = 16554, size = 186.329 MiB)
slot update_slots: id  1 | task 1442 | erased invalidated context checkpoint (pos_min = 17281, pos_max = 17281, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  1 | task 1442 | n_tokens = 16555, memory_seq_rm [16555, end)
slot update_slots: id  1 | task 1442 | prompt processing progress, n_tokens = 17819, batch.n_tokens = 1264, progress = 0.972069
slot update_slots: id  1 | task 1442 | n_tokens = 17819, memory_seq_rm [17819, end)
slot update_slots: id  1 | task 1442 | prompt processing progress, n_tokens = 18331, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  1 | task 1442 | prompt done, n_tokens = 18331, batch.n_tokens = 512
slot init_sampler: id  1 | task 1442 | init sampler, took 1.88 ms, tokens: text = 18331, total = 18331
slot update_slots: id  1 | task 1442 | created context checkpoint 2 of 8 (pos_min = 17818, pos_max = 17818, size = 186.329 MiB)
slot print_timing: id  1 | task 1442 |
prompt eval time =    2517.59 ms /  1776 tokens (    1.42 ms per token,   705.44 tokens per second)
       eval time =     155.89 ms /     8 tokens (   19.49 ms per token,    51.32 tokens per second)
      total time =    2673.48 ms /  1784 tokens
slot      release: id  1 | task 1442 | stop processing: n_tokens = 18338, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = 7566837554
srv  get_availabl: updating prompt cache
srv  params_from_: Chat format: peg-native
srv   prompt_save:  - saving prompt with length 144, total state size = 190.550 MiB
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.139, sim = 0.032
srv        update:  - cache state: 5 prompts, 1409.787 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv  get_availabl: prompt cache update took 105.71 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 1452 | processing task, is_child = 0
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = 7623858093
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 806, total state size = 209.952 MiB
srv          load:  - looking for better prompt, base f_keep = 0.025, sim = 0.014
srv        update:  - cache state: 6 prompts, 1992.397 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv  get_availabl: prompt cache update took 152.89 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 1453 | processing task, is_child = 0
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 7641730655
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 33152, total state size = 1157.959 MiB
srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.002
srv        update:  - cache state: 7 prompts, 4640.989 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv  get_availabl: prompt cache update took 640.78 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 1454 | processing task, is_child = 0
slot update_slots: id  0 | task 1454 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 19037
slot update_slots: id  0 | task 1454 | n_past = 32, slot.prompt.tokens.size() = 33152, seq_id = 0, pos_min = 33151, n_swa = 1
slot update_slots: id  0 | task 1454 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 1454 | erased invalidated context checkpoint (pos_min = 11972, pos_max = 11972, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1454 | erased invalidated context checkpoint (pos_min = 12290, pos_max = 12290, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1454 | erased invalidated context checkpoint (pos_min = 26803, pos_max = 26803, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1454 | erased invalidated context checkpoint (pos_min = 28258, pos_max = 28258, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1454 | erased invalidated context checkpoint (pos_min = 28824, pos_max = 28824, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1454 | erased invalidated context checkpoint (pos_min = 29306, pos_max = 29306, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1454 | erased invalidated context checkpoint (pos_min = 30921, pos_max = 30921, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1454 | erased invalidated context checkpoint (pos_min = 32088, pos_max = 32088, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1454 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.107580
slot update_slots: id  0 | task 1454 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.215160
slot update_slots: id  0 | task 1454 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.322740
slot update_slots: id  0 | task 1454 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.430320
slot update_slots: id  0 | task 1454 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.537900
slot update_slots: id  0 | task 1454 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.645480
slot update_slots: id  0 | task 1454 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.753060
slot update_slots: id  0 | task 1454 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.860640
slot update_slots: id  0 | task 1454 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 18432, batch.n_tokens = 2048, progress = 0.968220
slot update_slots: id  0 | task 1454 | n_tokens = 18432, memory_seq_rm [18432, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 18525, batch.n_tokens = 93, progress = 0.973105
slot update_slots: id  2 | task 1453 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 1464
slot update_slots: id  2 | task 1453 | n_past = 20, slot.prompt.tokens.size() = 806, seq_id = 2, pos_min = 805, n_swa = 1
slot update_slots: id  2 | task 1453 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 1453 | erased invalidated context checkpoint (pos_min = 174, pos_max = 174, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  2 | task 1453 | erased invalidated context checkpoint (pos_min = 268, pos_max = 268, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  2 | task 1453 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 1453 | prompt processing progress, n_tokens = 952, batch.n_tokens = 1045, progress = 0.650273
slot update_slots: id  3 | task 1452 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 630
slot update_slots: id  3 | task 1452 | n_past = 20, slot.prompt.tokens.size() = 144, seq_id = 3, pos_min = 143, n_swa = 1
slot update_slots: id  3 | task 1452 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 1452 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 1452 | prompt processing progress, n_tokens = 118, batch.n_tokens = 1163, progress = 0.187302
slot update_slots: id  0 | task 1454 | n_tokens = 18525, memory_seq_rm [18525, end)
slot update_slots: id  0 | task 1454 | prompt processing progress, n_tokens = 19037, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 1454 | prompt done, n_tokens = 19037, batch.n_tokens = 512
slot init_sampler: id  0 | task 1454 | init sampler, took 1.95 ms, tokens: text = 19037, total = 19037
slot update_slots: id  0 | task 1454 | created context checkpoint 1 of 8 (pos_min = 18524, pos_max = 18524, size = 186.329 MiB)
slot update_slots: id  2 | task 1453 | n_tokens = 952, memory_seq_rm [952, end)
slot update_slots: id  2 | task 1453 | prompt processing progress, n_tokens = 1464, batch.n_tokens = 1024, progress = 1.000000
slot update_slots: id  2 | task 1453 | prompt done, n_tokens = 1464, batch.n_tokens = 1024
slot init_sampler: id  2 | task 1453 | init sampler, took 0.15 ms, tokens: text = 1464, total = 1464
slot update_slots: id  2 | task 1453 | created context checkpoint 1 of 8 (pos_min = 951, pos_max = 951, size = 186.329 MiB)
slot update_slots: id  3 | task 1452 | n_tokens = 118, memory_seq_rm [118, end)
slot update_slots: id  3 | task 1452 | prompt processing progress, n_tokens = 630, batch.n_tokens = 1536, progress = 1.000000
slot update_slots: id  3 | task 1452 | prompt done, n_tokens = 630, batch.n_tokens = 1536
slot init_sampler: id  3 | task 1452 | init sampler, took 0.07 ms, tokens: text = 630, total = 630
slot update_slots: id  3 | task 1452 | created context checkpoint 1 of 8 (pos_min = 117, pos_max = 117, size = 186.329 MiB)
slot print_timing: id  3 | task 1452 |
prompt eval time =    3863.20 ms /   630 tokens (    6.13 ms per token,   163.08 tokens per second)
       eval time =     573.94 ms /    12 tokens (   47.83 ms per token,    20.91 tokens per second)
      total time =    4437.14 ms /   642 tokens
slot      release: id  3 | task 1452 | stop processing: n_tokens = 641, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  2 | task 1453 |
prompt eval time =    3873.91 ms /  1464 tokens (    2.65 ms per token,   377.91 tokens per second)
       eval time =    1154.96 ms /    25 tokens (   46.20 ms per token,    21.65 tokens per second)
      total time =    5028.87 ms /  1489 tokens
slot      release: id  2 | task 1453 | stop processing: n_tokens = 1488, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

slot print_timing: id  0 | task 1454 |
prompt eval time =   27496.23 ms / 19037 tokens (    1.44 ms per token,   692.35 tokens per second)
       eval time =    5972.05 ms /   244 tokens (   24.48 ms per token,    40.86 tokens per second)
      total time =   33468.29 ms / 19281 tokens
slot      release: id  0 | task 1454 | stop processing: n_tokens = 19280, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = 7651046148
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 18338, total state size = 723.785 MiB
srv          load:  - looking for better prompt, base f_keep = 0.002, sim = 0.003
srv        update:  - cache state: 8 prompts, 5737.433 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv  get_availabl: prompt cache update took 264.57 ms
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 1709 | processing task, is_child = 0
slot update_slots: id  1 | task 1709 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 9878
slot update_slots: id  1 | task 1709 | n_past = 32, slot.prompt.tokens.size() = 18338, seq_id = 1, pos_min = 18337, n_swa = 1
slot update_slots: id  1 | task 1709 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  1 | task 1709 | erased invalidated context checkpoint (pos_min = 16554, pos_max = 16554, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  1 | task 1709 | erased invalidated context checkpoint (pos_min = 17818, pos_max = 17818, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  1 | task 1709 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 1709 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.207329
slot update_slots: id  1 | task 1709 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  1 | task 1709 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.414659
slot update_slots: id  1 | task 1709 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  1 | task 1709 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.621988
slot update_slots: id  1 | task 1709 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  1 | task 1709 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.829318
slot update_slots: id  1 | task 1709 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  1 | task 1709 | prompt processing progress, n_tokens = 9366, batch.n_tokens = 1174, progress = 0.948168
slot update_slots: id  1 | task 1709 | n_tokens = 9366, memory_seq_rm [9366, end)
slot update_slots: id  1 | task 1709 | prompt processing progress, n_tokens = 9878, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  1 | task 1709 | prompt done, n_tokens = 9878, batch.n_tokens = 512
slot init_sampler: id  1 | task 1709 | init sampler, took 1.01 ms, tokens: text = 9878, total = 9878
slot update_slots: id  1 | task 1709 | created context checkpoint 1 of 8 (pos_min = 9365, pos_max = 9365, size = 186.329 MiB)
slot print_timing: id  1 | task 1709 |
prompt eval time =   12658.29 ms /  9878 tokens (    1.28 ms per token,   780.36 tokens per second)
       eval time =    1809.34 ms /    84 tokens (   21.54 ms per token,    46.43 tokens per second)
      total time =   14467.63 ms /  9962 tokens
slot      release: id  1 | task 1709 | stop processing: n_tokens = 9961, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = 12110702889
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 641, total state size = 205.116 MiB
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.031, sim = 0.005
srv        update:  - cache state: 9 prompts, 6128.879 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv  get_availabl: prompt cache update took 98.89 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 1799 | processing task, is_child = 0
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.720 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 1800 | processing task, is_child = 0
slot update_slots: id  1 | task 1800 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 13831
slot update_slots: id  1 | task 1800 | n_tokens = 9961, memory_seq_rm [9961, end)
slot update_slots: id  1 | task 1800 | prompt processing progress, n_tokens = 12009, batch.n_tokens = 2048, progress = 0.868267
slot update_slots: id  1 | task 1800 | n_tokens = 12009, memory_seq_rm [12009, end)
slot update_slots: id  1 | task 1800 | prompt processing progress, n_tokens = 13319, batch.n_tokens = 1310, progress = 0.962982
slot update_slots: id  3 | task 1799 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 3922
slot update_slots: id  3 | task 1799 | n_past = 20, slot.prompt.tokens.size() = 641, seq_id = 3, pos_min = 640, n_swa = 1
slot update_slots: id  3 | task 1799 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 1799 | erased invalidated context checkpoint (pos_min = 117, pos_max = 117, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  3 | task 1799 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 1799 | prompt processing progress, n_tokens = 738, batch.n_tokens = 2048, progress = 0.188169
slot update_slots: id  1 | task 1800 | n_tokens = 13319, memory_seq_rm [13319, end)
slot update_slots: id  1 | task 1800 | prompt processing progress, n_tokens = 13831, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  1 | task 1800 | prompt done, n_tokens = 13831, batch.n_tokens = 512
slot init_sampler: id  1 | task 1800 | init sampler, took 1.41 ms, tokens: text = 13831, total = 13831
slot update_slots: id  1 | task 1800 | created context checkpoint 2 of 8 (pos_min = 13318, pos_max = 13318, size = 186.329 MiB)
slot update_slots: id  3 | task 1799 | n_tokens = 738, memory_seq_rm [738, end)
slot update_slots: id  3 | task 1799 | prompt processing progress, n_tokens = 2274, batch.n_tokens = 2048, progress = 0.579806
slot update_slots: id  3 | task 1799 | n_tokens = 2274, memory_seq_rm [2274, end)
slot update_slots: id  3 | task 1799 | prompt processing progress, n_tokens = 3410, batch.n_tokens = 1137, progress = 0.869454
slot update_slots: id  3 | task 1799 | n_tokens = 3410, memory_seq_rm [3410, end)
slot update_slots: id  3 | task 1799 | prompt processing progress, n_tokens = 3922, batch.n_tokens = 513, progress = 1.000000
slot update_slots: id  3 | task 1799 | prompt done, n_tokens = 3922, batch.n_tokens = 513
slot init_sampler: id  3 | task 1799 | init sampler, took 0.40 ms, tokens: text = 3922, total = 3922
slot update_slots: id  3 | task 1799 | created context checkpoint 1 of 8 (pos_min = 3409, pos_max = 3409, size = 186.329 MiB)
srv  params_from_: Chat format: peg-native
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = 12111283665
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1488, total state size = 229.940 MiB
srv          load:  - looking for better prompt, base f_keep = 0.002, sim = 0.000
srv        update:  - cache state: 10 prompts, 6545.148 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv  get_availabl: prompt cache update took 118.08 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  2 | task 1814 | processing task, is_child = 0
slot update_slots: id  2 | task 1814 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 13831
slot update_slots: id  2 | task 1814 | n_past = 3, slot.prompt.tokens.size() = 1488, seq_id = 2, pos_min = 1487, n_swa = 1
slot update_slots: id  2 | task 1814 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 1814 | erased invalidated context checkpoint (pos_min = 951, pos_max = 951, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  2 | task 1814 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 1814 | prompt processing progress, n_tokens = 2046, batch.n_tokens = 2048, progress = 0.147929
slot update_slots: id  2 | task 1814 | n_tokens = 2046, memory_seq_rm [2046, end)
slot update_slots: id  2 | task 1814 | prompt processing progress, n_tokens = 4092, batch.n_tokens = 2048, progress = 0.295857
slot update_slots: id  2 | task 1814 | n_tokens = 4092, memory_seq_rm [4092, end)
slot update_slots: id  2 | task 1814 | prompt processing progress, n_tokens = 6138, batch.n_tokens = 2048, progress = 0.443786
slot update_slots: id  2 | task 1814 | n_tokens = 6138, memory_seq_rm [6138, end)
slot update_slots: id  2 | task 1814 | prompt processing progress, n_tokens = 8184, batch.n_tokens = 2048, progress = 0.591714
slot update_slots: id  2 | task 1814 | n_tokens = 8184, memory_seq_rm [8184, end)
slot update_slots: id  2 | task 1814 | prompt processing progress, n_tokens = 10230, batch.n_tokens = 2048, progress = 0.739643
slot update_slots: id  2 | task 1814 | n_tokens = 10230, memory_seq_rm [10230, end)
slot update_slots: id  2 | task 1814 | prompt processing progress, n_tokens = 12276, batch.n_tokens = 2048, progress = 0.887571
slot update_slots: id  2 | task 1814 | n_tokens = 12276, memory_seq_rm [12276, end)
slot update_slots: id  2 | task 1814 | prompt processing progress, n_tokens = 13319, batch.n_tokens = 1045, progress = 0.962982
slot update_slots: id  2 | task 1814 | n_tokens = 13319, memory_seq_rm [13319, end)
slot update_slots: id  2 | task 1814 | prompt processing progress, n_tokens = 13831, batch.n_tokens = 514, progress = 1.000000
slot update_slots: id  2 | task 1814 | prompt done, n_tokens = 13831, batch.n_tokens = 514
slot init_sampler: id  2 | task 1814 | init sampler, took 1.42 ms, tokens: text = 13831, total = 13831
slot update_slots: id  2 | task 1814 | created context checkpoint 1 of 8 (pos_min = 13318, pos_max = 13318, size = 186.329 MiB)
slot print_timing: id  3 | task 1799 |
prompt eval time =    7584.64 ms /  3922 tokens (    1.93 ms per token,   517.10 tokens per second)
       eval time =   19322.29 ms /    26 tokens (  743.16 ms per token,     1.35 tokens per second)
      total time =   26906.93 ms /  3948 tokens
slot      release: id  3 | task 1799 | stop processing: n_tokens = 3947, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  1 | task 1800 |
prompt eval time =    7703.96 ms /  3870 tokens (    1.99 ms per token,   502.34 tokens per second)
       eval time =   22713.41 ms /    63 tokens (  360.53 ms per token,     2.77 tokens per second)
      total time =   30417.37 ms /  3933 tokens
slot      release: id  1 | task 1800 | stop processing: n_tokens = 13893, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  2 | task 1814 |
prompt eval time =   18524.01 ms / 13831 tokens (    1.34 ms per token,   746.65 tokens per second)
       eval time =    1703.08 ms /    57 tokens (   29.88 ms per token,    33.47 tokens per second)
      total time =   20227.09 ms / 13888 tokens
slot      release: id  2 | task 1814 | stop processing: n_tokens = 13887, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 12116100970
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 19280, total state size = 751.394 MiB
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.000, sim = 0.000
srv        update:  - cache state: 11 prompts, 7482.871 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv  get_availabl: prompt cache update took 242.92 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 1879 | processing task, is_child = 0
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.619 (> 0.100 thold), f_keep = 0.996
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 1880 | processing task, is_child = 0
slot update_slots: id  0 | task 1879 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 8746
slot update_slots: id  0 | task 1879 | n_past = 3, slot.prompt.tokens.size() = 19280, seq_id = 0, pos_min = 19279, n_swa = 1
slot update_slots: id  0 | task 1879 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 1879 | erased invalidated context checkpoint (pos_min = 18524, pos_max = 18524, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 1879 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 1879 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.234164
slot update_slots: id  0 | task 1879 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  0 | task 1879 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.468328
slot update_slots: id  0 | task 1879 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 1879 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.702493
slot update_slots: id  0 | task 1879 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 1879 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.936657
slot update_slots: id  0 | task 1879 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 1879 | prompt processing progress, n_tokens = 8234, batch.n_tokens = 42, progress = 0.941459
slot update_slots: id  1 | task 1880 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 22350
slot update_slots: id  1 | task 1880 | n_past = 13831, slot.prompt.tokens.size() = 13893, seq_id = 1, pos_min = 13892, n_swa = 1
slot update_slots: id  1 | task 1880 | restored context checkpoint (pos_min = 13318, pos_max = 13318, size = 186.329 MiB)
slot update_slots: id  1 | task 1880 | n_tokens = 13319, memory_seq_rm [13319, end)
slot update_slots: id  1 | task 1880 | prompt processing progress, n_tokens = 15325, batch.n_tokens = 2048, progress = 0.685682
slot update_slots: id  0 | task 1879 | n_tokens = 8234, memory_seq_rm [8234, end)
slot update_slots: id  0 | task 1879 | prompt processing progress, n_tokens = 8746, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 1879 | prompt done, n_tokens = 8746, batch.n_tokens = 512
slot init_sampler: id  0 | task 1879 | init sampler, took 0.90 ms, tokens: text = 8746, total = 8746
slot update_slots: id  0 | task 1879 | created context checkpoint 1 of 8 (pos_min = 8233, pos_max = 8233, size = 186.329 MiB)
slot update_slots: id  1 | task 1880 | n_tokens = 15325, memory_seq_rm [15325, end)
slot update_slots: id  1 | task 1880 | prompt processing progress, n_tokens = 16861, batch.n_tokens = 2048, progress = 0.754407
slot update_slots: id  1 | task 1880 | n_tokens = 16861, memory_seq_rm [16861, end)
slot update_slots: id  1 | task 1880 | prompt processing progress, n_tokens = 18908, batch.n_tokens = 2048, progress = 0.845996
slot update_slots: id  1 | task 1880 | n_tokens = 18908, memory_seq_rm [18908, end)
slot update_slots: id  1 | task 1880 | prompt processing progress, n_tokens = 20955, batch.n_tokens = 2048, progress = 0.937584
slot update_slots: id  1 | task 1880 | n_tokens = 20955, memory_seq_rm [20955, end)
slot update_slots: id  1 | task 1880 | prompt processing progress, n_tokens = 21838, batch.n_tokens = 884, progress = 0.977092
slot update_slots: id  1 | task 1880 | n_tokens = 21838, memory_seq_rm [21838, end)
slot update_slots: id  1 | task 1880 | prompt processing progress, n_tokens = 22350, batch.n_tokens = 513, progress = 1.000000
slot update_slots: id  1 | task 1880 | prompt done, n_tokens = 22350, batch.n_tokens = 513
slot init_sampler: id  1 | task 1880 | init sampler, took 2.29 ms, tokens: text = 22350, total = 22350
slot update_slots: id  1 | task 1880 | created context checkpoint 3 of 8 (pos_min = 21837, pos_max = 21837, size = 186.329 MiB)
slot print_timing: id  0 | task 1879 |
prompt eval time =   15400.64 ms /  8746 tokens (    1.76 ms per token,   567.90 tokens per second)
       eval time =    7818.93 ms /    26 tokens (  300.73 ms per token,     3.33 tokens per second)
      total time =   23219.58 ms /  8772 tokens
slot      release: id  0 | task 1879 | stop processing: n_tokens = 8771, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  1 | task 1880 |
prompt eval time =   12604.93 ms /  9031 tokens (    1.40 ms per token,   716.47 tokens per second)
       eval time =    2697.02 ms /   115 tokens (   23.45 ms per token,    42.64 tokens per second)
      total time =   15301.95 ms /  9146 tokens
slot      release: id  1 | task 1880 | stop processing: n_tokens = 22464, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = 12161486685
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 3947, total state size = 302.009 MiB
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.005, sim = 0.053
srv        update:  - cache state: 12 prompts, 7971.210 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x5a84ff17a710:     442 tokens, checkpoints:  0,   199.284 MiB
srv        update:    - prompt 0x5a850c9e7d10:     146 tokens, checkpoints:  0,   190.609 MiB
srv        update:    - prompt 0x7f39440502b0:    2155 tokens, checkpoints:  1,   435.818 MiB
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv  get_availabl: prompt cache update took 175.17 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 2005 | processing task, is_child = 0
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.998 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2006 | processing task, is_child = 0
slot update_slots: id  1 | task 2006 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 22520
slot update_slots: id  1 | task 2006 | n_tokens = 22464, memory_seq_rm [22464, end)
slot update_slots: id  1 | task 2006 | prompt processing progress, n_tokens = 22520, batch.n_tokens = 56, progress = 1.000000
slot update_slots: id  1 | task 2006 | prompt done, n_tokens = 22520, batch.n_tokens = 56
slot init_sampler: id  1 | task 2006 | init sampler, took 2.30 ms, tokens: text = 22520, total = 22520
slot update_slots: id  1 | task 2006 | created context checkpoint 4 of 8 (pos_min = 22463, pos_max = 22463, size = 186.329 MiB)
slot update_slots: id  3 | task 2005 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 375
slot update_slots: id  3 | task 2005 | n_past = 20, slot.prompt.tokens.size() = 3947, seq_id = 3, pos_min = 3946, n_swa = 1
slot update_slots: id  3 | task 2005 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 2005 | erased invalidated context checkpoint (pos_min = 3409, pos_max = 3409, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  3 | task 2005 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 2005 | prompt processing progress, n_tokens = 375, batch.n_tokens = 431, progress = 1.000000
slot update_slots: id  3 | task 2005 | prompt done, n_tokens = 375, batch.n_tokens = 431
slot init_sampler: id  3 | task 2005 | init sampler, took 0.04 ms, tokens: text = 375, total = 375
srv  params_from_: Chat format: peg-native
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = 12162851942
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 13887, total state size = 593.334 MiB
srv          load:  - looking for better prompt, base f_keep = 0.000, sim = 0.008
srv        update:  - cache size limit reached, removing oldest entry (size = 199.284 MiB)
srv        update:  - cache size limit reached, removing oldest entry (size = 190.609 MiB)
srv        update:  - cache size limit reached, removing oldest entry (size = 435.818 MiB)
srv        update:  - cache state: 10 prompts, 7925.163 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x7f3934020330:     712 tokens, checkpoints:  1,   393.526 MiB
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv        update:    - prompt 0x5a8510fa1180:   13887 tokens, checkpoints:  1,   779.663 MiB
srv  get_availabl: prompt cache update took 237.41 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  2 | task 2021 | processing task, is_child = 0
slot update_slots: id  2 | task 2021 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 375
slot update_slots: id  2 | task 2021 | n_past = 3, slot.prompt.tokens.size() = 13887, seq_id = 2, pos_min = 13886, n_swa = 1
slot update_slots: id  2 | task 2021 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 2021 | erased invalidated context checkpoint (pos_min = 13318, pos_max = 13318, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  2 | task 2021 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 2021 | prompt processing progress, n_tokens = 375, batch.n_tokens = 377, progress = 1.000000
slot update_slots: id  2 | task 2021 | prompt done, n_tokens = 375, batch.n_tokens = 377
slot init_sampler: id  2 | task 2021 | init sampler, took 0.04 ms, tokens: text = 375, total = 375
slot print_timing: id  3 | task 2005 |
prompt eval time =     787.12 ms /   375 tokens (    2.10 ms per token,   476.42 tokens per second)
       eval time =    1830.61 ms /    26 tokens (   70.41 ms per token,    14.20 tokens per second)
      total time =    2617.73 ms /   401 tokens
slot      release: id  3 | task 2005 | stop processing: n_tokens = 400, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  2 | task 2021 |
prompt eval time =     626.68 ms /   375 tokens (    1.67 ms per token,   598.39 tokens per second)
       eval time =     831.91 ms /    26 tokens (   32.00 ms per token,    31.25 tokens per second)
      total time =    1458.59 ms /   401 tokens
slot      release: id  2 | task 2021 | stop processing: n_tokens = 400, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  1 | task 2006 |
prompt eval time =     833.63 ms /    56 tokens (   14.89 ms per token,    67.18 tokens per second)
       eval time =    2926.62 ms /    71 tokens (   41.22 ms per token,    24.26 tokens per second)
      total time =    3760.25 ms /   127 tokens
slot      release: id  1 | task 2006 | stop processing: n_tokens = 22590, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.986 (> 0.100 thold), f_keep = 0.997
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2079 | processing task, is_child = 0
slot update_slots: id  1 | task 2079 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 22852
slot update_slots: id  1 | task 2079 | n_past = 22532, slot.prompt.tokens.size() = 22590, seq_id = 1, pos_min = 22589, n_swa = 1
slot update_slots: id  1 | task 2079 | restored context checkpoint (pos_min = 22463, pos_max = 22463, size = 186.329 MiB)
slot update_slots: id  1 | task 2079 | n_tokens = 22464, memory_seq_rm [22464, end)
slot update_slots: id  1 | task 2079 | prompt processing progress, n_tokens = 22852, batch.n_tokens = 388, progress = 1.000000
slot update_slots: id  1 | task 2079 | prompt done, n_tokens = 22852, batch.n_tokens = 388
slot init_sampler: id  1 | task 2079 | init sampler, took 2.34 ms, tokens: text = 22852, total = 22852
slot print_timing: id  1 | task 2079 |
prompt eval time =     582.00 ms /   388 tokens (    1.50 ms per token,   666.66 tokens per second)
       eval time =    2031.82 ms /    95 tokens (   21.39 ms per token,    46.76 tokens per second)
      total time =    2613.82 ms /   483 tokens
slot      release: id  1 | task 2079 | stop processing: n_tokens = 22946, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: peg-native
srv  params_from_: Chat format: peg-native
srv  params_from_: Chat format: peg-native
srv  params_from_: Chat format: peg-native
srv  params_from_: Chat format: peg-native
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 12186613870
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 8771, total state size = 443.393 MiB
srv          load:  - looking for better prompt, base f_keep = 0.002, sim = 0.024
srv        update:  - cache size limit reached, removing oldest entry (size = 393.526 MiB)
srv        update:  - cache state: 10 prompts, 8161.358 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x7f38b02195c0:     144 tokens, checkpoints:  0,   190.550 MiB
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv        update:    - prompt 0x5a8510fa1180:   13887 tokens, checkpoints:  1,   779.663 MiB
srv        update:    - prompt 0x7f372804cb90:    8771 tokens, checkpoints:  1,   629.722 MiB
srv  get_availabl: prompt cache update took 171.32 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 2176 | processing task, is_child = 0
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = 12209295966
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 400, total state size = 198.053 MiB
srv          load:  - looking for better prompt, base f_keep = 0.052, sim = 0.025
srv        update:  - cache size limit reached, removing oldest entry (size = 190.550 MiB)
srv        update:  - cache state: 10 prompts, 8168.861 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv        update:    - prompt 0x5a8510fa1180:   13887 tokens, checkpoints:  1,   779.663 MiB
srv        update:    - prompt 0x7f372804cb90:    8771 tokens, checkpoints:  1,   629.722 MiB
srv        update:    - prompt 0x7f3934020330:     400 tokens, checkpoints:  0,   198.053 MiB
srv  get_availabl: prompt cache update took 56.08 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 2175 | processing task, is_child = 0
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = 12209721070
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 400, total state size = 198.053 MiB
srv         alloc:  - prompt is already in the cache, skipping
srv          load:  - looking for better prompt, base f_keep = 0.052, sim = 0.025
srv        update:  - cache state: 10 prompts, 8168.861 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x7f38b02179e0:     806 tokens, checkpoints:  2,   582.610 MiB
srv        update:    - prompt 0x7f38b02147d0:   33152 tokens, checkpoints:  8,  2648.592 MiB
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv        update:    - prompt 0x5a8510fa1180:   13887 tokens, checkpoints:  1,   779.663 MiB
srv        update:    - prompt 0x7f372804cb90:    8771 tokens, checkpoints:  1,   629.722 MiB
srv        update:    - prompt 0x7f3934020330:     400 tokens, checkpoints:  0,   198.053 MiB
srv  get_availabl: prompt cache update took 0.13 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 2177 | processing task, is_child = 0
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = 12213148494
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 22946, total state size = 858.838 MiB
srv          load:  - looking for better prompt, base f_keep = 0.000, sim = 0.004
srv        update:  - cache size limit reached, removing oldest entry (size = 582.610 MiB)
srv        update:  - cache size limit reached, removing oldest entry (size = 2648.592 MiB)
srv        update:  - cache state: 9 prompts, 6541.813 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv        update:    - prompt 0x5a8510fa1180:   13887 tokens, checkpoints:  1,   779.663 MiB
srv        update:    - prompt 0x7f372804cb90:    8771 tokens, checkpoints:  1,   629.722 MiB
srv        update:    - prompt 0x7f3934020330:     400 tokens, checkpoints:  0,   198.053 MiB
srv        update:    - prompt 0x7f38b02195c0:   22946 tokens, checkpoints:  4,  1604.155 MiB
srv  get_availabl: prompt cache update took 488.10 ms
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2178 | processing task, is_child = 0
slot update_slots: id  0 | task 2176 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 841
slot update_slots: id  0 | task 2176 | n_past = 20, slot.prompt.tokens.size() = 8771, seq_id = 0, pos_min = 8770, n_swa = 1
slot update_slots: id  0 | task 2176 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 2176 | erased invalidated context checkpoint (pos_min = 8233, pos_max = 8233, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  0 | task 2176 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 2176 | prompt processing progress, n_tokens = 329, batch.n_tokens = 329, progress = 0.391201
slot update_slots: id  1 | task 2178 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 839
slot update_slots: id  1 | task 2178 | n_past = 3, slot.prompt.tokens.size() = 22946, seq_id = 1, pos_min = 22945, n_swa = 1
slot update_slots: id  1 | task 2178 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  1 | task 2178 | erased invalidated context checkpoint (pos_min = 9365, pos_max = 9365, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  1 | task 2178 | erased invalidated context checkpoint (pos_min = 13318, pos_max = 13318, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  1 | task 2178 | erased invalidated context checkpoint (pos_min = 21837, pos_max = 21837, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  1 | task 2178 | erased invalidated context checkpoint (pos_min = 22463, pos_max = 22463, n_swa = 1, size = 186.329 MiB)
slot update_slots: id  1 | task 2178 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 2178 | prompt processing progress, n_tokens = 327, batch.n_tokens = 656, progress = 0.389750
slot update_slots: id  2 | task 2177 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 837
slot update_slots: id  2 | task 2177 | n_past = 21, slot.prompt.tokens.size() = 400, seq_id = 2, pos_min = 399, n_swa = 1
slot update_slots: id  2 | task 2177 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 2177 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 2177 | prompt processing progress, n_tokens = 325, batch.n_tokens = 981, progress = 0.388292
slot update_slots: id  3 | task 2175 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 847
slot update_slots: id  3 | task 2175 | n_past = 21, slot.prompt.tokens.size() = 400, seq_id = 3, pos_min = 399, n_swa = 1
slot update_slots: id  3 | task 2175 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 2175 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 2175 | prompt processing progress, n_tokens = 335, batch.n_tokens = 1316, progress = 0.395514
slot update_slots: id  0 | task 2176 | n_tokens = 329, memory_seq_rm [329, end)
slot update_slots: id  0 | task 2176 | prompt processing progress, n_tokens = 841, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 2176 | prompt done, n_tokens = 841, batch.n_tokens = 512
slot init_sampler: id  0 | task 2176 | init sampler, took 0.09 ms, tokens: text = 841, total = 841
slot update_slots: id  0 | task 2176 | created context checkpoint 1 of 8 (pos_min = 328, pos_max = 328, size = 186.329 MiB)
slot update_slots: id  1 | task 2178 | n_tokens = 327, memory_seq_rm [327, end)
slot update_slots: id  1 | task 2178 | prompt processing progress, n_tokens = 839, batch.n_tokens = 1024, progress = 1.000000
slot update_slots: id  1 | task 2178 | prompt done, n_tokens = 839, batch.n_tokens = 1024
slot init_sampler: id  1 | task 2178 | init sampler, took 0.09 ms, tokens: text = 839, total = 839
slot update_slots: id  1 | task 2178 | created context checkpoint 1 of 8 (pos_min = 326, pos_max = 326, size = 186.329 MiB)
slot update_slots: id  2 | task 2177 | n_tokens = 325, memory_seq_rm [325, end)
slot update_slots: id  2 | task 2177 | prompt processing progress, n_tokens = 837, batch.n_tokens = 1536, progress = 1.000000
slot update_slots: id  2 | task 2177 | prompt done, n_tokens = 837, batch.n_tokens = 1536
slot init_sampler: id  2 | task 2177 | init sampler, took 0.09 ms, tokens: text = 837, total = 837
slot update_slots: id  2 | task 2177 | created context checkpoint 1 of 8 (pos_min = 324, pos_max = 324, size = 186.329 MiB)
slot update_slots: id  3 | task 2175 | n_tokens = 335, memory_seq_rm [335, end)
slot update_slots: id  3 | task 2175 | prompt processing progress, n_tokens = 847, batch.n_tokens = 2048, progress = 1.000000
slot update_slots: id  3 | task 2175 | prompt done, n_tokens = 847, batch.n_tokens = 2048
slot init_sampler: id  3 | task 2175 | init sampler, took 0.09 ms, tokens: text = 847, total = 847
slot update_slots: id  3 | task 2175 | created context checkpoint 1 of 8 (pos_min = 334, pos_max = 334, size = 186.329 MiB)
slot print_timing: id  0 | task 2176 |
prompt eval time =    3745.74 ms /   841 tokens (    4.45 ms per token,   224.52 tokens per second)
       eval time =      44.87 ms /     2 tokens (   22.44 ms per token,    44.57 tokens per second)
      total time =    3790.61 ms /   843 tokens
slot      release: id  0 | task 2176 | stop processing: n_tokens = 842, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  1 | task 2178 |
prompt eval time =    3740.69 ms /   839 tokens (    4.46 ms per token,   224.29 tokens per second)
       eval time =      44.36 ms /     2 tokens (   22.18 ms per token,    45.08 tokens per second)
      total time =    3785.06 ms /   841 tokens
slot      release: id  1 | task 2178 | stop processing: n_tokens = 840, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  2 | task 2177 |
prompt eval time =    3718.23 ms /   837 tokens (    4.44 ms per token,   225.11 tokens per second)
       eval time =      43.94 ms /     2 tokens (   21.97 ms per token,    45.52 tokens per second)
      total time =    3762.16 ms /   839 tokens
slot      release: id  2 | task 2177 | stop processing: n_tokens = 838, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.964 (> 0.100 thold), f_keep = 0.988
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2180 | processing task, is_child = 0
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.988 (> 0.100 thold), f_keep = 0.982
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 2179 | processing task, is_child = 0
slot update_slots: id  0 | task 2179 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 837
slot update_slots: id  0 | task 2179 | n_past = 827, slot.prompt.tokens.size() = 842, seq_id = 0, pos_min = 841, n_swa = 1
slot update_slots: id  0 | task 2179 | restored context checkpoint (pos_min = 328, pos_max = 328, size = 186.329 MiB)
slot update_slots: id  0 | task 2179 | n_tokens = 329, memory_seq_rm [329, end)
slot update_slots: id  0 | task 2179 | prompt processing progress, n_tokens = 837, batch.n_tokens = 509, progress = 1.000000
slot update_slots: id  0 | task 2179 | prompt done, n_tokens = 837, batch.n_tokens = 509
slot init_sampler: id  0 | task 2179 | init sampler, took 0.09 ms, tokens: text = 837, total = 837
slot update_slots: id  1 | task 2180 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 861
slot update_slots: id  1 | task 2180 | n_past = 830, slot.prompt.tokens.size() = 840, seq_id = 1, pos_min = 839, n_swa = 1
slot update_slots: id  1 | task 2180 | restored context checkpoint (pos_min = 326, pos_max = 326, size = 186.329 MiB)
slot update_slots: id  1 | task 2180 | n_tokens = 327, memory_seq_rm [327, end)
slot update_slots: id  1 | task 2180 | prompt processing progress, n_tokens = 349, batch.n_tokens = 531, progress = 0.405343
slot update_slots: id  1 | task 2180 | n_tokens = 349, memory_seq_rm [349, end)
slot update_slots: id  1 | task 2180 | prompt processing progress, n_tokens = 861, batch.n_tokens = 514, progress = 1.000000
slot update_slots: id  1 | task 2180 | prompt done, n_tokens = 861, batch.n_tokens = 514
slot init_sampler: id  1 | task 2180 | init sampler, took 0.09 ms, tokens: text = 861, total = 861
slot print_timing: id  0 | task 2179 |
prompt eval time =     782.91 ms /   508 tokens (    1.54 ms per token,   648.86 tokens per second)
       eval time =     666.49 ms /     2 tokens (  333.25 ms per token,     3.00 tokens per second)
      total time =    1449.40 ms /   510 tokens
slot      release: id  0 | task 2179 | stop processing: n_tokens = 838, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  1 | task 2180 |
prompt eval time =    1440.66 ms /   534 tokens (    2.70 ms per token,   370.66 tokens per second)
       eval time =      41.96 ms /     2 tokens (   20.98 ms per token,    47.66 tokens per second)
      total time =    1482.63 ms /   536 tokens
slot      release: id  1 | task 2180 | stop processing: n_tokens = 862, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  3 | task 2175 |
prompt eval time =    3718.86 ms /   847 tokens (    4.39 ms per token,   227.76 tokens per second)
       eval time =    1535.47 ms /     5 tokens (  307.09 ms per token,     3.26 tokens per second)
      total time =    5254.34 ms /   852 tokens
slot      release: id  3 | task 2175 | stop processing: n_tokens = 851, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = 12217776144
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 838, total state size = 210.890 MiB
srv  params_from_: Chat format: peg-native
srv          load:  - looking for better prompt, base f_keep = 0.025, sim = 0.056
srv          load:  - found better prompt with f_keep = 0.815, sim = 0.865
srv        update:  - cache state: 9 prompts, 6740.980 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv        update:    - prompt 0x5a8510fa1180:   13887 tokens, checkpoints:  1,   779.663 MiB
srv        update:    - prompt 0x7f372804cb90:    8771 tokens, checkpoints:  1,   629.722 MiB
srv        update:    - prompt 0x7f38b02195c0:   22946 tokens, checkpoints:  4,  1604.155 MiB
srv        update:    - prompt 0x5a8507c04550:     838 tokens, checkpoints:  1,   397.219 MiB
srv  get_availabl: prompt cache update took 233.82 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 2187 | processing task, is_child = 0
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 12219225902
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 838, total state size = 210.890 MiB
srv          load:  - looking for better prompt, base f_keep = 0.004, sim = 0.000
srv          load:  - found better prompt with f_keep = 0.996, sim = 0.601
srv        update:  - cache state: 9 prompts, 6358.535 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv        update:    - prompt 0x7f372804cb90:    8771 tokens, checkpoints:  1,   629.722 MiB
srv        update:    - prompt 0x7f38b02195c0:   22946 tokens, checkpoints:  4,  1604.155 MiB
srv        update:    - prompt 0x5a8507c04550:     838 tokens, checkpoints:  1,   397.219 MiB
srv        update:    - prompt 0x7f3934020330:     838 tokens, checkpoints:  1,   397.219 MiB
srv  get_availabl: prompt cache update took 2097.51 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 2188 | processing task, is_child = 0
slot update_slots: id  0 | task 2188 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 23002
slot update_slots: id  0 | task 2188 | n_past = 13831, slot.prompt.tokens.size() = 13887, seq_id = 0, pos_min = 13886, n_swa = 1
slot update_slots: id  0 | task 2188 | restored context checkpoint (pos_min = 13318, pos_max = 13318, size = 186.329 MiB)
slot update_slots: id  0 | task 2188 | n_tokens = 13319, memory_seq_rm [13319, end)
slot update_slots: id  0 | task 2188 | prompt processing progress, n_tokens = 15367, batch.n_tokens = 2048, progress = 0.668072
slot update_slots: id  0 | task 2188 | n_tokens = 15367, memory_seq_rm [15367, end)
slot update_slots: id  0 | task 2188 | prompt processing progress, n_tokens = 17415, batch.n_tokens = 2048, progress = 0.757108
slot update_slots: id  0 | task 2188 | n_tokens = 17415, memory_seq_rm [17415, end)
slot update_slots: id  0 | task 2188 | prompt processing progress, n_tokens = 19463, batch.n_tokens = 2048, progress = 0.846144
slot update_slots: id  0 | task 2188 | n_tokens = 19463, memory_seq_rm [19463, end)
slot update_slots: id  0 | task 2188 | prompt processing progress, n_tokens = 21511, batch.n_tokens = 2048, progress = 0.935180
slot update_slots: id  0 | task 2188 | n_tokens = 21511, memory_seq_rm [21511, end)
slot update_slots: id  0 | task 2188 | prompt processing progress, n_tokens = 22490, batch.n_tokens = 979, progress = 0.977741
slot update_slots: id  2 | task 2187 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 377
slot update_slots: id  2 | task 2187 | n_past = 326, slot.prompt.tokens.size() = 400, seq_id = 2, pos_min = 399, n_swa = 1
slot update_slots: id  2 | task 2187 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 2187 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 2187 | prompt processing progress, n_tokens = 377, batch.n_tokens = 1356, progress = 1.000000
slot update_slots: id  2 | task 2187 | prompt done, n_tokens = 377, batch.n_tokens = 1356
slot init_sampler: id  2 | task 2187 | init sampler, took 0.04 ms, tokens: text = 377, total = 377
slot update_slots: id  0 | task 2188 | n_tokens = 22490, memory_seq_rm [22490, end)
slot update_slots: id  0 | task 2188 | prompt processing progress, n_tokens = 23002, batch.n_tokens = 513, progress = 1.000000
slot update_slots: id  0 | task 2188 | prompt done, n_tokens = 23002, batch.n_tokens = 513
slot init_sampler: id  0 | task 2188 | init sampler, took 2.35 ms, tokens: text = 23002, total = 23002
slot update_slots: id  0 | task 2188 | created context checkpoint 2 of 8 (pos_min = 22489, pos_max = 22489, size = 186.329 MiB)
srv  params_from_: Chat format: peg-native
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = 12219268240
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 862, total state size = 211.593 MiB
srv          load:  - looking for better prompt, base f_keep = 0.003, sim = 0.000
srv          load:  - found better prompt with f_keep = 0.996, sim = 0.993
srv        update:  - cache state: 9 prompts, 5152.303 MiB (limits: 8192.000 MiB, 196608 tokens, 196608 est)
srv        update:    - prompt 0x7f38bc1581a0:   18338 tokens, checkpoints:  2,  1096.444 MiB
srv        update:    - prompt 0x7f38b4055830:     641 tokens, checkpoints:  1,   391.445 MiB
srv        update:    - prompt 0x7f38b404e9f0:    1488 tokens, checkpoints:  1,   416.269 MiB
srv        update:    - prompt 0x5a850ce55c60:   19280 tokens, checkpoints:  1,   937.723 MiB
srv        update:    - prompt 0x7f392c044100:    3947 tokens, checkpoints:  1,   488.339 MiB
srv        update:    - prompt 0x7f372804cb90:    8771 tokens, checkpoints:  1,   629.722 MiB
srv        update:    - prompt 0x5a8507c04550:     838 tokens, checkpoints:  1,   397.219 MiB
srv        update:    - prompt 0x7f3934020330:     838 tokens, checkpoints:  1,   397.219 MiB
srv        update:    - prompt 0x5a8510fa1180:     862 tokens, checkpoints:  1,   397.922 MiB
srv  get_availabl: prompt cache update took 3462.08 ms
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  1 | task 2209 | processing task, is_child = 0
slot update_slots: id  1 | task 2209 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 23002
slot update_slots: id  1 | task 2209 | n_past = 22852, slot.prompt.tokens.size() = 22946, seq_id = 1, pos_min = 22945, n_swa = 1
slot update_slots: id  1 | task 2209 | restored context checkpoint (pos_min = 22463, pos_max = 22463, size = 186.329 MiB)
slot update_slots: id  1 | task 2209 | n_tokens = 22464, memory_seq_rm [22464, end)
slot update_slots: id  1 | task 2209 | prompt processing progress, n_tokens = 22490, batch.n_tokens = 28, progress = 0.977741
slot update_slots: id  1 | task 2209 | n_tokens = 22490, memory_seq_rm [22490, end)
slot update_slots: id  1 | task 2209 | prompt processing progress, n_tokens = 23002, batch.n_tokens = 514, progress = 1.000000
slot update_slots: id  1 | task 2209 | prompt done, n_tokens = 23002, batch.n_tokens = 514
slot init_sampler: id  1 | task 2209 | init sampler, took 2.36 ms, tokens: text = 23002, total = 23002
slot print_timing: id  2 | task 2187 |
prompt eval time =    1810.15 ms /   377 tokens (    4.80 ms per token,   208.27 tokens per second)
       eval time =    5987.66 ms /    26 tokens (  230.29 ms per token,     4.34 tokens per second)
      total time =    7797.82 ms /   403 tokens
slot      release: id  2 | task 2187 | stop processing: n_tokens = 402, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  1 | task 2209 |
prompt eval time =     912.36 ms /   538 tokens (    1.70 ms per token,   589.68 tokens per second)
       eval time =    2517.84 ms /    82 tokens (   30.71 ms per token,    32.57 tokens per second)
      total time =    3430.19 ms /   620 tokens
slot      release: id  1 | task 2209 | stop processing: n_tokens = 23083, truncated = 0
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  0 | task 2188 |
prompt eval time =   12454.58 ms /  9683 tokens (    1.29 ms per token,   777.47 tokens per second)
       eval time =    7658.14 ms /   106 tokens (   72.25 ms per token,    13.84 tokens per second)
      total time =   20112.72 ms /  9789 tokens
slot      release: id  0 | task 2188 | stop processing: n_tokens = 23107, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

@Mikec78660
Copy link

I built llama.cpp about 2 days ago and I still see this issue with both Qwen3-Coder-Next-Q6_K-00001-of-00004.gguf, and Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf. It works for a few checkpoints then invalidates and has to process everything, then works for a few more checkpoints and repeats.

Could any of these switches be causing an issue?

./llama-server --device CUDA0,CUDA1,CUDA2,CUDA3,CUDA4 --host 0.0.0.0 -m /mnt/llama.cpp/models/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf  -ngl 999 -np 1 --tensor-split 2.8,3,3,3,3 --cache-type-k q8_0 --cache-type-v q8_0 --threads 12 --batch-size 4096 --ubatch-size 512 --threads-batch 10 --no-mmap  --mlock  --cache-ram 16384 --keep 8192 --ctx-checkpoints 128  --kv-unified --main-gpu 0 --no-host --reasoning-budget 0 --perf --backend-sampling --direct-io -lcd /tmp/cache_file --ctx-size 262144 --cache-reuse 64 --swa-full --perf

@jmander11
Copy link

jmander11 commented Feb 22, 2026

Edit: After reading through this PR --swa-full worked! The log message should be updated to mention this option and how it will use more memory and any other effects. This fixed 1 of my 2 issues with the cache misses and my second issue was open webui I had set RAG_SYSTEM_CONTEXT=True causing the system prompt to change with web_search (not fetch_url). I think this being set True for RAG is ideal for cache hits, but for web_search (which isn't RAG) this needs to be False for cache hits. When I have both RAG and web_search I am not sure how I will solve it, but that's not a llama.cpp issue.

I'm also hitting this with llama-server and gpt-oss-20b:
slot update_slots: id 2 | task 319 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see #13194 (comment))

It's making web search calls and causing the whole prefill to start over from scratch all the time. The agentic behaviour is slowed to a crawl due to all the full prompt reprocessing occurring. Is there a workaround or fix in another PR for this being worked on?

Is there an issue with my settings?
I am using an amd r9700 with docker image docker.io/kyuz0/amd-r9700-toolboxes:rocm-7.9 and the following llama-server args:
--model /models/gpt-oss-20b-mxfp4-gguf/gpt-oss-20b-mxfp4.gguf -c 393000 --alias "GPT" -np 3 -fa on --cache-prompt --cache-type-k q8_0 --cache-type-v q8_0 -ngl 999 -t 8 -tb 8 -b 8192 -ub 4096 --no-mmap --host 0.0.0.0 --port 8080 --temp 0.5 --top-p 1.0 --min-p 0.02 --repeat-penalty 1.05 --keep -1 --ctx-checkpoints 128 --swa-checkpoints 0 --slot-prompt-similarity 0.85 --chat-template-kwargs '{"reasoning_effort":"high"}'

I have been playing around with --swa-checkpoints being 0 and a high value and neither helped.
Why does the llama-server log point at this ticket as if it should be what solves the issue? Was this a missed case and the PR mentioned in the log should point to a new PR which may or may not exist? This merged quite a while ago, so I assume all of us hitting this have it in our builds and I don't understand what is going on with the log message and behaviour :P

apmantza pushed a commit to apmantza/helots that referenced this pull request Mar 6, 2026
…vailabl: id ...

whats happening now. getting this on server side: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.887
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 74 | processing task, is_child = 0
slot update_slots: id  0 | task 74 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 557
slot update_slots: id  0 | task 74 | n_past = 557, slot.prompt.tokens.size() = 628, seq_id = 0, pos_min = 627, n_swa = 1
slot update_slots: id  0 | task 74 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see ggml-org/llama.cpp#13194 (comment))
slot update_slots: id  0 | task 74 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 74 | prompt processing progress, n_tokens = 45, batch.n_tokens = 45, progress = 0.080790
slot update_slots: id  0 | task 74 | n_tokens = 45, memory_seq_rm [45, end)
slot init_sampler: id  0 | task 74 | init sampler, took 0.05 ms, tokens: text = 557, total = 557
slot update_slots: id  0 | task 74 | prompt processing done, n_tokens = 557, batch.n_tokens = 512
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  0 | task 74 |
prompt eval time =     314.07 ms /   557 tokens (    0.56 ms per token,  1773.47 tokens per second)
       eval time =     857.92 ms /    93 tokens (    9.22 ms per token,   108.40 tokens per second)
      total time =    1171.99 ms /   650 tokens
slot      release: id  0 | task 74 | stop processing: n_tokens = 649, truncated = 0
srv  update_slots: all slots are idle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.