Name and Version
llama-server version: 8234 (213c4a0b8)
- Platform: NVIDIA Orin (CUDA)
Operating systems
Linux
GGML backends
CUDA
Hardware
jetson orin agx 64GB
Models
qwen3.5-35b-a3b
Problem description & steps to reproduce
llama-server
--ctx-size 32768
--gpu-layers 999
--batch-size 3072
--ubatch-size 256
--threads $(nproc)
--flash-attn on
--override-kv general.name=str:RedQueen
--kv-unified
-np 1
-nocb
--no-slots
--cache-ram 0
First Bad Commit
No response
Relevant log output
First request:
full prompt: 3047 tokens
created checkpoint: n_tokens = 2535
Second request with the exact same prompt:
restored checkpoint: n_tokens = 2535
prompt eval still runs for 512 tokens
Later requests with appended user input:
prompt grows to 3077, 3090
restored checkpoint is still always 2535
prompt eval becomes 542, 555 tokens
created context checkpoint 1 of 32 (pos_min = 2534, pos_max = 2534, n_tokens = 2535, size = 62.813 MiB)
restored context checkpoint (pos_min = 2534, pos_max = 2534, n_tokens = 2535, size = 62.813 MiB)
prompt eval time = 1055.79 ms / 512 tokens
prompt eval time = 1251.87 ms / 542 tokens
prompt eval time = 1298.17 ms / 555 tokens
Is this expected for Qwen3.5 hybrid/recurrent models?
Is this a known checkpoint restore/creation issue?
Is there a newer commit or recommended flag to fix this?
Name and Version
llama-serverversion:8234 (213c4a0b8)Operating systems
Linux
GGML backends
CUDA
Hardware
jetson orin agx 64GB
Models
qwen3.5-35b-a3b
Problem description & steps to reproduce
llama-server
--ctx-size 32768
--gpu-layers 999
--batch-size 3072
--ubatch-size 256
--threads $(nproc)
--flash-attn on
--override-kv general.name=str:RedQueen
--kv-unified
-np 1
-nocb
--no-slots
--cache-ram 0
First Bad Commit
No response
Relevant log output
First request:
full prompt: 3047 tokens
created checkpoint: n_tokens = 2535
Second request with the exact same prompt:
restored checkpoint: n_tokens = 2535
prompt eval still runs for 512 tokens
Later requests with appended user input:
prompt grows to 3077, 3090
restored checkpoint is still always 2535
prompt eval becomes 542, 555 tokens
created context checkpoint 1 of 32 (pos_min = 2534, pos_max = 2534, n_tokens = 2535, size = 62.813 MiB)
restored context checkpoint (pos_min = 2534, pos_max = 2534, n_tokens = 2535, size = 62.813 MiB)
prompt eval time = 1055.79 ms / 512 tokens
prompt eval time = 1251.87 ms / 542 tokens
prompt eval time = 1298.17 ms / 555 tokens
Is this expected for Qwen3.5 hybrid/recurrent models?
Is this a known checkpoint restore/creation issue?
Is there a newer commit or recommended flag to fix this?