Qwen3.5 35B in llama-server keeps re-evaluating ~512 tail tokens on every turn

### Name and Version

- `llama-server` version: `8234 (213c4a0b8)`
- Platform: NVIDIA Orin (CUDA)

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

jetson orin agx 64GB

### Models

qwen3.5-35b-a3b

### Problem description & steps to reproduce

llama-server \
  --ctx-size 32768 \
  --gpu-layers 999 \
  --batch-size 3072 \
  --ubatch-size 256 \
  --threads $(nproc) \
  --flash-attn on \
  --override-kv general.name=str:RedQueen \
  --kv-unified \
  -np 1 \
  -nocb \
  --no-slots \
  --cache-ram 0

### First Bad Commit

_No response_

### Relevant log output

First request:

full prompt: 3047 tokens

created checkpoint: n_tokens = 2535

Second request with the exact same prompt:

restored checkpoint: n_tokens = 2535

prompt eval still runs for 512 tokens

Later requests with appended user input:

prompt grows to 3077, 3090

restored checkpoint is still always 2535

prompt eval becomes 542, 555 tokens


created context checkpoint 1 of 32 (pos_min = 2534, pos_max = 2534, n_tokens = 2535, size = 62.813 MiB)

restored context checkpoint (pos_min = 2534, pos_max = 2534, n_tokens = 2535, size = 62.813 MiB)

prompt eval time = 1055.79 ms / 512 tokens
prompt eval time = 1251.87 ms / 542 tokens
prompt eval time = 1298.17 ms / 555 tokens



Is this expected for Qwen3.5 hybrid/recurrent models?

Is this a known checkpoint restore/creation issue?

Is there a newer commit or recommended flag to fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5 35B in llama-server keeps re-evaluating ~512 tail tokens on every turn #20239

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Qwen3.5 35B in llama-server keeps re-evaluating ~512 tail tokens on every turn #20239

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions