Skip to content

Qwen3.5 35B in llama-server keeps re-evaluating ~512 tail tokens on every turn #20239

@alex2robotic

Description

@alex2robotic

Name and Version

  • llama-server version: 8234 (213c4a0b8)
  • Platform: NVIDIA Orin (CUDA)

Operating systems

Linux

GGML backends

CUDA

Hardware

jetson orin agx 64GB

Models

qwen3.5-35b-a3b

Problem description & steps to reproduce

llama-server
--ctx-size 32768
--gpu-layers 999
--batch-size 3072
--ubatch-size 256
--threads $(nproc)
--flash-attn on
--override-kv general.name=str:RedQueen
--kv-unified
-np 1
-nocb
--no-slots
--cache-ram 0

First Bad Commit

No response

Relevant log output

First request:

full prompt: 3047 tokens

created checkpoint: n_tokens = 2535

Second request with the exact same prompt:

restored checkpoint: n_tokens = 2535

prompt eval still runs for 512 tokens

Later requests with appended user input:

prompt grows to 3077, 3090

restored checkpoint is still always 2535

prompt eval becomes 542, 555 tokens

created context checkpoint 1 of 32 (pos_min = 2534, pos_max = 2534, n_tokens = 2535, size = 62.813 MiB)

restored context checkpoint (pos_min = 2534, pos_max = 2534, n_tokens = 2535, size = 62.813 MiB)

prompt eval time = 1055.79 ms / 512 tokens
prompt eval time = 1251.87 ms / 542 tokens
prompt eval time = 1298.17 ms / 555 tokens

Is this expected for Qwen3.5 hybrid/recurrent models?

Is this a known checkpoint restore/creation issue?

Is there a newer commit or recommended flag to fix this?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions