Skip to content

Eval bug: Qwen3-Coder-Next Poor Outputs #19305

@will-lms

Description

@will-lms

Name and Version

$ ./build/bin/llama-server --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
version: 7925 (8bece2e)
built with GNU 13.3.0 for Linux aarch64

Operating systems

Linux

GGML backends

CUDA

Hardware

DGX Spark

Models

Same behavior observed on all quants I tested

  1. lmstudio-community/Qwen3-Coder-Next-GGUF @ Q4_K_M and Q8_0
  2. Qwen/Qwen3-Coder-Next-GGUF @ Q4_K_M

Problem description & steps to reproduce

Qwen3-Coder-Next gets tripped up on "syntax issues" when run in llama-server. I run the attached prompt.txt, which asks if server.cpp has any syntax errors.

This specific prompt is not the most "realistic," but it clearly demonstrates an issue I see when I ask the model to generally "review the changes in file X." It smells to me like an implementation bug, given that the output is noticeably worse in llama-server at both 4 bit and 8 bit quants compared to vLLM running FP8 or MLX running int 4.

llama-server Outputs:

  • llama-server-out.txt: uses lmstudio-community/Qwen3-Coder-Next-GGUF @ Q4_K_M. Shows poor output with misidentified syntax errors, seemingly stumbling over itself.
  • llama-server-out-q8.txt: uses lmstudio-community/Qwen3-Coder-Next-GGUF @ Q8_0 . Shows similarly poor outputs to Q4.
  • llama-server-out-qwen-quant.txt: Uses Qwen/Qwen3-Coder-Next-GGUF @ Q4_K_M . Shows similar outputs to the lmstudio-community quants.

Outputs from other engines:

  • vllm-fp8.txt : vLLM running the Qwen/Qwen3-Coder-Next-FP8 model. Does not report any incorrect errors.
  • lms-mlx-engine-out.txt: mlx-engine running the lmstudio-community/Qwen3-Coder-Next-MLX-4bit model LM Studio. Does report any incorrect errors.

First Bad Commit

I did not observe a commit where this was previously working.

Relevant log output

Logs
./build/bin/llama-server -m ~/.lmstudio/models/lmstudio-community/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -c 65536
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv

_unified = true
build: 7925 (8bece2eb2) with GNU 13.3.0 for Linux aarch64
system info: n_threads = 20, n_threads_batch = 20, total_threads = 20

system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/lms/.lmstudio/models/lmstudio-community/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 82704 MiB of device memory vs. 114941 MiB of free device memory
llama_params_fit_impl: will leave 32236 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.19 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GB10) (000f:01:00.0) - 115242 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 42 key-value pairs and 843 tensors from /home/lms/.lmstudio/models/lmstudio-community/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3next
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 40
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen_Qwen3 Coder Next
llama_model_loader: - kv   6:                         general.size_label str              = 80B
llama_model_loader: - kv   7:                      qwen3next.block_count u32              = 48
llama_model_loader: - kv   8:                   qwen3next.context_length u32              = 262144
llama_model_loader: - kv   9:                 qwen3next.embedding_length u32              = 2048
llama_model_loader: - kv  10:              qwen3next.feed_forward_length u32              = 5120
llama_model_loader: - kv  11:             qwen3next.attention.head_count u32              = 16
llama_model_loader: - kv  12:          qwen3next.attention.head_count_kv u32              = 2
llama_model_loader: - kv  13:                   qwen3next.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  14: qwen3next.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                qwen3next.expert_used_count u32              = 10
llama_model_loader: - kv  16:             qwen3next.attention.key_length u32              = 256
llama_model_loader: - kv  17:           qwen3next.attention.value_length u32              = 256
llama_model_loader: - kv  18:                     qwen3next.expert_count u32              = 512
llama_model_loader: - kv  19:       qwen3next.expert_feed_forward_length u32              = 512
llama_model_loader: - kv  20: qwen3next.expert_shared_feed_forward_length u32              = 512
llama_model_loader: - kv  21:                  qwen3next.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  22:                   qwen3next.ssm.state_size u32              = 128
llama_model_loader: - kv  23:                  qwen3next.ssm.group_count u32              = 16
llama_model_loader: - kv  24:               qwen3next.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  25:                   qwen3next.ssm.inner_size u32              = 4096
llama_model_loader: - kv  26:             qwen3next.rope.dimension_count u32              = 64
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  34:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% macro render_extra_keys(json_dict,...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - kv  38:                          general.file_type u32              = 7
llama_model_loader: - kv  39:                                   split.no u16              = 0
llama_model_loader: - kv  40:                        split.tensors.count i32              = 843
llama_model_loader: - kv  41:                                split.count u16              = 3
llama_model_loader: - type  f32:  313 tensors
llama_model_loader: - type q8_0:  482 tensors
llama_model_loader: - type bf16:   48 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 78.98 GiB (8.52 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch                  = qwen3next
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2048
print_info: n_embd_inp            = 2048
print_info: n_layer               = 48
print_info: n_head                = 16
print_info: n_head_kv             = 2
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 8
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 5120
print_info: n_expert              = 512
print_info: n_expert_used         = 10
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 5000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 4096
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 32
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 80B.A3B
print_info: model params          = 79.67 B
print_info: general.name          = Qwen_Qwen3 Coder Next
print_info: vocab type            = BPE
print_info: n_vocab               = 151936
print_info: n_merges              = 151387
print_info: BOS token             = 151643 '<|endoftext|>'
print_info: EOS token             = 151645 '<|im_end|>'
print_info: EOT token             = 151645 '<|im_end|>'
print_info: PAD token             = 151643 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151659 '<|fim_prefix|>'
print_info: FIM SUF token         = 151661 '<|fim_suffix|>'
print_info: FIM MID token         = 151660 '<|fim_middle|>'
print_info: FIM PAD token         = 151662 '<|fim_pad|>'
print_info: FIM REP token         = 151663 '<|repo_name|>'
print_info: FIM SEP token         = 151664 '<|file_sep|>'
print_info: EOG token             = 151643 '<|endoftext|>'
print_info: EOG token             = 151645 '<|im_end|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   315.30 MiB
load_tensors:        CUDA0 model buffer size = 80562.07 MiB
....................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 65536
llama_context: n_ctx_seq     = 65536
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (65536) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.32 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1536.00 MiB
llama_kv_cache: size = 1536.00 MiB ( 65536 cells,  12 layers,  4/1 seqs), K (f16):  768.00 MiB, V (f16):  768.00 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =   301.50 MiB
llama_memory_recurrent: size =  301.50 MiB (     4 cells,  48 layers,  4 seqs), R (f32):   13.50 MiB, S (f32):  288.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA0 compute buffer size =   304.75 MiB
sched_reserve:  CUDA_Host compute buffer size =   136.01 MiB
sched_reserve: graph nodes  = 9374 (with bs=512), 5918 (with bs=1)
sched_reserve: graph splits = 2
sched_reserve: reserve took 168.35 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv    load_model: initializing slots, n_slots = 4
no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | speculative decoding context not initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 65536
no implementations specified for speculative decoding
slot   load_model: id  1 | task -1 | speculative decoding context not initialized
slot   load_model: id  1 | task -1 | new slot, n_ctx = 65536
no implementations specified for speculative decoding
slot   load_model: id  2 | task -1 | speculative decoding context not initialized
slot   load_model: id  2 | task -1 | new slot, n_ctx = 65536
no implementations specified for speculative decoding
slot   load_model: id  3 | task -1 | speculative decoding context not initialized
slot   load_model: id  3 | task -1 | new slot, n_ctx = 65536
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
srv          init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Qwen3 Coder
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 3051
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.671255
slot update_slots: id  3 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2987, batch.n_tokens = 939, progress = 0.979023
slot update_slots: id  3 | task 0 | n_tokens = 2987, memory_seq_rm [2987, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 3051, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 3051, batch.n_tokens = 64
slot init_sampler: id  3 | task 0 | init sampler, took 0.33 ms, tokens: text = 3051, total = 3051
slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 2986, pos_max = 2986, size = 75.376 MiB)
slot print_timing: id  3 | task 0 |
prompt eval time =    4135.85 ms /  3051 tokens (    1.36 ms per token,   737.70 tokens per second)
       eval time =  112321.23 ms /  3501 tokens (   32.08 ms per token,    31.17 tokens per second)
      total time =  116457.08 ms /  6552 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 6551, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
^Csrv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122548 = 30640 + (82704 = 80562 +    1837 +     304) +        9203 |
llama_memory_breakdown_print: |   - Host               |                     451 =   315 +       0 +     136                |

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggeneration qualityQuality of model outputhotSomething that is hotmodelModel specific

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions