Skip to content

Support Step3.5-Flash#19283

Merged
CISC merged 18 commits intoggml-org:masterfrom
forforever73:pr/step3.5-flash
Feb 6, 2026
Merged

Support Step3.5-Flash#19283
CISC merged 18 commits intoggml-org:masterfrom
forforever73:pr/step3.5-flash

Conversation

@forforever73
Copy link
Contributor

@forforever73
Copy link
Contributor Author

Adding supplemental evaluation results for reference.

Performance

https://github.com/stepfun-ai/Step-3.5-Flash/blob/main/llama.cpp/docs/step3.5-flash.md

Accuracy

Accuracy was evaluated against a BF16 vLLM baseline.

Test the maximum 256k context on 8 * H200 devices

Dataset vLLM BF16 Baseline step3.5_flash_fp16.gguf
IFEVAL (keywords / existence) 98.08% (±2.13) 98.33% (±2.89)
Dataset vLLM BF16 Baseline step3.5_flash_fp16.gguf
HMMT25 98.44% (±1.86) 97.50%

Test the maximum 256k context on Mac Studio
Repeated 64 times and averaged.

Model Device Repeats Average
vLLM BF16 baseline H200 64 84.38%
step3.5_flash_Q4_K_S.gguf Mac Studio 64 82.89%

@IIIIIllllIIIIIlllll
Copy link

IIIIIllllIIIIIlllll commented Feb 3, 2026

great work! thank you!
It works fine on my 395, about 22 token/s.
image

@gopinath87607
Copy link

is this exactly a same modification did in the forked step llama.cpp ? or its a new one ?

@forforever73
Copy link
Contributor Author

@gopinath87607 The register name (step3p5) was modified in the convert_hf_to_gguf part. Everything else is exactly the same.

@tarruda
Copy link

tarruda commented Feb 3, 2026

I tried running this branch with codex. While it works, I see some leaked tool call tokens into the UI:

image

Additionally, I see some warnings in llama-server

slot init_sampler: id  1 | task 3684 | init sampler, took 4.65 ms, tokens: text = 47025, total = 47025
slot update_slots: id  1 | task 3684 | erasing old context checkpoint (pos_min = 33429, pos_max = 35988, size = 330.030 MiB)
slot update_slots: id  1 | task 3684 | created context checkpoint 8 of 8 (pos_min = 44401, pos_max = 46960, size = 330.030 MiB)
slot print_timing: id  1 | task 3684 | 
prompt eval time =   10377.77 ms /  2080 tokens (    4.99 ms per token,   200.43 tokens per second)
       eval time =    6575.80 ms /   169 tokens (   38.91 ms per token,    25.70 tokens per second)
      total time =   16953.57 ms /  2249 tokens
slot      release: id  1 | task 3684 | stop processing: n_tokens = 47193, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/responses 192.168.10.78 200
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
srv  params_from_: Chat format: Hermes 2 Pro

@AesSedai
Copy link
Contributor

AesSedai commented Feb 3, 2026

I pulled and compiled with this commit, then produced a BF16 with convert_hf_to_gguf, then attempted to imatrix it and the results were looking very suspect:

llama-imatrix output on commit `2f0f12e70`
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 7907 (2f0f12e70) with GNU 14.2.1 for Linux x86_64
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  24135 total,  11876 used,  11995 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  24135 total,   7252 used,  16619 free vs. target of   1024
llama_params_fit_impl: projected to use 19129 MiB of device memory vs. 47743 MiB of free device memory
llama_params_fit_impl: targets for free memory can be met on all devices, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 15.59 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:06:10.0) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:06:11.0) - 23871 MiB free
llama_model_loader: loaded meta data with 49 key-value pairs and 754 tensors from /mnt/srv/snowdrift/ggml/Step-3.5-Flash/Step-3.5-Flash-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = step35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Step 3.5 Flash
llama_model_loader: - kv   3:                         general.size_label str              = 288x7.4B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                   general.base_model.count u32              = 1
llama_model_loader: - kv   6:                  general.base_model.0.name str              = Step 3.5 Flash
llama_model_loader: - kv   7:          general.base_model.0.organization str              = Stepfun Ai
llama_model_loader: - kv   8:              general.base_model.0.repo_url str              = https://huggingface.co/stepfun-ai/ste...
llama_model_loader: - kv   9:                         step35.block_count u32              = 45
llama_model_loader: - kv  10:                      step35.context_length u32              = 262144
llama_model_loader: - kv  11:                    step35.embedding_length u32              = 4096
llama_model_loader: - kv  12:                 step35.feed_forward_length u32              = 11264
llama_model_loader: - kv  13:                step35.attention.head_count arr[i32,45]      = [64, 96, 96, 96, 64, 96, 96, 96, 64, ...
llama_model_loader: - kv  14:                      step35.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  15:                step35.attention.key_length u32              = 128
llama_model_loader: - kv  16:              step35.attention.value_length u32              = 128
llama_model_loader: - kv  17:                          general.file_type u32              = 32
llama_model_loader: - kv  18:             step35.attention.head_count_kv arr[i32,45]      = [8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
llama_model_loader: - kv  19:            step35.attention.sliding_window u32              = 512
llama_model_loader: - kv  20:    step35.attention.sliding_window_pattern arr[i32,45]      = [0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, ...
llama_model_loader: - kv  21:             step35.rope.scaling.apply_mask u32              = 1
llama_model_loader: - kv  22:                        step35.expert_count u32              = 288
llama_model_loader: - kv  23:                   step35.expert_used_count u32              = 8
llama_model_loader: - kv  24:          step35.expert_feed_forward_length u32              = 1280
llama_model_loader: - kv  25:   step35.expert_shared_feed_forward_length u32              = 1280
llama_model_loader: - kv  26:                  step35.expert_gating_func u32              = 2
llama_model_loader: - kv  27:                step35.expert_weights_scale f32              = 3.000000
llama_model_loader: - kv  28:                 step35.expert_weights_norm bool             = true
llama_model_loader: - kv  29:           step35.leading_dense_block_count u32              = 3
llama_model_loader: - kv  30:                  step35.moe_every_n_layers u32              = 1
llama_model_loader: - kv  31:      step35.rope.dimension_count_per_layer arr[i32,45]      = [64, 128, 128, 128, 64, 128, 128, 128...
llama_model_loader: - kv  32:    step35.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  33:            step35.rope.freq_base_per_layer arr[f32,45]      = [5000000.000000, 10000.000000, 10000....
llama_model_loader: - kv  34:                       step35.swiglu_limits arr[f32,45]      = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  35:                step35.swiglu_limits_shared arr[f32,45]      = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  38:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  39:                      tokenizer.ggml.tokens arr[str,128896]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  40:                  tokenizer.ggml.token_type arr[i32,128896]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  41:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  42:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  43:                tokenizer.ggml.eos_token_id u32              = 128007
llama_model_loader: - kv  44:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  45:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  46:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  47:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  48:                    tokenizer.chat_template str              = {% macro render_content(content) %}{%...
llama_model_loader: - type  f32:  266 tensors
llama_model_loader: - type bf16:  488 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 366.95 GiB (16.00 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 128007 ('<|im_end|>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8220 MB
print_info: arch                  = step35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 45
print_info: n_head                = [64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64]
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 512
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = [8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8]
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 11264
print_info: n_expert              = 288
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 5000000.0
print_info: freq_scale_train      = 1
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = ?B
print_info: model params          = 196.96 B
print_info: general.name          = Step 3.5 Flash
print_info: vocab type            = BPE
print_info: n_vocab               = 128896
print_info: n_merges              = 127741
print_info: BOS token             = 0 '<|begin▁of▁sentence|>'
print_info: EOS token             = 128007 '<|im_end|>'
print_info: EOT token             = 128007 '<|im_end|>'
print_info: PAD token             = 1 '<|end▁of▁sentence|>'
print_info: LF token              = 201 'Ċ'
print_info: FIM PRE token         = 128801 '<|fim▁begin|>'
print_info: FIM SUF token         = 128800 '<|fim▁hole|>'
print_info: FIM MID token         = 128802 '<|fim▁end|>'
print_info: EOG token             = 128007 '<|im_end|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 44 repeating layers to GPU
load_tensors: offloaded 46/46 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 375759.26 MiB
load_tensors:        CUDA0 model buffer size =  5898.51 MiB
load_tensors:        CUDA1 model buffer size =  5973.75 MiB
....................................................................................................
common_init_result: added <|im_end|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_seq     = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache:      CUDA0 KV buffer size =    48.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    48.00 MiB
llama_kv_cache: size =   96.00 MiB (  2048 cells,  12 layers,  1/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 2048 cells
llama_kv_cache:      CUDA0 KV buffer size =   136.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   128.00 MiB
llama_kv_cache: size =  264.00 MiB (  2048 cells,  33 layers,  1/1 seqs), K (f16):  132.00 MiB, V (f16):  132.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA0 compute buffer size =  5794.25 MiB
sched_reserve:      CUDA1 compute buffer size =  1103.00 MiB
sched_reserve:  CUDA_Host compute buffer size =    96.09 MiB
sched_reserve: graph nodes  = 3422
sched_reserve: graph splits = 151 (with bs=2048), 87 (with bs=1)
sched_reserve: reserve took 21.70 ms, sched copies = 1

system_info: n_threads = 56 (n_threads_batch = 56) / 56 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 585.593 ms
compute_imatrix: computing over 200 chunks, n_ctx=2048, batch_size=2048, n_seq=1
compute_imatrix: 12.43 seconds per pass - ETA 41.43 minutes
[1]86644.2818,[2]87846.2570,[3]85126.1948,[4]85482.5234,[5]86821.5460,[6]86843.7771,[7]85988.4366,[8]87247.1141,[9]88148.0087,
save_imatrix: entry '               blk.43.ffn_up_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.42.ffn_down_exps.weight' has partial data (3.12%)
save_imatrix: entry '             blk.39.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.38.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.39.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.37.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.36.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.40.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.35.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.35.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '               blk.34.ffn_up_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.34.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.33.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.33.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '               blk.39.ffn_up_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.32.ffn_down_exps.weight' has partial data (3.12%)
save_imatrix: entry '               blk.32.ffn_up_exps.weight' has partial data (3.12%)
save_imatrix: entry '             blk.34.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.31.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.31.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.40.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.43.ffn_gate_exps.weight' has partial data (2.78%)

I canceled it because the partial data for the experts and the 80,000+ PPL make it seem like something has gone wrong in the conversion or inference process somewhere.

@IIIIIllllIIIIIlllll
Copy link

IIIIIllllIIIIIlllll commented Feb 3, 2026

470a255d3a0777d23d24755be629f502

The same issue, about 'tool_call'.

Edited: However, the result is correct; it indeed helped me write the HTML game I wanted. @forforever73

@drrros
Copy link
Contributor

drrros commented Feb 3, 2026

same in cline
image

running with LLAMA_SET_ROWS=1 ./build/bin/llama-server --model /mnt/ds1nfs/codellamaweights/stepfun/step3p5_flash_Q4_K_S.gguf --port 30000 --host 192.168.0.60 -c $((256*1024)) -fa on --reasoning-format auto --no-mmap --jinja --temp 1.0

speed i'm getting:

prompt eval time =  266478.36 ms / 47459 tokens (    5.61 ms per token,   178.10 tokens per second)
       eval time =    3651.26 ms /   141 tokens (   25.90 ms per token,    38.62 tokens per second)
      total time =  270129.61 ms / 47600 tokens

This is on Epyc 9274f \ 12*32Gb 4800 MT/s \ dual Nvidia A5000

@forforever73
Copy link
Contributor Author

@AesSedai Sorry about that. For now, please use the pre-quantized GGUF model: https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4
This is because an offline +1 adjustment was applied to the weights before conversion. I’ll move this part into convert_hf_to_gguf as soon as possible.

@eauchs
Copy link

eauchs commented Feb 3, 2026

have around 23 tokens.s-1 with m3 max 128go, this is really great!
image

@forforever73
Copy link
Contributor Author

Tool calling is still missing some support in llama.cpp at the moment. I’ll submit the next PR to address this as soon as possible 💪🙂

@IIIIIllllIIIIIlllll
Copy link

Tool calling is still missing some support in llama.cpp at the moment. I’ll submit the next PR to address this as soon as possible 💪🙂

After testing, I found that this bug occurs when more MCP tools are provided.

If there is only one (perhaps) MCP tool, this issue does not occur.

@tarruda
Copy link

tarruda commented Feb 3, 2026

Tool calling is still missing some support in llama.cpp at the moment. I’ll submit the next PR to address this as soon as possible 💪🙂

Looking forward to it!

This is the best LLM I could run locally so far, thank you for it!

@joonanykanen
Copy link

@tarruda I do share your thoughts. This model seems extremely intelligent. Running ~16tok/s with 2xRTX3090 and 128GB DDR4. Makes me want to invest in Pro 6000 Blackwells lmao!

@pwilkin
Copy link
Collaborator

pwilkin commented Feb 3, 2026

If someone wants a version with fully working reasoning + tool calling, I've added a cherry-picked version of my autoparser branch. Already tested with OpenCode and works great so far.

https://github.com/pwilkin/llama.cpp/tree/autoparser-stepfun

@tarruda
Copy link

tarruda commented Feb 3, 2026

If someone wants a version with fully working reasoning + tool calling, I've added a cherry-picked version of my autoparser branch. Already tested with OpenCode and works great so far.

https://github.com/pwilkin/llama.cpp/tree/autoparser-stepfun

Thank you @pwilkin, will use that branch for now!

@drrros
Copy link
Contributor

drrros commented Feb 3, 2026

@pwilkin

https://github.com/pwilkin/llama.cpp/tree/autoparser-stepfun

This doesn't compiling for me:

git status
On branch autoparser-stepfun
Your branch is up to date with 'origin/autoparser-stepfun'.

nothing to commit, working tree clean
...
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON && cmake --build build --config Release -j 24
...
/bin/ld: ../../bin/libllama.so.0.0.7931: undefined reference to `bool llama_model_loader::get_key_or_arr<float, 512ul>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::array<float, 512ul>&, unsigned int, bool)'
collect2: error: ld returned 1 exit status
gmake[2]: *** [examples/simple/CMakeFiles/llama-simple.dir/build.make:102: bin/llama-simple] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:3749: examples/simple/CMakeFiles/llama-simple.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....
[ 65%] Building CXX object common/CMakeFiles/common.dir/json-partial.cpp.o
[ 65%] Linking CXX executable ../../bin/llama-simple-chat
/bin/ld: ../../bin/libllama.so.0.0.7931: undefined reference to `bool llama_model_loader::get_key_or_arr<float, 512ul>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::array<float, 512ul>&, unsigned int, bool)'
collect2: error: ld returned 1 exit status
gmake[2]: *** [examples/simple-chat/CMakeFiles/llama-simple-chat.dir/build.make:102: bin/llama-simple-chat] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:3779: examples/simple-chat/CMakeFiles/llama-simple-chat.dir/all] Error 2

@pwilkin
Copy link
Collaborator

pwilkin commented Feb 3, 2026

@drrros sorry, forgot to commit that fix, try now.

@ngladitz
Copy link

ngladitz commented Feb 3, 2026

@pwilkin

https://github.com/pwilkin/llama.cpp/tree/autoparser-stepfun

This doesn't compiling for me:

I ran into the same issue but taking https://github.com/pwilkin/llama.cpp/tree/autoparser and then cherry-picking this MR's commit on top worked for me.

I do occasionally see "Invalid diff:" exceptions. A tool "string" parameter (which happens to consist of only digits; is incidentally also a legal integer) is shown once with and once without quotes.

@Edgar-I
Copy link

Edgar-I commented Feb 3, 2026

@pwilkin Compiling now, thanks
image

@pwilkin
Copy link
Collaborator

pwilkin commented Feb 3, 2026

I do occasionally see "Invalid diff:" exceptions. A tool "string" parameter (which happens to consist of only digits; is incidentally also a legal integer) is shown once with and once without quotes.

That's a good debug case, could you possibly paste it here?

@forforever73
Copy link
Contributor Author

@pwilkin I tried your branch and it does fix the tool call issue — thanks!
Is this a general issue, or something tied to this pr or the step3.5 model?

@pwilkin
Copy link
Collaborator

pwilkin commented Feb 3, 2026

@pwilkin I tried your branch and it does fix the tool call issue — thanks! Is this a general issue, or something tied to this pr or the step3.5 model?

I'm refactoring the parser in general so that it handles new typical templates automatically (and I tackle a few edge cases that are annoying during agentic coding). It's just that the model doesn't have a dedicated parser in master yet (which is how things were done till now).

@tarruda
Copy link

tarruda commented Feb 8, 2026

So far in my testing, ubergarm IQ4_XS seems to be the best quant for 128GB devices.

Not only it uses less memory (I can run 2 105k context parallel streams on a M1 ultra mac studio), but haven't been able to reproduce infinite reasoning loops, which happened once for me with both Q4_K (ggml-org release) and Q4_K_S (stepfun-ai release)

Update: Scratched the part about no reasoning loops, I got one that generated 95k reaoning tokens before hitting the context limit. Not sure if this is a problem with the model itself or just quants that break the model in edge cases.

@exxocism
Copy link

exxocism commented Feb 8, 2026

This is the job of the tool call parser, not the template, try the autoparser PR.

Autoparser branch works for step 3.5 👍
Looks like it lacks minimax <think> support.
I think I can adapt multi binary strategy with llama-swap. Thanks for the info!

@pwilkin
Copy link
Collaborator

pwilkin commented Feb 8, 2026

@exxocism it should work for Minimax as well. If it doesn't, please give feedback there.

@armand1m
Copy link

armand1m commented Feb 8, 2026

I'm getting some good results now using the latest ggml-org/llama.cpp with the latest Int4 GGUF push.

Thnking tokens are being properly parsed, but for tool calls the jinja template still gives me these spilled tool calls

Now I'll make the edits step by step. First, add the snapshot import:<tool_call>
<function=edit

but the tool call on OpenCode is working still. Sometimes it spits something like

Now I need to add the StateSnapshot import. Let me add it to the snapshot re-export:<tool_call>
<function=edit
<function=edit

which sometimes break the call but the model recovers well.

I'm getting a constant 14tps generation, which for a model like this, is pretty cool even.

update: eventually, it struggled w tool calling:

Thinking: I keep messing up the tool calls. I need to use the correct format. Let me just use the read tool correctly.

...


I need to add enable_snapshots to MultiAgentProfile as well:<tool_call>
<function=read
<function=read
⚙ invalid [tool=read
<function=read, error=Model tried to call unavailable tool 'read
<function=read'. Available tools: invalid, question, bash, read, glob, grep, edit, write, task, webfetch, todowrite

...

Thinking: I'm really struggling with the tool usage. I think I'm in a loop of errors. Let me take a different approach: I'll write the code changes to files using the write tool to directly replace the entire file content. That might be simpler. But I need to be careful not to lose existing code.
Actually, maybe I should just stop and provide a summary of what I've accomplished and what remains to be done. The user might be okay with that.

I also hit this from time to time:

srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.653 (> 0.100 thold), f_keep = 0.615
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 26804 | processing task, is_child = 0
slot update_slots: id  0 | task 26804 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 129280
slot update_slots: id  0 | task 26804 | n_past = 84450, slot.prompt.tokens.size() = 137210, seq_id = 0, pos_min = 134650, n_swa = 512
slot update_slots: id  0 | task 26804 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 128963, pos_max = 131522, n_swa = 512, size = 92.843 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 129574, pos_max = 132077, n_swa = 512, size = 90.812 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 130331, pos_max = 132890, n_swa = 512, size = 92.843 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 131131, pos_max = 133690, n_swa = 512, size = 92.843 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 131743, pos_max = 134246, n_swa = 512, size = 90.812 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 132304, pos_max = 134863, n_swa = 512, size = 92.843 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 133078, pos_max = 135596, n_swa = 512, size = 91.356 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 133717, pos_max = 136276, n_swa = 512, size = 92.843 MiB)

@ggerganov
Copy link
Member

@forforever73 I am testing the model from the official website at https://stepfun.ai/ and it is very easy to make it loop infinitely. For example, here writing a basic quick sort:

stepfun-loop-0.mp4

This is very easy to reproduce. Is this expected?

@tarruda
Copy link

tarruda commented Feb 8, 2026

This is very easy to reproduce. Is this expected?

Interesting, so it is not an issue with the quants.

I wonder if tweaking presence/repetition penalty can help with this.

@gelim
Copy link
Contributor

gelim commented Feb 8, 2026

There still seems to be an issue with tool calls and the current chat template:

srv  log_server_r: response: {"error":{"code":500,"message":"\n------------\n
While executing FilterExpression at line 55, column 63 in source:\n
...- for args_name, args_value in arguments|items %}↵                        {{- '<...\n 
                                          ^\n
Error: Unknown (built-in) filter 'items' for type String","type":"server_error"}}

I saw earlier the patch from @exxocism, but I don't see a fix on the root cause being that arguments is a JSON string and not a dict (hence |items breaking). I'm using @ubergarm IQ4_XS (no loops, else working great), but I see it's the same template in @ggerganov ggml-org Q4_K_M upload.

What's send over the wire (as an example):

[...]
"tool_calls":[{"id":"blabla","type":"function","function":{"name":"bash","arguments":"{\"command\":\"ls\",\"description\":\"Lists files in current directory\"}"}}]},
[...]

@CISC
Copy link
Collaborator

CISC commented Feb 8, 2026

@gelim #19283 (comment)

@Hanqer
Copy link

Hanqer commented Feb 9, 2026

@ggerganov Can you share me some prompts to reproduce this loop generation? I will help to check this model behavior. Btw, you can also email me with hanqer@stepfun.com .

@forforever73
Copy link
Contributor Author

@forforever73 AesSedai/Step-3.5-Flash-GGUF/Step-3.5-Flash-IQ3_XXS,Ex0bit/Step-3.5-Flash-PRISM/Step-3.5-Flash-PRISM-LITE-IQ2_M.gguf,These two models, when running for a while, will keep repeating or producing meaningless output

Q2/Q3 usually have relatively large precision loss. You might want to try the official Q4_K_S build first. Also, could you share the prompt you used? I can try to reproduce it on my side.


@xldistance
Copy link

@forforever73 Any question eventually devolves into repetitive responses

@tarruda
Copy link

tarruda commented Feb 9, 2026

@forforever73 @Hanqer infinite reasoning loops can be reproduced in the official https://stepfun.ai website, so it doesn't seem like an issue caused by quantization. Here's a prompt similar to @ggerganov's example that sometimes makes it get stuck: Quicksort In C. No comments or explanations. Just code. Read input from stdin. (might need to retry a few times):

image

@tarruda
Copy link

tarruda commented Feb 9, 2026

After a while it exhausted the token window and just cut the response:
image

https://stepfun.ai/share/209712602887770112?shareto_way=link

@forforever73
Copy link
Contributor Author

The model gives nonsensical responses该模型给出了无意义的响应


Step-3.5-Flash
今天 11:21
思考用时 35 秒
这是一个非常深刻的国际关系问题,朝韩对峙(1953年停战,未签和平条约)和俄乌冲突确实存在表面上的相似性,但根本原因、性质、外部干预方式和地缘政治格局有本质区别,导致“永久对峙”的可能性存在,但形式可能不同,且过程会极其动荡。 俄乌冲突更可能走向一种“长期化、低烈度、高风险的消耗战”或“冻结但脆弱的停火”,而非朝韩那种“低烈度、相对稳定、但无和平条约”的“冻结战争”。以下是关键因素分析:

一、与朝韩对峙的“相似性”
无和平条约,法律上仍是“战争状态”: 朝韩停战只是停火,俄乌短期内很难达成全面和平条约,更可能长期处于“停火线僵持”状态。
2 大国介入,但方式不同: 朝韩是美苏/中美博弈的“代理人冲突”,而俄乌是大国直接军事介入+代理人混合模式。北约(尤其是美国)对乌援助远超冷战时期对韩军援助,俄罗斯直接参战而非仅支持叛乱势力。
3 意识形态/制度差异: 朝韩是制度、阵营对立;俄乌冲突核心是主权和领土完整 vs. 军事行动和领土吞并,后者更易引发持续对抗。
4 非军事区(DMZ”): 俄乌可能形成一条事实上的“接触线”,双方长期对峙,类似顿巴斯“明斯克线”的僵局。
双方核心矛盾:朝韩 vs 2014/2022年后的俄乌
| 维度 | 朝韩对峙 | 俄乌冲突 |
|冲突性质** | 冷战延伸,核不扩散、军事边界 | 生存与领土完整:俄承认克里米亚、顿涅茨克、卢甘“独立”并部分吞并,乌要求恢复1991年边界。领土问题无妥协空间。
| 核心矛盾 | 体制、半岛统一方式 | 主权 vs 势力范围:俄乌双方无法接受对方对核心诉求(乌加入北约/俄“去军事化”)。
| 外部干预 | 中美俄在幕后,直接军事介入少 | 北约(美欧) vs 俄:直接军援、制裁、政治捆绑,无“停火后撤军”的明确机制。
| 核威慑 | 朝鲜拥核,但朝韩不直接冲突** | 俄乌冲突中,俄核威慑,但俄乌双方直接交火,核风险更接近,可能升级。
| 经济整合 | 朝韩基本** | 俄乌**,制裁与反制裁,冲突**,但更易被“冻结” | 俄乌**:冲突**,但更易被“冻结”:朝鲜半岛**:经济上**,但更易被冻结。 |
| 长期化 | 朝韩**:经济上**,但更易被冻结 | :经济上,但更“冻结”:朝鲜半岛**:经济上**,俄乌**:经济上**,俄乌**:经济上**,,但更易被“冻结” | :经济上,但更“冻结”::经济上**,,但 “冻结” :经济上,俄乌**:经济上**,,但 “冻结” ::经济上**,,但“冻结”::经济上**,,但 “冻结”: :经济上,,但 “冻结 :经济上,,但 “冻结 ::经济**,,但“冻结**:::经济上**,,但“冻结**:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::**:---

二、俄乌冲突“永久化”的障碍
俄罗斯的战略目标::,但 “冻结” :::::::::::::::::::**:

@xldistance I’m guessing your prompt might have been something like “朝韩冲突和俄乌对峙是否是高度相似的” I tested it using Step-3.5-Flash-IQ3_XXS and the results looked normal on my side:
image

@forforever73
Copy link
Contributor Author

@tarruda Thanks for testing and sharing this. We’ve been able to reproduce the issue as well. @Hanqer and I are currently investigating it.


@ChengYen-Tang
Copy link

ChengYen-Tang commented Feb 11, 2026

Hi @forforever73,
This is a really great model🚀 and I want to give some feedback on how it’s used.
I’ve noticed that the model doesn’t seem to use its subagent tools, even though Codex provides this capability.

  - spawn_agent (function): Spawn a sub-agent for a well-scoped task. Returns the agent id to use to communicate with this agent.
    parameters: {"type":"object","properties":{"agent_type":{"type":"string","description":"Optional agent type ({ \"name\": \"default\"}, { \"name\": \"explorer\", \"description\": Use `explorer` for all codebase questions.\nExplorers are fast and authoritative.\nAlways prefer them over manual search or file reading.\nRules:\n- Ask explorers first and precisely.\n- Do not re-read or re-search code they cover.\n- Trust explorer results without verification.\n- Run explorers in parallel when useful.\n- Reuse existing explorers for related questions.\n                }, { \"name\": \"worker\", \"description\": Use for execution and production work.\nTypical tasks:\n- Implement part of a feature\n- Fix tests or bugs\n- Split large refactors into independent chunks\nRules:\n- Explicitly assign **ownership** of the task (files / responsibility).\n- Always tell workers they are **not alone in the codebase**, and they should ignore edits made by others without touching them}). Use an explicit type when delegating."},"message":{"type":"string","description":"Initial task for the new agent. Include scope, constraints, and the expected output."}},"required":["message"],"additionalProperties":false}
  - send_input (function): Send a message to an existing agent. Use interrupt=true to redirect work immediately.
    parameters: {"type":"object","properties":{"id":{"type":"string","description":"Agent id to message (from spawn_agent)."},"interrupt":{"type":"boolean","description":"When true, stop the agent's current task and handle this immediately. When false (default), queue this message."},"message":{"type":"string","description":"Message to send to the agent."}},"required":["id","message"],"additionalProperties":false}
  - wait (function): Wait for agents to reach a final status. Completed statuses may include the agent's final message. Returns empty status when timed out.
    parameters: {"type":"object","properties":{"ids":{"type":"array","items":{"type":"string"},"description":"Agent ids to wait on. Pass multiple ids to wait for whichever finishes first."},"timeout_ms":{"type":"number","description":"Optional timeout in milliseconds. Defaults to 30000, min 10000, max 300000. Prefer longer waits (minutes) to avoid busy polling."}},"required":["ids"],"additionalProperties":false}
  - close_agent (function): Close an agent when it is no longer needed and return its last known status.
    parameters: {"type":"object","properties":{"id":{"type":"string","description":"Agent id to close (from spawn_agent)."}},"required":["id"],"additionalProperties":false}

jesseposner added a commit to jesseposner/llama.cpp that referenced this pull request Feb 15, 2026
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string() — so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/<​/think> to preserved tokens
- Add Step-3.5-Flash chat template and format detection test

Closes: ggml-org#19283
See also: ggml-org#19283
jesseposner added a commit to jesseposner/llama.cpp that referenced this pull request Feb 15, 2026
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string() — so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/<​/think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls when
  tool_choice=required (also fixes MiniMax-M2 and GLM 4.5)
- Add Step-3.5-Flash chat template and format detection test

Closes: ggml-org#19283
See also: ggml-org#19283
jesseposner added a commit to jesseposner/llama.cpp that referenced this pull request Feb 15, 2026
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string() — so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/<​/think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls when
  tool_choice=required (also fixes MiniMax-M2 and GLM 4.5)
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283
jesseposner added a commit to jesseposner/llama.cpp that referenced this pull request Feb 15, 2026
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string() — so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls when
  tool_choice=required (also fixes MiniMax-M2 and GLM 4.5)
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283
jesseposner added a commit to jesseposner/llama.cpp that referenced this pull request Feb 15, 2026
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls when
  tool_choice=required (also fixes MiniMax-M2 and GLM 4.5)
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283
jesseposner added a commit to jesseposner/llama.cpp that referenced this pull request Feb 16, 2026
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283
pwilkin pushed a commit that referenced this pull request Feb 19, 2026
…9635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: #19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* Support Step3.5-Flash

* fix: norm.weight + 1 (HF zero_centered=true)

* step35: simplify GGUF conversion + drop redundant rope KVs

* Address review feedback

* rename limits -> clamp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rename swiglu limits -> swiglu clamp in LLM_KV

* avoid CI fail

* Apply suggestions from code review

* Apply suggestions from code review

* disabled KV shifting for LLM_ARCH_STEP35

* Apply suggestions from code review

* mistakenly removed cmath

* add model size && apply missed suggestion

* assert partial_rotary_factors

* fix CI errors:

* load freq_base_swa

---------

Co-authored-by: lvyichen <lvyichen@stepfun.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
…ml-org#19635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* Support Step3.5-Flash

* fix: norm.weight + 1 (HF zero_centered=true)

* step35: simplify GGUF conversion + drop redundant rope KVs

* Address review feedback

* rename limits -> clamp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rename swiglu limits -> swiglu clamp in LLM_KV

* avoid CI fail

* Apply suggestions from code review

* Apply suggestions from code review

* disabled KV shifting for LLM_ARCH_STEP35

* Apply suggestions from code review

* mistakenly removed cmath

* add model size && apply missed suggestion

* assert partial_rotary_factors

* fix CI errors:

* load freq_base_swa

---------

Co-authored-by: lvyichen <lvyichen@stepfun.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
…ml-org#19635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
…ml-org#19635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.