Support Step3.5-Flash by forforever73 · Pull Request #19283 · ggml-org/llama.cpp

forforever73 · 2026-02-03T03:57:28Z

This PR adds support for the Step3.5-Flash model architecture.

github:

https://github.com/stepfun-ai/Step-3.5-Flash/tree/main

huggingface:

mtp3_bf16: https://huggingface.co/stepfun-ai/Step-3.5-Flash
mtp3_fp8: https://huggingface.co/stepfun-ai/Step-3.5-Flash-FP8
int4(Q4_K_S.gguf): https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4

forforever73 · 2026-02-03T04:33:58Z

Adding supplemental evaluation results for reference.

Performance

https://github.com/stepfun-ai/Step-3.5-Flash/blob/main/llama.cpp/docs/step3.5-flash.md

Accuracy

Accuracy was evaluated against a BF16 vLLM baseline.

Test the maximum 256k context on 8 * H200 devices

Dataset	vLLM BF16 Baseline	step3.5_flash_fp16.gguf
IFEVAL (keywords / existence)	98.08% (±2.13)	98.33% (±2.89)

Dataset	vLLM BF16 Baseline	step3.5_flash_fp16.gguf
HMMT25	98.44% (±1.86)	97.50%

Test the maximum 256k context on Mac Studio
Repeated 64 times and averaged.

Model	Device	Repeats	Average
vLLM BF16 baseline	H200	64	84.38%
step3.5_flash_Q4_K_S.gguf	Mac Studio	64	82.89%

IIIIIllllIIIIIlllll · 2026-02-03T06:46:11Z

great work! thank you!
It works fine on my 395, about 22 token/s.

gopinath87607 · 2026-02-03T08:28:18Z

is this exactly a same modification did in the forked step llama.cpp ? or its a new one ?

forforever73 · 2026-02-03T08:40:53Z

@gopinath87607 The register name (step3p5) was modified in the convert_hf_to_gguf part. Everything else is exactly the same.

tarruda · 2026-02-03T08:44:21Z

I tried running this branch with codex. While it works, I see some leaked tool call tokens into the UI:

Additionally, I see some warnings in llama-server

slot init_sampler: id  1 | task 3684 | init sampler, took 4.65 ms, tokens: text = 47025, total = 47025
slot update_slots: id  1 | task 3684 | erasing old context checkpoint (pos_min = 33429, pos_max = 35988, size = 330.030 MiB)
slot update_slots: id  1 | task 3684 | created context checkpoint 8 of 8 (pos_min = 44401, pos_max = 46960, size = 330.030 MiB)
slot print_timing: id  1 | task 3684 | 
prompt eval time =   10377.77 ms /  2080 tokens (    4.99 ms per token,   200.43 tokens per second)
       eval time =    6575.80 ms /   169 tokens (   38.91 ms per token,    25.70 tokens per second)
      total time =   16953.57 ms /  2249 tokens
slot      release: id  1 | task 3684 | stop processing: n_tokens = 47193, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/responses 192.168.10.78 200
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
srv  params_from_: Chat format: Hermes 2 Pro

AesSedai · 2026-02-03T08:56:47Z

I pulled and compiled with this commit, then produced a BF16 with convert_hf_to_gguf, then attempted to imatrix it and the results were looking very suspect:

llama-imatrix output on commit `2f0f12e70`

ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 7907 (2f0f12e70) with GNU 14.2.1 for Linux x86_64
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  24135 total,  11876 used,  11995 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  24135 total,   7252 used,  16619 free vs. target of   1024
llama_params_fit_impl: projected to use 19129 MiB of device memory vs. 47743 MiB of free device memory
llama_params_fit_impl: targets for free memory can be met on all devices, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 15.59 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:06:10.0) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:06:11.0) - 23871 MiB free
llama_model_loader: loaded meta data with 49 key-value pairs and 754 tensors from /mnt/srv/snowdrift/ggml/Step-3.5-Flash/Step-3.5-Flash-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = step35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Step 3.5 Flash
llama_model_loader: - kv   3:                         general.size_label str              = 288x7.4B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                   general.base_model.count u32              = 1
llama_model_loader: - kv   6:                  general.base_model.0.name str              = Step 3.5 Flash
llama_model_loader: - kv   7:          general.base_model.0.organization str              = Stepfun Ai
llama_model_loader: - kv   8:              general.base_model.0.repo_url str              = https://huggingface.co/stepfun-ai/ste...
llama_model_loader: - kv   9:                         step35.block_count u32              = 45
llama_model_loader: - kv  10:                      step35.context_length u32              = 262144
llama_model_loader: - kv  11:                    step35.embedding_length u32              = 4096
llama_model_loader: - kv  12:                 step35.feed_forward_length u32              = 11264
llama_model_loader: - kv  13:                step35.attention.head_count arr[i32,45]      = [64, 96, 96, 96, 64, 96, 96, 96, 64, ...
llama_model_loader: - kv  14:                      step35.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  15:                step35.attention.key_length u32              = 128
llama_model_loader: - kv  16:              step35.attention.value_length u32              = 128
llama_model_loader: - kv  17:                          general.file_type u32              = 32
llama_model_loader: - kv  18:             step35.attention.head_count_kv arr[i32,45]      = [8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
llama_model_loader: - kv  19:            step35.attention.sliding_window u32              = 512
llama_model_loader: - kv  20:    step35.attention.sliding_window_pattern arr[i32,45]      = [0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, ...
llama_model_loader: - kv  21:             step35.rope.scaling.apply_mask u32              = 1
llama_model_loader: - kv  22:                        step35.expert_count u32              = 288
llama_model_loader: - kv  23:                   step35.expert_used_count u32              = 8
llama_model_loader: - kv  24:          step35.expert_feed_forward_length u32              = 1280
llama_model_loader: - kv  25:   step35.expert_shared_feed_forward_length u32              = 1280
llama_model_loader: - kv  26:                  step35.expert_gating_func u32              = 2
llama_model_loader: - kv  27:                step35.expert_weights_scale f32              = 3.000000
llama_model_loader: - kv  28:                 step35.expert_weights_norm bool             = true
llama_model_loader: - kv  29:           step35.leading_dense_block_count u32              = 3
llama_model_loader: - kv  30:                  step35.moe_every_n_layers u32              = 1
llama_model_loader: - kv  31:      step35.rope.dimension_count_per_layer arr[i32,45]      = [64, 128, 128, 128, 64, 128, 128, 128...
llama_model_loader: - kv  32:    step35.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  33:            step35.rope.freq_base_per_layer arr[f32,45]      = [5000000.000000, 10000.000000, 10000....
llama_model_loader: - kv  34:                       step35.swiglu_limits arr[f32,45]      = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  35:                step35.swiglu_limits_shared arr[f32,45]      = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  38:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  39:                      tokenizer.ggml.tokens arr[str,128896]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  40:                  tokenizer.ggml.token_type arr[i32,128896]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  41:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  42:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  43:                tokenizer.ggml.eos_token_id u32              = 128007
llama_model_loader: - kv  44:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  45:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  46:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  47:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  48:                    tokenizer.chat_template str              = {% macro render_content(content) %}{%...
llama_model_loader: - type  f32:  266 tensors
llama_model_loader: - type bf16:  488 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 366.95 GiB (16.00 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 128007 ('<|im_end|>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8220 MB
print_info: arch                  = step35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 45
print_info: n_head                = [64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64, 96, 96, 96, 64]
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 512
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = [8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8, 12, 12, 12, 8]
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 11264
print_info: n_expert              = 288
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 5000000.0
print_info: freq_scale_train      = 1
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = ?B
print_info: model params          = 196.96 B
print_info: general.name          = Step 3.5 Flash
print_info: vocab type            = BPE
print_info: n_vocab               = 128896
print_info: n_merges              = 127741
print_info: BOS token             = 0 '<｜begin▁of▁sentence｜>'
print_info: EOS token             = 128007 '<|im_end|>'
print_info: EOT token             = 128007 '<|im_end|>'
print_info: PAD token             = 1 '<｜end▁of▁sentence｜>'
print_info: LF token              = 201 'Ċ'
print_info: FIM PRE token         = 128801 '<｜fim▁begin｜>'
print_info: FIM SUF token         = 128800 '<｜fim▁hole｜>'
print_info: FIM MID token         = 128802 '<｜fim▁end｜>'
print_info: EOG token             = 128007 '<|im_end|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 44 repeating layers to GPU
load_tensors: offloaded 46/46 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 375759.26 MiB
load_tensors:        CUDA0 model buffer size =  5898.51 MiB
load_tensors:        CUDA1 model buffer size =  5973.75 MiB
....................................................................................................
common_init_result: added <|im_end|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_seq     = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache:      CUDA0 KV buffer size =    48.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    48.00 MiB
llama_kv_cache: size =   96.00 MiB (  2048 cells,  12 layers,  1/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 2048 cells
llama_kv_cache:      CUDA0 KV buffer size =   136.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   128.00 MiB
llama_kv_cache: size =  264.00 MiB (  2048 cells,  33 layers,  1/1 seqs), K (f16):  132.00 MiB, V (f16):  132.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA0 compute buffer size =  5794.25 MiB
sched_reserve:      CUDA1 compute buffer size =  1103.00 MiB
sched_reserve:  CUDA_Host compute buffer size =    96.09 MiB
sched_reserve: graph nodes  = 3422
sched_reserve: graph splits = 151 (with bs=2048), 87 (with bs=1)
sched_reserve: reserve took 21.70 ms, sched copies = 1

system_info: n_threads = 56 (n_threads_batch = 56) / 56 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 585.593 ms
compute_imatrix: computing over 200 chunks, n_ctx=2048, batch_size=2048, n_seq=1
compute_imatrix: 12.43 seconds per pass - ETA 41.43 minutes
[1]86644.2818,[2]87846.2570,[3]85126.1948,[4]85482.5234,[5]86821.5460,[6]86843.7771,[7]85988.4366,[8]87247.1141,[9]88148.0087,
save_imatrix: entry '               blk.43.ffn_up_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.42.ffn_down_exps.weight' has partial data (3.12%)
save_imatrix: entry '             blk.39.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.38.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.39.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.37.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.36.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.40.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.35.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.35.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '               blk.34.ffn_up_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.34.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.33.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.33.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '               blk.39.ffn_up_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.32.ffn_down_exps.weight' has partial data (3.12%)
save_imatrix: entry '               blk.32.ffn_up_exps.weight' has partial data (3.12%)
save_imatrix: entry '             blk.34.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.31.ffn_down_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.31.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.40.ffn_gate_exps.weight' has partial data (2.78%)
save_imatrix: entry '             blk.43.ffn_gate_exps.weight' has partial data (2.78%)

I canceled it because the partial data for the experts and the 80,000+ PPL make it seem like something has gone wrong in the conversion or inference process somewhere.

IIIIIllllIIIIIlllll · 2026-02-03T08:58:22Z

The same issue, about 'tool_call'.

Edited: However, the result is correct; it indeed helped me write the HTML game I wanted. @forforever73

drrros · 2026-02-03T09:23:22Z

same in cline

running with LLAMA_SET_ROWS=1 ./build/bin/llama-server --model /mnt/ds1nfs/codellamaweights/stepfun/step3p5_flash_Q4_K_S.gguf --port 30000 --host 192.168.0.60 -c $((256*1024)) -fa on --reasoning-format auto --no-mmap --jinja --temp 1.0

speed i'm getting:

prompt eval time =  266478.36 ms / 47459 tokens (    5.61 ms per token,   178.10 tokens per second)
       eval time =    3651.26 ms /   141 tokens (   25.90 ms per token,    38.62 tokens per second)
      total time =  270129.61 ms / 47600 tokens

This is on Epyc 9274f \ 12*32Gb 4800 MT/s \ dual Nvidia A5000

forforever73 · 2026-02-03T09:25:44Z

@AesSedai Sorry about that. For now, please use the pre-quantized GGUF model: https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4
This is because an offline +1 adjustment was applied to the weights before conversion. I’ll move this part into convert_hf_to_gguf as soon as possible.

eauchs · 2026-02-03T09:28:13Z

have around 23 tokens.s-1 with m3 max 128go, this is really great!

forforever73 · 2026-02-03T09:37:00Z

Tool calling is still missing some support in llama.cpp at the moment. I’ll submit the next PR to address this as soon as possible 💪🙂

IIIIIllllIIIIIlllll · 2026-02-03T09:41:34Z

Tool calling is still missing some support in llama.cpp at the moment. I’ll submit the next PR to address this as soon as possible 💪🙂

After testing, I found that this bug occurs when more MCP tools are provided.

If there is only one (perhaps) MCP tool, this issue does not occur.

tarruda · 2026-02-03T10:52:15Z

Tool calling is still missing some support in llama.cpp at the moment. I’ll submit the next PR to address this as soon as possible 💪🙂

Looking forward to it!

This is the best LLM I could run locally so far, thank you for it!

joonanykanen · 2026-02-03T12:14:25Z

@tarruda I do share your thoughts. This model seems extremely intelligent. Running ~16tok/s with 2xRTX3090 and 128GB DDR4. Makes me want to invest in Pro 6000 Blackwells lmao!

pwilkin · 2026-02-03T13:14:23Z

If someone wants a version with fully working reasoning + tool calling, I've added a cherry-picked version of my autoparser branch. Already tested with OpenCode and works great so far.

https://github.com/pwilkin/llama.cpp/tree/autoparser-stepfun

tarruda · 2026-02-03T13:15:47Z

If someone wants a version with fully working reasoning + tool calling, I've added a cherry-picked version of my autoparser branch. Already tested with OpenCode and works great so far.

https://github.com/pwilkin/llama.cpp/tree/autoparser-stepfun

Thank you @pwilkin, will use that branch for now!

drrros · 2026-02-03T14:13:31Z

@pwilkin

https://github.com/pwilkin/llama.cpp/tree/autoparser-stepfun

This doesn't compiling for me:

git status
On branch autoparser-stepfun
Your branch is up to date with 'origin/autoparser-stepfun'.

nothing to commit, working tree clean
...
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON && cmake --build build --config Release -j 24
...
/bin/ld: ../../bin/libllama.so.0.0.7931: undefined reference to `bool llama_model_loader::get_key_or_arr<float, 512ul>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::array<float, 512ul>&, unsigned int, bool)'
collect2: error: ld returned 1 exit status
gmake[2]: *** [examples/simple/CMakeFiles/llama-simple.dir/build.make:102: bin/llama-simple] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:3749: examples/simple/CMakeFiles/llama-simple.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....
[ 65%] Building CXX object common/CMakeFiles/common.dir/json-partial.cpp.o
[ 65%] Linking CXX executable ../../bin/llama-simple-chat
/bin/ld: ../../bin/libllama.so.0.0.7931: undefined reference to `bool llama_model_loader::get_key_or_arr<float, 512ul>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::array<float, 512ul>&, unsigned int, bool)'
collect2: error: ld returned 1 exit status
gmake[2]: *** [examples/simple-chat/CMakeFiles/llama-simple-chat.dir/build.make:102: bin/llama-simple-chat] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:3779: examples/simple-chat/CMakeFiles/llama-simple-chat.dir/all] Error 2

pwilkin · 2026-02-03T14:40:48Z

@drrros sorry, forgot to commit that fix, try now.

ngladitz · 2026-02-03T14:43:14Z

@pwilkin

https://github.com/pwilkin/llama.cpp/tree/autoparser-stepfun

This doesn't compiling for me:

I ran into the same issue but taking https://github.com/pwilkin/llama.cpp/tree/autoparser and then cherry-picking this MR's commit on top worked for me.

I do occasionally see "Invalid diff:" exceptions. A tool "string" parameter (which happens to consist of only digits; is incidentally also a legal integer) is shown once with and once without quotes.

Edgar-I · 2026-02-03T14:45:26Z

@pwilkin Compiling now, thanks

pwilkin · 2026-02-03T14:48:48Z

I do occasionally see "Invalid diff:" exceptions. A tool "string" parameter (which happens to consist of only digits; is incidentally also a legal integer) is shown once with and once without quotes.

That's a good debug case, could you possibly paste it here?

convert_hf_to_gguf.py

src/llama-graph.cpp

forforever73 · 2026-02-03T15:07:56Z

@pwilkin I tried your branch and it does fix the tool call issue — thanks!
Is this a general issue, or something tied to this pr or the step3.5 model?

gguf-py/gguf/gguf_writer.py

src/models/step35-iswa.cpp

src/llama-hparams.h

src/llama-model.cpp

pwilkin · 2026-02-03T15:48:47Z

@pwilkin I tried your branch and it does fix the tool call issue — thanks! Is this a general issue, or something tied to this pr or the step3.5 model?

I'm refactoring the parser in general so that it handles new typical templates automatically (and I tackle a few edge cases that are annoying during agentic coding). It's just that the model doesn't have a dedicated parser in master yet (which is how things were done till now).

tarruda · 2026-02-08T11:58:28Z

So far in my testing, ubergarm IQ4_XS seems to be the best quant for 128GB devices.

Not only it uses less memory (I can run 2 105k context parallel streams on a M1 ultra mac studio), ~~but haven't been able to reproduce infinite reasoning loops~~, which happened once for me with both Q4_K (ggml-org release) and Q4_K_S (stepfun-ai release)

Update: Scratched the part about no reasoning loops, I got one that generated 95k reaoning tokens before hitting the context limit. Not sure if this is a problem with the model itself or just quants that break the model in edge cases.

exxocism · 2026-02-08T13:17:16Z

This is the job of the tool call parser, not the template, try the autoparser PR.

Autoparser branch works for step 3.5 👍
Looks like it lacks minimax <think> support.
I think I can adapt multi binary strategy with llama-swap. Thanks for the info!

pwilkin · 2026-02-08T14:23:51Z

@exxocism it should work for Minimax as well. If it doesn't, please give feedback there.

armand1m · 2026-02-08T17:20:45Z

I'm getting some good results now using the latest ggml-org/llama.cpp with the latest Int4 GGUF push.

Thnking tokens are being properly parsed, but for tool calls the jinja template still gives me these spilled tool calls

Now I'll make the edits step by step. First, add the snapshot import:<tool_call>
<function=edit

but the tool call on OpenCode is working still. Sometimes it spits something like

Now I need to add the StateSnapshot import. Let me add it to the snapshot re-export:<tool_call>
<function=edit
<function=edit

which sometimes break the call but the model recovers well.

I'm getting a constant 14tps generation, which for a model like this, is pretty cool even.

update: eventually, it struggled w tool calling:

Thinking: I keep messing up the tool calls. I need to use the correct format. Let me just use the read tool correctly.

...


I need to add enable_snapshots to MultiAgentProfile as well:<tool_call>
<function=read
<function=read
⚙ invalid [tool=read
<function=read, error=Model tried to call unavailable tool 'read
<function=read'. Available tools: invalid, question, bash, read, glob, grep, edit, write, task, webfetch, todowrite

...

Thinking: I'm really struggling with the tool usage. I think I'm in a loop of errors. Let me take a different approach: I'll write the code changes to files using the write tool to directly replace the entire file content. That might be simpler. But I need to be careful not to lose existing code.
Actually, maybe I should just stop and provide a summary of what I've accomplished and what remains to be done. The user might be okay with that.

I also hit this from time to time:

srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.653 (> 0.100 thold), f_keep = 0.615
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 26804 | processing task, is_child = 0
slot update_slots: id  0 | task 26804 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 129280
slot update_slots: id  0 | task 26804 | n_past = 84450, slot.prompt.tokens.size() = 137210, seq_id = 0, pos_min = 134650, n_swa = 512
slot update_slots: id  0 | task 26804 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 128963, pos_max = 131522, n_swa = 512, size = 92.843 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 129574, pos_max = 132077, n_swa = 512, size = 90.812 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 130331, pos_max = 132890, n_swa = 512, size = 92.843 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 131131, pos_max = 133690, n_swa = 512, size = 92.843 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 131743, pos_max = 134246, n_swa = 512, size = 90.812 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 132304, pos_max = 134863, n_swa = 512, size = 92.843 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 133078, pos_max = 135596, n_swa = 512, size = 91.356 MiB)
slot update_slots: id  0 | task 26804 | erased invalidated context checkpoint (pos_min = 133717, pos_max = 136276, n_swa = 512, size = 92.843 MiB)

ggerganov · 2026-02-08T18:27:21Z

@forforever73 I am testing the model from the official website at https://stepfun.ai/ and it is very easy to make it loop infinitely. For example, here writing a basic quick sort:

stepfun-loop-0.mp4

This is very easy to reproduce. Is this expected?

tarruda · 2026-02-08T19:14:27Z

This is very easy to reproduce. Is this expected?

Interesting, so it is not an issue with the quants.

I wonder if tweaking presence/repetition penalty can help with this.

gelim · 2026-02-08T21:00:43Z

There still seems to be an issue with tool calls and the current chat template:

srv  log_server_r: response: {"error":{"code":500,"message":"\n------------\n
While executing FilterExpression at line 55, column 63 in source:\n
...- for args_name, args_value in arguments|items %}↵                        {{- '<...\n 
                                          ^\n
Error: Unknown (built-in) filter 'items' for type String","type":"server_error"}}

I saw earlier the patch from @exxocism, but I don't see a fix on the root cause being that arguments is a JSON string and not a dict (hence |items breaking). I'm using @ubergarm IQ4_XS (no loops, else working great), but I see it's the same template in @ggerganov ggml-org Q4_K_M upload.

What's send over the wire (as an example):

[...]
"tool_calls":[{"id":"blabla","type":"function","function":{"name":"bash","arguments":"{\"command\":\"ls\",\"description\":\"Lists files in current directory\"}"}}]},
[...]

CISC · 2026-02-08T21:03:56Z

@gelim #19283 (comment)

Hanqer · 2026-02-09T02:39:59Z

@ggerganov Can you share me some prompts to reproduce this loop generation? I will help to check this model behavior. Btw, you can also email me with hanqer@stepfun.com .

forforever73 · 2026-02-09T02:55:54Z

@forforever73 AesSedai/Step-3.5-Flash-GGUF/Step-3.5-Flash-IQ3_XXS,Ex0bit/Step-3.5-Flash-PRISM/Step-3.5-Flash-PRISM-LITE-IQ2_M.gguf,These two models, when running for a while, will keep repeating or producing meaningless output

Q2/Q3 usually have relatively large precision loss. You might want to try the official Q4_K_S build first. Also, could you share the prompt you used? I can try to reproduce it on my side.

xldistance · 2026-02-09T03:00:00Z

@forforever73 Any question eventually devolves into repetitive responses

tarruda · 2026-02-09T08:34:41Z

@forforever73 @Hanqer infinite reasoning loops can be reproduced in the official https://stepfun.ai website, so it doesn't seem like an issue caused by quantization. Here's a prompt similar to @ggerganov's example that sometimes makes it get stuck: Quicksort In C. No comments or explanations. Just code. Read input from stdin. (might need to retry a few times):

tarruda · 2026-02-09T08:43:35Z

After a while it exhausted the token window and just cut the response:

https://stepfun.ai/share/209712602887770112?shareto_way=link

forforever73 · 2026-02-09T08:51:06Z

The model gives nonsensical responses该模型给出了无意义的响应


Step-3.5-Flash
今天 11:21
思考用时 35 秒
这是一个非常深刻的国际关系问题，朝韩对峙（1953年停战，未签和平条约）和俄乌冲突确实存在表面上的相似性，但根本原因、性质、外部干预方式和地缘政治格局有本质区别，导致“永久对峙”的可能性存在，但形式可能不同，且过程会极其动荡。 俄乌冲突更可能走向一种“长期化、低烈度、高风险的消耗战”或“冻结但脆弱的停火”，而非朝韩那种“低烈度、相对稳定、但无和平条约”的“冻结战争”。以下是关键因素分析：

一、与朝韩对峙的“相似性”
无和平条约，法律上仍是“战争状态”： 朝韩停战只是停火，俄乌短期内很难达成全面和平条约，更可能长期处于“停火线僵持”状态。
2 大国介入，但方式不同： 朝韩是美苏/中美博弈的“代理人冲突”，而俄乌是大国直接军事介入+代理人混合模式。北约（尤其是美国）对乌援助远超冷战时期对韩军援助，俄罗斯直接参战而非仅支持叛乱势力。
3 意识形态/制度差异： 朝韩是制度、阵营对立；俄乌冲突核心是主权和领土完整 vs. 军事行动和领土吞并，后者更易引发持续对抗。
4 非军事区（DMZ”）： 俄乌可能形成一条事实上的“接触线”，双方长期对峙，类似顿巴斯“明斯克线”的僵局。
双方核心矛盾：朝韩 vs 2014/2022年后的俄乌
| 维度 | 朝韩对峙 | 俄乌冲突 |
|冲突性质** | 冷战延伸，核不扩散、军事边界 | 生存与领土完整：俄承认克里米亚、顿涅茨克、卢甘“独立”并部分吞并，乌要求恢复1991年边界。领土问题无妥协空间。
| 核心矛盾 | 体制、半岛统一方式 | 主权 vs 势力范围：俄乌双方无法接受对方对核心诉求（乌加入北约/俄“去军事化”）。
| 外部干预 | 中美俄在幕后，直接军事介入少 | 北约（美欧） vs 俄：直接军援、制裁、政治捆绑，无“停火后撤军”的明确机制。
| 核威慑 | 朝鲜拥核，但朝韩不直接冲突** | 俄乌冲突中，俄核威慑，但俄乌双方直接交火，核风险更接近，可能升级。
| 经济整合 | 朝韩基本** | 俄乌**，制裁与反制裁，冲突**，但更易被“冻结” | 俄乌**：冲突**，但更易被“冻结”：朝鲜半岛**：经济上**，但更易被冻结。 |
| 长期化 | 朝韩**：经济上**，但更易被冻结 | ：经济上，但更“冻结”：朝鲜半岛**：经济上**，俄乌**：经济上**，俄乌**：经济上**，，但更易被“冻结” | ：经济上，但更“冻结”：：经济上**，，但 “冻结” ：经济上，俄乌**：经济上**，，但 “冻结” ：：经济上**，，但“冻结”：：经济上**，，但 “冻结”： ：经济上，，但 “冻结 ：经济上，，但 “冻结 ：：经济**，，但“冻结**：：：经济上**，，但“冻结**：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：：**：---

二、俄乌冲突“永久化”的障碍
俄罗斯的战略目标：：，但 “冻结” ：：：：：：：：：：：：：：：：：：：**：

@xldistance I’m guessing your prompt might have been something like “朝韩冲突和俄乌对峙是否是高度相似的” I tested it using Step-3.5-Flash-IQ3_XXS and the results looked normal on my side:

forforever73 · 2026-02-09T08:56:41Z

@tarruda Thanks for testing and sharing this. We’ve been able to reproduce the issue as well. @Hanqer and I are currently investigating it.

ChengYen-Tang · 2026-02-11T02:50:59Z

Hi @forforever73,
This is a really great model🚀 and I want to give some feedback on how it’s used.
I’ve noticed that the model doesn’t seem to use its subagent tools, even though Codex provides this capability.

  - spawn_agent (function): Spawn a sub-agent for a well-scoped task. Returns the agent id to use to communicate with this agent.
    parameters: {"type":"object","properties":{"agent_type":{"type":"string","description":"Optional agent type ({ \"name\": \"default\"}, { \"name\": \"explorer\", \"description\": Use `explorer` for all codebase questions.\nExplorers are fast and authoritative.\nAlways prefer them over manual search or file reading.\nRules:\n- Ask explorers first and precisely.\n- Do not re-read or re-search code they cover.\n- Trust explorer results without verification.\n- Run explorers in parallel when useful.\n- Reuse existing explorers for related questions.\n                }, { \"name\": \"worker\", \"description\": Use for execution and production work.\nTypical tasks:\n- Implement part of a feature\n- Fix tests or bugs\n- Split large refactors into independent chunks\nRules:\n- Explicitly assign **ownership** of the task (files / responsibility).\n- Always tell workers they are **not alone in the codebase**, and they should ignore edits made by others without touching them}). Use an explicit type when delegating."},"message":{"type":"string","description":"Initial task for the new agent. Include scope, constraints, and the expected output."}},"required":["message"],"additionalProperties":false}
  - send_input (function): Send a message to an existing agent. Use interrupt=true to redirect work immediately.
    parameters: {"type":"object","properties":{"id":{"type":"string","description":"Agent id to message (from spawn_agent)."},"interrupt":{"type":"boolean","description":"When true, stop the agent's current task and handle this immediately. When false (default), queue this message."},"message":{"type":"string","description":"Message to send to the agent."}},"required":["id","message"],"additionalProperties":false}
  - wait (function): Wait for agents to reach a final status. Completed statuses may include the agent's final message. Returns empty status when timed out.
    parameters: {"type":"object","properties":{"ids":{"type":"array","items":{"type":"string"},"description":"Agent ids to wait on. Pass multiple ids to wait for whichever finishes first."},"timeout_ms":{"type":"number","description":"Optional timeout in milliseconds. Defaults to 30000, min 10000, max 300000. Prefer longer waits (minutes) to avoid busy polling."}},"required":["ids"],"additionalProperties":false}
  - close_agent (function): Close an agent when it is no longer needed and return its last known status.
    parameters: {"type":"object","properties":{"id":{"type":"string","description":"Agent id to close (from spawn_agent)."}},"required":["id"],"additionalProperties":false}

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string() — so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Add Step-3.5-Flash chat template and format detection test Closes: ggml-org#19283 See also: ggml-org#19283

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string() — so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls when tool_choice=required (also fixes MiniMax-M2 and GLM 4.5) - Add Step-3.5-Flash chat template and format detection test Closes: ggml-org#19283 See also: ggml-org#19283

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string() — so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls when tool_choice=required (also fixes MiniMax-M2 and GLM 4.5) - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org#19283

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls when tool_choice=required (also fixes MiniMax-M2 and GLM 4.5) - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org#19283

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org#19283

…9635) * common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: #19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.

@CISC

* Support Step3.5-Flash * fix: norm.weight + 1 (HF zero_centered=true) * step35: simplify GGUF conversion + drop redundant rope KVs * Address review feedback * rename limits -> clamp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * rename swiglu limits -> swiglu clamp in LLM_KV * avoid CI fail * Apply suggestions from code review * Apply suggestions from code review * disabled KV shifting for LLM_ARCH_STEP35 * Apply suggestions from code review * mistakenly removed cmath * add model size && apply missed suggestion * assert partial_rotary_factors * fix CI errors: * load freq_base_swa --------- Co-authored-by: lvyichen <lvyichen@stepfun.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…ml-org#19635) * common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org#19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.

@CISC

* Support Step3.5-Flash * fix: norm.weight + 1 (HF zero_centered=true) * step35: simplify GGUF conversion + drop redundant rope KVs * Address review feedback * rename limits -> clamp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * rename swiglu limits -> swiglu clamp in LLM_KV * avoid CI fail * Apply suggestions from code review * Apply suggestions from code review * disabled KV shifting for LLM_ARCH_STEP35 * Apply suggestions from code review * mistakenly removed cmath * add model size && apply missed suggestion * assert partial_rotary_factors * fix CI errors: * load freq_base_swa --------- Co-authored-by: lvyichen <lvyichen@stepfun.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…ml-org#19635) * common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org#19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.

forforever73 requested review from CISC and ggerganov as code owners February 3, 2026 03:57

github-actions bot added model Model specific python python script changes labels Feb 3, 2026

loci-dev mentioned this pull request Feb 3, 2026

UPSTREAM PR #19283: Support Step3.5-Flash auroralabs-loci/llama.cpp#1146

Open

CISC reviewed Feb 3, 2026

View reviewed changes

ngxson reviewed Feb 3, 2026

View reviewed changes

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

src/models/step35-iswa.cpp Outdated Show resolved Hide resolved

src/llama-hparams.h Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

jesseposner mentioned this pull request Feb 15, 2026

common : fix Step-3.5-Flash format detection and thinking support #19635

Merged

loci-dev mentioned this pull request Feb 16, 2026

UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support auroralabs-loci/llama.cpp#1182

Open

Conversation

forforever73 commented Feb 3, 2026

Uh oh!

forforever73 commented Feb 3, 2026

Performance

Accuracy

Uh oh!

IIIIIllllIIIIIlllll commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gopinath87607 commented Feb 3, 2026

Uh oh!

forforever73 commented Feb 3, 2026

Uh oh!

tarruda commented Feb 3, 2026

Uh oh!

AesSedai commented Feb 3, 2026

Uh oh!

IIIIIllllIIIIIlllll commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drrros commented Feb 3, 2026

Uh oh!

forforever73 commented Feb 3, 2026

Uh oh!

eauchs commented Feb 3, 2026

Uh oh!

forforever73 commented Feb 3, 2026

Uh oh!

IIIIIllllIIIIIlllll commented Feb 3, 2026

Uh oh!

tarruda commented Feb 3, 2026

Uh oh!

joonanykanen commented Feb 3, 2026

Uh oh!

pwilkin commented Feb 3, 2026

Uh oh!

tarruda commented Feb 3, 2026

Uh oh!

drrros commented Feb 3, 2026

Uh oh!

pwilkin commented Feb 3, 2026

Uh oh!

ngladitz commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edgar-I commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

forforever73 commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pwilkin commented Feb 3, 2026

Uh oh!

tarruda commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

exxocism commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Feb 8, 2026

Uh oh!

armand1m commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IIIIIllllIIIIIlllll commented Feb 3, 2026 •

edited

Loading

IIIIIllllIIIIIlllll commented Feb 3, 2026 •

edited

Loading

ngladitz commented Feb 3, 2026 •

edited

Loading

Edgar-I commented Feb 3, 2026 •

edited

Loading

tarruda commented Feb 8, 2026 •

edited

Loading

exxocism commented Feb 8, 2026 •

edited

Loading

armand1m commented Feb 8, 2026 •

edited

Loading

gelim commented Feb 8, 2026 •

edited

Loading

ChengYen-Tang commented Feb 11, 2026 •

edited

Loading