Skip to content

Add missing headers for memcpy and assert#3

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
jcelerier:patch-1
Mar 10, 2023
Merged

Add missing headers for memcpy and assert#3
ggerganov merged 1 commit intoggml-org:masterfrom
jcelerier:patch-1

Conversation

@jcelerier
Copy link
Copy Markdown
Contributor

No description provided.

@ggerganov ggerganov merged commit 9dcf4db into ggml-org:master Mar 10, 2023
Hades32 referenced this pull request in Hades32/llama.cpp Mar 17, 2023
process the scanf() output so Ubuntu 22 compiler doesn't error due to…
SlyEcho referenced this pull request in SlyEcho/llama.cpp May 31, 2023
Add streaming via server-sent events.
Has some changes that I didn't make, and I decided I prefer "stream" to "streaming"
@ghost ghost mentioned this pull request Jul 9, 2023
rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023
@ghost ghost mentioned this pull request Aug 8, 2023
chsasank pushed a commit to chsasank/llama.cpp that referenced this pull request Dec 20, 2023
@Dyke-F Dyke-F mentioned this pull request Dec 21, 2023
3 tasks
@java63940 java63940 mentioned this pull request Jan 16, 2024
@m828 m828 mentioned this pull request Jul 16, 2024
@fan-chao fan-chao mentioned this pull request Aug 13, 2024
4 tasks
@slaren slaren mentioned this pull request Aug 15, 2024
4 tasks
TheTom referenced this pull request in TheTom/llama-cpp-turboquant Mar 26, 2026
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
didlawowo pushed a commit to didlawowo/llama.cpp that referenced this pull request Mar 27, 2026
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed ggml-org#3 (TURBO_D). ggml-org#1 and ggml-org#2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aguspiza pushed a commit to aguspiza/llama.cpp that referenced this pull request Mar 28, 2026
Complete experiment log:
  ggml-org#1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  ggml-org#2  Batched extract:     13.7 (+25%)
  ggml-org#3  Inline FA block:     13.5 (I-cache pressure)
  ggml-org#4  Deferred norm:       12.9 (loses ILP)
  ggml-org#5  2-pair half2:        12.0 (ternary overhead)
  ggml-org#6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
iosub pushed a commit to iosub/IA-MODELOS-llama.cpp that referenced this pull request Apr 1, 2026
Resolved conflict in ggml-turbo-quant.c (kept both 4-bit centroids and CPU WHT).
Updated ISWA build_attn to use new ggml_turbo_wht 5-arg signature.
Removed redundant V inverse WHT from ISWA overload (now handled in build_attn_mha).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
didlawowo pushed a commit to didlawowo/llama.cpp that referenced this pull request Apr 4, 2026
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed ggml-org#3 (TURBO_D). ggml-org#1 and ggml-org#2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kainlan added a commit to kainlan/llama.cpp-intel-optimizations that referenced this pull request Apr 11, 2026
…s (dleex)

- l5ct0: Add pre_allocate_runtime_chunks() to pinned_chunk_pool and host_cache
  to prevent lazy runtime pool growth during inference. Called after zone
  configuration with onednn_scratchpad + dma_staging_pool bytes.
- 4f4o3: GGML_SYCL_HOST_ALLOC_PHASE_GATE default changed from 0 to 1
  (now unblocked by l5ct0 pre-allocation)
- dleex ggml-org#1: Document name shadowing in binbcast.cpp:594 as intentional
  (required by GGML_TENSOR_BINARY_OP_LOCALS macro)
- dleex ggml-org#4: Add GGML_ASSERT bounds checking to sycl_tensor::ne()/nb()
- dleex ggml-org#5: Add null assertion in sycl_tensor::resolve_as<T>() to catch
  unresolved tensor data early
- dleex ggml-org#7: Replace silent catch in fattn.cpp resolve_host_seq_ids with
  GGML_LOG_WARN fallback message
- dleex ggml-org#2,3,6,8: Deferred — ggml-org#2 is consistent naming, ggml-org#3 addressed by
  accessor migration (1vy5r), ggml-org#6 is design tension with const_cast,
  ggml-org#8 is a review note on past commits
slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 11, 2026
Closes the "5 t/s gap" from phase25 baseline by eliminating ~96% of
the graph splits on Qwen3.5-9B-mtp. The three fusions introduced in

  ff64be2 ggml: add GGML_OP_FUSED framework + GATE_PREP CPU kernel
  20fecf6 ggml: add GGML_FUSION_SILU_MUL — fused SiLU gate multiply
  c23cd08 ggml: add GGML_FUSION_SIGMOID_MUL — fused sigmoid gate mul

shipped with CPU kernels only. Every fused op on the Qwen3.5 hybrid
SSM/attention path therefore fell back to CPU, which forced the ggml
scheduler to emit ~118 graph splits per forward pass — four per SSM
block (GATE_PREP + SILU_MUL or SIGMOID_MUL + transfer boundaries) plus
one per attention block (SIGMOID_MUL).

This change adds three new Vulkan shaders:
- fused_silu_mul.comp    — silu(a) * b,           2 src + 1 dst, F32
- fused_sigmoid_mul.comp — sigmoid(a) * b,        2 src + 1 dst, F32
- fused_gate_prep.comp   — softplus(alpha+dt_bias[h]) * ssm_a[h],
                           3 src + 1 dst, F32, modular broadcast over
                           num_v_heads (passed via push-const KY)

wired through:
- pipeline fields in vk_device (pipeline_fused_silu_mul_f32, etc.)
- pipeline creation alongside silu_back
- GGML_OP_FUSED case in ggml_vk_op_get_pipeline (dispatches by
  fusion_id stored at op_params[0])
- GGML_OP_FUSED added to the element-count case list so 512x512xZ
  workgroup splitting applies to the new ops
- ggml_vk_fused() helper that reads op_params[0] and hands off to
  ggml_vk_op_f32<vk_op_push_constants>
- GGML_OP_FUSED case in ggml_vk_compute_forward dispatch
- GGML_OP_FUSED case in ggml_backend_vk_device_supports_op, mirroring
  the F32-only constraints of the CPU reference and checking shape /
  contiguity / num_v_heads

Measured on Vega 64 (RADV VEGA10), Qwen3.5-9B-mtp-q4km, c=4096, -ngl 99:

                         before           after       delta
  graph splits           118              4           -96.6%
  llama-server spec      27.07 t/s        37.86 t/s   +39.8%
  llama-server nospec    27.65 t/s        37.92 t/s   +37.1%
  batched-bench B=1 TG    4.96 t/s        36.31 t/s    x7.3
  batched-bench B=4 TG   22.77 t/s        67.93 t/s    x2.98
  batched-bench B=8 TG   19.84 t/s        69.96 t/s    x3.53
  batched-bench B=1 PP   11.29 t/s        75.74 t/s    x6.7

Token output remains byte-identical to the pre-fix baseline at the
deterministic seed used in the §9 equivalence check (prompt "Once
upon a time", n_predict=64, temperature=0, seed=42), and the 77.78%
MTP acceptance rate is unchanged.

Four remaining splits after this change:
  SPLIT #0 (CPU)    — empty graph-entry sync
  SPLIT ggml-org#1 (Vulkan) — whole-model compute
  SPLIT ggml-org#2 (CPU)    — MTP greedy-token pick on 1 element
  SPLIT ggml-org#3 (Vulkan) — mtp_token_embd after greedy pick

The two remaining CPU splits are the MTP-specific greedy-token hop
at the tail, not the token_embd issue that --no-mmap solves. Getting
to zero would need either moving greedy selection to GPU or using
VK_EXT_external_memory_host for the mmap'd weight buffer (orthogonal
to this change).
itme-brain pushed a commit to itme-brain/llama.cpp that referenced this pull request Apr 16, 2026
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed ggml-org#3 (TURBO_D). ggml-org#1 and ggml-org#2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
itme-brain pushed a commit to itme-brain/llama.cpp that referenced this pull request Apr 16, 2026
Complete experiment log:
  ggml-org#1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  ggml-org#2  Batched extract:     13.7 (+25%)
  ggml-org#3  Inline FA block:     13.5 (I-cache pressure)
  ggml-org#4  Deferred norm:       12.9 (loses ILP)
  ggml-org#5  2-pair half2:        12.0 (ternary overhead)
  ggml-org#6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Mastadon85 pushed a commit to Mastadon85/llama.cpp-attention-matching that referenced this pull request Apr 18, 2026
…copy fallback

Two synchronization optimizations:

1. Selective backend synchronization (ggml-org#2):
   Track which backends actually ran work during compute_splits() via a
   backends_used[] array. In ggml_backend_sched_synchronize(), only sync
   backends that were active. On multi-backend systems (CPU + GPU), this
   avoids the API call overhead and driver round-trip for idle backends.

2. Eliminate redundant split_backend sync on copy fallback (ggml-org#3):
   In compute_splits(), when async tensor copy fails and falls back to
   blocking copy, the split_backend was synchronized twice: once at the
   outer scope (event_wait or synchronize) and again in the fallback path.
   Remove the redundant second synchronize when events are not available,
   since the outer scope already ensured the backend completed.

Also skip redundant backend_dst sync in tensor_copy_async when src and
dst are the same backend.

https://claude.ai/code/session_01RLqwwCXX36T9YWTzRKq4G9
ausshir pushed a commit to ausshir/llama.cpp-iso-rocm that referenced this pull request Apr 20, 2026
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed ggml-org#3 (TURBO_D). ggml-org#1 and ggml-org#2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ausshir pushed a commit to ausshir/llama.cpp-iso-rocm that referenced this pull request Apr 20, 2026
Complete experiment log:
  ggml-org#1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  ggml-org#2  Batched extract:     13.7 (+25%)
  ggml-org#3  Inline FA block:     13.5 (I-cache pressure)
  ggml-org#4  Deferred norm:       12.9 (loses ILP)
  ggml-org#5  2-pair half2:        12.0 (ternary overhead)
  ggml-org#6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
rocktw added a commit to rocktw/llama.cpp that referenced this pull request Apr 20, 2026
…replay matrix (215/215)

Ships three of the four Gate 5 top-tier follow-ups from GATE5_TODO.md:

  D1 — multi-step replay with OP_SAMPLER_ACCEPT (item ggml-org#1)
  D2 — long-form 32-step replay (item ggml-org#3)
  D2 — additional real models: Gemma3-1B, Llama3.2-1B (item ggml-org#4)

Grammar-lazy triggers (item ggml-org#2) is the remaining top-tier and will
land as Phase D3.

### Phase D1 — multi-step grammar replay with accept

Current replay was stateless per-step: each OP_SAMPLE constructed a
fresh llm_sampler_stream, ran once, and discarded. Real inference
calls llm_sampler_stream_accept() between samples to advance grammar
PDA state (and, for lazy grammars, check for trigger patterns against
the accept stream). Without multi-step testing, any MCU-side drift
in accept-state handling was invisible.

New opcode OP_SAMPLER_ACCEPT (op = 5):
  payload = int32 token_id (4 bytes)
  action  = llm_grammar_accept_token(&g_grammar, token_id,
                                     g_grammar_token_texts,
                                     g_grammar_n_vocab)
            when g_grammar_active, else no-op
  reply   = 16-byte ack header (status only)

Both firmware and host-ref call llm_grammar_accept_token identically.
Per-step byte divergence would indicate accept-state regression.

5 new multi-step grammar cases (MULTI_STEP_GRAMMAR_CASES):
  greedy 'abc' 3-step
  greedy 'hello' 5-step
  alt+seq ("a"|"b")("c"|"d") 2-step at temp=0.7
  char-class sequence [a-d][0-9] 2-step greedy
  kleene-plus (x|y)+ 4-step at temp=0.9 top_p=0.9

All steps, all cases byte-identical MCU vs host-ref.

### Phase D2 — extended replay matrix

Three new Makefile targets broadening Phase B3's scope:

  test-mcu-e2e-replay-gemma3 — Gemma3-1B (SPM, 262K vocab)   16/16 PASS
  test-mcu-e2e-replay-llama3 — Llama3.2-1B (BPE, 128K vocab) 16/16 PASS
  test-mcu-e2e-replay-long   — Qwen3-0.6B × 32 steps        128/128 PASS

Bundler test-mcu-e2e-replay-all runs all four (original -replay,
plus the three new targets) for a 176-case aggregate.

The long-form 32-step run is the numerical-drift canary:
Kahan-compensated softmax + Welford online stats must stay bit-exact
across many samples; a slow divergence would appear at step ~20-30
if either accumulator had a bug.

### Full Gate 5 scoreboard after Phase D

  test-mcu-e2e-tokenizer       50/50   (BPE + SPM × 25 prompts)
  test-mcu-e2e-sampler         39/39   (v1 + v3 + grammar + multi-step)
  test-mcu-e2e-replay          16/16   (Qwen3 4-step)
  test-mcu-e2e-replay-mtmd     16/16   (SmolVLM2 + image)
  test-mcu-e2e-replay-gemma3   16/16   (new)
  test-mcu-e2e-replay-llama3   16/16   (new)
  test-mcu-e2e-replay-long    128/128  (new, 32 steps)
                              ─────
                               291/291 total

Firmware size delta from C3: +720 B text (OP_SAMPLER_ACCEPT dispatch
+ multi-step handlers); .bss unchanged.

Regression check — existing suites untouched:
  make test-mcu-e2e-tokenizer    → 50/50 PASS
  make test-phase4               → PASS

Docs: TEST_REPORT.md rows 17+20-22, Phase D1/D2 detail sections,
summary count 19→22; GATE5_TODO.md ticks off items ggml-org#1, ggml-org#3, ggml-org#4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
YuruDeveloper referenced this pull request in YuruDeveloper/llama.cpp-quant Apr 21, 2026
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
YuruDeveloper referenced this pull request in YuruDeveloper/llama.cpp-quant Apr 21, 2026
Complete experiment log:
  #1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  #2  Batched extract:     13.7 (+25%)
  #3  Inline FA block:     13.5 (I-cache pressure)
  ggml-org#4  Deferred norm:       12.9 (loses ILP)
  ggml-org#5  2-pair half2:        12.0 (ternary overhead)
  ggml-org#6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…gml-org#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    ggml-org#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    ggml-org#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    ggml-org#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    ggml-org#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    ggml-org#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    ggml-org#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    ggml-org#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    ggml-org#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
 x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
 x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
 x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    ggml-org#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    ggml-org#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    ggml-org#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    ggml-org#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    ggml-org#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
ggml-org#958)

* port upstream ggml-org#16932

* Add fixed chat templates.

* fix grammar when tool have no argument

* Insert additional stops for Kimi-K2

* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6

* update chat.cpp

* fix grammar for GLM 4.5/4.6

* chat: Fix streaming parser for granite models (ggml-org#15682)

* fix(chat): fix streaming parser for granite models

* tests: add test cases for Granite models chat parser

* common : Fix corrupted memory error on json grammar initialization (ggml-org#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    ggml-org#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    ggml-org#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    ggml-org#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    ggml-org#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    ggml-org#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    ggml-org#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    ggml-org#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    ggml-org#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
 x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
 x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
 x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    ggml-org#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    ggml-org#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    ggml-org#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    ggml-org#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    ggml-org#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING

* common : fix reasoning before forced tool call via tool_choice = required (ggml-org#16264)

* common : fix reasoning before forced tool call via tool_choice = required

* common : improve reasoning and commentary handling when tool_choice is required

(cherry picked from commit c746984)

---------

Co-authored-by: Alde Rojas <hello@alde.dev>

* Try fix Jinja template for GLM

* Improve Kimi-K2 chat template

* Fix "Invalid tool call arguments passed" in a rare case.

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.

---------

Co-authored-by: shun095 <8069181+shun095@users.noreply.github.com>
Co-authored-by: David Ribeiro Alves <davidralves@gmail.com>
Co-authored-by: crat0z <11581854+crat0z@users.noreply.github.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants