Add missing headers for memcpy and assert by jcelerier · Pull Request #3 · ggml-org/llama.cpp

jcelerier · 2023-03-10T22:51:59Z

No description provided.

process the scanf() output so Ubuntu 22 compiler doesn't error due to…

Add streaming via server-sent events. Has some changes that I didn't make, and I decided I prefer "stream" to "streaming"

added requirement.txt

support powerinfer without GPU

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed ggml-org#3 (TURBO_D). ggml-org#1 and ggml-org#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Resolved conflict in ggml-turbo-quant.c (kept both 4-bit centroids and CPU WHT). Updated ISWA build_attn to use new ggml_turbo_wht 5-arg signature. Removed redundant V inverse WHT from ISWA overload (now handled in build_attn_mha). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed ggml-org#3 (TURBO_D). ggml-org#1 and ggml-org#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s (dleex) - l5ct0: Add pre_allocate_runtime_chunks() to pinned_chunk_pool and host_cache to prevent lazy runtime pool growth during inference. Called after zone configuration with onednn_scratchpad + dma_staging_pool bytes. - 4f4o3: GGML_SYCL_HOST_ALLOC_PHASE_GATE default changed from 0 to 1 (now unblocked by l5ct0 pre-allocation) - dleex ggml-org#1: Document name shadowing in binbcast.cpp:594 as intentional (required by GGML_TENSOR_BINARY_OP_LOCALS macro) - dleex ggml-org#4: Add GGML_ASSERT bounds checking to sycl_tensor::ne()/nb() - dleex ggml-org#5: Add null assertion in sycl_tensor::resolve_as<T>() to catch unresolved tensor data early - dleex ggml-org#7: Replace silent catch in fattn.cpp resolve_host_seq_ids with GGML_LOG_WARN fallback message - dleex ggml-org#2,3,6,8: Deferred — ggml-org#2 is consistent naming, ggml-org#3 addressed by accessor migration (1vy5r), ggml-org#6 is design tension with const_cast, ggml-org#8 is a review note on past commits

Closes the "5 t/s gap" from phase25 baseline by eliminating ~96% of the graph splits on Qwen3.5-9B-mtp. The three fusions introduced in ff64be2 ggml: add GGML_OP_FUSED framework + GATE_PREP CPU kernel 20fecf6 ggml: add GGML_FUSION_SILU_MUL — fused SiLU gate multiply c23cd08 ggml: add GGML_FUSION_SIGMOID_MUL — fused sigmoid gate mul shipped with CPU kernels only. Every fused op on the Qwen3.5 hybrid SSM/attention path therefore fell back to CPU, which forced the ggml scheduler to emit ~118 graph splits per forward pass — four per SSM block (GATE_PREP + SILU_MUL or SIGMOID_MUL + transfer boundaries) plus one per attention block (SIGMOID_MUL). This change adds three new Vulkan shaders: - fused_silu_mul.comp — silu(a) * b, 2 src + 1 dst, F32 - fused_sigmoid_mul.comp — sigmoid(a) * b, 2 src + 1 dst, F32 - fused_gate_prep.comp — softplus(alpha+dt_bias[h]) * ssm_a[h], 3 src + 1 dst, F32, modular broadcast over num_v_heads (passed via push-const KY) wired through: - pipeline fields in vk_device (pipeline_fused_silu_mul_f32, etc.) - pipeline creation alongside silu_back - GGML_OP_FUSED case in ggml_vk_op_get_pipeline (dispatches by fusion_id stored at op_params[0]) - GGML_OP_FUSED added to the element-count case list so 512x512xZ workgroup splitting applies to the new ops - ggml_vk_fused() helper that reads op_params[0] and hands off to ggml_vk_op_f32<vk_op_push_constants> - GGML_OP_FUSED case in ggml_vk_compute_forward dispatch - GGML_OP_FUSED case in ggml_backend_vk_device_supports_op, mirroring the F32-only constraints of the CPU reference and checking shape / contiguity / num_v_heads Measured on Vega 64 (RADV VEGA10), Qwen3.5-9B-mtp-q4km, c=4096, -ngl 99: before after delta graph splits 118 4 -96.6% llama-server spec 27.07 t/s 37.86 t/s +39.8% llama-server nospec 27.65 t/s 37.92 t/s +37.1% batched-bench B=1 TG 4.96 t/s 36.31 t/s x7.3 batched-bench B=4 TG 22.77 t/s 67.93 t/s x2.98 batched-bench B=8 TG 19.84 t/s 69.96 t/s x3.53 batched-bench B=1 PP 11.29 t/s 75.74 t/s x6.7 Token output remains byte-identical to the pre-fix baseline at the deterministic seed used in the §9 equivalence check (prompt "Once upon a time", n_predict=64, temperature=0, seed=42), and the 77.78% MTP acceptance rate is unchanged. Four remaining splits after this change: SPLIT #0 (CPU) — empty graph-entry sync SPLIT ggml-org#1 (Vulkan) — whole-model compute SPLIT ggml-org#2 (CPU) — MTP greedy-token pick on 1 element SPLIT ggml-org#3 (Vulkan) — mtp_token_embd after greedy pick The two remaining CPU splits are the MTP-specific greedy-token hop at the tail, not the token_embd issue that --no-mmap solves. Getting to zero would need either moving greedy selection to GPU or using VK_EXT_external_memory_host for the mmap'd weight buffer (orthogonal to this change).

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed ggml-org#3 (TURBO_D). ggml-org#1 and ggml-org#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…copy fallback Two synchronization optimizations: 1. Selective backend synchronization (ggml-org#2): Track which backends actually ran work during compute_splits() via a backends_used[] array. In ggml_backend_sched_synchronize(), only sync backends that were active. On multi-backend systems (CPU + GPU), this avoids the API call overhead and driver round-trip for idle backends. 2. Eliminate redundant split_backend sync on copy fallback (ggml-org#3): In compute_splits(), when async tensor copy fails and falls back to blocking copy, the split_backend was synchronized twice: once at the outer scope (event_wait or synchronize) and again in the fallback path. Remove the redundant second synchronize when events are not available, since the outer scope already ensured the backend completed. Also skip redundant backend_dst sync in tensor_copy_async when src and dst are the same backend. https://claude.ai/code/session_01RLqwwCXX36T9YWTzRKq4G9

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed ggml-org#3 (TURBO_D). ggml-org#1 and ggml-org#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete experiment log: ggml-org#1 4-mag LUT: 15.1 at 8K (BEST, +38%) ggml-org#2 Batched extract: 13.7 (+25%) ggml-org#3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…replay matrix (215/215) Ships three of the four Gate 5 top-tier follow-ups from GATE5_TODO.md: D1 — multi-step replay with OP_SAMPLER_ACCEPT (item ggml-org#1) D2 — long-form 32-step replay (item ggml-org#3) D2 — additional real models: Gemma3-1B, Llama3.2-1B (item ggml-org#4) Grammar-lazy triggers (item ggml-org#2) is the remaining top-tier and will land as Phase D3. ### Phase D1 — multi-step grammar replay with accept Current replay was stateless per-step: each OP_SAMPLE constructed a fresh llm_sampler_stream, ran once, and discarded. Real inference calls llm_sampler_stream_accept() between samples to advance grammar PDA state (and, for lazy grammars, check for trigger patterns against the accept stream). Without multi-step testing, any MCU-side drift in accept-state handling was invisible. New opcode OP_SAMPLER_ACCEPT (op = 5): payload = int32 token_id (4 bytes) action = llm_grammar_accept_token(&g_grammar, token_id, g_grammar_token_texts, g_grammar_n_vocab) when g_grammar_active, else no-op reply = 16-byte ack header (status only) Both firmware and host-ref call llm_grammar_accept_token identically. Per-step byte divergence would indicate accept-state regression. 5 new multi-step grammar cases (MULTI_STEP_GRAMMAR_CASES): greedy 'abc' 3-step greedy 'hello' 5-step alt+seq ("a"|"b")("c"|"d") 2-step at temp=0.7 char-class sequence [a-d][0-9] 2-step greedy kleene-plus (x|y)+ 4-step at temp=0.9 top_p=0.9 All steps, all cases byte-identical MCU vs host-ref. ### Phase D2 — extended replay matrix Three new Makefile targets broadening Phase B3's scope: test-mcu-e2e-replay-gemma3 — Gemma3-1B (SPM, 262K vocab) 16/16 PASS test-mcu-e2e-replay-llama3 — Llama3.2-1B (BPE, 128K vocab) 16/16 PASS test-mcu-e2e-replay-long — Qwen3-0.6B × 32 steps 128/128 PASS Bundler test-mcu-e2e-replay-all runs all four (original -replay, plus the three new targets) for a 176-case aggregate. The long-form 32-step run is the numerical-drift canary: Kahan-compensated softmax + Welford online stats must stay bit-exact across many samples; a slow divergence would appear at step ~20-30 if either accumulator had a bug. ### Full Gate 5 scoreboard after Phase D test-mcu-e2e-tokenizer 50/50 (BPE + SPM × 25 prompts) test-mcu-e2e-sampler 39/39 (v1 + v3 + grammar + multi-step) test-mcu-e2e-replay 16/16 (Qwen3 4-step) test-mcu-e2e-replay-mtmd 16/16 (SmolVLM2 + image) test-mcu-e2e-replay-gemma3 16/16 (new) test-mcu-e2e-replay-llama3 16/16 (new) test-mcu-e2e-replay-long 128/128 (new, 32 steps) ───── 291/291 total Firmware size delta from C3: +720 B text (OP_SAMPLER_ACCEPT dispatch + multi-step handlers); .bss unchanged. Regression check — existing suites untouched: make test-mcu-e2e-tokenizer → 50/50 PASS make test-phase4 → PASS Docs: TEST_REPORT.md rows 17+20-22, Phase D1/D2 detail sections, summary count 19→22; GATE5_TODO.md ticks off items ggml-org#1, ggml-org#3, ggml-org#4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) ggml-org#4 Deferred norm: 12.9 (loses ILP) ggml-org#5 2-pair half2: 12.0 (ternary overhead) ggml-org#6 Select chain: 11.9 (branches kill) ggml-org#7 Bit-arithmetic: 11.6 (ALU too heavy) ggml-org#8 FMA branchless: 11.4 (ALU still too heavy) ggml-org#9 Named-reg ternary: 10.3 (branches worst) ggml-org#10 Main (8-LUT): 10.95 (baseline) ggml-org#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 ggml-org#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 ggml-org#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 ggml-org#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 ggml-org#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 ggml-org#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 ggml-org#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 ggml-org#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 ggml-org#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) ggml-org#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) ggml-org#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) ggml-org#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) ggml-org#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) ggml-org#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

ggml-org#958) * port upstream ggml-org#16932 * Add fixed chat templates. * fix grammar when tool have no argument * Insert additional stops for Kimi-K2 * Fix `no triggers set for lazy grammar!` for GLM4.5/4.6 * update chat.cpp * fix grammar for GLM 4.5/4.6 * chat: Fix streaming parser for granite models (ggml-org#15682) * fix(chat): fix streaming parser for granite models * tests: add test cases for Granite models chat parser * common : Fix corrupted memory error on json grammar initialization (ggml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 ggml-org#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 ggml-org#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 ggml-org#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 ggml-org#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 ggml-org#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 ggml-org#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 ggml-org#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 ggml-org#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) ggml-org#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) ggml-org#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) ggml-org#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) ggml-org#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) ggml-org#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING * common : fix reasoning before forced tool call via tool_choice = required (ggml-org#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984) --------- Co-authored-by: Alde Rojas <hello@alde.dev> * Try fix Jinja template for GLM * Improve Kimi-K2 chat template * Fix "Invalid tool call arguments passed" in a rare case. In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation. --------- Co-authored-by: shun095 <8069181+shun095@users.noreply.github.com> Co-authored-by: David Ribeiro Alves <davidralves@gmail.com> Co-authored-by: crat0z <11581854+crat0z@users.noreply.github.com> Co-authored-by: Alde Rojas <hello@alde.dev>

Add missing headers for memcpy and assert

df8abc8

ggerganov merged commit 9dcf4db into ggml-org:master Mar 10, 2023

Hades32 referenced this pull request in Hades32/llama.cpp Mar 17, 2023

Merge pull request #3 from bigattichouse/master

2af2331

process the scanf() output so Ubuntu 22 compiler doesn't error due to…

SavageShrimp mentioned this pull request Mar 20, 2023

segmentation fault Alpaca #317

Closed

alankila mentioned this pull request Mar 22, 2023

Compute perplexity over prompt #270

Merged

sha0coder mentioned this pull request Apr 5, 2023

[Bug] dequantize_row_q4_0 segfaults #791

Closed

SlyEcho referenced this pull request in SlyEcho/llama.cpp May 31, 2023

Merge pull request #3 from anon998/sse

e6de69a

Add streaming via server-sent events. Has some changes that I didn't make, and I decided I prefer "stream" to "streaming"

windmaple mentioned this pull request Jul 4, 2023

crash when opening the app shixiangcap/llama-jni#1

Open

ghost mentioned this pull request Jul 9, 2023

Implement classifier-free guidance #2135

Merged

rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023

Merge pull request ggml-org#3 from vovw/master

353266a

added requirement.txt

ghost mentioned this pull request Aug 8, 2023

Regression in interactive mode #2507

Closed

atopheim mentioned this pull request Sep 7, 2023

Segfault when compiling with make LLAMA_CUBLAS=1 #3054

Closed

4 tasks

dataf3l mentioned this pull request Oct 18, 2023

whe I give non-existant file names, it segfaults #3 #3663

Closed

chsasank pushed a commit to chsasank/llama.cpp that referenced this pull request Dec 20, 2023

Merge pull request ggml-org#3 from hodlen/fix/gpu-dependency

e44f640

support powerinfer without GPU

Dyke-F mentioned this pull request Dec 21, 2023

CUDA error 719 #4563

Closed

3 tasks

nasawyer7 mentioned this pull request Jan 3, 2024

CUDA error: invalid device function when compiling and running for amd gfx 1032 #4762

Closed

shahizat mentioned this pull request Jan 6, 2024

MPI issue on the Nvidia Jetson Cluster #4792

Closed

segmond mentioned this pull request Jan 14, 2024

train-text-from-scratch oom (in tokenizer?) #4300

Closed

4 tasks

java63940 mentioned this pull request Jan 16, 2024

Adreno gpu run crash #4973

Closed

enn-nafnlaus mentioned this pull request Feb 26, 2024

Smarter slot handling #5737

Closed

phymbert mentioned this pull request Mar 3, 2024

server: main loop blocked, server stuck #5851

Closed

This was referenced Apr 7, 2024

GGML_ASSERT: llama.cpp/ggml-cuda/argsort.cu:48: (ncols & (ncols - 1)) == 0 #6527

Closed

Segmentation fault during IQ3_XS generation. #6597

Closed

steampunque mentioned this pull request May 21, 2024

b2950 broke RPC mode #7427

Closed

micsthepick mentioned this pull request Jul 1, 2024

Bug: GGML assert with bf16, RTX3090 #8234

Closed

ko-alex mentioned this pull request Jul 4, 2024

Bug: gemma 2 27B GGML_ASSERT n_dims <= ne0 #8246

Closed

m828 mentioned this pull request Jul 16, 2024

Bug: ROCm CUDA error #8504

Closed

fan-chao mentioned this pull request Aug 13, 2024

[CANN] Support Q4_0 for Ascend NPU #8822

Merged

4 tasks

slaren mentioned this pull request Aug 15, 2024

Threadpool: take 2 #8672

Merged

4 tasks

znzjugod mentioned this pull request Aug 30, 2024

Bug: A crash occurs when llama-bench is running on multiple cann devices. #9250

Closed

stew675 mentioned this pull request Mar 25, 2026

Eval bug: PR20908 breaks rpc-server functionality when balancing split a model across multiple machines. #21006

Closed

rubin55 mentioned this pull request Mar 26, 2026

Eval bug: Unresolved Symbol <__memcpy_chk> when running (any?) model #21041

Closed

uaruss mentioned this pull request Mar 29, 2026

Multi-GPU ROCm illegal memory access (device -1) in recurrent state restore #21140

Closed

NickM-27 mentioned this pull request Apr 3, 2026

Eval bug: Infinite repetition loop in llama-server with peg-gemma4 parser during tool calls #21375

Open

mina-ai-io mentioned this pull request Apr 3, 2026

SYCL: flash attention tile kernel crash on 2nd prompt with Qwen3.5 #21396

Open

mzsergiu mentioned this pull request Apr 4, 2026

Eval bug: gemma-4-26B-A4B crashing (openweb-ui -> litellm -> llama.cpp version: 8661 (b7ad48ebd) #21420

Closed

greyhound3 mentioned this pull request Apr 6, 2026

Misc. bug: ggml-cuda\ggml-cuda.cu:98: CUDA error #21289

Open

Neko-Box-Coder mentioned this pull request Apr 11, 2026

Eval bug: Cuda error on split mode row after tensor parallelism changes #21773

Open

icoicqico mentioned this pull request Apr 17, 2026

llama-finetune bug: #22040

Open

jduerstock mentioned this pull request Apr 24, 2026

Misc. bug: llama-server aborts when loading model (OpenVINO) #22333

Open

IndigoFloyd mentioned this pull request Apr 26, 2026

Misc. bug: Meta backend ggml_context pool exhaustion with --split-mode tensor #22404

Open

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

Add missing headers for memcpy and assert (ggml-org#3)

12131a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing headers for memcpy and assert#3

Add missing headers for memcpy and assert#3
ggerganov merged 1 commit intoggml-org:masterfrom
jcelerier:patch-1

jcelerier commented Mar 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jcelerier commented Mar 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants