Skip to content

Commit 9ff11f5

Browse files
committed
Merge branch 'master' of github.com:ggml-org/llama.cpp
* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...
2 parents 47df07f + c3e3f9e commit 9ff11f5

File tree

102 files changed

+3980
-467
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+3980
-467
lines changed

.github/workflows/build.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -469,6 +469,7 @@ jobs:
469469
cd build
470470
export GGML_VK_VISIBLE_DEVICES=0
471471
export GGML_VK_DISABLE_F16=1
472+
export GGML_VK_DISABLE_COOPMAT=1
472473
# This is using llvmpipe and runs slower than other backends
473474
ctest -L main --verbose --timeout 4800
474475
@@ -1726,6 +1727,22 @@ jobs:
17261727
vulkaninfo --summary
17271728
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
17281729
1730+
ggml-ci-x64-linux-intel-vulkan:
1731+
runs-on: [self-hosted, Linux, X64, Intel]
1732+
1733+
steps:
1734+
- name: Clone
1735+
id: checkout
1736+
uses: actions/checkout@v6
1737+
with:
1738+
persist-credentials: false
1739+
1740+
- name: Test
1741+
id: ggml-ci
1742+
run: |
1743+
vulkaninfo --summary
1744+
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
1745+
17291746
ggml-ci-arm64-cpu-kleidiai:
17301747
runs-on: ubuntu-22.04-arm
17311748

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# NVIDIA DGX Spark
2+
3+
## System info
4+
5+
```bash
6+
uname --all
7+
Linux spark-17ed 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
8+
9+
g++ --version
10+
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
11+
12+
nvidia-smi
13+
Fri Mar 6 11:39:45 2026
14+
+-----------------------------------------------------------------------------------------+
15+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
16+
+-----------------------------------------+------------------------+----------------------+
17+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
18+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
19+
| | | MIG M. |
20+
|=========================================+========================+======================|
21+
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
22+
| N/A 52C P0 13W / N/A | Not Supported | 0% Default |
23+
| | | N/A |
24+
+-----------------------------------------+------------------------+----------------------+
25+
```
26+
27+
## ggml-org/nemotron-3-super-120b-GGUF
28+
29+
Model: https://huggingface.co/ggml-org/nemotron-3-super-120b-GGUF
30+
31+
- `llama-batched-bench`
32+
33+
main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 99, n_threads = 20, n_threads_batch = 20
34+
35+
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
36+
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
37+
| 512 | 32 | 1 | 544 | 1.094 | 468.05 | 1.621 | 19.74 | 2.715 | 200.37 |
38+
| 512 | 32 | 2 | 1088 | 1.463 | 700.16 | 2.437 | 26.26 | 3.900 | 279.01 |
39+
| 512 | 32 | 4 | 2176 | 2.647 | 773.76 | 4.043 | 31.66 | 6.689 | 325.29 |
40+
| 512 | 32 | 8 | 4352 | 5.291 | 774.14 | 6.151 | 41.62 | 11.442 | 380.37 |
41+
| 512 | 32 | 16 | 8704 | 10.603 | 772.62 | 10.385 | 49.30 | 20.987 | 414.72 |
42+
| 512 | 32 | 32 | 17408 | 21.231 | 771.69 | 18.235 | 56.16 | 39.466 | 441.09 |
43+
| 4096 | 32 | 1 | 4128 | 5.340 | 767.05 | 1.616 | 19.81 | 6.956 | 593.47 |
44+
| 4096 | 32 | 2 | 8256 | 10.673 | 767.55 | 2.454 | 26.08 | 13.127 | 628.94 |
45+
| 4096 | 32 | 4 | 16512 | 21.348 | 767.46 | 4.072 | 31.44 | 25.420 | 649.57 |
46+
| 4096 | 32 | 8 | 33024 | 42.714 | 767.15 | 6.277 | 40.78 | 48.991 | 674.08 |
47+
| 4096 | 32 | 16 | 66048 | 85.385 | 767.54 | 10.596 | 48.32 | 95.981 | 688.14 |
48+
| 4096 | 32 | 32 | 132096 | 170.819 | 767.32 | 18.619 | 55.00 | 189.437 | 697.31 |
49+
| 8192 | 32 | 1 | 8224 | 10.690 | 766.32 | 1.619 | 19.76 | 12.310 | 668.10 |
50+
| 8192 | 32 | 2 | 16448 | 21.382 | 766.24 | 2.467 | 25.94 | 23.850 | 689.65 |
51+
| 8192 | 32 | 4 | 32896 | 42.782 | 765.92 | 4.098 | 31.23 | 46.881 | 701.69 |
52+
| 8192 | 32 | 8 | 65792 | 85.582 | 765.77 | 6.368 | 40.20 | 91.951 | 715.52 |
53+
| 8192 | 32 | 16 | 131584 | 171.066 | 766.21 | 10.774 | 47.52 | 181.840 | 723.62 |
54+
| 8192 | 32 | 32 | 263168 | 342.140 | 766.19 | 18.969 | 53.98 | 361.109 | 728.78 |
55+
56+
57+
- `llama-bench`
58+
59+
| model | size | params | backend | n_ubatch | fa | test | t/s |
60+
| ----------------------- | ---------: | ---------: | ---------- | -------: | -: | --------------: | -------------------: |
61+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | pp2048 | 768.84 ± 0.90 |
62+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | tg32 | 19.94 ± 0.16 |
63+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | pp2048 @ d4096 | 764.51 ± 0.50 |
64+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | tg32 @ d4096 | 19.95 ± 0.18 |
65+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | pp2048 @ d8192 | 759.53 ± 0.71 |
66+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | tg32 @ d8192 | 19.83 ± 0.18 |
67+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | pp2048 @ d16384 | 747.98 ± 1.58 |
68+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | tg32 @ d16384 | 19.84 ± 0.18 |
69+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | pp2048 @ d32768 | 724.40 ± 2.70 |
70+
| nemotron 120B.A12B Q4_K | 65.10 GiB | 120.67 B | CUDA | 2048 | 1 | tg32 @ d32768 | 19.45 ± 0.18 |
71+
72+
build: 04a65daab (8268)

common/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,8 @@ add_library(${TARGET} STATIC
8181
preset.cpp
8282
preset.h
8383
regex-partial.cpp
84+
reasoning-budget.cpp
85+
reasoning-budget.h
8486
regex-partial.h
8587
sampling.cpp
8688
sampling.h

common/arg.cpp

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2913,6 +2913,10 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
29132913
[](common_params & params, const std::string & value) {
29142914
auto parsed = json::parse(value);
29152915
for (const auto & item : parsed.items()) {
2916+
if (item.key() == "enable_thinking") {
2917+
LOG_WRN("Setting 'enable_thinking' via --chat-template-kwargs is deprecated. "
2918+
"Use --reasoning on / --reasoning off instead.\n");
2919+
}
29162920
params.default_template_kwargs[item.key()] = item.value().dump();
29172921
}
29182922
}
@@ -3048,14 +3052,39 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
30483052
params.reasoning_format = common_reasoning_format_from_name(value);
30493053
}
30503054
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_THINK"));
3055+
add_opt(common_arg(
3056+
{"-rea", "--reasoning"}, "[on|off|auto]",
3057+
"Use reasoning/thinking in the chat ('on', 'off', or 'auto', default: 'auto' (detect from template))",
3058+
[](common_params & params, const std::string & value) {
3059+
if (is_truthy(value)) {
3060+
params.enable_reasoning = 1;
3061+
params.default_template_kwargs["enable_thinking"] = "true";
3062+
} else if (is_falsey(value)) {
3063+
params.enable_reasoning = 0;
3064+
params.default_template_kwargs["enable_thinking"] = "false";
3065+
} else if (is_autoy(value)) {
3066+
params.enable_reasoning = -1;
3067+
} else {
3068+
throw std::invalid_argument(
3069+
string_format("error: unknown value for --reasoning: '%s'\n", value.c_str()));
3070+
}
3071+
}
3072+
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_REASONING"));
30513073
add_opt(common_arg(
30523074
{"--reasoning-budget"}, "N",
3053-
"controls the amount of thinking allowed; currently only one of: -1 for unrestricted thinking budget, or 0 to disable thinking (default: -1)",
3075+
"token budget for thinking: -1 for unrestricted, 0 for immediate end, N>0 for token budget (default: -1)",
30543076
[](common_params & params, int value) {
3055-
if (value != 0 && value != -1) { throw std::invalid_argument("invalid value"); }
3077+
if (value < -1) { throw std::invalid_argument("invalid value"); }
30563078
params.reasoning_budget = value;
30573079
}
30583080
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_THINK_BUDGET"));
3081+
add_opt(common_arg(
3082+
{"--reasoning-budget-message"}, "MESSAGE",
3083+
"message injected before the end-of-thinking tag when reasoning budget is exhausted (default: none)",
3084+
[](common_params & params, const std::string & value) {
3085+
params.reasoning_budget_message = value;
3086+
}
3087+
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_THINK_BUDGET_MESSAGE"));
30593088
add_opt(common_arg(
30603089
{"--chat-template"}, "JINJA_TEMPLATE",
30613090
string_format(

common/chat-auto-parser-generator.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,9 @@ common_peg_parser analyze_reasoning::build_parser(parser_build_context & ctx) co
135135
if (thinking_forced_open || thinking_forced_closed) {
136136
// Thinking is forced open OR forced closed with enable_thinking=true
137137
// In both cases, expect only the closing tag (opening was in template)
138-
return p.reasoning(p.until(end)) + end;
138+
// However, since we might have incorrectly detected the open/close pattern,
139+
// we admit an optional starting marker
140+
return p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end;
139141
}
140142
if (mode == reasoning_mode::TAG_BASED || mode == reasoning_mode::TOOLS_ONLY) {
141143
// Standard tag-based reasoning OR tools-only mode (reasoning appears with tools)

common/chat-peg-parser.cpp

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
#include <nlohmann/json.hpp>
88

9-
using json = nlohmann::ordered_json;
9+
using ordered_json = nlohmann::ordered_json;
1010

1111
static std::string_view trim_trailing_space(std::string_view sv, int max = -1) {
1212
int count = 0;
@@ -68,7 +68,7 @@ static int json_brace_depth(const std::string & s) {
6868

6969
// JSON-escape a string and return the inner content (without surrounding quotes).
7070
static std::string escape_json_string_inner(const std::string & s) {
71-
std::string escaped = json(s).dump();
71+
std::string escaped = ordered_json(s).dump();
7272
if (escaped.size() >= 2 && escaped.front() == '"' && escaped.back() == '"') {
7373
return escaped.substr(1, escaped.size() - 2);
7474
}
@@ -309,7 +309,7 @@ void common_chat_peg_mapper::map(const common_peg_ast_node & node) {
309309
if (arg_count > 0) {
310310
arg_entry = ",";
311311
}
312-
arg_entry += json(trim(node.text)).dump() + ":";
312+
arg_entry += ordered_json(trim(node.text)).dump() + ":";
313313
++arg_count;
314314

315315
auto & target = args_target();
@@ -343,7 +343,7 @@ void common_chat_peg_mapper::map(const common_peg_ast_node & node) {
343343

344344
// Try to parse as JSON value (number, bool, null, object, array)
345345
try {
346-
json parsed = json::parse(value_content);
346+
ordered_json parsed = ordered_json::parse(value_content);
347347
if (parsed.is_string()) {
348348
// Don't add closing quote yet (added by arg_close) for monotonic streaming
349349
std::string escaped = parsed.dump();
@@ -408,7 +408,7 @@ void common_chat_peg_mapper::map(const common_peg_ast_node & node) {
408408

409409
common_peg_parser common_chat_peg_builder::standard_constructed_tools(
410410
const std::map<std::string, std::string> & markers,
411-
const nlohmann::json & tools,
411+
const ordered_json & tools,
412412
bool parallel_tool_calls,
413413
bool force_tool_calls) {
414414
if (!tools.is_array() || tools.empty()) {
@@ -439,7 +439,7 @@ common_peg_parser common_chat_peg_builder::standard_constructed_tools(
439439
}
440440
const auto & function = tool_def.at("function");
441441
std::string name = function.at("name");
442-
nlohmann::json params = function.contains("parameters") ? function.at("parameters") : nlohmann::json::object();
442+
ordered_json params = function.contains("parameters") ? function.at("parameters") : ordered_json::object();
443443

444444
// Build argument parsers
445445
auto args = eps();
@@ -479,8 +479,8 @@ common_peg_parser common_chat_peg_builder::standard_constructed_tools(
479479
// Python-style tool calls: name(arg1="value1", arg2=123)
480480
// Used only by LFM2 for now, so we don't merge it into autoparser
481481
common_peg_parser common_chat_peg_builder::python_style_tool_calls(
482-
const nlohmann::json & tools,
483-
bool parallel_tool_calls) {
482+
const ordered_json & tools,
483+
bool parallel_tool_calls) {
484484
if (!tools.is_array() || tools.empty()) {
485485
return eps();
486486
}
@@ -493,7 +493,7 @@ common_peg_parser common_chat_peg_builder::python_style_tool_calls(
493493
}
494494
const auto & function = tool_def.at("function");
495495
std::string name = function.at("name");
496-
nlohmann::json params = function.contains("parameters") ? function.at("parameters") : nlohmann::json::object();
496+
ordered_json params = function.contains("parameters") ? function.at("parameters") : ordered_json::object();
497497

498498
auto args = eps();
499499
if (params.contains("properties") && !params["properties"].empty()) {
@@ -555,11 +555,11 @@ static std::pair<std::string, std::string> parse_key_spec(const std::string & ke
555555

556556
// Mode 1: function_is_key — parse {"function_name": {...}}
557557
common_peg_parser common_chat_peg_builder::build_json_tools_function_is_key(
558-
const nlohmann::json & tools,
559-
const std::string & args_key,
560-
const std::string & effective_args_key,
561-
const std::string & call_id_key,
562-
const std::string & gen_call_id_key) {
558+
const ordered_json & tools,
559+
const std::string & args_key,
560+
const std::string & effective_args_key,
561+
const std::string & call_id_key,
562+
const std::string & gen_call_id_key) {
563563

564564
auto tool_choices = choice();
565565

@@ -569,7 +569,7 @@ common_peg_parser common_chat_peg_builder::build_json_tools_function_is_key(
569569
}
570570
const auto & function = tool_def.at("function");
571571
std::string name = function.at("name");
572-
nlohmann::json params = function.contains("parameters") ? function.at("parameters") : nlohmann::json::object();
572+
ordered_json params = function.contains("parameters") ? function.at("parameters") : ordered_json::object();
573573

574574
// Build inner object fields
575575
std::vector<common_peg_parser> inner_fields;
@@ -634,11 +634,11 @@ common_peg_parser common_chat_peg_builder::build_json_tools_function_is_key(
634634

635635
// Mode 2: Nested keys (dot notation like "function.name")
636636
common_peg_parser common_chat_peg_builder::build_json_tools_nested_keys(
637-
const nlohmann::json & tools,
638-
const std::string & effective_name_key,
639-
const std::string & effective_args_key,
640-
const std::string & call_id_key,
641-
const std::string & gen_call_id_key) {
637+
const ordered_json & tools,
638+
const std::string & effective_name_key,
639+
const std::string & effective_args_key,
640+
const std::string & call_id_key,
641+
const std::string & gen_call_id_key) {
642642

643643
auto tool_choices = choice();
644644

@@ -655,7 +655,7 @@ common_peg_parser common_chat_peg_builder::build_json_tools_nested_keys(
655655
}
656656
const auto & function = tool_def.at("function");
657657
std::string name = function.at("name");
658-
nlohmann::json params = function.contains("parameters") ? function.at("parameters") : nlohmann::json::object();
658+
ordered_json params = function.contains("parameters") ? function.at("parameters") : ordered_json::object();
659659

660660
auto nested_name = literal("\"" + nested_name_field + "\"") + space() + literal(":") + space() +
661661
literal("\"") + tool_name(literal(name)) + literal("\"");
@@ -706,7 +706,7 @@ common_peg_parser common_chat_peg_builder::build_json_tools_nested_keys(
706706

707707
// Mode 3: Flat keys with optional ID fields and parameter ordering
708708
common_peg_parser common_chat_peg_builder::build_json_tools_flat_keys(
709-
const nlohmann::json & tools,
709+
const ordered_json & tools,
710710
const std::string & effective_name_key,
711711
const std::string & effective_args_key,
712712
const std::string & call_id_key,
@@ -723,7 +723,7 @@ common_peg_parser common_chat_peg_builder::build_json_tools_flat_keys(
723723
}
724724
const auto & function = tool_def.at("function");
725725
std::string name = function.at("name");
726-
nlohmann::json params = function.contains("parameters") ? function.at("parameters") : nlohmann::json::object();
726+
ordered_json params = function.contains("parameters") ? function.at("parameters") : ordered_json::object();
727727

728728
auto tool_name_ = name_key_parser + space() + literal(":") + space() +
729729
literal("\"") + tool_name(literal(name)) + literal("\"");
@@ -791,7 +791,7 @@ common_peg_parser common_chat_peg_builder::build_json_tools_flat_keys(
791791
common_peg_parser common_chat_peg_builder::standard_json_tools(
792792
const std::string & section_start,
793793
const std::string & section_end,
794-
const nlohmann::json & tools,
794+
const ordered_json & tools,
795795
bool parallel_tool_calls,
796796
bool force_tool_calls,
797797
const std::string & name_key,

0 commit comments

Comments
 (0)