llama: fix magic number of 999 for GPU layers by JohannesGaessler · Pull Request #18266 · ggml-org/llama.cpp

JohannesGaessler · 2025-12-21T20:16:10Z

As of right now llama_params_fit is disabled if the user sets a values other than the default values. However, because the default value for the number of GPU layers is 999 this is a value that a user could feasibly set manually without this being recognized by the code. This PR makes it so that the llama API explicitly recognizes negative values for n_gpu_layers to mean all layers should be put on the GPU, the default is changed to -1. For CLI arguments a value of -1 means llama_params_fit, other negative values mean all layers on GPU.

Also this PR fixes a minor bug where a print in llama_params_fit where the sum of total memory rather than the sum of free memory is printed.

DocShotgun · 2025-12-21T21:29:39Z

Interesting... I had been using 999 layers to mean "all" layers in my launch configs, and didn't realize that only that specific value would trigger fit to be enabled. I thought fit was just a new default feature, and went ahead and added -fit off to all of my launch configs to work around it 😂.

am17an · 2025-12-22T05:04:04Z

common/arg.cpp

    add_opt(common_arg(
        {"-ngl", "--gpu-layers", "--n-gpu-layers"}, "N",
-        string_format("max. number of layers to store in VRAM (default: %d)", params.n_gpu_layers),
+        string_format("max. number of layers to store in VRAM, -1 is auto, <= -2 is all (default: %d)", params.n_gpu_layers),


for UX purposes a value of -2 is quite non-intuitive, it might make sense to have auto, all or a number for this, similar as we have for FA (0/1 and on/off/auto all work)

ggerganov · 2025-12-23T08:45:18Z

src/llama-context.cpp

        bool pipeline_parallel =
            model.n_devices() > 1 &&
-            model.params.n_gpu_layers > (int) model.hparams.n_layer &&
+            uint32_t(model.params.n_gpu_layers) > model.hparams.n_layer &&


This handling of negative gpu layers feels a bit error-prone - can easily forget to handle this case. Something like this should improve safety a bit:

diff --git a/src/llama-context.cpp b/src/llama-context.cpp index 6ead26820..ef0032f01 100644 --- a/src/llama-context.cpp +++ b/src/llama-context.cpp @@ -294,8 +294,8 @@ llama_context::llama_context( // enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary bool pipeline_parallel = model.n_devices() > 1 && - uint32_t(model.params.n_gpu_layers) > model.hparams.n_layer && - model.params.split_mode == LLAMA_SPLIT_MODE_LAYER && + model.n_gpu_layers() > model.hparams.n_layer && + model.split_mode() == LLAMA_SPLIT_MODE_LAYER && cparams.offload_kqv && !model.has_tensor_overrides(); @@ -1571,7 +1571,7 @@ llm_graph_cb llama_context::graph_get_cb() const { // norm may be automatically assigned to the backend of the previous layer, increasing data transfer between backends // FIXME: fix in ggml_backend_sched - const bool full_offload = uint32_t(model.params.n_gpu_layers) > model.hparams.n_layer; + const bool full_offload = model.n_gpu_layers() > model.hparams.n_layer; if (ubatch.n_tokens < 32 || full_offload) { if (il != -1 && strcmp(name, "norm") == 0) { const auto & dev_layer = model.dev_layer(il); diff --git a/src/llama-model.cpp b/src/llama-model.cpp index e10309439..3c861a9e4 100644 --- a/src/llama-model.cpp +++ b/src/llama-model.cpp @@ -2333,7 +2333,7 @@ bool llama_model::load_tensors(llama_model_loader & ml) { const auto & tensor_split = params.tensor_split; const int n_layer = hparams.n_layer; - const int n_gpu_layers = params.n_gpu_layers >= 0 ? params.n_gpu_layers : n_layer + 1; + const int n_gpu_layers = this->n_gpu_layers(); const bool use_mmap_buffer = true; @@ -6765,6 +6765,14 @@ size_t llama_model::n_devices() const { return devices.size(); } +uint32_t llama_model::n_gpu_layers() const { + return params.n_gpu_layers >= 0 ? params.n_gpu_layers : hparams.n_layer + 1; +} + +llama_split_mode llama_model::split_mode() const { + return params.split_mode; +} + std::map<ggml_backend_buffer_type_t, size_t> llama_model::memory_breakdown() const { std::map<ggml_backend_buffer_type_t, size_t> ret; for (const auto & [ctx, bufs] : pimpl->ctxs_bufs) { diff --git a/src/llama-model.h b/src/llama-model.h index c6eb95318..b6c3e006d 100644 --- a/src/llama-model.h +++ b/src/llama-model.h @@ -462,8 +462,6 @@ struct llama_model { struct ggml_tensor * dense_2_out_layers = nullptr; struct ggml_tensor * dense_3_out_layers = nullptr; - llama_model_params params; - // gguf metadata std::unordered_map<std::string, std::string> gguf_kv; @@ -494,6 +492,9 @@ struct llama_model { size_t n_tensors() const; size_t n_devices() const; + uint32_t n_gpu_layers() const; + llama_split_mode split_mode() const; + std::map<ggml_backend_buffer_type_t, size_t> memory_breakdown() const; // total number of parameters in the model @@ -522,6 +523,8 @@ struct llama_model { ggml_cgraph * build_graph(const llm_graph_params & params) const; private: + llama_model_params params; + struct impl; std::unique_ptr<impl> pimpl; };

* llama: fix magic number of 999 for GPU layers * use strings for -ngl, -ngld * enacapsulate n_gpu_layers, split_mode

JohannesGaessler requested review from CISC and ggerganov as code owners December 21, 2025 20:16

JohannesGaessler mentioned this pull request Dec 21, 2025

Eval bug: Major performance drop since b7406 #18258

Closed

JohannesGaessler force-pushed the llama-fp-ngl-magic-number branch from 87b383d to 6c0ed15 Compare December 21, 2025 20:31

loci-dev mentioned this pull request Dec 21, 2025

UPSTREAM PR #18266: llama: fix magic number of 999 for GPU layers auroralabs-loci/llama.cpp#653

Closed

am17an reviewed Dec 22, 2025

View reviewed changes

ggerganov reviewed Dec 23, 2025

View reviewed changes

JohannesGaessler mentioned this pull request Dec 26, 2025

Misc. bug: Maximum Context Size #18376

Closed

ggerganov approved these changes Dec 27, 2025

View reviewed changes

JohannesGaessler added 3 commits December 27, 2025 10:11

llama: fix magic number of 999 for GPU layers

efe6d55

use strings for -ngl, -ngld

ab2e229

enacapsulate n_gpu_layers, split_mode

5491591

JohannesGaessler force-pushed the llama-fp-ngl-magic-number branch from 5678fc0 to 5491591 Compare December 27, 2025 11:38

JohannesGaessler merged commit 026d2ad into ggml-org:master Dec 27, 2025
70 of 71 checks passed

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

llama: fix magic number of 999 for GPU layers (#18266)

353cb2d

* llama: fix magic number of 999 for GPU layers * use strings for -ngl, -ngld * enacapsulate n_gpu_layers, split_mode

Mainframework mentioned this pull request Feb 14, 2026

models : optimizing qwen3next graph #19375

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: fix magic number of 999 for GPU layers#18266

llama: fix magic number of 999 for GPU layers#18266
JohannesGaessler merged 3 commits intoggml-org:masterfrom
JohannesGaessler:llama-fp-ngl-magic-number

JohannesGaessler commented Dec 21, 2025

Uh oh!

DocShotgun commented Dec 21, 2025

Uh oh!

am17an Dec 22, 2025 •

edited

Loading

Uh oh!

ggerganov Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JohannesGaessler commented Dec 21, 2025

Uh oh!

DocShotgun commented Dec 21, 2025

Uh oh!

am17an Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an Dec 22, 2025 •

edited

Loading