C:\llama.cpp\llama-b8241-bin-win-cuda-13.1-x64>.\llama-server.exe -m C:/Users/vladn/.lmstudio/models/lmstudio-community/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q4_K_M-00001-of-00007.gguf --host 0.0.0.0 -c 262144 --log-timestamps --log-prefix --rpc gx10-bb36:50052 --chat-template-kwargs "{\"enable_thinking\":false}" --no-mmap --fit-target 4095,4096,4097
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\llama.cpp\llama-b8241-bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\llama.cpp\llama-b8241-bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\llama.cpp\llama-b8241-bin-win-cuda-13.1-x64\ggml-cpu-alderlake.dll
0.02.831.294 I main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.02.831.309 I build: 8241 (62b8143ad) with Clang 19.1.5 for Windows x86_64
0.02.831.319 I system info: n_threads = 24, n_threads_batch = 24, total_threads = 32
0.02.831.320 I
0.02.831.369 I system_info: n_threads = 24 (n_threads_batch = 24) / 32 | CUDA : ARCHS = 750,800,860,890,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.02.831.370 I
0.02.831.373 I Running without SSL
0.02.831.384 I init: using 31 threads for HTTP server
0.02.831.458 I start: binding port with default address family
0.02.832.318 I main: loading model
0.02.832.319 I srv load_model: loading model 'C:/Users/vladn/.lmstudio/models/lmstudio-community/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q4_K_M-00001-of-00007.gguf'
0.02.832.376 I common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
0.03.911.357 I llama_params_fit_impl: projected memory use with initial parameters [MiB]:
0.03.911.365 I llama_params_fit_impl: - RPC0 (gx10-bb36:50052) : 122502 total, 164575 used, -46590 free vs. target of 4095
0.03.911.366 I llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 5090): 32606 total, 43674 used, -12881 free vs. target of 0
0.03.911.366 I llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 4090): 24563 total, 30862 used, -7958 free vs. target of 1
0.03.911.367 I llama_params_fit_impl: projected to use 239111 MiB of device memory vs. 171682 MiB of free device memory
0.03.911.367 I llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 71525 MiB less in total
0.03.911.367 I llama_params_fit_impl: context size set by user to 262144 -> no change
0.04.322.483 I llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 148469 MiB
0.04.804.938 I llama_params_fit_impl: filling dense-only layers back-to-front:
0.05.264.416 I llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 4090): 61 layers, 18938 MiB used, 3965 MiB free
0.05.264.424 I llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 5090): 0 layers, 2840 MiB used, 27952 MiB free
0.05.264.425 I llama_params_fit_impl: - RPC0 (gx10-bb36:50052) : 0 layers, 0 MiB used, 117985 MiB free
0.05.264.426 I llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
0.09.141.830 I llama_params_fit_impl: - RPC0 (gx10-bb36:50052) : 29 layers ( 1 overflowing), 113619 MiB used, 4366 MiB free
0.15.511.768 I llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 5090): 7 layers ( 1 overflowing), 29761 MiB used, 1031 MiB free
0.18.824.381 I llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 4090): 25 layers (21 overflowing), 22644 MiB used, 259 MiB free
0.18.824.391 I llama_params_fit: successfully fit params to free device memory
0.18.824.394 I llama_params_fit: fitting params to free memory took 0.73 seconds
0.18.827.980 I llama_model_load_from_file_impl: using device RPC0 (gx10-bb36:50052) (unknown id) - 118526 MiB free
0.18.828.015 I llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:08:00.0) - 30927 MiB free
0.18.828.027 I llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 22988 MiB free
0.18.939.729 I llama_model_loader: additional 6 GGUFs metadata loaded.
0.18.939.739 I llama_model_loader: loaded meta data with 43 key-value pairs and 1098 tensors from C:/Users/vladn/.lmstudio/models/lmstudio-community/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q4_K_M-00001-of-00007.gguf (version GGUF V3 (latest))
0.18.939.796 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.18.939.798 I llama_model_loader: - kv 0: general.architecture str = qwen35moe
0.18.939.799 I llama_model_loader: - kv 1: general.type str = model
0.18.939.800 I llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
0.18.939.804 I llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
0.18.939.820 I llama_model_loader: - kv 4: general.sampling.temp f32 = 0.600000
0.18.939.821 I llama_model_loader: - kv 5: general.name str = Qwen_Qwen3.5 397B A17B
0.18.939.822 I llama_model_loader: - kv 6: general.basename str = Qwen_Qwen3.5
0.18.939.822 I llama_model_loader: - kv 7: general.size_label str = 397B
0.18.939.823 I llama_model_loader: - kv 8: qwen35moe.block_count u32 = 60
0.18.939.824 I llama_model_loader: - kv 9: qwen35moe.context_length u32 = 262144
0.18.939.824 I llama_model_loader: - kv 10: qwen35moe.embedding_length u32 = 4096
0.18.939.825 I llama_model_loader: - kv 11: qwen35moe.attention.head_count u32 = 32
0.18.939.825 I llama_model_loader: - kv 12: qwen35moe.attention.head_count_kv u32 = 2
0.18.939.834 I llama_model_loader: - kv 13: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
0.18.939.836 I llama_model_loader: - kv 14: qwen35moe.rope.freq_base f32 = 10000000.000000
0.18.939.837 I llama_model_loader: - kv 15: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
0.18.939.838 I llama_model_loader: - kv 16: qwen35moe.expert_count u32 = 512
0.18.939.838 I llama_model_loader: - kv 17: qwen35moe.expert_used_count u32 = 10
0.18.939.839 I llama_model_loader: - kv 18: qwen35moe.attention.key_length u32 = 256
0.18.939.839 I llama_model_loader: - kv 19: qwen35moe.attention.value_length u32 = 256
0.18.939.840 I llama_model_loader: - kv 20: qwen35moe.expert_feed_forward_length u32 = 1024
0.18.939.840 I llama_model_loader: - kv 21: qwen35moe.expert_shared_feed_forward_length u32 = 1024
0.18.939.841 I llama_model_loader: - kv 22: qwen35moe.ssm.conv_kernel u32 = 4
0.18.939.841 I llama_model_loader: - kv 23: qwen35moe.ssm.state_size u32 = 128
0.18.939.842 I llama_model_loader: - kv 24: qwen35moe.ssm.group_count u32 = 16
0.18.939.842 I llama_model_loader: - kv 25: qwen35moe.ssm.time_step_rank u32 = 64
0.18.939.843 I llama_model_loader: - kv 26: qwen35moe.ssm.inner_size u32 = 8192
0.18.939.843 I llama_model_loader: - kv 27: qwen35moe.full_attention_interval u32 = 4
0.18.939.844 I llama_model_loader: - kv 28: qwen35moe.rope.dimension_count u32 = 64
0.18.939.844 I llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
0.18.939.845 I llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen35
0.18.992.422 I llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,248320] = ["!", "\"", "#", "$", "%", "&", "'", ...
0.19.014.442 I llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.19.087.568 I llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,247587] = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
0.19.087.580 I llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 248046
0.19.087.581 I llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 248044
0.19.087.584 I llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false
0.19.087.591 I llama_model_loader: - kv 37: tokenizer.chat_template str = {%- set image_count = namespace(value...
0.19.087.592 I llama_model_loader: - kv 38: general.quantization_version u32 = 2
0.19.087.593 I llama_model_loader: - kv 39: general.file_type u32 = 15
0.19.087.594 I llama_model_loader: - kv 40: split.no u16 = 0
0.19.087.595 I llama_model_loader: - kv 41: split.tensors.count i32 = 1098
0.19.087.595 I llama_model_loader: - kv 42: split.count u16 = 7
0.19.087.596 I llama_model_loader: - type f32: 451 tensors
0.19.087.597 I llama_model_loader: - type q4_K: 532 tensors
0.19.087.598 I llama_model_loader: - type q5_K: 45 tensors
0.19.087.598 I llama_model_loader: - type q6_K: 70 tensors
0.19.087.599 I print_info: file format = GGUF V3 (latest)
0.19.087.600 I print_info: file type = Q4_K - Medium
0.19.087.604 I print_info: file size = 224.06 GiB (4.86 BPW)
0.19.281.738 I load: 0 unused tokens
0.19.336.179 I load: printing all EOG tokens:
0.19.336.187 I load: - 248044 ('<|endoftext|>')
0.19.336.187 I load: - 248046 ('<|im_end|>')
0.19.336.188 I load: - 248063 ('<|fim_pad|>')
0.19.336.188 I load: - 248064 ('<|repo_name|>')
0.19.336.189 I load: - 248065 ('<|file_sep|>')
0.19.342.104 I load: special tokens cache size = 33
0.19.408.217 I load: token to piece cache size = 1.7581 MB
0.19.408.234 I print_info: arch = qwen35moe
0.19.408.235 I print_info: vocab_only = 0
0.19.408.235 I print_info: no_alloc = 0
0.19.408.235 I print_info: n_ctx_train = 262144
0.19.408.236 I print_info: n_embd = 4096
0.19.408.236 I print_info: n_embd_inp = 4096
0.19.408.237 I print_info: n_layer = 60
0.19.408.247 I print_info: n_head = 32
0.19.408.249 I print_info: n_head_kv = 2
0.19.408.250 I print_info: n_rot = 64
0.19.408.250 I print_info: n_swa = 0
0.19.408.251 I print_info: is_swa_any = 0
0.19.408.251 I print_info: n_embd_head_k = 256
0.19.408.251 I print_info: n_embd_head_v = 256
0.19.408.254 I print_info: n_gqa = 16
0.19.408.256 I print_info: n_embd_k_gqa = 512
0.19.408.258 I print_info: n_embd_v_gqa = 512
0.19.408.259 I print_info: f_norm_eps = 0.0e+00
0.19.408.261 I print_info: f_norm_rms_eps = 1.0e-06
0.19.408.261 I print_info: f_clamp_kqv = 0.0e+00
0.19.408.261 I print_info: f_max_alibi_bias = 0.0e+00
0.19.408.262 I print_info: f_logit_scale = 0.0e+00
0.19.408.262 I print_info: f_attn_scale = 0.0e+00
0.19.408.264 I print_info: n_ff = 0
0.19.408.265 I print_info: n_expert = 512
0.19.408.265 I print_info: n_expert_used = 10
0.19.408.265 I print_info: n_expert_groups = 0
0.19.408.266 I print_info: n_group_used = 0
0.19.408.266 I print_info: causal attn = 1
0.19.408.266 I print_info: pooling type = 0
0.19.408.267 I print_info: rope type = 40
0.19.408.267 I print_info: rope scaling = linear
0.19.408.268 I print_info: freq_base_train = 10000000.0
0.19.408.269 I print_info: freq_scale_train = 1
0.19.408.270 I print_info: n_ctx_orig_yarn = 262144
0.19.408.270 I print_info: rope_yarn_log_mul = 0.0000
0.19.408.271 I print_info: rope_finetuned = unknown
0.19.408.271 I print_info: mrope sections = [11, 11, 10, 0]
0.19.408.271 I print_info: ssm_d_conv = 4
0.19.408.272 I print_info: ssm_d_inner = 8192
0.19.408.272 I print_info: ssm_d_state = 128
0.19.408.272 I print_info: ssm_dt_rank = 64
0.19.408.273 I print_info: ssm_n_group = 16
0.19.408.273 I print_info: ssm_dt_b_c_rms = 0
0.19.408.273 I print_info: model type = 397B.A17B
0.19.408.274 I print_info: model params = 396.35 B
0.19.408.275 I print_info: general.name = Qwen_Qwen3.5 397B A17B
0.19.408.276 I print_info: vocab type = BPE
0.19.408.276 I print_info: n_vocab = 248320
0.19.408.276 I print_info: n_merges = 247587
0.19.408.277 I print_info: BOS token = 11 ','
0.19.408.277 I print_info: EOS token = 248046 '<|im_end|>'
0.19.408.278 I print_info: EOT token = 248046 '<|im_end|>'
0.19.408.278 I print_info: PAD token = 248044 '<|endoftext|>'
0.19.408.279 I print_info: LF token = 198 '─è'
0.19.408.279 I print_info: FIM PRE token = 248060 '<|fim_prefix|>'
0.19.408.280 I print_info: FIM SUF token = 248062 '<|fim_suffix|>'
0.19.408.280 I print_info: FIM MID token = 248061 '<|fim_middle|>'
0.19.408.280 I print_info: FIM PAD token = 248063 '<|fim_pad|>'
0.19.408.281 I print_info: FIM REP token = 248064 '<|repo_name|>'
0.19.408.281 I print_info: FIM SEP token = 248065 '<|file_sep|>'
0.19.408.282 I print_info: EOG token = 248044 '<|endoftext|>'
0.19.408.282 I print_info: EOG token = 248046 '<|im_end|>'
0.19.408.282 I print_info: EOG token = 248063 '<|fim_pad|>'
0.19.408.283 I print_info: EOG token = 248064 '<|repo_name|>'
0.19.408.283 I print_info: EOG token = 248065 '<|file_sep|>'
0.19.408.283 I print_info: max token length = 256
0.19.408.285 I load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
0.22.996.045 I load_tensors: offloading output layer to GPU
0.22.996.049 I load_tensors: offloading 59 repeating layers to GPU
0.22.996.049 I load_tensors: offloaded 61/61 layers to GPU
0.22.996.053 I load_tensors: CPU model buffer size = 545.62 MiB
0.22.996.057 I load_tensors: CUDA0 model buffer size = 25813.49 MiB
0.22.996.058 I load_tensors: CUDA1 model buffer size = 18773.76 MiB
0.22.996.058 I load_tensors: CPU model buffer size = 75470.77 MiB
0.22.996.059 I load_tensors: RPC0[gx10-bb36:50052] model buffer size = 108831.00 MiB
....................................................................................................
8.06.815.206 I common_init_result: added <|endoftext|> logit bias = -inf
8.06.815.212 I common_init_result: added <|im_end|> logit bias = -inf
8.06.815.213 I common_init_result: added <|fim_pad|> logit bias = -inf
8.06.815.214 I common_init_result: added <|repo_name|> logit bias = -inf
8.06.815.214 I common_init_result: added <|file_sep|> logit bias = -inf
8.06.822.523 I llama_context: constructing llama_context
8.06.822.544 I llama_context: n_seq_max = 4
8.06.822.545 I llama_context: n_ctx = 262144
8.06.822.545 I llama_context: n_ctx_seq = 262144
8.06.822.666 I llama_context: n_batch = 2048
8.06.822.718 I llama_context: n_ubatch = 512
8.06.822.719 I llama_context: causal_attn = 1
8.06.822.720 I llama_context: flash_attn = auto
8.06.822.720 I llama_context: kv_unified = true
8.06.822.724 I llama_context: freq_base = 10000000.0
8.06.822.725 I llama_context: freq_scale = 1
8.06.824.140 I llama_context: CUDA_Host output buffer size = 3.79 MiB
8.06.830.117 I llama_kv_cache: CUDA0 KV buffer size = 1024.00 MiB
8.06.902.458 I llama_kv_cache: CUDA1 KV buffer size = 3072.00 MiB
8.08.963.930 I llama_kv_cache: RPC0[gx10-bb36:50052] KV buffer size = 3584.00 MiB
8.08.983.661 I llama_kv_cache: size = 7680.00 MiB (262144 cells, 15 layers, 4/1 seqs), K (f16): 3840.00 MiB, V (f16): 3840.00 MiB
8.08.991.739 I llama_memory_recurrent: CUDA0 RS buffer size = 82.81 MiB
8.11.224.257 I llama_memory_recurrent: CUDA1 RS buffer size = 298.12 MiB
8.11.263.752 I llama_memory_recurrent: RPC0[gx10-bb36:50052] RS buffer size = 364.38 MiB
8.11.263.763 I llama_memory_recurrent: size = 745.31 MiB ( 4 cells, 60 layers, 4 seqs), R (f32): 25.31 MiB, S (f32): 720.00 MiB
8.11.263.774 I sched_reserve: reserving ...
8.11.298.538 I sched_reserve: Flash Attention was auto, set to enabled
8.12.209.851 I sched_reserve: RPC0[gx10-bb36:50052] compute buffer size = 840.02 MiB
8.12.209.860 I sched_reserve: CUDA0 compute buffer size = 2841.00 MiB
8.12.209.861 I sched_reserve: CUDA1 compute buffer size = 509.00 MiB
8.12.209.862 I sched_reserve: CPU compute buffer size = 528.02 MiB
8.12.209.864 I sched_reserve: graph nodes = 10284 (with bs=512), 5919 (with bs=1)
8.12.209.865 I sched_reserve: graph splits = 97 (with bs=512), 52 (with bs=1)
8.12.209.866 I sched_reserve: reserve took 946.09 ms, sched copies = 1
8.12.211.447 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
8.16.302.338 I srv load_model: initializing slots, n_slots = 4
8.19.105.359 W common_speculative_is_compat: the target context does not support partial sequence removal
8.19.137.062 W srv load_model: speculative decoding not supported by this context
8.19.137.071 I slot load_model: id 0 | task -1 | new slot, n_ctx = 262144
8.19.137.734 I slot load_model: id 1 | task -1 | new slot, n_ctx = 262144
8.19.137.743 I slot load_model: id 2 | task -1 | new slot, n_ctx = 262144
8.19.137.747 I slot load_model: id 3 | task -1 | new slot, n_ctx = 262144
8.19.137.844 W srv load_model: prompt cache is enabled, size limit: 8192 MiB
8.19.137.844 W srv load_model: use `--cache-ram 0` to disable the prompt cache
8.19.137.845 W srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
8.19.175.519 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
</think>
'
8.19.194.944 I srv init: init: chat template, thinking = 1
8.19.194.972 I main: model loaded
8.19.194.974 I main: server is listening on http://0.0.0.0:8080
8.19.194.975 I main: starting the main loop...
8.19.194.987 I srv update_slots: all slots are idle
8.21.980.896 I srv operator(): operator(): cleaning up before exit...
8.21.984.092 I llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
8.21.984.095 I llama_memory_breakdown_print: | - RPC0 (gx10-bb36:50052) | 122502 = 3130 + (113619 = 108831 + 3948 + 840) + 5752 |
8.21.984.096 I llama_memory_breakdown_print: | - CUDA0 (RTX 5090) | 32606 = 795 + ( 29761 = 25813 + 1106 + 2840) + 2050 |
8.21.984.096 I llama_memory_breakdown_print: | - CUDA1 (RTX 4090) | 24563 = 0 + ( 22652 = 18773 + 3370 + 509) + 1910 |
8.21.984.097 I llama_memory_breakdown_print: | - Host | 76544 = 76016 + 0 + 528 |
Name and Version
version: 8241 (62b8143)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
RTX 5090 + RTX 4090 + RPC on a NVIDIA GB10 based machine (Asus GX10, similar to a DGX Spark)
The issue is not specific to this hardware and the CUDA / RPC backends, it can be reproduced with any GPU.
Models
lmstudio-community/Qwen3.5-397B-A17B-GGUF:Q4_K_M
The issue is not model related, it can be reproduced with any model.
Problem description & steps to reproduce
When using values higher than 4095 MiB for the
--fit-targeton Windows, an integer overflow occurs. As a result, you can't set a target higher than 4095 MiB on Windows.Since the issue is
fitrelated, it can't be reproduced with-fit offFor example, running with
--fit-target 4095,4096,4097for my 3 backends results in this output during initialization:As you can see,
4096results in an actual target of0and and4097results in an actual target of1.Interrupting the process confirms that it was actually not using the correct targets for those two devices:
As far as I can tell, this overflow is not a security issue.
I asked Claude Opus 4.6 high reasoning to look into it, and this is what it came up with:
--fit-targetvalues >= 4096 silently overflow to 0 on WindowsSummary
The
--fit-target(aka-fitt) parameter silently overflows to 0 for values >= 4096 MiB on Windows due to a 32-bit integer overflow in the argument parsing code.Reproduction
On Windows (any 64-bit version):
The margin is silently set to 0 instead of 4096 MiB. Values below 4096 work correctly.
Root Cause
In
common/arg.cpp, the parsing code converts MiB to bytes using:std::stoul()returnsunsigned long, which is 32-bit on Windows (even on 64-bit systems — this is a well-known MSVC/Windows ABI property). The multiplication4096 * 1024 * 1024 = 2^32overflows a 32-bit unsigned integer to exactly 0. Values below 4096 produce results < 2^32 and work fine.The storage type
std::vector<size_t>(incommon/common.h:389) is 64-bit and can hold the value — the overflow happens before the assignment.Suggested Fix
Replace
std::stoulwithstd::stoull(returnsunsigned long long, guaranteed 64-bit on all platforms):Environment
unsigned longis 32-bit on Windows per the MSVC ABIunsigned longis 64-bitFirst Bad Commit
No response
Relevant log output
Logs