kusa@framework:~/llama-server$ 0.00.165.383 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.165.456 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.165.487 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.165.533 I srv init: running without SSL
0.00.165.574 I srv init: using 31 threads for HTTP server
0.00.166.565 I srv load_models: Loaded 0 cached model presets
0.00.166.975 I srv load_models: Loaded 5 custom model presets from /home/kusa/llama-server/config.ini
0.00.167.097 I srv operator(): Available models (5) (*: custom preset)
0.00.167.098 I srv operator(): * GLM-5.1-UD-IQ2_M
0.00.167.099 I srv operator(): * MiMo-V2.5-UD-Q5_K_XL
0.00.167.099 I srv operator(): * Qwen3.5-397B-A17B-UD-IQ4_XS
0.00.167.099 I srv operator(): * Qwen3.5-397B-A17B-UD-Q4_K_XL
0.00.167.099 I srv operator(): * default
0.00.167.241 W srv llama_server: -----------------
0.00.167.241 W srv llama_server: CORS proxy is enabled, do not expose server to untrusted environments
0.00.167.242 W srv llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
0.00.167.242 W srv llama_server: -----------------
0.00.167.245 I srv llama_server: starting router server, no model will be loaded in this process
0.00.167.246 I srv start: binding port with default address family
0.00.168.438 I srv llama_server: router server is listening on http://0.0.0.0:8080
0.00.168.442 W srv llama_server: NOTE: router mode is experimental
0.00.168.442 W srv llama_server: it is not recommended to use this mode in untrusted environments
0.04.667.821 I srv load: spawning server instance with name=Qwen3.5-397B-A17B-UD-IQ4_XS on port 59145
0.04.667.860 I srv load: spawning server instance with args:
0.04.667.860 I srv load: /home/kusa/llama.cpp/build/bin/llama-server
0.04.667.861 I srv load: --host
0.04.667.861 I srv load: 127.0.0.1
0.04.667.861 I srv load: --jinja
0.04.667.861 I srv load: --metrics
0.04.667.861 I srv load: --no-mmap
0.04.667.862 I srv load: --no-mmproj-auto
0.04.667.862 I srv load: --port
0.04.667.862 I srv load: 59145
0.04.667.862 I srv load: --rpc
0.04.667.862 I srv load: 10.0.69.2:50052
0.04.667.863 I srv load: --spec-draft-n-max
0.04.667.863 I srv load: 3
0.04.667.863 I srv load: --spec-ngram-mod-n-match
0.04.667.863 I srv load: 24
0.04.667.863 I srv load: --spec-ngram-mod-n-max
0.04.667.864 I srv load: 64
0.04.667.864 I srv load: --spec-ngram-mod-n-min
0.04.667.864 I srv load: 48
0.04.667.864 I srv load: --spec-type
0.04.667.864 I srv load: draft-mtp
0.04.667.865 I srv load: --webui-config-file
0.04.667.865 I srv load: /home/kusa/llama-server/webui-config.json
0.04.667.865 I srv load: --webui-mcp-proxy
0.04.667.865 I srv load: --alias
0.04.667.865 I srv load: Qwen3.5-397B-A17B-UD-IQ4_XS
0.04.667.866 I srv load: --batch-size
0.04.667.866 I srv load: 2048
0.04.667.866 I srv load: --ctx-size
0.04.667.866 I srv load: 262144
0.04.667.866 I srv load: --cache-ram
0.04.667.867 I srv load: 2048
0.04.667.867 I srv load: --cache-type-k
0.04.667.867 I srv load: q8_0
0.04.667.867 I srv load: --cache-type-v
0.04.667.867 I srv load: q8_0
0.04.667.868 I srv load: --swa-checkpoints
0.04.667.868 I srv load: 100
0.04.667.868 I srv load: --flash-attn
0.04.667.868 I srv load: 1
0.04.667.868 I srv load: --log-verbosity
0.04.667.869 I srv load: 4
0.04.667.869 I srv load: --model
0.04.667.869 I srv load: /home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf
0.04.667.869 I srv load: --n-gpu-layers
0.04.667.869 I srv load: all
0.04.667.870 I srv load: --parallel
0.04.667.870 I srv load: 1
0.04.667.870 I srv load: --ubatch-size
0.04.667.870 I srv load: 512
[59145] 0.00.035.963 I common_params_print_info: build 9388 (d7be46189) with GNU 16.1.1 for Linux x86_64
[59145] 0.00.035.966 I log_info: verbosity = 4 (adjust with the `-lv N` CLI arg)
[59145] 0.00.035.966 I device_info:
[59145] 0.00.036.025 I - ROCm0 : AMD Radeon 8060S Graphics (122880 MiB, 125717 MiB free)
[59145] 0.00.036.029 I - CPU : AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (128077 MiB, 128077 MiB free)
[59145] 0.00.036.783 I - RPC0 : 10.0.69.2:50052 (122880 MiB, 125458 MiB free)
[59145] 0.00.036.829 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[59145] 0.00.036.894 I srv init: running without SSL
[59145] 0.00.036.925 I srv init: using 31 threads for HTTP server
[59145] 0.00.037.009 W srv llama_server: -----------------
[59145] 0.00.037.011 W srv llama_server: CORS proxy is enabled, do not expose server to untrusted environments
[59145] 0.00.037.011 W srv llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
[59145] 0.00.037.011 W srv llama_server: -----------------
[59145] 0.00.037.017 I srv start: binding port with default address family
[59145] 0.00.038.160 I srv llama_server: loading model
[59145] 0.00.038.168 I srv load_model: loading model '/home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf'
[59145] 0.00.433.088 I common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
[59145] 0.00.433.093 I common_memory_breakdown_print: | - RPC0 (10.0.69.2:50052) | 122880 = 125464 + (92584 = 92584 + 0 + 0) + -95169 |
[59145] 0.00.433.094 I common_memory_breakdown_print: | - ROCm0 (8060S Graphics) | 122880 = 125408 + (93398 = 92054 + 512 + 832) + -95927 |
[59145] 0.00.433.094 I common_memory_breakdown_print: | - Host | 1558 = 1030 + 0 + 528 |
[59145] 0.00.472.923 I srv load_model: [spec] estimated memory usage of MTP context is 1344.02 MiB
[59145] 0.00.472.944 I common_init_result: fitting params to device memory ...
[59145] 0.00.472.944 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[59145] 0.00.472.950 I common_params_fit_impl: getting device memory data for initial parameters:
[59145] 0.00.907.869 I common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
[59145] 0.00.907.874 I common_memory_breakdown_print: | - RPC0 (10.0.69.2:50052) | 122880 = 124899 + (98989 = 95560 + 2573 + 856) + -101009 |
[59145] 0.00.907.875 I common_memory_breakdown_print: | - ROCm0 (8060S Graphics) | 122880 = 125013 + (91840 = 89079 + 2251 + 509) + -93974 |
[59145] 0.00.907.875 I common_memory_breakdown_print: | - Host | 1558 = 1030 + 0 + 528 |
[59145] 0.00.944.258 I common_params_fit_impl: projected memory use with initial parameters [MiB]:
[59145] 0.00.944.266 I common_params_fit_impl: - RPC0 (10.0.69.2:50052) : 122880 total, 98989 used, 25909 free vs. target of 2368
[59145] 0.00.944.266 I common_params_fit_impl: - ROCm0 (AMD Radeon 8060S Graphics): 122880 total, 91840 used, 33173 free vs. target of 1024
[59145] 0.00.944.266 I common_params_fit_impl: projected to use 190830 MiB of device memory vs. 249913 MiB of free device memory
[59145] 0.00.944.266 I common_params_fit_impl: targets for free memory can be met on all devices, no changes needed
[59145] 0.00.944.268 I common_fit_params: successfully fit params to free device memory
[59145] 0.00.944.273 I common_fit_params: fitting params to free memory took 0.47 seconds
[59145] 0.00.973.803 I llama_model_loader: additional 4 GGUFs metadata loaded.
[59145] 0.00.973.809 I llama_model_loader: loaded meta data with 56 key-value pairs and 1118 tensors from /home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf (version GGUF V3 (latest))
[59145] 0.00.973.825 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[59145] 0.00.973.828 I llama_model_loader: - kv 0: general.architecture str = qwen35moe
[59145] 0.00.973.828 I llama_model_loader: - kv 1: general.type str = model
[59145] 0.00.973.829 I llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
[59145] 0.00.973.834 I llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
[59145] 0.00.973.835 I llama_model_loader: - kv 4: general.sampling.temp f32 = 0.600000
[59145] 0.00.973.835 I llama_model_loader: - kv 5: general.name str = Qwen3.5-397B-A17B
[59145] 0.00.973.836 I llama_model_loader: - kv 6: general.basename str = Qwen3.5-397B-A17B
[59145] 0.00.973.836 I llama_model_loader: - kv 7: general.quantized_by str = Unsloth
[59145] 0.00.973.836 I llama_model_loader: - kv 8: general.size_label str = 397B-A17B
[59145] 0.00.973.836 I llama_model_loader: - kv 9: general.license str = apache-2.0
[59145] 0.00.973.837 I llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-3...
[59145] 0.00.973.838 I llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
[59145] 0.00.973.838 I llama_model_loader: - kv 12: general.base_model.count u32 = 1
[59145] 0.00.973.839 I llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.5 397B A17B
[59145] 0.00.973.839 I llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
[59145] 0.00.973.839 I llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.5-3...
[59145] 0.00.973.852 I llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
[59145] 0.00.973.852 I llama_model_loader: - kv 17: qwen35moe.block_count u32 = 61
[59145] 0.00.973.853 I llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
[59145] 0.00.973.853 I llama_model_loader: - kv 19: qwen35moe.embedding_length u32 = 4096
[59145] 0.00.973.854 I llama_model_loader: - kv 20: qwen35moe.attention.head_count u32 = 32
[59145] 0.00.973.854 I llama_model_loader: - kv 21: qwen35moe.attention.head_count_kv u32 = 2
[59145] 0.00.973.855 I llama_model_loader: - kv 22: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
[59145] 0.00.973.857 I llama_model_loader: - kv 23: qwen35moe.rope.freq_base f32 = 10000000.000000
[59145] 0.00.973.858 I llama_model_loader: - kv 24: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
[59145] 0.00.973.858 I llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 512
[59145] 0.00.973.859 I llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 10
[59145] 0.00.973.859 I llama_model_loader: - kv 27: qwen35moe.attention.key_length u32 = 256
[59145] 0.00.973.859 I llama_model_loader: - kv 28: qwen35moe.attention.value_length u32 = 256
[59145] 0.00.973.860 I llama_model_loader: - kv 29: qwen35moe.expert_feed_forward_length u32 = 1024
[59145] 0.00.973.860 I llama_model_loader: - kv 30: qwen35moe.expert_shared_feed_forward_length u32 = 1024
[59145] 0.00.973.860 I llama_model_loader: - kv 31: qwen35moe.ssm.conv_kernel u32 = 4
[59145] 0.00.973.860 I llama_model_loader: - kv 32: qwen35moe.ssm.state_size u32 = 128
[59145] 0.00.973.860 I llama_model_loader: - kv 33: qwen35moe.ssm.group_count u32 = 16
[59145] 0.00.973.861 I llama_model_loader: - kv 34: qwen35moe.ssm.time_step_rank u32 = 64
[59145] 0.00.973.861 I llama_model_loader: - kv 35: qwen35moe.ssm.inner_size u32 = 8192
[59145] 0.00.973.861 I llama_model_loader: - kv 36: qwen35moe.full_attention_interval u32 = 4
[59145] 0.00.973.862 I llama_model_loader: - kv 37: qwen35moe.rope.dimension_count u32 = 64
[59145] 0.00.973.862 I llama_model_loader: - kv 38: qwen35moe.nextn_predict_layers u32 = 1
[59145] 0.00.973.862 I llama_model_loader: - kv 39: tokenizer.ggml.model str = gpt2
[59145] 0.00.973.862 I llama_model_loader: - kv 40: tokenizer.ggml.pre str = qwen35
[59145] 0.00.984.710 I llama_model_loader: - kv 41: tokenizer.ggml.tokens arr[str,248320] = ["!", "\"", "#", "$", "%", "&", "'", ...
[59145] 0.00.987.765 I llama_model_loader: - kv 42: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[59145] 0.00.998.388 I llama_model_loader: - kv 43: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[59145] 0.00.998.390 I llama_model_loader: - kv 44: tokenizer.ggml.eos_token_id u32 = 248046
[59145] 0.00.998.390 I llama_model_loader: - kv 45: tokenizer.ggml.padding_token_id u32 = 248055
[59145] 0.00.998.390 I llama_model_loader: - kv 46: general.quantization_version u32 = 2
[59145] 0.00.998.391 I llama_model_loader: - kv 47: general.file_type u32 = 30
[59145] 0.00.998.392 I llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3.5-397B-A17B-GGUF/imatrix_unslot...
[59145] 0.00.998.392 I llama_model_loader: - kv 49: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.5-397B-A17B...
[59145] 0.00.998.392 I llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 765
[59145] 0.00.998.392 I llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 76
[59145] 0.00.998.393 I llama_model_loader: - kv 52: split.no u16 = 0
[59145] 0.00.998.393 I llama_model_loader: - kv 53: split.tensors.count i32 = 1118
[59145] 0.00.998.393 I llama_model_loader: - kv 54: split.count u16 = 5
[59145] 0.00.998.395 I llama_model_loader: - kv 55: tokenizer.chat_template str = {%- set image_count = namespace(value...
[59145] 0.00.998.396 I llama_model_loader: - type f32: 548 tensors
[59145] 0.00.998.396 I llama_model_loader: - type q8_0: 384 tensors
[59145] 0.00.998.397 I llama_model_loader: - type q3_K: 2 tensors
[59145] 0.00.998.397 I llama_model_loader: - type q4_K: 1 tensors
[59145] 0.00.998.397 I llama_model_loader: - type q6_K: 3 tensors
[59145] 0.00.998.397 I llama_model_loader: - type iq3_s: 118 tensors
[59145] 0.00.998.398 I llama_model_loader: - type iq4_xs: 60 tensors
[59145] 0.00.998.398 I llama_model_loader: - type bf16: 2 tensors
[59145] 0.00.998.399 I print_info: file format = GGUF V3 (latest)
[59145] 0.00.998.400 I print_info: file type = IQ4_XS - 4.25 bpw
[59145] 0.00.998.403 I print_info: file size = 181.32 GiB (3.87 BPW)
[59145] 0.01.000.369 I llama_prepare_model_devices: using device RPC0 (10.0.69.2:50052) (unknown id) - 125295 MiB free
[59145] 0.01.000.383 I llama_prepare_model_devices: using device ROCm0 (AMD Radeon 8060S Graphics) (0000:c2:00.0) - 125361 MiB free
[59145] 0.01.067.835 I load: 0 unused tokens
[59145] 0.01.091.159 I load: printing all EOG tokens:
[59145] 0.01.091.161 I load: - 248044 ('<|endoftext|>')
[59145] 0.01.091.162 I load: - 248046 ('<|im_end|>')
[59145] 0.01.091.162 I load: - 248063 ('<|fim_pad|>')
[59145] 0.01.091.162 I load: - 248064 ('<|repo_name|>')
[59145] 0.01.091.162 I load: - 248065 ('<|file_sep|>')
[59145] 0.01.091.344 I load: special tokens cache size = 33
[59145] 0.01.134.425 I load: token to piece cache size = 1.7581 MB
[59145] 0.01.134.441 I print_info: arch = qwen35moe
[59145] 0.01.134.442 I print_info: vocab_only = 0
[59145] 0.01.134.442 I print_info: no_alloc = 0
[59145] 0.01.134.442 I print_info: n_ctx_train = 262144
[59145] 0.01.134.442 I print_info: n_embd = 4096
[59145] 0.01.134.443 I print_info: n_embd_inp = 4096
[59145] 0.01.134.443 I print_info: n_layer = 61
[59145] 0.01.134.452 I print_info: n_head = 32
[59145] 0.01.134.453 I print_info: n_head_kv = 2
[59145] 0.01.134.453 I print_info: n_rot = 64
[59145] 0.01.134.453 I print_info: n_swa = 0
[59145] 0.01.134.454 I print_info: is_swa_any = 0
[59145] 0.01.134.454 I print_info: n_embd_head_k = 256
[59145] 0.01.134.454 I print_info: n_embd_head_v = 256
[59145] 0.01.134.455 I print_info: n_gqa = 16
[59145] 0.01.134.456 I print_info: n_embd_k_gqa = 512
[59145] 0.01.134.458 I print_info: n_embd_v_gqa = 512
[59145] 0.01.134.458 I print_info: f_norm_eps = 0.0e+00
[59145] 0.01.134.460 I print_info: f_norm_rms_eps = 1.0e-06
[59145] 0.01.134.460 I print_info: f_clamp_kqv = 0.0e+00
[59145] 0.01.134.460 I print_info: f_max_alibi_bias = 0.0e+00
[59145] 0.01.134.460 I print_info: f_logit_scale = 0.0e+00
[59145] 0.01.134.460 I print_info: f_attn_scale = 0.0e+00
[59145] 0.01.134.461 I print_info: f_attn_value_scale = 0.0000
[59145] 0.01.134.462 I print_info: n_ff = 0
[59145] 0.01.134.462 I print_info: n_expert = 512
[59145] 0.01.134.462 I print_info: n_expert_used = 10
[59145] 0.01.134.462 I print_info: n_expert_groups = 0
[59145] 0.01.134.462 I print_info: n_group_used = 0
[59145] 0.01.134.462 I print_info: causal attn = 1
[59145] 0.01.134.462 I print_info: pooling type = -1
[59145] 0.01.134.463 I print_info: rope type = 40
[59145] 0.01.134.463 I print_info: rope scaling = linear
[59145] 0.01.134.464 I print_info: freq_base_train = 10000000.0
[59145] 0.01.134.464 I print_info: freq_scale_train = 1
[59145] 0.01.134.464 I print_info: n_ctx_orig_yarn = 262144
[59145] 0.01.134.465 I print_info: rope_yarn_log_mul = 0.0000
[59145] 0.01.134.465 I print_info: rope_finetuned = unknown
[59145] 0.01.134.465 I print_info: mrope sections = [11, 11, 10, 0]
[59145] 0.01.134.465 I print_info: ssm_d_conv = 4
[59145] 0.01.134.465 I print_info: ssm_d_inner = 8192
[59145] 0.01.134.465 I print_info: ssm_d_state = 128
[59145] 0.01.134.466 I print_info: ssm_dt_rank = 64
[59145] 0.01.134.466 I print_info: ssm_n_group = 16
[59145] 0.01.134.466 I print_info: ssm_dt_b_c_rms = 0
[59145] 0.01.134.466 I print_info: model type = 397B.A17B
[59145] 0.01.134.467 I print_info: model params = 402.94 B
[59145] 0.01.134.467 I print_info: general.name = Qwen3.5-397B-A17B
[59145] 0.01.134.469 I print_info: vocab type = BPE
[59145] 0.01.134.469 I print_info: n_vocab = 248320
[59145] 0.01.134.469 I print_info: n_merges = 247587
[59145] 0.01.134.469 I print_info: BOS token = 11 ','
[59145] 0.01.134.470 I print_info: EOS token = 248046 '<|im_end|>'
[59145] 0.01.134.470 I print_info: EOT token = 248046 '<|im_end|>'
[59145] 0.01.134.470 I print_info: PAD token = 248055 '<|vision_pad|>'
[59145] 0.01.134.470 I print_info: LF token = 198 'Ċ'
[59145] 0.01.134.470 I print_info: FIM PRE token = 248060 '<|fim_prefix|>'
[59145] 0.01.134.470 I print_info: FIM SUF token = 248062 '<|fim_suffix|>'
[59145] 0.01.134.471 I print_info: FIM MID token = 248061 '<|fim_middle|>'
[59145] 0.01.134.471 I print_info: FIM PAD token = 248063 '<|fim_pad|>'
[59145] 0.01.134.471 I print_info: FIM REP token = 248064 '<|repo_name|>'
[59145] 0.01.134.471 I print_info: FIM SEP token = 248065 '<|file_sep|>'
[59145] 0.01.134.471 I print_info: EOG token = 248044 '<|endoftext|>'
[59145] 0.01.134.471 I print_info: EOG token = 248046 '<|im_end|>'
[59145] 0.01.134.472 I print_info: EOG token = 248063 '<|fim_pad|>'
[59145] 0.01.134.472 I print_info: EOG token = 248064 '<|repo_name|>'
[59145] 0.01.134.472 I print_info: EOG token = 248065 '<|file_sep|>'
[59145] 0.01.134.472 I print_info: max token length = 256
[59145] 0.01.134.473 I load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[59145] 0.09.008.161 I load_tensors: offloading output layer to GPU
[59145] 0.09.008.174 I load_tensors: offloading 60 repeating layers to GPU
[59145] 0.09.008.175 I load_tensors: offloaded 62/62 layers to GPU
[59145] 0.09.008.186 I load_tensors: ROCm0 model buffer size = 92054.89 MiB
[59145] 0.09.008.188 I load_tensors: ROCm_Host model buffer size = 1030.62 MiB
[59145] 0.09.008.189 I load_tensors: RPC0[10.0.69.2:50052] model buffer size = 92584.99 MiB
Name and Version
kusa@framework:~/llama-server$ llama-server --version
version: 9389 (30af6e2)
built with GNU 16.1.1 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
I have two AMD AI Max+ 395 128 GB devices (Framework Desktop + Asus Flow Z13) networked together with ethernet over thunderbolt, using RPC mode for inference. I am noticing that in newer builds, the host device memory isn't being taken into account when attempting to allocate tensors, only the RPC device.
Issue is observed starting with commit 30af6e2, PR #23007
Issue is not observed with the commit immediately preceding: d7be461
First Bad Commit
30af6e2
Relevant log output
30af6e2
d7be461