Skip to content

Misc. bug: Model load not allocating tensors to Strix Halo host memory when using RPC with another Strix Halo device #23858

@Illuminati-CRAZ

Description

@Illuminati-CRAZ

Name and Version

kusa@framework:~/llama-server$ llama-server --version
version: 9389 (30af6e2)
built with GNU 16.1.1 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

kusa@framework:~/llama.cpp$ cat build-rocm.sh
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU-TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DGGML_RPC=ON \
    -DGGML_HIP_RCCL=ON \
    -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
    && cmake --build build --config Release -- -j 32

kusa@framework:~/llama-server$ cat start-disown.sh
pkill -f llama-server && sleep 1
llama-server --host 0.0.0.0 --port 8080 --models-preset /home/kusa/llama-server/config.ini --webui-config-file /home/kusa/llama-server/webui-config.json --models-max 1 --webui-mcp-proxy --rpc 10.0.69.2:50052 & disown

kusa@flow:~/rpc-server$ cat start-disown.sh
pkill -f rpc-server && sleep 1
rpc-server --host 10.0.69.2 --cache --threads 32 & disown

Problem description & steps to reproduce

I have two AMD AI Max+ 395 128 GB devices (Framework Desktop + Asus Flow Z13) networked together with ethernet over thunderbolt, using RPC mode for inference. I am noticing that in newer builds, the host device memory isn't being taken into account when attempting to allocate tensors, only the RPC device.

Issue is observed starting with commit 30af6e2, PR #23007

Issue is not observed with the commit immediately preceding: d7be461

First Bad Commit

30af6e2

Relevant log output

30af6e2
kusa@framework:~/llama-server$ 0.00.140.593 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.140.642 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.140.646 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.140.711 I srv          init: running without SSL
0.00.140.738 I srv          init: using 31 threads for HTTP server
0.00.141.518 I srv   load_models: Loaded 0 cached model presets
0.00.141.929 I srv   load_models: Loaded 5 custom model presets from /home/kusa/llama-server/config.ini
0.00.142.058 I srv    operator(): Available models (5) (*: custom preset)
0.00.142.059 I srv    operator():   * GLM-5.1-UD-IQ2_M
0.00.142.059 I srv    operator():   * MiMo-V2.5-UD-Q5_K_XL
0.00.142.060 I srv    operator():   * Qwen3.5-397B-A17B-UD-IQ4_XS
0.00.142.060 I srv    operator():   * Qwen3.5-397B-A17B-UD-Q4_K_XL
0.00.142.060 I srv    operator():   * default
0.00.142.205 W srv  llama_server: -----------------
0.00.142.206 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
0.00.142.206 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
0.00.142.206 W srv  llama_server: -----------------
0.00.142.209 I srv  llama_server: starting router server, no model will be loaded in this process
0.00.142.211 I srv         start: binding port with default address family
0.00.143.461 I srv  llama_server: router server is listening on http://0.0.0.0:8080
0.00.143.466 W srv  llama_server: NOTE: router mode is experimental
0.00.143.466 W srv  llama_server:       it is not recommended to use this mode in untrusted environments
0.11.036.201 I srv          load: spawning server instance with name=Qwen3.5-397B-A17B-UD-IQ4_XS on port 41303
0.11.036.244 I srv          load: spawning server instance with args:
0.11.036.244 I srv          load:   /home/kusa/llama.cpp/build/bin/llama-server
0.11.036.245 I srv          load:   --host
0.11.036.247 I srv          load:   127.0.0.1
0.11.036.247 I srv          load:   --jinja
0.11.036.247 I srv          load:   --metrics
0.11.036.247 I srv          load:   --no-mmap
0.11.036.248 I srv          load:   --no-mmproj-auto
0.11.036.248 I srv          load:   --port
0.11.036.248 I srv          load:   41303
0.11.036.248 I srv          load:   --rpc
0.11.036.248 I srv          load:   10.0.69.2:50052
0.11.036.249 I srv          load:   --spec-draft-n-max
0.11.036.249 I srv          load:   3
0.11.036.249 I srv          load:   --spec-ngram-mod-n-match
0.11.036.249 I srv          load:   24
0.11.036.249 I srv          load:   --spec-ngram-mod-n-max
0.11.036.250 I srv          load:   64
0.11.036.250 I srv          load:   --spec-ngram-mod-n-min
0.11.036.250 I srv          load:   48
0.11.036.250 I srv          load:   --spec-type
0.11.036.250 I srv          load:   draft-mtp
0.11.036.251 I srv          load:   --webui-config-file
0.11.036.251 I srv          load:   /home/kusa/llama-server/webui-config.json
0.11.036.251 I srv          load:   --webui-mcp-proxy
0.11.036.251 I srv          load:   --alias
0.11.036.252 I srv          load:   Qwen3.5-397B-A17B-UD-IQ4_XS
0.11.036.252 I srv          load:   --batch-size
0.11.036.252 I srv          load:   2048
0.11.036.252 I srv          load:   --ctx-size
0.11.036.252 I srv          load:   262144
0.11.036.253 I srv          load:   --cache-ram
0.11.036.253 I srv          load:   2048
0.11.036.253 I srv          load:   --cache-type-k
0.11.036.253 I srv          load:   q8_0
0.11.036.254 I srv          load:   --cache-type-v
0.11.036.254 I srv          load:   q8_0
0.11.036.254 I srv          load:   --swa-checkpoints
0.11.036.254 I srv          load:   100
0.11.036.254 I srv          load:   --flash-attn
0.11.036.255 I srv          load:   1
0.11.036.255 I srv          load:   --log-verbosity
0.11.036.255 I srv          load:   4
0.11.036.256 I srv          load:   --model
0.11.036.256 I srv          load:   /home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf
0.11.036.256 I srv          load:   --n-gpu-layers
0.11.036.256 I srv          load:   all
0.11.036.257 I srv          load:   --parallel
0.11.036.257 I srv          load:   1
0.11.036.257 I srv          load:   --ubatch-size
0.11.036.257 I srv          load:   512
[41303] 0.00.153.386 I common_params_print_info: build 9389 (30af6e2b9) with GNU 16.1.1 for Linux x86_64
[41303] 0.00.153.389 I log_info: verbosity = 4 (adjust with the `-lv N` CLI arg)
[41303] 0.00.153.390 I device_info:
[41303] 0.00.153.447 I   - ROCm0   : AMD Radeon 8060S Graphics (122880 MiB, 125693 MiB free)
[41303] 0.00.153.449 I   - CPU     : AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (128077 MiB, 128077 MiB free)
[41303] 0.00.154.397 I   - RPC0    : 10.0.69.2:50052 (122880 MiB, 125385 MiB free)
[41303] 0.00.154.446 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[41303] 0.00.154.517 I srv          init: running without SSL
[41303] 0.00.154.547 I srv          init: using 31 threads for HTTP server
[41303] 0.00.154.635 W srv  llama_server: -----------------
[41303] 0.00.154.636 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
[41303] 0.00.154.636 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
[41303] 0.00.154.636 W srv  llama_server: -----------------
[41303] 0.00.154.641 I srv         start: binding port with default address family
[41303] 0.00.155.778 I srv  llama_server: loading model
[41303] 0.00.155.787 I srv    load_model: loading model '/home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf'
[41303] 0.00.467.696 I common_memory_breakdown_print: | memory breakdown [MiB]     |  total     free      self    model   context   compute    unaccounted |
[41303] 0.00.467.712 I common_memory_breakdown_print: |   - RPC0 (10.0.69.2:50052) | 122880 = 125402 + (185983 = 184639 +     512 +     832) +     -188506 |
[41303] 0.00.467.712 I common_memory_breakdown_print: |   - Host                   |                      1558 =   1030 +       0 +     528                |
[41303] 0.00.503.929 I srv    load_model: [spec] estimated memory usage of MTP context is 1344.02 MiB
[41303] 0.00.503.950 I common_init_result: fitting params to device memory ...
[41303] 0.00.503.951 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[41303] 0.00.503.957 I common_params_fit_impl: getting device memory data for initial parameters:
[41303] 0.01.033.801 I common_memory_breakdown_print: | memory breakdown [MiB]     |  total     free      self    model   context   compute    unaccounted |
[41303] 0.01.033.806 I common_memory_breakdown_print: |   - RPC0 (10.0.69.2:50052) | 122880 = 124460 + (190321 = 184639 +    4825 +     856) +     -191901 |
[41303] 0.01.033.806 I common_memory_breakdown_print: |   - Host                   |                      1558 =   1030 +       0 +     528                |
[41303] 0.01.073.330 I common_params_fit_impl: projected to use 190321 MiB of device memory vs. 124460 MiB of free device memory
[41303] 0.01.073.336 I common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 66885 MiB
[41303] 0.01.073.337 I common_params_fit_impl: context size set by user to 262144 -> no change
[41303] 0.01.073.429 W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to -2, abort
[41303] 0.01.073.443 I common_fit_params: fitting params to free memory took 0.57 seconds
[41303] 0.01.114.934 I llama_model_loader: additional 4 GGUFs metadata loaded.
[41303] 0.01.114.938 I llama_model_loader: loaded meta data with 56 key-value pairs and 1118 tensors from /home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf (version GGUF V3 (latest))
[41303] 0.01.114.960 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[41303] 0.01.114.966 I llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
[41303] 0.01.114.967 I llama_model_loader: - kv   1:                               general.type str              = model
[41303] 0.01.114.968 I llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
[41303] 0.01.114.972 I llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
[41303] 0.01.114.972 I llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.600000
[41303] 0.01.114.973 I llama_model_loader: - kv   5:                               general.name str              = Qwen3.5-397B-A17B
[41303] 0.01.114.973 I llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5-397B-A17B
[41303] 0.01.114.974 I llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
[41303] 0.01.114.974 I llama_model_loader: - kv   8:                         general.size_label str              = 397B-A17B
[41303] 0.01.114.974 I llama_model_loader: - kv   9:                            general.license str              = apache-2.0
[41303] 0.01.114.975 I llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-3...
[41303] 0.01.114.976 I llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
[41303] 0.01.114.977 I llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
[41303] 0.01.114.978 I llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.5 397B A17B
[41303] 0.01.114.978 I llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
[41303] 0.01.114.979 I llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-3...
[41303] 0.01.114.991 I llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
[41303] 0.01.114.992 I llama_model_loader: - kv  17:                      qwen35moe.block_count u32              = 61
[41303] 0.01.114.992 I llama_model_loader: - kv  18:                   qwen35moe.context_length u32              = 262144
[41303] 0.01.114.993 I llama_model_loader: - kv  19:                 qwen35moe.embedding_length u32              = 4096
[41303] 0.01.114.993 I llama_model_loader: - kv  20:             qwen35moe.attention.head_count u32              = 32
[41303] 0.01.114.994 I llama_model_loader: - kv  21:          qwen35moe.attention.head_count_kv u32              = 2
[41303] 0.01.114.995 I llama_model_loader: - kv  22:          qwen35moe.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
[41303] 0.01.114.997 I llama_model_loader: - kv  23:                   qwen35moe.rope.freq_base f32              = 10000000.000000
[41303] 0.01.114.998 I llama_model_loader: - kv  24: qwen35moe.attention.layer_norm_rms_epsilon f32              = 0.000001
[41303] 0.01.114.999 I llama_model_loader: - kv  25:                     qwen35moe.expert_count u32              = 512
[41303] 0.01.114.999 I llama_model_loader: - kv  26:                qwen35moe.expert_used_count u32              = 10
[41303] 0.01.114.999 I llama_model_loader: - kv  27:             qwen35moe.attention.key_length u32              = 256
[41303] 0.01.114.999 I llama_model_loader: - kv  28:           qwen35moe.attention.value_length u32              = 256
[41303] 0.01.115.000 I llama_model_loader: - kv  29:       qwen35moe.expert_feed_forward_length u32              = 1024
[41303] 0.01.115.000 I llama_model_loader: - kv  30: qwen35moe.expert_shared_feed_forward_length u32              = 1024
[41303] 0.01.115.000 I llama_model_loader: - kv  31:                  qwen35moe.ssm.conv_kernel u32              = 4
[41303] 0.01.115.001 I llama_model_loader: - kv  32:                   qwen35moe.ssm.state_size u32              = 128
[41303] 0.01.115.001 I llama_model_loader: - kv  33:                  qwen35moe.ssm.group_count u32              = 16
[41303] 0.01.115.001 I llama_model_loader: - kv  34:               qwen35moe.ssm.time_step_rank u32              = 64
[41303] 0.01.115.001 I llama_model_loader: - kv  35:                   qwen35moe.ssm.inner_size u32              = 8192
[41303] 0.01.115.002 I llama_model_loader: - kv  36:          qwen35moe.full_attention_interval u32              = 4
[41303] 0.01.115.002 I llama_model_loader: - kv  37:             qwen35moe.rope.dimension_count u32              = 64
[41303] 0.01.115.002 I llama_model_loader: - kv  38:             qwen35moe.nextn_predict_layers u32              = 1
[41303] 0.01.115.003 I llama_model_loader: - kv  39:                       tokenizer.ggml.model str              = gpt2
[41303] 0.01.115.003 I llama_model_loader: - kv  40:                         tokenizer.ggml.pre str              = qwen35
[41303] 0.01.129.919 I llama_model_loader: - kv  41:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
[41303] 0.01.133.610 I llama_model_loader: - kv  42:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[41303] 0.01.145.959 I llama_model_loader: - kv  43:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[41303] 0.01.145.960 I llama_model_loader: - kv  44:                tokenizer.ggml.eos_token_id u32              = 248046
[41303] 0.01.145.960 I llama_model_loader: - kv  45:            tokenizer.ggml.padding_token_id u32              = 248055
[41303] 0.01.145.961 I llama_model_loader: - kv  46:               general.quantization_version u32              = 2
[41303] 0.01.145.964 I llama_model_loader: - kv  47:                          general.file_type u32              = 30
[41303] 0.01.145.964 I llama_model_loader: - kv  48:                      quantize.imatrix.file str              = Qwen3.5-397B-A17B-GGUF/imatrix_unslot...
[41303] 0.01.145.965 I llama_model_loader: - kv  49:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.5-397B-A17B...
[41303] 0.01.145.965 I llama_model_loader: - kv  50:             quantize.imatrix.entries_count u32              = 765
[41303] 0.01.145.965 I llama_model_loader: - kv  51:              quantize.imatrix.chunks_count u32              = 76
[41303] 0.01.145.966 I llama_model_loader: - kv  52:                                   split.no u16              = 0
[41303] 0.01.145.966 I llama_model_loader: - kv  53:                        split.tensors.count i32              = 1118
[41303] 0.01.145.966 I llama_model_loader: - kv  54:                                split.count u16              = 5
[41303] 0.01.145.967 I llama_model_loader: - kv  55:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
[41303] 0.01.145.968 I llama_model_loader: - type  f32:  548 tensors
[41303] 0.01.145.969 I llama_model_loader: - type q8_0:  384 tensors
[41303] 0.01.145.969 I llama_model_loader: - type q3_K:    2 tensors
[41303] 0.01.145.969 I llama_model_loader: - type q4_K:    1 tensors
[41303] 0.01.145.970 I llama_model_loader: - type q6_K:    3 tensors
[41303] 0.01.145.970 I llama_model_loader: - type iq3_s:  118 tensors
[41303] 0.01.145.970 I llama_model_loader: - type iq4_xs:   60 tensors
[41303] 0.01.145.970 I llama_model_loader: - type bf16:    2 tensors
[41303] 0.01.145.972 I print_info: file format = GGUF V3 (latest)
[41303] 0.01.145.972 I print_info: file type   = IQ4_XS - 4.25 bpw
[41303] 0.01.145.974 I print_info: file size   = 181.32 GiB (3.87 BPW)
[41303] 0.01.147.798 I llama_prepare_model_devices: using device RPC0 (10.0.69.2:50052) (unknown id) - 125209 MiB free
[41303] 0.01.215.702 I load: 0 unused tokens
[41303] 0.01.238.645 I load: printing all EOG tokens:
[41303] 0.01.238.647 I load:   - 248044 ('<|endoftext|>')
[41303] 0.01.238.648 I load:   - 248046 ('<|im_end|>')
[41303] 0.01.238.648 I load:   - 248063 ('<|fim_pad|>')
[41303] 0.01.238.648 I load:   - 248064 ('<|repo_name|>')
[41303] 0.01.238.648 I load:   - 248065 ('<|file_sep|>')
[41303] 0.01.238.829 I load: special tokens cache size = 33
[41303] 0.01.281.772 I load: token to piece cache size = 1.7581 MB
[41303] 0.01.281.785 I print_info: arch                  = qwen35moe
[41303] 0.01.281.785 I print_info: vocab_only            = 0
[41303] 0.01.281.785 I print_info: no_alloc              = 0
[41303] 0.01.281.786 I print_info: n_ctx_train           = 262144
[41303] 0.01.281.786 I print_info: n_embd                = 4096
[41303] 0.01.281.786 I print_info: n_embd_inp            = 4096
[41303] 0.01.281.786 I print_info: n_layer               = 61
[41303] 0.01.281.794 I print_info: n_head                = 32
[41303] 0.01.281.795 I print_info: n_head_kv             = 2
[41303] 0.01.281.795 I print_info: n_rot                 = 64
[41303] 0.01.281.795 I print_info: n_swa                 = 0
[41303] 0.01.281.796 I print_info: is_swa_any            = 0
[41303] 0.01.281.796 I print_info: n_embd_head_k         = 256
[41303] 0.01.281.796 I print_info: n_embd_head_v         = 256
[41303] 0.01.281.797 I print_info: n_gqa                 = 16
[41303] 0.01.281.799 I print_info: n_embd_k_gqa          = 512
[41303] 0.01.281.800 I print_info: n_embd_v_gqa          = 512
[41303] 0.01.281.800 I print_info: f_norm_eps            = 0.0e+00
[41303] 0.01.281.802 I print_info: f_norm_rms_eps        = 1.0e-06
[41303] 0.01.281.803 I print_info: f_clamp_kqv           = 0.0e+00
[41303] 0.01.281.803 I print_info: f_max_alibi_bias      = 0.0e+00
[41303] 0.01.281.803 I print_info: f_logit_scale         = 0.0e+00
[41303] 0.01.281.803 I print_info: f_attn_scale          = 0.0e+00
[41303] 0.01.281.803 I print_info: f_attn_value_scale    = 0.0000
[41303] 0.01.281.805 I print_info: n_ff                  = 0
[41303] 0.01.281.805 I print_info: n_expert              = 512
[41303] 0.01.281.805 I print_info: n_expert_used         = 10
[41303] 0.01.281.805 I print_info: n_expert_groups       = 0
[41303] 0.01.281.805 I print_info: n_group_used          = 0
[41303] 0.01.281.805 I print_info: causal attn           = 1
[41303] 0.01.281.805 I print_info: pooling type          = -1
[41303] 0.01.281.806 I print_info: rope type             = 40
[41303] 0.01.281.806 I print_info: rope scaling          = linear
[41303] 0.01.281.810 I print_info: freq_base_train       = 10000000.0
[41303] 0.01.281.811 I print_info: freq_scale_train      = 1
[41303] 0.01.281.811 I print_info: n_ctx_orig_yarn       = 262144
[41303] 0.01.281.811 I print_info: rope_yarn_log_mul     = 0.0000
[41303] 0.01.281.811 I print_info: rope_finetuned        = unknown
[41303] 0.01.281.817 I print_info: mrope sections        = [11, 11, 10, 0]
[41303] 0.01.281.817 I print_info: ssm_d_conv            = 4
[41303] 0.01.281.817 I print_info: ssm_d_inner           = 8192
[41303] 0.01.281.818 I print_info: ssm_d_state           = 128
[41303] 0.01.281.841 I print_info: ssm_dt_rank           = 64
[41303] 0.01.281.843 I print_info: ssm_n_group           = 16
[41303] 0.01.281.843 I print_info: ssm_dt_b_c_rms        = 0
[41303] 0.01.281.844 I print_info: model type            = 397B.A17B
[41303] 0.01.281.845 I print_info: model params          = 402.94 B
[41303] 0.01.281.845 I print_info: general.name          = Qwen3.5-397B-A17B
[41303] 0.01.281.847 I print_info: vocab type            = BPE
[41303] 0.01.281.847 I print_info: n_vocab               = 248320
[41303] 0.01.281.847 I print_info: n_merges              = 247587
[41303] 0.01.281.848 I print_info: BOS token             = 11 ','
[41303] 0.01.281.848 I print_info: EOS token             = 248046 '<|im_end|>'
[41303] 0.01.281.848 I print_info: EOT token             = 248046 '<|im_end|>'
[41303] 0.01.281.848 I print_info: PAD token             = 248055 '<|vision_pad|>'
[41303] 0.01.281.848 I print_info: LF token              = 198 'Ċ'
[41303] 0.01.281.848 I print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
[41303] 0.01.281.848 I print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
[41303] 0.01.281.849 I print_info: FIM MID token         = 248061 '<|fim_middle|>'
[41303] 0.01.281.849 I print_info: FIM PAD token         = 248063 '<|fim_pad|>'
[41303] 0.01.281.849 I print_info: FIM REP token         = 248064 '<|repo_name|>'
[41303] 0.01.281.849 I print_info: FIM SEP token         = 248065 '<|file_sep|>'
[41303] 0.01.281.849 I print_info: EOG token             = 248044 '<|endoftext|>'
[41303] 0.01.281.849 I print_info: EOG token             = 248046 '<|im_end|>'
[41303] 0.01.281.850 I print_info: EOG token             = 248063 '<|fim_pad|>'
[41303] 0.01.281.850 I print_info: EOG token             = 248064 '<|repo_name|>'
[41303] 0.01.281.850 I print_info: EOG token             = 248065 '<|file_sep|>'
[41303] 0.01.281.850 I print_info: max token length      = 256
[41303] 0.01.281.851 I load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[41303] 0.01.289.827 E alloc_tensor_range: failed to allocate RPC0[10.0.69.2:50052] buffer of size 193608946688
[41303] 0.01.322.919 E llama_model_load: error loading model: unable to allocate RPC0[10.0.69.2:50052] buffer
[41303] 0.01.322.926 E llama_model_load_from_file_impl: failed to load model
[41303] 0.01.322.932 E common_init_from_params: failed to load model '/home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf'
[41303] 0.01.322.939 E srv    load_model: failed to load model, '/home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf'
[41303] 0.01.322.940 I srv    operator(): operator(): cleaning up before exit...
[41303] 0.01.324.633 E srv  llama_server: exiting due to model loading error
d7be461
kusa@framework:~/llama-server$ 0.00.165.383 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.165.456 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.165.487 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.165.533 I srv          init: running without SSL
0.00.165.574 I srv          init: using 31 threads for HTTP server
0.00.166.565 I srv   load_models: Loaded 0 cached model presets
0.00.166.975 I srv   load_models: Loaded 5 custom model presets from /home/kusa/llama-server/config.ini
0.00.167.097 I srv    operator(): Available models (5) (*: custom preset)
0.00.167.098 I srv    operator():   * GLM-5.1-UD-IQ2_M
0.00.167.099 I srv    operator():   * MiMo-V2.5-UD-Q5_K_XL
0.00.167.099 I srv    operator():   * Qwen3.5-397B-A17B-UD-IQ4_XS
0.00.167.099 I srv    operator():   * Qwen3.5-397B-A17B-UD-Q4_K_XL
0.00.167.099 I srv    operator():   * default
0.00.167.241 W srv  llama_server: -----------------
0.00.167.241 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
0.00.167.242 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
0.00.167.242 W srv  llama_server: -----------------
0.00.167.245 I srv  llama_server: starting router server, no model will be loaded in this process
0.00.167.246 I srv         start: binding port with default address family
0.00.168.438 I srv  llama_server: router server is listening on http://0.0.0.0:8080
0.00.168.442 W srv  llama_server: NOTE: router mode is experimental
0.00.168.442 W srv  llama_server:       it is not recommended to use this mode in untrusted environments
0.04.667.821 I srv          load: spawning server instance with name=Qwen3.5-397B-A17B-UD-IQ4_XS on port 59145
0.04.667.860 I srv          load: spawning server instance with args:
0.04.667.860 I srv          load:   /home/kusa/llama.cpp/build/bin/llama-server
0.04.667.861 I srv          load:   --host
0.04.667.861 I srv          load:   127.0.0.1
0.04.667.861 I srv          load:   --jinja
0.04.667.861 I srv          load:   --metrics
0.04.667.861 I srv          load:   --no-mmap
0.04.667.862 I srv          load:   --no-mmproj-auto
0.04.667.862 I srv          load:   --port
0.04.667.862 I srv          load:   59145
0.04.667.862 I srv          load:   --rpc
0.04.667.862 I srv          load:   10.0.69.2:50052
0.04.667.863 I srv          load:   --spec-draft-n-max
0.04.667.863 I srv          load:   3
0.04.667.863 I srv          load:   --spec-ngram-mod-n-match
0.04.667.863 I srv          load:   24
0.04.667.863 I srv          load:   --spec-ngram-mod-n-max
0.04.667.864 I srv          load:   64
0.04.667.864 I srv          load:   --spec-ngram-mod-n-min
0.04.667.864 I srv          load:   48
0.04.667.864 I srv          load:   --spec-type
0.04.667.864 I srv          load:   draft-mtp
0.04.667.865 I srv          load:   --webui-config-file
0.04.667.865 I srv          load:   /home/kusa/llama-server/webui-config.json
0.04.667.865 I srv          load:   --webui-mcp-proxy
0.04.667.865 I srv          load:   --alias
0.04.667.865 I srv          load:   Qwen3.5-397B-A17B-UD-IQ4_XS
0.04.667.866 I srv          load:   --batch-size
0.04.667.866 I srv          load:   2048
0.04.667.866 I srv          load:   --ctx-size
0.04.667.866 I srv          load:   262144
0.04.667.866 I srv          load:   --cache-ram
0.04.667.867 I srv          load:   2048
0.04.667.867 I srv          load:   --cache-type-k
0.04.667.867 I srv          load:   q8_0
0.04.667.867 I srv          load:   --cache-type-v
0.04.667.867 I srv          load:   q8_0
0.04.667.868 I srv          load:   --swa-checkpoints
0.04.667.868 I srv          load:   100
0.04.667.868 I srv          load:   --flash-attn
0.04.667.868 I srv          load:   1
0.04.667.868 I srv          load:   --log-verbosity
0.04.667.869 I srv          load:   4
0.04.667.869 I srv          load:   --model
0.04.667.869 I srv          load:   /home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf
0.04.667.869 I srv          load:   --n-gpu-layers
0.04.667.869 I srv          load:   all
0.04.667.870 I srv          load:   --parallel
0.04.667.870 I srv          load:   1
0.04.667.870 I srv          load:   --ubatch-size
0.04.667.870 I srv          load:   512
[59145] 0.00.035.963 I common_params_print_info: build 9388 (d7be46189) with GNU 16.1.1 for Linux x86_64
[59145] 0.00.035.966 I log_info: verbosity = 4 (adjust with the `-lv N` CLI arg)
[59145] 0.00.035.966 I device_info:
[59145] 0.00.036.025 I   - ROCm0   : AMD Radeon 8060S Graphics (122880 MiB, 125717 MiB free)
[59145] 0.00.036.029 I   - CPU     : AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (128077 MiB, 128077 MiB free)
[59145] 0.00.036.783 I   - RPC0    : 10.0.69.2:50052 (122880 MiB, 125458 MiB free)
[59145] 0.00.036.829 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[59145] 0.00.036.894 I srv          init: running without SSL
[59145] 0.00.036.925 I srv          init: using 31 threads for HTTP server
[59145] 0.00.037.009 W srv  llama_server: -----------------
[59145] 0.00.037.011 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
[59145] 0.00.037.011 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
[59145] 0.00.037.011 W srv  llama_server: -----------------
[59145] 0.00.037.017 I srv         start: binding port with default address family
[59145] 0.00.038.160 I srv  llama_server: loading model
[59145] 0.00.038.168 I srv    load_model: loading model '/home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf'
[59145] 0.00.433.088 I common_memory_breakdown_print: | memory breakdown [MiB]     |  total     free     self   model   context   compute    unaccounted |
[59145] 0.00.433.093 I common_memory_breakdown_print: |   - RPC0 (10.0.69.2:50052) | 122880 = 125464 + (92584 = 92584 +       0 +       0) +      -95169 |
[59145] 0.00.433.094 I common_memory_breakdown_print: |   - ROCm0 (8060S Graphics) | 122880 = 125408 + (93398 = 92054 +     512 +     832) +      -95927 |
[59145] 0.00.433.094 I common_memory_breakdown_print: |   - Host                   |                     1558 =  1030 +       0 +     528                |
[59145] 0.00.472.923 I srv    load_model: [spec] estimated memory usage of MTP context is 1344.02 MiB
[59145] 0.00.472.944 I common_init_result: fitting params to device memory ...
[59145] 0.00.472.944 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[59145] 0.00.472.950 I common_params_fit_impl: getting device memory data for initial parameters:
[59145] 0.00.907.869 I common_memory_breakdown_print: | memory breakdown [MiB]     |  total     free     self   model   context   compute    unaccounted |
[59145] 0.00.907.874 I common_memory_breakdown_print: |   - RPC0 (10.0.69.2:50052) | 122880 = 124899 + (98989 = 95560 +    2573 +     856) +     -101009 |
[59145] 0.00.907.875 I common_memory_breakdown_print: |   - ROCm0 (8060S Graphics) | 122880 = 125013 + (91840 = 89079 +    2251 +     509) +      -93974 |
[59145] 0.00.907.875 I common_memory_breakdown_print: |   - Host                   |                     1558 =  1030 +       0 +     528                |
[59145] 0.00.944.258 I common_params_fit_impl: projected memory use with initial parameters [MiB]:
[59145] 0.00.944.266 I common_params_fit_impl:   - RPC0 (10.0.69.2:50052)           : 122880 total,  98989 used,  25909 free vs. target of   2368
[59145] 0.00.944.266 I common_params_fit_impl:   - ROCm0 (AMD Radeon 8060S Graphics): 122880 total,  91840 used,  33173 free vs. target of   1024
[59145] 0.00.944.266 I common_params_fit_impl: projected to use 190830 MiB of device memory vs. 249913 MiB of free device memory
[59145] 0.00.944.266 I common_params_fit_impl: targets for free memory can be met on all devices, no changes needed
[59145] 0.00.944.268 I common_fit_params: successfully fit params to free device memory
[59145] 0.00.944.273 I common_fit_params: fitting params to free memory took 0.47 seconds
[59145] 0.00.973.803 I llama_model_loader: additional 4 GGUFs metadata loaded.
[59145] 0.00.973.809 I llama_model_loader: loaded meta data with 56 key-value pairs and 1118 tensors from /home/kusa/llama-server/models/Qwen3.5-397B-A17B-UD-IQ4_XS/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf (version GGUF V3 (latest))
[59145] 0.00.973.825 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[59145] 0.00.973.828 I llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
[59145] 0.00.973.828 I llama_model_loader: - kv   1:                               general.type str              = model
[59145] 0.00.973.829 I llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
[59145] 0.00.973.834 I llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
[59145] 0.00.973.835 I llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.600000
[59145] 0.00.973.835 I llama_model_loader: - kv   5:                               general.name str              = Qwen3.5-397B-A17B
[59145] 0.00.973.836 I llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5-397B-A17B
[59145] 0.00.973.836 I llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
[59145] 0.00.973.836 I llama_model_loader: - kv   8:                         general.size_label str              = 397B-A17B
[59145] 0.00.973.836 I llama_model_loader: - kv   9:                            general.license str              = apache-2.0
[59145] 0.00.973.837 I llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-3...
[59145] 0.00.973.838 I llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
[59145] 0.00.973.838 I llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
[59145] 0.00.973.839 I llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.5 397B A17B
[59145] 0.00.973.839 I llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
[59145] 0.00.973.839 I llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-3...
[59145] 0.00.973.852 I llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
[59145] 0.00.973.852 I llama_model_loader: - kv  17:                      qwen35moe.block_count u32              = 61
[59145] 0.00.973.853 I llama_model_loader: - kv  18:                   qwen35moe.context_length u32              = 262144
[59145] 0.00.973.853 I llama_model_loader: - kv  19:                 qwen35moe.embedding_length u32              = 4096
[59145] 0.00.973.854 I llama_model_loader: - kv  20:             qwen35moe.attention.head_count u32              = 32
[59145] 0.00.973.854 I llama_model_loader: - kv  21:          qwen35moe.attention.head_count_kv u32              = 2
[59145] 0.00.973.855 I llama_model_loader: - kv  22:          qwen35moe.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
[59145] 0.00.973.857 I llama_model_loader: - kv  23:                   qwen35moe.rope.freq_base f32              = 10000000.000000
[59145] 0.00.973.858 I llama_model_loader: - kv  24: qwen35moe.attention.layer_norm_rms_epsilon f32              = 0.000001
[59145] 0.00.973.858 I llama_model_loader: - kv  25:                     qwen35moe.expert_count u32              = 512
[59145] 0.00.973.859 I llama_model_loader: - kv  26:                qwen35moe.expert_used_count u32              = 10
[59145] 0.00.973.859 I llama_model_loader: - kv  27:             qwen35moe.attention.key_length u32              = 256
[59145] 0.00.973.859 I llama_model_loader: - kv  28:           qwen35moe.attention.value_length u32              = 256
[59145] 0.00.973.860 I llama_model_loader: - kv  29:       qwen35moe.expert_feed_forward_length u32              = 1024
[59145] 0.00.973.860 I llama_model_loader: - kv  30: qwen35moe.expert_shared_feed_forward_length u32              = 1024
[59145] 0.00.973.860 I llama_model_loader: - kv  31:                  qwen35moe.ssm.conv_kernel u32              = 4
[59145] 0.00.973.860 I llama_model_loader: - kv  32:                   qwen35moe.ssm.state_size u32              = 128
[59145] 0.00.973.860 I llama_model_loader: - kv  33:                  qwen35moe.ssm.group_count u32              = 16
[59145] 0.00.973.861 I llama_model_loader: - kv  34:               qwen35moe.ssm.time_step_rank u32              = 64
[59145] 0.00.973.861 I llama_model_loader: - kv  35:                   qwen35moe.ssm.inner_size u32              = 8192
[59145] 0.00.973.861 I llama_model_loader: - kv  36:          qwen35moe.full_attention_interval u32              = 4
[59145] 0.00.973.862 I llama_model_loader: - kv  37:             qwen35moe.rope.dimension_count u32              = 64
[59145] 0.00.973.862 I llama_model_loader: - kv  38:             qwen35moe.nextn_predict_layers u32              = 1
[59145] 0.00.973.862 I llama_model_loader: - kv  39:                       tokenizer.ggml.model str              = gpt2
[59145] 0.00.973.862 I llama_model_loader: - kv  40:                         tokenizer.ggml.pre str              = qwen35
[59145] 0.00.984.710 I llama_model_loader: - kv  41:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
[59145] 0.00.987.765 I llama_model_loader: - kv  42:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[59145] 0.00.998.388 I llama_model_loader: - kv  43:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[59145] 0.00.998.390 I llama_model_loader: - kv  44:                tokenizer.ggml.eos_token_id u32              = 248046
[59145] 0.00.998.390 I llama_model_loader: - kv  45:            tokenizer.ggml.padding_token_id u32              = 248055
[59145] 0.00.998.390 I llama_model_loader: - kv  46:               general.quantization_version u32              = 2
[59145] 0.00.998.391 I llama_model_loader: - kv  47:                          general.file_type u32              = 30
[59145] 0.00.998.392 I llama_model_loader: - kv  48:                      quantize.imatrix.file str              = Qwen3.5-397B-A17B-GGUF/imatrix_unslot...
[59145] 0.00.998.392 I llama_model_loader: - kv  49:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.5-397B-A17B...
[59145] 0.00.998.392 I llama_model_loader: - kv  50:             quantize.imatrix.entries_count u32              = 765
[59145] 0.00.998.392 I llama_model_loader: - kv  51:              quantize.imatrix.chunks_count u32              = 76
[59145] 0.00.998.393 I llama_model_loader: - kv  52:                                   split.no u16              = 0
[59145] 0.00.998.393 I llama_model_loader: - kv  53:                        split.tensors.count i32              = 1118
[59145] 0.00.998.393 I llama_model_loader: - kv  54:                                split.count u16              = 5
[59145] 0.00.998.395 I llama_model_loader: - kv  55:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
[59145] 0.00.998.396 I llama_model_loader: - type  f32:  548 tensors
[59145] 0.00.998.396 I llama_model_loader: - type q8_0:  384 tensors
[59145] 0.00.998.397 I llama_model_loader: - type q3_K:    2 tensors
[59145] 0.00.998.397 I llama_model_loader: - type q4_K:    1 tensors
[59145] 0.00.998.397 I llama_model_loader: - type q6_K:    3 tensors
[59145] 0.00.998.397 I llama_model_loader: - type iq3_s:  118 tensors
[59145] 0.00.998.398 I llama_model_loader: - type iq4_xs:   60 tensors
[59145] 0.00.998.398 I llama_model_loader: - type bf16:    2 tensors
[59145] 0.00.998.399 I print_info: file format = GGUF V3 (latest)
[59145] 0.00.998.400 I print_info: file type   = IQ4_XS - 4.25 bpw
[59145] 0.00.998.403 I print_info: file size   = 181.32 GiB (3.87 BPW)
[59145] 0.01.000.369 I llama_prepare_model_devices: using device RPC0 (10.0.69.2:50052) (unknown id) - 125295 MiB free
[59145] 0.01.000.383 I llama_prepare_model_devices: using device ROCm0 (AMD Radeon 8060S Graphics) (0000:c2:00.0) - 125361 MiB free
[59145] 0.01.067.835 I load: 0 unused tokens
[59145] 0.01.091.159 I load: printing all EOG tokens:
[59145] 0.01.091.161 I load:   - 248044 ('<|endoftext|>')
[59145] 0.01.091.162 I load:   - 248046 ('<|im_end|>')
[59145] 0.01.091.162 I load:   - 248063 ('<|fim_pad|>')
[59145] 0.01.091.162 I load:   - 248064 ('<|repo_name|>')
[59145] 0.01.091.162 I load:   - 248065 ('<|file_sep|>')
[59145] 0.01.091.344 I load: special tokens cache size = 33
[59145] 0.01.134.425 I load: token to piece cache size = 1.7581 MB
[59145] 0.01.134.441 I print_info: arch                  = qwen35moe
[59145] 0.01.134.442 I print_info: vocab_only            = 0
[59145] 0.01.134.442 I print_info: no_alloc              = 0
[59145] 0.01.134.442 I print_info: n_ctx_train           = 262144
[59145] 0.01.134.442 I print_info: n_embd                = 4096
[59145] 0.01.134.443 I print_info: n_embd_inp            = 4096
[59145] 0.01.134.443 I print_info: n_layer               = 61
[59145] 0.01.134.452 I print_info: n_head                = 32
[59145] 0.01.134.453 I print_info: n_head_kv             = 2
[59145] 0.01.134.453 I print_info: n_rot                 = 64
[59145] 0.01.134.453 I print_info: n_swa                 = 0
[59145] 0.01.134.454 I print_info: is_swa_any            = 0
[59145] 0.01.134.454 I print_info: n_embd_head_k         = 256
[59145] 0.01.134.454 I print_info: n_embd_head_v         = 256
[59145] 0.01.134.455 I print_info: n_gqa                 = 16
[59145] 0.01.134.456 I print_info: n_embd_k_gqa          = 512
[59145] 0.01.134.458 I print_info: n_embd_v_gqa          = 512
[59145] 0.01.134.458 I print_info: f_norm_eps            = 0.0e+00
[59145] 0.01.134.460 I print_info: f_norm_rms_eps        = 1.0e-06
[59145] 0.01.134.460 I print_info: f_clamp_kqv           = 0.0e+00
[59145] 0.01.134.460 I print_info: f_max_alibi_bias      = 0.0e+00
[59145] 0.01.134.460 I print_info: f_logit_scale         = 0.0e+00
[59145] 0.01.134.460 I print_info: f_attn_scale          = 0.0e+00
[59145] 0.01.134.461 I print_info: f_attn_value_scale    = 0.0000
[59145] 0.01.134.462 I print_info: n_ff                  = 0
[59145] 0.01.134.462 I print_info: n_expert              = 512
[59145] 0.01.134.462 I print_info: n_expert_used         = 10
[59145] 0.01.134.462 I print_info: n_expert_groups       = 0
[59145] 0.01.134.462 I print_info: n_group_used          = 0
[59145] 0.01.134.462 I print_info: causal attn           = 1
[59145] 0.01.134.462 I print_info: pooling type          = -1
[59145] 0.01.134.463 I print_info: rope type             = 40
[59145] 0.01.134.463 I print_info: rope scaling          = linear
[59145] 0.01.134.464 I print_info: freq_base_train       = 10000000.0
[59145] 0.01.134.464 I print_info: freq_scale_train      = 1
[59145] 0.01.134.464 I print_info: n_ctx_orig_yarn       = 262144
[59145] 0.01.134.465 I print_info: rope_yarn_log_mul     = 0.0000
[59145] 0.01.134.465 I print_info: rope_finetuned        = unknown
[59145] 0.01.134.465 I print_info: mrope sections        = [11, 11, 10, 0]
[59145] 0.01.134.465 I print_info: ssm_d_conv            = 4
[59145] 0.01.134.465 I print_info: ssm_d_inner           = 8192
[59145] 0.01.134.465 I print_info: ssm_d_state           = 128
[59145] 0.01.134.466 I print_info: ssm_dt_rank           = 64
[59145] 0.01.134.466 I print_info: ssm_n_group           = 16
[59145] 0.01.134.466 I print_info: ssm_dt_b_c_rms        = 0
[59145] 0.01.134.466 I print_info: model type            = 397B.A17B
[59145] 0.01.134.467 I print_info: model params          = 402.94 B
[59145] 0.01.134.467 I print_info: general.name          = Qwen3.5-397B-A17B
[59145] 0.01.134.469 I print_info: vocab type            = BPE
[59145] 0.01.134.469 I print_info: n_vocab               = 248320
[59145] 0.01.134.469 I print_info: n_merges              = 247587
[59145] 0.01.134.469 I print_info: BOS token             = 11 ','
[59145] 0.01.134.470 I print_info: EOS token             = 248046 '<|im_end|>'
[59145] 0.01.134.470 I print_info: EOT token             = 248046 '<|im_end|>'
[59145] 0.01.134.470 I print_info: PAD token             = 248055 '<|vision_pad|>'
[59145] 0.01.134.470 I print_info: LF token              = 198 'Ċ'
[59145] 0.01.134.470 I print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
[59145] 0.01.134.470 I print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
[59145] 0.01.134.471 I print_info: FIM MID token         = 248061 '<|fim_middle|>'
[59145] 0.01.134.471 I print_info: FIM PAD token         = 248063 '<|fim_pad|>'
[59145] 0.01.134.471 I print_info: FIM REP token         = 248064 '<|repo_name|>'
[59145] 0.01.134.471 I print_info: FIM SEP token         = 248065 '<|file_sep|>'
[59145] 0.01.134.471 I print_info: EOG token             = 248044 '<|endoftext|>'
[59145] 0.01.134.471 I print_info: EOG token             = 248046 '<|im_end|>'
[59145] 0.01.134.472 I print_info: EOG token             = 248063 '<|fim_pad|>'
[59145] 0.01.134.472 I print_info: EOG token             = 248064 '<|repo_name|>'
[59145] 0.01.134.472 I print_info: EOG token             = 248065 '<|file_sep|>'
[59145] 0.01.134.472 I print_info: max token length      = 256
[59145] 0.01.134.473 I load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[59145] 0.09.008.161 I load_tensors: offloading output layer to GPU
[59145] 0.09.008.174 I load_tensors: offloading 60 repeating layers to GPU
[59145] 0.09.008.175 I load_tensors: offloaded 62/62 layers to GPU
[59145] 0.09.008.186 I load_tensors:        ROCm0 model buffer size = 92054.89 MiB
[59145] 0.09.008.188 I load_tensors:    ROCm_Host model buffer size =  1030.62 MiB
[59145] 0.09.008.189 I load_tensors: RPC0[10.0.69.2:50052] model buffer size = 92584.99 MiB

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions