-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Description
What happened?
CPU Ryzen 7950x3D
win 11
Mistral-Large-Instruct-2407.IQ3_XS.gguf ( CPU 90 C )

Meta-Llama-3-70B-Instruct.Q4_K_M.gguf (CPU 66 C )

Temperature is higher than the CPU torture tests made by CPUZ then max I have is 83 C.
That happens ONLY with Mistral-Large-Instruct-2407.IQ3_XS.gguf for me even I set --threads 1 my CPU is heating up like crazy to 90 C but manager showing only 1 thread used for llamacpp....
Mistral-Large-Instruct-2407.IQ3_XS.gguf
llama-cli.exe --model models/new3/Mistral-Large-Instruct-2407.IQ3_XS.gguf --color --threads 1 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 39 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template chatml
llama-cli.exe --model models/new3/Mistral-Large-Instruct-2407.IQ3_XS.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 39 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template chatml
Log start
main: build = 3488 (75af08c4)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1722289609
llama_model_loader: loaded meta data with 41 key-value pairs and 795 tensors from models/new3/Mistral-Large-Instruct-2407.IQ3_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Mistral Large Instruct 2407
llama_model_loader: - kv 3: general.version str = 2407
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Mistral
llama_model_loader: - kv 6: general.size_label str = Large
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = mrl
llama_model_loader: - kv 9: general.license.link str = https://mistral.ai/licenses/MRL-0.1.md
llama_model_loader: - kv 10: general.languages arr[str,10] = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv 11: llama.block_count u32 = 88
llama_model_loader: - kv 12: llama.context_length u32 = 32768
llama_model_loader: - kv 13: llama.embedding_length u32 = 12288
llama_model_loader: - kv 14: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 15: llama.attention.head_count u32 = 96
llama_model_loader: - kv 16: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: general.file_type u32 = 22
llama_model_loader: - kv 20: llama.vocab_size u32 = 32768
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 22: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,32768] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,32768] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: quantize.imatrix.file str = Mistral-Large-Instruct-2407-IMat-GGUF...
llama_model_loader: - kv 35: quantize.imatrix.dataset str = Mistral-Large-Instruct-2407-IMat-GGUF...
llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 616
llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 148
llama_model_loader: - kv 38: split.no u16 = 0
llama_model_loader: - kv 39: split.count u16 = 0
llama_model_loader: - kv 40: split.tensors.count i32 = 795
llama_model_loader: - type f32: 177 tensors
llama_model_loader: - type q4_K: 88 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq3_xxs: 308 tensors
llama_model_loader: - type iq3_s: 221 tensors
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1732 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32768
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 12288
llm_load_print_meta: n_layer = 88
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = IQ3_XS - 3.3 bpw
llm_load_print_meta: model params = 122.61 B
llm_load_print_meta: model size = 46.70 GiB (3.27 BPW)
llm_load_print_meta: general.name = Mistral Large Instruct 2407
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 781 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.74 MiB
llm_load_tensors: offloading 39 repeating layers to GPU
llm_load_tensors: offloaded 39/89 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 26799.61 MiB
llm_load_tensors: CUDA0 buffer size = 21018.94 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 8224
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 1574.12 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1252.88 MiB
llama_new_context_with_model: KV self size = 2827.00 MiB, K (f16): 1413.50 MiB, V (f16): 1413.50 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1669.13 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 40.07 MiB
llama_new_context_with_model: graph nodes = 2822
llama_new_context_with_model: graph splits = 543
main: chat template example: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to the AI, end your input with '\'.
- To return control without starting a new line, end your input with '/'.
> hello /
Hello! How can I assist you today? Let's have a friendly and respectful conversation. 😊
> tell me a stry /
I'd be happy to share a short story with you! Here we go:
Once upon a time in a small town nestled between rolling hills and a sparkling river, there lived a little girl named Lily. Lily was known for her vibrant imagination and her love for drawing. She could spend hours by the river, sketching the ducks, the flowers, and the clouds above.
> llama_print_timings: load time = 18929.92 ms
llama_print_timings: sample time = 3.32 ms / 107 runs ( 0.03 ms per token, 32248.34 tokens per second)
llama_print_timings: prompt eval time = 18667.41 ms / 64 tokens ( 291.68 ms per token, 3.43 tokens per second)
llama_print_timings: eval time = 56435.98 ms / 105 runs ( 537.49 ms per token, 1.86 tokens per second)
llama_print_timings: total time = 123532.93 ms / 169 tokens
Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
llama-cli.exe --model models/new3/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 42 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template llama3
llama-cli.exe --model models/new3/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 42 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template llama3
Log start
main: build = 3488 (75af08c4)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1722289776
llama_model_loader: loaded meta data with 33 key-value pairs and 724 tensors from models/new3/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 80
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/Meta-Llama-3.1-70B-Instru...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.68 MiB
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloaded 42/81 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 19985.43 MiB
llm_load_tensors: CUDA0 buffer size = 20557.70 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 8224
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 1220.75 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1349.25 MiB
llama_new_context_with_model: KV self size = 2570.00 MiB, K (f16): 1285.00 MiB, V (f16): 1285.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1140.25 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 32.07 MiB
llama_new_context_with_model: graph nodes = 2566
llama_new_context_with_model: graph splits = 498
main: chat template example: <|start_header_id|>system<|end_header_id|>
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>
How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to the AI, end your input with '\'.
- To return control without starting a new line, end your input with '/'.
> tell me a story /
Once upon a time, in a small village nestled in the rolling hills of Tuscany, there was a tiny shop called "Mirabel's Marvels." The shop was run by a kind and gentle woman named Mirabel, who was known throughout the village for her extraordinary talent: she could hear the whispers of inanimate objects.
llama_print_timings: load time = 15492.02 ms
llama_print_timings: sample time = 13.44 ms / 72 runs ( 0.19 ms per token, 5358.34 tokens per second)
llama_print_timings: prompt eval time = 9574.75 ms / 14 tokens ( 683.91 ms per token, 1.46 tokens per second)
allama_print_timings: eval time = 30604.81 ms / 72 runs ( 425.07 ms per token, 2.35 tokens per second)
Any Idea what is happening?
Name and Version
llama-cli --version
version: 3488 (75af08c)
built with MSVC 19.29.30154.0 for x64
What operating system are you seeing the problem on?
Windows
Relevant log output
No response