Environment:
unsloth==2025.6.12 (latest)
unsloth-zoo==2025.6.12
transformers==4.52.4
torch==2.7.0+cu126
CUDA 12.6
Ubuntu 22.04
RTX 4060 Ti (16GB VRAM)
Critical Observation:
The issue persists in the newest Unsloth version despite the closed #2098 fix.
Technical Details:
Quantization Artifacts:
NaN values appear only during GGUF conversion (not in original HF model)
Error occurs at layer blk.3.attn_k.weight (BF16→q4_k_m)
Memory Analysis:
bash
# During conversion:
GPU Memory: 14.2/16.0 GB utilized
System RAM: 32GB (70% free)
Reproduction Script:
python
from unsloth import FastLanguageModel
model, _ = FastLanguageModel.from_pretrained("unsloth/Qwen3-4B-unsloth-bnb-4bit")
model.save_pretrained_gguf("test", quantization_method="q4_k_m") # Fails
Request:
Please provide:
Recommended workaround for Qwen3 4-bit
Expected timeline for hotfix
from unsloth import FastLanguageModel
from peft import PeftModel
import torch
import os
import gc
def clean_memory():
"""Clear GPU cache and free memory"""
torch.cuda.empty_cache()
gc.collect()
# 1. Load base model in original format
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Qwen3-4B-unsloth-bnb-4bit",
load_in_4bit=True,
device_map="auto",
)
# 2. Load adapter with file validation
adapter_path = "Qwen3-4B-unsloth-bnb-4bit+dataset_PB_eng"
try:
# Check for required adapter files
required_files = ['adapter_config.json', 'adapter_model.safetensors']
for file in required_files:
if not os.path.exists(os.path.join(adapter_path, file)):
raise FileNotFoundError(f"Missing adapter file: {file}")
model = PeftModel.from_pretrained(model, adapter_path)
print(f"✅ Adapter loaded. Model type: {type(model)}")
except Exception as e:
print(f"❌ Adapter loading error: {e}")
exit()
# 3. Merge adapter with NaN check
try:
print("Checking for NaN values...")
nan_found = False
for name, param in model.named_parameters():
if torch.isnan(param).any():
print(f"⚠️ NaN detected in {name}, replacing with zeros")
param[torch.isnan(param)] = 0
nan_found = True
if nan_found:
print("⚠️ Warning: NaN values were found and replaced")
with torch.inference_mode():
model = model.merge_and_unload()
model.config.use_cache = False
print("✅ Model merged successfully")
clean_memory()
except Exception as e:
print(f"❌ Merge error: {e}")
exit()
# 4. GGUF conversion with fallback
output_dir = "PB_eng"
os.makedirs(output_dir, exist_ok=True)
try:
print("Starting GGUF conversion (Q4_K_M)...")
model.save_pretrained_gguf(
output_dir,
tokenizer=tokenizer,
quantization_method="q4_k_m",
maximum_memory_usage=0.7,
)
# Verify output file
gguf_file = os.path.join(output_dir, "unsloth.Q4_K_M.gguf")
if os.path.exists(gguf_file):
size_gb = os.path.getsize(gguf_file) / (1024 ** 3)
print(f"✅ Conversion successful! File size: {size_gb:.2f}GB")
else:
print("❌ GGUF file not created!")
except Exception as e:
print(f"❌ Q4_K_M conversion failed: {e}")
# Fallback to F16
try:
print("Attempting F16 conversion as fallback...")
model.save_pretrained_gguf(
output_dir,
tokenizer=tokenizer,
quantization_method="f16",
)
print("✅ F16 conversion completed")
except Exception as e:
print(f"❌ F16 conversion failed: {e}")
exit()
===================================================================================
Log:
(u_env) oleg@oleg-MS-7B86:~$ python z.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2025.6.12: Fast Qwen3 patching. Transformers: 4.52.4.
\ /| NVIDIA GeForce RTX 4060 Ti. Num GPUs = 1. Max memory: 15.576 GB. Platform: Linux.
O^O/ _/ \ Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0
\ / Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
"--" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
✅ Adapter loaded. Model type: <class 'peft.peft_model.PeftModelForCausalLM'>
Checking for NaN values...
/home/oleg/miniconda3/envs/u_env/lib/python3.10/site-packages/peft/tuners/lora/bnb.py:351: UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.
warnings.warn(
✅ Model merged successfully
Starting GGUF conversion (Q4_K_M)...
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 37.11 out of 62.72 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...
0%| | 0/36 [00:00<?, ?it/s]
We will save to Disk and not RAM now.
100%|███████████████████████████████████████████| 36/36 [00:07<00:00, 5.08it/s]
Unsloth: Saving tokenizer... Done.
Done.
Unsloth: Converting qwen3 model. Can use fast conversion = False.
==((====))== Unsloth: Conversion from QLoRA to GGUF information
\ /| [0] Installing llama.cpp might take 3 minutes.
O^O/ _/ \ [1] Converting HF to GGUF 16bits might take 3 minutes.
\ / [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
"--" In total, you will have to wait at least 16 minutes.
Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at PB_eng into bf16 GGUF format.
The output location will be /home/oleg/PB_eng/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: PB_eng
INFO:hf-to-gguf:Model architecture: Qwen3ForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> BF16, shape = {2560, 151936}
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.0.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.0.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.0.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.0.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.1.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.1.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.1.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.1.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.1.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.1.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.1.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.1.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.1.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.10.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.10.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.10.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.10.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.10.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.10.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.10.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.10.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.10.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.10.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.10.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.11.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.11.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.11.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.11.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.11.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.11.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.11.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.11.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.11.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.11.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.11.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.12.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.12.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.12.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.12.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.12.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.12.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.2.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.2.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.2.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.2.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.2.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.2.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.2.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.2.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.2.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.2.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.2.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.3.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.3.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.3.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.3.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.3.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.3.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.3.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.3.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.3.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.3.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.3.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.4.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.4.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.4.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.4.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.4.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.4.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.4.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.4.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.4.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.4.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.4.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.5.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.5.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.5.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.5.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.5.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.5.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.5.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.5.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.5.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.5.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.5.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.6.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.6.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.6.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.6.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.6.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.6.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.6.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.6.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.6.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.6.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.6.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.7.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.7.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.7.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.7.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.7.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.7.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.7.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.7.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.7.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.7.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.7.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.8.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.8.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.8.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.8.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.8.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.8.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.8.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.8.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.8.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.8.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.8.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.9.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.9.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.9.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.9.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.9.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.9.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.9.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.9.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.9.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.9.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.9.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00002-of-00003.safetensors'
INFO:hf-to-gguf:blk.12.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.12.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.12.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.12.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.12.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.13.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.13.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.13.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.13.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.13.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.13.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.13.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.13.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.13.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.13.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.13.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.14.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.14.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.14.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.14.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.14.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.14.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.14.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.14.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.14.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.14.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.14.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.15.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.15.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.15.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.15.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.15.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.15.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.15.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.15.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.15.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.15.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.15.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.16.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.16.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.16.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.16.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.16.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.16.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.16.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.16.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.16.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.16.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.16.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.17.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.17.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.17.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.17.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.17.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.17.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.17.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.17.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.17.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.17.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.17.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.18.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.18.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.18.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.18.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.18.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.18.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.18.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.18.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.18.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.18.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.18.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.19.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.19.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.19.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.19.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.19.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.19.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.19.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.19.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.19.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.19.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.19.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.20.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.20.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.20.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.20.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.20.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.20.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.20.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.20.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.20.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.20.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.20.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.21.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.21.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.21.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.21.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.21.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.21.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.21.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.21.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.21.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.21.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.21.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.22.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.22.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.22.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.22.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.22.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.22.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.22.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.22.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.22.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.22.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.22.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.23.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.23.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.23.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.23.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.23.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.23.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.23.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.23.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.23.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.23.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.23.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.24.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.24.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.24.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.24.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.24.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.24.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.24.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.24.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.24.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.24.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.24.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.25.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.25.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:gguf: loading model part 'model-00003-of-00003.safetensors'
INFO:hf-to-gguf:blk.25.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.25.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.25.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.25.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.25.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.25.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.25.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.25.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.25.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.26.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.26.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.26.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.26.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.26.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.26.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.26.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.26.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.26.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.26.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.26.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.27.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.27.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.27.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.27.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.27.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.27.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.27.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.27.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.27.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.27.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.27.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.28.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.28.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.28.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.28.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.28.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.28.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.28.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.28.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.28.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.28.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.28.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.29.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.29.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.29.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.29.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.29.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.29.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.29.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.29.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.29.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.29.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.29.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.30.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.30.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.30.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.30.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.30.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.30.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.30.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.30.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.30.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.30.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.30.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.31.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.31.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.31.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.31.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.31.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.31.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.31.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.31.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.31.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.31.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.31.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.32.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.32.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.32.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.32.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.32.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.32.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.32.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.32.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.32.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.32.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.32.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.33.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.33.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.33.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.33.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.33.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.33.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.33.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.33.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.33.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.33.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.33.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.34.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.34.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.34.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.34.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.34.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.34.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.34.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.34.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.34.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.34.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.34.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.35.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.35.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.35.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.35.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.35.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.35.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.35.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.35.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.35.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.35.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.35.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:output_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 40960
INFO:hf-to-gguf:gguf: embedding length = 2560
INFO:hf-to-gguf:gguf: feed forward length = 9728
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 1000000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 32
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to 151645
INFO:gguf.vocab:Setting special token type pad to 151654
INFO:gguf.vocab:Setting add_bos_token to False
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/oleg/PB_eng/unsloth.BF16.gguf: n_tensors = 398, total_size = 8.0G
Writing: 100%|██████████| 8.05G/8.05G [00:32<00:00, 245Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/oleg/PB_eng/unsloth.BF16.gguf
Unsloth: Conversion completed! Output location: /home/oleg/PB_eng/unsloth.BF16.gguf
Unsloth: [2] Converting GGUF 16bit into q4_k_m. This might take 20 minutes...
main: build = 1 (bb16041)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/home/oleg/PB_eng/unsloth.BF16.gguf' to '/home/oleg/PB_eng/unsloth.Q4_K_M.gguf' as Q4_K_M using 24 threads
llama_model_loader: loaded meta data with 24 key-value pairs and 398 tensors from /home/oleg/PB_eng/unsloth.BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = PB_eng
llama_model_loader: - kv 3: general.size_label str = 4.0B
llama_model_loader: - kv 4: qwen3.block_count u32 = 36
llama_model_loader: - kv 5: qwen3.context_length u32 = 40960
llama_model_loader: - kv 6: qwen3.embedding_length u32 = 2560
llama_model_loader: - kv 7: qwen3.feed_forward_length u32 = 9728
llama_model_loader: - kv 8: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 9: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 13: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 14: general.file_type u32 = 32
llama_model_loader: - kv 15: general.quantization_version u32 = 2
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type bf16: 253 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
[ 1/ 398] output_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 2/ 398] token_embd.weight - [ 2560, 151936, 1, 1], type = bf16, converting to q6_K .. size = 741.88 MiB -> 304.28 MiB
[ 3/ 398] blk.0.attn_k.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q4_K .. size = 5.00 MiB -> 1.41 MiB
[ 4/ 398] blk.0.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 5/ 398] blk.0.attn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 6/ 398] blk.0.attn_output.weight - [ 4096, 2560, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 7/ 398] blk.0.attn_q.weight - [ 2560, 4096, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 8/ 398] blk.0.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 9/ 398] blk.0.attn_v.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q6_K .. size = 5.00 MiB -> 2.05 MiB
[ 10/ 398] blk.0.ffn_down.weight - [ 9728, 2560, 1, 1], type = bf16, converting to q6_K .. size = 47.50 MiB -> 19.48 MiB
[ 11/ 398] blk.0.ffn_gate.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 12/ 398] blk.0.ffn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 13/ 398] blk.0.ffn_up.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 14/ 398] blk.1.attn_k.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q4_K .. size = 5.00 MiB -> 1.41 MiB
[ 15/ 398] blk.1.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 16/ 398] blk.1.attn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 17/ 398] blk.1.attn_output.weight - [ 4096, 2560, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 18/ 398] blk.1.attn_q.weight - [ 2560, 4096, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 19/ 398] blk.1.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 20/ 398] blk.1.attn_v.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q6_K .. size = 5.00 MiB -> 2.05 MiB
[ 21/ 398] blk.1.ffn_down.weight - [ 9728, 2560, 1, 1], type = bf16, converting to q6_K .. size = 47.50 MiB -> 19.48 MiB
[ 22/ 398] blk.1.ffn_gate.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 23/ 398] blk.1.ffn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 24/ 398] blk.1.ffn_up.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 25/ 398] blk.2.attn_k.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q4_K .. size = 5.00 MiB -> 1.41 MiB
[ 26/ 398] blk.2.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 27/ 398] blk.2.attn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 28/ 398] blk.2.attn_output.weight - [ 4096, 2560, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 29/ 398] blk.2.attn_q.weight - [ 2560, 4096, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 30/ 398] blk.2.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 31/ 398] blk.2.attn_v.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q6_K .. size = 5.00 MiB -> 2.05 MiB
[ 32/ 398] blk.2.ffn_down.weight - [ 9728, 2560, 1, 1], type = bf16, converting to q6_K .. size = 47.50 MiB -> 19.48 MiB
[ 33/ 398] blk.2.ffn_gate.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 34/ 398] blk.2.ffn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 35/ 398] blk.2.ffn_up.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
ggml_validate_row_data: found 1139 NaNs in row of 2621440 BF16 values
llama_model_quantize: failed to quantize: tensor 'blk.3.attn_k.weight' has invalid data
main: failed to quantize model from '/home/oleg/PB_eng/unsloth.BF16.gguf'
Unsloth: Conversion completed! Output location: /home/oleg/PB_eng/unsloth.Q4_K_M.gguf
✅ Conversion successful! File size: 0.48GB
/home/oleg/PB_eng/config.json
{
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2560,
"initializer_range": 0.02,
"intermediate_size": 9728,
"max_position_embeddings": 40960,
"max_window_layers": 36,
"model_type": "qwen3",
"num_attention_heads": 32,
"num_hidden_layers": 36,
"num_key_value_heads": 8,
"pad_token_id": 151654,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.52.4",
"unsloth_fixed": true,
"unsloth_version": "2025.6.12",
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 151936
}
/home/oleg/PB_eng/generation_config.json
{
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"max_length": 40960,
"pad_token_id": 151654,
"temperature": 0.6,
"top_k": 20,
"top_p": 0.95,
"transformers_version": "4.52.4"
}
Log:
(u_env) oleg@oleg-MS-7B86:$ llama-cli -m /home/oleg/PB_eng/unsloth.BF16.gguf -ngl 99 -p "Hi"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
build: 1 (bb16041) with cc (Ubuntu 13.3.0-6ubuntu224.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 15081 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 398 tensors from /home/oleg/PB_eng/unsloth.BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = PB_eng
llama_model_loader: - kv 3: general.size_label str = 4.0B
llama_model_loader: - kv 4: qwen3.block_count u32 = 36
llama_model_loader: - kv 5: qwen3.context_length u32 = 40960
llama_model_loader: - kv 6: qwen3.embedding_length u32 = 2560
llama_model_loader: - kv 7: qwen3.feed_forward_length u32 = 9728
llama_model_loader: - kv 8: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 9: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: qwen3.rope.freq_base f32 = 1000000,000000
llama_model_loader: - kv 11: qwen3.attention.layer_norm_rms_epsilon f32 = 0,000001
llama_model_loader: - kv 12: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 13: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 14: general.file_type u32 = 32
llama_model_loader: - kv 15: general.quantization_version u32 = 2
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type bf16: 253 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 7,49 GiB (16,00 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0,9311 MB
print_info: arch = qwen3
print_info: vocab_only = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 2560
print_info: n_layer = 36
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0,0e+00
print_info: f_norm_rms_eps = 1,0e-06
print_info: f_clamp_kqv = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale = 0,0e+00
print_info: f_attn_scale = 0,0e+00
print_info: n_ff = 9728
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000,0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 4B
print_info: model params = 4,02 B
print_info: general.name = PB_eng
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CUDA0 model buffer size = 7672,62 MiB
load_tensors: CPU_Mapped model buffer size = 741,88 MiB
.....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000,0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0,58 MiB
llama_kv_cache_unified: CUDA0 KV buffer size = 576,00 MiB
llama_kv_cache_unified: size = 576,00 MiB ( 4096 cells, 36 layers, 1 seqs), K (f16): 288,00 MiB, V (f16): 288,00 MiB
llama_context: CUDA0 compute buffer size = 301,75 MiB
llama_context: CUDA_Host compute buffer size = 13,01 MiB
llama_context: graph nodes = 1446
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3425608011
sampler params:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0
HiGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
llama_perf_sampler_print: sampling time = 37,21 ms / 459 runs ( 0,08 ms per token, 12335,06 tokens per second)
llama_perf_context_print: load time = 2908,46 ms
llama_perf_context_print: prompt eval time = 0,00 ms / 1 tokens ( 0,00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15375,31 ms / 458 runs ( 33,57 ms per token, 29,79 tokens per second)
llama_perf_context_print: total time = 15516,83 ms / 459 tokens
Interrupted by user
(u_env) oleg@oleg-MS-7B86:~$
Log:
(u_env) oleg@oleg-MS-7B86:$ llama-cli -m /home/oleg/PB_eng/unsloth.Q4_K_M.gguf -ngl 99 -p "Hi"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
build: 1 (bb16041) with cc (Ubuntu 13.3.0-6ubuntu224.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 15069 MiB free
gguf_init_from_file_impl: invalid magic characters: 'llama_model_load: error loading model: llama_model_loader: failed to load model from /home/oleg/PB_eng/unsloth.Q4_K_M.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/oleg/PB_eng/unsloth.Q4_K_M.gguf'
main: error: unable to load model
(u_env) oleg@oleg-MS-7B86:~$
Environment:
unsloth==2025.6.12 (latest)
unsloth-zoo==2025.6.12
transformers==4.52.4
torch==2.7.0+cu126
CUDA 12.6
Ubuntu 22.04
RTX 4060 Ti (16GB VRAM)
Critical Observation:
The issue persists in the newest Unsloth version despite the closed #2098 fix.
Technical Details:
Quantization Artifacts:
NaN values appear only during GGUF conversion (not in original HF model)
Error occurs at layer blk.3.attn_k.weight (BF16→q4_k_m)
Memory Analysis:
Reproduction Script:
Request:
Please provide:
Recommended workaround for Qwen3 4-bit
Expected timeline for hotfix
===================================================================================
Log:
(u_env) oleg@oleg-MS-7B86:~$ python z.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2025.6.12: Fast Qwen3 patching. Transformers: 4.52.4.
\ /| NVIDIA GeForce RTX 4060 Ti. Num GPUs = 1. Max memory: 15.576 GB. Platform: Linux.
O^O/ _/ \ Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0
\ / Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
"--" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
✅ Adapter loaded. Model type: <class 'peft.peft_model.PeftModelForCausalLM'>
Checking for NaN values...
/home/oleg/miniconda3/envs/u_env/lib/python3.10/site-packages/peft/tuners/lora/bnb.py:351: UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.
warnings.warn(
✅ Model merged successfully
Starting GGUF conversion (Q4_K_M)...
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 37.11 out of 62.72 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...
0%| | 0/36 [00:00<?, ?it/s]
We will save to Disk and not RAM now.
100%|███████████████████████████████████████████| 36/36 [00:07<00:00, 5.08it/s]
Unsloth: Saving tokenizer... Done.
Done.
Unsloth: Converting qwen3 model. Can use fast conversion = False.
==((====))== Unsloth: Conversion from QLoRA to GGUF information
\ /| [0] Installing llama.cpp might take 3 minutes.
O^O/ _/ \ [1] Converting HF to GGUF 16bits might take 3 minutes.
\ / [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
"--" In total, you will have to wait at least 16 minutes.
Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at PB_eng into bf16 GGUF format.
The output location will be /home/oleg/PB_eng/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: PB_eng
INFO:hf-to-gguf:Model architecture: Qwen3ForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> BF16, shape = {2560, 151936}
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.0.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.0.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.0.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.0.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.1.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.1.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.1.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.1.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.1.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.1.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.1.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.1.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.1.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.10.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.10.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.10.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.10.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.10.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.10.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.10.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.10.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.10.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.10.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.10.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.11.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.11.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.11.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.11.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.11.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.11.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.11.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.11.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.11.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.11.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.11.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.12.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.12.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.12.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.12.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.12.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.12.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.2.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.2.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.2.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.2.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.2.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.2.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.2.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.2.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.2.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.2.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.2.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.3.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.3.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.3.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.3.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.3.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.3.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.3.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.3.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.3.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.3.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.3.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.4.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.4.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.4.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.4.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.4.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.4.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.4.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.4.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.4.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.4.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.4.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.5.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.5.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.5.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.5.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.5.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.5.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.5.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.5.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.5.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.5.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.5.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.6.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.6.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.6.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.6.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.6.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.6.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.6.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.6.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.6.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.6.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.6.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.7.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.7.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.7.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.7.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.7.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.7.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.7.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.7.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.7.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.7.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.7.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.8.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.8.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.8.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.8.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.8.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.8.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.8.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.8.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.8.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.8.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.8.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.9.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.9.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.9.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.9.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.9.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.9.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.9.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.9.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.9.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.9.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.9.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00002-of-00003.safetensors'
INFO:hf-to-gguf:blk.12.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.12.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.12.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.12.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.12.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.13.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.13.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.13.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.13.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.13.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.13.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.13.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.13.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.13.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.13.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.13.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.14.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.14.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.14.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.14.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.14.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.14.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.14.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.14.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.14.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.14.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.14.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.15.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.15.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.15.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.15.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.15.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.15.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.15.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.15.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.15.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.15.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.15.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.16.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.16.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.16.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.16.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.16.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.16.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.16.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.16.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.16.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.16.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.16.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.17.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.17.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.17.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.17.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.17.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.17.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.17.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.17.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.17.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.17.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.17.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.18.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.18.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.18.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.18.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.18.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.18.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.18.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.18.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.18.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.18.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.18.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.19.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.19.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.19.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.19.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.19.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.19.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.19.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.19.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.19.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.19.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.19.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.20.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.20.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.20.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.20.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.20.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.20.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.20.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.20.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.20.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.20.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.20.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.21.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.21.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.21.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.21.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.21.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.21.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.21.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.21.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.21.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.21.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.21.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.22.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.22.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.22.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.22.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.22.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.22.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.22.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.22.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.22.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.22.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.22.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.23.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.23.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.23.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.23.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.23.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.23.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.23.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.23.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.23.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.23.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.23.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.24.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.24.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.24.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.24.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.24.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.24.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.24.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.24.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.24.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.24.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.24.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.25.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.25.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:gguf: loading model part 'model-00003-of-00003.safetensors'
INFO:hf-to-gguf:blk.25.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.25.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.25.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.25.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.25.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.25.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.25.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.25.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.25.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.26.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.26.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.26.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.26.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.26.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.26.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.26.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.26.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.26.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.26.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.26.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.27.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.27.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.27.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.27.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.27.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.27.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.27.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.27.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.27.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.27.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.27.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.28.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.28.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.28.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.28.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.28.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.28.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.28.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.28.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.28.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.28.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.28.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.29.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.29.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.29.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.29.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.29.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.29.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.29.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.29.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.29.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.29.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.29.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.30.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.30.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.30.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.30.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.30.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.30.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.30.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.30.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.30.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.30.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.30.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.31.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.31.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.31.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.31.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.31.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.31.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.31.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.31.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.31.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.31.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.31.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.32.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.32.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.32.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.32.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.32.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.32.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.32.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.32.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.32.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.32.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.32.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.33.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.33.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.33.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.33.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.33.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.33.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.33.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.33.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.33.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.33.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.33.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.34.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.34.ffn_down.weight, torch.bfloat16 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.34.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.34.ffn_up.weight, torch.bfloat16 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.34.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.34.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.34.attn_k.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.34.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.34.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.34.attn_q.weight, torch.bfloat16 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.34.attn_v.weight, torch.bfloat16 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.35.attn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.35.ffn_down.weight, torch.float32 --> BF16, shape = {9728, 2560}
INFO:hf-to-gguf:blk.35.ffn_gate.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.35.ffn_up.weight, torch.float32 --> BF16, shape = {2560, 9728}
INFO:hf-to-gguf:blk.35.ffn_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.35.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.35.attn_k.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:blk.35.attn_output.weight, torch.float32 --> BF16, shape = {4096, 2560}
INFO:hf-to-gguf:blk.35.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
INFO:hf-to-gguf:blk.35.attn_q.weight, torch.float32 --> BF16, shape = {2560, 4096}
INFO:hf-to-gguf:blk.35.attn_v.weight, torch.float32 --> BF16, shape = {2560, 1024}
INFO:hf-to-gguf:output_norm.weight, torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 40960
INFO:hf-to-gguf:gguf: embedding length = 2560
INFO:hf-to-gguf:gguf: feed forward length = 9728
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 1000000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 32
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to 151645
INFO:gguf.vocab:Setting special token type pad to 151654
INFO:gguf.vocab:Setting add_bos_token to False
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/oleg/PB_eng/unsloth.BF16.gguf: n_tensors = 398, total_size = 8.0G
Writing: 100%|██████████| 8.05G/8.05G [00:32<00:00, 245Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/oleg/PB_eng/unsloth.BF16.gguf
Unsloth: Conversion completed! Output location: /home/oleg/PB_eng/unsloth.BF16.gguf
Unsloth: [2] Converting GGUF 16bit into q4_k_m. This might take 20 minutes...
main: build = 1 (bb16041)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/home/oleg/PB_eng/unsloth.BF16.gguf' to '/home/oleg/PB_eng/unsloth.Q4_K_M.gguf' as Q4_K_M using 24 threads
llama_model_loader: loaded meta data with 24 key-value pairs and 398 tensors from /home/oleg/PB_eng/unsloth.BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = PB_eng
llama_model_loader: - kv 3: general.size_label str = 4.0B
llama_model_loader: - kv 4: qwen3.block_count u32 = 36
llama_model_loader: - kv 5: qwen3.context_length u32 = 40960
llama_model_loader: - kv 6: qwen3.embedding_length u32 = 2560
llama_model_loader: - kv 7: qwen3.feed_forward_length u32 = 9728
llama_model_loader: - kv 8: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 9: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 13: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 14: general.file_type u32 = 32
llama_model_loader: - kv 15: general.quantization_version u32 = 2
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type bf16: 253 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
[ 1/ 398] output_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 2/ 398] token_embd.weight - [ 2560, 151936, 1, 1], type = bf16, converting to q6_K .. size = 741.88 MiB -> 304.28 MiB
[ 3/ 398] blk.0.attn_k.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q4_K .. size = 5.00 MiB -> 1.41 MiB
[ 4/ 398] blk.0.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 5/ 398] blk.0.attn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 6/ 398] blk.0.attn_output.weight - [ 4096, 2560, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 7/ 398] blk.0.attn_q.weight - [ 2560, 4096, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 8/ 398] blk.0.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 9/ 398] blk.0.attn_v.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q6_K .. size = 5.00 MiB -> 2.05 MiB
[ 10/ 398] blk.0.ffn_down.weight - [ 9728, 2560, 1, 1], type = bf16, converting to q6_K .. size = 47.50 MiB -> 19.48 MiB
[ 11/ 398] blk.0.ffn_gate.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 12/ 398] blk.0.ffn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 13/ 398] blk.0.ffn_up.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 14/ 398] blk.1.attn_k.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q4_K .. size = 5.00 MiB -> 1.41 MiB
[ 15/ 398] blk.1.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 16/ 398] blk.1.attn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 17/ 398] blk.1.attn_output.weight - [ 4096, 2560, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 18/ 398] blk.1.attn_q.weight - [ 2560, 4096, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 19/ 398] blk.1.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 20/ 398] blk.1.attn_v.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q6_K .. size = 5.00 MiB -> 2.05 MiB
[ 21/ 398] blk.1.ffn_down.weight - [ 9728, 2560, 1, 1], type = bf16, converting to q6_K .. size = 47.50 MiB -> 19.48 MiB
[ 22/ 398] blk.1.ffn_gate.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 23/ 398] blk.1.ffn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 24/ 398] blk.1.ffn_up.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 25/ 398] blk.2.attn_k.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q4_K .. size = 5.00 MiB -> 1.41 MiB
[ 26/ 398] blk.2.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 27/ 398] blk.2.attn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 28/ 398] blk.2.attn_output.weight - [ 4096, 2560, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 29/ 398] blk.2.attn_q.weight - [ 2560, 4096, 1, 1], type = bf16, converting to q4_K .. size = 20.00 MiB -> 5.62 MiB
[ 30/ 398] blk.2.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 31/ 398] blk.2.attn_v.weight - [ 2560, 1024, 1, 1], type = bf16, converting to q6_K .. size = 5.00 MiB -> 2.05 MiB
[ 32/ 398] blk.2.ffn_down.weight - [ 9728, 2560, 1, 1], type = bf16, converting to q6_K .. size = 47.50 MiB -> 19.48 MiB
[ 33/ 398] blk.2.ffn_gate.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
[ 34/ 398] blk.2.ffn_norm.weight - [ 2560, 1, 1, 1], type = f32, size = 0.010 MB
[ 35/ 398] blk.2.ffn_up.weight - [ 2560, 9728, 1, 1], type = bf16, converting to q4_K .. size = 47.50 MiB -> 13.36 MiB
ggml_validate_row_data: found 1139 NaNs in row of 2621440 BF16 values
llama_model_quantize: failed to quantize: tensor 'blk.3.attn_k.weight' has invalid data
main: failed to quantize model from '/home/oleg/PB_eng/unsloth.BF16.gguf'
Unsloth: Conversion completed! Output location: /home/oleg/PB_eng/unsloth.Q4_K_M.gguf
✅ Conversion successful! File size: 0.48GB
/home/oleg/PB_eng/config.json
{
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2560,
"initializer_range": 0.02,
"intermediate_size": 9728,
"max_position_embeddings": 40960,
"max_window_layers": 36,
"model_type": "qwen3",
"num_attention_heads": 32,
"num_hidden_layers": 36,
"num_key_value_heads": 8,
"pad_token_id": 151654,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.52.4",
"unsloth_fixed": true,
"unsloth_version": "2025.6.12",
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 151936
}
/home/oleg/PB_eng/generation_config.json
{
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"max_length": 40960,
"pad_token_id": 151654,
"temperature": 0.6,
"top_k": 20,
"top_p": 0.95,
"transformers_version": "4.52.4"
}
Log:
(u_env) oleg@oleg-MS-7B86:
$ llama-cli -m /home/oleg/PB_eng/unsloth.BF16.gguf -ngl 99 -p "Hi"24.04) 13.3.0 for x86_64-linux-gnuggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
build: 1 (bb16041) with cc (Ubuntu 13.3.0-6ubuntu2
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 15081 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 398 tensors from /home/oleg/PB_eng/unsloth.BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = PB_eng
llama_model_loader: - kv 3: general.size_label str = 4.0B
llama_model_loader: - kv 4: qwen3.block_count u32 = 36
llama_model_loader: - kv 5: qwen3.context_length u32 = 40960
llama_model_loader: - kv 6: qwen3.embedding_length u32 = 2560
llama_model_loader: - kv 7: qwen3.feed_forward_length u32 = 9728
llama_model_loader: - kv 8: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 9: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: qwen3.rope.freq_base f32 = 1000000,000000
llama_model_loader: - kv 11: qwen3.attention.layer_norm_rms_epsilon f32 = 0,000001
llama_model_loader: - kv 12: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 13: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 14: general.file_type u32 = 32
llama_model_loader: - kv 15: general.quantization_version u32 = 2
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type bf16: 253 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 7,49 GiB (16,00 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0,9311 MB
print_info: arch = qwen3
print_info: vocab_only = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 2560
print_info: n_layer = 36
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0,0e+00
print_info: f_norm_rms_eps = 1,0e-06
print_info: f_clamp_kqv = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale = 0,0e+00
print_info: f_attn_scale = 0,0e+00
print_info: n_ff = 9728
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000,0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 4B
print_info: model params = 4,02 B
print_info: general.name = PB_eng
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CUDA0 model buffer size = 7672,62 MiB
load_tensors: CPU_Mapped model buffer size = 741,88 MiB
.....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000,0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0,58 MiB
llama_kv_cache_unified: CUDA0 KV buffer size = 576,00 MiB
llama_kv_cache_unified: size = 576,00 MiB ( 4096 cells, 36 layers, 1 seqs), K (f16): 288,00 MiB, V (f16): 288,00 MiB
llama_context: CUDA0 compute buffer size = 301,75 MiB
llama_context: CUDA_Host compute buffer size = 13,01 MiB
llama_context: graph nodes = 1446
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3425608011
sampler params:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0
HiGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
llama_perf_sampler_print: sampling time = 37,21 ms / 459 runs ( 0,08 ms per token, 12335,06 tokens per second)
llama_perf_context_print: load time = 2908,46 ms
llama_perf_context_print: prompt eval time = 0,00 ms / 1 tokens ( 0,00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15375,31 ms / 458 runs ( 33,57 ms per token, 29,79 tokens per second)
llama_perf_context_print: total time = 15516,83 ms / 459 tokens
Interrupted by user
(u_env) oleg@oleg-MS-7B86:~$
Log:
(u_env) oleg@oleg-MS-7B86:
$ llama-cli -m /home/oleg/PB_eng/unsloth.Q4_K_M.gguf -ngl 99 -p "Hi"24.04) 13.3.0 for x86_64-linux-gnuggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
build: 1 (bb16041) with cc (Ubuntu 13.3.0-6ubuntu2
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 15069 MiB free
gguf_init_from_file_impl: invalid magic characters: 'llama_model_load: error loading model: llama_model_loader: failed to load model from /home/oleg/PB_eng/unsloth.Q4_K_M.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/oleg/PB_eng/unsloth.Q4_K_M.gguf'
main: error: unable to load model
(u_env) oleg@oleg-MS-7B86:~$