Skip to content

[Bug]Error: llama runner process has terminated: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file llama_load_model_from_file: failed to load model #2667

@alperen21

Description

@alperen21
  1. Did you update? pip install --upgrade unsloth unsloth_zoo
  2. Colab or Kaggle or local / cloud
  3. Number GPUs used, use nvidia-smi
  4. Which notebook?
  5. Paste Unsloth printout with 🦥 sloth emoji
  6. Which trainer? SFTTrainer, GRPOTrainer etc
  7. Minimal code to reproduce error Remove Hugging Face token!

You can also join our Discord: https://discord.com/invite/unsloth
Have you tried visiting our Docs? https://docs.unsloth.ai/basics/errors-troubleshooting

I trained a llama model using the following script:

`python
#!/usr/bin/env python

coding: utf-8

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github

To install Unsloth on your own computer, follow the installation instructions on our Github page here.

You will learn how to do data prep, how to train, how to run the model, & how to save it

### News

Unsloth now supports Text-to-Speech (TTS) models. Read our guide here.

Read our Qwen3 Guide and check out our new Dynamic 2.0 quants which outperforms other quantization methods!

Visit our docs for all our model uploads and notebooks.

### Installation

In[1]:

### Unsloth

In[2]:

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

4bit pre quantized models we support for 4x faster downloading + no OOMs.

fourbit_models = [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!
"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

MODEL = "meta-llama/Llama-3.1-8B-Instruct"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In[3]:

model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)

### Data Prep

We now use the Alpaca dataset from vicgalle, which is a version of 52K of the original Alpaca dataset generated from GPT4. You can replace this code section with your own data prep.

In[4]:

from huggingface_hub import login

Paste your Hugging Face token when prompted

login()
from datasets import load_dataset

dataset = load_dataset("alperenyildiz/r4vd_sft_dataset", split="train")
dataset = dataset.select(range(3))
print(dataset.column_names)

One issue is this dataset has multiple columns. For Ollama and llama.cpp to function like a custom ChatGPT Chatbot, we must only have 2 columns - an instruction and an output column.

In[5]:

print(dataset.column_names)

To solve this, we shall do the following:

* Merge all columns into 1 instruction prompt.

* Remember LLMs are text predictors, so we can customize the instruction to anything we like!

* Use the to_sharegpt function to do this column merging process!

For example below in our Titanic CSV finetuning notebook, we merged multiple columns in 1 prompt:

To merge multiple columns into 1, use merged_prompt.

* Enclose all columns in curly braces {}.

* Optional text must be enclused in [[]]. For example if the column "Pclass" is empty, the merging function will not show the text and skp this. This is useful for datasets with missing values.

* You can select every column, or a few!

* Select the output or target / prediction column in output_column_name. For the Alpaca dataset, this will be output.

To make the finetune handle multiple turns (like in ChatGPT), we have to create a "fake" dataset with multiple turns - we use conversation_extension to randomnly select some conversations from the dataset, and pack them together into 1 conversation.

In[6]:

from unsloth import to_sharegpt

dataset = to_sharegpt(
dataset,
merged_prompt="{instruction}[[\nYour input is:\n{input}]]",
output_column_name="output",
conversation_extension=3, # Select more to handle longer conversations
)

Finally use standardize_sharegpt to fix up the dataset!

In[7]:

from unsloth import standardize_sharegpt

dataset = standardize_sharegpt(dataset)

### Customizable Chat Templates

You also need to specify a chat template. Previously, you could use the Alpaca format as shown below.

In[8]:

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

{}

Input:

{}

Response:

{}"""

Now, you have to use {INPUT} for the instruction and {OUTPUT} for the response.

We also allow you to use an optional {SYSTEM} field. This is useful for Ollama when you want to use a custom system prompt (also like in ChatGPT).

You can also not put a {SYSTEM} field, and just put plain text.

```python

chat_template = """{SYSTEM}

USER: {INPUT}

ASSISTANT: {OUTPUT}"""

```

Use below if you want to use the Llama-3 prompt format. You must use the instruct and not the base model if you use this!

```python

chat_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>

{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{OUTPUT}<|eot_id|>"""

```

For the ChatML format:

```python

chat_template = """<|im_start|>system

{SYSTEM}<|im_end|>

<|im_start|>user

{INPUT}<|im_end|>

<|im_start|>assistant

{OUTPUT}<|im_end|>"""

```

The issue is the Alpaca format has 3 fields, whilst OpenAI style chatbots must only use 2 fields (instruction and response). That's why we used the to_sharegpt function to merge these columns into 1.

In[9]:

chat_template = """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.

Instruction:

{INPUT}

Response:

{OUTPUT}"""

from unsloth import apply_chat_template

dataset = apply_chat_template(
dataset,
tokenizer=tokenizer,
chat_template=chat_template,
# default_system_message = "You are a helpful assistant", << [OPTIONAL]
)

### Train the model

Now let's use Huggingface TRL's SFTTrainer! More docs here: TRL SFT docs. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None. We also support TRL's DPOTrainer!

In[10]:

import wandb
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

Initialize WandB

wandb.init(project="r4vd_training", name="sft_train")

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
num_train_epochs = 1,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
save_strategy = 'no',
output_dir = "outputs",
report_to = "wandb", # Now logs to WandB
),
)

In[11]:

@title Show current memory stats

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In[12]:

trainer_stats = trainer.train()

In[13]:

@title Show final memory and time stats

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

### Inference

Let's run the model! Unsloth makes inference natively 2x faster as well! You should use prompts which are similar to the ones you had finetuned on, otherwise you might get bad results!

In[14]:

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [ # Change below!
{"role": "user", "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8,"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

Since we created an actual chatbot, you can also do longer conversations by manually adding alternating conversations between the user and assistant!

In[15]:

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [ # Change below!
{"role": "user", "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8"},
{"role": "assistant", "content": "The fibonacci sequence continues as 13, 21, 34, 55 and 89."},
{"role": "user", "content": "What is France's tallest tower called?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

### Saving, loading finetuned models

To save the final model as LoRA adapters, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In[16]:

model.save_pretrained("model") # Local saving
tokenizer.save_pretrained("model")

model.push_to_hub("your_name/lora_model", token = "...") # Online saving

tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set False to True:

In[17]:

if False:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass

messages = [ # Change below!
{"role": "user", "content": "Describe anything special about a sequence. Your input is 1, 1, 2, 3, 5, 8,"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

You can also use Hugging Face's AutoModelForPeftCausalLM. Only use this if you do not have unsloth installed. It can be hopelessly slow, since 4bit model downloading is not supported, and Unsloth's inference is 2x faster.

In[18]:

if False:
# I highly do NOT suggest - use Unsloth if possible
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
"lora_model", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Ollama Support

Unsloth now allows you to automatically finetune and create a Modelfile, and export to Ollama! This makes finetuning much easier and provides a seamless workflow from Unsloth to Ollama!

Let's first install Ollama!

In[ ]:

Next, we shall save the model to GGUF / llama.cpp

We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our Wiki page):

* q8_0 - Fast conversion. High resource use, but generally acceptable.

* q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.

* q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

We also support saving to multiple GGUF options in a list fashion! This can speed things up by 10 minutes or more if you want multiple export formats!

In[ ]:

from transformers import AutoTokenizer
base_tok = AutoTokenizer.from_pretrained(MODEL) # use the exact base model name

Save to 8bit Q8_0

if False: model.save_pretrained_gguf("model", tokenizer,)

Remember to go to https://huggingface.co/settings/tokens for a token!

And change hf to your username!

if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

Save to 16bit GGUF

if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

Save to q4_k_m GGUF

if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Save to multiple GGUF options - much faster if you want multiple!

if False:
model.push_to_hub_gguf(
"hf/model", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "", # Get a token at https://huggingface.co/settings/tokens
)

We use subprocess to start Ollama up in a non blocking fashion! In your own desktop, you can simply open up a new terminal and type ollama serve, but in Colab, we have to use this hack!

In[ ]:

subprocess.Popen(["ollama", "serve"])
import time

time.sleep(3) # Wait for a few seconds for Ollama to load!

Ollama needs a Modelfile, which specifies the model's prompt format. Let's print Unsloth's auto generated one:

In[ ]:

print(tokenizer._ollama_modelfile)

We now will create an Ollama model called unsloth_model using the Modelfile which we auto generated!

In[ ]:

get_ipython().system('~/ollama/bin/ollama create unsloth_model -f ./model/Modelfile')

And now we can do inference on it via Ollama!

You can also upload to Ollama and try the Ollama Desktop app by heading to https://www.ollama.com/

In[ ]:

get_ipython().system('curl http://localhost:11434/api/chat -d '{ "model": "unsloth_model", "messages": [ { "role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8," } ] }'')

# ChatGPT interactive mode

### ⭐ To run the finetuned model like in a ChatGPT style interface, first click the | >_ | button.

---

---

---

### ⭐ Then, type ollama run unsloth_model

---

---

---

### ⭐ And you have a ChatGPT style assistant!

### Type any question you like and press ENTER. If you want to exit, hit CTRL + D

You can also use the model-unsloth.gguf file or model-unsloth-Q4_K_M.gguf file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan here and Open WebUI here

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:

1. Train your own reasoning model - Llama GRPO notebook Free Colab

2. Saving finetunes to Ollama. Free notebook

3. Llama 3.2 Vision finetuning - Radiography use case. Free Colab

6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️

`

Based on one of the official notebooks.

However when I generate the gguf file and try to run it with Ollama, the model does get created but I cannot run it and I am met with the following error:

Error: llama runner process has terminated: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file

llama_load_model_from_file: failed to load model

This is the Modelfile
`FROM ./unsloth.Q8_0.gguf

TEMPLATE """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.{{ if .Prompt }}

Instruction:

{{ .Prompt }}{{ end }}

Response:

{{ .Response }}<|end_of_text|>"""

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_"
PARAMETER temperature 1.5
PARAMETER min_p 0.1
`

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions