- Did you update?
pip install --upgrade unsloth unsloth_zoo
Colab or Kaggle or local / cloud
- Number GPUs used, use
nvidia-smi
- Which notebook?
- Paste
Unsloth printout with 🦥 sloth emoji
- Which trainer?
SFTTrainer, GRPOTrainer etc
- Minimal code to reproduce error Remove Hugging Face token!
You can also join our Discord: https://discord.com/invite/unsloth
Have you tried visiting our Docs? https://docs.unsloth.ai/basics/errors-troubleshooting
I trained a llama model using the following script:
`python
#!/usr/bin/env python
coding: utf-8
To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!


Join Discord if you need help + ⭐ Star us on Github ⭐
To install Unsloth on your own computer, follow the installation instructions on our Github page here.
### News
Unsloth now supports Text-to-Speech (TTS) models. Read our guide here.
Read our Qwen3 Guide and check out our new Dynamic 2.0 quants which outperforms other quantization methods!
### Installation
In[1]:
### Unsloth
In[2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!
"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth
MODEL = "meta-llama/Llama-3.1-8B-Instruct"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
In[3]:
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
### Data Prep
We now use the Alpaca dataset from vicgalle, which is a version of 52K of the original Alpaca dataset generated from GPT4. You can replace this code section with your own data prep.
In[4]:
from huggingface_hub import login
Paste your Hugging Face token when prompted
login()
from datasets import load_dataset
dataset = load_dataset("alperenyildiz/r4vd_sft_dataset", split="train")
dataset = dataset.select(range(3))
print(dataset.column_names)
One issue is this dataset has multiple columns. For Ollama and llama.cpp to function like a custom ChatGPT Chatbot, we must only have 2 columns - an instruction and an output column.
In[5]:
print(dataset.column_names)
To solve this, we shall do the following:
* Merge all columns into 1 instruction prompt.
* Remember LLMs are text predictors, so we can customize the instruction to anything we like!
* Use the to_sharegpt function to do this column merging process!
For example below in our Titanic CSV finetuning notebook, we merged multiple columns in 1 prompt:

To merge multiple columns into 1, use merged_prompt.
* Enclose all columns in curly braces {}.
* Optional text must be enclused in [[]]. For example if the column "Pclass" is empty, the merging function will not show the text and skp this. This is useful for datasets with missing values.
* You can select every column, or a few!
* Select the output or target / prediction column in output_column_name. For the Alpaca dataset, this will be output.
To make the finetune handle multiple turns (like in ChatGPT), we have to create a "fake" dataset with multiple turns - we use conversation_extension to randomnly select some conversations from the dataset, and pack them together into 1 conversation.
In[6]:
from unsloth import to_sharegpt
dataset = to_sharegpt(
dataset,
merged_prompt="{instruction}[[\nYour input is:\n{input}]]",
output_column_name="output",
conversation_extension=3, # Select more to handle longer conversations
)
Finally use standardize_sharegpt to fix up the dataset!
In[7]:
from unsloth import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
### Customizable Chat Templates
You also need to specify a chat template. Previously, you could use the Alpaca format as shown below.
In[8]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
Instruction:
{}
Input:
{}
Response:
{}"""
Now, you have to use {INPUT} for the instruction and {OUTPUT} for the response.
We also allow you to use an optional {SYSTEM} field. This is useful for Ollama when you want to use a custom system prompt (also like in ChatGPT).
You can also not put a {SYSTEM} field, and just put plain text.
```python
chat_template = """{SYSTEM}
USER: {INPUT}
ASSISTANT: {OUTPUT}"""
```
Use below if you want to use the Llama-3 prompt format. You must use the instruct and not the base model if you use this!
```python
chat_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{OUTPUT}<|eot_id|>"""
```
For the ChatML format:
```python
chat_template = """<|im_start|>system
{SYSTEM}<|im_end|>
<|im_start|>user
{INPUT}<|im_end|>
<|im_start|>assistant
{OUTPUT}<|im_end|>"""
```
The issue is the Alpaca format has 3 fields, whilst OpenAI style chatbots must only use 2 fields (instruction and response). That's why we used the to_sharegpt function to merge these columns into 1.
In[9]:
chat_template = """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.
Instruction:
{INPUT}
Response:
{OUTPUT}"""
from unsloth import apply_chat_template
dataset = apply_chat_template(
dataset,
tokenizer=tokenizer,
chat_template=chat_template,
# default_system_message = "You are a helpful assistant", << [OPTIONAL]
)
### Train the model
Now let's use Huggingface TRL's SFTTrainer! More docs here: TRL SFT docs. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None. We also support TRL's DPOTrainer!
In[10]:
import wandb
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
Initialize WandB
wandb.init(project="r4vd_training", name="sft_train")
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
num_train_epochs = 1,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
save_strategy = 'no',
output_dir = "outputs",
report_to = "wandb", # Now logs to WandB
),
)
In[11]:
@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
In[12]:
trainer_stats = trainer.train()
In[13]:
@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
### Inference
Let's run the model! Unsloth makes inference natively 2x faster as well! You should use prompts which are similar to the ones you had finetuned on, otherwise you might get bad results!
In[14]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [ # Change below!
{"role": "user", "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8,"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)
Since we created an actual chatbot, you can also do longer conversations by manually adding alternating conversations between the user and assistant!
In[15]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [ # Change below!
{"role": "user", "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8"},
{"role": "assistant", "content": "The fibonacci sequence continues as 13, 21, 34, 55 and 89."},
{"role": "user", "content": "What is France's tallest tower called?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.
[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
In[16]:
model.save_pretrained("model") # Local saving
tokenizer.save_pretrained("model")
model.push_to_hub("your_name/lora_model", token = "...") # Online saving
tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving
Now if you want to load the LoRA adapters we just saved for inference, set False to True:
In[17]:
if False:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass
messages = [ # Change below!
{"role": "user", "content": "Describe anything special about a sequence. Your input is 1, 1, 2, 3, 5, 8,"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)
You can also use Hugging Face's AutoModelForPeftCausalLM. Only use this if you do not have unsloth installed. It can be hopelessly slow, since 4bit model downloading is not supported, and Unsloth's inference is 2x faster.
In[18]:
if False:
# I highly do NOT suggest - use Unsloth if possible
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
"lora_model", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")
### Ollama Support
Unsloth now allows you to automatically finetune and create a Modelfile, and export to Ollama! This makes finetuning much easier and provides a seamless workflow from Unsloth to Ollama!
Let's first install Ollama!
In[ ]:
Next, we shall save the model to GGUF / llama.cpp
We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.
Some supported quant methods (full list on our Wiki page):
* q8_0 - Fast conversion. High resource use, but generally acceptable.
* q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
We also support saving to multiple GGUF options in a list fashion! This can speed things up by 10 minutes or more if you want multiple export formats!
In[ ]:
from transformers import AutoTokenizer
base_tok = AutoTokenizer.from_pretrained(MODEL) # use the exact base model name
Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
Save to 16bit GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")
Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
Save to multiple GGUF options - much faster if you want multiple!
if False:
model.push_to_hub_gguf(
"hf/model", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "", # Get a token at https://huggingface.co/settings/tokens
)
We use subprocess to start Ollama up in a non blocking fashion! In your own desktop, you can simply open up a new terminal and type ollama serve, but in Colab, we have to use this hack!
In[ ]:
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!
Ollama needs a Modelfile, which specifies the model's prompt format. Let's print Unsloth's auto generated one:
In[ ]:
print(tokenizer._ollama_modelfile)
We now will create an Ollama model called unsloth_model using the Modelfile which we auto generated!
In[ ]:
get_ipython().system('~/ollama/bin/ollama create unsloth_model -f ./model/Modelfile')
And now we can do inference on it via Ollama!
You can also upload to Ollama and try the Ollama Desktop app by heading to https://www.ollama.com/
In[ ]:
get_ipython().system('curl http://localhost:11434/api/chat -d '{ "model": "unsloth_model", "messages": [ { "role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8," } ] }'')
# ChatGPT interactive mode
### ⭐ To run the finetuned model like in a ChatGPT style interface, first click the | >_ | button.

---
---
---
### ⭐ Then, type ollama run unsloth_model

---
---
---
### ⭐ And you have a ChatGPT style assistant!
### Type any question you like and press ENTER. If you want to exit, hit CTRL + D
You can also use the model-unsloth.gguf file or model-unsloth-Q4_K_M.gguf file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan here and Open WebUI here
And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other links:
1. Train your own reasoning model - Llama GRPO notebook Free Colab
2. Saving finetunes to Ollama. Free notebook
3. Llama 3.2 Vision finetuning - Radiography use case. Free Colab
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!



Join Discord if you need help + ⭐️ Star us on Github ⭐️
`
Based on one of the official notebooks.
However when I generate the gguf file and try to run it with Ollama, the model does get created but I cannot run it and I am met with the following error:
Error: llama runner process has terminated: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file
llama_load_model_from_file: failed to load model
This is the Modelfile
`FROM ./unsloth.Q8_0.gguf
TEMPLATE """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.{{ if .Prompt }}
Instruction:
{{ .Prompt }}{{ end }}
Response:
{{ .Response }}<|end_of_text|>"""
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_"
PARAMETER temperature 1.5
PARAMETER min_p 0.1
`
pip install --upgrade unsloth unsloth_zooColaborKaggleor local / cloudnvidia-smiUnslothprintout with 🦥 sloth emojiSFTTrainer,GRPOTraineretcYou can also join our Discord: https://discord.com/invite/unsloth
Have you tried visiting our Docs? https://docs.unsloth.ai/basics/errors-troubleshooting
I trained a llama model using the following script:
`python
#!/usr/bin/env python
coding: utf-8
To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!
To install Unsloth on your own computer, follow the installation instructions on our Github page here.
You will learn how to do data prep, how to train, how to run the model, & how to save it
### News
Unsloth now supports Text-to-Speech (TTS) models. Read our guide here.
Read our Qwen3 Guide and check out our new Dynamic 2.0 quants which outperforms other quantization methods!
Visit our docs for all our model uploads and notebooks.
### Installation
In[1]:
### Unsloth
In[2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!
"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth
MODEL = "meta-llama/Llama-3.1-8B-Instruct"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
In[3]:
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
### Data Prep
We now use the Alpaca dataset from vicgalle, which is a version of 52K of the original Alpaca dataset generated from GPT4. You can replace this code section with your own data prep.
In[4]:
from huggingface_hub import login
Paste your Hugging Face token when prompted
login()
from datasets import load_dataset
dataset = load_dataset("alperenyildiz/r4vd_sft_dataset", split="train")
dataset = dataset.select(range(3))
print(dataset.column_names)
One issue is this dataset has multiple columns. For
Ollamaandllama.cppto function like a customChatGPTChatbot, we must only have 2 columns - aninstructionand anoutputcolumn.In[5]:
print(dataset.column_names)
To solve this, we shall do the following:
* Merge all columns into 1 instruction prompt.
* Remember LLMs are text predictors, so we can customize the instruction to anything we like!
* Use the
to_sharegptfunction to do this column merging process!For example below in our Titanic CSV finetuning notebook, we merged multiple columns in 1 prompt:
To merge multiple columns into 1, use
merged_prompt.* Enclose all columns in curly braces
{}.* Optional text must be enclused in
[[]]. For example if the column "Pclass" is empty, the merging function will not show the text and skp this. This is useful for datasets with missing values.* You can select every column, or a few!
* Select the output or target / prediction column in
output_column_name. For the Alpaca dataset, this will beoutput.To make the finetune handle multiple turns (like in ChatGPT), we have to create a "fake" dataset with multiple turns - we use
conversation_extensionto randomnly select some conversations from the dataset, and pack them together into 1 conversation.In[6]:
from unsloth import to_sharegpt
dataset = to_sharegpt(
dataset,
merged_prompt="{instruction}[[\nYour input is:\n{input}]]",
output_column_name="output",
conversation_extension=3, # Select more to handle longer conversations
)
Finally use
standardize_sharegptto fix up the dataset!In[7]:
from unsloth import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
### Customizable Chat Templates
You also need to specify a chat template. Previously, you could use the Alpaca format as shown below.
In[8]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
Instruction:
{}
Input:
{}
Response:
{}"""
Now, you have to use
{INPUT}for the instruction and{OUTPUT}for the response.We also allow you to use an optional
{SYSTEM}field. This is useful for Ollama when you want to use a custom system prompt (also like in ChatGPT).You can also not put a
{SYSTEM}field, and just put plain text.```python
chat_template = """{SYSTEM}
USER: {INPUT}
ASSISTANT: {OUTPUT}"""
```
Use below if you want to use the Llama-3 prompt format. You must use the
instructand not thebasemodel if you use this!```python
chat_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{OUTPUT}<|eot_id|>"""
```
For the ChatML format:
```python
chat_template = """<|im_start|>system
{SYSTEM}<|im_end|>
<|im_start|>user
{INPUT}<|im_end|>
<|im_start|>assistant
{OUTPUT}<|im_end|>"""
```
The issue is the Alpaca format has 3 fields, whilst OpenAI style chatbots must only use 2 fields (instruction and response). That's why we used the
to_sharegptfunction to merge these columns into 1.In[9]:
chat_template = """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.
Instruction:
{INPUT}
Response:
{OUTPUT}"""
from unsloth import apply_chat_template
dataset = apply_chat_template(
dataset,
tokenizer=tokenizer,
chat_template=chat_template,
# default_system_message = "You are a helpful assistant", << [OPTIONAL]
)
### Train the model
Now let's use Huggingface TRL's
SFTTrainer! More docs here: TRL SFT docs. We do 60 steps to speed things up, but you can setnum_train_epochs=1for a full run, and turn offmax_steps=None. We also support TRL'sDPOTrainer!In[10]:
import wandb
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
Initialize WandB
wandb.init(project="r4vd_training", name="sft_train")
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
num_train_epochs = 1,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
save_strategy = 'no',
output_dir = "outputs",
report_to = "wandb", # Now logs to WandB
),
)
In[11]:
@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
In[12]:
trainer_stats = trainer.train()
In[13]:
@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
### Inference
Let's run the model! Unsloth makes inference natively 2x faster as well! You should use prompts which are similar to the ones you had finetuned on, otherwise you might get bad results!
In[14]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [ # Change below!
{"role": "user", "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8,"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)
Since we created an actual chatbot, you can also do longer conversations by manually adding alternating conversations between the user and assistant!
In[15]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [ # Change below!
{"role": "user", "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8"},
{"role": "assistant", "content": "The fibonacci sequence continues as 13, 21, 34, 55 and 89."},
{"role": "user", "content": "What is France's tallest tower called?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's
push_to_hubfor an online save orsave_pretrainedfor a local save.[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
In[16]:
model.save_pretrained("model") # Local saving
tokenizer.save_pretrained("model")
model.push_to_hub("your_name/lora_model", token = "...") # Online saving
tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving
Now if you want to load the LoRA adapters we just saved for inference, set
FalsetoTrue:In[17]:
if False:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass
messages = [ # Change below!
{"role": "user", "content": "Describe anything special about a sequence. Your input is 1, 1, 2, 3, 5, 8,"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)
You can also use Hugging Face's
AutoModelForPeftCausalLM. Only use this if you do not haveunslothinstalled. It can be hopelessly slow, since4bitmodel downloading is not supported, and Unsloth's inference is 2x faster.In[18]:
if False:
# I highly do NOT suggest - use Unsloth if possible
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
"lora_model", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")
### Ollama Support
Unsloth now allows you to automatically finetune and create a Modelfile, and export to Ollama! This makes finetuning much easier and provides a seamless workflow from
UnslothtoOllama!Let's first install
Ollama!In[ ]:
Next, we shall save the model to GGUF / llama.cpp
We clone
llama.cppand we default save it toq8_0. We allow all methods likeq4_k_m. Usesave_pretrained_gguffor local saving andpush_to_hub_gguffor uploading to HF.Some supported quant methods (full list on our Wiki page):
*
q8_0- Fast conversion. High resource use, but generally acceptable.*
q4_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.*
q5_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.We also support saving to multiple GGUF options in a list fashion! This can speed things up by 10 minutes or more if you want multiple export formats!
In[ ]:
from transformers import AutoTokenizer
base_tok = AutoTokenizer.from_pretrained(MODEL) # use the exact base model name
Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
Remember to go to https://huggingface.co/settings/tokens for a token!
And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
Save to 16bit GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")
Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
Save to multiple GGUF options - much faster if you want multiple!
if False:
model.push_to_hub_gguf(
"hf/model", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "", # Get a token at https://huggingface.co/settings/tokens
)
We use
subprocessto startOllamaup in a non blocking fashion! In your own desktop, you can simply open up a newterminaland typeollama serve, but in Colab, we have to use this hack!In[ ]:
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!
Ollamaneeds aModelfile, which specifies the model's prompt format. Let's print Unsloth's auto generated one:In[ ]:
print(tokenizer._ollama_modelfile)
We now will create an
Ollamamodel calledunsloth_modelusing theModelfilewhich we auto generated!In[ ]:
get_ipython().system('~/ollama/bin/ollama create unsloth_model -f ./model/Modelfile')
And now we can do inference on it via
Ollama!You can also upload to
Ollamaand try theOllamaDesktop app by heading to https://www.ollama.com/In[ ]:
get_ipython().system('curl http://localhost:11434/api/chat -d '{ "model": "unsloth_model", "messages": [ { "role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8," } ] }'')
# ChatGPT interactive mode
### ⭐ To run the finetuned model like in a ChatGPT style interface, first click the | >_ | button.
---
---
---
### ⭐ Then, type
ollama run unsloth_model---
---
---
### ⭐ And you have a ChatGPT style assistant!
### Type any question you like and press
ENTER. If you want to exit, hitCTRL + Dmodel-unsloth.gguffile ormodel-unsloth-Q4_K_M.gguffile in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan here and Open WebUI hereAnd we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other links:
1. Train your own reasoning model - Llama GRPO notebook Free Colab
2. Saving finetunes to Ollama. Free notebook
3. Llama 3.2 Vision finetuning - Radiography use case. Free Colab
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!
Join Discord if you need help + ⭐️ Star us on Github ⭐️
`
Based on one of the official notebooks.
However when I generate the gguf file and try to run it with Ollama, the model does get created but I cannot run it and I am met with the following error:
Error: llama runner process has terminated: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file
llama_load_model_from_file: failed to load model
This is the Modelfile
`FROM ./unsloth.Q8_0.gguf
TEMPLATE """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.{{ if .Prompt }}
Instruction:
{{ .Prompt }}{{ end }}
Response:
{{ .Response }}<|end_of_text|>"""
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_"
PARAMETER temperature 1.5
PARAMETER min_p 0.1
`