{"id":39861,"date":"2025-02-10T06:00:00","date_gmt":"2025-02-10T00:30:00","guid":{"rendered":"https:\/\/debuggercafe.com\/?p=39861"},"modified":"2025-06-02T19:48:20","modified_gmt":"2025-06-02T14:18:20","slug":"unsloth-getting-started","status":"publish","type":"post","link":"https:\/\/debuggercafe.com\/unsloth-getting-started\/","title":{"rendered":"Unsloth &#8211; Getting Started"},"content":{"rendered":"\n<p><strong>Unsloth<\/strong> has become synonymous with easy fine-tuning and faster inference of LLMs with fewer hardware requirements. From training LLMs to converting them into various formats, Unsloth offers a host of functionalities.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-horizontal is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-499968f5 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-outline is-style-outline--1\"><a class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background wp-element-button\" href=\"#download-code\"><strong>Jump to Download Code<\/strong><\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth_getting_started_demo.gif\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"575\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth_getting_started_demo.gif\" alt=\"Unsloth text streaming demo.\" class=\"wp-image-39958\"\/><\/a><figcaption class=\"wp-element-caption\">Figure 1. Unsloth text streaming demo.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>This article will cover some of the most important aspects of starting with Unsloth.<\/p>\n\n\n\n<p><strong><em>What will we cover while getting started with Unsloth?<\/em><\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>What is Unsloth?<\/em><\/li>\n\n\n\n<li><em>Why do we need Unsloth?<\/em><\/li>\n\n\n\n<li><em>How to install Unsloth on Ubuntu?<\/em><\/li>\n\n\n\n<li><em>How to we run inference with various LLMs?<\/em><\/li>\n\n\n\n<li><em>What is the ShareGPT chat template format and why do we need it?<\/em><\/li>\n\n\n\n<li><em>How to figure out chat template values for different models?<\/em><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is Unsloth?<\/h2>\n\n\n\n<p>Unsloth is a deep learning library that provides optimized inference and fine-tuning of LLMs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-logo-1.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"898\" height=\"315\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-logo-1.png\" alt=\"Unsloth logo.\" class=\"wp-image-39963\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-logo-1.png 898w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-logo-1-300x105.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-logo-1-768x269.png 768w\" sizes=\"auto, (max-width: 898px) 100vw, 898px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 2. Unsloth logo (source: https:\/\/github.com\/unslothai\/unsloth?tab=readme-ov-file)<\/figcaption><\/figure>\n\n\n\n<p>The project was started by Daniel Han and Michael Han, now with several contributors on their GitHub repository.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why Do We Need Unsloth?<\/h2>\n\n\n\n<p>Unsloth allows <strong>inference and<\/strong> <strong>fine-tuning of LLMs with low GPU VRAM<\/strong> requirements<span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\">, making them extremely<\/span> accessible. We can\u00a0<strong>train an 8B parameter LLM on a\u00a010GB VRAM<\/strong>\u00a0system. Additionally, it provides various model weight exporting techniques for Ollama, vLLM, and GGUF. Furthermore, it is highly compatible with the\u00a0Hugging Face Transformers\u00a0library, making it versatile.<\/p>\n\n\n\n<p>Naive fine-tuning of LLMs is expensive and compute-intensive. We need tens of gigabytes of VRAM (often surpassing hundreds of gigabytes) for full-tuning of LLMs. Although <a href=\"https:\/\/debuggercafe.com\/fine-tuning-phi-1-5-using-qlora\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>fine-tuning with QLoRA<\/strong><\/a> and LoRA mitigates some of the issues, it is only a solution till 3B parameter models when having a system with 10GB or 16GB VRAM. Larger models often require more VRAM even with QLoRA training.<\/p>\n\n\n\n<p>This is where Unsloth shines. It makes training models up to 8B parameters (like the recent Llama 3.1 and its variants) a breeze even with 10GB VRAM. Unsloth can reduce memory requirements by up to 70% while providing 2x faster inference with their optimized pipeline. They also provide 4-bit quantized models so we can download models that are half in size.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-llama2-benchmark.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"846\" height=\"283\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-llama2-benchmark.png\" alt=\"Unsloth Llama-2 benchmark and comparison with Hugging Face and Flash Attention 2.\" class=\"wp-image-39967\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-llama2-benchmark.png 846w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-llama2-benchmark-300x100.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-llama2-benchmark-768x257.png 768w\" sizes=\"auto, (max-width: 846px) 100vw, 846px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 3. Unsloth Llama-2 benchmark and comparison with Hugging Face and Flash Attention 2.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>Figure 3 shows the speedup and GPU memory saving percentage when using Unsloth compared to naive Hugging Face and Flash Attention 2 implementations. <\/p>\n\n\n\n<p>Unsloth lets developers get started with the least friction possible. Their support for the Tesla T4 GPU ensures wide use across Colab and Kaggle notebooks.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-tesla-t4-benchmark.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"846\" height=\"1116\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-tesla-t4-benchmark.png\" alt=\"Unsloth training benchmark and comparison across 1 and 2 Tesla T4 GPUs.\" class=\"wp-image-39975\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-tesla-t4-benchmark.png 846w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-tesla-t4-benchmark-227x300.png 227w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-tesla-t4-benchmark-768x1013.png 768w\" sizes=\"auto, (max-width: 846px) 100vw, 846px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 4. Unsloth training benchmark and comparison across 1 and 2 Tesla T4 GPUs.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>Figure 4 shows the number of hours and peak GPU memory required when training with Unsloth on various datasets with 1 and 2 Tesla T4 GPUs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What are the Key Features of Unsloth?<\/h3>\n\n\n\n<p>The following are some of the key features of Unsloth:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Unsloth&#8217;s kernels are written in OpenAI Triton language with a manual backpropagation engine.<\/li>\n\n\n\n<li>It supports all modern GPUs starting from 2018 and some older GPUs like Tesla T4 which makes it easier to fine-tune models on Colab.<\/li>\n\n\n\n<li>Linux setup is easy.<\/li>\n\n\n\n<li>Supports 4-bit and 16-bit fine-tuning of LLMs via LoRA and QLoRA.<\/li>\n\n\n\n<li>It is 5x faster compared to naive HuggingFace + Flash Attention implementation.<\/li>\n\n\n\n<li>There is no loss in accuracy when fine-tuning LLMs with all optimizations that Unsloth provides.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>Unsloth supports all the Llama models, its derivatives, and other major model families as well. You can check the <strong><a href=\"https:\/\/docs.unsloth.ai\/get-started\/all-our-models\" target=\"_blank\" rel=\"noreferrer noopener\">supported models here<\/a>.<\/strong><\/p>\n\n\n\n<p>Additionally, they provide <strong><a href=\"https:\/\/huggingface.co\/unsloth\" target=\"_blank\" rel=\"noreferrer noopener\">4bit quantized models<\/a><\/strong> directly for download via Hugging Face saving both time and storage.<\/p>\n\n\n\n<p>Apart from the open-source library, they also provide paid services. These include even faster training and inference, more GPU memory saving, multi-GPU support, and full fine-tuning of LLMs.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-pro-enterprise.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"1466\" height=\"807\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-pro-enterprise.png\" alt=\"Unsloth Pro and Enterprise features.\" class=\"wp-image-39982\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-pro-enterprise.png 1466w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-pro-enterprise-300x165.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/unsloth-pro-enterprise-768x423.png 768w\" sizes=\"auto, (max-width: 1466px) 100vw, 1466px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 5. Unsloth Pro and Enterprise features.<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">How to Install Unsloth on Ubuntu?<\/h2>\n\n\n\n<p>We will cover the installation on Ubuntu which is quite straightforward. It is highly recommended that you use the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">conda<\/code> package manager which makes the installation even easier.<\/p>\n\n\n\n<p>The first step is to create a new Anaconda environment with PyTorch and xformers, and activate the environment.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda create --name unsloth_env \\\n    python=3.11 \\\n    pytorch-cuda=12.1 \\\n    pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \\\n    -y<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda activate unsloth_env<\/pre>\n\n\n\n<p>Then install Unsloth and the rest of the requirements including the Hugging Face libraries.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install \"unsloth[colab-new] @ git+https:\/\/github.com\/unslothai\/unsloth.git\"\npip install --no-deps trl peft accelerate bitsandbytes<\/pre>\n\n\n\n<p>If you wish to install it on Windows or via pip, please follow the <strong><a href=\"https:\/\/github.com\/unslothai\/unsloth?tab=readme-ov-file#-installation-instructions\" target=\"_blank\" rel=\"noreferrer noopener\">instructions here<\/a><\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Directory Structure<\/h2>\n\n\n\n<p>Let&#8217;s take a look at the directory structure for all the code that we will cover here.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\u251c\u2500\u2500 unsloth_gemma2_2b.ipynb\n\u251c\u2500\u2500 unsloth_llama_3_1_8b.ipynb\n\u2514\u2500\u2500 unsloth_mistralv03.ipynb<\/pre>\n\n\n\n<p>The directory contains three Jupyter Notebooks covering the inference of various models.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#ffb76a\"><strong><em>All the Jupyter Notebooks are available for download via the Download Code section.<\/em><\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"download-code\">Download Code<\/h3>\n\n\n\n<div class=\"wp-block-button is-style-outline center\"><a data-sumome-listbuilder-id=\"e6b9333c-0942-41e4-8967-34d243b9ff85\" class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background\"><b>Download the Source Code for this Tutorial<\/b><\/a><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">How to Run Inference Using Unsloth?<\/h2>\n\n\n\n<p>In this section, first, we will cover inference using LLMs with some of the important aspects of Chat Templates.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Inference Using Llama 3.1 8B<\/h3>\n\n\n\n<p>We will start with the inference using the Llama 3.1 8B model. The code for this resides in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">unsloth_llama_3_1_8b.ipynb<\/code> file. All notebooks contain setup steps for easily installing the libraries and running the code on Colab.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Importing the Necessary Libraries<\/h4>\n\n\n\n<p>Let&#8217;s start with the import statements.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"unsloth_llama_3_1_8b.ipynb\" data-enlighter-group=\"unsloth_llama_3_1_8b_1\">from unsloth import FastLanguageModel\nfrom transformers import TextStreamer\nfrom unsloth.chat_templates import get_chat_template<\/pre>\n\n\n\n<p>Following are the functions and classes that we import:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">FastLanguageModel<\/code> from <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">unlsoth<\/code>: We will use the  <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">FastLanguageModel<\/code> class to load the pretrained Llama 3.1 model.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">TextStreamer<\/code> from <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">transformers<\/code>: The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">TextStreamer<\/code> class allows us to stream the results as the tokens are generated by the LLM instead of waiting for the entire result. <\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">get_chat_template<\/code>: This function allows us to load the chat template for a particular model and also map the special chat template tokens to another format (more on this later).<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Load the Model<\/h4>\n\n\n\n<p>Now, let&#8217;s load the Llama 3.1 8B model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"4\" data-enlighter-title=\"unsloth_llama_3_1_8b.ipynb\" data-enlighter-group=\"unsloth_llama_3_1_8b_2\">model, tokenizer = FastLanguageModel.from_pretrained(\n    model_name = 'unsloth\/Meta-Llama-3.1-8B-Instruct-bnb-4bit',\n    max_seq_length=8192,\n    # load_in_4bit=True # Not needed when loading 4-bit models directly.\n)<\/pre>\n\n\n\n<p>Just like the Transformers library, we use the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">from_pretrained<\/code> method above to load the model. We can also specify a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">max_seq_length<\/code> although that may not be necessary during inference.<\/p>\n\n\n\n<p>One important aspect is the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">load_in_4bit<\/code> argument. We have commented that out as we are specifically loading the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">unsloth\/Meta-Llama-3.1-8B-Instruct-bnb-4bit<\/code> which has already been quantized to 4-bit. We can also download the full precision model from Unsloth and pass <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">load_in_4bit=True<\/code>. However, the current method allows the download of smaller files as the weights have been reduced by almost 4x.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Tokenizer and Template Mapping<\/h4>\n\n\n\n<p>In the next code block, we load the chat template which also returns the updated tokenizer.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"9\" data-enlighter-title=\"unsloth_llama_3_1_8b.ipynb\" data-enlighter-group=\"unsloth_llama_3_1_8b_3\">tokenizer = get_chat_template(\n    tokenizer,\n    chat_template = 'llama-3.1',\n    mapping = {'role' : 'from', 'content' : 'value', 'user' : 'human', 'assistant' : 'gpt'}, # ShareGPT style\n)<\/pre>\n\n\n\n<p>A few important things are happening in the above code block.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The chat template also loads the tokenizer.<\/li>\n\n\n\n<li>The&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">chat_template<\/code>&nbsp;argument accepts the model version\/name.<\/li>\n\n\n\n<li>We do mapping from the chat template tokens of Llama 3.1 to ShareGPT style chat template tokens.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>The mapping does the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">role<\/code>,&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">content<\/code>,&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">user<\/code>,&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">assistant<\/code>&nbsp;tokens are part of Llama 3.1 tokens. Check the&nbsp;<a href=\"https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B-Instruct\/blob\/0e9e39f249a16976918f6564b8830bc894c89659\/tokenizer_config.json#L2053\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>tokenizer_config.json<\/strong><\/a>&nbsp;of Llama 3.1.<\/li>\n\n\n\n<li>We replace them with&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">from<\/code>,&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">value<\/code>,&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">human<\/code>,&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">gpt<\/code> respectively.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>The above mapping allows us to switch between any model and not worry about managing history with different template tokens. We just need to take a look at the chat template of each model once to modify the\u00a0<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">mapping<\/code>\u00a0argument, then the rest of the code (as we will see later) remains the same.<\/p>\n\n\n\n<p>For example, this is how a conversation thread in ShareGPT style might look like:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[\n    [{'from': 'human', 'value': 'Hi there!'},\n     {'from': 'gpt', 'value': 'Hi how can I help?'},\n     {'from': 'human', 'value': 'What is 2+2?'}],\n    [{'from': 'human', 'value': 'What's your name?'},\n     {'from': 'gpt', 'value': 'I'm Daniel!'},\n     {'from': 'human', 'value': 'Ok! Nice!'},\n     {'from': 'gpt', 'value': 'What can I do for you?'},\n     {'from': 'human', 'value': 'Oh nothing :)'},],\n]<\/pre>\n\n\n\n<p>Before carrying out inference, we need to do one more optimization. We can apply the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">for_inference<\/code> method to enable 2x faster inference using Unsloth.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"14\" data-enlighter-title=\"unsloth_llama_3_1_8b.ipynb\" data-enlighter-group=\"unsloth_llama_3_1_8b_4\"># Enable native 2x faster inference\nFastLanguageModel.for_inference(model)<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Carrying Out Inference<\/h4>\n\n\n\n<p>In the next code block, we use the ShareGPT style template to apply the chat template and get the input tokens. Because we have mapped the tokens above, the following will remain the same no matter the model we choose from Unsloth.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"16\" data-enlighter-title=\"unsloth_llama_3_1_8b.ipynb\" data-enlighter-group=\"unsloth_llama_3_1_8b_5\">messages = [\n    {'from': 'human', 'value': 'Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,'},\n]\ninputs = tokenizer.apply_chat_template(\n    messages, tokenize=True, add_generation_prompt=True, return_tensors='pt'\n).to('cuda')<\/pre>\n\n\n\n<p>Next, we initialize the text streamer and pass the input tokens through the model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"22\" data-enlighter-title=\"unsloth_llama_3_1_8b.ipynb\" data-enlighter-group=\"unsloth_llama_3_1_8b_6\">text_streamer = TextStreamer(tokenizer)\n_ = model.generate(\n    input_ids=inputs, streamer=text_streamer, max_new_tokens=1024, use_cache=True\n)<\/pre>\n\n\n\n<p>Following is the output that we get.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;|begin_of_text|>&lt;|start_header_id|>system&lt;|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n&lt;|eot_id|>&lt;|start_header_id|>human&lt;|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,&lt;|eot_id|>&lt;|start_header_id|>assistant&lt;|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers:\n\n1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,...&lt;|eot_id|><\/pre>\n\n\n\n<p>The interesting point to observe here is that Llama 3.1 still works with its own chat template tokens internally because that&#8217;s what it has been trained on. The mapping only helps to maintain a consistent chat template on the user level, and internally, these tokens are mapped to the original tokenizer.<\/p>\n\n\n\n<p>Of course, we can use the original chat template tokens according to the model. But think about the case when we try to create a chat application with options for multiple models. It will be challenging to manage history when switching between different models because the&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">messages<\/code>&nbsp;list will have to contain the original chat template tokens.<\/p>\n\n\n\n<p>Instead, just knowing and replacing the proper chat template tokens once when calling&nbsp;<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">get_chat_template<\/code>&nbsp;makes the rest of the pipeline streamlined.<\/p>\n\n\n\n<p>To get an idea of what a multi-turn chat with a system prompt might look like, we have the next chat example. For the system prompt, we can set the value of the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">from<\/code> key as <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">system<\/code>. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"26\" data-enlighter-title=\"unsloth_llama_3_1_8b.ipynb\" data-enlighter-group=\"unsloth_llama_3_1_8b_7\"># Let's try with some system prompt.\nmessages = [\n    {\"from\": \"system\", \"value\": \"You are a pro gamer. You answer everything concisely touching upon the most important points. You mostly play sim racing.\"},\n    {\"from\": \"human\", \"value\": \"Who are you?\"},\n]\n\ninputs = tokenizer.apply_chat_template(\n    messages, tokenize=True, add_generation_prompt=True, return_tensors='pt'\n).to('cuda')\n\n_ = model.generate(\n    input_ids=inputs, streamer=text_streamer, max_new_tokens=1024, use_cache=True\n)<\/pre>\n\n\n\n<p>We get the following output.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;|begin_of_text|>&lt;|start_header_id|>system&lt;|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a pro gamer. You answer everything concisely touching upon the most important points. You mostly play sim racing.&lt;|eot_id|>&lt;|start_header_id|>human&lt;|end_header_id|>\n\nWho are you?&lt;|eot_id|>&lt;|start_header_id|>assistant&lt;|end_header_id|>\n\nI'm Vroom, a pro sim racer. I compete in iRacing and Assetto Corsa Competizione.&lt;|eot_id|><\/pre>\n\n\n\n<p>As expected, the output of the model aligns with the system prompt.<\/p>\n\n\n\n<p>For a final example, let&#8217;s try a multi-turn template.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"39\" data-enlighter-title=\"unsloth_llama_3_1_8b.ipynb\" data-enlighter-group=\"unsloth_llama_3_1_8b_8\"># Let's try with some continuous chat.\nmessages = [\n    {\"from\": \"system\", \"value\": \"You are a pro gamer. You answer everything concisely touching upon the most important points. You mostly play sim racing.\"},\n    {\"from\": \"human\", \"value\": \"Who are you?\"},\n    {\"from\": \"gpt\", \"value\": \"I'm TurboToni. I'm a sim racing pro, specializing in iRacing and Project Cars.\"},\n    {\"from\": \"human\", \"value\": \"Right. I am having difficulty in playing F1 22 with a PS5 controller. What can I do? Also what about Elden Ring? (Back to sim racing) Have you tried the new Monaco track in F1 22?\"},\n]\n\ninputs = tokenizer.apply_chat_template(\n    messages, tokenize=True, add_generation_prompt=True, return_tensors='pt'\n).to('cuda')\n\n_ = model.generate(\n    input_ids=inputs, streamer=text_streamer, max_new_tokens=1024, use_cache=True\n)<\/pre>\n\n\n\n<p>The output aligns with the system prompt.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;|begin_of_text|>&lt;|start_header_id|>system&lt;|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a pro gamer. You answer everything concisely touching upon the most important points. You mostly play sim racing.&lt;|eot_id|>&lt;|start_header_id|>human&lt;|end_header_id|>\n\nWho are you?&lt;|eot_id|>&lt;|start_header_id|>gpt&lt;|end_header_id|>\n\nI'm TurboToni. I'm a sim racing pro, specializing in iRacing and Project Cars.&lt;|eot_id|>&lt;|start_header_id|>human&lt;|end_header_id|>\n\nRight. I am having difficulty in playing F1 22 with a PS5 controller. What can I do? Also what about Elden Ring? (Back to sim racing) Have you tried the new Monaco track in F1 22?&lt;|eot_id|>&lt;|start_header_id|>assistant&lt;|end_header_id|>\n\nFor F1 22 on PS5 controller, try adjusting the sensitivity and dead zone in the game's settings. You can also use the PS5's built-in controller settings to customize the stick and button responsiveness.\n\nAs for Elden Ring, I'm not a fan of action games, but I've heard it's a great experience. I'll stick to sim racing, though.\n\nThe new Monaco track in F1 22 is a great addition! It's a challenging circuit, and the new layout adds some excitement. I've spent a few hours racing there, and it's become one of my favorite tracks.&lt;|eot_id|><\/pre>\n\n\n\n<p>With the above examples, we get a clear idea for two tasks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Loading a 4-bit quantized model with Unsloth.<\/li>\n\n\n\n<li>Running inference with ShareGPT chat template for a more streamlined developer experience.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Loading Different Chat Models with ShareGPT Chat Template<\/h3>\n\n\n\n<p>Earlier, we discussed how we can load any mode and our chat conversation template remains the same.<\/p>\n\n\n\n<p>To get a hands-on example, the following code block shows loading the Mistral v-0.3 chat model and its chat template.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">model, tokenizer = FastLanguageModel.from_pretrained(\n    model_name = 'unsloth\/mistral-7b-instruct-v0.3-bnb-4bit',\n    max_seq_length=8192,\n    # load_in_4bit=True # Not needed when loading 4-bit models directly.\n)\n\ntokenizer = get_chat_template(\n    tokenizer,\n    chat_template='mistral',\n    mapping={'role' : 'from', 'content' : 'value', 'user' : 'human', 'assistant' : 'gpt'}, # ShareGPT style\n)<\/pre>\n\n\n\n<p>When calling the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">get_chat_template<\/code> function, we pass the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">chat_template<\/code> value as <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">mistral<\/code>. The original chat template of Mistral v-0.3 is the same as Llama 3.1, so, we do not need to make any changes. After the above, the rest of the inference code remains the same.<\/p>\n\n\n\n<p>Similarly, we can load the Gemma2-2B instruction-tuned model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">model, tokenizer = FastLanguageModel.from_pretrained(\n    model_name = 'unsloth\/gemma-2-2b-it-bnb-4bit',\n    max_seq_length=8192,\n    # load_in_4bit=True # Not needed when loading 4-bit models directly.\n)\n\ntokenizer = get_chat_template(\n    tokenizer,\n    chat_template='gemma2',\n    mapping={'role' : 'from', 'content' : 'value', 'user' : 'human', 'assistant' : 'gpt'}, # ShareGPT style\n)<\/pre>\n\n\n\n<p>The only change here is that the value of the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">chat_template<\/code> argument is <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">gemma2<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Figuring Out chat_template Argument Value According to the Model<\/h3>\n\n\n\n<p>Unsloth provides several pretrained and chat models. It might be difficult to figure out all the possible values for the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">chat_template<\/code> argument.<\/p>\n\n\n\n<p>For example, in our examples:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For Llama 3.1, the value was <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">llama-3.1<\/code>.<\/li>\n\n\n\n<li>For Gemme2, the value <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">gemma2<\/code>.<\/li>\n\n\n\n<li>However, for Mistral v-0.3, the value was <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">mistral<\/code> without any version information.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>One solution is visiting the <strong><a href=\"https:\/\/docs.unsloth.ai\/basics\/chat-templates\" target=\"_blank\" rel=\"noreferrer noopener\">chat template documentation<\/a><\/strong> of Unsloth. It provides the key arguments for the possible models. However, it might not be updated on time for new models.<\/p>\n\n\n\n<p><strong><em>The best solution to find the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">chat_template<\/code> argument values<\/em><\/strong> is to visit the <strong><a href=\"https:\/\/github.com\/unslothai\/unsloth\/blob\/main\/unsloth\/chat_templates.py\" target=\"_blank\" rel=\"noreferrer noopener\">source code file<\/a><\/strong>.<\/p>\n\n\n\n<p>Here, searching for <strong>CHAT_TEMPLATES[&#8220;<\/strong> in the browser will give you all the possible values for each model. For instance, Llama 3.1 accepts both <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">llama-3.1<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">llama-31<\/code> as the chat template values. Similarly, Qwen-2.5 accepts <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">qwen-2.5<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">qwen-25<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">qwen25<\/code>, and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">qwen2.5<\/code> as the values.<\/p>\n\n\n\n<p>The above shows that the authors of Unsloth favor developer experience and provide flexibility in getting a solution up and running in minimal time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary and Conclusion<\/h2>\n\n\n\n<p>In this article, we covered how to get started with LLM inference using Unsloth. We started with the discussion of the need for Unsloth, how to set it up, the steps for carrying out inference, and finally the different chat templates for the models. I hope this article was worth your time.<\/p>\n\n\n\n<p>If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.<\/p>\n\n\n\n<p>You can contact me using the <strong><a aria-label=\"Contact (opens in a new tab)\" href=\"https:\/\/debuggercafe.com\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">Contact<\/a><\/strong> section. You can also find me on <strong><a aria-label=\"LinkedIn (opens in a new tab)\" href=\"https:\/\/www.linkedin.com\/in\/sovit-rath\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a><\/strong>, and <strong><a href=\"https:\/\/x.com\/SovitRath5\" target=\"_blank\" rel=\"noreferrer noopener\">X<\/a><\/strong>.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article covers an introduction to the Unsloth LLM library. It covers the need for Unsloth, the steps to install it, running inference using various language models like  Llama 3.1, Gemma2, and Mistral v-0.3, and also understanding the chat templates.<\/p>\n","protected":false},"author":1,"featured_media":39988,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[59,819,409,1339],"tags":[1137,1138,1141,1139,1140,1135,1210,1136],"class_list":["post-39861","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-deep-learning","category-llms","category-nlp","category-unsloth","tag-unsloth-chat-template","tag-unsloth-chat_template","tag-unsloth-colab","tag-unsloth-fastlanguagemodel","tag-unsloth-get_chat_template","tag-unsloth-inference","tag-unsloth-inference-only","tag-unsloth-llms"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Unsloth - Getting Started<\/title>\n<meta name=\"description\" content=\"Unsloth provides memory efficient and fast inference &amp; training of LLMs with support for several models like Meta Llama, Google Gemma, &amp; Phi.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/debuggercafe.com\/unsloth-getting-started\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Unsloth - Getting Started\" \/>\n<meta property=\"og:description\" content=\"Unsloth provides memory efficient and fast inference &amp; training of LLMs with support for several models like Meta Llama, Google Gemma, &amp; Phi.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/debuggercafe.com\/unsloth-getting-started\/\" \/>\n<meta property=\"og:site_name\" content=\"DebuggerCafe\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/profile.php?id=100013731104496\" \/>\n<meta property=\"article:published_time\" content=\"2025-02-10T00:30:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-02T14:18:20+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"563\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sovit Ranjan Rath\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:site\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sovit Ranjan Rath\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"18 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/\"},\"author\":{\"name\":\"Sovit Ranjan Rath\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"headline\":\"Unsloth &#8211; Getting Started\",\"datePublished\":\"2025-02-10T00:30:00+00:00\",\"dateModified\":\"2025-06-02T14:18:20+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/\"},\"wordCount\":1806,\"commentCount\":3,\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png\",\"keywords\":[\"Unsloth Chat Template\",\"Unsloth chat_template\",\"Unsloth Colab\",\"Unsloth FastLanguageModel\",\"Unsloth get_chat_template\",\"Unsloth Inference\",\"Unsloth Inference Only\",\"Unsloth LLMs\"],\"articleSection\":[\"Deep Learning\",\"LLMs\",\"NLP\",\"Unsloth\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/debuggercafe.com\/unsloth-getting-started\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/\",\"url\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/\",\"name\":\"Unsloth - Getting Started\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png\",\"datePublished\":\"2025-02-10T00:30:00+00:00\",\"dateModified\":\"2025-06-02T14:18:20+00:00\",\"author\":{\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"description\":\"Unsloth provides memory efficient and fast inference & training of LLMs with support for several models like Meta Llama, Google Gemma, & Phi.\",\"breadcrumb\":{\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/debuggercafe.com\/unsloth-getting-started\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/#primaryimage\",\"url\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png\",\"contentUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png\",\"width\":1000,\"height\":563,\"caption\":\"Unsloth \u2013 Getting Started\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/debuggercafe.com\/unsloth-getting-started\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/debuggercafe.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Unsloth &#8211; Getting Started\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/debuggercafe.com\/#website\",\"url\":\"https:\/\/debuggercafe.com\/\",\"name\":\"DebuggerCafe\",\"description\":\"Machine Learning and Deep Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/debuggercafe.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\",\"name\":\"Sovit Ranjan Rath\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"caption\":\"Sovit Ranjan Rath\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Unsloth - Getting Started","description":"Unsloth provides memory efficient and fast inference & training of LLMs with support for several models like Meta Llama, Google Gemma, & Phi.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/debuggercafe.com\/unsloth-getting-started\/","og_locale":"en_US","og_type":"article","og_title":"Unsloth - Getting Started","og_description":"Unsloth provides memory efficient and fast inference & training of LLMs with support for several models like Meta Llama, Google Gemma, & Phi.","og_url":"https:\/\/debuggercafe.com\/unsloth-getting-started\/","og_site_name":"DebuggerCafe","article_publisher":"https:\/\/www.facebook.com\/profile.php?id=100013731104496","article_published_time":"2025-02-10T00:30:00+00:00","article_modified_time":"2025-06-02T14:18:20+00:00","og_image":[{"width":1000,"height":563,"url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png","type":"image\/png"}],"author":"Sovit Ranjan Rath","twitter_card":"summary_large_image","twitter_creator":"@SovitRath5","twitter_site":"@SovitRath5","twitter_misc":{"Written by":"Sovit Ranjan Rath","Est. reading time":"18 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/#article","isPartOf":{"@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/"},"author":{"name":"Sovit Ranjan Rath","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"headline":"Unsloth &#8211; Getting Started","datePublished":"2025-02-10T00:30:00+00:00","dateModified":"2025-06-02T14:18:20+00:00","mainEntityOfPage":{"@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/"},"wordCount":1806,"commentCount":3,"image":{"@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png","keywords":["Unsloth Chat Template","Unsloth chat_template","Unsloth Colab","Unsloth FastLanguageModel","Unsloth get_chat_template","Unsloth Inference","Unsloth Inference Only","Unsloth LLMs"],"articleSection":["Deep Learning","LLMs","NLP","Unsloth"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/debuggercafe.com\/unsloth-getting-started\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/","url":"https:\/\/debuggercafe.com\/unsloth-getting-started\/","name":"Unsloth - Getting Started","isPartOf":{"@id":"https:\/\/debuggercafe.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/#primaryimage"},"image":{"@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png","datePublished":"2025-02-10T00:30:00+00:00","dateModified":"2025-06-02T14:18:20+00:00","author":{"@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"description":"Unsloth provides memory efficient and fast inference & training of LLMs with support for several models like Meta Llama, Google Gemma, & Phi.","breadcrumb":{"@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/debuggercafe.com\/unsloth-getting-started\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/#primaryimage","url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png","contentUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/12\/Unsloth-Getting-Started-e1735089890608.png","width":1000,"height":563,"caption":"Unsloth \u2013 Getting Started"},{"@type":"BreadcrumbList","@id":"https:\/\/debuggercafe.com\/unsloth-getting-started\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/debuggercafe.com\/"},{"@type":"ListItem","position":2,"name":"Unsloth &#8211; Getting Started"}]},{"@type":"WebSite","@id":"https:\/\/debuggercafe.com\/#website","url":"https:\/\/debuggercafe.com\/","name":"DebuggerCafe","description":"Machine Learning and Deep Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/debuggercafe.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752","name":"Sovit Ranjan Rath","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","caption":"Sovit Ranjan Rath"}}]}},"_links":{"self":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/39861","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/comments?post=39861"}],"version-history":[{"count":138,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/39861\/revisions"}],"predecessor-version":[{"id":40007,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/39861\/revisions\/40007"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media\/39988"}],"wp:attachment":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media?parent=39861"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/categories?post=39861"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/tags?post=39861"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}