Describe the bug
I'm following the steps in the https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb notebook to fine tune Llama 3.2 1B model to generate custom Python functions from my codebase. During inference (Colab itself, and also when I convert to Ollama GGUF), the generated output always has a few reserved tokens in the end. For example: <|reserved_special_token_193|><|reserved_special_token_87|>.
Earlier, when I just followed the steps in the example Colab notebook, it resulted in continuous generation. But later when I added an explicit step to append EOS_TOKEN I'm not seeing continuous generation, but it still generates random reserved tokens. This is the only step that is different from the example Colab notebook:
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3.1",
)
print(f"Using EOS token: {tokenizer.eos_token} with id: {tokenizer.eos_token_id}")
print(f"Using PAD token: {tokenizer.pad_token} with id: {tokenizer.pad_token_id}")
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = []
for convo in convos:
text = tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
# Manually ensure EOS token is present
if not text.endswith(tokenizer.eos_token):
text += tokenizer.eos_token
texts.append(text)
return {"text": texts}
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)
I have also confirmed that EOS_TOKEN and PAD_TOKEN are not the same.
Here's what I get when I print the dataset after applying the formatting_prompts_func to the conversations (which appears to be in line with the Colab example):
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 July 2024
<|eot_id|><|start_header_id|>user<|end_header_id|>
--- Prompt example instruction ---
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```python
--- expected sample python code ---
```
<|eot_id|><|end_of_text|>
Any ideas what might be going on? As a last resort, I had to write a clean-up function that removes these tokens from the generated response.
-
Environment Setup:
-
Dataset Details:
Custom private dataset that follows the following pattern:
[{'content': '---sample prompt to generate a custom python function given some inputs---', 'role': 'user'}, {'content': '```python---sample code snippet for training---```\n', 'role': 'assistant'}]
-
Model Details:
unsloth/Llama-3.2-1B
-
Training Configuration:
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
#num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 500,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 10,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)
# Start training
trainer_stats = trainer.train()
-
Expected Behavior:
I would expected the inference to just generate the desired Python output without any trailing reserved tokens.
-
Actual Behavior:
Unwanted trailing reserved tokens, leading to junk output after actual code snippet.
-
Additional notes:
None.
Describe the bug
I'm following the steps in the https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb notebook to fine tune Llama 3.2 1B model to generate custom Python functions from my codebase. During inference (Colab itself, and also when I convert to Ollama GGUF), the generated output always has a few reserved tokens in the end. For example:
<|reserved_special_token_193|><|reserved_special_token_87|>.Earlier, when I just followed the steps in the example Colab notebook, it resulted in continuous generation. But later when I added an explicit step to append
EOS_TOKENI'm not seeing continuous generation, but it still generates random reserved tokens. This is the only step that is different from the example Colab notebook:I have also confirmed that
EOS_TOKENandPAD_TOKENare not the same.Here's what I get when I print the dataset after applying the
formatting_prompts_functo the conversations (which appears to be in line with the Colab example):Any ideas what might be going on? As a last resort, I had to write a clean-up function that removes these tokens from the generated response.
Environment Setup:
Dataset Details:
Custom private dataset that follows the following pattern:
[{'content': '---sample prompt to generate a custom python function given some inputs---', 'role': 'user'}, {'content': '```python---sample code snippet for training---```\n', 'role': 'assistant'}]Model Details:
unsloth/Llama-3.2-1BTraining Configuration:
Expected Behavior:
I would expected the inference to just generate the desired Python output without any trailing reserved tokens.
Actual Behavior:
Unwanted trailing reserved tokens, leading to junk output after actual code snippet.
Additional notes:
None.