Skip to content

[Bug] Llama_3.2_1B_Conversational: Seeing multiple trailing <|reserved_special_token_xxx|> at inference time #2360

@smalhotra-spirent

Description

@smalhotra-spirent

Describe the bug
I'm following the steps in the https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb notebook to fine tune Llama 3.2 1B model to generate custom Python functions from my codebase. During inference (Colab itself, and also when I convert to Ollama GGUF), the generated output always has a few reserved tokens in the end. For example: <|reserved_special_token_193|><|reserved_special_token_87|>.

Earlier, when I just followed the steps in the example Colab notebook, it resulted in continuous generation. But later when I added an explicit step to append EOS_TOKEN I'm not seeing continuous generation, but it still generates random reserved tokens. This is the only step that is different from the example Colab notebook:

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

print(f"Using EOS token: {tokenizer.eos_token} with id: {tokenizer.eos_token_id}")
print(f"Using PAD token: {tokenizer.pad_token} with id: {tokenizer.pad_token_id}")

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = []
    for convo in convos:
        text = tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        # Manually ensure EOS token is present
        if not text.endswith(tokenizer.eos_token):
            text += tokenizer.eos_token
        texts.append(text)
    return {"text": texts}

dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

I have also confirmed that EOS_TOKEN and PAD_TOKEN are not the same.

Here's what I get when I print the dataset after applying the formatting_prompts_func to the conversations (which appears to be in line with the Colab example):

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

--- Prompt example instruction ---

<|eot_id|><|start_header_id|>assistant<|end_header_id|>

```python
--- expected sample python code ---
```
<|eot_id|><|end_of_text|>

Any ideas what might be going on? As a last resort, I had to write a clean-up function that removes these tokens from the generated response.

  1. Environment Setup:

  2. Dataset Details:
    Custom private dataset that follows the following pattern:
    [{'content': '---sample prompt to generate a custom python function given some inputs---', 'role': 'user'}, {'content': '```python---sample code snippet for training---```\n', 'role': 'assistant'}]

  3. Model Details:
    unsloth/Llama-3.2-1B

  4. Training Configuration:

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        #num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 500,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

# Start training
trainer_stats = trainer.train()
  1. Expected Behavior:
    I would expected the inference to just generate the desired Python output without any trailing reserved tokens.

  2. Actual Behavior:
    Unwanted trailing reserved tokens, leading to junk output after actual code snippet.

  3. Additional notes:
    None.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions