Skip to content

[Question] High Gradient Norm and Loss During Initial Training with Gemma-4-E2B #4982

@hungphongtrn

Description

@hungphongtrn

Description

I'm experiencing unusually high gradient norms and loss values during the initial steps of training Gemma-4-E2B-it with full fine-tuning. The gradient norms are reaching 6880-8128 despite having max_grad_norm: 1.0 set, and the initial loss is around 185.9, which seems excessively high.

Training Logs

Unsloth: Will smartly offload gradients to save VRAM!
{'loss': '185.9', 'grad_norm': '6880', 'learning_rate': '1.263e-06', 'epoch': '0.002296'}  
{'loss': '142.5', 'grad_norm': '4512', 'learning_rate': '2.538e-06', 'epoch': '0.004592'}
{'loss': '117.6', 'grad_norm': '8128', 'learning_rate': '3.814e-06', 'epoch': '0.006888'}
{'loss': '92.86', 'grad_norm': '1344', 'learning_rate': '5.089e-06', 'epoch': '0.009185'}

Questions:

  1. Is it normal to have such high gradient norms (6880-8128) during initial training? The max_grad_norm is set to 1.0 in my config, but the reported values are much higher.
  2. Is an initial loss of 185.9 expected for Gemma-4-E2B-it? The README mentions Gemma-4-E4B loss should be 13-15, but I'm using E2B.
  3. Could the gradient offloading message be related? "Unsloth: Will smartly offload gradients to save VRAM!"

Environment Information

Note: Please do not remove the questions. Answer beside them.

  1. Did you update? pip install --upgrade unsloth unsloth_zoo.

    • Yes, using latest versions via uv pip install -e .
  2. Colab or Kaggle or local / cloud.

    • Local server with H100 GPUs
  3. Number GPUs used, use nvidia-smi

    • 1x H100 80GB
  4. Which notebook? Please link!

    • Custom training script: train_ift.py (see below)
  5. Which Unsloth version, TRL version, transformers version, PyTorch version?

    • Unsloth: latest (installed via uv pip install unsloth)
    • TRL: latest
    • Transformers: latest
    • PyTorch: latest (with CUDA support)
  6. Which trainer? SFTTrainer, GRPOTrainer etc

    • SFTTrainer from TRL

Minimal Code to Reproduce

# train_ift.py - Simplified relevant parts
import torch
from unsloth import FastModel, get_chat_template, train_on_responses_only
from trl import SFTTrainer

# Load model
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-4-E2B-it",
    max_seq_length=8192,
    load_in_4bit=False,
    load_in_8bit=False,
    full_finetuning=True,
    trust_remote_code=True,
)

# Apply chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-4",
)

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=5e-5,
    num_train_epochs=3,
    max_grad_norm=1.0,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_torch",
    adam_beta1=0.9,
    adam_beta2=0.95,
    weight_decay=0.1,
)

# Dataset: 10 Vietnamese datasets (~650K samples)
# Format: Conversational with reasoning content

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args,
    train_on_responses_only=True,
    instruction_part="<|turn>user\n",
    response_part="<|turn>model\n",
)

# Apply train_on_responses_only
train_on_responses_only(
    trainer,
    instruction_part="<|turn>user\n",
    response_part="<|turn>model\n",
)

# Train
trainer.train()

Configuration Details

Model: unsloth/gemma-4-E2B-it

  • Full fine-tuning (no LoRA)
  • Max sequence length: 8192
  • Chat template: "gemma-4"
  • Thinking enabled: Yes

Training Config:

  • Batch size: 1 per device
  • Gradient accumulation: 16 (effective batch = 16)
  • Learning rate: 5e-5
  • Warmup ratio: 3%
  • Max grad norm: 1.0 (but seeing 6880-8128!)
  • Optimizer: AdamW (adamw_torch)
  • Precision: bf16

Datasets:

  • 10 Vietnamese datasets (~650K samples total)
  • Mix: Education, Math, Legal, History, NLI, Summarization, etc.
  • Format: ShareGPT/Alpaca with reasoning content

Expected vs Actual Behavior

Expected:

  • Gradient norm should be clipped to max_grad_norm (1.0)
  • Loss should start at a reasonable value (not 185.9)
  • Training should be stable from the start

Actual:

  • Gradient norms: 6880, 4512, 8128, 1344 (much higher than max_grad_norm)
  • Initial loss: 185.9 (extremely high)
  • Loss is decreasing but starting from a very high point

Sample Training Data (with Chat Template)

Here are examples of the actual training data with the Gemma-4 chat template applied:

Example 1 (Vietnamese Math MCQ):

<|turn>system
<|think|>
<turn|>
<|turn>user
Câu 1: 
 
 Độ dài của chiếc bút xoá là:

Các đáp án:
A. 10 cm
B. 8 cm
C. 9 cm
D. 7 cm<turn|>
<|turn>model
<|channel>thought
Hướng dẫn giải 
 Đáp án đúng là: C 
 Quan sát hình vẽ ta thấy độ dài của chiếc bút xoá là 9 cm .
<channel|>
Đáp án: C<turn|>

Example 2 (English Math with Reasoning):

<|turn>system
<|think|>
<turn|>
<|turn>user
[MATH - medium] 252 fifth-grade students and 8 teachers at Yeji's school are going on a field trip. If the cost of renting a 41-seater bus is 300,000 won and the highway toll per bus is 7,500 won, how much does it cost to rent the bus and pay the toll?<turn|>
<|turn>model
<|channel>thought
Let me work through this problem step by step.

First, I need to understand what's being asked: 252 fifth-grade students and 8 teachers at Yeji's school are going on a field trip. If the cost of renting a 41-seater bus is 300,000 won and the highway toll per bus is 7,500 won, how much does it cost to rent the bus and pay the toll?

Key values given: 252, 8, 41, 300,000, 7,500, ,

My approach:
1. Find the total number of people going on the field trip

- Fifth-grade students: 252
- Teachers: 8
- Total people: 252 + 8 = 260 people
2. Calculate how many buses are needed

Each bus has 41 seats.

$$\text{Number of buses} = \frac{260}{41} = 6.34...$$

Since we cannot rent a partial bus, we must round up to the
3. whole number.

Buses needed: 7 buses

Verification: 
- 6 buses × 41 seats = 246 seats (not enough for 260 people)
- 7 buses × 41 seats = 287 seats (sufficient for 260 people) ✓
4. Calculate the bus rental cost

$$\text{Bus rental cost} = 7 \text{ buses} \times 300,000 \text{ won} = 2,100,000 \text{ won}$$
5. Calculate the highway toll

$$\text{Highway toll} = 7 \text{ buses} \times 7,500 \text{ won} = 52,500 \text{ won}$$
6. Calculate the total cost

$$\text{Total cost} = 2,100,000 + 52,500 = 2,152,500 \text{ won}$$

Final Answer

| Item | Calculation | Cost |
 Bus rental | 7 × 300,000 won | 2,100,000 won |
| Highway tol

This gives us the final answer: s
<channel|>
# Solution: Calculating Bus Rental and Toll Costs

## Step 1: Find the total number of people going on the field trip

- Fifth-grade students: 252
- Teachers: 8
- **Total people**: 252 + 8 = **260 people**

## Step 2: Calculate how many buses are needed

Each bus has 41 seats.

$$\text{Number of buses} = \frac{260}{41} = 6.34...$$

Since we cannot rent a partial bus, we must round up to the next whole number.

**Buses needed**: 7 buses

**Verification**: 
- 6 buses × 41 seats = 246 seats (not enough for 260 people)
- 7 buses × 41 seats = 287 seats (sufficient for 260 people) ✓

## Step 3: Calculate the bus rental cost

$$\text{Bus rental cost} = 7 \text{ buses} \times 300,000 \text{ won} = 2,100,000 \text{ won}$$

## Step 4: Calculate the highway toll

$$\text{Highway toll} = 7 \text{ buses} \times 7,500 \text{ won} = 52,500 \text{ won}$$

## Step 5: Calculate the total cost

$$\text{Total cost} = 2,100,000 + 52,500 = 2,152,500 \text{ won}$$

---

## Final Answer

| Item | Calculation | Cost |
|------|-------------|------|
| Bus rental | 7 × 300,000 won | 2,100,000 won |
| Highway toll | 7 × 7,500 won | 52,500 won |
| **Total** | | **2,152,500 won** |

**The total cost to rent the buses and pay the toll is 2,152,500 won.**<turn|>

Notes:

  • Using train_on_responses_only=True with instruction_part="<|turn>user\n" and response_part="<|turn>model\n"
  • The <|channel>thought blocks contain reasoning content that is also being trained on
  • Mix of Vietnamese and English content

Additional Context

  • The "Unsloth: Will smartly offload gradients to save VRAM!" message appears at the start
  • Using train_on_responses_only=True with Gemma-4 chat template
  • Training on Vietnamese instruction datasets with reasoning content
  • H100 80GB GPU, bf16 precision
  • This is continued fine-tuning of an already instruction-tuned model (Gemma-4-E2B-it)

Is this behavior normal for fine-tuning an already instruction-tuned model on a different language/domain? Or is there something wrong with my configuration?

Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions