Skip to content

"Unsloth: Failed to make input require gradients!" When Vision-fine-tune Gemma3 #2131

@johnray22

Description

@johnray22

I'm tring to vision fine-tune Gemma3 refering this tutorial: https://colab.research.google.com/drive/1j0N4XTY1zXXy7mPAhOC1_gMYZ2F2EBlk?usp=sharing#scrollTo=QmUBVEnvCDJv

I constructed my dataset like the tutorial do

Here is my code:

def load_my_flickr_dataset(json_path: str, split: str="train"):
    raw_dset = load_dataset("json", data_files=json_path)
    dset = raw_dset["train"]
    if split in ["train","val","test"]:
        dset = dset.filter(lambda x: x["split"] == split)
    return dset

def convert_to_conversation(sample, image_root):
    image_path = os.path.join(image_root, sample["messages"][1]["content"][1]["image"])
    image = Image.open(image_path).convert("RGB")
    conversation = [
        {"role": "user",
         "content": [{"type": "text", "text": sample["messages"][1]["content"][0]["text"]},
                     {"type": "image", "image": image}]},
        {"role": "assistant",
         "content": [{"type": "text", "text": sample["messages"][2]["content"][0]["text"]}]}
    ]
    return {"messages": conversation}

def main():
    data_path = "my_flickr_full_chat.json"
    image_root = "/data/rzr/flickr30k/flickr30k-images"

    train_dataset_raw = load_my_flickr_dataset(data_path, split="train")
    converted_dataset = [convert_to_conversation(sample, image_root) for sample in train_dataset_raw]

    model, tokenizer = FastVisionModel.from_pretrained(
        model_name="/data/rzr/gemma3-4b",
        load_in_4bit=True,
        use_gradient_checkpointing="unsloth",
    )

    FastVisionModel.for_training(model)

    model = FastVisionModel.get_peft_model(
        model,
        finetune_vision_layers=True,
        finetune_language_layers=True,
        finetune_attention_modules=True,
        finetune_mlp_modules=True,
        r=16,
        lora_alpha=16,
        lora_dropout=0,
        bias="none",
        random_state=3407,
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        data_collator=UnslothVisionDataCollator(model, tokenizer),
        train_dataset=converted_dataset,
        args=SFTConfig(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=5,
            num_train_epochs=1,
            learning_rate=2e-4,
            fp16=not is_bf16_supported(),
            bf16=is_bf16_supported(),
            logging_steps=10,
            optim="adamw_8bit",
            weight_decay=0.01,
            lr_scheduler_type="linear",
            output_dir="unsloth_out",
            report_to="none",
            remove_unused_columns=False,
            dataset_text_field="",
            dataset_kwargs={"skip_prepare_dataset": True},
            dataset_num_proc=4,
            max_seq_length=2048,
        ),
    )

    trainer.train()

if __name__ == "__main__":
    main()

and the converted_dataset is:

Image

the detail of converted_dataset[0]:

{'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Please briefly describe this image, then list identifiable objects and their bounding boxes.'}, {'type': 'image', 'image': <PIL.Image.Image image mode=RGB size=333x500 at 0x7F5EEA3220D0>}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': 'Here is a high-level description:\n - Two young guys with shaggy hair look at their hands while hanging out in the yard.\n - Two young, White males are outside near many bushes.\n - Two men in green shirts are standing in a yard.\n - A man in a blue shirt standing in a garden.\n - Two friends enjoy time spent together.\n\nIdentified objects and bounding boxes:\n * Two young guys: [196, 109, 260, 372], [158, 124, 218, 334]\n * shaggy hair: [179, 124, 205, 155], [197, 113, 239, 145]\n * their hands: [157, 197, 190, 224], [172, 183, 197, 202]\n * Two young , White males: [196, 109, 260, 372], [158, 124, 218, 334]\n * many bushes: [275, 214, 331, 336], [0, 219, 210, 472]\n * Two men: [196, 109, 260, 372], [158, 124, 218, 334]\n * green shirts: [172, 155, 216, 235], [206, 143, 256, 243]\n * A man: [196, 109, 260, 372]\n * a blue shirt: [206, 143, 256, 243]\n * Two friends: [196, 109, 260, 372], [158, 124, 218, 334]'}]}]}

So the data is same as the tutorial.

Then I started trainning, but an error occured, here is my log:

To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
🦥 Unsloth Zoo will now patch everything to make training faster!
[2025-03-21 09:17:47,889] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
==((====))==  Unsloth 2025.3.15: Fast Siglip patching. Transformers: 4.50.0.dev0.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 4. Max memory: 23.542 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Unsloth: Making `model.base_model.model.vision_tower.vision_model.encoder` require gradients
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 9,123 | Num Epochs = 1 | Total steps = 570
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 38,497,792/4,000,000,000 (0.96% trained)
  0%|                                                   | 0/570 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/user/zero_nlp/train_llava/train_3.py", line 89, in <module>
    main()
  File "/home/user/zero_nlp/train_llava/train_3.py", line 86, in main
    trainer.train()
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/transformers/trainer.py", line 2250, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 311, in _fast_inner_training_loop
  File "<string>", line 31, in _unsloth_training_step
  File "/home/user/zero_nlp/train_llava/unsloth_compiled_cache/UnslothSFTTrainer.py", line 750, in compute_loss
    outputs = super().compute_loss(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/unsloth/models/_utils.py", line 1028, in _unsloth_pre_compute_loss
    outputs = self._old_compute_loss(model, inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/transformers/trainer.py", line 3772, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/peft/peft_model.py", line 1719, in forward
    return self.base_model(
           ^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/peft/tuners/tuners_utils.py", line 197, in forward
    return self.model.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/unsloth_zoo/temporary_patches.py", line 217, in forward
    image_features = self.get_image_features(pixel_values)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zero_nlp/train_llava/unsloth_compiled_cache/unsloth_compiled_module_gemma3.py", line 1138, in get_image_features
    vision_outputs = self.vision_tower(pixel_values=pixel_values).last_hidden_state
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/transformers/models/siglip/modeling_siglip.py", line 1191, in forward
    return self.vision_model(
           ^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/transformers/models/siglip/modeling_siglip.py", line 1092, in forward
    encoder_outputs = self.encoder(
                      ^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
    return inner()
           ^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1782, in inner
    args_result = hook(self, args)
                  ^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/unsloth_zoo/peft_utils.py", line 208, in requires_grad_pre_hook
    raise RuntimeError("Unsloth: Failed to make input require gradients!")
RuntimeError: Unsloth: Failed to make input require gradients!
  0%|                                                   | 0/570 [00:04<?, ?it/s]
ERROR conda.cli.main_run:execute(49): `conda run python /home/user/zero_nlp/train_llava/train_3.py` failed. (See above for error)

But if I train llava1.6 use the same code, it will work:

Image

So I think its a Gemma3 adaption problem

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions