I'm tring to vision fine-tune Gemma3 refering this tutorial: https://colab.research.google.com/drive/1j0N4XTY1zXXy7mPAhOC1_gMYZ2F2EBlk?usp=sharing#scrollTo=QmUBVEnvCDJv
I constructed my dataset like the tutorial do
Here is my code:
def load_my_flickr_dataset(json_path: str, split: str="train"):
raw_dset = load_dataset("json", data_files=json_path)
dset = raw_dset["train"]
if split in ["train","val","test"]:
dset = dset.filter(lambda x: x["split"] == split)
return dset
def convert_to_conversation(sample, image_root):
image_path = os.path.join(image_root, sample["messages"][1]["content"][1]["image"])
image = Image.open(image_path).convert("RGB")
conversation = [
{"role": "user",
"content": [{"type": "text", "text": sample["messages"][1]["content"][0]["text"]},
{"type": "image", "image": image}]},
{"role": "assistant",
"content": [{"type": "text", "text": sample["messages"][2]["content"][0]["text"]}]}
]
return {"messages": conversation}
def main():
data_path = "my_flickr_full_chat.json"
image_root = "/data/rzr/flickr30k/flickr30k-images"
train_dataset_raw = load_my_flickr_dataset(data_path, split="train")
converted_dataset = [convert_to_conversation(sample, image_root) for sample in train_dataset_raw]
model, tokenizer = FastVisionModel.from_pretrained(
model_name="/data/rzr/gemma3-4b",
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
FastVisionModel.for_training(model)
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=3407,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=UnslothVisionDataCollator(model, tokenizer),
train_dataset=converted_dataset,
args=SFTConfig(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=1,
learning_rate=2e-4,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
output_dir="unsloth_out",
report_to="none",
remove_unused_columns=False,
dataset_text_field="",
dataset_kwargs={"skip_prepare_dataset": True},
dataset_num_proc=4,
max_seq_length=2048,
),
)
trainer.train()
if __name__ == "__main__":
main()
and the converted_dataset is:

the detail of converted_dataset[0]:
{'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Please briefly describe this image, then list identifiable objects and their bounding boxes.'}, {'type': 'image', 'image': <PIL.Image.Image image mode=RGB size=333x500 at 0x7F5EEA3220D0>}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': 'Here is a high-level description:\n - Two young guys with shaggy hair look at their hands while hanging out in the yard.\n - Two young, White males are outside near many bushes.\n - Two men in green shirts are standing in a yard.\n - A man in a blue shirt standing in a garden.\n - Two friends enjoy time spent together.\n\nIdentified objects and bounding boxes:\n * Two young guys: [196, 109, 260, 372], [158, 124, 218, 334]\n * shaggy hair: [179, 124, 205, 155], [197, 113, 239, 145]\n * their hands: [157, 197, 190, 224], [172, 183, 197, 202]\n * Two young , White males: [196, 109, 260, 372], [158, 124, 218, 334]\n * many bushes: [275, 214, 331, 336], [0, 219, 210, 472]\n * Two men: [196, 109, 260, 372], [158, 124, 218, 334]\n * green shirts: [172, 155, 216, 235], [206, 143, 256, 243]\n * A man: [196, 109, 260, 372]\n * a blue shirt: [206, 143, 256, 243]\n * Two friends: [196, 109, 260, 372], [158, 124, 218, 334]'}]}]}
So the data is same as the tutorial.
Then I started trainning, but an error occured, here is my log:
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
🦥 Unsloth Zoo will now patch everything to make training faster!
[2025-03-21 09:17:47,889] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
==((====))== Unsloth 2025.3.15: Fast Siglip patching. Transformers: 4.50.0.dev0.
\\ /| NVIDIA GeForce RTX 4090 D. Num GPUs = 4. Max memory: 23.542 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = True]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Unsloth: Making `model.base_model.model.vision_tower.vision_model.encoder` require gradients
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 9,123 | Num Epochs = 1 | Total steps = 570
O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
"-____-" Trainable parameters = 38,497,792/4,000,000,000 (0.96% trained)
0%| | 0/570 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/user/zero_nlp/train_llava/train_3.py", line 89, in <module>
main()
File "/home/user/zero_nlp/train_llava/train_3.py", line 86, in main
trainer.train()
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/transformers/trainer.py", line 2250, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "<string>", line 311, in _fast_inner_training_loop
File "<string>", line 31, in _unsloth_training_step
File "/home/user/zero_nlp/train_llava/unsloth_compiled_cache/UnslothSFTTrainer.py", line 750, in compute_loss
outputs = super().compute_loss(
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/unsloth/models/_utils.py", line 1028, in _unsloth_pre_compute_loss
outputs = self._old_compute_loss(model, inputs, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/transformers/trainer.py", line 3772, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/accelerate/utils/operations.py", line 819, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/accelerate/utils/operations.py", line 807, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/peft/peft_model.py", line 1719, in forward
return self.base_model(
^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/peft/tuners/tuners_utils.py", line 197, in forward
return self.model.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/unsloth_zoo/temporary_patches.py", line 217, in forward
image_features = self.get_image_features(pixel_values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zero_nlp/train_llava/unsloth_compiled_cache/unsloth_compiled_module_gemma3.py", line 1138, in get_image_features
vision_outputs = self.vision_tower(pixel_values=pixel_values).last_hidden_state
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/transformers/models/siglip/modeling_siglip.py", line 1191, in forward
return self.vision_model(
^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/transformers/models/siglip/modeling_siglip.py", line 1092, in forward
encoder_outputs = self.encoder(
^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
return inner()
^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1782, in inner
args_result = hook(self, args)
^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/unsloth_zoo/peft_utils.py", line 208, in requires_grad_pre_hook
raise RuntimeError("Unsloth: Failed to make input require gradients!")
RuntimeError: Unsloth: Failed to make input require gradients!
0%| | 0/570 [00:04<?, ?it/s]
ERROR conda.cli.main_run:execute(49): `conda run python /home/user/zero_nlp/train_llava/train_3.py` failed. (See above for error)
But if I train llava1.6 use the same code, it will work:

So I think its a Gemma3 adaption problem
I'm tring to vision fine-tune Gemma3 refering this tutorial: https://colab.research.google.com/drive/1j0N4XTY1zXXy7mPAhOC1_gMYZ2F2EBlk?usp=sharing#scrollTo=QmUBVEnvCDJv
I constructed my dataset like the tutorial do
Here is my code:
and the
converted_datasetis:the detail of
converted_dataset[0]:So the data is same as the tutorial.
Then I started trainning, but an error occured, here is my log:
But if I train llava1.6 use the same code, it will work:
So I think its a Gemma3 adaption problem