Skip to content

CI fails: AssertionError: Param model.visual.blocks.0.norm1.weight is not updated #5768

@albertvillanova

Description

@albertvillanova

CI fails: https://github.com/huggingface/trl/actions/runs/25825653691/job/75877994889

AssertionError: Param model.visual.blocks.0.norm1.weight is not updated

   FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train_chunked_nll_loss_vlm[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f2d5d2f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f2d5d2f6b20> = torch.equal
  FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train_chunked_nll_loss_vlm[trl-internal-testing/tiny-Qwen3VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f2d5d2f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f2d5d2f6b20> = torch.equal
  FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train_vlm[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f0b6a8f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f0b6a8f6b20> = torch.equal
  FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train_vlm[trl-internal-testing/tiny-Qwen3VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f0b6a8f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f0b6a8f6b20> = torch.equal
  FAILED tests/test_rloo_trainer.py::TestRLOOTrainer::test_train_vlm_multi_image[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] - AssertionError: Parameter model.visual.blocks.0.norm1.weight has not changed.
  assert not True
   +  where True = <built-in method equal of type object at 0x7f2cda4f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f2cda4f6b20> = torch.equal
  FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train_vlm_multi_image[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f0b6a8f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f0b6a8f6b20> = torch.equal
  FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train_vlm[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f2d5d2f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f2d5d2f6b20> = torch.equal
  FAILED tests/test_grpo_trainer.py::TestGRPOTrainer::test_train_vlm_multi_image[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] - AssertionError: Parameter model.visual.blocks.0.norm1.weight has not changed.
  assert not True
   +  where True = <built-in method equal of type object at 0x7fb7186f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7fb7186f6b20> = torch.equal
  FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train_vlm[trl-internal-testing/tiny-Qwen3VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f2d5d2f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f2d5d2f6b20> = torch.equal
  FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train_vlm_multi_image[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f2d5d2f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f2d5d2f6b20> = torch.equal
  FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train_vlm_prompt_completion[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] - AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  assert not True
   +  where True = <built-in method equal of type object at 0x7f2d5d2f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
   +    where <built-in method equal of type object at 0x7f2d5d2f6b20> = torch.equal
  = 11 failed

Stacktrace:

  _ TestSFTTrainer.test_train_vlm_prompt_completion[trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration] _
  [gw3] linux -- Python 3.10.20 /__w/trl/trl/.venv/bin/python3
  
  self = <tests.test_sft_trainer.TestSFTTrainer object at 0x7f2c8ae53610>
  model_id = 'trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration'
  
      @pytest.mark.parametrize(
          "model_id",
          [
              "trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration",
              # Special case for Gemma, as it uses token_type_ids, and we need to ensure they are properly in the collator:
              "trl-internal-testing/tiny-Gemma3ForConditionalGeneration",
              pytest.param(
                  "trl-internal-testing/tiny-Gemma4ForConditionalGeneration",
                  marks=pytest.mark.skipif(
                      Version(transformers.__version__) < Version("5.5.0"),
                      reason="Gemma4 models were introduced in transformers-5.5.0",
                  ),
              ),
          ],
      )
      @require_vision
      def test_train_vlm_prompt_completion(self, model_id):
          dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_completion", split="train")
      
          training_args = SFTConfig(
              output_dir=self.tmp_dir,
              per_device_train_batch_size=1,  # VLM training is memory intensive, reduce batch size to avoid OOM
              learning_rate=0.1,  # use higher lr because gradients are tiny and default lr can stall updates
              max_length=None,  # for VLMs, truncating can remove image tokens, leading to errors
              report_to="none",
          )
          trainer = SFTTrainer(
              model=model_id,
              args=training_args,
              train_dataset=dataset,
          )
      
          previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}
      
          trainer.train()
      
          assert trainer.state.log_history[-1]["train_loss"] is not None
      
          # Check that the params have changed
          for n, param in previous_trainable_params.items():
              new_param = trainer.model.get_parameter(n)
  >           assert not torch.equal(param, new_param), f"Param {n} is not updated"
  E           AssertionError: Param model.visual.blocks.0.norm1.weight is not updated
  E           assert not True
  E            +  where True = <built-in method equal of type object at 0x7f2d5d2f6b20>(tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', grad_fn=<CloneBackward0>), Parameter containing:\ntensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],\n       device='cuda:0', requires_grad=True))
  E            +    where <built-in method equal of type object at 0x7f2d5d2f6b20> = torch.equal
  
  tests/test_sft_trainer.py:1755: AssertionError

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions