Skip to content

Support max_length in DPO VLM training#5284

Merged
albertvillanova merged 10 commits into
huggingface:mainfrom
albertvillanova:fix-5283
Mar 17, 2026
Merged

Support max_length in DPO VLM training#5284
albertvillanova merged 10 commits into
huggingface:mainfrom
albertvillanova:fix-5283

Conversation

@albertvillanova

@albertvillanova albertvillanova commented Mar 13, 2026

Copy link
Copy Markdown
Member

Support max_length in DPO VLM training.

  • Truncate sequence-aligned side-inputs (token_type_ids, mm_token_type_ids) with input_ids in DPO VLM training

Fix #5283.

This PR addresses a regression affecting vision-language model (VLM) training when using sequence truncation. The main fix ensures that auxiliary token fields (mm_token_type_ids and token_type_ids) are truncated in sync with input_ids, preventing shape mismatches and crashes during the model's forward pass. Additionally, a regression test is added to verify this behavior.

Changes

Bug fix for sequence truncation in VLMs:

  • Ensured that token_type_ids and mm_token_type_ids are truncated to match the length of input_ids in both compute_ref_log_probs and _compute_loss methods of DPOTrainer, preventing shape mismatch errors during training.
    • Note that pixel_values, image_grid_thw, image_sizes, and pixel_attention_mask are patch-level or image-level tensors and should not be truncated.

Testing improvements:

  • Added a regression test test_train_vlm_with_max_length in tests/test_dpo_trainer.py to verify that truncation with max_length does not crash the model and that image tokens are handled correctly.

Follow-up

If this approach is approved, I will implement a similar fix for other trainers.

Related

See related discussion in: #5279 (comment)


Note

Medium Risk
Changes DPOTrainer truncation behavior for sequence-aligned auxiliary tensors in VLM training; incorrect handling could affect training correctness or cause new shape/attention issues when max_length/truncation_mode are used.

Overview
Fixes a VLM regression when training DPOTrainer with max_length by truncating sequence-aligned side inputs (currently token_type_ids and mm_token_type_ids) in lockstep with input_ids/masks.

Extends _truncate_inputs to accept and return additional tensors, and wires this through both compute_ref_log_probs and _compute_loss while leaving image/patch-level inputs (e.g. pixel_values, image_grid_thw) untouched.

Adds a require_vision regression test test_train_vlm_with_max_length to ensure truncated VLM batches no longer crash due to shape mismatches.

Written by Cursor Bugbot for commit 38b4bc6. This will update automatically on new commits. Configure here.

@albertvillanova albertvillanova changed the title Truncate token_type_ids and mm_token_type_ids with input_ids in DPO VLM training Support max_length in DPO VLM training Mar 13, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread trl/trainer/dpo_trainer.py Outdated
Comment thread trl/trainer/dpo_trainer.py Outdated
input_ids, attention_mask, completion_mask = self._truncate_inputs(input_ids, attention_mask, completion_mask)
# token_type_ids is sequence-length-aligned: truncate to match input_ids
# in keep_end mode, token_type_ids participates in flush_right/flush_left
extra = (inputs["token_type_ids"],) if "token_type_ids" in inputs else ()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not having mm_token_type_ids in extra?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokens in extra can be truncated both with "keep_start" and "keep_end", and I think it is semantically wrong to use "keep_end" in VLM mm_token_type_ids, but I'm addressing that in a following PR:

So, let's treat mm_token_type_ids and token_type_ids symmetrically to be internally consistent, and leave the semantical correction to the other PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think you should align compute_ref_log_probs with compute_loss, ie having mm_token_type_ids in extra in both cases

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks! 😅

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@albertvillanova albertvillanova merged commit 3acb8e8 into huggingface:main Mar 17, 2026
12 checks passed
qgallouedec added a commit that referenced this pull request Mar 18, 2026
commit 52cd0cc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:31:26 2026 +0100

    Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295)

commit 7b42fc4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:29:11 2026 +0100

    Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286)

commit 3acb8e8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:27:10 2026 +0100

    Support max_length in DPO VLM training (#5284)

commit ee339a0
Author: Carlos Miguel Patiño <carlos.patino@huggingface.co>
Date:   Tue Mar 17 14:01:44 2026 +0100

    [GKD] Buffer Implementation for Distillation Trainer (#5137)

    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

commit d46131f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:27:19 2026 +0100

    Remove custom get_train/eval_dataloader from OnlineDPO (#5291)

commit 85cf8f4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:24:24 2026 +0100

    Remove TrainingArguments import from experimental trainers (#5290)

commit 91e3da0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Mar 16 07:19:51 2026 -0600

    Fix `accuracy_reward` crash when called from non-main thread (#5281)

commit 4996631
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:44:28 2026 +0100

    Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274)

commit 5fceaa7
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:43:34 2026 +0100

    Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273)

commit 406d406
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Sat Mar 14 04:12:49 2026 +0100

    feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199)

commit d0ac7ef
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Sat Mar 14 02:53:33 2026 +0100

    Allow nullable logprobs in vLLM serve responses  (#5203)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit c0eabc4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Mar 13 18:19:15 2026 -0600

    Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255)

    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

commit 6c0fccd
Author: Mario Šaško <mariosasko777@gmail.com>
Date:   Sat Mar 14 00:19:38 2026 +0100

    35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
qgallouedec added a commit that referenced this pull request Mar 18, 2026
commit 3972d66
Author: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Date:   Wed Mar 18 22:26:44 2026 +0100

    Suggest the `Json()` type for tool calling dataset format (#5307)

    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 5c6e915
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 14:55:19 2026 -0600

    Update `RewardFunc` type annotation to allow `None`values in reward list (#5297)

commit ee96845
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Mar 18 17:03:54 2026 +0100

    Fix DPOTrainer collators to truncate sequences before padding (#5305)

commit 435c2ae
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 08:09:42 2026 -0600

    Add guidance to avoid `hasattr` and `getattr` with defaults in `AGENTS.md` (#5294)

commit 26ce6a3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 00:44:12 2026 -0600

    Apply docstyle (#5296)

commit 52cd0cc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:31:26 2026 +0100

    Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295)

commit 7b42fc4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:29:11 2026 +0100

    Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286)

commit 3acb8e8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:27:10 2026 +0100

    Support max_length in DPO VLM training (#5284)

commit ee339a0
Author: Carlos Miguel Patiño <carlos.patino@huggingface.co>
Date:   Tue Mar 17 14:01:44 2026 +0100

    [GKD] Buffer Implementation for Distillation Trainer (#5137)

    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

commit d46131f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:27:19 2026 +0100

    Remove custom get_train/eval_dataloader from OnlineDPO (#5291)

commit 85cf8f4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:24:24 2026 +0100

    Remove TrainingArguments import from experimental trainers (#5290)

commit 91e3da0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Mar 16 07:19:51 2026 -0600

    Fix `accuracy_reward` crash when called from non-main thread (#5281)

commit 4996631
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:44:28 2026 +0100

    Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274)

commit 5fceaa7
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:43:34 2026 +0100

    Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273)

commit 406d406
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Sat Mar 14 04:12:49 2026 +0100

    feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199)

commit d0ac7ef
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Sat Mar 14 02:53:33 2026 +0100

    Allow nullable logprobs in vLLM serve responses  (#5203)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit c0eabc4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Mar 13 18:19:15 2026 -0600

    Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255)

    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

commit 6c0fccd
Author: Mario Šaško <mariosasko777@gmail.com>
Date:   Sat Mar 14 00:19:38 2026 +0100

    35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
songhappy pushed a commit to songhappy/trl that referenced this pull request Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DPOTrainer crashes when max_length is set with VLMs: IndexError

3 participants