[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242
Conversation
… left-padding for per-token fields
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3375aeac6c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… and multimodal_fields parameters
…r and RLOOTrainer
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 367a79ebc6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
kashif
left a comment
There was a problem hiding this comment.
The changes are correct and the approach is sound:
- The concatenation approach preserves exact token IDs (the bug fix)
- Images/multimodal_fields indexing is correct: filtered by
idxs_with_toolafter overlong removal, maintaining alignment - Overlong truncation logic remains intact
- Prefix-preserving check removal is justified since prefix is now preserved by construction
…token IDs (#5242) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

Context
Part of the series to fix the re-tokenization bug in GRPO multi-turn tool calling (see #5224).
closes #5224
closes #5144
When the model generates a completion in a tool-calling loop, the decoded text is re-tokenized via
apply_chat_template, which can produce different token IDs due to BPE merge ambiguities. To fix this, we need a token-in / token-out pipeline: tokenize once, then pass raw token IDs through every subsequent generation call — never decoding and re-tokenizing.promptsin vLLM client and server #5225rollout_funcfrom_generate_single_turnto_generate#5232_generate_single_turn#5239_generate_single_turn#5240This is the final PR in the series. It eliminates the re-tokenization in the tool-calling loop — the actual source of the bug.
Changes
_get_tool_suffix_ids(tool_messages)method: Tokenizes only the tool result portion by diffing a minimal dummy conversation (2 messages vs 3 messages). This avoids re-tokenizing the full conversation history._tool_call_loop: Instead of re-tokenizingprompt + completion + tool_resultsviaapply_chat_template, builds the token sequence by concatenation:prompt_ids + completion_ids + tool_suffix_ids. The original prompt and completion token IDs are preserved exactly as they were — only the new tool result tokens are freshly tokenized._tokenize_promptscall in the tool loop.The bug and the fix
Previously, after a tool call:
prompt + assistant + tool_resultswas re-tokenized viaapply_chat_templateNow:
prompt_idsandcompletion_idsare kept as-is (never decoded and re-tokenized)prompt_ids + completion_ids + suffix_idsBackward compatibility
No user-facing API changes.
_get_tool_suffix_idsand_tool_call_loopare internal methods.Note
Medium Risk
Generation/tokenization flow for GRPO tool-calling is changed to operate on raw token IDs, which can affect multi-turn tool execution, truncation behavior, and multimodal batching. Risk is mitigated by being internal-only but impacts a core training loop.
Overview
Fixes a GRPO multi-turn tool-calling bug caused by decoding and re-tokenizing completions (BPE ambiguity) by switching the loop to a token-in/token-out pipeline.
Adds
_get_tool_suffix_ids()and updates_tool_call_loop()to build follow-up prompts by concatenatingprompt_ids + completion_ids + tool_suffix_ids(and passing through the correctimages/multimodal_fieldssubset), eliminatingapply_chat_templatere-tokenization and related prefix-preservation checks.Adjusts
_generate_single_turn()/callers ingrpo_trainer.pyandrloo_trainer.pyto stop returning/expecting prompt IDs from generation backends and simplifies prompt-length handling in the transformers path.Written by Cursor Bugbot for commit f74b5d1. This will update automatically on new commits. Configure here.