Skip to content

[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242

Merged
qgallouedec merged 103 commits into
mainfrom
fix-retokenization-tool-loop
Mar 19, 2026
Merged

[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242
qgallouedec merged 103 commits into
mainfrom
fix-retokenization-tool-loop

Conversation

@qgallouedec

@qgallouedec qgallouedec commented Mar 7, 2026

Copy link
Copy Markdown
Member

Context

Part of the series to fix the re-tokenization bug in GRPO multi-turn tool calling (see #5224).

closes #5224
closes #5144

When the model generates a completion in a tool-calling loop, the decoded text is re-tokenized via apply_chat_template, which can produce different token IDs due to BPE merge ambiguities. To fix this, we need a token-in / token-out pipeline: tokenize once, then pass raw token IDs through every subsequent generation call — never decoding and re-tokenizing.

This is the final PR in the series. It eliminates the re-tokenization in the tool-calling loop — the actual source of the bug.

Changes

  • New _get_tool_suffix_ids(tool_messages) method: Tokenizes only the tool result portion by diffing a minimal dummy conversation (2 messages vs 3 messages). This avoids re-tokenizing the full conversation history.
  • _tool_call_loop: Instead of re-tokenizing prompt + completion + tool_results via apply_chat_template, builds the token sequence by concatenation: prompt_ids + completion_ids + tool_suffix_ids. The original prompt and completion token IDs are preserved exactly as they were — only the new tool result tokens are freshly tokenized.
  • Removed the prefix-preserving sanity check (no longer needed since the prefix is preserved by construction).
  • Removed the _tokenize_prompts call in the tool loop.

The bug and the fix

Previously, after a tool call:

  1. The completion was decoded to text and appended as an assistant message
  2. The full prompt + assistant + tool_results was re-tokenized via apply_chat_template
  3. Due to BPE merge ambiguity, step 2 could produce different token IDs for the completion part

Now:

  1. The original prompt_ids and completion_ids are kept as-is (never decoded and re-tokenized)
  2. Only the tool result suffix is tokenized, using a minimal dummy conversation to extract just the template formatting
  3. The full prompt is built by concatenation: prompt_ids + completion_ids + suffix_ids

Backward compatibility

No user-facing API changes. _get_tool_suffix_ids and _tool_call_loop are internal methods.


Note

Medium Risk
Generation/tokenization flow for GRPO tool-calling is changed to operate on raw token IDs, which can affect multi-turn tool execution, truncation behavior, and multimodal batching. Risk is mitigated by being internal-only but impacts a core training loop.

Overview
Fixes a GRPO multi-turn tool-calling bug caused by decoding and re-tokenizing completions (BPE ambiguity) by switching the loop to a token-in/token-out pipeline.

Adds _get_tool_suffix_ids() and updates _tool_call_loop() to build follow-up prompts by concatenating prompt_ids + completion_ids + tool_suffix_ids (and passing through the correct images/multimodal_fields subset), eliminating apply_chat_template re-tokenization and related prefix-preservation checks.

Adjusts _generate_single_turn()/callers in grpo_trainer.py and rloo_trainer.py to stop returning/expecting prompt IDs from generation backends and simplifies prompt-length handling in the transformers path.

Written by Cursor Bugbot for commit f74b5d1. This will update automatically on new commits. Configure here.

qgallouedec and others added 27 commits March 5, 2026 19:10

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3375aeac6c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py Outdated
Base automatically changed from extract-tokenize-prompts to main March 10, 2026 21:35

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread trl/trainer/rloo_trainer.py Outdated
@qgallouedec

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 367a79ebc6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py Outdated

@kashif kashif left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are correct and the approach is sound:

  • The concatenation approach preserves exact token IDs (the bug fix)
  • Images/multimodal_fields indexing is correct: filtered by idxs_with_tool after overlong removal, maintaining alignment
  • Overlong truncation logic remains intact
  • Prefix-preserving check removal is justified since prefix is now preserved by construction

@qgallouedec qgallouedec merged commit ebdfe82 into main Mar 19, 2026
14 checks passed
@qgallouedec qgallouedec deleted the fix-retokenization-tool-loop branch March 19, 2026 18:05
qgallouedec added a commit that referenced this pull request Mar 20, 2026
…token IDs (#5242)

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-tokenization bug in GRPO multi-turn tool calling GRPOTrainer tool_mask can become longer than completion_ids after tool-call retokenization

3 participants