Skip to content

Re-tokenization bug in GRPO multi-turn tool calling #5224

@qgallouedec

Description

@qgallouedec

Re-tokenization bug in GRPO multi-turn tool calling

The bug

When GRPOTrainer runs multi-turn tool calling, each iteration of _tool_call_loop does this:

  1. The model generates a completion (producing token IDs)
  2. The token IDs are decoded to text
  3. The text is appended as an assistant message to the conversation
  4. The full conversation (prompt + completion + tool result) is re-tokenized via apply_chat_template
  5. The re-tokenized IDs are passed to the next generation call

The problem is step 4: BPE is not a bijection. Decoding token IDs to text and re-tokenizing can produce different token IDs. For example, the token sequence ["he", "llo"] decodes to "hello", which re-tokenizes to ["hello"]: different IDs, same text. This means the log-probabilities used for the policy gradient no longer correspond to the tokens that were actually sampled.

The fix

Tokenize once, never re-tokenize. Build a token-in / token-out pipeline where raw token IDs flow through the entire generation loop without ever being decoded and re-tokenized.

In _tool_call_loop, instead of re-tokenizing the full conversation, build the next prompt by concatenation:

next_prompt_ids = prompt_ids + completion_ids + tool_suffix_ids

The original prompt_ids and completion_ids are preserved exactly. Only the tool result portion (template formatting + tool output) is freshly tokenized.

PRs

The fix is split into 7 incremental PRs, each building on the previous:

  1. Add support for raw ids in prompts in vLLM client and server #5225 — Add prompt_token_ids support to vLLM client/server, so vLLM can receive pre-tokenized IDs
  2. Add VLM support when passing raw token IDs to vLLM client #5227 — Extend the token ID path to support VLMs (images alongside token IDs)
  3. Move rollout_func from _generate_single_turn to _generate #5232 — Move rollout_func out of _generate_single_turn into _generate (prep refactor)
  4. [GRPO/RLOO] Tokenize before vLLM generation call #5238 — Move tokenization from VLLMGeneration.generate into _generate_single_turn, so vLLM always receives raw token IDs
  5. [GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn #5239 — Unify tokenization across all 3 backends (vLLM, paged, regular) at the top of _generate_single_turn
  6. [GRPO/RLOO] Extract tokenize prompts from _generate_single_turn #5240 — Extract _tokenize_prompts() method, make _generate_single_turn accept pre-tokenized inputs
  7. [GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs #5242 — The actual fix: replace re-tokenization in _tool_call_loop with token ID concatenation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions