You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Re-tokenization bug in GRPO multi-turn tool calling
The bug
When GRPOTrainer runs multi-turn tool calling, each iteration of _tool_call_loop does this:
The model generates a completion (producing token IDs)
The token IDs are decoded to text
The text is appended as an assistant message to the conversation
The full conversation (prompt + completion + tool result) is re-tokenized via apply_chat_template
The re-tokenized IDs are passed to the next generation call
The problem is step 4: BPE is not a bijection. Decoding token IDs to text and re-tokenizing can produce different token IDs. For example, the token sequence ["he", "llo"] decodes to "hello", which re-tokenizes to ["hello"]: different IDs, same text. This means the log-probabilities used for the policy gradient no longer correspond to the tokens that were actually sampled.
The fix
Tokenize once, never re-tokenize. Build a token-in / token-out pipeline where raw token IDs flow through the entire generation loop without ever being decoded and re-tokenized.
In _tool_call_loop, instead of re-tokenizing the full conversation, build the next prompt by concatenation:
The original prompt_ids and completion_ids are preserved exactly. Only the tool result portion (template formatting + tool output) is freshly tokenized.
PRs
The fix is split into 7 incremental PRs, each building on the previous:
Re-tokenization bug in GRPO multi-turn tool calling
The bug
When
GRPOTrainerruns multi-turn tool calling, each iteration of_tool_call_loopdoes this:apply_chat_templateThe problem is step 4: BPE is not a bijection. Decoding token IDs to text and re-tokenizing can produce different token IDs. For example, the token sequence
["he", "llo"]decodes to"hello", which re-tokenizes to["hello"]: different IDs, same text. This means the log-probabilities used for the policy gradient no longer correspond to the tokens that were actually sampled.The fix
Tokenize once, never re-tokenize. Build a token-in / token-out pipeline where raw token IDs flow through the entire generation loop without ever being decoded and re-tokenized.
In
_tool_call_loop, instead of re-tokenizing the full conversation, build the next prompt by concatenation:The original
prompt_idsandcompletion_idsare preserved exactly. Only the tool result portion (template formatting + tool output) is freshly tokenized.PRs
The fix is split into 7 incremental PRs, each building on the previous:
promptsin vLLM client and server #5225 — Addprompt_token_idssupport to vLLM client/server, so vLLM can receive pre-tokenized IDsrollout_funcfrom_generate_single_turnto_generate#5232 — Moverollout_funcout of_generate_single_turninto_generate(prep refactor)VLLMGeneration.generateinto_generate_single_turn, so vLLM always receives raw token IDs_generate_single_turn#5239 — Unify tokenization across all 3 backends (vLLM, paged, regular) at the top of_generate_single_turn_generate_single_turn#5240 — Extract_tokenize_prompts()method, make_generate_single_turnaccept pre-tokenized inputs_tool_call_loopwith token ID concatenation