Re-tokenization bug in GRPO multi-turn tool calling

# Re-tokenization bug in GRPO multi-turn tool calling

## The bug

When `GRPOTrainer` runs multi-turn tool calling, each iteration of `_tool_call_loop` does this:

1. The model generates a completion (producing token IDs)
2. The token IDs are **decoded** to text
3. The text is appended as an assistant message to the conversation
4. The full conversation (prompt + completion + tool result) is re-tokenized via `apply_chat_template`
5. The re-tokenized IDs are passed to the next generation call

The problem is step 4: BPE is not a bijection. Decoding token IDs to text and re-tokenizing can produce different token IDs. For example, the token sequence `["he", "llo"]` decodes to `"hello"`, which re-tokenizes to `["hello"]`: different IDs, same text. This means the log-probabilities used for the policy gradient no longer correspond to the tokens that were actually sampled.

## The fix

**Tokenize once, never re-tokenize.** Build a token-in / token-out pipeline where raw token IDs flow through the entire generation loop without ever being decoded and re-tokenized.

In `_tool_call_loop`, instead of re-tokenizing the full conversation, build the next prompt by concatenation:

```
next_prompt_ids = prompt_ids + completion_ids + tool_suffix_ids
```

The original `prompt_ids` and `completion_ids` are preserved exactly. Only the tool result portion (template formatting + tool output) is freshly tokenized.

## PRs

The fix is split into 7 incremental PRs, each building on the previous:

1. **#5225** — Add `prompt_token_ids` support to vLLM client/server, so vLLM can receive pre-tokenized IDs
2. **#5227** — Extend the token ID path to support VLMs (images alongside token IDs)
3. **#5232** — Move `rollout_func` out of `_generate_single_turn` into `_generate` (prep refactor)
4. **#5238** — Move tokenization from `VLLMGeneration.generate` into `_generate_single_turn`, so vLLM always receives raw token IDs
5. **#5239** — Unify tokenization across all 3 backends (vLLM, paged, regular) at the top of `_generate_single_turn`
6. **#5240** — Extract `_tokenize_prompts()` method, make `_generate_single_turn` accept pre-tokenized inputs
7. **#5242** — The actual fix: replace re-tokenization in `_tool_call_loop` with token ID concatenation


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-tokenization bug in GRPO multi-turn tool calling #5224

Re-tokenization bug in GRPO multi-turn tool calling

The bug

The fix

PRs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Re-tokenization bug in GRPO multi-turn tool calling #5224

Description

Re-tokenization bug in GRPO multi-turn tool calling

The bug

The fix

PRs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions