Skip to content

[GRPO] Sequence-level TIS + MIS #4493

@LeonEricsson

Description

@LeonEricsson

Feature request

By default, when vLLM is used for rollouts, we apply Token-level Truncated Importance Sampling (TIS) to mitigate the training–inference mismatch. I propose we implement sequence-level importance sampling as the default behavior, paired with either truncation (TIS) or masking (MIS), instead of the current token-level approach.

Motivation

This is motivated by the following studies:

  1. Masked Importance Sampling (MIS), introduced by the Ling Team in IcePop, takes a more conservative approach by completely masking noisy gradient updates. Empirically, they found MIS to be more stable than TIS.

  2. The Qwen Team’s analysis of the training–inference mismatch provides deeper theoretical insight. In Section 4.2.1, they show that while token-level importance sampling has lower variance than sequence-level methods, it is a biased and theoretically unsound estimator. They propose instead using sequence-level importance sampling and experimentally compare truncated (TIS) and masked (MIS) variants at both token and sequence levels, finding Seq-MIS to be the most stable and effective approach.

  3. The strength of sequence-level masking (Seq-MIS) is further supported by recent work in Defeating the Training–Inference Mismatch via FP16 (Sections 4.1–4.2).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions