Feature request
By default, when vLLM is used for rollouts, we apply Token-level Truncated Importance Sampling (TIS) to mitigate the training–inference mismatch. I propose we implement sequence-level importance sampling as the default behavior, paired with either truncation (TIS) or masking (MIS), instead of the current token-level approach.
Motivation
This is motivated by the following studies:
-
Masked Importance Sampling (MIS), introduced by the Ling Team in IcePop, takes a more conservative approach by completely masking noisy gradient updates. Empirically, they found MIS to be more stable than TIS.
-
The Qwen Team’s analysis of the training–inference mismatch provides deeper theoretical insight. In Section 4.2.1, they show that while token-level importance sampling has lower variance than sequence-level methods, it is a biased and theoretically unsound estimator. They propose instead using sequence-level importance sampling and experimentally compare truncated (TIS) and masked (MIS) variants at both token and sequence levels, finding Seq-MIS to be the most stable and effective approach.
-
The strength of sequence-level masking (Seq-MIS) is further supported by recent work in Defeating the Training–Inference Mismatch via FP16 (Sections 4.1–4.2).
Feature request
By default, when vLLM is used for rollouts, we apply Token-level Truncated Importance Sampling (TIS) to mitigate the training–inference mismatch. I propose we implement sequence-level importance sampling as the default behavior, paired with either truncation (TIS) or masking (MIS), instead of the current token-level approach.
Motivation
This is motivated by the following studies:
Masked Importance Sampling (MIS), introduced by the Ling Team in IcePop, takes a more conservative approach by completely masking noisy gradient updates. Empirically, they found MIS to be more stable than TIS.
The Qwen Team’s analysis of the training–inference mismatch provides deeper theoretical insight. In Section 4.2.1, they show that while token-level importance sampling has lower variance than sequence-level methods, it is a biased and theoretically unsound estimator. They propose instead using sequence-level importance sampling and experimentally compare truncated (TIS) and masked (MIS) variants at both token and sequence levels, finding Seq-MIS to be the most stable and effective approach.
The strength of sequence-level masking (Seq-MIS) is further supported by recent work in Defeating the Training–Inference Mismatch via FP16 (Sections 4.1–4.2).