35% faster packing + rename `bfd-requeue` to `bfd_split` by mariosasko · Pull Request #5189 · huggingface/trl

mariosasko · 2026-02-26T15:50:42Z

What does this PR do?

This PR improves the packing logic to make it faster, less error-prone, and easier to read.

The main changes are:

Replacing the AI-generated BFD splitting (a.k.a. "requeuing") logic from Preserve truncated tokens in BFD packing #4632
with a vectorized implementation that is significantly shorter and 30% faster.
Updating pack_examples to restore the input dataset’s format and perform proper input validation.
Applying a minor optimization to the wrapped packing implementation by reusing the offsets across all packed columns.
Aligning the naming with the literature (e.g., requeue → split) while preserving backward compatibility.

P.S. The recent Qwen3-Coder-Next technical report includes a nice comparison of packing techniques, which would be a nice addition to the docs. ~~However, the report is not yet available on arXiv, so it cannot be cited as an HF paper.~~ Done 😄

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

Medium Risk
Touches core SFT data preprocessing (packing/truncation) and changes strategy semantics/validation, which can alter token ordering/retention and training results if misconfigured; backward compatibility is handled via a deprecation warning.

Overview
Improves SFT dataset packing and clarifies packing strategies. The BFD packer is rewritten with a more vectorized PyArrow implementation, adds shared column validation, filters empty sequences, and introduces an explicit overflow behavior switch (truncate vs split) used by the new "bfd_split" strategy.

API/UX updates around packing. pack_dataset now validates strategies, preserves/restores the original dataset format (including IterableDatasets) and ensures seq_lengths is included in formatted columns for BFD-based strategies; truncate_dataset is unified to the same Arrow-based map + format restore path.

Naming + docs/tests alignment. SFTConfig deprecates bfd-requeue by warning and mapping it to bfd_split, SFTTrainer treats bfd_split like bfd for padding-free/FlashAttention checks, tests are updated to assert format preservation and bfd_split behavior (including empty sequences), and docs add/expand guidance and citations for bfd_split vs wrapped.

^{Written by Cursor Bugbot for commit b265bbf. This will update automatically on new commits. Configure here.}

HuggingFaceDocBuilderDev · 2026-02-26T16:30:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…-chunking

qgallouedec

thanks, just

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

…nto vectorized-bfd-chunking

qgallouedec · 2026-03-13T17:42:44Z

Thanks @mariosasko reviewing it today

…nto vectorized-bfd-chunking

… ValError in `pack_dataset`; align doc with help

qgallouedec · 2026-03-13T22:28:40Z

import gc
import time

import numpy as np
from datasets import Dataset

from trl import pack_dataset


def generate_dataset(num_examples, mean_length=256, std_length=128, seed=42):
    """Generate a synthetic dataset with variable-length sequences."""
    rng = np.random.default_rng(seed)
    lengths = rng.normal(mean_length, std_length, num_examples).astype(int).clip(min=1)
    input_ids = [rng.integers(0, 32000, size=length).tolist() for length in lengths]
    attention_mask = [[1] * length for length in lengths]
    return Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})


def benchmark(num_examples, seq_length, strategy, min_runs=5, min_time=2.0, **kwargs):
    dataset = generate_dataset(num_examples, **kwargs)
    # Warmup (3 runs to stabilize caches)
    for _ in range(3):
        pack_dataset(dataset, seq_length=seq_length, strategy=strategy)

    # Run until we have enough samples and enough wall time
    times = []
    total_elapsed = 0.0
    while len(times) < min_runs or total_elapsed < min_time:
        gc.collect()
        gc.disable()
        start = time.perf_counter()
        packed = pack_dataset(dataset, seq_length=seq_length, strategy=strategy)
        elapsed = time.perf_counter() - start
        gc.enable()
        times.append(elapsed)
        total_elapsed += elapsed

    # Use median instead of mean to reduce sensitivity to outliers
    median_time = np.median(times)
    iqr = np.percentile(times, 75) - np.percentile(times, 25)
    total_tokens = sum(len(ids) for ids in dataset["input_ids"])
    packed_examples = len(packed)

    return {
        "num_examples": num_examples,
        "seq_length": seq_length,
        "strategy": strategy,
        "total_tokens": total_tokens,
        "packed_examples": packed_examples,
        "median_time": median_time,
        "iqr": iqr,
        "num_runs": len(times),
        "throughput_examples_per_sec": num_examples / median_time,
        "throughput_tokens_per_sec": total_tokens / median_time,
    }


if __name__ == "__main__":
    configs = [
        # Vary dataset size
        {"num_examples": 1_000, "seq_length": 512, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 512, "strategy": "bfd"},
        {"num_examples": 100_000, "seq_length": 512, "strategy": "bfd"},
        # Vary seq_length
        {"num_examples": 10_000, "seq_length": 256, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 1024, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 4096, "strategy": "bfd"},
        # Compare strategies
        {"num_examples": 10_000, "seq_length": 512, "strategy": "bfd_split"},
        {"num_examples": 10_000, "seq_length": 512, "strategy": "wrapped"},
    ]

    header = f"{'strategy':<12} {'examples':>10} {'seq_len':>8} {'packed':>8} {'runs':>5} {'median (s)':>14} {'tok/s':>14}"
    print(header)
    print("-" * len(header))

    for cfg in configs:
        result = benchmark(**cfg)
        print(
            f"{result['strategy']:<12} "
            f"{result['num_examples']:>10,} "
            f"{result['seq_length']:>8} "
            f"{result['packed_examples']:>8,} "
            f"{result['num_runs']:>5} "
            f"{result['median_time']:>8.3f} ± {result['iqr']:.3f} "
            f"{result['throughput_tokens_per_sec']:>14,.0f}"
        )

Nice!!

TBH, I didn't look deeply into the new logic. My understanding is pretty basic, I get the general idea of how it works, I don't see any obvious mistakes, the tests are passing, and the agents are happy. So I'm going to approve and merge this. Albert, feel free to take a look too later; you have more experience with datasets than I do.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

commit 52cd0cc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:31:26 2026 +0100 Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295) commit 7b42fc4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:29:11 2026 +0100 Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286) commit 3acb8e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:27:10 2026 +0100 Support max_length in DPO VLM training (#5284) commit ee339a0 Author: Carlos Miguel Patiño <carlos.patino@huggingface.co> Date: Tue Mar 17 14:01:44 2026 +0100 [GKD] Buffer Implementation for Distillation Trainer (#5137) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit d46131f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 15:27:19 2026 +0100 Remove custom get_train/eval_dataloader from OnlineDPO (#5291) commit 85cf8f4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 15:24:24 2026 +0100 Remove TrainingArguments import from experimental trainers (#5290) commit 91e3da0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Mar 16 07:19:51 2026 -0600 Fix `accuracy_reward` crash when called from non-main thread (#5281) commit 4996631 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 07:44:28 2026 +0100 Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274) commit 5fceaa7 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 07:43:34 2026 +0100 Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273) commit 406d406 Author: casinca <47400729+casinca@users.noreply.github.com> Date: Sat Mar 14 04:12:49 2026 +0100 feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199) commit d0ac7ef Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sat Mar 14 02:53:33 2026 +0100 Allow nullable logprobs in vLLM serve responses (#5203) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit c0eabc4 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Mar 13 18:19:15 2026 -0600 Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> commit 6c0fccd Author: Mario Šaško <mariosasko777@gmail.com> Date: Sat Mar 14 00:19:38 2026 +0100 35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 3972d66 Author: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Wed Mar 18 22:26:44 2026 +0100 Suggest the `Json()` type for tool calling dataset format (#5307) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 5c6e915 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Mar 18 14:55:19 2026 -0600 Update `RewardFunc` type annotation to allow `None`values in reward list (#5297) commit ee96845 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Mar 18 17:03:54 2026 +0100 Fix DPOTrainer collators to truncate sequences before padding (#5305) commit 435c2ae Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Mar 18 08:09:42 2026 -0600 Add guidance to avoid `hasattr` and `getattr` with defaults in `AGENTS.md` (#5294) commit 26ce6a3 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Mar 18 00:44:12 2026 -0600 Apply docstyle (#5296) commit 52cd0cc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:31:26 2026 +0100 Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295) commit 7b42fc4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:29:11 2026 +0100 Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286) commit 3acb8e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 17 15:27:10 2026 +0100 Support max_length in DPO VLM training (#5284) commit ee339a0 Author: Carlos Miguel Patiño <carlos.patino@huggingface.co> Date: Tue Mar 17 14:01:44 2026 +0100 [GKD] Buffer Implementation for Distillation Trainer (#5137) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit d46131f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 15:27:19 2026 +0100 Remove custom get_train/eval_dataloader from OnlineDPO (#5291) commit 85cf8f4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 15:24:24 2026 +0100 Remove TrainingArguments import from experimental trainers (#5290) commit 91e3da0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Mar 16 07:19:51 2026 -0600 Fix `accuracy_reward` crash when called from non-main thread (#5281) commit 4996631 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 07:44:28 2026 +0100 Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274) commit 5fceaa7 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 16 07:43:34 2026 +0100 Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273) commit 406d406 Author: casinca <47400729+casinca@users.noreply.github.com> Date: Sat Mar 14 04:12:49 2026 +0100 feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199) commit d0ac7ef Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sat Mar 14 02:53:33 2026 +0100 Allow nullable logprobs in vLLM serve responses (#5203) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit c0eabc4 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Mar 13 18:19:15 2026 -0600 Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> commit 6c0fccd Author: Mario Šaško <mariosasko777@gmail.com> Date: Sat Mar 14 00:19:38 2026 +0100 35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

…#5189) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

mariosasko added 5 commits February 26, 2026 11:01

Vectorize splitting of sequences longer than seq_length in BFD packing

ae3f3d9

Minor improvements

ce371f0

Merge with upstream

d5b5ec3

Nit

b393b4e

More nits

a32b5a5

qgallouedec requested a review from albertvillanova February 26, 2026 16:24

mariosasko and others added 2 commits March 4, 2026 14:06

Merge branch 'main' into vectorized-bfd-chunking

c216184

Merge branch 'main' into vectorized-bfd-chunking

819c2f8

cursor Bot reviewed Mar 7, 2026

View reviewed changes

Comment thread trl/data_utils.py Outdated

Comment thread trl/data_utils.py

mariosasko added 3 commits March 9, 2026 19:54

Improvements and fixes to make the bot happy

25b0502

Tests

fb6ffc9

Merge branch 'main' of github.com:huggingface/trl into vectorized-bfd…

dc23b74

…-chunking

cursor Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread trl/data_utils.py

Nit

5caa917

qgallouedec reviewed Mar 10, 2026

View reviewed changes

Comment thread docs/source/data_utils.md Outdated

Comment thread trl/data_utils.py Outdated

Comment thread trl/trainer/sft_config.py

mariosasko and others added 5 commits March 11, 2026 14:06

Account for possible empty sequences

e08d8fe

Resolve conflict

86e89a2

Update docs/source/data_utils.md

a766059

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Fix BFD formatting

15a8367

Merge branch 'vectorized-bfd-chunking' of github.com:mariosasko/trl i…

ccc3e8e

…nto vectorized-bfd-chunking

cursor Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread trl/data_utils.py Outdated

mariosasko requested a review from qgallouedec March 13, 2026 17:25

Merge branch 'main' into vectorized-bfd-chunking

9ae9266

cursor Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread trl/data_utils.py Outdated

mariosasko and others added 4 commits March 13, 2026 21:17

Don't mutate in-place

39ba691

Add paper reference

7508775

Merge branch 'vectorized-bfd-chunking' of github.com:mariosasko/trl i…

f47060b

…nto vectorized-bfd-chunking

drop enum; deprecate bfd-requeue; bfd_split -> bfd-split`, re-add…

4356e20

… ValError in `pack_dataset`; align doc with help

qgallouedec added 3 commits March 13, 2026 21:57

paper index and revisit doc

4737aee

- -> _

42d4be7

fix word

37b168d

cursor Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread trl/trainer/sft_config.py

qgallouedec changed the title ~~Misc packing improvements~~ 35% faster packing + rename bfd-requeue to bfd_split Mar 13, 2026

fix backward compt

5007bec

qgallouedec approved these changes Mar 13, 2026

View reviewed changes

better doc rendering

b265bbf

cursor Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread trl/data_utils.py

qgallouedec merged commit 6c0fccd into huggingface:main Mar 13, 2026
1 check passed

mariosasko deleted the vectorized-bfd-chunking branch March 13, 2026 23:27

albertvillanova mentioned this pull request Apr 7, 2026

Fix SFT deprecation warning #5466

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

35% faster packing + rename `bfd-requeue` to `bfd_split`#5189

35% faster packing + rename `bfd-requeue` to `bfd_split`#5189
qgallouedec merged 26 commits into
huggingface:mainfrom
mariosasko:vectorized-bfd-chunking

mariosasko commented Feb 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Mar 13, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mariosasko commented Feb 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Mar 13, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mariosasko commented Feb 26, 2026 •

edited by cursor Bot

Loading