Skip to content

35% faster packing + rename bfd-requeue to bfd_split#5189

Merged
qgallouedec merged 26 commits into
huggingface:mainfrom
mariosasko:vectorized-bfd-chunking
Mar 13, 2026
Merged

35% faster packing + rename bfd-requeue to bfd_split#5189
qgallouedec merged 26 commits into
huggingface:mainfrom
mariosasko:vectorized-bfd-chunking

Conversation

@mariosasko

@mariosasko mariosasko commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR improves the packing logic to make it faster, less error-prone, and easier to read.

The main changes are:

  • Replacing the AI-generated BFD splitting (a.k.a. "requeuing") logic from Preserve truncated tokens in BFD packing #4632
    with a vectorized implementation that is significantly shorter and 30% faster.

  • Updating pack_examples to restore the input dataset’s format and perform proper input validation.

  • Applying a minor optimization to the wrapped packing implementation by reusing the offsets across all packed columns.

  • Aligning the naming with the literature (e.g., requeuesplit) while preserving backward compatibility.

P.S. The recent Qwen3-Coder-Next technical report includes a nice comparison of packing techniques, which would be a nice addition to the docs. However, the report is not yet available on arXiv, so it cannot be cited as an HF paper. Done 😄

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.


Note

Medium Risk
Touches core SFT data preprocessing (packing/truncation) and changes strategy semantics/validation, which can alter token ordering/retention and training results if misconfigured; backward compatibility is handled via a deprecation warning.

Overview
Improves SFT dataset packing and clarifies packing strategies. The BFD packer is rewritten with a more vectorized PyArrow implementation, adds shared column validation, filters empty sequences, and introduces an explicit overflow behavior switch (truncate vs split) used by the new "bfd_split" strategy.

API/UX updates around packing. pack_dataset now validates strategies, preserves/restores the original dataset format (including IterableDatasets) and ensures seq_lengths is included in formatted columns for BFD-based strategies; truncate_dataset is unified to the same Arrow-based map + format restore path.

Naming + docs/tests alignment. SFTConfig deprecates bfd-requeue by warning and mapping it to bfd_split, SFTTrainer treats bfd_split like bfd for padding-free/FlashAttention checks, tests are updated to assert format preservation and bfd_split behavior (including empty sequences), and docs add/expand guidance and citations for bfd_split vs wrapped.

Written by Cursor Bugbot for commit b265bbf. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread trl/data_utils.py Outdated
Comment thread trl/data_utils.py
Comment thread trl/data_utils.py

@qgallouedec qgallouedec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, just

Comment thread docs/source/data_utils.md Outdated
Comment thread trl/data_utils.py Outdated
Comment thread trl/trainer/sft_config.py
Comment thread trl/data_utils.py Outdated
@mariosasko mariosasko requested a review from qgallouedec March 13, 2026 17:25
@qgallouedec

Copy link
Copy Markdown
Member

Thanks @mariosasko reviewing it today

Comment thread trl/data_utils.py Outdated
Comment thread trl/trainer/sft_config.py
@qgallouedec qgallouedec changed the title Misc packing improvements 35% faster packing + rename bfd-requeue to bfd_split Mar 13, 2026
@qgallouedec

Copy link
Copy Markdown
Member
import gc
import time

import numpy as np
from datasets import Dataset

from trl import pack_dataset


def generate_dataset(num_examples, mean_length=256, std_length=128, seed=42):
    """Generate a synthetic dataset with variable-length sequences."""
    rng = np.random.default_rng(seed)
    lengths = rng.normal(mean_length, std_length, num_examples).astype(int).clip(min=1)
    input_ids = [rng.integers(0, 32000, size=length).tolist() for length in lengths]
    attention_mask = [[1] * length for length in lengths]
    return Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})


def benchmark(num_examples, seq_length, strategy, min_runs=5, min_time=2.0, **kwargs):
    dataset = generate_dataset(num_examples, **kwargs)
    # Warmup (3 runs to stabilize caches)
    for _ in range(3):
        pack_dataset(dataset, seq_length=seq_length, strategy=strategy)

    # Run until we have enough samples and enough wall time
    times = []
    total_elapsed = 0.0
    while len(times) < min_runs or total_elapsed < min_time:
        gc.collect()
        gc.disable()
        start = time.perf_counter()
        packed = pack_dataset(dataset, seq_length=seq_length, strategy=strategy)
        elapsed = time.perf_counter() - start
        gc.enable()
        times.append(elapsed)
        total_elapsed += elapsed

    # Use median instead of mean to reduce sensitivity to outliers
    median_time = np.median(times)
    iqr = np.percentile(times, 75) - np.percentile(times, 25)
    total_tokens = sum(len(ids) for ids in dataset["input_ids"])
    packed_examples = len(packed)

    return {
        "num_examples": num_examples,
        "seq_length": seq_length,
        "strategy": strategy,
        "total_tokens": total_tokens,
        "packed_examples": packed_examples,
        "median_time": median_time,
        "iqr": iqr,
        "num_runs": len(times),
        "throughput_examples_per_sec": num_examples / median_time,
        "throughput_tokens_per_sec": total_tokens / median_time,
    }


if __name__ == "__main__":
    configs = [
        # Vary dataset size
        {"num_examples": 1_000, "seq_length": 512, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 512, "strategy": "bfd"},
        {"num_examples": 100_000, "seq_length": 512, "strategy": "bfd"},
        # Vary seq_length
        {"num_examples": 10_000, "seq_length": 256, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 1024, "strategy": "bfd"},
        {"num_examples": 10_000, "seq_length": 4096, "strategy": "bfd"},
        # Compare strategies
        {"num_examples": 10_000, "seq_length": 512, "strategy": "bfd_split"},
        {"num_examples": 10_000, "seq_length": 512, "strategy": "wrapped"},
    ]

    header = f"{'strategy':<12} {'examples':>10} {'seq_len':>8} {'packed':>8} {'runs':>5} {'median (s)':>14} {'tok/s':>14}"
    print(header)
    print("-" * len(header))

    for cfg in configs:
        result = benchmark(**cfg)
        print(
            f"{result['strategy']:<12} "
            f"{result['num_examples']:>10,} "
            f"{result['seq_length']:>8} "
            f"{result['packed_examples']:>8,} "
            f"{result['num_runs']:>5} "
            f"{result['median_time']:>8.3f} ± {result['iqr']:.3f} "
            f"{result['throughput_tokens_per_sec']:>14,.0f}"
        )
benchmark_results

Nice!!

TBH, I didn't look deeply into the new logic. My understanding is pretty basic, I get the general idea of how it works, I don't see any obvious mistakes, the tests are passing, and the agents are happy. So I'm going to approve and merge this. Albert, feel free to take a look too later; you have more experience with datasets than I do.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread trl/data_utils.py
@qgallouedec qgallouedec merged commit 6c0fccd into huggingface:main Mar 13, 2026
1 check passed
@mariosasko mariosasko deleted the vectorized-bfd-chunking branch March 13, 2026 23:27
qgallouedec added a commit that referenced this pull request Mar 18, 2026
commit 52cd0cc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:31:26 2026 +0100

    Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295)

commit 7b42fc4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:29:11 2026 +0100

    Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286)

commit 3acb8e8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:27:10 2026 +0100

    Support max_length in DPO VLM training (#5284)

commit ee339a0
Author: Carlos Miguel Patiño <carlos.patino@huggingface.co>
Date:   Tue Mar 17 14:01:44 2026 +0100

    [GKD] Buffer Implementation for Distillation Trainer (#5137)

    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

commit d46131f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:27:19 2026 +0100

    Remove custom get_train/eval_dataloader from OnlineDPO (#5291)

commit 85cf8f4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:24:24 2026 +0100

    Remove TrainingArguments import from experimental trainers (#5290)

commit 91e3da0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Mar 16 07:19:51 2026 -0600

    Fix `accuracy_reward` crash when called from non-main thread (#5281)

commit 4996631
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:44:28 2026 +0100

    Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274)

commit 5fceaa7
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:43:34 2026 +0100

    Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273)

commit 406d406
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Sat Mar 14 04:12:49 2026 +0100

    feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199)

commit d0ac7ef
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Sat Mar 14 02:53:33 2026 +0100

    Allow nullable logprobs in vLLM serve responses  (#5203)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit c0eabc4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Mar 13 18:19:15 2026 -0600

    Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255)

    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

commit 6c0fccd
Author: Mario Šaško <mariosasko777@gmail.com>
Date:   Sat Mar 14 00:19:38 2026 +0100

    35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
qgallouedec added a commit that referenced this pull request Mar 18, 2026
commit 3972d66
Author: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Date:   Wed Mar 18 22:26:44 2026 +0100

    Suggest the `Json()` type for tool calling dataset format (#5307)

    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 5c6e915
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 14:55:19 2026 -0600

    Update `RewardFunc` type annotation to allow `None`values in reward list (#5297)

commit ee96845
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Mar 18 17:03:54 2026 +0100

    Fix DPOTrainer collators to truncate sequences before padding (#5305)

commit 435c2ae
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 08:09:42 2026 -0600

    Add guidance to avoid `hasattr` and `getattr` with defaults in `AGENTS.md` (#5294)

commit 26ce6a3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Mar 18 00:44:12 2026 -0600

    Apply docstyle (#5296)

commit 52cd0cc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:31:26 2026 +0100

    Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model (#5295)

commit 7b42fc4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:29:11 2026 +0100

    Prevent corruption of DPO VLM training if "keep_end" truncation_mode (#5286)

commit 3acb8e8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 17 15:27:10 2026 +0100

    Support max_length in DPO VLM training (#5284)

commit ee339a0
Author: Carlos Miguel Patiño <carlos.patino@huggingface.co>
Date:   Tue Mar 17 14:01:44 2026 +0100

    [GKD] Buffer Implementation for Distillation Trainer (#5137)

    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

commit d46131f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:27:19 2026 +0100

    Remove custom get_train/eval_dataloader from OnlineDPO (#5291)

commit 85cf8f4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 15:24:24 2026 +0100

    Remove TrainingArguments import from experimental trainers (#5290)

commit 91e3da0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Mar 16 07:19:51 2026 -0600

    Fix `accuracy_reward` crash when called from non-main thread (#5281)

commit 4996631
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:44:28 2026 +0100

    Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string (#5274)

commit 5fceaa7
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 16 07:43:34 2026 +0100

    Simplify structured outputs logic across vLLM versions in scripts/vllm_serve (#5273)

commit 406d406
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Sat Mar 14 04:12:49 2026 +0100

    feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) (#5199)

commit d0ac7ef
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Sat Mar 14 02:53:33 2026 +0100

    Allow nullable logprobs in vLLM serve responses  (#5203)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit c0eabc4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Mar 13 18:19:15 2026 -0600

    Change default `vllm_mode` to `"colocate"` and add v0→v1 migration guide (#5255)

    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

commit 6c0fccd
Author: Mario Šaško <mariosasko777@gmail.com>
Date:   Sat Mar 14 00:19:38 2026 +0100

    35% faster packing + rename `bfd-requeue` to `bfd_split` (#5189)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
songhappy pushed a commit to songhappy/trl that referenced this pull request Apr 20, 2026
…#5189)

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants