Refactor KTO [3/N]: Extract dataset processing to _prepare_dataset method by albertvillanova · Pull Request #4788 · huggingface/trl

albertvillanova · 2026-01-08T09:21:21Z

Refactor KTO [3/N]: Extract dataset processing to _prepare_dataset method.

This PR extracts inline dataset preprocessing logic from KTOTrainer.__init__() into a dedicated _prepare_dataset() method, aligning with SFTTrainer's architecture and improving code organization.

Architectural Pattern: Follows SFTTrainer._prepare_dataset pattern

Part of:

KTO refactoring #4786

Problem

Before:

75 lines of inline dataset processing in __init__()
Duplicated logic for train and eval datasets
Cannot test dataset preparation independently
Cannot reuse preprocessing for custom datasets
No format detection (always processes, even if already tokenized)
Complex __init__() with mixed concerns

After:

Clean, reusable _prepare_dataset() method
Single method for train/eval/custom datasets
Format detection (skips if already processed)
~55 lines removed from __init__()
Testable dataset preparation logic
Aligns with SFTTrainer architecture

HuggingFaceDocBuilderDev · 2026-01-08T09:24:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2026-01-08T13:49:54Z

+            if is_processed:
+                logger.info(
+                    f"{dataset_name} dataset is already processed (contains 'completion_input_ids'), skipping preprocessing."
+                )
+                return dataset


I think we shouldn't allow pre-tokenized dataset. We do support it for SFT, but IMO it should remain the only trainer to support this

qgallouedec · 2026-01-08T13:50:52Z

+
+            # Tokenize the dataset
+            dataset = dataset.map(
+                _tokenize,


the function _tokenize could also be included inside this method, like in SFTTrainer

Extract dataset processing to _prepare_dataset method

3c19ebd

qgallouedec reviewed Jan 8, 2026

View reviewed changes

albertvillanova mentioned this pull request Jan 8, 2026

KTO refactoring #4786

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor KTO [3/N]: Extract dataset processing to _prepare_dataset method#4788

Refactor KTO [3/N]: Extract dataset processing to _prepare_dataset method#4788
albertvillanova wants to merge 1 commit into
huggingface:mainfrom
albertvillanova:refactor-kto-2a

albertvillanova commented Jan 8, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 8, 2026

Uh oh!

qgallouedec Jan 8, 2026 •

edited

Loading

Uh oh!

qgallouedec Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

albertvillanova commented Jan 8, 2026

Problem

Uh oh!

HuggingFaceDocBuilderDev commented Jan 8, 2026

Uh oh!

qgallouedec Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qgallouedec Jan 8, 2026 •

edited

Loading