Skip to content

Refactor KTO [3/N]: Extract dataset processing to _prepare_dataset method#4788

Open
albertvillanova wants to merge 1 commit into
huggingface:mainfrom
albertvillanova:refactor-kto-2a
Open

Refactor KTO [3/N]: Extract dataset processing to _prepare_dataset method#4788
albertvillanova wants to merge 1 commit into
huggingface:mainfrom
albertvillanova:refactor-kto-2a

Conversation

@albertvillanova

Copy link
Copy Markdown
Member

Refactor KTO [3/N]: Extract dataset processing to _prepare_dataset method.

This PR extracts inline dataset preprocessing logic from KTOTrainer.__init__() into a dedicated _prepare_dataset() method, aligning with SFTTrainer's architecture and improving code organization.

  • Architectural Pattern: Follows SFTTrainer._prepare_dataset pattern

Part of:

Problem

Before:

  • 75 lines of inline dataset processing in __init__()
  • Duplicated logic for train and eval datasets
  • Cannot test dataset preparation independently
  • Cannot reuse preprocessing for custom datasets
  • No format detection (always processes, even if already tokenized)
  • Complex __init__() with mixed concerns

After:

  • Clean, reusable _prepare_dataset() method
  • Single method for train/eval/custom datasets
  • Format detection (skips if already processed)
  • ~55 lines removed from __init__()
  • Testable dataset preparation logic
  • Aligns with SFTTrainer architecture

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +784 to +788
if is_processed:
logger.info(
f"{dataset_name} dataset is already processed (contains 'completion_input_ids'), skipping preprocessing."
)
return dataset

@qgallouedec qgallouedec Jan 8, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't allow pre-tokenized dataset. We do support it for SFT, but IMO it should remain the only trainer to support this


# Tokenize the dataset
dataset = dataset.map(
_tokenize,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function _tokenize could also be included inside this method, like in SFTTrainer

@albertvillanova albertvillanova mentioned this pull request Jan 8, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants