Skip to content

Support composed splits in streaming datasets#8220

Merged
lhoestq merged 1 commit into
huggingface:mainfrom
lanarkite99:resume/datasets-streaming-split-composition
Jun 5, 2026
Merged

Support composed splits in streaming datasets#8220
lhoestq merged 1 commit into
huggingface:mainfrom
lanarkite99:resume/datasets-streaming-split-composition

Conversation

@lanarkite99

Copy link
Copy Markdown
Contributor

Fixes #2699
Fixes #4804

This PR adds support for unsliced split composition when loading datasets in streaming mode, e.g. split="train+validation".

Previously, DatasetBuilder.as_streaming_dataset() only accepted a single split name or returned all splits as an IterableDatasetDict, so composed split strings raised ValueError: Bad split.

The change resolves composed split instructions by building each requested streaming split and concatenating the resulting IterableDatasets. It also supports the all split sentinel in streaming mode.

This intentionally does not add support for sliced streaming expressions such as train[:10%], which require separate handling.

Tests added for:

  • string split composition: "train+test"
  • object split composition: Split.TRAIN + Split.TEST
  • "all"
  • Split.ALL

Validation:

  • python -m pytest tests/test_builder.py -q

@lanarkite99

Copy link
Copy Markdown
Contributor Author

@lhoestq could you please review this PR when you get a chance?

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@lhoestq lhoestq merged commit cfe4492 into huggingface:main Jun 5, 2026
3 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

streaming dataset with concatenating splits raises an error cannot combine splits merging and streaming?

3 participants