Skip to content

Fix iterable skip over full Arrow blocks#8236

Merged
lhoestq merged 2 commits into
huggingface:mainfrom
my17th2:fix-streaming-skip-arrow-block
Jun 5, 2026
Merged

Fix iterable skip over full Arrow blocks#8236
lhoestq merged 2 commits into
huggingface:mainfrom
my17th2:fix-streaming-skip-arrow-block

Conversation

@my17th2

@my17th2 my17th2 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes IterableDataset.skip(n) for streaming datasets when the underlying iterable uses Arrow batches and n skips one or more complete Arrow blocks.

Previously, after a full Arrow block was counted as skipped, _iter_arrow() continued into the partial-slice branch and yielded rows from a block that should have been fully skipped.

What was the issue?

SkipExamplesIterable._iter_arrow() handles skipping in two cases:

  1. the current Arrow table is fully skipped
  2. only the beginning of the current Arrow table is skipped, and the rest is yielded

The bug was that case 1 did not stop after marking the table as skipped. So the same table then fell through into case 2.

In other words, a table could first be counted as "already skipped", but then still be sliced and yielded.

For example, if Arrow tables have 4 rows each and we call skip(6):

  • table 1 has rows [0, 1, 2, 3] and should be fully skipped
  • table 2 has rows [4, 5, 6, 7], so only [4, 5] should be skipped and [6, 7] should be yielded

Before this PR, after table 1 was counted as skipped, the code kept processing table 1 and yielded part of it. This is why skipped rows could appear in the output.

This PR adds continue after a table is fully skipped, so the code moves directly to the next Arrow table.

Tests

  • PYTHONPATH=src pytest tests/test_iterable_dataset.py::test_skip_arrow_examples_iterable -q

The regression test covers skipping within a block, exactly one block, across blocks, and beyond the dataset length.

@lhoestq lhoestq left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ! applying a minor change for consistency with take()

Comment thread src/datasets/iterable_dataset.py Outdated
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq merged commit 10cdc81 into huggingface:main Jun 5, 2026
4 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants