Fix iterable skip over full Arrow blocks by my17th2 · Pull Request #8236 · huggingface/datasets

my17th2 · 2026-06-04T09:37:26Z

What does this PR do?

Fixes IterableDataset.skip(n) for streaming datasets when the underlying iterable uses Arrow batches and n skips one or more complete Arrow blocks.

Previously, after a full Arrow block was counted as skipped, _iter_arrow() continued into the partial-slice branch and yielded rows from a block that should have been fully skipped.

What was the issue?

SkipExamplesIterable._iter_arrow() handles skipping in two cases:

the current Arrow table is fully skipped
only the beginning of the current Arrow table is skipped, and the rest is yielded

The bug was that case 1 did not stop after marking the table as skipped. So the same table then fell through into case 2.

In other words, a table could first be counted as "already skipped", but then still be sliced and yielded.

For example, if Arrow tables have 4 rows each and we call skip(6):

table 1 has rows [0, 1, 2, 3] and should be fully skipped
table 2 has rows [4, 5, 6, 7], so only [4, 5] should be skipped and [6, 7] should be yielded

Before this PR, after table 1 was counted as skipped, the code kept processing table 1 and yielded part of it. This is why skipped rows could appear in the output.

This PR adds continue after a table is fully skipped, so the code moves directly to the next Arrow table.

Tests

PYTHONPATH=src pytest tests/test_iterable_dataset.py::test_skip_arrow_examples_iterable -q

The regression test covers skipping within a block, exactly one block, across blocks, and beyond the dataset length.

lhoestq

lgtm ! applying a minor change for consistency with take()

HuggingFaceDocBuilderDev · 2026-06-05T12:26:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

fix: skip full arrow blocks in iterable skip

7f3c979

lhoestq approved these changes Jun 5, 2026

View reviewed changes

Comment thread src/datasets/iterable_dataset.py Outdated

Update src/datasets/iterable_dataset.py

ff09d2f

lhoestq merged commit 10cdc81 into huggingface:main Jun 5, 2026
4 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix iterable skip over full Arrow blocks#8236

Fix iterable skip over full Arrow blocks#8236
lhoestq merged 2 commits into
huggingface:mainfrom
my17th2:fix-streaming-skip-arrow-block

my17th2 commented Jun 4, 2026 •

edited

Loading

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

my17th2 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What was the issue?

Tests

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

my17th2 commented Jun 4, 2026 •

edited

Loading