Skip to content

Conversation

@shchur
Copy link
Contributor

@shchur shchur commented Jan 8, 2026

Issue #, if available:

Currently the operations based on Dataset.filter and Dataset.map are quite slow. Just running the following code takes ~20 minutes and generates 10+GB of intermediate files in ~/.cache/huggingface/datasets

bench = fev.Benchmark.from_yaml(
    "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/fev_bench/tasks.yaml"
)
for task in bench.tasks:
    for window in task.iter_windows():
        window.get_input_data()

These are not the only bottlenecks - there are also slow map-based operations in the metrics that I will address in a separate PR.

Description of changes:

  • Perform length-based filtering & past/future splits completely in memory using pyarrow operations without saving any intermediate results to disk. This results in a large speedup: iterating over all windows in fev-bench takes ~4 minutes (down from 20+) and does not save any results to disk.
  • The main logic is inspired by the efficient slicing algorithm from TimeSeriesDataFrame in AutoGluon that essentially performs df.groupby("item_id").nth(slice(start, end)) in flat numpy arrays.
  • I validated that the values in the datasets are identical (np.allclose) by sampling 1/7th of all evaluation windows in fev-bench and comparing the values on main / PR branch.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@shchur shchur requested a review from abdulfatir January 8, 2026 17:29
@shchur shchur force-pushed the fast-slicing-and-filtering branch from 2e7ae75 to 854fe10 Compare January 8, 2026 17:30
"""
# Flatten indices if dataset has been sorted/filtered, so row order in dataset
# matches the physical order in the underlying Arrow table
if getattr(dataset, "_indices", None) is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this property been fairly standard for a while?

cutoff_scalar = pc.cast(pa.scalar(cutoff), timestamps_flat.type)
mask = pc.less_equal(timestamps_flat, cutoff_scalar)
cumsum = np.concatenate([[0], np.cumsum(mask.to_numpy(zero_copy_only=False))])
return cumsum[offsets[1:]] - cumsum[offsets[:-1]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud: will this work also when all timestamps are less than cutoff?

@shchur shchur merged commit 546f5cd into main Jan 12, 2026
5 checks passed
@shchur shchur deleted the fast-slicing-and-filtering branch January 12, 2026 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants