Avoid unconditional pyarrow dependency in dataframe.backends by TomAugspurger · Pull Request #12075 · dask/dask

TomAugspurger · 2025-09-15T12:08:56Z

This avoids importing pyarrow unconditionally in dataframe.backends, which is reachable in some dask.array workloads (notably, those using xarray, which use pandas.Index classes, tokenization of which requires importing this module; however that class of users doesn't depend on dask[dataframe] and so may not have the pyarrow dependency.

To accomplish this, we need to reorganize the root of the dask.dataframe package a bit. I've moved some uses of pandas in dask.dataframe._compat to dask._pandas_compat, which is outside of the dask.dataframe subpackage.

This lets users of just pandas, but not dask.dataframe, (like tokenization registration for pandas objects) import from dask._pandas_compat, while avoiding all of dask.dataframe.

Closes #12072

At the moment, I've only tested this manually using the reproducer from #12072. I don't have a unit test yet. I started with

@pytest.mark.filterwarnings("ignore:Passing an object to dask.array")
def test_tokenize_without_pyarrow(monkeypatch: pytest.MonkeyPatch) -> None:
    # https://github.com/dask/dask/issues/12072
    packages = list(sys.modules)
    for package in packages:
        if package.startswith(("pyarrow", "dask")):
            monkeypatch.delitem(sys.modules, package)

    monkeypatch.setitem(sys.modules, "pyarrow", None)

    y = xr.DataArray(
        data=da.zeros((1)),
        dims=('y',),
        coords={'y': np.arange(1)},
        name='foo'
    ).to_dataset()
    y['foo'] = y.foo.dims, y.foo.data + 1
    dask.optimize(y)

which reproduces the issues. However, the trick of messing with sys.modules to force re-importing modules / import errors doesn't play nicely with dask's import-time tokenization registration, and so we end up in a weird state.

This avoids importing pyarrow unconditionally in dataframe.backends, which is reachable in some dask.array workloads (notably, those using xarray, which use pandas.Index classes, tokenization of which requires importing this module; however that class of users doesn't depend on dask[dataframe] and so may not have the pyarrow dependency. To accomplish this, we need to reorganize the root of the dask.dataframe package a bit. I've moved some uses of pandas in dask.dataframe._compat to dask._pandas_compat, which is outside of the dask.dataframe subpackage. This lets users of just pandas, but not dask.dataframe, (like tokenization registration for pandas objects) import from dask._pandas_compat, while avoiding all of dask.dataframe. Closes dask#12072

jacobtomlinson

Thanks for jumping on this quickly!

github-actions · 2025-09-15T12:43:22Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

9 files ±0 9 suites ±0 3h 21m 45s ⏱️ + 7m 41s
18 060 tests +1 16 843 ✅ +1 1 217 💤 ±0 0 ❌ ±0
161 614 runs +9 149 495 ✅ +9 12 119 💤 ±0 0 ❌ ±0

Results for commit c1d97bf. ± Comparison against base commit 8468786.

♻️ This comment has been updated with latest results.

TomAugspurger · 2025-09-15T13:26:30Z

c1d97bf runs the test snippet in a subprocess. It fails on main with a ModuleNotFoundError and passes on this branch.

jacobtomlinson

Sweet thanks @TomAugspurger

hmaarrfk · 2025-09-15T22:55:24Z

Thank you!

TomAugspurger mentioned this pull request Sep 15, 2025

Did you mean to make pyarrow a hard dependency of dask-array #12072

Closed

jacobtomlinson reviewed Sep 15, 2025

View reviewed changes

TomAugspurger added 2 commits September 15, 2025 08:05

Merge branch 'main' into tom/pyarrow-import-error

c92bd4f

add subprocess test

c1d97bf

jacobtomlinson approved these changes Sep 15, 2025

View reviewed changes

jacobtomlinson merged commit 51f00e3 into dask:main Sep 15, 2025
22 of 24 checks passed

jacobtomlinson mentioned this pull request Sep 15, 2025

Release 2025.9.1 dask/community#426

Closed

4 tasks

TomAugspurger deleted the tom/pyarrow-import-error branch September 15, 2025 14:14

brendan-m-murphy mentioned this pull request Oct 6, 2025

deps(deps): update dask requirement from <2025.9 to <2025.10 openghg/openghg#1492

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid unconditional pyarrow dependency in dataframe.backends#12075

Avoid unconditional pyarrow dependency in dataframe.backends#12075
jacobtomlinson merged 3 commits intodask:mainfrom
TomAugspurger:tom/pyarrow-import-error

TomAugspurger commented Sep 15, 2025 •

edited

Loading

Uh oh!

jacobtomlinson left a comment

Uh oh!

github-actions bot commented Sep 15, 2025 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 15, 2025

Uh oh!

jacobtomlinson left a comment

Uh oh!

Uh oh!

hmaarrfk commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

TomAugspurger commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

TomAugspurger commented Sep 15, 2025

Uh oh!

jacobtomlinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hmaarrfk commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TomAugspurger commented Sep 15, 2025 •

edited

Loading

github-actions bot commented Sep 15, 2025 •

edited

Loading