Skip to content

Avoid unconditional pyarrow dependency in dataframe.backends#12075

Merged
jacobtomlinson merged 3 commits intodask:mainfrom
TomAugspurger:tom/pyarrow-import-error
Sep 15, 2025
Merged

Avoid unconditional pyarrow dependency in dataframe.backends#12075
jacobtomlinson merged 3 commits intodask:mainfrom
TomAugspurger:tom/pyarrow-import-error

Conversation

@TomAugspurger
Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger commented Sep 15, 2025

This avoids importing pyarrow unconditionally in dataframe.backends, which is reachable in some dask.array workloads (notably, those using xarray, which use pandas.Index classes, tokenization of which requires importing this module; however that class of users doesn't depend on dask[dataframe] and so may not have the pyarrow dependency.

To accomplish this, we need to reorganize the root of the dask.dataframe package a bit. I've moved some uses of pandas in dask.dataframe._compat to dask._pandas_compat, which is outside of the dask.dataframe subpackage.

This lets users of just pandas, but not dask.dataframe, (like tokenization registration for pandas objects) import from dask._pandas_compat, while avoiding all of dask.dataframe.

Closes #12072


At the moment, I've only tested this manually using the reproducer from #12072. I don't have a unit test yet. I started with

@pytest.mark.filterwarnings("ignore:Passing an object to dask.array")
def test_tokenize_without_pyarrow(monkeypatch: pytest.MonkeyPatch) -> None:
    # https://github.com/dask/dask/issues/12072
    packages = list(sys.modules)
    for package in packages:
        if package.startswith(("pyarrow", "dask")):
            monkeypatch.delitem(sys.modules, package)

    monkeypatch.setitem(sys.modules, "pyarrow", None)

    y = xr.DataArray(
        data=da.zeros((1)),
        dims=('y',),
        coords={'y': np.arange(1)},
        name='foo'
    ).to_dataset()
    y['foo'] = y.foo.dims, y.foo.data + 1
    dask.optimize(y)

which reproduces the issues. However, the trick of messing with sys.modules to force re-importing modules / import errors doesn't play nicely with dask's import-time tokenization registration, and so we end up in a weird state.

This avoids importing pyarrow unconditionally in dataframe.backends,
which is reachable in some dask.array workloads (notably, those using
xarray, which use pandas.Index classes, tokenization of which requires
importing this module; however that class of users doesn't depend
on dask[dataframe] and so may not have the pyarrow dependency.

To accomplish this, we need to reorganize the root of the dask.dataframe
package a bit. I've moved some uses of pandas in dask.dataframe._compat
to dask._pandas_compat, which is outside of the dask.dataframe subpackage.

This lets users of just pandas, but not dask.dataframe, (like tokenization
registration for pandas objects) import from dask._pandas_compat, while
avoiding all of dask.dataframe.

Closes dask#12072
Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping on this quickly!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Sep 15, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ±0        9 suites  ±0   3h 21m 45s ⏱️ + 7m 41s
 18 060 tests +1   16 843 ✅ +1   1 217 💤 ±0  0 ❌ ±0 
161 614 runs  +9  149 495 ✅ +9  12 119 💤 ±0  0 ❌ ±0 

Results for commit c1d97bf. ± Comparison against base commit 8468786.

♻️ This comment has been updated with latest results.

@TomAugspurger
Copy link
Copy Markdown
Member Author

c1d97bf runs the test snippet in a subprocess. It fails on main with a ModuleNotFoundError and passes on this branch.

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet thanks @TomAugspurger

@jacobtomlinson jacobtomlinson merged commit 51f00e3 into dask:main Sep 15, 2025
22 of 24 checks passed
@TomAugspurger TomAugspurger deleted the tom/pyarrow-import-error branch September 15, 2025 14:14
@hmaarrfk
Copy link
Copy Markdown
Contributor

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Did you mean to make pyarrow a hard dependency of dask-array

3 participants