Avoid unconditional pyarrow dependency in dataframe.backends#12075
Merged
jacobtomlinson merged 3 commits intodask:mainfrom Sep 15, 2025
Merged
Avoid unconditional pyarrow dependency in dataframe.backends#12075jacobtomlinson merged 3 commits intodask:mainfrom
jacobtomlinson merged 3 commits intodask:mainfrom
Conversation
This avoids importing pyarrow unconditionally in dataframe.backends, which is reachable in some dask.array workloads (notably, those using xarray, which use pandas.Index classes, tokenization of which requires importing this module; however that class of users doesn't depend on dask[dataframe] and so may not have the pyarrow dependency. To accomplish this, we need to reorganize the root of the dask.dataframe package a bit. I've moved some uses of pandas in dask.dataframe._compat to dask._pandas_compat, which is outside of the dask.dataframe subpackage. This lets users of just pandas, but not dask.dataframe, (like tokenization registration for pandas objects) import from dask._pandas_compat, while avoiding all of dask.dataframe. Closes dask#12072
Member
jacobtomlinson
left a comment
There was a problem hiding this comment.
Thanks for jumping on this quickly!
Contributor
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 9 files ±0 9 suites ±0 3h 21m 45s ⏱️ + 7m 41s Results for commit c1d97bf. ± Comparison against base commit 8468786. ♻️ This comment has been updated with latest results. |
Member
Author
|
c1d97bf runs the test snippet in a subprocess. It fails on |
jacobtomlinson
approved these changes
Sep 15, 2025
Member
jacobtomlinson
left a comment
There was a problem hiding this comment.
Sweet thanks @TomAugspurger
Contributor
|
Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This avoids importing pyarrow unconditionally in dataframe.backends, which is reachable in some dask.array workloads (notably, those using xarray, which use pandas.Index classes, tokenization of which requires importing this module; however that class of users doesn't depend on dask[dataframe] and so may not have the pyarrow dependency.
To accomplish this, we need to reorganize the root of the dask.dataframe package a bit. I've moved some uses of pandas in dask.dataframe._compat to dask._pandas_compat, which is outside of the dask.dataframe subpackage.
This lets users of just pandas, but not dask.dataframe, (like tokenization registration for pandas objects) import from dask._pandas_compat, while avoiding all of dask.dataframe.
Closes #12072
At the moment, I've only tested this manually using the reproducer from #12072. I don't have a unit test yet. I started with
which reproduces the issues. However, the trick of messing with
sys.modulesto force re-importing modules / import errors doesn't play nicely with dask's import-time tokenization registration, and so we end up in a weird state.