ARROW-4629: [Python] Pandas arrow conversion slowed down by imports#3706
ARROW-4629: [Python] Pandas arrow conversion slowed down by imports#3706fjetter wants to merge 3 commits intoapache:masterfrom
Conversation
python/pyarrow/array.pxi
Outdated
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
| import pyarrow.pandas_compat as pdcompat |
There was a problem hiding this comment.
Pandas is an optional dependency of pyarrow, that's why pandas is not imported here.
There was a problem hiding this comment.
I changed it to an import iff pandas is available
| from_pandas=True, safe=safe, | ||
| memory_pool=memory_pool) | ||
| else: | ||
| import pyarrow.pandas_compat as pdcompat |
There was a problem hiding this comment.
@wesm I'm just realising that here is one of the potential problems users may have reported about pdcompat import problems. This path should be supported without pandas but currently isn't.
There was a problem hiding this comment.
yes, I would agree with you. can you create a follow up issue about this? We should set up a docker-compose "no pandas" build to make sure that the project is usable without pandas
There was a problem hiding this comment.
I went ahead and opened https://issues.apache.org/jira/browse/ARROW-4640
Co-Authored-By: fjetter <fjetter@users.noreply.github.com>
Codecov Report
@@ Coverage Diff @@
## master #3706 +/- ##
===========================================
- Coverage 87.79% 66.48% -21.32%
===========================================
Files 688 323 -365
Lines 84280 48123 -36157
Branches 1081 0 -1081
===========================================
- Hits 73994 31993 -42001
- Misses 10175 16130 +5955
+ Partials 111 0 -111
Continue to review full report at Codecov.
|
|
I'm surprised we're importing Pandas inconditionally. We probably shouldn't do that, as Pandas is quite slow to import: Here is a comparison of PyArrow import time with and without Pandas: => more than twice faster without. |
|
@pitrou Any ideas on how to avoid these local imports and also have the benefit of only loading pandas when needed? |
|
Imports are reasonably cheap once the module is already loaded, but it's probably better to avoid doing them in a tight loop. So hoisting the import outside of critical loops should be sufficient. >>> %timeit import pandas
102 ns ± 0.497 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit import pyarrow.pandas_compat as pdcompat
253 ns ± 7.27 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)This is in pure Python, though. Cython does not seem to implement the same optimizations as CPython does: >>> %timeit lib._noop_bench()
66.3 ns ± 0.379 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit lib._import_bench()
928 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)(edit: issue opened for Cython at cython/cython#2854) |
|
@pitrou Are you happy with the change here and we will deal with the Pandas import issue separately or should this patch be adapted before merging? |
|
Yes, I've created ARROW-4637 for the pandas import. |
The local imports slow down the conversion from pandas to arrow significantly (see [here](https://issues.apache.org/jira/browse/ARROW-4629)) Author: fjetter <fjetter@users.noreply.github.com> Author: Uwe L. Korn <xhochy@users.noreply.github.com> Closes apache#3706 from fjetter/local_imports and squashes the following commits: eb5c8ba <Uwe L. Korn> Apply suggestions from code review b4604be <fjetter> Only import pandas_compat if pandas is available f1c8b40 <fjetter> Don't use local imports
The local imports slow down the conversion from pandas to arrow significantly (see here)