WIP: Initial split for libarrow components#1175
WIP: Initial split for libarrow components#1175raulcd wants to merge 28 commits intoconda-forge:mainfrom
Conversation
|
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( |
h-vetinari
left a comment
There was a problem hiding this comment.
First look - thanks for tackling this!
recipe/meta.yaml
Outdated
| - {{ pin_subpackage('libarrow', exact=True) }} | ||
| - {{ pin_subpackage('libarrow-dataset', exact=True) }} | ||
| - {{ pin_subpackage('libarrow-flight', exact=True) }} | ||
| - {{ pin_subpackage('libarrow-gandiva', exact=True) }} | ||
| - {{ pin_subpackage('libarrow-substrait', exact=True) }} |
There was a problem hiding this comment.
I guess there's a question hiding in there whether we want to provide a libarrow-all or something like that, which is just a metapackage that pulls together all the lib* components, and then pyarrow would host-depend on that.
|
@kou @jorisvandenbossche do you have any idea why only some of the registered functions for compute like and I can see: but some of the registered functions via In case you are curious the CI one I am trying to fix first are the Linux 64 ones: edit: added link to CI |
The issue seems to be due to basic kernels where we do not require |
h-vetinari
left a comment
There was a problem hiding this comment.
Another quick look; I don't think we need the full current test suite (for libarrow) duplicated to all outputs, can probably cut that down a tad.
| {% set libs = (cuda_compiler_version != "None") * ["arrow_cuda"] + [ | ||
| "arrow", "arrow_acero" | ||
| ] %} |
There was a problem hiding this comment.
What's the situation around libarrow_cuda.so? It's currently listed as an expected library in a bunch of outputs; if indeed those bits get mashed together in a single library, then these outputs would clobber each other...
There was a problem hiding this comment.
isn't this just because I am using libarrow as a run dependency? I can remove libarrow from the run dependency but as libarrow also contains all headers at the moment, I should probably stop testing for headers on that case.
|
Potentially dumb question: could we "just" build Arrow C++ once with all components enabled, and then have a way to have each actual output package ship one of the shared libraries that was created with the initial build? |
Yes, that works from a package-building POV, and actually what we should do, because we don't have the (CI) time to rebuild everything X times |
I was planning on investigating why / how to reuse the previously built component so we don't rebuild again from scratch but I can investigate how this "selection" of which shared libraries to ship per package can be done on the conda recipe. @h-vetinari do you know any other project that might do something similar where I could take inspiration? Otherwise I'll investigate how can I just pick the specific .so + headers on the specific package. It's my first time working on a conda recipe :) |
I think the best pattern for this (or at least: my personal favourite) is the one with installing into a temporary prefix, and then copying the required bits per output. See for example llvmdev. The structure would then be:
If copying the headers becomes a hassle, we can also consider having a |
|
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe:
|
4987581 to
88fb08d
Compare
h-vetinari
left a comment
There was a problem hiding this comment.
Please reduce the host dependencies per output to the necessary minimum. The link check helps you on this. For example:
WARNING (libarrow-dataset): run-exports library package conda-forge::aws-crt-cpp-0.23.1-hf7d0843_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::aws-sdk-cpp-1.11.156-he6c2984_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::bzip2-1.0.8-h7f98852_4 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::glog-0.6.0-h6f12383_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libabseil-20230802.1-cxx17_h59595ed_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libbrotlidec-1.1.0-hd590300_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libbrotlienc-1.1.0-hd590300_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libgoogle-cloud-2.12.0-h8d7e28b_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libthrift-0.19.0-h8fd135c_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libutf8proc-2.8.0-h166bdaf_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libzlib-1.2.13-hd590300_5 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::lz4-c-1.9.3-h9c3ff4c_1 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::orc-1.9.0-h52d3b3c_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::re2-2023.03.02-h8c504da_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::snappy-1.1.10-h9fff704_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::ucx-1.14.0-h3484d09_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::zstd-1.5.5-hfc55251_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
means that libarrow-dataset has a large amount of unnecessary host dependencies.
This cuts both ways though, for example, libparquet also depends on openssl (presumably for the encryption facilities):
WARNING (libarrow,lib/libparquet.so.1300.0.0): Needed DSO lib/libcrypto.so.3 found in ['openssl']
WARNING (libarrow,lib/libparquet.so.1300.0.0): .. but ['openssl'] not in reqs/run, (i.e. it is overlinking) (likely) or a missing dependency (less likely)
(note that - despite the error message talking about reqs/run - this is still a question about host-dependencies, because if openssl is a host-dependency, the respective run-export will ensure it becomes a run-dependency as well).
| - xsimd | ||
| - zlib | ||
| - zstd | ||
| - __cuda >=11.2 # [cuda_compiler_version != "None"] |
There was a problem hiding this comment.
@h-vetinari this was raising an error locally if I used here __cuda >={{ cuda_compiler_version_min }} # [cuda_compiler_version != "None"]. The error when running python build-locally.py linux_64_cuda_compiler_versionNone:
RuntimeError: Received dictionary as spec. Note that pip requirements are not supported in conda-build meta.yaml. Error message: Invalid version '>=': invalid operator
Do you know if there is any issue where the global host dependency syntax would behave differently with this templating syntax?
edit: syntax
…w-flight, libarrow-flight-sql, libarrow-gandiva, libarrow-substrait and use them on pyarrow
0ecc9ee to
d2f91ef
Compare
|
The dependencies are beginning to look better and better 👍 substrait is still missing a libprotobuf dependency, and pyarrow needs more (all?) of the libarrow subpackages. For finding these, I recommend opening the "raw log" in azure and then searching for |
|
@h-vetinari I think I've fixed the different ones but I am not sure why pyarrow fails with finding the different libarrow related DSO's, example: shouldn't the and pyarrow having both: and should contain the DSO's. |
It's enough to have the DSO's (and so the package will be functional), but the metadata is off, because pyarrow does not run-depend on the outputs that contain the DSOs, and so the metadata is incomplete. In this case, it wouldn't be an actual issue because pyarrow depends exactly on libarrow-all, which depends exactly on the right libs for the same build, but conda cannot assume that such transitive dependencies are actually ABI-safe. In any case, the solution is simple: we need to add run-exports for all the sub-libraries to In turn, you'll be able to remove the run-dependence on libarrow-all. |
|
@raulcd, I took your changes here and rebased them on main, plus slimmed down the refactor where possible: #1201 This should more-or-less be ready to go for arrow 14 (should be in the final rounds of polishing). However, I noticed one crucial things w.r.t. #1035 - the new |
|
Thanks @h-vetinari ! And many thanks for your support when I was working on it!! I know there are some conversations in order to create a libarrow-fs in order to split the filesystems away from Arrow core. Which will allow us to remove |
|
We have indeed chatted about that privately (and exactly with the motivation to be able to remove the large aws and libgoogle-cloud dependencies from the core libarrow package here). I opened apache/arrow#38309 to have a public record of this idea. |
Initial split for libarrow, libarrow-acero, libarrow-dataset, libarrow-flight, libarrow-flight-sql, libarrow-gandiva, libarrow-substrait and use them on pyarrow.
Checklist
0(if the version changed)conda-smithy(Use the phrase@conda-forge-admin, please rerenderin a comment in this PR for automated rerendering)This is still under development pending tasks are:
Closes #1035