RFC: Split off `pyarrow-*` builds

The more things change, the more they stay the same...

Almost 4 years ago, the https://github.com/conda-forge/pyarrow-feedstock feedstock was archived and the builds were moved here in #146. The package split alluded to in #93 and clarified in #862 took a bit longer to materialize (in #875). With the impending #1376, we're now getting a very hefty 30(!) artefacts per CI job,

<details>
<summary>List of artefacts as of v16 + <code>pyarrow{,-core,-all}</code></summary>

```
anaconda upload \
    /home/conda/feedstock_root/build_artifacts/linux-64/apache-arrow-proc-5.0.0-cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-16.0.0-hefa796f_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-acero-16.0.0-hbabe93e_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-flight-16.0.0-hc4f8a93_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-gandiva-16.0.0-hc1954e9_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libparquet-16.0.0-hacf5a1f_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-dataset-16.0.0-hbabe93e_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-flight-sql-16.0.0-he4f5ca8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-substrait-16.0.0-h8508dc1_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-all-16.0.0-ha770c72_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py38hc396e17_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py39h38d04b8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py312h3f82784_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py310hd207890_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py311hd5e4297_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py38hb563948_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py39h8003fee_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py312h8da182e_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py310h17c5347_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py311h781c19f_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py38hb563948_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py39h8003fee_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py312h8da182e_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py310h17c5347_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py311h781c19f_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py38hc396e17_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py39h38d04b8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py312h3f82784_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py310hd207890_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py311hd5e4297_0_cpu.conda
```

</details>

...which is also pushing conda-smithy and the rerender bots to its limits (in terms of rendering time), c.f. https://github.com/conda-forge/conda-forge-pinning-feedstock/issues/5815. This also [spurred](https://github.com/conda/conda-build/issues/5224#issuecomment-2077998182) some performance improvements in conda-build, but fundamentally the issue remains that this is getting very large. Quoting @beckermr from the pinning issue:
> Splitting that feedstock up into more than one would potentially have other benefits in addition to easing the burden on our already overstressed tools.

At first I thought this was not going to work, but after a closer look and especially with the split after #875, there's actually a pretty clean separation between the C++ `libarrow*` side and the python `pyarrow*` bits. In short, I think there's no technical barrier to do this.

Here's some pros/cons as I see them.

Cons:
* We lose the 1:1 correspondence between pyarrow and libarrow, which made it easy to do invasive changes on the libarrow side.
  * For example, we've recently had issues with specific GCC versions that manifested in segfaults in the pyarrow test suite[^1], which is much easier to debug if we see the effect of a changed compiler version directly in the PR, without having to first publish a libarrow build to then test pyarrow.
  * ~For example, if we were to [enable](https://github.com/conda-forge/arrow-cpp-feedstock/pull/1086) orc-for-pyarrow on windows (which will be possible as of orc 2.0.1), we'd have to avoid on the pyarrow-side that a too-old libarrow gets pulled in which doesn't have support yet.~ In this case it works because we've had enabled orc-support in `libarrow` for a long time.
* More maintenance effort from keeping two recipes times ~4 maintenance branches running 
* Upstream arrow runs the conda [recipes](https://github.com/apache/arrow/tree/main/dev/tasks/conda-recipes) in their CI, this would need some work to follow suit (CC @kou @pitrou @assignUser)

[^1]: arguably also related to the fact that we're not running the test suite for [libarrow](https://github.com/conda-forge/arrow-cpp-feedstock/pull/1058), which is hard because it unconditionally depends on the very-hard-to-package-because-incompatible-with-our-pinnings [testbench](https://github.com/googleapis/storage-testbench)

Pros:
 * Less burden for our infrastructure (i.e. migrator timeouts)
 * More reliable bot PRs --> less maintenance effort
 * Much faster rerendering times
 * Less artefact churn for pyarrow (i.e. the `libarrow` bits are migrated very often without actually requiring a rebuild of `pyarrow`; in the current recipe, this always leads to pushing new pyarrow builds as well)
 * Shorter build times on both sides
 * More reliable builds for pyarrow (due to being split into individual runs, instead of having 5x the opportunity to run into a flaky error in the test suite in a single job).

Assuming we want to do this, we could unarchive [pyarrow](https://github.com/conda-forge/pyarrow-feedstock), and update it to v16. In that case, it would make sense to split the pyarrow bits off from #1376. I'm not sure we would want to touch any of the older still-supported versions for this, but that'd be possible as well (maybe from v13, as we're about to drop v12 once we migrate v16).

Thoughts @conda-forge/arrow-cpp?

CC @conda-forge/core

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Split off `pyarrow-*` builds #1381

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Split off pyarrow-* builds #1381

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RFC: Split off `pyarrow-*` builds #1381