Skip to content

RFC: Split off pyarrow-* builds #1381

@h-vetinari

Description

@h-vetinari

The more things change, the more they stay the same...

Almost 4 years ago, the https://github.com/conda-forge/pyarrow-feedstock feedstock was archived and the builds were moved here in #146. The package split alluded to in #93 and clarified in #862 took a bit longer to materialize (in #875). With the impending #1376, we're now getting a very hefty 30(!) artefacts per CI job,

List of artefacts as of v16 + pyarrow{,-core,-all}
anaconda upload \
    /home/conda/feedstock_root/build_artifacts/linux-64/apache-arrow-proc-5.0.0-cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-16.0.0-hefa796f_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-acero-16.0.0-hbabe93e_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-flight-16.0.0-hc4f8a93_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-gandiva-16.0.0-hc1954e9_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libparquet-16.0.0-hacf5a1f_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-dataset-16.0.0-hbabe93e_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-flight-sql-16.0.0-he4f5ca8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-substrait-16.0.0-h8508dc1_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/libarrow-all-16.0.0-ha770c72_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py38hc396e17_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py39h38d04b8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py312h3f82784_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py310hd207890_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-core-16.0.0-py311hd5e4297_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py38hb563948_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py39h8003fee_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py312h8da182e_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py310h17c5347_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-16.0.0-py311h781c19f_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py38hb563948_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py39h8003fee_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py312h8da182e_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py310h17c5347_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-all-16.0.0-py311h781c19f_0.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py38hc396e17_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py39h38d04b8_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py312h3f82784_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py310hd207890_0_cpu.conda \
    /home/conda/feedstock_root/build_artifacts/linux-64/pyarrow-tests-16.0.0-py311hd5e4297_0_cpu.conda

...which is also pushing conda-smithy and the rerender bots to its limits (in terms of rendering time), c.f. conda-forge/conda-forge-pinning-feedstock#5815. This also spurred some performance improvements in conda-build, but fundamentally the issue remains that this is getting very large. Quoting @beckermr from the pinning issue:

Splitting that feedstock up into more than one would potentially have other benefits in addition to easing the burden on our already overstressed tools.

At first I thought this was not going to work, but after a closer look and especially with the split after #875, there's actually a pretty clean separation between the C++ libarrow* side and the python pyarrow* bits. In short, I think there's no technical barrier to do this.

Here's some pros/cons as I see them.

Cons:

  • We lose the 1:1 correspondence between pyarrow and libarrow, which made it easy to do invasive changes on the libarrow side.
    • For example, we've recently had issues with specific GCC versions that manifested in segfaults in the pyarrow test suite1, which is much easier to debug if we see the effect of a changed compiler version directly in the PR, without having to first publish a libarrow build to then test pyarrow.
    • For example, if we were to enable orc-for-pyarrow on windows (which will be possible as of orc 2.0.1), we'd have to avoid on the pyarrow-side that a too-old libarrow gets pulled in which doesn't have support yet. In this case it works because we've had enabled orc-support in libarrow for a long time.
  • More maintenance effort from keeping two recipes times ~4 maintenance branches running
  • Upstream arrow runs the conda recipes in their CI, this would need some work to follow suit (CC @kou @pitrou @assignUser)

Pros:

  • Less burden for our infrastructure (i.e. migrator timeouts)
  • More reliable bot PRs --> less maintenance effort
  • Much faster rerendering times
  • Less artefact churn for pyarrow (i.e. the libarrow bits are migrated very often without actually requiring a rebuild of pyarrow; in the current recipe, this always leads to pushing new pyarrow builds as well)
  • Shorter build times on both sides
  • More reliable builds for pyarrow (due to being split into individual runs, instead of having 5x the opportunity to run into a flaky error in the test suite in a single job).

Assuming we want to do this, we could unarchive pyarrow, and update it to v16. In that case, it would make sense to split the pyarrow bits off from #1376. I'm not sure we would want to touch any of the older still-supported versions for this, but that'd be possible as well (maybe from v13, as we're about to drop v12 once we migrate v16).

Thoughts @conda-forge/arrow-cpp?

CC @conda-forge/core

Footnotes

  1. arguably also related to the fact that we're not running the test suite for libarrow, which is hard because it unconditionally depends on the very-hard-to-package-because-incompatible-with-our-pinnings testbench

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions