Skip to content

Conversation

@h-vetinari
Copy link
Member

@h-vetinari h-vetinari commented Sep 23, 2022

PPC builds in apache/arrow#14102 are failing.

Edit: new version dropped, repurposing this PR as it had almost all the required bits already there.

@conda-forge-linter
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@h-vetinari
Copy link
Member Author

@conda-forge-admin, please rerender

@h-vetinari
Copy link
Member Author

h-vetinari commented Oct 22, 2022

Looks like the PPC failures are gone! 🥳

Windows has some issues with a symlink, but 2ce51ad can be removed as soon as there's a release dist.

@h-vetinari h-vetinari force-pushed the ppc_head branch 3 times, most recently from 84f98d9 to 81a5e9f Compare October 23, 2022 07:36
@h-vetinari
Copy link
Member Author

@conda-forge/arrow-cpp @jakirkham @jaimergp @isuruf @kkraus14

TL;DR: Due to recent changes, we need to do something about emulation builds here

Proposal: for aarch/ppc pyarrow builds, revive pyarrow-feedstock or use separate branch here

First off: Amazing news, arrow-cpp is not depending on python anymore 🥳

  • Positives:
    • Way less CI jobs (one per aarch, including all python versions)
    • Way less build artefacts for arrow-cpp
  • Negatives:
    • Consolidated builds impossible in emulation

The aarch64/ppc64le builds are already timing out about 50% in emulation, so adding more outputs to the same job is completely impossible. I can imagine four possible ways to fix this AFAICT (sorted by preference, ascending):

  1. Spuriously keep arrow-cpp depending on python for aarch/ppc (still running into ~50% timeouts; 16 CI jobs here instead of 2, more once PyPy arrives). This is the status quo but hella ugly/work-intensive, and unnecessarily multiplies artefacts/traffic.
  2. Revive the pyarrow feedstock only for aarch/ppc pyarrow builds
    • Keep the arrow-cpp build here in emulation (happy side effect of reduced timeout risk)
    • Synchronization through submodule or manually
    • Having faster iteration speeds on this feedstock, and - once passing - doing the aarch/ppc side doesn't sound too bad as a workflow IMO (as the aarch/ppc builds - especially for only pyarrow - rarely cause issues, but blow up CI time).
  3. Do the same as 2. but without touching the old pyarrow-feedstock
    • have a branch (per version >=10...) in this feedstock that just has one commit...
    • ...which removes the skips for pyarrow on aarch/ppc that'd have to be on main and adds skips everywhere else
    • rebase that branch whenever a PR is merged to main (aarch/ppc pyarrow builds get published from that branch)
    • same benefits as 2., but less synchronization hassles
    • as a downside, this would require force-pushing to a productive branch (after each rebase)
  4. Enable cross-compilation with CUDA (open question how/whether that's possible, see some notes from the core sync; wasn't discussed though)

Obviously 4. would be nicest, but because I have no idea how long that's going to take to become possible, I'd like to proceed with 3. or 2.

PS. If someone has a better way than 1f17385 to keep conda smithy from generating jobs for different numpy versions, I'm all ears!

@jaimergp
Copy link
Member

jaimergp commented Oct 23, 2022

Hm, and what about some conda_build_config.yaml magic plus Jinja logic to have one output per job in those troublesome archs? The idea is to have two jobs for ppc/aarch so it doesn't timeout, instead of using two branches or feedstocks.

@h-vetinari
Copy link
Member Author

Hm, and what about some conda_build_config.yaml magic plus Jinja logic to have one output per job in those troublesome archs?

That would be a similar situation as until now - 16 jobs (once we include PPC+CUDA) for aarch/ppc, building pyarrow and arrow-cpp (the latter redundantly for 3/4 jobs per aarch), and those jobs currently time out 50% of the time.

That means it would take 5-6(!) restarts of 6h jobs to eventually get a passing run, which is just a spectacular hassle that I'd like to avoid going forward.

@h-vetinari
Copy link
Member Author

[1. implies] 16 jobs (once we include PPC+CUDA) for aarch/ppc, building pyarrow and arrow-cpp

Actually, now that OpenSSL 3 has been unblocked, we'd have 32 jobs only for aarch/ppc (and that's not even counting another 16 if we get PyPy); as the person doing most of the maintenance here at the moment, that makes me object strongly to "1." above - even though I try to do the restarts, sometimes the GH-Azure interaction will not let me do it anymore after 3-4, and since we'd need 6-8 restarts to realistically finish all 32-48 timeout-prone jobs, this would mean missing aarch/ppc pyarrow builds for every PR (and attendant resolution problems). Comparatively, it's way less hassle to do option 2. or 3.

@h-vetinari
Copy link
Member Author

Hm, and what about some conda_build_config.yaml magic plus Jinja logic to have one output per job in those troublesome archs?

That would be a similar situation as until now - 16 jobs (once we include PPC+CUDA) for aarch/ppc, building pyarrow and arrow-cpp (the latter redundantly for 3/4 jobs per aarch), and those jobs currently time out 50% of the time.

That means it would take 5-6(!) restarts of 6h jobs to eventually get a passing run, which is just a spectacular hassle that I'd like to avoid going forward.

I can imagine four possible ways to fix this AFAICT (sorted by preference, ascending):

Actually, since upstream arrow runs the recipe as part of its CI (c.f. apache/arrow#14102), splitting things up into separate feedstocks (option 2.) does not sound like a good idea anymore. So my preference would now be 3.

Would be happy to have your input @xhochy @pitrou @kou

@jakirkham jakirkham mentioned this pull request Oct 27, 2022
3 tasks
@h-vetinari
Copy link
Member Author

@jakirkham @kkraus14 @isuruf @jaimergp
Any thoughts about the feasibility of cross-compiling cuda for aarch64/ppc64le (i.e. 4. above)? That would be by far the most elegant solution. If it'd be possible to get there within (say) a few months, it might even make sense to wait for that.

@kkraus14
Copy link
Contributor

@jakirkham @kkraus14 @isuruf @jaimergp
Any thoughts about the feasibility of cross-compiling cuda for aarch64/ppc64le (i.e. 4. above)? That would be by far the most elegant solution. If it'd be possible to get there within (say) a few months, it might even make sense to wait for that.

My understanding is there's currently no way to get a libcuda.so stub library into a conda package or a docker image that we can distribute that is compliant with the CUDA EULA.

@h-vetinari
Copy link
Member Author

Thanks for the quick response @kkraus14!

CC @conda-forge/core: with upstream changes (plus openssl & pypy in the pipeline), and the inability to cross-compile for cuda, we need to split the builds here. More details further up.

Recapping our options, and adding two more:

Option Benefits Downsides Comment
1. Keep status quo - everything builds
from one PR
- build explosion;
- infeasible # of restarts
- infeasible total CI runtime
- resolver pitfalls
Infeasible
(details)
2. Reactivate
pyarrow-feedstock
- no time outs here (or there) - separate feedstock
- harder to sync upstream
[see variants]
-> 2a. ... just for aarch/ppc feasible, but meh
-> 2b. ... for everything - build explosion there
(python x openssl
x arch x cuda)
feasible, but ugh
3. Build pyarrow aarch/ppc
in separate branch
- no time outs here (or there)
- everything in one place
- double the
productive branches
[see variants]
-> 3a. ... carrying
unskip-commit
- needs force-push to
prod. branch after rebase
meh
-> 3b. ... carrying env var Least bad option? 🤩
4. cross-compile cuda - everything builds
from one PR
best solution,
but infeasible 🥲

To detail option 3b, I'm envisioning a setup that doesn't need rebases, as follows:

  • create branch <ver>.0.x_aarch_ppc, and add a single commit that sets an (environment) variable only for aarch/ppc in conda_build_config.yaml (or somewhere else), say CF_BUILD_ONLY_PYARROW=1
  • use selectors # [CF_BUILD_ONLY_PYARROW == 1] to switch off pyarrow aarch/ppc builds on the main branches, resp. enable them on the _aarch_ppc branches
  • merge <ver>.0.x (rather than rebase) into <ver>.0.x_aarch_ppc after ever PR to <ver>.0.x.

@kkraus14
Copy link
Contributor

I'd propose a 5th option here:

  • Move aarch64 / ppc64le non-CUDA builds to cross compilation
  • Disable aarch64 / ppc64le CUDA builds until there's either native runners that we can get CUDA on or a way to ship the necessary libraries to support cross compilation, at which point we can revisit producing CUDA builds

@h-vetinari
Copy link
Member Author

I'd propose a 5th option here:

Sure, if people are fine with not producing CUDA builds on aarch/ppc, that's even easier. I had just assumed that it would not be acceptable to remove a feature like that (which people spent a bunch of time on, from looking at the old PRs that enabled it).

@xhochy
Copy link
Member

xhochy commented Oct 27, 2022

Actually, since upstream arrow runs the recipe as part of its CI (c.f. apache/arrow#14102), splitting things up into separate feedstocks (option 2.) does not sound like a good idea anymore. So my preference would now be 3.

From my last years' experience maintaining this feedstock, I would propose to go with option 2. Especially as we test the conda recipes as part of the Arrow CI, we are pretty sure that things work together. We are doing a similar thing with boost(-cpp) and there it also works nicely.

@h-vetinari Can you explain how you came to the above conclusion?

@h-vetinari
Copy link
Member Author

@h-vetinari Can you explain how you came to the above conclusion?

What I meant was that the recipe will have to continue being synced back to arrow (assuming it should stay part of the CI there), and if we split the recipe into two feedstocks, I imagine it will be harder to sync that back in a way that upstream CI can do an integrated build & check of both arrow-cpp & pyarrow.

That wouldn't be the case with option 3., because then the recipe would stay contained in one feedstock, and synching back would just mean undoing certain skips (and there's already a few manual adjustment to do anyway, as I learned in apache/arrow#14102)

Is that reasoning understandable? FWIW, I can live very well also with option 2a, if that's what you prefer. Just option 1. is something I'm strongly against, the rest (2.-5.) I can deal with.

@xhochy
Copy link
Member

xhochy commented Oct 27, 2022

What I meant was that the recipe will have to continue being synced back to arrow (assuming it should stay part of the CI there), and if we split the recipe into two feedstocks, I imagine it will be harder to sync that back in a way that upstream CI can do an integrated build & check of both arrow-cpp & pyarrow.

As we had this situation before: With the current manual sync, it didn't feel different whether you had one or two recipes. For the syncing, I would propose to add the changes we need in the recipes to support the builds of nightly versions to the feedstock here. This will mean that we probably need to have some if version == statement in the recipe but should help a lot with the sync of the pinning files. I will think a bit about this and write about a better syncing approach to the Arrow mailing list.

@h-vetinari
Copy link
Member Author

If upstream arrow is fine with synchronizing from two feedstocks, I don't mind. One problem that comes to mind (with any of the splits actually) is that we cannot easily do something like {{ pin_subpackage('arrow-cpp', exact=True) }} which we currently do.

If we do push some stuff back to the pyarrow-feedstock, I'd still propose to only build the aarch/ppc builds there, because for all others combinations it actually works beautifully here (and we'd avoid a build matrix of soon-to-be 1161!! builds there).

Also, keeping the build scripts for pyarrow here up-to-date (due to building for non-aarch/ppc) will make it easier to switch back to doing everything here if/once we can cross-compile cuda.

I will think a bit about this and write about a better syncing approach to the Arrow mailing list.

Could you please let us know here when you've done so?

Footnotes

  1. 10 OS+GPU combinations (linux-{64, aarch64, ppc64le}-{cpu, cuda}, win-64-{cpu, cuda}, osx-{64, arm64}) x 6 python versions (CPython 3.8-3.11, PyPy 3.8 & 3.9) x 2 OpenSSL {1.1.1, 3} = 120; minus 4 builds for not having pypy on osx-arm

@isuruf
Copy link
Member

isuruf commented Oct 31, 2022

My understanding is there's currently no way to get a libcuda.so stub library into a conda package or a docker image that we can distribute that is compliant with the CUDA EULA.

We can install libcuda.so stub library from the NVIDIA RPM package at CI run time and not redistribute it.
You'll also need to install the cross-linux-sbsa compiler which is the cross compiler for aarch64.

However, NVIDIA doesn't provide a cross-linux-ppc64le. Not sure why.

@h-vetinari
Copy link
Member Author

We can install libcuda.so stub library from the NVIDIA RPM package at CI run time and not redistribute it.

Awesome, thanks a lot! That's what I had pictured, but I thought I'd leave the EULA judgment to those who know Nvidia much better than I do.

Assuming that this is possible, what would be a good way to experiment on this? Iterate in the build scripts here and then move it to the ci-setup once working?

However, NVIDIA doesn't provide a cross-linux-ppc64le. Not sure why.

I think having this even just for aarch would be a big win. Also, we're not yet building PPC+CUDA, so it wouldn't be a "regression" not to publish those.

@xhochy
Copy link
Member

xhochy commented Oct 31, 2022

If we do push some stuff back to the pyarrow-feedstock, I'd still propose to only build the aarch/ppc builds there, because for all others combinations it actually works beautifully here (and we'd avoid a build matrix of soon-to-be 1161!! builds there).

Wouldn't that mean that we need to keep the build scripts for pyarrow in sync between the two repositories?

@h-vetinari
Copy link
Member Author

Wouldn't that mean that we need to keep the build scripts for pyarrow in sync between the two repositories?

Yes, but I'm willing to do that. It's far less of a hassle than restarting failing 6h jobs several times, or drip-feeding PRs because we shouldn't be blocking 100+ CI agents at once

@h-vetinari
Copy link
Member Author

  1. cross-compile cuda [...] best solution, but infeasible 🥲

For those subscribed: it looks like option 4 is back in play: conda-forge/conda-forge-ci-setup-feedstock#209 🥳

This would be amazing (and it seems it even supports cross-compiling PPC after all). 🤩

@h-vetinari
Copy link
Member Author

cross-compilation support for cuda is still baking, but in the meantime I'm continuing this PR in #875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants