cudaPackages: use the same libstdc++ as the rest of nixpkgs#223664
cudaPackages: use the same libstdc++ as the rest of nixpkgs#223664samuela merged 4 commits intoNixOS:masterfrom
Conversation
| overrideCC stdenv (wrapCCWith rec { | ||
| cc = compatibleStdenv.cc.cc; | ||
| libcxx = stdenv.cc.cc.lib; | ||
| }) |
There was a problem hiding this comment.
I've never used wrapCCWith and overrideCC before, so I've no idea if I'm doing this correctly. I'm also probably abusing the libcxx argument, I guess it's a hatch for clang, but it did the job for me somehow
I'd really appreciate a comment from @FRidh who had suggested overrideCC at some point
|
Interesting! Curious to see if such an approach could work! |
|
Result of 18 packages failed to build:
94 packages built:
|
Failed derivationsDetails
|
| # one that is compatible with the rest of nixpkgs, even when | ||
| # nvcc forces us to use an older gcc | ||
| # NB: We don't actually know if this is the right thing to do | ||
| backendStdenv.cc.cc.lib |
There was a problem hiding this comment.
I probably should leave a warning somewhere, but if one accidentally uses backendStdenv.cc.cc.lib they'll end up breaking things by linking against the old libstdc++: .lib is just an output that cc happens to have, and wrapCCWith doesn't expose any passthru for libcxx
|
Result of 293 packages failed to build:
1290 packages built:
|
Failed derivationsDetails
|
|
Well, I'm sorry, it's been emotional |
|
|
|
So, IIIUC, with this PR we would only need to use |
Yes. We'll build with (at the time of writing) gcc11, but we'll link to the same libstdc++ we expect other nixpkgs citizens to link to. This way we should be able to co-exist within the same process's namespace I'll repeat once again that |
Could we drop/override it somehow to prevent the footgun? Something like backendStdenv = backendStdenv.override { cc.cc.lib = throw "deprecated! use X instead!"; } |
|
As such, |
Indeed, I have previously been a follower of such advice :) I like the idea of preemptively avoiding footguns entirely, but perhaps doing so is excessively onerous
I love the idea of this change, but I don't think I can support merging something that breaks JAX since it is so central to the CUDA/ML package ecosystem. What do you think would be required to unf#ck jax? Also, perhaps we should target staging for a change this far-reaching? Wdyt? |
JAX has been failing on master for quite some time now, last time I checked was 2 days ago: https://hercules-ci.com/github/SomeoneSerge/nixpkgs-unfree/jobs/3696 It's just that I was hoping this PR would fix jax as well. My hypothesis is that it didn't because we use
Actually, unless one sets
...but maybe I just got too accustomed to cuda changes triggering rebuilds of the entire universe?
I'm absolutely with you on this one. Only that in this case I feel we'd have to touch some internals of Nix, which would likely mean creating a new footgun
Again, I apologize for the language (although I had rather got it out and apologized, than kept it). |
ah yes, those failures are a PITA! However, they should be fixed by #219778 which i'm hoping to get merged tomorrow. Maybe that will fix JAX 🤞
Mmm gotcha, I didn't realize this would require Nix-internals meddling...
Oh no apologies necessary -- i thought it was funny! Yeah, this would be an ideal scenario... Perhaps one day we can get there haha |
As far as I can see, it's a different failure. The one observed in hercules logs is a libstdc++ mismatch on |
Yes, that's correct. Perhaps I should be more specific: As far as I'm aware, what has happened recently is
Anyhow, I was assuming that you were commenting build failures with JAX as opposed to runtime libstdc++ failures! |
Gotcha! I was referring to |
|
In the hindsight, the way we (I) handled libstdc++ in #218265 was very much a mistake and we should remember it, because the failure is rather nasty: I just unet-corresp/notebooks on master [$!?] via 🐍 v3.10.10 via ❄️ impure (nix-shell-env) took 2s
❯ python 2023-03-29.py
Traceback (most recent call last):
File "/home/ss/Sources/unet-corresp/notebooks/2023-03-29.py", line 21, in <module>
import zmq
File "/nix/store/rjdykq4lhq711y7gbidap60754ma5wmc-python3-3.10.10-env/lib/python3.10/site-packages/zmq/__init__.py", line 103, in <module>
from zmq import backend
File "/nix/store/rjdykq4lhq711y7gbidap60754ma5wmc-python3-3.10.10-env/lib/python3.10/site-packages/zmq/backend/__init__.py", line 31, in <module>
raise original_error from None
File "/nix/store/rjdykq4lhq711y7gbidap60754ma5wmc-python3-3.10.10-env/lib/python3.10/site-packages/zmq/backend/__init__.py", line 26, in <module>
_ns = select_backend(first)
File "/nix/store/rjdykq4lhq711y7gbidap60754ma5wmc-python3-3.10.10-env/lib/python3.10/site-packages/zmq/backend/select.py", line 31, in select_backend
mod = import_module(name)
File "/nix/store/iw1vmh509hcbby8dbpsaanbri4zsq7dj-python3-3.10.10/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/nix/store/rjdykq4lhq711y7gbidap60754ma5wmc-python3-3.10.10-env/lib/python3.10/site-packages/zmq/backend/cython/__init__.py", line 6, in <module>
from . import (
ImportError: /nix/store/205vsmbfhq1q2vhgskpqyymqvba4mscp-gcc-11.3.0-lib/lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /nix/store/j4jznffk0l87frb77llh0ap3j0zahl5b-zeromq-4.3.4/lib/libzmq.so.5)Moving
|
|
|
Result of 14 packages failed to build:
99 packages built:
|
Failed derivationsDetails
|
Good idea! I think that makes thing much simpler, which is always a good thing I guess my only remaining question (#223664 (comment)) is why don't we remove |
Oh, we still are using it, in downstream packages 🙃 E.g. in tensorflow we're currently swapping the stdenv for backendStdenv, so that tf does use gcc11. With this PR though, tensorflow will not only consume gcc11, it'll consume gcc11 with gcc12's libstdc++. The footgun is at |
|
FWIW, I feel like these ad hoc issues are worth a separate PR. I'm willing to bet this must be caused by some hacks applied upstream |
There was a problem hiding this comment.
@samuela look, this worked 😆 😅
import faiss
import zmqworks now.
So yeah, further on the roadmap is to add hooks for nvcc and FindCUDAToolkit.cmake, and rename this into cudaStdenv
There was a problem hiding this comment.
| stdenv = if cudaSupport then backendStdenv else inputs.stdenv; | |
| stdenv = if cudaSupport then backendStdenv else stdenv; |
|
@samuela I cherry-picked the This one, however, seems to be ready to go. And I already have ideas for a |
samuela
left a comment
There was a problem hiding this comment.
Ok, I think I'm starting to wrap my head around what's going on here. So IIUC
- Compilers, like gcc, are derivations just like all derivations, usually named
cc. They also havecc.liboutputs where their correspondinglibcxxLibraries live. A Compiler derivation,cc, outputs a naive executable that does not have any knowledge of the Libraries incc.lib. - A Compiler,
cc, is wrapped with wrapCCWith or wrapCC, producing a Wrapped Compiler that is automatically configured to find certain libraries, possibly (but not necessarily) includingcc.lib. Confusingly, the Wrapped Compilers are also often namedccin which casecc.ccis the underlying Compiler andcc.cc.libis the Compiler's correspondinglibcxx. - Wrapped Compilers and Libraries (including
libcxx?) are bundled together into Standard Environments, usually namedstdenv. What wrapping Standard Environments do in addition to Wrapped Compilers is not entirely clear to me. But in this case,stdenv.ccis the Wrapped Compiler,stdenv.cc.ccis the underlying Compiler, andstdenv.cc.cc.libis the Library corresponding to the underlying Compiler.
The goal of this PR is to build CUDA-related packages with gcc11's Compiler cc (as required by CUDA) in tandem with gcc12's cc.lib Library (as is used throughout the rest of Nixpkgs). After this PR, backendStdenv will be a Standard Environment with a compiler toolchain including gcc11's Compiler cc, but building against gcc12's libcxx Library.
Is this understanding correct?
Btw, given a Standard Environment, say backendStdenv, how does one access the appropriate libcxx Library? I understand that backendStdenv.cc.cc.lib is Not The Way...
There was a problem hiding this comment.
stdenv = if cudaSupport then backendStdenv else stdenv;
This causes infinite recursion in (whatever version of Nix I've tried this with a month ago)
There was a problem hiding this comment.
What about something like
effectiveStdenv = if cudaSupport then backendStdenv else stdenv;?
There was a problem hiding this comment.
I tried that earlier in tensorflow, I think. Sandro pointed out that this way it's really easy for someone to accidentally refer to the old stdenv since it's still in the scope
There was a problem hiding this comment.
ok that's a fair point. I'm happy to go either way
There was a problem hiding this comment.
| stdenv = if cudaSupport then backendStdenv else inputs.stdenv; | |
| stdenv = if cudaSupport then backendStdenv else stdenv; |
There was a problem hiding this comment.
can we simply set this to null a la
on that note, could we just use
overrideCC stdenv compatibleStdenv.cc?
There was a problem hiding this comment.
To clarify: would it be possible to build against gcc11's libcxx using overrideCC stdenv compatibleStdenv.cc and then link/patchelf with gcc12's libcxx in buildInputs? Here we would be taking advantage of backwards compatibility between libcxx versions.
There was a problem hiding this comment.
can we simply set this to null
We need exactly the opposite: we want this to be gcc12's libstdc++. If we use libstdc++ from gcc11 we get the mistmatch as soon as we import any C++ library from the normal nixpkgs
There was a problem hiding this comment.
Could we not build with the standard gcc11 toolchain and then link/patchelf against gcc12's libcxx?
There was a problem hiding this comment.
Yes, theoretically, we could. In fact, that was pennae's original suggestion. I think this would be more error-prone: we can forget to remove a reference to gcc11 in one place, we can forget to add a reference to gcc12 in another place, and we'd also have to do that in some hooks. I absolutely despise hooks, though I usually have to submit to them.
The approach in this PR, however, means that gcc11's libstdc++ never enters our build environment, unless a user accidentally adds it. The latter is a possibility, and I myself have made this mistake several times already. I think we want to run sanity checks, not patchelf
There was a problem hiding this comment.
Ahh ok, I gotcha! Thanks for explaining!
There was a problem hiding this comment.
not a blocker for this PR, but would it be possible to include a simple check for gcc11's libstdc++ in preBuild or something like that in backendStdenv?
There was a problem hiding this comment.
There's a number arguments that pkgs/build-support/cc-wrapper/default.nix accepts. I would guess that both extraBuildCommands and {extra,native}{Tools,Packages} must allow setting up with a bit of bash hackery. I have not yet explored them
There was a problem hiding this comment.
As I mentioned, I'd also like cudaStdenv to include a working nvcc, so there's another reason to learn more about these inputs
There was a problem hiding this comment.
"This option is for clang" typo?
There was a problem hiding this comment.
No, I meant it this way, but maybe "for clang" is clearer?
There was a problem hiding this comment.
oh I didn't realize that libcxx was a clang-specific library. I assumed that gcc also called it libcxx. Perhaps "for clang's libcxx" would be clearest?
There was a problem hiding this comment.
(it also "rhymes" with the end of the sentence: "gcc's libstdc++")
|
Yes, you understand the purpose of this PR exactly right!1
Ideally, we don't have to access it. I still do not know of a good way to access it through Nix. Mind that some stdenv's do not come with a c++ standard library at all. We could try and provide a passthru, but we'd be diverging from the rest of nixpkgs Footnotes
|
d5fffbe to
3af299f
Compare
samuela
left a comment
There was a problem hiding this comment.
LGTM any blockers to merging?
|
@samuela I think it's ready to go. All the extra work I can think of goes outside the current scope |
|
Woo! Thanks for all your hard work and patient explanations @SomeoneSerge ! |
Description of changes
A potential solution to #220341, work in progress.
TLDR: we're dealing with libstdc++ mismatches when using cuda-enabled c++ packages (built using gcc11, because nvcc requires so) with any other c++ packages built via gcc12.
@pennae suggested we could try and rely on libstdc++ backward compatibility, specifically we could build with gcc11 and patchelf the resulting artifacts to consume libstdc++ from gcc12
In a similar spirit, this PR explores the possibility of directly passing the new libstdc++ to the old gcc11, avoiding the need to patchelf (and maintain the patchelfing logic)
Things done
easyocrfails without this patch, but succeeds with itsandbox = trueset innix.conf? (See Nix manual)nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)CC @NixOS/cuda-maintainers