Skip to content

Prevent eval breakage in CUDA extensions#192958

Merged
samuela merged 2 commits intoNixOS:masterfrom
aidalgol:cuda-exts-prevent-eval-break
Sep 26, 2022
Merged

Prevent eval breakage in CUDA extensions#192958
samuela merged 2 commits intoNixOS:masterfrom
aidalgol:cuda-exts-prevent-eval-break

Conversation

@aidalgol
Copy link
Copy Markdown
Contributor

@aidalgol aidalgol commented Sep 25, 2022

Description of changes

For both cuDNN and TensorRT, add a fallback mechanism when passed a CUDA version not in the table of default package versions for CUDA versions. Previously, adding a new CUDA version would break these extensions.

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.11 Release Notes (or backporting 22.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

See also #185557

Copy link
Copy Markdown
Member

@samuela samuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a nix mechanism to raise a warning to the user when we are unable to find a compatible version?

@samuela
Copy link
Copy Markdown
Member

samuela commented Sep 26, 2022

I'm getting the following evaluation error when attempting to run nixpkgs-review:

skainswo in doodoo in nixpkgs on  master [$] took 23s
❯ nixpkgs-review pr --post-result --build-args "--max-jobs 192" 192958
$ git -c fetch.prune=false fetch --no-tags --force https://github.com/NixOS/nixpkgs master:refs/nixpkgs-review/0 pull/192958/head:refs/nixpkgs-review/1
remote: Enumerating objects: 18, done.
remote: Counting objects: 100% (18/18), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 18 (delta 6), reused 14 (delta 5), pack-reused 0
Unpacking objects: 100% (18/18), 56.35 KiB | 801.00 KiB/s, done.
From https://github.com/NixOS/nixpkgs
   252244a9fd7..70168c4fde0  master                -> refs/nixpkgs-review/0
   54fd3e5b7f3..356a60d23d9  refs/pull/192958/head -> refs/nixpkgs-review/1
$ git worktree add /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs 70168c4fde028ec5f7876da7c662cd9a7854c267
Preparing worktree (detached HEAD 70168c4fde0)
Updating files: 100% (31952/31952), done.
HEAD is now at 70168c4fde0 crun: drop criu workaround
$ nix-env --option system x86_64-linux -f /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs -qaP --xml --out-path --show-trace
$ git merge --no-commit --no-ff 356a60d23d97bc678efdbf24ddbe681c80191126
Automatic merge went well; stopped before committing as requested
$ nix-env --option system x86_64-linux -f /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs -qaP --xml --out-path --show-trace --meta
error: attribute 'cudnn_8_3_2' missing

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/pkgs/development/libraries/science/math/cudnn/extension.nix:30:32:

           29|     # Set the default attributes, e.g. cudnn = cudnn_8_3_1;
           30|     defaultBuild = { "cudnn" = allBuilds.${computeName cuDnnDefaultVersion}; };
             |                                ^
           31|   in allBuilds // defaultBuild;

       … while evaluating the attribute 'cudnn'

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/pkgs/development/libraries/science/math/cudnn/extension.nix:30:22:

           29|     # Set the default attributes, e.g. cudnn = cudnn_8_3_1;
           30|     defaultBuild = { "cudnn" = allBuilds.${computeName cuDnnDefaultVersion}; };
             |                      ^
           31|   in allBuilds // defaultBuild;

       … while evaluating 'isDerivation'

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/lib/attrsets.nix:427:18:

          426|   */
          427|   isDerivation = x: x.type or null == "derivation";
             |                  ^
          428|

       … from call site

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:193:8:

          192|   checkDependencyList' = positions: name: deps: lib.flip lib.imap1 deps (index: dep:
          193|     if lib.isDerivation dep || isNull dep || builtins.typeOf dep == "string" || builtins.typeOf dep == "path" then dep
             |        ^
          194|     else if lib.isList dep then checkDependencyList' ([index] ++ positions) name dep

       … while evaluating anonymous lambda

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:192:81:

          191|   checkDependencyList = checkDependencyList' [];
          192|   checkDependencyList' = positions: name: deps: lib.flip lib.imap1 deps (index: dep:
             |                                                                                 ^
          193|     if lib.isDerivation dep || isNull dep || builtins.typeOf dep == "string" || builtins.typeOf dep == "path" then dep

       … from call site

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/lib/lists.nix:117:32:

          116|   */
          117|   imap1 = f: list: genList (n: f (n + 1) (elemAt list n)) (length list);
             |                                ^
          118|

       … while evaluating anonymous lambda

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/lib/lists.nix:117:29:

          116|   */
          117|   imap1 = f: list: genList (n: f (n + 1) (elemAt list n)) (length list);
             |                             ^
          118|

       … from call site

       … while evaluating anonymous lambda

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:221:13:

          220|       (map (drv: drv.__spliced.hostHost or drv) (checkDependencyList "depsHostHost" depsHostHost))
          221|       (map (drv: drv.crossDrv or drv) (checkDependencyList "buildInputs" buildInputs))
             |             ^
          222|     ]

       … from call site

       … while evaluating 'getOutput'

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/lib/attrsets.nix:598:23:

          597|   */
          598|   getOutput = output: pkg:
             |                       ^
          599|     if ! pkg ? outputSpecified || ! pkg.outputSpecified

       … from call site

       … while evaluating the attribute 'buildInputs' of the derivation 'caffe-1.0'

       at /home/skainswo/.cache/nixpkgs-review/pr-192958/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:270:7:

          269|     // (lib.optionalAttrs (attrs ? name || (attrs ? pname && attrs ? version)) {
          270|       name =
             |       ^
          271|         let

       … while querying the derivation named 'caffe-1.0'
https://github.com/NixOS/nixpkgs/pull/192958 failed to build
$ git worktree prune

@aidalgol
Copy link
Copy Markdown
Contributor Author

Is there a nix mechanism to raise a warning to the user when we are unable to find a compatible version?

As far as I know, you can only do that for errors. I hope I'm wrong.

@aidalgol
Copy link
Copy Markdown
Contributor Author

I also got that error from nixpkgs-review, and it looks like ofborg has thrown up the same error. It looks like it's happening when building the package caffe, but when I run,

NIXPKGS_ALLOW_UNFREE=1 nix-build -A caffe

From my nixpkgs checkout, it builds just fine. How do we find out what nix expression is being evaluated that hits this error?

@aidalgol
Copy link
Copy Markdown
Contributor Author

I still don't know why caffe builds without error for me, but from looking over the trace a bit more, I can see that it's specifically cudaPackages_10_1.cudnn that is broken with these changes.

NIXPKGS_ALLOW_UNFREE=1 nix repl .
Welcome to Nix 2.8.1. Type :? for help.

Loading '.'...
Added 17005 variables.

nix-repl> :b cudaPackages_10_1.cudnn
error: attribute 'cudnn_8_3_2' missing

       at /home/aidan/src/nixpkgs/pkgs/development/libraries/science/math/cudnn/extension.nix:30:32:

           29|     # Set the default attributes, e.g. cudnn = cudnn_8_3_1;
           30|     defaultBuild = { "cudnn" = allBuilds.${computeName cuDnnDefaultVersion}; };
             |                                ^
           31|   in allBuilds // defaultBuild;

@aidalgol
Copy link
Copy Markdown
Contributor Author

It turns out that the ? operator does not do what I thought it does, so the expression defaultVersions ? cudaVersion is checking whether defaultVersions has cudaVersion as an attribute, instead of using the value of cudaVersion. I should have been using builtins.hasAttr.

@aidalgol aidalgol requested a review from samuela September 26, 2022 03:01
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux. labels Sep 26, 2022
@LunNova
Copy link
Copy Markdown
Member

LunNova commented Sep 26, 2022

a.attr or 1 defaults to 1 if attr doesn't exist, probably wanted that?

@aidalgol
Copy link
Copy Markdown
Contributor Author

Thanks! I didn't even think to try that because I thought a.attr when attr does not exist was always an error, no matter what operator you put after that.

When passed a CUDA version not in the table of default cuDNN versions for CUDA
versions, fall back to a default package version instead of breaking evaluation
of the package.
When passed a CUDA version not in the table of default TensorRT versions for
CUDA versions, fall back to a default package version instead of breaking
evaluation of the package.
@samuela
Copy link
Copy Markdown
Member

samuela commented Sep 26, 2022

Thanks! I didn't even think to try that because I thought a.attr when attr does not exist was always an error, no matter what operator you put after that.

Huh, yeah I expected the same. Seems though if a sub-expression errors, any parent expression ought to error as well... TIL!

@samuela
Copy link
Copy Markdown
Member

samuela commented Sep 26, 2022

Result of nixpkgs-review pr 192958 run on x86_64-linux 1

@samuela
Copy link
Copy Markdown
Member

samuela commented Sep 26, 2022

Ok, well Nothing to be built. as far as nixpkgs-review is concerned. I'm guessing this is bc I have CAS enabled. So everything builds fine for me.

Only remaining issue is if we can get a warning when the version is not found. Does anyone know if such a thing is possible in Nix?

@samuela
Copy link
Copy Markdown
Member

samuela commented Sep 26, 2022

Result of nixpkgs-review pr 192958 run on x86_64-linux 1

@aidalgol
Copy link
Copy Markdown
Contributor Author

Only remaining issue is if we can get a warning when the version is not found. Does anyone know if such a thing is possible in Nix?

I think the meta.broken predicate serves the same purpose. If a user tries to build any version of cudnn against an unsupported CUDA version, they will get the "this package is marked as broken" error, along with instructions on how to proceed anyway. This could be made clearer if there was some kind of "reason" field to go along with the meta.broken attribute, instead of requiring the user to find the derivation's source file in nixpkgs and hope there is a comment there.

@samuela
Copy link
Copy Markdown
Member

samuela commented Sep 26, 2022

Ah yeah, that's a fair point. I think as long as they're marked as meta.broken then we're all good.

@samuela samuela merged commit f2a116a into NixOS:master Sep 26, 2022
@aidalgol aidalgol deleted the cuda-exts-prevent-eval-break branch September 26, 2022 05:38
@dguibert dguibert mentioned this pull request Oct 6, 2022
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants