[CI][NIXL] Fix PD CI breakage: pin nixl-cu{12,13} versions#39851
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the requirements/kv_connectors.txt file to include explicit dependencies for nixl-cu12 and nixl-cu13. A review comment points out that adding nixl-cu12 as a global requirement is problematic as it forces unnecessary installation and bloat on CUDA 13 systems; it is suggested to move this dependency to a CI-specific configuration instead.
| nixl-cu12 >= 0.7.1, < 0.10.0 | ||
| nixl-cu13 >= 0.7.1, < 0.10.0 |
There was a problem hiding this comment.
Adding nixl-cu12 as a direct requirement forces its installation on all systems using this file, including CUDA 13 environments where it is unnecessary and adds significant bloat (100MB+). If the crash on CUDA 13 CI is caused by a pre-installed version of nixl-cu12 in the environment, this pin should be moved to a CI-specific constraints file or the package should be uninstalled during CI setup. Forcing a backend for a different CUDA version on all users is a significant regression in environment hygiene.
nixl-cu13 >= 0.7.1, < 0.10.0
There was a problem hiding this comment.
I'm also not super happy with having to install both like this, do you see any other option with this requirements.txt installation method @cjackal ?
There was a problem hiding this comment.
One (quick but dirty) idea is just install the nixl_cu1x variant with exact version number and then force-install with --no-deps option for the nixl metapackage. Like:
# requirements/kv_connectors.txt
# put the `nixl_cu1x` variant
...
nixl_cu12==0.9.0# In container build stage
...
RUN uv pip install -r requirements/kv_connectors.txt && uv pip install `nixl==0.9.0` --no-deps
...nixl-cu12==1.0.1 published today ships nixl_ep compiled against libcudart.so.12, crashing on CUDA 13 CI runners. The existing < 0.10.0 constraint only pins the meta-package, not the backends. Signed-off-by: ZhanqiuHu <zhu@redhat.com>
ed2b8ba to
5b47c4e
Compare
|
This PR also fixes #36676 |
|
Fixes #39872 |
|
Can we rebuild the nightly? |
Signed-off-by: ZhanqiuHu <zhu@redhat.com> (cherry picked from commit 799973a) Signed-off-by: khluu <khluu000@gmail.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com> (cherry picked from commit 4809252) Signed-off-by: khluu <khluu000@gmail.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com> (cherry picked from commit 4ff86d1) Signed-off-by: khluu <khluu000@gmail.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com>
…ect#39851) Signed-off-by: ZhanqiuHu <zhu@redhat.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
nixl-cu12==1.0.1dropped on PyPI today (19:38 UTC) and shipsnixl_epcompiled againstlibcudart.so.12— crashes on CUDA 13 CI runners. Our< 0.10.0constraint only pins the meta-package, not the backends:nixl-cu12==1.0.0, nonixl_ep)nixl-cu12==1.0.1,nixl_epimport crashes)Temp fix: pin
nixl-cu12andnixl-cu13to< 0.10.0. @NickLucche is working on the proper version bump in #39797 (tracking #39521).