Skip to content

[Bugfix] Fix _has_module to verify native deps via trial import#39873

Closed
esmeetu wants to merge 2 commits into
mainfrom
fix/has-module-trial-import
Closed

[Bugfix] Fix _has_module to verify native deps via trial import#39873
esmeetu wants to merge 2 commits into
mainfrom
fix/has-module-trial-import

Conversation

@esmeetu

@esmeetu esmeetu commented Apr 15, 2026

Copy link
Copy Markdown
Member

Summary

  • _has_module previously used importlib.util.find_spec() which only checks whether a module spec exists on disk, but does not verify that the module can actually be imported
  • For native-extension packages whose shared libraries are missing (e.g. nixl_ep installed without libcudart.so.12), find_spec returns a valid spec but the real import crashes with ImportError
  • This caused has_nixl_ep() (and potentially other has_* guards) to return True when the module was not actually usable, leading to import-time crashes like:
    ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
    
  • The fix adds a trial importlib.import_module() after the find_spec pre-check, catching ImportError/OSError and returning False when native dependencies are unsatisfied. The @cache decorator ensures this only runs once per module name.

Not a duplicate

No existing open PRs address this issue. Searched for _has_module, find_spec, and has_module ImportError.

Test plan

AI disclosure

This PR was authored with AI assistance (Claude). The human submitter has reviewed all changed lines.

🤖 Generated with Claude Code

`_has_module` used `importlib.util.find_spec` which only checks whether a
module spec exists on disk.  For native-extension packages whose shared
libraries are missing (e.g. `nixl_ep` without `libcudart.so.12`),
`find_spec` succeeds but the real import crashes.

Add a trial `importlib.import_module` after the `find_spec` pre-check so
that `has_nixl_ep()` (and all other `has_*` guards) correctly return
`False` when a package is installed but its native dependencies are
unsatisfied.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: esmeetu <jasonailu87@gmail.com>
@mergify mergify Bot added the bug Something isn't working label Apr 15, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the _has_module utility in vllm/utils/import_utils.py to perform a trial import of the specified module instead of just checking for its existence via find_spec. This change ensures that native dependencies and shared libraries are correctly satisfied before reporting the module as available. I have no feedback to provide.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a77e42a67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

try:
importlib.import_module(module_name)
return True
except (ImportError, OSError) as exc:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Catch non-ImportError failures in availability probe

The new trial import in _has_module only handles ImportError/OSError, so optional modules that fail import with other exception types now crash capability checks instead of returning False. This is a real path in-tree: vllm/utils/deep_gemm.py explicitly documents/catches RuntimeError from vendored deep_gemm import (_C.init() failure), but has_deep_gemm() now routes through _has_module first and would raise before that fallback logic runs. This turns a best-effort feature probe into a startup/runtime hard failure for broken optional installs.

Useful? React with 👍 / 👎.

Tests cover:
- Returns True for importable stdlib modules
- Returns False for nonexistent modules
- Returns False when find_spec succeeds but import raises ImportError
  (e.g. missing libcudart.so)
- Returns False when import raises OSError
- Result is cached across repeated calls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: esmeetu <jasonailu87@gmail.com>
@jhsmith409

Copy link
Copy Markdown
Contributor

Real-world repro that this PR would fix: #39923.

Summary: vllm/vllm-openai:cu130-nightly (digest sha256:9f815b8c…, vLLM 0.19.1rc1.dev296+gbcc2306ce) ships a nixl_ep/ package whose compiled .so needs libcudart.so.12 — not present anywhere in the cu13 image. has_nixl_ep() currently returns True via find_spec, so all2all_utils.py then eagerly imports the broken module and the server crash-loops at startup:

File ".../nixl_ep/__init__.py", line 23, in <module>
    from . import nixl_ep_cpp as _nixl_ep_cpp
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

With this PR's trial-import change, _has_module("nixl_ep") would catch the ImportError, has_nixl_ep() would return False, and the server would start normally — the nixl_ep code path isn't needed for single-GPU inference anyway.

Happy to test a nightly once this merges and report back. Thanks for the fix.

@benchislett

Copy link
Copy Markdown
Member

Duplicate #39851

@esmeetu

esmeetu commented Apr 18, 2026

Copy link
Copy Markdown
Member Author

@benchislett yeah, it does solve that issue. But i think this PR introduce a robust way to avoid similar events happen. :)

@esmeetu esmeetu deleted the fix/has-module-trial-import branch May 25, 2026 05:47
@jeffreywang88

jeffreywang88 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Hey @esmeetu, this PR would've fixed nixl_ep import issues on cuda13 images (thread) that #39851 didn't fully resolve. I'll open a PR with the fix you have here. Let me know if you want to re-open this PR instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants