Conversation
| tensorrt_fused_nccl_all_gather_op, | ||
| tensorrt_fused_nccl_reduce_scatter_op, | ||
| ) | ||
| if load_tensorrt_llm_for_nccl(): |
There was a problem hiding this comment.
I would like to use the enabled features system for this rather than a stand alone function
py/torch_tensorrt/dynamo/utils.py
Outdated
|
|
||
| Unsupported: | ||
| - Windows platforms | ||
| - Jetson/Orin/Xavier (aarch64 architecture + 'tegra' in platform release) |
There was a problem hiding this comment.
Thor also not supported by TRT-LLM right?
There was a problem hiding this comment.
yeah Thor and sbsa should support NCCL, but TRT-LLM I am not aware. Will include Thor in the list of unsupported platform.
A followup question, what about sbsa? I see on TRT-LLM page that they are supported on Blackwell, but that does not imply sbsa support right (can be supported on B200 - non sbsa vs GB200 - sbsa).
py/torch_tensorrt/dynamo/utils.py
Outdated
|
|
||
| if machine == "aarch64" and "tegra" in release: | ||
| logger.info( | ||
| "TensorRT-LLM plugins for NCCL backend are not supported on Jetson/Orin/Xavier (Tegra) devices." |
There was a problem hiding this comment.
Edit the error message here to include thor
py/torch_tensorrt/dynamo/utils.py
Outdated
| try: | ||
| cuda_version = torch.version.cuda # e.g., "12.4" or "13.0" | ||
| if cuda_version is None: | ||
| logger.warning("No CUDA runtime detected — TRT-LLM plugins unavailable.") |
There was a problem hiding this comment.
This is somewhat misleading because the actual error is that the pytorch install does not support cuda.
There was a problem hiding this comment.
Also if that is the case would this be an error? What invokes this function? Should the user continue to be able to run? Would they be under the assumption that TRT-LLM plugins would be available?
There was a problem hiding this comment.
yes will change the error message.
In that case cuda runtime is not available, but I assume we would hit an error before only before reaching this point. Wrt to this function we won't be able to verify if the CUDA is 12.X or 13.X. Should I remove this check altogether?
There was a problem hiding this comment.
Its fine to have redundant checks as long as they are clear
py/torch_tensorrt/dynamo/utils.py
Outdated
|
|
||
| major, minor = map(int, cuda_version.split(".")) | ||
| if major != 12: | ||
| logger.warning("CUDA 13 is not supported for TRT-LLM plugins.") |
There was a problem hiding this comment.
not currently supported. Same comment as above though, Seems to me this is at least log error, but the question is if we should kill the process. If the program will not run as intended we should otherwise its still an error but we can continue
There was a problem hiding this comment.
Will change the error message to add currently
Its more like then this function will return a false
load_tensorrt_llm_for_nccl() calls is_platform_supported_for_trtllm() which will return false and the converter will be unsupported.
py/torch_tensorrt/dynamo/utils.py
Outdated
| return False | ||
|
|
||
|
|
||
| def load_tensorrt_llm_for_nccl() -> bool: |
There was a problem hiding this comment.
This function should be in the enabled features system. And should register the feature for other parts of the library to query against
There was a problem hiding this comment.
Will make this change
|
Yes you should be able to run it on GB200, I think there is just not a thor distribution of TRT-LLM for now.
|
…ing. Pending- check support on Thor and sbsa
8dd657c to
cee5c7a
Compare
6dfb740 to
f4bbba4
Compare
There was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/_features.py 2025-10-02 20:53:24.201799+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/_features.py 2025-10-02 20:53:59.151475+00:00
@@ -165,10 +165,11 @@
def needs_trtllm_for_nccl(f: Callable[..., Any]) -> Callable[..., Any]:
def wrapper(*args: List[Any], **kwargs: Dict[str, Any]) -> Any:
if ENABLED_FEATURES.trtllm_for_nccl:
return f(*args, **kwargs)
else:
+
def not_implemented(*args: List[Any], **kwargs: Dict[str, Any]) -> Any:
raise NotImplementedError(
"Refit feature is currently not available in Python 3.13 or higher"
)
f4bbba4 to
7046c6d
Compare
045722a to
a028601
Compare
fb2e683 to
b96b9ee
Compare
b96b9ee to
2f2cd31
Compare
| python -m pytest -ra --junitxml=${RUNNER_TEST_RESULTS_DIR}/l1_ts_models_tests_results.xml -n auto models/ | ||
| popd | ||
|
|
||
| L1-dynamo-distributed-tests: |
There was a problem hiding this comment.
Can we make this L2 for now?
44d9b60 to
24264e5
Compare
Across runs wheel is removed, while .so file is retained