-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-9493][feat] Custom AllToAll for helix parallelism #9986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a FIFO-based all-to-all communication path for Helix parallelism, selected via environment variable. Changes include new CUDA kernels with workspace management, Python bindings, MNNVL memory utilities for inter-rank memory sharing, and conditional integration into the attention module alongside existing NCCL support. Changes
Sequence DiagramsequenceDiagram
actor User as Attention<br/>Module
participant Env as Environment<br/>Variable
participant PyBind as PyTorch<br/>Binding
participant Launcher as Kernel<br/>Launcher
participant Kernel as CUDA<br/>Kernel
participant MNNVL as MNNVL<br/>Memory
participant FIFO as Workspace<br/>FIFO
User->>Env: Check TRTLLM_USE_NCCL_FOR_HELIX
alt Use FIFO Path
User->>MNNVL: Initialize workspace (per-rank)
MNNVL-->>User: Workspace memory with FIFO buffers
User->>PyBind: helixAllToAllNative(field0, field1, workspace)
PyBind->>Launcher: launchHelixAllToAll(params, stream)
Launcher->>Kernel: Launch kernel on GPU
Kernel->>FIFO: Read from send FIFO buffer
FIFO-->>Kernel: Send data
Kernel->>Kernel: Pack/unpack fields (G2S, S2G)
Kernel->>Fifo: Write to recv FIFO buffer
Kernel-->>Launcher: Kernel complete
Launcher-->>PyBind: Return
PyBind-->>User: (field0_out, field1_out)
else Use NCCL Path
User->>PyBind: alltoall_helix (existing NCCL path)
PyBind-->>User: (field0_out, field1_out)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_mnnvl_utils.py (1)
1-14: Update copyright year to 2025, addClassVarannotations for mutable class attributes, and guard cached communicator with mapping validation.The copyright header (line 1) ends at 2024 but should include 2025 per NVIDIA guidelines. Class attributes
comm(line 69),allocated_map(line 73), andaddress_refcnt(line 74) lackClassVartype annotations and violate Ruff's RUF012 rule;ClassVarmust be imported fromtyping(currently only importsList, Optional, Unionat line 20).Most critically,
get_comm()returns a cached communicator (lines 107–108) without verifying theMappingparameters match. If called with different mapping configurations, the function incorrectly returns a communicator built for an earlier mapping, causing silent correctness failures. Cache the mapping signature as(pp_rank, pp_size, cp_rank, cp_size, tp_rank, tp_size, moe_tp_rank, moe_tp_size)and assert consistency on cache hits.The same copyright, annotation, and caching issues exist in
HelixCpMnnvlMemory(lines 349–385).
🧹 Nitpick comments (4)
tensorrt_llm/_torch/modules/attention.py (2)
2-12: Import style: prefer module-level import to keep namespace (per repo Python guidelines).
E.g., import the module and referencemnnvl_utils.HelixCpMnnvlMemory/mnnvl_utils.MnnvlMemory.
1192-1243: Helix NCCL-vs-FIFO switch looks structurally correct; add a small safety check around workspace init.
Consider asserting thatHelixAllToAllNative.mappingmatchesself.mapping(or re-init) before callingalltoall_native, so a prior initialize from a different model/config doesn’t get reused accidentally.tests/unittest/_torch/modules/test_mla_helix.py (1)
857-862: Ruff B905: avoidzip()here or suppress in a Py3.8-compatible way.
Since repo supports Python 3.8+,zip(strict=...)isn’t available. Consider building theexecutor.mapargs withoutzip(*...)(e.g., pre-construct an iterable of argument-tuples) or add a targeted# noqa: B905with a short justification.cpp/tensorrt_llm/kernels/helixAllToAll.cu (1)
339-397: Reduce redundant volatile FIFO head/tail stores to lane 0 only.
Currently manysenderFifo->head/receiverFifo->tail/senderFifo->tailwrites happen from all 32 lanes (same value), creating unnecessary traffic on a hot synchronization path. Preferif (laneId == 0) { ... }plus__syncwarp()where needed.Also applies to: 400-469
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)cpp/tensorrt_llm/kernels/helixAllToAll.cu(1 hunks)cpp/tensorrt_llm/kernels/helixAllToAll.h(1 hunks)cpp/tensorrt_llm/thop/alltoallOp.cpp(2 hunks)cpp/tensorrt_llm/thop/alltoallOp.h(1 hunks)tensorrt_llm/_mnnvl_utils.py(4 hunks)tensorrt_llm/_torch/modules/attention.py(3 hunks)tests/unittest/_torch/modules/test_mla_helix.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,h,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements)
Ifandelseshould always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g.,thisIsASubDirandthisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g.,FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g.,static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g.,mNbFooValues), though the 'm' pre...
Files:
cpp/tensorrt_llm/common/envUtils.cppcpp/tensorrt_llm/thop/alltoallOp.hcpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/kernels/helixAllToAll.hcpp/tensorrt_llm/kernels/helixAllToAll.cucpp/tensorrt_llm/thop/alltoallOp.cpp
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
cpp/tensorrt_llm/common/envUtils.cppcpp/tensorrt_llm/thop/alltoallOp.htests/unittest/_torch/modules/test_mla_helix.pycpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/kernels/helixAllToAll.hcpp/tensorrt_llm/kernels/helixAllToAll.cutensorrt_llm/_torch/modules/attention.pycpp/tensorrt_llm/thop/alltoallOp.cpptensorrt_llm/_mnnvl_utils.py
**/*.h
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.h: Use a preprocessor guard in C++ header files with the guard name formatTRTLLM_followed by the filename in all caps (e.g.,TRTLLM_FOO_BAR_HELLO_Hfor fileFooBarHello.h); do not include directory names in the symbol
Do not use underscore prefix or suffix in C++ preprocessor guard symbols; they are reserved in C++ standard for compilers or implementation
Files:
cpp/tensorrt_llm/thop/alltoallOp.hcpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/kernels/helixAllToAll.h
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tests/unittest/_torch/modules/test_mla_helix.pytensorrt_llm/_torch/modules/attention.pytensorrt_llm/_mnnvl_utils.py
**/*.cu
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
CUDA code must be compiled with a CUDA compiler and includes declarations/definitions with CUDA keywords (
__device__,__managed__,__constant__,__global__), device functions, and kernel launching with <<<...>>> syntax
Files:
cpp/tensorrt_llm/kernels/helixAllToAll.cu
🧠 Learnings (16)
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/common/envUtils.cppcpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/kernels/helixAllToAll.cucpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
cpp/tensorrt_llm/common/envUtils.cppcpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/common/envUtils.cppcpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-09-22T19:25:45.607Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp:170-179
Timestamp: 2025-09-22T19:25:45.607Z
Learning: In NCCLUserBufferAllocator::getNCCLDevComm(), multimem support is hard-coded to true because multimem is required for this function. The caller is responsible for ensuring multimem is available before calling this function - it should not be called if multimem is not supported.
Applied to files:
cpp/tensorrt_llm/common/envUtils.cppcpp/tensorrt_llm/common/envUtils.h
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
cpp/tensorrt_llm/thop/alltoallOp.hcpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/kernels/helixAllToAll.cucpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
cpp/tensorrt_llm/thop/alltoallOp.hcpp/tensorrt_llm/kernels/helixAllToAll.cucpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
cpp/tensorrt_llm/thop/alltoallOp.hcpp/tensorrt_llm/kernels/helixAllToAll.hcpp/tensorrt_llm/kernels/helixAllToAll.cucpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.
Applied to files:
cpp/tensorrt_llm/thop/alltoallOp.hcpp/tensorrt_llm/kernels/helixAllToAll.cucpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/modules/test_mla_helix.py
📚 Learning: 2025-11-24T17:09:17.870Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the guard name format `TRTLLM_` followed by the filename in all caps (e.g., `TRTLLM_FOO_BAR_HELLO_H` for file `FooBarHello.h`); do not include directory names in the symbol
Applied to files:
cpp/tensorrt_llm/common/envUtils.h
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.
Applied to files:
cpp/tensorrt_llm/common/envUtils.h
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/helixAllToAll.hcpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.
Applied to files:
cpp/tensorrt_llm/kernels/helixAllToAll.cucpp/tensorrt_llm/thop/alltoallOp.cpptensorrt_llm/_mnnvl_utils.py
📚 Learning: 2025-08-25T00:03:39.294Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1185-1189
Timestamp: 2025-08-25T00:03:39.294Z
Learning: TLLM_CHECK_WITH_INFO is a host-side utility function and cannot be called from CUDA device functions (those marked with __device__ or __global__). In device code, assert() is the primary mechanism for handling "should never happen" conditions, and like standard C++ assert, CUDA's assert only works in debug builds and is compiled out in release builds.
Applied to files:
cpp/tensorrt_llm/kernels/helixAllToAll.cu
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm() (line 51 in lora.cpp), along with setWorkspace(). The stream parameter flows from loraOp.cpp through LoraImpl::run() to _runGemm() where setStream() is called appropriately. Adding setStream() in loraOp.cpp would be redundant and goes against the intended architectural design.
Applied to files:
cpp/tensorrt_llm/kernels/helixAllToAll.cucpp/tensorrt_llm/thop/alltoallOp.cpp
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
cpp/tensorrt_llm/thop/alltoallOp.cpptensorrt_llm/_mnnvl_utils.py
🧬 Code graph analysis (4)
tests/unittest/_torch/modules/test_mla_helix.py (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
get(129-147)tensorrt_llm/_torch/distributed/communicator.py (1)
world_size(45-46)
cpp/tensorrt_llm/common/envUtils.h (1)
cpp/tensorrt_llm/common/envUtils.cpp (2)
getEnvUseNcclForHelix(534-538)getEnvUseNcclForHelix(534-534)
cpp/tensorrt_llm/kernels/helixAllToAll.h (2)
cpp/tensorrt_llm/thop/alltoallOp.h (5)
tensorrt_llm(26-186)kernels(28-142)FifoInfo(67-71)ceil_div(103-106)computeHelixWorkspaceSizePerRank(150-166)cpp/tensorrt_llm/kernels/helixAllToAll.cu (4)
initializeHelixWorkspace(591-605)initializeHelixWorkspace(591-591)launchHelixAllToAll(575-585)launchHelixAllToAll(575-575)
cpp/tensorrt_llm/thop/alltoallOp.cpp (3)
cpp/tensorrt_llm/kernels/helixAllToAll.cu (2)
initializeHelixWorkspace(591-605)initializeHelixWorkspace(591-591)cpp/tensorrt_llm/kernels/helixAllToAll.h (1)
computeHelixWorkspaceSizePerRank(127-143)cpp/tensorrt_llm/thop/alltoallOp.h (1)
computeHelixWorkspaceSizePerRank(150-166)
🪛 Ruff (0.14.8)
tests/unittest/_torch/modules/test_mla_helix.py
860-860: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
tensorrt_llm/_torch/modules/attention.py
177-177: Unused static method argument: is_fused
(ARG004)
tensorrt_llm/_mnnvl_utils.py
354-354: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
355-355: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
456-456: Loop control variable i not used within loop body
(B007)
460-462: Avoid specifying long messages outside the exception class
(TRY003)
466-466: Loop control variable i not used within loop body
(B007)
466-466: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
482-482: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/common/envUtils.h (1)
159-161: Env accessor addition looks consistent with existing envUtils API surface.cpp/tensorrt_llm/common/envUtils.cpp (1)
534-538: Confirm “read-once” caching semantics are intended for TRTLLM_USE_NCCL_FOR_HELIX.
If tests/tools expect flipping the env var within the same long-lived process, this accessor won’t reflect changes after first call.tests/unittest/_torch/modules/test_mla_helix.py (1)
841-873: Env-var set/restore is robust (try/finally) and mode coverage is a nice improvement.
|
/bot run --disable-fail-fast |
|
PR_Github #28272 [ run ] triggered by Bot. Commit: |
a877088 to
1954e0e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #28276 [ run ] triggered by Bot. Commit: |
|
PR_Github #28276 [ run ] completed with state |
1954e0e to
f01c674
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #29256 [ run ] triggered by Bot. Commit: |
10bc185 to
9e2c6ab
Compare
|
PR_Github #29482 [ run ] triggered by Bot. Commit: |
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #29491 [ run ] triggered by Bot. Commit: |
|
PR_Github #29491 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29544 [ run ] triggered by Bot. Commit: |
|
PR_Github #29544 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29558 [ run ] triggered by Bot. Commit: |
|
PR_Github #29558 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29643 [ run ] triggered by Bot. Commit: |
|
PR_Github #29643 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #29652 [ run ] triggered by Bot. Commit: |
|
PR_Github #29652 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #29657 [ run ] triggered by Bot. Commit: |
|
PR_Github #29657 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #29669 [ run ] triggered by Bot. Commit: |
|
PR_Github #29669 [ run ] completed with state |
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Description
This MR adds a new implementation of alltoall for use with helix parallelism.
fusedMoeCommKernels.cuclosely using LL128Proto and MnnvlMemory.Following MRs were merged in preparation for this:
#9922
#9924
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.