[TRTLLM-9493][feat] Custom AllToAll for helix parallelism #9986

brb-nv · 2025-12-14T22:16:39Z

Description

This MR adds a new implementation of alltoall for use with helix parallelism.

Current implementation on ToT for this is based on NCCL. Incoming change follows fusedMoeCommKernels.cu closely using LL128Proto and MnnvlMemory.
We keep around the old implementation as well. It's needed in cases where Mnnvl isn't supported.

Following MRs were merged in preparation for this:
#9922
#9924

Test Coverage

$ pytest tests/unittest/_torch/modules/test_mla_helix.py -s -v
$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype_with_helix -s -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features

Added Helix all-to-all communication with configurable backends (NCCL or FIFO-based workspace) for distributed operations, selectable via environment variable.

Tests

Extended test coverage to validate both communication path options for Helix operations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-14T22:25:59Z

📝 Walkthrough

Walkthrough

This pull request introduces a FIFO-based all-to-all communication path for Helix parallelism, selected via environment variable. Changes include new CUDA kernels with workspace management, Python bindings, MNNVL memory utilities for inter-rank memory sharing, and conditional integration into the attention module alongside existing NCCL support.

Changes

Cohort / File(s)	Summary
Environment variable support `cpp/tensorrt_llm/common/envUtils.h`, `cpp/tensorrt_llm/common/envUtils.cpp`	Added `getEnvUseNcclForHelix()` function to read and cache the `TRTLLM_USE_NCCL_FOR_HELIX` environment variable for runtime selection between NCCL and FIFO code paths.
Helix all-to-all CUDA kernel `cpp/tensorrt_llm/kernels/helixAllToAll.h`, `cpp/tensorrt_llm/kernels/helixAllToAll.cu`	Implemented comprehensive Helix all-to-all kernel with FIFO-based workspace management, data structures (`HelixAllToAllParams`, `HelixFieldInfo`, `FifoInfo`), alignment constants, channel/workspace calculation utilities, device-side helpers for FIFO operations and LL128 proto transfers, and host launcher functions.
Python/C++ kernel bindings `cpp/tensorrt_llm/thop/alltoallOp.h`, `cpp/tensorrt_llm/thop/alltoallOp.cpp`	Added PyTorch extension bindings for `helixAllToAllNative()`, workspace size query `getHelixWorkspaceSizePerRank()`, and workspace initialization `initializeHelixWorkspace()` with Torch library registrations.
MNNVL memory management `tensorrt_llm/_mnnvl_utils.py`	Introduced `HelixCpMnnvlMemory` class for managing shareable MNNVL memory across Helix CP ranks with memory allocation, per-rank address mapping, handle sharing via pidfd, and reference counting. Updated `MnnvlMemory.get_comm()` for TP-oriented grouping.
Attention module integration `tensorrt_llm/_torch/modules/attention.py`	Added `HelixAllToAllNative` helper class for workspace lifecycle management. Integrated runtime selection via `TRTLLM_USE_NCCL_FOR_HELIX` environment variable to choose between NCCL-based and FIFO-based all-to-all paths in Helix attention computation.
Test updates `tests/unittest/_torch/modules/test_mla_helix.py`	Added `use_nccl_for_helix` parameter to test function for mode selection, environment variable control, and expanded `__main__` execution to test both NCCL and FIFO modes with labeled output.

Sequence Diagram

sequenceDiagram
    actor User as Attention<br/>Module
    participant Env as Environment<br/>Variable
    participant PyBind as PyTorch<br/>Binding
    participant Launcher as Kernel<br/>Launcher
    participant Kernel as CUDA<br/>Kernel
    participant MNNVL as MNNVL<br/>Memory
    participant FIFO as Workspace<br/>FIFO

    User->>Env: Check TRTLLM_USE_NCCL_FOR_HELIX
    alt Use FIFO Path
        User->>MNNVL: Initialize workspace (per-rank)
        MNNVL-->>User: Workspace memory with FIFO buffers
        User->>PyBind: helixAllToAllNative(field0, field1, workspace)
        PyBind->>Launcher: launchHelixAllToAll(params, stream)
        Launcher->>Kernel: Launch kernel on GPU
        Kernel->>FIFO: Read from send FIFO buffer
        FIFO-->>Kernel: Send data
        Kernel->>Kernel: Pack/unpack fields (G2S, S2G)
        Kernel->>Fifo: Write to recv FIFO buffer
        Kernel-->>Launcher: Kernel complete
        Launcher-->>PyBind: Return
        PyBind-->>User: (field0_out, field1_out)
    else Use NCCL Path
        User->>PyBind: alltoall_helix (existing NCCL path)
        PyBind-->>User: (field0_out, field1_out)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

helixAllToAll.cu: Large CUDA kernel file with complex device-side logic, shared memory management, LL128 proto-based packing/unpacking, and FIFO state machine—requires deep understanding of workspace layout and synchronization patterns.
_mnnvl_utils.py: Reference-counted memory management with cross-rank handle sharing, pidfd-based inter-process communication, and global address space coordination—verify correctness of allocation, mapping, and cleanup flows.
attention.py: Conditional execution path based on environment variable with workspace lifecycle management—ensure proper initialization, error handling, and tensor reshaping for both NCCL and FIFO modes.
Data structure alignment: Verify ALIGN_256 and HELIX_FIFO_* constants are consistently applied across C++ and Python layers for correct memory layout.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.59% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[TRTLLM-9493][feat] Custom AllToAll for helix parallelism' accurately and concisely summarizes the primary change - adding a custom AllToAll implementation for helix parallelism, following the repository's naming convention.
Description check	✅ Passed	PR description follows the template structure with Description, Test Coverage, and PR Checklist sections, clearly explaining the changes and providing test instructions.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_mnnvl_utils.py (1)

1-14: Update copyright year to 2025, add ClassVar annotations for mutable class attributes, and guard cached communicator with mapping validation.

The copyright header (line 1) ends at 2024 but should include 2025 per NVIDIA guidelines. Class attributes comm (line 69), allocated_map (line 73), and address_refcnt (line 74) lack ClassVar type annotations and violate Ruff's RUF012 rule; ClassVar must be imported from typing (currently only imports List, Optional, Union at line 20).

Most critically, get_comm() returns a cached communicator (lines 107–108) without verifying the Mapping parameters match. If called with different mapping configurations, the function incorrectly returns a communicator built for an earlier mapping, causing silent correctness failures. Cache the mapping signature as (pp_rank, pp_size, cp_rank, cp_size, tp_rank, tp_size, moe_tp_rank, moe_tp_size) and assert consistency on cache hits.

The same copyright, annotation, and caching issues exist in HelixCpMnnvlMemory (lines 349–385).

🧹 Nitpick comments (4)

tensorrt_llm/_torch/modules/attention.py (2)

2-12: Import style: prefer module-level import to keep namespace (per repo Python guidelines).
E.g., import the module and reference mnnvl_utils.HelixCpMnnvlMemory / mnnvl_utils.MnnvlMemory.

1192-1243: Helix NCCL-vs-FIFO switch looks structurally correct; add a small safety check around workspace init.
Consider asserting that HelixAllToAllNative.mapping matches self.mapping (or re-init) before calling alltoall_native, so a prior initialize from a different model/config doesn’t get reused accidentally.

tests/unittest/_torch/modules/test_mla_helix.py (1)

857-862: Ruff B905: avoid zip() here or suppress in a Py3.8-compatible way.
Since repo supports Python 3.8+, zip(strict=...) isn’t available. Consider building the executor.map args without zip(*...) (e.g., pre-construct an iterable of argument-tuples) or add a targeted # noqa: B905 with a short justification.

cpp/tensorrt_llm/kernels/helixAllToAll.cu (1)

339-397: Reduce redundant volatile FIFO head/tail stores to lane 0 only.
Currently many senderFifo->head / receiverFifo->tail / senderFifo->tail writes happen from all 32 lanes (same value), creating unnecessary traffic on a hot synchronization path. Prefer if (laneId == 0) { ... } plus __syncwarp() where needed.

Also applies to: 400-469

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a1750c and a877088.

📒 Files selected for processing (9)

cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
cpp/tensorrt_llm/common/envUtils.h (1 hunks)
cpp/tensorrt_llm/kernels/helixAllToAll.cu (1 hunks)
cpp/tensorrt_llm/kernels/helixAllToAll.h (1 hunks)
cpp/tensorrt_llm/thop/alltoallOp.cpp (2 hunks)
cpp/tensorrt_llm/thop/alltoallOp.h (1 hunks)
tensorrt_llm/_mnnvl_utils.py (4 hunks)
tensorrt_llm/_torch/modules/attention.py (3 hunks)
tests/unittest/_torch/modules/test_mla_helix.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

cpp/tensorrt_llm/common/envUtils.cpp
cpp/tensorrt_llm/thop/alltoallOp.h
cpp/tensorrt_llm/common/envUtils.h
cpp/tensorrt_llm/kernels/helixAllToAll.h
cpp/tensorrt_llm/kernels/helixAllToAll.cu
cpp/tensorrt_llm/thop/alltoallOp.cpp

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

cpp/tensorrt_llm/common/envUtils.cpp
cpp/tensorrt_llm/thop/alltoallOp.h
tests/unittest/_torch/modules/test_mla_helix.py
cpp/tensorrt_llm/common/envUtils.h
cpp/tensorrt_llm/kernels/helixAllToAll.h
cpp/tensorrt_llm/kernels/helixAllToAll.cu
tensorrt_llm/_torch/modules/attention.py
cpp/tensorrt_llm/thop/alltoallOp.cpp
tensorrt_llm/_mnnvl_utils.py

**/*.h

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.h: Use a preprocessor guard in C++ header files with the guard name format TRTLLM_ followed by the filename in all caps (e.g., TRTLLM_FOO_BAR_HELLO_H for file FooBarHello.h); do not include directory names in the symbol
Do not use underscore prefix or suffix in C++ preprocessor guard symbols; they are reserved in C++ standard for compilers or implementation

Files:

cpp/tensorrt_llm/thop/alltoallOp.h
cpp/tensorrt_llm/common/envUtils.h
cpp/tensorrt_llm/kernels/helixAllToAll.h

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tests/unittest/_torch/modules/test_mla_helix.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_mnnvl_utils.py

**/*.cu

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

CUDA code must be compiled with a CUDA compiler and includes declarations/definitions with CUDA keywords (__device__, __managed__, __constant__, __global__), device functions, and kernel launching with <<<...>>> syntax

Files:

cpp/tensorrt_llm/kernels/helixAllToAll.cu

🧠 Learnings (16)

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/common/envUtils.cpp
cpp/tensorrt_llm/common/envUtils.h
cpp/tensorrt_llm/kernels/helixAllToAll.cu
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

cpp/tensorrt_llm/common/envUtils.cpp
cpp/tensorrt_llm/common/envUtils.h
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/common/envUtils.cpp
cpp/tensorrt_llm/common/envUtils.h
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-09-22T19:25:45.607Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp:170-179
Timestamp: 2025-09-22T19:25:45.607Z
Learning: In NCCLUserBufferAllocator::getNCCLDevComm(), multimem support is hard-coded to true because multimem is required for this function. The caller is responsible for ensuring multimem is available before calling this function - it should not be called if multimem is not supported.

Applied to files:

cpp/tensorrt_llm/common/envUtils.cpp
cpp/tensorrt_llm/common/envUtils.h

📚 Learning: 2025-09-23T15:13:48.819Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

cpp/tensorrt_llm/thop/alltoallOp.h
cpp/tensorrt_llm/common/envUtils.h
cpp/tensorrt_llm/kernels/helixAllToAll.cu
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

cpp/tensorrt_llm/thop/alltoallOp.h
cpp/tensorrt_llm/kernels/helixAllToAll.cu
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-09-23T14:58:05.372Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

cpp/tensorrt_llm/thop/alltoallOp.h
cpp/tensorrt_llm/kernels/helixAllToAll.h
cpp/tensorrt_llm/kernels/helixAllToAll.cu
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-08-17T15:07:01.420Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.

Applied to files:

cpp/tensorrt_llm/thop/alltoallOp.h
cpp/tensorrt_llm/kernels/helixAllToAll.cu
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/modules/test_mla_helix.py

📚 Learning: 2025-11-24T17:09:17.870Z

Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the guard name format `TRTLLM_` followed by the filename in all caps (e.g., `TRTLLM_FOO_BAR_HELLO_H` for file `FooBarHello.h`); do not include directory names in the symbol

Applied to files:

cpp/tensorrt_llm/common/envUtils.h

📚 Learning: 2025-09-16T09:30:09.716Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

cpp/tensorrt_llm/common/envUtils.h

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/helixAllToAll.h
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

cpp/tensorrt_llm/kernels/helixAllToAll.cu
cpp/tensorrt_llm/thop/alltoallOp.cpp
tensorrt_llm/_mnnvl_utils.py

📚 Learning: 2025-08-25T00:03:39.294Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1185-1189
Timestamp: 2025-08-25T00:03:39.294Z
Learning: TLLM_CHECK_WITH_INFO is a host-side utility function and cannot be called from CUDA device functions (those marked with __device__ or __global__). In device code, assert() is the primary mechanism for handling "should never happen" conditions, and like standard C++ assert, CUDA's assert only works in debug builds and is compiled out in release builds.

Applied to files:

cpp/tensorrt_llm/kernels/helixAllToAll.cu

📚 Learning: 2025-08-17T15:07:01.420Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm() (line 51 in lora.cpp), along with setWorkspace(). The stream parameter flows from loraOp.cpp through LoraImpl::run() to _runGemm() where setStream() is called appropriately. Adding setStream() in loraOp.cpp would be redundant and goes against the intended architectural design.

Applied to files:

cpp/tensorrt_llm/kernels/helixAllToAll.cu
cpp/tensorrt_llm/thop/alltoallOp.cpp

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

cpp/tensorrt_llm/thop/alltoallOp.cpp
tensorrt_llm/_mnnvl_utils.py

🧬 Code graph analysis (4)

tests/unittest/_torch/modules/test_mla_helix.py (2)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

get (129-147)

tensorrt_llm/_torch/distributed/communicator.py (1)

world_size (45-46)

cpp/tensorrt_llm/common/envUtils.h (1)

cpp/tensorrt_llm/common/envUtils.cpp (2)

getEnvUseNcclForHelix (534-538)

getEnvUseNcclForHelix (534-534)

cpp/tensorrt_llm/kernels/helixAllToAll.h (2)

cpp/tensorrt_llm/thop/alltoallOp.h (5)

tensorrt_llm (26-186)

kernels (28-142)

FifoInfo (67-71)

ceil_div (103-106)

computeHelixWorkspaceSizePerRank (150-166)

cpp/tensorrt_llm/kernels/helixAllToAll.cu (4)

initializeHelixWorkspace (591-605)

initializeHelixWorkspace (591-591)

launchHelixAllToAll (575-585)

launchHelixAllToAll (575-575)

cpp/tensorrt_llm/thop/alltoallOp.cpp (3)

cpp/tensorrt_llm/kernels/helixAllToAll.cu (2)

initializeHelixWorkspace (591-605)

initializeHelixWorkspace (591-591)

cpp/tensorrt_llm/kernels/helixAllToAll.h (1)

computeHelixWorkspaceSizePerRank (127-143)

cpp/tensorrt_llm/thop/alltoallOp.h (1)

computeHelixWorkspaceSizePerRank (150-166)

🪛 Ruff (0.14.8)

tests/unittest/_torch/modules/test_mla_helix.py

860-860: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

tensorrt_llm/_torch/modules/attention.py

177-177: Unused static method argument: is_fused

(ARG004)

tensorrt_llm/_mnnvl_utils.py

354-354: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

355-355: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

456-456: Loop control variable i not used within loop body

(B007)

460-462: Avoid specifying long messages outside the exception class

(TRY003)

466-466: Loop control variable i not used within loop body

(B007)

466-466: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

482-482: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

cpp/tensorrt_llm/common/envUtils.h (1)

159-161: Env accessor addition looks consistent with existing envUtils API surface.

cpp/tensorrt_llm/common/envUtils.cpp (1)

534-538: Confirm “read-once” caching semantics are intended for TRTLLM_USE_NCCL_FOR_HELIX.
If tests/tools expect flipping the env var within the same long-lived process, this accessor won’t reflect changes after first call.

tests/unittest/_torch/modules/test_mla_helix.py (1)

841-873: Env-var set/restore is robust (try/finally) and mode coverage is a nice improvement.

cpp/tensorrt_llm/kernels/helixAllToAll.cu

cpp/tensorrt_llm/kernels/helixAllToAll.h

cpp/tensorrt_llm/thop/alltoallOp.cpp

cpp/tensorrt_llm/thop/alltoallOp.h

tensorrt_llm/_mnnvl_utils.py

tensorrt_llm/_torch/modules/attention.py

brb-nv · 2025-12-14T23:25:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-14T23:31:42Z

PR_Github #28272 [ run ] triggered by Bot. Commit: a877088

brb-nv · 2025-12-15T00:14:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-15T00:20:34Z

PR_Github #28276 [ run ] triggered by Bot. Commit: 1954e0e

tensorrt_llm/_torch/modules/attention.py

tensorrt_llm/_mnnvl_utils.py

cpp/tensorrt_llm/thop/alltoallOp.cpp

cpp/tensorrt_llm/thop/alltoallOp.h

cpp/tensorrt_llm/kernels/helixAllToAll.h

tests/unittest/_torch/modules/test_mla_helix.py

tensorrt-cicd · 2025-12-15T08:19:49Z

PR_Github #28276 [ run ] completed with state FAILURE. Commit: 1954e0e
/LLM/main/L0_MergeRequest_PR pipeline #21627 completed with status: 'FAILURE'

brb-nv · 2025-12-21T00:06:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-21T00:12:12Z

PR_Github #29256 [ run ] triggered by Bot. Commit: 812770c

cpp/tensorrt_llm/kernels/CMakeLists.txt

tensorrt-cicd · 2025-12-23T01:56:10Z

PR_Github #29482 [ run ] triggered by Bot. Commit: e0c8fb5

tensorrt_llm/_torch/distributed/ops.py

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv · 2025-12-23T02:34:15Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T02:40:11Z

PR_Github #29491 [ run ] triggered by Bot. Commit: 6cfcae5

tensorrt-cicd · 2025-12-23T06:06:43Z

PR_Github #29491 [ run ] completed with state SUCCESS. Commit: 6cfcae5
/LLM/main/L0_MergeRequest_PR pipeline #22672 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

brb-nv · 2025-12-23T07:02:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T07:08:45Z

PR_Github #29544 [ run ] triggered by Bot. Commit: 6cfcae5

tensorrt-cicd · 2025-12-23T07:18:09Z

PR_Github #29544 [ run ] completed with state FAILURE. Commit: 6cfcae5
/LLM/main/L0_MergeRequest_PR pipeline #22718 completed with status: 'ABORTED'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

brb-nv · 2025-12-23T07:38:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T07:44:27Z

PR_Github #29558 [ run ] triggered by Bot. Commit: 6cfcae5

tensorrt-cicd · 2025-12-23T11:53:37Z

PR_Github #29558 [ run ] completed with state SUCCESS. Commit: 6cfcae5
/LLM/main/L0_MergeRequest_PR pipeline #22729 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

brb-nv · 2025-12-23T16:47:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T16:53:28Z

PR_Github #29643 [ run ] triggered by Bot. Commit: 6cfcae5

tensorrt-cicd · 2025-12-23T16:53:29Z

PR_Github #29643 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 12 PM PST on 12/23.

brb-nv · 2025-12-23T20:20:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T20:26:36Z

PR_Github #29652 [ run ] triggered by Bot. Commit: 6cfcae5

tensorrt-cicd · 2025-12-23T20:26:37Z

PR_Github #29652 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 12 PM PST on 12/23.

brb-nv · 2025-12-23T21:16:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T21:21:44Z

PR_Github #29657 [ run ] triggered by Bot. Commit: 6cfcae5

tensorrt-cicd · 2025-12-23T21:21:45Z

PR_Github #29657 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 12 PM PST on 12/23.

brb-nv · 2025-12-24T00:13:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-24T00:19:55Z

PR_Github #29669 [ run ] triggered by Bot. Commit: 6cfcae5

tensorrt-cicd · 2025-12-24T02:09:26Z

PR_Github #29669 [ run ] completed with state SUCCESS. Commit: 6cfcae5
/LLM/main/L0_MergeRequest_PR pipeline #22787 completed with status: 'SUCCESS'

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>

brb-nv requested a review from a team as a code owner December 14, 2025 22:16

brb-nv requested review from MatthiasKohl and yuxianq and removed request for yuxianq December 14, 2025 22:16

coderabbitai bot reviewed Dec 14, 2025

View reviewed changes

brb-nv force-pushed the user/brb/native-alltoall-redo branch from a877088 to 1954e0e Compare December 15, 2025 00:13