ci: Specify MPI implementation to mpich by bkryu · Pull Request #2182 · flashinfer-ai/flashinfer

bkryu · 2025-12-05T21:22:30Z

📌 Description

We currently have unit tests failing as:

==========================================
Running: pytest --continue-on-collection-errors -s --junitxml=/junit/tests/comm/test_trtllm_mnnvl_allreduce.py.xml "tests/comm/test_trtllm_mnnvl_allreduce.py"
==========================================
Abort(1090447) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(192)........:
MPID_Init(1665)..............:
MPIDI_OFI_mpi_init_hook(1586):
(unknown)(): Unknown error class
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090447
:
system msg for write_line failure : Bad file descriptor
Abort(1090447) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(192)........:
MPID_Init(1665)..............:
MPIDI_OFI_mpi_init_hook(1586):
...
Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, markupsafe._speedups, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, mpi4py.MPI (total: 22)
!!!!!!! Segfault encountered !!!!!!!
...

❌ FAILED: tests/comm/test_trtllm_mnnvl_allreduce.py

These tests should be skipping in a single GPU environment, but are failing, which indicates that they are failing at MPI module load time.

The current dockerfile.cuXXX installs MPI via RUN conda install -n py312 -y mpi4py. Upon investigating the docker build logs,

A month ago (Nov. 4),

#17 13.68     mpi-1.0.1                  |            mpich           6 KB  conda-forge
#17 13.68     mpi4py-4.1.1               |py312hd0af0b3_100         866 KB  conda-forge
#17 13.68     mpich-4.3.2                |     h79b1c89_100         5.4 MB  conda-forge

was being installed, but yesterday:

#17 13.59     impi_rt-2021.13.1          |     ha770c72_769        41.7 MB  conda-forge
#17 13.59     mpi-1.0                    |             impi           6 KB  conda-forge
#17 13.59     mpi4py-4.1.1               |py312h18f78f0_102         864 KB  conda-forge

is being installed.

The mpich vs. impi are Implementations to the MPI: MPICH vs. Intel MPI. This is currently the suspected issue underlying the MPI load failures.

Current PR specifies the MPI implementation via RUN conda install -n py312 -y mpi4py mpich. The result of the current PR produces (build log):

#15 14.63     mpi-1.0.1                  |            mpich           6 KB  conda-forge
#15 14.63     mpi4py-4.1.1               |py312hd0af0b3_102         865 KB  conda-forge
#15 14.63     mpich-4.3.2                |     h79b1c89_100         5.4 MB  conda-forge

which now matches what we had before

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

coderabbitai · 2025-12-05T21:22:43Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Across all CUDA version Dockerfiles (cu126, cu128, cu129, cu130), mpich is now installed alongside mpi4py in the py312 conda environment. This change is applied consistently to both standard and development-variant Dockerfiles, expanding the MPI backend availability.

Changes

Cohort / File(s)	Summary
MPI Installation Enhancement `docker/Dockerfile.cu126`, `docker/Dockerfile.cu126.dev`, `docker/Dockerfile.cu128`, `docker/Dockerfile.cu128.dev`, `docker/Dockerfile.cu129`, `docker/Dockerfile.cu129.dev`, `docker/Dockerfile.cu130`, `docker/Dockerfile.cu130.dev`	Added mpich to conda install command for mpi4py in py312 environment, changing the dependency set from mpi4py alone to mpi4py and mpich.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

This is a repetitive configuration change applied uniformly across 8 Dockerfiles with identical patterns. Each modification simply adds mpich to an existing conda install line with no logic, build flow, or behavioral changes.

Poem

🐰 Through Docker lanes where images grow,
We hopped in mpich, don't you know!
With mpi4py paired, both sides aligned,
Eight Dockerfiles enhanced, perfectly signed! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The PR title 'ci: Specify MPI implementation to mpich' clearly summarizes the main change across all modified Dockerfiles, which consistently add mpich as an MPI implementation alongside mpi4py.
Description check	✅ Passed	The PR description comprehensively explains the MPI implementation issue, provides specific evidence from build logs, and clearly documents the problem and solution.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-12-05T21:22:46Z

Summary of Changes

Hello @bkryu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request standardizes the MPI environment within the project's Docker images by explicitly specifying mpich as the MPI implementation during the mpi4py installation. This modification is applied across all CUDA-specific and development Dockerfiles, aiming to ensure consistent behavior and prevent potential issues arising from differing or default MPI backends. The 'DO NOT MERGE' tag indicates this is likely a work-in-progress or experimental change.

Highlights

Explicit MPI Implementation: The mpich package is now explicitly installed alongside mpi4py in the py312 conda environment across all Dockerfiles.
Consistent Docker Environment: This change is applied uniformly to all Dockerfiles, covering various CUDA versions (cu126, cu128, cu129, cu130) and their development variants, ensuring a standardized MPI setup.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request consistently specifies mpich as the MPI implementation when installing mpi4py across all Dockerfiles. This is a good practice for ensuring reproducible environments. My main feedback is to also clean the conda cache after installation to optimize the Docker image sizes. I've added specific suggestions for this on each of the changed lines.

Additionally, I've noticed significant duplication across the various Dockerfiles (cu126, cu128, etc., and their .dev variants). While a full refactor is outside the scope of this PR, you might consider consolidating them in the future using a single base Dockerfile with build arguments to handle the CUDA version differences. This would greatly improve maintainability.

yzh119

LGTM overall.

In the long term, I wonder should we consider installing mpi from apt-get like in sglang: https://github.com/sgl-project/sglang/blob/cee93a6f26023d978b5187725bcb3c15ba604343/docker/Dockerfile#L474

## 📌 Description We currently have unit tests failing as: ``` ========================================== Running: pytest --continue-on-collection-errors -s --junitxml=/junit/tests/comm/test_trtllm_mnnvl_allreduce.py.xml "tests/comm/test_trtllm_mnnvl_allreduce.py" ========================================== Abort(1090447) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack: MPIR_Init_thread(192)........: MPID_Init(1665)..............: MPIDI_OFI_mpi_init_hook(1586): (unknown)(): Unknown error class [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090447 : system msg for write_line failure : Bad file descriptor Abort(1090447) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack: MPIR_Init_thread(192)........: MPID_Init(1665)..............: MPIDI_OFI_mpi_init_hook(1586): ... Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, markupsafe._speedups, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, mpi4py.MPI (total: 22) !!!!!!! Segfault encountered !!!!!!! ... ❌ FAILED: tests/comm/test_trtllm_mnnvl_allreduce.py ``` These tests should be skipping in a single GPU environment, but are failing, which indicates that they are failing at MPI module load time. The current `dockerfile.cuXXX` installs MPI via `RUN conda install -n py312 -y mpi4py`. Upon investigating the docker build logs, [A month ago (Nov. 4)](https://github.com/flashinfer-ai/flashinfer/actions/runs/19084098717/job/54520197904#step:6:802), ``` flashinfer-ai#17 13.68 mpi-1.0.1 | mpich 6 KB conda-forge flashinfer-ai#17 13.68 mpi4py-4.1.1 |py312hd0af0b3_100 866 KB conda-forge flashinfer-ai#17 13.68 mpich-4.3.2 | h79b1c89_100 5.4 MB conda-forge ``` was being installed, [but yesterday](https://github.com/flashinfer-ai/flashinfer/actions/runs/19960576464/job/57239792717#step:6:673): ``` flashinfer-ai#17 13.59 impi_rt-2021.13.1 | ha770c72_769 41.7 MB conda-forge flashinfer-ai#17 13.59 mpi-1.0 | impi 6 KB conda-forge flashinfer-ai#17 13.59 mpi4py-4.1.1 |py312h18f78f0_102 864 KB conda-forge ``` is being installed. The mpich vs. impi are Implementations to the MPI: MPICH vs. Intel MPI. This is currently the suspected issue underlying the MPI load failures. Current PR specifies the MPI implementation via `RUN conda install -n py312 -y mpi4py mpich`. The result of the current PR produces ([build log](https://github.com/flashinfer-ai/flashinfer/actions/runs/19976372640/job/57293423165?pr=2182#step:6:436)): ``` flashinfer-ai#15 14.63 mpi-1.0.1 | mpich 6 KB conda-forge flashinfer-ai#15 14.63 mpi4py-4.1.1 |py312hd0af0b3_102 865 KB conda-forge flashinfer-ai#15 14.63 mpich-4.3.2 | h79b1c89_100 5.4 MB conda-forge ``` which now matches what we had before  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes

Specify MPI implementation to mpich

e54b2c1

gemini-code-assist Bot reviewed Dec 5, 2025

View reviewed changes

bkryu changed the title ~~[DO NOT MERGE] Specify MPI implementation to mpich~~ ci: Specify MPI implementation to mpich Dec 5, 2025

yzh119 approved these changes Dec 6, 2025

View reviewed changes

yzh119 merged commit 185d63a into flashinfer-ai:main Dec 6, 2025
15 checks passed

bkryu deleted the specify_mpi_impl branch December 8, 2025 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Specify MPI implementation to mpich#2182

ci: Specify MPI implementation to mpich#2182
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
bkryu:specify_mpi_impl

bkryu commented Dec 5, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Dec 5, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

gemini-code-assist Bot commented Dec 5, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzh119 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bkryu commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

coderabbitai Bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist Bot commented Dec 5, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bkryu commented Dec 5, 2025 •

edited

Loading

coderabbitai Bot commented Dec 5, 2025 •

edited

Loading