[CI] Add third shard to pull/linux-jammy-py3.10-gcc11 distributed CPU tests by huydhn · Pull Request #177211 · pytorch/pytorch

huydhn · 2026-03-11T22:40:51Z

Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency.

Before today, those jobs already took more than 3+ hours to finish, so a recent change might pushes them over the limit.

… tests Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency. Signed-off-by: Huy Do <huydhn@gmail.com>

pytorch-bot · 2026-03-11T22:40:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177211

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0e08161 with merge base ad67e7a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet · 2026-03-11T22:42:51Z

@pytorchbot merge -f "Would not hurt"

pytorchmergebot · 2026-03-11T22:44:49Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-auto-revert · 2026-03-12T07:08:42Z

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

vllm-test: vllm-x-pytorch-test / test (hud)

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

claude · 2026-03-12T07:09:08Z

Claude finished @pytorch-auto-revert[bot]'s task in 1m 50s —— View job

Diagnostics: vllm-test Failure Attribution for PR #177211

Read revert comment and identify failing CI job
Fetch CI failure logs and diagnose root cause
Determine if revert is legitimate and provide guidance

Failing Job

Job: vllm-x-pytorch-test / test (vllm_multi_model_test_28_failure_test, 1, 1, linux.g6.4xlarge.experimental.nvidia.gpu) — View job

Failed Tests (2 failures, 44 passed, 8 skipped)

test_reranker_text[half-nvidia/llama-nemotron-rerank-vl-1b-v2]
test_reranker_image_doc[half-nvidia/llama-nemotron-rerank-vl-1b-v2]

Root Cause

Both failures have the same root cause — a vLLM model weight loading error for the nvidia/llama-nemotron-rerank-vl-1b-v2 model:

RuntimeError: Cannot find any model weights with
`/mnt/hf_cache/hub/models--nvidia--llama-nemotron-rerank-vl-1b-v2/snapshots/0dff00ba958ae46b0fb9c931fa2a4e3bf308404b`

The vLLM EngineCore fails to start because it cannot locate the cached model weights on disk. This happens in vllm/model_executor/model_loader/default_loader.py:179 (_prepare_weights), deep inside the vLLM model loading stack — not in any PyTorch code.

Is This a Legitimate Revert?

No — this is almost certainly a false attribution. Here's why:

PR [CI] Add third shard to pull/linux-jammy-py3.10-gcc11 distributed CPU tests #177211 only changed .github/workflows/pull.yml (adding a third shard to distributed CPU tests). It modified zero PyTorch source code, zero test files, and nothing related to vLLM, model loading, or GPU tests.
The failure is in a vllm_multi_model_test_28_failure_test job — note the 28_failure_test suffix, which indicates this is a test suite tracking known failures from a specific vLLM release (2.8). These tests are expected to be flaky or broken.
The error is an infrastructure/cache issue: the HuggingFace model cache at /mnt/hf_cache/ doesn't have the expected model snapshot. This is a CI environment caching problem, not a code regression.
The test was retried 3 times (via pytest-rerunfailures with --reruns 2) and failed consistently, pointing to a persistent cache/infra issue rather than flakiness introduced by code changes.

Guidance for the Author

@huydhn — No code fix is needed. The revert was a false positive by the autorevert bot. The vLLM nvidia/llama-nemotron-rerank-vl-1b-v2 model weights were not available in the HF cache on the CI machine. This is an infrastructure issue unrelated to adding a third distributed test shard.

To reland the PR:

Simply re-submit the same change. Consider adding the autorevert: disable tag if this test suite continues to produce false attributions.
The vLLM cache issue may resolve on its own or may need to be investigated by the CI infra team.

pytorchmergebot · 2026-03-12T07:10:26Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2026-03-12T07:10:33Z

@huydhn your PR has been successfully reverted.

…uted CPU tests (#177211)" This reverts commit 1c18262. Reverted #177211 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#177211 (comment)))

malfet · 2026-03-12T19:12:12Z

@pytorchbot merge -f "Cause and effect is hard for autorevert bot"

pytorchmergebot · 2026-03-12T19:13:59Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… tests (pytorch#177211) Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency. Before today, those jobs already took more than 3+ hours to finish, so a recent change might pushes them over the limit. Pull Request resolved: pytorch#177211 Approved by: https://github.com/malfet

…uted CPU tests (pytorch#177211)" This reverts commit 1c18262. Reverted pytorch#177211 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#177211 (comment)))

… tests (pytorch#177211) Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency. Before today, those jobs already took more than 3+ hours to finish, so a recent change might pushes them over the limit. Pull Request resolved: pytorch#177211 Approved by: https://github.com/malfet

[CI] Add third shard to pull/linux-jammy-py3.10-gcc11 distributed CPU…

0e08161

… tests Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency. Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn requested a review from malfet March 11, 2026 22:40

huydhn requested a review from a team as a code owner March 11, 2026 22:40

huydhn added the no-runner-experiments Bypass Meta/LF runner determinator label Mar 11, 2026

pytorch-bot bot added the topic: not user facing topic category label Mar 11, 2026

malfet approved these changes Mar 11, 2026

View reviewed changes

pytorchmergebot added the merging label Mar 11, 2026

pytorchmergebot added the Merged label Mar 11, 2026

pytorchmergebot closed this in 1c18262 Mar 11, 2026

pytorchmergebot removed the merging label Mar 11, 2026

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Mar 12, 2026

pytorchmergebot reopened this Mar 12, 2026

albanD mentioned this pull request Mar 12, 2026

[nn] Support meta device and 0-element tensors in all init methods #176614

Closed

malfet added the autorevert: disable Disable autorevert for a specific PR label Mar 12, 2026

pytorchmergebot added the merging label Mar 12, 2026

pytorchmergebot closed this in e21551b Mar 12, 2026

pytorchmergebot removed the merging label Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add third shard to pull/linux-jammy-py3.10-gcc11 distributed CPU tests#177211

[CI] Add third shard to pull/linux-jammy-py3.10-gcc11 distributed CPU tests#177211
huydhn wants to merge 1 commit intopytorch:mainfrom
huydhn:add-distributed-shard-pull-ci

huydhn commented Mar 11, 2026

Uh oh!

pytorch-bot bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

malfet commented Mar 11, 2026

Uh oh!

pytorchmergebot commented Mar 11, 2026

Uh oh!

pytorch-auto-revert bot commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

pytorchmergebot commented Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Uh oh!

malfet commented Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huydhn commented Mar 11, 2026

Uh oh!

pytorch-bot bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177211

✅ No Failures

Uh oh!

malfet commented Mar 11, 2026

Uh oh!

pytorchmergebot commented Mar 11, 2026

Merge started

Uh oh!

pytorch-auto-revert bot commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Diagnostics: vllm-test Failure Attribution for PR #177211

Failing Job

Failed Tests (2 failures, 44 passed, 8 skipped)

Root Cause

Is This a Legitimate Revert?

Guidance for the Author

Uh oh!

pytorchmergebot commented Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Uh oh!

malfet commented Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Mar 11, 2026 •

edited

Loading

claude bot commented Mar 12, 2026 •

edited

Loading