Skip to content

[CI] Add third shard to pull/linux-jammy-py3.10-gcc11 distributed CPU tests#177211

Closed
huydhn wants to merge 1 commit intopytorch:mainfrom
huydhn:add-distributed-shard-pull-ci
Closed

[CI] Add third shard to pull/linux-jammy-py3.10-gcc11 distributed CPU tests#177211
huydhn wants to merge 1 commit intopytorch:mainfrom
huydhn:add-distributed-shard-pull-ci

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented Mar 11, 2026

Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency.

Before today, those jobs already took more than 3+ hours to finish, so a recent change might pushes them over the limit.

… tests

Increase distributed test shards from 2 to 3 to reduce per-shard test
time and improve CI latency.

Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn huydhn requested a review from malfet March 11, 2026 22:40
@huydhn huydhn requested a review from a team as a code owner March 11, 2026 22:40
@huydhn huydhn added the no-runner-experiments Bypass Meta/LF runner determinator label Mar 11, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177211

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0e08161 with merge base ad67e7a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 11, 2026
@malfet
Copy link
Copy Markdown
Contributor

malfet commented Mar 11, 2026

@pytorchbot merge -f "Would not hurt"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-auto-revert
Copy link
Copy Markdown

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

@claude
Copy link
Copy Markdown

claude bot commented Mar 12, 2026

Claude finished @pytorch-auto-revert[bot]'s task in 1m 50s —— View job


Diagnostics: vllm-test Failure Attribution for PR #177211

  • Read revert comment and identify failing CI job
  • Fetch CI failure logs and diagnose root cause
  • Determine if revert is legitimate and provide guidance

Failing Job

Job: vllm-x-pytorch-test / test (vllm_multi_model_test_28_failure_test, 1, 1, linux.g6.4xlarge.experimental.nvidia.gpu)View job

Failed Tests (2 failures, 44 passed, 8 skipped)

  1. test_reranker_text[half-nvidia/llama-nemotron-rerank-vl-1b-v2]
  2. test_reranker_image_doc[half-nvidia/llama-nemotron-rerank-vl-1b-v2]

Root Cause

Both failures have the same root cause — a vLLM model weight loading error for the nvidia/llama-nemotron-rerank-vl-1b-v2 model:

RuntimeError: Cannot find any model weights with
`/mnt/hf_cache/hub/models--nvidia--llama-nemotron-rerank-vl-1b-v2/snapshots/0dff00ba958ae46b0fb9c931fa2a4e3bf308404b`

The vLLM EngineCore fails to start because it cannot locate the cached model weights on disk. This happens in vllm/model_executor/model_loader/default_loader.py:179 (_prepare_weights), deep inside the vLLM model loading stack — not in any PyTorch code.

Is This a Legitimate Revert?

No — this is almost certainly a false attribution. Here's why:

  1. PR [CI] Add third shard to pull/linux-jammy-py3.10-gcc11 distributed CPU tests #177211 only changed .github/workflows/pull.yml (adding a third shard to distributed CPU tests). It modified zero PyTorch source code, zero test files, and nothing related to vLLM, model loading, or GPU tests.

  2. The failure is in a vllm_multi_model_test_28_failure_test job — note the 28_failure_test suffix, which indicates this is a test suite tracking known failures from a specific vLLM release (2.8). These tests are expected to be flaky or broken.

  3. The error is an infrastructure/cache issue: the HuggingFace model cache at /mnt/hf_cache/ doesn't have the expected model snapshot. This is a CI environment caching problem, not a code regression.

  4. The test was retried 3 times (via pytest-rerunfailures with --reruns 2) and failed consistently, pointing to a persistent cache/infra issue rather than flakiness introduced by code changes.

Guidance for the Author

@huydhnNo code fix is needed. The revert was a false positive by the autorevert bot. The vLLM nvidia/llama-nemotron-rerank-vl-1b-v2 model weights were not available in the HF cache on the CI machine. This is an infrastructure issue unrelated to adding a third distributed test shard.

To reland the PR:

  • Simply re-submit the same change. Consider adding the autorevert: disable tag if this test suite continues to produce false attributions.
  • The vLLM cache issue may resolve on its own or may need to be investigated by the CI infra team.

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@huydhn your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Mar 12, 2026
…uted CPU tests (#177211)"

This reverts commit 1c18262.

Reverted #177211 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#177211 (comment)))
@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Mar 12, 2026
@malfet
Copy link
Copy Markdown
Contributor

malfet commented Mar 12, 2026

@pytorchbot merge -f "Cause and effect is hard for autorevert bot"

@malfet malfet added the autorevert: disable Disable autorevert for a specific PR label Mar 12, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
… tests (pytorch#177211)

Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency.

Before today, those jobs already took more than 3+ hours to finish, so a recent change might pushes them over the limit.
Pull Request resolved: pytorch#177211
Approved by: https://github.com/malfet
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…uted CPU tests (pytorch#177211)"

This reverts commit 1c18262.

Reverted pytorch#177211 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#177211 (comment)))
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
… tests (pytorch#177211)

Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency.

Before today, those jobs already took more than 3+ hours to finish, so a recent change might pushes them over the limit.
Pull Request resolved: pytorch#177211
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autorevert: disable Disable autorevert for a specific PR ci-no-td Do not run TD on this PR Merged no-runner-experiments Bypass Meta/LF runner determinator Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants