Skip to content

[xpu][ut] Fix XPU CI failures#176057

Closed
guangyey wants to merge 8 commits intogh/guangyey/293/basefrom
gh/guangyey/293/head
Closed

[xpu][ut] Fix XPU CI failures#176057
guangyey wants to merge 8 commits intogh/guangyey/293/basefrom
gh/guangyey/293/head

Conversation

@guangyey
Copy link
Copy Markdown
Collaborator

@guangyey guangyey commented Feb 28, 2026

Stack from ghstack (oldest at bottom):

Motivation

This PR aims to fix the CI failure on XPU.

is_parallel = self.use_process_pool()

Additional Context

fix #173473
fix #173344
fix #173916
fix #110040

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 28, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176057

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 1ede193 with merge base 5a6d6b3 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

guangyey added a commit that referenced this pull request Feb 28, 2026
ghstack-source-id: 33a37c8
Pull Request resolved: #176057
guangyey added a commit that referenced this pull request Feb 28, 2026
ghstack-source-id: d47d41a
Pull Request resolved: #176057
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Feb 28, 2026
@guangyey guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 28, 2026
[ghstack-poisoned]
[ghstack-poisoned]
Copy link
Copy Markdown
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing tests?

@guangyey
Copy link
Copy Markdown
Collaborator Author

guangyey commented Mar 2, 2026

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot pushed a commit that referenced this pull request Mar 2, 2026
ghstack-source-id: d47d41a
Pull Request resolved: #176057
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Tried to rebase and push PR #176057, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@guangyey
Copy link
Copy Markdown
Collaborator Author

guangyey commented Mar 2, 2026

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/main gh/guangyey/293/orig returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/inductor/test_select_algorithm.py
CONFLICT (content): Merge conflict in test/inductor/test_select_algorithm.py
error: could not apply bd0cb0b6210... [xpu][ut] Fix XPU CI failures
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply bd0cb0b6210... # [xpu][ut] Fix XPU CI failures

Raised by https://github.com/pytorch/pytorch/actions/runs/22558967383

guangyey added a commit that referenced this pull request Mar 2, 2026
ghstack-source-id: cef8192
Pull Request resolved: #176057
guangyey added a commit that referenced this pull request Mar 2, 2026
ghstack-source-id: 9a84ee4
Pull Request resolved: #176057
[ghstack-poisoned]
[ghstack-poisoned]
@guangyey
Copy link
Copy Markdown
Collaborator Author

guangyey commented Mar 2, 2026

The failure is unrelated to this PR. See the log https://github.com/pytorch/pytorch/actions/runs/22565777807/job/65406923173?pr=176057#step:14:1414, the fixed UT test_triton_extension_backend::test_codegen_with_custom_heuristics_module passed.

@guangyey
Copy link
Copy Markdown
Collaborator Author

guangyey commented Mar 2, 2026

@pytorchbot drci

@guangyey guangyey requested a review from jansel March 2, 2026 15:35
# can be resolved when the compiled kernel is unpickled from the
# compile subprocess back into the parent process.
if path_to_ext_heuristics not in sys.path:
sys.path.append(path_to_ext_heuristics)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we mutate sys.path we need to cleanup the change after the test finishes. It also seems like the "custom imports" code below is now redundant.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

guangyey added a commit that referenced this pull request Mar 4, 2026
ghstack-source-id: ced7317
Pull Request resolved: #176057
@guangyey guangyey requested a review from jansel March 4, 2026 01:59
Copy link
Copy Markdown
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint and test failures?

guangyey added a commit that referenced this pull request Mar 4, 2026
ghstack-source-id: b55f949
Pull Request resolved: #176057
guangyey added a commit that referenced this pull request Mar 4, 2026
ghstack-source-id: f19d699
Pull Request resolved: #176057
guangyey added a commit that referenced this pull request Mar 4, 2026
ghstack-source-id: 45e05c8
Pull Request resolved: #176057
[ghstack-poisoned]
@guangyey guangyey requested a review from jansel March 4, 2026 11:12
[ghstack-poisoned]
@guangyey
Copy link
Copy Markdown
Collaborator Author

guangyey commented Mar 4, 2026

Hi @jansel, All XPU UT passed. The current failure Build left local git repository checkout dirty is unrelated to XPU functionality. From the CI log:

  • All tests on this node have passed successfully, https://hud.pytorch.org/pr/pytorch/pytorch/176057#65683576662
  • The job failed during the workflow check assert_git_not_dirty
    function assert_git_not_dirty() {
    # TODO: we should add an option to `build_amd.py` that reverts the repo to
    # an unmodified state.
    if [[ "$BUILD_ENVIRONMENT" != *rocm* ]] && [[ "$BUILD_ENVIRONMENT" != *xla* ]] ; then
    git_status=$(git status --porcelain | grep -v '?? third_party' || true)
    if [[ $git_status ]]; then
    echo "Build left local git repository checkout dirty"
    echo "git status --porcelain:"
    echo "${git_status}"
    exit 1
    fi
    fi
    }

The failure occurs because when assert_git_not_dirty runs, it detects a newly created file in the working directory.
It is likely that this file is generated by two tests that set the environment variable TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1, which enables dumping Inductor launch parameters to a file:

We are currently investigating why this issue started occurring recently, as it did not fail previously. Once we narrow down the root cause, we will file a separate PR to address it.

guangyey added 2 commits March 4, 2026 15:27
[ghstack-poisoned]
[ghstack-poisoned]
@guangyey
Copy link
Copy Markdown
Collaborator Author

guangyey commented Mar 6, 2026

Hi @jansel, just wanted to check if you have any additional comments or concerns.

@guangyey
Copy link
Copy Markdown
Collaborator Author

guangyey commented Mar 6, 2026

Thanks!
@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
# Motivation
This PR aims to fix the CI failure on XPU.
- `test_mm_plus_mm3` seems to only fail on CUDA. Already fixed in pytorch#175569
- `test_codegen_with_custom_heuristics_module` will fail on XPU due to `ModuleNotFoundError: No module named 'extension_triton_heuristics'`. This is because on CUDA CI, `is_parallel` is `False`, and on XPU CI, it is `True` due to a race condition. So we should add the path to `sys.path` in the parent process so that the` ExtensionCachingAutotuner` class can be resolved to boost the UT's robustness, whatever `is_parallel` it is.
- skip `test_circular_dependencies` due to it is flaky on XPU, see pytorch#110040

https://github.com/pytorch/pytorch/blob/a88bb129e9d9e7572bc3a830ad5d148d74a63c48/torch/_inductor/async_compile.py#L385

# Additional Context
fix pytorch#173473
fix pytorch#173344
fix pytorch#173916
fix pytorch#110040

Pull Request resolved: pytorch#176057
Approved by: https://github.com/jansel
@github-actions github-actions Bot deleted the gh/guangyey/293/head branch April 6, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants