Skip to content

[AMD][gfx1100] test_decompose_mem_bound_mm.py tolerance increase#165625

Closed
k-artem wants to merge 5 commits intopytorch:mainfrom
k-artem:gfx1100_fixes
Closed

[AMD][gfx1100] test_decompose_mem_bound_mm.py tolerance increase#165625
k-artem wants to merge 5 commits intopytorch:mainfrom
k-artem:gfx1100_fixes

Conversation

@k-artem
Copy link
Copy Markdown
Contributor

@k-artem k-artem commented Oct 16, 2025

test_decompose_mem_bound_mm.py tolerance increase for navi3x(gfx11x)

(cherry picked from commit 03c7da0) from

Fixes for CI HUD for gfx1100

Signed-off-by: Artem Kuzmitckii artem.kuzmitckii@amd.com

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@k-artem k-artem requested a review from a team as a code owner October 16, 2025 09:30
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Oct 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165625

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit ccba83d with merge base 1009790 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@k-artem
Copy link
Copy Markdown
Contributor Author

k-artem commented Oct 16, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot Bot added the topic: not user facing topic category label Oct 16, 2025
@k-artem
Copy link
Copy Markdown
Contributor Author

k-artem commented Oct 16, 2025

@pytorchbot label "release notes: rocm"

@pytorch-bot pytorch-bot Bot added the release notes: rocm mandatorylabel label Oct 16, 2025
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Oct 16, 2025

Didn't find following labels among repository labels: label:ciflow/rocm

@k-artem
Copy link
Copy Markdown
Contributor Author

k-artem commented Oct 16, 2025

@pytorchbot label "ciflow/rocm"

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Oct 16, 2025

To add these label(s) (ciflow/rocm) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Comment on lines +81 to +85
def setup_tolerance(self, rtol=None, atol=None):
if rtol is None:
rtol = self.rtol
if atol is None:
atol = self.rtol
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? rtol and atol local to the function. Modifying them here should not affect call sites like line 88 below. Right?

Copy link
Copy Markdown
Contributor Author

@k-artem k-artem Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it works because setattr(self, member, value) inside decorator set correct values.
  2. One more isssue found atol = self.rtol instead of atol = self.atol

I keep setup_tolerance (which actually can be removed) in case we need to update tolerance value at calls of compare_* functions.
Please let me know if I need to cp in back to ROCm fork or we will wait and get it via sync w/ upstream

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Oct 17, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: dnikolaev-amd / name: Dmitry Nikolaev (60a0174)
  • ✅ login: iupaikov-amd / name: iupaikov-amd (e7f2289)
  • ✅ login: jeffdaily / name: Jeff Daily (37c138f, ccba83d)
  • ✅ login: k-artem / name: Artem Kuzmitckii (861ca48)

@k-artem k-artem requested a review from jeffdaily October 17, 2025 12:40
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 17, 2025
@jithunnair-amd
Copy link
Copy Markdown
Collaborator

/easycla

@jithunnair-amd
Copy link
Copy Markdown
Collaborator

\easycla

pytorchmergebot pushed a commit that referenced this pull request Oct 20, 2025
This should allow us to move gfx1100 workflow to a lower frequency and also allow it to be triggered on PRs via a dedicated label, for any PRs that target Navi fixes such as [this](#165630) or [this](#165625).

Pull Request resolved: #165699
Approved by: https://github.com/jeffdaily
@jithunnair-amd
Copy link
Copy Markdown
Collaborator

@pytorchbot merge -f "lint failure unrelated; CLA signed but not reflecting on PR; trying to see if it is updated internally"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@jithunnair-amd
Copy link
Copy Markdown
Collaborator

\easycla

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
This should allow us to move gfx1100 workflow to a lower frequency and also allow it to be triggered on PRs via a dedicated label, for any PRs that target Navi fixes such as [this](pytorch#165630) or [this](pytorch#165625).

Pull Request resolved: pytorch#165699
Approved by: https://github.com/jeffdaily
@k-artem
Copy link
Copy Markdown
Contributor Author

k-artem commented Oct 21, 2025

\easycla

@molssongroup
Copy link
Copy Markdown

/easycla

@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

iupaikov-amd and others added 3 commits October 21, 2025 18:32
…navi3x(gfx11x)

(cherry picked from commit 03c7da0)
Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>
Bug introduced by
ROCm@03c7da0

(cherry picked from commit bbd0112)
Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>
Bug introduced by
ROCm@03c7da0

Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased gfx1100_fixes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout gfx1100_fixes && git pull --rebase)

@jeffdaily jeffdaily added keep-going Don't stop on first failure, keep running tests until the end ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 labels Oct 21, 2025
@pytorch-bot pytorch-bot Bot removed the ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 label Oct 21, 2025
@jeffdaily jeffdaily added the ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 label Oct 21, 2025
@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge -f "navi3 test changes only, lint is passing. too many failures still to make sense of the signal"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
This should allow us to move gfx1100 workflow to a lower frequency and also allow it to be triggered on PRs via a dedicated label, for any PRs that target Navi fixes such as [this](pytorch#165630) or [this](pytorch#165625).

Pull Request resolved: pytorch#165699
Approved by: https://github.com/jeffdaily
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
…orch#165625)

test_decompose_mem_bound_mm.py tolerance increase for navi3x(gfx11x)

(cherry picked from commit 03c7da0) from

Fixes for CI HUD for gfx1100

Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>

Pull Request resolved: pytorch#165625
Approved by: https://github.com/jeffdaily

Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com>
Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
@k-artem k-artem deleted the gfx1100_fixes branch January 20, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 keep-going Don't stop on first failure, keep running tests until the end Merged module: inductor open source release notes: rocm mandatorylabel topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants