Skip to content

CUDACachingHostAllocatorImpl skip event query during capture#164001

Closed
jeffdaily wants to merge 2 commits intopytorch:mainfrom
ROCm:CachingHostAllocator_graph_safe
Closed

CUDACachingHostAllocatorImpl skip event query during capture#164001
jeffdaily wants to merge 2 commits intopytorch:mainfrom
ROCm:CachingHostAllocator_graph_safe

Conversation

@jeffdaily
Copy link
Collaborator

The CUDACachingAllocator already does this, so there is precedent.

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164001

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit de3ff2b with merge base 2f85de0 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeffdaily jeffdaily added release notes: rocm mandatorylabel release notes: cuda release notes category labels Sep 26, 2025
@jeffdaily
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 29, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@jeffdaily
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yangw-dev
Copy link
Contributor

@pytorchbot revert -m "failed internal error with multiple errors found: Not equal to tolerance rtol=0.1, atol=0.1
Expected value: [[0.225953385233879]..." -c ghfirst

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 1, 2025

❌ 🤖 pytorchbot command failed:

Got EOF while in a quoted string```
Try `@pytorchbot --help` for more info.

@yangw-dev
Copy link
Contributor

@pytorchbot revert -m "failed internal error with multiple errors found: Not equal to tolerance rtol=0.1, atol=0.1.." -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Oct 1, 2025
…164001)"

This reverts commit 4cf2900.

Reverted #164001 on behalf of https://github.com/yangw-dev due to failed internal error with multiple errors found: Not equal to tolerance rtol=0.1, atol=0.1.. ([comment](#164001 (comment)))
@pytorchmergebot
Copy link
Collaborator

@jeffdaily your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Oct 1, 2025
@jeffdaily
Copy link
Collaborator Author

@yangw-dev Can I get any more information than that? How am I supposed to fix this?

@yangw-dev
Copy link
Contributor

@yangw-dev Can I get any more information than that? How am I supposed to fix this?

it seems like there is test internally run_inference_model_predictions:

Not equal to tolerance rtol=0.1, atol=0.1
mtml_ctr_inline_cvr_mbl_feed_model_standalone_rc3_baseline_360470807/prediction_outbound_click_imp/0
Expected value: [[0.225953385233879]
[0.236223086714745]
[0.227420285344124]
[0.228283658623695]
[0.233226433396339]
[0.228179335594177]
[0.220103234052658]
[0.242186114192009]
[0.230371057987213]
[0.239499777555466]] vs actual value: [[0.311966806650162]
[0.308063328266144]
[0.326430112123489]
[0.307156801223755]
[0.331012278795242]
[0.329995095729828]
[0.343286603689194]
[0.325854539871216]
[0.331635475158691]
[0.324597477912903]].
Mismatched elements: 1 / 10 (10%)

please reach out pytorch folks who has internal access for more details.

@jeffdaily
Copy link
Collaborator Author

@atalman @yangw-dev was the error transient or definitely root caused to this PR's changes?

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…ytorch#164001)"

This reverts commit 4cf2900.

Reverted pytorch#164001 on behalf of https://github.com/yangw-dev due to failed internal error with multiple errors found: Not equal to tolerance rtol=0.1, atol=0.1.. ([comment](pytorch#164001 (comment)))
@jeffdaily
Copy link
Collaborator Author

ping @atalman @yangw-dev

@jeffdaily
Copy link
Collaborator Author

I suspect this PR will be replaced by #167507.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Feb 3, 2026
@jeffdaily jeffdaily closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: cuda release notes category release notes: rocm mandatorylabel Reverted Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants