[AOTI] Fix a GPU memory leak caused by reference circle#168063
[AOTI] Fix a GPU memory leak caused by reference circle#168063desertfire wants to merge 3 commits intogh/desertfire/608/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168063
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit ca9aa6a with merge base ded9bcd ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Fix #167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. ghstack-source-id: e50a1ec Pull-Request: #168063
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "Internal test breaks, contacted author to revert it and fix it test_codegen_int_array_var_fix_memory_leak, self.assertTrue(allocated_memory[1] == allocated_memory[2]) AssertionError: False is not true" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
)" This reverts commit cdca10b. Reverted #168063 on behalf of https://github.com/yangw-dev due to Internal test breaks, contacted author to revert it and fix it test_codegen_int_array_var_fix_memory_leak, self.assertTrue(allocated_memory[1] == allocated_memory[2]) AssertionError: False is not true ([comment](#168063 (comment)))
|
@desertfire your PR has been successfully reverted. |
Summary: Fix #167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
Summary: Fix #167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. ghstack-source-id: a00a8f2 Pull-Request: #168063
|
@pytorchbot merge |
|
@pytorchbot rebase -b main |
|
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
|
Successfully rebased |
Summary: Fix #167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. ghstack-source-id: d6b1de0 Pull-Request: #168063
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
)" This reverts commit cdca10b. Reverted #168063 on behalf of https://github.com/yangw-dev due to Internal test breaks, contacted author to revert it and fix it test_codegen_int_array_var_fix_memory_leak, self.assertTrue(allocated_memory[1] == allocated_memory[2]) AssertionError: False is not true ([comment](#168063 (comment)))
Summary: Fix #167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. Pull Request resolved: #168063 Approved by: https://github.com/yushangdi
Stack from ghstack (oldest at bottom):
Summary: Fix #167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @chenyang78