Generate unique id for tensor storage object by observing the week pointer of tensor storage object#154859
Generate unique id for tensor storage object by observing the week pointer of tensor storage object#154859shengfukevin wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154859
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 1 Pending, 1 Unrelated FailureAs of commit c3d445f with merge base 2908c10 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D75749065 |
|
@pytorchbot label "topic: not user facing" |
|
@eellison To restore the deleter function of at::DataPtr when ET is disabled, I have to save a pointer to at::DataPtr. Since it is a unique pointer, it does not seem safe to me. What is your suggestion? |
eellison
left a comment
There was a problem hiding this comment.
after doing a bit of looking around codebase - you should use getWeakStorageImpl instead.
0b1e16b to
48dffe6
Compare
|
This pull request was exported from Phabricator. Differential Revision: D75749065 |
48dffe6 to
fac836d
Compare
fac836d to
31fb0a7
Compare
|
This pull request was exported from Phabricator. Differential Revision: D75749065 |
31fb0a7 to
84ce32d
Compare
84ce32d to
5a4bd72
Compare
|
This pull request was exported from Phabricator. Differential Revision: D75749065 |
|
This pull request was exported from Phabricator. Differential Revision: D75749065 |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2) Details for Dev Infra teamRaised by workflow job |
|
@pytorchmergebot merge main |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot rebase main |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
…inter of tensor storage object (pytorch#154859) Summary: PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors. However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id. This implementation does not guarantee the storage object id is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object. This DIFF is to observe the week pointer of tensor storage object in ET. When a tensor storage is deleted, its week pointer will expire. ET saves a map between raw data pointer to the week pointer of the tensor storage object. If the memory address got reused, the week pointer in the map will expire, then the new ID will be generated for it, and the map will be updated with the week pointer to the new tensor storage object. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA Rollback Plan: Reviewed By: eellison, ngimel Differential Revision: D75749065
|
Successfully rebased |
fb07d41 to
c3d445f
Compare
|
@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This reverts commit 6707dc8. Reverted #168297 on behalf of https://github.com/yangw-dev due to this seems breaks the trunk ##[error]Process completed with exit code 2. ([comment](#168297 (comment)))
This reverts commit 6707dc8. Reverted #168297 on behalf of https://github.com/yangw-dev due to this seems breaks the trunk ##[error]Process completed with exit code 2. ([comment](#168297 (comment)))
Summary:
PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors.
However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id.
This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object.
Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA
Differential Revision: D75749065