Skip to content

Generate unique id for tensor storage object by observing the week pointer of tensor storage object#154859

Closed
shengfukevin wants to merge 1 commit intopytorch:mainfrom
shengfukevin:export-D75749065
Closed

Generate unique id for tensor storage object by observing the week pointer of tensor storage object#154859
shengfukevin wants to merge 1 commit intopytorch:mainfrom
shengfukevin:export-D75749065

Conversation

@shengfukevin
Copy link
Contributor

@shengfukevin shengfukevin commented Jun 2, 2025

Summary:
PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors.
However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id.
This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D75749065

@shengfukevin shengfukevin requested a review from sraikund16 as a code owner June 2, 2025 17:50
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154859

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit c3d445f with merge base 2908c10 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75749065

@shengfukevin
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 2, 2025
@shengfukevin shengfukevin requested review from ezyang and ngimel and removed request for sraikund16 June 2, 2025 17:54
@shengfukevin
Copy link
Contributor Author

@eellison To restore the deleter function of at::DataPtr when ET is disabled, I have to save a pointer to at::DataPtr. Since it is a unique pointer, it does not seem safe to me. What is your suggestion?

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after doing a bit of looking around codebase - you should use getWeakStorageImpl instead.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75749065

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 6, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75749065

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75749065

@shengfukevin shengfukevin changed the title Provide tensor storage delete function in ET to generate unique id for tensor storage object Generate unique id for tensor storage object by observing the week pointer of tensor storage object Jun 6, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75749065

@shengfukevin
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team Raised by workflow job

@shengfukevin
Copy link
Contributor Author

@pytorchmergebot merge main

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 9, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: main

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@shengfukevin
Copy link
Contributor Author

@pytorchbot rebase main

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 9, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: main

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@shengfukevin
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…inter of tensor storage object (pytorch#154859)

Summary:

PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors.
However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id.
This implementation does not guarantee the storage object id is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object.
This DIFF is to observe the week pointer of tensor storage object in ET. When a tensor storage is deleted, its week pointer will expire. ET saves a map between raw data pointer to the week pointer of the tensor storage object. If the memory address got reused, the week pointer in the map will expire, then the new ID will be generated for it, and the map will be updated with the week pointer to the new tensor storage object.

Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Rollback Plan:

Reviewed By: eellison, ngimel

Differential Revision: D75749065
@pytorchmergebot
Copy link
Collaborator

Successfully rebased export-D75749065 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout export-D75749065 && git pull --rebase)

@facebook-github-bot
Copy link
Contributor

@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@shengfukevin
Copy link
Contributor Author

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Nov 21, 2025
We suspect it's causing intermittent segfaults

Pull Request resolved: #168297
Approved by: https://github.com/malfet
pytorchmergebot added a commit that referenced this pull request Nov 21, 2025
This reverts commit 6707dc8.

Reverted #168297 on behalf of https://github.com/yangw-dev due to this seems breaks the trunk ##[error]Process completed with exit code 2. ([comment](#168297 (comment)))
pytorchmergebot pushed a commit that referenced this pull request Nov 21, 2025
We suspect it's causing intermittent segfaults

Pull Request resolved: #168297
Approved by: https://github.com/malfet
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
We suspect it's causing intermittent segfaults

Pull Request resolved: #168297
Approved by: https://github.com/malfet
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
This reverts commit 6707dc8.

Reverted #168297 on behalf of https://github.com/yangw-dev due to this seems breaks the trunk ##[error]Process completed with exit code 2. ([comment](#168297 (comment)))
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
We suspect it's causing intermittent segfaults

Pull Request resolved: #168297
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants