Generate unique id for tensor storage object by observing the week pointer of tensor storage object by shengfukevin · Pull Request #154859 · pytorch/pytorch

shengfukevin · 2025-06-02T17:50:33Z

Summary:
PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors.
However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id.
This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D75749065

pytorch-bot · 2025-06-02T17:50:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154859

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit c3d445f with merge base 2908c10 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge, unstable) (gh)
exir/backend/test/test_to_backend_multi_method.py::TestToBackendMultiMethod::test_multi_method_end_to_end

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-06-02T17:50:42Z

This pull request was exported from Phabricator. Differential Revision: D75749065

shengfukevin · 2025-06-02T17:54:16Z

@pytorchbot label "topic: not user facing"

shengfukevin · 2025-06-02T18:27:48Z

@eellison To restore the deleter function of at::DataPtr when ET is disabled, I have to save a pointer to at::DataPtr. Since it is a unique pointer, it does not seem safe to me. What is your suggestion?

eellison

after doing a bit of looking around codebase - you should use getWeakStorageImpl instead.

torch/csrc/profiler/standalone/execution_trace_observer.cpp

facebook-github-bot · 2025-06-06T19:23:46Z

This pull request was exported from Phabricator. Differential Revision: D75749065

eellison

nice!

facebook-github-bot · 2025-06-06T21:18:14Z

This pull request was exported from Phabricator. Differential Revision: D75749065

torch/csrc/profiler/standalone/execution_trace_observer.cpp

facebook-github-bot · 2025-06-06T21:32:30Z

This pull request was exported from Phabricator. Differential Revision: D75749065

facebook-github-bot · 2025-06-06T21:50:27Z

This pull request was exported from Phabricator. Differential Revision: D75749065

shengfukevin · 2025-06-07T13:06:02Z

@pytorchbot merge

pytorchmergebot · 2025-06-07T13:07:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-07T13:08:09Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team

Raised by workflow job

shengfukevin · 2025-06-09T05:34:43Z

@pytorchmergebot merge main

pytorch-bot · 2025-06-09T05:34:46Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: main

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

shengfukevin · 2025-06-09T05:35:14Z

@pytorchbot rebase main

pytorch-bot · 2025-06-09T05:35:16Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: main

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

shengfukevin · 2025-06-09T05:35:32Z

@pytorchbot rebase

pytorchmergebot · 2025-06-09T05:37:11Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…inter of tensor storage object (pytorch#154859) Summary: PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors. However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id. This implementation does not guarantee the storage object id is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object. This DIFF is to observe the week pointer of tensor storage object in ET. When a tensor storage is deleted, its week pointer will expire. ET saves a map between raw data pointer to the week pointer of the tensor storage object. If the memory address got reused, the week pointer in the map will expire, then the new ID will be generated for it, and the map will be updated with the week pointer to the new tensor storage object. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA Rollback Plan: Reviewed By: eellison, ngimel Differential Revision: D75749065

pytorchmergebot · 2025-06-09T05:37:15Z

Successfully rebased export-D75749065 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout export-D75749065 && git pull --rebase)

facebook-github-bot · 2025-06-09T05:40:19Z

@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

shengfukevin · 2025-06-09T15:38:57Z

@pytorchmergebot merge

pytorchmergebot · 2025-06-09T15:40:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

We suspect it's causing intermittent segfaults Pull Request resolved: #168297 Approved by: https://github.com/malfet

This reverts commit 6707dc8. Reverted #168297 on behalf of https://github.com/yangw-dev due to this seems breaks the trunk ##[error]Process completed with exit code 2. ([comment](#168297 (comment)))

We suspect it's causing intermittent segfaults Pull Request resolved: #168297 Approved by: https://github.com/malfet

This reverts commit 6707dc8. Reverted #168297 on behalf of https://github.com/yangw-dev due to this seems breaks the trunk ##[error]Process completed with exit code 2. ([comment](#168297 (comment)))

We suspect it's causing intermittent segfaults Pull Request resolved: #168297 Approved by: https://github.com/malfet

shengfukevin requested a review from sraikund16 as a code owner June 2, 2025 17:50

facebook-github-bot added the fb-exported label Jun 2, 2025

shengfukevin requested a review from eellison June 2, 2025 17:52

pytorch-bot bot added the topic: not user facing topic category label Jun 2, 2025

shengfukevin requested review from ezyang and ngimel and removed request for sraikund16 June 2, 2025 17:54

shengfukevin mentioned this pull request Jun 2, 2025

Generate unique id for tensor storage object #153921

Closed

eellison reviewed Jun 4, 2025

View reviewed changes

torch/csrc/profiler/standalone/execution_trace_observer.cpp Outdated Show resolved Hide resolved

shengfukevin force-pushed the export-D75749065 branch from 0b1e16b to 48dffe6 Compare June 6, 2025 19:18

shengfukevin force-pushed the export-D75749065 branch from 48dffe6 to fac836d Compare June 6, 2025 19:23

eellison approved these changes Jun 6, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 6, 2025

shengfukevin force-pushed the export-D75749065 branch from fac836d to 31fb0a7 Compare June 6, 2025 21:14

shengfukevin force-pushed the export-D75749065 branch from 31fb0a7 to 84ce32d Compare June 6, 2025 21:18

ngimel reviewed Jun 6, 2025

View reviewed changes

torch/csrc/profiler/standalone/execution_trace_observer.cpp Outdated Show resolved Hide resolved

shengfukevin force-pushed the export-D75749065 branch from 84ce32d to 5a4bd72 Compare June 6, 2025 21:32

shengfukevin changed the title ~~Provide tensor storage delete function in ET to generate unique id for tensor storage object~~ Generate unique id for tensor storage object by observing the week pointer of tensor storage object Jun 6, 2025

pytorchmergebot added the merging label Jun 7, 2025

pytorchmergebot removed the merging label Jun 7, 2025

pytorchmergebot force-pushed the export-D75749065 branch from fb07d41 to c3d445f Compare June 9, 2025 05:37

pytorchmergebot added the merging label Jun 9, 2025

pytorchmergebot added the Merged label Jun 9, 2025

pytorchmergebot closed this in b9b84d8 Jun 9, 2025

pytorchmergebot removed the merging label Jun 9, 2025

pytorchmergebot pushed a commit that referenced this pull request Nov 21, 2025

Revert #154859 (#168297)

6707dc8

We suspect it's causing intermittent segfaults Pull Request resolved: #168297 Approved by: https://github.com/malfet

pytorchmergebot pushed a commit that referenced this pull request Nov 21, 2025

Revert #154859 (#168297)

b1cd563

We suspect it's causing intermittent segfaults Pull Request resolved: #168297 Approved by: https://github.com/malfet

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025

Revert #154859 (#168297)

895021a

We suspect it's causing intermittent segfaults Pull Request resolved: #168297 Approved by: https://github.com/malfet

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025

Revert #154859 (#168297)

2ce05a4

We suspect it's causing intermittent segfaults Pull Request resolved: #168297 Approved by: https://github.com/malfet

Conversation

shengfukevin commented Jun 2, 2025 • edited by ngimel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154859

⏳ 1 Pending, 1 Unrelated Failure

Uh oh!

facebook-github-bot commented Jun 2, 2025

Uh oh!

shengfukevin commented Jun 2, 2025

Uh oh!

shengfukevin commented Jun 2, 2025

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Jun 6, 2025

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 6, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Jun 6, 2025

Uh oh!

facebook-github-bot commented Jun 6, 2025

Uh oh!

shengfukevin commented Jun 7, 2025

Uh oh!

pytorchmergebot commented Jun 7, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 7, 2025

Merge failed

Uh oh!

shengfukevin commented Jun 9, 2025

Uh oh!

pytorch-bot bot commented Jun 9, 2025

Uh oh!

shengfukevin commented Jun 9, 2025

Uh oh!

pytorch-bot bot commented Jun 9, 2025

Uh oh!

shengfukevin commented Jun 9, 2025

Uh oh!

pytorchmergebot commented Jun 9, 2025

Uh oh!

pytorchmergebot commented Jun 9, 2025

Uh oh!

facebook-github-bot commented Jun 9, 2025

Uh oh!

shengfukevin commented Jun 9, 2025

Uh oh!

pytorchmergebot commented Jun 9, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shengfukevin commented Jun 2, 2025 •

edited by ngimel

Loading

pytorch-bot bot commented Jun 2, 2025 •

edited

Loading