Skip to content

Use ctypes to serialize raw content for tensors.#108287

Closed
muchulee8 wants to merge 2 commits intomainfrom
mlee8/aot_large_weight
Closed

Use ctypes to serialize raw content for tensors.#108287
muchulee8 wants to merge 2 commits intomainfrom
mlee8/aot_large_weight

Conversation

@muchulee8
Copy link
Contributor

@muchulee8 muchulee8 commented Aug 30, 2023

Summary:
There's a deadlock in current storage's implementation if the size of tensor is too large. Use ctypes to do serialization.

Test Plan:
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy --inference --device cuda --export-aot-inductor --only MT5ForConditionalGeneration

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 30, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108287

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d93c77d with merge base b535ed2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@desertfire desertfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve to unblock, but not sure what the underlying root cause was, cc @ezyang @eellison

actual = AOTInductorModelRunner.run(model, example_inputs, expected)
self.assertTrue(same(actual, expected))

@requires_cpp_extension()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently changed this file. You will need to remove this line.

@desertfire
Copy link
Contributor

cc @Chillee as this week's oncall. Last nightly run has shown timeout because of this issue, https://github.com/pytorch/pytorch/actions/runs/6021619982/job/16335510610. I prefer to merge this PR as a forward fix.

@muchulee8 muchulee8 requested a review from Chillee August 31, 2023 01:04
@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Aug 31, 2023
@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Aug 31, 2023
Summary:
There's a deadlock in current storage's implementation if the size of
tensor is too large. Use ctypes to do serialization.

Test Plan:
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy
--inference --device cuda --export-aot-inductor --only
MT5ForConditionalGeneration

Reviewers:

Subscribers:

Tasks:

Tags:
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not feel like a deadlock, but rather than bytes() will create a copy of a tensor, while ctypes creates an alias to the same region. Also, wouldn't it be better, if untyped_storage were serializable on its own?

@muchulee8
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 31, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@desertfire
Copy link
Contributor

@pytorchbot revert -m "Internal test failure from #107718. Revert this one first and then revert 107718." -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@muchulee8 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Aug 31, 2023
This reverts commit 43f28be.

Reverted #108287 on behalf of https://github.com/desertfire due to Internal test failure from #107718. Revert this one first and then revert 107718. ([comment](#108287 (comment)))
@desertfire
Copy link
Contributor

@malfet @huydhn , what is the right way to revert the base PR (#107718) from both fbcode and github? There is this auto-generated D48866377 for fbcode reverting, but should I just revert the github PR and wait for it to sync? Is there a faster way?

@malfet
Copy link
Contributor

malfet commented Aug 31, 2023

@desertfire unland internal diff, it should auto revert PR, doing it now...

@github-actions github-actions bot deleted the mlee8/aot_large_weight branch March 1, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: releng release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants