Skip to content

Enable AcceleratorAllocatorConfig key check#157908

Closed
guangyey wants to merge 11 commits intogh/guangyey/165/basefrom
gh/guangyey/165/head
Closed

Enable AcceleratorAllocatorConfig key check#157908
guangyey wants to merge 11 commits intogh/guangyey/165/basefrom
gh/guangyey/165/head

Conversation

@guangyey
Copy link
Collaborator

@guangyey guangyey commented Jul 9, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157908

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 062449b with merge base 6de2413 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

guangyey added 4 commits July 9, 2025 19:06
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

@albanD May I know if this PR is reasonable to you to check the valid key that you mentioned in #150312 (comment)

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@guangyey
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jul 11, 2025
…0312)

# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: #150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
pytorchmergebot pushed a commit that referenced this pull request Jul 11, 2025
…llocatorConfig instead (#156165)

Pull Request resolved: #156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
// If a device-specific configuration parser hook is registered, it will
// check if the key is unrecognized.
if (device_config_parser_hook_) {
TORCH_CHECK(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this AllocatorConfig is called at static init time, it's not safe to use logging libraries. This can cause SIOF: https://en.cppreference.com/w/cpp/language/siof.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL. Thanks @ericcfu
I'd like to understand something about the TORCH_CHECK macro. It doesn't seem to depend on third-party libraries or other static variables. It's just like a function, right?
I see another similar situation is being used in static initialization here:

auto val = c10::utils::get_env("PYTORCH_CUDA_ALLOC_CONF");
, c10::utils::get_env is a function defiend in another translation unit.
Is this implementation actually safe, or is it just happening to work by chance?

Copy link
Collaborator Author

@guangyey guangyey Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in this PR, no instance is created at static init time. So I don't understand the failure. Anyway, change AllocatorConfig to be loaded at runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guangyey I think I may be wrong here. I believe the real culprit is this change: #149601 (comment)

Upon further look, I agree, TORCH_CHECK doesn't look like it depends on static init.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your confirmation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the churn. Thanks for the followup on the other PR.

@huydhn
Copy link
Contributor

huydhn commented Jul 15, 2025

@pytorchbot revert -m 'Sorry for reverting your change but it is failing internally per #157908 (comment)' -c ghfirst

@huydhn
Copy link
Contributor

huydhn commented Jul 15, 2025

cc @wdvr Once this is revert, could you help facilitate the import and we can jedi land the revert if needed?

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jul 15, 2025
This reverts commit 65fcca4.

Reverted #157908 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internally per #157908 (comment) ([comment](#157908 (comment)))
guangyey added 5 commits July 16, 2025 11:27
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

@wdvr May I know if you could help import all PRs in this stack to help check if the issue has been resolved.

[ghstack-poisoned]
@wdvr
Copy link
Contributor

wdvr commented Jul 18, 2025

@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@wdvr
Copy link
Contributor

wdvr commented Jul 18, 2025

@guangyey just imported #157908 and #149601, will let you know in a few hours what the result is

@guangyey
Copy link
Collaborator Author

@wdvr Thanks very much. BTW, I think it is better to import #150312 as well. Because CUDAAllocatorConfig will be created at static init time, which may raise the same issue.

@guangyey
Copy link
Collaborator Author

@wdvr, May I know if the result is good?

@guangyey
Copy link
Collaborator Author

@huydhn @Camyll Wouter seems to be on PTO, would you help to check if the result is good?

@wdvr
Copy link
Contributor

wdvr commented Jul 25, 2025

Sorry for the delay. I imported #157908 and #149601, mostly fine except for:

linter that asks std::fill( to be std::ranges::fill(

And the same OSS failures you see in the PRs.

I didn't import #150312 yet, will do now and let you know after the weekend what the result is

@guangyey
Copy link
Collaborator Author

@wdvr Thanks for your help! Please let me know if we could land this series of PRs.

@wdvr
Copy link
Contributor

wdvr commented Jul 29, 2025

All three seem good to merge (#150312, #157908, #149601)

@guangyey
Copy link
Collaborator Author

@wdvr I really appreciate your help!

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #156175

pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
…0312)

# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: #150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
…llocatorConfig instead (#156165)

Pull Request resolved: #156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: #156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
[ghstack-poisoned]
Comment on lines +330 to 340
// A set of valid configuration keys, including both common and
// device-specific options. This set is used to validate the presence and
// correctness of keys during parsing.
inline static std::unordered_set<std::string> keys_{
"max_split_size_mb",
"max_non_split_rounding_mb",
"garbage_collection_threshold",
"roundup_power2_divisions",
"expandable_segments",
"pinned_use_background_threads"};
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also been commenting on #150312, but this static initialization at class scope using a data structure from the STL is not safe in Windows DLLs. This could be moved into function scope or it could be initialized lazily using std::once_flag (c10::once_flag). This case is a bit trickier than the one in that other PR since keys_ is mutated by registerDeviceConfigParserHook()

yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
# Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

Pull Request resolved: #157908
Approved by: https://github.com/albanD
ghstack dependencies: #149601
yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
…0312)

# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: #150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
…llocatorConfig instead (#156165)

Pull Request resolved: #156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: #156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
@github-actions github-actions bot deleted the gh/guangyey/165/head branch August 31, 2025 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: accelerator Issues related to the shared accelerator API open source Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants