Enable AcceleratorAllocatorConfig key check by guangyey · Pull Request #157908 · pytorch/pytorch

guangyey · 2025-07-09T11:36:18Z

Stack from ghstack (oldest at bottom):

Motivation

Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

cc @albanD @EikanWang

pytorch-bot · 2025-07-09T11:36:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157908

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 062449b with merge base 6de2413 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) (gh) (similar failure)
distributed/test_c10d_nccl.py::DistributedDataParallelTest::test_grad_layout_1devicemodule_1replicaperprocess

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

guangyey · 2025-07-10T17:04:11Z

@albanD May I know if this PR is reasonable to you to check the valid key that you mentioned in #150312 (comment)

albanD

Thanks!

guangyey · 2025-07-11T02:03:51Z

@pytorchbot merge

pytorchmergebot · 2025-07-11T02:05:32Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…0312) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: #150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908

…llocatorConfig instead (#156165) Pull Request resolved: #156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312

ericcfu · 2025-07-15T17:48:54Z

c10/core/AllocatorConfig.cpp

+      // If a device-specific configuration parser hook is registered, it will
+      // check if the key is unrecognized.
+      if (device_config_parser_hook_) {
+        TORCH_CHECK(


Since this AllocatorConfig is called at static init time, it's not safe to use logging libraries. This can cause SIOF: https://en.cppreference.com/w/cpp/language/siof.html

TIL. Thanks @ericcfu
I'd like to understand something about the TORCH_CHECK macro. It doesn't seem to depend on third-party libraries or other static variables. It's just like a function, right?
I see another similar situation is being used in static initialization here:

pytorch/c10/cuda/CUDACachingAllocator.cpp

Line 4139 in 0cb36e2

auto val = c10::utils::get_env("PYTORCH_CUDA_ALLOC_CONF");

, c10::utils::get_env is a function defiend in another translation unit.
Is this implementation actually safe, or is it just happening to work by chance?

And in this PR, no instance is created at static init time. So I don't understand the failure. Anyway, change AllocatorConfig to be loaded at runtime.

@guangyey I think I may be wrong here. I believe the real culprit is this change: #149601 (comment)

Upon further look, I agree, TORCH_CHECK doesn't look like it depends on static init.

Thanks for your confirmation.

Sorry for the churn. Thanks for the followup on the other PR.

huydhn · 2025-07-15T18:15:37Z

@pytorchbot revert -m 'Sorry for reverting your change but it is failing internally per #157908 (comment)' -c ghfirst

huydhn · 2025-07-15T18:16:51Z

cc @wdvr Once this is revert, could you help facilitate the import and we can jedi land the revert if needed?

pytorchmergebot · 2025-07-15T18:17:32Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 65fcca4. Reverted #157908 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internally per #157908 (comment) ([comment](#157908 (comment)))

[ghstack-poisoned]

guangyey · 2025-07-18T11:08:50Z

@wdvr May I know if you could help import all PRs in this stack to help check if the issue has been resolved.

[ghstack-poisoned]

wdvr · 2025-07-18T21:34:47Z

@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wdvr · 2025-07-18T21:38:11Z

Enable AcceleratorAllocatorConfig key check #157908

@guangyey just imported #157908 and #149601, will let you know in a few hours what the result is

guangyey · 2025-07-20T04:09:02Z

@wdvr Thanks very much. BTW, I think it is better to import #150312 as well. Because CUDAAllocatorConfig will be created at static init time, which may raise the same issue.

guangyey · 2025-07-23T03:33:34Z

@wdvr, May I know if the result is good?

guangyey · 2025-07-25T02:50:14Z

@huydhn @Camyll Wouter seems to be on PTO, would you help to check if the result is good?

wdvr · 2025-07-25T21:19:19Z

Sorry for the delay. I imported #157908 and #149601, mostly fine except for:

linter that asks std::fill( to be std::ranges::fill(

And the same OSS failures you see in the PRs.

I didn't import #150312 yet, will do now and let you know after the weekend what the result is

guangyey · 2025-07-28T01:41:23Z

@wdvr Thanks for your help! Please let me know if we could land this series of PRs.

wdvr · 2025-07-29T18:45:59Z

All three seem good to merge (#150312, #157908, #149601)

guangyey · 2025-07-30T01:42:11Z

@wdvr I really appreciate your help!

pytorchmergebot · 2025-07-30T06:30:49Z

Starting merge as part of PR stack under #156175

…0312) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: #150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908

…llocatorConfig instead (#156165) Pull Request resolved: #156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312

# Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: #156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165

[ghstack-poisoned]

ScottTodd · 2025-07-31T23:19:03Z

c10/core/AllocatorConfig.h

+  // A set of valid configuration keys, including both common and
+  // device-specific options. This set is used to validate the presence and
+  // correctness of keys during parsing.
+  inline static std::unordered_set<std::string> keys_{
+      "max_split_size_mb",
+      "max_non_split_rounding_mb",
+      "garbage_collection_threshold",
+      "roundup_power2_divisions",
+      "expandable_segments",
+      "pinned_use_background_threads"};
 };


I've also been commenting on #150312, but this static initialization at class scope using a data structure from the STL is not safe in Windows DLLs. This could be moved into function scope or it could be initialized lazily using std::once_flag (c10::once_flag). This case is a bit trickier than the one in that other PR since keys_ is mutated by registerDeviceConfigParserHook()

https://stackoverflow.com/a/5115008

https://isocpp.org/wiki/faq/ctors#static-init-order

# Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: #157908 Approved by: https://github.com/albanD ghstack dependencies: #149601

…0312) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: #150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908

…llocatorConfig instead (#156165) Pull Request resolved: #156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312

# Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: #156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165

pytorchbot added the open source label Jul 9, 2025

guangyey added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Jul 9, 2025

guangyey requested a review from albanD July 9, 2025 13:08

guangyey added the module: accelerator Issues related to the shared accelerator API label Jul 9, 2025

guangyey added 4 commits July 9, 2025 19:06

Update

09c3b38

[ghstack-poisoned]

Update

3ae2697

[ghstack-poisoned]

Update

d78bd62

[ghstack-poisoned]

Update

7122967

[ghstack-poisoned]

albanD approved these changes Jul 10, 2025

View reviewed changes

pytorchmergebot added the merging label Jul 11, 2025

pytorchmergebot added the Merged label Jul 11, 2025

pytorchmergebot closed this in 65fcca4 Jul 11, 2025

pytorchmergebot removed the merging label Jul 11, 2025

ericcfu reviewed Jul 15, 2025

View reviewed changes

guangyey added 5 commits July 16, 2025 11:27

Update

4b32152

[ghstack-poisoned]

Update

ad5d9c5

[ghstack-poisoned]

Update

bb519a2

[ghstack-poisoned]

Update

cc1c3b5

[ghstack-poisoned]

Update

0690847

[ghstack-poisoned]

Update

91de8ff

[ghstack-poisoned]

pytorchmergebot closed this in c8cf811 Jul 30, 2025

Update

062449b

[ghstack-poisoned]

ScottTodd reviewed Jul 31, 2025

View reviewed changes

ScottTodd mentioned this pull request Jul 31, 2025

Fix recently added AllocatorConfig static initializer code. #159607

Closed

github-actions bot deleted the gh/guangyey/165/head branch August 31, 2025 02:15

Conversation

guangyey commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

pytorch-bot bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157908

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

guangyey commented Jul 10, 2025

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Jul 11, 2025

Uh oh!

pytorchmergebot commented Jul 11, 2025

Merge started

Uh oh!

ericcfu Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericcfu Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

ericcfu Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

huydhn commented Jul 15, 2025

Uh oh!

huydhn commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

guangyey commented Jul 18, 2025

Uh oh!

wdvr commented Jul 18, 2025

Uh oh!

wdvr commented Jul 18, 2025

Uh oh!

guangyey commented Jul 20, 2025

Uh oh!

guangyey commented Jul 23, 2025

Uh oh!

guangyey commented Jul 25, 2025

Uh oh!

wdvr commented Jul 25, 2025

Uh oh!

guangyey commented Jul 28, 2025

Uh oh!

wdvr commented Jul 29, 2025

Uh oh!

guangyey commented Jul 30, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Uh oh!

ScottTodd Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

guangyey commented Jul 9, 2025 •

edited

Loading

pytorch-bot bot commented Jul 9, 2025 •

edited

Loading

guangyey Jul 16, 2025 •

edited

Loading