Enable AcceleratorAllocatorConfig key check#157908
Enable AcceleratorAllocatorConfig key check#157908guangyey wants to merge 11 commits intogh/guangyey/165/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157908
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 062449b with merge base 6de2413 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@albanD May I know if this PR is reasonable to you to check the valid key that you mentioned in #150312 (comment) |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…0312) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: #150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908
| // If a device-specific configuration parser hook is registered, it will | ||
| // check if the key is unrecognized. | ||
| if (device_config_parser_hook_) { | ||
| TORCH_CHECK( |
There was a problem hiding this comment.
Since this AllocatorConfig is called at static init time, it's not safe to use logging libraries. This can cause SIOF: https://en.cppreference.com/w/cpp/language/siof.html
There was a problem hiding this comment.
TIL. Thanks @ericcfu
I'd like to understand something about the TORCH_CHECK macro. It doesn't seem to depend on third-party libraries or other static variables. It's just like a function, right?
I see another similar situation is being used in static initialization here:
pytorch/c10/cuda/CUDACachingAllocator.cpp
Line 4139 in 0cb36e2
c10::utils::get_env is a function defiend in another translation unit.Is this implementation actually safe, or is it just happening to work by chance?
There was a problem hiding this comment.
And in this PR, no instance is created at static init time. So I don't understand the failure. Anyway, change AllocatorConfig to be loaded at runtime.
There was a problem hiding this comment.
@guangyey I think I may be wrong here. I believe the real culprit is this change: #149601 (comment)
Upon further look, I agree, TORCH_CHECK doesn't look like it depends on static init.
There was a problem hiding this comment.
Thanks for your confirmation.
There was a problem hiding this comment.
Sorry for the churn. Thanks for the followup on the other PR.
|
@pytorchbot revert -m 'Sorry for reverting your change but it is failing internally per #157908 (comment)' -c ghfirst |
|
cc @wdvr Once this is revert, could you help facilitate the import and we can jedi land the revert if needed? |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 65fcca4. Reverted #157908 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internally per #157908 (comment) ([comment](#157908 (comment)))
|
@wdvr May I know if you could help import all PRs in this stack to help check if the issue has been resolved. |
|
@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@wdvr, May I know if the result is good? |
|
@wdvr Thanks for your help! Please let me know if we could land this series of PRs. |
|
@wdvr I really appreciate your help! |
|
Starting merge as part of PR stack under #156175 |
…0312) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: #150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908
# Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: #156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165
| // A set of valid configuration keys, including both common and | ||
| // device-specific options. This set is used to validate the presence and | ||
| // correctness of keys during parsing. | ||
| inline static std::unordered_set<std::string> keys_{ | ||
| "max_split_size_mb", | ||
| "max_non_split_rounding_mb", | ||
| "garbage_collection_threshold", | ||
| "roundup_power2_divisions", | ||
| "expandable_segments", | ||
| "pinned_use_background_threads"}; | ||
| }; |
There was a problem hiding this comment.
I've also been commenting on #150312, but this static initialization at class scope using a data structure from the STL is not safe in Windows DLLs. This could be moved into function scope or it could be initialized lazily using std::once_flag (c10::once_flag). This case is a bit trickier than the one in that other PR since keys_ is mutated by registerDeviceConfigParserHook()
# Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: #157908 Approved by: https://github.com/albanD ghstack dependencies: #149601
…0312) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: #150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908
# Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: #156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165
Stack from ghstack (oldest at bottom):
Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.
cc @albanD @EikanWang