Add configurable threshold to avoid power-of-two rounding for large pinned memory allocations#171662
Add configurable threshold to avoid power-of-two rounding for large pinned memory allocations#171662
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171662
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a00c135 with merge base 08268aa ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| pool.free_list_[index].list_.push_back(block); | ||
| // Check if block is too large to cache | ||
| // See https://github.com/pytorch/pytorch/issues/150517 | ||
| size_t maxCachedSize = pinned_max_cached_size(); |
There was a problem hiding this comment.
Does max_round_size and max_cache_size must be equal? If not, there might be a problem.
For example, max_round_size is 128M, max_cache_size is 256M. When free a 129M block, it will be cached by a free_list for 256MB block. Next time, when allocate a 200M block, it seems previous 129M block is returned.
Please correct me if I am wrong.
There was a problem hiding this comment.
sounds you're right. I consolidated the configs into 1 and added some tests.
ac90d12 to
76bc9e9
Compare
0013bc0 to
bee1cc6
Compare
|
AI assistance on PRs must be disclosed. Claude or codex? |
| // See https://github.com/pytorch/pytorch/issues/150517 | ||
| size_t maxPower2Size = pinned_max_power2_size(); | ||
| if (maxPower2Size > 0 && block->size_ > maxPower2Size) { | ||
| // Block too large to cache, free it immediately |
There was a problem hiding this comment.
conflating "I don't want to round allocation up" and "I don't want to cache allocation" is confusing
| // See https://github.com/pytorch/pytorch/issues/150517 | ||
| size_t roundSize = size; | ||
| size_t maxPower2Size = pinned_max_power2_size(); | ||
| if (maxPower2Size == 0 || size <= maxPower2Size) { |
There was a problem hiding this comment.
the naming of the variable is confusing, returning 0 (instead of say numeric_limits::max) to indicate that it's disabled is also confusing
There was a problem hiding this comment.
Made a change so that the default value is -1 to indicate the config isn't set.
claude opus 4.5 |
bee1cc6 to
e2a7cd7
Compare
ngimel
left a comment
There was a problem hiding this comment.
This still doesn't disentangle max size for caching and max size for rounding up.
| // -1 means disabled (all allocations use power-of-two and caching). | ||
| // 0 means no caching (all allocations use exact size). | ||
| // Positive values set the threshold in MB. | ||
| m_pinned_max_cachesize_mb = val; |
There was a problem hiding this comment.
you can directly set it to numeric_limits::max here to avoid -1 logic later
| // See https://github.com/pytorch/pytorch/issues/150517 | ||
| size_t maxCachesize = pinned_max_cachesize(); | ||
| if (block->size_ > maxCachesize) { | ||
| // Block too large to cache, free it immediately |
There was a problem hiding this comment.
you have fairly large codeblocks to delete blocks in 2 places now, factor them out into a function
e2a7cd7 to
3abdf3a
Compare
|
Honestly, I wish we would stop adding new capabilities to CachingHostAllocator and instead allow for a different implementation (in this case, one that just cuts up segments into blocks like CUDACachingAllocator), but I understand that that is an unreasonable ask given all of the interfaces involved at this point. |
|
What is the status of this PR? The problem this solves leads to OOM errors for vLLM with UMA (360GB pinned unified memory OOMs because much more RAM is spent because of this issue). |
The default value of these two configs is `std::numeric_limits<size_t>::max()`. `pinned_max_round_threshold_mb` sets the maximum allocation size which will be rounded up to the nearest power-of-2. Exact requested sizes are used if their allocation size is greater than this threshold. `pinned_max_cached_size_mb` sets the maximum block size that will be cached. Blocks larger than this threshold will be freed immediately when no longer in use, rather then being cached. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
3abdf3a to
2533df0
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
See under "correction" for up to date description
This pull request introduces a configurable threshold for the maximum allocation size that will be cached by the CUDA pinned memory (host) allocator. Allocations larger than this threshold will no longer be rounded up to the next power of two or cached, which helps avoid memory waste for large allocations that are just above a power-of-two boundary. The threshold is controlled via a new
pinned_max_cachesize_mboption inPYTORCH_CUDA_ALLOC_CONF. The default behavior remains unchanged unless this option is set. Documentation and configuration parsing have been updated accordingly.Pinned memory allocator improvements:
CachingHostAllocatorImplandCUDACachingHostAllocatorImplto skip power-of-two rounding and caching for allocations larger than a configurable threshold, freeing them immediately instead. The threshold is set by the newpinned_max_cachesize_mboption. [1] [2] [3] [4]Configuration and API changes:
pinned_max_cachesize_mboption toCUDAAllocatorConfig, including parsing, storage, and API for retrieving the value. This option is now recognized and handled in the allocator configuration. [1] [2] [3] [4] [5] [6] [7] [8]Documentation:
pinned_max_cachesize_mboption, its usage, and its default behavior.Miscellaneous:
<limits>where needed to support the new logic. [1] [2]Rel: #150517
Used Claude Opus 4.5
Correction
This PR adds two new PYTORCH_CUDA_ALLOC_CONF options for the pinned memory (host) caching allocator to address memory waste from power-of-2
rounding and caching of large allocations:
exceed this limit are also never rounded up, since rounding is only useful for cached blocks.
Both options default to unlimited (disabled), preserving existing behavior. A warning is emitted if pinned_max_round_threshold_mb is
explicitly set larger than pinned_max_cached_size_mb.
The block caching/freeing logic is refactored into a shared maybe_cache_block method used by both the direct free path and the
event-processing path. The new config values are also exposed in the memory snapshot allocator settings.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo