Skip to content

Add configurable threshold to avoid power-of-two rounding for large pinned memory allocations#171662

Closed
crcrpar wants to merge 3 commits intomainfrom
add-pinned-max-round-size-config
Closed

Add configurable threshold to avoid power-of-two rounding for large pinned memory allocations#171662
crcrpar wants to merge 3 commits intomainfrom
add-pinned-max-round-size-config

Conversation

@crcrpar
Copy link
Copy Markdown
Collaborator

@crcrpar crcrpar commented Jan 4, 2026

See under "correction" for up to date description
This pull request introduces a configurable threshold for the maximum allocation size that will be cached by the CUDA pinned memory (host) allocator. Allocations larger than this threshold will no longer be rounded up to the next power of two or cached, which helps avoid memory waste for large allocations that are just above a power-of-two boundary. The threshold is controlled via a new pinned_max_cachesize_mb option in PYTORCH_CUDA_ALLOC_CONF. The default behavior remains unchanged unless this option is set. Documentation and configuration parsing have been updated accordingly.

Pinned memory allocator improvements:

  • Added logic in CachingHostAllocatorImpl and CUDACachingHostAllocatorImpl to skip power-of-two rounding and caching for allocations larger than a configurable threshold, freeing them immediately instead. The threshold is set by the new pinned_max_cachesize_mb option. [1] [2] [3] [4]

Configuration and API changes:

  • Introduced the pinned_max_cachesize_mb option to CUDAAllocatorConfig, including parsing, storage, and API for retrieving the value. This option is now recognized and handled in the allocator configuration. [1] [2] [3] [4] [5] [6] [7] [8]

Documentation:

  • Updated the CUDA notes documentation to describe the new pinned_max_cachesize_mb option, its usage, and its default behavior.

Miscellaneous:

  • Added missing includes for <limits> where needed to support the new logic. [1] [2]

Rel: #150517

Used Claude Opus 4.5

Correction

This PR adds two new PYTORCH_CUDA_ALLOC_CONF options for the pinned memory (host) caching allocator to address memory waste from power-of-2
rounding and caching of large allocations:

  • pinned_max_round_threshold_mb: allocations above this threshold skip power-of-2 rounding.
  • pinned_max_cached_size_mb: allocations above this threshold are freed immediately instead of cached in the free list. Allocations that
    exceed this limit are also never rounded up, since rounding is only useful for cached blocks.

Both options default to unlimited (disabled), preserving existing behavior. A warning is emitted if pinned_max_round_threshold_mb is
explicitly set larger than pinned_max_cached_size_mb.

The block caching/freeing logic is refactored into a shared maybe_cache_block method used by both the direct free path and the
event-processing path. The new config values are also exposed in the memory snapshot allocator settings.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171662

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a00c135 with merge base 08268aa (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pool.free_list_[index].list_.push_back(block);
// Check if block is too large to cache
// See https://github.com/pytorch/pytorch/issues/150517
size_t maxCachedSize = pinned_max_cached_size();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does max_round_size and max_cache_size must be equal? If not, there might be a problem.

For example, max_round_size is 128M, max_cache_size is 256M. When free a 129M block, it will be cached by a free_list for 256MB block. Next time, when allocate a 200M block, it seems previous 129M block is returned.

Please correct me if I am wrong.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds you're right. I consolidated the configs into 1 and added some tests.

@crcrpar crcrpar force-pushed the add-pinned-max-round-size-config branch from ac90d12 to 76bc9e9 Compare January 7, 2026 07:11
@crcrpar crcrpar force-pushed the add-pinned-max-round-size-config branch 2 times, most recently from 0013bc0 to bee1cc6 Compare January 7, 2026 07:50
@ezyang ezyang requested review from colesbury and ngimel January 7, 2026 13:34
@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Jan 7, 2026

AI assistance on PRs must be disclosed. Claude or codex?

// See https://github.com/pytorch/pytorch/issues/150517
size_t maxPower2Size = pinned_max_power2_size();
if (maxPower2Size > 0 && block->size_ > maxPower2Size) {
// Block too large to cache, free it immediately
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conflating "I don't want to round allocation up" and "I don't want to cache allocation" is confusing

// See https://github.com/pytorch/pytorch/issues/150517
size_t roundSize = size;
size_t maxPower2Size = pinned_max_power2_size();
if (maxPower2Size == 0 || size <= maxPower2Size) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming of the variable is confusing, returning 0 (instead of say numeric_limits::max) to indicate that it's disabled is also confusing

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a change so that the default value is -1 to indicate the config isn't set.

@crcrpar
Copy link
Copy Markdown
Collaborator Author

crcrpar commented Jan 8, 2026

AI assistance on PRs must be disclosed. Claude or codex?

claude opus 4.5

@jbschlosser jbschlosser added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 8, 2026
@crcrpar crcrpar force-pushed the add-pinned-max-round-size-config branch from bee1cc6 to e2a7cd7 Compare January 12, 2026 03:45
Copy link
Copy Markdown
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still doesn't disentangle max size for caching and max size for rounding up.

Comment thread c10/cuda/CUDAAllocatorConfig.cpp Outdated
// -1 means disabled (all allocations use power-of-two and caching).
// 0 means no caching (all allocations use exact size).
// Positive values set the threshold in MB.
m_pinned_max_cachesize_mb = val;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can directly set it to numeric_limits::max here to avoid -1 logic later

// See https://github.com/pytorch/pytorch/issues/150517
size_t maxCachesize = pinned_max_cachesize();
if (block->size_ > maxCachesize) {
// Block too large to cache, free it immediately
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have fairly large codeblocks to delete blocks in 2 places now, factor them out into a function

@crcrpar crcrpar force-pushed the add-pinned-max-round-size-config branch from e2a7cd7 to 3abdf3a Compare January 21, 2026 14:35
@galv
Copy link
Copy Markdown
Collaborator

galv commented Jan 30, 2026

Honestly, I wish we would stop adding new capabilities to CachingHostAllocator and instead allow for a different implementation (in this case, one that just cuts up segments into blocks like CUDACachingAllocator), but I understand that that is an unreasonable ask given all of the interfaces involved at this point.

@ehfd
Copy link
Copy Markdown

ehfd commented Mar 23, 2026

What is the status of this PR? The problem this solves leads to OOM errors for vLLM with UMA (360GB pinned unified memory OOMs because much more RAM is spent because of this issue).

@jeffdaily jeffdaily removed their request for review April 28, 2026 20:08
crcrpar and others added 2 commits April 29, 2026 18:56
The default value of these two configs is `std::numeric_limits<size_t>::max()`.
`pinned_max_round_threshold_mb` sets the maximum allocation size which
will be rounded up to the nearest power-of-2. Exact requested sizes are
used if their allocation size is greater than this threshold.

`pinned_max_cached_size_mb` sets the maximum block size that will be
cached. Blocks larger than this threshold will be freed immediately when
no longer in use, rather then being cached.

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@ngimel ngimel force-pushed the add-pinned-max-round-size-config branch from 3abdf3a to 2533df0 Compare April 30, 2026 03:32
@ngimel ngimel added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 30, 2026
@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Apr 30, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor open source release notes: releng release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants