Skip to content

Uses memory pools for mixing CUDA allocators#125722

Closed
syed-ahmed wants to merge 12 commits intopytorch:mainfrom
syed-ahmed:torch-mempool-upstream
Closed

Uses memory pools for mixing CUDA allocators#125722
syed-ahmed wants to merge 12 commits intopytorch:mainfrom
syed-ahmed:torch-mempool-upstream

Conversation

@syed-ahmed
Copy link
Copy Markdown
Collaborator

@syed-ahmed syed-ahmed commented May 7, 2024

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125722

Note: Links to docs will display an error until the docs builds have been completed.

❌ 66 New Failures, 5 Unrelated Failures

As of commit 29d15bd with merge base e8e327b (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 7, 2024
@syed-ahmed syed-ahmed changed the title Uses memory pools for mixing CUDA Allocators Uses memory pools for mixing CUDA allocators May 7, 2024
@syed-ahmed syed-ahmed force-pushed the torch-mempool-upstream branch 2 times, most recently from b3bf94b to b0dc669 Compare May 15, 2024 03:11
@syed-ahmed syed-ahmed force-pushed the torch-mempool-upstream branch from 4688287 to 29d15bd Compare May 29, 2024 17:16
@Aidyn-A Aidyn-A marked this pull request as ready for review June 4, 2024 17:52
@Aidyn-A Aidyn-A requested a review from eqy as a code owner June 4, 2024 17:52
@Aidyn-A Aidyn-A marked this pull request as draft June 4, 2024 17:52
pytorchmergebot pushed a commit that referenced this pull request Jul 18, 2024
…cator usage (#130472)

We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see #124807, #125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls  `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.

In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy
Pull Request resolved: #130472
Approved by: https://github.com/eqy, https://github.com/ezyang
DiweiSun pushed a commit to DiweiSun/pytorch that referenced this pull request Jul 22, 2024
…cator usage (pytorch#130472)

We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see pytorch#124807, pytorch#125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls  `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.

In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy
Pull Request resolved: pytorch#130472
Approved by: https://github.com/eqy, https://github.com/ezyang
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Jul 25, 2024
…cator usage (pytorch#130472)

We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see pytorch#124807, pytorch#125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls  `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.

In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy
Pull Request resolved: pytorch#130472
Approved by: https://github.com/eqy, https://github.com/ezyang
pytorchmergebot pushed a commit that referenced this pull request Aug 1, 2024
In this PR:
- Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change.
- MemPool holds a pointer to a CUDAAllocator as proposed in #124807 (comment). Tests are added to show usage with CUDAPluggableAllocator.
- MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: #125722

Pull Request resolved: #131152
Approved by: https://github.com/eqy, https://github.com/ezyang
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Aug 3, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions Bot added the Stale label Aug 3, 2024
@syed-ahmed
Copy link
Copy Markdown
Collaborator Author

Stacked PRs with tests have been posted. Closing this.

@syed-ahmed syed-ahmed closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants