[REVIEW] Out-of-memory callback resource adaptor#892
[REVIEW] Out-of-memory callback resource adaptor#892rapids-bot[bot] merged 20 commits intorapidsai:branch-21.12from
Conversation
|
Wow, that was fast. |
jrhemstad
left a comment
There was a problem hiding this comment.
Can we also get a test written in C++?
|
Yeap @rongou we do it here: https://github.com/rapidsai/cudf/blob/branch-21.12/java/src/main/native/src/RmmJni.cpp#L139. We hook at the mr level, and call a JNI function in our case. Here's where we handle an oom: https://github.com/rapidsai/cudf/blob/branch-21.12/java/src/main/native/src/RmmJni.cpp#L212 Our resource had handling for threshold-based OOM (i.e. not real OOM from RMM, but instead a low/high watermark for some preemptive spilling). We are not using the low/high watermark at the moment. I am sure part of that could be refactored into its own memory resource. |
@jrhemstad thanks for the review, I have added a C++ test, renamed the |
harrism
left a comment
There was a problem hiding this comment.
I think the naming should be more general and not use an acronym. Other than that and a few doc improvements, looks like a great contribution. Thanks!
Co-authored-by: Mark Harris <mharris@nvidia.com>
Thanks for the review @harrism, I think I have addressed all of your suggestions. |
|
@gpucibot merge |
Use rapidsai/rmm#892 to implement spilling on demand. Requires use of [RMM](https://github.com/rapidsai/rmm) and JIT-unspill enabled. The `device_memory_limit` still works as usual -- when known allocations gets to `device_memory_limit`, Dask-CUDA starts spilling preemptively. However, with this PR it is should be possible to increase `device_memory_limit` significantly since memory spikes will be handled by spilling on demand. Closes #755 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #756
…or (#898) #892 added `failure_callback_resource_adaptor` which provides the ability to respond to memory allocation failures. However, it was hard-coded to catch (and rethrow) `std::bad_alloc` exceptions. This PR makes the type of exception the adaptor catches a template parameter, to provide greater flexibility. The default exception type is now `rmm::out_of_memory` since we expect this to be the common use case. Also a few changes to fix clang-tidy warnings. Authors: - Mark Harris (https://github.com/harrism) Approvers: - Rong Ou (https://github.com/rongou) - Mads R. B. Kristensen (https://github.com/madsbk) - Jake Hemstad (https://github.com/jrhemstad) URL: #898
This PR implements a new resource adaptor that calls a callback function when an allocation fails. The idea being that the callback function can free up memory (e.g. by spilling) and ask rmm to retry the allocation.
This is motivated by the fairly primitive spilling in Dask. Currently, Dask and Dask-CUDA has no way of handling OOM errors other then restarting tasks or workers. Instead they spill preemptively based on some very conservative memory thresholds. For instance, most workflows in Dask-CUDA starts spilling when half the GPU memory is in use.
This PR makes it possible for projects like Dask and Dask-CUDA to trigger spilling on demand instead of preemptively.
cc. @jrhemstad @shwina