Fix memory issues in cuIO due to removal of memory padding#13586
Fix memory issues in cuIO due to removal of memory padding#13586rapids-bot[bot] merged 19 commits intorapidsai:branch-23.08from
Conversation
|
@vuule Please also check the entire codebase more carefully. I tried to find as many related buffers as possible but I may still miss some. |
|
I just realize that there is a duplicate fix for Parquet (#13585), which implements a different approach than this: adding |
| * @return The aligned value | ||
| */ | ||
| inline std::size_t align_up(std::size_t value, std::size_t alignment) | ||
| { |
There was a problem hiding this comment.
nit:
If alignment is going to be compile-time constant, it could be a template parameter and a static_assert could be used to enforce it as power of 2 during compilation.
If not, an assert statement to check for power of 2 would be helpful here for debugging purposes.
harrism
left a comment
There was a problem hiding this comment.
Most of the use cases here are constructing a device_buffer with padding (not alignment, as the comments say). I suggest creating a factory function make_padded_device_buffer() and replace calls to
rmm::device_buffer buf(cudf::detail::align_up(size, 8 /* alignment */), stream);
with
auto buf = make_padded_device_buffer(size, padding, stream);
You may alternatively want to create a globally available memory_resource adaptor padding_adaptor which wraps the default memory resource, and that make_padded_device_buffer uses by default (and hence the padding argument can be optional.)
cpp/src/io/avro/reader_impl.cu
Outdated
|
|
||
| rmm::device_buffer decomp_block_data(uncomp_size, stream); | ||
| // Buffer needs to be padded as the compute kernels require aligned data pointers. | ||
| rmm::device_buffer decomp_block_data(cudf::detail::align_up(uncomp_size, 8 /*alignment*/), |
There was a problem hiding this comment.
This and all the other device_buffer use cases are padding out the allocation, not changing its alignment. This might lead to confusion. See my main review comment.
There was a problem hiding this comment.
How did we know that 8 is the right amount of padding? Also, can we use a constant for this rather than hardcoding the value 8 many times?
There was a problem hiding this comment.
I believe that @vuule has some idea about why cuIO kernels require 8-bytes aligned pointers?
There was a problem hiding this comment.
No idea. For me there's a gap between the reported issue and this solution, where I don't see why these buffers need the padding, and why this much padding. I don't see this information in the PR.
|
@harrism I finished implementing I think the better solution for this is to explicitly say "padded" in the buffer type. That means, we should have a new buffer type like |
I like this idea, because it allows users of buffers to advertise the padding (and/or alignment) that they require. (Aside: this "newtype" idiom is very common in the Haskell and Rust worlds). I imagine that a nice way to do it is to have a |
| * @param alignment The amount of bytes to align, must be a power of 2 | ||
| * @return The aligned value | ||
| */ | ||
| inline std::size_t align_up(std::size_t value, std::size_t alignment) |
There was a problem hiding this comment.
Is this the same as round_up_safe?
There was a problem hiding this comment.
I think this function could be useful if it took the number of alignment bits. Removes the power of 2 requirement and differentiates it from round_up_safe.
There was a problem hiding this comment.
Sorry I'm confused. What is the conclusion here? Should round_up_safe be used to fix the bugs instead of adding align_up?
There was a problem hiding this comment.
I would prefer just using round_up_safe here and using a constexpr named variable for the round-up value instead of the hardcoded 8 value.
There was a problem hiding this comment.
Indeed, round_up_safe in this use case produces exactly the same output as align_up. So I'm going to switch to use it, as suggested by @davidwendt to get the bugs fixed ASAP. padded_device_buffer will be the follow up work.
There was a problem hiding this comment.
Could have also used RMM's function?
There was a problem hiding this comment.
Finished adoption of round_up_safe but still strugging to find a good place to put in the constexpr that @davidwendt recommended.
There was a problem hiding this comment.
The changes seem to correspond to temporary compression/decompression buffers. Perhaps gpuinflate.hpp?
I'm ok defining it in each .cu file too if there is a chance the padding may be different for the different implementations.
There was a problem hiding this comment.
Could have also used RMM's function?
I think we can. However, that is a detail function (rmm::detail::align_up) so I was reluctant to use it.
|
I would like to consider the device-buffer-padding work in a follow on PR. |
bdice
left a comment
There was a problem hiding this comment.
A few small comments. I am happy with this design overall, feel free to address as you see fit.
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
vuule
left a comment
There was a problem hiding this comment.
Left some notes to help track where the padding requirements originate
Looks like there are two sources - inflate_kernel and nvcomp snappy
the nvcomp one is surprising, we should follow up on it with the nvcomp team.
cpp/src/io/avro/reader_impl.cu
Outdated
| // Buffer needs to be padded. | ||
| rmm::device_buffer decomp_block_data( | ||
| cudf::util::round_up_safe(uncomp_size, BUFFER_PADDING_MULTIPLE), stream); |
There was a problem hiding this comment.
padding because of inflate_kernel (this is the output of this kernel)
cpp/src/io/avro/reader_impl.cu
Outdated
| // Buffer needs to be padded. | ||
| rmm::device_buffer scratch(cudf::util::round_up_safe(temp_size, BUFFER_PADDING_MULTIPLE), | ||
| stream); | ||
| rmm::device_buffer decomp_block_data( | ||
| cudf::util::round_up_safe(uncompressed_data_size, BUFFER_PADDING_MULTIPLE), stream); |
There was a problem hiding this comment.
this one is odd, there should be no padding requirement from nvcomp. Have you verified that padding is required in the snappy case?
There was a problem hiding this comment.
Why not? It is input to nvcompBatchedSnappyDecompressAsync.
Indeed, we don't have any unit test for AVRO in libcudf thus I can't verify it.
There was a problem hiding this comment.
Just checked, there are no alignment/padding requirements from nvcomp here. IMO this part of the change can be reverted.
Are cuDF tests useless in this context?
There was a problem hiding this comment.
Then I'll revert this.
Are cuDF tests useless in this context?
There is 0 (zero) C++ test for avro.
There was a problem hiding this comment.
I was referring to Python tests, we have some limited coverage there.
There was a problem hiding this comment.
We have avro_reader_test in Python but I don't know how to check it with compute-sanitizer.
There was a problem hiding this comment.
Update: I ran compute-sanitizer on test_avro_reader_fastavro_integration.py and didn't see any issue 👍
cpp/src/io/avro/reader_impl.cu
Outdated
| // Buffer needs to be padded. | ||
| block_data.resize(cudf::util::round_up_safe(block_data.size(), BUFFER_PADDING_MULTIPLE), | ||
| stream); | ||
|
|
There was a problem hiding this comment.
padding because of inflate_kernel? (this is an input of this kernel)
| stripe_data.emplace_back( | ||
| cudf::util::round_up_safe(total_data_size, BUFFER_PADDING_MULTIPLE), stream); |
There was a problem hiding this comment.
based on the failing test, this buffer is the input to the nvcomp snappy decompression.
There was a problem hiding this comment.
Assuming the issue has nothing to do with snappy, maybe this padding is needed in the uncompressed case. @ttnghia do you know which test triggered on this one?
There was a problem hiding this comment.
Reversing this will cause memory access issue at:
at 0x430 in ...cpp/src/io/comp/gpuinflate.cu:1197:cudf::io::copy_uncompressed_kernel
There was a problem hiding this comment.
Based on the repro (without combing through copy_uncompressed_kernel), my understanding is that it requires the input buffers to be padded. I would suggest to add this to places where buffers are padded for this reason, and also add a note to gpu_copy_uncompressed_blocks docs. I really want changes in this PR to be traceable to the root(ish) cause.
| // Buffer needs to be padded. | ||
| rmm::device_buffer decomp_data( | ||
| cudf::util::round_up_safe(total_decomp_size, BUFFER_PADDING_MULTIPLE), stream); |
There was a problem hiding this comment.
output of (presumed) nvcomp snappy decompression.
There was a problem hiding this comment.
Issue from gpuDecodeOrcColumnData.
There was a problem hiding this comment.
this one is about inflate_kernel
vyasr
left a comment
There was a problem hiding this comment.
Fixes look correct to me. Would also like to discuss the adaptor/device_buffer changes in a separate follow-up PR.
| rmm::device_buffer decomp_pages(total_decomp_size, stream); | ||
| // Dispatch batches of pages to decompress for each codec. | ||
| // Buffer needs to be padded. | ||
| rmm::device_buffer decomp_pages( |
There was a problem hiding this comment.
All changes for buffers in this file are due to gpuDecodePageData.
|
/merge |
…kernel (#13643) Fixes memcheck regression in ORC reader uncompressed logic. The regression was introduced in #13396. The original fix is copied from #13586 The error was caught by the nightly build here: https://github.com/rapidsai/cudf/actions/runs/5409203449/jobs/9829059618 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #13643
After
rmmremoved memory padding (rapidsai/rmm#1278), some of cuIO code started to have out-of-bound access issues because many of its compute kernels shift the input pointers back and forth to satisfy some alignment.This adds back padding to various memory buffers so the buffers now will have some extra space enough for such shifting.
With this fix, the reported issues (#13567, #13571, #13570) no longer show up.
Closes: