Skip to content

Fix memory issues in cuIO due to removal of memory padding#13586

Merged
rapids-bot[bot] merged 19 commits intorapidsai:branch-23.08from
ttnghia:fix_io
Jun 23, 2023
Merged

Fix memory issues in cuIO due to removal of memory padding#13586
rapids-bot[bot] merged 19 commits intorapidsai:branch-23.08from
ttnghia:fix_io

Conversation

@ttnghia
Copy link
Copy Markdown
Contributor

@ttnghia ttnghia commented Jun 16, 2023

After rmm removed memory padding (rapidsai/rmm#1278), some of cuIO code started to have out-of-bound access issues because many of its compute kernels shift the input pointers back and forth to satisfy some alignment.

This adds back padding to various memory buffers so the buffers now will have some extra space enough for such shifting.

With this fix, the reported issues (#13567, #13571, #13570) no longer show up.

Closes:

@ttnghia ttnghia added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Jun 16, 2023
@ttnghia ttnghia requested a review from a team as a code owner June 16, 2023 00:51
@ttnghia ttnghia self-assigned this Jun 16, 2023
@ttnghia ttnghia requested review from bdice, harrism and vuule June 16, 2023 00:51
@ttnghia
Copy link
Copy Markdown
Contributor Author

ttnghia commented Jun 16, 2023

@vuule Please also check the entire codebase more carefully. I tried to find as many related buffers as possible but I may still miss some.

@ttnghia
Copy link
Copy Markdown
Contributor Author

ttnghia commented Jun 16, 2023

I just realize that there is a duplicate fix for Parquet (#13585), which implements a different approach than this: adding 8 bytes of padding to the buffers all the times, instead of align_up to the nearest multiple of 8.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Jun 16, 2023

I just realize that there is a duplicate fix for Parquet (#13585), which implements a different approach than this: adding 8 bytes of padding to the buffers all the times, instead of align_up to the nearest multiple of 8.

The approach here is superior so I closed #13585.

* @return The aligned value
*/
inline std::size_t align_up(std::size_t value, std::size_t alignment)
{
Copy link
Copy Markdown
Contributor

@karthikeyann karthikeyann Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
If alignment is going to be compile-time constant, it could be a template parameter and a static_assert could be used to enforce it as power of 2 during compilation.
If not, an assert statement to check for power of 2 would be helpful here for debugging purposes.

Copy link
Copy Markdown
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the use cases here are constructing a device_buffer with padding (not alignment, as the comments say). I suggest creating a factory function make_padded_device_buffer() and replace calls to

rmm::device_buffer buf(cudf::detail::align_up(size, 8 /* alignment */), stream);

with

auto buf = make_padded_device_buffer(size, padding, stream);

You may alternatively want to create a globally available memory_resource adaptor padding_adaptor which wraps the default memory resource, and that make_padded_device_buffer uses by default (and hence the padding argument can be optional.)


rmm::device_buffer decomp_block_data(uncomp_size, stream);
// Buffer needs to be padded as the compute kernels require aligned data pointers.
rmm::device_buffer decomp_block_data(cudf::detail::align_up(uncomp_size, 8 /*alignment*/),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and all the other device_buffer use cases are padding out the allocation, not changing its alignment. This might lead to confusion. See my main review comment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we know that 8 is the right amount of padding? Also, can we use a constant for this rather than hardcoding the value 8 many times?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that @vuule has some idea about why cuIO kernels require 8-bytes aligned pointers?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. For me there's a gap between the reported issue and this solution, where I don't see why these buffers need the padding, and why this much padding. I don't see this information in the PR.

@ttnghia
Copy link
Copy Markdown
Contributor Author

ttnghia commented Jun 21, 2023

@harrism I finished implementing padding_memory_resource_adaptor and the rmm::device_buffer make_padded_device_buffer factory function. However, something coming up in my mind. The output rmm::device_buffer can be output from some utility function and then resize at the caller. In such cases, we have little idea if that buffer is already padded or not, and won't know if it will automatically have its size padded during resize.

I think the better solution for this is to explicitly say "padded" in the buffer type. That means, we should have a new buffer type like cudf::padded_device_buffer that can be constructed similar to rmm::device_buffer but it will have its own private padding mechanism. When using it, we can directly infer from its type that any of its memory allocation will involve padding.

@wence-
Copy link
Copy Markdown
Contributor

wence- commented Jun 21, 2023

I think the better solution for this is to explicitly say "padded" in the buffer type. That means, we should have a new buffer type like cudf::padded_device_buffer that can be constructed similar to rmm::device_buffer but it will have its own private padding mechanism. When using it, we can directly infer from its type that any of its memory allocation will involve padding.

I like this idea, because it allows users of buffers to advertise the padding (and/or alignment) that they require. (Aside: this "newtype" idiom is very common in the Haskell and Rust worlds).

I imagine that a nice way to do it is to have a padded_device_buffer that is templated on the required padding that inherits from rmm::device_buffer and just wraps the provided memory resource in one that applies padding. The only thing that templating would prevent, I think, is passing something with padding of 8 (say) to a function that expects padding of 2, which would be safe but the types would disallow it.

* @param alignment The amount of bytes to align, must be a power of 2
* @return The aligned value
*/
inline std::size_t align_up(std::size_t value, std::size_t alignment)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as round_up_safe?

constexpr S round_up_safe(S number_to_round, S modulus)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function could be useful if it took the number of alignment bits. Removes the power of 2 requirement and differentiates it from round_up_safe.

Copy link
Copy Markdown
Contributor Author

@ttnghia ttnghia Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm confused. What is the conclusion here? Should round_up_safe be used to fix the bugs instead of adding align_up?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer just using round_up_safe here and using a constexpr named variable for the round-up value instead of the hardcoded 8 value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, round_up_safe in this use case produces exactly the same output as align_up. So I'm going to switch to use it, as suggested by @davidwendt to get the bugs fixed ASAP. padded_device_buffer will be the follow up work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have also used RMM's function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished adoption of round_up_safe but still strugging to find a good place to put in the constexpr that @davidwendt recommended.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes seem to correspond to temporary compression/decompression buffers. Perhaps gpuinflate.hpp?
I'm ok defining it in each .cu file too if there is a chance the padding may be different for the different implementations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have also used RMM's function?

I think we can. However, that is a detail function (rmm::detail::align_up) so I was reluctant to use it.

@davidwendt
Copy link
Copy Markdown
Contributor

I would like to consider the device-buffer-padding work in a follow on PR.
I think the current implementation is fine along with Bradley's comments to get the nightly builds to work again.

@ttnghia ttnghia requested review from bdice and davidwendt June 22, 2023 02:24
Copy link
Copy Markdown
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments. I am happy with this design overall, feel free to address as you see fit.

ttnghia and others added 2 commits June 22, 2023 08:51
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Copy link
Copy Markdown
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some notes to help track where the padding requirements originate
Looks like there are two sources - inflate_kernel and nvcomp snappy
the nvcomp one is surprising, we should follow up on it with the nvcomp team.

Comment on lines +208 to +210
// Buffer needs to be padded.
rmm::device_buffer decomp_block_data(
cudf::util::round_up_safe(uncomp_size, BUFFER_PADDING_MULTIPLE), stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

padding because of inflate_kernel (this is the output of this kernel)

Comment on lines +313 to +317
// Buffer needs to be padded.
rmm::device_buffer scratch(cudf::util::round_up_safe(temp_size, BUFFER_PADDING_MULTIPLE),
stream);
rmm::device_buffer decomp_block_data(
cudf::util::round_up_safe(uncompressed_data_size, BUFFER_PADDING_MULTIPLE), stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is odd, there should be no padding requirement from nvcomp. Have you verified that padding is required in the snappy case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It is input to nvcompBatchedSnappyDecompressAsync.
Indeed, we don't have any unit test for AVRO in libcudf thus I can't verify it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked, there are no alignment/padding requirements from nvcomp here. IMO this part of the change can be reverted.

Are cuDF tests useless in this context?

Copy link
Copy Markdown
Contributor Author

@ttnghia ttnghia Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'll revert this.

Are cuDF tests useless in this context?

There is 0 (zero) C++ test for avro.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to Python tests, we have some limited coverage there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have avro_reader_test in Python but I don't know how to check it with compute-sanitizer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I ran compute-sanitizer on test_avro_reader_fastavro_integration.py and didn't see any issue 👍

Comment on lines +531 to +534
// Buffer needs to be padded.
block_data.resize(cudf::util::round_up_safe(block_data.size(), BUFFER_PADDING_MULTIPLE),
stream);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

padding because of inflate_kernel? (this is an input of this kernel)

Comment on lines +1073 to +1074
stripe_data.emplace_back(
cudf::util::round_up_safe(total_data_size, BUFFER_PADDING_MULTIPLE), stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the failing test, this buffer is the input to the nvcomp snappy decompression.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the issue has nothing to do with snappy, maybe this padding is needed in the uncompressed case. @ttnghia do you know which test triggered on this one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reversing this will cause memory access issue at:

 at 0x430 in ...cpp/src/io/comp/gpuinflate.cu:1197:cudf::io::copy_uncompressed_kernel

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the repro (without combing through copy_uncompressed_kernel), my understanding is that it requires the input buffers to be padded. I would suggest to add this to places where buffers are padded for this reason, and also add a note to gpu_copy_uncompressed_blocks docs. I really want changes in this PR to be traceable to the root(ish) cause.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +322 to +324
// Buffer needs to be padded.
rmm::device_buffer decomp_data(
cudf::util::round_up_safe(total_decomp_size, BUFFER_PADDING_MULTIPLE), stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output of (presumed) nvcomp snappy decompression.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue from gpuDecodeOrcColumnData.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is about inflate_kernel

Copy link
Copy Markdown
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes look correct to me. Would also like to discuss the adaptor/device_buffer changes in a separate follow-up PR.

rmm::device_buffer decomp_pages(total_decomp_size, stream);
// Dispatch batches of pages to decompress for each codec.
// Buffer needs to be padded.
rmm::device_buffer decomp_pages(
Copy link
Copy Markdown
Contributor Author

@ttnghia ttnghia Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes for buffers in this file are due to gpuDecodePageData.

@ttnghia
Copy link
Copy Markdown
Contributor Author

ttnghia commented Jun 23, 2023

/merge

@rapids-bot rapids-bot bot merged commit 0b4e354 into rapidsai:branch-23.08 Jun 23, 2023
@ttnghia ttnghia deleted the fix_io branch June 23, 2023 22:25
rapids-bot bot pushed a commit that referenced this pull request Jun 29, 2023
…kernel (#13643)

Fixes memcheck regression in ORC reader uncompressed logic. The regression was introduced in #13396.
The original fix is copied from #13586 

The error was caught by the nightly build here: https://github.com/rapidsai/cudf/actions/runs/5409203449/jobs/9829059618

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #13643
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

9 participants