GH-39858: [C++][Device] Add Copy/View slice functions to a CPU pointer#41477
GH-39858: [C++][Device] Add Copy/View slice functions to a CPU pointer#41477pitrou merged 9 commits intoapache:mainfrom
Conversation
|
Hi @zeroshade I've had a go at adding these MemoryManager methods, just a few q's:
Thanks! |
|
When I wrote
I was wishing for an API that didn't always have to allocate A combinatorial explosion of functions to support on every We just need copying as it's supposed to be used for small slices. Most often operations should be based on whole buffers because processing item by item in a columnar/vectorized system is not the way to go. The motivation for this API was the loading of a single offset (the last one) from a list to determine its inner values array length. |
|
@alanstoate what do you think of @felipecrv's feedback and refactoring a bit based on his suggestions? |
|
@zeroshade @felipecrv thanks for the feedback and happy to refactor a bit So if I'm understanding correctly we want a |
|
That's my understanding. @felipecrv that sound about right to you? |
|
Yes. You would be giving the length and CPU memory pointer to the It is a virtual function so it can have the code that @zeroshade wrote to copy those 4 or 8 bytes as the default implementation, but the CPU memory manager can have a much simpler one using |
|
@felipecrv I've had another go with just the Not too sure about the interface especially copying into a |
cpp/src/arrow/c/bridge.cc
Outdated
| OffsetType last_offset = 0; | ||
| RETURN_NOT_OK( | ||
| data_->buffers[offsets_buffer_id]->memory_manager()->CopyBufferSliceFrom( | ||
| data_->buffers[offsets_buffer_id], &last_offset, | ||
| c_struct_->length * sizeof(OffsetType), sizeof(OffsetType))); |
There was a problem hiding this comment.
You shouldn't need the if (device_type_ == DeviceAllocationType::kCPU) { case in the beginning of this function if CopyBufferSliceFrom is as efficient as a memcpy.
cpp/src/arrow/device.cc
Outdated
| buf->memory_manager()->device()->ToString(), | ||
| " to memory not supported"); | ||
| } | ||
|
|
There was a problem hiding this comment.
This class is complicated, so I had to try implementing this to understand the code.
One of the things I've learned: the -From suffix means your function takes from MemoryManager parameter that matches buf->memory_manager(). That doesn't apply here.
I also realized that the static function is enough, no need to have the virtual version — the fact that we are always copying to CPU memory means we don't have to virtual-dispatch another call and we only have to virtual-dispatch to buf->memory_manager().
So here is what I drafted (I didn't test it):
Status MemoryManager::CopyBufferSlice(const std::shared_ptr<Buffer>& buf, int64_t offset,
int64_t length, uint8_t* out_data) {
if (ARROW_PREDICT_TRUE(buf->is_cpu())) {
memcpy(out_data, buf->data() + offset, static_cast<size_t>(length));
return Status::OK();
}
auto& from = buf->memory_manager();
auto cpu_mm = default_cpu_memory_manager();
// Try a view first
auto maybe_buffer_result = from->ViewBufferTo(buf, cpu_mm);
if (!COPY_BUFFER_SUCCESS(maybe_buffer_result)) {
// View failed, try a copy instead
maybe_buffer_result = from->CopyBufferTo(buf, cpu_mm);
}
ARROW_ASSIGN_OR_RAISE(auto maybe_buffer, std::move(maybe_buffer_result));
if (maybe_buffer != nullptr) {
memcpy(out_data, maybe_buffer->data() + offset, static_cast<size_t>(length));
return Status::OK();
}
return Status::NotImplemented("Copying buffer slice from ", from->device()->ToString(),
" to CPU not supported");
}Please define and declare it after ViewBuffer to not split that related group of functions.
|
I don't really understand what the purpose of this PR is. It claims that slicing a Buffer is overkill, but I'm a bit skeptical: are there really situations where the overhead would be unacceptable? Especially, the |
|
The goal is to have a definitive function that can copy bytes from any device buffer to CPU by delegating to if (device_type_ == DeviceAllocationType::kCPU) {
auto offsets = data_->GetValues<OffsetType>(offsets_buffer_id);
// Compute visible size of buffer
int64_t buffer_size =
(c_struct_->length > 0) ? byte_width * offsets[c_struct_->length] : 0;
return ImportBuffer(buffer_id, buffer_size);
}
// we only need the value of the last offset so let's just copy that
// one value from device to host.
auto single_value_buf =
SliceBuffer(data_->buffers[offsets_buffer_id],
c_struct_->length * sizeof(OffsetType), sizeof(OffsetType));
ARROW_ASSIGN_OR_RAISE(
auto cpubuf, Buffer::ViewOrCopy(single_value_buf, default_cpu_memory_manager()));
auto offsets = cpubuf->data_as<OffsetType>();
// Compute visible size of buffer
int64_t buffer_size = (c_struct_->length > 0) ? byte_width * offsets[0] : 0; |
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
|
@alanstoate don't forget to update the PR description to reflect the final state of the code. |
cpp/src/arrow/device.h
Outdated
| const std::shared_ptr<Buffer>& source, const std::shared_ptr<MemoryManager>& to); | ||
|
|
||
| /// \brief Copy a slice of a buffer into a CPU pointer | ||
| static Status CopyBufferSlice(const std::shared_ptr<Buffer>& buf, int64_t offset, |
There was a problem hiding this comment.
Can we name this CopyBufferSliceToCPU or similar? To make it clear that, unlike other methods here, this isn't taking a target memory manager.
|
@github-actions crossbow submit test-cuda* |
|
Revision: 8293e79 Submitted crossbow builds: ursacomputing/crossbow @ actions-aae53c830d
|
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit ecd769c. There were 15 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 202 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…pointer (apache#41477) ### Rationale for this change Currently ```MemoryManager``` objects define functionality to Copy or View entire buffers. Occasionally there is the need to only copy a single value or slice from a buffer to a piece of CPU memory (see apache#39770 (comment)). It's overkill to do a bunch of whole Buffer operations and manually slicing just to copy 4 or 8 bytes. ### What changes are included in this PR? Add the ```MemoryManager::CopyBufferSliceToCPU``` function, which initially attempts to use memcpy for the specified slice. If this is not possible, it defaults to copying the entire buffer and then viewing/copying the slice. Update ```ArrayImporter::ImportStringValuesBuffer``` to use this function. ### Are these changes tested? ```ArrayImporter::ImportStringValuesBuffer``` is tested as a part of ```arrow-c-bridge-test``` * GitHub Issue: apache#39858 Lead-authored-by: Alan Stoate <alan.stoate@gmail.com> Co-authored-by: Mac Lilly <maclilly45@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Currently
MemoryManagerobjects define functionality to Copy or View entire buffers. Occasionally there is the need to only copy a single value or slice from a buffer to a piece of CPU memory (see #39770 (comment)). It's overkill to do a bunch of whole Buffer operations and manually slicing just to copy 4 or 8 bytes.What changes are included in this PR?
Add the
MemoryManager::CopyBufferSliceToCPUfunction, which initially attempts to use memcpy for the specified slice. If this is not possible, it defaults to copying the entire buffer and then viewing/copying the slice.Update
ArrayImporter::ImportStringValuesBufferto use this function.Are these changes tested?
ArrayImporter::ImportStringValuesBufferis tested as a part ofarrow-c-bridge-test