Skip to content

GH-39858: [C++][Device] Add Copy/View slice functions to a CPU pointer#41477

Merged
pitrou merged 9 commits intoapache:mainfrom
alanstoate:add-memory-manager-slice-fns
May 23, 2024
Merged

GH-39858: [C++][Device] Add Copy/View slice functions to a CPU pointer#41477
pitrou merged 9 commits intoapache:mainfrom
alanstoate:add-memory-manager-slice-fns

Conversation

@alanstoate
Copy link
Copy Markdown
Contributor

@alanstoate alanstoate commented Apr 30, 2024

Rationale for this change

Currently MemoryManager objects define functionality to Copy or View entire buffers. Occasionally there is the need to only copy a single value or slice from a buffer to a piece of CPU memory (see #39770 (comment)). It's overkill to do a bunch of whole Buffer operations and manually slicing just to copy 4 or 8 bytes.

What changes are included in this PR?

Add the MemoryManager::CopyBufferSliceToCPU function, which initially attempts to use memcpy for the specified slice. If this is not possible, it defaults to copying the entire buffer and then viewing/copying the slice.

Update ArrayImporter::ImportStringValuesBuffer to use this function.

Are these changes tested?

ArrayImporter::ImportStringValuesBuffer is tested as a part of arrow-c-bridge-test

@alanstoate
Copy link
Copy Markdown
Contributor Author

alanstoate commented Apr 30, 2024

Hi @zeroshade I've had a go at adding these MemoryManager methods, just a few q's:

  • I've added all the copy/view methods in MemoryManager as slice versions I'm not sure if this is more than requested?
  • Should the existing copy/view methods in CPUMemoryManager call these slice versions with offset = 0 & length = buf.size() to reuse some code?
  • I should probably add a test for all of these, just wondering where that should live?

Thanks!

@felipecrv
Copy link
Copy Markdown
Contributor

felipecrv commented May 2, 2024

When I wrote

Perhaps copying a slice of an address() should be a MemoryManager operation.

I was wishing for an API that didn't always have to allocate unique_ptr or shared_ptrs of anything to copy 4 or 8 bytes from a CPU/non-CPU Buffer to a regular CPU memory pointer. It would lower to a memcpy if both source and destination are CPU memory managers.

A combinatorial explosion of functions to support on every MemoryManager is not desirable.

CopyBufferSliceFrom
CopyBufferSliceTo
CopyNonOwnedSliceFrom
CopyNonOwnedSliceTo
ViewBufferSliceFrom
ViewBufferSliceTo

We just need copying as it's supposed to be used for small slices. Most often operations should be based on whole buffers because processing item by item in a columnar/vectorized system is not the way to go.

The motivation for this API was the loading of a single offset (the last one) from a list to determine its inner values array length.

@zeroshade
Copy link
Copy Markdown
Member

@alanstoate what do you think of @felipecrv's feedback and refactoring a bit based on his suggestions?

@alanstoate
Copy link
Copy Markdown
Contributor Author

@zeroshade @felipecrv thanks for the feedback and happy to refactor a bit

So if I'm understanding correctly we want a CopyBufferSlice that copies from a buffer using the MemoryManager into a raw pointer?

@zeroshade
Copy link
Copy Markdown
Member

That's my understanding. @felipecrv that sound about right to you?

@felipecrv
Copy link
Copy Markdown
Contributor

Yes. You would be giving the length and CPU memory pointer to the MemoryManager and it would decide the best way to copy (always copy) the data.

It is a virtual function so it can have the code that @zeroshade wrote to copy those 4 or 8 bytes as the default implementation, but the CPU memory manager can have a much simpler one using memcpy.

@alanstoate
Copy link
Copy Markdown
Contributor Author

alanstoate commented May 9, 2024

@felipecrv I've had another go with just the CopyBufferSliceFrom function

Not too sure about the interface especially copying into a void*, but I'm not really sure how else to copy 1 or more OffsetType (or another buffer item type)

Comment on lines +1879 to +1883
OffsetType last_offset = 0;
RETURN_NOT_OK(
data_->buffers[offsets_buffer_id]->memory_manager()->CopyBufferSliceFrom(
data_->buffers[offsets_buffer_id], &last_offset,
c_struct_->length * sizeof(OffsetType), sizeof(OffsetType)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need the if (device_type_ == DeviceAllocationType::kCPU) { case in the beginning of this function if CopyBufferSliceFrom is as efficient as a memcpy.

buf->memory_manager()->device()->ToString(),
" to memory not supported");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is complicated, so I had to try implementing this to understand the code.

One of the things I've learned: the -From suffix means your function takes from MemoryManager parameter that matches buf->memory_manager(). That doesn't apply here.

I also realized that the static function is enough, no need to have the virtual version — the fact that we are always copying to CPU memory means we don't have to virtual-dispatch another call and we only have to virtual-dispatch to buf->memory_manager().

So here is what I drafted (I didn't test it):

Status MemoryManager::CopyBufferSlice(const std::shared_ptr<Buffer>& buf, int64_t offset,
                                      int64_t length, uint8_t* out_data) {
  if (ARROW_PREDICT_TRUE(buf->is_cpu())) {
    memcpy(out_data, buf->data() + offset, static_cast<size_t>(length));
    return Status::OK();
  }

  auto& from = buf->memory_manager();
  auto cpu_mm = default_cpu_memory_manager();
  // Try a view first
  auto maybe_buffer_result = from->ViewBufferTo(buf, cpu_mm);
  if (!COPY_BUFFER_SUCCESS(maybe_buffer_result)) {
    // View failed, try a copy instead
    maybe_buffer_result = from->CopyBufferTo(buf, cpu_mm);
  }
  ARROW_ASSIGN_OR_RAISE(auto maybe_buffer, std::move(maybe_buffer_result));
  if (maybe_buffer != nullptr) {
    memcpy(out_data, maybe_buffer->data() + offset, static_cast<size_t>(length));
    return Status::OK();
  }

  return Status::NotImplemented("Copying buffer slice from ", from->device()->ToString(),
                                " to CPU not supported");
}

Please define and declare it after ViewBuffer to not split that related group of functions.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 10, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 10, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting change review Awaiting change review labels May 14, 2024
@github-actions github-actions bot removed awaiting review Awaiting review awaiting changes Awaiting changes labels May 15, 2024
@github-actions github-actions bot added the awaiting review Awaiting review label May 15, 2024
@pitrou
Copy link
Copy Markdown
Member

pitrou commented May 16, 2024

I don't really understand what the purpose of this PR is. It claims that slicing a Buffer is overkill, but I'm a bit skeptical: are there really situations where the overhead would be unacceptable?

Especially, the CopyBufferSlice implementation is always first calling ViewBufferTo, then falls back to CopyBufferTo (both of which instantiate a std::shared_ptr<Buffer>), then does a memcpy always. If the data length is small, this is still considerable overhead. If the data length is large, then we're doing a spurious copy compared to ViewOrCopy...

@felipecrv
Copy link
Copy Markdown
Contributor

felipecrv commented May 17, 2024

The goal is to have a definitive function that can copy bytes from any device buffer to CPU by delegating to MemoryManager who should know the best way to do it. Having to write that code that @zeroshade wrote in bridge.cc to copy 4 or 8 bytes from a Buffer was very annoying. Without getting into the details of what is fast and what is slow, I think there should be a helper for doing this:

    if (device_type_ == DeviceAllocationType::kCPU) {
      auto offsets = data_->GetValues<OffsetType>(offsets_buffer_id);
      // Compute visible size of buffer
      int64_t buffer_size =
          (c_struct_->length > 0) ? byte_width * offsets[c_struct_->length] : 0;
      return ImportBuffer(buffer_id, buffer_size);
    }

    // we only need the value of the last offset so let's just copy that
    // one value from device to host.
    auto single_value_buf =
        SliceBuffer(data_->buffers[offsets_buffer_id],
                    c_struct_->length * sizeof(OffsetType), sizeof(OffsetType));
    ARROW_ASSIGN_OR_RAISE(
        auto cpubuf, Buffer::ViewOrCopy(single_value_buf, default_cpu_memory_manager()));
    auto offsets = cpubuf->data_as<OffsetType>();
    // Compute visible size of buffer
    int64_t buffer_size = (c_struct_->length > 0) ? byte_width * offsets[0] : 0;

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 17, 2024
alanstoate and others added 2 commits May 17, 2024 20:11
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 17, 2024
Copy link
Copy Markdown
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I will let @pitrou give his final approval and merge.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 17, 2024
@felipecrv
Copy link
Copy Markdown
Contributor

@alanstoate don't forget to update the PR description to reflect the final state of the code.

const std::shared_ptr<Buffer>& source, const std::shared_ptr<MemoryManager>& to);

/// \brief Copy a slice of a buffer into a CPU pointer
static Status CopyBufferSlice(const std::shared_ptr<Buffer>& buf, int64_t offset,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this CopyBufferSliceToCPU or similar? To make it clear that, unlike other methods here, this isn't taking a target memory manager.

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @alanstoate

@pitrou pitrou changed the title GH-39858: [C++][Device] Add Copy/View slice functions to MemoryManager GH-39858: [C++][Device] Add Copy/View slice functions to a CPU pointer May 23, 2024
@pitrou
Copy link
Copy Markdown
Member

pitrou commented May 23, 2024

@github-actions crossbow submit test-cuda*

@github-actions
Copy link
Copy Markdown

Revision: 8293e79

Submitted crossbow builds: ursacomputing/crossbow @ actions-aae53c830d

Task Status
test-cuda-cpp GitHub Actions
test-cuda-python GitHub Actions

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit ecd769c.

There were 15 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 202 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…pointer (apache#41477)

### Rationale for this change
Currently ```MemoryManager``` objects define functionality to Copy or View entire buffers. Occasionally there is the need to only copy a single value or slice from a buffer to a piece of CPU memory (see apache#39770 (comment)). It's overkill to do a bunch of whole Buffer operations and manually slicing just to copy 4 or 8 bytes.

### What changes are included in this PR?
Add the ```MemoryManager::CopyBufferSliceToCPU``` function, which initially attempts to use memcpy for the specified slice. If this is not possible, it defaults to copying the entire buffer and then viewing/copying the slice.

Update ```ArrayImporter::ImportStringValuesBuffer``` to use this function.

### Are these changes tested?

```ArrayImporter::ImportStringValuesBuffer```  is tested as a part of  ```arrow-c-bridge-test```
* GitHub Issue: apache#39858

Lead-authored-by: Alan Stoate <alan.stoate@gmail.com>
Co-authored-by: Mac Lilly <maclilly45@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants