mdspan icon indicating copy to clipboard operation
mdspan copied to clipboard

Are pitched allocations supported?

Open bernhardmgruber opened this issue 4 years ago • 6 comments

I spent some time playing with mdspan as abstraction to access multidimensional CPU/GPU buffers in alpaka. You can think of these buffers as similar to Kokkos::Views. Now, higher-than-one dimensional buffers may employ padding between rows to align the start of each row at an address of a hardware specific multiple. This is analog to e.g. cudaMallocPitch in CUDA. I encountered several problems with mdspan in this regard:

The protocol between layout and access policy seems to deal in 1D element indices, not byte offsets, which are, however, needed when mapping an ND index into a 1D memory location with arbitrary gaps. The layout returns a 1D element index, the access policy accepts a 1D element index. P0009r14 seems a bit vage on that for the definition of layout mappings, but m.required_span_size(), which is used to define the valid range of values returned from operator(), specifies then "Returns: If Extents::rank() > 0 is true, the product of extents().extent(r) for all r in the range [0, Extents::rank() ). Otherwise, 1.", which to me sounds like you are not allowed to have empty space via unused indices in the span of memory underlying an mdspan. In the access policy however it reads very clearly that the 1D index is an element index and not a byte index: "access, a method to access the i-th element in the range of elements represented by a pointer;".

I could hack around that by definiting a custom layout policy which stored element size and row pitch and returned a 1D byte index instead of an element index. That in turn required a custom accessor that took a byte offset for the access and offset methods. That worked in principle, but I am sure that it is neither intended that way nor would it work with submdspans/slicing. Furthermore, layout and access policy are now tighly coupled and you cannot combine them with other accessor or layout policies. A different strategy could be to just define an accessor that divides the 1D index back into a row/col index and multiplies with the row pitch again. However, that would cost a division on each mdspan access. I have not tried this yet but it seems suboptimal to me as well.

This exploration raised a few questions:

  • Does mdspan support pitched allocations at all? Maybe there is way which I have not found yet.
  • Is the design of mdspan per P0009r14 incomplete and missing an important use case?
  • If not: why does mdspan not support pitched allocations? They obviously exist, notable in CUDA, and I am certain that NVIDIA had a look at mdspan's design. Pitched allocations also exist for CPUs (e.g. in-memory image or matrix formats). Is it because:
    • pitched allocations are a narrow and complicated use case and not worth the effort?
    • mdspan is already extensible to support them in the future?
    • pitched allocations are no longer relevant on today's and future hardware? (I tried finding some benchmarks on that matter and was surprised I could not find any concrete numbers for e.g. CUDA's cudaMalloc vs. cudaMallocPitch. People claim the latter is faster, but never tell how much).

I am very much looking forward to your response! Thank you!

bernhardmgruber avatar Dec 25 '21 16:12 bernhardmgruber

The protocol between layout and access policy seems to deal in 1D element indices, not byte offsets,

That's exactly right. The layout mapping maps a multidimensional index to a 1-D element offset. The accessor, in turn, maps the offset to a reference. default_accessor does this just as std::span would. However, a generic accessor and pointer need not even point to memory. There might not be any "bytes" there.

... m.required_span_size(), which is used to define the valid range of values returned from operator(), specifies then "Returns: If Extents::rank() > 0 is true, the product of extents().extent(r) for all r in the range [0, Extents::rank() ). Otherwise, 1.", which to me sounds like you are not allowed to have empty space via unused indices in the span of memory underlying an mdspan.

It looks like you're quoting the definition of required_span_size() for layout_left and layout_right. Those layouts are always contiguous. However, you'll notice that layout_stride has a different definition of required_span_size(). That definition certainly permits what you are calling "pitched allocations."

Furthermore, generic layouts need not be contiguous (i.e., can be "pitched"). The "Layout mapping policy and layout mapping requirements" table in P0009R14 defines m.required_span_size() as follows: "If the multidimensional index space that e (the extents object) defines is empty, then zero, else 1 plus the maximum value of m(i...) for all i... in e."

mhoemmen avatar Dec 27 '21 05:12 mhoemmen

One more thing: there are very likely more layouts we want to eventually propose for the standard. The BLAS proposal P1673 for example proposes the common layouts used in BLAS (which has a single free stride parameter, but is not arbitrarily strided in both dimensions of a matrix). However, for the initial proposal we need to keep it to a manageable subset of what we need. Hence we proposed C-layout (layout_right), Fortran layout (layout_left) and the generic strided layout, which for simple pitched allocations isn't fully optimal (since you would still want a compile time stride-one index in one dimension) but is required for stuff like submdspans anyway, and the most general expression of regularly strided arrays.

There are other things we should consider proposing, such as a "broadcast" layout which is for example useful for implementing certain stuff in tensor operations. But all of those can be done later in separate papers.

crtrott avatar Jan 04 '22 19:01 crtrott

It looks like you're quoting the definition of required_span_size() for layout_left and layout_right. Those layouts are always contiguous. However, you'll notice that layout_stride has a different definition of required_span_size(). That definition certainly permits what you are calling "pitched allocations."

Thanks for pointing this out! layout_left and layout_right are indeed contiguous per definition from P0009R14. Looking at the more general requirements for m(i...) and m.required_span_size() I am slightly confused. m(i...) claims to return "A value in the range of [0, required_span_size() )", while m.required_span_size() claims to return "... 1 plus the maximum value of m(i...) for all i... in e.". That's a circular definition in my opinion and looks like there are no requirements on what both functions can return in general, it just needs to be compatible with the other function. There is also no mentioning of the semantic meaning of the return type of m(i...), that is, it is not stated that those are 1D element indices. They could be hashes that are resolved into a hash table in the accessor.

So far so good, that means that there is nothing forbidding the layout mapping to return byte wise indices, but the accessor needs to be aware. So essentially there would be two worlds of mappings/accessors, element-wise and byte-wise ones, and they cannot be mixed.

Thank you also for the further pointers, @crtrott. P1673 proposes layout_blas_general and layout_blas_packed , which I was unaware of. Still, it seems like these layouts also deal in 1D element indices. Also the LDA etc. parameter in BLAS seem to be in element counts.

To give a concrete example, let's say my element type is this struct of 12 bytes:

struct RGB {
    float r, g, b;
};

And now I want to create a 5x5 image where each row is aligned to 64 bytes: image

The layout mappings I have seen so far can only produce indices from 0 to 5x5-1. There is no way how the layout mapping can describe a gap that is not a multiple of the element size, so the layout mapping needs to step down to something smaller like byte offsets and the accessor needs to recognize that. And to me it seems like this is conforment with P0009R14 but still some kind of hidden assumption. Unfortunately, the return type of m(i...) is fixed to extents::size_type, which is size_t, so having m(i...) return a "special type" à la std::align_val_t to mark the return value as byte offset and not element offset is not a possible solution.

Anyway, are there any plans to propose/support such kind of layouts and e.g. error out if such a layout is combined with an accessor dealing in a different quantity? Or allow to define an accessor that can accept 1D element indices as well as byte offsets? Thanks a lot!

bernhardmgruber avatar Jan 10 '22 18:01 bernhardmgruber

I see what you are talking about now. In some sense the thing here is not the layout but the accessor. No matter whether this 1D or 2D or 5D, I think this looks like you want to have 5 consecutive 12byte elements aligned to 64byte, which has nothing to do with the mapping of logical indicies to a 1D offset. In fact the problem is that you can't represetn this as indexing into a RGB*.

So what I would do to support that is say have an accessor which is parametrized on how many consecutive elements to align to how much bytes, which is a fully generic thing independent of the rank of the mdspan.

Thus your accessor could do something like this:

template<int T, int N, int ALIGN>
struct batched_aligned_accessor {
  using pointer = char*;
  using reference = T&;
  reference access(pointer p, size_t idx) {
     size_t batch_id = idx/N;
     size_t id = idx%N;
     pointer p_element = p+batch_id*ALIGN+idx*size_of(T);
     return *reinterpret_cast<T*>(p_element);
  }
  ...
};

Does that make sense? Note that how much memory you need to allocate is now not just required_span_size() * sizeof(T) !

crtrott avatar Jan 11 '22 20:01 crtrott

In some sense the thing here is not the layout but the accessor. No matter whether this 1D or 2D or 5D, I think this looks like you want to have 5 consecutive 12byte elements aligned to 64byte, which has nothing to do with the mapping of logical indicies to a 1D offset. In fact the problem is that you can't represetn this as indexing into a RGB*.

Hmm, you are right here. The reason why I would want to put this logic into the layout mapping is that the division/modulo and remultiplication in the accessor is more expensive than multiplying with the right values in the layout mapping in the first place.

And you are correct that the behavior I am requesting would not work on a T* but only on byte pointers like char* or std::byte*.

So what I would do to support that is say have an accessor which is parametrized on how many consecutive elements to align to how much bytes, which is a fully generic thing independent of the rank of the mdspan.

Maybe this is really the direction I have to go and measure what the impact of the divison is.

Does that make sense? Note that how much memory you need to allocate is now not just required_span_size() * sizeof(T) !

Thanks for pointing that out! It seems then I cannot implement my feature inside the accessor in a conforming way, since required_span_size is defined via the layout mapping. This in turn means I cannot use std::mdspan in conjunction with e.g. cudaMallocPitch.

Thank you for your time!

bernhardmgruber avatar Jan 17 '22 14:01 bernhardmgruber

This in turn means I cannot use std::mdspan in conjunction with e.g. cudaMallocPitch.

@crtrott can you confirm if his is indeed impossible ?

fwyzard avatar Feb 22 '22 13:02 fwyzard

@fwyzard In general, allocations from cudaMallocPitch should work just fine with mdspan. You can use layout_stride or a custom layout (such as layout_right_padded in this PR) to represent the extra row padding that cudaMallocPitch might return.

The issue here is that the original poster wants five consecutive 12-byte structs to align to 64 bytes. An mdspan layout alone cannot represent this data structure. This is because a layout maps a multidimensional element index to a multidimensional element (not byte) offset. However, as the above discussion points out, a custom accessor could give the desired behavior. While the required memory would no longer be required_span_size() * sizeof(element_type) as with default_accessor<element_type>, this is fine, because custom accessors can be so general that they might not even require a memory allocation.

mhoemmen avatar Sep 06 '22 21:09 mhoemmen

hi @mhoemmen,

In general, allocations from cudaMallocPitch should work just fine with mdspan. ... The issue here is that the original poster wants five consecutive 12-byte structs to align to 64 bytes

I think cudaMallocPitch actually has the exact same issue: rows are aligned to a value (e.g. 512 bytes) that may not be a multiple of the element size (e.g. 12 bytes).

Here is an example that uses cudaMallocPitch to allocate pitched memory for a 2D array of structs. Each struct holds 3 floats, so it has sizeof(T) = 12 and alignof(T) = 4. The array is 5x4, so each row is 5 x 12 = 60 bytes wide. At least on my system, the pitch returned by cudaMallocPitch is 512 bytes, which is not a multiple of the row width or event of sizeof(T):

#include <iostream>
#include <string>

int main(void) {

  struct T {
    float x;
    float y;
    float z;
  };

  void* ptr = nullptr;
  size_t pitch = 0;
  size_t width = 5;   // columns
  size_t height = 4;  // rows

  cudaMallocPitch(&ptr, &pitch, sizeof(T) * width, height);

  std::cout << "ptr: " << ptr << std::endl;
  std::cout << "pitch: " << pitch << std::endl;

  for (size_t j = 0; j < height; ++j) {
    for (size_t i = 0; i < width; ++i) {
      std::cout << "ptr(" << i << ", " << j << "): " << (T*) ((char*)ptr + i * pitch) + j << '\t';
    }
    std::cout << std::endl;
  }

  return 0;
}

where (T*) ((char*) ptr + i * pitch) + j is taken from the latest documentation of cudaMallocPitch.

On my system this prints

ptr: 0x7fb233000000
pitch: 512
ptr(0, 0): 0x7fb233000000       ptr(1, 0): 0x7fb233000200       ptr(2, 0): 0x7fb233000400       ptr(3, 0): 0x7fb233000600       ptr(4, 0): 0x7fb233000800
ptr(0, 1): 0x7fb23300000c       ptr(1, 1): 0x7fb23300020c       ptr(2, 1): 0x7fb23300040c       ptr(3, 1): 0x7fb23300060c       ptr(4, 1): 0x7fb23300080c
ptr(0, 2): 0x7fb233000018       ptr(1, 2): 0x7fb233000218       ptr(2, 2): 0x7fb233000418       ptr(3, 2): 0x7fb233000618       ptr(4, 2): 0x7fb233000818
ptr(0, 3): 0x7fb233000024       ptr(1, 3): 0x7fb233000224       ptr(2, 3): 0x7fb233000424       ptr(3, 3): 0x7fb233000624       ptr(4, 3): 0x7fb233000824

So, the columns have a pitch of 512 bytes, which cannot be represented as a multiple of the element size (12 bytes).

However, as the above discussion points out, a custom accessor could give the desired behavior. While the required memory would no longer be required_span_size() * sizeof(element_type) as with default_accessor<element_type>, this is fine, because custom accessors can be so general that they might not even require a memory allocation.

Thanks. I will look for how to that can be implemented !

fwyzard avatar Sep 07 '22 16:09 fwyzard

@fwyzard In general, allocations from cudaMallocPitch should work just fine with mdspan. You can use layout_stride or a custom layout (such as layout_right_padded in this PR) to represent the extra row padding that cudaMallocPitch might return.

I had a look at layout_right_padded and it would indeed work nicely if the row pitch is a multiple of the element size. Unfortunately, as @fwyzard demonstrated, we also have counter examples. Even worse, we can only discover these cases at runtime, because there is no guarantee which pitch is returned from cudaMallocPitch either.

Anyway, thank you for the input so far! I will see how far we need to bend mdspan to support our use case and whether that still makes sense in the end.

bernhardmgruber avatar Sep 16 '22 16:09 bernhardmgruber

@bernhardmgruber Thanks for taking a look!

... rows are aligned to a value (e.g. 512 bytes) that may not be a multiple of the element size (e.g. 12 bytes).

I think the key is to make the mdspan store bytes, but reference elements. I haven't thought about the details quite just yet.

mhoemmen avatar Sep 16 '22 18:09 mhoemmen

@bernhardmgruber I think I've got it. The cudaMallocPitch documentation specifies the access formula as

T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

Given that sizeof(T) need not evenly divide pitch, the key is to use bytes (char) as the underlying data type. This means that the accessor's data_handle_type would be char* (or const char*). I'll start backwards from the accessor and use it to figure out the layout mapping.

The accessor's access(p, k) member function would need to take p = BaseAddress (as char* or const char*) and k = Row * pitch + Column * sizeof(T). If the resulting p + k is correctly aligned, then access(p, k) could just return *reinterpret_cast<T*>(p + k). Otherwise, it would have to do memcpy tricks and use proxy references. (Here is an example of accessors that pack or unpack structs in possibly nonaligned byte storage: https://godbolt.org/z/d4j4MYaEz .)

template<class ElementType>
struct aligned_byte_accessor {
  using offset_policy = aligned_byte_accessor;

  using data_handle_type = std::conditional_t<std::is_const_v<ElementType>, const char*, char*>;
  using element_type = ElementType;
  // We can only use ElementType& if the alignment is correct.  See above.
  using reference = ElementType&;

  constexpr aligned_byte_accessor() noexcept = default;

  constexpr reference
  access(data_handle_type p, size_t k) const noexcept {
    // Again, this assumes that p + k is correctly aligned for ElementType.
    // Otherwise, we need proxy references.
    return *reinterpret_cast<T*>(p + k);
  }

  constexpr typename offset_policy::data_handle_type
  offset(data_handle_type p, size_t k) const noexcept {
    // Again, this assumes that p + k is correctly aligned for ElementType.
    // Otherwise, we may need a different offset_policy type.
    return p + k;
  }
};

The offset k given to access comes from the layout mapping. Thus, the layout mapping for this allocation would need an operator()(r, c) that returns r * pitch + c * sizeof(T). If the layout mapping describes the layout of bytes, not of T, then this is just layout_stride with strides pitch, sizeof(T). Here's how you create the layout mapping and use it to make the mdspan. (This assumes you are using C++17's CTAD feature.)

void* ptr = nullptr;
size_t pitch = 0;
// I'm assuming you know these at compile time.
// If not, use auto exts = dextents<int, 2>{num_rows, num_cols}
// to make exts instead.
constexpr size_t num_cols = 5;
constexpr size_t num_rows = 4;

cudaMallocPitch(&ptr, &pitch, sizeof(T) * num_cols, num_rows);
extents<int, num_rows, num_cols> exts{};
layout_stride::mapping mapping{exts, std::array{pitch, sizeof(T)}};
mdspan m{reinterpret_cast<char*>(ptr), mapping, aligned_byte_accessor<T>{}};

mhoemmen avatar Sep 18 '22 04:09 mhoemmen

@mhoemmen thanks for the implementation of the aligned_byte_accessor - I can confirm that it works as intended !

#include <iostream>
#include <experimental/mdspan>

#include <cuda_runtime.h>

template<class ElementType>
struct aligned_byte_accessor {
  using offset_policy = aligned_byte_accessor;

  using data_handle_type = std::conditional_t<std::is_const_v<ElementType>, const char*, char*>;
  using element_type = ElementType;
  // We can only use ElementType& if the alignment is correct.  See above.
  using reference = ElementType&;

  constexpr aligned_byte_accessor() noexcept = default;

  constexpr reference
  access(data_handle_type p, size_t k) const noexcept {
    // Again, this assumes that p + k is correctly aligned for ElementType.
    // Otherwise, we need proxy references.
    return *reinterpret_cast<ElementType*>(p + k);
  }

  constexpr typename offset_policy::data_handle_type
  offset(data_handle_type p, size_t k) const noexcept {
    // Again, this assumes that p + k is correctly aligned for ElementType.
    // Otherwise, we may need a different offset_policy type.
    return p + k;
  }
};

#include <iostream>
#include <string>

int main(void) {

  struct T {
    float x;
    float y;
    float z;
  };

  void* ptr = nullptr;
  size_t pitch = 0;
  // I'm assuming you know these at compile time.
  // If not, use auto exts = dextents<int, 2>{height, width}
  // to make exts instead.
  constexpr size_t width = 5;   // columns
  constexpr size_t height = 4;  // rows

  cudaMallocPitch(&ptr, &pitch, sizeof(T) * width, height);

  std::cout << "ptr: " << ptr << std::endl;
  std::cout << "pitch: " << pitch << std::endl;

  std::cout << std::endl;
  std::cout << "cudaMallocPitch access:" << std::endl;
  for (size_t j = 0; j < height; ++j) {
    for (size_t i = 0; i < width; ++i) {
      std::cout << "ptr(" << i << ", " << j << "): " << (T*) ((char*)ptr + i * pitch) + j << '\t';
    }
    std::cout << std::endl;
  }

  std::experimental::extents<int, height, width> exts{};
  std::experimental::layout_stride::mapping mapping{exts, std::array{pitch, sizeof(T)}};
  std::experimental::mdspan m{reinterpret_cast<char*>(ptr), mapping, aligned_byte_accessor<T>{}};

  std::cout << std::endl;
  std::cout << "mdspan<aligned_byte_accessor> access:" << std::endl;
  for (size_t j = 0; j < height; ++j) {
    for (size_t i = 0; i < width; ++i) {
      std::cout << "ptr(" << i << ", " << j << "): " << (T*) &(m(i, j)) << '\t';
    }
    std::cout << std::endl;
  }

  return 0;
}

gives

ptr: 0x7fd9cd000000
pitch: 512

cudaMallocPitch access:
ptr(0, 0): 0x7fd9cd000000       ptr(1, 0): 0x7fd9cd000200       ptr(2, 0): 0x7fd9cd000400       ptr(3, 0): 0x7fd9cd000600       ptr(4, 0): 0x7fd9cd000800
ptr(0, 1): 0x7fd9cd00000c       ptr(1, 1): 0x7fd9cd00020c       ptr(2, 1): 0x7fd9cd00040c       ptr(3, 1): 0x7fd9cd00060c       ptr(4, 1): 0x7fd9cd00080c
ptr(0, 2): 0x7fd9cd000018       ptr(1, 2): 0x7fd9cd000218       ptr(2, 2): 0x7fd9cd000418       ptr(3, 2): 0x7fd9cd000618       ptr(4, 2): 0x7fd9cd000818
ptr(0, 3): 0x7fd9cd000024       ptr(1, 3): 0x7fd9cd000224       ptr(2, 3): 0x7fd9cd000424       ptr(3, 3): 0x7fd9cd000624       ptr(4, 3): 0x7fd9cd000824

mdspan<aligned_byte_accessor> access:
ptr(0, 0): 0x7fd9cd000000       ptr(1, 0): 0x7fd9cd000200       ptr(2, 0): 0x7fd9cd000400       ptr(3, 0): 0x7fd9cd000600       ptr(4, 0): 0x7fd9cd000800
ptr(0, 1): 0x7fd9cd00000c       ptr(1, 1): 0x7fd9cd00020c       ptr(2, 1): 0x7fd9cd00040c       ptr(3, 1): 0x7fd9cd00060c       ptr(4, 1): 0x7fd9cd00080c
ptr(0, 2): 0x7fd9cd000018       ptr(1, 2): 0x7fd9cd000218       ptr(2, 2): 0x7fd9cd000418       ptr(3, 2): 0x7fd9cd000618       ptr(4, 2): 0x7fd9cd000818
ptr(0, 3): 0x7fd9cd000024       ptr(1, 3): 0x7fd9cd000224       ptr(2, 3): 0x7fd9cd000424       ptr(3, 3): 0x7fd9cd000624       ptr(4, 3): 0x7fd9cd000824

fwyzard avatar Sep 20 '22 14:09 fwyzard

@bernhardmgruber Thanks for reporting success! : - D

std::cout << "ptr(" << i << ", " << j << "): " << (T*) &(m(i, j)) << '\t';

You shouldn't need to take the address of m(i, j) or cast to T*. The return type of m(i, j) is T&. (aligned_byte_accessor<ElementType>::reference names the type ElementType&.) The following should work just fine.

std::cout << "ptr(" << i << ", " << j << "): " << m(i, j) << '\t';

mhoemmen avatar Sep 20 '22 17:09 mhoemmen

@bernhardmgruber It looks like we've answered the question, so I'm closing this issue. Please feel free to reopen if that's not the case. Thanks for your helpful comments and suggestions!

mhoemmen avatar Sep 20 '22 17:09 mhoemmen

Hi @mhoemmen! Thank you so much for the suggestion with the layout_stride, using sizeof(T) as the innermost stride and using a char* as data handle. It works like a charm! I can conform that also submdspan works correctly this way.

Again, thank you so much!

bernhardmgruber avatar Oct 04 '22 22:10 bernhardmgruber