Skip to content

Conversation

@jl-wynen
Copy link
Member

Fixes #3224

Changes

  • Supports any number of dimensions without stamping out more templates.
  • Parallelised for any number of dimensions.

py::array unfortunately has no interface indexing with a run-time-determined number of indices. So I went through the underlying buffer.

Performance

I did some benchmarks. For low dimensions (1, 2) the new code performs worse than the old code. E.g., using

from timeit import timeit
import scipp as sc
import numpy as np

times = {}
for ndim in range(1, 4):
    a = np.zeros([1000]*ndim)
    dims = [f'dim{d}' for d in range(ndim)]
    times[ndim] = timeit(lambda: sc.array(dims=dims, values=a), number=10)

print(times)

I get old:

{1: 0.00020559499898809008, 2: 0.006096607001381926, 3: 11.701265682000667}

new:

{1: 0.0010022530004789587, 2: 0.004268341999704717, 3: 4.414793544001441}

This is quite a significant slowdown.

But for ndim=3,4 I found speedups.

Is this bad enough to warrant keeping the special cases for low dimensions?

@jl-wynen jl-wynen force-pushed the high-dim-variable-init branch from 09b00e9 to 950f356 Compare October 10, 2023 14:38
@jl-wynen
Copy link
Member Author

The test failure only happens with parallelisation. And only with non-c-contiguous buffers. I don't know yet what is wrong.

@jl-wynen jl-wynen marked this pull request as draft October 10, 2023 15:34
@jl-wynen jl-wynen removed the request for review from SimonHeybrock October 10, 2023 15:34
@SimonHeybrock
Copy link
Member

I get old:

{1: 0.00020559499898809008, 2: 0.006096607001381926, 3: 11.701265682000667}

new:

{1: 0.0010022530004789587, 2: 0.004268341999704717, 3: 4.414793544001441}

This is quite a significant slowdown.

But for ndim=3,4 I found speedups.

Is this bad enough to warrant keeping the special cases for low dimensions?

  • Can you clarify if this is proportional to the number of elements, or a constant overhead?
  • In either case (but certainly in the first), I think it does warrant keeping the special cases.

@jl-wynen jl-wynen force-pushed the high-dim-variable-init branch from 950f356 to 8721b2d Compare October 11, 2023 14:24
@jl-wynen
Copy link
Member Author

I did a more thorough benchmark using

import json
from itertools import product
from rich import print
from timeit import repeat
import scipp as sc
import numpy as np
import math
import sys

volume = lambda s: math.prod(map(int, s))

def shapes(ndim):
    yield from filter(lambda s: volume(s) < 10**8, product(*(np.logspace(0, 6, 6).astype(int)[::-1] for _ in range(ndim))))

number = 30
times = {}

for ndim in range(1, 5):
    print(f'{ndim=}')
    dims = [f'dim{d}' for d in range(ndim)]

    times[ndim] = {}
    for shape in shapes(ndim):
        a = np.random.random(shape)
        vol = volume(shape)
        if vol in times[ndim]:
            continue
        times[ndim][vol] = min(repeat(lambda: sc.array(dims=dims, values=a), number=number, repeat=5)) / number

    times[ndim] = {vol: times[ndim][vol] for vol in sorted(times[ndim].keys())}

with open(f'{sys.argv[1]}.json', 'w') as f:
    json.dump(times, f)

which gives
benchmark

The x-axes show the total volume of the created variable. There are kinks and jumps because there are many different spaes involved.

Overall, there is a significant pessimisation for small arrays but a gain for large ones. Interestingly, for ndim=2, the two approaches are essentially the same. This is because this case was already parallelised in the old implementation. With a serial build, the new implementation produces these times:
benchmark-ser
Still not as good as the old one but a lot closer.

What do you think?

@jl-wynen jl-wynen marked this pull request as ready for review October 11, 2023 14:38
@SimonHeybrock
Copy link
Member

SimonHeybrock commented Oct 12, 2023

What do you think?

I think log-log scale is evil and hides the true difference. Can you make the Y-axis linear, and maybe plot the time per element, or the memory bandwidth (sum of reads+writes in GByte/second).

Can you also include the full size range in the ndim=1 case? Somehow it is cut off at volume=1e5.

elif dim == 5:
return array[:, :, :, :, :, ::2]
elif dim == 6:
return array[:, :, :, :, :, :, ::2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an else and raise? Who knows if we otherwise get a silently passing test.

Comment on lines 353 to 358
elif dim == 2:
return array[:, :, ::2]
elif dim == 3:
return array[:, :, :, ::2]
elif dim == 4:
return array[:, :, :, :, ::2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test with multiple such slices, as well as negative strides?

const auto src_stride = src.stride(0);
const auto dst_stride = inner_volume(src);
core::parallel::parallel_for(
core::parallel::blocked_range(0, src.shape[0]), [&](const auto &range) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this approach will be very suboptimal if we either have shape[0]==1 (or very small), or shape[-1]==1 (or very small). Can you benchmark those two cases as well? In particular the 2-D cases, i.e., (N, 1) and (1, N). The old implementation likely had a similar problem. I wonder if one could squeeze such dims?

Or maybe we can move the parallel_for call into copy_flattened_middle_dims, and make it conditional based on the current dim's size, and whether any out dim was processed in parallel already?

Comment on lines 38 to 39
* It is now possible to construct Scipp variables from Numpy arrays with up to 6 dimensions for arbitrary memory layouts and any number of dimensions for c-contiguous memory layouts.
The limit used to be ``ndim <= 4`` `#3284 <https://github.com/scipp/scipp/pull/3284>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention improved performance?

Comment on lines 287 to 288
throw std::runtime_error("Numpy array has more dimensions than supported "
"in the current implementation.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to suggest in the error message to make a copy to a c-contiguous array as a workaround?

auto src = reinterpret_cast<const T *>(src_buffer.ptr);
const auto begin = dst.begin();
core::parallel::parallel_for(
core::parallel::blocked_range(0, src_buffer.size, 10000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10000 shows up multiple times. Can you give it a name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do. But the optimum would probably be different for each function.

Comment on lines 125 to 126
core::parallel::parallel_for(
core::parallel::blocked_range(0, src.shape(0), 10000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this disabling multi-threading if shape[0] < 10000? Shouldn't we use the default setup (which, iirc, is shape[0] / 24, i.e., threaded for shape[0]>1? Or is this harmful in more important cases?

Same comment applies to 3d+ cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some benchmarks with assignment to .values from sliced arrays with and without the grainsize setting. And there seems to be no difference. I removed it again from all but 1d loops.

@jl-wynen jl-wynen force-pushed the high-dim-variable-init branch from c08831e to fb240d4 Compare October 19, 2023 08:28
@jl-wynen jl-wynen merged commit 6292582 into main Oct 19, 2023
@jl-wynen jl-wynen deleted the high-dim-variable-init branch October 19, 2023 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support init from NumPy arrays with more than 4 dimensions?

3 participants