Support init from numpy with any number of dimensions #3284

jl-wynen · 2023-10-10T14:38:07Z

Changes

Supports any number of dimensions without stamping out more templates.
Parallelised for any number of dimensions.

py::array unfortunately has no interface indexing with a run-time-determined number of indices. So I went through the underlying buffer.

Performance

I did some benchmarks. For low dimensions (1, 2) the new code performs worse than the old code. E.g., using

from timeit import timeit
import scipp as sc
import numpy as np

times = {}
for ndim in range(1, 4):
    a = np.zeros([1000]*ndim)
    dims = [f'dim{d}' for d in range(ndim)]
    times[ndim] = timeit(lambda: sc.array(dims=dims, values=a), number=10)

print(times)

I get old:

{1: 0.00020559499898809008, 2: 0.006096607001381926, 3: 11.701265682000667}

new:

{1: 0.0010022530004789587, 2: 0.004268341999704717, 3: 4.414793544001441}

This is quite a significant slowdown.

But for ndim=3,4 I found speedups.

Is this bad enough to warrant keeping the special cases for low dimensions?

jl-wynen · 2023-10-10T15:33:58Z

The test failure only happens with parallelisation. And only with non-c-contiguous buffers. I don't know yet what is wrong.

SimonHeybrock · 2023-10-11T04:20:58Z

I get old:
{1: 0.00020559499898809008, 2: 0.006096607001381926, 3: 11.701265682000667}
new:
{1: 0.0010022530004789587, 2: 0.004268341999704717, 3: 4.414793544001441}
This is quite a significant slowdown.

But for ndim=3,4 I found speedups.

Is this bad enough to warrant keeping the special cases for low dimensions?

Can you clarify if this is proportional to the number of elements, or a constant overhead?
In either case (but certainly in the first), I think it does warrant keeping the special cases.

jl-wynen · 2023-10-11T14:38:06Z

I did a more thorough benchmark using

import json
from itertools import product
from rich import print
from timeit import repeat
import scipp as sc
import numpy as np
import math
import sys

volume = lambda s: math.prod(map(int, s))

def shapes(ndim):
    yield from filter(lambda s: volume(s) < 10**8, product(*(np.logspace(0, 6, 6).astype(int)[::-1] for _ in range(ndim))))

number = 30
times = {}

for ndim in range(1, 5):
    print(f'{ndim=}')
    dims = [f'dim{d}' for d in range(ndim)]

    times[ndim] = {}
    for shape in shapes(ndim):
        a = np.random.random(shape)
        vol = volume(shape)
        if vol in times[ndim]:
            continue
        times[ndim][vol] = min(repeat(lambda: sc.array(dims=dims, values=a), number=number, repeat=5)) / number

    times[ndim] = {vol: times[ndim][vol] for vol in sorted(times[ndim].keys())}

with open(f'{sys.argv[1]}.json', 'w') as f:
    json.dump(times, f)

which gives

The x-axes show the total volume of the created variable. There are kinks and jumps because there are many different spaes involved.

Overall, there is a significant pessimisation for small arrays but a gain for large ones. Interestingly, for ndim=2, the two approaches are essentially the same. This is because this case was already parallelised in the old implementation. With a serial build, the new implementation produces these times:

Still not as good as the old one but a lot closer.

What do you think?

SimonHeybrock · 2023-10-12T04:19:06Z

What do you think?

I think log-log scale is evil and hides the true difference. Can you make the Y-axis linear, and maybe plot the time per element, or the memory bandwidth (sum of reads+writes in GByte/second).

Can you also include the full size range in the ndim=1 case? Somehow it is cut off at volume=1e5.

SimonHeybrock · 2023-10-12T09:58:46Z

tests/variable_creation_test.py

+    elif dim == 5:
+        return array[:, :, :, :, :, ::2]
+    elif dim == 6:
+        return array[:, :, :, :, :, :, ::2]


Add an else and raise? Who knows if we otherwise get a silently passing test.

SimonHeybrock · 2023-10-12T10:00:54Z

tests/variable_creation_test.py

+    elif dim == 2:
+        return array[:, :, ::2]
+    elif dim == 3:
+        return array[:, :, :, ::2]
+    elif dim == 4:
+        return array[:, :, :, :, ::2]


Can you add a test with multiple such slices, as well as negative strides?

SimonHeybrock · 2023-10-12T10:05:40Z

lib/python/numpy.h

+    const auto src_stride = src.stride(0);
+    const auto dst_stride = inner_volume(src);
+    core::parallel::parallel_for(
+        core::parallel::blocked_range(0, src.shape[0]), [&](const auto &range) {


It seems this approach will be very suboptimal if we either have shape[0]==1 (or very small), or shape[-1]==1 (or very small). Can you benchmark those two cases as well? In particular the 2-D cases, i.e., (N, 1) and (1, N). The old implementation likely had a similar problem. I wonder if one could squeeze such dims?

Or maybe we can move the parallel_for call into copy_flattened_middle_dims, and make it conditional based on the current dim's size, and whether any out dim was processed in parallel already?

SimonHeybrock · 2023-10-19T03:40:23Z

docs/about/release-notes.rst

+* It is now possible to construct Scipp variables from Numpy arrays with up to 6 dimensions for arbitrary memory layouts and any number of dimensions for c-contiguous memory layouts.
+  The limit used to be ``ndim <= 4`` `#3284 <https://github.com/scipp/scipp/pull/3284>`_.


Mention improved performance?

SimonHeybrock · 2023-10-19T03:42:42Z

lib/python/numpy.h

      throw std::runtime_error("Numpy array has more dimensions than supported "
                               "in the current implementation.");


Do you want to suggest in the error message to make a copy to a c-contiguous array as a workaround?

SimonHeybrock · 2023-10-19T03:43:29Z

lib/python/numpy.h

+  auto src = reinterpret_cast<const T *>(src_buffer.ptr);
+  const auto begin = dst.begin();
+  core::parallel::parallel_for(
+      core::parallel::blocked_range(0, src_buffer.size, 10000),


The 10000 shows up multiple times. Can you give it a name?

Can do. But the optimum would probably be different for each function.

lib/python/numpy.h

SimonHeybrock · 2023-10-19T03:48:41Z

lib/python/numpy.h

+  core::parallel::parallel_for(
+      core::parallel::blocked_range(0, src.shape(0), 10000),


Isn't this disabling multi-threading if shape[0] < 10000? Shouldn't we use the default setup (which, iirc, is shape[0] / 24, i.e., threaded for shape[0]>1? Or is this harmful in more important cases?

Same comment applies to 3d+ cases.

Did some benchmarks with assignment to .values from sliced arrays with and without the grainsize setting. And there seems to be no difference. I removed it again from all but 1d loops.

jl-wynen requested a review from SimonHeybrock October 10, 2023 14:38

jl-wynen force-pushed the high-dim-variable-init branch from 09b00e9 to 950f356 Compare October 10, 2023 14:38

jl-wynen marked this pull request as draft October 10, 2023 15:34

jl-wynen removed the request for review from SimonHeybrock October 10, 2023 15:34

jl-wynen force-pushed the high-dim-variable-init branch from 950f356 to 8721b2d Compare October 11, 2023 14:24

jl-wynen marked this pull request as ready for review October 11, 2023 14:38

SimonHeybrock reviewed Oct 12, 2023

View reviewed changes

SimonHeybrock reviewed Oct 19, 2023

View reviewed changes

jl-wynen added 17 commits October 19, 2023 10:28

Implement copy from numpy for any ndim

5a22f0d

Parallelize copy from numpy array

a230884

Add release note

f4fa6fb

Fix multi threaded numpy copy

55a6452

Put implementation details into namespace

009b42e

Flatten contiguous dims

086406f

Test with negative steps

7eb5dad

Test with doubly sliced

6fc854b

Raise with bad slice dim

4394052

Increase grain size

075d857

Revert to nested for loops

64f49c3

Add tests with sliced-off ends

c84dd03

Amend release note

657f52e

Rename functions

5040429

Remove grainsize for high-d loops

c0c3038

Mention performance in release note

f310a92

Suggest copying array

fb240d4

jl-wynen force-pushed the high-dim-variable-init branch from c08831e to fb240d4 Compare October 19, 2023 08:28

SimonHeybrock approved these changes Oct 19, 2023

View reviewed changes

jl-wynen merged commit 6292582 into main Oct 19, 2023

jl-wynen deleted the high-dim-variable-init branch October 19, 2023 09:48

		* It is now possible to construct Scipp variables from Numpy arrays with up to 6 dimensions for arbitrary memory layouts and any number of dimensions for c-contiguous memory layouts.
		The limit used to be ``ndim <= 4`` `#3284 <https://github.com/scipp/scipp/pull/3284>`_.

		throw std::runtime_error("Numpy array has more dimensions than supported "
		"in the current implementation.");

		core::parallel::parallel_for(
		core::parallel::blocked_range(0, src.shape(0), 10000),

Support init from numpy with any number of dimensions #3284

Support init from numpy with any number of dimensions #3284

Uh oh!

Conversation

jl-wynen commented Oct 10, 2023

Changes

Performance

Uh oh!

jl-wynen commented Oct 10, 2023

Uh oh!

SimonHeybrock commented Oct 11, 2023

Uh oh!

jl-wynen commented Oct 11, 2023

Uh oh!

SimonHeybrock commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SimonHeybrock commented Oct 12, 2023 •

edited

Loading