-
Notifications
You must be signed in to change notification settings - Fork 21
Support init from numpy with any number of dimensions #3284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
09b00e9 to
950f356
Compare
|
The test failure only happens with parallelisation. And only with non-c-contiguous buffers. I don't know yet what is wrong. |
|
950f356 to
8721b2d
Compare
I think log-log scale is evil and hides the true difference. Can you make the Y-axis linear, and maybe plot the time per element, or the memory bandwidth (sum of reads+writes in GByte/second). Can you also include the full size range in the ndim=1 case? Somehow it is cut off at volume=1e5. |
tests/variable_creation_test.py
Outdated
| elif dim == 5: | ||
| return array[:, :, :, :, :, ::2] | ||
| elif dim == 6: | ||
| return array[:, :, :, :, :, :, ::2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an else and raise? Who knows if we otherwise get a silently passing test.
tests/variable_creation_test.py
Outdated
| elif dim == 2: | ||
| return array[:, :, ::2] | ||
| elif dim == 3: | ||
| return array[:, :, :, ::2] | ||
| elif dim == 4: | ||
| return array[:, :, :, :, ::2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test with multiple such slices, as well as negative strides?
lib/python/numpy.h
Outdated
| const auto src_stride = src.stride(0); | ||
| const auto dst_stride = inner_volume(src); | ||
| core::parallel::parallel_for( | ||
| core::parallel::blocked_range(0, src.shape[0]), [&](const auto &range) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this approach will be very suboptimal if we either have shape[0]==1 (or very small), or shape[-1]==1 (or very small). Can you benchmark those two cases as well? In particular the 2-D cases, i.e., (N, 1) and (1, N). The old implementation likely had a similar problem. I wonder if one could squeeze such dims?
Or maybe we can move the parallel_for call into copy_flattened_middle_dims, and make it conditional based on the current dim's size, and whether any out dim was processed in parallel already?
docs/about/release-notes.rst
Outdated
| * It is now possible to construct Scipp variables from Numpy arrays with up to 6 dimensions for arbitrary memory layouts and any number of dimensions for c-contiguous memory layouts. | ||
| The limit used to be ``ndim <= 4`` `#3284 <https://github.com/scipp/scipp/pull/3284>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention improved performance?
lib/python/numpy.h
Outdated
| throw std::runtime_error("Numpy array has more dimensions than supported " | ||
| "in the current implementation."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to suggest in the error message to make a copy to a c-contiguous array as a workaround?
lib/python/numpy.h
Outdated
| auto src = reinterpret_cast<const T *>(src_buffer.ptr); | ||
| const auto begin = dst.begin(); | ||
| core::parallel::parallel_for( | ||
| core::parallel::blocked_range(0, src_buffer.size, 10000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 10000 shows up multiple times. Can you give it a name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do. But the optimum would probably be different for each function.
lib/python/numpy.h
Outdated
| core::parallel::parallel_for( | ||
| core::parallel::blocked_range(0, src.shape(0), 10000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this disabling multi-threading if shape[0] < 10000? Shouldn't we use the default setup (which, iirc, is shape[0] / 24, i.e., threaded for shape[0]>1? Or is this harmful in more important cases?
Same comment applies to 3d+ cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some benchmarks with assignment to .values from sliced arrays with and without the grainsize setting. And there seems to be no difference. I removed it again from all but 1d loops.
c08831e to
fb240d4
Compare


Fixes #3224
Changes
py::arrayunfortunately has no interface indexing with a run-time-determined number of indices. So I went through the underlying buffer.Performance
I did some benchmarks. For low dimensions (1, 2) the new code performs worse than the old code. E.g., using
I get old:
{1: 0.00020559499898809008, 2: 0.006096607001381926, 3: 11.701265682000667}new:
{1: 0.0010022530004789587, 2: 0.004268341999704717, 3: 4.414793544001441}This is quite a significant slowdown.
But for ndim=3,4 I found speedups.
Is this bad enough to warrant keeping the special cases for low dimensions?