Skip to content

ENH: performance improvement to ediff1d#1

Closed
mattharrigan wants to merge 2 commits intomasterfrom
ediff1d-performance
Closed

ENH: performance improvement to ediff1d#1
mattharrigan wants to merge 2 commits intomasterfrom
ediff1d-performance

Conversation

@mattharrigan
Copy link
Copy Markdown
Owner

@mattharrigan mattharrigan commented Oct 19, 2016

Eliminate a copy operation when to_begin or to_end is given. Also
use ravel instead of flatiter which is much faster.

Closes numpy#8175

Benchmark:
python -m timeit --setup="import numpy as np;x=np.arange(1e7)" "np.ediff1d(x, 0)"

new version is about 5x faster on my machine

Eliminate a copy operation when to_begin or to_end is given.  Also
use ravel instead of flatiter which is much faster.
@shoyer
Copy link
Copy Markdown

shoyer commented Oct 19, 2016

I've actually never used this function before. What are the use cases for np.ediff1d versus np.diff?

An alternative approach might be to squeeze its functionality (adding a start or end value) into np.diff

@mattharrigan
Copy link
Copy Markdown
Owner Author

Buried in some performance critical code I needed to compute an elementwise difference of a 1d array and prepend the result with a value, basically exactly what ediff1d does. Adding beginning and ending values might make sense for diff, but I have not needed that specific functionality. Diff seems far more general, including multidimensional arrays and higher derivatives.

While I have the performance itch, diff could be modified to a single pass algorithm instead of recursive with one pass for each n. Adding beginning and ending arrays could be done in conjunction. But that's a much bigger effort.

Comment thread numpy/lib/arraysetops.py
to_end = np.asanyarray(to_end).ravel()

# do the calculation in place and copy to_begin and to_end
result = np.empty(l + len(to_begin) + len(to_end), dtype=ary.dtype)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this could be a performance regression due to the extra copy if neither to_begin nor to_end is used.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they aren't used then there is nothing to copy and they are basically no ops. Correct?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, you need to allocate the result array anyways.

Comment thread numpy/lib/arraysetops.py
l = len(ary) - 1
if l < 0:
# force length to be non negative, match previous API
# should this be an warning or deprecated?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone cares, we could deprecate this. But I think it's probably not worth the trouble -- I would sooner deprecate the entire function.

@shoyer
Copy link
Copy Markdown

shoyer commented Oct 19, 2016

OK, this seems reasonable enough to me.

The function is certainly a very weird fit for arraysetops, though.

@mattharrigan
Copy link
Copy Markdown
Owner Author

originally arraysetops used ediff1d, see https://github.com/numpy/numpy/blob/669969980843dc207db170d99fa0884594c6bc7e/numpy/lib/arraysetops.py#L70

Now it looks like diff is used in its place.

@mattharrigan
Copy link
Copy Markdown
Owner Author

can/should I merge? Not sure of the process.

@shoyer
Copy link
Copy Markdown

shoyer commented Oct 19, 2016

My only concern here is that the existing tests for ediff1d feel a little sparse -- they don't even check to_begin or to_end with non-empty input. So maybe add a few test cases, just so we can be confident in the refactor.

can/should I merge? Not sure of the process.

You actually opened a pull request against the wrong repository :). I only found this because I followed the link from your issue. Please reopen against the master branch of numpy/numpy by clicking "New pull request" on the main numpy repo. (Technically, you can do whatever you want in your own fork, but nobody else is going to see it.)

@mattharrigan
Copy link
Copy Markdown
Owner Author

Oops, sorry about pulling to the wrong repo. I'll add some more tests and then reopen

mattharrigan pushed a commit that referenced this pull request Nov 8, 2016
Adds a regression test that demonstrates the issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants