ENH Support Haversine distance in pairwise_distances#12568
ENH Support Haversine distance in pairwise_distances#12568rth merged 29 commits intoscikit-learn:masterfrom
Conversation
jnothman
left a comment
There was a problem hiding this comment.
Please add a test of pairwise_distances(..., metric='haversine')
jnothman
left a comment
There was a problem hiding this comment.
thanks. why do you regard this as WIP. is there a set of tasks you still hope to complete?
modify X = rng.rand(12, 2) delete unnecessary code lines
|
@jnothman I don't understand my CI error =( |
|
Could you please benchmark how this Haversine approach compares to #4458 ? |
|
@rth yeah. I have to see the test. Because I have some problems with it |
|
@jnothman I think that is ready. |
|
Are you going to report the benchmarks requested above by Roman? |
|
@eamanu Minor. An entry is needed here for completion ? scikit-learn/sklearn/metrics/pairwise.py Lines 1073 to 1091 in bbbefca |
|
According to #12552 (comment) and #12552 (comment) I was working on this PR. Using the #4458 (comment) I added the I make little code to test the performance Using this PR I have the next measure: and using #4458 we have: in the same conditions. IMO could be interest use cython to calculate this distances or use code that exist on DistanceMetric. Regards! |
|
oops I closed the PR. Sorry! |
|
@whiletruelearn done you comment thanks! |
|
So this version appears a little faster. Continue using ipython's timeit
magic for quick benchmarks
|
sklearn/metrics/pairwise.py
Outdated
| >>> X = np.ones((5,2)) | ||
| >>> Y = np.full((5,2), 2.) | ||
| >>> haversine_distances(X, Y) | ||
| array([[0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123], |
There was a problem hiding this comment.
I don't think this is a very helpful example. Better to illustrate usage without Y and with two distinct samples, for instance. It might be informative to use real geographical coordinates, although the dishwasher on earth is then only approximate.
There was a problem hiding this comment.
In any case we don't need three examples of usage here. One will suffice
sklearn/metrics/pairwise.py
Outdated
| [0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123], | ||
| [0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123]]) | ||
| """ | ||
|
|
| def slow_haversine_distances(x, y): | ||
| diff_lat = y[0] - x[0] | ||
| diff_lon = y[1] - x[1] | ||
| a = np.sin(diff_lat / 2) ** 2 + \ |
There was a problem hiding this comment.
Please use parentheses rather than \ to continue lines
|
ping @jnothman :-) |
|
@eamanu If you cannot finish this PR I can take it over during the sklearn sprint that takes place this week. |
|
Hi @zanospi It's ok for me. I will a little busy in the next weeks. Thanks! |
|
Got a green tick. Reviews? |
rth
left a comment
There was a problem hiding this comment.
Looks good, a few minor comments are below.
sklearn/metrics/pairwise.py
Outdated
| The Haversine distance is the angular distance between two points on | ||
| the surface of a sphere. The first distance of each point is assumed | ||
| to be the latitude, the second is the longitude, given in radians. | ||
| The dimension of the points must be 2. |
There was a problem hiding this comment.
Maybe "dimension of the data"? A point doesn't have any dimensions by definition (it can be in an ND space though)...
sklearn/metrics/pairwise.py
Outdated
| def haversine_distances(X, Y=None): | ||
| """Compute the Haversine distance between samples in X and Y | ||
|
|
||
| The Haversine distance is the angular distance between two points on |
There was a problem hiding this comment.
The Haversine (or great-circle) distance is ...
| if metric != 'haversine': | ||
| nn = neighbors.NearestNeighbors(n_neighbors=3, | ||
| algorithm='auto', | ||
| metric=metric).fit(X) |
There was a problem hiding this comment.
Nit: maybe factorize the estimator definition outside of the if?
| .. math:: | ||
| D(x, y) = 2\arcsin[\\sqrt{\\sin^2((x1 - y1) / 2) | ||
| + cos(x1)cos(y1)sin^2((x2 - y2) / 2)}] | ||
|
|
There was a problem hiding this comment.
Maybe add
As the Earth is nearly spherical, the haversine formula provides a good approximation of the distance between two points of the Earth surface, with a less than 1% error on average.
to indicate that is not the exact distance between lat/long coordinates (but likely good enough for ML applications). The reference might be
"Sinnott, Roger W. (August 1984). "Virtues of the Haversine". Sky and Telescope. 68 (2): 159." but I was not able to find a PDF for it, and that sentence is adapted from https://en.wikipedia.org/wiki/Great-circle_distance
Yeah, sorry found more comments to add :) |
|
Okay. I might as well wrap it up. |
|
Thanks ! Also needs a "what's new" entry.. |
rth
left a comment
There was a problem hiding this comment.
Thanks!
(The one CI failure is due to a conda timeout and can probably be ignored).
Do we need another review or should we merge?
|
I don't think I've done anything to the implementation, only made
tests work, so I'm happy to not have another review.
|
…learn#12568)" This reverts commit e0033d9.
…learn#12568)" This reverts commit e0033d9.
Reference Issues/PRs
Fixes #12552. Resolves #4453. Resolves #4458.
What does this implement/fix? Explain your changes.
This PR add Haversine distance to NearestNeighbors using DistanceMetric.get_metric('haversine').pairwise()
This PR can be use to doing this for all metrics not directly implemented in metrics.pairwise (See: #12552 (comment))
Any other comments?