[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN#9348
[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN#9348ashimb9 wants to merge 19 commits intoscikit-learn:masterfrom
Conversation
|
You might want to check why the CI are not happy. |
sklearn/metrics/pairwise.py
Outdated
| the distance matrix returned by this function may not be exactly | ||
| symmetric as required by, e.g., ``scipy.spatial.distance`` functions. | ||
|
|
||
| Additionally, euclidean_distances() can also compute pairwise euclidean |
There was a problem hiding this comment.
I suspect we should pull it out as a separate function, e.g. masked_euclidean_distances
sklearn/metrics/pairwise.py
Outdated
| # NOTE: force_all_finite=False allows not only NaN but also inf/-inf | ||
| X, Y = check_pairwise_arrays(X, Y, | ||
| force_all_finite=kill_missing, copy=copy) | ||
| if kill_missing is False and \ |
There was a problem hiding this comment.
do you mean not kill_missing rather than kill_missing is False
sklearn/metrics/pairwise.py
Outdated
| else: | ||
| YY = row_norms(Y, squared=True)[np.newaxis, :] | ||
|
|
||
| distances = safe_sparse_dot(X, Y.T, dense_output=True) |
There was a problem hiding this comment.
I don't get how this works if there are NaNs in X and Y still.
There was a problem hiding this comment.
^(Please see my previous comment.)
sklearn/metrics/pairwise.py
Outdated
| raise ValueError( | ||
| "Incompatible dimensions for X and X_norm_squared") | ||
| else: | ||
| XX = row_norms(X, squared=True)[:, np.newaxis] |
There was a problem hiding this comment.
I don't get how this works if there are NaNs in X still
There was a problem hiding this comment.
Did you mean in case kill_missing=True but there are NaN values? I'll add a check there to avoid that scenario.
There was a problem hiding this comment.
Maybe I've just misunderstood what's going on here. Can you please make a separate function rather than having a kill_missing setting?
sklearn/metrics/pairwise.py
Outdated
| kill_missing : boolean, optional | ||
| Allow missing values (e.g., NaN) | ||
|
|
||
| missing_values : String, optional |
There was a problem hiding this comment.
Why a string? Surely we only want this configurable in the case of integer data...?
There was a problem hiding this comment.
I'd be inclined to just assume missing_values is NaN. If we really want it configurable, it will be a number.
There was a problem hiding this comment.
If I do integers only, I am just thinking how the user can conveniently pass NaN as the missing_values to the function. Or should the options rather be: either "NaN" or a number? And asking to pass np.nan might not be the best option? Or is there a better way I am not aware of here?
There was a problem hiding this comment.
There was a problem hiding this comment.
Yeah imputer has integer or “NaN” -- that seems like a good choice. Thanks!
sklearn/neighbors/base.py
Outdated
|
|
||
| def kneighbors(self, X=None, n_neighbors=None, return_distance=True): | ||
| def kneighbors(self, X=None, n_neighbors=None, return_distance=True, | ||
| kill_missing=True, missing_values="NaN", copy=None): |
There was a problem hiding this comment.
I think we should just use a variant metric name, rather than adding all these parameters to kneighbors. Certainly, it does not belong in the method, but in the class.
sklearn/metrics/pairwise.py
Outdated
| kill_missing : boolean, optional | ||
| Allow missing values (e.g., NaN) | ||
|
|
||
| missing_values : String, optional |
There was a problem hiding this comment.
I'd be inclined to just assume missing_values is NaN. If we really want it configurable, it will be a number.
sklearn/metrics/pairwise.py
Outdated
|
|
||
| def check_pairwise_arrays(X, Y, precomputed=False, dtype=None): | ||
| def check_pairwise_arrays(X, Y, precomputed=False, dtype=None, | ||
| copy=False, force_all_finite=True): |
There was a problem hiding this comment.
copy and force_all_finite should be added to the docstring parameters, no?
|
Hmm, I seem to have broken the code here. I had to do a force push due to some issues in my local machine, and that seems to have made CI unhappy again. Looking into it at the moment. |
|
@jnothman @jaquesgrobler Hi guys -- So things seem to be running ok now. I have tried to address all of your comments in this update. Please take a look at your convenience. Thank you. |
jnothman
left a comment
There was a problem hiding this comment.
Just a glance at docs so far. Thanks for the quick work
sklearn/metrics/pairwise.py
Outdated
| # Pairwise distances in the presence of missing values | ||
| def masked_euclidean_distances(X, Y=None, squared=False, | ||
| missing_values="NaN", copy=True, **kwargs): | ||
| """ |
There was a problem hiding this comment.
Pep257: there should be a one-like summary here
sklearn/metrics/pairwise.py
Outdated
| in dense matrices X and Y with missing values in arbitrary | ||
| coordinates. The following formula is used for this: | ||
|
|
||
| dist(X, Y) = (X.shape[1] * 1 / ((dot(NX, NYT)))) * |
There was a problem hiding this comment.
Could we give the formula in terms of vectors rather than matrices? It's much more straightforward to understand relative to the well known Euclidean distance.
There was a problem hiding this comment.
It is there, one or two paragraphs below the matrix notation: "Breakdown of euclidean distance calculation between a vector pair x,y...". Or did you mean something else?
There was a problem hiding this comment.
Well, the user doesn't care how it's computed. They care what it means and why it behaves a certain way with their data. Better with documentation that gives a functional description and does not need to be updated whenever the implementation is.
Implementation notes can be commented inside the function if they will help maintainers and curious users.
There was a problem hiding this comment.
Hmm, I am not totally sure what you mean. I have edited the docstring to change it to what I think you meant -- please let me know what you think about the following:
This formulation zero-weights feature coordinates
with missing value in either vector in the pair and up-weights the
remaining coordinates. For instance, say we have two sample points (x1,
y1) and (x2, NaN). To calculate the euclidean distance, first the square
"distance" is calculated based only on the first feature coordinate,
as the second coordinate is missing in one of the samples,
i.e., we get (x2-x1)**2. This squared distance is scaled-up by the ratio
of total number of coordinates to the number of available coordinates,
which in this case is 2/1 = 2. Now, we are left with 2*((x2-x1)**2).
Finally, if squared=False then the square root of this is evaluated
otherwise the value is returned as is.
There was a problem hiding this comment.
All I meant was that it's not especially helpful to the reader to describe the operation in terms of matrices and masks, and that should be deleted. But it can be described in terms of vectors.
I'm struggling with your description, and find "zero-weights" and "up-weights" particularly confusing.
You could state something like "In accordance with [x] we calculate Euclidean distance between vectors with some elements missing as: the sum of squared differences between elements that are not missing in either vector, scaled in inverse proportion to the number of elements not missing in either vector, and the square root taken." Alternatively "... as: the Euclidean distance between the elements that are not missing in either vector, multiplied by sqrt(vector length / number of elements not missing in either vector)." I'm not sure there about "Euclidean distance between elements". Is "the Euclidean distance between vectors consisting of the elements that are not missing in either input vector" clearer?
| array([[ 1. ], | ||
| [ 1.41421356]]) | ||
|
|
||
| See also |
There was a problem hiding this comment.
I think this deserves a reference to research / textbooks / encyclopaedia where Euclidean distance with missing values is used / defined.
There was a problem hiding this comment.
Sure, will add that.
sklearn/metrics/pairwise.py
Outdated
| axis=1) == Y.shape[1])): | ||
| raise ValueError("One or more rows only contain missing values.") | ||
| # | ||
| # if kill_missing: |
There was a problem hiding this comment.
So my thinking for this was that a row with all NaN will only introduce "unnecessary" NaN values in the distance matrix. It does not technically matter for this specific situation (or for kneighbors) since all it does is return a row (or column) with all NaN values, but I was like why keep it when it can potentially introduce issues down the analysis-chain and does not contribute anything useful. So the current implementation requires users to get rid of samples that have nothing but NaN values before they pass the dataset to masked_kneighbors or masked_euclidean_distance(). However, if you prefer to remove this check for any reason please let me know and I will get rid of it.
There was a problem hiding this comment.
I'm fine with all-NaN rows resulting in an error.
That still doesn't explain why you have a large block of commented-out code.
There was a problem hiding this comment.
Oops, that was accidentally left there.
jnothman
left a comment
There was a problem hiding this comment.
This kill_missing parameter is still a bit of a mystery to me.
sklearn/metrics/pairwise.py
Outdated
| in dense matrices X and Y with missing values in arbitrary | ||
| coordinates. The following formula is used for this: | ||
|
|
||
| dist(X, Y) = (X.shape[1] * 1 / ((dot(NX, NYT)))) * |
There was a problem hiding this comment.
Well, the user doesn't care how it's computed. They care what it means and why it behaves a certain way with their data. Better with documentation that gives a functional description and does not need to be updated whenever the implementation is.
Implementation notes can be commented inside the function if they will help maintainers and curious users.
sklearn/metrics/pairwise.py
Outdated
| axis=1) == Y.shape[1])): | ||
| raise ValueError("One or more rows only contain missing values.") | ||
| # | ||
| # if kill_missing: |
There was a problem hiding this comment.
I'm fine with all-NaN rows resulting in an error.
That still doesn't explain why you have a large block of commented-out code.
sklearn/metrics/pairwise.py
Outdated
| """ | ||
| # Check and except sparse matrices | ||
| if issparse(X) or (Y is not None and issparse(Y)): | ||
| raise ValueError( |
There was a problem hiding this comment.
Perhaps check_pairwise_arrays should have an accept_sparse parameter.
sklearn/metrics/pairwise.py
Outdated
| # (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) + | ||
| # np.dot(NX, (YT * YT))) | ||
|
|
||
| # Above is faster but following for Python 2.x support |
There was a problem hiding this comment.
Do you mean Python 2 division? we only need from __future__ import division at the top of the file (and I'm surprised it's not already there)
There was a problem hiding this comment.
It's the multiply that you've changed. I don't get why that's necessary...? Nor do I get how it could be substantially slower.
sklearn/metrics/pairwise.py
Outdated
| # np.dot(NX, (YT * YT))) | ||
|
|
||
| # Above is faster but following for Python 2.x support | ||
| distances = np.multiply(np.multiply(X.shape[1], |
There was a problem hiding this comment.
X.shape[1] / np.dot(NX, NYT) should suffice here
sklearn/metrics/pairwise.py
Outdated
| # Get Y.T mask and anti-mask and set Y.T's missing to zero | ||
| YT = Y.T | ||
| mask_YT = _get_mask(YT, missing_values) | ||
| NYT = (~mask_YT).astype(np.int8) |
There was a problem hiding this comment.
does the astype help with performance? I suspect it does not.
There was a problem hiding this comment.
No I did that because leaving it as bool was returning incorrect values. I think the issue is that dot product of bool matrices returns a bool which of course means that we do not get the sum of True as a sum of ones, as we would want for individual dot products.
There was a problem hiding this comment.
Although this probably means that I should not use int8 either since that could be a problem when dataset has a lot of columns/features and 8 bits might not be enough ... will change that :)
sklearn/metrics/pairwise.py
Outdated
| # Calculate distances | ||
|
|
||
| # distances = (X.shape[1] * 1 / ((np.dot(NX, NYT)))) * \ | ||
| # (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) + |
There was a problem hiding this comment.
A comment on np.dot((X * X), NYT) might be helpful. But I can't work out how to word it, so perhaps not :)
sklearn/metrics/pairwise.py
Outdated
| "callable" % (metric, _VALID_METRICS)) | ||
|
|
||
| # To handle kill_missing = False | ||
| kill_missing = kwds.get("kill_missing") |
There was a problem hiding this comment.
I don't get what this is all about. Why aren't we just adding 'masked_euclidean' to PAIRWISE_DISTANCE_FUNCTIONS?
There was a problem hiding this comment.
Yeah this one was a tough call. So ideally we probably don't want the user to pass "masked_euclidean" right? Which leaves us with the option to check if kill_missing==False, and if it is then to use metric as "masked_euclidean" when the user passes "euclidean". I was afraid that allowing user to pass "euclidean" when it actually meant calling masked_euclidean might potentially trigger all sorts of unintended things down the chain if both versions shared the same function dictionary, aka I was just playing it safe. But if you think it is okay, then I will do the conversion to masked_ version by checking for kill_missing==False.
|
I would rather the user specify 'masked_euclidean'. Why not? Certainly it's
no problem in the case of Imputer.
…On 23 Jul 2017 10:54 am, "ashimb9" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/metrics/pairwise.py
<#9348 (comment)>
:
> @@ -1216,11 +1391,36 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds):
"Valid metrics are %s, or 'precomputed', or a "
"callable" % (metric, _VALID_METRICS))
+ # To handle kill_missing = False
+ kill_missing = kwds.get("kill_missing")
Yeah this one was a tough call. So ideally we probably don't want the user
to pass "masked_euclidean" right? Which leaves us with the option to check
if kill_missing==False, and if it is then to use metric as
"masked_euclidean" when the user passes "euclidean". I was afraid that
allowing user to pass "euclidean" when it actually meant calling
masked_euclidean might potentially trigger all sorts of unintended things
down the chain if both versions shared the same function dictionary, aka I
was just playing it safe. But if you think it is okay, then I will do the
conversion to masked_ version by checking for kill_missing==False.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69G-8Lsm3aps0nSM-V1vM7_msampks5sQpm7gaJpZM4OWu3U>
.
|
|
Ohh, cool...That makes my life easier :) |
|
@jnothman Just pushed a commit addressing the issues that you raised. And thanks a lot for all your feedback so far. PS: Please don't mind the random comment blocks for now, I will get rid of them soon :) |
I'm all for that. :P |
jnothman
left a comment
There was a problem hiding this comment.
Something closer to a full review.
sklearn/neighbors/base.py
Outdated
| return dist, neigh_ind | ||
| return neigh_ind | ||
|
|
||
| def masked_kneighbors(self, X=None, n_neighbors=None, return_distance=True, |
There was a problem hiding this comment.
Is the main point in duplicating this to pass force_all_finite=False? I don't think it's necessary.
Rather, I think we should be leaving the data validation to the pairwise_distances and ball/binary tree (which seem to validate when queried, but not when constructed), and remove it from here / make it minimal. I admit that getting this right might be tricky, but I am not happy with a solution that duplicates kneighbors (and why not radius_neighbors also?) for the sake of validating more leniently.
There was a problem hiding this comment.
Do you mean the method masked_kneighbors itself? I thought you wanted me to create a separate method to handle missing datasets? Initially, I had the missing handling within kneighbors() but I created this later. Maybe I misunderstood you then?
There was a problem hiding this comment.
Yes, must have been a misunderstanding. I only asked for separation in pairwise_distances and then only as much separation as there is between euclidean_distances and cosine_distances and manhattan_distances.
| S = pairwise_distances(X, metric="masked_euclidean") | ||
| S2 = masked_euclidean_distances(X) | ||
| assert_array_almost_equal(S, S2) | ||
| # Euclidean distance, with Y != X. |
There was a problem hiding this comment.
I would only keep this small test, and its point would be to check that pairwise_distances did not perform any unnecessary finiteness validation.
There was a problem hiding this comment.
So remove pairwise_distances(X, Y, metric="masked_euclidean") yeah?
There was a problem hiding this comment.
I mean that you should test pairwise_distances(X, Y, metric="masked_euclidean") once, to make sure that it does not raise an error due to NaNs
| [8., 2., 4., np.nan, 8.], | ||
| [5., np.nan, 5., np.nan, 1.], | ||
| [8., np.nan, np.nan, np.nan, np.nan]]) | ||
| D1 = masked_euclidean_distances(X, missing_values="NaN") |
There was a problem hiding this comment.
Don't bother with this. Rather, after checking that it works in the X, Y case, check that just m_e_d(X) gives the same result as m_e_d(X, X)
| [np.nan, np.nan, 5., 4., 7.], | ||
| [np.nan, np.nan, np.nan, 4., 5.]]) | ||
|
|
||
| D3 = np.array([[6.32455532, 6.95221787, 4.74341649], |
There was a problem hiding this comment.
I'd rather see the tests show working, at least in some cases. Certainly you should test the distance calculation in the squared=True case for readability, then test the invariance that squared is meant to obtain.
assert_almost_equal(masked_euclidean_distances(X[:1], Y[:1], squared=True),
[[5/2 * ((7-3)**2 + (2-2)**2)]])
sklearn/metrics/pairwise.py
Outdated
| ============ ==================================== | ||
| ============ ==================================== | ||
| metric Function | ||
| ============ ==================================== |
There was a problem hiding this comment.
The first underline should be longer.
sklearn/metrics/pairwise.py
Outdated
| 'l2' metrics.pairwise.euclidean_distances | ||
| 'manhattan' metrics.pairwise.manhattan_distances | ||
| ============ ==================================== | ||
| ============ ==================================== |
There was a problem hiding this comment.
The first overline should be longer.
sklearn/metrics/pairwise.py
Outdated
| 'l2' metrics.pairwise.euclidean_distances | ||
| 'manhattan' metrics.pairwise.manhattan_distances | ||
| 'masked_euclidean' metrics.pairwise.masked_euclidean_distances | ||
| ============ ==================================== |
There was a problem hiding this comment.
The first underline should be longer.
sklearn/metrics/pairwise.py
Outdated
| vector in the pair or if there are no common non-missing coordinates then | ||
| NaN is returned for that pair. | ||
|
|
||
| References |
There was a problem hiding this comment.
We'd usually put this after Returns, before See Also (see for example additive_chi2_kernel in this file).
sklearn/metrics/pairwise.py
Outdated
| # Calculate distances | ||
|
|
||
| distances = (X.shape[1] / ((np.dot(NX, NYT)))) * \ | ||
| (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) + |
There was a problem hiding this comment.
please drop unnecessary parentheses around X * X and YT * YT
jnothman
left a comment
There was a problem hiding this comment.
This is starting to get there. I'm not so happy about the special-casing in neighbours, but it's alright for now.
|
|
||
| For efficiency reasons, the euclidean distance between a pair of row | ||
| vector x and y is computed as:: | ||
|
|
There was a problem hiding this comment.
You've, I suppose accidentally, remind a whole lot of blank lines
There was a problem hiding this comment.
This is weird -- the spacing does not exist on my machine. Has to be some artifact of Github.
There was a problem hiding this comment.
Well if you removed it it doesn't exist. It existed before, but not after your PR. This is not an artifact of github.
There was a problem hiding this comment.
Haha good one. But I think I misunderstood the comment -- @jnothman what do you mean by "remind a whole lot of blank lines"? Did you mean "removed"?
There was a problem hiding this comment.
I did indeed. Reviewing in the phone is a terrible habit.
There was a problem hiding this comment.
Gotcha. I initially thought you meant I added a whole lot of blank lines, which I obviously could not find. But yeah I do see the removed blank lines, and I have no idea how it happened! Sorry about that.
sklearn/metrics/pairwise.py
Outdated
| return distances if squared else np.sqrt(distances, out=distances) | ||
|
|
||
|
|
||
| # Pairwise distances in the presence of missing values |
|
|
||
| assert_array_almost_equal(D1, D2) | ||
|
|
||
| # check when squared = True |
There was a problem hiding this comment.
Please just do the first test with squared=True then assert_almost_equal (med(X,Y)**2, med(X,Y, squared=True)).
Good tests, in my opinion, should look like a proof by induction. First you prove a base case, then you show that invariants hold in extending from the base case. The base case should ideally be something the reader can easily reason is doing the right thing, hence rational numbers or worked examples.
sklearn/neighbors/base.py
Outdated
|
|
||
| def _fit(self, X): | ||
| if self.metric in _MASKED_SUPPORTED_METRICS: | ||
| kill_missing = False |
sklearn/metrics/pairwise.py
Outdated
| in dense matrices X and Y with missing values in arbitrary | ||
| coordinates. | ||
|
|
||
| The following formula is used for this: |
There was a problem hiding this comment.
Please cut the rest of the description down to a few sentences describing the calculation between vector pairs
There was a problem hiding this comment.
I don't know why you're repeatedly ignoring this comment.
There was a problem hiding this comment.
Hmm, I did cut the description down by between 8-10 lines compared to my previous commit. Sorry if it looked like I was ignoring it, that was definitely not my intention. But, anyway, I will cut it down further.
There was a problem hiding this comment.
Sorry, is not noticed. I have tried to suggest that the matrix formulation here is unhelpful. You just need enough for the intuition behind calculating the metric to be clear. A couple of sentences
sklearn/neighbors/base.py
Outdated
| @@ -355,6 +372,10 @@ class from an array representing our data set and ask who's | |||
| if self.effective_metric_ == 'euclidean': | |||
There was a problem hiding this comment.
Use or or in to put these cases in one.
sklearn/neighbors/base.py
Outdated
| "Nearest neighbor algorithm does not currently support" | ||
| "the use of sparse matrices." | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Use elif rather than more nesting.
There was a problem hiding this comment.
Please correct me if I am mistaken, but it seems the two "if" statements following the "else" are not mutually exclusive. This would preclude the use of two "elif"s instead right? However, I think I can simply remove the "else" as it seems to be redundant there.
There was a problem hiding this comment.
Sorry no need to use elif. Can just drop the else clause, CV as the preceding if clause raises an error.
|
@jnothman I have pushed the changes you asked for. Thanks again! |
|
@jnothman @amueller @jaquesgrobler Hey guys -- just a friendly ping to request feedback so I can wrap this up :) |
|
It's in my pile. I'm taking a while to work through it.
…On 29 July 2017 at 06:32, ashimb9 ***@***.***> wrote:
@jnothman <https://github.com/jnothman> @amueller
<https://github.com/amueller> @jaquesgrobler
<https://github.com/jaquesgrobler> Hey guys -- just a friendly ping to
request feedback so I can wrap this up :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67D_MFbAhXo-LTi2pRV9h2aed5MZks5sSkV2gaJpZM4OWu3U>
.
|
jnothman
left a comment
There was a problem hiding this comment.
I've not given this a full review now. Could I please suggest that you open a new PR which starts where this branch leaves off and implements a n_neighbors feature in Imputer...? unless you didn't want to do that part.
sklearn/metrics/pairwise.py
Outdated
| in dense matrices X and Y with missing values in arbitrary | ||
| coordinates. | ||
|
|
||
| The following formula is used for this: |
There was a problem hiding this comment.
I don't know why you're repeatedly ignoring this comment.
sklearn/metrics/pairwise.py
Outdated
| where NX and NYT represent the logical-not of the missing masks of | ||
| X and Y.T, respectively. | ||
| Formula in matrix form derived by: | ||
| Shreya Bhattarai <shreya.bhattarai@gmail.com> |
There was a problem hiding this comment.
I don't think the docstring is an appropriate place to place credits about implementation details. If you wish, note it in a comment in the code.
sklearn/metrics/pairwise.py
Outdated
| to be any format. False means that a sparse matrix input will | ||
| raise an error. | ||
|
|
||
| .. deprecated:: 0.19 |
There was a problem hiding this comment.
If this is being added, the deprecation note is irrelevant.
There was a problem hiding this comment.
Not sure I understand this. The deprecation note is for passing accept_sparse=None, which is not directly relevant to us?
There was a problem hiding this comment.
I mean that this behaviour is changed in check_array, not here. Deprecation is there only too help users taking advantage of a previously supported interface.
sklearn/metrics/pairwise.py
Outdated
| # NOTE: force_all_finite=False allows not only NaN but also +/- inf | ||
| X, Y = check_pairwise_arrays(X, Y, accept_sparse=False, | ||
| force_all_finite=False, copy=copy) | ||
| if (np.any(np.isinf(X.data)) or |
There was a problem hiding this comment.
Do we ever overwrite X if copy=False??
There was a problem hiding this comment.
At the moment masked_euclidean_distances() sets copy=True by default. I did that because X is altered during distance calculation whereby all NaNs are replaced with zeros. What do you think?
There was a problem hiding this comment.
Ahh sorry, I forgot that. I suspect that the user benefits little from being able to not copy (roughly the same memory is occupied by the mask), but I suppose it doesn't hurt to keep it in as long as it is tested.
sklearn/metrics/pairwise.py
Outdated
| "+/- Infinite values are not allowed.") | ||
|
|
||
| # Check if any rows have only missing value | ||
| if np.any(_get_mask(X, missing_values).sum(axis=1) == X.shape[1])\ |
There was a problem hiding this comment.
these repeated calls to _get_mask and any are relatively expensive. These should not be repeated here and below. And perhaps a helper should be factored out of imputer.
There was a problem hiding this comment.
By a helper I mean a separate function, perhaps in sklearn.utils, that can be reused.
There was a problem hiding this comment.
Is it okay if I do that as a separate PR later? I am thinking this PR might become a little unwieldy if I modify utils on top of already having modified both pairwise and neighbors. What do you think?
There was a problem hiding this comment.
I think you should at least avoid repeating element-wise operations here. If a helper refactors between here and Imputer, yes, make the change in utils, in this PR. It is only relevant because of this PR.
There was a problem hiding this comment.
Which module within sklearn.utils do you think is most appropriate? Or a new one instead? I could not locate a clear candidate at a quick glance.
Sure, but what do I do with the old kNN imputation PR? Would you prefer I start a new one or that I just edit that PR by referencing to this instead? |
|
forgot about that pr. great! let's work on it...
…On 30 Jul 2017 8:47 am, "ashimb9" ***@***.***> wrote:
I've not given this a full review now. Could I please suggest that you
open a new PR which starts where this branch leaves off and implements a
n_neighbors feature in Imputer...? unless you didn't want to do that part.
Sure, but what do I do with the old kNN imputation PR
<#9212>? Would you
prefer I start a new one or that I just edit that PR by referencing to this
instead?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61vRqL1CpSuDtYodW2A-m_h1SRdQks5sS7Z4gaJpZM4OWu3U>
.
|
|
I think a few of your error messages do not have test coverage. |
|
The new function should be listed in |
sklearn/metrics/pairwise.py
Outdated
| return X | ||
| elif metric in PAIRWISE_DISTANCE_FUNCTIONS: | ||
| func = PAIRWISE_DISTANCE_FUNCTIONS[metric] | ||
| func = PAIRWISE_DISTANCE_FUNCTIONS[metric] |
sklearn/metrics/pairwise.py
Outdated
| - From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', | ||
| 'manhattan']. These metrics support sparse matrix inputs. | ||
| 'manhattan']. These metrics support sparse matrix | ||
| inputs. |
|
|
||
| if issparse(X): | ||
| if allow_nans: | ||
| raise ValueError( |
There was a problem hiding this comment.
This doesn't appear to be tested.
|
@jnothman Hey, thanks a lot for the comments! A quick question: given that this has been merged with the PR for KNNImputer, should I address your comments here or in the other one? |
|
If you want to focus on KNNImputer, close this and work there. I only
suggested creating this as a more tractable stepping stone for reviewers to
iterate over. I still think it has value in that goal.
…On 4 September 2017 at 15:03, ashimb9 ***@***.***> wrote:
@jnothman <https://github.com/jnothman> Hey, thanks a lot for the
comments! A quick question: given that this has been merged with the PR for
KNNImputer, should I address your comments here or in the other one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6z1DNJvvuOD0SsVyQwrlrt6KaO8Rks5se4SYgaJpZM4OWu3U>
.
|
|
Ok cool, I will keep this since you think it might be useful. |
Reference Issue
Issue: 4844, 2989
Prelude to: 9212
What does this implement/fix? Explain your changes.
This is actually first part of another PR. The referenced PR implements a KNN based imputation strategy using its own k-Nearest Neighbor calculator. But it was suggested here that it might be a better option to make changes within sklearn.metrics instead so that the existing sklearn.neighbors machinery could be used. Given this context, this PR modifies relevant modules in sklearn.metrics (and sklearn.neighbors to a very minor extent) so the euclidean distance between samples with arbitrary missing coordinates can now be calculated.