FIX Adjusts xi in test_extract_xi#14201
Conversation
sklearn/cluster/optics_.py
Outdated
| The instance. | ||
| """ | ||
| X = check_array(X, dtype=np.float) | ||
| X = check_array(X, dtype=np.float32) |
There was a problem hiding this comment.
Hmm, if something fails in 64bit due to numerical issues, I don't understand why it would not fail in 32bit. What's the advantage of doing this? I would say keeping f64 is always good as far as numerical precision is concerned.
There was a problem hiding this comment.
This doesn't sound like a good idea to me. Most estimators accept a float64, if not enforce it. Which means usually if the user wants the data not to be copied, they can feed a float64 data. This change forces a copy on all those data, other than loosing precision.
There was a problem hiding this comment.
Most CI jobs pass and we're unable to find any related bugs, so I guess it's not worthwhile to drop float64 support in order to make CIs green. Maybe increase the tolerance or construct another dataset?
|
Maybe if this is fixing the issue, we should better generate the data for the tests? |
sklearn/cluster/tests/test_optics.py
Outdated
| X, expected_labels = shuffle(X, expected_labels, random_state=rng) | ||
|
|
||
| if _IS_32BIT: | ||
| X = X.astype(np.float32) |
There was a problem hiding this comment.
float64 should still work on 32bit arch. If it fails it may mean that something internally in OPTICS creates arrays of dtype np.int (or np.intp) instead of X.dtype...
There was a problem hiding this comment.
Time for a deep dive into OPTICS. I hope I have better vision when I surface.
|
Here is where the error is coming from: X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]] * 2), C6))
expected_labels = np.r_[[1] * 5, [3] * 5, [2] * 5, [0] * 5, [2] * 5,
-1, -1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=rng)
clust = OPTICS(min_samples=3, min_cluster_size=3,
max_eps=20, cluster_method='xi',
xi=0.1).fit(X)
# this may fail if the predecessor correction is not at work!
assert_array_equal(clust.labels_, expected_labels)which is failing on: E AssertionError:
E Arrays are not equal
E
E Mismatch: 18.8%
E Max absolute difference: 3
E Max relative difference: nan
E x: array([ 0, 0, -1, -1, 1, 3, 3, 2, 0, 3, 3, -1, 1, 1, -1, 2, -1,
E 4, 0, -1, 4, 0, 4, 2, -1, 1, 1, 4, 2, 3, 4, -1])
E y: array([ 0, 0, 2, 2, 1, 3, 3, 2, 0, 3, 3, 2, 1, 1, 2, 2, -1,
E 4, 0, 2, 4, 0, 4, 2, -1, 1, 1, 4, 2, 3, 4, 2])Some of the clusters in |
|
In the following line: scikit-learn/sklearn/cluster/optics_.py Line 476 in f339609 the In 64 bit: reachability_[index][13] = 0.6762013074479917
reachability_[index][15] = 0.6762013074479917and in 32 bit: reachability_[index][13] = 0.6762013074479917
reachability_[index][15] = 0.6762013074479916This is why in 32bit, 15 is chosen to be the Edit:
|
|
@thomasjpfan that's probably because it's the nearest neighbor algorithm returning different values based on the dtype/architecture. I had seen this issue, and kinda solved it with changing the dataset. It doesn't look like an optics issue to me, i.e. this optics test is surfacing an issue in the neighbors algorithm, I think. |
|
This PR was reduced to only updating the |
jnothman
left a comment
There was a problem hiding this comment.
This is quite a comfortable fix!!
jnothman
left a comment
There was a problem hiding this comment.
This is quite a comfortable fix!!
|
LGTM, thanks. |
Reference Issues/PRs
Fixes #13739
What does this implement/fix? Explain your changes.
Adjusts xi in test_extract_xi
Any other comments?
This test pass on i686 on scikit-learn-wheels configured to point to this branch.
Here is the diff of scikit-learn-wheels used to run the test.
cc @qinhanmin2014 @adrinjalali