[WIP] Make spectral embedding unconditionally deterministic and related fixes.#4299
[WIP] Make spectral embedding unconditionally deterministic and related fixes.#4299raghavrv wants to merge 1 commit intoscikit-learn:masterfrom
Conversation
1d79ac5 to
095401a
Compare
|
I don't think we should change the affinity here. Results should be deterministic no matter what. |
|
I don't like the global random state. That means if you only run particular tests, the result will be different than if you run all tests. That can make debugging tricky. |
Ah! I didn't think of that... ;) will delete that commit...! |
Don't delete it, just |
|
I actually unnecessarily refactored out the previously correctly placed rng initializations assuming it would be neat... So deleting that commit should revert to the previous state which was correct :) |
386a31a to
071fcaf
Compare
|
Should If so, To fix them both should I send a pull request to numpy and add a fix to |
dc98016 to
9c104a8
Compare
|
Numpy does not support sparse matrices. |
|
done :) |
|
But this still does not fix the issue, right? Why is it not deterministic when the graph is not fully connected? |
|
I'll investigate further and let you know! |
sklearn/utils/estimator_checks.py
Outdated
There was a problem hiding this comment.
Why a special case in this common check? Once the remaining sign switch is fixed (the constant abs values case), SpectralEmbedding should be fully deterministic and there would be no need for a specific dataset in the checks, right?
There was a problem hiding this comment.
Currently 2 things cause the o/p to be non-deterministic...
Major one is when the data produces an adjacency graph which is not fully connected. (But a warning is raised stating spectral embedding may not behave correctly since the adj. graph is not fully connected, so I am not quite sure if non-deterministic output is to be expected here)
Minor one is when there are multiple max abs values where the sign has to be arbitrarily given to the first index etc... (I'll fix this and push the commit soon...)
So yes if the 1st one is to be fixed then we should not special case the input... :)
There was a problem hiding this comment.
Well but the first should also be deterministic! There is no reason that a non-connected graph gives non-deterministic results.
There was a problem hiding this comment.
Okay! So I've (or I think I've) narrowed down the issue to a nearly singular graph laplacian when the graph is not fully connected...
- arpack's eigen solver seems to produce the non-deterministic eigen vectors when the laplacian matrix has a very low determinant... (order e-50) (
only "lobpcg" handles this perfectly producing deterministic output everytime... "amg" is not quite stable with nearly sigular matrices ( raisesWrong observation both "lobpcg" and "amg" also tend to throw errors with nearly / fully singular matrices.)LinAlgErrorrandomly ) - When this particular S_X, S_y is used the determinant of the laplacian is of the order e-20
which seems to be handled perfectly by"arpack","lobpcg"as well as"amg"eigen_solvers.
Could you kindly confirm if my observations are true?
We could fix this by reverting to "lobpcg" when the determinant of laplacian is very low (but involves an additional determinant computation step).
(Also note that we currently fall back to "lobpcg" solver when arpack raises LigAlgErrorRuntimeError... which is when the laplacian is perfectly singular)
NOTE: I made a wrong observation that lobpcg handles singular matrices perfectly. Both "lobpcg" and "amg" also tend to throw errors with nearly / fully singular matrices ( I hadn't tested it for different random states back then )
There was a problem hiding this comment.
What is the change in the result from the solver? Are the eigenvalues too close? And what is the behavior then?
There was a problem hiding this comment.
arpack can take a v0 to seed the random vector. Unfortunately, passing a gaussian distributed v0 is not correct. The correct way to do it would be to replicate:
https://github.com/scipy/scipy/blob/master/scipy/sparse/linalg/eigen/arpack/ARPACK/SRC/dgetv0.f
That looks complicated though. I am no fortran developer myself.
There was a problem hiding this comment.
thanks for the comments... This is just a minor confirmation that the arpacks eigsh does not raise RuntimeError but instead just gives out random values, which is the root cause of the the current issue at hand...
>>> N = np.array([[0,]*5]*5, dtype=float)
>>> print eigsh(N, k=1)
(array([ 0.]), array([[-0.32529456],
[ 0.29618396],
[ 0.794161 ],
[-0.04909478],
[-0.41636105]]))
>>> print eigsh(N, k=1)
(array([ 0.]), array([[-0.34134676],
[-0.71507399],
[ 0.00610712],
[-0.58006083],
[ 0.18879544]]))
There was a problem hiding this comment.
Thanks for confirming @ragv. Let's try to make the common tests use the deterministic dense solver then.
Note: np.array([[0,]*5]*5, dtype=float) is better written np.zeros(shape=(5, 5), dtype=np.float64).
There was a problem hiding this comment.
Note:
np.array([[0,]*5]*5, dtype=float)is better writtennp.zeros(shape=(5, 5), dtype=np.float64).
Ah thanks!! :)
Let's try to make the common tests use the deterministic dense solver then.
Do you mean to say that in the set_fast_parameters we could set eigen_solver to "lobpcg"?
There was a problem hiding this comment.
lobpcg or dense depending on the models. I don't think the notation is consistent.
f484f5f to
b11b002
Compare
|
@ogrisel Even with |
|
Also copying a chat from gitter -
@ogrisel - Nope... Follow up question. Do you want me to change |
|
Are the eigenvalues exactly equal to zero? Then we should remove the corresponding eigenvalues, right? That should get rid of the non-determinism. |
|
For the rename: leave it as is currently. our policy on what is public and what is private in |
sklearn/utils/tests/test_extmath.py
Outdated
b11b002 to
c4df19d
Compare
ad684c6 to
b8b6bb8
Compare
Fixes #4236
Started off as addressing #4249 (comment)
deterministic_vector_sign_flipto make sure the sign of the first element of every eigen vector is positive always.SpectralEmbeddingunconditionally deterministic even when the graph is not fully connected. (UseDoesn't fix it fully )"lobpcg"solver in tests insteadSpectralEmbeddingin thecheck_pipeline_consistencytest.MakeAnother PRLaplacianEmbeddingaccept lists and add tests for the same.Use a data that does not lead to the "graph not fully connected warning" only to avoid cluttering the output.ping @amueller