Prediction label retrieval should be moved to after the first fit call#12267
Prediction label retrieval should be moved to after the first fit call#12267oleksandr-pavlyk wants to merge 1 commit intoscikit-learn:masterfrom
Conversation
|
No, We indeed should test more explicitly that |
|
Discussion was here: #10978 |
|
can you please instead check that even with a single iteration calling |
jnothman
left a comment
There was a problem hiding this comment.
The point is that you need to run check_random_state only in fit not in __init__.
|
@jnothman I checked the code and we only call @amueller I am not sure I completely understood your request, but here is my attempt at verifying that calling I am also unsure whether you think the proposed change is not a good one, and is indicative of an issue with our changes. Please clarify |
|
Sorry for being unclear. Yes, I don't think your proposed change is good. Needing this change means your estimator violates sklearn api. |
|
I checked that with our changes does not raise asserts, but does. It has nothing to do |
|
Concurrently with the current discussion I will try to understand what determines the order in which cluster centers are returned in DAAL. |
|
Why does |
|
@amueller Would it be OK if the check that |
|
Aha. It just so happened that for |
|
I don't think using |
|
Ok, fair enough. I resolved it on my side. |
What does this implement/fix? Explain your changes.
In
estimator_checks::check_clusteringa check is made that labels produced byfitwith the givenrandom_stateare the same as the result offit_predictwith the samerandom_state.However, the actual retrieval of labels was not made after call
fit(X), but after subsequent callfit(X.tolist()), which may have consumed from the random stream, which may, in some implementations of KMeans, like that based on Intel DAAL, cause cluster centers to be reordered, and labels to come out equivalent up to a permutation causing a failure.Any other comments?
Hoping for no objections here :)
@GaelVaroquaux