In OPTICS (_extract_optics), we include several parameters which do not appear in the original paper (Automatic Extraction of Clusters from Hierarchical Clustering Representations), including rejection_ratio, significant_min and ratio of points in the child we check (not exposed to users). I can't understand why we need these parameters. I think we take these code from https://github.com/amyxzhang/OPTICS-Automatic-Clustering and she noted in her code : An implementation of the following algorithm, with some minor add-ons. I think as scikit-learn, we should check whether these add-ons are reasonable and necessary. E.g.,
-
For rejection_ratio, the original paper said that "We experimented with different ratios and in fact, any value in the range 0.7-0.8 always gives good results", so I guess it won't make too much difference?
-
For significant_min, I don't think it makes sense to users, since we have normalized RD at this point.
-
For the magical 0.8 (ratio of points in the child we check) inside the code, I think we should remove it, or at least make it public (I won't vote +1 for it at this point).
-
I think we need to allow users to pass int to min_maxima_ratio, like min_cluster_size. And the relationship between min_cluster_size and min_samples is still unclear.
-
We're using number of points when checking whether a point needs to be moved, apparently that's wrong right? We should use RD.
ping @espg
In OPTICS (_extract_optics), we include several parameters which do not appear in the original paper (Automatic Extraction of Clusters from Hierarchical Clustering Representations), including
rejection_ratio,significant_minand ratio of points in the child we check (not exposed to users). I can't understand why we need these parameters. I think we take these code from https://github.com/amyxzhang/OPTICS-Automatic-Clustering and she noted in her code : An implementation of the following algorithm, with some minor add-ons. I think as scikit-learn, we should check whether these add-ons are reasonable and necessary. E.g.,For
rejection_ratio, the original paper said that "We experimented with different ratios and in fact, any value in the range 0.7-0.8 always gives good results", so I guess it won't make too much difference?For
significant_min, I don't think it makes sense to users, since we have normalized RD at this point.For the magical 0.8 (ratio of points in the child we check) inside the code, I think we should remove it, or at least make it public (I won't vote +1 for it at this point).
I think we need to allow users to pass int to
min_maxima_ratio, likemin_cluster_size. And the relationship betweenmin_cluster_sizeandmin_samplesis still unclear.We're using number of points when checking whether a point needs to be moved, apparently that's wrong right? We should use RD.
ping @espg