-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
OPTICS extraction methods #12036
Description
Related
This is related to some issues we discussed and encountered in:
- OPTICS detecting the wrong outlier (Issue OPTICS detecting the wrong outlier #11677)
- fixes OPTICS split_points detected as NOISE and last point not detected as outlier (PR [WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857)
Intro
To continue the discussion, it is important to distinguish between two main tasks included in the OPTICS module/class:
- calculate the reachability plot, which is almost equivalent to a dendrogram as shown in [2].
- extract clusters
There's only one way/definition to extract the reachability plot the OPTICS way [1]. But there are multiple ways to extract the clusters, or the hierarchy of clusters from the reachability plot produced by OPTICS. The original OPTICS article [1] suggests two methods to extract the clusters:
- extract clusters based on a hard threshold on
eps[1, Fig. 8]. This method is often referred to asextract_dbscan, since there exists a DBSCAN with a set of parameters which gives the same result. - extract clusters based on the steepness observed on the reachability plot [1, Sec. 4]. The main parameter to this method is ξ (xi), and the method is often called
extract_xi(R, ELKI).
There are other methods to extract clusters from a given reachability plot (or a dendrogram), and the one included in our OPTICS's fit right now is the one introduced in [2] which is well cited. Let's call this method extract_SQLNK (for the lack of a name, taking the first character of the authors' surnames). The article argues that it produces better clusters than extract_xi, with some empirically found default values for the parameters.
Current Status
At the moment, calculating the reachability plot is taken from @espg's implementation [3], and extract_SQLNK (not with this name though) is taken from @amyxzhang's implementatoin [4].
Although the OPTICS class includes an extract_dbscan method, it's not the one used in the fit function. The fit function calculates the reachability plot and then uses extract_SQLNK to extract the clusters.
Right not the code does not include an implementation of extract_xi.
Proposal
We need 3 extract_* methods corresponding to the above 3 methods. Two of them are already implemented. We only need to implement the extract_xi.
The constructor takes a parameter as extract_method which specifies which extract method should be called in fit.
The extract_* methods run fast (O(n)) once the reachability plot is calculated. So it makes sense to have them as public functions and let the user call them after fit. The fit function uses the output of one of these extract_* functions to set the labels_. Calling an extract_* function explicitly, won't change any attribute of the class.
The extract_xi and extract_SQLNK methods are allowed to have fixes and additions on top of what's proposed in the original articles, with the same spirit that it's done in @amyxzhang's implementation and in R [5]:
The used algorithm was originally contributed by the ELKI framework, but contains a set of fixes.
@amyxzhang, @espg: is the above analysis sound to you? Am I missing something?
@jnothman, @amueller, @rth, @qinhanmin2014 (and others), what do you think?
[1] OPTICS: Ordering Points To Identify the Clustering Structure
[2] Automatic Extraction of Clusters from Hierarchical Clustering Representations
[3] https://github.com/espg/OPTICS
[4] https://github.com/amyxzhang/OPTICS-Automatic-Clustering
[5] https://rdrr.io/cran/dbscan/man/optics.html