Clarify DBSCAN eps parameter misunderstanding#13563
Clarify DBSCAN eps parameter misunderstanding#13563jnothman merged 1 commit intoscikit-learn:masterfrom
eps parameter misunderstanding#13563Conversation
eps parameter misunderstanding
sklearn/cluster/dbscan_.py
Outdated
| as in the same neighborhood. | ||
| The maximum distance between two samples for one to be considered | ||
| as in the neighborhood of the other. This is not a maximum bound | ||
| on the distances of points within a cluster, and the most important |
There was a problem hiding this comment.
the most important -> one of the most important / an important since min_samples is also important?
There was a problem hiding this comment.
I agree with @kno10 that min_samples is not as important because having more or fewer core samples in a region is less essential than determining whether samples are in the same (or any) cluster.
There was a problem hiding this comment.
I don't like "and" here as it's not clear what "the most important ..." applies to. Start a new sentence "This is the most important"
jnothman
left a comment
There was a problem hiding this comment.
You might note the importance of tuning eps in the user guide (doc/modules/cluster.rst) or summary section of the docstring. Do we have an example illustrating the effect of this parameter? What would be a good dataset to illustrate with?
sklearn/cluster/dbscan_.py
Outdated
| as in the same neighborhood. | ||
| The maximum distance between two samples for one to be considered | ||
| as in the neighborhood of the other. This is not a maximum bound | ||
| on the distances of points within a cluster, and the most important |
There was a problem hiding this comment.
I agree with @kno10 that min_samples is not as important because having more or fewer core samples in a region is less essential than determining whether samples are in the same (or any) cluster.
sklearn/cluster/dbscan_.py
Outdated
| as in the same neighborhood. | ||
| The maximum distance between two samples for one to be considered | ||
| as in the neighborhood of the other. This is not a maximum bound | ||
| on the distances of points within a cluster, and the most important |
There was a problem hiding this comment.
I don't like "and" here as it's not clear what "the most important ..." applies to. Start a new sentence "This is the most important"
As seen here: https://stackoverflow.com/a/55388827/1939754 the old description of the eps parameter can be misunderstood as a maximum distance of any two points. Also add a reference that discusses parameterization.
|
Made this two sentences, added a reference with discussion of parameterization, and added a paragraph to the user guide on parameterization, too. |
As seen here: https://stackoverflow.com/a/55388827/1939754
the old description of the eps parameter can be misunderstood as a maximum distance of any two points.
Also, people really need to tune this parameter, not rely on the bad default value.