FEA PairwiseDistancesReductions: support for Boolean DistanceMetrics via stable simultaneous sort#25097
FEA PairwiseDistancesReductions: support for Boolean DistanceMetrics via stable simultaneous sort#25097jjerphan wants to merge 5 commits intoscikit-learn:mainfrom
PairwiseDistancesReductions: support for Boolean DistanceMetrics via stable simultaneous sort#25097Conversation
ogrisel
left a comment
There was a problem hiding this comment.
Can you please sync with main to get github happy?
sklearn/utils/_sorting.pxd
Outdated
| cnp.npy_intp* samples, | ||
| cnp.npy_intp start, | ||
| cnp.npy_intp end, | ||
| ) nogil |
There was a problem hiding this comment.
Do we really need to export the low-level helper functions like sift_down, swp, median3 in the .pxd file?
Also I don't think we need to export both sort and introsort and I would find it more intuitive to rename:
sorttointrosort(the main API to call from other modules in scikit-learn),introsorttointrosort_recursionor_introsortor similar to make it more explicit that this function is not meant to be used directly but only useful to expose the prototype imposed by the recursive calls in the algorithm.
There was a problem hiding this comment.
By the way I think we could add a few lines of doc for each sorting function (assuming we stop exposing the helpers).
There was a problem hiding this comment.
I did iterations locally that I have not yet pushed and that are treating most of your comments.
sklearn/tree/_splitter.pyx
Outdated
| from ..utils._sorting cimport ( | ||
| sort, | ||
| swap, | ||
| median3, | ||
| introsort, | ||
| heapsort | ||
| ) |
There was a problem hiding this comment.
I have the impression that only the sort function is called in the _splitter.pyx file:
| from ..utils._sorting cimport ( | |
| sort, | |
| swap, | |
| median3, | |
| introsort, | |
| heapsort | |
| ) | |
| from ..utils._sorting cimport sort |
sklearn/utils/_sorting.pyx
Outdated
| root = maxind | ||
|
|
||
|
|
||
| cdef void heapsort(cnp.npy_float32* Xf, cnp.npy_intp* samples, cnp.npy_intp n) nogil: |
There was a problem hiding this comment.
Maybe we could rename:
Xftodataorvaluessamplestoindicesntosize
to make the prototype of this function more generic and more intuitive when used in other contexts than for tree growing. Similar comment for the other functions.
|
@ogrisel: note that this PR is experimental (hence the draft mode): I am cutting corners pragmatically to see if the algorithm can be made stable. |
|
As of d27f138, |
PairwiseDistancesReductions: support for Boolean DistanceMetrics via stable simulataneous sortPairwiseDistancesReductions: support for Boolean DistanceMetrics via stable simultaneous sort
|
We might want to adapt Glide sort to have it sort a structure of arrays. |
|
Based on a IRL discussion with @ogrisel, we also might be interested in exploring https://github.com/scandum/fluxsort which should be Cython-interfaceable. |
|
@Micky774, @OmarManzoor or anyone else: Similarly to other PRs of mine, if you are interested in continuing this work, feel free to do so (I do no have time for it currently). |
|
Closing, might reopen later. |
Reference Issues/PRs
Towards #22587.
What does this implement/fix? Explain your changes.
Use a stable sort to support boolean distance metrics, as explained here:
scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py
Lines 65 to 71 in 7af5297
Any other comments?
This moves sorting utilities to be able to reuse them in for
PairwiseDistancesReductions.