ENH Add the fused CSR dense case for Euclidean Specializations#25044
ENH Add the fused CSR dense case for Euclidean Specializations#25044lorentzenchr merged 12 commits intoscikit-learn:mainfrom
Conversation
a943f79 to
0243c1d
Compare
This was already studied in: scikit-learn#25449 Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
|
Hi @OmarManzoor, @Vincent-Maladiere, @Micky774 and @adam2392: might some of you might be interested in reviewing this PR? 🙂 |
Sure! I might need some time to understand the context of the PR though. |
|
I am wondering about the kind of tests we could add. Should we add tests on combinations of sparse and dense datasets for all the interfaces? Should we add more of them? |
Yes, feel free to take your time and do not feel pressured (the class hierarchy is rather dense and its implementations Tempita-heavy). |
This is required with Cython>=3.0.
|
Should I write more about the design? |
OmarManzoor
left a comment
There was a problem hiding this comment.
LGTM. Maybe we can enhance the tests by including the combinations within test_sqeuclidean_row_norms and test_pairwise_distances_argkmin.
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
I think there are enough tests in this PR and the already existing tests in |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>
6a3f58c to
45d425a
Compare
lorentzenchr
left a comment
There was a problem hiding this comment.
LGTM
PS: I had to find something:smirk:
sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp
Outdated
Show resolved
Hide resolved
| # uses the Squared Euclidean matrix decomposition, i.e.: | ||
| # | ||
| # ||X_c_i - Y_c_j||² = ||X_c_i||² - 2 X_c_i.Y_c_j^T + ||Y_c_j||² |
There was a problem hiding this comment.
Is this "trick" still documented somewhere in the pairwise distances codebase?
There was a problem hiding this comment.
Yes, it is given here:
I am thinking of writing a couple of notes for the design of sklearn.metrics._pairwise_distances_reduction. What do you think of it?
There was a problem hiding this comment.
That would be great. Not too many details, but giving a good overview.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
|
Which co-authors should I keep when merging? |
All, I guess? I let this choice to the maintainer merging. |
|
Thank you for your reviews, @OmarManzoor, @ogrisel, and @lorentzenchr. |
Reference Issues/PRs
Towards #22587.
Follow up of #24556.
What does this implement/fix? Explain your changes.
The CSR-dense and the dense-CSR cases were chosen not to be supported for the Euclidean specialization of all
PairwiseDistancesReductions(for more context see #23585 (comment)).This PR implements
SparseDenseMiddleTermComputerallows computing the middle term of the distance matrix decomposition for the Euclidean specializations, covering those two missing cases.Hence this completes all the combinations for the Euclidean specialisations (:tada:).
Any other comments?
Different designs have been explored in other Pull Requests to factor some logic altogether or rethink
DatasetsPairsw.r.t.MiddleTermComputerfor the Euclidean specialisations.In overall, this PR seems to have the best tradeoff regarding performance and duplication of code.
Benchmarks
This makes using
PairwiseDistancesReductionson the CSR-dense and the dense-CSR for euclidean competitive w.r.t to the previous implementation relying on joblib.One can get up to ×2 on a laptop.
Details
TODO:
DenseDenseMiddleTermComputer64DatasetsPairw.r.t the newMiddleTermComputerto remove the duplicated logic and to clarify responsabilities (esp. squared euclidean norm computations)DatasetsPairinstead ofArgKminandRadiusNeighbors#25170, works, but suffers from a lot of method dispatches overhead