DRAFT ENH Introduce AVX SIMD instructions for ManhattanDistances.dist#27145
DRAFT ENH Introduce AVX SIMD instructions for ManhattanDistances.dist#27145Micky774 wants to merge 21 commits intoscikit-learn:mainfrom
AVX SIMD instructions for ManhattanDistances.dist#27145Conversation
|
Thinks that came to my mind seeing this:
Otherwise I think this is pretty cool. |
That's something I'm hoping we can discuss a bit more. I'm unfamiliar with Rust, and although I know it is picking up popularity pretty quickly, afaik it does make it harder to work on relatively complex data structures (the antagonistic example is doubly-linked-lists) but I don't honestly know to what degree. My (comparative) familiarity with C++biases me. I'd love for someone more familiar with Rust to educate me in this regard :) I do think that there is potentially a lot of room to improve code readability by leveraging C++ templating and binding/importing via Cython. This is actually a line-item for discussion in the new native performance discussion group (see notes) although has not been broadly discussed yet. IMO it is easier to slowly-but-safely transition from pure Cython/C to some C++ in key areas than to introduce Rust, but I've not looked at any example implementations so I can still be swayed. I do also believe that the ability to include a relatively simple static dispatch to a widely supported SIMD target, propagating performance gains across a potentially wide swathe of the library is itself a strong merit for C++ inclusion (even if not widespread adoption).
Right, that seems to be a common opinion. I have it added as a submodule right now for (my) convenience but it is simple to vendor it as a header-only library (they're under BSD-3 so licensing is no issue either). I don't have a strong preference one way or the other so I think we can navigate around that. Edit: It's also not necessary at runtime at all, so we could try to specify it as a build-time requirement for users if they want SIMD acceleration and setup CI solutions for wheel building. I'm not sure what the best way to do that is, since (as usual) windows makes things tricky |
I would be very skeptical of that. C++ has become a monster. I would agree with you maybe 20 years ago though. But I've done enough C++ that I really wouldn't want us to maintain C++ code for multiple platforms.
I don't mind vendoring some C++ library at all, as long as we don't have to maintain it, and it doesn't give us much headache to support multiple platforms. And as long as the only thing we have to do is to add a very light binding layer which could also be auto-generated, then we should be fine. |
|
As discussed on the last monthly meeting (though it is not reported in its notes), the introduction of C++ code (be it done via this PR or another one) necessitates considering several technical aspects (packaging consideration, linkage and build system, depending on third party library or vendoring it, etc.) and several social aspects (when do we want to use C++ over existing and used solutions like Cython, who gets to maintain C++ implementations for those part of the project to have an healthy bus-factor, what are the conventions around C++, etc.) of scikit-learn. We decided that such choices needs to be discussed and voted in favor or in disfavor via a SLEP (I invoke @lorentzenchr for confirmation). IIRC, @Micky774 (let me know if I am wrong) proposed to author one; I also propose helping with writing it. |
|
Yes, let’s discuss in a slep which has as its goal the usage of simd instructions. Rust, for instance, would then be one alternative. Let me add here that we hit the boundaries of Cython several times, e.g. usage of tempita templates. |
Reference Issues/PRs
A follow-up to #26010
What does this implement/fix? Explain your changes.
This PR demonstrates the minimal infrastructure required for proper building and runtime dispatch of
AVXSIMD instructions. This means that scikit-learn can optionally be built withAVXinstructions. If, at runtime,AVXsupport is found on the runtime machine and scikit-learn was built withAVXsupport, then the optimizedAVXinstructions are leveraged when computing distances. Any other case results in the usual scalar loops currently found onmain.Concrete changes
_simd/directory housing new filessimd.hppwhich provides a definition and extern declarations of templated Manhattan distance functions (xsimd_manhattan_dist)simd.cppwhich provides externed declarations ofxsimd_manhattan_dist. This file serves as a translation unit to target with-mavxfeature flags.avx_dist_metricslibrary which is linked against to isolate the rest of_dist_metricsfrom requiring additional feature flags (preventing auto-vectorization of unintended code, e.g. backup scalar loops)WITH_SIMDmacro to correctly compile SIMD optimized code, or default to a trivial implementation to satisfy compilationDIST_METRICSmacro to correctly allow inclusion of_dist_optim.cppin_dist_metrics.pyxbut not in other Cython files (necessary since anycimportof_dist_metrics.pyxforces the inclusion of_dist_optim.cpp)Adds build and runtime dependency onUsesarchspecto provide out-of-the-box supportxsimddirectly for runtime microarchitecture feature detection (to determine whether building withAVXis possible, and whether running withAVXis possible) via_runtime_check.pyxand by compilingruntime_check_programinsetup.pyfor a quick build-time check.xsimdas a submodule (~2MBbut could be reduced by half if we only include theincludedirectory)self.{r}dist(...)calls to<Class>.rdist_csr(self, ...)to work around Cython's idiosyncrasies when compiling to C++TODO:
Resolve whether to add a dependency onUsesarchspecor vendor itxsimddirectlyxsimdas a submodule, vendor an instance, or only include as build-time dependency for wheels and conda-forgeAny other comments?
AVXinstructions were deployed in late 2011 and 2012 and thus are likely found in a significant portion of our users' machines. Honestly, we may want to consider going toAVX2directly since it was released only a couple years later...I don't know of any comprehensive surveys in the ML field which indicate the presence of these features on work machines and laptops.
Initial benchmarks generated with this script (exported from a Jupyter notebook).
Note
PR_noruntimeis the case where the library is built withAVXsupport, but the runtime check fails to detectAVXcapabilities.PR_nobuildis built withoutAVXcapabilities.PRis built withAVXcapabilities, and uses them at runtime.There are no slowdowns compared to main, and when built with
AVXsupport, and it is detected at runtime, we see minor speedups even in cases of smalln_features, to drastic speedups (4-7x) whenn_featuresis large.Plots