Now that we have implemented optimized native scorers for all data types, across our 2 supported architectures (x64 and ARM64), both single and "bulk", we need to fine tune them.
The native functions internally have different implementations/parameters that can be changed to optimize them: bulk size, prefetching, using of different SIMD instructions, specialized implementations for higher tier HW (SVE/AVX-512), etc. We also have different bulk algorithms and implementations, different unrolling levels and mechanisms. We should assess which ones are the most effectives, and adopt them across the codebase, to make code more readable, maintenable, and consistently more efficient.
Related tasks/issues:
Add missing benchmarks/tests:
Consolidate/fix implementations:
Optimizations:
Now that we have implemented optimized native scorers for all data types, across our 2 supported architectures (x64 and ARM64), both single and "bulk", we need to fine tune them.
The native functions internally have different implementations/parameters that can be changed to optimize them: bulk size, prefetching, using of different SIMD instructions, specialized implementations for higher tier HW (SVE/AVX-512), etc. We also have different bulk algorithms and implementations, different unrolling levels and mechanisms. We should assess which ones are the most effectives, and adopt them across the codebase, to make code more readable, maintenable, and consistently more efficient.
Related tasks/issues:
Add missing benchmarks/tests:
Consolidate/fix implementations:
Optimizations:
int4x86 SIMD optimizations #144649