AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring#1099
AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring#1099ggerganov merged 3 commits intoggml-org:masterfrom
Conversation
Except with perplexity the performance looks good compared to q4_1, not sure why there is a discrepancy there. |
|
Before merging this: the current Time per token on M1 Pro:
I want to make it close to ~50-60 ms / token. Will try to optimize this with highest priority, so we can decide on the final |
|
Well #1083 was a bit rushed IMO, but I tried to address the loose ends. For the horizontal sum of ints, I could not see a difference in speed between @ikawrakow's original code and @pubby's suggestion which ended up as commented-out code. The latter is AVX2-only, while the original should also work on AVX. |
|
Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for |
Apart from adding the AVX2 optimization for Q4_3, this refactors some commonly used intrinsic sequences into
inlinefunctions.