Add partial AVX512 Linux support for dot product on 4-bit quantized values by Ameobea · Pull Request #80 · antimatter15/alpaca.cpp

Ameobea · 2023-03-20T10:32:23Z

Changes

Update Makefile to detect AVX512 support and add compiler flags if it's available
Add AVX512 impl based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
Use built-in AVX512 horizontal reduce add to get sum at the end
Manual unrolling on inner dot product loop to reduce loop counter overhead

Performance Impact

I'm seeing around 10% speedup on the 4-bit quantized 7B model when running on my AMD 7950x.

Before:

main: mem per token = 14368644 bytes
main:     load time =   923.25 ms
main:   sample time =    85.94 ms
main:  predict time = 23502.37 ms / 92.17 ms per token
main:    total time = 24845.69 ms

After:

main: mem per token = 14368644 bytes
main:     load time =   928.89 ms
main:   sample time =    16.18 ms
main:  predict time =  5720.41 ms / 82.90 ms per token
main:    total time =  6982.89 ms

I was hoping for more, but some other stuff I tried like converting the bytesFromNibbles function to operate on two blocks at a time by using AVX512 were not successful.

* Update Makefile to detect AVX512 support and add compiler flags if it's available * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16 * Use built-in AVX512 horizontal reduce add to get sum at the end

* Manual unrolling on inner dot product loop to reduce loop counter overhead * Add some extra AVX512 compiler flags if detected in makefile

antimatter15 · 2023-03-20T10:39:04Z

Looks great!

That said, I'm working on trying to minimize the number of deviations from the upstream https://github.com/ggerganov/llama.cpp repo, so there would be a more appropriate place for this PR!

Ameobea · 2023-03-20T10:42:37Z

Yeah I'll get a PR up there tomorrow as well.

In the meantime, if anyone else has access to Linux with AVX512-capable CPUs, would be great if they could test this to make sure it works on their setups as well.

Ameobea · 2023-03-20T11:20:42Z

Created ggml-org#320

Ameobea added 2 commits March 19, 2023 23:15

Some optimizations to AVX512 code

b4d82fb

* Manual unrolling on inner dot product loop to reduce loop counter overhead * Add some extra AVX512 compiler flags if detected in makefile

Ameobea mentioned this pull request Mar 20, 2023

Add initial AVX512 support for dot product on Linux ggml-org/llama.cpp#320

Merged

antimatter15 closed this Mar 21, 2023

dfyz mentioned this pull request Apr 15, 2023

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() ggml-org/llama.cpp#933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partial AVX512 Linux support for dot product on 4-bit quantized values#80

Add partial AVX512 Linux support for dot product on 4-bit quantized values#80
Ameobea wants to merge 2 commits intoantimatter15:masterfrom
Ameobea:avx512-support

Ameobea commented Mar 20, 2023

Uh oh!

antimatter15 commented Mar 20, 2023

Uh oh!

Ameobea commented Mar 20, 2023

Uh oh!

Ameobea commented Mar 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ameobea commented Mar 20, 2023

Changes

Performance Impact

Uh oh!

antimatter15 commented Mar 20, 2023

Uh oh!

Ameobea commented Mar 20, 2023

Uh oh!

Ameobea commented Mar 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants