ggml : alternative Q4_3 implementation using modified Q8_0 by ggerganov · Pull Request #1109 · ggml-org/llama.cpp

ggerganov · 2023-04-21T17:56:41Z

This one looks promising - it does not change the Q4_3 format from master and only modifies slightly Q8_0 by adding low and high sums. The results should be identical, but now the Q4_3 dot product evaluates much faster:

#define QK8_0 32
typedef struct {
    float   d;          // delta
    float   s0;         // d * sum(qs[i]) low
    float   s1;         // d * sum(qs[i]) high
    int8_t  qs[QK8_0];  // quants
} block_q8_0;

llama_print_timings:      sample time =    47.11 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   482.44 ms /     8 tokens (   60.30 ms per token)
llama_print_timings:        eval time =  3419.36 ms /    63 runs   (   54.28 ms per run)
llama_print_timings:       total time =  3959.05 ms

I think this is the way to go. But, let's see the ppl results from the Q4_3a #1108 approach first

ggerganov · 2023-04-21T20:20:27Z

Will fix the AVX2 implementation tomorrow and merge it

sw · 2023-04-21T20:50:40Z

ggml.c

As mentioned in #1099 where I intend to fix this, the #if condition is wrong here, causing the code below to be executed for AVX2, essentially duplicating the work. Just a thing to keep in mind or fix when measuring performance.

This way we always use the same type of instruction across all quantizations

ggerganov mentioned this pull request Apr 21, 2023

ggml : alternative Q4_3 format + implementation #1108

Closed

ggerganov marked this pull request as ready for review April 21, 2023 20:14

sw reviewed Apr 21, 2023

View reviewed changes

ggerganov mentioned this pull request Apr 22, 2023

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

Merged

ggerganov added 4 commits April 22, 2023 10:37

ggml : prefer vzip to vuzp

ec805ee

This way we always use the same type of instruction across all quantizations

ggml : alternative Q4_3 implementation using modified Q8_0

5425e06

ggml : fix Q4_3 scalar imlpementation

829c480

ggml : slight improvement of Q4_3 - no need for loop unrolling

76b6b26

ggerganov force-pushed the q4_3b branch from 25b41a3 to 76b6b26 Compare April 22, 2023 07:42

ggml : fix AVX paths for Q8_0 quantization

2c358ec

ggerganov merged commit 955ef9a into master Apr 22, 2023

ggerganov deleted the q4_3b branch April 22, 2023 07:55

This was referenced Apr 22, 2023

Q8_0: unbreak AVX #1117

Closed

Continuous layouts for quantization q4_0c #1073

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : alternative Q4_3 implementation using modified Q8_0#1109

ggml : alternative Q4_3 implementation using modified Q8_0#1109
ggerganov merged 5 commits intomasterfrom
q4_3b

ggerganov commented Apr 21, 2023 •

edited

Loading

Uh oh!

ggerganov commented Apr 21, 2023

Uh oh!

sw Apr 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggerganov commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 21, 2023

Uh oh!

sw Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Apr 21, 2023 •

edited

Loading