IBM Granite MoE Architecture by gabe-l-hart · Pull Request #9438 · ggml-org/llama.cpp

gabe-l-hart · 2024-09-11T16:22:47Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Dependencies

This PR is dependent on merging the initial GraniteLM PR (IBM Granite Architecture #9412)

Description

This PR introduces the granitemoe model architecture from IBM. It emulates the transformers changes in this PR.

The granitemoe architecture follows a very similar pattern to the granite architecture and its changes relative to llama. For the MoE variant, the base architecture is mixtral (MoE branch of llama here in llama.cpp). The same four additional multipliers are added (embeddings_multiplier, attention_multiplier, residual_multiplier, and logits_scale).

Testing

This PR can be tested using ibm/PowerMoE-3b from huggingface following the same testing steps used for granite (here).

gguf-py/gguf/gguf_writer.py

gabe-l-hart · 2024-09-23T14:17:34Z

Hi @compilade @ggerganov! This PR is now ready for full review.

We're eager to get the granitemoe architecture fully supported in llama.cpp (and then following up with support in ollama). I'm sure you are perpetually swamped, so I just want to get a quick check on if this is in the review queue for you at this point and if you have any targets for merging support.

(also, thanks for the great project and all the work you do!)

convert_hf_to_gguf.py

gabe-l-hart · 2024-09-23T16:32:55Z

It looks like the failing test is on the windows server's Erase Slot server logs scenario. This seems like it should be unrelated to this PR. Without knowing the tests well, is there any likelihood that this is a false negative? I can dig further if needed.

convert_hf_to_gguf.py

ggerganov · 2024-09-23T16:48:39Z

is there any likelihood that this is a false negative?

Yes, this is unrelated to the PR, no need to investigate.

gguf-py/gguf/tensor_mapping.py

convert_hf_to_gguf.py

gguf-py/gguf/constants.py

src/llama.cpp

gabe-l-hart · 2024-09-23T20:00:50Z

Thanks for the detailed review @compilade! I believe I have all of the comments addressed at this point.

compilade

From the first few chunks of wikitext-2-raw with llama-perplexity and https://huggingface.co/ibm/PowerMoE-3b at Q8_0, I get [1]4.4570,[2]5.1116,[3]5.3469,[4]5.9955, so this does appear to work correctly.

This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

… and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: alex.brooks@ibm.com Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-Authored-By: ggerganov@gmail.com Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2024-09-24T16:31:28Z

@compilade After you pointed out that I was missing output in the recent granitemoe snapshots, I dug a little deeper and it seems that the model team has added this for granite (dense) as well. I've added another commit to this PR to fix that as well. I'm not sure the preferred PR hygiene, so I'm happy to move that to a separate fix PR if the preference is for more well-encapsulated changes.

This was added recently to llama.cpp: ggml-org/llama.cpp#9438 Signed-off-by: Eric Curtin <ecurtin@redhat.com>

* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: alex.brooks@ibm.com Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Typo fix in docstring Co-Authored-By: ggerganov@gmail.com Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

This is a port of the work done in llama.cpp directly ggml-org/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: alex.brooks@ibm.com Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Typo fix in docstring Co-Authored-By: ggerganov@gmail.com Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

This is a port of the work done in llama.cpp directly ggml-org/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the GraniteMoE branch 2 times, most recently from 5f37be3 to 3219f58 Compare September 11, 2024 16:29

github-actions bot added the python python script changes label Sep 11, 2024

gabe-l-hart mentioned this pull request Sep 11, 2024

IBM granite/granitemoe architecture support ollama/ollama#6760

Merged

2 tasks

compilade reviewed Sep 14, 2024

View reviewed changes

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

gabe-l-hart force-pushed the GraniteMoE branch 3 times, most recently from 1b235d0 to 2615459 Compare September 17, 2024 12:46

gabe-l-hart marked this pull request as ready for review September 17, 2024 12:46

gabe-l-hart force-pushed the GraniteMoE branch from 2615459 to 474c7fb Compare September 23, 2024 14:17

ggerganov approved these changes Sep 23, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

compilade reviewed Sep 23, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

gabe-l-hart force-pushed the GraniteMoE branch from 31ed122 to 1349625 Compare September 23, 2024 17:03

compilade reviewed Sep 23, 2024

View reviewed changes

compilade approved these changes Sep 23, 2024

View reviewed changes

gabe-l-hart and others added 9 commits September 24, 2024 10:24

feat(convert_hf_to_gguf): Add GraniteMoeModel

e0b7229

GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Typo fix in docstring

71bc4c1

Co-Authored-By: ggerganov@gmail.com Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(conversion): Simplify tensor name mapping in conversion

5eb28c4

Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(convert): Remove unused tensor name mappings

f236099

Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(convert): Sanity check on merged FFN tensor sizes

317b15b

Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Allow "output" layer in granite moe architecture (convert and cpp)

1c8b3e4

Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the GraniteMoE branch from e071bc8 to 1c8b3e4 Compare September 24, 2024 16:24

fix(granite): Add missing 'output' tensor for Granite

a843f1f

This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

compilade added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Sep 24, 2024

ggerganov merged commit 3d6bf69 into ggml-org:master Sep 25, 2024

gabe-l-hart deleted the GraniteMoE branch September 25, 2024 12:45

ericcurtin added a commit to containers/ramalama that referenced this pull request Oct 21, 2024

Update llama.cpp to fix granite3-moe models

883a9d4

This was added recently to llama.cpp: ggml-org/llama.cpp#9438 Signed-off-by: Eric Curtin <ecurtin@redhat.com>

ericcurtin mentioned this pull request Oct 21, 2024

Update llama.cpp to fix granite3-moe models containers/ramalama#340

Merged

gabe-l-hart mentioned this pull request Nov 4, 2024

Granite three support mozilla-ai/llamafile#608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IBM Granite MoE Architecture#9438

IBM Granite MoE Architecture#9438
ggerganov merged 10 commits intoggml-org:masterfrom
gabe-l-hart:GraniteMoE

gabe-l-hart commented Sep 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

gabe-l-hart commented Sep 23, 2024

Uh oh!

Uh oh!

gabe-l-hart commented Sep 23, 2024

Uh oh!

Uh oh!

ggerganov commented Sep 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Sep 23, 2024

Uh oh!

compilade left a comment

Uh oh!

gabe-l-hart commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gabe-l-hart commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Description

Testing

Uh oh!

Uh oh!

gabe-l-hart commented Sep 23, 2024

Uh oh!

Uh oh!

gabe-l-hart commented Sep 23, 2024

Uh oh!

Uh oh!

ggerganov commented Sep 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Sep 23, 2024

Uh oh!

compilade left a comment

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabe-l-hart commented Sep 11, 2024 •

edited

Loading