Granite three support by gabe-l-hart · Pull Request #608 · mozilla-ai/llamafile

gabe-l-hart · 2024-11-04T22:53:26Z

Description

This PR adds support for the "granite" and "granitemoe" architectures in order to support IBM's Granite 3.0. The changes mirror those added in llama.cpp upstream:

"granite": IBM Granite Architecture ggml-org/llama.cpp#9412
"granitemoe": IBM Granite MoE Architecture ggml-org/llama.cpp#9438

These models are currently available via HuggingFace and Ollama:

HuggingFace: https://huggingface.co/collections/ibm-granite/granite-30-language-models-66fdb59bbb54785c3512114f
Ollama:
- granite3-dense ("granite"): https://ollama.com/library/granite3-dense
- granite3-moe ("granitemoe"): https://ollama.com/library/granite3-moe

Testing

I did my development on a Mac M3 without gmake natively installed. To avoid a system-level install, I wrapped my dev environment in docker with the following two scripts:

Dockerfile

FROM ubuntu
RUN apt-get update && apt-get install -y build-essential unzip wget

build_dockerized.sh

#!/usr/bin/env bash

cd $(dirname ${BASH_SOURCE[0]})

docker buildx build . -t llamafile-builder:latest --load
docker run --rm -it --entrypoint bash -w /src -v $PWD:/src -v $HOME/models:/models llamafile-builder:latest

build_in_docker.sh

#!/usr/bin/env bash

gguf_file=$1
if [ $# -ge 2 ]
then
    model_name=$2
else
    model_name=$(basename $gguf_file | cut -d'.' -f 1)
fi
echo "Model Name: $model_name"

# Build (NOTE: First build may fail due to the need to download tools)
make -j || make -j

# Install the built binaries
make install PREFIX=/usr/local

# Make a temp dir to work in
start_dir=$PWD
temp_dir=$(mktemp -d)
cd $temp_dir

# Copy over the model and base binary
echo "Copying source materials..."
cp $gguf_file .
cp $(which llamafile) $model_name.llamafile

# Make the .args file
echo "Making .args file..."
echo "-m
$(basename $gguf_file)
--host
0.0.0.0
-ngl
9999
..." > .args

# Pack it all together
echo "Packing with zipalign..."
zipalign -j0 $model_name.llamafile $(basename $gguf_file) .args

# Move it back to the root dir
mv $model_name.llamafile $start_dir/
echo "DONE"

With these scripts, my workflow was:

Download pre-quantized versions of the models (e.g. ollama pull then grab the $HOME/.ollama/models/blobs/... blob for the GGUF file)
- NOTE: IBM does not currently host official quantized versions, but there are also many community quantizations available in HF (dense, moe)
Launch the docker build shell (./build_dockerized.sh)
Build the llamafile inside (./build_in_docker.sh /models/granite-3.0-2b-instruct.Q4_K_M.gguf granite3-dense-2b)
Run the llamafile outside the docker shell (./granite3-dense-2b.llamafile -p "tell me a story")

Open Questions

Solved! I found the PR added after mine in llama.cpp to update the chat template to support "granite": ggml-org/llama.cpp#10013

When running in interactive mode, the chat template seems to be using different special tokens besides those defined in the chat_template metadata in the GGUF file. I haven't dug enough yet to understand if this is something that can be pulled automatically from the GGUF, or if there's an additional place where the Granite architectures will need to explicitly indicate their chat templates.

DK013 · 2024-11-06T17:59:18Z

I was waiting for this. Thanks a lot for your hard work mate @gabe-l-hart

BradHutchings · 2024-11-07T17:32:06Z

Thanks for doing this @gabe-l-hart. And thanks for the link @DK013. I appreciate you both!

-Brad

BradHutchings · 2024-11-09T20:06:21Z

I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!

gabe-l-hart · 2024-11-14T20:12:32Z

Hi @jart! I wanted to check in and see if this PR is something you would consider for upstream merging. I see that you use llama.cpp/README.llamafile to track the version of llama.cpp being used and the list of local modifications on top. I didn't see a clean way to re-bump the commit and apply those deltas, but I'd be happy to re-do this change set to be a full llama.cpp bump if that's preferred.

DK013 · 2024-11-24T03:37:23Z

I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!

I have been wanting to try it but wasn't getting enough time to sit and resolve the errors on a windows machine. @BradHutchings would you mind sharing your build so I can run some tests as well?
Thanks in advance

pawel665j · 2024-11-24T18:13:04Z

I'll try to shoot a video and send a link to the repositories I want to run, right now I'm running a light version consisting of one exe file. My system Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2.60 GHz (processors: 2) 128 ГБ Windows 10 Pro Release Version 22H2 Installation Date ‎06.‎04.‎2024 OS Build 19045.5131 Interoperability Windows Feature Experience Pack 1000.19060.1000.0

…

________________________________ From: Deep Chakraborty ***@***.***> Sent: Sunday, November 24, 2024 6:37 AM To: Mozilla-Ocho/llamafile ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [Mozilla-Ocho/llamafile] Granite three support (PR #608) I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart<https://github.com/gabe-l-hart>! I have been wanting to try it but wasn't getting enough time to sit and resolve the errors on a windows machine. @BradHutchings<https://github.com/BradHutchings> would you mind sharing your build so I can run some tests as well? Thanks in advance — Reply to this email directly, view it on GitHub<#608 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AOOFRCGR7IS6HA6Y6CCIG7T2CFCYVAVCNFSM6AAAAABRFIWFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJVG44DCNZWGY>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

BradHutchings · 2024-11-24T20:25:18Z

@DK013 My llamafile builds are here: https://huggingface.co/bradhutchings/DemoMachine-LLMs

DK013 · 2025-01-22T12:56:07Z

@jart any update when we can get an official build with this merged?

cjpais · 2025-03-13T16:23:38Z

ill give it a spin and plan for it being in 0.9.2

cjpais · 2025-03-14T17:10:22Z

It does work, but there is an issue where the model likes to generate endlessly (at least in the llamafile default cli). I suspect it's an issue with the chat template.

On the server side, we see this error:

{"function":"validate_model_chat_template","level":"ERR","line":491,"msg":"The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses","tid":"12271296","timestamp":1741972043}

However generation seems to works properly.

It would be nice for this to work properly before merging. Fwiw I do see the code for the chat template, but seems like maybe it's not being picked up or respected? It's possible this could be a change needed in llamafile itself to pick it up, I will take a look when I can as well.

gabe-l-hart · 2025-03-14T18:27:51Z

Thanks for looking at this @cjpais! You're almost certainly right that it's a chat template issue. I haven't looked at Llamafile much since opening this PR, but it looks like you haven't bumped llama.cpp in quite a while (since before this PR). Do you have plans to bump in the future, or will this continue as a logical fork from an earlier point in history? I ask because all of these changes (plus the chat template fix) would come in with a bumped llama.cpp.

cjpais · 2025-03-14T18:40:26Z

Right now I have no plans of bumping it, though I do think it should be done. I think @stlhood could chime in here if there's a specific plan to get the upstream changes. I do think it would make the project much more maintainable, but I suspect it would be quite the effort to get the upstream changes

It's definitely on the mind as we would like to get in Gemma 3 support and other models in the future as well.

Edit: @gabe-l-hart, if you do have time to take a look, I would be happy to test any changes and get things merged, just let me know what's possible for you and I will plan accordingly

This is a port of the work done in llama.cpp directly ggml-org/llama.cpp#9412 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This is a port of the work done in llama.cpp directly ggml-org/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteThreeSupport This is a port of the work done in llama.cpp with a slight tweak for the tool call response: ggml-org/llama.cpp#10013 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2025-03-14T22:08:24Z

I've got a quick window now before knocking off for the day, but seem to have killed my local dockerized development environment unfortunately. I'm going to see how quickly I can revive it, but can you give more details on how you saw that error? I do see that I have the basic templating pulled in from llama.cpp, but it's definitely missing a lot of the detailed templating from the full model.

gabe-l-hart · 2025-03-14T22:38:00Z

Ok, I can repro now. I think this has to do with the 3.2 models (it didn't happen for 3.0 and I haven't tested for 3.1). If I explicitly pass --chat-template granite this doesn't happen. Will try for a quick fix!

gabe-l-hart · 2025-03-14T22:51:20Z

Aaah, I see what happened. The GGUF-encoded string for the chat template got truncated somehow during gguf conversion. The jinja2 template is really long, so the clause to detect the <|start_of_role|> sequence isn't tripping because that part of the template got copped off. I'll add an extra one for "Granite" since that appears early in the default system prompt.

For some Granite models, the jinja2 template string is very long and gets truncated when converting to GGUF. The string "Granite" appears in the default system prompt logic, so this will catch those models that were truncated before the clause for the <|start_of_role|> token. NOTE: Any future granite architectures which contain "Granite" in the system prompt will need to include their logic _before_ this block. Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2025-03-14T22:54:57Z

Fix pushed. Should be good to go @cjpais!

cjpais · 2025-03-14T23:17:23Z

This looks good for pretty much all of the models I tested.

The only one that noticeably doesn't work for was: bartowski/granite-3.0-3b-a800m-instruct-GGUF

I get:

The later revision (3.1) worked fine.

I'm thinking of pulling in despite this, but will look closer as well. fwiw i ran it in lmstudio without issue

Hope you have a nice weekend

cjpais · 2025-03-17T18:19:51Z

I'm gonna go ahead and merge this. I did a little bit of print debugging and it looks like the chat template is being applied. So I don't have a strong idea of what the issue with generation is here

github-actions bot added the llama.cpp label Nov 4, 2024

gabe-l-hart force-pushed the GraniteThreeSupport branch from 473e3fc to bbe64fe Compare November 5, 2024 00:06

gabe-l-hart force-pushed the GraniteThreeSupport branch from bbe64fe to f2557e3 Compare December 10, 2024 16:22

BradHutchings mentioned this pull request Mar 13, 2025

Bug: error loading model: error loading model architecture: unknown model architecture: 'gemma3' #711

Closed

gabe-l-hart added 3 commits March 14, 2025 15:55

feat(granite): Add support for the "granite" architecture in llama.cpp

dc8d208

This is a port of the work done in llama.cpp directly ggml-org/llama.cpp#9412 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(granitemoe): Add support for "granitemoe" architecture

fc90175

This is a port of the work done in llama.cpp directly ggml-org/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(granite*): Add granite chat template

f9ab15a

Branch: GraniteThreeSupport This is a port of the work done in llama.cpp with a slight tweak for the tool call response: ggml-org/llama.cpp#10013 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the GraniteThreeSupport branch from f2557e3 to f44916f Compare March 14, 2025 22:54

cjpais merged commit 17d7f4a into mozilla-ai:main Mar 17, 2025
1 check passed

cjpais mentioned this pull request Mar 29, 2025

Bug: unknown model architecture: 'granite' #730

Closed

Conversation

gabe-l-hart commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Open Questions

Uh oh!

DK013 commented Nov 6, 2024

Uh oh!

BradHutchings commented Nov 7, 2024

Uh oh!

BradHutchings commented Nov 9, 2024

Uh oh!

gabe-l-hart commented Nov 14, 2024

Uh oh!

DK013 commented Nov 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawel665j commented Nov 24, 2024 via email

Uh oh!

BradHutchings commented Nov 24, 2024

Uh oh!

DK013 commented Jan 22, 2025

Uh oh!

cjpais commented Mar 13, 2025

Uh oh!

cjpais commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Mar 14, 2025

Uh oh!

cjpais commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Mar 14, 2025

Uh oh!

gabe-l-hart commented Mar 14, 2025

Uh oh!

gabe-l-hart commented Mar 14, 2025

Uh oh!

gabe-l-hart commented Mar 14, 2025

Uh oh!

cjpais commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cjpais commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gabe-l-hart commented Nov 4, 2024 •

edited

Loading

DK013 commented Nov 24, 2024 •

edited

Loading

cjpais commented Mar 14, 2025 •

edited

Loading

cjpais commented Mar 14, 2025 •

edited

Loading

cjpais commented Mar 14, 2025 •

edited

Loading