Skip to content

Granite three support#608

Merged
cjpais merged 4 commits intomozilla-ai:mainfrom
gabe-l-hart:GraniteThreeSupport
Mar 17, 2025
Merged

Granite three support#608
cjpais merged 4 commits intomozilla-ai:mainfrom
gabe-l-hart:GraniteThreeSupport

Conversation

@gabe-l-hart
Copy link
Copy Markdown
Contributor

@gabe-l-hart gabe-l-hart commented Nov 4, 2024

Description

This PR adds support for the "granite" and "granitemoe" architectures in order to support IBM's Granite 3.0. The changes mirror those added in llama.cpp upstream:

These models are currently available via HuggingFace and Ollama:

Testing

I did my development on a Mac M3 without gmake natively installed. To avoid a system-level install, I wrapped my dev environment in docker with the following two scripts:

Dockerfile
FROM ubuntu
RUN apt-get update && apt-get install -y build-essential unzip wget
build_dockerized.sh
#!/usr/bin/env bash

cd $(dirname ${BASH_SOURCE[0]})

docker buildx build . -t llamafile-builder:latest --load
docker run --rm -it --entrypoint bash -w /src -v $PWD:/src -v $HOME/models:/models llamafile-builder:latest
build_in_docker.sh
#!/usr/bin/env bash

gguf_file=$1
if [ $# -ge 2 ]
then
    model_name=$2
else
    model_name=$(basename $gguf_file | cut -d'.' -f 1)
fi
echo "Model Name: $model_name"

# Build (NOTE: First build may fail due to the need to download tools)
make -j || make -j

# Install the built binaries
make install PREFIX=/usr/local

# Make a temp dir to work in
start_dir=$PWD
temp_dir=$(mktemp -d)
cd $temp_dir

# Copy over the model and base binary
echo "Copying source materials..."
cp $gguf_file .
cp $(which llamafile) $model_name.llamafile

# Make the .args file
echo "Making .args file..."
echo "-m
$(basename $gguf_file)
--host
0.0.0.0
-ngl
9999
..." > .args

# Pack it all together
echo "Packing with zipalign..."
zipalign -j0 $model_name.llamafile $(basename $gguf_file) .args

# Move it back to the root dir
mv $model_name.llamafile $start_dir/
echo "DONE"

With these scripts, my workflow was:

  1. Download pre-quantized versions of the models (e.g. ollama pull then grab the $HOME/.ollama/models/blobs/... blob for the GGUF file)
    • NOTE: IBM does not currently host official quantized versions, but there are also many community quantizations available in HF (dense, moe)
  2. Launch the docker build shell (./build_dockerized.sh)
  3. Build the llamafile inside (./build_in_docker.sh /models/granite-3.0-2b-instruct.Q4_K_M.gguf granite3-dense-2b)
  4. Run the llamafile outside the docker shell (./granite3-dense-2b.llamafile -p "tell me a story")

Open Questions

Solved! I found the PR added after mine in llama.cpp to update the chat template to support "granite": ggml-org/llama.cpp#10013

When running in interactive mode, the chat template seems to be using different special tokens besides those defined in the chat_template metadata in the GGUF file. I haven't dug enough yet to understand if this is something that can be pulled automatically from the GGUF, or if there's an additional place where the Granite architectures will need to explicitly indicate their chat templates.

@DK013
Copy link
Copy Markdown

DK013 commented Nov 6, 2024

I was waiting for this. Thanks a lot for your hard work mate @gabe-l-hart

@BradHutchings
Copy link
Copy Markdown

Thanks for doing this @gabe-l-hart. And thanks for the link @DK013. I appreciate you both!

-Brad

@BradHutchings
Copy link
Copy Markdown

I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

Hi @jart! I wanted to check in and see if this PR is something you would consider for upstream merging. I see that you use llama.cpp/README.llamafile to track the version of llama.cpp being used and the list of local modifications on top. I didn't see a clean way to re-bump the commit and apply those deltas, but I'd be happy to re-do this change set to be a full llama.cpp bump if that's preferred.

@DK013
Copy link
Copy Markdown

DK013 commented Nov 24, 2024

I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!

I have been wanting to try it but wasn't getting enough time to sit and resolve the errors on a windows machine. @BradHutchings would you mind sharing your build so I can run some tests as well?
Thanks in advance

@pawel665j
Copy link
Copy Markdown

pawel665j commented Nov 24, 2024 via email

@BradHutchings
Copy link
Copy Markdown

@DK013 My llamafile builds are here: https://huggingface.co/bradhutchings/DemoMachine-LLMs

@DK013
Copy link
Copy Markdown

DK013 commented Jan 22, 2025

@jart any update when we can get an official build with this merged?

@cjpais
Copy link
Copy Markdown
Collaborator

cjpais commented Mar 13, 2025

ill give it a spin and plan for it being in 0.9.2

@cjpais
Copy link
Copy Markdown
Collaborator

cjpais commented Mar 14, 2025

It does work, but there is an issue where the model likes to generate endlessly (at least in the llamafile default cli). I suspect it's an issue with the chat template.

On the server side, we see this error:

{"function":"validate_model_chat_template","level":"ERR","line":491,"msg":"The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses","tid":"12271296","timestamp":1741972043}

However generation seems to works properly.

It would be nice for this to work properly before merging. Fwiw I do see the code for the chat template, but seems like maybe it's not being picked up or respected? It's possible this could be a change needed in llamafile itself to pick it up, I will take a look when I can as well.

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

Thanks for looking at this @cjpais! You're almost certainly right that it's a chat template issue. I haven't looked at Llamafile much since opening this PR, but it looks like you haven't bumped llama.cpp in quite a while (since before this PR). Do you have plans to bump in the future, or will this continue as a logical fork from an earlier point in history? I ask because all of these changes (plus the chat template fix) would come in with a bumped llama.cpp.

@cjpais
Copy link
Copy Markdown
Collaborator

cjpais commented Mar 14, 2025

Right now I have no plans of bumping it, though I do think it should be done. I think @stlhood could chime in here if there's a specific plan to get the upstream changes. I do think it would make the project much more maintainable, but I suspect it would be quite the effort to get the upstream changes

It's definitely on the mind as we would like to get in Gemma 3 support and other models in the future as well.

Edit: @gabe-l-hart, if you do have time to take a look, I would be happy to test any changes and get things merged, just let me know what's possible for you and I will plan accordingly

This is a port of the work done in llama.cpp directly
ggml-org/llama.cpp#9412

Branch: GraniteThreeSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a port of the work done in llama.cpp directly
ggml-org/llama.cpp#9438

Branch: GraniteThreeSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteThreeSupport

This is a port of the work done in llama.cpp with a slight tweak for the
tool call response:
ggml-org/llama.cpp#10013

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

I've got a quick window now before knocking off for the day, but seem to have killed my local dockerized development environment unfortunately. I'm going to see how quickly I can revive it, but can you give more details on how you saw that error? I do see that I have the basic templating pulled in from llama.cpp, but it's definitely missing a lot of the detailed templating from the full model.

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

Ok, I can repro now. I think this has to do with the 3.2 models (it didn't happen for 3.0 and I haven't tested for 3.1). If I explicitly pass --chat-template granite this doesn't happen. Will try for a quick fix!

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

Aaah, I see what happened. The GGUF-encoded string for the chat template got truncated somehow during gguf conversion. The jinja2 template is really long, so the clause to detect the <|start_of_role|> sequence isn't tripping because that part of the template got copped off. I'll add an extra one for "Granite" since that appears early in the default system prompt.

For some Granite models, the jinja2 template string is very long and gets
truncated when converting to GGUF. The string "Granite" appears in the
default system prompt logic, so this will catch those models that were
truncated before the clause for the <|start_of_role|> token.

NOTE: Any future granite architectures which contain "Granite" in the
system prompt will need to include their logic _before_ this block.

Branch: GraniteThreeSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart gabe-l-hart force-pushed the GraniteThreeSupport branch from f2557e3 to f44916f Compare March 14, 2025 22:54
@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

Fix pushed. Should be good to go @cjpais!

@cjpais
Copy link
Copy Markdown
Collaborator

cjpais commented Mar 14, 2025

This looks good for pretty much all of the models I tested.

The only one that noticeably doesn't work for was: bartowski/granite-3.0-3b-a800m-instruct-GGUF

I get:
Screenshot 2025-03-14 at 4 16 01 PM

The later revision (3.1) worked fine.

I'm thinking of pulling in despite this, but will look closer as well. fwiw i ran it in lmstudio without issue

Hope you have a nice weekend

@cjpais cjpais merged commit 17d7f4a into mozilla-ai:main Mar 17, 2025
1 check passed
@cjpais
Copy link
Copy Markdown
Collaborator

cjpais commented Mar 17, 2025

I'm gonna go ahead and merge this. I did a little bit of print debugging and it looks like the chat template is being applied. So I don't have a strong idea of what the issue with generation is here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants