Conversation
473e3fc to
bbe64fe
Compare
|
I was waiting for this. Thanks a lot for your hard work mate @gabe-l-hart |
|
Thanks for doing this @gabe-l-hart. And thanks for the link @DK013. I appreciate you both! -Brad |
|
I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart! |
|
Hi @jart! I wanted to check in and see if this PR is something you would consider for upstream merging. I see that you use llama.cpp/README.llamafile to track the version of |
I have been wanting to try it but wasn't getting enough time to sit and resolve the errors on a windows machine. @BradHutchings would you mind sharing your build so I can run some tests as well? |
|
I'll try to shoot a video and send a link to the repositories I want to run, right now I'm running a light version consisting of one exe file. My system
Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2.60 GHz (processors: 2)
128 ГБ
Windows 10 Pro Release
Version 22H2
Installation Date 06.04.2024
OS Build 19045.5131
Interoperability Windows Feature Experience Pack 1000.19060.1000.0
…________________________________
From: Deep Chakraborty ***@***.***>
Sent: Sunday, November 24, 2024 6:37 AM
To: Mozilla-Ocho/llamafile ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [Mozilla-Ocho/llamafile] Granite three support (PR #608)
I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart<https://github.com/gabe-l-hart>!
I have been wanting to try it but wasn't getting enough time to sit and resolve the errors on a windows machine. @BradHutchings<https://github.com/BradHutchings> would you mind sharing your build so I can run some tests as well?
Thanks in advance
—
Reply to this email directly, view it on GitHub<#608 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AOOFRCGR7IS6HA6Y6CCIG7T2CFCYVAVCNFSM6AAAAABRFIWFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJVG44DCNZWGY>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
|
@DK013 My llamafile builds are here: https://huggingface.co/bradhutchings/DemoMachine-LLMs |
bbe64fe to
f2557e3
Compare
|
@jart any update when we can get an official build with this merged? |
|
ill give it a spin and plan for it being in 0.9.2 |
|
It does work, but there is an issue where the model likes to generate endlessly (at least in the llamafile default cli). I suspect it's an issue with the chat template. On the server side, we see this error: However generation seems to works properly. It would be nice for this to work properly before merging. Fwiw I do see the code for the chat template, but seems like maybe it's not being picked up or respected? It's possible this could be a change needed in llamafile itself to pick it up, I will take a look when I can as well. |
|
Thanks for looking at this @cjpais! You're almost certainly right that it's a chat template issue. I haven't looked at Llamafile much since opening this PR, but it looks like you haven't bumped |
|
Right now I have no plans of bumping it, though I do think it should be done. I think @stlhood could chime in here if there's a specific plan to get the upstream changes. I do think it would make the project much more maintainable, but I suspect it would be quite the effort to get the upstream changes It's definitely on the mind as we would like to get in Gemma 3 support and other models in the future as well. Edit: @gabe-l-hart, if you do have time to take a look, I would be happy to test any changes and get things merged, just let me know what's possible for you and I will plan accordingly |
This is a port of the work done in llama.cpp directly ggml-org/llama.cpp#9412 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a port of the work done in llama.cpp directly ggml-org/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteThreeSupport This is a port of the work done in llama.cpp with a slight tweak for the tool call response: ggml-org/llama.cpp#10013 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
|
I've got a quick window now before knocking off for the day, but seem to have killed my local dockerized development environment unfortunately. I'm going to see how quickly I can revive it, but can you give more details on how you saw that error? I do see that I have the basic templating pulled in from |
|
Ok, I can repro now. I think this has to do with the 3.2 models (it didn't happen for 3.0 and I haven't tested for 3.1). If I explicitly pass |
|
Aaah, I see what happened. The GGUF-encoded string for the chat template got truncated somehow during gguf conversion. The jinja2 template is really long, so the clause to detect the |
For some Granite models, the jinja2 template string is very long and gets truncated when converting to GGUF. The string "Granite" appears in the default system prompt logic, so this will catch those models that were truncated before the clause for the <|start_of_role|> token. NOTE: Any future granite architectures which contain "Granite" in the system prompt will need to include their logic _before_ this block. Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
f2557e3 to
f44916f
Compare
|
Fix pushed. Should be good to go @cjpais! |
|
This looks good for pretty much all of the models I tested. The only one that noticeably doesn't work for was: bartowski/granite-3.0-3b-a800m-instruct-GGUF The later revision (3.1) worked fine. I'm thinking of pulling in despite this, but will look closer as well. fwiw i ran it in lmstudio without issue Hope you have a nice weekend |
|
I'm gonna go ahead and merge this. I did a little bit of print debugging and it looks like the chat template is being applied. So I don't have a strong idea of what the issue with generation is here |

Description
This PR adds support for the
"granite"and"granitemoe"architectures in order to support IBM's Granite 3.0. The changes mirror those added inllama.cppupstream:"granite": IBM Granite Architecture ggml-org/llama.cpp#9412"granitemoe": IBM Granite MoE Architecture ggml-org/llama.cpp#9438These models are currently available via HuggingFace and Ollama:
granite3-dense("granite"): https://ollama.com/library/granite3-densegranite3-moe("granitemoe"): https://ollama.com/library/granite3-moeTesting
I did my development on a Mac M3 without
gmakenatively installed. To avoid a system-level install, I wrapped my dev environment indockerwith the following two scripts:Dockerfile
build_dockerized.sh
build_in_docker.sh
With these scripts, my workflow was:
ollama pullthen grab the$HOME/.ollama/models/blobs/...blob for the GGUF file)./build_dockerized.sh)llamafileinside (./build_in_docker.sh /models/granite-3.0-2b-instruct.Q4_K_M.gguf granite3-dense-2b)llamafileoutside the docker shell (./granite3-dense-2b.llamafile -p "tell me a story")Open Questions
Solved! I found the PR added after mine in
llama.cppto update the chat template to support"granite": ggml-org/llama.cpp#10013When running in interactive mode, the chat template seems to be using different special tokens besides those defined in thechat_templatemetadata in the GGUF file. I haven't dug enough yet to understand if this is something that can be pulled automatically from the GGUF, or if there's an additional place where the Granite architectures will need to explicitly indicate their chat templates.