convert: support Mistral 3 Large MoE#17730
Conversation
|
So far so good with this, in a couple hours will be able to test generation |
|
seems to work and produce coherent results! |
|
This PR still needs to be clean up before it is ready for review 😅 |
| # remap hparams from Mistral MoE format to DeepseekV2 format | ||
| # we do this way to be able to reuse DeepseekV2Model set_gguf_parameters logic |
There was a problem hiding this comment.
Somewhat ugly but an acceptable trade-off.
|
@ngxson Thank you so much for this. I've tried your Q4_K_M - seems working just fine. Is there any other setting or change needed for the conversion? |
|
It disappeared?? 👀 I can re-upload if necessary I guess .. Only difference is using |
|
Yeah I've used the mistral format. Than I guess I have a corrupted bf16 version (I cannot think of anything else) https://github.com/csabakecskemeti/ministral-3_dequantizer_fp8-bf16 |
|
I can see it here: https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512-BF16 |
|
You're right, they just removed it from the collection (it it was ever there :p) there's where I looked for. My bad |
It looks like @ngxson forgot |
CISC
left a comment
There was a problem hiding this comment.
@csabakecskemeti This should work.
|
I can confirm it works with the changes suggested by @CISC |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
@ngxson Ouch, that second suggestion should not have been directly applied, GitHub messes up changes outside of preview area. :( |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* convert: support Mistral 3 Large MoE * filter out vision tensors, add missing keys * handle vocab * add temperature_length * fix mscale_all_dim * clean up * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* convert: support Mistral 3 Large MoE * filter out vision tensors, add missing keys * handle vocab * add temperature_length * fix mscale_all_dim * clean up * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* convert: support Mistral 3 Large MoE * filter out vision tensors, add missing keys * handle vocab * add temperature_length * fix mscale_all_dim * clean up * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* convert: support Mistral 3 Large MoE * filter out vision tensors, add missing keys * handle vocab * add temperature_length * fix mscale_all_dim * clean up * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
WIP, the code is quite ugly for now, but just want to get it to work.
Remember to convert with the
--mistral-formatargument, as the weight is not yet transformers-compatibleOutput
F16 weight is 1.35 TerabytesQ8_0 weight is 716GB and I don't have enough hw to test itEdit: thanks @bartowski1182 for testing it!
NOTE: this PR only covers the conversion to GGUF. the C++ code still missing llama 4 scaling to work, but it will be another PR