mtmd : add ultravox audio input#13623
Conversation
|
Ok somehow it works magically, the code is still nowhere near finish Tested using first 6 seconds from https://www.youtube.com/watch?v=vP4iY1TtS3s
|
|
With the
Next step is to allow more than 30s input |
| if (has_audio) { | ||
| LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n" | ||
| " https://github.com/ggml-org/llama.cpp/pull/13623\n", __func__); | ||
| } |
There was a problem hiding this comment.
The model hallucinates on audio longer than 1 minute and I'm still not sure why (haven't yet had time to try the same audio on transformers)
But I think for now putting a small notice here is enough, this is kinda experimental support for now, hopefully we will get gemma 3n supported soon
convert_hf_to_gguf.py
Outdated
| self.hparams["image_size"] = self.hparams["num_mel_bins"] | ||
| self.hparams["patch_size"] = self.hparams["num_mel_bins"] |
There was a problem hiding this comment.
Are the image_size and patch_size used in the audio encoder?
There was a problem hiding this comment.
It is unused, but I leave it here from my first draft version so the warmup works. But yeah I should remove this
tools/mtmd/mtmd.h
Outdated
| #define MTMD_DEFAULT_MEDIA_MARKER "<__media__>" | ||
|
|
||
| // deprecated marker, use MTMD_DEFAULT_MEDIA_MARKER instead |
There was a problem hiding this comment.
We have such constants in llama.h and ggml.h, but we eventually have to start moving those behind API calls. It's more future-proof.
ggerganov
left a comment
There was a problem hiding this comment.
The preprocessor will convert input PCM to mel spectrogram with dimension of n_frames * n_mel, so it can be considered as a gray scale (1 channel) image with W=n_frames and H=n_mel
This is a neat idea. Do you think it would be compatible with other audio models or is this a lucky coincidence for this architecture? I guess the question is if all audio encoders work with 2D spectrograms.
|
I have seen so far just 2 types of model:
So overall, I think this system should work well for most audio models I'll resolve the 2 comments a bit later today, and will merge it after that. Thanks for reviewing this! |
|
truly AI expert, ......genius programmer, another gg! |


Supersede #12745
Important
Support for
llama-serverwill be added in a separated PRFor ultravox, it does not work very well with audio longer than 1 minute - Not sure why
How it works
This PR target specifically ultravox model, which is essentially a fine-tuned Whisper encoder and a custom projector.
Most of the preprocessing code are copied from whisper.cpp. The preprocessor will convert input PCM to mel spectrogram with dimension of
n_frames * n_mel, so it can be considered as a gray scale (1 channel) image with W=n_frames and H=n_melThe preprocessing code is inside
mtmd-audio.cpp, the mel filters values are hard-coded for convenient.Demo CLI
Supported formats: mp3, wav, flac
Example output:
New API
The API now accepts PCM F32 as input via
mtmd_bitmap_init_from_audio(). Optionally, you can check if a given bitmap is audio or not by usingmtmd_bitmap_is_audio()The helper
mtmd_helper_bitmap_init_from_buf/fileis extended to load input file data to the correctmtmd_bitmaptype (decided by the magic bytes of the file), so it will just work out-of-the-box without any changes in application code.mtmd_input_chunknow has a new type calledMTMD_INPUT_CHUNK_TYPE_AUDIOYou can get the number of audio/image tokens that a chunk takes via the newly added
mtmd_input_chunk_get_n_tokensAPIThe rest of the process (encode/decode) is the same as before. So, very little changes for downstream application.
For complete changes, see
tools/mtmd/mtmd-cli.cpp: https://github.com/ggml-org/llama.cpp/pull/13623/files#diff-4bfe825a05fa2d2598cc93f39aaa081605d2fd82823bd5d15e7dab72acd85e7cDeprecated API
The image marker
<__image__>will continue to work, but it's deprecated as a new marker<__media__>being added. This marker is defined inMTMD_DEFAULT_MEDIA_MARKERThe 3 APIs will be deprecated (but will continue to function, NO breaking change):
mtmd_image_tokens_get_n_tokensmtmd_image_tokens_get_idmtmd_image_tokens_get_n_posThey simple change their prefix to mtmd_input_chunk_ :
mtmd_input_chunk_get_n_tokensmtmd_input_chunk_get_idmtmd_input_chunk_get_n_posTODO in next PRs:
miniaudio.handstb_image.htomtmd_helpermtmd_image_tokens_get_n_tokens / n_pos / id