mtmd: Expose helper_decode_image_chunk#13366
Conversation
There was a problem hiding this comment.
I think this can be make more simple: In the application code, you can handle the embedding copy as I said. This way, you can even have a CPP struct with std::vector<float> which makes memory management much easier. The mtmd API already provided enough function allowing you to do that, so I think we should not extend it more.
A struct in your app could look like this:
struct my_image {
std::vector<float> embeddings; // the encoded embeddings
mtmd_input_chunk * chunk; // the chunk containing mtmd_image_tokens
}There was a problem hiding this comment.
Nice, thanks! 💯 💯
Btw @mattjcly one nice-to-have thing that I'm thinking about, currently mtmd_helper_decode_image_chunk run non-stop while it actually support smaller batch under the hood.
This can lead to a poor UX where user hits "stop" button on the UI, but mtmd_helper_decode_image_chunk still tries to decode the whole image which may takes some extra seconds to finish.
I'm thinking about another version of mtmd_helper_decode_image_chunk (ofc will add it in another PR) which support interrupt-ability. I'm thinking about maybe exposing the i_batch and n_batch to the public API. Do you have any other ideas?
Edit: another idea could be to add a helper that does pre/post batch preparation, then you can llama_decode(prepared_image_batch) in the user code ; but still this may look quite cumbersome 😞
I like this - I think that 1) having a point where the decoding can be stopped in between batches would be great 2) having a way to, as a user, get progress information during image decoding in the mutli-batch case (other than just the current log) would be great.
Interesting. How would you envision this as the method of supporting interrupt-ability from the client-side? Just trying to understand more |
The most intuitive way to to provide to application code the notion of "a list of batches" instead of a one-do-all API call. A pseudo code looks like this: list_batches = mtmd_generate_decode_batches()
for batch in list_batches:
llama_decode(batch)Then if you want the interrupt-ability: list_batches = mtmd_generate_decode_batches()
for batch in list_batches:
if check_user_interrupt():
break # stop the decode
llama_decode(batch)I'm thinking about this line, maybe this will be implemented as a cpp-only API to make it easier to manage batch allocation |
* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...
New API
Decoding-only helper
mtmd_helper_decode_image_chunk: Split out frommtmd_helper_eval_chunk_single. Same logic as before, but use as a standalone function enables clients to usemtmd_encodeat some prior time, cache these embeddings, and then send them in later tomtmd_helper_decode_image_chunkto decode the embeddings without having to re-encode the image (expensive)Edit: removed below APIs that were in original PR
Output embedding copy
mtmd_get_output_embd_copy: Allows client to embed withmtmd_encode, then get a copy of the embd to hold onto past the lifetime of the embeddings within themtmd_context. Useful for caching these embedings, and sending intomtmd_helper_decode_imagelatermtmd_image_tokensmanagement functionsmtmd_image_tokens_copy: Allows clients to get a copy ofmtmd_image_tokensfrommtmd_input_chunk, for later use to send alongside pre-computed embeddings tomtmd_helper_decode_image.mtmd_image_tokens_free: For use to free anmtmd_image_tokens *, as can be recieved frommtmd_image_tokens_copyimage_tokens_ptr(made public, existed privately inmtmd.cppbefore): Enables auto memory management ofmtmd_image_tokens *@ngxson I'm thinking that maybe there's a way to avoid the need to expose new API for
mtmd_image_tokens, since I feel like the statement "for later use to send alongside pre-computed embeddings" aboutmtmd_image_tokens_copycould potential be weak and the API ofmtmd_helper_decode_imagecould be reworked not need this object in full? But it also seemed like the simplest conversion to enable decoupled embedding + decoding