Avoid re-encoding mtmd chunk when prefill MTP context

(This issue is mostly for transparency, no actions should be taken by other maintainers or contributors)

Planned to address this issue along with the new `mtmd_batch` API

The `process_chunk` is called twice, which encode the same image twice (cc @am17an for visibility)

https://github.com/ggml-org/llama.cpp/blob/76da2450a4f2cc9ce6c7fc8229e25dc0a4b41e5d/tools/server/server-context.cpp#L2979-L2999

	while (slot.prompt.n_tokens() < slot.task->n_tokens() && input_tokens[slot.prompt.n_tokens()] == LLAMA_TOKEN_NULL) {
	// process the image
	size_t n_tokens_out = 0;
	int32_t res = input_tokens.process_chunk(ctx_tgt, mctx, slot.prompt.n_tokens(), slot.prompt.tokens.pos_next(), slot.id, n_tokens_out);
	if (res != 0) {
	SLT_ERR(slot, "failed to process image, res = %d\n", res);
	send_error(slot, "failed to process image", ERROR_TYPE_SERVER);
	slot.release();
	continue;
	}

	if (ctx_dft && llama_get_ctx_other(ctx_dft.get()) != ctx_tgt) {
	// TODO: in the future, figure out how to infuse target embeddings to the images
	// for now, we skip this for simplicity
	// maybe we simply need to call `common_speculative_process()` on the mtmd batches in the `process_chunk` above?
	// [TAG_MTMD_DRAFT_PROCESSING]
	res = input_tokens.process_chunk(ctx_dft.get(), mctx, slot.prompt.n_tokens(), slot.prompt.tokens.pos_next(), slot.id, n_tokens_out);
	if (res != 0) {
	GGML_ABORT("failed to process multi-modal data on draft context\n");
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid re-encoding mtmd chunk when prefill MTP context #24380

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Avoid re-encoding mtmd chunk when prefill MTP context #24380

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions