Skip to content

Feature/gemma mtp#3

Merged
Ooooze merged 7 commits into
feature/turboquant-kv-cachefrom
feature/gemma-mtp
May 7, 2026
Merged

Feature/gemma mtp#3
Ooooze merged 7 commits into
feature/turboquant-kv-cachefrom
feature/gemma-mtp

Conversation

@Ooooze

@Ooooze Ooooze commented May 7, 2026

Copy link
Copy Markdown

Overview

Additional information

Requirements

Ooooze added 7 commits May 6, 2026 20:58
Introduces support for the Gemma 4 MTP assistant, allowing for enhanced speculative decoding. This includes new command-line options for specifying the MTP head and draft model, as well as updates to the model architecture and tensor handling. The assistant integrates with the target model, enabling efficient draft generation and improved performance in speculative tasks.

Changes include:
- New command-line options: `--mtp-head` and `--draft-block-size`.
- Updates to the model loading process to accommodate the MTP assistant.
- Enhancements in tensor management for MTP-specific operations.
- Documentation updates for usage examples and guidelines.

This feature aims to improve the overall functionality and efficiency of the model in handling complex tasks.
…oding

This commit introduces an asynchronous MTP draft pipeline, enhancing the speculative decoding process. Key changes include:

- Updated `draft_block_size` to 3, optimizing performance based on empirical results.
- Added new APIs: `llama_decode_mtp_async` and `llama_decode_mtp_wait` for non-blocking draft requests.
- Enhanced documentation to reflect the async pipeline's functionality and usage.
- Implemented tests to ensure parity between synchronous and asynchronous draft generation.

These improvements aim to increase throughput and efficiency in handling complex tasks within the model.
…P handling

This commit introduces significant improvements to the speculative decoding process by implementing a pipeline depth-2 mechanism that allows MTP draft computation to overlap with target verification. Key changes include:

- Added `prepare_next` and `cancel` hooks in the `common_speculative_state` interface for better async draft management.
- Implemented logic to drain any pending MTP requests before new iterations to prevent race conditions.
- Updated documentation to reflect the new pipeline depth-2 functionality and its implications for performance.
- Enhanced the `common_speculative` API with new functions for managing async MTP work.

These enhancements aim to improve throughput and efficiency in speculative decoding tasks, ensuring smoother operation during concurrent processing.
This commit introduces an optional NDJSON tracer for MTP draft and accept events, controlled by the environment variable LLAMA_MTP_ACC_TRACE. Key changes include:

- Implementation of the `mtp_acc_tracer` class for tracing MTP events with configurable output options.
- Integration of tracing logic into the `common_speculative_state_mtp` structure, capturing relevant metrics during draft and acceptance processes.
- Updates to the MTP decoding functions to utilize in-graph argmax for improved performance and reduced data transfer overhead.
- Addition of a new shell script for running the Gemma 4 MTP server with enhanced configuration options.

These enhancements aim to provide better observability and performance in MTP operations, facilitating debugging and optimization of the speculative decoding process.
…cing

This commit introduces an in-graph argmax for MTP draft processing, significantly improving throughput by reducing data transfer overhead. Key changes include:

- Implementation of `ggml_argmax` to publish final logits, allowing the host to read only the necessary token ID.
- Addition of a diagnostic feature for per-draft acceptance tracing, enabling detailed logging of MTP events for better observability.
- Documentation updates to reflect these enhancements and provide usage examples for the new tracing functionality.

These improvements aim to optimize MTP operations and facilitate debugging in the speculative decoding process.
This commit improves the handling of tensors in the MTP process, specifically for the Gemma 4 assistant. Key changes include:

- Updated tensor conversion logic to maintain integer types for specific tensors, ensuring compatibility with centroid routing.
- Introduced handling for `mtp.centroids.weight` and `mtp.token_ordering.weight`, ensuring correct tensor shapes and types during processing.
- Enhanced documentation to clarify the new tensor structures and their implications for MTP operations.
- Added new scripts for quantizing and running the Gemma 4 Edge assistant with improved configuration options.

These enhancements aim to optimize the performance and accuracy of the MTP draft process, particularly when using ordered embeddings.
This commit introduces TurboQuant, a new family of WHT-rotated low-bit quantization formats designed for KV cache and model weight compression. Key changes include:

- Added support for KV cache types (`turbo2`, `turbo3`, `turbo4`) with significant compression ratios.
- Introduced weight quantization formats (`TQ3_1S`, `TQ4_1S`) for efficient model size reduction.
- Enhanced documentation detailing usage, backend support, and practical examples for TurboQuant integration.
- Added new command-line options for enabling TurboQuant features in the server.

These enhancements aim to optimize memory usage and improve performance in bandwidth-bound scenarios, particularly on Apple Silicon and discrete GPUs.
@Ooooze Ooooze merged commit 98bbdfe into feature/turboquant-kv-cache May 7, 2026
16 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant