Feature/gemma mtp by Ooooze · Pull Request #3 · AtomicBot-ai/atomic-llama-cpp-turboquant

Ooooze · 2026-05-07T10:00:06Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

Introduces support for the Gemma 4 MTP assistant, allowing for enhanced speculative decoding. This includes new command-line options for specifying the MTP head and draft model, as well as updates to the model architecture and tensor handling. The assistant integrates with the target model, enabling efficient draft generation and improved performance in speculative tasks. Changes include: - New command-line options: `--mtp-head` and `--draft-block-size`. - Updates to the model loading process to accommodate the MTP assistant. - Enhancements in tensor management for MTP-specific operations. - Documentation updates for usage examples and guidelines. This feature aims to improve the overall functionality and efficiency of the model in handling complex tasks.

…oding This commit introduces an asynchronous MTP draft pipeline, enhancing the speculative decoding process. Key changes include: - Updated `draft_block_size` to 3, optimizing performance based on empirical results. - Added new APIs: `llama_decode_mtp_async` and `llama_decode_mtp_wait` for non-blocking draft requests. - Enhanced documentation to reflect the async pipeline's functionality and usage. - Implemented tests to ensure parity between synchronous and asynchronous draft generation. These improvements aim to increase throughput and efficiency in handling complex tasks within the model.

…P handling This commit introduces significant improvements to the speculative decoding process by implementing a pipeline depth-2 mechanism that allows MTP draft computation to overlap with target verification. Key changes include: - Added `prepare_next` and `cancel` hooks in the `common_speculative_state` interface for better async draft management. - Implemented logic to drain any pending MTP requests before new iterations to prevent race conditions. - Updated documentation to reflect the new pipeline depth-2 functionality and its implications for performance. - Enhanced the `common_speculative` API with new functions for managing async MTP work. These enhancements aim to improve throughput and efficiency in speculative decoding tasks, ensuring smoother operation during concurrent processing.

This commit introduces an optional NDJSON tracer for MTP draft and accept events, controlled by the environment variable LLAMA_MTP_ACC_TRACE. Key changes include: - Implementation of the `mtp_acc_tracer` class for tracing MTP events with configurable output options. - Integration of tracing logic into the `common_speculative_state_mtp` structure, capturing relevant metrics during draft and acceptance processes. - Updates to the MTP decoding functions to utilize in-graph argmax for improved performance and reduced data transfer overhead. - Addition of a new shell script for running the Gemma 4 MTP server with enhanced configuration options. These enhancements aim to provide better observability and performance in MTP operations, facilitating debugging and optimization of the speculative decoding process.

…cing This commit introduces an in-graph argmax for MTP draft processing, significantly improving throughput by reducing data transfer overhead. Key changes include: - Implementation of `ggml_argmax` to publish final logits, allowing the host to read only the necessary token ID. - Addition of a diagnostic feature for per-draft acceptance tracing, enabling detailed logging of MTP events for better observability. - Documentation updates to reflect these enhancements and provide usage examples for the new tracing functionality. These improvements aim to optimize MTP operations and facilitate debugging in the speculative decoding process.

This commit improves the handling of tensors in the MTP process, specifically for the Gemma 4 assistant. Key changes include: - Updated tensor conversion logic to maintain integer types for specific tensors, ensuring compatibility with centroid routing. - Introduced handling for `mtp.centroids.weight` and `mtp.token_ordering.weight`, ensuring correct tensor shapes and types during processing. - Enhanced documentation to clarify the new tensor structures and their implications for MTP operations. - Added new scripts for quantizing and running the Gemma 4 Edge assistant with improved configuration options. These enhancements aim to optimize the performance and accuracy of the MTP draft process, particularly when using ordered embeddings.

This commit introduces TurboQuant, a new family of WHT-rotated low-bit quantization formats designed for KV cache and model weight compression. Key changes include: - Added support for KV cache types (`turbo2`, `turbo3`, `turbo4`) with significant compression ratios. - Introduced weight quantization formats (`TQ3_1S`, `TQ4_1S`) for efficient model size reduction. - Enhanced documentation detailing usage, backend support, and practical examples for TurboQuant integration. - Added new command-line options for enabling TurboQuant features in the server. These enhancements aim to optimize memory usage and improve performance in bandwidth-bound scenarios, particularly on Apple Silicon and discrete GPUs.

Ooooze added 7 commits May 6, 2026 20:58

Ooooze merged commit 98bbdfe into feature/turboquant-kv-cache May 7, 2026
16 of 41 checks passed

github-actions Bot added documentation Improvements or additions to documentation testing examples server python script model labels May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/gemma mtp#3

Feature/gemma mtp#3
Ooooze merged 7 commits into
feature/turboquant-kv-cachefrom
feature/gemma-mtp

Ooooze commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ooooze commented May 7, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant