Skip to content

Add continuous batching to server#1027

Merged
Blaizzy merged 44 commits into
mainfrom
pc/continous-batch
Apr 16, 2026
Merged

Add continuous batching to server#1027
Blaizzy merged 44 commits into
mainfrom
pc/continous-batch

Conversation

@Blaizzy

@Blaizzy Blaizzy commented Apr 16, 2026

Copy link
Copy Markdown
Owner

Summary

Continuous Batching Server

  • Single GPU thread with BatchGenerator processes multiple concurrent requests together
  • New requests join the active batch immediately without waiting for existing ones to finish
  • Mixed batches of image and text-only requests supported with drain-before-insert for correct embeddings
  • asyncio.to_thread unblocks the FastAPI event loop for true concurrent request handling
  • Time-budgeted generation loop (0.5s bursts) for responsive batching

OpenAI-Compatible Response Format

  • id, object, created, index, prompt_tokens_details — matches mlx-lm field-for-field
  • reasoning/content split for thinking models (handles <|channel>thought/<channel|> and <think>/</think>)
  • Streaming: tag-aware state machine routes tokens to reasoning vs content fields
  • completion_tokens excludes thinking tag tokens (matches mlx-lm count)
  • /v1/ route aliases for OpenAI SDK compatibility

Multi-Turn Tool Calling

  • ChatMessage accepts role="tool", tool_calls, tool_call_id, name
  • Tool metadata preserved through apply_chat_template to Jinja templates
  • process_tool_calls parses model output for structured tool_calls[] responses
  • Works in both streaming and non-streaming paths
  • finish_reason="tool_calls" emitted correctly
  • Tool arguments normalized (JSON string ↔ dict) for cross-template compatibility
  • Gemma4 tool parser returns arguments as JSON string per OpenAI spec

Vision Feature Caching

  • vision_cache kwarg passed to get_input_embeddings — all 44 models patched
  • Cache lookup/store happens inside the vision tower call (no encode_image needed)
  • Gemma4: 228x speedup, Qwen3.5: 23x speedup on cache hit
  • --vision-cache-size CLI flag (default: 20 images, ~30MB max), LRU eviction

Model Fixes

  • Gemma4: Deduplicated attention mask creation (_make_masks) — matches mlx-lm per-step speed
  • Qwen3.5: Reset _position_ids/_rope_deltas per prefill, clamp negative BatchKVCache offsets
  • All models: Always use batch-aware caches in _process_prompts (fixes RotatingKVCache.extend crash)
  • BatchGenerator: Use tokenizer.stopping_criteria in addition to stop_tokens set
  • StoppingCriteria: add_eos_token_ids handles int token IDs (not just strings)

Server Features

  • --model flag to pre-load model at startup
  • enable_thinking on by default, configurable per request
  • Full sampling params: top_k, min_p, repetition_penalty, logit_bias, thinking_budget
  • GenerationArguments with to_generate_kwargs()/to_template_kwargs() — no duplicated kwargs
  • _build_gen_args() shared helper for both /responses and /chat/completions
  • ResizeShapeInput with field_validator for proper validation

Benchmarks

Server /chat/completionsgoogle/gemma-4-26b-a4b-it, thinking enabled, Apple Silicon:

B Mode Server Time Prompt tok/s Gen tok/s Total tok/s Peak Mem
1 seq mlx-lm 2.26s 9 43 52 n/a
1 seq mlx-vlm 2.28s 9 44 53 51.7 GB
1 batch mlx-lm 2.12s 9 46 56 n/a
1 batch mlx-vlm 1.90s 11 53 63 51.7 GB
4 batch mlx-lm 3.70s 25 101 126 n/a
4 batch mlx-vlm 3.96s 23 96 118 52.0 GB
8 batch mlx-lm 7.98s 22 88 109 n/a
8 batch mlx-vlm 7.95s 22 89 111 52.3 GB

At B=8: mlx-vlm matches mlx-lm (111 vs 109 tok/s).

Tool calling: 5-tool agentic workflow (weather → forecast → search → book → email) completes in 4 turns.

Vision cache: 228x speedup on gemma4 cache hit (229ms → 1ms), 1GB peak memory saved.

Test plan

  • Sequential and concurrent text-only generation
  • Sequential and concurrent image generation (real images)
  • Mixed batch (text + images) with vision cache
  • Multi-turn with back-references (4 users × 4 turns)
  • Multi-turn tool calling (gemma4 + qwen3.5)
  • Streaming tool call parsing
  • Server streaming and non-streaming endpoints
  • /responses and /v1/ route aliases
  • Response format matches mlx-lm (all fields, token counts)
  • Vision feature caching across all 44 models
  • --model pre-loading, --vision-cache-size
  • 396 unit tests pass, 27 server tests pass

Blaizzy and others added 30 commits April 16, 2026 05:30
- Introduced a global ResponseGenerator for managing continuous batching in the server.
- Updated Batch and BatchGenerator classes to handle per-sequence samplers and logits processors, improving flexibility in prompt handling.
- Enhanced LanguageModel to support per-sequence cache offsets for better performance during batched generation.
- Refactored insert and remove methods in BatchGenerator to accommodate new features and improve resource management.
- Added detailed docstrings for clarity on new functionalities and usage.
…nseGenerator

- Added peak memory tracking to Response and StreamingToken classes for better resource management.
- Updated BatchGenerator to include peak memory usage during response generation.
- Introduced preprocessing of images and audio before queuing requests to optimize performance and reduce blocking.
- Refactored input handling to utilize preprocessed inputs, improving efficiency in the generation process.
- Enhanced comments and documentation for clarity on new functionalities.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cessary chunking

- Drain pending text-only prompts BEFORE inserting image request (not after),
  so image prompts aren't prefilled without their embeddings
- Only chunk prefill when prompt exceeds prefill_step_size (matching main's behavior)
- Pass mask to get_input_embeddings for correct attention_mask_4d generation
- Remove stale pixel_values from gen_kwargs passed to language model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ResponseGenerator is server-specific infrastructure (threaded queue,
request dispatching) — it belongs with the server, not the generation
library. BatchGenerator stays in generate.py for offline batch use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collect all pending requests from the queue before calling next(),
so multiple text-only requests get inserted and prefilled together
in a single batch call instead of one-at-a-time.

Text-only batch speedup: 1.16x -> 1.54x (gemma4-26b)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reset _position_ids and _rope_deltas in _process_prompts before each
  new batch prefill to prevent stale position state from previous batches
- Clamp negative BatchKVCache offsets to 0 in Qwen3.5 language model

Qwen3.5 stores position state on the model instance which breaks when
BatchGenerator prefills multiple batches sequentially. The stale cached
positions from batch N are incorrectly used for batch N+1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the old thread/queue approach with a cleaner design:
- Single dedicated thread owns all GPU work via BatchGenerator
- FastAPI handlers submit preprocessed requests to a queue
- GPU thread runs next() in a tight time-budgeted loop (0.5s bursts)
- Concurrent requests are batched together automatically
- Image requests drain pending text-only first, then prefill inline

Results on gemma-4-26b (5 prompts, max_tokens=100):
  Sequential: 3.56s (72 tok/s) — 2.3x faster than mlx-lm
  Batch:      2.27s (112 tok/s) — 1.46x faster than mlx-lm

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-call dual create_attention_mask with _make_masks() that
creates one mask per unique layer type (full_attention, sliding_attention)
and reuses it across layers. Also streamline the layer loop to use
zip iteration instead of index lookups.

Closes the 1.21x per-step gap with mlx-lm (89 vs 88 tok/s, batch of 3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All preprocessing (prepare_inputs + get_input_embeddings) now runs on the
single GPU thread. Callers only put raw data (prompt, image paths, args)
on the queue. This prevents concurrent Metal access when multiple
FastAPI requests arrive simultaneously.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… thread

- Pass enable_thinking from request to apply_chat_template on both
  /responses and /chat/completions endpoints
- Move prepare_inputs (CPU: tokenize, load images) to caller thread,
  keep only get_input_embeddings (GPU) on the GPU thread — reduces
  GPU thread blocking during batch generation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The async endpoints were blocking the event loop on Queue.get(),
preventing concurrent request processing. Wrap blocking generate()
and token iteration in asyncio.to_thread so the event loop stays
free to accept new requests.

Before: B=4 batch 65 tok/s (sequential execution)
After:  B=4 batch 120 tok/s (true concurrent batching)
At B=8: matches mlx-lm (114 vs 114 tok/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Separate thinking tokens into `reasoning` field, clean answer in `content`
  (handles <|channel>thought...<channel|> and <think>...</think>)
- Add `id`, `object`, `created` to response/chunk models
- Use standard `prompt_tokens`/`completion_tokens` field names
- Stream: tag-aware state machine routes tokens to reasoning vs content
- Add `data: [DONE]` SSE terminator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Apply _split_thinking to both non-streaming and streaming /responses
output, so output_text and content contain clean answers and reasoning
is in a separate field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add index to ChatChoice and ChatStreamChoice
- Add prompt_tokens_details with cached_tokens to UsageStats

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Count raw generated tokens minus thinking tag tokens instead of
re-encoding text (which loses boundary tokens). Now matches
mlx-lm exactly: prompt_tokens=23, completion_tokens=40, total=63.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…an tags

Four bugs fixed:

1. prompt_utils: Tool messages (tool_calls, tool_call_id, role="tool")
   now pass through to the Jinja template instead of being stripped
   by _get_role_content/get_message_json.

2. server ChatMessage: Accept role="tool", add tool_calls, tool_call_id,
   name fields. Server message processing preserves these fields.

3. server: Add process_tool_calls + tool parser detection to
   /chat/completions. Model output is parsed for <|tool_call> tags,
   returned as structured tool_calls with finish_reason="tool_calls".

4. tool_parsers/gemma4: Return arguments as JSON string (OpenAI spec),
   not dict, preventing double-encoding on round-trip.

Also fixes thinking tag leaks:
- _split_thinking handles partial "thought\n...<channel|>" continuation
- Tool call turns return content=None (no leaked <|tool_call> tags)

Tested: 5-tool agentic workflow (weather → forecast → search → book →
email) completing in 4 turns with correct tool selection and
back-references across turns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tool args

- BatchGenerator._next() now checks tokenizer.stopping_criteria(t) in
  addition to self.stop_tokens. Fixes models like Qwen3.5 where
  config.eos_token_id is None but the tokenizer has <|im_end|> as EOS.
- Normalize tool_calls arguments from JSON string to dict before passing
  to Jinja templates (Qwen3.5 template iterates arguments with |items).
- Handle partial </think> tag (no opening <think>) in _split_thinking.
- Strip model control tokens from tool-call remaining text.

Before: Qwen3.5 generated past <|im_end|>, hallucinated fake turns.
After: Clean 3-turn tool calling matching mlx-lm behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-add VisionFeatureCache that was lost during rebase:
- Create in get_cached_model, clear in unload_model_sync
- Pass vision_cache to all fallback stream_generate/generate calls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Updated ResponseGenerator to include vision_cache for improved efficiency.
- Modified __init__ method to accept vision_cache parameter.
- Enhanced _gpu_embed method to utilize cached image features, reducing redundant processing for repeated images.
- Adjusted request handling to accommodate images in batch processing.

This change optimizes the handling of image inputs, leveraging caching to enhance performance.
… CLI

- VisionFeatureCache integrated into ResponseGenerator._gpu_embed:
  cache hit skips vision encoder (229ms → 1ms, 1GB memory saved)
- Shared VisionFeatureCache instance between ResponseGenerator and
  fallback paths
- --vision-cache-size CLI flag (default: 20 images, ~30MB max)
- LRU eviction, cleared on model unload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Changed section headers for Multi-Image Chat Support and Video Understanding to improve readability.
- Nested Usage Examples under their respective sections for better organization.
- Added detailed command line examples for both Multi-Image Chat Support and Video Understanding features.
Blaizzy and others added 8 commits April 16, 2026 17:13
Fixes test_generate.py import error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n-only tests

- test_gemma4_tool_parser: arguments is now JSON string per OpenAI spec
- test_generate: Batch needs samplers/logits_processors/tokens fields,
  logprobs is List[mx.array] not mx.array
- Skip tests for main-only features not in this branch:
  - TestSamplerArgs (make_sampler API)
  - CLI enable_thinking/thinking_start_token
  - Server schema (enable_thinking, resize_shape single int)
  - TurboQuant kv_quant_scheme parameter

387 passed, 10 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace generate.py with main's version (make_sampler, TurboQuant,
enable_thinking, PromptCacheState, vision_cache) plus our one addition:
_process_prompts resets _position_ids/_rope_deltas for Qwen3.5 batch compat.

Restore test files from main. 3 server tests fail due to our server
having a different schema (continuous batching rewrite) — not regressions.

393 passed, 3 failed (server schema), 1 skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ings

- Pass vision_cache + _image_key as kwargs to get_input_embeddings so
  models can cache/reuse vision features internally (gemma4: 228x,
  Qwen3.5: 23x speedup on cache hit)
- Always call get_input_embeddings (even text-only) since main's
  BatchGenerator._process_prompts requires inputs_embeds
- Fix add_eos_token_ids to handle int token IDs (not just strings)
  — fixes crash when config.eos_token_id provides ints
- Add vision_cache support to gemma4 and qwen3_5 get_input_embeddings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Introduced new fields in GenerationArguments and OpenAIRequest for top_k, min_p, repetition_penalty, logit_bias, enable_thinking, thinking_budget, and thinking_start_token.
- Added methods to convert GenerationArguments to keyword arguments for generation and template functions.
- Implemented a helper function to build GenerationArguments from request data.
- Updated resize_shape handling with a field validator to normalize input.

These changes enhance the model's generation capabilities and support advanced features for thinking mode.
- Removed redundant instantiation of GenerationArguments in multiple locations within the responses and chat completions endpoints.
- Replaced instances of `args` with `gen_args` to streamline the argument passing to the response generator.
- This change simplifies the code and improves maintainability by reducing duplication.
- Refactor GenerationArguments with full sampling params (top_k, min_p,
  repetition_penalty, logit_bias, enable_thinking, thinking_budget,
  thinking_start_token) + to_generate_kwargs()/to_template_kwargs()
- Add _build_gen_args() helper, remove all inline kwargs duplication
- Add ResizeShapeInput type + field_validator for resize_shape
- Forward all sampling/thinking params to generate() fallback paths
- Fix RotatingKVCache.extend crash: always use batch-aware caches in
  _process_prompts (removes single-prompt standard cache optimization
  that breaks when continuous batching extends the batch later)
- Add vision_cache support to all 44 model get_input_embeddings
- Fix add_eos_token_ids to handle int token IDs
- Fix phi3_v vision cache patch ordering

396 passed, 1 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test GenerationArguments, _build_gen_args, _split_thinking,
ChatMessage tool-calling schema, process_tool_calls, and
_count_thinking_tag_tokens.

27 server tests pass (18 new + 9 existing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Blaizzy Blaizzy force-pushed the pc/continous-batch branch from fffcac8 to 0be8c0f Compare April 16, 2026 19:42
Blaizzy and others added 6 commits April 16, 2026 21:57
/v1/responses, /v1/chat/completions, /v1/models now mirror
the base routes (hidden from schema docs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accumulate full output during streaming and parse tool calls at the end.
If tool calls are detected, emit a final SSE chunk with structured
tool_calls[] and finish_reason="tool_calls" — matching what OpenAI SDKs
(and tools like Pi) expect for agentic workflows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore: Thinking Budget, model docs table (DOTS-OCR, Gemma 4, etc.),
Vision Feature Caching, TurboQuant KV Cache, Activation Quantization,
/v1/ route docs, server preload docs.

Add: Continuous Batching section under Server, --vision-cache-size option.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g headers

- Replace @app.on_event("startup") with asynccontextmanager lifespan
- Add CORSMiddleware (allow all origins)
- Restore KV quant env helpers (get_prefill_step_size, get_quantized_kv_bits, etc.)
- Add --kv-bits, --kv-quant-scheme, --kv-group-size, --max-kv-size,
  --quantized-kv-start, --prefill-step-size, --reload CLI args
- Add Cache-Control/Connection/X-Accel-Buffering headers to StreamingResponse
- Import DEFAULT_KV_*, DEFAULT_THINKING_*, DEFAULT_PREFILL_STEP_SIZE
- Remove unused chat_messages variable
- reload=False by default (was True)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The streaming /chat/completions path was emitting finish_reason=null
on all chunks. Now propagates the actual finish_reason (stop/length)
from the ResponseGenerator's StreamingToken to the SSE chunk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Blaizzy Blaizzy merged commit e2e9e67 into main Apr 16, 2026
1 check passed
Blaizzy added a commit that referenced this pull request Apr 16, 2026
* Add continuous batching support

- Introduced a global ResponseGenerator for managing continuous batching in the server.
- Updated Batch and BatchGenerator classes to handle per-sequence samplers and logits processors, improving flexibility in prompt handling.
- Enhanced LanguageModel to support per-sequence cache offsets for better performance during batched generation.
- Refactored insert and remove methods in BatchGenerator to accommodate new features and improve resource management.
- Added detailed docstrings for clarity on new functionalities and usage.

* Enhance memory tracking and preprocessing in BatchGenerator and ResponseGenerator

- Added peak memory tracking to Response and StreamingToken classes for better resource management.
- Updated BatchGenerator to include peak memory usage during response generation.
- Introduced preprocessing of images and audio before queuing requests to optimize performance and reduce blocking.
- Refactored input handling to utilize preprocessed inputs, improving efficiency in the generation process.
- Enhanced comments and documentation for clarity on new functionalities.

* add continous batching

* Add back PromptCacheState removed during rebase

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix continuous batching image prefill: drain before insert, skip unnecessary chunking

- Drain pending text-only prompts BEFORE inserting image request (not after),
  so image prompts aren't prefilled without their embeddings
- Only chunk prefill when prompt exceeds prefill_step_size (matching main's behavior)
- Pass mask to get_input_embeddings for correct attention_mask_4d generation
- Remove stale pixel_values from gen_kwargs passed to language model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move ResponseGenerator from generate.py to server.py

ResponseGenerator is server-specific infrastructure (threaded queue,
request dispatching) — it belongs with the server, not the generation
library. BatchGenerator stays in generate.py for offline batch use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add continuous batching section to README

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add server start command to continuous batching section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add --model flag to pre-load model at server startup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Optimize text-only batching: drain queue before next() for fused prefill

Collect all pending requests from the queue before calling next(),
so multiple text-only requests get inserted and prefilled together
in a single batch call instead of one-at-a-time.

Text-only batch speedup: 1.16x -> 1.54x (gemma4-26b)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix batch position state for models with cached rope (Qwen3.5)

- Reset _position_ids and _rope_deltas in _process_prompts before each
  new batch prefill to prevent stale position state from previous batches
- Clamp negative BatchKVCache offsets to 0 in Qwen3.5 language model

Qwen3.5 stores position state on the model instance which breaks when
BatchGenerator prefills multiple batches sequentially. The stale cached
positions from batch N are incorrectly used for batch N+1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Rewrite ResponseGenerator: single GPU thread with tight batch loop

Replace the old thread/queue approach with a cleaner design:
- Single dedicated thread owns all GPU work via BatchGenerator
- FastAPI handlers submit preprocessed requests to a queue
- GPU thread runs next() in a tight time-budgeted loop (0.5s bursts)
- Concurrent requests are batched together automatically
- Image requests drain pending text-only first, then prefill inline

Results on gemma-4-26b (5 prompts, max_tokens=100):
  Sequential: 3.56s (72 tok/s) — 2.3x faster than mlx-lm
  Batch:      2.27s (112 tok/s) — 1.46x faster than mlx-lm

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Optimize gemma4 attention mask creation: deduplicate by layer type

Replace per-call dual create_attention_mask with _make_masks() that
creates one mask per unique layer type (full_attention, sliding_attention)
and reuses it across layers. Also streamline the layer loop to use
zip iteration instead of index lookups.

Closes the 1.21x per-step gap with mlx-lm (89 vs 88 tok/s, batch of 3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix Metal crash: move all preprocessing to GPU thread

All preprocessing (prepare_inputs + get_input_embeddings) now runs on the
single GPU thread. Callers only put raw data (prompt, image paths, args)
on the queue. This prevents concurrent Metal access when multiple
FastAPI requests arrive simultaneously.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add enable_thinking flag to server + move CPU preprocessing to caller thread

- Pass enable_thinking from request to apply_chat_template on both
  /responses and /chat/completions endpoints
- Move prepare_inputs (CPU: tokenize, load images) to caller thread,
  keep only get_input_embeddings (GPU) on the GPU thread — reduces
  GPU thread blocking during batch generation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Default enable_thinking to true on server

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix event loop blocking: use asyncio.to_thread for Queue operations

The async endpoints were blocking the event loop on Queue.get(),
preventing concurrent request processing. Wrap blocking generate()
and token iteration in asyncio.to_thread so the event loop stays
free to accept new requests.

Before: B=4 batch 65 tok/s (sequential execution)
After:  B=4 batch 120 tok/s (true concurrent batching)
At B=8: matches mlx-lm (114 vs 114 tok/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Match OpenAI response format: split reasoning/content, add metadata

- Separate thinking tokens into `reasoning` field, clean answer in `content`
  (handles <|channel>thought...<channel|> and <think>...</think>)
- Add `id`, `object`, `created` to response/chunk models
- Use standard `prompt_tokens`/`completion_tokens` field names
- Stream: tag-aware state machine routes tokens to reasoning vs content
- Add `data: [DONE]` SSE terminator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Split thinking tags in /responses endpoint too

Apply _split_thinking to both non-streaming and streaming /responses
output, so output_text and content contain clean answers and reasoning
is in a separate field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add missing OpenAI fields: index, prompt_tokens_details

- Add index to ChatChoice and ChatStreamChoice
- Add prompt_tokens_details with cached_tokens to UsageStats

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix completion_tokens count to match mlx-lm

Count raw generated tokens minus thinking tag tokens instead of
re-encoding text (which loses boundary tokens). Now matches
mlx-lm exactly: prompt_tokens=23, completion_tokens=40, total=63.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Format with black and isort

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix multi-turn tool calling: preserve metadata, parse tool calls, clean tags

Four bugs fixed:

1. prompt_utils: Tool messages (tool_calls, tool_call_id, role="tool")
   now pass through to the Jinja template instead of being stripped
   by _get_role_content/get_message_json.

2. server ChatMessage: Accept role="tool", add tool_calls, tool_call_id,
   name fields. Server message processing preserves these fields.

3. server: Add process_tool_calls + tool parser detection to
   /chat/completions. Model output is parsed for <|tool_call> tags,
   returned as structured tool_calls with finish_reason="tool_calls".

4. tool_parsers/gemma4: Return arguments as JSON string (OpenAI spec),
   not dict, preventing double-encoding on round-trip.

Also fixes thinking tag leaks:
- _split_thinking handles partial "thought\n...<channel|>" continuation
- Tool call turns return content=None (no leaked <|tool_call> tags)

Tested: 5-tool agentic workflow (weather → forecast → search → book →
email) completing in 4 turns with correct tool selection and
back-references across turns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix BatchGenerator stop: use tokenizer.stopping_criteria + normalize tool args

- BatchGenerator._next() now checks tokenizer.stopping_criteria(t) in
  addition to self.stop_tokens. Fixes models like Qwen3.5 where
  config.eos_token_id is None but the tokenizer has <|im_end|> as EOS.
- Normalize tool_calls arguments from JSON string to dict before passing
  to Jinja templates (Qwen3.5 template iterates arguments with |items).
- Handle partial </think> tag (no opening <think>) in _split_thinking.
- Strip model control tokens from tool-call remaining text.

Before: Qwen3.5 generated past <|im_end|>, hallucinated fake turns.
After: Clean 3-turn tool calling matching mlx-lm behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore VisionFeatureCache to server

Re-add VisionFeatureCache that was lost during rebase:
- Create in get_cached_model, clear in unload_model_sync
- Pass vision_cache to all fallback stream_generate/generate calls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add vision caching to ResponseGenerator

- Updated ResponseGenerator to include vision_cache for improved efficiency.
- Modified __init__ method to accept vision_cache parameter.
- Enhanced _gpu_embed method to utilize cached image features, reducing redundant processing for repeated images.
- Adjusted request handling to accommodate images in batch processing.

This change optimizes the handling of image inputs, leveraging caching to enhance performance.

* Add vision feature caching to ResponseGenerator + --vision-cache-size CLI

- VisionFeatureCache integrated into ResponseGenerator._gpu_embed:
  cache hit skips vision encoder (229ms → 1ms, 1GB memory saved)
- Shared VisionFeatureCache instance between ResponseGenerator and
  fallback paths
- --vision-cache-size CLI flag (default: 20 images, ~30MB max)
- LRU eviction, cleared on model unload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Document server options and vision cache in README

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Nest Continuous Batching under Server in README table of contents

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update README.md to enhance structure and clarity

- Changed section headers for Multi-Image Chat Support and Video Understanding to improve readability.
- Nested Usage Examples under their respective sections for better organization.
- Added detailed command line examples for both Multi-Image Chat Support and Video Understanding features.

* Add back normalize_resize_shape lost during rebase

Fixes test_generate.py import error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix tests: update for Batch fields, tool parser JSON string, skip main-only tests

- test_gemma4_tool_parser: arguments is now JSON string per OpenAI spec
- test_generate: Batch needs samplers/logits_processors/tokens fields,
  logprobs is List[mx.array] not mx.array
- Skip tests for main-only features not in this branch:
  - TestSamplerArgs (make_sampler API)
  - CLI enable_thinking/thinking_start_token
  - Server schema (enable_thinking, resize_shape single int)
  - TurboQuant kv_quant_scheme parameter

387 passed, 10 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore generate.py and tests from main, re-apply position state reset

Replace generate.py with main's version (make_sampler, TurboQuant,
enable_thinking, PromptCacheState, vision_cache) plus our one addition:
_process_prompts resets _position_ids/_rope_deltas for Qwen3.5 batch compat.

Restore test files from main. 3 server tests fail due to our server
having a different schema (continuous batching rewrite) — not regressions.

393 passed, 3 failed (server schema), 1 skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Vision cache via kwargs, fix add_eos_token_ids, always produce embeddings

- Pass vision_cache + _image_key as kwargs to get_input_embeddings so
  models can cache/reuse vision features internally (gemma4: 228x,
  Qwen3.5: 23x speedup on cache hit)
- Always call get_input_embeddings (even text-only) since main's
  BatchGenerator._process_prompts requires inputs_embeds
- Fix add_eos_token_ids to handle int token IDs (not just strings)
  — fixes crash when config.eos_token_id provides ints
- Add vision_cache support to gemma4 and qwen3_5 get_input_embeddings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add generation arguments and thinking mode support

- Introduced new fields in GenerationArguments and OpenAIRequest for top_k, min_p, repetition_penalty, logit_bias, enable_thinking, thinking_budget, and thinking_start_token.
- Added methods to convert GenerationArguments to keyword arguments for generation and template functions.
- Implemented a helper function to build GenerationArguments from request data.
- Updated resize_shape handling with a field validator to normalize input.

These changes enhance the model's generation capabilities and support advanced features for thinking mode.

* Refactor response generation arguments handling

- Removed redundant instantiation of GenerationArguments in multiple locations within the responses and chat completions endpoints.
- Replaced instances of `args` with `gen_args` to streamline the argument passing to the response generator.
- This change simplifies the code and improves maintainability by reducing duplication.

* Add generation arguments and thinking mode support

- Refactor GenerationArguments with full sampling params (top_k, min_p,
  repetition_penalty, logit_bias, enable_thinking, thinking_budget,
  thinking_start_token) + to_generate_kwargs()/to_template_kwargs()
- Add _build_gen_args() helper, remove all inline kwargs duplication
- Add ResizeShapeInput type + field_validator for resize_shape
- Forward all sampling/thinking params to generate() fallback paths
- Fix RotatingKVCache.extend crash: always use batch-aware caches in
  _process_prompts (removes single-prompt standard cache optimization
  that breaks when continuous batching extends the batch later)
- Add vision_cache support to all 44 model get_input_embeddings
- Fix add_eos_token_ids to handle int token IDs
- Fix phi3_v vision cache patch ordering

396 passed, 1 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add tests for continuous batching server components

Test GenerationArguments, _build_gen_args, _split_thinking,
ChatMessage tool-calling schema, process_tool_calls, and
_count_thinking_tag_tokens.

27 server tests pass (18 new + 9 existing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add /v1/ route aliases for OpenAI SDK compatibility

/v1/responses, /v1/chat/completions, /v1/models now mirror
the base routes (hidden from schema docs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update DEFAULT_MAX_TOKENS in generate.py to 2048 for enhanced generation capabilities

* Parse tool calls in streaming path for OpenAI SDK compatibility

Accumulate full output during streaming and parse tool calls at the end.
If tool calls are detected, emit a final SSE chunk with structured
tool_calls[] and finish_reason="tool_calls" — matching what OpenAI SDKs
(and tools like Pi) expect for agentic workflows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore lost README sections from main, add continuous batching

Restore: Thinking Budget, model docs table (DOTS-OCR, Gemma 4, etc.),
Vision Feature Caching, TurboQuant KV Cache, Activation Quantization,
/v1/ route docs, server preload docs.

Add: Continuous Batching section under Server, --vision-cache-size option.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore server features from main: lifespan, CORS, KV quant, streaming headers

- Replace @app.on_event("startup") with asynccontextmanager lifespan
- Add CORSMiddleware (allow all origins)
- Restore KV quant env helpers (get_prefill_step_size, get_quantized_kv_bits, etc.)
- Add --kv-bits, --kv-quant-scheme, --kv-group-size, --max-kv-size,
  --quantized-kv-start, --prefill-step-size, --reload CLI args
- Add Cache-Control/Connection/X-Accel-Buffering headers to StreamingResponse
- Import DEFAULT_KV_*, DEFAULT_THINKING_*, DEFAULT_PREFILL_STEP_SIZE
- Remove unused chat_messages variable
- reload=False by default (was True)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix streaming finish_reason: propagate token.finish_reason to SSE chunks

The streaming /chat/completions path was emitting finish_reason=null
on all chunks. Now propagates the actual finish_reason (stop/length)
from the ResponseGenerator's StreamingToken to the SSE chunk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blaizzy added a commit that referenced this pull request Apr 17, 2026
Keep speculative decoding drafter loading, CLI args, and draft_model
wiring in server.py. Keep is_batch_offset logic in language.py as
superset of main's batch offset handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blaizzy added a commit that referenced this pull request Apr 20, 2026
* add dtree + dflash

* Refactor Qwen3.5 model to remove capture_layer_ids and streamline forward pass. Update DFlash imports and enhance parity check for lightweight smoke testing. Remove unused cache snapshot functionality and simplify speculative decoding CLI. Clean up tree verification module and optimize DFlash loop for better performance.

* Refactor CLI argument handling in speculative module for improved readability and consistency. Simplify token processing in main function by renaming variables for clarity. Remove redundant code and enhance overall structure in dflash.py and cli.py.

* Enhance dflash_generate function in dflash_loop.py for improved clarity and performance. Update docstring for accepted_in_round explanation and add a new variable L to clarify drafted slots per round. Refactor comments for better understanding of draft cache trimming and token emission process.

* refactor dflash loop

* Update dependencies in requirements.txt and uv.lock for improved compatibility and performance. Bump versions for mlx, transformers, and mlx-lm, and update hf-xet to version 1.4.3 with new source and wheel links.

* add support for drafter loading

* Refactor DFlash drafter implementation by consolidating loading logic and removing deprecated components. Introduce a new DFlashConfig class for configuration management and streamline the load_drafter function to utilize shared loading utilities. Add a parity check for basic functionality verification. Remove the old qwen3_5_dflash module and its associated files.

* refactor dflash generate

* Introduce draft_kind parameter to generate_step for flexible speculative decoding. Update logic to handle different draft types, starting with 'dflash', and raise an error for unsupported types.

* Refactor speculative decoding logic in generate_step and _dflash_rounds functions. Simplify docstrings and streamline yield statements for clarity and consistency.

* Add speculative walk function and refactor DFlashDraftModel acceptance tracking. Update _dflash_rounds to utilize new walk logic and streamline cache management. Enhance generate_step for clarity in acceptance reporting.

* Update docstring in LanguageModel to clarify the handling of gated delta states during cache restoration.

* Refactor rotary embedding and attention mechanisms in Qwen3_5 model. Update apply_multimodal_rotary_pos_emb to ensure dtype consistency and introduce _precise_swiglu for improved gate handling in RMSNormGated.

* Add batch processing for speculative walk and DFlash rounds in generate.py. Implement rollback_speculative_cache_batch in LanguageModel for batch cache management. Update DFlashDraftModel to support batch bonuses in draft_block method. Enhance generate_step to handle batch decoding logic.

* Add continuous batching to server (#1027)

* Add continuous batching support

- Introduced a global ResponseGenerator for managing continuous batching in the server.
- Updated Batch and BatchGenerator classes to handle per-sequence samplers and logits processors, improving flexibility in prompt handling.
- Enhanced LanguageModel to support per-sequence cache offsets for better performance during batched generation.
- Refactored insert and remove methods in BatchGenerator to accommodate new features and improve resource management.
- Added detailed docstrings for clarity on new functionalities and usage.

* Enhance memory tracking and preprocessing in BatchGenerator and ResponseGenerator

- Added peak memory tracking to Response and StreamingToken classes for better resource management.
- Updated BatchGenerator to include peak memory usage during response generation.
- Introduced preprocessing of images and audio before queuing requests to optimize performance and reduce blocking.
- Refactored input handling to utilize preprocessed inputs, improving efficiency in the generation process.
- Enhanced comments and documentation for clarity on new functionalities.

* add continous batching

* Add back PromptCacheState removed during rebase

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix continuous batching image prefill: drain before insert, skip unnecessary chunking

- Drain pending text-only prompts BEFORE inserting image request (not after),
  so image prompts aren't prefilled without their embeddings
- Only chunk prefill when prompt exceeds prefill_step_size (matching main's behavior)
- Pass mask to get_input_embeddings for correct attention_mask_4d generation
- Remove stale pixel_values from gen_kwargs passed to language model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move ResponseGenerator from generate.py to server.py

ResponseGenerator is server-specific infrastructure (threaded queue,
request dispatching) — it belongs with the server, not the generation
library. BatchGenerator stays in generate.py for offline batch use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add continuous batching section to README

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add server start command to continuous batching section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add --model flag to pre-load model at server startup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Optimize text-only batching: drain queue before next() for fused prefill

Collect all pending requests from the queue before calling next(),
so multiple text-only requests get inserted and prefilled together
in a single batch call instead of one-at-a-time.

Text-only batch speedup: 1.16x -> 1.54x (gemma4-26b)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix batch position state for models with cached rope (Qwen3.5)

- Reset _position_ids and _rope_deltas in _process_prompts before each
  new batch prefill to prevent stale position state from previous batches
- Clamp negative BatchKVCache offsets to 0 in Qwen3.5 language model

Qwen3.5 stores position state on the model instance which breaks when
BatchGenerator prefills multiple batches sequentially. The stale cached
positions from batch N are incorrectly used for batch N+1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Rewrite ResponseGenerator: single GPU thread with tight batch loop

Replace the old thread/queue approach with a cleaner design:
- Single dedicated thread owns all GPU work via BatchGenerator
- FastAPI handlers submit preprocessed requests to a queue
- GPU thread runs next() in a tight time-budgeted loop (0.5s bursts)
- Concurrent requests are batched together automatically
- Image requests drain pending text-only first, then prefill inline

Results on gemma-4-26b (5 prompts, max_tokens=100):
  Sequential: 3.56s (72 tok/s) — 2.3x faster than mlx-lm
  Batch:      2.27s (112 tok/s) — 1.46x faster than mlx-lm

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Optimize gemma4 attention mask creation: deduplicate by layer type

Replace per-call dual create_attention_mask with _make_masks() that
creates one mask per unique layer type (full_attention, sliding_attention)
and reuses it across layers. Also streamline the layer loop to use
zip iteration instead of index lookups.

Closes the 1.21x per-step gap with mlx-lm (89 vs 88 tok/s, batch of 3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix Metal crash: move all preprocessing to GPU thread

All preprocessing (prepare_inputs + get_input_embeddings) now runs on the
single GPU thread. Callers only put raw data (prompt, image paths, args)
on the queue. This prevents concurrent Metal access when multiple
FastAPI requests arrive simultaneously.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add enable_thinking flag to server + move CPU preprocessing to caller thread

- Pass enable_thinking from request to apply_chat_template on both
  /responses and /chat/completions endpoints
- Move prepare_inputs (CPU: tokenize, load images) to caller thread,
  keep only get_input_embeddings (GPU) on the GPU thread — reduces
  GPU thread blocking during batch generation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Default enable_thinking to true on server

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix event loop blocking: use asyncio.to_thread for Queue operations

The async endpoints were blocking the event loop on Queue.get(),
preventing concurrent request processing. Wrap blocking generate()
and token iteration in asyncio.to_thread so the event loop stays
free to accept new requests.

Before: B=4 batch 65 tok/s (sequential execution)
After:  B=4 batch 120 tok/s (true concurrent batching)
At B=8: matches mlx-lm (114 vs 114 tok/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Match OpenAI response format: split reasoning/content, add metadata

- Separate thinking tokens into `reasoning` field, clean answer in `content`
  (handles <|channel>thought...<channel|> and <think>...</think>)
- Add `id`, `object`, `created` to response/chunk models
- Use standard `prompt_tokens`/`completion_tokens` field names
- Stream: tag-aware state machine routes tokens to reasoning vs content
- Add `data: [DONE]` SSE terminator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Split thinking tags in /responses endpoint too

Apply _split_thinking to both non-streaming and streaming /responses
output, so output_text and content contain clean answers and reasoning
is in a separate field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add missing OpenAI fields: index, prompt_tokens_details

- Add index to ChatChoice and ChatStreamChoice
- Add prompt_tokens_details with cached_tokens to UsageStats

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix completion_tokens count to match mlx-lm

Count raw generated tokens minus thinking tag tokens instead of
re-encoding text (which loses boundary tokens). Now matches
mlx-lm exactly: prompt_tokens=23, completion_tokens=40, total=63.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Format with black and isort

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix multi-turn tool calling: preserve metadata, parse tool calls, clean tags

Four bugs fixed:

1. prompt_utils: Tool messages (tool_calls, tool_call_id, role="tool")
   now pass through to the Jinja template instead of being stripped
   by _get_role_content/get_message_json.

2. server ChatMessage: Accept role="tool", add tool_calls, tool_call_id,
   name fields. Server message processing preserves these fields.

3. server: Add process_tool_calls + tool parser detection to
   /chat/completions. Model output is parsed for <|tool_call> tags,
   returned as structured tool_calls with finish_reason="tool_calls".

4. tool_parsers/gemma4: Return arguments as JSON string (OpenAI spec),
   not dict, preventing double-encoding on round-trip.

Also fixes thinking tag leaks:
- _split_thinking handles partial "thought\n...<channel|>" continuation
- Tool call turns return content=None (no leaked <|tool_call> tags)

Tested: 5-tool agentic workflow (weather → forecast → search → book →
email) completing in 4 turns with correct tool selection and
back-references across turns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix BatchGenerator stop: use tokenizer.stopping_criteria + normalize tool args

- BatchGenerator._next() now checks tokenizer.stopping_criteria(t) in
  addition to self.stop_tokens. Fixes models like Qwen3.5 where
  config.eos_token_id is None but the tokenizer has <|im_end|> as EOS.
- Normalize tool_calls arguments from JSON string to dict before passing
  to Jinja templates (Qwen3.5 template iterates arguments with |items).
- Handle partial </think> tag (no opening <think>) in _split_thinking.
- Strip model control tokens from tool-call remaining text.

Before: Qwen3.5 generated past <|im_end|>, hallucinated fake turns.
After: Clean 3-turn tool calling matching mlx-lm behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore VisionFeatureCache to server

Re-add VisionFeatureCache that was lost during rebase:
- Create in get_cached_model, clear in unload_model_sync
- Pass vision_cache to all fallback stream_generate/generate calls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add vision caching to ResponseGenerator

- Updated ResponseGenerator to include vision_cache for improved efficiency.
- Modified __init__ method to accept vision_cache parameter.
- Enhanced _gpu_embed method to utilize cached image features, reducing redundant processing for repeated images.
- Adjusted request handling to accommodate images in batch processing.

This change optimizes the handling of image inputs, leveraging caching to enhance performance.

* Add vision feature caching to ResponseGenerator + --vision-cache-size CLI

- VisionFeatureCache integrated into ResponseGenerator._gpu_embed:
  cache hit skips vision encoder (229ms → 1ms, 1GB memory saved)
- Shared VisionFeatureCache instance between ResponseGenerator and
  fallback paths
- --vision-cache-size CLI flag (default: 20 images, ~30MB max)
- LRU eviction, cleared on model unload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Document server options and vision cache in README

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Nest Continuous Batching under Server in README table of contents

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update README.md to enhance structure and clarity

- Changed section headers for Multi-Image Chat Support and Video Understanding to improve readability.
- Nested Usage Examples under their respective sections for better organization.
- Added detailed command line examples for both Multi-Image Chat Support and Video Understanding features.

* Add back normalize_resize_shape lost during rebase

Fixes test_generate.py import error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix tests: update for Batch fields, tool parser JSON string, skip main-only tests

- test_gemma4_tool_parser: arguments is now JSON string per OpenAI spec
- test_generate: Batch needs samplers/logits_processors/tokens fields,
  logprobs is List[mx.array] not mx.array
- Skip tests for main-only features not in this branch:
  - TestSamplerArgs (make_sampler API)
  - CLI enable_thinking/thinking_start_token
  - Server schema (enable_thinking, resize_shape single int)
  - TurboQuant kv_quant_scheme parameter

387 passed, 10 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore generate.py and tests from main, re-apply position state reset

Replace generate.py with main's version (make_sampler, TurboQuant,
enable_thinking, PromptCacheState, vision_cache) plus our one addition:
_process_prompts resets _position_ids/_rope_deltas for Qwen3.5 batch compat.

Restore test files from main. 3 server tests fail due to our server
having a different schema (continuous batching rewrite) — not regressions.

393 passed, 3 failed (server schema), 1 skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Vision cache via kwargs, fix add_eos_token_ids, always produce embeddings

- Pass vision_cache + _image_key as kwargs to get_input_embeddings so
  models can cache/reuse vision features internally (gemma4: 228x,
  Qwen3.5: 23x speedup on cache hit)
- Always call get_input_embeddings (even text-only) since main's
  BatchGenerator._process_prompts requires inputs_embeds
- Fix add_eos_token_ids to handle int token IDs (not just strings)
  — fixes crash when config.eos_token_id provides ints
- Add vision_cache support to gemma4 and qwen3_5 get_input_embeddings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add generation arguments and thinking mode support

- Introduced new fields in GenerationArguments and OpenAIRequest for top_k, min_p, repetition_penalty, logit_bias, enable_thinking, thinking_budget, and thinking_start_token.
- Added methods to convert GenerationArguments to keyword arguments for generation and template functions.
- Implemented a helper function to build GenerationArguments from request data.
- Updated resize_shape handling with a field validator to normalize input.

These changes enhance the model's generation capabilities and support advanced features for thinking mode.

* Refactor response generation arguments handling

- Removed redundant instantiation of GenerationArguments in multiple locations within the responses and chat completions endpoints.
- Replaced instances of `args` with `gen_args` to streamline the argument passing to the response generator.
- This change simplifies the code and improves maintainability by reducing duplication.

* Add generation arguments and thinking mode support

- Refactor GenerationArguments with full sampling params (top_k, min_p,
  repetition_penalty, logit_bias, enable_thinking, thinking_budget,
  thinking_start_token) + to_generate_kwargs()/to_template_kwargs()
- Add _build_gen_args() helper, remove all inline kwargs duplication
- Add ResizeShapeInput type + field_validator for resize_shape
- Forward all sampling/thinking params to generate() fallback paths
- Fix RotatingKVCache.extend crash: always use batch-aware caches in
  _process_prompts (removes single-prompt standard cache optimization
  that breaks when continuous batching extends the batch later)
- Add vision_cache support to all 44 model get_input_embeddings
- Fix add_eos_token_ids to handle int token IDs
- Fix phi3_v vision cache patch ordering

396 passed, 1 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add tests for continuous batching server components

Test GenerationArguments, _build_gen_args, _split_thinking,
ChatMessage tool-calling schema, process_tool_calls, and
_count_thinking_tag_tokens.

27 server tests pass (18 new + 9 existing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add /v1/ route aliases for OpenAI SDK compatibility

/v1/responses, /v1/chat/completions, /v1/models now mirror
the base routes (hidden from schema docs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update DEFAULT_MAX_TOKENS in generate.py to 2048 for enhanced generation capabilities

* Parse tool calls in streaming path for OpenAI SDK compatibility

Accumulate full output during streaming and parse tool calls at the end.
If tool calls are detected, emit a final SSE chunk with structured
tool_calls[] and finish_reason="tool_calls" — matching what OpenAI SDKs
(and tools like Pi) expect for agentic workflows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore lost README sections from main, add continuous batching

Restore: Thinking Budget, model docs table (DOTS-OCR, Gemma 4, etc.),
Vision Feature Caching, TurboQuant KV Cache, Activation Quantization,
/v1/ route docs, server preload docs.

Add: Continuous Batching section under Server, --vision-cache-size option.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore server features from main: lifespan, CORS, KV quant, streaming headers

- Replace @app.on_event("startup") with asynccontextmanager lifespan
- Add CORSMiddleware (allow all origins)
- Restore KV quant env helpers (get_prefill_step_size, get_quantized_kv_bits, etc.)
- Add --kv-bits, --kv-quant-scheme, --kv-group-size, --max-kv-size,
  --quantized-kv-start, --prefill-step-size, --reload CLI args
- Add Cache-Control/Connection/X-Accel-Buffering headers to StreamingResponse
- Import DEFAULT_KV_*, DEFAULT_THINKING_*, DEFAULT_PREFILL_STEP_SIZE
- Remove unused chat_messages variable
- reload=False by default (was True)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix streaming finish_reason: propagate token.finish_reason to SSE chunks

The streaming /chat/completions path was emitting finish_reason=null
on all chunks. Now propagates the actual finish_reason (stop/length)
from the ResponseGenerator's StreamingToken to the SSE chunk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add batch speculative decoding with automatic sequence filtering and docs

- _dflash_rounds_batch: continuous batch support — finished sequences
  are filtered from target caches (via BatchKVCache.filter) and the
  drafter cold-restarts for the new batch size. stop_check callback
  for per-sequence EOS detection. active_idx mapping keeps stable
  original indices across batch changes.
- _speculative_walk_batch: per-sequence acceptance walk for B > 1.
- generate_step: B > 1 dispatch to batch path when draft_model is set.
- docs/usage.md: speculative decoding section with CLI, single-sequence,
  and batch generate examples.

Verified: B=1 regression (57 tok/s, 4.1 accept), B=4 batch (74.5 tok/s,
2.72x vs sequential AR), continuous filtering (short sequences exit
early, remaining continue correctly).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add DFlash speculative decoding to server + update README

Server (mlx_vlm/server.py):
- --draft-model CLI arg loads a speculative drafter at startup
- ResponseGenerator._run_speculative: dedicated GPU thread loop for
  DFlash batch speculative decoding. Collects pending requests,
  batch-prefills with capture_layer_ids, runs _dflash_rounds_batch,
  dispatches tokens to per-request queues. Finished sequences are
  handled by stop_check callback.
- Acceptance metrics logged per batch: [DFlash] batch=N tokens=M
  accept=X rounds=Y

README.md:
- Speculative Decoding (DFlash) section with CLI examples for text,
  image, and server usage
- --draft-model added to Server Options table

Benchmarks (Qwen3.5-4B + DFlash drafter, same prompt, 200 tokens):
  Server AR:     1 req 24.9 tok/s | 4 req 49.5 tok/s
  Server DFlash: 1 req 44.9 tok/s | 4 req 85.1 tok/s (1.7-1.8x)
  Acceptance: ~4.0 tokens/round

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Switch batch rollback from min-trim to max-trim with stale KV zeroing

rollback_speculative_cache_batch now trims to max(accepted) instead of
min(accepted). Stale KV entries for shorter-accepted sequences are
zeroed so attention assigns near-zero weight to them. GDN replay uses
a per-sequence mask. Each sequence emits its full accepted+1 tokens
per round instead of being capped to the global minimum.

B=8 throughput: 52.6 → 73.7 tok/s (+40%)
B=8 acceptance: 0.44 → 1.22 (+177%)
No collapse at high batch sizes — steady scaling from B=1 to B=8.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: apply black, isort, autoflake formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add draft_model attrs to test_generate_cli_smoke Namespace

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: seed RNG in test_turboquant_prod_is_nearly_unbiased_across_seeds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add --top-logprobs-k argument to CLI for configurable top-K log probabilities

This commit introduces a new command-line argument, --top-logprobs-k, to allow users to set the cap for per-token top-K log probabilities directly via the CLI. The implementation mirrors the existing TOP_LOGPROBS_K environment variable functionality, enhancing usability. The README has been updated to reflect this addition.

* Revert redundant is_batch_offset mRoPE path in Qwen3.5 language model

Main's existing cache_offsets handling covers the batch case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Clean up drafter __init__.py: remove extra comments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Clean up qwen3_dflash: remove excessive comments and docstrings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Merge rollback_speculative_cache and rollback_speculative_cache_batch into one method

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: apply black formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Prince Canuma <prince.gdt@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mdkirin pushed a commit to mdkirin/mlx-seori that referenced this pull request Apr 26, 2026
…MoE pos fix

upstream/main 흡수 (4-19 ~ 4-25 batch). Fork의 핵심 자산은 모두 보존:
MTP (mlx-lm 포팅, Qwen3.5 dense+MoE), PrefixCache hybrid, server hardening
(MLX_MEMORY_LIMIT_GB env, /v1/status, /v1/models 로드 모델 포함, model
pinning, busy tracking, GC threshold, last_request, OOM-위험 startup
warmup 제거), 서버사이드 thinking strip + 스트리밍 incremental, null
tool_calls 가드.

Upstream 흡수: continuous batching server (Blaizzy#1027), DFlash speculative
decoding (Blaizzy#1029, Blaizzy#1053 fix), thread-local generation stream (Blaizzy#1050,
mlx<0.32 hasattr 가드), batch_generate/server VLM fixes (Blaizzy#1055), Qwen3.5/3.6
MoE stale position IDs + gdn_sink 호환 (Blaizzy#1040), tool-call markup strip
(Blaizzy#1037), KV cache quantization (Blaizzy#1030), Qwen2-3.5 VL torch-free 비디오
processors (Blaizzy#1048), Gemma4 LoRA NaN/freeze fix (Blaizzy#1052), Gemma4 video,
Youtu-VL, distributed inference 등.

충돌 해결 원칙: fork의 MTP n_confirmed와 upstream의 gdn_sink는 같은
함수에서 공존하도록 시그니처 확장. fork는 Blaizzy#1029(DFlash) 도입 전 시점에서
분기되어 gdn_sink 본체 로직은 우리 모델에서 비활성(None 전달); 단
시그니처는 받아두어 호환성 유지. position_ids 캐시 재사용 시 fork의
">= cache_offset + seq_length" 체크가 Blaizzy#1040 fix를 더 정교하게 커버.
LanguageModelOutput.hidden_states/gdn_states 필드는 upstream 추가분 호환.

검증: 4개 파일 syntax + import OK. M3 96GB에서 mlx 0.31.0 호환 확인.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant