Add continuous batching to server#1027
Merged
Merged
Conversation
- Introduced a global ResponseGenerator for managing continuous batching in the server. - Updated Batch and BatchGenerator classes to handle per-sequence samplers and logits processors, improving flexibility in prompt handling. - Enhanced LanguageModel to support per-sequence cache offsets for better performance during batched generation. - Refactored insert and remove methods in BatchGenerator to accommodate new features and improve resource management. - Added detailed docstrings for clarity on new functionalities and usage.
…nseGenerator - Added peak memory tracking to Response and StreamingToken classes for better resource management. - Updated BatchGenerator to include peak memory usage during response generation. - Introduced preprocessing of images and audio before queuing requests to optimize performance and reduce blocking. - Refactored input handling to utilize preprocessed inputs, improving efficiency in the generation process. - Enhanced comments and documentation for clarity on new functionalities.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cessary chunking - Drain pending text-only prompts BEFORE inserting image request (not after), so image prompts aren't prefilled without their embeddings - Only chunk prefill when prompt exceeds prefill_step_size (matching main's behavior) - Pass mask to get_input_embeddings for correct attention_mask_4d generation - Remove stale pixel_values from gen_kwargs passed to language model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ResponseGenerator is server-specific infrastructure (threaded queue, request dispatching) — it belongs with the server, not the generation library. BatchGenerator stays in generate.py for offline batch use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collect all pending requests from the queue before calling next(), so multiple text-only requests get inserted and prefilled together in a single batch call instead of one-at-a-time. Text-only batch speedup: 1.16x -> 1.54x (gemma4-26b) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reset _position_ids and _rope_deltas in _process_prompts before each new batch prefill to prevent stale position state from previous batches - Clamp negative BatchKVCache offsets to 0 in Qwen3.5 language model Qwen3.5 stores position state on the model instance which breaks when BatchGenerator prefills multiple batches sequentially. The stale cached positions from batch N are incorrectly used for batch N+1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the old thread/queue approach with a cleaner design: - Single dedicated thread owns all GPU work via BatchGenerator - FastAPI handlers submit preprocessed requests to a queue - GPU thread runs next() in a tight time-budgeted loop (0.5s bursts) - Concurrent requests are batched together automatically - Image requests drain pending text-only first, then prefill inline Results on gemma-4-26b (5 prompts, max_tokens=100): Sequential: 3.56s (72 tok/s) — 2.3x faster than mlx-lm Batch: 2.27s (112 tok/s) — 1.46x faster than mlx-lm Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-call dual create_attention_mask with _make_masks() that creates one mask per unique layer type (full_attention, sliding_attention) and reuses it across layers. Also streamline the layer loop to use zip iteration instead of index lookups. Closes the 1.21x per-step gap with mlx-lm (89 vs 88 tok/s, batch of 3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All preprocessing (prepare_inputs + get_input_embeddings) now runs on the single GPU thread. Callers only put raw data (prompt, image paths, args) on the queue. This prevents concurrent Metal access when multiple FastAPI requests arrive simultaneously. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… thread - Pass enable_thinking from request to apply_chat_template on both /responses and /chat/completions endpoints - Move prepare_inputs (CPU: tokenize, load images) to caller thread, keep only get_input_embeddings (GPU) on the GPU thread — reduces GPU thread blocking during batch generation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The async endpoints were blocking the event loop on Queue.get(), preventing concurrent request processing. Wrap blocking generate() and token iteration in asyncio.to_thread so the event loop stays free to accept new requests. Before: B=4 batch 65 tok/s (sequential execution) After: B=4 batch 120 tok/s (true concurrent batching) At B=8: matches mlx-lm (114 vs 114 tok/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Separate thinking tokens into `reasoning` field, clean answer in `content` (handles <|channel>thought...<channel|> and <think>...</think>) - Add `id`, `object`, `created` to response/chunk models - Use standard `prompt_tokens`/`completion_tokens` field names - Stream: tag-aware state machine routes tokens to reasoning vs content - Add `data: [DONE]` SSE terminator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Apply _split_thinking to both non-streaming and streaming /responses output, so output_text and content contain clean answers and reasoning is in a separate field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add index to ChatChoice and ChatStreamChoice - Add prompt_tokens_details with cached_tokens to UsageStats Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Count raw generated tokens minus thinking tag tokens instead of re-encoding text (which loses boundary tokens). Now matches mlx-lm exactly: prompt_tokens=23, completion_tokens=40, total=63. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…an tags Four bugs fixed: 1. prompt_utils: Tool messages (tool_calls, tool_call_id, role="tool") now pass through to the Jinja template instead of being stripped by _get_role_content/get_message_json. 2. server ChatMessage: Accept role="tool", add tool_calls, tool_call_id, name fields. Server message processing preserves these fields. 3. server: Add process_tool_calls + tool parser detection to /chat/completions. Model output is parsed for <|tool_call> tags, returned as structured tool_calls with finish_reason="tool_calls". 4. tool_parsers/gemma4: Return arguments as JSON string (OpenAI spec), not dict, preventing double-encoding on round-trip. Also fixes thinking tag leaks: - _split_thinking handles partial "thought\n...<channel|>" continuation - Tool call turns return content=None (no leaked <|tool_call> tags) Tested: 5-tool agentic workflow (weather → forecast → search → book → email) completing in 4 turns with correct tool selection and back-references across turns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tool args - BatchGenerator._next() now checks tokenizer.stopping_criteria(t) in addition to self.stop_tokens. Fixes models like Qwen3.5 where config.eos_token_id is None but the tokenizer has <|im_end|> as EOS. - Normalize tool_calls arguments from JSON string to dict before passing to Jinja templates (Qwen3.5 template iterates arguments with |items). - Handle partial </think> tag (no opening <think>) in _split_thinking. - Strip model control tokens from tool-call remaining text. Before: Qwen3.5 generated past <|im_end|>, hallucinated fake turns. After: Clean 3-turn tool calling matching mlx-lm behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-add VisionFeatureCache that was lost during rebase: - Create in get_cached_model, clear in unload_model_sync - Pass vision_cache to all fallback stream_generate/generate calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Updated ResponseGenerator to include vision_cache for improved efficiency. - Modified __init__ method to accept vision_cache parameter. - Enhanced _gpu_embed method to utilize cached image features, reducing redundant processing for repeated images. - Adjusted request handling to accommodate images in batch processing. This change optimizes the handling of image inputs, leveraging caching to enhance performance.
… CLI - VisionFeatureCache integrated into ResponseGenerator._gpu_embed: cache hit skips vision encoder (229ms → 1ms, 1GB memory saved) - Shared VisionFeatureCache instance between ResponseGenerator and fallback paths - --vision-cache-size CLI flag (default: 20 images, ~30MB max) - LRU eviction, cleared on model unload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Changed section headers for Multi-Image Chat Support and Video Understanding to improve readability. - Nested Usage Examples under their respective sections for better organization. - Added detailed command line examples for both Multi-Image Chat Support and Video Understanding features.
Fixes test_generate.py import error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n-only tests - test_gemma4_tool_parser: arguments is now JSON string per OpenAI spec - test_generate: Batch needs samplers/logits_processors/tokens fields, logprobs is List[mx.array] not mx.array - Skip tests for main-only features not in this branch: - TestSamplerArgs (make_sampler API) - CLI enable_thinking/thinking_start_token - Server schema (enable_thinking, resize_shape single int) - TurboQuant kv_quant_scheme parameter 387 passed, 10 skipped, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace generate.py with main's version (make_sampler, TurboQuant, enable_thinking, PromptCacheState, vision_cache) plus our one addition: _process_prompts resets _position_ids/_rope_deltas for Qwen3.5 batch compat. Restore test files from main. 3 server tests fail due to our server having a different schema (continuous batching rewrite) — not regressions. 393 passed, 3 failed (server schema), 1 skipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ings - Pass vision_cache + _image_key as kwargs to get_input_embeddings so models can cache/reuse vision features internally (gemma4: 228x, Qwen3.5: 23x speedup on cache hit) - Always call get_input_embeddings (even text-only) since main's BatchGenerator._process_prompts requires inputs_embeds - Fix add_eos_token_ids to handle int token IDs (not just strings) — fixes crash when config.eos_token_id provides ints - Add vision_cache support to gemma4 and qwen3_5 get_input_embeddings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Introduced new fields in GenerationArguments and OpenAIRequest for top_k, min_p, repetition_penalty, logit_bias, enable_thinking, thinking_budget, and thinking_start_token. - Added methods to convert GenerationArguments to keyword arguments for generation and template functions. - Implemented a helper function to build GenerationArguments from request data. - Updated resize_shape handling with a field validator to normalize input. These changes enhance the model's generation capabilities and support advanced features for thinking mode.
- Removed redundant instantiation of GenerationArguments in multiple locations within the responses and chat completions endpoints. - Replaced instances of `args` with `gen_args` to streamline the argument passing to the response generator. - This change simplifies the code and improves maintainability by reducing duplication.
- Refactor GenerationArguments with full sampling params (top_k, min_p, repetition_penalty, logit_bias, enable_thinking, thinking_budget, thinking_start_token) + to_generate_kwargs()/to_template_kwargs() - Add _build_gen_args() helper, remove all inline kwargs duplication - Add ResizeShapeInput type + field_validator for resize_shape - Forward all sampling/thinking params to generate() fallback paths - Fix RotatingKVCache.extend crash: always use batch-aware caches in _process_prompts (removes single-prompt standard cache optimization that breaks when continuous batching extends the batch later) - Add vision_cache support to all 44 model get_input_embeddings - Fix add_eos_token_ids to handle int token IDs - Fix phi3_v vision cache patch ordering 396 passed, 1 skipped, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test GenerationArguments, _build_gen_args, _split_thinking, ChatMessage tool-calling schema, process_tool_calls, and _count_thinking_tag_tokens. 27 server tests pass (18 new + 9 existing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fffcac8 to
0be8c0f
Compare
/v1/responses, /v1/chat/completions, /v1/models now mirror the base routes (hidden from schema docs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accumulate full output during streaming and parse tool calls at the end. If tool calls are detected, emit a final SSE chunk with structured tool_calls[] and finish_reason="tool_calls" — matching what OpenAI SDKs (and tools like Pi) expect for agentic workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore: Thinking Budget, model docs table (DOTS-OCR, Gemma 4, etc.), Vision Feature Caching, TurboQuant KV Cache, Activation Quantization, /v1/ route docs, server preload docs. Add: Continuous Batching section under Server, --vision-cache-size option. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g headers
- Replace @app.on_event("startup") with asynccontextmanager lifespan
- Add CORSMiddleware (allow all origins)
- Restore KV quant env helpers (get_prefill_step_size, get_quantized_kv_bits, etc.)
- Add --kv-bits, --kv-quant-scheme, --kv-group-size, --max-kv-size,
--quantized-kv-start, --prefill-step-size, --reload CLI args
- Add Cache-Control/Connection/X-Accel-Buffering headers to StreamingResponse
- Import DEFAULT_KV_*, DEFAULT_THINKING_*, DEFAULT_PREFILL_STEP_SIZE
- Remove unused chat_messages variable
- reload=False by default (was True)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The streaming /chat/completions path was emitting finish_reason=null on all chunks. Now propagates the actual finish_reason (stop/length) from the ResponseGenerator's StreamingToken to the SSE chunk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blaizzy
added a commit
that referenced
this pull request
Apr 16, 2026
* Add continuous batching support
- Introduced a global ResponseGenerator for managing continuous batching in the server.
- Updated Batch and BatchGenerator classes to handle per-sequence samplers and logits processors, improving flexibility in prompt handling.
- Enhanced LanguageModel to support per-sequence cache offsets for better performance during batched generation.
- Refactored insert and remove methods in BatchGenerator to accommodate new features and improve resource management.
- Added detailed docstrings for clarity on new functionalities and usage.
* Enhance memory tracking and preprocessing in BatchGenerator and ResponseGenerator
- Added peak memory tracking to Response and StreamingToken classes for better resource management.
- Updated BatchGenerator to include peak memory usage during response generation.
- Introduced preprocessing of images and audio before queuing requests to optimize performance and reduce blocking.
- Refactored input handling to utilize preprocessed inputs, improving efficiency in the generation process.
- Enhanced comments and documentation for clarity on new functionalities.
* add continous batching
* Add back PromptCacheState removed during rebase
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix continuous batching image prefill: drain before insert, skip unnecessary chunking
- Drain pending text-only prompts BEFORE inserting image request (not after),
so image prompts aren't prefilled without their embeddings
- Only chunk prefill when prompt exceeds prefill_step_size (matching main's behavior)
- Pass mask to get_input_embeddings for correct attention_mask_4d generation
- Remove stale pixel_values from gen_kwargs passed to language model
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Move ResponseGenerator from generate.py to server.py
ResponseGenerator is server-specific infrastructure (threaded queue,
request dispatching) — it belongs with the server, not the generation
library. BatchGenerator stays in generate.py for offline batch use.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add continuous batching section to README
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add server start command to continuous batching section
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add --model flag to pre-load model at server startup
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Optimize text-only batching: drain queue before next() for fused prefill
Collect all pending requests from the queue before calling next(),
so multiple text-only requests get inserted and prefilled together
in a single batch call instead of one-at-a-time.
Text-only batch speedup: 1.16x -> 1.54x (gemma4-26b)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix batch position state for models with cached rope (Qwen3.5)
- Reset _position_ids and _rope_deltas in _process_prompts before each
new batch prefill to prevent stale position state from previous batches
- Clamp negative BatchKVCache offsets to 0 in Qwen3.5 language model
Qwen3.5 stores position state on the model instance which breaks when
BatchGenerator prefills multiple batches sequentially. The stale cached
positions from batch N are incorrectly used for batch N+1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Rewrite ResponseGenerator: single GPU thread with tight batch loop
Replace the old thread/queue approach with a cleaner design:
- Single dedicated thread owns all GPU work via BatchGenerator
- FastAPI handlers submit preprocessed requests to a queue
- GPU thread runs next() in a tight time-budgeted loop (0.5s bursts)
- Concurrent requests are batched together automatically
- Image requests drain pending text-only first, then prefill inline
Results on gemma-4-26b (5 prompts, max_tokens=100):
Sequential: 3.56s (72 tok/s) — 2.3x faster than mlx-lm
Batch: 2.27s (112 tok/s) — 1.46x faster than mlx-lm
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Optimize gemma4 attention mask creation: deduplicate by layer type
Replace per-call dual create_attention_mask with _make_masks() that
creates one mask per unique layer type (full_attention, sliding_attention)
and reuses it across layers. Also streamline the layer loop to use
zip iteration instead of index lookups.
Closes the 1.21x per-step gap with mlx-lm (89 vs 88 tok/s, batch of 3).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix Metal crash: move all preprocessing to GPU thread
All preprocessing (prepare_inputs + get_input_embeddings) now runs on the
single GPU thread. Callers only put raw data (prompt, image paths, args)
on the queue. This prevents concurrent Metal access when multiple
FastAPI requests arrive simultaneously.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add enable_thinking flag to server + move CPU preprocessing to caller thread
- Pass enable_thinking from request to apply_chat_template on both
/responses and /chat/completions endpoints
- Move prepare_inputs (CPU: tokenize, load images) to caller thread,
keep only get_input_embeddings (GPU) on the GPU thread — reduces
GPU thread blocking during batch generation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Default enable_thinking to true on server
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix event loop blocking: use asyncio.to_thread for Queue operations
The async endpoints were blocking the event loop on Queue.get(),
preventing concurrent request processing. Wrap blocking generate()
and token iteration in asyncio.to_thread so the event loop stays
free to accept new requests.
Before: B=4 batch 65 tok/s (sequential execution)
After: B=4 batch 120 tok/s (true concurrent batching)
At B=8: matches mlx-lm (114 vs 114 tok/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Match OpenAI response format: split reasoning/content, add metadata
- Separate thinking tokens into `reasoning` field, clean answer in `content`
(handles <|channel>thought...<channel|> and <think>...</think>)
- Add `id`, `object`, `created` to response/chunk models
- Use standard `prompt_tokens`/`completion_tokens` field names
- Stream: tag-aware state machine routes tokens to reasoning vs content
- Add `data: [DONE]` SSE terminator
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Split thinking tags in /responses endpoint too
Apply _split_thinking to both non-streaming and streaming /responses
output, so output_text and content contain clean answers and reasoning
is in a separate field.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add missing OpenAI fields: index, prompt_tokens_details
- Add index to ChatChoice and ChatStreamChoice
- Add prompt_tokens_details with cached_tokens to UsageStats
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix completion_tokens count to match mlx-lm
Count raw generated tokens minus thinking tag tokens instead of
re-encoding text (which loses boundary tokens). Now matches
mlx-lm exactly: prompt_tokens=23, completion_tokens=40, total=63.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Format with black and isort
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix multi-turn tool calling: preserve metadata, parse tool calls, clean tags
Four bugs fixed:
1. prompt_utils: Tool messages (tool_calls, tool_call_id, role="tool")
now pass through to the Jinja template instead of being stripped
by _get_role_content/get_message_json.
2. server ChatMessage: Accept role="tool", add tool_calls, tool_call_id,
name fields. Server message processing preserves these fields.
3. server: Add process_tool_calls + tool parser detection to
/chat/completions. Model output is parsed for <|tool_call> tags,
returned as structured tool_calls with finish_reason="tool_calls".
4. tool_parsers/gemma4: Return arguments as JSON string (OpenAI spec),
not dict, preventing double-encoding on round-trip.
Also fixes thinking tag leaks:
- _split_thinking handles partial "thought\n...<channel|>" continuation
- Tool call turns return content=None (no leaked <|tool_call> tags)
Tested: 5-tool agentic workflow (weather → forecast → search → book →
email) completing in 4 turns with correct tool selection and
back-references across turns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix BatchGenerator stop: use tokenizer.stopping_criteria + normalize tool args
- BatchGenerator._next() now checks tokenizer.stopping_criteria(t) in
addition to self.stop_tokens. Fixes models like Qwen3.5 where
config.eos_token_id is None but the tokenizer has <|im_end|> as EOS.
- Normalize tool_calls arguments from JSON string to dict before passing
to Jinja templates (Qwen3.5 template iterates arguments with |items).
- Handle partial </think> tag (no opening <think>) in _split_thinking.
- Strip model control tokens from tool-call remaining text.
Before: Qwen3.5 generated past <|im_end|>, hallucinated fake turns.
After: Clean 3-turn tool calling matching mlx-lm behavior.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Restore VisionFeatureCache to server
Re-add VisionFeatureCache that was lost during rebase:
- Create in get_cached_model, clear in unload_model_sync
- Pass vision_cache to all fallback stream_generate/generate calls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add vision caching to ResponseGenerator
- Updated ResponseGenerator to include vision_cache for improved efficiency.
- Modified __init__ method to accept vision_cache parameter.
- Enhanced _gpu_embed method to utilize cached image features, reducing redundant processing for repeated images.
- Adjusted request handling to accommodate images in batch processing.
This change optimizes the handling of image inputs, leveraging caching to enhance performance.
* Add vision feature caching to ResponseGenerator + --vision-cache-size CLI
- VisionFeatureCache integrated into ResponseGenerator._gpu_embed:
cache hit skips vision encoder (229ms → 1ms, 1GB memory saved)
- Shared VisionFeatureCache instance between ResponseGenerator and
fallback paths
- --vision-cache-size CLI flag (default: 20 images, ~30MB max)
- LRU eviction, cleared on model unload
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Document server options and vision cache in README
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Nest Continuous Batching under Server in README table of contents
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Update README.md to enhance structure and clarity
- Changed section headers for Multi-Image Chat Support and Video Understanding to improve readability.
- Nested Usage Examples under their respective sections for better organization.
- Added detailed command line examples for both Multi-Image Chat Support and Video Understanding features.
* Add back normalize_resize_shape lost during rebase
Fixes test_generate.py import error.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix tests: update for Batch fields, tool parser JSON string, skip main-only tests
- test_gemma4_tool_parser: arguments is now JSON string per OpenAI spec
- test_generate: Batch needs samplers/logits_processors/tokens fields,
logprobs is List[mx.array] not mx.array
- Skip tests for main-only features not in this branch:
- TestSamplerArgs (make_sampler API)
- CLI enable_thinking/thinking_start_token
- Server schema (enable_thinking, resize_shape single int)
- TurboQuant kv_quant_scheme parameter
387 passed, 10 skipped, 0 failed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Restore generate.py and tests from main, re-apply position state reset
Replace generate.py with main's version (make_sampler, TurboQuant,
enable_thinking, PromptCacheState, vision_cache) plus our one addition:
_process_prompts resets _position_ids/_rope_deltas for Qwen3.5 batch compat.
Restore test files from main. 3 server tests fail due to our server
having a different schema (continuous batching rewrite) — not regressions.
393 passed, 3 failed (server schema), 1 skipped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Vision cache via kwargs, fix add_eos_token_ids, always produce embeddings
- Pass vision_cache + _image_key as kwargs to get_input_embeddings so
models can cache/reuse vision features internally (gemma4: 228x,
Qwen3.5: 23x speedup on cache hit)
- Always call get_input_embeddings (even text-only) since main's
BatchGenerator._process_prompts requires inputs_embeds
- Fix add_eos_token_ids to handle int token IDs (not just strings)
— fixes crash when config.eos_token_id provides ints
- Add vision_cache support to gemma4 and qwen3_5 get_input_embeddings
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add generation arguments and thinking mode support
- Introduced new fields in GenerationArguments and OpenAIRequest for top_k, min_p, repetition_penalty, logit_bias, enable_thinking, thinking_budget, and thinking_start_token.
- Added methods to convert GenerationArguments to keyword arguments for generation and template functions.
- Implemented a helper function to build GenerationArguments from request data.
- Updated resize_shape handling with a field validator to normalize input.
These changes enhance the model's generation capabilities and support advanced features for thinking mode.
* Refactor response generation arguments handling
- Removed redundant instantiation of GenerationArguments in multiple locations within the responses and chat completions endpoints.
- Replaced instances of `args` with `gen_args` to streamline the argument passing to the response generator.
- This change simplifies the code and improves maintainability by reducing duplication.
* Add generation arguments and thinking mode support
- Refactor GenerationArguments with full sampling params (top_k, min_p,
repetition_penalty, logit_bias, enable_thinking, thinking_budget,
thinking_start_token) + to_generate_kwargs()/to_template_kwargs()
- Add _build_gen_args() helper, remove all inline kwargs duplication
- Add ResizeShapeInput type + field_validator for resize_shape
- Forward all sampling/thinking params to generate() fallback paths
- Fix RotatingKVCache.extend crash: always use batch-aware caches in
_process_prompts (removes single-prompt standard cache optimization
that breaks when continuous batching extends the batch later)
- Add vision_cache support to all 44 model get_input_embeddings
- Fix add_eos_token_ids to handle int token IDs
- Fix phi3_v vision cache patch ordering
396 passed, 1 skipped, 0 failed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add tests for continuous batching server components
Test GenerationArguments, _build_gen_args, _split_thinking,
ChatMessage tool-calling schema, process_tool_calls, and
_count_thinking_tag_tokens.
27 server tests pass (18 new + 9 existing).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add /v1/ route aliases for OpenAI SDK compatibility
/v1/responses, /v1/chat/completions, /v1/models now mirror
the base routes (hidden from schema docs).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Update DEFAULT_MAX_TOKENS in generate.py to 2048 for enhanced generation capabilities
* Parse tool calls in streaming path for OpenAI SDK compatibility
Accumulate full output during streaming and parse tool calls at the end.
If tool calls are detected, emit a final SSE chunk with structured
tool_calls[] and finish_reason="tool_calls" — matching what OpenAI SDKs
(and tools like Pi) expect for agentic workflows.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Restore lost README sections from main, add continuous batching
Restore: Thinking Budget, model docs table (DOTS-OCR, Gemma 4, etc.),
Vision Feature Caching, TurboQuant KV Cache, Activation Quantization,
/v1/ route docs, server preload docs.
Add: Continuous Batching section under Server, --vision-cache-size option.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Restore server features from main: lifespan, CORS, KV quant, streaming headers
- Replace @app.on_event("startup") with asynccontextmanager lifespan
- Add CORSMiddleware (allow all origins)
- Restore KV quant env helpers (get_prefill_step_size, get_quantized_kv_bits, etc.)
- Add --kv-bits, --kv-quant-scheme, --kv-group-size, --max-kv-size,
--quantized-kv-start, --prefill-step-size, --reload CLI args
- Add Cache-Control/Connection/X-Accel-Buffering headers to StreamingResponse
- Import DEFAULT_KV_*, DEFAULT_THINKING_*, DEFAULT_PREFILL_STEP_SIZE
- Remove unused chat_messages variable
- reload=False by default (was True)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix streaming finish_reason: propagate token.finish_reason to SSE chunks
The streaming /chat/completions path was emitting finish_reason=null
on all chunks. Now propagates the actual finish_reason (stop/length)
from the ResponseGenerator's StreamingToken to the SSE chunk.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blaizzy
added a commit
that referenced
this pull request
Apr 17, 2026
Keep speculative decoding drafter loading, CLI args, and draft_model wiring in server.py. Keep is_batch_offset logic in language.py as superset of main's batch offset handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blaizzy
added a commit
that referenced
this pull request
Apr 20, 2026
* add dtree + dflash * Refactor Qwen3.5 model to remove capture_layer_ids and streamline forward pass. Update DFlash imports and enhance parity check for lightweight smoke testing. Remove unused cache snapshot functionality and simplify speculative decoding CLI. Clean up tree verification module and optimize DFlash loop for better performance. * Refactor CLI argument handling in speculative module for improved readability and consistency. Simplify token processing in main function by renaming variables for clarity. Remove redundant code and enhance overall structure in dflash.py and cli.py. * Enhance dflash_generate function in dflash_loop.py for improved clarity and performance. Update docstring for accepted_in_round explanation and add a new variable L to clarify drafted slots per round. Refactor comments for better understanding of draft cache trimming and token emission process. * refactor dflash loop * Update dependencies in requirements.txt and uv.lock for improved compatibility and performance. Bump versions for mlx, transformers, and mlx-lm, and update hf-xet to version 1.4.3 with new source and wheel links. * add support for drafter loading * Refactor DFlash drafter implementation by consolidating loading logic and removing deprecated components. Introduce a new DFlashConfig class for configuration management and streamline the load_drafter function to utilize shared loading utilities. Add a parity check for basic functionality verification. Remove the old qwen3_5_dflash module and its associated files. * refactor dflash generate * Introduce draft_kind parameter to generate_step for flexible speculative decoding. Update logic to handle different draft types, starting with 'dflash', and raise an error for unsupported types. * Refactor speculative decoding logic in generate_step and _dflash_rounds functions. Simplify docstrings and streamline yield statements for clarity and consistency. * Add speculative walk function and refactor DFlashDraftModel acceptance tracking. Update _dflash_rounds to utilize new walk logic and streamline cache management. Enhance generate_step for clarity in acceptance reporting. * Update docstring in LanguageModel to clarify the handling of gated delta states during cache restoration. * Refactor rotary embedding and attention mechanisms in Qwen3_5 model. Update apply_multimodal_rotary_pos_emb to ensure dtype consistency and introduce _precise_swiglu for improved gate handling in RMSNormGated. * Add batch processing for speculative walk and DFlash rounds in generate.py. Implement rollback_speculative_cache_batch in LanguageModel for batch cache management. Update DFlashDraftModel to support batch bonuses in draft_block method. Enhance generate_step to handle batch decoding logic. * Add continuous batching to server (#1027) * Add continuous batching support - Introduced a global ResponseGenerator for managing continuous batching in the server. - Updated Batch and BatchGenerator classes to handle per-sequence samplers and logits processors, improving flexibility in prompt handling. - Enhanced LanguageModel to support per-sequence cache offsets for better performance during batched generation. - Refactored insert and remove methods in BatchGenerator to accommodate new features and improve resource management. - Added detailed docstrings for clarity on new functionalities and usage. * Enhance memory tracking and preprocessing in BatchGenerator and ResponseGenerator - Added peak memory tracking to Response and StreamingToken classes for better resource management. - Updated BatchGenerator to include peak memory usage during response generation. - Introduced preprocessing of images and audio before queuing requests to optimize performance and reduce blocking. - Refactored input handling to utilize preprocessed inputs, improving efficiency in the generation process. - Enhanced comments and documentation for clarity on new functionalities. * add continous batching * Add back PromptCacheState removed during rebase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix continuous batching image prefill: drain before insert, skip unnecessary chunking - Drain pending text-only prompts BEFORE inserting image request (not after), so image prompts aren't prefilled without their embeddings - Only chunk prefill when prompt exceeds prefill_step_size (matching main's behavior) - Pass mask to get_input_embeddings for correct attention_mask_4d generation - Remove stale pixel_values from gen_kwargs passed to language model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move ResponseGenerator from generate.py to server.py ResponseGenerator is server-specific infrastructure (threaded queue, request dispatching) — it belongs with the server, not the generation library. BatchGenerator stays in generate.py for offline batch use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add continuous batching section to README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add server start command to continuous batching section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add --model flag to pre-load model at server startup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Optimize text-only batching: drain queue before next() for fused prefill Collect all pending requests from the queue before calling next(), so multiple text-only requests get inserted and prefilled together in a single batch call instead of one-at-a-time. Text-only batch speedup: 1.16x -> 1.54x (gemma4-26b) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix batch position state for models with cached rope (Qwen3.5) - Reset _position_ids and _rope_deltas in _process_prompts before each new batch prefill to prevent stale position state from previous batches - Clamp negative BatchKVCache offsets to 0 in Qwen3.5 language model Qwen3.5 stores position state on the model instance which breaks when BatchGenerator prefills multiple batches sequentially. The stale cached positions from batch N are incorrectly used for batch N+1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Rewrite ResponseGenerator: single GPU thread with tight batch loop Replace the old thread/queue approach with a cleaner design: - Single dedicated thread owns all GPU work via BatchGenerator - FastAPI handlers submit preprocessed requests to a queue - GPU thread runs next() in a tight time-budgeted loop (0.5s bursts) - Concurrent requests are batched together automatically - Image requests drain pending text-only first, then prefill inline Results on gemma-4-26b (5 prompts, max_tokens=100): Sequential: 3.56s (72 tok/s) — 2.3x faster than mlx-lm Batch: 2.27s (112 tok/s) — 1.46x faster than mlx-lm Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Optimize gemma4 attention mask creation: deduplicate by layer type Replace per-call dual create_attention_mask with _make_masks() that creates one mask per unique layer type (full_attention, sliding_attention) and reuses it across layers. Also streamline the layer loop to use zip iteration instead of index lookups. Closes the 1.21x per-step gap with mlx-lm (89 vs 88 tok/s, batch of 3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix Metal crash: move all preprocessing to GPU thread All preprocessing (prepare_inputs + get_input_embeddings) now runs on the single GPU thread. Callers only put raw data (prompt, image paths, args) on the queue. This prevents concurrent Metal access when multiple FastAPI requests arrive simultaneously. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add enable_thinking flag to server + move CPU preprocessing to caller thread - Pass enable_thinking from request to apply_chat_template on both /responses and /chat/completions endpoints - Move prepare_inputs (CPU: tokenize, load images) to caller thread, keep only get_input_embeddings (GPU) on the GPU thread — reduces GPU thread blocking during batch generation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Default enable_thinking to true on server Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix event loop blocking: use asyncio.to_thread for Queue operations The async endpoints were blocking the event loop on Queue.get(), preventing concurrent request processing. Wrap blocking generate() and token iteration in asyncio.to_thread so the event loop stays free to accept new requests. Before: B=4 batch 65 tok/s (sequential execution) After: B=4 batch 120 tok/s (true concurrent batching) At B=8: matches mlx-lm (114 vs 114 tok/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Match OpenAI response format: split reasoning/content, add metadata - Separate thinking tokens into `reasoning` field, clean answer in `content` (handles <|channel>thought...<channel|> and <think>...</think>) - Add `id`, `object`, `created` to response/chunk models - Use standard `prompt_tokens`/`completion_tokens` field names - Stream: tag-aware state machine routes tokens to reasoning vs content - Add `data: [DONE]` SSE terminator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Split thinking tags in /responses endpoint too Apply _split_thinking to both non-streaming and streaming /responses output, so output_text and content contain clean answers and reasoning is in a separate field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add missing OpenAI fields: index, prompt_tokens_details - Add index to ChatChoice and ChatStreamChoice - Add prompt_tokens_details with cached_tokens to UsageStats Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix completion_tokens count to match mlx-lm Count raw generated tokens minus thinking tag tokens instead of re-encoding text (which loses boundary tokens). Now matches mlx-lm exactly: prompt_tokens=23, completion_tokens=40, total=63. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Format with black and isort Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix multi-turn tool calling: preserve metadata, parse tool calls, clean tags Four bugs fixed: 1. prompt_utils: Tool messages (tool_calls, tool_call_id, role="tool") now pass through to the Jinja template instead of being stripped by _get_role_content/get_message_json. 2. server ChatMessage: Accept role="tool", add tool_calls, tool_call_id, name fields. Server message processing preserves these fields. 3. server: Add process_tool_calls + tool parser detection to /chat/completions. Model output is parsed for <|tool_call> tags, returned as structured tool_calls with finish_reason="tool_calls". 4. tool_parsers/gemma4: Return arguments as JSON string (OpenAI spec), not dict, preventing double-encoding on round-trip. Also fixes thinking tag leaks: - _split_thinking handles partial "thought\n...<channel|>" continuation - Tool call turns return content=None (no leaked <|tool_call> tags) Tested: 5-tool agentic workflow (weather → forecast → search → book → email) completing in 4 turns with correct tool selection and back-references across turns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix BatchGenerator stop: use tokenizer.stopping_criteria + normalize tool args - BatchGenerator._next() now checks tokenizer.stopping_criteria(t) in addition to self.stop_tokens. Fixes models like Qwen3.5 where config.eos_token_id is None but the tokenizer has <|im_end|> as EOS. - Normalize tool_calls arguments from JSON string to dict before passing to Jinja templates (Qwen3.5 template iterates arguments with |items). - Handle partial </think> tag (no opening <think>) in _split_thinking. - Strip model control tokens from tool-call remaining text. Before: Qwen3.5 generated past <|im_end|>, hallucinated fake turns. After: Clean 3-turn tool calling matching mlx-lm behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restore VisionFeatureCache to server Re-add VisionFeatureCache that was lost during rebase: - Create in get_cached_model, clear in unload_model_sync - Pass vision_cache to all fallback stream_generate/generate calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add vision caching to ResponseGenerator - Updated ResponseGenerator to include vision_cache for improved efficiency. - Modified __init__ method to accept vision_cache parameter. - Enhanced _gpu_embed method to utilize cached image features, reducing redundant processing for repeated images. - Adjusted request handling to accommodate images in batch processing. This change optimizes the handling of image inputs, leveraging caching to enhance performance. * Add vision feature caching to ResponseGenerator + --vision-cache-size CLI - VisionFeatureCache integrated into ResponseGenerator._gpu_embed: cache hit skips vision encoder (229ms → 1ms, 1GB memory saved) - Shared VisionFeatureCache instance between ResponseGenerator and fallback paths - --vision-cache-size CLI flag (default: 20 images, ~30MB max) - LRU eviction, cleared on model unload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Document server options and vision cache in README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Nest Continuous Batching under Server in README table of contents Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update README.md to enhance structure and clarity - Changed section headers for Multi-Image Chat Support and Video Understanding to improve readability. - Nested Usage Examples under their respective sections for better organization. - Added detailed command line examples for both Multi-Image Chat Support and Video Understanding features. * Add back normalize_resize_shape lost during rebase Fixes test_generate.py import error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix tests: update for Batch fields, tool parser JSON string, skip main-only tests - test_gemma4_tool_parser: arguments is now JSON string per OpenAI spec - test_generate: Batch needs samplers/logits_processors/tokens fields, logprobs is List[mx.array] not mx.array - Skip tests for main-only features not in this branch: - TestSamplerArgs (make_sampler API) - CLI enable_thinking/thinking_start_token - Server schema (enable_thinking, resize_shape single int) - TurboQuant kv_quant_scheme parameter 387 passed, 10 skipped, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restore generate.py and tests from main, re-apply position state reset Replace generate.py with main's version (make_sampler, TurboQuant, enable_thinking, PromptCacheState, vision_cache) plus our one addition: _process_prompts resets _position_ids/_rope_deltas for Qwen3.5 batch compat. Restore test files from main. 3 server tests fail due to our server having a different schema (continuous batching rewrite) — not regressions. 393 passed, 3 failed (server schema), 1 skipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Vision cache via kwargs, fix add_eos_token_ids, always produce embeddings - Pass vision_cache + _image_key as kwargs to get_input_embeddings so models can cache/reuse vision features internally (gemma4: 228x, Qwen3.5: 23x speedup on cache hit) - Always call get_input_embeddings (even text-only) since main's BatchGenerator._process_prompts requires inputs_embeds - Fix add_eos_token_ids to handle int token IDs (not just strings) — fixes crash when config.eos_token_id provides ints - Add vision_cache support to gemma4 and qwen3_5 get_input_embeddings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add generation arguments and thinking mode support - Introduced new fields in GenerationArguments and OpenAIRequest for top_k, min_p, repetition_penalty, logit_bias, enable_thinking, thinking_budget, and thinking_start_token. - Added methods to convert GenerationArguments to keyword arguments for generation and template functions. - Implemented a helper function to build GenerationArguments from request data. - Updated resize_shape handling with a field validator to normalize input. These changes enhance the model's generation capabilities and support advanced features for thinking mode. * Refactor response generation arguments handling - Removed redundant instantiation of GenerationArguments in multiple locations within the responses and chat completions endpoints. - Replaced instances of `args` with `gen_args` to streamline the argument passing to the response generator. - This change simplifies the code and improves maintainability by reducing duplication. * Add generation arguments and thinking mode support - Refactor GenerationArguments with full sampling params (top_k, min_p, repetition_penalty, logit_bias, enable_thinking, thinking_budget, thinking_start_token) + to_generate_kwargs()/to_template_kwargs() - Add _build_gen_args() helper, remove all inline kwargs duplication - Add ResizeShapeInput type + field_validator for resize_shape - Forward all sampling/thinking params to generate() fallback paths - Fix RotatingKVCache.extend crash: always use batch-aware caches in _process_prompts (removes single-prompt standard cache optimization that breaks when continuous batching extends the batch later) - Add vision_cache support to all 44 model get_input_embeddings - Fix add_eos_token_ids to handle int token IDs - Fix phi3_v vision cache patch ordering 396 passed, 1 skipped, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add tests for continuous batching server components Test GenerationArguments, _build_gen_args, _split_thinking, ChatMessage tool-calling schema, process_tool_calls, and _count_thinking_tag_tokens. 27 server tests pass (18 new + 9 existing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add /v1/ route aliases for OpenAI SDK compatibility /v1/responses, /v1/chat/completions, /v1/models now mirror the base routes (hidden from schema docs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update DEFAULT_MAX_TOKENS in generate.py to 2048 for enhanced generation capabilities * Parse tool calls in streaming path for OpenAI SDK compatibility Accumulate full output during streaming and parse tool calls at the end. If tool calls are detected, emit a final SSE chunk with structured tool_calls[] and finish_reason="tool_calls" — matching what OpenAI SDKs (and tools like Pi) expect for agentic workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restore lost README sections from main, add continuous batching Restore: Thinking Budget, model docs table (DOTS-OCR, Gemma 4, etc.), Vision Feature Caching, TurboQuant KV Cache, Activation Quantization, /v1/ route docs, server preload docs. Add: Continuous Batching section under Server, --vision-cache-size option. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restore server features from main: lifespan, CORS, KV quant, streaming headers - Replace @app.on_event("startup") with asynccontextmanager lifespan - Add CORSMiddleware (allow all origins) - Restore KV quant env helpers (get_prefill_step_size, get_quantized_kv_bits, etc.) - Add --kv-bits, --kv-quant-scheme, --kv-group-size, --max-kv-size, --quantized-kv-start, --prefill-step-size, --reload CLI args - Add Cache-Control/Connection/X-Accel-Buffering headers to StreamingResponse - Import DEFAULT_KV_*, DEFAULT_THINKING_*, DEFAULT_PREFILL_STEP_SIZE - Remove unused chat_messages variable - reload=False by default (was True) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix streaming finish_reason: propagate token.finish_reason to SSE chunks The streaming /chat/completions path was emitting finish_reason=null on all chunks. Now propagates the actual finish_reason (stop/length) from the ResponseGenerator's StreamingToken to the SSE chunk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add batch speculative decoding with automatic sequence filtering and docs - _dflash_rounds_batch: continuous batch support — finished sequences are filtered from target caches (via BatchKVCache.filter) and the drafter cold-restarts for the new batch size. stop_check callback for per-sequence EOS detection. active_idx mapping keeps stable original indices across batch changes. - _speculative_walk_batch: per-sequence acceptance walk for B > 1. - generate_step: B > 1 dispatch to batch path when draft_model is set. - docs/usage.md: speculative decoding section with CLI, single-sequence, and batch generate examples. Verified: B=1 regression (57 tok/s, 4.1 accept), B=4 batch (74.5 tok/s, 2.72x vs sequential AR), continuous filtering (short sequences exit early, remaining continue correctly). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add DFlash speculative decoding to server + update README Server (mlx_vlm/server.py): - --draft-model CLI arg loads a speculative drafter at startup - ResponseGenerator._run_speculative: dedicated GPU thread loop for DFlash batch speculative decoding. Collects pending requests, batch-prefills with capture_layer_ids, runs _dflash_rounds_batch, dispatches tokens to per-request queues. Finished sequences are handled by stop_check callback. - Acceptance metrics logged per batch: [DFlash] batch=N tokens=M accept=X rounds=Y README.md: - Speculative Decoding (DFlash) section with CLI examples for text, image, and server usage - --draft-model added to Server Options table Benchmarks (Qwen3.5-4B + DFlash drafter, same prompt, 200 tokens): Server AR: 1 req 24.9 tok/s | 4 req 49.5 tok/s Server DFlash: 1 req 44.9 tok/s | 4 req 85.1 tok/s (1.7-1.8x) Acceptance: ~4.0 tokens/round Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Switch batch rollback from min-trim to max-trim with stale KV zeroing rollback_speculative_cache_batch now trims to max(accepted) instead of min(accepted). Stale KV entries for shorter-accepted sequences are zeroed so attention assigns near-zero weight to them. GDN replay uses a per-sequence mask. Each sequence emits its full accepted+1 tokens per round instead of being capped to the global minimum. B=8 throughput: 52.6 → 73.7 tok/s (+40%) B=8 acceptance: 0.44 → 1.22 (+177%) No collapse at high batch sizes — steady scaling from B=1 to B=8. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: apply black, isort, autoflake formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add draft_model attrs to test_generate_cli_smoke Namespace Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: seed RNG in test_turboquant_prod_is_nearly_unbiased_across_seeds Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add --top-logprobs-k argument to CLI for configurable top-K log probabilities This commit introduces a new command-line argument, --top-logprobs-k, to allow users to set the cap for per-token top-K log probabilities directly via the CLI. The implementation mirrors the existing TOP_LOGPROBS_K environment variable functionality, enhancing usability. The README has been updated to reflect this addition. * Revert redundant is_batch_offset mRoPE path in Qwen3.5 language model Main's existing cache_offsets handling covers the batch case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clean up drafter __init__.py: remove extra comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clean up qwen3_dflash: remove excessive comments and docstrings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Merge rollback_speculative_cache and rollback_speculative_cache_batch into one method Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: apply black formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Prince Canuma <prince.gdt@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mdkirin
pushed a commit
to mdkirin/mlx-seori
that referenced
this pull request
Apr 26, 2026
…MoE pos fix upstream/main 흡수 (4-19 ~ 4-25 batch). Fork의 핵심 자산은 모두 보존: MTP (mlx-lm 포팅, Qwen3.5 dense+MoE), PrefixCache hybrid, server hardening (MLX_MEMORY_LIMIT_GB env, /v1/status, /v1/models 로드 모델 포함, model pinning, busy tracking, GC threshold, last_request, OOM-위험 startup warmup 제거), 서버사이드 thinking strip + 스트리밍 incremental, null tool_calls 가드. Upstream 흡수: continuous batching server (Blaizzy#1027), DFlash speculative decoding (Blaizzy#1029, Blaizzy#1053 fix), thread-local generation stream (Blaizzy#1050, mlx<0.32 hasattr 가드), batch_generate/server VLM fixes (Blaizzy#1055), Qwen3.5/3.6 MoE stale position IDs + gdn_sink 호환 (Blaizzy#1040), tool-call markup strip (Blaizzy#1037), KV cache quantization (Blaizzy#1030), Qwen2-3.5 VL torch-free 비디오 processors (Blaizzy#1048), Gemma4 LoRA NaN/freeze fix (Blaizzy#1052), Gemma4 video, Youtu-VL, distributed inference 등. 충돌 해결 원칙: fork의 MTP n_confirmed와 upstream의 gdn_sink는 같은 함수에서 공존하도록 시그니처 확장. fork는 Blaizzy#1029(DFlash) 도입 전 시점에서 분기되어 gdn_sink 본체 로직은 우리 모델에서 비활성(None 전달); 단 시그니처는 받아두어 호환성 유지. position_ids 캐시 재사용 시 fork의 ">= cache_offset + seq_length" 체크가 Blaizzy#1040 fix를 더 정교하게 커버. LanguageModelOutput.hidden_states/gdn_states 필드는 upstream 추가분 호환. 검증: 4개 파일 syntax + import OK. M3 96GB에서 mlx 0.31.0 호환 확인. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Continuous Batching Server
BatchGeneratorprocesses multiple concurrent requests togetherasyncio.to_threadunblocks the FastAPI event loop for true concurrent request handlingOpenAI-Compatible Response Format
id,object,created,index,prompt_tokens_details— matches mlx-lm field-for-fieldreasoning/contentsplit for thinking models (handles<|channel>thought/<channel|>and<think>/</think>)reasoningvscontentfieldscompletion_tokensexcludes thinking tag tokens (matches mlx-lm count)/v1/route aliases for OpenAI SDK compatibilityMulti-Turn Tool Calling
ChatMessageacceptsrole="tool",tool_calls,tool_call_id,nameapply_chat_templateto Jinja templatesprocess_tool_callsparses model output for structuredtool_calls[]responsesfinish_reason="tool_calls"emitted correctlyVision Feature Caching
vision_cachekwarg passed toget_input_embeddings— all 44 models patchedencode_imageneeded)--vision-cache-sizeCLI flag (default: 20 images, ~30MB max), LRU evictionModel Fixes
_make_masks) — matches mlx-lm per-step speed_position_ids/_rope_deltasper prefill, clamp negativeBatchKVCacheoffsets_process_prompts(fixesRotatingKVCache.extendcrash)tokenizer.stopping_criteriain addition tostop_tokenssetadd_eos_token_idshandles int token IDs (not just strings)Server Features
--modelflag to pre-load model at startupenable_thinkingon by default, configurable per requesttop_k,min_p,repetition_penalty,logit_bias,thinking_budgetGenerationArgumentswithto_generate_kwargs()/to_template_kwargs()— no duplicated kwargs_build_gen_args()shared helper for both/responsesand/chat/completionsResizeShapeInputwithfield_validatorfor proper validationBenchmarks
Server
/chat/completions—google/gemma-4-26b-a4b-it, thinking enabled, Apple Silicon:At B=8: mlx-vlm matches mlx-lm (111 vs 109 tok/s).
Tool calling: 5-tool agentic workflow (weather → forecast → search → book → email) completes in 4 turns.
Vision cache: 228x speedup on gemma4 cache hit (229ms → 1ms), 1GB peak memory saved.
Test plan
/responsesand/v1/route aliases--modelpre-loading,--vision-cache-size