Conversation
- Add stop sequences (stop parameter, text trimming) - Add /v1/completions text completion endpoint (streaming + non-streaming) - Accurate token counting via lmInput.text.tokens.size (replaces chars÷4) - Add seed parameter for deterministic generation (MLXRandom.seed) - Add stream_options.include_usage for streaming token stats - Add CORS support via --cors CLI flag with CORSMiddleware - Extract handler closures into standalone functions (Swift type-checker fix) - Add ServerConfig struct for CLI defaults bundling - Expand test suite: 6 → 13 test sections (32 assertions total) All 32 tests pass.
…ra sampling params
- Add response_format: { type: 'json_object' } with prompt injection + fence stripping
- Add --vision CLI flag for VLM model loading via VLMModelFactory
- Parse OpenAI multipart content (string or [{type:'text',...},{type:'image_url',...}])
- Decode base64 data URIs and HTTP URLs into UserInput.Image for VLM inference
- Accept top_k, frequency_penalty, presence_penalty (API compat)
- Add MLXVLM package dependency
- Add 4 new regression tests (Tests 14-17), total: 38 assertions
All 38 tests pass.
…utdown, stats - Add --mem-limit CLI flag (sets Memory.memoryLimit + Memory.cacheLimit) - Add ServerStats actor tracking requests, tokens, generation timing - Enhanced /health endpoint with GPU memory (active/peak/cache/total), architecture, request/token stats - Add /metrics Prometheus-compatible endpoint (8 metrics with TYPE/HELP) - Add SIGTERM/SIGINT graceful shutdown handlers - Wire stats tracking into all 6 handler functions - Add 3 new regression tests (Tests 18-20), total: 49 assertions All 49 tests pass.
- Add --api-key CLI option for bearer token authentication - ApiKeyMiddleware validates Authorization: Bearer <key> header - Health and metrics endpoints exempt from auth (monitoring tools) - Returns 401 with OpenAI-style error JSON for invalid/missing keys - Config line shows auth=enabled/disabled - Add Test 21: 5 auth assertions (unauthenticated, wrong key, valid key, health exempt, metrics exempt) All 54 tests pass.
- Add PromptCache actor (saves/restores KV cache state per-layer) - Cache keyed by system prompt text hash - On cache hit: restore KV state, skip cached prefix tokens, process only new tokens - On cache miss: generate normally, save system prompt KV state asynchronously - Health/metrics endpoints exempt from cache - Uses container.perform() for direct model access with cache-aware generation All 54 tests pass.
The Metal shader library is required at runtime by MLX Swift. Install via: python3 -m venv + pip install mlx + copy metallib. Also trigger CI on feature/* branches.
Every 8 tokens, insert a 50μs Task.sleep to yield the GPU. This prevents heavy inference from freezing the macOS UI (WindowServer). Applied to all 4 generation loops: - Chat streaming - Chat non-streaming - Text streaming - Text non-streaming
- New ModelProfiler.swift: reads config.json, measures weight files (follows HF Hub symlinks), computes memory requirements (weights + KV cache + 20% overhead), and outputs a PartitionPlan with strategy (fullGPU/swapAssisted/layerPartitioned/tooLarge) - New --info flag: dry-run profiler prints formatted memory analysis report and exits without loading the model - New --gpu-layers option: accepts 'auto' or integer, ready for future GPU/CPU layer splitting (Phase 2) - Pre-load profiling: automatically detects overcommit ratio and sets MLX cache limits (2MB cache for swap-assisted mode to let OS manage page caching, inspired by Flash-MoE research) - Enhanced /health endpoint: includes partition data (strategy, overcommit_ratio, weight/kv/total GB, GPU layers, estimated tok/s) - Ready event JSON: includes partition data for downstream integration - Rename main.swift -> Server.swift (required by Swift compiler when adding second source file with @main attribute)
Phase 2 integration — connects the mlx-swift-lm fork's new LayerPartitionable protocol to mlx-server's CLI and profiler: - --gpu-layers N: explicitly set N layers on GPU, rest on CPU - --gpu-layers auto: use partition plan recommendation - Auto-partition: when model exceeds available RAM (overcommit > 1.0), automatically applies the recommended GPU layer count - PartitionPlan: added mutable gpuLayers field (updated after actual partitioning) and cpu_layers in /health response - Fixed .chunk API change in latest fork (now returns tokenId tuple) - Updated Package.swift comment to note partitioning support
- Package.resolved was accidentally deleted when switching to local path for fork development. Regenerated with resolved fork commit. - e2e-test.yml: added 3-attempt retry loop for transient HuggingFace API failures (HTTP 500). Set HF_HUB_DOWNLOAD_TIMEOUT=120. - build.yml: added feature/* branches to trigger CI builds.
FFTW-style auto-tuning that profiles optimal cache limits per model × hardware combination. First run benchmarks 4 configurations (tight → unlimited), measures tok/s, and persists the winner to ~/.mlx-server/wisdom/<key>.json. Subsequent runs load instantly. New CLI flags: --calibrate Force re-calibration even if wisdom exists Integration with startup flow: 1. Check for existing wisdom → apply instantly 2. Or run calibration trials → store + apply 3. --mem-limit always overrides wisdom Calibrator.swift: ~290 lines, zero new dependencies.
…ming for large MoE models
…icit SPM test target failure
- tests/test_turbo_quant.cpp: 9 standalone C++ tests for turbo_quant.h algorithm - T1: Lloyd-Max centroids match turboquant_plus Python reference (tol 1e-5) - T2: WHT sign arrays match turbo-wht.h (seed=42 and seed=1042) - T3: FWHT is self-inverse (MSE < 1e-10) - T4: Forward∘Inverse rotation == identity (MSE < 1e-8) - T5: WHT rotation is norm-preserving (unitary, δ < 1e-4) - T6: 3-bit pack/unpack round-trip (0 mismatches over all 8 index values) - T7: V-cache SNR = 14.6 dB over 200 random d=128 Gaussian vectors - T8: K-cache inner-product SNR = 13.7 dB over 100 random key/query pairs - T9: fp16 conversion round-trip < 0.3% relative error All 9/9 tests pass with clang++ -std=c++17 -O2. - .github/workflows/build.yml: run TurboQuant tests after swift build - .github/workflows/e2e-test.yml: run TurboQuant tests as fast pre-flight before expensive model download (fail early if compression math is broken)
…heartbeat) - ThinkingStateTracker: streaming state machine that splits <think>…</think> tokens into delta.reasoning_content vs delta.content (llama-server compatible) - extractThinkingBlock(): non-streaming extraction for handleChatNonStreaming - enableThinking param wired into both streaming and non-streaming handlers - Prefill heartbeat: BoolFlag actor + Task emitting ssePrefillChunk every 2s while prompt is being processed — prevents silent connections on long prefills - sseChunk() refactored: delta string replaced with reasoningContent/content optional fields — cleaner separation of thinking vs response tokens - ssePrefillChunk(): new SSE event type 'prefill_progress' for client-side UX - AssistantMessage: added reasoningContent field with CodingKey 'reasoning_content' - handleChatNonStreaming: applies extractThinkingBlock + JSON stripping on responseContent (not raw fullText) so thinking tokens are never returned as content
mlx::core::fast::turbo_encode() does not exist in the upstream MLX library yet. Replace the broken call with a std::runtime_error stub that compiles cleanly. The real implementation will be wired in once the TurboQuant C++ core is available upstream.
Adds the missing mlx::core::fast::turbo_encode_k() and turbo_encode_v()
functions that the C API stub was placeholding.
Algorithm (from turbo_quant.h):
turbo_encode_k: 3-bit PolarQuant (WHT rotation + Lloyd-Max centroids)
+ 1-bit QJL residual — K cache, 68 bytes/token
turbo_encode_v: 3-bit PolarQuant only — V cache, 50 bytes/token
Buffer layout per token:
K: indices[48] | qjl_signs[16] | norm_fp16[2] | rnorm_fp16[2]
V: indices[48] | norm_fp16[2]
Layout matches the Metal decompression path in sdpa_vector.h which
already implements turbo_dequant_k/v for on-the-fly decode during SDPA.
The encode path is CPU-side (eval + iterate), which is appropriate since
compression runs once per appended KV token, not in the hot forward pass.
Files changed:
fast.h — declare turbo_encode_k/v in namespace mlx::core::fast
fast.cpp — implement using turbo_quant.h primitives
mlx-c fast.cpp — replace runtime_error stub with real call
Phase 2: Server.swift integration of TurboQuant KV-cache compression.
CLI:
--turbo-kv Enable 3-bit PolarQuant+QJL KV compression on all
KVCacheSimple layers. Compresses history > 8192 tokens
to ~3.5 bits/token — recommended for 100k+ context.
Default: disabled (zero overhead when off).
KVCache.swift (submodule):
KVCacheSimple.turboQuantEnabled: Bool = false
Now settable at runtime so Server.swift can activate per-request.
Server.swift:
- @Flag --turbo-kv added to CLI
- turboKV stored in ServerConfig
- Startup log shows turbo_kv=enabled/disabled
- Sets .turboQuantEnabled = true on each KVCacheSimple before prefill
- Rename Sources/mlx-server/ → Sources/SwiftLM/ - Update Package.swift: package name, target name, source path - Update all [mlx-server] log prefixes to [SwiftLM] - Update ~/.mlx-server/wisdom/ path to ~/.swiftlm/wisdom/ - Update CLI commandName to SwiftLM - Update GitHub Actions workflows: binary path, tarball names, release titles - Update all documentation files
55c3e14 to
480e349
Compare
…name The CI cache was built when the repo was named 'mlx-server'. After renaming to 'SwiftLM', clang embedded the old path in the PCH, causing: error: PCH was compiled with module cache path '.../mlx-server/.build/...' but the path is currently '.../SwiftLM/.build/...' Fixes: - build.yml, e2e-test.yml: scope cache key to 'spm-SwiftLM-' so the old 'spm-' prefixed cache is never restored as a partial match - Add 'Clear stale module cache' step (rm ModuleCache/) before swift build to eliminate any stale PCH artifacts that sneak in via partial cache hits - tests/test-server.sh: update comment and default binary path to SwiftLM
Two bugs fixed in the SSD streaming throughput logger: 1. STREAMING CORRUPTION — The [⚡️ SSD Stream] log line was printed to std::cout (stdout), the same fd as the token stream. Since Swift token output uses print(text, terminator: "") with no newline, the metric lines interleaved mid-token, corrupting the SSE response body observed by clients and the Electron log display. Fix: switch to std::cerr (stderr). Electron routes stderr separately as [mlx-server:err] and never forwards it as SSE content. 2. LOG FLOODING — The 1-second throttle emitted ~60 lines/minute. With a 4 t/s MoE model this produces a metric line for almost every token. Fix: throttle to 10 seconds (10'000'000'000 ns). 3. WRONG UNIT — The metric was labelled 'MB/s' but the value computed was total MB read in the window (not divided by elapsed seconds). Fix: divide bytes by elapsed_s to get true NVMe throughput in MB/s. New format (stderr, every 10 s): [⚡️ SSD Stream] 3456 MB/s | 24070 chunks | avg 0.003 ms/chunk
Stack:
C++ (moe_stream_op.cpp): Add 4 lifetime atomics (bytes, ns, chunks,
window_throughput_mbs). Accumulated per expert-chunk load,
never reset. After each 10 s log window, write throughput_mbs.
Implement extern C mlx_ssd_metrics_snapshot() with full struct
definition in the .cpp TU.
C ABI (fast.h + include/mlx/c/fast.h): Declare MlxSSDMetricsSnapshot
typedef + mlx_ssd_metrics_snapshot() in both the mlx-c copy and the
Swift-visible umbrella header. moe_stream_op.h keeps only a forward
declaration + extern C bridge (avoids redefinition at link time).
Swift (MLXFast.swift): New MLXFast.SSDMetricsSnapshot struct +
MLXFast.ssdMetricsSnapshot() calling through to the C function.
Server.swift: /metrics emits 4 new Prometheus gauges/counters when
the server is started with --stream-experts:
swiftlm_ssd_throughput_mbps (gauge, 10 s rolling average)
swiftlm_ssd_bytes_read_total (counter, lifetime)
swiftlm_ssd_chunks_total (counter, lifetime)
swiftlm_ssd_chunk_latency_ms (gauge, lifetime average)
Example when SSD streaming is active:
$ curl http://127.0.0.1:8080/metrics | grep ssd
swiftlm_ssd_throughput_mbps 3456.0
swiftlm_ssd_bytes_read_total 1234567890
swiftlm_ssd_chunks_total 82340
swiftlm_ssd_chunk_latency_ms 0.0028
- Updates the TurboQuantization section in README to explain the fusion of V2 speed and V3 quality algorithms - Adds 'docs/turboquant_hybrid_architecture.md' with deep-dive technical analysis of the Lloyd-Max + QJL Metal integration
- Expanded the ThinkingStateTracker to match both `<think>` and `<thinking>` open/close tags - Fixes an issue where Qwen models would leak 'ing>' or other characters into the SSE stream because the tracker strictly looked for the 8-character DeepSeek '</think>' tag
- Added support for the top-level 'enable_thinking' parameter in ChatCompletionRequest - Ensures compatibility with Aegis-AI's gateway which passes 'enable_thinking' directly in the root JSON instead of nested in 'chat_template_kwargs'
…rbo-kv fixes - Streaming path: log content as null (not raw fullText) when tool_calls are present - mlx-swift-lm: TurboKV now compresses from token 1 (not 512) - mlx-swift-lm: head_dim guard prevents fatal crash on Qwen 122B (dim=256 != 128)
Qwen3.5-122B has head_dim=256. The C++ encoder now processes D=256 as two consecutive D=128 sub-groups using the existing TurboQuantK/V structs. Record sizes double: K=136b, V=100b per token for D=256.
- Add MLXInferenceCore shared Swift library target - InferenceEngine actor with AsyncStream<GenerationToken> API - ChatMessage, GenerationConfig, ModelCatalog models - Device-aware model recommendations based on physical RAM - Add SwiftLMChat/ Xcode project (iOS 17+ / macOS 14+) - SwiftLMChatApp.swift: entry point with macOS menu commands - RootView: NavigationSplitView (macOS) / NavigationStack (iOS) - ChatView: streaming message display, input bar, stop button - MessageBubble: custom bubble shape, typing indicator, thinking disclosure - ModelPickerView: device-aware model list with RAM fit badges - SettingsView: temperature, max tokens, top-p, thinking mode - ChatViewModel: @observable bridges InferenceEngine to SwiftUI - generate_xcodeproj.py: stdlib-only project generator - Update Package.swift: add iOS 17+ platform, MLXInferenceCore target
…ntion Implements decode path: packed uint8 compressed KV history → float32 for concatenation with hot window before passing to standard SDPA. Supports D=128 (68B/50B records) and D=256 (136B/100B records).
Full TurboKV pipeline now operational: - Encode history → 3-bit PolarQuant (existing) - Truncate hot cache to hot window (new) - Decode + prepend history at every SDPA call (new) RAM savings: ~5.5x on history tokens (3-bit vs 16-bit fp16)
…e, commit references
Replace hash-based system-prompt cache with longest-common-prefix scan: - Store full token sequence alongside KV state - On each request: scan token-by-token to find longest shared prefix - Restore KV state, trim excess via layer.trim() for partial matches - Save full prompt after every request (not just system-prompt) Benefits: System prompt matched exactly, no token-count approximation bug Conversation history reuse (any shared prefix, not just system prompt) Partial prefix matches (e.g. same system + first N turns) also benefit Works correctly with TurboKV (state getter now returns full fp16 context)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.