Feature/api parity roadmap by solderzzc · Pull Request #5 · SharpAI/SwiftLM

solderzzc · 2026-03-30T17:05:24Z

No description provided.

- Add stop sequences (stop parameter, text trimming) - Add /v1/completions text completion endpoint (streaming + non-streaming) - Accurate token counting via lmInput.text.tokens.size (replaces chars÷4) - Add seed parameter for deterministic generation (MLXRandom.seed) - Add stream_options.include_usage for streaming token stats - Add CORS support via --cors CLI flag with CORSMiddleware - Extract handler closures into standalone functions (Swift type-checker fix) - Add ServerConfig struct for CLI defaults bundling - Expand test suite: 6 → 13 test sections (32 assertions total) All 32 tests pass.

…ra sampling params - Add response_format: { type: 'json_object' } with prompt injection + fence stripping - Add --vision CLI flag for VLM model loading via VLMModelFactory - Parse OpenAI multipart content (string or [{type:'text',...},{type:'image_url',...}]) - Decode base64 data URIs and HTTP URLs into UserInput.Image for VLM inference - Accept top_k, frequency_penalty, presence_penalty (API compat) - Add MLXVLM package dependency - Add 4 new regression tests (Tests 14-17), total: 38 assertions All 38 tests pass.

…utdown, stats - Add --mem-limit CLI flag (sets Memory.memoryLimit + Memory.cacheLimit) - Add ServerStats actor tracking requests, tokens, generation timing - Enhanced /health endpoint with GPU memory (active/peak/cache/total), architecture, request/token stats - Add /metrics Prometheus-compatible endpoint (8 metrics with TYPE/HELP) - Add SIGTERM/SIGINT graceful shutdown handlers - Wire stats tracking into all 6 handler functions - Add 3 new regression tests (Tests 18-20), total: 49 assertions All 49 tests pass.

- Add --api-key CLI option for bearer token authentication - ApiKeyMiddleware validates Authorization: Bearer <key> header - Health and metrics endpoints exempt from auth (monitoring tools) - Returns 401 with OpenAI-style error JSON for invalid/missing keys - Config line shows auth=enabled/disabled - Add Test 21: 5 auth assertions (unauthenticated, wrong key, valid key, health exempt, metrics exempt) All 54 tests pass.

- Add PromptCache actor (saves/restores KV cache state per-layer) - Cache keyed by system prompt text hash - On cache hit: restore KV state, skip cached prefix tokens, process only new tokens - On cache miss: generate normally, save system prompt KV state asynchronously - Health/metrics endpoints exempt from cache - Uses container.perform() for direct model access with cache-aware generation All 54 tests pass.

The Metal shader library is required at runtime by MLX Swift. Install via: python3 -m venv + pip install mlx + copy metallib. Also trigger CI on feature/* branches.

Every 8 tokens, insert a 50μs Task.sleep to yield the GPU. This prevents heavy inference from freezing the macOS UI (WindowServer). Applied to all 4 generation loops: - Chat streaming - Chat non-streaming - Text streaming - Text non-streaming

@main

- New ModelProfiler.swift: reads config.json, measures weight files (follows HF Hub symlinks), computes memory requirements (weights + KV cache + 20% overhead), and outputs a PartitionPlan with strategy (fullGPU/swapAssisted/layerPartitioned/tooLarge) - New --info flag: dry-run profiler prints formatted memory analysis report and exits without loading the model - New --gpu-layers option: accepts 'auto' or integer, ready for future GPU/CPU layer splitting (Phase 2) - Pre-load profiling: automatically detects overcommit ratio and sets MLX cache limits (2MB cache for swap-assisted mode to let OS manage page caching, inspired by Flash-MoE research) - Enhanced /health endpoint: includes partition data (strategy, overcommit_ratio, weight/kv/total GB, GPU layers, estimated tok/s) - Ready event JSON: includes partition data for downstream integration - Rename main.swift -> Server.swift (required by Swift compiler when adding second source file with @main attribute)

Phase 2 integration — connects the mlx-swift-lm fork's new LayerPartitionable protocol to mlx-server's CLI and profiler: - --gpu-layers N: explicitly set N layers on GPU, rest on CPU - --gpu-layers auto: use partition plan recommendation - Auto-partition: when model exceeds available RAM (overcommit > 1.0), automatically applies the recommended GPU layer count - PartitionPlan: added mutable gpuLayers field (updated after actual partitioning) and cpu_layers in /health response - Fixed .chunk API change in latest fork (now returns tokenId tuple) - Updated Package.swift comment to note partitioning support

- Package.resolved was accidentally deleted when switching to local path for fork development. Regenerated with resolved fork commit. - e2e-test.yml: added 3-attempt retry loop for transient HuggingFace API failures (HTTP 500). Set HF_HUB_DOWNLOAD_TIMEOUT=120. - build.yml: added feature/* branches to trigger CI builds.

…tionable)

FFTW-style auto-tuning that profiles optimal cache limits per model × hardware combination. First run benchmarks 4 configurations (tight → unlimited), measures tok/s, and persists the winner to ~/.mlx-server/wisdom/<key>.json. Subsequent runs load instantly. New CLI flags: --calibrate Force re-calibration even if wisdom exists Integration with startup flow: 1. Check for existing wisdom → apply instantly 2. Or run calibration trials → store + apply 3. --mem-limit always overrides wisdom Calibrator.swift: ~290 lines, zero new dependencies.

…ming for large MoE models

…r scaffolds

… path mapping

…nature

…tails

…icit SPM test target failure

…re vLLM column

- tests/test_turbo_quant.cpp: 9 standalone C++ tests for turbo_quant.h algorithm - T1: Lloyd-Max centroids match turboquant_plus Python reference (tol 1e-5) - T2: WHT sign arrays match turbo-wht.h (seed=42 and seed=1042) - T3: FWHT is self-inverse (MSE < 1e-10) - T4: Forward∘Inverse rotation == identity (MSE < 1e-8) - T5: WHT rotation is norm-preserving (unitary, δ < 1e-4) - T6: 3-bit pack/unpack round-trip (0 mismatches over all 8 index values) - T7: V-cache SNR = 14.6 dB over 200 random d=128 Gaussian vectors - T8: K-cache inner-product SNR = 13.7 dB over 100 random key/query pairs - T9: fp16 conversion round-trip < 0.3% relative error All 9/9 tests pass with clang++ -std=c++17 -O2. - .github/workflows/build.yml: run TurboQuant tests after swift build - .github/workflows/e2e-test.yml: run TurboQuant tests as fast pre-flight before expensive model download (fail early if compression math is broken)

…heartbeat) - ThinkingStateTracker: streaming state machine that splits <think>…</think> tokens into delta.reasoning_content vs delta.content (llama-server compatible) - extractThinkingBlock(): non-streaming extraction for handleChatNonStreaming - enableThinking param wired into both streaming and non-streaming handlers - Prefill heartbeat: BoolFlag actor + Task emitting ssePrefillChunk every 2s while prompt is being processed — prevents silent connections on long prefills - sseChunk() refactored: delta string replaced with reasoningContent/content optional fields — cleaner separation of thinking vs response tokens - ssePrefillChunk(): new SSE event type 'prefill_progress' for client-side UX - AssistantMessage: added reasoningContent field with CodingKey 'reasoning_content' - handleChatNonStreaming: applies extractThinkingBlock + JSON stripping on responseContent (not raw fullText) so thinking tokens are never returned as content

…erBound)

mlx::core::fast::turbo_encode() does not exist in the upstream MLX library yet. Replace the broken call with a std::runtime_error stub that compiles cleanly. The real implementation will be wired in once the TurboQuant C++ core is available upstream.

Adds the missing mlx::core::fast::turbo_encode_k() and turbo_encode_v() functions that the C API stub was placeholding. Algorithm (from turbo_quant.h): turbo_encode_k: 3-bit PolarQuant (WHT rotation + Lloyd-Max centroids) + 1-bit QJL residual — K cache, 68 bytes/token turbo_encode_v: 3-bit PolarQuant only — V cache, 50 bytes/token Buffer layout per token: K: indices[48] | qjl_signs[16] | norm_fp16[2] | rnorm_fp16[2] V: indices[48] | norm_fp16[2] Layout matches the Metal decompression path in sdpa_vector.h which already implements turbo_dequant_k/v for on-the-fly decode during SDPA. The encode path is CPU-side (eval + iterate), which is appropriate since compression runs once per appended KV token, not in the hot forward pass. Files changed: fast.h — declare turbo_encode_k/v in namespace mlx::core::fast fast.cpp — implement using turbo_quant.h primitives mlx-c fast.cpp — replace runtime_error stub with real call

@Flag

Phase 2: Server.swift integration of TurboQuant KV-cache compression. CLI: --turbo-kv Enable 3-bit PolarQuant+QJL KV compression on all KVCacheSimple layers. Compresses history > 8192 tokens to ~3.5 bits/token — recommended for 100k+ context. Default: disabled (zero overhead when off). KVCache.swift (submodule): KVCacheSimple.turboQuantEnabled: Bool = false Now settable at runtime so Server.swift can activate per-request. Server.swift: - @Flag --turbo-kv added to CLI - turboKV stored in ServerConfig - Startup log shows turbo_kv=enabled/disabled - Sets .turboQuantEnabled = true on each KVCacheSimple before prefill

- Rename Sources/mlx-server/ → Sources/SwiftLM/ - Update Package.swift: package name, target name, source path - Update all [mlx-server] log prefixes to [SwiftLM] - Update ~/.mlx-server/wisdom/ path to ~/.swiftlm/wisdom/ - Update CLI commandName to SwiftLM - Update GitHub Actions workflows: binary path, tarball names, release titles - Update all documentation files

…prefix

…name The CI cache was built when the repo was named 'mlx-server'. After renaming to 'SwiftLM', clang embedded the old path in the PCH, causing: error: PCH was compiled with module cache path '.../mlx-server/.build/...' but the path is currently '.../SwiftLM/.build/...' Fixes: - build.yml, e2e-test.yml: scope cache key to 'spm-SwiftLM-' so the old 'spm-' prefixed cache is never restored as a partial match - Add 'Clear stale module cache' step (rm ModuleCache/) before swift build to eliminate any stale PCH artifacts that sneak in via partial cache hits - tests/test-server.sh: update comment and default binary path to SwiftLM

Two bugs fixed in the SSD streaming throughput logger: 1. STREAMING CORRUPTION — The [⚡️ SSD Stream] log line was printed to std::cout (stdout), the same fd as the token stream. Since Swift token output uses print(text, terminator: "") with no newline, the metric lines interleaved mid-token, corrupting the SSE response body observed by clients and the Electron log display. Fix: switch to std::cerr (stderr). Electron routes stderr separately as [mlx-server:err] and never forwards it as SSE content. 2. LOG FLOODING — The 1-second throttle emitted ~60 lines/minute. With a 4 t/s MoE model this produces a metric line for almost every token. Fix: throttle to 10 seconds (10'000'000'000 ns). 3. WRONG UNIT — The metric was labelled 'MB/s' but the value computed was total MB read in the window (not divided by elapsed seconds). Fix: divide bytes by elapsed_s to get true NVMe throughput in MB/s. New format (stderr, every 10 s): [⚡️ SSD Stream] 3456 MB/s | 24070 chunks | avg 0.003 ms/chunk

Stack: C++ (moe_stream_op.cpp): Add 4 lifetime atomics (bytes, ns, chunks, window_throughput_mbs). Accumulated per expert-chunk load, never reset. After each 10 s log window, write throughput_mbs. Implement extern C mlx_ssd_metrics_snapshot() with full struct definition in the .cpp TU. C ABI (fast.h + include/mlx/c/fast.h): Declare MlxSSDMetricsSnapshot typedef + mlx_ssd_metrics_snapshot() in both the mlx-c copy and the Swift-visible umbrella header. moe_stream_op.h keeps only a forward declaration + extern C bridge (avoids redefinition at link time). Swift (MLXFast.swift): New MLXFast.SSDMetricsSnapshot struct + MLXFast.ssdMetricsSnapshot() calling through to the C function. Server.swift: /metrics emits 4 new Prometheus gauges/counters when the server is started with --stream-experts: swiftlm_ssd_throughput_mbps (gauge, 10 s rolling average) swiftlm_ssd_bytes_read_total (counter, lifetime) swiftlm_ssd_chunks_total (counter, lifetime) swiftlm_ssd_chunk_latency_ms (gauge, lifetime average) Example when SSD streaming is active: $ curl http://127.0.0.1:8080/metrics | grep ssd swiftlm_ssd_throughput_mbps 3456.0 swiftlm_ssd_bytes_read_total 1234567890 swiftlm_ssd_chunks_total 82340 swiftlm_ssd_chunk_latency_ms 0.0028

- Updates the TurboQuantization section in README to explain the fusion of V2 speed and V3 quality algorithms - Adds 'docs/turboquant_hybrid_architecture.md' with deep-dive technical analysis of the Lloyd-Max + QJL Metal integration

- Expanded the ThinkingStateTracker to match both `<think>` and `<thinking>` open/close tags - Fixes an issue where Qwen models would leak 'ing>' or other characters into the SSE stream because the tracker strictly looked for the 8-character DeepSeek '</think>' tag

- Added support for the top-level 'enable_thinking' parameter in ChatCompletionRequest - Ensures compatibility with Aegis-AI's gateway which passes 'enable_thinking' directly in the root JSON instead of nested in 'chat_template_kwargs'

…rbo-kv fixes - Streaming path: log content as null (not raw fullText) when tool_calls are present - mlx-swift-lm: TurboKV now compresses from token 1 (not 512) - mlx-swift-lm: head_dim guard prevents fatal crash on Qwen 122B (dim=256 != 128)

Qwen3.5-122B has head_dim=256. The C++ encoder now processes D=256 as two consecutive D=128 sub-groups using the existing TurboQuantK/V structs. Record sizes double: K=136b, V=100b per token for D=256.

- Add MLXInferenceCore shared Swift library target - InferenceEngine actor with AsyncStream<GenerationToken> API - ChatMessage, GenerationConfig, ModelCatalog models - Device-aware model recommendations based on physical RAM - Add SwiftLMChat/ Xcode project (iOS 17+ / macOS 14+) - SwiftLMChatApp.swift: entry point with macOS menu commands - RootView: NavigationSplitView (macOS) / NavigationStack (iOS) - ChatView: streaming message display, input bar, stop button - MessageBubble: custom bubble shape, typing indicator, thinking disclosure - ModelPickerView: device-aware model list with RAM fit badges - SettingsView: temperature, max tokens, top-p, thinking mode - ChatViewModel: @observable bridges InferenceEngine to SwiftUI - generate_xcodeproj.py: stdlib-only project generator - Update Package.swift: add iOS 17+ platform, MLXInferenceCore target

…ntion Implements decode path: packed uint8 compressed KV history → float32 for concatenation with hot window before passing to standard SDPA. Supports D=128 (68B/50B records) and D=256 (136B/100B records).

Full TurboKV pipeline now operational: - Encode history → 3-bit PolarQuant (existing) - Truncate hot cache to hot window (new) - Decode + prepend history at every SDPA call (new) RAM savings: ~5.5x on history tokens (3-bit vs 16-bit fp16)

…agnostic)

…e, commit references

Replace hash-based system-prompt cache with longest-common-prefix scan: - Store full token sequence alongside KV state - On each request: scan token-by-token to find longest shared prefix - Restore KV state, trim excess via layer.trim() for partial matches - Save full prompt after every request (not just system-prompt) Benefits: System prompt matched exactly, no token-count approximation bug Conversation history reuse (any shared prefix, not just system prompt) Partial prefix matches (e.g. same system + first N turns) also benefit Works correctly with TurboKV (state getter now returns full fp16 context)

simba and others added 30 commits March 22, 2026 23:10

fix: CI — install mlx.metallib from Python mlx package

30c06d6

The Metal shader library is required at runtime by MLX Swift. Install via: python3 -m venv + pip install mlx + copy metallib. Also trigger CI on feature/* branches.

chore: update fork to f8f315b (20 model architectures with LayerParti…

8dd340e

…tionable)

feat(moe): Expose --stream-experts flag to enable SSD inference strea…

7210980

…ming for large MoE models

feat: add download speed and progress bar UX

5255cd6

fix(ux): progress bar fraction handling for non-byte counts

3472cba

feat(ux): add robust caching bandwidth speedometer

441cb2b

fix(ux): add autonomous task-driven progress bar and restore MB counts

7d23eba

feat: localize MLX frameworks to write C++ turboquant and ssd streame…

eced528

…r scaffolds

feat(mlx): integrate core C++ unified memory SSD streaming primitives

86bcee8

feat(mlx-c): expose SSD streamed_gather_mm primitive to c-api

b5e6ade

feat(server): auto-wire safetensors resolution and stream environment…

e29155c

… path mapping

feat(mlx-swift): expose MLXFast.streamedGatherMM and update c-api sig…

9131ff7

…nature

docs: recreate README with mlx-server comparisons and architecture de…

ed2d2b0

…tails

docs: add Flash-MoE and vLLM to comparison table

91852f8

test: structure test scripts into tests directory and ignore artifacts

e0da633

docs: fix hardware specs and document 4-bit JSON quantization caveat

8082313

fix(ci): relocate standalone test scripts to scripts/ to prevent impl…

c6078ad

…icit SPM test target failure

docs: remove vLLM column and correctly attribute Flash-MoE features

310b940

docs: revert incorrect Flash-MoE designation for mlx-server and resto…

b2ccae9

…re vLLM column

docs: fix test hardware to M5 Pro 64GB

8b1d723

solderzzc added 7 commits March 30, 2026 10:27

fix: Correct buffer range removal in ThinkingStateTracker (use ..<upp…

7dd655f

…erBound)

solderzzc force-pushed the feature/api-parity-roadmap branch from 55c3e14 to 480e349 Compare March 30, 2026 19:28

solderzzc force-pushed the main branch from 07d5748 to 50b6e71 Compare March 30, 2026 19:33

solderzzc added 19 commits March 30, 2026 12:35

docs: add MIGRATION_NOTE.md for Aegis-AI mlx-server → SwiftLM rename

1f7087b

fix(metrics): rename Prometheus metrics from mlx_server_ to swiftlm_ …

60cc3e3

…prefix

docs(engine): add TurboQuant C++ architecture notes

a83fa7d

chore(deps): bump mlx-swift-lm for SSD background telemetry fixes

c323412

feat(turbo-kv): support head_dim=256 via two 128-dim sub-groups

48fd996

Qwen3.5-122B has head_dim=256. The C++ encoder now processes D=256 as two consecutive D=128 sub-groups using the existing TurboQuantK/V structs. Record sizes double: K=136b, V=100b per token for D=256.

feat(turbo-kv): add turbo_decode_k/v — batch dequantize for SDPA atte…

e141627

…ntion Implements decode path: packed uint8 compressed KV history → float32 for concatenation with hot window before passing to standard SDPA. Supports D=128 (68B/50B records) and D=256 (136B/100B records).

feat(turbo-kv): add mlx_turbo_kv_record C atomic + 10s log hook

dc6af72

fix(turbo-kv): drop token count from log, show ratio+MB saved (layer-…

3df7430

…agnostic)

docs(turbo-kv): add implementation status, hot-window design rational…

6ede853

…e, commit references

solderzzc merged commit bf9e87e into main Mar 31, 2026
4 checks passed

solderzzc deleted the feature/api-parity-roadmap branch March 31, 2026 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/api parity roadmap#5

Feature/api parity roadmap#5
solderzzc merged 92 commits intomainfrom
feature/api-parity-roadmap

solderzzc commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant