Skip to content

Feature/api parity roadmap#5

Merged
solderzzc merged 92 commits intomainfrom
feature/api-parity-roadmap
Mar 31, 2026
Merged

Feature/api parity roadmap#5
solderzzc merged 92 commits intomainfrom
feature/api-parity-roadmap

Conversation

@solderzzc
Copy link
Copy Markdown
Member

No description provided.

simba and others added 30 commits March 22, 2026 23:10
- Add stop sequences (stop parameter, text trimming)
- Add /v1/completions text completion endpoint (streaming + non-streaming)
- Accurate token counting via lmInput.text.tokens.size (replaces chars÷4)
- Add seed parameter for deterministic generation (MLXRandom.seed)
- Add stream_options.include_usage for streaming token stats
- Add CORS support via --cors CLI flag with CORSMiddleware
- Extract handler closures into standalone functions (Swift type-checker fix)
- Add ServerConfig struct for CLI defaults bundling
- Expand test suite: 6 → 13 test sections (32 assertions total)

All 32 tests pass.
…ra sampling params

- Add response_format: { type: 'json_object' } with prompt injection + fence stripping
- Add --vision CLI flag for VLM model loading via VLMModelFactory
- Parse OpenAI multipart content (string or [{type:'text',...},{type:'image_url',...}])
- Decode base64 data URIs and HTTP URLs into UserInput.Image for VLM inference
- Accept top_k, frequency_penalty, presence_penalty (API compat)
- Add MLXVLM package dependency
- Add 4 new regression tests (Tests 14-17), total: 38 assertions

All 38 tests pass.
…utdown, stats

- Add --mem-limit CLI flag (sets Memory.memoryLimit + Memory.cacheLimit)
- Add ServerStats actor tracking requests, tokens, generation timing
- Enhanced /health endpoint with GPU memory (active/peak/cache/total), architecture, request/token stats
- Add /metrics Prometheus-compatible endpoint (8 metrics with TYPE/HELP)
- Add SIGTERM/SIGINT graceful shutdown handlers
- Wire stats tracking into all 6 handler functions
- Add 3 new regression tests (Tests 18-20), total: 49 assertions

All 49 tests pass.
- Add --api-key CLI option for bearer token authentication
- ApiKeyMiddleware validates Authorization: Bearer <key> header
- Health and metrics endpoints exempt from auth (monitoring tools)
- Returns 401 with OpenAI-style error JSON for invalid/missing keys
- Config line shows auth=enabled/disabled
- Add Test 21: 5 auth assertions (unauthenticated, wrong key, valid key, health exempt, metrics exempt)

All 54 tests pass.
- Add PromptCache actor (saves/restores KV cache state per-layer)
- Cache keyed by system prompt text hash
- On cache hit: restore KV state, skip cached prefix tokens, process only new tokens
- On cache miss: generate normally, save system prompt KV state asynchronously
- Health/metrics endpoints exempt from cache
- Uses container.perform() for direct model access with cache-aware generation

All 54 tests pass.
The Metal shader library is required at runtime by MLX Swift.
Install via: python3 -m venv + pip install mlx + copy metallib.
Also trigger CI on feature/* branches.
Every 8 tokens, insert a 50μs Task.sleep to yield the GPU.
This prevents heavy inference from freezing the macOS UI
(WindowServer). Applied to all 4 generation loops:
- Chat streaming
- Chat non-streaming
- Text streaming
- Text non-streaming
- New ModelProfiler.swift: reads config.json, measures weight files
  (follows HF Hub symlinks), computes memory requirements (weights +
  KV cache + 20% overhead), and outputs a PartitionPlan with strategy
  (fullGPU/swapAssisted/layerPartitioned/tooLarge)

- New --info flag: dry-run profiler prints formatted memory analysis
  report and exits without loading the model

- New --gpu-layers option: accepts 'auto' or integer, ready for future
  GPU/CPU layer splitting (Phase 2)

- Pre-load profiling: automatically detects overcommit ratio and sets
  MLX cache limits (2MB cache for swap-assisted mode to let OS manage
  page caching, inspired by Flash-MoE research)

- Enhanced /health endpoint: includes partition data (strategy,
  overcommit_ratio, weight/kv/total GB, GPU layers, estimated tok/s)

- Ready event JSON: includes partition data for downstream integration

- Rename main.swift -> Server.swift (required by Swift compiler when
  adding second source file with @main attribute)
Phase 2 integration — connects the mlx-swift-lm fork's new
LayerPartitionable protocol to mlx-server's CLI and profiler:

- --gpu-layers N: explicitly set N layers on GPU, rest on CPU
- --gpu-layers auto: use partition plan recommendation
- Auto-partition: when model exceeds available RAM (overcommit > 1.0),
  automatically applies the recommended GPU layer count

- PartitionPlan: added mutable gpuLayers field (updated after actual
  partitioning) and cpu_layers in /health response

- Fixed .chunk API change in latest fork (now returns tokenId tuple)

- Updated Package.swift comment to note partitioning support
- Package.resolved was accidentally deleted when switching to local
  path for fork development. Regenerated with resolved fork commit.

- e2e-test.yml: added 3-attempt retry loop for transient HuggingFace
  API failures (HTTP 500). Set HF_HUB_DOWNLOAD_TIMEOUT=120.

- build.yml: added feature/* branches to trigger CI builds.
FFTW-style auto-tuning that profiles optimal cache limits per
model × hardware combination. First run benchmarks 4 configurations
(tight → unlimited), measures tok/s, and persists the winner to
~/.mlx-server/wisdom/<key>.json. Subsequent runs load instantly.

New CLI flags:
  --calibrate    Force re-calibration even if wisdom exists

Integration with startup flow:
  1. Check for existing wisdom → apply instantly
  2. Or run calibration trials → store + apply
  3. --mem-limit always overrides wisdom

Calibrator.swift: ~290 lines, zero new dependencies.
- tests/test_turbo_quant.cpp: 9 standalone C++ tests for turbo_quant.h algorithm
  - T1: Lloyd-Max centroids match turboquant_plus Python reference (tol 1e-5)
  - T2: WHT sign arrays match turbo-wht.h (seed=42 and seed=1042)
  - T3: FWHT is self-inverse (MSE < 1e-10)
  - T4: Forward∘Inverse rotation == identity (MSE < 1e-8)
  - T5: WHT rotation is norm-preserving (unitary, δ < 1e-4)
  - T6: 3-bit pack/unpack round-trip (0 mismatches over all 8 index values)
  - T7: V-cache SNR = 14.6 dB over 200 random d=128 Gaussian vectors
  - T8: K-cache inner-product SNR = 13.7 dB over 100 random key/query pairs
  - T9: fp16 conversion round-trip < 0.3% relative error
  All 9/9 tests pass with clang++ -std=c++17 -O2.

- .github/workflows/build.yml: run TurboQuant tests after swift build
- .github/workflows/e2e-test.yml: run TurboQuant tests as fast pre-flight
  before expensive model download (fail early if compression math is broken)
…heartbeat)

- ThinkingStateTracker: streaming state machine that splits <think>…</think>
  tokens into delta.reasoning_content vs delta.content (llama-server compatible)
- extractThinkingBlock(): non-streaming extraction for handleChatNonStreaming
- enableThinking param wired into both streaming and non-streaming handlers
- Prefill heartbeat: BoolFlag actor + Task emitting ssePrefillChunk every 2s
  while prompt is being processed — prevents silent connections on long prefills
- sseChunk() refactored: delta string replaced with reasoningContent/content
  optional fields — cleaner separation of thinking vs response tokens
- ssePrefillChunk(): new SSE event type 'prefill_progress' for client-side UX
- AssistantMessage: added reasoningContent field with CodingKey 'reasoning_content'
- handleChatNonStreaming: applies extractThinkingBlock + JSON stripping on
  responseContent (not raw fullText) so thinking tokens are never returned as content
mlx::core::fast::turbo_encode() does not exist in the upstream MLX
library yet. Replace the broken call with a std::runtime_error stub
that compiles cleanly. The real implementation will be wired in once
the TurboQuant C++ core is available upstream.
Adds the missing mlx::core::fast::turbo_encode_k() and turbo_encode_v()
functions that the C API stub was placeholding.

Algorithm (from turbo_quant.h):
  turbo_encode_k: 3-bit PolarQuant (WHT rotation + Lloyd-Max centroids)
                  + 1-bit QJL residual — K cache, 68 bytes/token
  turbo_encode_v: 3-bit PolarQuant only — V cache, 50 bytes/token

Buffer layout per token:
  K: indices[48] | qjl_signs[16] | norm_fp16[2] | rnorm_fp16[2]
  V: indices[48] | norm_fp16[2]

Layout matches the Metal decompression path in sdpa_vector.h which
already implements turbo_dequant_k/v for on-the-fly decode during SDPA.

The encode path is CPU-side (eval + iterate), which is appropriate since
compression runs once per appended KV token, not in the hot forward pass.

Files changed:
  fast.h     — declare turbo_encode_k/v in namespace mlx::core::fast
  fast.cpp   — implement using turbo_quant.h primitives
  mlx-c fast.cpp — replace runtime_error stub with real call
Phase 2: Server.swift integration of TurboQuant KV-cache compression.

CLI:
  --turbo-kv   Enable 3-bit PolarQuant+QJL KV compression on all
               KVCacheSimple layers. Compresses history > 8192 tokens
               to ~3.5 bits/token — recommended for 100k+ context.
               Default: disabled (zero overhead when off).

KVCache.swift (submodule):
  KVCacheSimple.turboQuantEnabled: Bool = false
    Now settable at runtime so Server.swift can activate per-request.

Server.swift:
  - @Flag --turbo-kv added to CLI
  - turboKV stored in ServerConfig
  - Startup log shows turbo_kv=enabled/disabled
  - Sets .turboQuantEnabled = true on each KVCacheSimple before prefill
- Rename Sources/mlx-server/ → Sources/SwiftLM/
- Update Package.swift: package name, target name, source path
- Update all [mlx-server] log prefixes to [SwiftLM]
- Update ~/.mlx-server/wisdom/ path to ~/.swiftlm/wisdom/
- Update CLI commandName to SwiftLM
- Update GitHub Actions workflows: binary path, tarball names, release titles
- Update all documentation files
…name

The CI cache was built when the repo was named 'mlx-server'. After renaming
to 'SwiftLM', clang embedded the old path in the PCH, causing:

  error: PCH was compiled with module cache path '.../mlx-server/.build/...'
  but the path is currently '.../SwiftLM/.build/...'

Fixes:
- build.yml, e2e-test.yml: scope cache key to 'spm-SwiftLM-' so the old
  'spm-' prefixed cache is never restored as a partial match
- Add 'Clear stale module cache' step (rm ModuleCache/) before swift build
  to eliminate any stale PCH artifacts that sneak in via partial cache hits
- tests/test-server.sh: update comment and default binary path to SwiftLM
Two bugs fixed in the SSD streaming throughput logger:

1. STREAMING CORRUPTION — The [⚡️ SSD Stream] log line was printed to
   std::cout (stdout), the same fd as the token stream. Since Swift
   token output uses print(text, terminator: "") with no newline, the
   metric lines interleaved mid-token, corrupting the SSE response body
   observed by clients and the Electron log display.
   Fix: switch to std::cerr (stderr). Electron routes stderr separately
   as [mlx-server:err] and never forwards it as SSE content.

2. LOG FLOODING — The 1-second throttle emitted ~60 lines/minute. With
   a 4 t/s MoE model this produces a metric line for almost every token.
   Fix: throttle to 10 seconds (10'000'000'000 ns).

3. WRONG UNIT — The metric was labelled 'MB/s' but the value computed
   was total MB read in the window (not divided by elapsed seconds).
   Fix: divide bytes by elapsed_s to get true NVMe throughput in MB/s.

New format (stderr, every 10 s):
  [⚡️ SSD Stream] 3456 MB/s | 24070 chunks | avg 0.003 ms/chunk
Stack:
  C++ (moe_stream_op.cpp): Add 4 lifetime atomics (bytes, ns, chunks,
    window_throughput_mbs). Accumulated per expert-chunk load,
    never reset. After each 10 s log window, write throughput_mbs.
    Implement extern C mlx_ssd_metrics_snapshot() with full struct
    definition in the .cpp TU.

  C ABI (fast.h + include/mlx/c/fast.h): Declare MlxSSDMetricsSnapshot
    typedef + mlx_ssd_metrics_snapshot() in both the mlx-c copy and the
    Swift-visible umbrella header. moe_stream_op.h keeps only a forward
    declaration + extern C bridge (avoids redefinition at link time).

  Swift (MLXFast.swift): New MLXFast.SSDMetricsSnapshot struct +
    MLXFast.ssdMetricsSnapshot() calling through to the C function.

  Server.swift: /metrics emits 4 new Prometheus gauges/counters when
    the server is started with --stream-experts:
      swiftlm_ssd_throughput_mbps   (gauge, 10 s rolling average)
      swiftlm_ssd_bytes_read_total  (counter, lifetime)
      swiftlm_ssd_chunks_total      (counter, lifetime)
      swiftlm_ssd_chunk_latency_ms  (gauge, lifetime average)

Example when SSD streaming is active:
  $ curl http://127.0.0.1:8080/metrics | grep ssd
  swiftlm_ssd_throughput_mbps 3456.0
  swiftlm_ssd_bytes_read_total 1234567890
  swiftlm_ssd_chunks_total 82340
  swiftlm_ssd_chunk_latency_ms 0.0028
- Updates the TurboQuantization section in README to explain the fusion of V2 speed and V3 quality algorithms
- Adds 'docs/turboquant_hybrid_architecture.md' with deep-dive technical analysis of the Lloyd-Max + QJL Metal integration
- Expanded the ThinkingStateTracker to match both `<think>` and `<thinking>` open/close tags
- Fixes an issue where Qwen models would leak 'ing>' or other characters into the SSE stream because the tracker strictly looked for the 8-character DeepSeek '</think>' tag
- Added support for the top-level 'enable_thinking' parameter in ChatCompletionRequest
- Ensures compatibility with Aegis-AI's gateway which passes 'enable_thinking' directly in the root JSON instead of nested in 'chat_template_kwargs'
…rbo-kv fixes

- Streaming path: log content as null (not raw fullText) when tool_calls are present
- mlx-swift-lm: TurboKV now compresses from token 1 (not 512)
- mlx-swift-lm: head_dim guard prevents fatal crash on Qwen 122B (dim=256 != 128)
Qwen3.5-122B has head_dim=256. The C++ encoder now processes D=256
as two consecutive D=128 sub-groups using the existing TurboQuantK/V
structs. Record sizes double: K=136b, V=100b per token for D=256.
- Add MLXInferenceCore shared Swift library target
  - InferenceEngine actor with AsyncStream<GenerationToken> API
  - ChatMessage, GenerationConfig, ModelCatalog models
  - Device-aware model recommendations based on physical RAM

- Add SwiftLMChat/ Xcode project (iOS 17+ / macOS 14+)
  - SwiftLMChatApp.swift: entry point with macOS menu commands
  - RootView: NavigationSplitView (macOS) / NavigationStack (iOS)
  - ChatView: streaming message display, input bar, stop button
  - MessageBubble: custom bubble shape, typing indicator, thinking disclosure
  - ModelPickerView: device-aware model list with RAM fit badges
  - SettingsView: temperature, max tokens, top-p, thinking mode
  - ChatViewModel: @observable bridges InferenceEngine to SwiftUI
  - generate_xcodeproj.py: stdlib-only project generator

- Update Package.swift: add iOS 17+ platform, MLXInferenceCore target
…ntion

Implements decode path: packed uint8 compressed KV history → float32
for concatenation with hot window before passing to standard SDPA.
Supports D=128 (68B/50B records) and D=256 (136B/100B records).
Full TurboKV pipeline now operational:
  - Encode history → 3-bit PolarQuant (existing)
  - Truncate hot cache to hot window (new)
  - Decode + prepend history at every SDPA call (new)

RAM savings: ~5.5x on history tokens (3-bit vs 16-bit fp16)
Replace hash-based system-prompt cache with longest-common-prefix scan:
- Store full token sequence alongside KV state
- On each request: scan token-by-token to find longest shared prefix
- Restore KV state, trim excess via layer.trim() for partial matches
- Save full prompt after every request (not just system-prompt)

Benefits:
  System prompt matched exactly, no token-count approximation bug
  Conversation history reuse (any shared prefix, not just system prompt)
  Partial prefix matches (e.g. same system + first N turns) also benefit
  Works correctly with TurboKV (state getter now returns full fp16 context)
@solderzzc solderzzc merged commit bf9e87e into main Mar 31, 2026
4 checks passed
@solderzzc solderzzc deleted the feature/api-parity-roadmap branch March 31, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant