Releases: SharpAI/SwiftLM
SwiftLM b197
SwiftLM b197-7f62ac9
fix(deps): use remote URL dependencies for mlx-swift and mlx-swift-lm
Changelog
- fix(deps): use remote URL dependencies for mlx-swift and mlx-swift-lm (7f62ac9)
Download
Quick Start
Please refer to the Getting Started section in the README for full installation and usage instructions.
Note:
mlx.metallibis bundled in this archive. Keep it in the same directory as theSwiftLMbinary — Metal GPU compute will fail if it is missing.
SwiftLM b196
SwiftLM b196-da535ea
chore: isolate HomeSec benchmark output to local tmp directory
Changelog
- chore: isolate HomeSec benchmark output to local tmp directory (da535ea)
- fix(benchmark): remove trailing /v1 from gateway URL for HomeSec option (c508b04)
- chore: sync and lock mlx-swift-lm to latest main (73429e1)
- feat: integrate HomeSec Benchmark as option 3 (LLM only) (5da1648)
- feat: add Delete ALL Models option mapping to huggingface hub cache (c75b50a)
- feat: add 8-bit model variants and model maintenance option to benchmark menu (36e7d49)
- chore: bump mlx-swift-lm for Gemma 4 mixed-precision shape fix (bbb137a)
- chore(release): finalize 0.2.9 release candidate with Gemma 4 8-bit verification and updated benchmark scripts (4a73a5f)
Download
Quick Start
Please refer to the Getting Started section in the README for full installation and usage instructions.
Note:
mlx.metallibis bundled in this archive. Keep it in the same directory as theSwiftLMbinary — Metal GPU compute will fail if it is missing.
SwiftLM b188
SwiftLM b188-cf0319f
ci: add C/C++/Metal extensions and lock file to release trigger paths
Changelog
- ci: add C/C++/Metal extensions and lock file to release trigger paths (cf0319f)
- Rename SwiftLM Chat to SwiftBuddy in README (Resolves #13) (5ad7716)
- Make SwiftLM macOS screencast GIF clickable to high-res YouTube video (8d67925)
- Stack iOS app demo beneath macOS demo in README header (826f905)
- Promote iOS app section higher in README below Features (e336438)
- Fix ugly README layout by moving mobile GIF to iOS section and prioritizing wide Mac demo (bf9d1bd)
- Add macOS inference demo GIF to README (8b7d407)
- Remove redundant GPU metallib warning from README (7170997)
- Update release CI to use build.sh and package mlx.metallib instead of default.metallib (99e1679)
- Move Quick Start (Getting Started) setup instructions to top of README (734b938)
- Update README to replace disjointed scripts with unified run_benchmark.sh documentation (8eec8e1)
- Refactor run_benchmark.sh to apply model picker to both benchmark suites (5f27226)
- Consolidate both benchmark suites into run_benchmark.sh interactive menu (e1a50cf)
- Add killall SwiftLM to end of bash test loop (6af47c3)
- Restructure benchmarks section with Test 1 and Test 2 headers (5d73e0e)
- Update README with correct binary path and clarify sliding window test (ddc8a75)
- Update README for new build workflow and change Qwen2.5 to 3.5 in benchmark menu (e978096)
- Add rich ANSI console visualization after benchmark completes (c199866)
- Remove huggingface_hub dependency — SwiftLM downloads models natively via HubApi (9f643a8)
- Build mlx.metallib from source via cmake instead of tracking pre-built binary (391cb43)
- Fix build.sh: use tracked pre-built metallib instead of dynamic find (eab1e47)
- Rename SwiftLM Chat to SwiftBuddy in README (553f637)
- Force-add the version-matched default.metallib binary so it is available upon clone (fab10ac)
- Add interactive benchmark launch script with menu (af60626)
- Support automatic HuggingFace downloading to ./models via profile_runner.py (54f7121)
- Update build.sh to dynamically find default.metallib (44a0baa)
- Fix Liquid syntax errors, add build.sh, create tmp directory in profile_runner (045120e)
Download
Quick Start
tar -xzf SwiftLM-b188-macos-arm64.tar.gz
# mlx.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413Note:
mlx.metallibis bundled in this archive. Keep it in the same directory as theSwiftLMbinary — Metal GPU compute will fail if it is missing.
SwiftLM b160
SwiftLM b160-1233435
feat: complete extreme context profiling & fix prompt cache for TurboQuant
- Fix: Prevent prompt cache from decoding TurboQuant compressed polar buffers back to fp16, saving ~19GB GPU allocation at 100K context.
- Feat: Add GPU allocation tracking via ioreg to capture true memory demand including swap memory.
- Docs: Update README with benchmark summary and multi-device profiling results structure.
- Add run-benchmark workflow skill.
- Add total memory and active GPU memory monitoring in MemoryUtils.
Changelog
- feat: complete extreme context profiling & fix prompt cache for TurboQuant (1233435)
- feat: extend profile_runner.py parameterization to test extreme contexts - Add --contexts flag to seamlessly loop through scale factors - Refactor script to output extended markdown matrix encompassing context depths - Enables sequential TTFT scaling tests up to 100k prompts (170501a)
- feat: persist Aegis-AI Physical Model Profiler and backend physical memory logger - Injects C++ 'mach_task_basic_info' logging to parse real Apple Silicon wire memory limit - Extracts 'OS_RAM' string output at prefill boundaries - Integrates interactive --model parameter into profiling script matrix for ease-of-use. (2cd373f)
Download
Quick Start
tar -xzf SwiftLM-b160-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413Note:
default.metallibis bundled in this archive. Keep it in the same directory as theSwiftLMbinary — Metal GPU compute will fail if it is missing.
SwiftLM b156
SwiftLM b156-39ecadd
chore: move debugging scripts to dedicated folder
Changelog
- chore: move debugging scripts to dedicated folder (39ecadd)
- test: update harness runner loopback bindings and benchmark report (e593137)
- fix: stabilize Gemma4 MoE inference — dynamic attention mask slicing (32ce0e2)
Download
Quick Start
tar -xzf SwiftLM-b156-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413Note:
default.metallibis bundled in this archive. Keep it in the same directory as theSwiftLMbinary — Metal GPU compute will fail if it is missing.
SwiftLM b153
SwiftLM b153-c59f6a1
Merge pull request #11 from SharpAI/feature/gemma-4-inference-ahan-moment
Feature/gemma 4 inference ahan moment
Changelog
- Bump submodule. (e2ca7cc)
- Bump submodule to enable Gemma 4 SSD streaming (dee5c00)
- Bump mlx-swift-lm to fix MediaProcessing Swift 6 Strict Concurrency error (f59d527)
- Support OpenAI's developer role (3d53a89)
- Update InferenceEngine to load TransformersTokenizer from HubDownloader and update submodule reference (6e94301)
- Map ChatCompletionRequest tool_calls natively into Chat.Message to retain contextual history (63a5cf8)
- Update mlx-swift-lm submodule: Gemma4 tool parser + weight mapping (2955afa)
- Update mlx-swift-lm submodule: RotatingKVCache mask fix (ee39ad2)
- Fix JSON mode system prompt injection template exception (0639d5a)
- Add Gemma4 native tool call parser (64d8a3c)
- Fix Gemma 4 sliding window rotating KV cache regression and weight mapping (03025a4)
- Fix SwiftLM inference server cache alignment, sliding window sigtrap, and prompt cache save race condition (a2b70dc)
- feat: Sync submodule — TurboKV 512-dim virtual head splitting (67562be)
- fix: Prevent crash on full prompt cache hit (100% match) (eac5ab3)
- feat: Stabilize Gemma-4 backend inference and sync submodules (32dd183)
Download
Quick Start
tar -xzf SwiftLM-b153-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413Note:
default.metallibis bundled in this archive. Keep it in the same directory as theSwiftLMbinary — Metal GPU compute will fail if it is missing.
SwiftLM b137
SwiftLM b137-5f51468
Export MLXInferenceCore in Package.swift
Changelog
- Export MLXInferenceCore in Package.swift (5f51468)
- chore: Update Gemma 4 benchmark metrics and add comprehensive testing suite (5a05548)
- feat: rename SwiftLMChat → SwiftBuddy, add design doc (d770bce)
- docs: remove duplicate GIF embed, keep single intro line for iOS 13 Pro 6GB (049bce7)
- docs: fix iOS demo GIF path to existing docs/demo.gif (f821ce0)
- docs: add iPhone 13 Pro 6GB live demo GIF intro to iOS section (cf19434)
- feat(chat): unified iOS + macOS premium UI overhaul (66fe453)
Download
Quick Start
tar -xzf SwiftLM-b137-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413Note:
default.metallibis bundled in this archive. Keep it in the same directory as theSwiftLMbinary — Metal GPU compute will fail if it is missing.
SwiftLM b129
SwiftLM b129-afb677c
fix(ci): compile default.metallib from .metal sources instead of searching for binary
The .metal shader sources are tracked in git but default.metallib is
gitignored (*.metallib rule). Previous approach searched for a pre-built
binary that CI never has. Now compiles fresh from the 39 tracked .metal
source files using xcrun metal + metallib — guaranteed version-matched
to the Swift binary by construction since it uses the same source files.
Changelog
- fix(ci): compile default.metallib from .metal sources instead of searching for binary (afb677c)
- docs: warn against Python mlx-metal metallib version mismatch (33e1511)
- fix(release): correct metallib source — it ships in mlx-swift submodule, not built by swift build (e6556fc)
- fix(release): bundle default.metallib in release tarball (2d6b174)
- docs: add flash-moe reference to README and introduce benchmark test script (11e7078)
- chore: bump mlx-swift-lm submodule (iOS I/O fix, ExpertStreaming, Mistral4) (1922374)
- docs: add iOS demo GIF, iOS build instructions, and contributor Team ID note (c22abd0)
- feat(ios): iOS-first TabView UI + stable inference lifecycle (d31ad49)
- feat(swiftlmchat): HuggingFace live search + font/color fixes (ffe7b23)
- fix(mlx-swift): remove non-existent cuda.cpp from Package.swift exclude list (6a7449a)
- fix(swiftlmchat): full xcodebuild macOS compilation (4a46560)
- fix(inference): Swift 6 Sendable + deprecated API cleanup on main (8d08728)
- feat: iOS expert streaming via mmap page-cache for MoE models (541da29)
- feat(swiftlmchat): proactive iOS lifecycle — unload on background, reload on foreground (d454c0c)
- feat(swiftlmchat): platform-aware model management for iOS + macOS (dc4069f)
- fix(swiftlmchat): wire MLXInferenceCore sources + SPM packages into Xcode project (32b7483)
- feat: iOS expert streaming via mmap page-cache for MoE models (8671bc9)
- feat(swiftlmchat): proactive iOS lifecycle — unload on background, reload on foreground (7def72a)
- feat(swiftlmchat): platform-aware model management for iOS + macOS (c8bd3ba)
- fix(swiftlmchat): wire MLXInferenceCore sources + SPM packages into Xcode project (35e6172)
Download
Quick Start
tar -xzf SwiftLM-b129-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413Note:
default.metallibis bundled in this archive. Keep it in the same directory as theSwiftLMbinary — Metal GPU compute will fail if it is missing.
SwiftLM b107
SwiftLM b107-bf9e87e
Merge pull request #5 from SharpAI/feature/api-parity-roadmap
Feature/api parity roadmap
Changelog
- feat(prompt-cache): token-by-token prefix match (llama-server style) (01df003)
- docs(turbo-kv): add implementation status, hot-window design rationale, commit references (6ede853)
- fix(turbo-kv): drop token count from log, show ratio+MB saved (layer-agnostic) (3df7430)
- feat(turbo-kv): add mlx_turbo_kv_record C atomic + 10s log hook (dc6af72)
- feat(turbo-kv): stage 2 — activate compressed KV attention pipeline (28a00f9)
- feat(turbo-kv): add turbo_decode_k/v — batch dequantize for SDPA attention (e141627)
- feat: add SwiftLM Chat multiplatform app (iOS + macOS) (957f763)
- feat(turbo-kv): support head_dim=256 via two 128-dim sub-groups (48fd996)
- fix(server): null content in tool_calls log; bump mlx-swift-lm for turbo-kv fixes (4143d3b)
- chore(deps): bump mlx-swift-lm for SSD background telemetry fixes (c323412)
- fix(server): support top-level enable_thinking parameter (e444985)
- fix(server): support Qwen tags in state tracker (3f4cc09)
- docs: clarify TurboQuant hybrid architecture in README (6c6d62c)
- docs(engine): add TurboQuant C++ architecture notes (a83fa7d)
- feat(metrics): expose SSD Flash-Stream stats to /metrics endpoint (26d2319)
- fix(ssd-stream): route metrics to stderr, throttle to 10s, fix MB/s calc (60d538b)
- fix(ci): Fix stale PCH module cache error after mlx-server→SwiftLM rename (ce626c1)
- fix(metrics): rename Prometheus metrics from mlx_server_ to swiftlm_ prefix (60cc3e3)
- docs: add MIGRATION_NOTE.md for Aegis-AI mlx-server → SwiftLM rename (1f7087b)
- refactor: rename project from mlx-server to SwiftLM (480e349)
- feat(turboquant): wire --turbo-kv flag into server and KVCache (2ca5b02)
- feat(turboquant): implement turbo_encode_k/v CPU encode path (8286492)
- fix(mlx-c): stub out turbo_encode to fix CI build (70ac5e8)
- fix: Correct buffer range removal in ThinkingStateTracker (use ..<upperBound) (7dd655f)
- feat: Add thinking/reasoning support (ThinkingStateTracker + prefill heartbeat) (dfa9fba)
- test: Add TurboQuant unit tests to CI/CD pipeline (a042879)
- docs: remove llama.cpp VLM comparison table from README (511a59b)
- feat: Implement real TurboQuant KV cache compression (ported from llama-cpp-turboquant) (2bb7017)
- docs: Add TurboQuant KV cache algorithm description to README (ab35fd2)
- fix(server): Remove debug prompt_debug print from slot_launch log (7c20227)
- fix(ssd): Restore 5 GB/s throughput + correct output via tensor_name cache key fix (95edfdc)
- fix(ssd): Key expert offset cache by tensor_name not E — gate/up/down proj no longer collide (b814c88)
- feat(ssd): Wire mlx_fast_pread_into for high-throughput SSD weight streaming (1c1ded9)
- feat(ssd): Add mlx_fast_pread_into for direct NVMe reads into evaluated MLX buffers (3ea2afb)
- feat(ssd): Restore SSD stream metrics around prefault() call (2f655a1)
- fix(mlx-server): Restore correct output by using prefault+slice instead of raw SSD blob (5849c5a)
- fix(mlx-server): Restore SSD streaming throughput and mem-limit enforcement via C++ memory-aware loader (231c62c)
- fix(ssd): Rewrite streamed_gather_mm primitive to load directly into allocator::malloc enforcing memory limits (65e5497)
- feat: llama-server style logging + SSE CRLF fix (82bcc4b)
- feat: real-time token streaming to stdout + fflush (da21efe)
- feat: per-request chat_template_kwargs.enable_thinking support (8d3e15f)
- fix(build): capture streamExperts as local let before escaping health route closure (95a126d)
- fix(metrics): correct gpu_layers, strategy, and estimated_tok_s for SSD streaming mode (4dc61a6)
- fix: replace SWAP-ASSISTED warning with SSD STREAMING label when streamExperts active (40d65d0)
- feat: add thinking and ssd_stream to Config log line for observability (54a619f)
- feat: log full JSON response body matching llama-server log_server_r format (4c5e54b)
- feat: add llama-server style generation logging + API response format docs (d3da36e)
- fix(memory): use physical RAM budget for SSD streaming instead of Metal's capped working set size (df7d154)
- docs: add AEGIS_INTEGRATION.md with complete Aegis-AI sidecar setup guide (b19abdb)
- feat(mlx-swift): implement 1-second interval aggregated SSD read metrics for cleaner console output (b0b3b9b)
- build(ci): add .gitmodules mapping to fix mlx-swift-lm cloning failure in GitHub Actions (3c879db)
- docs: remove Aegis-AI integration block temporarily to prepare for new hero section (f27da83)
- docs: remove outdated Metal compile lockup warning as MoE streamed inference primitives resolve the delay (0c3288b)
- docs: add M5 to requirements and highlight pre-built binary usage (2de99c9)
- fix(Server): update mlx-swift-lm submodule to receive Evaluate.swift iteration mapping bugfix (5d2ea4b)
- fix: correct finish_reason=length and tool_calls test robustness (d40e9e4)
- test(e2e): extend test suite from 21 to 31 tests (7f3911e)
- ci: re-trigger workflow after mlx-swift-lm submodule push (e6a421a)
- fix(ci): add mlx-swift-lm git submodule and checkout submodules in CI - Create .gitmodules pointing to SharpAI/mlx-swift-lm.git - Add submodules: recursive to both build.yml and e2e-test.yml checkout steps - Remove mlx-swift-lm from .gitignore so git tracks the submodule pointer (1aaa13b)
- fix: enforce SIGKILL in e2e tests and expand HF timeout (9f70969)
- docs: update quick start and curl snippet to demonstrate 122B model deduplication JSON query (349fc53)
- docs: remove vLLM completely from matrix (5b31a29)
- docs: fix test hardware to M5 Pro 64GB (8b1d723)
- docs: revert incorrect Flash-MoE designation for mlx-server and restore vLLM column (b2ccae9)
- docs: remove vLLM column and correctly attribute Flash-MoE features (310b940)
- fix(ci): relocate standalone test scripts to scripts/ to prevent implicit SPM test target failure (c6078ad)
- docs: fix hardware specs and document 4-bit JSON quantization caveat (8082313)
- test: structure test scripts into tests directory and ignore artifacts (e0da633)
- docs: add Flash-MoE and vLLM to comparison table (91852f8)
- docs: recreate README with mlx-server comparisons and architecture details (ed2d2b0)
- feat(mlx-swift): expose MLXFast.streamedGatherMM and update c-api signature (9131ff7)
- feat(server): auto-wire safetensors resolution and stream environment path mapping (e29155c)
- feat(mlx-c): expose SSD streamed_gather_mm primitive to c-api (b5e6ade)
- feat(mlx): integrate core C++ unified memory SSD streaming primitives (86bcee8)
- feat: localize MLX frameworks to write C++ turboquant and ssd streamer scaffolds (eced528)
- fix(ux): add autonomous task-driven progress bar and restore MB counts (7d23eba)
- feat(ux): add robust caching bandwidth speedometer (441cb2b)
- fix(ux): progress bar fraction handling for non-byte counts (3472cba)
- feat: add download speed and progress bar UX (5255cd6)
- feat(moe): Expose --stream-experts flag to enable SSD inference streaming for large MoE models (7210980)
- feat: add auto-calibration 'Wisdom' system (3aae8e2)
- chore: update fork to f8f315b (20 model architectures with LayerPartitionable) (8dd340e)
- fix: restore Package.resolved and add CI retry for HF downloads (ebef9b0)
- feat: wire GPU/CPU layer partitioning to --gpu-layers flag (837ced0)
- feat: add memory-aware model partitioning framework (9c89cfc)
- feat: GPU yield — prevent Metal from starving macOS WindowServer (fd4a5e3)
- fix: CI — install mlx.metallib from Python mlx package (30c06d6)
- feat: Prompt caching — reduce TTFT by reusing system prompt KV state (20e1ce8)
- feat: API key authentication (--api-key flag) (9fe2175)
- feat: Phase 3 — Memory limit, /metrics, enhanced /health, graceful shutdown, stats (e4ebecb)
- feat: Phase 2 — JSON mode, VLM vision support, multipart content, extra sampling params (6589cbe)
- feat: Phase 1 API parity with mlx-lm (337ec6d)
Download
Quick Start
tar -xzf SwiftLM-b107-macos-arm64.tar.gz
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413Note: Requires
mlx.metallibnext to the binary for GPU compute. See README for setup.
mlx-server b21
mlx-server b21-2d382ef
Merge pull request #3 from SharpAI/feature/api-parity-roadmap
Feature/api parity roadmap
Changelog
- fix: CI — install mlx.metallib from Python mlx package (4086ce9)
- feat: Prompt caching — reduce TTFT by reusing system prompt KV state (6b34c97)
- feat: API key authentication (--api-key flag) (75e927d)
- feat: Phase 3 — Memory limit, /metrics, enhanced /health, graceful shutdown, stats (433d90e)
- feat: Phase 2 — JSON mode, VLM vision support, multipart content, extra sampling params (bfc980a)
- feat: Phase 1 API parity with mlx-lm (519bfda)
Download
Quick Start
tar -xzf mlx-server-b21-macos-arm64.tar.gz
./mlx-server --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413Note: Requires
mlx.metallibnext to the binary for GPU compute. See README for setup.