Skip to content

Releases: SharpAI/SwiftLM

SwiftLM b197

06 Apr 21:48

Choose a tag to compare

SwiftLM b197-7f62ac9

fix(deps): use remote URL dependencies for mlx-swift and mlx-swift-lm

Changelog

  • fix(deps): use remote URL dependencies for mlx-swift and mlx-swift-lm (7f62ac9)

Download

Quick Start

Please refer to the Getting Started section in the README for full installation and usage instructions.

Note: mlx.metallib is bundled in this archive. Keep it in the same directory as the SwiftLM binary — Metal GPU compute will fail if it is missing.

SwiftLM b196

06 Apr 17:38

Choose a tag to compare

SwiftLM b196-da535ea

chore: isolate HomeSec benchmark output to local tmp directory

Changelog

  • chore: isolate HomeSec benchmark output to local tmp directory (da535ea)
  • fix(benchmark): remove trailing /v1 from gateway URL for HomeSec option (c508b04)
  • chore: sync and lock mlx-swift-lm to latest main (73429e1)
  • feat: integrate HomeSec Benchmark as option 3 (LLM only) (5da1648)
  • feat: add Delete ALL Models option mapping to huggingface hub cache (c75b50a)
  • feat: add 8-bit model variants and model maintenance option to benchmark menu (36e7d49)
  • chore: bump mlx-swift-lm for Gemma 4 mixed-precision shape fix (bbb137a)
  • chore(release): finalize 0.2.9 release candidate with Gemma 4 8-bit verification and updated benchmark scripts (4a73a5f)

Download

Quick Start

Please refer to the Getting Started section in the README for full installation and usage instructions.

Note: mlx.metallib is bundled in this archive. Keep it in the same directory as the SwiftLM binary — Metal GPU compute will fail if it is missing.

SwiftLM b188

06 Apr 06:30

Choose a tag to compare

SwiftLM b188-cf0319f

ci: add C/C++/Metal extensions and lock file to release trigger paths

Changelog

  • ci: add C/C++/Metal extensions and lock file to release trigger paths (cf0319f)
  • Rename SwiftLM Chat to SwiftBuddy in README (Resolves #13) (5ad7716)
  • Make SwiftLM macOS screencast GIF clickable to high-res YouTube video (8d67925)
  • Stack iOS app demo beneath macOS demo in README header (826f905)
  • Promote iOS app section higher in README below Features (e336438)
  • Fix ugly README layout by moving mobile GIF to iOS section and prioritizing wide Mac demo (bf9d1bd)
  • Add macOS inference demo GIF to README (8b7d407)
  • Remove redundant GPU metallib warning from README (7170997)
  • Update release CI to use build.sh and package mlx.metallib instead of default.metallib (99e1679)
  • Move Quick Start (Getting Started) setup instructions to top of README (734b938)
  • Update README to replace disjointed scripts with unified run_benchmark.sh documentation (8eec8e1)
  • Refactor run_benchmark.sh to apply model picker to both benchmark suites (5f27226)
  • Consolidate both benchmark suites into run_benchmark.sh interactive menu (e1a50cf)
  • Add killall SwiftLM to end of bash test loop (6af47c3)
  • Restructure benchmarks section with Test 1 and Test 2 headers (5d73e0e)
  • Update README with correct binary path and clarify sliding window test (ddc8a75)
  • Update README for new build workflow and change Qwen2.5 to 3.5 in benchmark menu (e978096)
  • Add rich ANSI console visualization after benchmark completes (c199866)
  • Remove huggingface_hub dependency — SwiftLM downloads models natively via HubApi (9f643a8)
  • Build mlx.metallib from source via cmake instead of tracking pre-built binary (391cb43)
  • Fix build.sh: use tracked pre-built metallib instead of dynamic find (eab1e47)
  • Rename SwiftLM Chat to SwiftBuddy in README (553f637)
  • Force-add the version-matched default.metallib binary so it is available upon clone (fab10ac)
  • Add interactive benchmark launch script with menu (af60626)
  • Support automatic HuggingFace downloading to ./models via profile_runner.py (54f7121)
  • Update build.sh to dynamically find default.metallib (44a0baa)
  • Fix Liquid syntax errors, add build.sh, create tmp directory in profile_runner (045120e)

Download

Quick Start

tar -xzf SwiftLM-b188-macos-arm64.tar.gz
# mlx.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Note: mlx.metallib is bundled in this archive. Keep it in the same directory as the SwiftLM binary — Metal GPU compute will fail if it is missing.

SwiftLM b160

06 Apr 01:21

Choose a tag to compare

SwiftLM b160-1233435

feat: complete extreme context profiling & fix prompt cache for TurboQuant

  • Fix: Prevent prompt cache from decoding TurboQuant compressed polar buffers back to fp16, saving ~19GB GPU allocation at 100K context.
  • Feat: Add GPU allocation tracking via ioreg to capture true memory demand including swap memory.
  • Docs: Update README with benchmark summary and multi-device profiling results structure.
  • Add run-benchmark workflow skill.
  • Add total memory and active GPU memory monitoring in MemoryUtils.

Changelog

  • feat: complete extreme context profiling & fix prompt cache for TurboQuant (1233435)
  • feat: extend profile_runner.py parameterization to test extreme contexts - Add --contexts flag to seamlessly loop through scale factors - Refactor script to output extended markdown matrix encompassing context depths - Enables sequential TTFT scaling tests up to 100k prompts (170501a)
  • feat: persist Aegis-AI Physical Model Profiler and backend physical memory logger - Injects C++ 'mach_task_basic_info' logging to parse real Apple Silicon wire memory limit - Extracts 'OS_RAM' string output at prefill boundaries - Integrates interactive --model parameter into profiling script matrix for ease-of-use. (2cd373f)

Download

Quick Start

tar -xzf SwiftLM-b160-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Note: default.metallib is bundled in this archive. Keep it in the same directory as the SwiftLM binary — Metal GPU compute will fail if it is missing.

SwiftLM b156

05 Apr 19:10

Choose a tag to compare

SwiftLM b156-39ecadd

chore: move debugging scripts to dedicated folder

Changelog

  • chore: move debugging scripts to dedicated folder (39ecadd)
  • test: update harness runner loopback bindings and benchmark report (e593137)
  • fix: stabilize Gemma4 MoE inference — dynamic attention mask slicing (32ce0e2)

Download

Quick Start

tar -xzf SwiftLM-b156-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Note: default.metallib is bundled in this archive. Keep it in the same directory as the SwiftLM binary — Metal GPU compute will fail if it is missing.

SwiftLM b153

04 Apr 23:35
c59f6a1

Choose a tag to compare

SwiftLM b153-c59f6a1

Merge pull request #11 from SharpAI/feature/gemma-4-inference-ahan-moment

Feature/gemma 4 inference ahan moment

Changelog

  • Bump submodule. (e2ca7cc)
  • Bump submodule to enable Gemma 4 SSD streaming (dee5c00)
  • Bump mlx-swift-lm to fix MediaProcessing Swift 6 Strict Concurrency error (f59d527)
  • Support OpenAI's developer role (3d53a89)
  • Update InferenceEngine to load TransformersTokenizer from HubDownloader and update submodule reference (6e94301)
  • Map ChatCompletionRequest tool_calls natively into Chat.Message to retain contextual history (63a5cf8)
  • Update mlx-swift-lm submodule: Gemma4 tool parser + weight mapping (2955afa)
  • Update mlx-swift-lm submodule: RotatingKVCache mask fix (ee39ad2)
  • Fix JSON mode system prompt injection template exception (0639d5a)
  • Add Gemma4 native tool call parser (64d8a3c)
  • Fix Gemma 4 sliding window rotating KV cache regression and weight mapping (03025a4)
  • Fix SwiftLM inference server cache alignment, sliding window sigtrap, and prompt cache save race condition (a2b70dc)
  • feat: Sync submodule — TurboKV 512-dim virtual head splitting (67562be)
  • fix: Prevent crash on full prompt cache hit (100% match) (eac5ab3)
  • feat: Stabilize Gemma-4 backend inference and sync submodules (32dd183)

Download

Quick Start

tar -xzf SwiftLM-b153-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Note: default.metallib is bundled in this archive. Keep it in the same directory as the SwiftLM binary — Metal GPU compute will fail if it is missing.

SwiftLM b137

03 Apr 15:07

Choose a tag to compare

SwiftLM b137-5f51468

Export MLXInferenceCore in Package.swift

Changelog

  • Export MLXInferenceCore in Package.swift (5f51468)
  • chore: Update Gemma 4 benchmark metrics and add comprehensive testing suite (5a05548)
  • feat: rename SwiftLMChat → SwiftBuddy, add design doc (d770bce)
  • docs: remove duplicate GIF embed, keep single intro line for iOS 13 Pro 6GB (049bce7)
  • docs: fix iOS demo GIF path to existing docs/demo.gif (f821ce0)
  • docs: add iPhone 13 Pro 6GB live demo GIF intro to iOS section (cf19434)
  • feat(chat): unified iOS + macOS premium UI overhaul (66fe453)

Download

Quick Start

tar -xzf SwiftLM-b137-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Note: default.metallib is bundled in this archive. Keep it in the same directory as the SwiftLM binary — Metal GPU compute will fail if it is missing.

SwiftLM b129

01 Apr 20:41

Choose a tag to compare

SwiftLM b129-afb677c

fix(ci): compile default.metallib from .metal sources instead of searching for binary

The .metal shader sources are tracked in git but default.metallib is
gitignored (*.metallib rule). Previous approach searched for a pre-built
binary that CI never has. Now compiles fresh from the 39 tracked .metal
source files using xcrun metal + metallib — guaranteed version-matched
to the Swift binary by construction since it uses the same source files.

Changelog

  • fix(ci): compile default.metallib from .metal sources instead of searching for binary (afb677c)
  • docs: warn against Python mlx-metal metallib version mismatch (33e1511)
  • fix(release): correct metallib source — it ships in mlx-swift submodule, not built by swift build (e6556fc)
  • fix(release): bundle default.metallib in release tarball (2d6b174)
  • docs: add flash-moe reference to README and introduce benchmark test script (11e7078)
  • chore: bump mlx-swift-lm submodule (iOS I/O fix, ExpertStreaming, Mistral4) (1922374)
  • docs: add iOS demo GIF, iOS build instructions, and contributor Team ID note (c22abd0)
  • feat(ios): iOS-first TabView UI + stable inference lifecycle (d31ad49)
  • feat(swiftlmchat): HuggingFace live search + font/color fixes (ffe7b23)
  • fix(mlx-swift): remove non-existent cuda.cpp from Package.swift exclude list (6a7449a)
  • fix(swiftlmchat): full xcodebuild macOS compilation (4a46560)
  • fix(inference): Swift 6 Sendable + deprecated API cleanup on main (8d08728)
  • feat: iOS expert streaming via mmap page-cache for MoE models (541da29)
  • feat(swiftlmchat): proactive iOS lifecycle — unload on background, reload on foreground (d454c0c)
  • feat(swiftlmchat): platform-aware model management for iOS + macOS (dc4069f)
  • fix(swiftlmchat): wire MLXInferenceCore sources + SPM packages into Xcode project (32b7483)
  • feat: iOS expert streaming via mmap page-cache for MoE models (8671bc9)
  • feat(swiftlmchat): proactive iOS lifecycle — unload on background, reload on foreground (7def72a)
  • feat(swiftlmchat): platform-aware model management for iOS + macOS (c8bd3ba)
  • fix(swiftlmchat): wire MLXInferenceCore sources + SPM packages into Xcode project (35e6172)

Download

Quick Start

tar -xzf SwiftLM-b129-macos-arm64.tar.gz
# default.metallib is included — run from the extracted directory
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Note: default.metallib is bundled in this archive. Keep it in the same directory as the SwiftLM binary — Metal GPU compute will fail if it is missing.

SwiftLM b107

31 Mar 06:35
bf9e87e

Choose a tag to compare

SwiftLM b107-bf9e87e

Merge pull request #5 from SharpAI/feature/api-parity-roadmap

Feature/api parity roadmap

Changelog

  • feat(prompt-cache): token-by-token prefix match (llama-server style) (01df003)
  • docs(turbo-kv): add implementation status, hot-window design rationale, commit references (6ede853)
  • fix(turbo-kv): drop token count from log, show ratio+MB saved (layer-agnostic) (3df7430)
  • feat(turbo-kv): add mlx_turbo_kv_record C atomic + 10s log hook (dc6af72)
  • feat(turbo-kv): stage 2 — activate compressed KV attention pipeline (28a00f9)
  • feat(turbo-kv): add turbo_decode_k/v — batch dequantize for SDPA attention (e141627)
  • feat: add SwiftLM Chat multiplatform app (iOS + macOS) (957f763)
  • feat(turbo-kv): support head_dim=256 via two 128-dim sub-groups (48fd996)
  • fix(server): null content in tool_calls log; bump mlx-swift-lm for turbo-kv fixes (4143d3b)
  • chore(deps): bump mlx-swift-lm for SSD background telemetry fixes (c323412)
  • fix(server): support top-level enable_thinking parameter (e444985)
  • fix(server): support Qwen tags in state tracker (3f4cc09)
  • docs: clarify TurboQuant hybrid architecture in README (6c6d62c)
  • docs(engine): add TurboQuant C++ architecture notes (a83fa7d)
  • feat(metrics): expose SSD Flash-Stream stats to /metrics endpoint (26d2319)
  • fix(ssd-stream): route metrics to stderr, throttle to 10s, fix MB/s calc (60d538b)
  • fix(ci): Fix stale PCH module cache error after mlx-server→SwiftLM rename (ce626c1)
  • fix(metrics): rename Prometheus metrics from mlx_server_ to swiftlm_ prefix (60cc3e3)
  • docs: add MIGRATION_NOTE.md for Aegis-AI mlx-server → SwiftLM rename (1f7087b)
  • refactor: rename project from mlx-server to SwiftLM (480e349)
  • feat(turboquant): wire --turbo-kv flag into server and KVCache (2ca5b02)
  • feat(turboquant): implement turbo_encode_k/v CPU encode path (8286492)
  • fix(mlx-c): stub out turbo_encode to fix CI build (70ac5e8)
  • fix: Correct buffer range removal in ThinkingStateTracker (use ..<upperBound) (7dd655f)
  • feat: Add thinking/reasoning support (ThinkingStateTracker + prefill heartbeat) (dfa9fba)
  • test: Add TurboQuant unit tests to CI/CD pipeline (a042879)
  • docs: remove llama.cpp VLM comparison table from README (511a59b)
  • feat: Implement real TurboQuant KV cache compression (ported from llama-cpp-turboquant) (2bb7017)
  • docs: Add TurboQuant KV cache algorithm description to README (ab35fd2)
  • fix(server): Remove debug prompt_debug print from slot_launch log (7c20227)
  • fix(ssd): Restore 5 GB/s throughput + correct output via tensor_name cache key fix (95edfdc)
  • fix(ssd): Key expert offset cache by tensor_name not E — gate/up/down proj no longer collide (b814c88)
  • feat(ssd): Wire mlx_fast_pread_into for high-throughput SSD weight streaming (1c1ded9)
  • feat(ssd): Add mlx_fast_pread_into for direct NVMe reads into evaluated MLX buffers (3ea2afb)
  • feat(ssd): Restore SSD stream metrics around prefault() call (2f655a1)
  • fix(mlx-server): Restore correct output by using prefault+slice instead of raw SSD blob (5849c5a)
  • fix(mlx-server): Restore SSD streaming throughput and mem-limit enforcement via C++ memory-aware loader (231c62c)
  • fix(ssd): Rewrite streamed_gather_mm primitive to load directly into allocator::malloc enforcing memory limits (65e5497)
  • feat: llama-server style logging + SSE CRLF fix (82bcc4b)
  • feat: real-time token streaming to stdout + fflush (da21efe)
  • feat: per-request chat_template_kwargs.enable_thinking support (8d3e15f)
  • fix(build): capture streamExperts as local let before escaping health route closure (95a126d)
  • fix(metrics): correct gpu_layers, strategy, and estimated_tok_s for SSD streaming mode (4dc61a6)
  • fix: replace SWAP-ASSISTED warning with SSD STREAMING label when streamExperts active (40d65d0)
  • feat: add thinking and ssd_stream to Config log line for observability (54a619f)
  • feat: log full JSON response body matching llama-server log_server_r format (4c5e54b)
  • feat: add llama-server style generation logging + API response format docs (d3da36e)
  • fix(memory): use physical RAM budget for SSD streaming instead of Metal's capped working set size (df7d154)
  • docs: add AEGIS_INTEGRATION.md with complete Aegis-AI sidecar setup guide (b19abdb)
  • feat(mlx-swift): implement 1-second interval aggregated SSD read metrics for cleaner console output (b0b3b9b)
  • build(ci): add .gitmodules mapping to fix mlx-swift-lm cloning failure in GitHub Actions (3c879db)
  • docs: remove Aegis-AI integration block temporarily to prepare for new hero section (f27da83)
  • docs: remove outdated Metal compile lockup warning as MoE streamed inference primitives resolve the delay (0c3288b)
  • docs: add M5 to requirements and highlight pre-built binary usage (2de99c9)
  • fix(Server): update mlx-swift-lm submodule to receive Evaluate.swift iteration mapping bugfix (5d2ea4b)
  • fix: correct finish_reason=length and tool_calls test robustness (d40e9e4)
  • test(e2e): extend test suite from 21 to 31 tests (7f3911e)
  • ci: re-trigger workflow after mlx-swift-lm submodule push (e6a421a)
  • fix(ci): add mlx-swift-lm git submodule and checkout submodules in CI - Create .gitmodules pointing to SharpAI/mlx-swift-lm.git - Add submodules: recursive to both build.yml and e2e-test.yml checkout steps - Remove mlx-swift-lm from .gitignore so git tracks the submodule pointer (1aaa13b)
  • fix: enforce SIGKILL in e2e tests and expand HF timeout (9f70969)
  • docs: update quick start and curl snippet to demonstrate 122B model deduplication JSON query (349fc53)
  • docs: remove vLLM completely from matrix (5b31a29)
  • docs: fix test hardware to M5 Pro 64GB (8b1d723)
  • docs: revert incorrect Flash-MoE designation for mlx-server and restore vLLM column (b2ccae9)
  • docs: remove vLLM column and correctly attribute Flash-MoE features (310b940)
  • fix(ci): relocate standalone test scripts to scripts/ to prevent implicit SPM test target failure (c6078ad)
  • docs: fix hardware specs and document 4-bit JSON quantization caveat (8082313)
  • test: structure test scripts into tests directory and ignore artifacts (e0da633)
  • docs: add Flash-MoE and vLLM to comparison table (91852f8)
  • docs: recreate README with mlx-server comparisons and architecture details (ed2d2b0)
  • feat(mlx-swift): expose MLXFast.streamedGatherMM and update c-api signature (9131ff7)
  • feat(server): auto-wire safetensors resolution and stream environment path mapping (e29155c)
  • feat(mlx-c): expose SSD streamed_gather_mm primitive to c-api (b5e6ade)
  • feat(mlx): integrate core C++ unified memory SSD streaming primitives (86bcee8)
  • feat: localize MLX frameworks to write C++ turboquant and ssd streamer scaffolds (eced528)
  • fix(ux): add autonomous task-driven progress bar and restore MB counts (7d23eba)
  • feat(ux): add robust caching bandwidth speedometer (441cb2b)
  • fix(ux): progress bar fraction handling for non-byte counts (3472cba)
  • feat: add download speed and progress bar UX (5255cd6)
  • feat(moe): Expose --stream-experts flag to enable SSD inference streaming for large MoE models (7210980)
  • feat: add auto-calibration 'Wisdom' system (3aae8e2)
  • chore: update fork to f8f315b (20 model architectures with LayerPartitionable) (8dd340e)
  • fix: restore Package.resolved and add CI retry for HF downloads (ebef9b0)
  • feat: wire GPU/CPU layer partitioning to --gpu-layers flag (837ced0)
  • feat: add memory-aware model partitioning framework (9c89cfc)
  • feat: GPU yield — prevent Metal from starving macOS WindowServer (fd4a5e3)
  • fix: CI — install mlx.metallib from Python mlx package (30c06d6)
  • feat: Prompt caching — reduce TTFT by reusing system prompt KV state (20e1ce8)
  • feat: API key authentication (--api-key flag) (9fe2175)
  • feat: Phase 3 — Memory limit, /metrics, enhanced /health, graceful shutdown, stats (e4ebecb)
  • feat: Phase 2 — JSON mode, VLM vision support, multipart content, extra sampling params (6589cbe)
  • feat: Phase 1 API parity with mlx-lm (337ec6d)

Download

Quick Start

tar -xzf SwiftLM-b107-macos-arm64.tar.gz
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Note: Requires mlx.metallib next to the binary for GPU compute. See README for setup.

mlx-server b21

24 Mar 16:58
2d382ef

Choose a tag to compare

mlx-server b21-2d382ef

Merge pull request #3 from SharpAI/feature/api-parity-roadmap

Feature/api parity roadmap

Changelog

  • fix: CI — install mlx.metallib from Python mlx package (4086ce9)
  • feat: Prompt caching — reduce TTFT by reusing system prompt KV state (6b34c97)
  • feat: API key authentication (--api-key flag) (75e927d)
  • feat: Phase 3 — Memory limit, /metrics, enhanced /health, graceful shutdown, stats (433d90e)
  • feat: Phase 2 — JSON mode, VLM vision support, multipart content, extra sampling params (bfc980a)
  • feat: Phase 1 API parity with mlx-lm (519bfda)

Download

Quick Start

tar -xzf mlx-server-b21-macos-arm64.tar.gz
./mlx-server --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Note: Requires mlx.metallib next to the binary for GPU compute. See README for setup.