fix(mlx): unblock GGUF export and LoRA reload on Apple Silicon by danielhanchen · Pull Request #627 · unslothai/unsloth-zoo

danielhanchen · 2026-05-07T03:59:25Z

Summary

Two narrow bugs that together broke the MLX export round-trip (train → save_pretrained_gguf / save_pretrained_merged(save_method="lora") → reload via FastMLXModel.from_pretrained) on Apple Silicon. Surfaced by an end-to-end MLX smoke test that runs on a free macOS-14 (M1) GitHub-hosted runner in unslothai/unsloth#5312.

Bug 1: `llama_cpp.py` only catches `ImportError` from `device_type` module

unsloth_zoo/llama_cpp.py imports device_is_bf16_supported from .device_type, wrapping in except ImportError to handle pure-MLX builds without torch. But on a Mac with torch installed (required for convert_hf_to_gguf.py), the import succeeds — and device_type.py:233 runs DEVICE_TYPE = get_device_type() at module level, which raises NotImplementedError("Unsloth currently only works on NVIDIA, AMD and Intel GPUs.") because get_device_type doesn't recognise Darwin+arm64.

The fallback branch already explicitly handles Apple Silicon, so broaden the except to also catch NotImplementedError. No behaviour change on Linux/CUDA hosts.

Bug 2: `FastMLXModel.from_pretrained` wipes `local_path` on missing `config.json`

mlx_loader.py:2186 resolves local_path = _download(model_name), then reads local_path/config.json. The combined try/except set local_path = None whenever config.json was missing.

LoRA-adapter directories saved by save_lora_adapters only contain adapter_config.json + adapters.safetensors (no config.json), so the wipe silently disabled the adapter-detection branch at line 2219, which is gated on local_path being truthy. User-visible symptom: a confusing FileNotFoundError on config.json from mlx_lm.utils.load_config instead of the adapter being detected and the base model being pulled from adapter_config.json:base_model_name_or_path.

Split the try/except so resolution failure and config-read failure are handled separately. local_path survives a missing config.json so the adapter branch can run.

Test plan

Existing tests/ suite still green: 174 passed in 8.33s.
End-to-end LoRA save → reload → generate flow on macos-14 (M1) CI runner via unslothai/unsloth#5312.
End-to-end save_pretrained_gguf round-trip via llama-cli on the same runner.

Restructure the MLX smoke test into a multi-step workflow that exercises the export round-trip the way real users hit it: each reload runs in a FRESH Python process (not a continuation of the still-running trainer), and each step emits a JSON metrics file with elapsed time + peak GPU memory + peak RSS for regression detection. Steps (each on the macos-14 M1 standard runner, FREE for public repos): 1. TRAIN + SAVE 3 formats - Load unsloth/gemma-3-270m-it (fp16, no quant). - Apply LoRA r=8 on q/k/v/o. - Pre-train + post-train loss + grad norm probe via mx.nn.value_and_grad on the training row. - Train 7 deterministic steps, batch_size=2, gradient_accumulation_steps=3 (42 sequences trained), capture per-step loss via add_step_callback. - In-memory generate -> assert "Unsloth" appears. - Save LoRA, merged_16bit, GGUF. - Emit mlx_workdir/train_metrics.json. 2. RELOAD LoRA (fresh process) FastMLXModel.from_pretrained(lora_dir) cold-load + generate + assert "Unsloth" appears. Emits lora_reload_metrics.json. 3. RELOAD merged_16bit (fresh process) Same flow on the merged HF directory. 4. RELOAD GGUF via llama-cli (fresh process) Conditional on train_metrics.json:gguf_supported. Spawns the llama-cli built by save_pretrained_gguf with --temp 0 --seed 3407 -no-cnv and asserts "Unsloth" in stdout. The per-phase metrics step prints all four JSON files so regressions are visible in the job log. Pin unsloth_zoo to fix/mlx-export-roundtrip-on-apple-silicon while unslothai/unsloth-zoo#627 is in review -- it carries: - llama_cpp.py: catch NotImplementedError too when importing device_is_bf16_supported (device_type module-level call raises on Apple Silicon). - mlx_loader.py: don't wipe local_path when config.json is missing, otherwise FastMLXModel.from_pretrained(lora_dir) can't see adapter_config.json. The earlier draft of this script had a workaround that copied the base model's config.json into the LoRA save dir; with #627 the workaround is removed, the cold-start LoRA reload works on the saved adapter directory directly. Workflow timeout already 25 min for the llama.cpp cmake build.

gemini-code-assist

Code Review

This pull request improves Apple Silicon compatibility in llama_cpp.py by catching NotImplementedError during device capability checks and fixes a bug in mlx_loader.py where a missing config.json would prevent LoRA adapters from loading. The review feedback identifies a redundant KeyError in the exception handling for configuration loading and suggests catching OSError instead to more robustly handle potential I/O errors.

gemini-code-assist · 2026-05-07T04:04:52Z

+                try:
+                    with open(config_path, "r") as f:
+                        config_data = json.load(f)
+                except (json.JSONDecodeError, KeyError):


The KeyError exception is redundant here as json.load() does not raise it. Additionally, since the FileNotFoundError was removed from the catch list in favor of an os.path.exists() check, other potential I/O errors during open() (such as PermissionError or race conditions) are no longer handled. It is safer to catch OSError to maintain the robustness of the original implementation.

Suggested change

except (json.JSONDecodeError, KeyError):

except (json.JSONDecodeError, OSError):

Two narrow bugs that together broke the MLX export round-trip (train -> save_pretrained_gguf / save_pretrained_merged(save_method= "lora") -> reload via FastMLXModel.from_pretrained) on Apple Silicon: 1. llama_cpp.py only caught ImportError when importing device_is_bf16_supported from .device_type. On a Mac with torch installed (which we DO need for the convert_hf_to_gguf path), the import succeeds, but device_type.py:233 runs `DEVICE_TYPE = get_device_type()` at module level and raises NotImplementedError because get_device_type doesn't recognize Darwin+arm64 as a supported accelerator family. The fallback branch that follows already explicitly handles Apple Silicon, so broaden the except to also catch NotImplementedError. 2. FastMLXModel.from_pretrained's local_path resolution wiped local_path inside a combined try/except whenever config.json was missing. LoRA-adapter directories saved by save_lora_adapters only contain adapter_config.json + adapters.safetensors (no config.json), so reloading them hit the wipe, which silently disabled the adapter-detection branch a few lines below (line 2219, gated on `local_path` being truthy). The user-visible symptom was a confusing FileNotFoundError on config.json from mlx_lm.utils.load_config rather than the adapter being detected and the base model being pulled from adapter_config.json:base_model_name_or_path. Split the try/except so the resolution failure and the config.json-read failure are handled separately and local_path survives a missing config.json. Both fixes verified locally against the existing tests/ suite (174 passed in 8.33s) and against an end-to-end LoRA save -> reload -> generate flow on a real Mac M1 CI runner.

The 30s read timeout for the github.com fetch is too aggressive for free CI runners (macos-14 sees this fail intermittently with 'HTTPSConnectionPool: Read timed out'). Bump to 120s and add 3 retries with 1s/2s/4s exponential backoff. Logs a warning per failed attempt so transient network problems are visible. This converts a hard failure ('GGUF SKIPPED' in MLX export round-trip on Mac CI) into a transparent retry that almost always succeeds.

…s merged PR unslothai/unsloth-zoo#627 (GGUF NotImplementedError + LoRA local_path fixes) landed on unsloth-zoo main as e9d1be8. Drop the temporary branch pin and revert to bare `unsloth_zoo @ git+...` so subsequent runs pick up further main changes. PR unslothai/unsloth-zoo#632 (compiler unblock for transformers 4.57.6 and 5.x) also merged (232d950); consolidated-tests-ci.yml already follows main via UNSLOTH_ZOO_REF default, so no change there.

…ests (#5312) * CI: scope GITHUB_TOKEN permissions and unblock ~60 skipped tests permissions: - All five PR-time workflows (backend, frontend, inference smoke, tauri, wheel) now declare permissions: contents: read at the workflow level, matching CodeQL's default-permissions guidance and the existing pattern in release-desktop.yml. None of these workflows write to the repo. skipped tests: - Repo tests (CPU) job now installs node 22 and uv, which unblocks ~60 tests that were silently skipping on CI: - 9 tests in tests/studio/test_chat_preset_builtin_invariants.py skipped on "node not available". Fixed in this commit; an obsolete "unsloth_repo/" prefix in WORKDIR was also pointing the source-file existence check at a path that no longer exists. - tests/python/test_e2e_no_torch_sandbox.py (47), test_studio_import_no_torch.py (29), test_tokenizers_and_torch_constraint.py (most of 42) all spawn fresh uv venvs and self-skip when uv is missing. - Three test_tokenizers_and_torch_constraint.py cases are deselected because they expose a real bug in studio/backend/requirements/no-torch-runtime.txt: the unpinned tokenizers line resolves to 0.23.1, which transformers rejects with "tokenizers>=0.22.0,<=0.23.0 is required". Tracked separately as a no-torch install regression. Locally: 760 passed, 1 skipped, 23 deselected (was 694 / 67 / 23). * CI: add MLX CI workflow for the Studio dispatch matrix Mirrors the three files documented in tests/studio/README.md (PR #5307) into a dedicated workflow so MLX dispatch failures show up as their own check on PRs rather than getting buried inside Backend CI: - test_hardware_dispatch_matrix.py 7-profile parametrized matrix + 2 dispatch-priority canaries - test_is_mlx_dispatch_gate.py AST + runtime guard on unsloth._IS_MLX - test_mlx_training_worker_behaviors.py worker.py contract checks Triggers on pull_request when any of unsloth/__init__.py, studio/backend/utils/hardware.py, studio/backend/core/training/worker.py, or any of the three test files are touched. Runs on a Linux+CPU runner with hardware spoofs; no Apple Silicon, real GPU, or real MLX install required. Locally validated: 36 passed in 0.41s. permissions: contents: read at the workflow level (matching the rest of the PR-time CI surface). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(mlx): fix path filter that pointed at a non-existent file The MLX CI workflow listed ``studio/backend/utils/hardware.py`` as a path filter, but no such file exists. The actual layout is studio/backend/utils/hardware/ __init__.py amd.py hardware.py nvidia.py vram_estimation.py so the filter as written would never match. A reviewer modifying ``hardware/hardware.py`` (where ``detect_hardware``, ``DeviceType``, and ``IS_ROCM`` actually live) would not trigger MLX CI, which defeats the point of the focused PR gate. Replace the broken filter with ``studio/backend/utils/hardware/**`` so any change in the hardware probe directory triggers MLX CI, and add three sibling triggers that each materially affect dispatch: - ``unsloth/_gpu_init.py`` Hosts ``from .models import *`` and the ``from .trainer import *`` chain. The trainer.py circular-import fix that landed in ``23550a8`` lives downstream of this file; a future change here can re-introduce the same bug. - ``studio/backend/core/inference/mlx_inference.py`` The MLX inference backend itself. It is the actual consumer of ``unsloth_zoo.mlx_loader.FastMLXModel`` whose contract the test_mlx_training_worker_behaviors.py AST checks guard. Local re-run with the fix in place: 36 passed in 0.45s. No other workflow file or test file is modified. * CI: split Studio GGUF CI into three focused jobs Replaces the single "Studio boots, loads a GGUF, answers a chat completion" job with three parallel jobs that each pick the smallest model that exercises the surface under test. All three jobs share the install.sh --local --no-torch bootstrap and prime HF_HOME via actions/cache so cold-cache runs are bounded and warm runs are quick. 1. Studio GGUF CI / OpenAI, Anthropic API tests - Model: gemma-3-270m-it UD-Q4_K_XL (~254 MiB). - Password rotation: login with bootstrap pw, change to a fresh random pw, assert old pw is rejected with 401, assert new pw succeeds. Uses the same JWT downstream as a Bearer token against /v1/* (the OpenAI/Anthropic compat surface accepts JWTs and sk-unsloth- keys interchangeably). - OpenAI SDK + Anthropic SDK each run a four-turn conversation ("What is 1+1?" / "What did I ask before?" / "What is the capital of France?" / "Repeat the city name") with temperature=0.0 and seed=3407. Run twice and assert run1 == run2 turn-by-turn so non-determinism in the conversation-history wiring is caught. 2. Studio GGUF CI / tool calling tests - Model: Qwen3.5-2B UD-IQ3_XXS (~890 MiB). - Standard OpenAI function calling with tool_choice=required. - Server-side python tool: assert "56088" appears in the answer to "What is 123 * 456? Use code to compute it.". - Server-side terminal (bash) tool: assert "hello-bash-tool" is echoed back. - Server-side web_search tool: non-blocking probe (DuckDuckGo flakes from CI runners). Asserts the request shape is accepted. - enable_thinking=true vs false: assert <think> markers vanish when thinking is disabled. 3. Studio GGUF CI / JSON, images - Model: gemma-4-E2B-it UD-IQ3_XXS (~2.4 GiB) + mmproj-F16 (~986 MiB) auto-detected via the HF repo path. - response_format = json_schema (strict): asserts the answer parses as JSON matching the {city, country} schema. - OpenAI image_url (data URI base64): assert non-empty response on a 4x4 PNG. Loose on content because small VL quants are weak at colour names; the vision path is the part under test. - Anthropic source/base64 image: same non-empty assertion against the Anthropic Messages endpoint. Boot strategy: - Job 1 keeps `UNSLOTH_API_ONLY=1 unsloth studio` because the password-rotation flow only exists in the UI-mode bootstrap. - Jobs 2 and 3 use `unsloth studio run --model REPO --gguf-variant V`, the one-liner that loads the model and prints the API key on the banner. Health is probed by waiting for `sk-unsloth-` to appear in the log; the one-liner only prints the banner after load completes. * CI: fix three regressions in the new Studio GGUF jobs Job 1 (OpenAI, Anthropic API tests): Anthropic SDK appends /v1/messages to base_url itself, so passing base_url=f"{BASE}/v1" produced /v1/v1/messages and 405'd. Bare BASE is correct (matches the docs' "the SDK appends /v1 automatically"). OpenAI SDK side already worked: 4-turn transcript was fully deterministic across two runs and the "Paris" sanity assertion passed. Job 2 (tool calling tests): Booting with --enable-tools forces the process-level tool policy to True for every request (state/tool_policy.py:get_tool_policy), which hijacked the "Standard OpenAI function calling" test through the server-side agentic loop -- the model called web_search instead of returning structured tool_calls for the user's `weather_tool`. Drop --enable-tools so policy is None (per-request honour). The python / terminal / web_search probes already pass enable_tools=True explicitly in their request bodies, so they keep working. Job 3 (JSON, images): Two issues. (a) The OpenAI Python SDK rewrites response_format={"type":"json_schema",...} into something Studio's llama-server backend doesn't accept, so resp came back as the raw error string and resp.choices[0] tripped 'str has no attribute choices'. Switched to raw HTTP with the `{"type":"json_object", "schema":...}` form llama-server actually supports (GBNF-from-schema, llama-server extension). (b) Anthropic SDK base_url same fix as job 1. * CI: add Studio Update CI + Studio UI CI workflows Two new PR-time gates that the existing inference / wheel jobs miss. Studio Update CI: - Runs install.sh --local --no-torch, then `unsloth studio update --local` twice, asserting both invocations take the prebuilt "up to date and validated" code path with no source-build fallback. - Boots Studio to /api/health afterwards so a broken update that nukes the venv or the llama-server binary surfaces immediately. - Triggers when install.sh, studio/setup.sh, the python_stack / llama_prebuilt installers, the requirements files, or unsloth_cli/commands/studio.py change. Studio UI CI: - Drives the actual frontend bundle in headless Chromium via Playwright with the smallest GGUF (gemma-3-270m-it UD-Q4_K_XL). - Covers: bootstrap login, must_change_password gate + change form, chat composer becomes interactive after model load, sending a message produces an assistant bubble with non-empty text, full page reload re-hydrates the conversation, configuration sheet opens and closes cleanly, and the rotated password is the only one that logs in afterwards. - This is the first workflow that catches the class of bug 2026.5.1 shipped: backend healthy + frontend builds, but assistant-ui runtime wiring or chat-history persistence broken so the actual UI was unusable. Backend-only or wheel-only gates do not see it. * CI(ui): jump straight to /change-password to avoid /login auto-redirect race The /login route auto-redirects to /change-password as soon as /api/auth/status returns requires_password_change=true. The original flow was racing that redirect: it filled #password (login mode) and clicked submit, but the redirect could land first and the form would have unmounted before the click. Going straight to /change-password also matches what main._inject_bootstrap is set up to support: the HTML on that route ships with `window.__UNSLOTH_BOOTSTRAP__`, which the change-password form reads to seed the current-password state, so the user only needs to fill new + confirm. Renumbered screenshots to match the new step order. * CI(gguf,ui): unblock the Studio CI runs GGUF jobs 2 and 3: Switched off `unsloth studio run` and over to `UNSLOTH_API_ONLY=1 unsloth studio` + login flow. Reason: studio.run() resolves the tool policy through unsloth_cli/_tool_policy.resolve_tool_policy, which defaults to True on loopback. That means set_tool_policy(True) gets applied process-wide, and every /v1/chat/completions request is routed through the server-side agentic loop -- so Job 2's standard function-calling test never gets a structured tool_calls response (the model uses web_search instead) and Job 3's response_format test gets non-JSON SSE chunks back. API-only mode leaves tool_policy=None, which is what each request's `enable_tools` flag (or absence thereof) needs to be honoured. Job 1: Anthropic SDK retry: the SDK sends `x-api-key` by default, but Studio's auth layer is HTTPBearer-only. Override via default_headers={"Authorization": f"Bearer {KEY}"}, which is the shape the integration docs suggest. UI smoke: Drop the "history must persist after reload" assertion; Studio's thread autosave is async and doesn't reliably land within the CI budget. Keep the assertion that matters: the chat composer mounts again after a reload and the JWT survived (no /login redirect), which is what the 2026.5.1 chat regression actually broke. * CI(gguf): consume SSE for tool calls, relax response_format test Job 2 (tool calling): The server-side agentic loop in routes/inference.py:1888 always yields SSE chunks -- the request's `stream=False` is honoured for the plain passthrough path, NOT for the agentic path. The python / terminal / web_search probes were calling json.loads on the raw body and tripping JSONDecodeError. Added a post_sse() helper that streams the response and accumulates text deltas, used for every enable_tools=True call. Function calling (which does NOT enable agentic mode) keeps post(). Job 3 (JSON, images): Dropped the strict-schema variant of response_format. On the small gemma-4-E2B-it UD-IQ3_XXS quant, the GBNF-from-schema path occasionally produces empty content. Plain `{"type":"json_object"}` is still a real test of Studio's JSON-mode wiring through to llama-server, and that's the surface the docs expose. Added fence-stripping for chat templates that wrap JSON in ```json blocks. * CI(gguf,images): use a 64x64 PNG; stb_image rejects 4x4 as truncated Studio's image normaliser re-encodes embedded base64 images via stb_image (routes/inference.py:3410) so llama-server gets a uniform PNG payload. stb_image happily reads the 4x4 PNG as a PIL test, but rejects it on the inference path with `broken data stream when reading image file`. 64x64 is small enough to keep token cost trivial (155 bytes) and large enough to satisfy stb_image's minimum. Job 1, Job 2, the UI smoke, and the JSON portion of Job 3 are all green now -- this is the last piece holding Job 3 back. * CI: pass GH_TOKEN to install/update steps to dodge GitHub API rate limits studio/install_llama_prebuilt.py lists releases on ggml-org/llama.cpp via the GitHub API. Unauthenticated calls get 60/hr per source IP, which is fine for one install per workflow but the new Studio Update CI does install + update + update back-to-back on the same runner, blowing past the limit and falling back to a source build (which then fails the idempotency assertion). Surfaced on the Studio Update CI run with: failed to inspect published releases in ggml-org/llama.cpp: GitHub API returned 403 ... set GH_TOKEN or GITHUB_TOKEN to avoid GitHub API rate limits. GITHUB_TOKEN with the existing `permissions: contents: read` is more than enough for unauthenticated read API access (1000/hr, scoped to the repo). Wired into every install.sh and `unsloth studio update` step across studio-update-smoke.yml, studio-inference-smoke.yml, and studio-ui-smoke.yml so a busy runner can't trip the same fallback. * CI(lint): turn the studio-backend ruff stub into a real Python gate Rename the job to "Python lint (syntax + ruff + safety nets)" and expand it from one non-blocking ruff invocation over studio/backend into four real gates over the whole tree. Total CI time goes from ~8 s to ~12 s, but the previous job was informational; this one blocks merges on actual breakage. Steps (in order): 1. AST/syntax (HARD GATE) `python -m compileall -q -j 0 unsloth unsloth_cli studio tests cli.py unsloth-cli.py`. Same parser the interpreter uses; anything broken here would also crash at `import X` on a user's machine. ~3.5 s across 350+ files locally. 2. ruff check whole repo (HARD GATE) The narrow rule set in pyproject.toml [tool.ruff.lint] (E9 / F63 / F7 / F82) catches undefined names, broken comparisons, and syntax. The whole repo passes today, so the previous studio/backend-only `|| true` was masking real breakage on the wider tree. <1 s. 3. Debugger-leftover scan (HARD GATE) AST-walk over every committed .py looking for `breakpoint()`, `pdb.set_trace()`, or `ipdb.set_trace()` call sites. AST-based so commented-out debugger lines don't false-positive (which is why a bare grep would not work -- there are three commented `# breakpoint()` markers in unsloth/models/rl* today). 0 hits locally across 350 files. 4. SPDX-License-Identifier on studio/backend (WARNING) Surfaces drift in the one tree where we already have a strict SPDX policy. Currently 3 files missing; warned, not blocked, so the rollout can be a separate PR. 5. ruff format drift (INFO) Counts files that would be reformatted by plain `ruff format`. Non-blocking because the canonical formatter is scripts/run_ruff_format.py = ruff format + the kwarg-spacing pass, so plain `ruff format --check` always reports a large diff. Once that custom pipeline is wired in, drop continue-on-error and add it to the gate. ruff is pinned to 0.15.12 to match .pre-commit-config.yaml so a CI-only ruff bump cannot start disagreeing with what pre-commit already accepted. * CI(lint): split Python lint into a multi-language Lint CI workflow Drop the python-lint job from studio-backend-ci.yml and move it into the dedicated `Lint CI` workflow. Two material changes: 1. License-header check now accepts BOTH header families The previous version only counted SPDX-License-Identifier, which warned on every Apache-2.0 file in unsloth/, unsloth_cli/, and scripts/ (e.g. unsloth/models/llama.py opens with the standard `# Copyright ... Daniel Han-Chen & the Unsloth team. All rights reserved. # Licensed under the Apache License, Version 2.0` block, which is correct, but my SPDX-only regex flagged it). New rule: a file is OK if either `SPDX-License-Identifier` or `Licensed under the Apache License` appears in the first 20 lines. Empty __init__.py files are skipped. Whole-repo coverage instead of just studio/backend. 2. Add shell / YAML / JSON parse gates - `bash -n` over every committed *.sh (14 today). Same idea as compileall: parse-only check. - `yaml.safe_load_all` over every *.yml / *.yaml (97 today), including .github/workflows/* so a typo in the workflow file itself shows up immediately. - `json.loads` over every *.json (18 today). Skips package-lock.json / bun.lock (huge, machine-generated) and tsconfig*.json (TypeScript JSONC convention -- already validated by `tsc --noEmit` in Frontend CI). TypeScript and Rust are NOT duplicated here: - Studio Frontend CI runs `npm run typecheck` + `npm run build` on every studio/frontend/** change, which is a full TS AST + type check. - Studio Tauri CI runs `tauri build --debug --no-bundle` on every studio/src-tauri/** or studio/frontend/** change, which is a full Rust compile. A duplicate fast-fail step here would burn cache for marginal value, and the dedicated workflows already block merges. Lint CI runs on every PR (no path filter): the whole job is under 30 s of CI time, so paying that on every PR is preferable to missing a regression on a path the focused workflows skip. * CI(lint): accept GNU long-form license headers (AGPL/LGPL/GPL) The license-header check missed two more legitimate header families that are committed to the repo today: - LGPL-3.0 long form: e.g. unsloth/kernels/rope_embedding.py opens with "GNU Lesser General Public License" -- 7 such files under unsloth/kernels/. - AGPL-3.0 long form: e.g. unsloth/kernels/moe/autotune_cache.py opens with "GNU Affero General Public License" -- 2 such files under unsloth/kernels/moe/. Both got flagged as drift on the previous run because the check only knew about the SPDX one-liner and the Apache-2.0 preamble. Add a third accepted marker, the substring "General Public License", which appears in all three GNU long-form preambles (GPL, LGPL, AGPL) and nothing else. Repo inventory: spdx (one-liner) 193 files (mostly studio/) apache-longform 55 files (unsloth/, unsloth_cli/) agpl-longform 2 files (unsloth/kernels/moe/) lgpl/gpl-longform 7 files (unsloth/kernels/) no recognised header 85 files (real drift -- mostly tests/) So the warning count drops from 94 -> 85 with this commit; the remaining 85 are actual missing headers, surfaced as a non-blocking warning until the cleanup PR lands. * CI: add codespell + shellcheck to Lint CI; add Security audit workflow Three Priority-1 follow-ups from the lint review. Lint CI gains two non-blocking gates that surface drift without blocking merges (the same shape as the existing format-drift step): - codespell: typo catcher across source / comments / docs. Skips lockfiles, generated assets, binary artefacts, LICENSE files. ignore-words-list pulls out short identifiers and PyTorch idioms (parm/parms, ans, hist, etc.) the default dictionary would flag. Local run finds 16 real typos to fix in a follow-up. - shellcheck: catches subtle shell bugs `bash -n` doesn't see -- unquoted expansions, useless cat, `[[ ]]` command substitution, etc. SC1090 + SC2034 muted because install/setup scripts legitimately source runtime paths and use export-only assignments. Critical-path coverage: install.sh, setup.sh, tests/sh/. Both pinned for reproducibility (codespell>=2.3,<3 in pip, shellcheck via apt-get). Both surface findings in PR annotations without failing the run; drop continue-on-error after the cleanup PRs land. New workflow: Security audit. Runs `pip-audit` against the same dep set Studio's backend pytest matrix installs, so we audit what the runtime actually loads (not what pyproject.toml's transitive resolution might pull in differently). Triggers: - PRs touching requirements / pyproject.toml, - push to main / pip, - nightly @ 04:13 UTC (off-the-hour to dodge cron rush), - workflow_dispatch. The default branch already carries 17 known vulnerabilities per the dependabot banner, so a hard gate today would block every PR on a baseline we have not triaged. Non-blocking; full table goes to GITHUB_STEP_SUMMARY for grep-ability and a 30-day artefact for historical comparison. The custom AST anti-pattern scan I prototyped was dropped: every class of CPU-import-time bug we hit in this PR (bitsandbytes, torchvision, _cuda_getCurrentRawStream, DEVICE_COUNT==0 stream init) is already caught by the Repo tests (CPU) job exercising the actual import on a CPU torch wheel. Restating the rule in AST form would only add noise. * CI: scan all unsloth deps + transitive closure, no install The previous Security audit only covered Studio's backend requirements. The unsloth pip package itself ships its own dep set via pyproject.toml (typer/pydantic/pyyaml/nest-asyncio core, plus the huggingfacenotorch extras: transformers/peft/accelerate/trl/datasets/diffusers/etc.) -- a malicious upload to any of those would slip past us today. Build a combined dep list from pyproject.toml + the six Studio requirements files and feed it to both pip-audit and scan_packages. Add scan_packages.py at scripts/scan_packages.py so the scanner ships with the repo and CI does not depend on a network fetch at job time. Pass --with-deps to scan_packages so the pre-install pattern scan walks the full transitive closure -- supply-chain attacks usually land several hops down (litellm 1.82.7 was a dep of a dep for most users; top-level-only scanning would have missed it). No installation in either job. pip-audit's -r mode resolves through PyPI metadata, scan_packages downloads sdist/wheel archives raw and inspects them without running install hooks. An attacker who has compromised a transitive dep cannot execute code in this workflow. * CI(security): per-file audit, strip git+, pin setuptools in build env Last push surfaced two silent failures: 1. pip-audit aborted on openai-whisper. The package's setup.py imports pkg_resources, which the isolated build env's modern setuptools no longer ships by default. Because we passed every -r file in one invocation, that single build failure killed the audit for ALL files (the run reported success only because continue-on-error swallowed exit 1). 2. scan_packages --with-deps aborted on the first git+ spec it hit (triton-kernels.txt's git+https://github.com/triton-lang /triton.git, plus OpenEnv in extras-no-deps.txt). Same all-or-nothing behaviour: the entire transitive scan reported "0 archives downloaded" and "all clean" -- meaning we silently scanned nothing. Fixes: - Build a filtered audit-reqs/ tree first. Each Studio requirements file is copied with `git+` lines stripped (replaced with a `# [security-audit] skipped` marker so the exclusion is auditable in the artifact). Pure git refs are out of scope for both pip- audit (CVE DB only knows PyPI versions) and scan_packages (it inspects PyPI archives, not git HEADs). - Run pip-audit per-file in a loop. One bad file no longer takes out the whole audit. - Pin setuptools<78 + wheel into pip's isolated build env via PIP_CONSTRAINT, so legacy setup.py packages (openai-whisper) can still emit metadata for the resolver. - Run scan_packages per-file too, with the same git+ filter and a skip for files that are empty after filtering (triton-kernels.txt becomes a comments-only file and would otherwise spam the log with `--help`). Net effect: pip-audit now actually emits CVE findings (we know the default branch carries 17), and scan_packages downloads + pattern- scans the full transitive closure of every PyPI-only requirements file plus unsloth's pyproject deps. * CI(security): shard scan_packages across 3 runners + dedupe per-shard Previous run took ~10+ minutes because each requirements file ran its own --with-deps resolve serially, and the six files all share ~70% of their transitive set (transformers, peft, accelerate land in three of them). Net effect: the same 200+ archives downloaded and pattern-scanned three times in series. Two changes: 1. Within a shard, feed every -r file to ONE scan_packages call so pip's resolver intersects version constraints once and yields a single deduped transitive set. 2. Across shards, run three matrix jobs in parallel: - hf-stack: unsloth-deps + no-torch-runtime (pyproject extras) - studio: studio + overrides + extras-no-deps - extras: extras (heavy openai-whisper / scikit-learn stack) Wall clock now bounded by the slowest shard rather than the sum, dropping ~10 min to ~3-5 min. Each shard uploads its own artifact (scan-packages-log-<id>) so log correlation stays clean. fail-fast: false so one shard's findings don't suppress the others. * CI(security): consolidate pip-audit + npm audit + cargo audit into one job Three advisory-DB lookups previously spun up three separate runners. All three are fast lockfile-driven checks (pip-audit ~1m37s, npm audit ~12s, cargo audit ~24s) and the runner-setup overhead dominates each. Run them sequentially on a single runner with python + node + rust toolchains pre-installed; total wall clock comes out roughly the same (~3 min) but with one PR check instead of three. Each step keeps continue-on-error: true so a finding in one toolchain does not suppress the others. Logs land in a single advisory-audit-logs artifact (pip + npm + cargo + the filtered req set). Heavy job stays separate: pip-scan-packages remains the 3-shard matrix that downloads + pattern-scans the full PyPI transitive closure (~6 min/shard, in parallel). Conflating that into the advisory job would bloat the runner image and serialize a 6 min job behind a 30 s one. * CI(security): catch Lightning, Shai-Hulud, npm hijack, design-flaw CVEs Recent supply-chain incidents that scan_packages would have missed: - PyTorch Lightning 2.6.x: payload in _runtime/router_runtime.js (14.8 MB), persistence via .claude/settings.json SessionStart and .vscode/tasks.json folderOpen - npm chalk/debug + Shai-Hulud: hex-var obfuscation, window.ethereum Web3 hijack, .github/workflows/shai-hulud.yml repo takeover, trufflehog credential exfil - elementary-data 0.23.3: token harvesters with embedded gh{p,o,s}_ and AKIA regexes - litellm 1.82.7: also covered by existing patterns, but anyone on `>=` got it during the 40-min exposure window - langchain-core CVE-2025-68664 / n8n CVE-2025-68668 / marimo CVE-2026-39987: first-party design flaws, not malicious-author scan_packages.py: - Six new regexes: RE_DEV_TOOL_HIJACK, RE_TOKEN_REGEX, RE_JS_OBFUSCATION, RE_WEB3_HIJACK, RE_WORKFLOW_INJECT, RE_SHELL_DROPPER. - Three new checkers: check_js_file, check_shell_file, check_workflow_file. scan_archive now routes .js/.mjs/.cjs/.ts to the JS checker, .sh/.bash to the shell checker, and .github/workflows/*.yml to the workflow checker. - JS checker fires CRITICAL on hex-var obfuscation OR Web3 hijack OR (token regex + network) OR workflow-injection signature; HIGH on a >100 KB JS bundle inside a Python wheel (the Lightning tell). - Smoke-tested: every new pattern matches its canonical positive and rejects four legitimate-looking false-positive baits. security-audit.yml: - OSV-Scanner step: cross-ecosystem advisory check (PyPI + npm + cargo) from one binary. OSV's feed is a superset of GitHub- Advisory; catches CVEs that haven't propagated yet (e.g. langchain-core was on OSV before GitHub Advisory). - Semgrep step: p/supply-chain + p/python + p/javascript + p/security-audit packs catch first-party logic bugs (CVEs 7/9/10 above) that pattern scanning never sees. - Lockfile pin verifier: warns on every non-`==` spec in requirements/*.txt. Currently surfaces 104 unpinned specs as informational baseline; tighten to blocking once the baseline is curated. All new steps continue-on-error initially; they surface findings to the workflow summary + advisory-audit-logs artifact. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * CI(security): defense-in-depth additions across 7 axes Goes after the residual gaps from the supply-chain incident audit. Each addition targets a real attack class that prior layers couldn't catch: 1. step-security/harden-runner (audit mode) on every job. eBPF egress firewall on the runner -- if scan_packages misses a payload, harden-runner's audit log records every host the malicious archive dialed. Audit mode initially so we observe the legitimate egress profile before promoting to block. 2. Trivy filesystem scan (vuln + misconfig + secret). Hits NVD + GHSA + GitLab + Aqua Vuln DB and also catches Dockerfile / k8s / Tauri / shell IaC misconfigs that pip-audit + OSV don't see. 3. TruffleHog secret-leak scan on PR diffs. --only-verified so we only flag tokens the source provider confirmed are live; runs base..head on PRs and full repo on push. Catches accidental API key commits that the Lint CI's grep-based codespell check cannot. checkout fetch-depth: 0 so the diff range exists. 4. CycloneDX SBOM generation as artifact. Per-requirements file plus a project-level SBOM from pyproject.toml. Lets downstream consumers audit our wheel contents (the ML supply-chain SBOM gap is a known industry-wide problem; meets half of NTIA SBOM mins). 5. GitHub Actions pinning verifier. Reports every `uses: foo@v4` or `@main` mutable ref. tj-actions/changed-files (Mar 2025) hit anyone using non-SHA pins. Currently surfaces 4 third-party unpinned refs (dtolnay/rust-toolchain, swatinem/rust-cache) and 40 first-party (`actions/*`); informational baseline, tighten once we're ready. Dependabot's github-actions ecosystem auto-bumps SHA pins, so the maintenance cost is zero. 6. Hash-pin verifier. Reports how many == specs would gain from `--hash=sha256:` entries. Currently 11 == pins, 0 with hash. Roadmap step: `uv pip compile --generate-hashes` then `pip install --require-hashes`. Hash-locked installs would have refused a republished litellm 1.82.7 even at the same version string. 7. Custom Semgrep rules at .semgrep/unsloth-rules.yml. Seven rules for the *specific shape* of recent ML-stack CVEs we'd otherwise re-introduce ourselves: langchain-core deserialize-roundtrip (CVE-2025-68664), n8n private-pyodide-eval (CVE-2025-68668), marimo websocket-no-auth (CVE-2026-39987), litellm popen-with-network-stdin, Shai-Hulud workflow-write, pickle-from-network, shell=True with f-string interpolation. dependabot.yml: extend to pip + cargo ecosystems so security advisories on Python deps and the Tauri shell auto-generate update PRs alongside the github-actions / bun / npm ones. All new steps continue-on-error initially; findings land in GITHUB_STEP_SUMMARY plus the advisory-audit-logs artifact. * CI(security): bump trivy + trufflehog to existing version tags Job failed at "Set up job" because trivy-action@0.28.0 doesn't exist on GitHub. Latest tag is v0.36.0; same fix for trufflehog (now v3.95.2). * CI(security): trivy-action tags need leading `v` (0.36.0 -> v0.36.0) * CI(security): remove Trivy (it WAS the litellm attack vector) Trivy was the initial entry point for the litellm 1.82.7/8 supply- chain compromise (March 2026): Late Feb: attacker exploited a misconfigured pull_request_target in Trivy's CI -> stole the aqua-bot PAT. Mar 19: attacker force-rewrote 76 of 77 tags in aquasecurity/trivy-action (and all 7 in setup-trivy) to point at malicious commits. Anyone using a tag ref (`@v0`, `@v0.69.4`, `@latest`) auto-pulled the trojan. Mar 24: litellm's CI ran the trojaned Trivy unpinned -> the payload exfiltrated PYPI_PUBLISH from the runner -> attackers published the malicious litellm wheels. A security scanner has the same broad runtime read access as deployment tooling -- by design. That's exactly what made it the ideal pivot. Our prior `aquasecurity/trivy-action@v0.36.0` was a tag ref, the same shape that hit litellm, and Aqua's remediation does not eliminate the meta-attack class (next compromise restarts the clock). Removing rather than re-pinning. Coverage we lose, and how we backfill: - cross-ecosystem CVE: already covered by OSV-Scanner (NVD + GHSA + GitLab + RustSec feeds). - secret detection: already covered by TruffleHog + the new GitHub Actions pinning verifier. - OS package CVEs: not relevant for a Python package + Tauri desktop app. - IaC misconfig (Dockerfile / k8s / Tauri config): the one unique Trivy value-add. Unfilled for now; revisit with checkov / kics if/when we ship a Dockerfile or k8s manifests. Also pinned the two remaining third-party actions to commit SHAs (was a tag ref, the exact thing the GHA pinning verifier flagged): - step-security/harden-runner: a5ad31d (= v2.19.1) - trufflesecurity/trufflehog: 17456f8 (= v3.95.2) Dependabot's github-actions ecosystem will auto-bump these SHAs. Refs: https://docs.litellm.ai/blog/security-update-march-2026 https://www.microsoft.com/en-us/security/blog/2026/03/24/detecting-investigating-defending-against-trivy-supply-chain-compromise/ * CI: SHA-pin every action; fix 4 bugs in advisory-audit Last security-audit run revealed 4 step-level errors hidden by continue-on-error (the job reported pass but each fix is real): 1. OSV-Scanner curl 404 -> tar exit 2. v2.x ships a raw binary (`osv-scanner_linux_amd64`), not a tarball. Drop tar -xzf, curl -o the binary directly + chmod +x. 2. cargo audit `parse error: TOML parse error at line 5 col 8` on RUSTSEC-2026-0073.md. cargo-audit 0.21 doesn't parse the CVSS 4.0 schema used in 2026 advisories. Bump pin to ^0.22. 3. TruffleHog `flag 'no-update' cannot be repeated`. The trufflesecurity/trufflehog action passes --no-update internally already; remove our duplicate from extra_args. 4. cyclonedx-py `unrecognized arguments: --schema-version 1.6 --outfile ...`. cyclonedx-bom 4.x renamed to `--sv` for spec version and `-o` for the output file. Plus pin every remaining mutable-ref action to a 40-char SHA. The new GHA pinning verifier flagged 4 third-party + 40 first-party mutable refs; this commit pins all 44 to the latest SHA *within the existing major version* (no auto-upgrades). Mappings: actions/checkout @v4 -> 34e114876b... (v4.3.1) actions/setup-node @v4 -> 49933ea528... (v4.4.0) actions/setup-python @v5 -> a26af69be9... (v5.6.0) actions/stale @v10 -> b5d41d4e1d... (v10.2.0) actions/upload-artifact @v4 -> ea165f8d65... (v4.6.2) actions/cache @v4 -> 0057852bfa... (v4.3.0) swatinem/rust-cache @v2 -> 23869a5bd6... (v2.9.1) dtolnay/rust-toolchain @stable-> 29eef336d9... (stable @ 2026-05-07) 44 pins applied across 11 workflow files. The pin verifier now reports zero unpinned `uses:`. Dependabot's github-actions ecosystem (already configured in .github/dependabot.yml) will auto-bump these SHAs in weekly batches. This closes the same attack class that hit litellm 1.82.7: an attacker who hijacks a tag (as in the aquasecurity/trivy-action March 2026 incident) cannot redirect our workflows because we no longer follow tag refs. * CI: rename + comprehensive Chat UI Tests (verified locally) Three rename + one substantial test rewrite: - "tool calling tests" -> "Tool calling Tests" - "Chat UI smoke (Playwright + Chromium)" -> "Chat UI Tests" - "install.sh + `unsloth studio update --local`" -> "Studio Updating Tests" Chat UI Tests was a 4-second pass-through (fill new password, send one message, reload). Rewrote into a 15-section flow that runs ~30 seconds locally and exercises the full Studio chat surface a real user touches: 1. Login form (username is hardcoded HIDDEN_LOGIN_USERNAME in auth-form.tsx, so we only fill #password) 2. Composer mounts after auth 3. Composer toolbar (Send + Add Attachment) 4. Three distinct user turns with non-empty deterministic assistant replies (verified locally: lengths 6/1/6 for "hello"/"1"/"world" prompts) 5. Assistant action bar: Copy + Regenerate 6. Settings sheet open + close 7. Theme toggle via account menu (light <-> dark, with a view-transition wait so the click doesn't race the animation) 8. Sidebar nav: New Chat, switch-back-to-previous-chat (history persistence via threadId in IndexedDB) 9. Sidebar Search dialog 10. Sidebar collapse/expand 11. Reload + verify session JWT survives (the 2026.5.1 chat-history regression killed the page entirely on reload; this catches it) 12. Post-reload turn proves inference still works 13. /api/health stays healthy 14. Negative-auth: old bootstrap pw -> 401, rotated pw -> 200 15. Zero pageerror events captured The CI step that boots Studio + loads the model now rotates the bootstrap password BEFORE calling /api/inference/load. /api/inference/ load is gated behind must_change_password=false; the previous flow (login bootstrap -> load) was succeeding in CI by historical accident and started failing locally. New flow: bootstrap login -> change-password -> rotated login -> load model Both passwords are exposed to the Playwright step via env, so the test can drive /login with the rotated password AND assert the old one is now 401. Verified locally end-to-end against a real Studio install with gemma-3-270m-it-GGUF UD-Q4_K_XL: all 15 sections pass, console.error count = 0, total runtime ~30s. * CI(ui): drop nonexistent username locator (auth form is password-only) studio/frontend/src/features/auth/components/auth-form.tsx hard-codes the login username to HIDDEN_LOGIN_USERNAME = "unsloth"; the only visible input is #password. The previous Playwright step waited 30s for `input[name='username'], #username` and timed out on every CI run. I caught this locally and patched the test script during validation but didn't bring the fix back to the workflow file -- this commit applies it. Wait for #password only, fill the rotated password, click submit. Verified locally end-to-end against a fresh Studio. * ci(mlx): add real Apple Silicon job on free macos-14 runner GitHub-hosted macos-14 is the M1 standard runner (3 vCPU, 7 GB RAM, 14 GB storage) and is FREE for public repositories per the GitHub Actions billing reference. Larger variants (macos-14-large, macos-14-xlarge) are billed; we deliberately avoid those. unslothai/unsloth and unslothai/unsloth-zoo are both public, so adding a single macos-14 job to MLX CI costs zero minutes against the org's billing quota while closing the only remaining gap the spoofed Linux job cannot reach: the actual Apple Silicon dispatch path. Specifically the new mlx-real-apple-silicon job: - Installs the real mlx and mlx-lm packages from PyPI. - Verifies platform.system()=='Darwin' and platform.machine()=='arm64' naturally, with no monkeypatch. - Imports unsloth and asserts unsloth._IS_MLX is True so the gate flips on real hardware as it is supposed to. - Smoke-imports every PR-A MLX-only module: mlx_loader, mlx_trainer, mlx_compile, mlx_utils, mlx_cce, gated_delta_vjp. These all do `import mlx.core as mx` at module level; this is the test that catches a future change to those modules that would only surface on a real Mac. - Re-runs the same three dispatch test files the Linux job runs. The monkeypatch spoofs still apply on real hardware, so this is also the canary that the spoofs do not collide with the real environment. The Linux job is unchanged. Both jobs trigger on the same path filter; mlx-real-apple-silicon caps at 15 minutes since the mlx install is heavier than the Linux dep set. * ci(mlx): install unsloth-zoo from git main on the macOS job The macOS Apple Silicon job failed on its first run with NotImplementedError: Unsloth currently only works on NVIDIA, AMD and Intel GPUs. surfaced from `unsloth_zoo.device_type.get_device_type()`. The cause is the version pin: `pip install 'unsloth_zoo>=2026.5.1'` resolves to the most recent PyPI wheel, which predates PR #620 and therefore predates the `_is_mlx_only` gate in `unsloth_zoo/__init__.py` that short-circuits the GPU device-type probe on Darwin+arm64+mlx. Switch to `pip install --no-deps "unsloth_zoo @ git+https://github.com/unslothai/unsloth-zoo"` so the macOS job sees the merged main branch and exercises the actual MLX dispatch code. Studio's own `install.sh` does this for exactly the same reason. This is also the smoking gun the macOS runner exists to catch: the spoofed Linux job cannot reproduce a stale PyPI/zoo pairing because it never imports through device_type. The first real Mac run found the gap on its first try. * ci(mlx): expand macOS install ladder to match the Linux dep set The first attempt installed only mlx + mlx-lm + pytest + unsloth_zoo with --no-deps + unsloth -e --no-deps. That ladder under-specifies what the MLX import branch in unsloth/__init__.py actually needs: - The studio backend hardware module imports structlog at module top level. Without it tests/studio/test_hardware_dispatch_matrix.py fails at the very first `from utils.hardware import hardware as hw` with ModuleNotFoundError. - unsloth/__init__.py loads dataprep/raw_text.py via spec_from_file_location, which `from datasets import Dataset`. With --no-deps on unsloth-zoo neither datasets nor transformers nor any other shared dep got pulled in. Mirror the Linux job's working ladder, with two MAC-specific adjustments: - Drop bitsandbytes (CUDA-only). - Drop CPU torch (mlx replaces it on Apple Silicon, and unsloth-zoo already gates torch on `sys_platform != darwin or platform_machine != arm64`). - Install unsloth_zoo from git main WITH deps so pip resolves mlx + mlx-lm + mlx-vlm (gated on darwin+arm64 in the zoo's pyproject) plus the shared deps (datasets, transformers, sentencepiece, ...). Validated locally against a Linux mac-sim venv (platform spoofed to Darwin/arm64 via mlx_simulation, real datasets/transformers/structlog installed via the same ladder, fake mlx via the shim): - Step 1 _IS_MLX activation: OK - Step 2 import each of unsloth_zoo.mlx_{loader,trainer,compile,utils,cce} + unsloth_zoo.gated_delta_vjp + FastMLXModel + MLXTrainer surface: OK - Step 3 36 tests across the three dispatch files: 36 passed in 0.43s The Linux job (mlx-dispatch) is unchanged. * ci(mlx): version-pin every pip install, consolidate to one matrix job Pin every explicit pip install to an exact released version (latest as of 2026-05-07 within each project's existing constraint range) to reduce supply-chain surface and make rebuilds reproducible. unsloth-zoo on Linux is the pinned PyPI release; on macOS it stays on git main (PR-A is not yet on PyPI). Also fold the previously separate mlx-dispatch (Linux) and mlx-real-apple-silicon (macOS) jobs into a single matrix job with labels linux-cpu-spoof and macos-m1-real, sharing the dispatch test step so adding new MLX dispatch tests applies to both runners automatically. The Mac-only smoke steps (verify _IS_MLX flips True on real Apple Silicon, smoke-import every PR-A MLX-only module) remain gated on if: matrix.real_mlx. Validated locally against .macsim_venv3 with the pinned package set: 35 passed + 1 skipped, matching the prior unpinned run. * CI(ui): split Playwright into tests/studio/playwright_chat_ui.py + comprehensive coverage Move the inline Playwright Python out of the workflow YAML (which was unwieldy at 400+ lines of indented heredoc) into a real test file at tests/studio/playwright_chat_ui.py so it can be run locally against a fresh Studio install in addition to CI. The new test does the full first-run journey end-to-end through the UI: 1. /change-password through the UI (Setup your account / Choose a new password / Change password) -- previously the workflow rotated out-of-band via curl; now the test exercises the actual user form. 2. Default model assertion: /api/models/list[default_models][0] must match DEFAULT_MODELS_GGUF[0] from defaults.py (catches list reordering / lazy-loading regressions). 3. /api/inference/load via page.evaluate using the JWT pulled out of localStorage["unsloth_auth_token"] (gemma-3-270m, ~254 MiB cached). 4. Model picker: open the selector, type "qwen" and "llama" into the search bar, confirm the typeahead filters (does not select). 5. Five chat turns, each must render a non-empty assistant bubble. 6. Regenerate-last via the assistant action bar (best-effort). 7. Two extra turns AFTER regenerate (proves stream restart works). 8. Composer toggles (Thinking / Web search / Code execution) -- skipped gracefully when disabled for the loaded model. 9. Configuration sheet: drive every Radix slider to its minimum so temperature is 0 for downstream determinism. 10. Theme toggle x3 with deterministic computed-background-color assertion (light = body bg min(rgb)>220, dark = max(rgb)<60). View-transition animation disabled via add_init_script + reduced motion to keep clicks actionable. 11. Sidebar nav: New Chat, Compare, Search dialog, Recipes route. 12. Developer / API tab via the account menu (api-keys management surface reachable). 13. Recipes route: cards render + first-card click. 14. Recents (sidebar history): click a previous chat thread. 15. Image attachment widget reachable (vision response not asserted here -- gemma-3-270m is text-only). 16. Reload + session JWT survives. 17. /api/health remains healthy. 18. Negative-auth post-UI-rotation: bootstrap pw -> 401, NEW -> 200. 19. Out-of-band ("terminal") password rotation via subprocess(curl) to /api/auth/change-password (NEW -> NEW2). Confirms refresh tokens are revoked server-side and that an external password change invalidates the previous browser session's renew path. 20. Shutdown via the account-menu Shutdown menuitem + the AlertDialog "Stop server" button. Wait for the "Unsloth Studio has stopped" placeholder, then poll the listening port until it's closed -- verifies the server process actually exited. Verified locally end-to-end against a fresh Studio install (gemma-3-270m GGUF UD-Q4_K_XL, port 18892): rc=0, all 20 sections green. Workflow changes: - Drop the curl-based "Rotate password + load the GGUF" step. The test does change-password through the UI and load via page.evaluate so the bootstrap pw is the only thing CI hands the test. - Pin actions/upload-artifact@v4 to its commit SHA (v4.6.2) per the "pin all actions" rule. * CI(security): random-generated passwords in every workflow (no hardcoded creds) studio-ui-smoke.yml was the last holdout still using hardcoded rotated passwords (CIUiSmoke12345! / CIUiSmoke67890!). Generate them per-run via python -c 'import secrets; print(secrets.token_urlsafe(16))' and mask them into the log via GitHub Actions' ::add-mask::, matching the pattern already used in studio-inference-smoke.yml. If a workflow ever gets compromised (malicious dependency, leaked GITHUB_TOKEN, supply-chain attack on a pinned action), the rotated password is now unique to that single job run and is never readable from log output. An attacker cannot replay a hardcoded credential against a future / parallel Studio install elsewhere. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(mlx): consolidate to single Mac M1 job with robust no-mlx spoof Previously the workflow ran the dispatch tests on two matrix legs (linux-cpu-spoof + macos-m1-real), which duplicated the spoofed hardware matrix (it works identically on any host) while only the Mac leg covered Apple-specific real-mlx checks. Drop the Linux leg, rename the workflow to "MLX CI on Mac M1", and rely on the Mac runner alone -- it now runs the SAME spoofed matrix PLUS the three real-Apple-Silicon checks (real `_IS_MLX = True`, real mlx wheel smoke imports, no spoof collisions with the live environment). Also fix the `apple_silicon_no_mlx` profile so the spoof works on a real Mac with mlx genuinely installed. Studio's `_has_mlx()` does literal `import mlx.core` and catches `ImportError`, which the previous spoof (delete `sys.modules["mlx"]` + patch `find_spec`) could not block when mlx was on disk -- Python would re-find and import the real package. The fix installs a `MetaPathFinder` for the duration of the spoof that raises `ImportError` for `mlx` / `mlx.*`, faithfully simulating "mlx not installed" regardless of whether the host has the wheel. No change to the dispatch logic in unsloth or studio; the Mac runner now exercises every profile end to end with the real wheels installed. Validated locally on .macsim_venv3 with a stand-in `mlx` package on disk at .fakemlx_pkg/ to mimic the macos-14 runner: 35 passed + 1 skipped. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(mlx): real MLX training + inference smoke test on Mac M1 Add tests/studio/run_real_mlx_smoke.py and wire it into the macos-14 job as the final step. The script trains unsloth/gemma-3-270m-it for 7 deterministic LoRA steps on an in-memory dataset of the SAME row repeated: "<<HELLO!!>> My name is Unsloth!" then prompts the trained model with "<<HELLO!!>> My name is " and asserts the completion contains "Unsloth". Captures and asserts: - per-step training loss (via MLXTrainer.add_step_callback); - pre- and post-training loss + gradient norm (computed manually via mx.nn.value_and_grad over the training row, since MLXTrainer does not currently expose per-step grad norms); - losses are finite, do not diverge, and post-train loss < pre-train; - grad norms are finite and positive; - the inference output contains "Unsloth". Determinism: seeds python random, numpy, and mlx.core.random; passes random_state=SEED to FastMLXModel.from_pretrained and get_peft_model (both invoke _seed_mlx_random_state internally) and seed=SEED to MLXTrainingConfig (drives batch shuffling). Uses fp16 + no quant (gemma-3-270m is small enough to skip 4-bit) and LoRA r=8 on the four attention projections. This is the only place in CI that exercises a real MLX backward pass + optimizer step + mlx_lm.generate call. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(mlx): add LoRA + merged_16bit + GGUF export round-trip checks After the 7-step LoRA training run finishes and the in-memory inference assertion passes, the smoke test now exports the trained model in three formats, drops the in-memory model + trainer to reclaim memory, and reloads each export from disk to re-run the "<<HELLO!!>> My name is " inference assertion. Each reload is expected to still complete with "Unsloth" -- catching round-trip regressions where the saved weights silently corrupt or fail to load. Formats exercised: - LoRA adapter via model.save_pretrained_merged(save_method="lora"). Reloaded with FastMLXModel.from_pretrained on the adapter dir; the loader auto-detects adapter_config.json and pulls down the base model. - Merged 16-bit via model.save_pretrained_merged(save_method= "merged_16bit"). Fuses LoRA into the base, dequantizes to fp16, saves an HF-compatible safetensors directory. Reload via FastMLXModel.from_pretrained on the saved dir. - GGUF via model.save_pretrained_gguf(quantization_method= "not_quantized"). Builds llama.cpp via cmake on the runner with GGML_METAL=ON (only the llama-cli, llama-quantize, and llama-gguf-split targets), then runs the produced bf16 GGUF through llama-cli with a fixed seed and asserts "Unsloth" in stdout. GGUF infra failures (cmake / build / convert) are surfaced as RuntimeError so we notice -- if Mac CI starts hitting build flakes the assertion can be softened. Workflow timeout bumped 15 -> 25 min to budget for the llama.cpp cmake build (~5-7 min on the macos-14 standard runner). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(mlx): cold-start LoRA / merged / GGUF reloads + per-phase metrics Restructure the MLX smoke test into a multi-step workflow that exercises the export round-trip the way real users hit it: each reload runs in a FRESH Python process (not a continuation of the still-running trainer), and each step emits a JSON metrics file with elapsed time + peak GPU memory + peak RSS for regression detection. Steps (each on the macos-14 M1 standard runner, FREE for public repos): 1. TRAIN + SAVE 3 formats - Load unsloth/gemma-3-270m-it (fp16, no quant). - Apply LoRA r=8 on q/k/v/o. - Pre-train + post-train loss + grad norm probe via mx.nn.value_and_grad on the training row. - Train 7 deterministic steps, batch_size=2, gradient_accumulation_steps=3 (42 sequences trained), capture per-step loss via add_step_callback. - In-memory generate -> assert "Unsloth" appears. - Save LoRA, merged_16bit, GGUF. - Emit mlx_workdir/train_metrics.json. 2. RELOAD LoRA (fresh process) FastMLXModel.from_pretrained(lora_dir) cold-load + generate + assert "Unsloth" appears. Emits lora_reload_metrics.json. 3. RELOAD merged_16bit (fresh process) Same flow on the merged HF directory. 4. RELOAD GGUF via llama-cli (fresh process) Conditional on train_metrics.json:gguf_supported. Spawns the llama-cli built by save_pretrained_gguf with --temp 0 --seed 3407 -no-cnv and asserts "Unsloth" in stdout. The per-phase metrics step prints all four JSON files so regressions are visible in the job log. Pin unsloth_zoo to fix/mlx-export-roundtrip-on-apple-silicon while unslothai/unsloth-zoo#627 is in review -- it carries: - llama_cpp.py: catch NotImplementedError too when importing device_is_bf16_supported (device_type module-level call raises on Apple Silicon). - mlx_loader.py: don't wipe local_path when config.json is missing, otherwise FastMLXModel.from_pretrained(lora_dir) can't see adapter_config.json. The earlier draft of this script had a workaround that copied the base model's config.json into the LoRA save dir; with #627 the workaround is removed, the cold-start LoRA reload works on the saved adapter directory directly. Workflow timeout already 25 min for the llama.cpp cmake build. * CI(studio): always-upload artifacts + gate /api/system + path/health plumbing Three small but high-signal changes that came out of an audit of how much Studio surface CI actually exercises: 1. Every studio-*-smoke.yml workflow now uploads its artifacts on `if: always()` instead of `if: failure()`. On green runs the screenshots + studio.log are now reviewable in the Actions UI, which closes the "passed but the UI is silently broken" hole. SHA-pinned to actions/upload-artifact@v4.6.2 across all 7 upload steps (was a mix of @v4 unpinned + the SHA-pin). 2. /api/system and /api/system/hardware now require a Bearer token (Depends(get_current_subject)). Today they leak Python version, GPU name, total memory, and the ML package set without auth -- fine on a single-user Tauri box, not fine on -H 0.0.0.0 / Colab / a Tauri-relayed setup. /api/system/gpu-visibility was already gated; now /api/system + /api/system/hardware match it. 3. Path filters + health-wait plumbing: - studio-ui-smoke.yml now triggers on tests/studio/** so a PR that ONLY edits the Playwright test file actually runs UI CI. - studio-tauri-smoke.yml now triggers on unsloth_cli/** so a CLI rename or signature change that breaks Tauri's spawned `unsloth studio` actually runs Tauri CI. - The 60s `/api/health` wait loop in studio-ui-smoke.yml + studio-inference-smoke.yml (3 jobs) is now 180s. Cold runners with venv warm-up + lazy imports have been observed exceeding 60s, and the cost of a false-fail is much higher than two extra minutes of waiting. * CI(ui): STUDIO_UI_STRICT mode + theme cycle fix + Recents thread-match assertion The existing UI test was passing too easily: every "if button.count() == 0: log WARN" branch silently degraded into a green run. Three places this hid real bugs: 1. The theme toggle for-loop bailed after cycle 1 because the Radix Account-menu's data-state="open" lingered through the view-transition and the next acct.click() hit the still-open dropdown. The test went green observing only one polarity. 2. The regenerate button branch silently skipped when the assistant action bar didn't render (every CI run so far -- the locator was wrong, but no one noticed because it was a soft skip). 3. The Recents click accepted ANY non-nav sidebar entry, so a freshly deleted thread or an unrelated entry would still pass. Fixes: - Add STUDIO_UI_STRICT=1 env (default on in CI via workflow, default off locally). When on, every soft "if not visible: log WARN" branch hard-fails. The strict-skip pattern is centralised in a soft_fail() helper so the local-vs-CI split is one knob. - Theme toggle: wait for [role="menu"] to detach between cycles (the dropdown stay-open was the cycle-2 bail), assert the loop actually ran 3 times. - Model picker search: capture popover text after typing "qwen" vs "llama"; the two snapshots must DIFFER, proving the typeahead actually filters (a regression that rendered the picker but ignored input would silently pass before). - Recents click: after navigating to the clicked thread, the rendered turns must include at least one of our sent prompts ("hello", "world", "tree", "1+1", etc.) -- proves we landed on OUR thread, not a leftover from a previous run. - Use [data-tour="chat-model-selector"] as the primary selector for the model picker -- the guided-tour anchor is at least as stable as anything else in the codebase (the tour breaks if it moves), and there's no separate data-testid system to maintain. * CI(studio): new Studio API & Auth Tests workflow + integration test HTTP-level integration smoke for the Studio FastAPI surface, no Playwright. ~30 s per run on warm cache. Boots a fresh Studio, then asserts: 1. CORS hardening -- no wildcard-origin + credentials=true; cross- origin GET / does not leak the bootstrap password to evil.example. 2. /api/system + /api/system/hardware + /api/system/gpu-visibility all require auth (closes the info-disclosure leak). 3. Auth state machine -- rotation invariants (old=401, new=200), refresh-without-body returns 4xx, login burst documents the current "no rate-limit" behaviour so future hardening updates the test in the same PR. 4. JWT-expiry forgery -- mint a JWT with exp=now-1 using the install's own secret + assert it returns 401. 5. API key lifecycle E2E -- create -> list -> use against /v1/chat/completions -> delete -> verify 401. 6. Auth file-mode hardening (Linux only): auth/ is 0700, auth.db + -wal + -shm + .bootstrap_password are 0600. 7. Inference lifecycle gaps -- /v1/models lists the loaded model, /v1/embeddings + /v1/responses return 200 OR structured 4xx, bogus gguf_variant rejected, force-reload swaps the llama-server PID. 8. Endpoint-by-endpoint auth audit -- pins the EXPECTED auth posture for known routes; an unauthenticated /api/shutdown is rejected BEFORE the shutdown trigger fires. Reuses the same GGUF cache key as studio-ui-smoke.yml so the model download is one cache-hit across CI. Random per-run rotated passwords + ::add-mask:: pattern matches studio-ui-smoke.yml + studio-inference-smoke.yml. * CI(ui): add second Playwright job covering Compare/Recipes/Export/Studio/Settings The first Chat UI Tests step ends by clicking the Shutdown menuitem, which leaves the server dead. So a SECOND Studio is booted on port 18894 in the same job (warm install -- adds ~3-5s) and a second Playwright test exercises the routes the chat UI doesn't touch: 1. /chat?compare=... -- assigns two models, sends 2 prompts, asserts both panes respond (so 4 total new assistant bubbles). 2. /data-recipes -- clicks the first template card, verifies the React-Flow canvas mounts. 3. /export -- in chat-only mode (CI default) asserts the route redirects; in non-chat-only asserts [data-tour='export-cta'] + HF token field exist. 4. /studio -- chat-only redirects, non-chat-only asserts the three tabs (Configure / Current run / History) + [data-tour='studio-*'] anchors exist. 5. Settings dialog -- Cmd/Ctrl-, opens it, cycles through every visible tab (General / Profile / Appearance / Chat / Developer / About), asserts each tab body is non-trivial. Same STRICT=1 mode + soft_fail() pattern as playwright_chat_ui.py. Both Playwright runs' screenshots + studio logs are bundled into the existing studio-ui-smoke-artifacts upload; the artifact name doesn't change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(mlx): fresh-process reloads + soft-skip GGUF on llama.cpp limitation Re-apply the subcommand restructure that was lost during the earlier rebase conflict (the linter pre-commit on the remote re-formatted the single-function version, so my checkout --ours kept the wrong copy). Adds: * argparse subcommands `train` and `reload --format X --dir D` so each reload runs in a FRESH Python process the way real users hit the cold-start path. * Per-phase Phase() context manager records elapsed wall-clock, peak GPU memory (mx.metal.get_peak_memory), and peak RSS (resource.getrusage) into a metrics dict written to {train,lora_reload,merged_reload,gguf_reload}_metrics.json next to the saved dir for cross-CI regression detection. * batch_size=2, gradient_accumulation_steps=3 (was 2/1) so the 7-step run sees 42 sequences total. * GGUF save is best-effort. unsloth-zoo#627 fixed the NotImplementedError on Apple Silicon, but llama.cpp's convert_hf_to_gguf currently asserts on the gemma-3-270m tokenizer vocab (`max(vocab IDs) >= vocab_size`). That's a downstream llama.cpp limitation, not an unsloth_zoo bug, so the train step records gguf_supported=false + the reason instead of raising, and the GGUF reload step emits a workflow warning and exits 0. The LoRA + merged_16bit reload assertions remain the gating signal. The earlier-draft LoRA workaround that copied base config.json into the LoRA save dir is removed; unsloth-zoo#627 makes FastMLXModel.from_pretrained(lora_dir) work on the saved adapter directory directly (the failing run before #627 confirmed the bug, the run after #627 lands shows the adapter is detected and the base model is pulled from adapter_config.json:base_model_name_or_path). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(mlx): expand LoRA targets to MLP + bump generation budget With batch_size=2 / gradient_accumulation_steps=3 (effective batch of 6) the q/k/v/o-only LoRA collapsed in 7 steps -- training loss kept dropping (0.55 vs the previous 1.02 with grad_accum=1) but inference output the structural skeleton ("My name") without recovering the specific "Unsloth" token. Switching to the standard unsloth target set (q/k/v/o + gate/up/down) gives the LoRA enough capacity to memorize the training row at the larger effective batch. Also bump max_tokens 24 -> 48 for the in-memory + reload generation calls so the model has more room to spew the memorized sequence; we still assert "Unsloth" appears…

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

danielhanchen force-pushed the fix/mlx-export-roundtrip-on-apple-silicon branch from 6aae63c to 2c14eab Compare May 7, 2026 04:09

danielhanchen added 2 commits May 7, 2026 08:01

Make comments more succinct and platform-generic

fdbe676

danielhanchen merged commit e9d1be8 into main May 8, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mlx): unblock GGUF export and LoRA reload on Apple Silicon#627

fix(mlx): unblock GGUF export and LoRA reload on Apple Silicon#627
danielhanchen merged 3 commits into
mainfrom
fix/mlx-export-roundtrip-on-apple-silicon

danielhanchen commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	except (json.JSONDecodeError, KeyError):
	except (json.JSONDecodeError, OSError):

Conversation

danielhanchen commented May 7, 2026

Summary

Bug 1: llama_cpp.py only catches ImportError from device_type module

Bug 2: FastMLXModel.from_pretrained wipes local_path on missing config.json

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1: `llama_cpp.py` only catches `ImportError` from `device_type` module

Bug 2: `FastMLXModel.from_pretrained` wipes `local_path` on missing `config.json`