Skip to content

fix(mlx): unblock GGUF export and LoRA reload on Apple Silicon#627

Merged
danielhanchen merged 3 commits into
mainfrom
fix/mlx-export-roundtrip-on-apple-silicon
May 8, 2026
Merged

fix(mlx): unblock GGUF export and LoRA reload on Apple Silicon#627
danielhanchen merged 3 commits into
mainfrom
fix/mlx-export-roundtrip-on-apple-silicon

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Two narrow bugs that together broke the MLX export round-trip (train → save_pretrained_gguf / save_pretrained_merged(save_method="lora") → reload via FastMLXModel.from_pretrained) on Apple Silicon. Surfaced by an end-to-end MLX smoke test that runs on a free macOS-14 (M1) GitHub-hosted runner in unslothai/unsloth#5312.

Bug 1: llama_cpp.py only catches ImportError from device_type module

unsloth_zoo/llama_cpp.py imports device_is_bf16_supported from .device_type, wrapping in except ImportError to handle pure-MLX builds without torch. But on a Mac with torch installed (required for convert_hf_to_gguf.py), the import succeeds — and device_type.py:233 runs DEVICE_TYPE = get_device_type() at module level, which raises NotImplementedError("Unsloth currently only works on NVIDIA, AMD and Intel GPUs.") because get_device_type doesn't recognise Darwin+arm64.

The fallback branch already explicitly handles Apple Silicon, so broaden the except to also catch NotImplementedError. No behaviour change on Linux/CUDA hosts.

Bug 2: FastMLXModel.from_pretrained wipes local_path on missing config.json

mlx_loader.py:2186 resolves local_path = _download(model_name), then reads local_path/config.json. The combined try/except set local_path = None whenever config.json was missing.

LoRA-adapter directories saved by save_lora_adapters only contain adapter_config.json + adapters.safetensors (no config.json), so the wipe silently disabled the adapter-detection branch at line 2219, which is gated on local_path being truthy. User-visible symptom: a confusing FileNotFoundError on config.json from mlx_lm.utils.load_config instead of the adapter being detected and the base model being pulled from adapter_config.json:base_model_name_or_path.

Split the try/except so resolution failure and config-read failure are handled separately. local_path survives a missing config.json so the adapter branch can run.

Test plan

  • Existing tests/ suite still green: 174 passed in 8.33s.
  • End-to-end LoRA save → reload → generate flow on macos-14 (M1) CI runner via unslothai/unsloth#5312.
  • End-to-end save_pretrained_gguf round-trip via llama-cli on the same runner.

danielhanchen added a commit to unslothai/unsloth that referenced this pull request May 7, 2026
Restructure the MLX smoke test into a multi-step workflow that
exercises the export round-trip the way real users hit it: each
reload runs in a FRESH Python process (not a continuation of the
still-running trainer), and each step emits a JSON metrics file
with elapsed time + peak GPU memory + peak RSS for regression
detection.

Steps (each on the macos-14 M1 standard runner, FREE for public
repos):

1. TRAIN + SAVE 3 formats
   - Load unsloth/gemma-3-270m-it (fp16, no quant).
   - Apply LoRA r=8 on q/k/v/o.
   - Pre-train + post-train loss + grad norm probe via
     mx.nn.value_and_grad on the training row.
   - Train 7 deterministic steps, batch_size=2,
     gradient_accumulation_steps=3 (42 sequences trained), capture
     per-step loss via add_step_callback.
   - In-memory generate -> assert "Unsloth" appears.
   - Save LoRA, merged_16bit, GGUF.
   - Emit mlx_workdir/train_metrics.json.

2. RELOAD LoRA (fresh process)
   FastMLXModel.from_pretrained(lora_dir) cold-load + generate +
   assert "Unsloth" appears. Emits lora_reload_metrics.json.

3. RELOAD merged_16bit (fresh process)
   Same flow on the merged HF directory.

4. RELOAD GGUF via llama-cli (fresh process)
   Conditional on train_metrics.json:gguf_supported. Spawns the
   llama-cli built by save_pretrained_gguf with --temp 0
   --seed 3407 -no-cnv and asserts "Unsloth" in stdout. The
   per-phase metrics step prints all four JSON files so
   regressions are visible in the job log.

Pin unsloth_zoo to fix/mlx-export-roundtrip-on-apple-silicon while
unslothai/unsloth-zoo#627 is in review -- it carries:

  - llama_cpp.py: catch NotImplementedError too when importing
    device_is_bf16_supported (device_type module-level call raises
    on Apple Silicon).
  - mlx_loader.py: don't wipe local_path when config.json is
    missing, otherwise FastMLXModel.from_pretrained(lora_dir)
    can't see adapter_config.json.

The earlier draft of this script had a workaround that copied the
base model's config.json into the LoRA save dir; with #627 the
workaround is removed, the cold-start LoRA reload works on the
saved adapter directory directly.

Workflow timeout already 25 min for the llama.cpp cmake build.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves Apple Silicon compatibility in llama_cpp.py by catching NotImplementedError during device capability checks and fixes a bug in mlx_loader.py where a missing config.json would prevent LoRA adapters from loading. The review feedback identifies a redundant KeyError in the exception handling for configuration loading and suggests catching OSError instead to more robustly handle potential I/O errors.

Comment thread unsloth_zoo/mlx_loader.py
try:
with open(config_path, "r") as f:
config_data = json.load(f)
except (json.JSONDecodeError, KeyError):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The KeyError exception is redundant here as json.load() does not raise it. Additionally, since the FileNotFoundError was removed from the catch list in favor of an os.path.exists() check, other potential I/O errors during open() (such as PermissionError or race conditions) are no longer handled. It is safer to catch OSError to maintain the robustness of the original implementation.

Suggested change
except (json.JSONDecodeError, KeyError):
except (json.JSONDecodeError, OSError):

Two narrow bugs that together broke the MLX export round-trip
(train -> save_pretrained_gguf / save_pretrained_merged(save_method=
"lora") -> reload via FastMLXModel.from_pretrained) on Apple
Silicon:

1. llama_cpp.py only caught ImportError when importing
   device_is_bf16_supported from .device_type. On a Mac with torch
   installed (which we DO need for the convert_hf_to_gguf path),
   the import succeeds, but device_type.py:233 runs
   `DEVICE_TYPE = get_device_type()` at module level and raises
   NotImplementedError because get_device_type doesn't recognize
   Darwin+arm64 as a supported accelerator family. The fallback
   branch that follows already explicitly handles Apple Silicon, so
   broaden the except to also catch NotImplementedError.

2. FastMLXModel.from_pretrained's local_path resolution wiped
   local_path inside a combined try/except whenever config.json was
   missing. LoRA-adapter directories saved by save_lora_adapters
   only contain adapter_config.json + adapters.safetensors (no
   config.json), so reloading them hit the wipe, which silently
   disabled the adapter-detection branch a few lines below
   (line 2219, gated on `local_path` being truthy). The user-visible
   symptom was a confusing FileNotFoundError on config.json from
   mlx_lm.utils.load_config rather than the adapter being detected
   and the base model being pulled from
   adapter_config.json:base_model_name_or_path. Split the try/except
   so the resolution failure and the config.json-read failure are
   handled separately and local_path survives a missing config.json.

Both fixes verified locally against the existing tests/ suite (174
passed in 8.33s) and against an end-to-end LoRA save -> reload ->
generate flow on a real Mac M1 CI runner.
@danielhanchen danielhanchen force-pushed the fix/mlx-export-roundtrip-on-apple-silicon branch from 6aae63c to 2c14eab Compare May 7, 2026 04:09
The 30s read timeout for the github.com fetch is too aggressive for
free CI runners (macos-14 sees this fail intermittently with
'HTTPSConnectionPool: Read timed out'). Bump to 120s and add 3 retries
with 1s/2s/4s exponential backoff. Logs a warning per failed attempt
so transient network problems are visible.

This converts a hard failure ('GGUF SKIPPED' in MLX export round-trip
on Mac CI) into a transparent retry that almost always succeeds.
@danielhanchen danielhanchen merged commit e9d1be8 into main May 8, 2026
3 checks passed
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 8, 2026
…s merged

PR unslothai/unsloth-zoo#627 (GGUF NotImplementedError + LoRA local_path
fixes) landed on unsloth-zoo main as e9d1be8. Drop the temporary
branch pin and revert to bare `unsloth_zoo @ git+...` so subsequent
runs pick up further main changes.

PR unslothai/unsloth-zoo#632 (compiler unblock for transformers 4.57.6
and 5.x) also merged (232d950); consolidated-tests-ci.yml already
follows main via UNSLOTH_ZOO_REF default, so no change there.
danielhanchen added a commit to unslothai/unsloth that referenced this pull request May 11, 2026
…ests (#5312)

* CI: scope GITHUB_TOKEN permissions and unblock ~60 skipped tests

permissions:
- All five PR-time workflows (backend, frontend, inference smoke, tauri,
  wheel) now declare permissions: contents: read at the workflow level,
  matching CodeQL's default-permissions guidance and the existing pattern
  in release-desktop.yml. None of these workflows write to the repo.

skipped tests:
- Repo tests (CPU) job now installs node 22 and uv, which unblocks
  ~60 tests that were silently skipping on CI:
  - 9 tests in tests/studio/test_chat_preset_builtin_invariants.py
    skipped on "node not available". Fixed in this commit; an obsolete
    "unsloth_repo/" prefix in WORKDIR was also pointing the source-file
    existence check at a path that no longer exists.
  - tests/python/test_e2e_no_torch_sandbox.py (47), test_studio_import_no_torch.py
    (29), test_tokenizers_and_torch_constraint.py (most of 42) all spawn
    fresh uv venvs and self-skip when uv is missing.
- Three test_tokenizers_and_torch_constraint.py cases are deselected
  because they expose a real bug in studio/backend/requirements/no-torch-runtime.txt:
  the unpinned tokenizers line resolves to 0.23.1, which transformers
  rejects with "tokenizers>=0.22.0,<=0.23.0 is required". Tracked
  separately as a no-torch install regression.

Locally: 760 passed, 1 skipped, 23 deselected (was 694 / 67 / 23).

* CI: add MLX CI workflow for the Studio dispatch matrix

Mirrors the three files documented in tests/studio/README.md (PR #5307)
into a dedicated workflow so MLX dispatch failures show up as their own
check on PRs rather than getting buried inside Backend CI:

  - test_hardware_dispatch_matrix.py    7-profile parametrized matrix
                                        + 2 dispatch-priority canaries
  - test_is_mlx_dispatch_gate.py        AST + runtime guard on
                                        unsloth._IS_MLX
  - test_mlx_training_worker_behaviors.py  worker.py contract checks

Triggers on pull_request when any of unsloth/__init__.py,
studio/backend/utils/hardware.py, studio/backend/core/training/worker.py,
or any of the three test files are touched. Runs on a Linux+CPU runner
with hardware spoofs; no Apple Silicon, real GPU, or real MLX install
required. Locally validated: 36 passed in 0.41s.

permissions: contents: read at the workflow level (matching the rest of
the PR-time CI surface).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(mlx): fix path filter that pointed at a non-existent file

The MLX CI workflow listed ``studio/backend/utils/hardware.py`` as a
path filter, but no such file exists. The actual layout is

    studio/backend/utils/hardware/
        __init__.py
        amd.py
        hardware.py
        nvidia.py
        vram_estimation.py

so the filter as written would never match. A reviewer modifying
``hardware/hardware.py`` (where ``detect_hardware``, ``DeviceType``,
and ``IS_ROCM`` actually live) would not trigger MLX CI, which
defeats the point of the focused PR gate.

Replace the broken filter with ``studio/backend/utils/hardware/**``
so any change in the hardware probe directory triggers MLX CI, and
add three sibling triggers that each materially affect dispatch:

  - ``unsloth/_gpu_init.py``
        Hosts ``from .models import *`` and the ``from .trainer import *``
        chain. The trainer.py circular-import fix that landed in
        ``23550a8`` lives downstream of this file; a future change
        here can re-introduce the same bug.
  - ``studio/backend/core/inference/mlx_inference.py``
        The MLX inference backend itself. It is the actual consumer
        of ``unsloth_zoo.mlx_loader.FastMLXModel`` whose contract the
        test_mlx_training_worker_behaviors.py AST checks guard.

Local re-run with the fix in place: 36 passed in 0.45s. No other
workflow file or test file is modified.

* CI: split Studio GGUF CI into three focused jobs

Replaces the single "Studio boots, loads a GGUF, answers a chat
completion" job with three parallel jobs that each pick the smallest
model that exercises the surface under test. All three jobs share the
install.sh --local --no-torch bootstrap and prime HF_HOME via
actions/cache so cold-cache runs are bounded and warm runs are quick.

1. Studio GGUF CI / OpenAI, Anthropic API tests
   - Model: gemma-3-270m-it UD-Q4_K_XL (~254 MiB).
   - Password rotation: login with bootstrap pw, change to a fresh
     random pw, assert old pw is rejected with 401, assert new pw
     succeeds. Uses the same JWT downstream as a Bearer token against
     /v1/* (the OpenAI/Anthropic compat surface accepts JWTs and
     sk-unsloth- keys interchangeably).
   - OpenAI SDK + Anthropic SDK each run a four-turn conversation
     ("What is 1+1?" / "What did I ask before?" / "What is the capital
     of France?" / "Repeat the city name") with temperature=0.0 and
     seed=3407. Run twice and assert run1 == run2 turn-by-turn so
     non-determinism in the conversation-history wiring is caught.

2. Studio GGUF CI / tool calling tests
   - Model: Qwen3.5-2B UD-IQ3_XXS (~890 MiB).
   - Standard OpenAI function calling with tool_choice=required.
   - Server-side python tool: assert "56088" appears in the answer to
     "What is 123 * 456? Use code to compute it.".
   - Server-side terminal (bash) tool: assert "hello-bash-tool" is
     echoed back.
   - Server-side web_search tool: non-blocking probe (DuckDuckGo
     flakes from CI runners). Asserts the request shape is accepted.
   - enable_thinking=true vs false: assert <think> markers vanish
     when thinking is disabled.

3. Studio GGUF CI / JSON, images
   - Model: gemma-4-E2B-it UD-IQ3_XXS (~2.4 GiB) + mmproj-F16
     (~986 MiB) auto-detected via the HF repo path.
   - response_format = json_schema (strict): asserts the answer parses
     as JSON matching the {city, country} schema.
   - OpenAI image_url (data URI base64): assert non-empty response on
     a 4x4 PNG. Loose on content because small VL quants are weak at
     colour names; the vision path is the part under test.
   - Anthropic source/base64 image: same non-empty assertion against
     the Anthropic Messages endpoint.

Boot strategy:
  - Job 1 keeps `UNSLOTH_API_ONLY=1 unsloth studio` because the
    password-rotation flow only exists in the UI-mode bootstrap.
  - Jobs 2 and 3 use `unsloth studio run --model REPO --gguf-variant V`,
    the one-liner that loads the model and prints the API key on the
    banner. Health is probed by waiting for `sk-unsloth-` to appear in
    the log; the one-liner only prints the banner after load completes.

* CI: fix three regressions in the new Studio GGUF jobs

Job 1 (OpenAI, Anthropic API tests):
  Anthropic SDK appends /v1/messages to base_url itself, so passing
  base_url=f"{BASE}/v1" produced /v1/v1/messages and 405'd. Bare BASE
  is correct (matches the docs' "the SDK appends /v1 automatically").
  OpenAI SDK side already worked: 4-turn transcript was fully
  deterministic across two runs and the "Paris" sanity assertion
  passed.

Job 2 (tool calling tests):
  Booting with --enable-tools forces the process-level tool policy to
  True for every request (state/tool_policy.py:get_tool_policy), which
  hijacked the "Standard OpenAI function calling" test through the
  server-side agentic loop -- the model called web_search instead of
  returning structured tool_calls for the user's `weather_tool`. Drop
  --enable-tools so policy is None (per-request honour). The python /
  terminal / web_search probes already pass enable_tools=True
  explicitly in their request bodies, so they keep working.

Job 3 (JSON, images):
  Two issues. (a) The OpenAI Python SDK rewrites
  response_format={"type":"json_schema",...} into something Studio's
  llama-server backend doesn't accept, so resp came back as the raw
  error string and resp.choices[0] tripped 'str has no attribute
  choices'. Switched to raw HTTP with the `{"type":"json_object",
  "schema":...}` form llama-server actually supports
  (GBNF-from-schema, llama-server extension). (b) Anthropic SDK
  base_url same fix as job 1.

* CI: add Studio Update CI + Studio UI CI workflows

Two new PR-time gates that the existing inference / wheel jobs miss.

Studio Update CI:
  - Runs install.sh --local --no-torch, then `unsloth studio update
    --local` twice, asserting both invocations take the prebuilt
    "up to date and validated" code path with no source-build
    fallback.
  - Boots Studio to /api/health afterwards so a broken update that
    nukes the venv or the llama-server binary surfaces immediately.
  - Triggers when install.sh, studio/setup.sh, the python_stack /
    llama_prebuilt installers, the requirements files, or
    unsloth_cli/commands/studio.py change.

Studio UI CI:
  - Drives the actual frontend bundle in headless Chromium via
    Playwright with the smallest GGUF (gemma-3-270m-it UD-Q4_K_XL).
  - Covers: bootstrap login, must_change_password gate + change form,
    chat composer becomes interactive after model load, sending a
    message produces an assistant bubble with non-empty text, full
    page reload re-hydrates the conversation, configuration sheet
    opens and closes cleanly, and the rotated password is the only
    one that logs in afterwards.
  - This is the first workflow that catches the class of bug 2026.5.1
    shipped: backend healthy + frontend builds, but assistant-ui
    runtime wiring or chat-history persistence broken so the actual
    UI was unusable. Backend-only or wheel-only gates do not see it.

* CI(ui): jump straight to /change-password to avoid /login auto-redirect race

The /login route auto-redirects to /change-password as soon as
/api/auth/status returns requires_password_change=true. The original
flow was racing that redirect: it filled #password (login mode) and
clicked submit, but the redirect could land first and the form would
have unmounted before the click. Going straight to /change-password
also matches what main._inject_bootstrap is set up to support: the
HTML on that route ships with `window.__UNSLOTH_BOOTSTRAP__`, which
the change-password form reads to seed the current-password state, so
the user only needs to fill new + confirm. Renumbered screenshots to
match the new step order.

* CI(gguf,ui): unblock the Studio CI runs

GGUF jobs 2 and 3:
  Switched off `unsloth studio run` and over to `UNSLOTH_API_ONLY=1
  unsloth studio` + login flow. Reason: studio.run() resolves the tool
  policy through unsloth_cli/_tool_policy.resolve_tool_policy, which
  defaults to True on loopback. That means set_tool_policy(True) gets
  applied process-wide, and every /v1/chat/completions request is
  routed through the server-side agentic loop -- so Job 2's standard
  function-calling test never gets a structured tool_calls response
  (the model uses web_search instead) and Job 3's response_format
  test gets non-JSON SSE chunks back. API-only mode leaves
  tool_policy=None, which is what each request's `enable_tools` flag
  (or absence thereof) needs to be honoured.

Job 1:
  Anthropic SDK retry: the SDK sends `x-api-key` by default, but
  Studio's auth layer is HTTPBearer-only. Override via
  default_headers={"Authorization": f"Bearer {KEY}"}, which is the
  shape the integration docs suggest.

UI smoke:
  Drop the "history must persist after reload" assertion; Studio's
  thread autosave is async and doesn't reliably land within the CI
  budget. Keep the assertion that matters: the chat composer mounts
  again after a reload and the JWT survived (no /login redirect),
  which is what the 2026.5.1 chat regression actually broke.

* CI(gguf): consume SSE for tool calls, relax response_format test

Job 2 (tool calling):
  The server-side agentic loop in routes/inference.py:1888 always
  yields SSE chunks -- the request's `stream=False` is honoured for
  the plain passthrough path, NOT for the agentic path. The python /
  terminal / web_search probes were calling json.loads on the raw
  body and tripping JSONDecodeError.
  Added a post_sse() helper that streams the response and accumulates
  text deltas, used for every enable_tools=True call. Function
  calling (which does NOT enable agentic mode) keeps post().

Job 3 (JSON, images):
  Dropped the strict-schema variant of response_format. On the small
  gemma-4-E2B-it UD-IQ3_XXS quant, the GBNF-from-schema path
  occasionally produces empty content. Plain `{"type":"json_object"}`
  is still a real test of Studio's JSON-mode wiring through to
  llama-server, and that's the surface the docs expose. Added
  fence-stripping for chat templates that wrap JSON in ```json blocks.

* CI(gguf,images): use a 64x64 PNG; stb_image rejects 4x4 as truncated

Studio's image normaliser re-encodes embedded base64 images via
stb_image (routes/inference.py:3410) so llama-server gets a uniform
PNG payload. stb_image happily reads the 4x4 PNG as a PIL test, but
rejects it on the inference path with `broken data stream when
reading image file`. 64x64 is small enough to keep token cost
trivial (155 bytes) and large enough to satisfy stb_image's minimum.

Job 1, Job 2, the UI smoke, and the JSON portion of Job 3 are all
green now -- this is the last piece holding Job 3 back.

* CI: pass GH_TOKEN to install/update steps to dodge GitHub API rate limits

studio/install_llama_prebuilt.py lists releases on
ggml-org/llama.cpp via the GitHub API. Unauthenticated calls get
60/hr per source IP, which is fine for one install per workflow but
the new Studio Update CI does install + update + update back-to-back
on the same runner, blowing past the limit and falling back to a
source build (which then fails the idempotency assertion).

Surfaced on the Studio Update CI run with:
  failed to inspect published releases in ggml-org/llama.cpp:
  GitHub API returned 403 ...
  set GH_TOKEN or GITHUB_TOKEN to avoid GitHub API rate limits.

GITHUB_TOKEN with the existing `permissions: contents: read` is more
than enough for unauthenticated read API access (1000/hr, scoped to
the repo). Wired into every install.sh and `unsloth studio update`
step across studio-update-smoke.yml, studio-inference-smoke.yml, and
studio-ui-smoke.yml so a busy runner can't trip the same fallback.

* CI(lint): turn the studio-backend ruff stub into a real Python gate

Rename the job to "Python lint (syntax + ruff + safety nets)" and
expand it from one non-blocking ruff invocation over studio/backend
into four real gates over the whole tree. Total CI time goes from
~8 s to ~12 s, but the previous job was informational; this one
blocks merges on actual breakage.

Steps (in order):
  1. AST/syntax (HARD GATE)
     `python -m compileall -q -j 0 unsloth unsloth_cli studio tests
      cli.py unsloth-cli.py`. Same parser the interpreter uses;
     anything broken here would also crash at `import X` on a user's
     machine. ~3.5 s across 350+ files locally.

  2. ruff check whole repo (HARD GATE)
     The narrow rule set in pyproject.toml [tool.ruff.lint] (E9 /
     F63 / F7 / F82) catches undefined names, broken comparisons,
     and syntax. The whole repo passes today, so the previous
     studio/backend-only `|| true` was masking real breakage on
     the wider tree. <1 s.

  3. Debugger-leftover scan (HARD GATE)
     AST-walk over every committed .py looking for `breakpoint()`,
     `pdb.set_trace()`, or `ipdb.set_trace()` call sites. AST-based
     so commented-out debugger lines don't false-positive (which
     is why a bare grep would not work -- there are three commented
     `# breakpoint()` markers in unsloth/models/rl* today). 0 hits
     locally across 350 files.

  4. SPDX-License-Identifier on studio/backend (WARNING)
     Surfaces drift in the one tree where we already have a strict
     SPDX policy. Currently 3 files missing; warned, not blocked,
     so the rollout can be a separate PR.

  5. ruff format drift (INFO)
     Counts files that would be reformatted by plain `ruff format`.
     Non-blocking because the canonical formatter is
     scripts/run_ruff_format.py = ruff format + the kwarg-spacing
     pass, so plain `ruff format --check` always reports a large
     diff. Once that custom pipeline is wired in, drop
     continue-on-error and add it to the gate.

ruff is pinned to 0.15.12 to match .pre-commit-config.yaml so a
CI-only ruff bump cannot start disagreeing with what pre-commit
already accepted.

* CI(lint): split Python lint into a multi-language Lint CI workflow

Drop the python-lint job from studio-backend-ci.yml and move it into
the dedicated `Lint CI` workflow. Two material changes:

1. License-header check now accepts BOTH header families
   The previous version only counted SPDX-License-Identifier, which
   warned on every Apache-2.0 file in unsloth/, unsloth_cli/, and
   scripts/ (e.g. unsloth/models/llama.py opens with the standard
   `# Copyright ... Daniel Han-Chen & the Unsloth team. All rights
   reserved. # Licensed under the Apache License, Version 2.0` block,
   which is correct, but my SPDX-only regex flagged it).
   New rule: a file is OK if either `SPDX-License-Identifier` or
   `Licensed under the Apache License` appears in the first 20 lines.
   Empty __init__.py files are skipped. Whole-repo coverage instead
   of just studio/backend.

2. Add shell / YAML / JSON parse gates
   - `bash -n` over every committed *.sh (14 today). Same idea as
     compileall: parse-only check.
   - `yaml.safe_load_all` over every *.yml / *.yaml (97 today),
     including .github/workflows/* so a typo in the workflow file
     itself shows up immediately.
   - `json.loads` over every *.json (18 today). Skips
     package-lock.json / bun.lock (huge, machine-generated) and
     tsconfig*.json (TypeScript JSONC convention -- already
     validated by `tsc --noEmit` in Frontend CI).

TypeScript and Rust are NOT duplicated here:
  - Studio Frontend CI runs `npm run typecheck` + `npm run build`
    on every studio/frontend/** change, which is a full TS AST +
    type check.
  - Studio Tauri CI runs `tauri build --debug --no-bundle` on every
    studio/src-tauri/** or studio/frontend/** change, which is a
    full Rust compile.
A duplicate fast-fail step here would burn cache for marginal
value, and the dedicated workflows already block merges.

Lint CI runs on every PR (no path filter): the whole job is
under 30 s of CI time, so paying that on every PR is preferable
to missing a regression on a path the focused workflows skip.

* CI(lint): accept GNU long-form license headers (AGPL/LGPL/GPL)

The license-header check missed two more legitimate header families
that are committed to the repo today:

  - LGPL-3.0 long form: e.g. unsloth/kernels/rope_embedding.py opens
    with "GNU Lesser General Public License" -- 7 such files under
    unsloth/kernels/.
  - AGPL-3.0 long form: e.g. unsloth/kernels/moe/autotune_cache.py
    opens with "GNU Affero General Public License" -- 2 such files
    under unsloth/kernels/moe/.

Both got flagged as drift on the previous run because the check
only knew about the SPDX one-liner and the Apache-2.0 preamble.
Add a third accepted marker, the substring "General Public License",
which appears in all three GNU long-form preambles (GPL, LGPL,
AGPL) and nothing else. Repo inventory:

   spdx (one-liner)        193 files (mostly studio/)
   apache-longform          55 files (unsloth/, unsloth_cli/)
   agpl-longform             2 files (unsloth/kernels/moe/)
   lgpl/gpl-longform         7 files (unsloth/kernels/)
   no recognised header     85 files (real drift -- mostly tests/)

So the warning count drops from 94 -> 85 with this commit; the
remaining 85 are actual missing headers, surfaced as a non-blocking
warning until the cleanup PR lands.

* CI: add codespell + shellcheck to Lint CI; add Security audit workflow

Three Priority-1 follow-ups from the lint review.

Lint CI gains two non-blocking gates that surface drift without
blocking merges (the same shape as the existing format-drift step):

  - codespell: typo catcher across source / comments / docs. Skips
    lockfiles, generated assets, binary artefacts, LICENSE files.
    ignore-words-list pulls out short identifiers and PyTorch
    idioms (parm/parms, ans, hist, etc.) the default dictionary
    would flag. Local run finds 16 real typos to fix in a follow-up.

  - shellcheck: catches subtle shell bugs `bash -n` doesn't see --
    unquoted expansions, useless cat, `[[ ]]` command substitution,
    etc. SC1090 + SC2034 muted because install/setup scripts
    legitimately source runtime paths and use export-only
    assignments. Critical-path coverage: install.sh, setup.sh,
    tests/sh/.

Both pinned for reproducibility (codespell>=2.3,<3 in pip,
shellcheck via apt-get). Both surface findings in PR annotations
without failing the run; drop continue-on-error after the cleanup
PRs land.

New workflow: Security audit. Runs `pip-audit` against the same
dep set Studio's backend pytest matrix installs, so we audit what
the runtime actually loads (not what pyproject.toml's transitive
resolution might pull in differently). Triggers:
  - PRs touching requirements / pyproject.toml,
  - push to main / pip,
  - nightly @ 04:13 UTC (off-the-hour to dodge cron rush),
  - workflow_dispatch.

The default branch already carries 17 known vulnerabilities per
the dependabot banner, so a hard gate today would block every PR
on a baseline we have not triaged. Non-blocking; full table goes
to GITHUB_STEP_SUMMARY for grep-ability and a 30-day artefact for
historical comparison.

The custom AST anti-pattern scan I prototyped was dropped: every
class of CPU-import-time bug we hit in this PR (bitsandbytes,
torchvision, _cuda_getCurrentRawStream, DEVICE_COUNT==0 stream
init) is already caught by the Repo tests (CPU) job exercising
the actual import on a CPU torch wheel. Restating the rule
in AST form would only add noise.

* CI: scan all unsloth deps + transitive closure, no install

The previous Security audit only covered Studio's backend requirements.
The unsloth pip package itself ships its own dep set via pyproject.toml
(typer/pydantic/pyyaml/nest-asyncio core, plus the huggingfacenotorch
extras: transformers/peft/accelerate/trl/datasets/diffusers/etc.) -- a
malicious upload to any of those would slip past us today. Build a
combined dep list from pyproject.toml + the six Studio requirements
files and feed it to both pip-audit and scan_packages.

Add scan_packages.py at scripts/scan_packages.py so the scanner ships
with the repo and CI does not depend on a network fetch at job time.

Pass --with-deps to scan_packages so the pre-install pattern scan
walks the full transitive closure -- supply-chain attacks usually land
several hops down (litellm 1.82.7 was a dep of a dep for most users;
top-level-only scanning would have missed it).

No installation in either job. pip-audit's -r mode resolves through
PyPI metadata, scan_packages downloads sdist/wheel archives raw and
inspects them without running install hooks. An attacker who has
compromised a transitive dep cannot execute code in this workflow.

* CI(security): per-file audit, strip git+, pin setuptools in build env

Last push surfaced two silent failures:

  1. pip-audit aborted on openai-whisper. The package's setup.py
     imports pkg_resources, which the isolated build env's modern
     setuptools no longer ships by default. Because we passed every
     -r file in one invocation, that single build failure killed the
     audit for ALL files (the run reported success only because
     continue-on-error swallowed exit 1).
  2. scan_packages --with-deps aborted on the first git+ spec it
     hit (triton-kernels.txt's git+https://github.com/triton-lang
     /triton.git, plus OpenEnv in extras-no-deps.txt). Same
     all-or-nothing behaviour: the entire transitive scan reported
     "0 archives downloaded" and "all clean" -- meaning we silently
     scanned nothing.

Fixes:

  - Build a filtered audit-reqs/ tree first. Each Studio requirements
    file is copied with `git+` lines stripped (replaced with a
    `# [security-audit] skipped` marker so the exclusion is auditable
    in the artifact). Pure git refs are out of scope for both pip-
    audit (CVE DB only knows PyPI versions) and scan_packages (it
    inspects PyPI archives, not git HEADs).
  - Run pip-audit per-file in a loop. One bad file no longer takes
    out the whole audit.
  - Pin setuptools<78 + wheel into pip's isolated build env via
    PIP_CONSTRAINT, so legacy setup.py packages (openai-whisper) can
    still emit metadata for the resolver.
  - Run scan_packages per-file too, with the same git+ filter and a
    skip for files that are empty after filtering (triton-kernels.txt
    becomes a comments-only file and would otherwise spam the log
    with `--help`).

Net effect: pip-audit now actually emits CVE findings (we know the
default branch carries 17), and scan_packages downloads + pattern-
scans the full transitive closure of every PyPI-only requirements
file plus unsloth's pyproject deps.

* CI(security): shard scan_packages across 3 runners + dedupe per-shard

Previous run took ~10+ minutes because each requirements file ran
its own --with-deps resolve serially, and the six files all share
~70% of their transitive set (transformers, peft, accelerate land
in three of them). Net effect: the same 200+ archives downloaded and
pattern-scanned three times in series.

Two changes:
  1. Within a shard, feed every -r file to ONE scan_packages call so
     pip's resolver intersects version constraints once and yields
     a single deduped transitive set.
  2. Across shards, run three matrix jobs in parallel:
       - hf-stack: unsloth-deps + no-torch-runtime  (pyproject extras)
       - studio:   studio + overrides + extras-no-deps
       - extras:   extras (heavy openai-whisper / scikit-learn stack)
     Wall clock now bounded by the slowest shard rather than the
     sum, dropping ~10 min to ~3-5 min.

Each shard uploads its own artifact (scan-packages-log-<id>) so log
correlation stays clean. fail-fast: false so one shard's findings
don't suppress the others.

* CI(security): consolidate pip-audit + npm audit + cargo audit into one job

Three advisory-DB lookups previously spun up three separate runners.
All three are fast lockfile-driven checks (pip-audit ~1m37s, npm audit
~12s, cargo audit ~24s) and the runner-setup overhead dominates each.
Run them sequentially on a single runner with python + node + rust
toolchains pre-installed; total wall clock comes out roughly the same
(~3 min) but with one PR check instead of three.

Each step keeps continue-on-error: true so a finding in one toolchain
does not suppress the others. Logs land in a single advisory-audit-logs
artifact (pip + npm + cargo + the filtered req set).

Heavy job stays separate: pip-scan-packages remains the 3-shard matrix
that downloads + pattern-scans the full PyPI transitive closure (~6
min/shard, in parallel). Conflating that into the advisory job would
bloat the runner image and serialize a 6 min job behind a 30 s one.

* CI(security): catch Lightning, Shai-Hulud, npm hijack, design-flaw CVEs

Recent supply-chain incidents that scan_packages would have missed:
  - PyTorch Lightning 2.6.x: payload in _runtime/router_runtime.js
    (14.8 MB), persistence via .claude/settings.json SessionStart
    and .vscode/tasks.json folderOpen
  - npm chalk/debug + Shai-Hulud: hex-var obfuscation, window.ethereum
    Web3 hijack, .github/workflows/shai-hulud.yml repo takeover,
    trufflehog credential exfil
  - elementary-data 0.23.3: token harvesters with embedded gh{p,o,s}_
    and AKIA regexes
  - litellm 1.82.7: also covered by existing patterns, but anyone on
    `>=` got it during the 40-min exposure window
  - langchain-core CVE-2025-68664 / n8n CVE-2025-68668 / marimo
    CVE-2026-39987: first-party design flaws, not malicious-author

scan_packages.py:
  - Six new regexes: RE_DEV_TOOL_HIJACK, RE_TOKEN_REGEX,
    RE_JS_OBFUSCATION, RE_WEB3_HIJACK, RE_WORKFLOW_INJECT,
    RE_SHELL_DROPPER.
  - Three new checkers: check_js_file, check_shell_file,
    check_workflow_file. scan_archive now routes .js/.mjs/.cjs/.ts
    to the JS checker, .sh/.bash to the shell checker, and
    .github/workflows/*.yml to the workflow checker.
  - JS checker fires CRITICAL on hex-var obfuscation OR Web3 hijack
    OR (token regex + network) OR workflow-injection signature; HIGH
    on a >100 KB JS bundle inside a Python wheel (the Lightning tell).
  - Smoke-tested: every new pattern matches its canonical positive
    and rejects four legitimate-looking false-positive baits.

security-audit.yml:
  - OSV-Scanner step: cross-ecosystem advisory check (PyPI + npm
    + cargo) from one binary. OSV's feed is a superset of GitHub-
    Advisory; catches CVEs that haven't propagated yet (e.g.
    langchain-core was on OSV before GitHub Advisory).
  - Semgrep step: p/supply-chain + p/python + p/javascript +
    p/security-audit packs catch first-party logic bugs (CVEs 7/9/10
    above) that pattern scanning never sees.
  - Lockfile pin verifier: warns on every non-`==` spec in
    requirements/*.txt. Currently surfaces 104 unpinned specs as
    informational baseline; tighten to blocking once the baseline
    is curated.

All new steps continue-on-error initially; they surface findings to
the workflow summary + advisory-audit-logs artifact.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* CI(security): defense-in-depth additions across 7 axes

Goes after the residual gaps from the supply-chain incident audit.
Each addition targets a real attack class that prior layers couldn't
catch:

  1. step-security/harden-runner (audit mode) on every job. eBPF
     egress firewall on the runner -- if scan_packages misses a
     payload, harden-runner's audit log records every host the
     malicious archive dialed. Audit mode initially so we observe
     the legitimate egress profile before promoting to block.

  2. Trivy filesystem scan (vuln + misconfig + secret). Hits NVD +
     GHSA + GitLab + Aqua Vuln DB and also catches Dockerfile / k8s /
     Tauri / shell IaC misconfigs that pip-audit + OSV don't see.

  3. TruffleHog secret-leak scan on PR diffs. --only-verified so we
     only flag tokens the source provider confirmed are live; runs
     base..head on PRs and full repo on push. Catches accidental API
     key commits that the Lint CI's grep-based codespell check
     cannot. checkout fetch-depth: 0 so the diff range exists.

  4. CycloneDX SBOM generation as artifact. Per-requirements file
     plus a project-level SBOM from pyproject.toml. Lets downstream
     consumers audit our wheel contents (the ML supply-chain SBOM gap
     is a known industry-wide problem; meets half of NTIA SBOM mins).

  5. GitHub Actions pinning verifier. Reports every `uses: foo@v4`
     or `@main` mutable ref. tj-actions/changed-files (Mar 2025) hit
     anyone using non-SHA pins. Currently surfaces 4 third-party
     unpinned refs (dtolnay/rust-toolchain, swatinem/rust-cache) and
     40 first-party (`actions/*`); informational baseline, tighten
     once we're ready. Dependabot's github-actions ecosystem
     auto-bumps SHA pins, so the maintenance cost is zero.

  6. Hash-pin verifier. Reports how many == specs would gain from
     `--hash=sha256:` entries. Currently 11 == pins, 0 with hash.
     Roadmap step: `uv pip compile --generate-hashes` then
     `pip install --require-hashes`. Hash-locked installs would have
     refused a republished litellm 1.82.7 even at the same version
     string.

  7. Custom Semgrep rules at .semgrep/unsloth-rules.yml. Seven rules
     for the *specific shape* of recent ML-stack CVEs we'd otherwise
     re-introduce ourselves: langchain-core deserialize-roundtrip
     (CVE-2025-68664), n8n private-pyodide-eval (CVE-2025-68668),
     marimo websocket-no-auth (CVE-2026-39987), litellm
     popen-with-network-stdin, Shai-Hulud workflow-write,
     pickle-from-network, shell=True with f-string interpolation.

dependabot.yml: extend to pip + cargo ecosystems so security
advisories on Python deps and the Tauri shell auto-generate update
PRs alongside the github-actions / bun / npm ones.

All new steps continue-on-error initially; findings land in
GITHUB_STEP_SUMMARY plus the advisory-audit-logs artifact.

* CI(security): bump trivy + trufflehog to existing version tags

Job failed at "Set up job" because trivy-action@0.28.0 doesn't exist
on GitHub. Latest tag is v0.36.0; same fix for trufflehog (now v3.95.2).

* CI(security): trivy-action tags need leading `v` (0.36.0 -> v0.36.0)

* CI(security): remove Trivy (it WAS the litellm attack vector)

Trivy was the initial entry point for the litellm 1.82.7/8 supply-
chain compromise (March 2026):

  Late Feb: attacker exploited a misconfigured pull_request_target in
            Trivy's CI -> stole the aqua-bot PAT.
  Mar 19:   attacker force-rewrote 76 of 77 tags in
            aquasecurity/trivy-action (and all 7 in setup-trivy) to
            point at malicious commits. Anyone using a tag ref
            (`@v0`, `@v0.69.4`, `@latest`) auto-pulled the trojan.
  Mar 24:   litellm's CI ran the trojaned Trivy unpinned -> the
            payload exfiltrated PYPI_PUBLISH from the runner ->
            attackers published the malicious litellm wheels.

A security scanner has the same broad runtime read access as
deployment tooling -- by design. That's exactly what made it the
ideal pivot. Our prior `aquasecurity/trivy-action@v0.36.0` was a tag
ref, the same shape that hit litellm, and Aqua's remediation does
not eliminate the meta-attack class (next compromise restarts the
clock). Removing rather than re-pinning.

Coverage we lose, and how we backfill:
  - cross-ecosystem CVE: already covered by OSV-Scanner (NVD + GHSA
    + GitLab + RustSec feeds).
  - secret detection: already covered by TruffleHog + the new
    GitHub Actions pinning verifier.
  - OS package CVEs: not relevant for a Python package + Tauri
    desktop app.
  - IaC misconfig (Dockerfile / k8s / Tauri config): the one unique
    Trivy value-add. Unfilled for now; revisit with checkov / kics
    if/when we ship a Dockerfile or k8s manifests.

Also pinned the two remaining third-party actions to commit SHAs
(was a tag ref, the exact thing the GHA pinning verifier flagged):
  - step-security/harden-runner: a5ad31d (= v2.19.1)
  - trufflesecurity/trufflehog:  17456f8 (= v3.95.2)

Dependabot's github-actions ecosystem will auto-bump these SHAs.
Refs: https://docs.litellm.ai/blog/security-update-march-2026
      https://www.microsoft.com/en-us/security/blog/2026/03/24/detecting-investigating-defending-against-trivy-supply-chain-compromise/

* CI: SHA-pin every action; fix 4 bugs in advisory-audit

Last security-audit run revealed 4 step-level errors hidden by
continue-on-error (the job reported pass but each fix is real):

  1. OSV-Scanner curl 404 -> tar exit 2. v2.x ships a raw binary
     (`osv-scanner_linux_amd64`), not a tarball. Drop tar -xzf,
     curl -o the binary directly + chmod +x.
  2. cargo audit `parse error: TOML parse error at line 5 col 8`
     on RUSTSEC-2026-0073.md. cargo-audit 0.21 doesn't parse the
     CVSS 4.0 schema used in 2026 advisories. Bump pin to ^0.22.
  3. TruffleHog `flag 'no-update' cannot be repeated`. The
     trufflesecurity/trufflehog action passes --no-update
     internally already; remove our duplicate from extra_args.
  4. cyclonedx-py `unrecognized arguments: --schema-version 1.6
     --outfile ...`. cyclonedx-bom 4.x renamed to `--sv` for spec
     version and `-o` for the output file.

Plus pin every remaining mutable-ref action to a 40-char SHA. The
new GHA pinning verifier flagged 4 third-party + 40 first-party
mutable refs; this commit pins all 44 to the latest SHA *within
the existing major version* (no auto-upgrades). Mappings:

  actions/checkout         @v4    -> 34e114876b... (v4.3.1)
  actions/setup-node       @v4    -> 49933ea528... (v4.4.0)
  actions/setup-python     @v5    -> a26af69be9... (v5.6.0)
  actions/stale            @v10   -> b5d41d4e1d... (v10.2.0)
  actions/upload-artifact  @v4    -> ea165f8d65... (v4.6.2)
  actions/cache            @v4    -> 0057852bfa... (v4.3.0)
  swatinem/rust-cache      @v2    -> 23869a5bd6... (v2.9.1)
  dtolnay/rust-toolchain   @stable-> 29eef336d9... (stable @ 2026-05-07)

44 pins applied across 11 workflow files. The pin verifier now
reports zero unpinned `uses:`. Dependabot's github-actions
ecosystem (already configured in .github/dependabot.yml) will
auto-bump these SHAs in weekly batches.

This closes the same attack class that hit litellm 1.82.7: an
attacker who hijacks a tag (as in the aquasecurity/trivy-action
March 2026 incident) cannot redirect our workflows because we no
longer follow tag refs.

* CI: rename + comprehensive Chat UI Tests (verified locally)

Three rename + one substantial test rewrite:

  - "tool calling tests"                         -> "Tool calling Tests"
  - "Chat UI smoke (Playwright + Chromium)"      -> "Chat UI Tests"
  - "install.sh + `unsloth studio update --local`" -> "Studio Updating Tests"

Chat UI Tests was a 4-second pass-through (fill new password, send one
message, reload). Rewrote into a 15-section flow that runs ~30 seconds
locally and exercises the full Studio chat surface a real user touches:

  1.  Login form (username is hardcoded HIDDEN_LOGIN_USERNAME in
      auth-form.tsx, so we only fill #password)
  2.  Composer mounts after auth
  3.  Composer toolbar (Send + Add Attachment)
  4.  Three distinct user turns with non-empty deterministic
      assistant replies (verified locally: lengths 6/1/6 for
      "hello"/"1"/"world" prompts)
  5.  Assistant action bar: Copy + Regenerate
  6.  Settings sheet open + close
  7.  Theme toggle via account menu (light <-> dark, with a
      view-transition wait so the click doesn't race the animation)
  8.  Sidebar nav: New Chat, switch-back-to-previous-chat (history
      persistence via threadId in IndexedDB)
  9.  Sidebar Search dialog
  10. Sidebar collapse/expand
  11. Reload + verify session JWT survives (the 2026.5.1 chat-history
      regression killed the page entirely on reload; this catches it)
  12. Post-reload turn proves inference still works
  13. /api/health stays healthy
  14. Negative-auth: old bootstrap pw -> 401, rotated pw -> 200
  15. Zero pageerror events captured

The CI step that boots Studio + loads the model now rotates the
bootstrap password BEFORE calling /api/inference/load. /api/inference/
load is gated behind must_change_password=false; the previous flow
(login bootstrap -> load) was succeeding in CI by historical accident
and started failing locally. New flow:

  bootstrap login -> change-password -> rotated login -> load model

Both passwords are exposed to the Playwright step via env, so the
test can drive /login with the rotated password AND assert the old
one is now 401.

Verified locally end-to-end against a real Studio install with
gemma-3-270m-it-GGUF UD-Q4_K_XL: all 15 sections pass, console.error
count = 0, total runtime ~30s.

* CI(ui): drop nonexistent username locator (auth form is password-only)

studio/frontend/src/features/auth/components/auth-form.tsx hard-codes
the login username to HIDDEN_LOGIN_USERNAME = "unsloth"; the only
visible input is #password. The previous Playwright step waited 30s
for `input[name='username'], #username` and timed out on every CI run.

I caught this locally and patched the test script during validation
but didn't bring the fix back to the workflow file -- this commit
applies it. Wait for #password only, fill the rotated password, click
submit. Verified locally end-to-end against a fresh Studio.

* ci(mlx): add real Apple Silicon job on free macos-14 runner

GitHub-hosted macos-14 is the M1 standard runner (3 vCPU, 7 GB RAM,
14 GB storage) and is FREE for public repositories per the GitHub
Actions billing reference. Larger variants (macos-14-large,
macos-14-xlarge) are billed; we deliberately avoid those.

unslothai/unsloth and unslothai/unsloth-zoo are both public, so
adding a single macos-14 job to MLX CI costs zero minutes against
the org's billing quota while closing the only remaining gap the
spoofed Linux job cannot reach: the actual Apple Silicon dispatch
path. Specifically the new mlx-real-apple-silicon job:

  - Installs the real mlx and mlx-lm packages from PyPI.
  - Verifies platform.system()=='Darwin' and platform.machine()=='arm64'
    naturally, with no monkeypatch.
  - Imports unsloth and asserts unsloth._IS_MLX is True so the gate
    flips on real hardware as it is supposed to.
  - Smoke-imports every PR-A MLX-only module: mlx_loader, mlx_trainer,
    mlx_compile, mlx_utils, mlx_cce, gated_delta_vjp. These all do
    `import mlx.core as mx` at module level; this is the test that
    catches a future change to those modules that would only surface
    on a real Mac.
  - Re-runs the same three dispatch test files the Linux job runs.
    The monkeypatch spoofs still apply on real hardware, so this is
    also the canary that the spoofs do not collide with the real
    environment.

The Linux job is unchanged. Both jobs trigger on the same path
filter; mlx-real-apple-silicon caps at 15 minutes since the mlx
install is heavier than the Linux dep set.

* ci(mlx): install unsloth-zoo from git main on the macOS job

The macOS Apple Silicon job failed on its first run with

    NotImplementedError: Unsloth currently only works on NVIDIA, AMD
    and Intel GPUs.

surfaced from `unsloth_zoo.device_type.get_device_type()`. The cause
is the version pin: `pip install 'unsloth_zoo>=2026.5.1'` resolves
to the most recent PyPI wheel, which predates PR #620 and therefore
predates the `_is_mlx_only` gate in `unsloth_zoo/__init__.py` that
short-circuits the GPU device-type probe on Darwin+arm64+mlx.

Switch to `pip install --no-deps "unsloth_zoo @ git+https://github.com/unslothai/unsloth-zoo"`
so the macOS job sees the merged main branch and exercises the
actual MLX dispatch code. Studio's own `install.sh` does this for
exactly the same reason.

This is also the smoking gun the macOS runner exists to catch:
the spoofed Linux job cannot reproduce a stale PyPI/zoo pairing
because it never imports through device_type. The first real Mac
run found the gap on its first try.

* ci(mlx): expand macOS install ladder to match the Linux dep set

The first attempt installed only mlx + mlx-lm + pytest +
unsloth_zoo with --no-deps + unsloth -e --no-deps. That ladder
under-specifies what the MLX import branch in unsloth/__init__.py
actually needs:

  - The studio backend hardware module imports structlog at module
    top level. Without it tests/studio/test_hardware_dispatch_matrix.py
    fails at the very first `from utils.hardware import hardware as hw`
    with ModuleNotFoundError.
  - unsloth/__init__.py loads dataprep/raw_text.py via
    spec_from_file_location, which `from datasets import Dataset`. With
    --no-deps on unsloth-zoo neither datasets nor transformers nor any
    other shared dep got pulled in.

Mirror the Linux job's working ladder, with two MAC-specific
adjustments:

  - Drop bitsandbytes (CUDA-only).
  - Drop CPU torch (mlx replaces it on Apple Silicon, and unsloth-zoo
    already gates torch on `sys_platform != darwin or platform_machine != arm64`).
  - Install unsloth_zoo from git main WITH deps so pip resolves
    mlx + mlx-lm + mlx-vlm (gated on darwin+arm64 in the zoo's
    pyproject) plus the shared deps (datasets, transformers,
    sentencepiece, ...).

Validated locally against a Linux mac-sim venv (platform spoofed to
Darwin/arm64 via mlx_simulation, real datasets/transformers/structlog
installed via the same ladder, fake mlx via the shim):

  - Step 1 _IS_MLX activation: OK
  - Step 2 import each of unsloth_zoo.mlx_{loader,trainer,compile,utils,cce}
    + unsloth_zoo.gated_delta_vjp + FastMLXModel + MLXTrainer surface: OK
  - Step 3 36 tests across the three dispatch files: 36 passed in 0.43s

The Linux job (mlx-dispatch) is unchanged.

* ci(mlx): version-pin every pip install, consolidate to one matrix job

Pin every explicit pip install to an exact released version (latest
as of 2026-05-07 within each project's existing constraint range)
to reduce supply-chain surface and make rebuilds reproducible.
unsloth-zoo on Linux is the pinned PyPI release; on macOS it stays
on git main (PR-A is not yet on PyPI).

Also fold the previously separate mlx-dispatch (Linux) and
mlx-real-apple-silicon (macOS) jobs into a single matrix job with
labels linux-cpu-spoof and macos-m1-real, sharing the dispatch
test step so adding new MLX dispatch tests applies to both runners
automatically. The Mac-only smoke steps (verify _IS_MLX flips True
on real Apple Silicon, smoke-import every PR-A MLX-only module)
remain gated on if: matrix.real_mlx.

Validated locally against .macsim_venv3 with the pinned package
set: 35 passed + 1 skipped, matching the prior unpinned run.

* CI(ui): split Playwright into tests/studio/playwright_chat_ui.py + comprehensive coverage

Move the inline Playwright Python out of the workflow YAML (which was
unwieldy at 400+ lines of indented heredoc) into a real test file at
tests/studio/playwright_chat_ui.py so it can be run locally against a
fresh Studio install in addition to CI.

The new test does the full first-run journey end-to-end through the
UI:

  1. /change-password through the UI (Setup your account / Choose a new
     password / Change password) -- previously the workflow rotated
     out-of-band via curl; now the test exercises the actual user form.
  2. Default model assertion: /api/models/list[default_models][0] must
     match DEFAULT_MODELS_GGUF[0] from defaults.py (catches list
     reordering / lazy-loading regressions).
  3. /api/inference/load via page.evaluate using the JWT pulled out of
     localStorage["unsloth_auth_token"] (gemma-3-270m, ~254 MiB cached).
  4. Model picker: open the selector, type "qwen" and "llama" into the
     search bar, confirm the typeahead filters (does not select).
  5. Five chat turns, each must render a non-empty assistant bubble.
  6. Regenerate-last via the assistant action bar (best-effort).
  7. Two extra turns AFTER regenerate (proves stream restart works).
  8. Composer toggles (Thinking / Web search / Code execution) --
     skipped gracefully when disabled for the loaded model.
  9. Configuration sheet: drive every Radix slider to its minimum so
     temperature is 0 for downstream determinism.
  10. Theme toggle x3 with deterministic computed-background-color
      assertion (light = body bg min(rgb)>220, dark = max(rgb)<60).
      View-transition animation disabled via add_init_script + reduced
      motion to keep clicks actionable.
  11. Sidebar nav: New Chat, Compare, Search dialog, Recipes route.
  12. Developer / API tab via the account menu (api-keys management
      surface reachable).
  13. Recipes route: cards render + first-card click.
  14. Recents (sidebar history): click a previous chat thread.
  15. Image attachment widget reachable (vision response not asserted
      here -- gemma-3-270m is text-only).
  16. Reload + session JWT survives.
  17. /api/health remains healthy.
  18. Negative-auth post-UI-rotation: bootstrap pw -> 401, NEW -> 200.
  19. Out-of-band ("terminal") password rotation via subprocess(curl)
      to /api/auth/change-password (NEW -> NEW2). Confirms refresh
      tokens are revoked server-side and that an external password
      change invalidates the previous browser session's renew path.
  20. Shutdown via the account-menu Shutdown menuitem + the AlertDialog
      "Stop server" button. Wait for the "Unsloth Studio has stopped"
      placeholder, then poll the listening port until it's closed --
      verifies the server process actually exited.

Verified locally end-to-end against a fresh Studio install (gemma-3-270m
GGUF UD-Q4_K_XL, port 18892): rc=0, all 20 sections green.

Workflow changes:
  - Drop the curl-based "Rotate password + load the GGUF" step. The
    test does change-password through the UI and load via page.evaluate
    so the bootstrap pw is the only thing CI hands the test.
  - Pin actions/upload-artifact@v4 to its commit SHA (v4.6.2) per the
    "pin all actions" rule.

* CI(security): random-generated passwords in every workflow (no hardcoded creds)

studio-ui-smoke.yml was the last holdout still using hardcoded rotated
passwords (CIUiSmoke12345! / CIUiSmoke67890!). Generate them per-run
via python -c 'import secrets; print(secrets.token_urlsafe(16))' and
mask them into the log via GitHub Actions' ::add-mask::, matching the
pattern already used in studio-inference-smoke.yml.

If a workflow ever gets compromised (malicious dependency, leaked
GITHUB_TOKEN, supply-chain attack on a pinned action), the rotated
password is now unique to that single job run and is never readable
from log output. An attacker cannot replay a hardcoded credential
against a future / parallel Studio install elsewhere.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(mlx): consolidate to single Mac M1 job with robust no-mlx spoof

Previously the workflow ran the dispatch tests on two matrix legs
(linux-cpu-spoof + macos-m1-real), which duplicated the spoofed
hardware matrix (it works identically on any host) while only the
Mac leg covered Apple-specific real-mlx checks. Drop the Linux leg,
rename the workflow to "MLX CI on Mac M1", and rely on the Mac
runner alone -- it now runs the SAME spoofed matrix PLUS the three
real-Apple-Silicon checks (real `_IS_MLX = True`, real mlx wheel
smoke imports, no spoof collisions with the live environment).

Also fix the `apple_silicon_no_mlx` profile so the spoof works on a
real Mac with mlx genuinely installed. Studio's `_has_mlx()` does
literal `import mlx.core` and catches `ImportError`, which the
previous spoof (delete `sys.modules["mlx"]` + patch `find_spec`)
could not block when mlx was on disk -- Python would re-find and
import the real package. The fix installs a `MetaPathFinder` for
the duration of the spoof that raises `ImportError` for `mlx` /
`mlx.*`, faithfully simulating "mlx not installed" regardless of
whether the host has the wheel. No change to the dispatch logic in
unsloth or studio; the Mac runner now exercises every profile end
to end with the real wheels installed.

Validated locally on .macsim_venv3 with a stand-in `mlx` package
on disk at .fakemlx_pkg/ to mimic the macos-14 runner: 35 passed +
1 skipped.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(mlx): real MLX training + inference smoke test on Mac M1

Add tests/studio/run_real_mlx_smoke.py and wire it into the macos-14
job as the final step. The script trains unsloth/gemma-3-270m-it
for 7 deterministic LoRA steps on an in-memory dataset of the SAME
row repeated:

    "<<HELLO!!>> My name is Unsloth!"

then prompts the trained model with "<<HELLO!!>> My name is " and
asserts the completion contains "Unsloth". Captures and asserts:

- per-step training loss (via MLXTrainer.add_step_callback);
- pre- and post-training loss + gradient norm (computed manually via
  mx.nn.value_and_grad over the training row, since MLXTrainer does
  not currently expose per-step grad norms);
- losses are finite, do not diverge, and post-train loss < pre-train;
- grad norms are finite and positive;
- the inference output contains "Unsloth".

Determinism: seeds python random, numpy, and mlx.core.random; passes
random_state=SEED to FastMLXModel.from_pretrained and
get_peft_model (both invoke _seed_mlx_random_state internally) and
seed=SEED to MLXTrainingConfig (drives batch shuffling). Uses fp16
+ no quant (gemma-3-270m is small enough to skip 4-bit) and LoRA
r=8 on the four attention projections.

This is the only place in CI that exercises a real MLX backward
pass + optimizer step + mlx_lm.generate call.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(mlx): add LoRA + merged_16bit + GGUF export round-trip checks

After the 7-step LoRA training run finishes and the in-memory
inference assertion passes, the smoke test now exports the trained
model in three formats, drops the in-memory model + trainer to
reclaim memory, and reloads each export from disk to re-run the
"<<HELLO!!>> My name is " inference assertion. Each reload is
expected to still complete with "Unsloth" -- catching round-trip
regressions where the saved weights silently corrupt or fail to
load.

Formats exercised:

- LoRA adapter via model.save_pretrained_merged(save_method="lora").
  Reloaded with FastMLXModel.from_pretrained on the adapter dir;
  the loader auto-detects adapter_config.json and pulls down the
  base model.

- Merged 16-bit via model.save_pretrained_merged(save_method=
  "merged_16bit"). Fuses LoRA into the base, dequantizes to fp16,
  saves an HF-compatible safetensors directory. Reload via
  FastMLXModel.from_pretrained on the saved dir.

- GGUF via model.save_pretrained_gguf(quantization_method=
  "not_quantized"). Builds llama.cpp via cmake on the runner with
  GGML_METAL=ON (only the llama-cli, llama-quantize, and
  llama-gguf-split targets), then runs the produced bf16 GGUF
  through llama-cli with a fixed seed and asserts "Unsloth" in
  stdout. GGUF infra failures (cmake / build / convert) are
  surfaced as RuntimeError so we notice -- if Mac CI starts hitting
  build flakes the assertion can be softened.

Workflow timeout bumped 15 -> 25 min to budget for the llama.cpp
cmake build (~5-7 min on the macos-14 standard runner).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(mlx): cold-start LoRA / merged / GGUF reloads + per-phase metrics

Restructure the MLX smoke test into a multi-step workflow that
exercises the export round-trip the way real users hit it: each
reload runs in a FRESH Python process (not a continuation of the
still-running trainer), and each step emits a JSON metrics file
with elapsed time + peak GPU memory + peak RSS for regression
detection.

Steps (each on the macos-14 M1 standard runner, FREE for public
repos):

1. TRAIN + SAVE 3 formats
   - Load unsloth/gemma-3-270m-it (fp16, no quant).
   - Apply LoRA r=8 on q/k/v/o.
   - Pre-train + post-train loss + grad norm probe via
     mx.nn.value_and_grad on the training row.
   - Train 7 deterministic steps, batch_size=2,
     gradient_accumulation_steps=3 (42 sequences trained), capture
     per-step loss via add_step_callback.
   - In-memory generate -> assert "Unsloth" appears.
   - Save LoRA, merged_16bit, GGUF.
   - Emit mlx_workdir/train_metrics.json.

2. RELOAD LoRA (fresh process)
   FastMLXModel.from_pretrained(lora_dir) cold-load + generate +
   assert "Unsloth" appears. Emits lora_reload_metrics.json.

3. RELOAD merged_16bit (fresh process)
   Same flow on the merged HF directory.

4. RELOAD GGUF via llama-cli (fresh process)
   Conditional on train_metrics.json:gguf_supported. Spawns the
   llama-cli built by save_pretrained_gguf with --temp 0
   --seed 3407 -no-cnv and asserts "Unsloth" in stdout. The
   per-phase metrics step prints all four JSON files so
   regressions are visible in the job log.

Pin unsloth_zoo to fix/mlx-export-roundtrip-on-apple-silicon while
unslothai/unsloth-zoo#627 is in review -- it carries:

  - llama_cpp.py: catch NotImplementedError too when importing
    device_is_bf16_supported (device_type module-level call raises
    on Apple Silicon).
  - mlx_loader.py: don't wipe local_path when config.json is
    missing, otherwise FastMLXModel.from_pretrained(lora_dir)
    can't see adapter_config.json.

The earlier draft of this script had a workaround that copied the
base model's config.json into the LoRA save dir; with #627 the
workaround is removed, the cold-start LoRA reload works on the
saved adapter directory directly.

Workflow timeout already 25 min for the llama.cpp cmake build.

* CI(studio): always-upload artifacts + gate /api/system + path/health plumbing

Three small but high-signal changes that came out of an audit of how
much Studio surface CI actually exercises:

  1. Every studio-*-smoke.yml workflow now uploads its artifacts on
     `if: always()` instead of `if: failure()`. On green runs the
     screenshots + studio.log are now reviewable in the Actions UI,
     which closes the "passed but the UI is silently broken" hole.
     SHA-pinned to actions/upload-artifact@v4.6.2 across all 7 upload
     steps (was a mix of @v4 unpinned + the SHA-pin).

  2. /api/system and /api/system/hardware now require a Bearer token
     (Depends(get_current_subject)). Today they leak Python version,
     GPU name, total memory, and the ML package set without auth --
     fine on a single-user Tauri box, not fine on -H 0.0.0.0 / Colab
     / a Tauri-relayed setup. /api/system/gpu-visibility was already
     gated; now /api/system + /api/system/hardware match it.

  3. Path filters + health-wait plumbing:
     - studio-ui-smoke.yml now triggers on tests/studio/** so a PR
       that ONLY edits the Playwright test file actually runs UI CI.
     - studio-tauri-smoke.yml now triggers on unsloth_cli/** so a CLI
       rename or signature change that breaks Tauri's spawned
       `unsloth studio` actually runs Tauri CI.
     - The 60s `/api/health` wait loop in studio-ui-smoke.yml +
       studio-inference-smoke.yml (3 jobs) is now 180s. Cold runners
       with venv warm-up + lazy imports have been observed exceeding
       60s, and the cost of a false-fail is much higher than two
       extra minutes of waiting.

* CI(ui): STUDIO_UI_STRICT mode + theme cycle fix + Recents thread-match assertion

The existing UI test was passing too easily: every "if button.count() == 0:
log WARN" branch silently degraded into a green run. Three places this
hid real bugs:

  1. The theme toggle for-loop bailed after cycle 1 because the Radix
     Account-menu's data-state="open" lingered through the view-transition
     and the next acct.click() hit the still-open dropdown. The test
     went green observing only one polarity.
  2. The regenerate button branch silently skipped when the assistant
     action bar didn't render (every CI run so far -- the locator was
     wrong, but no one noticed because it was a soft skip).
  3. The Recents click accepted ANY non-nav sidebar entry, so a freshly
     deleted thread or an unrelated entry would still pass.

Fixes:

  - Add STUDIO_UI_STRICT=1 env (default on in CI via workflow,
    default off locally). When on, every soft "if not visible: log
    WARN" branch hard-fails. The strict-skip pattern is centralised
    in a soft_fail() helper so the local-vs-CI split is one knob.
  - Theme toggle: wait for [role="menu"] to detach between cycles
    (the dropdown stay-open was the cycle-2 bail), assert the loop
    actually ran 3 times.
  - Model picker search: capture popover text after typing "qwen" vs
    "llama"; the two snapshots must DIFFER, proving the typeahead
    actually filters (a regression that rendered the picker but
    ignored input would silently pass before).
  - Recents click: after navigating to the clicked thread, the
    rendered turns must include at least one of our sent prompts
    ("hello", "world", "tree", "1+1", etc.) -- proves we landed on
    OUR thread, not a leftover from a previous run.
  - Use [data-tour="chat-model-selector"] as the primary selector
    for the model picker -- the guided-tour anchor is at least as
    stable as anything else in the codebase (the tour breaks if it
    moves), and there's no separate data-testid system to maintain.

* CI(studio): new Studio API & Auth Tests workflow + integration test

HTTP-level integration smoke for the Studio FastAPI surface, no
Playwright. ~30 s per run on warm cache. Boots a fresh Studio, then
asserts:

  1. CORS hardening -- no wildcard-origin + credentials=true; cross-
     origin GET / does not leak the bootstrap password to evil.example.
  2. /api/system + /api/system/hardware + /api/system/gpu-visibility
     all require auth (closes the info-disclosure leak).
  3. Auth state machine -- rotation invariants (old=401, new=200),
     refresh-without-body returns 4xx, login burst documents the
     current "no rate-limit" behaviour so future hardening updates the
     test in the same PR.
  4. JWT-expiry forgery -- mint a JWT with exp=now-1 using the install's
     own secret + assert it returns 401.
  5. API key lifecycle E2E -- create -> list -> use against
     /v1/chat/completions -> delete -> verify 401.
  6. Auth file-mode hardening (Linux only): auth/ is 0700, auth.db +
     -wal + -shm + .bootstrap_password are 0600.
  7. Inference lifecycle gaps -- /v1/models lists the loaded model,
     /v1/embeddings + /v1/responses return 200 OR structured 4xx,
     bogus gguf_variant rejected, force-reload swaps the llama-server
     PID.
  8. Endpoint-by-endpoint auth audit -- pins the EXPECTED auth posture
     for known routes; an unauthenticated /api/shutdown is rejected
     BEFORE the shutdown trigger fires.

Reuses the same GGUF cache key as studio-ui-smoke.yml so the model
download is one cache-hit across CI.

Random per-run rotated passwords + ::add-mask:: pattern matches
studio-ui-smoke.yml + studio-inference-smoke.yml.

* CI(ui): add second Playwright job covering Compare/Recipes/Export/Studio/Settings

The first Chat UI Tests step ends by clicking the Shutdown menuitem,
which leaves the server dead. So a SECOND Studio is booted on port
18894 in the same job (warm install -- adds ~3-5s) and a second
Playwright test exercises the routes the chat UI doesn't touch:

  1. /chat?compare=... -- assigns two models, sends 2 prompts, asserts
     both panes respond (so 4 total new assistant bubbles).
  2. /data-recipes -- clicks the first template card, verifies the
     React-Flow canvas mounts.
  3. /export -- in chat-only mode (CI default) asserts the route
     redirects; in non-chat-only asserts [data-tour='export-cta'] +
     HF token field exist.
  4. /studio -- chat-only redirects, non-chat-only asserts the three
     tabs (Configure / Current run / History) + [data-tour='studio-*']
     anchors exist.
  5. Settings dialog -- Cmd/Ctrl-, opens it, cycles through every
     visible tab (General / Profile / Appearance / Chat / Developer /
     About), asserts each tab body is non-trivial.

Same STRICT=1 mode + soft_fail() pattern as playwright_chat_ui.py.

Both Playwright runs' screenshots + studio logs are bundled into the
existing studio-ui-smoke-artifacts upload; the artifact name doesn't
change.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(mlx): fresh-process reloads + soft-skip GGUF on llama.cpp limitation

Re-apply the subcommand restructure that was lost during the earlier
rebase conflict (the linter pre-commit on the remote re-formatted the
single-function version, so my checkout --ours kept the wrong copy).
Adds:

  * argparse subcommands `train` and `reload --format X --dir D` so
    each reload runs in a FRESH Python process the way real users
    hit the cold-start path.
  * Per-phase Phase() context manager records elapsed wall-clock,
    peak GPU memory (mx.metal.get_peak_memory), and peak RSS
    (resource.getrusage) into a metrics dict written to
    {train,lora_reload,merged_reload,gguf_reload}_metrics.json
    next to the saved dir for cross-CI regression detection.
  * batch_size=2, gradient_accumulation_steps=3 (was 2/1) so the
    7-step run sees 42 sequences total.
  * GGUF save is best-effort. unsloth-zoo#627 fixed the
    NotImplementedError on Apple Silicon, but llama.cpp's
    convert_hf_to_gguf currently asserts on the gemma-3-270m
    tokenizer vocab (`max(vocab IDs) >= vocab_size`). That's a
    downstream llama.cpp limitation, not an unsloth_zoo bug, so the
    train step records gguf_supported=false + the reason instead of
    raising, and the GGUF reload step emits a workflow warning and
    exits 0. The LoRA + merged_16bit reload assertions remain the
    gating signal.

The earlier-draft LoRA workaround that copied base config.json into
the LoRA save dir is removed; unsloth-zoo#627 makes
FastMLXModel.from_pretrained(lora_dir) work on the saved adapter
directory directly (the failing run before #627 confirmed the bug,
the run after #627 lands shows the adapter is detected and the base
model is pulled from adapter_config.json:base_model_name_or_path).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(mlx): expand LoRA targets to MLP + bump generation budget

With batch_size=2 / gradient_accumulation_steps=3 (effective batch
of 6) the q/k/v/o-only LoRA collapsed in 7 steps -- training loss
kept dropping (0.55 vs the previous 1.02 with grad_accum=1) but
inference output the structural skeleton ("My name") without
recovering the specific "Unsloth" token. Switching to the standard
unsloth target set (q/k/v/o + gate/up/down) gives the LoRA enough
capacity to memorize the training row at the larger effective
batch. Also bump max_tokens 24 -> 48 for the in-memory + reload
generation calls so the model has more room to spew the memorized
sequence; we still assert "Unsloth" appears…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant