Fix MLX save and GGUF export parity issues#697
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive regression tests and fixes for MLX save and GGUF export parity, including VLM processor repairs, tied embedding materialization, and VLM weight sanitization. The review feedback suggests improving robustness by safely checking if thinker_config is a dictionary before accessing its attributes to prevent potential AttributeErrors, and expanding the saved weights check to support single model.safetensors files in addition to sharded indexes.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves MLX model save/export parity (especially for VLMs) to better match the CUDA path, including GGUF export behavior and preserved sidecar metadata, and adds regression tests for these scenarios.
Changes:
- Enhance config extraction/saving for MLX-LM vs MLX-VLM, including dataclass handling and backend-aware
save_config. - Add export-time fixes for GGUF/VLM parity (tensor rewrites, tied
lm_headmaterialization, and llama.cpp path anchoring). - Repair degraded MLX-VLM processors by rebuilding image processor + processor class from sidecar configs; add a dedicated regression test suite.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
unsloth_zoo/mlx/utils.py |
Adds backend-aware config saving, tied-embedding materialization, VLM GGUF rewrite pipeline, sidecar copying, and GGUF export adjustments. |
unsloth_zoo/mlx/loader.py |
Repairs degraded VLM processors from sidecar configs and filters supported kwargs for GGUF save/push bindings. |
unsloth_zoo/llama_cpp.py |
Ensures patched converter scripts are written in the correct directory for package layouts. |
tests/test_mlx_save_export_regressions.py |
Adds regression tests covering the new MLX save/export parity behaviors without heavy model downloads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Prefer HF vision aliases when inverting mlx-vlm sanitizers, while keeping same-name tensor rewrites as a fallback. This preserves Gemma3 patch embedding layout fixes and avoids stopping early on Qwen-family MLX vision_tower names.
Use loaded VLM model instances when replaying mlx-vlm sanitizers for GGUF tensor rewrites, while keeping the config-derived class pipeline as a fallback. This covers models whose top-level sanitizer delegates through submodules, such as GLM-OCR.
444b88b to
0336fc3
Compare
This reverts commit d8cdf89. Also removes the related regression coverage and later helper hardening now that bug-4 is out of scope.
|
Triple-confirmed this PR end-to-end: 1. Local Linux, four Python versions47 tests pass on each of 3.10 / 3.11 / 3.12 / 3.13: the 23 regressions from 2. Cross-OS shim path
3. Real Apple Silicon mlx wheels plus Studio bootDrove Pre-PR vs post-PR diffCopied this PR's test file onto Coverage per fix
Notes for the maintainer
LGTM. |
…nslothai#697 - unsloth_zoo/mlx/loader.py: _read_json_file returned lists/strings/None for sidecars holding valid but non-object JSON, which crashed _repair_degraded_vlm_processor with AttributeError on .get(). Coerce non-dict payloads to {} to match the documented contract. - tests/test_mlx_save_export_edge_cases.py: 71 tests covering branches the regression suite does not reach: rank-3/5D/1D tensor matching, the _MlxVlmSanitizeProxy two-arg sanitize path, the real shard rewrite loop with safetensors files plus index.json remapping, duplicate-name guards, NextN decrement and skip branches, sidecar filesystem edges (unicode, symlinks, permissions, src equals dst), six-thread concurrent GGUF exports over the patcher env lock, a user-set UNSLOTH_LLAMA_CPP_SCRIPTS_DIR override, and monolith vs package converter placement. - tests/mlx_simulation/mlx_stub.py: save_safetensors now normalizes tensors to contiguous; real MLX has no contiguity constraint, so the shim should not reject the rewrite loop's transposed views. All 273 MLX tests pass on Linux via the simulation shim across Python 3.10-3.13 and transformers 4.51.3, 4.57.6 and 5.5.0. The only failure is the pre-existing FakeModel one in test_mlx_finetune_last_n_layers.py, tracked in unslothai#700.
|
Pushed Bug fix:
|
Comment and docstring only pass over the PR diff, applying the repo brevity policy: collapse multi line blocks, drop comments that restate the code, and keep load bearing rationale. Scope is restricted to lines this PR adds so the pass cannot conflict with the comment tightening already merged on main. - unsloth_zoo/llama_cpp.py: 3 line placement comment down to 2. - unsloth_zoo/mlx/utils.py: file backed safetensors gotcha down to 2 lines, dropped a comment restating _copy_source_sidecars. - unsloth_zoo/mlx/loader.py: _repair_degraded_vlm_processor docstring 8 to 5 lines keeping the why. - tests/test_mlx_save_export_regressions.py: module docstring 5 to 3 lines. - tests/test_mlx_save_export_edge_cases.py: 14 three line section banners collapsed to one line each, two line docstrings to one, removed docstrings and comments that restate test names or adjacent assertions. Net 45 lines removed. Every file verified code identical to the previous commit by AST comparison with docstrings stripped; comment_tools.py check passes for all files except the one intentional docstring deletion, which the AST comparison covers. Full MLX suite: 273 passed, 1 skipped, plus the known pre existing failure tracked in unslothai#700.
|
Pushed Scope is restricted to lines this PR adds, so it cannot conflict with the comment tightening already merged on main. Every touched file was verified code identical to the previous commit by AST comparison with docstrings stripped, and the full MLX suite still passes (273 passed, 1 skipped, plus the known pre existing failure tracked in #700). |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 50309fe78c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| first_conversion=None, | ||
| token=None, |
There was a problem hiding this comment.
Move first_conversion after existing upload args
Adding first_conversion before token changes the existing positional API: a caller using the previous form push_to_hub_gguf(model, tok, dir, repo, "q8_0", token) now binds the token string as first_conversion and leaves token=None. In that scenario the export attempts to use the HF token as the intermediate GGUF dtype and the Hub client is created unauthenticated, so existing positional callers fail or upload without credentials; append the new option after private or make it keyword-only to preserve compatibility.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Valid catch. Fixed in 350d790: first_conversion now sits after token and private, so the pre-existing positional form push_to_hub_gguf(model, tok, dir, repo, method, token) keeps binding token as token. Guarded by test_push_to_hub_gguf_positional_token_stays_token.
| module_model_type = (model_type or "").replace("-", "_") | ||
| module_candidates = ( | ||
| f"mlx_vlm.models.{module_model_type}.processing", | ||
| f"mlx_vlm.models.{module_model_type}.processing_{module_model_type}", | ||
| ) |
There was a problem hiding this comment.
Resolve remapped mlx-vlm processor modules
When the saved model_type is one of mlx-vlm's remapped aliases (the loader already accounts for MODEL_REMAPPING in _build_vlm_model_types), this only probes the raw name with hyphens replaced. In that case a degraded tokenizer-only processor cannot be repaired because the custom processor class lives under the remapped mlx-vlm package, so subsequent saves still miss the image processor metadata; include the remapped target module name in the candidates before falling back to Transformers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed. 350d790 adds the MODEL_REMAPPING target package to the module candidates, so aliased model types resolve their custom processor classes before falling back to Transformers. Covered by test_resolve_processor_class_follows_model_remapping.
| if model is None or getattr(model, "sanitize", None) is None: | ||
| return [] |
There was a problem hiding this comment.
Allow submodule-only VLM sanitizers
For VLM wrappers whose top-level model object does not define sanitize but a component such as vision_tower or thinker.vision_tower does, this early return prevents the submodule scan below from ever running. _prepare_vlm_gguf_export_directory then sees no sanitizer pipelines and skips the HF-layout rewrite, leaving those GGUF exports with MLX-native tensor names/layouts; build pipelines from available submodule sanitizers even when the wrapper itself has no sanitizer.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed. 350d790 makes the wrapper step optional: submodule sanitizers such as vision_tower.sanitize now build replay pipelines even when the wrapper has no sanitize of its own, so those exports get the HF-layout rewrite. Covered by test_sanitize_pipelines_built_from_submodule_only_sanitizers plus a guard that the wrapper-first ordering is unchanged when the wrapper does sanitize.
| if image_processor_type: | ||
| try: | ||
| import transformers | ||
| image_processor_class = getattr(transformers, image_processor_type, None) | ||
| if image_processor_class is not None: | ||
| return image_processor_class(**image_kwargs) | ||
| except Exception: | ||
| pass | ||
|
|
||
| try: | ||
| from transformers import AutoImageProcessor | ||
| return AutoImageProcessor.from_pretrained(model_path) |
There was a problem hiding this comment.
Try mlx-vlm image processors before giving up
When the degraded processor came from a model whose image processor is implemented by mlx-vlm rather than transformers, this path cannot rebuild it: it only looks up image_processor_type on the transformers module and then retries AutoImageProcessor.from_pretrained, which is the failure mode this repair is meant to recover from. For those models the function returns None, so _repair_degraded_vlm_processor leaves a tokenizer-only processor and saved exports still miss multimodal processor metadata; use the mlx-vlm model module's ImageProcessor/load_image_processor path as a candidate here.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed. 350d790 tries the mlx-vlm processing module for the named image processor class after the Transformers lookup and before the AutoImageProcessor retry, so mlx-vlm native image processors can be rebuilt from sidecar kwargs. Covered by test_image_processor_rebuild_uses_mlx_vlm_module_class.
All four review comments were valid; each is fixed with a regression test in tests/test_mlx_save_export_edge_cases.py: - push_to_hub_gguf: first_conversion moved after token and private so pre-existing positional callers keep their meaning; it was inserted before token and would have bound a positionally passed HF token as the intermediate GGUF dtype. - _get_mlx_vlm_model_sanitize_pipelines: wrappers without a top level sanitize no longer return zero pipelines; submodule sanitizers such as vision_tower.sanitize now produce replay pipelines on their own. - _resolve_mlx_vlm_processor_class: probes the MODEL_REMAPPING target package as well, so aliased model types resolve their custom processor classes. - _build_vlm_image_processor_from_config: falls back to the mlx-vlm processing module for image processor classes that do not exist in transformers, before the AutoImageProcessor retry. Full MLX suite: 278 passed, 1 skipped. Full repo suite: 1114 passed; the two failures are pre-existing on the branch (FakeModel stub tracked in unslothai#700, and the PyPI version sync test which lags because this branch predates the latest release and resolves on merge).
…tubs (#700) * Fix mlx-ci.yml torch install gap + finetune-last-n-layers FakeModel stubs Two unrelated cleanups surfaced while validating PR #697: 1. .github/workflows/mlx-ci.yml fails on every scheduled run. pyproject gates `torch` off on darwin+arm64 (mlx is the native path on Apple Silicon), so `pip install -e .[mlx]` on macos-14 does not pull torch. But tests/mlx_simulation/__init__.py:52 unconditionally `import torch` so the autouse shim that powers the test suite can monkeypatch torch.Tensor with MLX-only methods. Collection of tests/test_mlx_torch_shim_smoke.py blows up with `ModuleNotFoundError: No module named 'torch'` before any test runs. Reproduces: most recent scheduled cron run on main (run id 26499388502) failed in 52s with exactly that error at collection time. Fix: install the CPU torch wheel explicitly before `.[mlx]`, matching the local-dev install order. Three-line patch. 2. tests/test_mlx_finetune_last_n_layers.py FakeModel is incomplete. `unsloth_zoo/mlx/loader.py::FastMLXModel.get_peft_model` prints a "LoRA applied -- N trainable params (P% of T total)" stats line at the tail of the function via `mlx.utils.tree_flatten(model.trainable_parameters())` plus `model.parameters()`. The synthetic FakeModel in this test never implemented those two methods, so the assertion targeting `linear_to_lora_layers`'s captured `num_layers` is never reached -- the test crashes at `loader.py:3712` with `AttributeError: 'FakeModel' object has no attribute 'trainable_parameters'`. The test exercises a parameter passthrough; the stats line is incidental. Fix: stub `trainable_parameters` and `parameters` on FakeModel to return empty dicts so tree_flatten yields zero entries and the stats line prints "0.00% of 0 total" without raising. Verified locally: python -m pytest tests/test_mlx_finetune_last_n_layers.py -v 2 passed python -m pytest tests/test_mlx_*.py -q 180 passed, 1 skipped python -m pytest tests/ -q --ignore=tests/security 1017 passed, 50 skipped * Match pyproject torch upper bound (<2.13.0, not <2.11.0) pyproject.toml line 26 declares torch>=2.4.0,<2.13.0 ; (sys_platform != 'darwin' or platform_machine != 'arm64') The mlx-ci.yml install used <2.11.0 by mistake. Bring the workflow into sync so it does not block torch 2.11 / 2.12 which the project itself supports. --------- Co-authored-by: Daniel Han-Chen <michaelhan2050@gmail.com>
|
Ran a post-merge verification on this since it landed mid-review: the parity fixes are artifact-level correct, not just call-level. Confirmed One small follow-up worth a one-liner: the "already saved" check in |
This PR fixes several issues around MLX merged saves and GGUF export parity.
Bug list:
VLM merged/full saves were missing VLM config fidelity
Root cause: MLX merged saves used the text-only
mlx_lm.utils.save_configpath for VLM checkpoints, so VLM-specific config structure, normalization, and fields were not preserved inconfig.json.Commit: 615aff8
QLoRA merged 16-bit saves were not fully dequantized
Root cause: The merged 16-bit path treated LoRA fusion with
fuse(dequantize=True)as sufficient dequantization, but quantized non-LoRA modules could remain packed.Commit: 55fa173
VLM GGUF mmproj output was metadata-only / missing real projector tensors
Root cause: The temporary GGUF staging directory contained mlx-vlm sanitized tensor names/layouts, while llama.cpp MMPROJ conversion expects HF/llama.cpp projector names/layouts; in-place rewriting also risked corrupting file-backed tensors.
Commit: 6fcaa19
MLX GGUF export options like
first_conversionwere dropped by the bound save methodRoot cause: The monkey-patched MLX
save_pretrained_ggufwrapper accepted CUDA-compatible**kwargsbut discarded them when calling the utility implementation, so caller-specified GGUF options likefirst_conversionnever reached export.Commit: 0029c54
VLM GGUF exports could leave same-name vision tensors in MLX layout, e.g. Gemma3 patch embeddings stayed OHWI instead of OIHW
Root cause: VLM sanitizer inversion only considered renamed candidates; for models whose sanitizer leaves a tensor name unchanged but changes layout, replay could accept the unchanged name/tensor and stop before applying the needed layout transform.
Commit: 97c980f
VLM GGUF sanitizer replay failed for models whose sanitizers need real submodules, e.g. GLM-OCR
Root cause: Sanitizer replay used a fake proxy object with only
config/args, but model sanitizers such as GLM-OCR call submodule sanitizers through real attributes likeself.vision_tower, so replay failed and no tensor rewrite happened.Commit: c7002e7
VLM GGUF export could fail with modern package-layout llama.cpp converters due missing
conversion/sibling packageRoot cause: Modern llama.cpp conversion scripts are package entrypoints that import sibling
conversion.*modules; writing/running the patched script from a detached cache directory broke that package layout, and the MLX path could also use a different llama.cpp tree than the shared patcher root.Commit: 4546225
MLX VLM loading could degrade to tokenizer-only processors, so GGUF staging omitted image processor metadata needed by mmproj conversion
Root cause: GLM-OCR custom processor construction could fail through Transformers
AutoImageProcessorwithouttorchvision, and mlx-vlm fell back to tokenizer-only metadata.Commit: 398cb9c
VLM GGUF staging could omit image processor sidecars like
preprocessor_config.json, breaking mmproj conversionRoot cause: MLX merged saves trusted tokenizer/model saves to emit every non-weight sidecar needed by downstream converters, but degraded or tokenizer-only processors can omit source files such as
preprocessor_config.json,processor_config.json,video_preprocessor_config.json, tokenizer model files, or custom processing files.Commit: df5deb9
GLM-OCR GGUF exports could advertise a NextN/MTP layer that the MLX model did not export, making llama.cpp load fail on missing
blk.16.*tensorsRoot cause: Source config metadata could advertise speculative/NextN/MTP layers through fields such as
num_nextn_predict_layers,nextn_predict_layers, ormtp_num_hidden_layers, but the loaded/exported MLX language model did not contain those extra blocks.Commit: d6d5d39
Raw mlx-vlm model saves could miss
config.jsonbecausemodel.configdataclass configs were not extractedRoot cause:
_get_model_configonly handled_configdictionaries andmodel.args, while raw mlx-vlm models can expose config as a dataclass-likemodel.config.Commit: 962dec8
Validation:
python -m pytest tests/test_mlx_save_export_regressions.py -q.