fix(mlx): forward resume_from_checkpoint to MLXTrainer.train() by BardiaKoopah · Pull Request #6173 · unslothai/unsloth

BardiaKoopah · 2026-06-10T22:34:24Z

Summary

Studio's frontend exposes a Resume action and submits requests with resume_from_checkpoint set to a previous run's output_dir. The CUDA training paths in worker.py read this field from config and pass it to trainer.train() (see lines 2729-2787 and 3108-3229). The MLX path _run_mlx_training did neither: it never read config['resume_from_checkpoint'] and called trainer.train() with no args.

The MLX trainer also didn't accept the kwarg, so even threading it through would have been a no-op. The trainer-side support is added in unslothai/unsloth-zoo#751, which:

Writes optimizer_state.safetensors and trainer_state.json on every checkpoint
Adds resume_from_checkpoint: str | None = None to MLXTrainer.train()
Loads adapter weights, optimizer state, and trainer scalars when resuming
Fast-forwards the loop counter and dataloader to the resume position

This PR is the 2-line companion that makes Studio's MLX path actually use the kwarg.

Changes

Read resume_from_checkpoint from config near the other config.get() extractions in _run_mlx_training.
Pass it as a kwarg at the trainer.train() call site.

Verification

Together with unsloth-zoo#751, on M2 16GB with unsloth/Qwen3-0.6B + unsloth/LaTeX_OCR, max_steps=10, save_steps=5, grad_accum=4:

Step	Fresh loss	Resumed loss
6	2.1686279773712158	2.168627977371216
7	1.6476788520812988	1.6476788520812988
8	1.4659109115600586	1.4659109115600586
9	1.4719936847686768	1.4719936847686768
10	1.4772168397903442	1.4772168397903442

Every post-resume loss matches the fresh-run baseline bit for bit.

Dependency

This PR is a no-op without unsloth-zoo#751. Should land in that order (zoo first, then this).

Studio's frontend exposes a Resume action and submits requests with resume_from_checkpoint set to a previous run's output_dir. The CUDA training paths in worker.py read this field from config and pass it to trainer.train() (see lines 2729-2787 and 3108-3229). The MLX path _run_mlx_training did neither: it never read config['resume_from_checkpoint'] and called trainer.train() with no args. The MLX trainer also did not accept the kwarg, so even threading it through would have been a no-op. With this PR + the unsloth-zoo companion PR adding the trainer-side support (saves optimizer_state + trainer_state, accepts and applies resume_from_checkpoint in MLXTrainer.train()), MLX Resume now works end-to-end. Verified on M2 16GB with Qwen3-0.6B + unsloth/LaTeX_OCR: loss at every post-resume step matches a fresh run bit for bit (2.168627977371216 == 2.168627977371216 at step 6, etc). Two lines: read the field near the other config.get() extractions in _run_mlx_training, pass it as a kwarg at the trainer.train() call site. Companion PR: unslothai/unsloth-zoo#751

gemini-code-assist · 2026-06-10T22:34:30Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d1f583bd81

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-10T22:37:33Z

    gc.collect()
    mx.synchronize()
-    trainer.train()
+    trainer.train(resume_from_checkpoint=resume_from_checkpoint)


Gate the new MLX train kwarg on zoo support

In installs where unsloth-zoo predates the companion MLXTrainer.train(resume_from_checkpoint=...) change, even a fresh MLX run now fails here with TypeError: train() got an unexpected keyword argument 'resume_from_checkpoint', because the kwarg is passed unconditionally even when its value is None. The repo still allows those installs (studio/backend/requirements/base.txt is unpinned and pyproject.toml only requires unsloth_zoo>=2026.6.2), so landing this worker change without a version bump or signature fallback makes Apple Silicon training unusable rather than a no-op. Please enforce the newer zoo version before this call or feature-detect/fallback for older trainers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-10T22:37:33Z

+    # the floor for the MLX path, so the Resume UI button silently restarted
+    # from step 0 (the CUDA path at lines 2729 / 3108 has been forwarding
+    # this all along).
+    resume_from_checkpoint = config.get("resume_from_checkpoint") or None


Reuse the checkpoint output directory on MLX resume

When Studio resumes a run, routes/training.py normalizes resume_from_checkpoint, but TrainingBackend.start_training only serializes that checkpoint into the worker config and drops the route's output_dir; the CUDA paths compensate with _output_dir_from_resume_checkpoint(...), while the MLX path still falls through to a fresh timestamped output dir. In the normal Resume UI flow this means MLX resumes from the old checkpoint but saves and reports completion under a new output directory, so the history query that marks older runs as resumed_later by matching output_dir will not mark the original stopped run as resumed and users can keep resuming from the stale checkpoint. Please derive MLX output_dir from resume_from_checkpoint the same way the other worker paths do.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

danielhanchen merged commit 0ac0511 into unslothai:main Jun 11, 2026
23 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(mlx): forward resume_from_checkpoint to MLXTrainer.train()#6173

fix(mlx): forward resume_from_checkpoint to MLXTrainer.train()#6173
danielhanchen merged 1 commit into
unslothai:mainfrom
BardiaKoopah:feat/mlx-studio-resume-wiring

BardiaKoopah commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot commented Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BardiaKoopah commented Jun 10, 2026

Summary

Changes

Verification

Dependency

Uh oh!

gemini-code-assist Bot commented Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants