Skip to content

fix(mlx): forward resume_from_checkpoint to MLXTrainer.train()#6173

Merged
danielhanchen merged 1 commit into
unslothai:mainfrom
BardiaKoopah:feat/mlx-studio-resume-wiring
Jun 11, 2026
Merged

fix(mlx): forward resume_from_checkpoint to MLXTrainer.train()#6173
danielhanchen merged 1 commit into
unslothai:mainfrom
BardiaKoopah:feat/mlx-studio-resume-wiring

Conversation

@BardiaKoopah

Copy link
Copy Markdown
Contributor

Summary

Studio's frontend exposes a Resume action and submits requests with resume_from_checkpoint set to a previous run's output_dir. The CUDA training paths in worker.py read this field from config and pass it to trainer.train() (see lines 2729-2787 and 3108-3229). The MLX path _run_mlx_training did neither: it never read config['resume_from_checkpoint'] and called trainer.train() with no args.

The MLX trainer also didn't accept the kwarg, so even threading it through would have been a no-op. The trainer-side support is added in unslothai/unsloth-zoo#751, which:

  • Writes optimizer_state.safetensors and trainer_state.json on every checkpoint
  • Adds resume_from_checkpoint: str | None = None to MLXTrainer.train()
  • Loads adapter weights, optimizer state, and trainer scalars when resuming
  • Fast-forwards the loop counter and dataloader to the resume position

This PR is the 2-line companion that makes Studio's MLX path actually use the kwarg.

Changes

  1. Read resume_from_checkpoint from config near the other config.get() extractions in _run_mlx_training.
  2. Pass it as a kwarg at the trainer.train() call site.

Verification

Together with unsloth-zoo#751, on M2 16GB with unsloth/Qwen3-0.6B + unsloth/LaTeX_OCR, max_steps=10, save_steps=5, grad_accum=4:

Step Fresh loss Resumed loss
6 2.1686279773712158 2.168627977371216
7 1.6476788520812988 1.6476788520812988
8 1.4659109115600586 1.4659109115600586
9 1.4719936847686768 1.4719936847686768
10 1.4772168397903442 1.4772168397903442

Every post-resume loss matches the fresh-run baseline bit for bit.

Dependency

This PR is a no-op without unsloth-zoo#751. Should land in that order (zoo first, then this).

Studio's frontend exposes a Resume action and submits requests with
resume_from_checkpoint set to a previous run's output_dir. The CUDA
training paths in worker.py read this field from config and pass it to
trainer.train() (see lines 2729-2787 and 3108-3229). The MLX path
_run_mlx_training did neither: it never read config['resume_from_checkpoint']
and called trainer.train() with no args. The MLX trainer also did not
accept the kwarg, so even threading it through would have been a no-op.

With this PR + the unsloth-zoo companion PR adding the trainer-side
support (saves optimizer_state + trainer_state, accepts and applies
resume_from_checkpoint in MLXTrainer.train()), MLX Resume now works
end-to-end. Verified on M2 16GB with Qwen3-0.6B + unsloth/LaTeX_OCR:
loss at every post-resume step matches a fresh run bit for bit
(2.168627977371216 == 2.168627977371216 at step 6, etc).

Two lines: read the field near the other config.get() extractions in
_run_mlx_training, pass it as a kwarg at the trainer.train() call site.

Companion PR: unslothai/unsloth-zoo#751
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d1f583bd81

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gc.collect()
mx.synchronize()
trainer.train()
trainer.train(resume_from_checkpoint=resume_from_checkpoint)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate the new MLX train kwarg on zoo support

In installs where unsloth-zoo predates the companion MLXTrainer.train(resume_from_checkpoint=...) change, even a fresh MLX run now fails here with TypeError: train() got an unexpected keyword argument 'resume_from_checkpoint', because the kwarg is passed unconditionally even when its value is None. The repo still allows those installs (studio/backend/requirements/base.txt is unpinned and pyproject.toml only requires unsloth_zoo>=2026.6.2), so landing this worker change without a version bump or signature fallback makes Apple Silicon training unusable rather than a no-op. Please enforce the newer zoo version before this call or feature-detect/fallback for older trainers.

Useful? React with 👍 / 👎.

# the floor for the MLX path, so the Resume UI button silently restarted
# from step 0 (the CUDA path at lines 2729 / 3108 has been forwarding
# this all along).
resume_from_checkpoint = config.get("resume_from_checkpoint") or None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reuse the checkpoint output directory on MLX resume

When Studio resumes a run, routes/training.py normalizes resume_from_checkpoint, but TrainingBackend.start_training only serializes that checkpoint into the worker config and drops the route's output_dir; the CUDA paths compensate with _output_dir_from_resume_checkpoint(...), while the MLX path still falls through to a fresh timestamped output dir. In the normal Resume UI flow this means MLX resumes from the old checkpoint but saves and reports completion under a new output directory, so the history query that marks older runs as resumed_later by matching output_dir will not mark the original stopped run as resumed and users can keep resuming from the stale checkpoint. Please derive MLX output_dir from resume_from_checkpoint the same way the other worker paths do.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen merged commit 0ac0511 into unslothai:main Jun 11, 2026
23 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants