fix(mlx): forward resume_from_checkpoint to MLXTrainer.train()#6173
Conversation
Studio's frontend exposes a Resume action and submits requests with resume_from_checkpoint set to a previous run's output_dir. The CUDA training paths in worker.py read this field from config and pass it to trainer.train() (see lines 2729-2787 and 3108-3229). The MLX path _run_mlx_training did neither: it never read config['resume_from_checkpoint'] and called trainer.train() with no args. The MLX trainer also did not accept the kwarg, so even threading it through would have been a no-op. With this PR + the unsloth-zoo companion PR adding the trainer-side support (saves optimizer_state + trainer_state, accepts and applies resume_from_checkpoint in MLXTrainer.train()), MLX Resume now works end-to-end. Verified on M2 16GB with Qwen3-0.6B + unsloth/LaTeX_OCR: loss at every post-resume step matches a fresh run bit for bit (2.168627977371216 == 2.168627977371216 at step 6, etc). Two lines: read the field near the other config.get() extractions in _run_mlx_training, pass it as a kwarg at the trainer.train() call site. Companion PR: unslothai/unsloth-zoo#751
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d1f583bd81
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| gc.collect() | ||
| mx.synchronize() | ||
| trainer.train() | ||
| trainer.train(resume_from_checkpoint=resume_from_checkpoint) |
There was a problem hiding this comment.
Gate the new MLX train kwarg on zoo support
In installs where unsloth-zoo predates the companion MLXTrainer.train(resume_from_checkpoint=...) change, even a fresh MLX run now fails here with TypeError: train() got an unexpected keyword argument 'resume_from_checkpoint', because the kwarg is passed unconditionally even when its value is None. The repo still allows those installs (studio/backend/requirements/base.txt is unpinned and pyproject.toml only requires unsloth_zoo>=2026.6.2), so landing this worker change without a version bump or signature fallback makes Apple Silicon training unusable rather than a no-op. Please enforce the newer zoo version before this call or feature-detect/fallback for older trainers.
Useful? React with 👍 / 👎.
| # the floor for the MLX path, so the Resume UI button silently restarted | ||
| # from step 0 (the CUDA path at lines 2729 / 3108 has been forwarding | ||
| # this all along). | ||
| resume_from_checkpoint = config.get("resume_from_checkpoint") or None |
There was a problem hiding this comment.
Reuse the checkpoint output directory on MLX resume
When Studio resumes a run, routes/training.py normalizes resume_from_checkpoint, but TrainingBackend.start_training only serializes that checkpoint into the worker config and drops the route's output_dir; the CUDA paths compensate with _output_dir_from_resume_checkpoint(...), while the MLX path still falls through to a fresh timestamped output dir. In the normal Resume UI flow this means MLX resumes from the old checkpoint but saves and reports completion under a new output directory, so the history query that marks older runs as resumed_later by matching output_dir will not mark the original stopped run as resumed and users can keep resuming from the stale checkpoint. Please derive MLX output_dir from resume_from_checkpoint the same way the other worker paths do.
Useful? React with 👍 / 👎.
Summary
Studio's frontend exposes a Resume action and submits requests with
resume_from_checkpointset to a previous run's output_dir. The CUDA training paths inworker.pyread this field from config and pass it totrainer.train()(see lines 2729-2787 and 3108-3229). The MLX path_run_mlx_trainingdid neither: it never readconfig['resume_from_checkpoint']and calledtrainer.train()with no args.The MLX trainer also didn't accept the kwarg, so even threading it through would have been a no-op. The trainer-side support is added in
unslothai/unsloth-zoo#751, which:optimizer_state.safetensorsandtrainer_state.jsonon every checkpointresume_from_checkpoint: str | None = NonetoMLXTrainer.train()This PR is the 2-line companion that makes Studio's MLX path actually use the kwarg.
Changes
resume_from_checkpointfrom config near the otherconfig.get()extractions in_run_mlx_training.trainer.train()call site.Verification
Together with
unsloth-zoo#751, on M2 16GB withunsloth/Qwen3-0.6B+unsloth/LaTeX_OCR,max_steps=10,save_steps=5,grad_accum=4:Every post-resume loss matches the fresh-run baseline bit for bit.
Dependency
This PR is a no-op without unsloth-zoo#751. Should land in that order (zoo first, then this).