Summary
After PR #634 the MLX trainer sets MLXTrainingConfig.max_grad_value = 5.0 (originally 1.0 in the same PR) and at training-config-resolution time silently zeroes out a user-supplied max_grad_norm when both are non-zero. This breaks HuggingFace/TRL parity for the MLX path. A fine-tune that converges and emits a sensible greedy completion under transformers.SFTTrainer on CUDA produces gibberish on MLX given identical hyperparameters.
Repro
Identical 7-step LoRA on unsloth/gemma-3-270m-it, train row = \"<<HELLO!!>> My name is Unsloth!\", bs=2, grad_accum=3, lr=1e-3, lr_scheduler=constant, warmup=0, optim=adamw, weight_decay=0, max_seq=64, seed=3407, LoRA r=8 on q/k/v/o.
| Run |
clip mode |
step 1 → 7 loss |
greedy completion of \"<<HELLO!!>> My name is \" |
| CUDA (torch+TRL) |
max_grad_norm=1.0 only |
7.64 → 1.16 |
\" 1! ... My name is Unsloth! ...\" (contains "Unsloth") |
| CUDA (torch+TRL) |
elementwise clip_grad_value_(1.0) only |
7.64 → 1.19 |
\" 1! What are you doing?! ...\" (no "Unsloth") |
| MLX (post-#634) |
default (max_grad_value=1.0/5.0 overrides max_grad_norm) |
10.55 → 0.10 |
'5 lbs!' |
| MLX (post-#634) |
user sets max_grad_value=0 so max_grad_norm=1.0 wins |
10.55 → 0.17 |
'5 lbs!' |
MLX (pre-#634, last green @ unsloth 12295c1f) |
trainer default |
10.55 → 5.04 (non-monotone) |
\" Unsloth!\\n\\nMy name is Unsloth! ...\" (contains "Unsloth") |
The CUDA mirror script lives at temp/torchcodec_test/cuda_mirror.py in my local workspace; results JSON in the same dir. MLX numbers come from the MLX CI on Mac M1 workflow on unslothai/unsloth:
Bisection
Only one unsloth-zoo commit landed between MLX-CI last green (2026-05-14T10:52Z, unsloth 12295c1f) and first red (2026-05-14T12:24Z, unsloth a9322946): #634, e6d8f7f, 2026-05-14T12:10:03Z.
What changed in #634 that broke parity
MLXTrainingConfig.max_grad_value introduced and defaulted to a non-zero value (1.0, later 5.0 in the same PR).
unsloth_zoo/mlx/trainer.py:733-738: when both max_grad_norm > 0 and max_grad_value > 0, max_grad_norm is forced to 0 with a printed notice. Users passing the HF/TRL-standard max_grad_norm=1.0 get it silently dropped.
bias_correction=True was correctly added to match torch.optim.AdamW. That part is HF parity and should stay.
The elementwise cap rotates the gradient direction per leaf, which is mathematically different from clip_grad_norm and is not what HF/TRL users opt into when they set max_grad_norm. The CUDA mirror above shows the same direction-rotation effect under torch with clip_grad_value_(1.0) only: identical loss curve, broken completion.
Recommended fix
In unsloth_zoo/mlx/trainer.py:
MLXTrainingConfig.max_grad_value: float | None = None (off by default).
- At resolution time (lines 727-739) treat
None as "feature disabled":
None → _clip_grad_value = False, never override max_grad_norm.
0 → same as None (off).
- explicit float > 0 → opt-in, only then warn-and-override if
max_grad_norm is also set.
- Leave
bias_correction=True (PyTorch parity).
Effect: a default MLXTrainingConfig honors args.max_grad_norm and matches CUDA HF/TRL semantics. Power users can still opt into elementwise clipping by passing max_grad_value explicitly.
Why this matters
Unsloth's MLX path is sold as a drop-in for SFTTrainer on Apple Silicon. Today, a user fine-tuning identical config on CUDA and MLX gets different gradient-clipping semantics and visibly different convergence basins. The MLX side prints a Unsloth: max_grad_norm and max_grad_value are both enabled; ignoring max_grad_norm in favor of max_grad_value. line, but that line is one of dozens in training logs and is easy to miss.
I will open a follow-up PR with the change above; filing this first so the rationale is captured separately from the patch.
Summary
After PR #634 the MLX trainer sets
MLXTrainingConfig.max_grad_value = 5.0(originally 1.0 in the same PR) and at training-config-resolution time silently zeroes out a user-suppliedmax_grad_normwhen both are non-zero. This breaks HuggingFace/TRL parity for the MLX path. A fine-tune that converges and emits a sensible greedy completion undertransformers.SFTTraineron CUDA produces gibberish on MLX given identical hyperparameters.Repro
Identical 7-step LoRA on
unsloth/gemma-3-270m-it, train row =\"<<HELLO!!>> My name is Unsloth!\",bs=2,grad_accum=3,lr=1e-3,lr_scheduler=constant,warmup=0,optim=adamw,weight_decay=0,max_seq=64,seed=3407, LoRAr=8onq/k/v/o.\"<<HELLO!!>> My name is \"max_grad_norm=1.0only\" 1! ... My name is Unsloth! ...\"(contains "Unsloth")clip_grad_value_(1.0)only\" 1! What are you doing?! ...\"(no "Unsloth")max_grad_value=1.0/5.0overridesmax_grad_norm)'5 lbs!'max_grad_value=0somax_grad_norm=1.0wins'5 lbs!'12295c1f)\" Unsloth!\\n\\nMy name is Unsloth! ...\"(contains "Unsloth")The CUDA mirror script lives at
temp/torchcodec_test/cuda_mirror.pyin my local workspace; results JSON in the same dir. MLX numbers come from theMLX CI on Mac M1workflow onunslothai/unsloth:12295c1f, completion contains "Unsloth".a9322946,step callback error: _on_step() takes 8 positional arguments but 9 were given(separate signature drift fixed in unsloth PR #5498).max_grad_value=0: run 25986340874 at5428914, callback fixed, training reaches inference,AssertionError: in-memory generation gibberish: '5 lbs!'.Bisection
Only one unsloth-zoo commit landed between MLX-CI last green (2026-05-14T10:52Z, unsloth
12295c1f) and first red (2026-05-14T12:24Z, unslotha9322946): #634,e6d8f7f, 2026-05-14T12:10:03Z.What changed in #634 that broke parity
MLXTrainingConfig.max_grad_valueintroduced and defaulted to a non-zero value (1.0, later 5.0 in the same PR).unsloth_zoo/mlx/trainer.py:733-738: when bothmax_grad_norm > 0andmax_grad_value > 0,max_grad_normis forced to 0 with a printed notice. Users passing the HF/TRL-standardmax_grad_norm=1.0get it silently dropped.bias_correction=Truewas correctly added to matchtorch.optim.AdamW. That part is HF parity and should stay.The elementwise cap rotates the gradient direction per leaf, which is mathematically different from
clip_grad_normand is not what HF/TRL users opt into when they setmax_grad_norm. The CUDA mirror above shows the same direction-rotation effect under torch withclip_grad_value_(1.0)only: identical loss curve, broken completion.Recommended fix
In
unsloth_zoo/mlx/trainer.py:MLXTrainingConfig.max_grad_value: float | None = None(off by default).Noneas "feature disabled":None→_clip_grad_value = False, never overridemax_grad_norm.0→ same asNone(off).max_grad_normis also set.bias_correction=True(PyTorch parity).Effect: a default
MLXTrainingConfighonorsargs.max_grad_normand matches CUDA HF/TRL semantics. Power users can still opt into elementwise clipping by passingmax_grad_valueexplicitly.Why this matters
Unsloth's MLX path is sold as a drop-in for
SFTTraineron Apple Silicon. Today, a user fine-tuning identical config on CUDA and MLX gets different gradient-clipping semantics and visibly different convergence basins. The MLX side prints aUnsloth: max_grad_norm and max_grad_value are both enabled; ignoring max_grad_norm in favor of max_grad_value.line, but that line is one of dozens in training logs and is easy to miss.I will open a follow-up PR with the change above; filing this first so the rationale is captured separately from the patch.