Hi,
First of all, thanks for your work on fixing gradient accumulation! I have a question about implementation in unsloth-zoo here. In a blog post https://unsloth.ai/blog/gradient you say that
This means naively averaging over each gradient accumulation step is wrong, but instead we must derive the denominator beforehand.
But checking your code implementation, I can see that you simply add up losses, but denominator is commented
|
loss = model(input_ids = input_ids, labels = labels, n_items = n_items).loss |
|
# loss = loss * inverse_gradient_accumulation_steps |
|
accumulated_loss += loss.detach() |
shouldn't loss be multiplied by denominator here to match an "After - Unsloth fix" graph?
Hi,
First of all, thanks for your work on fixing gradient accumulation! I have a question about implementation in unsloth-zoo here. In a blog post https://unsloth.ai/blog/gradient you say that
But checking your code implementation, I can see that you simply add up losses, but denominator is commented
unsloth-zoo/unsloth_zoo/training_utils.py
Lines 268 to 270 in 7b0048e
shouldn't loss be multiplied by denominator here to match an "After - Unsloth fix" graph?