Question towards @lewtun @kashif:
We noticed that the load balancing loss (aux_loss) that is implemented in MoEs modeling_mixtral.py#L1244 is not added to the loss implemented in DPO/KTO trainers.
Isn't this needed so that the load balancing between experts is still guaranteed after fine-tuning or does the KL loss penalization sufficiently prevents the router weights from changing too much?
Question towards @lewtun @kashif:
We noticed that the load balancing loss (aux_loss) that is implemented in MoEs modeling_mixtral.py#L1244 is not added to the loss implemented in DPO/KTO trainers.
Isn't this needed so that the load balancing between experts is still guaranteed after fine-tuning or does the KL loss penalization sufficiently prevents the router weights from changing too much?