[DPO/KTO] Mixtral Load Balancing Loss

Question towards @lewtun @kashif:

We noticed that the load balancing loss (aux_loss) that is implemented in MoEs [modeling_mixtral.py#L1244](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mixtral/modeling_mixtral.py#L1391) is not added to the loss implemented in DPO/KTO trainers.  

Isn't this needed so that the load balancing between experts is still guaranteed after fine-tuning or does the KL loss penalization sufficiently prevents the router weights from changing too much?   


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPO/KTO] Mixtral Load Balancing Loss #1544

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DPO/KTO] Mixtral Load Balancing Loss #1544

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions