SwiGLU activation function

### Feature request

Since it has been recently used in [PaLM](https://arxiv.org/abs/2204.02311) and several papers report its better performance, it would be good to have access to a [SwiGLU](https://arxiv.org/pdf/2002.05202v1.pdf) implementation as an activation function.



### Motivation

I am building a biomedical RoBERTa-based model with specific biomedical vocabulary. It could be seen as a PubMedBERT version wirth RoBERTa architecture and BPE vocab.

Since RoBERTa has already some years, I want to also add recent improvements to architecture and training.

I have tried myself to generate a RoBERTa model with two extra features. One is to remove bias from the FFN layers and the other to add the SwiGLU activation to these.

My approach has been to copy the code of [roberta_modeling.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/roberta/modeling_roberta.py) and modify its `RobertaIntermediate` class  to a `EXcellRobertaIntermediate` class including  the `swiglu` activation and a bias=`config.dense_layer_bias` attribute in the `nn.Linear` instantiation.

This works good for a first training of the model. However, when loading the model I find problems. 
The first problem was that the model config has `activation=swiglu` and there is some ContextManager that does not allow for that option. I did a dirty work around, keeping `activation=gelu` while keeping the swiglu in the code. This works and the model trains... but if I want to then further train it or use it for fine-tuning it will drop the extra layers generated by the swiglu. Here is an example output:

```
from smtag.excell_roberta.modeling_excell_roberta import EXcellRobertaForMaskedLM
model = EXcellRobertaForMaskedLM.from_pretrained('/app/excell-roberta-training/checkpoint-50/')

  loading configuration file /app/excell-roberta-training/checkpoint-50/config.json
  Model config EXcellRobertaConfig {
    "architectures": [
      "EXcellRobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bias_dense_layers": false,
    "bias_norm": false,
    "bos_token_id": 0,
    "classifier_dropout": null,
    "dense_layer_bias": false,
    "eos_token_id": 1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "layer_norm_eps": 1e-12,
    "max_position_embeddings": 514,
    "model_type": "roberta",
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "pad_token_id": 3,
    "position_embedding_type": "absolute",
    "sep_token_id": 1,
    "swiglu": true,
    "tokenizer_class": "RobertaTokenizerFast",
    "torch_dtype": "float32",
    "transformers_version": "4.20.0",
    "type_vocab_size": 1,
    "use_cache": true,
    "vocab_size": 64000
  }
  
  loading weights file /app/excell-roberta-training/checkpoint-50/pytorch_model.bin
  Some weights of the model checkpoint at /app/excell-roberta-training/checkpoint-50/ were not used when initializing EXcellRobertaForMaskedLM: ['roberta.encoder.layer.2.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.0.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.3.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.11.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.8.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.7.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.9.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.5.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.6.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.4.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.1.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.10.intermediate.intermediate_dense.weight']
  - This IS expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  - This IS NOT expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  All the weights of EXcellRobertaForMaskedLM were initialized from the model checkpoint at /app/excell-roberta-training/checkpoint-50/.
  If your task is similar to the task the model of the checkpoint was trained on, you can already use EXcellRobertaForMaskedLM for predictions without further training.
  
  model(**excell("acetyltransferase is something that should give extra subtokens to the tokenizer", truncation=True, padding="max_length", return_tensors='pt'))
  
  MaskedLMOutput(loss=None, logits=tensor([[[-0.1479,  0.3992, -0.3396,  ..., -0.3373, -0.8730, -0.7037],
           [ 0.1812,  0.5421, -0.4052,  ..., -0.0612, -0.6076, -1.0300],
           [-0.1578,  0.6487, -0.8400,  ...,  0.0745, -0.6941, -0.7082],
           ...,
           [-0.2610,  0.6921, -0.6040,  ..., -0.0400, -0.6101, -0.9326],
           [-0.2610,  0.6921, -0.6040,  ..., -0.0400, -0.6101, -0.9326],
           [-0.2610,  0.6921, -0.6040,  ..., -0.0400, -0.6101, -0.9326]]],
         grad_fn=<AddBackward0>), hidden_states=None, attentions=None)
  
  model = EXcellRobertaForMaskedLM.from_pretrained('/app/excell-roberta-training/checkpoint-50/')
  
  loading configuration file /app/excell-roberta-training/checkpoint-50/config.json
  Model config EXcellRobertaConfig {
    "architectures": [
      "EXcellRobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bias_dense_layers": false,
    "bias_norm": false,
    "bos_token_id": 0,
    "classifier_dropout": null,
    "dense_layer_bias": false,
    "eos_token_id": 1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "layer_norm_eps": 1e-12,
    "max_position_embeddings": 514,
    "model_type": "roberta",
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "pad_token_id": 3,
    "position_embedding_type": "absolute",
    "sep_token_id": 1,
    "swiglu": true,
    "tokenizer_class": "RobertaTokenizerFast",
    "torch_dtype": "float32",
    "transformers_version": "4.20.0",
    "type_vocab_size": 1,
    "use_cache": true,
    "vocab_size": 64000
  }
  
  loading weights file /app/excell-roberta-training/checkpoint-50/pytorch_model.bin
  Some weights of the model checkpoint at /app/excell-roberta-training/checkpoint-50/ were not used when initializing EXcellRobertaForMaskedLM: ['roberta.encoder.layer.2.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.0.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.3.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.11.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.8.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.7.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.9.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.5.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.6.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.4.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.1.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.10.intermediate.intermediate_dense.weight']
  - This IS expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  - This IS NOT expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  All the weights of EXcellRobertaForMaskedLM were initialized from the model checkpoint at /app/excell-roberta-training/checkpoint-50/.
  If your task is similar to the task the model of the checkpoint was trained on, you can already use EXcellRobertaForMaskedLM for predictions without further training.
```

I would like to check with you if there is any best way that this could be done, or whether it is possible at all without big modifications on transformers.

We plan to eventually, once the model is published to submit a request to add it to the library.

I would also be happy with a contribution of the SwiGLU activation if this would be possible. The main issue I see here is that instantiating a SwiGLU class requires instantiating an extra `nn.Linear` class. This therefore changes the behavior of the typical callables to other activation functions.

I will be happy also to contribute on this topic.

### Your contribution

I have added two main modifications to the original code of RoBERTa:

First, I generated the class `SwiGLU`. I know that this is not the place to define this class, but this has been a test so far.

```python
  class SwiGLU(nn.Module):
      def forward(self, x):
          x, gate = x.chunk(2, dim=-1)
          return F.silu(gate) * x
```

The other modification is:

```python
  class EXcellRobertaIntermediate(nn.Module):
      def __init__(self, config):
          super().__init__()
          self.dense = nn.Linear(config.hidden_size, config.intermediate_size, bias=config.dense_layer_bias)
          self.swiglu = config.swiglu
          if self.swiglu:
              self.swiglu = True
              self.intermediate_act_fn = SwiGLU()
              self.intermediate_dense = nn.Linear(config.intermediate_size//2, config.intermediate_size, bias=config.dense_layer_bias)
          elif isinstance(config.hidden_act, str):
              self.intermediate_act_fn = ACT2FN[config.hidden_act]
          else:
              self.intermediate_act_fn = config.hidden_act
  
  
      def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
          if self.swiglu:
              hidden_states = self.dense(hidden_states)
              hidden_states = self.intermediate_act_fn(hidden_states)
              hidden_states = self.intermediate_dense(hidden_states)
          else:
              hidden_states = self.dense(hidden_states)
              hidden_states = self.intermediate_act_fn(hidden_states)
          return hidden_states

```

Iwould be happy to contribute with tthe SwiGLU activation and  eventually to bring the entire model to transformers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SwiGLU activation function #20403

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SwiGLU activation function #20403

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions