-
Notifications
You must be signed in to change notification settings - Fork 32.4k
Description
Feature request
Since it has been recently used in PaLM and several papers report its better performance, it would be good to have access to a SwiGLU implementation as an activation function.
Motivation
I am building a biomedical RoBERTa-based model with specific biomedical vocabulary. It could be seen as a PubMedBERT version wirth RoBERTa architecture and BPE vocab.
Since RoBERTa has already some years, I want to also add recent improvements to architecture and training.
I have tried myself to generate a RoBERTa model with two extra features. One is to remove bias from the FFN layers and the other to add the SwiGLU activation to these.
My approach has been to copy the code of roberta_modeling.py and modify its RobertaIntermediate class to a EXcellRobertaIntermediate class including the swiglu activation and a bias=config.dense_layer_bias attribute in the nn.Linear instantiation.
This works good for a first training of the model. However, when loading the model I find problems.
The first problem was that the model config has activation=swiglu and there is some ContextManager that does not allow for that option. I did a dirty work around, keeping activation=gelu while keeping the swiglu in the code. This works and the model trains... but if I want to then further train it or use it for fine-tuning it will drop the extra layers generated by the swiglu. Here is an example output:
from smtag.excell_roberta.modeling_excell_roberta import EXcellRobertaForMaskedLM
model = EXcellRobertaForMaskedLM.from_pretrained('/app/excell-roberta-training/checkpoint-50/')
loading configuration file /app/excell-roberta-training/checkpoint-50/config.json
Model config EXcellRobertaConfig {
"architectures": [
"EXcellRobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bias_dense_layers": false,
"bias_norm": false,
"bos_token_id": 0,
"classifier_dropout": null,
"dense_layer_bias": false,
"eos_token_id": 1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 3,
"position_embedding_type": "absolute",
"sep_token_id": 1,
"swiglu": true,
"tokenizer_class": "RobertaTokenizerFast",
"torch_dtype": "float32",
"transformers_version": "4.20.0",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 64000
}
loading weights file /app/excell-roberta-training/checkpoint-50/pytorch_model.bin
Some weights of the model checkpoint at /app/excell-roberta-training/checkpoint-50/ were not used when initializing EXcellRobertaForMaskedLM: ['roberta.encoder.layer.2.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.0.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.3.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.11.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.8.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.7.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.9.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.5.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.6.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.4.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.1.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.10.intermediate.intermediate_dense.weight']
- This IS expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of EXcellRobertaForMaskedLM were initialized from the model checkpoint at /app/excell-roberta-training/checkpoint-50/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EXcellRobertaForMaskedLM for predictions without further training.
model(**excell("acetyltransferase is something that should give extra subtokens to the tokenizer", truncation=True, padding="max_length", return_tensors='pt'))
MaskedLMOutput(loss=None, logits=tensor([[[-0.1479, 0.3992, -0.3396, ..., -0.3373, -0.8730, -0.7037],
[ 0.1812, 0.5421, -0.4052, ..., -0.0612, -0.6076, -1.0300],
[-0.1578, 0.6487, -0.8400, ..., 0.0745, -0.6941, -0.7082],
...,
[-0.2610, 0.6921, -0.6040, ..., -0.0400, -0.6101, -0.9326],
[-0.2610, 0.6921, -0.6040, ..., -0.0400, -0.6101, -0.9326],
[-0.2610, 0.6921, -0.6040, ..., -0.0400, -0.6101, -0.9326]]],
grad_fn=<AddBackward0>), hidden_states=None, attentions=None)
model = EXcellRobertaForMaskedLM.from_pretrained('/app/excell-roberta-training/checkpoint-50/')
loading configuration file /app/excell-roberta-training/checkpoint-50/config.json
Model config EXcellRobertaConfig {
"architectures": [
"EXcellRobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bias_dense_layers": false,
"bias_norm": false,
"bos_token_id": 0,
"classifier_dropout": null,
"dense_layer_bias": false,
"eos_token_id": 1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 3,
"position_embedding_type": "absolute",
"sep_token_id": 1,
"swiglu": true,
"tokenizer_class": "RobertaTokenizerFast",
"torch_dtype": "float32",
"transformers_version": "4.20.0",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 64000
}
loading weights file /app/excell-roberta-training/checkpoint-50/pytorch_model.bin
Some weights of the model checkpoint at /app/excell-roberta-training/checkpoint-50/ were not used when initializing EXcellRobertaForMaskedLM: ['roberta.encoder.layer.2.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.0.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.3.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.11.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.8.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.7.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.9.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.5.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.6.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.4.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.1.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.10.intermediate.intermediate_dense.weight']
- This IS expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of EXcellRobertaForMaskedLM were initialized from the model checkpoint at /app/excell-roberta-training/checkpoint-50/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EXcellRobertaForMaskedLM for predictions without further training.
I would like to check with you if there is any best way that this could be done, or whether it is possible at all without big modifications on transformers.
We plan to eventually, once the model is published to submit a request to add it to the library.
I would also be happy with a contribution of the SwiGLU activation if this would be possible. The main issue I see here is that instantiating a SwiGLU class requires instantiating an extra nn.Linear class. This therefore changes the behavior of the typical callables to other activation functions.
I will be happy also to contribute on this topic.
Your contribution
I have added two main modifications to the original code of RoBERTa:
First, I generated the class SwiGLU. I know that this is not the place to define this class, but this has been a test so far.
class SwiGLU(nn.Module):
def forward(self, x):
x, gate = x.chunk(2, dim=-1)
return F.silu(gate) * xThe other modification is:
class EXcellRobertaIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size, bias=config.dense_layer_bias)
self.swiglu = config.swiglu
if self.swiglu:
self.swiglu = True
self.intermediate_act_fn = SwiGLU()
self.intermediate_dense = nn.Linear(config.intermediate_size//2, config.intermediate_size, bias=config.dense_layer_bias)
elif isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
if self.swiglu:
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
hidden_states = self.intermediate_dense(hidden_states)
else:
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_statesIwould be happy to contribute with tthe SwiGLU activation and eventually to bring the entire model to transformers.