GitHub - buetnlpbio/BiRNA-BERT

This current version is substantially updated than the previous ones and we suggest re-analyses of any data which was analyzed using prior versions of BIRNA-BERT

Authors' implementation of BiRNA-BERT Allows Efficient RNA Language Modeling with Adaptive Tokenization

Paper Link

BiRNA-BERT

BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE).

BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base

Extracting RNA embeddings

BATCH_SIZE = 16
LR = 2e-5
EPOCHS = 50
WARMUP_RATIO = 0.1
GRAD_ACCUM_STEPS = 1
MODEL_NAME = "Lancelot53/birnabert-2ep"
TOKENIZER = "buetnlpbio/birna-tokenizer"

# https://huggingface.co/Lancelot53/birnabert-2ep
import torch
from torch.utils.data import Dataset
import pandas as pd
import numpy as np
import torch.nn as nn
import torch.optim as optim
import transformers
import torch.nn.functional as F
from transformers import get_constant_schedule_with_warmup, AutoModelForMaskedLM, AutoTokenizer, AutoModel
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
config = transformers.BertConfig.from_pretrained(MODEL_NAME)
birnabert = AutoModelForMaskedLM.from_pretrained(MODEL_NAME,config=config,trust_remote_code=True)
birnabert.cls = torch.nn.Identity()

birnabert.to(device)

# To get sequence embeddings
seq_embed = birnabert(**tokenizer("AGCTACGTACGT", return_tensors="pt"))
print(seq_embed.logits.shape) # CLS + 4 BPE token embeddings + SEP

# To get nucleotide embeddings
char_embed = birnabert(**tokenizer("A G C T A C G T A C G T", return_tensors="pt")) 
print(char_embed.logits.shape) # CLS + 12 nucleotide token embeddings + SEP

Explicitly increasing max sequence length

config = transformers.BertConfig.from_pretrained(MODEL_NAME)
config.alibi_starting_size = 2048 # maximum sequence length updated to 2048 from config default of 1024

mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/birna-bert",config=config,trust_remote_code=True)

Download Model and Tokenizer (External Link)

Download model
Download tokenizer

Citation

@article {Tahmid2024.07.02.601703,
	author = {Tahmid, Md Toki and Shahgir, Haz Sameen and Mahbub, Sazan and Dong, Yue and Bayzid, Md. Shamsuzzoha},
	title = {BiRNA-BERT Allows Efficient RNA Language Modeling with Adaptive Tokenization},
	elocation-id = {2024.07.02.601703},
	year = {2024},
	doi = {10.1101/2024.07.02.601703},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/07/04/2024.07.02.601703},
	eprint = {https://www.biorxiv.org/content/early/2024/07/04/2024.07.02.601703.full.pdf},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
TOKENIZER		TOKENIZER
example_code_and_data		example_code_and_data
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This current version is substantially updated than the previous ones and we suggest re-analyses of any data which was analyzed using prior versions of BIRNA-BERT

BiRNA-BERT

Extracting RNA embeddings

Explicitly increasing max sequence length

Download Model and Tokenizer (External Link)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

This current version is substantially updated than the previous ones and we suggest re-analyses of any data which was analyzed using prior versions of BIRNA-BERT

BiRNA-BERT

Extracting RNA embeddings

Explicitly increasing max sequence length

Download Model and Tokenizer (External Link)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages