This script generates per-residue protein embeddings using the ProtTransBert-BFD model from the bio_embeddings (https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/embed/prottrans_bert_bfd_embedder.py) library.
Each protein sequence is embedded and saved as an individual .npy file.
The script expects a CSV file named: pro_id_seq_human.csv
| Column | Description |
|---|---|
| ProteinID | Unique protein identifier (e.g., UniProt ID) |
| Protein_sequence | Amino-acid sequence |
Example:
ProteinID,Protein_sequence
P35225,MADSASESDTDGAGGNSSSSAAMQSS...
####Environment Setup
Create and activate a Conda environment:
```bash
conda create -n bioemb python=3.8 -y
conda activate bioemb
Install dependencies:
pip install bio-embeddings pandas numpy tqdm lmdb
A CUDA-enabled GPU is strongly recommended for reasonable runtime.
Inside extract_bio.py, set the output directory:
output_dir = "/home/magesh/protein_embeddings_npy"Adjust batch size if GPU memory is limited:
batch_size = 1024 # reduce if CUDA OOM occursFrom the repository directory:
python extract_bio.py
Each protein produces one .npy file:
P35225_embedding.npy
Each file contains a Python dictionary:
{
"protein_id": str,
"fasta": str,
"embedding": np.ndarray # shape: (sequence_length, 1024)
}{
import numpy as np
data = np.load("P35225_embedding.npy", allow_pickle=True).item()
embedding = data["embedding"]
}- Embeddings are per-residue (not pooled)
- Embedding dimension: 1024
- Suitable for PPI, GNNs, clustering, and other protein ML tasks