Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

readme.md

Protein Bio-Embeddings Generator

This script generates per-residue protein embeddings using the ProtTransBert-BFD model from the bio_embeddings (https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/embed/prottrans_bert_bfd_embedder.py) library.

Each protein sequence is embedded and saved as an individual .npy file.


Input File

The script expects a CSV file named: pro_id_seq_human.csv

Required Columns

Column Description
ProteinID Unique protein identifier (e.g., UniProt ID)
Protein_sequence Amino-acid sequence

Example:

ProteinID,Protein_sequence
P35225,MADSASESDTDGAGGNSSSSAAMQSS...

####Environment Setup
Create and activate a Conda environment:

```bash
conda create -n bioemb python=3.8 -y
conda activate bioemb

Install dependencies:

pip install bio-embeddings pandas numpy tqdm lmdb

A CUDA-enabled GPU is strongly recommended for reasonable runtime.

Configuration

Inside extract_bio.py, set the output directory:

output_dir = "/home/magesh/protein_embeddings_npy"

Adjust batch size if GPU memory is limited:

batch_size = 1024  # reduce if CUDA OOM occurs

Run the Script

From the repository directory:

python extract_bio.py

Output

Each protein produces one .npy file:

P35225_embedding.npy

Each file contains a Python dictionary:

{
  "protein_id": str,
  "fasta": str,
  "embedding": np.ndarray  # shape: (sequence_length, 1024)
}

Load an Embedding

{

import numpy as np

data = np.load("P35225_embedding.npy", allow_pickle=True).item()
embedding = data["embedding"]

}
Notes
  • Embeddings are per-residue (not pooled)
  • Embedding dimension: 1024
  • Suitable for PPI, GNNs, clustering, and other protein ML tasks