bioembedding

Protein Bio-Embeddings Generator

This script generates per-residue protein embeddings using the ProtTransBert-BFD model from the bio_embeddings (https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/embed/prottrans_bert_bfd_embedder.py) library.

Each protein sequence is embedded and saved as an individual .npy file.

Input File

The script expects a CSV file named: pro_id_seq_human.csv

Required Columns

Column	Description
ProteinID	Unique protein identifier (e.g., UniProt ID)
Protein_sequence	Amino-acid sequence

Example:

ProteinID,Protein_sequence
P35225,MADSASESDTDGAGGNSSSSAAMQSS...

####Environment Setup
Create and activate a Conda environment:

```bash
conda create -n bioemb python=3.8 -y
conda activate bioemb

Install dependencies:

pip install bio-embeddings pandas numpy tqdm lmdb

A CUDA-enabled GPU is strongly recommended for reasonable runtime.

Configuration

Inside extract_bio.py, set the output directory:

output_dir = "/home/magesh/protein_embeddings_npy"

Adjust batch size if GPU memory is limited:

batch_size = 1024  # reduce if CUDA OOM occurs

Run the Script

From the repository directory:

python extract_bio.py

Output

Each protein produces one .npy file:

P35225_embedding.npy

Each file contains a Python dictionary:

{
  "protein_id": str,
  "fasta": str,
  "embedding": np.ndarray  # shape: (sequence_length, 1024)
}

Load an Embedding

{

import numpy as np

data = np.load("P35225_embedding.npy", allow_pickle=True).item()
embedding = data["embedding"]

}

Notes

Embeddings are per-residue (not pooled)
Embedding dimension: 1024
Suitable for PPI, GNNs, clustering, and other protein ML tasks

Name		Name	Last commit message	Last commit date
parent directory ..
__pycache__		__pycache__
extract_bio.py		extract_bio.py
load bioembed.py		load bioembed.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

Protein Bio-Embeddings Generator

Input File

Required Columns

Configuration

Run the Script

Output

Load an Embedding

Notes

FilesExpand file tree

bioembedding

Directory actions

More options

Directory actions

More options

Latest commit

History

bioembedding

Folders and files

parent directory

readme.md

Protein Bio-Embeddings Generator

Input File

Required Columns

Configuration

Run the Script

Output

Load an Embedding

Notes