Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Tutorial: Aggregating and Clustering AlphaFold2 Embeddings from DPEB

This tutorial demonstrates how to load .npy embedding files from the DeepDrug Protein Embeddings Bank (DPEB), perform mean pooling, and apply clustering and t-SNE visualization to analyze protein families.


Objective

This script will perform the following actions:

  • Load AlphaFold2 .npy embedding files stored in the dataset
  • Apply mean pooling across residue-level embeddings to produce fixed-length vectors
  • Aggregate per-residue embeddings into fixed-length vectors using mean pooling
  • Merge these embeddings with protein family labels from protein_families23k.csv
  • Save the aggregated results to eppi_alphafold_aggregated_embeddings.csv
  • Generate t-SNE plots and save them as:
    • raw_embeddings_tsne_Alphafold.png
    • raw_embeddings_kmeans_tsne_Alphafold.png
  • Print clustering evaluation metrics (Accuracy, Precision, Recall, F1-Score)

Input Files

  • AlphaFold2 embeddings: .npy files inside a .rar archive
  • Protein family annotations: protein_families23k.csv
  • Output aggregated embeddings: eppi_alphafold_aggregated_embeddings.csv

Files in This tutorial Folder

  • tutorial_clustering.py: Main script that performs loading, aggregation, clustering, and visualization.
  • protein_families23k.csv: CSV file mapping each protein to its family (used for supervised evaluation).
  • Output:
    • eppi_alphafold_aggregated_embeddings.csv (generated), we have also provided aggregated embeddings of each type in csv format inside each embedding folder.
    • generated t-SNE plots/raw_embeddings_tsne_Alphafold.png (generated)
    • generated t-SNE plots/raw_embeddings_kmeans_tsne_Alphafold.png (generated)

Required Installation Libraries

  • numpy, pandas
  • matplotlib, scikit-learn
  • torch, TSNE, KMeans
  • ast, LabelEncoder

Activate your conda environment and install any missing dependencies.

conda activate DPEB
pip install scikit-learn matplotlib pandas torch

File Setup and Path Configuration

Before running the tutorial script, make sure to complete the following steps:

Download and Unrar Embedding Files

Download the embeddings folder the Box link. For example:

  • All_ePPI_Alphafold2_Embeddings_np_v1.3.rar`
  • esm2_dict_embeddings.rar
  • protvec_dict_embeddings.rar
  • bioemb_dict_embeddings.rar

Extract them using the following commands:

# Install unrar if needed
sudo apt install unrar       # For Ubuntu/Debian
# or
sudo yum install unrar       # For RHEL/CentOS

# Extract the .rar file
unrar x All_ePPI_Alphafold2_Embeddings_np_v1.3.rar

Download protein_families23k.csv from the GitHub tutorial folder.

Change File path

In your script (e.g., tutorial_clustering.py), update the following paths:

# Example: Set this to your local extracted embedding directory
embedding_folder = "/your/local/path/to/All_ePPI_Alphafold2_Embeddings_np_v1.3/"

# Example: Set this to your local path to the metadata CSV file
protein_file = "/your/local/path/to/protein_families23k.csv"

Run the Tutorial Script

You can execute the tutorial by running the Python script inside the tutorial/ folder:

python tutorial_clustering.py

The t-SNE visualizations generated by the tutorial script can be found at:

tutorial/generated t-SNE plots