This tutorial demonstrates how to load .npy embedding files from the DeepDrug Protein Embeddings Bank (DPEB), perform mean pooling, and apply clustering and t-SNE visualization to analyze protein families.
This script will perform the following actions:
- Load AlphaFold2
.npyembedding files stored in the dataset - Apply mean pooling across residue-level embeddings to produce fixed-length vectors
- Aggregate per-residue embeddings into fixed-length vectors using mean pooling
- Merge these embeddings with protein family labels from
protein_families23k.csv - Save the aggregated results to
eppi_alphafold_aggregated_embeddings.csv - Generate t-SNE plots and save them as:
raw_embeddings_tsne_Alphafold.pngraw_embeddings_kmeans_tsne_Alphafold.png
- Print clustering evaluation metrics (Accuracy, Precision, Recall, F1-Score)
- AlphaFold2 embeddings:
.npyfiles inside a.rararchive - Protein family annotations:
protein_families23k.csv - Output aggregated embeddings:
eppi_alphafold_aggregated_embeddings.csv
tutorial_clustering.py: Main script that performs loading, aggregation, clustering, and visualization.protein_families23k.csv: CSV file mapping each protein to its family (used for supervised evaluation).- Output:
eppi_alphafold_aggregated_embeddings.csv(generated), we have also provided aggregated embeddings of each type in csv format inside each embedding folder.generated t-SNE plots/raw_embeddings_tsne_Alphafold.png(generated)generated t-SNE plots/raw_embeddings_kmeans_tsne_Alphafold.png(generated)
numpy,pandasmatplotlib,scikit-learntorch,TSNE,KMeansast,LabelEncoder
Activate your conda environment and install any missing dependencies.
conda activate DPEB
pip install scikit-learn matplotlib pandas torchBefore running the tutorial script, make sure to complete the following steps:
Download the embeddings folder the Box link. For example:
- All_ePPI_Alphafold2_Embeddings_np_v1.3.rar`
esm2_dict_embeddings.rarprotvec_dict_embeddings.rarbioemb_dict_embeddings.rar
Extract them using the following commands:
# Install unrar if needed
sudo apt install unrar # For Ubuntu/Debian
# or
sudo yum install unrar # For RHEL/CentOS
# Extract the .rar file
unrar x All_ePPI_Alphafold2_Embeddings_np_v1.3.rarDownload protein_families23k.csv from the GitHub tutorial folder.
In your script (e.g., tutorial_clustering.py), update the following paths:
# Example: Set this to your local extracted embedding directory
embedding_folder = "/your/local/path/to/All_ePPI_Alphafold2_Embeddings_np_v1.3/"
# Example: Set this to your local path to the metadata CSV file
protein_file = "/your/local/path/to/protein_families23k.csv"You can execute the tutorial by running the Python script inside the tutorial/ folder:
python tutorial_clustering.pyThe t-SNE visualizations generated by the tutorial script can be found at: