Skip to content

JocelynSong/SurfPro

Repository files navigation

SurfPro: Functional Protein Design Based on Continuous Surface

Model Architecture

This repository contains code and data for ICML 2024 paper SurfPro: Functional Protein Design Based on Continuous Surface

The overall model architecture is shown below:

image

Environment

The dependencies can be set up using the following commands:
conda create -n surfpro python=3.8 -y 
conda activate surfpro 
conda install pytorch=1.10.2 cudatoolkit=11.3 -c pytorch -y 
bash setup.sh 

To install SRU++, run

git clone https://github.com/asappresearch/sru
cd sru 
git checkout 3.0.0-dev 
pip install . 

Download Data

We provide all surface data for cath 4.2, binder design task and enzyme design task at SurfPro data

mkdir data 
cd data 
mkdir cath42 && cd cath42
wget https://drive.google.com/file/d/1_IUTRpQtQpoPzxUDD7cTzUn150hViN5h/view?usp=drive_link
cd .. && mkdir binder_design && cd binder_design
wget https://drive.google.com/drive/folders/1S7fg-XWBSy6-Pq7bSG_IrlLgLi3ESoX3?usp=drive_link
cd .. && mkdir enzyme_design && cd enzyme_design
wget https://drive.google.com/drive/folders/13EpZ1u7l28W0aR2WfXhIBooK5LZpXqTU?usp=drive_link

Download Model

We provide the model weigths trained on CATH4.2 at CATH4.2 model weights

We also provide our pretrained model weights on the whole PDB surfaces (SurfPro-Pretrain) at pretranied model weights

Prepare Surface for your own model

We provided surface data for all three tasks at SurfPro data.

If you want to generate surface from your own PDB files, please use the file preprocess/prepare_surface.py

You need to first apply MSMS tool to generate the corresponding vert files. Then you need to provide the corresponding fasta file and vert files to prepare corresponding surfaces. To run the code:

python preprocess/prepare_surface.py --data_path fasta_file_path --split train --output_path output_data_path

The vert files are put in fasta_file_path/msms directory by default.

Inverse Folding Task Training

First Download the corresponding data and decompress it:
mkdir binder_design && cd binder_design
wget https://drive.google.com/file/d/1_IUTRpQtQpoPzxUDD7cTzUn150hViN5h/view?usp=drive_link
tar -xvzf octree_aa_surf_5k_sorted.tar.gz

Then training the model:

bash train_surface_inverse_folding.sh

Binder Design Training

First Download the corresponding data:
mkdir binder_design && cd binder_design
wget https://drive.google.com/drive/folders/1S7fg-XWBSy6-Pq7bSG_IrlLgLi3ESoX3?usp=drive_link

Then decompress the target-binder data which are necessary for pAE_interaction evaluation.

cd binder_design/Binder_Design_Data
tar -xvzf binder.pkl.tar.gz

Then training the model. Suppose the model ckpt from inverse folding task is at cath_model_path/checkpoint_best.pt. The training script is shown below:

bash binder_design_finetune.sh

Enzyme Design Training

First Download the corresponding data:
mkdir enzyme_design && cd enzyme_design
wget https://drive.google.com/drive/folders/13EpZ1u7l28W0aR2WfXhIBooK5LZpXqTU?usp=drive_link

Then training the model. Suppose the model ckpt from inverse folding task is at cath_model_path/checkpoint_best.pt. The training script is shown below:

bash binder_design_finetune.sh

Inference

To generate protein sequences for CATH 4.2, design binders or design enzymes:
bash generation_cath42.sh
bash generate_binder.sh
bash generate_enzyme.sh

There are two items in the output directory:

  1. protein.txt refers to the designed protein sequence
  2. src.seq.txt refers to the ground truth sequences

Evaluation

Inverse Folding Task Evaluation

We provide the recovery rate calculate after pairwise alignment at evaluation/amino_acid_recovary_rate.py.

You need to provide the source sequence and designed sequence files.

Binder Design Task Evaluation

To calculate the superimpose files of designed binder and target proteins, please use file evaluation/super_impose.py

Then we apply scripts provided at dl_binder_design to calculate pAE_interaction scores.

Enzyme Design Task Evaluation

We provide the ESP evaluation data at [ESP_data_eval](https://drive.google.com/file/d/1LlYvJvV69dxtqblAUAkJ-CbFs9h_Q34j/view?usp=drive_link)

The format for ESP evaluation is (Protein_Sequence Substrate_Representation) for each test case.

The evaluation code for ESP score is developed by Alexander Kroll, which can be found at link

Citation

If you find our work helpful, please consider citing our paper.
@inproceedings{songsurfpro,
  title={SurfPro: Functional Protein Design Based on Continuous Surface},
  author={Song, Zhenqiao and Huang, Tinglin and Li, Lei and Jin, Wengong},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages