This repository contains code and data for ICML 2024 paper SurfPro: Functional Protein Design Based on Continuous Surface
The overall model architecture is shown below:
The dependencies can be set up using the following commands:conda create -n surfpro python=3.8 -y
conda activate surfpro
conda install pytorch=1.10.2 cudatoolkit=11.3 -c pytorch -y
bash setup.sh To install SRU++, run
git clone https://github.com/asappresearch/sru
cd sru
git checkout 3.0.0-dev
pip install . We provide all surface data for cath 4.2, binder design task and enzyme design task at SurfPro data
mkdir data
cd data
mkdir cath42 && cd cath42
wget https://drive.google.com/file/d/1_IUTRpQtQpoPzxUDD7cTzUn150hViN5h/view?usp=drive_link
cd .. && mkdir binder_design && cd binder_design
wget https://drive.google.com/drive/folders/1S7fg-XWBSy6-Pq7bSG_IrlLgLi3ESoX3?usp=drive_link
cd .. && mkdir enzyme_design && cd enzyme_design
wget https://drive.google.com/drive/folders/13EpZ1u7l28W0aR2WfXhIBooK5LZpXqTU?usp=drive_linkWe provide the model weigths trained on CATH4.2 at CATH4.2 model weights
We also provide our pretrained model weights on the whole PDB surfaces (SurfPro-Pretrain) at pretranied model weights
We provided surface data for all three tasks at SurfPro data.
If you want to generate surface from your own PDB files, please use the file preprocess/prepare_surface.py
You need to first apply MSMS tool to generate the corresponding vert files. Then you need to provide the corresponding fasta file and vert files to prepare corresponding surfaces. To run the code:
python preprocess/prepare_surface.py --data_path fasta_file_path --split train --output_path output_data_pathThe vert files are put in fasta_file_path/msms directory by default.
First Download the corresponding data and decompress it:mkdir binder_design && cd binder_design
wget https://drive.google.com/file/d/1_IUTRpQtQpoPzxUDD7cTzUn150hViN5h/view?usp=drive_link
tar -xvzf octree_aa_surf_5k_sorted.tar.gzThen training the model:
bash train_surface_inverse_folding.shmkdir binder_design && cd binder_design
wget https://drive.google.com/drive/folders/1S7fg-XWBSy6-Pq7bSG_IrlLgLi3ESoX3?usp=drive_linkThen decompress the target-binder data which are necessary for pAE_interaction evaluation.
cd binder_design/Binder_Design_Data
tar -xvzf binder.pkl.tar.gzThen training the model. Suppose the model ckpt from inverse folding task is at cath_model_path/checkpoint_best.pt. The training script is shown below:
bash binder_design_finetune.shmkdir enzyme_design && cd enzyme_design
wget https://drive.google.com/drive/folders/13EpZ1u7l28W0aR2WfXhIBooK5LZpXqTU?usp=drive_linkThen training the model. Suppose the model ckpt from inverse folding task is at cath_model_path/checkpoint_best.pt. The training script is shown below:
bash binder_design_finetune.shbash generation_cath42.sh
bash generate_binder.sh
bash generate_enzyme.shThere are two items in the output directory:
- protein.txt refers to the designed protein sequence
- src.seq.txt refers to the ground truth sequences
You need to provide the source sequence and designed sequence files.
To calculate the superimpose files of designed binder and target proteins, please use file evaluation/super_impose.pyThen we apply scripts provided at dl_binder_design to calculate pAE_interaction scores.
We provide the ESP evaluation data at [ESP_data_eval](https://drive.google.com/file/d/1LlYvJvV69dxtqblAUAkJ-CbFs9h_Q34j/view?usp=drive_link)The format for ESP evaluation is (Protein_Sequence Substrate_Representation) for each test case.
The evaluation code for ESP score is developed by Alexander Kroll, which can be found at link
If you find our work helpful, please consider citing our paper.@inproceedings{songsurfpro,
title={SurfPro: Functional Protein Design Based on Continuous Surface},
author={Song, Zhenqiao and Huang, Tinglin and Li, Lei and Jin, Wengong},
booktitle={Forty-first International Conference on Machine Learning},
year={2024}
}
