Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep) and Its Implications for Protein Engineering

Junwen Luo, Yi Cai, Jialin Wu, Hongmin Cai, View ORCID ProfileXiaofeng Yang, View ORCID ProfileZhanglin Lin
doi: https://doi.org/10.1101/2020.12.22.423916
Junwen Luo
1School of Biology and Biological Engineering, South China University of Technology, University Park, Guangzhou, Guangdong 510006, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yi Cai
1School of Biology and Biological Engineering, South China University of Technology, University Park, Guangzhou, Guangdong 510006, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jialin Wu
1School of Biology and Biological Engineering, South China University of Technology, University Park, Guangzhou, Guangdong 510006, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Hongmin Cai
2School of Computer Science and Engineering, South China University of Technology, University Park, Guangzhou, Guangdong 510006, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xiaofeng Yang
1School of Biology and Biological Engineering, South China University of Technology, University Park, Guangzhou, Guangdong 510006, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Xiaofeng Yang
  • For correspondence: zhanglinlin{at}scut.edu.cn biyangxf{at}scut.edu.cn
Zhanglin Lin
1School of Biology and Biological Engineering, South China University of Technology, University Park, Guangzhou, Guangdong 510006, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Zhanglin Lin
  • For correspondence: zhanglinlin{at}scut.edu.cn biyangxf{at}scut.edu.cn
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

In recent years, deep learning has been increasingly used to decipher the relationships among protein sequence, structure, and function. Thus far these applications of deep learning have been mostly based on primary sequence information, while the vast amount of tertiary structure information remains untapped. In this study, we devised a self-supervised representation learning framework (PtsRep) to extract the fundamental features of unlabeled protein tertiary structures deposited in the PDB, a total of 35,568 structures. The learned embeddings were challenged with two commonly recognized protein engineering tasks: the prediction of protein stability and prediction of the fluorescence brightness of green fluorescent protein (GFP) variants, with training datasets of 16,431 and 26,198 proteins or variants, respectively. On both tasks, PtsRep outperformed the two benchmark methods UniRep and TAPE-BERT, which were pre-trained on two much larger sets of data of 24 and 32 million protein sequences, respectively. Protein clustering analyses demonstrated that PtsRep can capture the structural signatures of proteins. Further testing of the GFP dataset revealed two important implications for protein engineering: (1) a reduced and experimentally manageable training dataset (20%, or 5,239 variants) yielded a satisfactory prediction performance for PtsRep, achieving a recall rate of 70% for the top 26 brightest variants with 795 variants in the testing dataset retrieved; (2) counter-intuitively, when only the bright variants were used for training, the performances of PtsRep and the benchmarks not only did not worsen but they actually slightly improved. This study provides a new avenue for learning and exploring general protein structural representations for protein engineering.

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted March 29, 2021.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep) and Its Implications for Protein Engineering
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep) and Its Implications for Protein Engineering
Junwen Luo, Yi Cai, Jialin Wu, Hongmin Cai, Xiaofeng Yang, Zhanglin Lin
bioRxiv 2020.12.22.423916; doi: https://doi.org/10.1101/2020.12.22.423916
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep) and Its Implications for Protein Engineering
Junwen Luo, Yi Cai, Jialin Wu, Hongmin Cai, Xiaofeng Yang, Zhanglin Lin
bioRxiv 2020.12.22.423916; doi: https://doi.org/10.1101/2020.12.22.423916

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (7562)
  • Biochemistry (17432)
  • Bioengineering (13673)
  • Bioinformatics (41429)
  • Biophysics (21225)
  • Cancer Biology (18348)
  • Cell Biology (25201)
  • Clinical Trials (138)
  • Developmental Biology (13253)
  • Ecology (19707)
  • Epidemiology (2067)
  • Evolutionary Biology (24129)
  • Genetics (15501)
  • Genomics (22306)
  • Immunology (17552)
  • Microbiology (39963)
  • Molecular Biology (16988)
  • Neuroscience (87648)
  • Paleontology (664)
  • Pathology (2809)
  • Pharmacology and Toxicology (4752)
  • Physiology (7561)
  • Plant Biology (14969)
  • Scientific Communication and Education (2034)
  • Synthetic Biology (4233)
  • Systems Biology (9723)
  • Zoology (2252)