Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy

Xu, Yunxin; Liu, Di; Gong, Haipeng

doi:10.1038/s43588-024-00716-2

Article
Published: 25 October 2024

Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy

Nature Computational Science volume 4, pages 840–850 (2024)Cite this article

1682 Accesses
3 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models—GeoFitness, GeoDDG and GeoDTm—for the prediction of fitness score, ΔΔG and ΔT_m of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔT_m prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Summary of the DMS, ΔΔG and ΔT_m data.**

**Fig. 2: Schematic overview of the model architecture.**

**Fig. 3: Pairwise comparisons of GeoFitness with other methods for the prediction of mutation effects on protein fitness.**

**Fig. 4: Detailed analysis for GeoFitness.**

Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations

Article Open access 23 July 2024

Persistent spectral theory-guided protein engineering

Article 20 February 2023

An end-to-end framework for the prediction of protein structure and fitness from single sequence

Article Open access 27 August 2024

Data availability

The protein fitness datasets for pre-training the GeoFitness model are available from the following sources: MaveDB (https://www.mavedb.org) and DeepSequence (https://doi.org/10.1038/s41592-018-0138-4). The datasets and prediction results are also available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/dms. For the GeoDDG model, training dataset S8754, all test datasets including S669, S461, S783, S^sym, S2000, M1261 and the protein stability-related cDNA proteolysis DMS dataset, as well as the corresponding prediction results, are available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/ddG. For the GeoDTm model, the training dataset S4346, the test dataset S571, as well as the corresponding prediction results are available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/dTm. Source data are provided with this paper.

Code availability

All source codes of GeoStab-suite are available at GitHub (https://github.com/Gonglab-THU/GeoStab). A reproducible code capsule of GeoStab-suite is available through Code Ocean⁵⁰. An online server for GeoStab-suite is available at https://structpred.life.tsinghua.edu.cn/server_geostab.html. GeoStab-suite is built on Python 3.9.7, pytorch 1.13.0, biopandas 0.4.1, biopython 1.81, click 8.1.7 and pdb-tools 2.5.0. GeoStab-suite utilizes the following tools to generate features: AlphaFold v2.2.0 (https://github.com/google-deepmind/alphafold), FoldX 5.0 (https://foldxsuite.crg.eu/) as well as ESM-1v and ESM-2 (esm2_t33_650M_UR50D; https://github.com/facebookresearch/esm).

References

Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article Google Scholar
Dahiyat, B. I. & Mayo, S. L. De novo protein design: fully automated sequence selection. Science 278, 82–87 (1997).
Article Google Scholar
Tokuriki, N. & Tawfik, D. S. Stability effects of mutations and protein evolvability. Curr. Opin. Struct. Biol. 19, 596–604 (2009).
Article Google Scholar
Pucci, F., Bourgeas, R. & Rooman, M. High-quality thermodynamic data on the stability changes of proteins upon single-site mutations. J. Phys. Chem. Ref. Data 45, 023104 (2016).
Article Google Scholar
Yeoman, C. J. et al. in Advances in Applied Microbiology (eds Laskin, A. I. et al.) 1–55 (Elsevier, 2010); https://doi.org/10.1016/s0065-2164(10)70001-0
Kopanos, C. et al. VarSome: the human genomic variant search engine. Bioinformatics 35, 1978–1980 (2018).
Article Google Scholar
Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014).
Article Google Scholar
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Article Google Scholar
Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).
Article Google Scholar
Li, M. et al. SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering. J. Cheminform 15, 12 (2023).
Article Google Scholar
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proc. Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M. et al.) 29287–29303 (Curran Associates, 2021).
Rao, R. M. et al. MSA Transformer. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8844–8856 (PMLR, 2021).
Mansoor, S., Baek, M., Juergens, D., Watson, J. L. & Baker, D. Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold. Protein Sci. 32, e4780 (2023).
Article Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article MathSciNet Google Scholar
Dehouck, Y. et al. Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 25, 2537–2543 (2009).
Article Google Scholar
Montanucci, L., Capriotti, E., Frank, Y., Ben-Tal, N. & Fariselli, P. DDGun: an untrained method for the prediction of protein stability changes upon single and multiple point variations. BMC Bioinformatics 20, 335 (2019).
Article Google Scholar
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Article Google Scholar
Benevenuta, S., Pancotti, C., Fariselli, P., Birolo, G. & Sanavia, T. An antisymmetric neural network to predict free energy changes in protein variants. J. Phys. D Appl. Phys. 54, 245403 (2021).
Article Google Scholar
Li, B., Yang, Y. T., Capra, J. A. & Gerstein, M. B. Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks. PLoS Comput. Biol. 16, e1008291 (2020).
Article Google Scholar
Pancotti, C. et al. A deep-learning sequence-based method to predict protein stability changes upon genetic variations. Genes 12, 911 (2021).
Article Google Scholar
Fariselli, P., Martelli, P. L., Savojardo, C. & Casadio, R. INPS: predicting the impact of non-synonymous variations on protein stability from sequence. Bioinformatics 31, 2816–2821 (2015).
Article Google Scholar
Capriotti, E., Fariselli, P., Rossi, I. & Casadio, R. A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics 9, S6 (2008).
Article Google Scholar
Chen, Y. et al. PremPS: predicting the impact of missense mutations on protein stability. PLoS Comput. Biol. 16, e1008543 (2020).
Article Google Scholar
Zhou, Y., Pan, Q., Pires, D. E. V., Rodrigues, C. H. M. & Ascher, D. B. DDMut: predicting effects of mutations on protein stability using deep learning. Nucleic Acids Res. 51, W122–W128 (2023).
Article Google Scholar
Iqbal, S. et al. Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations. Brief. Bioinform. 22, bbab184 (2021).
Article Google Scholar
Pancotti, C. et al. Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset. Brief. Bioinform. 23, bbab555 (2022).
Article Google Scholar
Pucci, F., Schwersensky, M. & Rooman, M. Artificial intelligence challenges for predicting the impact of mutations on protein stability. Curr. Opin. Struct. Biol. 72, 161–168 (2022).
Article Google Scholar
Masso, M. & Vaisman, I. I. AUTO-MUTE 2.0: a portable framework with enhanced capabilities for predicting protein functional consequences upon mutation. Adv. Bioinform. 2014, 278385 (2014).
Article Google Scholar
Pucci, F., Bourgeas, R. & Rooman, M. Predicting protein thermal stability changes upon point mutations using statistical potentials: introducing HoTMuSiC. Sci. Rep. 6, 23257 (2016).
Article Google Scholar
Louis, B. B. V. & Abriata, L. A. Reviewing challenges of predicting protein melting temperature change upon mutation through the full analysis of a highly detailed dataset with high-resolution structures. Mol. Biotechnol. 63, 863–884 (2021).
Article Google Scholar
Berman, H. M. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).
Article Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).
Article Google Scholar
Pucci, F., Bernaerts, K. V., Kwasigroch, J. M. & Rooman, M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics 34, 3659–3665 (2018).
Article Google Scholar
Usmanova, D. R. et al. Self-consistency test reveals systematic bias in programs for prediction change of stability upon mutation. Bioinformatics 34, 3653–3658 (2018).
Article Google Scholar
Hernández, I. M., Dehouck, Y., Bastolla, U., López-Blanco, J. R. & Chacón, P. Predicting protein stability changes upon mutation using a simple orientational potential. Bioinformatics 39, btad011 (2023).
Article Google Scholar
Laimer, J., Hofer, H., Fritz, M., Wegenkittl, S. & Lackner, P. Maestro—multi agent stability prediction upon point mutations. BMC Bioinformatics 16, 116 (2015).
Article Google Scholar
Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444 (2023).
Article Google Scholar
Rodrigues, C. H., Pires, D. E. & Ascher, D. B. DynaMut2: assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 30, 60–69 (2020).
Article Google Scholar
Blondel, M., Teboul, O., Berthet, Q. & Djolonga, J. Fast differentiable sorting and ranking. In Proc. 37th International Conference of Machine Learning (eds Daume, H. & Singh, A.) 950–959 (ICML, 2020).
Nikam, R., Kulandaisamy, A., Harini, K., Sharma, D. & Gromiha, M. M. ProThermDB: thermodynamic database for proteins and mutants revisited after 15 years. Nucleic Acids Res. 49, D420–D424 (2020).
Article Google Scholar
Xavier, J. S. et al. ThermoMutDB: a thermodynamic database for missense mutations. Nucleic Acids Res. 49, D475–D479 (2020).
Article Google Scholar
Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).
Article Google Scholar
Buel, G. R. & Walters, K. J. Can AlphaFold2 predict the impact of missense mutations on structure? Nat. Struct. Mol. Biol. 29, 1–2 (2022).
Article Google Scholar
Pak, M. A. et al. Using AlphaFold to predict the impact of single mutations on protein stability and function. PLoS ONE 18, e0282689 (2023).
Article Google Scholar
Kumar, M. D. S. ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions. Nucleic Acids Res. 34, D204–D206 (2006).
Article Google Scholar
Nair, P. S. & Vihinen, M. Varibench: a benchmark database for variations. Hum. Mutat. 34, 42–49 (2013).
Article Google Scholar
Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) 1417 (Curran Associates, 2019).
Xu, Y., Liu, D. & Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. Code Ocean https://doi.org/10.24433/CO.2318813.v1 (2024).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (#32171243 to H.G.) and by the Beijing Frontier Research Center for Biological Structure.

Author information

Authors and Affiliations

MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
Yunxin Xu, Di Liu & Haipeng Gong
Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
Yunxin Xu, Di Liu & Haipeng Gong

Authors

Yunxin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Di Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haipeng Gong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.X. and H.G. proposed the methodology and designed the experiment. Y.X. implemented the experiment. Y.X. and D.L. analyzed the results. Y.X., D.L. and H.G. wrote the paper. All authors agreed with the final paper.

Corresponding author

Correspondence to Haipeng Gong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Emil Alexov, Matsvei Tsishyn and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Detailed analysis for GeoDDG-Seq.

a Side-by-side comparison between GeoDDG-Seq and the second best sequence-based predictor, INPS-Seq, on S669. Predictions with error > 2.5 kcal/mol are identified as outliers (49 in GeoDDG-Seq vs. 54 in INPS-Seq) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDDG-Seq. are provided as a Source Data file.

Source data

Extended Data Fig. 2 Detailed analysis for GeoDDG-3D.

a Side-by-side comparison between GeoDDG-3D and the second best structure-based predictor, MAESTRO, on S669. Predictions with error > 2.5 kcal/mol are identified as outliers (52 in GeoDDG-3D vs. 64 in MAESTRO) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDDG-3D. are provided as a Source Data file.

Source data

Extended Data Fig. 3 Detailed analysis for GeoDTm.

a Side-by-side comparison between GeoDTm-3D and the second best structure-based predictor, HoTMuSiC, on S571. Predictions with error > 11.0 ^∘C are identified as outliers (68 in GeoDDG-3D vs. 83 in HoTMuSiC) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDTm-3D and GeoDTm-Seq. are provided as a Source Data file.

Source data

Supplementary information

Supplementary Information

Supplementary Methods, Model Evaluation, Tables 1–9 and Figs. 1–8.

Reporting Summary

Source data

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, Y., Liu, D. & Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. Nat Comput Sci 4, 840–850 (2024). https://doi.org/10.1038/s43588-024-00716-2

Download citation

Received: 29 August 2023
Accepted: 03 October 2024
Published: 25 October 2024
Issue Date: November 2024
DOI: https://doi.org/10.1038/s43588-024-00716-2

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.