Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy

A preprint version of the article is available at bioRxiv.

Abstract

Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models—GeoFitness, GeoDDG and GeoDTm—for the prediction of fitness score, ΔΔG and ΔTm of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔTm prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Summary of the DMS, ΔΔG and ΔTm data.
Fig. 2: Schematic overview of the model architecture.
Fig. 3: Pairwise comparisons of GeoFitness with other methods for the prediction of mutation effects on protein fitness.
Fig. 4: Detailed analysis for GeoFitness.

Similar content being viewed by others

Data availability

The protein fitness datasets for pre-training the GeoFitness model are available from the following sources: MaveDB (https://www.mavedb.org) and DeepSequence (https://doi.org/10.1038/s41592-018-0138-4). The datasets and prediction results are also available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/dms. For the GeoDDG model, training dataset S8754, all test datasets including S669, S461, S783, Ssym, S2000, M1261 and the protein stability-related cDNA proteolysis DMS dataset, as well as the corresponding prediction results, are available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/ddG. For the GeoDTm model, the training dataset S4346, the test dataset S571, as well as the corresponding prediction results are available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/dTm. Source data are provided with this paper.

Code availability

All source codes of GeoStab-suite are available at GitHub (https://github.com/Gonglab-THU/GeoStab). A reproducible code capsule of GeoStab-suite is available through Code Ocean50. An online server for GeoStab-suite is available at https://structpred.life.tsinghua.edu.cn/server_geostab.html. GeoStab-suite is built on Python 3.9.7, pytorch 1.13.0, biopandas 0.4.1, biopython 1.81, click 8.1.7 and pdb-tools 2.5.0. GeoStab-suite utilizes the following tools to generate features: AlphaFold v2.2.0 (https://github.com/google-deepmind/alphafold), FoldX 5.0 (https://foldxsuite.crg.eu/) as well as ESM-1v and ESM-2 (esm2_t33_650M_UR50D; https://github.com/facebookresearch/esm).

References

  1. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article  Google Scholar 

  2. Dahiyat, B. I. & Mayo, S. L. De novo protein design: fully automated sequence selection. Science 278, 82–87 (1997).

    Article  Google Scholar 

  3. Tokuriki, N. & Tawfik, D. S. Stability effects of mutations and protein evolvability. Curr. Opin. Struct. Biol. 19, 596–604 (2009).

    Article  Google Scholar 

  4. Pucci, F., Bourgeas, R. & Rooman, M. High-quality thermodynamic data on the stability changes of proteins upon single-site mutations. J. Phys. Chem. Ref. Data 45, 023104 (2016).

    Article  Google Scholar 

  5. Yeoman, C. J. et al. in Advances in Applied Microbiology (eds Laskin, A. I. et al.) 1–55 (Elsevier, 2010); https://doi.org/10.1016/s0065-2164(10)70001-0

  6. Kopanos, C. et al. VarSome: the human genomic variant search engine. Bioinformatics 35, 1978–1980 (2018).

    Article  Google Scholar 

  7. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014).

    Article  Google Scholar 

  8. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  Google Scholar 

  9. Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).

    Article  Google Scholar 

  10. Li, M. et al. SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering. J. Cheminform 15, 12 (2023).

    Article  Google Scholar 

  11. Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proc. Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M. et al.) 29287–29303 (Curran Associates, 2021).

  12. Rao, R. M. et al. MSA Transformer. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8844–8856 (PMLR, 2021).

  13. Mansoor, S., Baek, M., Juergens, D., Watson, J. L. & Baker, D. Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold. Protein Sci. 32, e4780 (2023).

    Article  Google Scholar 

  14. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  15. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article  MathSciNet  Google Scholar 

  16. Dehouck, Y. et al. Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 25, 2537–2543 (2009).

    Article  Google Scholar 

  17. Montanucci, L., Capriotti, E., Frank, Y., Ben-Tal, N. & Fariselli, P. DDGun: an untrained method for the prediction of protein stability changes upon single and multiple point variations. BMC Bioinformatics 20, 335 (2019).

    Article  Google Scholar 

  18. Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).

    Article  Google Scholar 

  19. Benevenuta, S., Pancotti, C., Fariselli, P., Birolo, G. & Sanavia, T. An antisymmetric neural network to predict free energy changes in protein variants. J. Phys. D Appl. Phys. 54, 245403 (2021).

    Article  Google Scholar 

  20. Li, B., Yang, Y. T., Capra, J. A. & Gerstein, M. B. Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks. PLoS Comput. Biol. 16, e1008291 (2020).

    Article  Google Scholar 

  21. Pancotti, C. et al. A deep-learning sequence-based method to predict protein stability changes upon genetic variations. Genes 12, 911 (2021).

    Article  Google Scholar 

  22. Fariselli, P., Martelli, P. L., Savojardo, C. & Casadio, R. INPS: predicting the impact of non-synonymous variations on protein stability from sequence. Bioinformatics 31, 2816–2821 (2015).

    Article  Google Scholar 

  23. Capriotti, E., Fariselli, P., Rossi, I. & Casadio, R. A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics 9, S6 (2008).

    Article  Google Scholar 

  24. Chen, Y. et al. PremPS: predicting the impact of missense mutations on protein stability. PLoS Comput. Biol. 16, e1008543 (2020).

    Article  Google Scholar 

  25. Zhou, Y., Pan, Q., Pires, D. E. V., Rodrigues, C. H. M. & Ascher, D. B. DDMut: predicting effects of mutations on protein stability using deep learning. Nucleic Acids Res. 51, W122–W128 (2023).

    Article  Google Scholar 

  26. Iqbal, S. et al. Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations. Brief. Bioinform. 22, bbab184 (2021).

    Article  Google Scholar 

  27. Pancotti, C. et al. Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset. Brief. Bioinform. 23, bbab555 (2022).

    Article  Google Scholar 

  28. Pucci, F., Schwersensky, M. & Rooman, M. Artificial intelligence challenges for predicting the impact of mutations on protein stability. Curr. Opin. Struct. Biol. 72, 161–168 (2022).

    Article  Google Scholar 

  29. Masso, M. & Vaisman, I. I. AUTO-MUTE 2.0: a portable framework with enhanced capabilities for predicting protein functional consequences upon mutation. Adv. Bioinform. 2014, 278385 (2014).

    Article  Google Scholar 

  30. Pucci, F., Bourgeas, R. & Rooman, M. Predicting protein thermal stability changes upon point mutations using statistical potentials: introducing HoTMuSiC. Sci. Rep. 6, 23257 (2016).

    Article  Google Scholar 

  31. Louis, B. B. V. & Abriata, L. A. Reviewing challenges of predicting protein melting temperature change upon mutation through the full analysis of a highly detailed dataset with high-resolution structures. Mol. Biotechnol. 63, 863–884 (2021).

    Article  Google Scholar 

  32. Berman, H. M. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).

    Article  Google Scholar 

  33. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  34. Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).

    Article  Google Scholar 

  35. Pucci, F., Bernaerts, K. V., Kwasigroch, J. M. & Rooman, M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics 34, 3659–3665 (2018).

    Article  Google Scholar 

  36. Usmanova, D. R. et al. Self-consistency test reveals systematic bias in programs for prediction change of stability upon mutation. Bioinformatics 34, 3653–3658 (2018).

    Article  Google Scholar 

  37. Hernández, I. M., Dehouck, Y., Bastolla, U., López-Blanco, J. R. & Chacón, P. Predicting protein stability changes upon mutation using a simple orientational potential. Bioinformatics 39, btad011 (2023).

    Article  Google Scholar 

  38. Laimer, J., Hofer, H., Fritz, M., Wegenkittl, S. & Lackner, P. Maestro—multi agent stability prediction upon point mutations. BMC Bioinformatics 16, 116 (2015).

    Article  Google Scholar 

  39. Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444 (2023).

    Article  Google Scholar 

  40. Rodrigues, C. H., Pires, D. E. & Ascher, D. B. DynaMut2: assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 30, 60–69 (2020).

    Article  Google Scholar 

  41. Blondel, M., Teboul, O., Berthet, Q. & Djolonga, J. Fast differentiable sorting and ranking. In Proc. 37th International Conference of Machine Learning (eds Daume, H. & Singh, A.) 950–959 (ICML, 2020).

  42. Nikam, R., Kulandaisamy, A., Harini, K., Sharma, D. & Gromiha, M. M. ProThermDB: thermodynamic database for proteins and mutants revisited after 15 years. Nucleic Acids Res. 49, D420–D424 (2020).

    Article  Google Scholar 

  43. Xavier, J. S. et al. ThermoMutDB: a thermodynamic database for missense mutations. Nucleic Acids Res. 49, D475–D479 (2020).

    Article  Google Scholar 

  44. Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).

    Article  Google Scholar 

  45. Buel, G. R. & Walters, K. J. Can AlphaFold2 predict the impact of missense mutations on structure? Nat. Struct. Mol. Biol. 29, 1–2 (2022).

    Article  Google Scholar 

  46. Pak, M. A. et al. Using AlphaFold to predict the impact of single mutations on protein stability and function. PLoS ONE 18, e0282689 (2023).

    Article  Google Scholar 

  47. Kumar, M. D. S. ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions. Nucleic Acids Res. 34, D204–D206 (2006).

    Article  Google Scholar 

  48. Nair, P. S. & Vihinen, M. Varibench: a benchmark database for variations. Hum. Mutat. 34, 42–49 (2013).

    Article  Google Scholar 

  49. Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) 1417 (Curran Associates, 2019).

  50. Xu, Y., Liu, D. & Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. Code Ocean https://doi.org/10.24433/CO.2318813.v1 (2024).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (#32171243 to H.G.) and by the Beijing Frontier Research Center for Biological Structure.

Author information

Authors and Affiliations

Authors

Contributions

Y.X. and H.G. proposed the methodology and designed the experiment. Y.X. implemented the experiment. Y.X. and D.L. analyzed the results. Y.X., D.L. and H.G. wrote the paper. All authors agreed with the final paper.

Corresponding author

Correspondence to Haipeng Gong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Emil Alexov, Matsvei Tsishyn and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Detailed analysis for GeoDDG-Seq.

a Side-by-side comparison between GeoDDG-Seq and the second best sequence-based predictor, INPS-Seq, on S669. Predictions with error > 2.5 kcal/mol are identified as outliers (49 in GeoDDG-Seq vs. 54 in INPS-Seq) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDDG-Seq. are provided as a Source Data file.

Source data

Extended Data Fig. 2 Detailed analysis for GeoDDG-3D.

a Side-by-side comparison between GeoDDG-3D and the second best structure-based predictor, MAESTRO, on S669. Predictions with error > 2.5 kcal/mol are identified as outliers (52 in GeoDDG-3D vs. 64 in MAESTRO) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDDG-3D. are provided as a Source Data file.

Source data

Extended Data Fig. 3 Detailed analysis for GeoDTm.

a Side-by-side comparison between GeoDTm-3D and the second best structure-based predictor, HoTMuSiC, on S571. Predictions with error > 11.0 C are identified as outliers (68 in GeoDDG-3D vs. 83 in HoTMuSiC) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDTm-3D and GeoDTm-Seq. are provided as a Source Data file.

Source data

Supplementary information

Supplementary Information

Supplementary Methods, Model Evaluation, Tables 1–9 and Figs. 1–8.

Reporting Summary

Source data

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Y., Liu, D. & Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. Nat Comput Sci 4, 840–850 (2024). https://doi.org/10.1038/s43588-024-00716-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-024-00716-2

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy