Abstract
Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models—GeoFitness, GeoDDG and GeoDTm—for the prediction of fitness score, ΔΔG and ΔTm of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔTm prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
The protein fitness datasets for pre-training the GeoFitness model are available from the following sources: MaveDB (https://www.mavedb.org) and DeepSequence (https://doi.org/10.1038/s41592-018-0138-4). The datasets and prediction results are also available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/dms. For the GeoDDG model, training dataset S8754, all test datasets including S669, S461, S783, Ssym, S2000, M1261 and the protein stability-related cDNA proteolysis DMS dataset, as well as the corresponding prediction results, are available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/ddG. For the GeoDTm model, the training dataset S4346, the test dataset S571, as well as the corresponding prediction results are available at https://github.com/Gonglab-THU/GeoStab/tree/main/data/dTm. Source data are provided with this paper.
Code availability
All source codes of GeoStab-suite are available at GitHub (https://github.com/Gonglab-THU/GeoStab). A reproducible code capsule of GeoStab-suite is available through Code Ocean50. An online server for GeoStab-suite is available at https://structpred.life.tsinghua.edu.cn/server_geostab.html. GeoStab-suite is built on Python 3.9.7, pytorch 1.13.0, biopandas 0.4.1, biopython 1.81, click 8.1.7 and pdb-tools 2.5.0. GeoStab-suite utilizes the following tools to generate features: AlphaFold v2.2.0 (https://github.com/google-deepmind/alphafold), FoldX 5.0 (https://foldxsuite.crg.eu/) as well as ESM-1v and ESM-2 (esm2_t33_650M_UR50D; https://github.com/facebookresearch/esm).
References
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Dahiyat, B. I. & Mayo, S. L. De novo protein design: fully automated sequence selection. Science 278, 82–87 (1997).
Tokuriki, N. & Tawfik, D. S. Stability effects of mutations and protein evolvability. Curr. Opin. Struct. Biol. 19, 596–604 (2009).
Pucci, F., Bourgeas, R. & Rooman, M. High-quality thermodynamic data on the stability changes of proteins upon single-site mutations. J. Phys. Chem. Ref. Data 45, 023104 (2016).
Yeoman, C. J. et al. in Advances in Applied Microbiology (eds Laskin, A. I. et al.) 1–55 (Elsevier, 2010); https://doi.org/10.1016/s0065-2164(10)70001-0
Kopanos, C. et al. VarSome: the human genomic variant search engine. Bioinformatics 35, 1978–1980 (2018).
Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).
Li, M. et al. SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering. J. Cheminform 15, 12 (2023).
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proc. Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M. et al.) 29287–29303 (Curran Associates, 2021).
Rao, R. M. et al. MSA Transformer. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8844–8856 (PMLR, 2021).
Mansoor, S., Baek, M., Juergens, D., Watson, J. L. & Baker, D. Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold. Protein Sci. 32, e4780 (2023).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Dehouck, Y. et al. Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 25, 2537–2543 (2009).
Montanucci, L., Capriotti, E., Frank, Y., Ben-Tal, N. & Fariselli, P. DDGun: an untrained method for the prediction of protein stability changes upon single and multiple point variations. BMC Bioinformatics 20, 335 (2019).
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Benevenuta, S., Pancotti, C., Fariselli, P., Birolo, G. & Sanavia, T. An antisymmetric neural network to predict free energy changes in protein variants. J. Phys. D Appl. Phys. 54, 245403 (2021).
Li, B., Yang, Y. T., Capra, J. A. & Gerstein, M. B. Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks. PLoS Comput. Biol. 16, e1008291 (2020).
Pancotti, C. et al. A deep-learning sequence-based method to predict protein stability changes upon genetic variations. Genes 12, 911 (2021).
Fariselli, P., Martelli, P. L., Savojardo, C. & Casadio, R. INPS: predicting the impact of non-synonymous variations on protein stability from sequence. Bioinformatics 31, 2816–2821 (2015).
Capriotti, E., Fariselli, P., Rossi, I. & Casadio, R. A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics 9, S6 (2008).
Chen, Y. et al. PremPS: predicting the impact of missense mutations on protein stability. PLoS Comput. Biol. 16, e1008543 (2020).
Zhou, Y., Pan, Q., Pires, D. E. V., Rodrigues, C. H. M. & Ascher, D. B. DDMut: predicting effects of mutations on protein stability using deep learning. Nucleic Acids Res. 51, W122–W128 (2023).
Iqbal, S. et al. Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations. Brief. Bioinform. 22, bbab184 (2021).
Pancotti, C. et al. Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset. Brief. Bioinform. 23, bbab555 (2022).
Pucci, F., Schwersensky, M. & Rooman, M. Artificial intelligence challenges for predicting the impact of mutations on protein stability. Curr. Opin. Struct. Biol. 72, 161–168 (2022).
Masso, M. & Vaisman, I. I. AUTO-MUTE 2.0: a portable framework with enhanced capabilities for predicting protein functional consequences upon mutation. Adv. Bioinform. 2014, 278385 (2014).
Pucci, F., Bourgeas, R. & Rooman, M. Predicting protein thermal stability changes upon point mutations using statistical potentials: introducing HoTMuSiC. Sci. Rep. 6, 23257 (2016).
Louis, B. B. V. & Abriata, L. A. Reviewing challenges of predicting protein melting temperature change upon mutation through the full analysis of a highly detailed dataset with high-resolution structures. Mol. Biotechnol. 63, 863–884 (2021).
Berman, H. M. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).
Pucci, F., Bernaerts, K. V., Kwasigroch, J. M. & Rooman, M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics 34, 3659–3665 (2018).
Usmanova, D. R. et al. Self-consistency test reveals systematic bias in programs for prediction change of stability upon mutation. Bioinformatics 34, 3653–3658 (2018).
Hernández, I. M., Dehouck, Y., Bastolla, U., López-Blanco, J. R. & Chacón, P. Predicting protein stability changes upon mutation using a simple orientational potential. Bioinformatics 39, btad011 (2023).
Laimer, J., Hofer, H., Fritz, M., Wegenkittl, S. & Lackner, P. Maestro—multi agent stability prediction upon point mutations. BMC Bioinformatics 16, 116 (2015).
Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444 (2023).
Rodrigues, C. H., Pires, D. E. & Ascher, D. B. DynaMut2: assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 30, 60–69 (2020).
Blondel, M., Teboul, O., Berthet, Q. & Djolonga, J. Fast differentiable sorting and ranking. In Proc. 37th International Conference of Machine Learning (eds Daume, H. & Singh, A.) 950–959 (ICML, 2020).
Nikam, R., Kulandaisamy, A., Harini, K., Sharma, D. & Gromiha, M. M. ProThermDB: thermodynamic database for proteins and mutants revisited after 15 years. Nucleic Acids Res. 49, D420–D424 (2020).
Xavier, J. S. et al. ThermoMutDB: a thermodynamic database for missense mutations. Nucleic Acids Res. 49, D475–D479 (2020).
Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).
Buel, G. R. & Walters, K. J. Can AlphaFold2 predict the impact of missense mutations on structure? Nat. Struct. Mol. Biol. 29, 1–2 (2022).
Pak, M. A. et al. Using AlphaFold to predict the impact of single mutations on protein stability and function. PLoS ONE 18, e0282689 (2023).
Kumar, M. D. S. ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions. Nucleic Acids Res. 34, D204–D206 (2006).
Nair, P. S. & Vihinen, M. Varibench: a benchmark database for variations. Hum. Mutat. 34, 42–49 (2013).
Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) 1417 (Curran Associates, 2019).
Xu, Y., Liu, D. & Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. Code Ocean https://doi.org/10.24433/CO.2318813.v1 (2024).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (#32171243 to H.G.) and by the Beijing Frontier Research Center for Biological Structure.
Author information
Authors and Affiliations
Contributions
Y.X. and H.G. proposed the methodology and designed the experiment. Y.X. implemented the experiment. Y.X. and D.L. analyzed the results. Y.X., D.L. and H.G. wrote the paper. All authors agreed with the final paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Emil Alexov, Matsvei Tsishyn and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Detailed analysis for GeoDDG-Seq.
a Side-by-side comparison between GeoDDG-Seq and the second best sequence-based predictor, INPS-Seq, on S669. Predictions with error > 2.5 kcal/mol are identified as outliers (49 in GeoDDG-Seq vs. 54 in INPS-Seq) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDDG-Seq. are provided as a Source Data file.
Extended Data Fig. 2 Detailed analysis for GeoDDG-3D.
a Side-by-side comparison between GeoDDG-3D and the second best structure-based predictor, MAESTRO, on S669. Predictions with error > 2.5 kcal/mol are identified as outliers (52 in GeoDDG-3D vs. 64 in MAESTRO) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDDG-3D. are provided as a Source Data file.
Extended Data Fig. 3 Detailed analysis for GeoDTm.
a Side-by-side comparison between GeoDTm-3D and the second best structure-based predictor, HoTMuSiC, on S571. Predictions with error > 11.0 ∘C are identified as outliers (68 in GeoDDG-3D vs. 83 in HoTMuSiC) and are colored red in the figures. The 95% confidence interval is calculated by bootstrapping. b Ablation study of GeoDTm-3D and GeoDTm-Seq. are provided as a Source Data file.
Supplementary information
Supplementary Information
Supplementary Methods, Model Evaluation, Tables 1–9 and Figs. 1–8.
Source data
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Extended Data Fig. 1
Statistical source data.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, Y., Liu, D. & Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. Nat Comput Sci 4, 840–850 (2024). https://doi.org/10.1038/s43588-024-00716-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-024-00716-2