Abstract
Systematic variation is a common issue in metabolomics data analysis. Therefore, different scaling and normalization techniques are used to preprocess the data for metabolomics data analysis. Although several scaling methods are available in the literature, however, choice of scaling, transformation and/or normalization technique influences the further statistical analysis. It is challenged to choose the appropriate scaling technique for downstream analysis to get accurate results or to make proper decision. Moreover, the existing scaling techniques are sensitive to outliers or extreme values. To fill the gap, our objective is to introduce a robust scaling approach that is not influenced by outliers as well as provides more accurate results for downstream analysis. Here, we introduced a new weighted scaling approach that is robust against outliers; however, no additional outlier detection/treatment step is needed in data preprocessing and also compared it with the conventional scaling and normalization techniques through artificial and real metabolomics datasets. We evaluated the performance of the proposed method in comparison to the other existing conventional scaling techniques using metabolomics data analysis in both the absence and presence of different percentages of outliers. Results show that in most cases, the proposed scaling technique is a better performer than the traditional scaling methods in both the absence and presence of outliers. The proposed method improves the further downstream metabolomics analysis. The R function of the proposed robust scaling method is available at https://github.com/nishithkumarpaul/robustScaling/blob/main/wscaling.R
Similar content being viewed by others
Data availability
Two datasets are available at the National Institute of Health (NIH) Common Fund’s National Metabolomics Data Repository (NMDR) website (https://www.metabolomicsworkbench.org/data/index.php). The breast cancer dataset was produced by GC–TOF-MS and processed by ChromaTOF software (v. 2.32) using the blood sample of 134 subjects.
References
Alam, M. A., Calhoun, V. D., & Wang, Y. P. (2018). Identifying outliers using multiple kernel canonical correlation analysis with application to imaging genetics. Computational Statistics & Data Analysis, 125, 70–85. https://doi.org/10.1016/j.csda.2018.03.013
Alam, M. A., Qiu, C., Shen, H., Wang, Y. P., & Deng, H. W. (2021). A generalized kernel machine approach to identify higher-order composite effects in multi-view datasets, with application to adolescent brain development and osteoporosis. Journal of Biomedical Informatics, 120, 103854. https://doi.org/10.1016/j.jbi.2021.103854
Benbadis, S., Klein, P., Schiemann, J., Diaz, A., Elmoufti, S., & Whitesides, J. (2018). Efficacy, safety, and tolerability of brivaracetam with concomitant lamotrigine or concomitant topiramate in pooled phase III randomized, double-blind trials: A post-hoc analysis. Epilepsy & Behavior, 80, 129–134. https://doi.org/10.1016/j.yebeh.2017.12.024
Bhajni, E., Sehgal, V. K., Kumar, A., & Sehgal, A. (2020). The comparative study of azilsartan with telmisartan in terms of efficacy safety and cost-effectiveness in hypertension. International Journal of Medical and Dental Sciences. https://doi.org/10.18311/ijmds/2020/24723
Blanchet, L., & Smolinska, A. (2016). Data fusion in metabolomics and proteomics for biomarker discovery. Statistical Analysis in Proteomics, 1362, 209–223. https://doi.org/10.1007/978-1-4939-3106-4_14
Bro, R., & Smilde, A. K. (2003). Centering and scaling in component analysis. Journal of Chemometrics, 17(1), 16–33. https://doi.org/10.1002/cem.773
Brown, M., Dunn, W. B., Ellis, D. I., Goodacre, R., Handl, J., Knowles, J. D., O’Hagan, S., Spasić, I., & Kell, D. B. (2005). A metabolome pipeline: From concept to data to knowledge. Metabolomics, 1(1), 39–51. https://doi.org/10.1007/s11306-005-1106-4
Dhaliwal, J. S., Rosani, A., & Saadabadi, A. (2021). Diazepam. In StatPearls. StatPearls Publishing. https://www.ncbi.nlm.nih.gov/books/NBK537022/
Fordellone, M., Bellincontro, A. & Mencarelli, F. (2018). Partial least squares discriminant analysis: A dimensionality reduction method to classify hyperspectral data. arXiv preprint arXiv:1806.09347. https://doi.org/10.48550/arXiv.1806.09347
Giloni, A., Simonoff, J. S., & Sengupta, B. (2006). Robust weighted LAD regression. Computational Statistics and Data Analysis, 50, 3124–3140. https://doi.org/10.1016/j.csda.2005.06.005
Goodacre, R., Broadhurst, D., Smilde, A. K., Kristal, B. S., Baker, J. D., Beger, R., Bessant, C., Connor, S., Capuani, G., Craig, A., & Ebbels, T. (2007). Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics, 3(3), 231–241. https://doi.org/10.1007/s11306-007-0081-3
Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G. G., & Kell, D. B. (2004). Metabolomics by numbers: Acquiring and understanding global metabolite data. Trends in Biotechnology, 22, 245–252. https://doi.org/10.1016/j.tibtech.2004.03.007
Gromski, P. S., Xu, Y., Kotze, H. L., Correa, E., Ellis, D. I., Armitage, E. G., Turner, M. L., & Goodacre, R. (2014). Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites, 4(2), 433–452. https://doi.org/10.3390/metabo4020433
Guo, L., Wang, C., Chi, C., Wang, X., Liu, S., Zhao, W., Ke, C., Xu, G., & Li, E. (2015). Exhaled breath volatile biomarker analysis for thyroid cancer. Translational Research, 166(2), 188–195. https://doi.org/10.1016/j.trsl.2015.01.005. 10.1016/j.yebeh.2017.12.024.
Janfaza, S., Khorsand, B., Nikkhah, M., & Zahiri, J. (2019). Digging deeper into volatile organic compounds associated with cancer. Biology Methods and Protocols, 4(1), bpz014. https://doi.org/10.1093/biomethods/bpz014
Janssens, E., van Meerbeeck, J. P., & Lamote, K. (2020). Volatile organic compounds in human matrices as lung cancer biomarkers: A systematic review. Critical Reviews in Oncology/hematology, 153, 103037. https://doi.org/10.1016/j.critrevonc.2020.103037
Keun, H. C., Ebbels, T. M., Antti, H., Bollard, M. E., Beckonert, O., Holmes, E., Lindon, J. C., & Nicholson, J. K. (2003). Improved analysis of multivariate data by variable stability scaling: Application to NMR-based metabolic profiling. Analytica Chimica Acta, 490, 265–276. https://doi.org/10.1016/S0003-2670(03)00094-1
Kim, C. H., Kang, S. I., & Shin, D. (2021). Pharmacokinetic interaction between Telmisartan and rosuvastatin/ezetimibe after multiple oral administration in healthy subjects. Advances in Therapy, 38(2), 1094–1105. https://doi.org/10.1007/s12325-020-01592-8
Kohl, S. M., Klein, M. S., Hochrein, J., Oefner, P. J., Spang, R., & Gronwald, W. (2012). State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics, 8(1), 146–160. https://doi.org/10.1007/s11306-011-0350-z
Kumar, N., Hoque, M. A., Shahjaman, M., Islam, S. M. S., & Mollah, M. N. H. (2017). Metabolomic biomarker identification in presence of outliers and missing values. BioMed Research International, 2017, 1–11. https://doi.org/10.1155/2017/2437608
Kumar, N., Hoque, M., & Sugimoto, M. (2021). Kernel weighted least square approach for imputing missing values of metabolomics data. Scientific Reports, 11(1), 1–12. https://doi.org/10.1038/s41598-021-90654-0
Leung, E., Rewcastle, G. W., Joseph, W. R., Rosengren, R. J., Larsen, L., & Baguley, B. C. (2012). Identification of cyclohexanone derivatives that act as catalytic inhibitors of topoisomerase I: Effects on tamoxifen-resistant MCF-7 cancer cells. Investigational New Drugs, 30(6), 2103–2112. https://doi.org/10.1007/s10637-011-9768-4
Li, B., Tang, J., Yang, Q., Cui, X., Li, S., Chen, S., Cao, Q., Xue, W., Chen, N., & Zhu, F. (2016a). Performance evaluation and online realization of data-driven normalization methods used in lc/ms based untargeted metabolomics analysis. Scientific Reports. https://doi.org/10.1038/srep38881
Li, Z., Yang, C., Liu, K., Hu, F., & Jin, B. (2016b). Automatic scaling hadoop in the cloud for efficient process of big geospatial data. ISPRS International Journal of Geo-Information, 5(10), 173. https://doi.org/10.3390/ijgi5100173
Lima, A. R., Araújo, A. M., Pinto, J., Jerónimo, C., Henrique, R., Bastos, M. D. L., Carvalho, M., & Guedes de Pinho, P. (2018). Discrimination between the human prostate normal and cancer cell exometabolome by GC-MS. Scientific Reports, 8(1), 1–12. https://doi.org/10.1038/s41598-018-23847-9
Liu, H., Wang, H., Li, C., Wang, L., Pan, Z., & Wang, L. (2014). Investigation of volatile organic metabolites in lung cancer pleural effusions by solid-phase microextraction and gas chromatography/mass spectrometry. Journal of Chromatography B, 945, 53–59. https://doi.org/10.1016/j.jchromb.2013.11.038
Mochalski, P., King, J., Haas, M., Unterkofler, K., Amann, A., & Mayer, G. (2014). Blood and breath profiles of volatile organic compounds in patients with end-stage renal disease. Bmc Nephrology, 15(1), 1–14. https://doi.org/10.1186/1471-2369-15-43
Negro, A., De Marco, L., Cesario, V., Santi, R., Boni, M. C., & Zanelli, M. (2017). A case of moderate sprue-like enteropathy associated with telmisartan. Journal of Clinical Medicine Research, 9(12), 1022. https://doi.org/10.14740/jocmr3047w
Nie, J. M., & Li, H. F. (2018). Therapeutic effects of Salvia miltiorrhiza injection combined with telmisartan in patients with diabetic nephropathy by influencing collagen IV and fibronectin: A case-control study. Experimental and Therapeutic Medicine, 16(4), 3405–3412. https://doi.org/10.3892/etm.2018.6654
Opdam, P., & Wascher, D. (2004). Climate change meets habitat fragmentation: Linking landscape and biogeographical scale levels in research and conservation. Biological Conservation, 117(3), 285–297. https://doi.org/10.1016/j.biocon.2003.12.008
Oromi, N., Jove, M., Pascual-Pons, M., Royo, J. L., Rocaspana, R., Aparicio, E., Pamplona, R., Palau, A., Sanuy, D., Fibla, J., & Portero-Otin, M. (2017). Differential metabolic profiles associated to movement behaviour of stream-resident brown trout (Salmo trutta). PLoS ONE, 12(7), e0181697. https://doi.org/10.1371/journal.pone.0181697
Redford, K. H., & Richter, B. D. (1999). Conservation of biodiversity in a world of use. Conservation Biology, 13(6), 1246–1256. https://doi.org/10.1046/j.1523-1739.1999.97463.x
Sardans, J., Penuelas, J., & Rivas-Ubach, A. (2011). Ecological metabolomics: Overview of current developments and future challenges. Chemoecology, 21(4), 191–225. https://doi.org/10.1007/s00049-011-0083-5
Schauer, N., Zamir, D., & Fernie, A. R. (2005). Metabolic profiling of leaves and fruit of wild species tomato: A survey of the Solanum lycopersicum complex. Journal of Experimental Botany, 56(410), 297–307. https://doi.org/10.1093/jxb/eri057
Scherling, C., Roscher, C., Giavalisco, P., Schulze, E. D., & Weckwerth, W. (2010). Metabolomics unravel contrasting effects of biodiversity on the performance of individual plant species. PLoS ONE, 5(9), e12569. https://doi.org/10.1371/journal.pone.0012569
Shahjaman, M., Kumar, N., & Mollah, M. N. (2019). Performance improvement of gene selection methods using outlier modification rule. Current Bioinformatics, 14(6), 491–503. https://doi.org/10.2174/1574893614666181126110008
Silva, C. L., Perestrelo, R., Silva, P., Tomás, H., & Câmara, J. S. (2017). Volatile metabolomic signature of human breast cancer cell lines. Scientific Reports, 7(1), 1–8. https://doi.org/10.1038/srep43969
Skryabin, V. Y., Zastrozhin, M., Torrado, M., Grishina, E., Ryzhikova, K., Shipitsyn, V., Galaktionova, T., Sorokin, A., Bryun, E., & Sychev, D. (2021). Effects of CYP2C19* 17 genetic polymorphisms on the steady-state concentration of diazepam in patients with alcohol withdrawal syndrome. Hospital Pharmacy, 56(5), 592–596. https://doi.org/10.1177/0018578720931756
Smeralda, C. L., Gigli, G. L., Janes, F., & Valente, M. (2020). May lamotrigine be an alternative to topiramate in the prevention of migraine with aura? Results of a retrospective study. BMJ Neurology Open. https://doi.org/10.1136/bmjno-2020-000059
Steuer, R., Morgenthal, K., Weckwerth, W., & Selbig, J. (2007). A gentle guide to the analysis of metabolomic data. In W. Weckwerth (Ed.), Metabolomics. Methods in molecular biology™. (Vol. 358). Humana Press. https://doi.org/10.1007/978-1-59745-244-1_7
Sumner, L. W., Amberg, A., Barrett, D., Beale, M. H., Beger, R., Daykin, C. A., Fan, T. W. M., Fiehn, O., Goodacre, R., Griffin, J. L., & Hankemeier, T. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3(3), 211–221. https://doi.org/10.1007/s11306-007-0082-2
Tiwari, S., & Rajwanshi, R. (2022). Overview of omics-assisted techniques for biodiversity conservation. In A. Kumar, B. Choudhury, S. Dayanandan, & M. L. Khan (Eds.), Molecular genetics and genomics tools in biodiversity conservation. Springer. https://doi.org/10.1007/978-981-16-6005-4_4
Tscharntke, T., Batáry, P., Clough, Y., Kleijn, D., Scherber, C., Thies, C., Wanger, T.C. & Westphal, C. (2012). Combining biodiversity conservation with agricultural intensification. Land use intensification. Effects on agriculture, biodiversity and ecological processes, CSIRO Publishing: Australia, 7–15.
van den Berg, R. A., Hoefsloot, H. C., Westerhuis, J. A., Smilde, A. K., & van der Werf, M. J. (2006). Centering, scaling, and transformations: Improving the biological information content of metabolomics data. BMC Genomics, 7(1), 142. https://doi.org/10.1186/1471-2164-7-142
Wang, C., Sun, B., Guo, L., Wang, X., Ke, C., Liu, S., Zhao, W., Luo, S., Guo, Z., Zhang, Y., & Xu, G. (2014). Volatile organic metabolites identify patients with breast cancer, cyclomastopathy and mammary gland fibroma. Scientific Reports, 4(1), 1–6. https://doi.org/10.1038/srep05383
Wen, B. (2020). Modular metaX pipeline for processing untargeted metabolomics data. Processing Metabolomics and Proteomics Data with Open Software. https://doi.org/10.1039/9781788019880-00302
Wood, K. E., Palmer, K. L., & Krasowski, M. D. (2021). Correlation of elevated lamotrigine and levetiracetam serum/plasma levels with toxicity: A long-term retrospective review at an academic medical center. Toxicology Reports, 8, 1592–1598. https://doi.org/10.1016/j.toxrep.2021.08.005
Worden, L., Grocott, O., Tourjee, A., Chan, F., & Thibert, R. (2018). Diazepam for outpatient treatment of nonconvulsive status epilepticus in pediatric patients with Angelman syndrome. Epilepsy & Behavior, 82, 74–80. https://doi.org/10.1016/j.yebeh.2018.02.027
Zhao, X., Huang, X., Peng, W., Han, M., Zhang, X., Zhu, K., & Shao, B. (2022). Chlorine disinfection byproduct of diazepam affects nervous system function and possesses gender-related difference in zebrafish. Ecotoxicology and Environmental Safety, 238, 113568. https://doi.org/10.1016/j.ecoenv.2022.113568
Acknowledgements
We thanks and acknowledge the Ministry of Science and Technology, Bangladesh for the National Science and Technology (NST) fellowship.
Author information
Authors and Affiliations
Contributions
BB analyzed the data, drafted the manuscript, and executed the statistical analysis. NK and MAA worked to develop the weighted scaling approach for metabolomics data analysis. NK, MAA, and MAH coordinated and supervised the project. All authors carefully read and finally approved the manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Biswas, B., Kumar, N., Hoque, M.A. et al. Weighted scaling approach for metabolomics data analysis. Jpn J Stat Data Sci 6, 785–802 (2023). https://doi.org/10.1007/s42081-023-00205-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42081-023-00205-2