Abstract
Due to the high cost of DNA-binding proteins (DBPs) detection, many machine learning algorithms (ML) have been utilized to large-scale process and detect DBPs. The previous methods took no count of the processing of noise samples. In this study, a fuzzy twin support vector machine (FTWSVM) is employed to detect DBPs. First, multiple types of protein sequence features are formed into kernel matrices; Then, multiple kernel learning (MKL) algorithm is utilized to linear combine multiple kernels; next, self-representation-based membership function is utilized to estimate membership value (weight) of each training sample; finally, we feed the integrated kernel matrix and membership values into the FTWSVM-SR model for training and testing. On comparison with other predictive models, FTWSVM based on SR (FTWSVM-SR) obtains the best performance of Matthew’s correlation coefficient (MCC): 0.7410 and 0.5909 on two independent testing sets (PDB186 and PDB2272 datasets), respectively. The results confirm that our method can be an effective DBPs detection tool. Before the biochemical experiment, our model can screen and analyze DBPs on a large scale.
Graphical abstract








Similar content being viewed by others
References
Cong L, Zhang F (2015) Genome engineering using crispr-cas9 system. Methods in molecular biology (Clifton, N.J.), vol 1239, p 197. https://doi.org/10.1007/978-1-4939-1862-1_10
Kumar M, Gromiha MM, Raghava GP (2007) Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinform 8:463. https://doi.org/10.1186/1471-2105-8-463
Lin W, Fang J, Xiao X, Chou K (2011) idna-prot: Identification of DNA binding proteins using random forest with grey model. PLoS One 6:e24756. https://doi.org/10.1371/journal.pone.0024756
Liu B, Wang S, Wang X (2015) DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep 5:15479. https://doi.org/10.1038/srep15479
Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, Chou KC (2014) idna-prot|dis: Identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One 9:e106691. https://doi.org/10.1371/journal.pone.0106691
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X (2015) Psedna-pro: DNA-binding protein identification by combining chou’s pseaac and physicochemical distance transformation. Mol Inf 34(1):8–17. https://doi.org/10.1002/minf.201400025
Wei L, Tang J, Quan Z (2016) Local-dpp: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci 384:135–144. https://doi.org/10.1016/j.ins.2016.06.026
Rahman MS, Shatabda S, Saha S (2018) Dpp-pseaac: a DNA-binding protein prediction model using chou’s general pseaac. J Theor Biol 452:22–34. https://doi.org/10.1016/j.jtbi.2018.05.006
Liu XJ, Gong XJ, Yu H, Xu JH (2018) A model stacking framework for identifying dna binding proteins by orchestrating multi-view features and classifiers. Genes 9(8):394. https://doi.org/10.3390/genes9080394
Ding YJ, Chen F, Guo XY, Tang JJ, Wu HJ (2020) Identification of DNA-binding proteins by multiple kernel support vector machine and sequence information. Curr Proteom 17(4):302–310. https://doi.org/10.2174/1570164616666190417100509
Zou Y, Ding YJ, Tang JJ, Guo F, Peng L (2019) FKRR-MVSF: a fuzzy kernel ridge regression model for identifying DNA-binding proteins by multi-view sequence features via Chou’s five-step rule. Int J Mol Sci 20(17):4175. https://doi.org/10.3390/ijms20174175
Zou Y, Wu HJ, Guo XY, Peng L, Ding YJ, Tang JJ, Guo F (2021) MK-FSVM-SVDD: a multiple kernel-based fuzzy SVM model for predicting DNA-binding proteins via support vector data description. Curr Bioinform 16(2):274–283. https://doi.org/10.2174/1574893615999200607173829
Adilina S, Farid D, Shatabda S (2019) Effective DNA binding protein prediction by using key features via chou’s general pseaac. J Theor Biol 460:64–78. https://doi.org/10.1016/j.jtbi.2018.10.027
Du X, Diao Y, Liu H (2019) Msdbp: exploring dna-binding proteins by integrating multi-scale sequence information via chou’s 5-steps rule. J Proteome Res 18(8):3119–3132. https://doi.org/10.1021/acs.jproteome.9b00226
Zhang S, Zhu F, Yu Q, Zhu X (2021) Identifying DNA-binding proteins based on multi-features and LASSO feature selection. Biopolymers 112:e23419. https://doi.org/10.1002/bip.23419
Wang J, Zhang S, Qiao H, Wang J (2021) UMAP-DBP: an improved DNA-binding proteins prediction method based on uniform manifold approximation and projection. Protein J 40:562–575. https://doi.org/10.1007/s10930-021-10011-y
Qian Y, Jiang L, Ding Y, Tang J, Guo F (2021) A sequence-based multiple kernel model for identifying DNA-binding proteins. BMC Bioinform 22:291. https://doi.org/10.1186/s12859-020-03875-x
Qian Y, Meng H, Lu W, Liao Z, Ding Y, Wu H (2021) Identification of DNA-binding proteins via Hypergraph based Laplacian Support Vector Machine. Curr Bioinform. https://doi.org/10.2174/1574893616666210806091922
Ahmad S, Sarai A (2004) Moment-based prediction of DNA-binding proteins. J Mol Biol 341(1):65–71. https://doi.org/10.1016/j.jmb.2004.05.058
Kumar KK, Pugalenthi G, Suganthan PN (2009) Dna-prot: Identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn 26(6):679–686. https://doi.org/10.1080/07391102.2009.10507281
Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H (2014) Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and gaussian naïve bayes. PLoS One 9:e86703. https://doi.org/10.1371/journal.pone.0086703
Nanni L, Brahnam S, Lumini A (2012) Wavelet images and chou’s pseudo amino acid composition for protein classification. Amino Acids 43:657–665. https://doi.org/10.1007/s00726-011-1114-9
Jeong JC, Lin X, Chen XW (2011) On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinf 8(2):308–315. https://doi.org/10.1109/TCBB.2010.93
Wei L, Luan S, Nagai L, Su R, Zou Q (2019) Exploring sequence-based features for the improved prediction of DNA n4-methylcytosine sites in multiple species. Bioinformatics 35:1326–1333. https://doi.org/10.1093/bioinformatics/bty824
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1023/A:1022627411411
Lin CF, Wang SD (2002) Fuzzy support vector machines. IEEE Trans Neural Netw 13(2):464–471. https://doi.org/10.1109/72.991432
Jayadeva RK, Khemchandani R, Chandra S (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910. https://doi.org/10.1109/TPAMI.2007.1068
Shao YH, Zhang CH, Wang XB (2011) Improvements on twin support vector machines. IEEE Trans Neural Netw 22(6):962–968. https://doi.org/10.1109/TNN.2011.2130540
Chou KC, Shen HB (2007) Memtype-2l: a web server for predicting membrane proteins and their types by incorporating evolution information through pse-pssm. Biochem Biophys Res Commun 360(2):339–345. https://doi.org/10.1016/j.bbrc.2007.06.027
Feng ZP, Zhang CT (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. J Protein Chem 19(4):269–275. https://doi.org/10.1023/A:1007091128394
Li X, Liao B, Shu Y, Zeng Q, Luo J (2009) Protein functional class prediction using global encoding of amino acid sequence. J Theor Biol 261(2):290–293. https://doi.org/10.1016/j.jtbi.2009.07.017
You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z (2014) Prediction of protein–protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinform 15:S9. https://doi.org/10.1186/1471-2105-15-S15-S9
Gretton A, Bousquet O, Smola A, Schölkopf B (2005) Measuring statistical dependence with Hilbert–Schmidt norms. Lect Notes Comput Sci 3734:63–77. https://doi.org/10.1007/11564089_7
Wang T, Li W (2018) Kernel learning and optimization with Hilbert–Schmidt independence criterion. Int J Mach Learn Cybern 9:1707–1717. https://doi.org/10.1007/s13042-017-0675-7
Wang H, Ding YJ, Tang JJ, Guo F (2020) Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt Independence Criterion. Neurocomputing 383:257–269. https://doi.org/10.1016/j.neucom.2019.11.103
Cristianini N, Kandola J, Elisseeff A (2001) On kernel-target alignment. Adv Neural Inf Process Syst 179(5):367–373. https://doi.org/10.1007/3-540-33486-6_8
Chen SG, Wu XJ (2018) A new fuzzy twin support vector machine for pattern classification. Int J Mach Learn Cybern 9:1553–1564. https://doi.org/10.1007/s13042-017-0664-x
Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227. https://doi.org/10.1109/TPAMI.2008.79
Ding YJ, Tang JJ, Guo F (2019) Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2019.2954826
Rezvani S, Wang X, Pourpanah F (2019) Intuitionistic fuzzy twin support vector machines. IEEE Trans Fuzzy Syst 27(11):2140–2151. https://doi.org/10.1109/TFUZZ.2019.2893863
Ahmad S, Sarai A (2020) Stackpdb: Predicting dna-binding proteins based on xgb-rfe feature optimization and stacked ensemble classifier. Appl Soft Comput. https://doi.org/10.1016/j.asoc.2020.106921
Acknowledgements
This study is supported by the National Science Foundation of China (NSFC 61873112, 61922020, 62172076 and 61902271) and Special Science Foundation of Quzhou (2021D004). The authors also thank professor Bin Liu, Xiuquan Du and Leyi Wei for kindly sharing the dataset.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of Interest
The authors have no competing interests.
Availability of data and material
The related data can be download from: https://figshare.com/s/934f45e3a3e7693691d5.
Rights and permissions
About this article
Cite this article
Zou, Y., Ding, Y., Peng, L. et al. FTWSVM-SR: DNA-Binding Proteins Identification via Fuzzy Twin Support Vector Machines on Self-Representation. Interdiscip Sci Comput Life Sci 14, 372–384 (2022). https://doi.org/10.1007/s12539-021-00489-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-021-00489-6