计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 36-43.doi: 10.11896/j.issn.1002-137X.2019.05.005
周卫星, 石海鹤
ZHOU Wei-xing, SHI Hai-he
摘要: 高通量测序(High-throughput Sequencing,HTS)技术是继第一代测序技术之后发展起来的一种新型测序方式,又被称为下一代测序技术。与第一代测序技术中采用基于Sanger方法的自动、半自动毛细管测序方法不同,高通量测序技术采用了基于焦磷酸测序的并行测序技术,是对传统测序技术的一项重要技术突破,它不仅克服了第一代测序技术高成本、低通量、低速度的缺点,而且能满足现代分子生物学和基因组学快速发展的需求,达到低成本、高通量以及快速的目的。相较于第一代测序数据,高通量测序数据具有典型的长度短、覆盖度不均匀以及准确率低的特点,同时第三代测序技术虽保持了高通量测序技术边测序边合成的思想,但采用了更为高效的单分子实时测序技术和纳米孔测序技术,具有高通量、低成本和测序数据长的优势。因此,要获得完整的全基因组基因序列,生物学家就需要使用一种技术将短测序reads拼装成一条完整的基因单链序列。在这种情况下,序列拼接算法应运而生。首先,介绍了序列拼接算法的发展背景以及高通量测序技术的相关概念,分析了高通量测序技术在序列拼接算法中所具有的优势;其次,通过总结序列拼接算法的发展成果,按基于greedy策略、基于Overlap-Layout-Consensus (OLC)策略和基于De Bruijn Graph (DBG)策略的分类对序列拼接算法进行阐述;最后,探讨了序列拼接算法的相关研究方向和发展趋势。
中图分类号:
[1]WARD R M,SCHMIEDER R,HIGHNAM G,et al.Big data challenges and opportunities in high-throughput sequencing[J].Systems Biomedicine,2013,1(1):29-34. [2]PEARSON W R Y,LIPMAN D J.Improved tools for biological sequence analysis[J].Proceedings of the National Academy of Sciences of the United Sates of America,1988,85(46):16138-16143. [3]ALTSCHUL S F,MADDEN T L,SCHÄFFER A A,et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J].Nucleic Acids Research,1997,25(17):3389-3402. [4]LARKIN M A,BLACKSHIELDS G,BROWN N P,et al.Clus-tal W and Clustal X version 2.0[J].Bioinformatics,2007,23(21):2947-2948. [5]DARLING A E,MAU B,PERNA N T.progressive Mauve:Mul-tiple Genome Alignment with Gene Gain,Loss and Rearrangement[J].PLOS One,2010,5(6):e11147. [6]BLANCHETTE M,KENT W J,RIEMER C,et al.Aligningmultiple genomic sequences with the threaded blockset aligner[J].Genome Research,2004,14(4):708-715. [7]CANTOR C R.Orchestrating the Human Genome Project[J].Science,1990,248(4951):49-51. [8]CONSORTIUM T G P.A global reference for human genetic variation,the 1000 Genomes Project Consortium[J].Nature,2015,526:68-74. [9]CONSORTIUM T U.The UK10K project identifies rare va-riants in health and disease[J].Nature,2015,526(7571):82-90. [10]WATSON M.Illuminating the future of DNA sequencing[J].Genome Biology,2014,15(2):108. [11]Sanger F,Nicklen S,Coulson A R.DNA sequencing with chain-terminating inhibitors[J].Proceedings of the National Academy of Sciences of the United States of America,1977,74(12):5463-5467. [12]MOROZOVA O.Applications of next-generation sequencingtechnologies in functional genomics[J].Genomics,2008,92(5):255-264. [13]WOOLEY J C,GODZIK A,FRIEDBERG I.A Primer on Meta-genomics[J].PLOS Computational Biology,2010,6(2):e1000667. [14]WU X,ZHU X,WU G Q,et al.Data Mining with Big Data[J].IEEE Transactions on Knowledge & Data Engineering,2014,26(1):97-107. [15]FONSECA N A,RUNG J,BRAZMA A,et al.Tools for mapping high-throughput sequencing data[J].Bioinformatics,2012,28(24):3169-3177. [16]NIEDRINGHAUS T P,MILANOVA D,KERBY M B,et al.Landscape of Next-Generation Sequencing Technologies[J].Analytical Chemistry,2011,83(12):4327-4341. [17]HARRIS T D,BUZBY P R,BABCOCK H,et al.Single-molecule DNA sequencing of a viral genome[J].Science,2008,320(5872):106-109. [18]FLUSBERG B A,WEBSTER D,LEE J,et al.Direct detection of DNA methylation during single-molecule,real-time sequencing[J].Nature Methods,2010,7(6):461-465. [19]RUSK N.Cheap third-generation sequencing[J].Nature Me-thods,2009,6(4):244-244. [20]CHIN C S,PELUSO P,SEDLAZECK F J,et al.Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing[J].Nature Methods,2016,13(12):1050-1054. [21]HEEREMA S J,DEKKER C.Graphene nanodevices for DNA sequencing[J].Nature Nanotechnology,2016,11(2):127-136. [22]HEATHER J M,CHAIN B.The sequence of sequencers:The history of sequencing DNA[J].Genomics,2016,107(1):1-8. [23]SHENDURE J,JI H.Next-generation DNA sequencing[J].Nature Biotechnology,2008,26(10):1135-1145. [24]JACKMAN S D,VANDERVALK B P,MOHAMADI H,et al.ABySS 2.0:resource-efficient assembly of large genomes using a Bloom filter[J].Genome Research,2017,27(5):768-777. [25]ZIMIN A V,STEVENS K A,CREPEAU M W,et al.An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing[J].GIGA Science,2017,6(1):1-4. [26]MOSTOVOY Y,LEVYSAKIN M,LAM J,et al.A hybrid approach for de novo human genome sequence assembly and phasing[J].Nature Methods,2016,13(7):587-590. [27]LIU Y,BERTIL S,MASKELL D L.Parallelized short read assembly of large genomes using de Bruijn graphs[J].BMC Bioinformatics,2011,12(1):354. [28]WARREN R L,SUTTON G G,JONES S J M,et al.Assembling millions of short DNA sequences using SSAKE[J].Bioinforma-tics,2007,23(4):500-501. [29]JECK W R,REINHARDT J A,BALTRUS D A,et al.Extending assembly of short DNA sequences to handle error[J].Bioinformatics,2007,23(21):2942-2944. [30]DOHM J C,LOTTAZ C,BORODINA T,et al.SHARCGS,a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing[J].Genome Research,2007,17(11):1697-1706. [31]FARRANT G K,HOEBEKE M,PARTENSKY F,et al.WiseScaffolder:an algorithm for the semi-automatic scaffolding of Next Generation Sequencing data[J].BMC Bioinformatics,2015,16(1):281. [32]CAO M D,NGUYEN S H,GANESAMOORTHY D,et al.Scaffolding and completing genome assemblies in real-time withna-nopore sequencing[J].Nature Communications,2017,8:14515. [33]HIEU T N,ZIAUR R M,HE L,et al.Complete De NovoAssembly of Monoclonal Antibody Sequences[J].Scientific Reports,2016,6:31730. [34]MIN L,LIAO Z,HE Y,et al.ISEA:Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2017,14(4):916-925. [35]MYERS E W,SUTTON G G,DELCHER A L,et al.A Whole-Genome Assembly of Drosophila[J].Science,2000,287(5461):2196-2204. [36]DE L B M,MCCOMBIE W R.Assembling genomic DNA se-quences with PHRAP[J].Current Protocols in Bioinformatics,2007,17(1):11.4.1-11.4.15. [37]MARGULIES M,EGHOLM M,ALTMAN W E,et al.Genome sequencing in microfabricated high-density picolitre reactors[J].Nature,2005,437(7057):376-380. [38]KAMATH G M,SHOMORONY I,XIA F,et al.HINGE:long-read assembly achieves optimal repeat resolution[J].Genome Research,2017,27(5):747-756. [39]MYERS G.Efficient Local Alignment Discovery amongst Noisy Long Reads[M]∥Algorithms in Bioinformatics.Berlin:Sprin-ger,2014:52-67. [40]KOREN S,WALENZ B P,BERLIN K,et al.Canu:scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation[J].Genome Research,2017,27(5):722-736. [41]YE C,HILL C M,WU S,et al.DBG2OLC:Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies[J].Scientific Reports,2016,6(X):31900. [42]YE C,MA Z S,CANNON C H,et al.Exploiting sparseness in de novo genome assembly[J].BMC Bioinformatics,2012,13(S6):S1. [43]LI H.Minimap and miniasm:fast mapping and de novo assembly for noisy long sequences[J].Bioinformatics,2015,32(14):2103-2110. [44]VASER R,SOVIC' I,NAGARAJAN N,et al.Fast and accurate de novo genome assembly from long uncorrected reads[J].Genome Research,2017,27(5):737-746. [45]JANSEN H J,LIEM M,JONGRAADSEN S A,et al.Rapid de novo assembly of the European eel genome from nanopore sequencing reads[J].Scientific Reports,2017,7(1):7213. [46]BAAIJENS J A, AZE A, RIVALS E,et al.De novo assembly of viral quasispecies using overlap graphs[J].Genome Research,2017,27(5):835-848. [47]PENG G,JI P,ZHAO F.A novel codon-based de Bruijn graph algorithm for gene construction from unassembled transcriptomes[J].Genome Biology,2016,17(1):232. [48]CAMERON D L,SCHROEDER J,PENINGTON J S,et al. GRIDSS:sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly[J].Genome Research,2017,27(12):1-11. [49]WEISENFELD N I,KUMAR V,SHAH P,et al.Direct determination of diploid genome sequences[J].Genome Research,2017,27(5):757-767. [50]BUTLER J,MACCALLUM I,KLEBER M,et al.ALLPATHS:de novo assembly of whole-genome shotgun microreads[J].Genome Research,2008,18(5):810-820. [51]PEVZNER P A,TANG H,WATERMAN M S.An Eulerianpath approach to DNA fragment assembly[J].Proceedings of the National Academy of Sciences of the United Sates of America,2001,98(17):9748-9753. [52]GNERRE S,JAFFE D B.High-quality draft assemblies of mammalian genomes from massively parallel sequence data[J].Proceedings of the National Academy of Sciences of the United Sates of America,2011,108(4):1513-1518. [53]LI R,ZHU H,RUAN J,et al.De novo assembly of human genomes with massively parallel short read sequencing[J].Genome Research,2010,20(2):265-272. [54]LUO R,LIU B,XIE Y,et al.SOAPdenovo2:an empirically improved memory-efficient short-read de novo assembler[J].GIGA Science,2012,1(1):18. [55]XIE Y,WU G,TANG J,et al.SOAPdenovo-Trans:de novotranscriptome assembly with short RNA-Seq reads[J].Bioinformatics,2014,30(12):1660-1666. [56]PENG Y,LEUNG H C M,YIU S M,et al.IDBA-UD:a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth[J].Bioinformatics,2012,28(11):1420-1428. [57]CHITSAZ H,YEE-GREENBAUM J L,TESLER G,et al.Efficient de novo assembly of single-cell bacterial genomes from short-read data sets[J].Nature Biotechnology,2011,29(10):915-921. [58]PENG Y,LEUNG H C M,YIU S M,et al.IDBA-tran:a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels[J].Bioinformatics,2013,29(13):326-334. [59]NURK S,MELESHKO D,KOROBEYNIKOV A,et al.metaSPAdes:a new versatile metagenomics assembler[J].Genome Research,2017,27(5):824-834[60]BANKEVICH A,NURK S,ANTIPOV D,et al.SPAdes:A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing[J].Journal of Computational Biology,2012,19(5):455-477. [61]PRJIBELSKI A D,VASILINETC I,BANKEVICH A,et al.ExSPAnder:a universal repeat resolver for DNA fragment assembly[J].Bioinformatics,2014,30(12):293-301. [62]SAFONOVA Y,BANKEVICH A,PEVZNER P A.dipSPAdes:Assembler for Highly Polymorphic Diploid Genomes[C]∥International Conference on Research in Computational Molecular Biology.New York:Springer-Verlag,2014:265-279. [63]ANTIPOV D,KOROBEYNIKOV A,MCLEAN J S,et al.hybridSPAdes:an algorithm for hybrid assembly of short and long reads[J].Bioinformatics,2015,32(7):1009-1015. [64]VASILINETC I,PRJIBELSKI A D,GUREVICH A,et al.Assembling short reads from jumping libraries with large insert sizes[J].Bioinformatics,2015,31(20):3262-3268. |
[1] | 邵子灏, 杨世宇, 马国杰. 室内信息服务的基础——低成本定位技术研究综述 Foundation of Indoor Information Services:A Survey of Low-cost Localization Techniques 计算机科学, 2022, 49(9): 228-235. https://doi.org/10.11896/jsjkx.210900260 |
[2] | 孙刚, 伍江江, 陈浩, 李军, 徐仕远. 一种基于切比雪夫距离的隐式偏好多目标进化算法 Hidden Preference-based Multi-objective Evolutionary Algorithm Based on Chebyshev Distance 计算机科学, 2022, 49(6): 297-304. https://doi.org/10.11896/jsjkx.210500095 |
[3] | 李丹丹, 吴宇翔, 朱聪聪, 李仲康. 基于多种改进策略的改进麻雀搜索算法 Improved Sparrow Search Algorithm Based on A Variety of Improved Strategies 计算机科学, 2022, 49(6A): 217-222. https://doi.org/10.11896/jsjkx.210700032 |
[4] | 胡聪, 何晓晖, 邵发明, 张艳武, 卢冠林, 王金康. 基于极大极稳定区域及SVM的交通标志检测 Traffic Sign Detection Based on MSERs and SVM 计算机科学, 2022, 49(6A): 325-330. https://doi.org/10.11896/jsjkx.210300117 |
[5] | 杨健楠, 张帆. 一种结合双注意力机制和层次网络结构的细碎农作物分类方法 Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure 计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169 |
[6] | 张嘉淏, 刘峰, 齐佳音. 一种基于Bottleneck Transformer的轻量级微表情识别架构 Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer 计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023 |
[7] | 王方红, 范兴刚, 杨静静, 周杰, 王德恩. 一种基于有向感知区域调整的强栅栏构建算法 Strong Barrier Construction Algorithm Based on Adjustment of Directional Sensing Area 计算机科学, 2022, 49(6A): 612-618. https://doi.org/10.11896/jsjkx.210300291 |
[8] | 张志龙, 史贤俊, 秦玉峰. 基于改进准深度算法的诊断策略优化方法 Diagnosis Strategy Optimization Method Based on Improved Quasi Depth Algorithm 计算机科学, 2022, 49(6A): 729-732. https://doi.org/10.11896/jsjkx.210700076 |
[9] | 高元浩, 罗晓清, 张战成. 基于特征分离的红外与可见光图像融合算法 Infrared and Visible Image Fusion Based on Feature Separation 计算机科学, 2022, 49(5): 58-63. https://doi.org/10.11896/jsjkx.210200148 |
[10] | 刘洋, 李凡长. 基于变分贝叶斯的纤维丛元学习算法 Fiber Bundle Meta-learning Algorithm Based on Variational Bayes 计算机科学, 2022, 49(3): 225-231. https://doi.org/10.11896/jsjkx.201100111 |
[11] | 乔杰, 蔡瑞初, 郝志峰. 一种基于信息瓶颈的因果关系挖掘方法 Mining Causality via Information Bottleneck 计算机科学, 2022, 49(2): 198-203. https://doi.org/10.11896/jsjkx.210100053 |
[12] | 赵学磊, 季新生, 刘树新, 李英乐, 李海涛. 基于路径连接强度的有向网络链路预测方法 Link Prediction Method for Directed Networks Based on Path Connection Strength 计算机科学, 2022, 49(2): 216-222. https://doi.org/10.11896/jsjkx.210100107 |
[13] | 魏昕, 冯锋. 基于高斯-柯西变异的帝国竞争算法优化 Optimization of Empire Competition Algorithm Based on Gauss-Cauchy Mutation 计算机科学, 2021, 48(11A): 142-146. https://doi.org/10.11896/jsjkx.201200071 |
[14] | 王友卫, 朱晨, 朱建明, 李洋, 凤丽洲, 刘江淳. 基于用户兴趣词典和LSTM的个性化情感分类方法 User Interest Dictionary and LSTM Based Method for Personalized Emotion Classification 计算机科学, 2021, 48(11A): 251-257. https://doi.org/10.11896/jsjkx.201200202 |
[15] | 肖满, 李伟东. 两点混合环上的半在线算法 Semi-online Algorithms for Mixed Ring with Two Nodes 计算机科学, 2021, 48(11A): 441-445. https://doi.org/10.11896/jsjkx.201100153 |
|