计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 36-43.doi: 10.11896/j.issn.1002-137X.2019.05.005

• 综述 • 上一篇    下一篇


周卫星, 石海鹤   

  1. (江西师范大学计算机信息工程学院 南昌330022)
  • 收稿日期:2018-08-09 修回日期:2018-12-13 发布日期:2019-05-15
  • 作者简介:周卫星(1994-),男,硕士生,CCF学生会员,主要研究方向为生物序列分析;石海鹤(1979-),女,博士,教授,CCF会员,主要研究方向为生物信息学、软件工程、形式化方法,E-mail:haiheshi@jxnu.edu.cn(通信作者)。
  • 基金资助:

Survey on Sequence Assembly Algorithms in High-throughput Sequencing

ZHOU Wei-xing, SHI Hai-he   

  1. (College of Computer Information and Engineering,Jiangxi Normal University,Nanchang 330022,China)
  • Received:2018-08-09 Revised:2018-12-13 Published:2019-05-15

摘要: 高通量测序(High-throughput Sequencing,HTS)技术是继第一代测序技术之后发展起来的一种新型测序方式,又被称为下一代测序技术。与第一代测序技术中采用基于Sanger方法的自动、半自动毛细管测序方法不同,高通量测序技术采用了基于焦磷酸测序的并行测序技术,是对传统测序技术的一项重要技术突破,它不仅克服了第一代测序技术高成本、低通量、低速度的缺点,而且能满足现代分子生物学和基因组学快速发展的需求,达到低成本、高通量以及快速的目的。相较于第一代测序数据,高通量测序数据具有典型的长度短、覆盖度不均匀以及准确率低的特点,同时第三代测序技术虽保持了高通量测序技术边测序边合成的思想,但采用了更为高效的单分子实时测序技术和纳米孔测序技术,具有高通量、低成本和测序数据长的优势。因此,要获得完整的全基因组基因序列,生物学家就需要使用一种技术将短测序reads拼装成一条完整的基因单链序列。在这种情况下,序列拼接算法应运而生。首先,介绍了序列拼接算法的发展背景以及高通量测序技术的相关概念,分析了高通量测序技术在序列拼接算法中所具有的优势;其次,通过总结序列拼接算法的发展成果,按基于greedy策略、基于Overlap-Layout-Consensus (OLC)策略和基于De Bruijn Graph (DBG)策略的分类对序列拼接算法进行阐述;最后,探讨了序列拼接算法的相关研究方向和发展趋势。

关键词: DeBruijnGraph, greedy, Overlap-Layout-Consensus, 高通量测序技术, 序列拼接算法

Abstract: High-throughput sequencing technology is a new sequencing method developed after the first generation sequencing technology,also known as next-generation sequencing technology.Different from the automatic and semi-automatic capillary sequencing method based on Sanger,the high-throughput sequencing technology adopts the parallel sequencing technology based on pyrosequencing.It not only conquers the shortcomings of high cost,low throughput and low speed of the first generation sequencing technology,but also meets the demands of the rapid development of modern molecular biology and genomics with low cost,high throughput and fast speed.Compared with the first generation sequencing data,high-throughput sequencing data are characterized by short lengths,uneven coverage and low accuracy,and the third-generation sequencing technology adopts more efficient single molecular real-time sequencing and Nanopore sequencing technology as well as the principle of sequencing and synthesis,which has the advantages of high throughput,low cost and long sequencing data.Therefore,in order to obtain complete genome sequence,a technique is needed to assemble short sequencing reads into a complte single-stranded sequence of genes.In this case, the sequence assembly algorithm was proposed.Firstly,the development background of sequence assembly algorithms and the related concepts of high-throughput sequencing technology were introduced,and the advantages of high-throughput sequencing technology on sequence assembly were analyzed.Secondly,by summarizing the development of sequence assembly algorithms.The sequence assembly algorithms were illustrated,according to the algorithm classifications,respectively,by greedy strategy,Overlap-Layout-Consensus (OLC) strategy and De Bruijn Graph (DBG) strategy.Finally,the research direction and development trend of sequence assembly algorithms were discussed.

Key words: De bruijn graph, Greedy, High-throughput sequencing, Overlap-layout-consensus, Sequence assembly algorithms


  • TP301.6
