0% found this document useful (0 votes)

14 views45 pages

Previewpdf

Uploaded by

zylhzau

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views45 pages

Previewpdf

Uploaded by

zylhzau

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

ALGORITHMS FOR

NEXT-GENERATION SEQUENCING
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING

Wing-Kin Sung
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

Version Date: 20170421

International Standard Book Number-13: 978-1-4665-6550-0 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface xi

1 Introduction 1
1.1 DNA, RNA, protein and cells . . . . . . . . . . . . . . . . . . 1
1.2 Sequencing technologies . . . . . . . . . . . . . . . . . . . . . 3
1.3 First-generation sequencing . . . . . . . . . . . . . . . . . . . 4
1.4 Second-generation sequencing . . . . . . . . . . . . . . . . . 6
1.4.1 Template preparation . . . . . . . . . . . . . . . . . . 6
1.4.2 Base calling . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Polymerase-mediated methods based on reversible
terminator nucleotides . . . . . . . . . . . . . . . . . . 7
1.4.4 Polymerase-mediated methods based on unmodiﬁed
nucleotides . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.5 Ligase-mediated method . . . . . . . . . . . . . . . . . 11
1.5 Third-generation sequencing . . . . . . . . . . . . . . . . . . 12
1.5.1 Single-molecule real-time sequencing . . . . . . . . . . 12
1.5.2 Nanopore sequencing method . . . . . . . . . . . . . . 13
1.5.3 Direct imaging of DNA using electron microscopy . . 15
1.6 Comparison of the three generations of sequencing . . . . . . 16
1.7 Applications of sequencing . . . . . . . . . . . . . . . . . . . 17
1.8 Summary and further reading . . . . . . . . . . . . . . . . . 19
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 NGS ﬁle formats 21

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Raw data ﬁles: fasta and fastq . . . . . . . . . . . . . . . . . 22
2.3 Alignment ﬁles: SAM and BAM . . . . . . . . . . . . . . . . 24
2.3.1 FLAG . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 CIGAR string . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Bed format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Variant Call Format (VCF) . . . . . . . . . . . . . . . . . . . 29
2.6 Format for representing density data . . . . . . . . . . . . . 31
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v
vi Contents

3 Related algorithms and data structures 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Recursion and dynamic programming . . . . . . . . . . . . . 35
3.2.1 Key searching problem . . . . . . . . . . . . . . . . . . 36
3.2.2 Edit-distance problem . . . . . . . . . . . . . . . . . . 37
3.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . 39
3.3.2 Unobserved variable and EM algorithm . . . . . . . . 40
3.4 Hash data structures . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Maintain an associative array by simple hashing . . . 43
3.4.2 Maintain a set using a Bloom filter . . . . . . . . . . . 45
3.4.3 Maintain a multiset using a counting Bloom filter . . . 46
3.4.4 Estimating the similarity of two sets using minHash . 47
3.5 Full-text index . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Suffix trie and suffix tree . . . . . . . . . . . . . . . . 49
3.5.2 Suffix array . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.3 FM-index . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3.1 Inverting the BWT B to the original text T 53
3.5.3.2 Simulate a suffix array using the FM-index . 54
3.5.3.3 Pattern matching . . . . . . . . . . . . . . . 55
3.5.4 Simulate a suffix trie using the FM-index . . . . . . . 55
3.5.5 Bi-directional BWT . . . . . . . . . . . . . . . . . . . 56
3.6 Data compression techniques . . . . . . . . . . . . . . . . . . 58
3.6.1 Data compression and entropy . . . . . . . . . . . . . 58
3.6.2 Unary, gamma, and delta coding . . . . . . . . . . . . 59
3.6.3 Golomb code . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.4 Huffman coding . . . . . . . . . . . . . . . . . . . . . . 60
3.6.5 Arithmetic code . . . . . . . . . . . . . . . . . . . . . 62
3.6.6 Order-k Markov Chain . . . . . . . . . . . . . . . . . . 64
3.6.7 Run-length encoding . . . . . . . . . . . . . . . . . . . 65
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 NGS read mapping 69

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Overview of the read mapping problem . . . . . . . . . . . . 70
4.2.1 Mapping reads with no quality score . . . . . . . . . . 70
4.2.2 Mapping reads with a quality score . . . . . . . . . . . 71
4.2.3 Brute-force solution . . . . . . . . . . . . . . . . . . . 72
4.2.4 Mapping quality . . . . . . . . . . . . . . . . . . . . . 74
4.2.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Align reads allowing a small number of mismatches . . . . . 76
4.3.1 Mismatch seed hashing approach . . . . . . . . . . . . 77
4.3.2 Read hashing with a spaced seed . . . . . . . . . . . . 78
4.3.3 Reference hashing approach . . . . . . . . . . . . . . . 82
4.3.4 Suﬃx trie-based approaches . . . . . . . . . . . . . . . 84
Contents vii

4.3.4.1 Estimating the lower bound of the number of

mismatches . . . . . . . . . . . . . . . . . . . 87
4.3.4.2 Divide and conquer with the enhanced pigeon
hole principle . . . . . . . . . . . . . . . . . . 89
4.3.4.3 Aligning a set of reads together . . . . . . . 92
4.3.4.4 Speed up utilizing the quality score . . . . . 94
4.4 Aligning reads allowing a small number of mismatches
and indels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 q-mer approach . . . . . . . . . . . . . . . . . . . . . . 97
4.4.2 Computing alignment using a suffix trie . . . . . . . . 99
4.4.2.1 Computing the edit distance using a suffix trie 100
4.4.2.2 Local alignment using a suffix trie . . . . . . 103
4.5 Aligning reads in general . . . . . . . . . . . . . . . . . . . . 105
4.5.1 Seed-and-extension approach . . . . . . . . . . . . . . 107
4.5.1.1 BWA-SW . . . . . . . . . . . . . . . . . . . . 108
4.5.1.2 Bowtie 2 . . . . . . . . . . . . . . . . . . . . 109
4.5.1.3 BatAlign . . . . . . . . . . . . . . . . . . . . 110
4.5.1.4 Cushaw2 . . . . . . . . . . . . . . . . . . . . 111
4.5.1.5 BWA-MEM . . . . . . . . . . . . . . . . . . . 112
4.5.1.6 LAST . . . . . . . . . . . . . . . . . . . . . . 113
4.5.2 Filter-based approach . . . . . . . . . . . . . . . . . . 114
4.6 Paired-end alignment . . . . . . . . . . . . . . . . . . . . . . 116
4.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5 Genome assembly 123

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2 Whole genome shotgun sequencing . . . . . . . . . . . . . . . 124
5.2.1 Whole genome sequencing . . . . . . . . . . . . . . . . 124
5.2.2 Mate-pair sequencing . . . . . . . . . . . . . . . . . . 126
5.3 De novo genome assembly for short reads . . . . . . . . . . . 126
5.3.1 Read error correction . . . . . . . . . . . . . . . . . . 128
5.3.1.1 Spectral alignment problem (SAP) . . . . . . 129
5.3.1.2 k-mer counting . . . . . . . . . . . . . . . . . 133
5.3.2 Base-by-base extension approach . . . . . . . . . . . . 138
5.3.3 De Bruijn graph approach . . . . . . . . . . . . . . . . 141
5.3.3.1 De Bruijn assembler (no sequencing error) . 143
5.3.3.2 De Bruijn assembler (with sequencing errors) 144
5.3.3.3 How to select k . . . . . . . . . . . . . . . . . 146
5.3.3.4 Additional issues of the de Bruijn graph
approach . . . . . . . . . . . . . . . . . . . . 147
5.3.4 Scaﬀolding . . . . . . . . . . . . . . . . . . . . . . . . 150
5.3.5 Gap ﬁlling . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.4 Genome assembly for long reads . . . . . . . . . . . . . . . . 154
viii Contents

5.4.1 Assemble long reads assuming long reads have a low

sequencing error rate . . . . . . . . . . . . . . . . . . . 155
5.4.2 Hybrid approach . . . . . . . . . . . . . . . . . . . . . 157
5.4.2.1 Use mate-pair reads and long reads to improve
the assembly from short reads . . . . . . . . 160
5.4.2.2 Use short reads to correct errors in long reads 160
5.4.3 Long read approach . . . . . . . . . . . . . . . . . . . 161
5.4.3.1 MinHash for all-versus-all pairwise alignment 162
5.4.3.2 Computing consensus using Falcon Sense . . 163
5.4.3.3 Quiver consensus algorithm . . . . . . . . . . 165
5.5 How to evaluate the goodness of an assembly . . . . . . . . . 168
5.6 Discussion and further reading . . . . . . . . . . . . . . . . . 168
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6 Single nucleotide variation (SNV) calling 175

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.1.1 What are SNVs and small indels? . . . . . . . . . . . 175
6.1.2 Somatic and germline mutations . . . . . . . . . . . . 178
6.2 Determine variations by resequencing . . . . . . . . . . . . . 178
6.2.1 Exome/targeted sequencing . . . . . . . . . . . . . . . 179
6.2.2 Detection of somatic and germline variations . . . . . 180
6.3 Single locus SNV calling . . . . . . . . . . . . . . . . . . . . 180
6.3.1 Identifying SNVs by counting alleles . . . . . . . . . . 181
6.3.2 Identify SNVs by binomial distribution . . . . . . . . 182
6.3.3 Identify SNVs by Poisson-binomial distribution . . . . 184
6.3.4 Identifying SNVs by the Bayesian approach . . . . . . 185
6.4 Single locus somatic SNV calling . . . . . . . . . . . . . . . . 187
6.4.1 Identify somatic SNVs by the Fisher exact test . . . . 187
6.4.2 Identify somatic SNVs by verifying that the SNVs
appear in the tumor only . . . . . . . . . . . . . . . . 188
6.4.2.1 Identify SNVs in the tumor sample by
posterior odds ratio . . . . . . . . . . . . . . 188
6.4.2.2 Verify if an SNV is somatic by the posterior
odds ratio . . . . . . . . . . . . . . . . . . . . 191
6.5 General pipeline for calling SNVs . . . . . . . . . . . . . . . 192
6.6 Local realignment . . . . . . . . . . . . . . . . . . . . . . . . 193
6.7 Duplicate read marking . . . . . . . . . . . . . . . . . . . . . 195
6.8 Base quality score recalibration . . . . . . . . . . . . . . . . 195
6.9 Rule-based ﬁltering . . . . . . . . . . . . . . . . . . . . . . . 198
6.10 Computational methods to identify small indels . . . . . . . 199
6.10.1 Split-read approach . . . . . . . . . . . . . . . . . . . 199
6.10.2 Span distribution-based clustering approach . . . . . . 200
6.10.3 Local assembly approach . . . . . . . . . . . . . . . . 203
6.11 Correctness of existing SNV and indel callers . . . . . . . . . 204
6.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 205
Contents ix

6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7 Structural variation calling 209

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2 Formation of SVs . . . . . . . . . . . . . . . . . . . . . . . . 211
7.3 Clinical effects of structural variations . . . . . . . . . . . . . 214
7.4 Methods for determining structural variations . . . . . . . . 215
7.5 CNV calling . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.5.1 Computing the raw read count . . . . . . . . . . . . . 218
7.5.2 Normalize the read counts . . . . . . . . . . . . . . . . 219
7.5.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . 219
7.6 SV calling pipeline . . . . . . . . . . . . . . . . . . . . . . . . 222
7.6.1 Insert size estimation . . . . . . . . . . . . . . . . . . . 222
7.7 Classifying the paired-end read alignments . . . . . . . . . . 223
7.8 Identifying candidate SVs from paired-end reads . . . . . . . 226
7.8.1 Clustering approach . . . . . . . . . . . . . . . . . . . 227
7.8.1.1 Clique-finding approach . . . . . . . . . . . . 228
7.8.1.2 Confidence interval overlapping approach . . 229
7.8.1.3 Set cover approach . . . . . . . . . . . . . . . 233
7.8.1.4 Performance of the clustering approach . . . 236
7.8.2 Split-mapping approach . . . . . . . . . . . . . . . . . 236
7.8.3 Assembly approach . . . . . . . . . . . . . . . . . . . . 237
7.8.4 Hybrid approach . . . . . . . . . . . . . . . . . . . . . 238
7.9 Verify the SVs . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.10 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

8 RNA-seq 245
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.2 High-throughput methods to study the transcriptome . . . . 247
8.3 Application of RNA-seq . . . . . . . . . . . . . . . . . . . . . 248
8.4 Computational Problems of RNA-seq . . . . . . . . . . . . . 250
8.5 RNA-seq read mapping . . . . . . . . . . . . . . . . . . . . . 250
8.5.1 Features used in RNA-seq read mapping . . . . . . . . 250
8.5.1.1 Transcript model . . . . . . . . . . . . . . . . 250
8.5.1.2 Splice junction signals . . . . . . . . . . . . . 252
8.5.2 Exon-ﬁrst approach . . . . . . . . . . . . . . . . . . . 253
8.5.3 Seed-and-extend approach . . . . . . . . . . . . . . . . 256
8.6 Construction of isoforms . . . . . . . . . . . . . . . . . . . . 260
8.7 Estimating expression level of each transcript . . . . . . . . . 261
8.7.1 Estimating transcript abundances when every read
maps to exactly one transcript . . . . . . . . . . . . . 261
8.7.2 Estimating transcript abundances when a read maps to
multiple isoforms . . . . . . . . . . . . . . . . . . . . . 264
8.7.3 Estimating gene abundance . . . . . . . . . . . . . . . 266
x Contents

8.8 Summary and further reading . . . . . . . . . . . . . . . . . 268

8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

9 Peak calling methods 271

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
9.2 Techniques that generate density-based datasets . . . . . . . 271
9.2.1 Protein DNA interaction . . . . . . . . . . . . . . . . . 271
9.2.2 Epigenetics of our genome . . . . . . . . . . . . . . . . 273
9.2.3 Open chromatin . . . . . . . . . . . . . . . . . . . . . 274
9.3 Peak calling methods . . . . . . . . . . . . . . . . . . . . . . 274
9.3.1 Model fragment length . . . . . . . . . . . . . . . . . . 276
9.3.2 Modeling noise using a control library . . . . . . . . . 279
9.3.3 Noise in the sample library . . . . . . . . . . . . . . . 280
9.3.4 Determination if a peak is signiﬁcant . . . . . . . . . . 281
9.3.5 Unannotated high copy number regions . . . . . . . . 283
9.3.6 Constructing a signal proﬁle by Kernel methods . . . 284
9.4 Sequencing depth of the ChIP-seq libraries . . . . . . . . . . 285
9.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 286
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

10 Data compression techniques used in NGS ﬁles 289

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
10.2 Strategies for compressing fasta/fastq ﬁles . . . . . . . . . . 290
10.3 Techniques to compress identiﬁers . . . . . . . . . . . . . . . 290
10.4 Techniques to compress DNA bases . . . . . . . . . . . . . . 291
10.4.1 Statistical-based approach . . . . . . . . . . . . . . . . 291
10.4.2 BWT-based approach . . . . . . . . . . . . . . . . . . 292
10.4.3 Reference-based approach . . . . . . . . . . . . . . . . 295
10.4.4 Assembly-based approach . . . . . . . . . . . . . . . . 297
10.5 Quality score compression methods . . . . . . . . . . . . . . 299
10.5.1 Lossless compression . . . . . . . . . . . . . . . . . . . 300
10.5.2 Lossy compression . . . . . . . . . . . . . . . . . . . . 301
10.6 Compression of other NGS data . . . . . . . . . . . . . . . . 302
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

References 307

Index 339
Preface

Next-generation sequencing (NGS) is a recently developed technology enabling

us to generate hundreds of billions of DNA bases from the samples. We can
use NGS to reconstruct the genome, understand genomic variations, recover
transcriptomes, and identify the transcription factor binding sites or the epi
genetic marks.
The NGS technology radically changes how we collect genomic data from
the samples. Instead of studying a particular gene or a particular genomic re
gion, NGS technologies enable us to perform genome-wide study unbiasedly.
Although more raw data can be obtained from sequencing machines, we face
computational challenges in analyzing such a big dataset. Hence, it is impor
tant to develop efficient and accurate computational methods to analyze or
process such datasets. This book is intended to give an in-depth introduction
to such algorithmic techniques.
The primary audiences of this book include advanced undergraduate stu
dents and graduate students who are from mathematics or computer science
departments. We assume that readers have some training in college-level bi
ology, statistics, discrete mathematics and algorithms.
This book was developed partly from the teaching material for the course
on Combinatorial Methods in Bioinformatics, which I taught at the National
University of Singapore, Singapore. The chapters in this book are classified
based on the application domains of the NGS technologies. In each chapter, a
brief introduction to the technology is first given. Then, different methods or
algorithms for analyzing such NGS datasets are described. To illustrate each
algorithm, detailed examples are given. At the end of each chapter, exercises
are given to help readers understand the topics.
Chapter 1 introduces the next-generation sequencing technologies. We
cover the three generations of sequencing, starting from Sanger sequencing
(first generation). Then, we cover second-generation sequencing, which in
cludes Illumina Solexa sequencing. Finally, we describe the third-generation
sequencing technologies which include PacBio sequencing and nanopore se
quencing.
Chapter 2 introduces a few NGS file formats, which facilitate downstream
analysis and data transfer. They include fasta, fastq, SAM, BAM, BED, VCF
and WIG formats. Fasta and fastq are file formats for describing the raw
sequencing reads generated by the sequencers. SAM and BAM are file formats

xi
xii Preface

for describing the alignments of the NGS reads on the reference genome. BED,
VCF and WIG formats are annotation formats.
To develop methods for processing NGS data, we need efficient algorithms
and data structures. Chapter 3 is devoted to briefly describing these tech
niques.
Chapter 4 studies read mappers. Read mappers align the NGS reads on
the reference genome. The input is a set of raw reads in fasta or fastq files.
The read mapper will align each raw read on the reference genome, that is,
identify the region in the reference genome which is highly similar to the read.
Then, the read mapper will output all these alignments in a SAM or BAM
file. This is the basic step for many NGS applications. (It is the first step for
the methods in Chapters 6−9.)
Chapter 5 studies the de novo assembly problem. Given a set of raw reads
extracted from whole genome sequencing of some sample genome, de novo
assembly aims to stitch the raw reads together to reconstruct the genome.
It enables us to reconstruct novel genomes like plants and bacteria. De novo
assembly involves a few steps: error correction, contig assembly (de Bruijn
graph approach or base-by-base extension approach), scaffolding and gap fill
ing. This chapter describes techniques developed for these steps.
Chapter 6 discusses the problem of identifying single nucleotide variations
(SNVs) and small insertions/deletions (indels) in an individual genome. The
genome of every individual is highly similar to the reference human genome.
However, each genome is still different from the reference genome. On average,
there is 1 single nucleotide variation in every 3000 bases and 1 small indel in
every 1000 bases. To discover these variations, we can first perform whole
genome sequencing or exome sequencing of the individual genome to obtain
a set of raw reads. After aligning the raw reads on the reference genome, we
use SNV callers and indel callers to call SNVs and small indels. This chapter
is devoted to discussing techniques used in SNV callers and indel callers.
Apart from SNVs and small indels, copy number variations (CNVs) and
structural variations (SVs) are the other types of variations that appear in our
genome. CNVs and SVs are not as frequent as SNVs and indels. Moreover, they
are more prone to change the phenotype. Hence, it is important to understand
them. Chapter 7 is devoted to studying techniques used in CNV callers and
SV callers.
All above technologies are related to genome sequencing. We can also se
quence RNA. This technology is known as RNA-seq. Chapter 8 studies meth
ods for analyzing RNA-seq. By applying computational methods on RNA-seq,
we can recover the transcriptome. More precisely, RNA-seq enables us to iden
tify exons and split junctions. Then, we can predict the isoforms of the genes.
We can also determine the expression of each transcript and each gene.
By combining Chromatin immunoprecipitation and next-generation se
quencing, we can sequence genome regions that are bound by some transcrip
tion factors or with epigenetic marks. Such technology is known as ChIP
seq. The computational methods that identify those binding sites are known
Preface xiii

as ChIP-seq peak callers. Chapter 9 is devoted to discussing computational

methods for such purpose.
As stated earlier, NGS data is huge; and the NGS data files are usually
big. It is difficult to store and transfer NGS files. One solution is to com
press the NGS data files. Nowadays, a number of compression methods have
been developed and some of the compression formats are used frequently in
the literatures like BAM, bigBed and bigWig. Chapter 10 aims to describe
these compression techniques. We also describe techniques that enable us to
randomly access the compressed NGS data files.
Supplementary material can be found at
http://www.comp.nus.edu.sg/∼ksung/algo in ngs/.
I would like to thank my PhD supervisors Tak-Wah Lam and Hing-
Fung Ting and my collaborators Francis Y. L. Chin, Kwok Pui Choi, Ed
win Cheung, Axel Hillmer, Wing Kai Hon, Jansson Jesper, Ming-Yang Kao,
Caroline Lee, Nikki Lee, Hon Wai Leong, Alexander Lezhava, John Luk, See-
Kiong Ng, Franco P. Preparata, Yijun Ruan, Kunihiko Sadakane, Chialin Wei,
Limsoon Wong, Siu-Ming Yiu, and Louxin Zhang. My knowledge of NGS and
bioinformatics was enriched through numerous discussions with them. I would
like to thank Ramesh Rajaby, Kunihiko Sadakane, Chandana Tennakoon,
Hugo Willy, and Han Xu for helping to proofread some of the chapters. I
would also like to thank my parents Kang Fai Sung and Siu King Wong, my
three brothers Wing Hong Sung, Wing Keung Sung, and Wing Fu Sung, my
wife Lily Or, and my three kids Kelly, Kathleen and Kayden for their support.
Finally, if you have any suggestions for improvement or if you identify any
errors in the book, please send an email to me at ksung@comp.nus.edu.sg. I
thank you in advance for your helpful comments in improving the book.

Wing-Kin Sung
Chapter 1
Introduction

DNA stands for deoxyribonucleic acid. It was ﬁrst discovered in 1869 by

Friedrich Miescher [58]. However, it was not until 1944 that Avery, MacLeod
and McCarty [12] demonstrated that DNA is the major carrier of genetic in
formation, not protein. In 1953, James Watson and Francis Crick discovered
the basic structure of DNA, which is a double helix [310]. After that, people
started to work on DNA intensively.
DNA sequencing sprang to life in 1972, when Frederick Sanger (at the Uni
versity of Cambridge, England) began to work on the genome sequence using
a variation of the recombinant DNA method. The full DNA sequence of a viral
genome (bacteriophage φX174) was completed by Sanger in 1977 [259, 260].
Based on the power of sequencing, Sanger established genomics,1 which is the
study of the entirety of an organism’s hereditary information, encoded in DNA
(or RNA for certain viruses). Note that it is diﬀerent from molecular biology
or genetics, whose primary focus is to investigate the roles and functions of
single genes.
During the last decades, DNA sequencing has improved rapidly. We can
sequence the whole human genome within a day and compare multiple individ
ual human genomes. This book is devoted to understanding the bioinformatics
issues related to DNA sequencing. In this introduction, we brieﬂy review DNA,
RNA and protein. Then, we describe various sequencing technologies. Lastly,
we describe the applications of sequencing technologies.

1.1 DNA, RNA, protein and cells

Deoxyribonucleic acid (DNA) is used as the genetic material (with the
exception that certain viruses use RNA as the genetic material). The basic
building block of DNA is the DNA nucleotide. There are 4 types of DNA
nucleotides: adenine (A), guanine (G), cytosine (C) and thymine (T). The DNA

1 The actual term “genomics” is thought to have been coined by Dr. Tom Roderick, a

geneticist at the Jackson Laboratory (Bar Harbor, ME) at a meeting held in Maryland on
the mapping of the human genome in 1986.

1
2 Algorithms for Next-Generation Sequencing

5� − A C G T A G C T −3�
|| ||| ||| || || ||| ||| ||
3� − T G C A T C G A −5�

FIGURE 1.1: The double-stranded DNA. The two strands show a comple
mentary base pairing.

nucleotides can be chained together to form a strand of DNA. Each strand of

DNA is asymmetric. It begins from 5� end and ends at 3� end.
When two opposing DNA strands satisfy the Watson-Crick rule, they can
be interwoven together by hydrogen bonds and form a double-stranded DNA.
The Watson-Crick rule (or complementary base pairing rule) requires that the
two nucleotides in opposing strands be a complementary base pair, that is,
they must be an (A, T) pair or a (C, G) pair. (Note that A = T and C ≡ G are
bound with the help of two and three hydrogen bonds, respectively.) Figure 1.1
gives an example double-stranded DNA. One strand is ACGTAGCT while the
other strand is its reverse complement, i.e., AGCTACGT.
The double-stranded DNAs are located in the nucleus (and mitochondria)
of every cell. A cell can contain multiple pieces of double-stranded DNAs, each
is called a chromosome. As a whole, the collection of chromosomes is called a
genome; the human genome consists of 23 pairs of chromosomes, and its total
length is roughly 3 billion base pairs.
The genome provides the instructions for the cell to perform daily life
functions. Through the process of transcription, the machine RNA polymerase
transcribes genes (the basic functional units) in our genome into transcripts
(or RNA molecules). This process is known as gene expression. The complete
set of transcripts in a cell is denoted as its transcriptome.
Each transcript is a chain of 4 different ribonucleic acid (RNA) nucleotides:
adenine (A), guanine (G), cytosine (C) and uracil (U). The main difference be
tween the DNA nucleotide and the RNA nucleotide is that the RNA nucleotide
has an extra OH group. This extra OH group enables the RNA nucleotide to
form more hydrogen bonds. Transcripts are usually single stranded instead of
double stranded.
There are two types of transcripts: non-coding RNA (ncRNA) and message
RNA (mRNA). ncRNAs are transcripts that do not translate into proteins.
They can be classified into transfer RNAs (tRNAs), ribosomal RNAs (rRNAs),
short ncRNAs (of length < 30 bp, includes miRNA, siRNA and piRNA) and
long ncRNAs (of length > 200 bp, example includes Xist, and HOTAIR).
mRNA is the intermediate between DNA and protein. Each mRNA con
sists of three parts: a 5’ untranslated region (a 5’ UTR), a coding region and
a 3’ untranslated region (3’ UTR). The length of the coding region is of a
multiple of 3. It is a sequence of triplets of nucleotides called codons. Each
codon corresponds to an amino acid.
Through translation, the machine ribosome translates each mRNA into a
Introduction 3

protein, which is the sequence of amino acids corresponding to the sequence of

codons in the mRNA. Protein forms complex 3D structures. Each protein is
a biological nanomachine that performs a specialized function. For example,
enzymes are proteins that work as catalysts to promote chemical reactions
for generating energy or digesting food. Other proteins, called transcription
factors, interact with the genome to turn on or off the transcriptions. Through
the interaction among DNA, RNA and protein, our genome dictates which
cells should grow, when cells should die, how cells should be structured, and
creates various body parts.
All cells in our body are developed from a single cell through cell division.
When a cell divides, the double helix genome is separated into single-stranded
DNA molecules. An enzyme called DNA polymerase uses each single-stranded
DNA molecule as the template to replicate the genome into two identical
double helixes. By this replication process, all cells within the same individual
will have the same genome. However, due to errors in copying, some variations
(called mutations) might happen in some cells. Those variations or mutations
may cause diseases such as cancer.
Different individuals have similar genomes, but they also have genome
variations that contribute to different phenotypes. For example, the color of
our hairs and our eyes are controlled by the differences in our genomes. By
studying and comparing genomes of different individuals, researchers develop
an understanding of the factors that cause different phenotypes and diseases.
Such knowledge ultimately helps to gain insights into the mystery of life and
contributes to improving human health.

1.2 Sequencing technologies

DNA sequencing is a process that determines the order of the nucleotide
bases. It translates the DNA of a specific organism into a format that is deci
pherable by researchers and scientists. DNA sequencing has allowed scientists
to better understand genes and their roles within our body. Such knowledge
has become indispensable for understanding biological processes, as well as in
application fields such as diagnostic or forensic research. The advent of DNA
sequencing has significantly accelerated biological research and discovery.
To facilitate the genomics study, we need to sequence the genomes of differ
ent species or different individuals. A number of sequencing technologies have
been developed during the last decades. Roughly speaking, the development
of the sequencing technologies consists of three phases:

• First-generation sequencing: Sequencing based on chemical degradation

and gel electrophoresis.
4 Algorithms for Next-Generation Sequencing

• Second-generation sequencing: Sequencing many DNA fragments in par

allel. It has higher yield, lower cost, but shorter reads.

• Third-generation sequencing: Sequencing a single DNA molecule with

out the need to halt between read steps.

In this section, we will discuss the three phases in detail.

1.3 First-generation sequencing

Sanger and Coulson proposed the ﬁrst-generation sequencing in 1975 [259,
260]. It enables us to sequence a DNA template of length 500 − 1000 within a
few hours. The detailed steps are as follows (see Figure 1.3).

1. Amplify the DNA template by cloning.

2. Generate all possible preﬁxes of the DNA template.

3. Separation by electrophoresis.

4. Readout with ﬂuorescent tags.

Step 1 amplifies the DNA template. The DNA template is inserted into
the plasmid vector; then the plasmid vector is inserted into the host cells for
cloning. By growing the host cells, we obtain many copies of the same DNA
template.
Step 2 generates all possible prefixes of the DNA template. Two tech
niques have been proposed for this step: (1) the Maxam-Gilbert technique [194]
and (2) the chain termination methodology (Sanger method) [259, 260]. The
Maxam-Gilbert technique relies on the cleaving of nucleotides by chemical.
Four different chemicals are used and generate all sequences ending with A, C, G
and T, respectively. This allows us to generate all possible prefixes of the tem
plate. This technique is most efficient for short DNA sequences. However, it
is considered unsafe because of the extensive use of toxic chemicals.
The chain termination methodology (Sanger method) is a better alter
native. Given a single-stranded DNA template, the method performs DNA
polymerase-dependent synthesis in the presence of (1) natural deoxynu
cleotides (dNTPs) and (2) dideoxynucleotides (ddNTPs). ddNTPs serve as
non-reversible synthesis terminators (see Figure 1.2(a,b)). The DNA synthesis
reaction is randomly terminated whenever a ddNTP is added to the growing
oligonucleotide chain, resulting in truncated products of varying lengths with
an appropriate ddNTP at their 3’ terminus.
After we obtain all possible prefixes of the DNA template, the product is
a mixture of DNA fragments of different lengths. We can separate these DNA
Introduction 5

C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ dATP + H+ + PPi
(a)

C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ ddATP + H+ + PPi
(b)

FIGURE 1.2: (a) The chemical reaction for the incorporation of dATP into
the growing DNA strand. (b) The chemical reaction for the incorporation of
ddATP into the growing DNA strand. The vertical bar behind A indicates
that the extension of the DNA strand is terminated.

3’-GCATCGGCATATG...-5’
5’-CGTA
CGTA G - +
CGTAG C
CGTAGC C
CGTAGCC G
CGTAGCCG T
DNA Insert Insert
CGTAGCCGT A
template into into GCCGTATAC
CGTAGCCGTA T
vector host cell Cloning CGTAGCCGTAT A
CGTAGCCGTATA C Electrophoresis
Cyclic sequencing & readout

FIGURE 1.3: The steps of Sanger sequencing.

fragments by their lengths using gel electrophoresis (Step 3). Gel electrophore
sis is based on the fact that DNA is negatively charged. When an electrical
field is applied to a mixture of DNA on a gel, the DNA fragments will move
from the negative pole to the positive pole. Due to friction, short DNA frag
ments travel faster than long DNA fragments. Hence, the gel electrophoresis
separates the mixture into bands, each containing DNA molecules of the same
length.
Using the fluorescent tags attached to the terminal ddNTPs (we have
4 different colors for the 4 different ddNTPs), the DNA fragments ending
with different nucleotides will be labeled with different fluorescent dyes. By
detecting the light emitted from different bands, the DNA sequence of the
template will be revealed (Step 4).
In summary, the Sanger method can generate sequences of length ∼800 bp.
The process can be fully automated and hence it was a popular DNA sequenc
6 Algorithms for Next-Generation Sequencing

ing method in 1970 − 2000. However, it is expensive and the throughput is

slow. It can only process a limited number of DNA fragments per unit of time.

1.4 Second-generation sequencing

Second-generation sequencing can generate hundreds of millions of short
reads per instrument run. When compared with ﬁrst-generation sequencing,
it has the following advantages: (1) it uses clone-free ampliﬁcation, and (2) it
can sequence many reads in parallel. Some commercially available technologies
include Roche/454, Illumina, ABI SOLiD, Ion Torrent, Helicos BioSciences
and Complete Genomics.
In general, second-generation sequencing involves the following two main
steps: (1) Template preparation and (2) base calling in parallel. The following
Section 1.4.1 describes Step 1 while Section 1.4.2 describes Step 2.

1.4.1 Template preparation

Given a set of DNA fragments, the template preparation step first gener
ates a DNA template for each DNA fragment. The DNA template is created
by ligating adaptor sequences to the two ends of the target DNA fragment (see
Figure 1.4(a)). Then, the templates are amplified using PCR. There are two
common methods for amplifying the templates: (1) emulsion PCR (emPCR)
and (2) solid-phase amplification (Bridge PCR).
emPCR amplifies each DNA template by a bead. First of all, one piece of
DNA template and a bead are inserted within a water drop in oil. The surface
of every bead is coated with a primer corresponding to one type of adaptor.
The DNA template will hybridize with one primer on the surface of the bead.
Then, it is PCR amplified within a water drop in oil. Figure 1.4(b) illustrates
the emPCR. emPCR is used by 454, Ion Torrent and SOLiD.
For bridge PCR, the amplification is done on a flat surface (say, glass),
which is coated with two types of primers, corresponding to the adaptors.
Each DNA template is first hybridized to one primer on the flat surface.
Amplification proceeds in cycles, with one end of each bridge tethered to the
surface. Figure 1.4(c) illustrates the bridge PCR process. Bridge PCR is used
by Illumina.
Although PCR can amplify DNA templates, there is amplification bias.
Experiments revealed that templates that are AT-rich or GC-rich have a lower
amplification efficient. This limitation creates uneven sequencing of the DNA
templates in the sample.
Introduction 7

(a)

templates

beads

water drop in oil template binds PCR for

to the bead a few rounds
(b)

(c)

FIGURE 1.4: (a) From the DNA fragments, DNA template is created by
attaching the two ends with adaptor sequences. (b) Amplifying the template
by emPCR. (c) Amplifying the template by bridge PCR.

1.4.2 Base calling

Now we have many PCR clones of amplified templates (see Figure 1.5(a)).
This step aims to read the DNA sequences from the amplified templates in
parallel. This method is called the cyclic-array method. There are two ap
proaches: the polymerase-mediated method (also called sequencing by syn
thesis) and the ligase-mediated method (also called sequencing by ligation).
The polymerase-mediated method is further divided into methods based on re
versible terminator nucleotides and methods based on unmodified nucleotides.
Below, we will discuss these approaches.

1.4.3 Polymerase-mediated methods based on reversible ter

minator nucleotides
A reversible terminator nucleotide is a modiﬁed nucleotide. Similar to
ddNTPs, during the DNA polymerase-dependent synthesis, if a reversible ter
minator nucleotide is incorporated onto the DNA template, the DNA synthesis
is terminated. Moreover, we can reverse the termination and restart the DNA
synthesis.
Figure 1.5(b) demonstrates how we use reversible terminator nucleotides
for sequencing. First, we hybridize the primer on the adaptor of the template.
Then, by DNA polymerase, a reversible terminator nucleotide is incorporated
onto the template. After that, we scan the signal of the dye attached to the
8 Algorithms for Next-Generation Sequencing

PCR clone
C C C
T C
T T T
G G G
C G
C C C
A A AC
T A
C C C C
T T T G T
G G G C G
C C C A C
A A A A

(a)

G
C C G C G C
T T T T ……
G G G G
Add After Repeat the
C C C C
reversible scanning, steps to
A A A A
terminator reverse the sequence
dGTP termination other bases

Wash &
scan
(b)

FIGURE 1.5: Polymerase-mediated sequencing methods based on reversible

terminator nucleotides. (a) PCR clones of the DNA templates are evenly dis
tributed on a ﬂat surface. Each PCR clone contains many DNA templates of
the same type. (b) The steps of polymerase-mediated methods are based on
reversible terminator nucleotides.

reversible terminator nucleotide by imaging. After imaging, the termination

is reversed by cleaving the dye-nucleotide linker. By repeating the steps, we
can sequence the complete DNA template.
Two commercial sequencers use this approach. They are Illumina and He
licos BioSciences.
The Illumina sequencer amplifies the DNA templates by bridge PCR.
Then, all PCR clones are distributed on the glass plate. By using the four-
color cyclic reversible termination (CRT) cycle (see Figure 1.6(b)), we can
sequence all the DNA templates in parallel.
The major error of Illumina sequencing is substitution error, with a higher
portion of errors occurring when the previous incorporated nucleotide is a
base G.
Another major error of Illumina sequencing is that the accuracy decreases
with increasing nucleotide addition steps. The errors accumulate due to the
failure in cleaving off the fluorescent tags or due to errors in controlling the
Introduction 9

A C
C G
T T
G A
T C

(a)

(b)

A C G T A C G T
(c)

FIGURE 1.6: Polymerase-mediated sequencing methods based on reversible

terminator nucleotides. (a) A ﬂat surface with many PCR clones. In particu
lar, we show the DNA templates for two clones. (b) Four-color cyclic reversible
termination (CRT) cycle. Within each cycle, we extend the template of each
PCR clone by one base. The color indicates the extended base. Precisely, the
four colors, dark gray, black, white and light gray, correspond to the four nu
cleotides A, C, G and T, respectively. (c) One-color cyclic reversible termination
(CRT) cycle. Each cycle tries to extend the template of each PCR clone by
one particular base. If the extension is successful, the white color is lighted
up.
10 Algorithms for Next-Generation Sequencing

reversible terminator nucleotides. Then, bases fail to get incorporated to the

template strand or extra bases might get incorporated [190].
Helicos BioSciences does not perform PCR amplification. It is a single
molecular sequencing method. It first immobilizes the DNA template on the
flat surface. Then, all DNA templates on the surface are sequenced in par
allel by using a one-color cyclic reversible termination (CRT) cycle (see Fig
ure 1.6(c)). Note that this technology can also be used to sequence RNA
directly by using reverse transcriptase instead of DNA polymerase. However,
the reads generated by Helicos BioSciences are very short (∼25 bp). It is also
slow and expensive.

1.4.4 Polymerase-mediated methods based on unmodiﬁed

nucleotides
The previous methods require the use of modified nucleotides. Actually,
we can sequence the DNA templates using unmodified nucleotides. The basic
observation is that the incorporation of a deoxyribonucleotide triphosphate
(dNTP) into a growing DNA strand involves the formation of a covalent bond
and the release of pyrophosphate and a positively charged hydrogen ion (see
Figure 1.2). Hence, it is possible to sequence the DNA template by detecting
the concentration change of pyrophosphate or hydrogen ion. Roche 454 and
Ion Torrent are two sequencers which take advantage of this principle.
The Roche 454 sequencer performs sequencing by detecting pyrophos
phates. It is called pyrosequencing. First, the 454 sequencer uses emPCR to
amplify the templates. Then, amplified beads are loaded into an array of wells.
(Each well contains one amplified bead which corresponds to one DNA tem
plate.) In each iteration, a single type of dNTP flows across the wells. If the
dNTP is complementary to the template in a well, polymerase will extend by
one base and relax pyrophosphate. With the help of enzymes sulfurylase and
luciferase, the pyrophosphate is converted into visual light. The CDC camera
detects the light signal from all wells in parallel. For each well, the light inten
sity generated is recorded as a flowgram. For example, if the DNA template
in a well is TCGGTAAAAAACAGTTTCCT, Figure 1.7 is the corresponding
flowgram. Precisely, the light signal can be detected only when the dNTP that
flows across the well is complementary to the template. If the template has
a homopolymer of length k, the light intensity detected is k-fold higher. By
interpreting the flowgram, we can recover the DNA sequence.
However, when the homopolymer is long (say longer than 6), the detec
tor is not sensitive enough to report the correct length of the homopolymer.
Therefore, the Roche 454 sequencer gives higher rate of indel errors.
Ion Torrent was created by the person as Roche 454. It is the first semi
conductor sequencing chip available on the commercial market. Instead of
detecting pyrophosphate, it performs sequencing by detecting hydrogen ions.
The basic method of Ion Torrent is the same as that of Roche 454. It also uses
emPCR to amplify the templates and the amplified beads are also loaded into
Introduction 11

6
5

intensity
4
3
2
1
ACGTACGTACGTACGTACGT

FIGURE 1.7: The ﬂowgram for the DNA sequence TCG

GTAAAAAACAGTTTCCT.

a high-density array of wells, and each well contains one template. In each
iteration, a single type of dNTP flows across the wells. If the dNTP is comple
mentary to the template, polymerase will extend by one base and relax H+.
The relaxation of H+ changes the pH of the solution in the well and an IS
FET sensor at the bottom of the well measures the pH change and converts it
into electric signals [251]. The sensor avoids the use of optical measurements,
which require a complicated camera and laser. This is the main difference
between Ion Torrent sequencing and 454 sequencing. The unattached dNTP
molecules are washed out before the next iteration. By interpreting the flow-
gram obtained from the ISFET sensor, we can recover the sequences of the
templates.
Since the method used by Ion Torrent is similar to that of Roche 454, it
also has the disadvantage that it cannot distinguish long homopolymers.

1.4.5 Ligase-mediated method

Instead of extending the template base by base using polymerase, ligase-
mediated methods use probes to check the bases on the template. ABI SOLiD
is the commercial sequencer that uses this approach. In SOLiD, the templates
are ﬁrst ampliﬁed by emPCR. After that, millions of templates are placed on
a plate. SOLiD then tries to probe the bases of all templates in parallel. In
every iteration, SOLiD probes two adjacent bases of each template, i.e., it uses
two-base color encoding. The color coding scheme is shown in the following
table. For example, for the DNA template AT GGA, it is coded as A3102.

A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
12 Algorithms for Next-Generation Sequencing

The primary advantage of the two-base color encoding is that it improves

the single nucleotide variation (SNV) calling. Since every base is covered by
two color bases, it reduces the error rate for calling SNVs. However, conversion
from color bases to nucleotide bases is not simple. Errors may be generated
during the conversion process.
In summary, second-generation sequencing enables us to generate hundreds
of billions of bases per run. However, each run takes days to ﬁnished due to a
large number of scanning and washing cycles. Adding of a base per cycle is not
100% correct. This causes sequencing errors. Furthermore, base extensions of
some strands may be lag behind or lead forward. Hence, errors accumulate
as the reads get long. This is the reason why second-generation sequencing
cannot get very long read. Furthermore, due to the PCR ampliﬁcation bias,
this approach may miss some templates with high or low GC content.

1.5 Third-generation sequencing

Although many of us are still using second-generation sequencing, third-
generation sequencing is coming. There is no fixed definition for third-
generation sequencing yet. Here, we define it as a single molecule sequencing
(SMS) technology without the need to halt between read steps (whether enzy
matic or otherwise). A number of third-generation sequencing methods have
been proposed. They include:

• Single-molecule real-time sequencing

• Nanopore-sequencing technologies

• Direct imaging of individual DNA molecules using advanced microscopy

techniques

1.5.1 Single-molecule real-time sequencing

Pacific BioSciences released their PacBio RS sequencing platform [71].
Their approach is called single-molecule real-time (SMRT) sequencing. It mim
ics what happens in our body as cells divide and copies their DNA with the
DNA polymerase machine. Precisely, PacBio RS immobilizes DNA polymerase
molecules on an array slide. When the DNA template gets in touch with the
DNA polymerase, DNA synthesis happens with four fluorescently labeled nu
cleotides. By detecting the light emitted, PacBio RS reconstructs the DNA
sequences. Figure 1.8 illustrates the SMRT sequencing approach.
PacBio RS sequencing requires no prior amplification of the DNA template.
Hence, it has no PCR bias. It can achieve more uniform coverage and lower GC
bias when compared with Illumina sequencing [79]. It can read long sequences
Introduction 13

Immobilized
polymerase

FIGURE 1.8: The illustration of PacBio sequencing. On an array slide,

there are a number of immobilized DNA polymerase molecules. When a DNA
template gets in touch with the DNA polymerase (see the polymerase at the
lower bottom right), DNA synthesis happens with the ﬂuorescently labeled
nucleotides. By detecting the emitted light signal, we can reconstruct the
DNA sequence.

of length up to 20, 000 bp, with an average read length of about 10, 000 bp.
Another advantage of PacBio RS is that it can sequence methylation status
simultaneously.
However, PacBio sequencing is more costly. It is about 3 − 4 times more
expensive than short read sequencing. Also, PacBio RS has a high error rate,
up to 17.9% errors [46]. The majority of the errors are indel errors [71]. Luckily,
the error rate is unbiased and almost constant throughout the entire read
length [146]. By repeatedly sequencing the same DNA template, we can reduce
the error rate.

1.5.2 Nanopore sequencing method

A nanopore is a pore of nano size on a thin membrane. When a voltage
is applied across the membrane, charged molecules that are small enough can
move from the negative well to the positive well. Moreover, molecules with
different structures will have different efficiencies in passing through the pore
and affect the electrical conductivity. By studying the electrical conductivity,
we can determine the molecules that pass through the pore.
This idea has been used in a number of methods for sequencing DNA.
These methods are called the nanopore sequencing method. Since nanopore
methods use unmodified DNA, it requires an extremely small amount of input
material. They also have the potential to sequence long DNA reads efficiently
at low cost. There are a number of companies working on the nanopore se
quencing method. They include (1) Oxford Nanopore, (2) IBM Transistor-
mediated DNA sequencing, (3) Genia and (4) NABsys.
Oxford nanopore technology detects nucleotides by measuring the ionic
current flowing through the pore. It allows the single-strand DNA sequence to
14 Algorithms for Next-Generation Sequencing

FIGURE 1.9: An illustration of the sequencing technique of Oxford

nanopore.

flow through the pore continuously. As illustrated in Figure 1.9, DNA material
is placed in the top chamber. The positive charge draws a strand of DNA
moving from the top chamber to the bottom chamber flowing through the
nanopore. By detecting the difference in the electrical conductivity in the
pore, the DNA sequence is decoded. (Note that IBM’s DNA transistor is a
prototype which uses a similar idea.)
The approach has difficulty in calling the individual base accurately. In
stead, Oxford nanopore technology will read the signal of k (say 5) bases
in each round. Then, using a hidden Markov model, the DNA base can be
decoded base by base.
Oxford nanopore technology has announced two sequencers: MiniION and
GridION. MiniION is a disposable USB-key sequencer. GridION is an ex
pandable sequencer. Oxford nanopore technology claimed that GridION can
sequence 30x coverage of a human genome in 6 days at US$2200 − $3600. It
has the potential to decode a DNA fragment of length 100, 000 bp. Its cost is
about US$25−$40 per gigabyte. Although it is not expensive, the error rate is
about 17.8% (4.9% insertion error, 7.8% deletion error and 5.1% substitution
error) [115].
Unlike Oxford nanopore technology, Genia suggested combining nanopore
and the DNA polymerase to sequence a single-strand DNA template. In Genia,
the DNA polymerase is tethered with a biological nanopore. When a DNA
template gets in touch with the DNA polymerase, DNA synthesis happens
with four engineered nucleotides for A, C, G and T , each attached with a
different short tag. When a nucleotide is incorporated into the DNA template,
the tag is cleaved and it will travel through the biological nanopore and an
electric signal is measured. Since different nucleotides have different tags, we
can reconstruct the DNA template by measuring the electric signals.
NABsys is another nanopore sequencer. It first chops the genome into DNA
fragments of length 100, 000 bp. The DNA fragments are hybridized with a
particular probe so that specific short DNA sequences on the DNA fragments
Introduction 15

(a) (b)

… …
(c)

FIGURE 1.10: Consider a DNA fragment hybridized with a particular

probe. After it passes through the nanopore (see (a)), an electrical signal
proﬁle is obtained (see (b)). By aligning the electrical signal proﬁles gener
ated from a set of DNA fragments, we obtain the probe map for a genome
(see (c)).

are bounded by the probes. Those DNA fragments with bound probes are
driven through a nanopore (see Figure 1.10(a)), creating a current-versus
time tracing. The trace gives the position of the probes on the fragment.
(See Figure 1.10(b).) We can align the fragments based on their inter-probe
distance; then, we obtain a probe map for the genome (see Figure 1.10(c)).
We can obtain the probe maps for diﬀerent probes. By aligning all of them,
we obtain the whole genome.

Unlike Genia, Oxford nanopore technology and the IBM DNA transis
tor, NABsys does not require a very accurate current measurement from the
nanopore. The company claims that this method is cheap, and that read length
is long and fast. Furthermore, it is accurate!

1.5.3 Direct imaging of DNA using electron microscopy

Another choice is to use direct imaging. ZS genetics is developing meth

ods based on transmission electron microscopy (TEM). Reveo is developing
a technology based on scanning tunneling microscope (STM) tips. DNA is
placed on a conductive surface for detecting bases electronically using STM
tips and tunneling current measurements. Both approaches have the potential
to sequence very long reads (in millions) at low cost. However, they are still
in the development phase. No sequencing machine is available yet.
16 Algorithms for Next-Generation Sequencing

TABLE 1.1: Comparison of the three generations of sequencing

First generationSecond genera- Third generation

tion
Ampliﬁcation In-vivo cloning In-vitro PCR Single molecule
and ampliﬁcation
Sequencing Electrophoresis Cyclic array se- Nanopore, elec
quencing tronic microscopy
or real-time
monitoring of
PCR
Starting ma- More Less (< 1µg) Even less
terial
Cost Expensive Cheap Very cheap
Time Very slow Fast Very fast
Read length About 800bp Short Very long
Accuracy < 1% error < 1% error High error rate
(mismatch or
homopolmer
error)

1.6 Comparison of the three generations of sequencing

We have discussed the technologies of the three generations of sequencing.

Table 1.1 summarizes their key features. Currently, we are in the late phase
of second-generation sequencing and at the early phase of third-generation
sequencing. We can already see a dramatic drop in sequencing cost. Figure 1.11
shows the sequence cost over time. Cost per genome is calculated based on
6-fold coverage for Sanger sequencing, 10-fold coverage for 454 sequencing
and 30-fold coverage for Illumina (or SOLiD) sequencing. As a matter of
fact, the sequencing cost does not include the data management cost and
the bioinformatics analysis cost. Note that there was a sudden reduction in
sequencing cost in January 2008, which is due to the introduction of second-
generation sequencing. In the future, the sequencing cost is expected to drop
further.
Introduction 17
$100,000,000.00

$10,000,000.00

$1,000,000.00

$100,000.00

$10,000.00

$1,000.00

$100.00

$10.00

$1.00

$0.10

$0.01

$0.00
Sep-01
Jan-02

Sep-02

Sep-03

Sep-04

Sep-11
May-02

Jan-03
May-03

Jan-04
May-04

Jan-05

Sep-05
May-05

Jan-06

Sep-06
May-06

Jan-07

Sep-07
May-07

Jan-08

Sep-08

Sep-09
May-08

Jan-09
May-09

Jan-10

Sep-10
May-10

Jan-11
May-11

Jan-12

Sep-12

Sep-13
May-12

Jan-13
May-13

Jan-14

Sep-14
May-14

Jan-15

Sep-15
May-15
Cost per Mb of DNA bases Cost per Genome

FIGURE 1.11: The sequencing cost over time. There are two curves. The
blue curve shows the sequencing cost per million of sequencing bases while
the red curve shows the sequencing cost per human genome. (Data is obtained
from http://www.genome.gov/sequencingcosts.)

1.7 Applications of sequencing

The previous section describes three generations of sequencing. This sec
tion describes their applications.
Genome assembly: Genome assembly aims to reconstruct the genome
of some species. Since our genome is long, we still cannot read the whole
chromosome in one step. The current solution is to sequence the fragments of
the genome one by one using a sequencer. Then, by overlapping the fragments
computationally, the complete genome is reconstructed.
Many genome assembly projects have been finished. We have obtained
the reference genomes for many species, including human, mouse, rice, etc.
The human genome project is properly the most important genome assem
bly project. This project started in 1984 and declared complete in 2003. The
project cost was more than 3 billion US$. Although it is expensive, the project
enables us to have a better understanding of the human genome. Given the
human reference genome, researchers can examine the list of genes in hu
mans. We know that the number of protein coding genes in humans is about
20, 000, which covers only 3% of the whole genome. Subsequently, we can also
understand the differences among individuals and understand the differences
between cancerous and healthy human genomes.
The project also improves the genome assembly process. It leads to a
whole genome shotgun approach, which is the most common assembly ap
proach nowadays. By coupling the whole genome shotgun approach and next
18 Algorithms for Next-Generation Sequencing

generation sequencing, we obtain the reference genomes of many species. (See

Chapter 5 for methods to reconstruct a genome.)
Genome variations finding: The genome of each individual is different
from that of the reference human genome. Roughly speaking, there are four
types of genome variations: single nucleotide variations (SNVs), short indels,
copy number variations (CNVs) and structural variations (SVs). Figure 6.1
illustrates these four types of variations. Genome variations can cause can
cer. For example, in chronic myelogenous leukemia (CML), a translocation
exists between chromosome 9 and chromosome 22, which fuses the ABL1 and
BCR genes together to form a fusion gene, BCL-ABL1. Such a translocation
is known to be present in 95 percent of patients with CML. Another example
occurs with a deletion in chromosome 21 that fuses the ERG and TMPRSS2
genes. The TMPRSS2-ERG fusion is seen in approximately 50 percent of
prostate cancers, and researchers have found that this fusion enhances the
invasiveness of prostate cancer. Genome sequencing of cancers enables us to
identify the variation of each individual. Apart from genome variations in can
cers, many novel disease-causing variations have been discovered for childhood
diseases and neurological diseases. In the future, we expect everyone will per
form genome sequencing. Depending on the variations, different personalized
therapeutics can be applied to different patients. This is known as personal
ized medicine or stratified medicine. (See Chapters 7 and 6 for methods to
call genome variations.)
Reconstructing the transcriptome: Although every human cell has
the same human genome, human cells in different tissues express different
sets of genes at different times. The set of genes expressed in a particular cell
type is called its transcriptome. In the past, the transcriptome was extracted
using technologies like microarray. However, microarray can only report the
expression of known genes. They fail to discover novel splice variants and
novel genes. Due to the advance in sequencing technologies, we can use RNA
seq to solve these problems. We can not only measure gene expression more
accurately, but can also discover novel genes and novel splice variants. (See
Chapter 8 for methods to analyze RNA-seq data.)
Decoding the transcriptional regulation: Some proteins called tran
scription factors (TFs) bind on the genome and regulate the expression of
genes. If a TF fails to bind on the genome, the corresponding target gene will
fail to express and the cell cannot function properly. For example, one type
of breast cancer is ER+ cancer cells. In ER+ cancer, ER, GATA3 and FoxA1
form a functional enhanceosome that regulates a set of genes and drives the
core ERα function. It is important to understand how they work together.
To know the binding sites of each TF, we can apply ChIP-seq. ChIP-seq is
a sequencing protocol that enables us to identify the binding sites of each TF
on a genome-wide scale. By studying the ChIP-seq data, we can understand
how TFs work together, the relationship between TFs and transcriptomes,
etc. (See Chapter 9 for methods to analyze ChIP-seq data.)
Many other applications: Apart from the above applications, sequenc
Introduction 19

ing has been applied to many other research areas, including metagenomics,
3D modeling of the genome, etc.

1.8 Summary and further reading

This chapter summarizes the three generations of sequencing. It also brieﬂy
describes their applications. There are a number of good surveys of sec
ond generation-sequencing. Please refer to [200]. For more detail on third-
generation sequencing, please refer to [263].

1.9 Exercises
1. Consider the DNA sequence 5’-ACTCAGTTCG-3’. What is its reverse
complement? The SOLiD sequencer will output color-based sequences.
What is the expected color-based sequence for the above DNA sequence
and its reverse complement? Do you observe an interesting property?

2. Should we always use second- or third- generation sequencing instead of

ﬁrst-generation sequencing? If not, when should we use Sanger sequenc
ing?
References
1000 Genomes Project Consortium , G. R. Abecasis , A. Auton , L. D. Brooks , M. A. DePristo , R. M. Durbin , R. E. Handsaker , H. M.
Kang , G. T. Marth , and G. A. McVean . An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56–65,
Nov 2012.
A. Abyzov , A. E. Urban , M. Snyder , and M. Gerstein . CNVnator: An approach to discover, genotype, and characterize typical and
atypical CNVs from family and population genome sequencing. Genome Research, 21(6):974–984, Jun 2011.
E. Ahrné , L. Molzahn , T. Glatter , and A. Schmidt . Critical assessment of proteome-wide label-free absolute abundance estimation
strategies. Proteomics, 13(17):2567–2578, Sep 2013.
C. A. Albers , G. Lunter , D. G. MacArthur , G. McVean , W. H. Ouwehand , and R. Durbin . Dindel: Accurate indel calls from short-read
data. Genome Research, 21(6):961–973, Jun 2011.
S. F. Altschul , T. L. Madden , A. A. Schäffer , J. Zhang , Z. Zhang , W. Miller , and D. J. Lipman . Gapped BLAST and PSI-BLAST: A new
generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, Sep 1997.
A. Ameur , A. Wetterbom , L. Feuk , and U. Gyllensten . Global and unbiased detection of splice junctions from RNA-seq data. Genome
Biol, 11(3):R34, 2010.
A. Amir , M. Lewenstein , and E. Porat . Faster algorithms for string matching with k mismatches. Journal of Algorithms, 50(2):257–275,
Feb 2004.
E. L. Anson and E. W. Myers . ReAligner: A program for refining DNA sequence multi-alignments. Journal of Computational Biology,
4(3):369–383, 1997.
P. N. Ariyaratne and W.-K. Sung . PE-Assembler: De novo assembler using short paired-end reads. Bioinformatics, 27(2):167–174, 2011.
K. F. Au , H. Jiang , L. Lin , Y. Xing , and W. H. Wong . Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic
Acids Res, 38(14):4570–4578, Aug 2010.
P. N. C. B. Audergon , S. Catania , A. Kagansky , P. Tong , M. Shukla , A. L. Pidoux , and R. C. Allshire . Restricted epigenetic inheritance
of H3K9 methylation. Science, 348(6230):132–135, Apr 2015.
O. T. Avery , C. M. Macleod , and M. McCarty . Studies on the chemical nature of the substance inducing transformation of pneumococcal
types: Induction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type III. J Exp Med, 79(2):137–158,
Feb 1944.
A. Bankevich , S. Nurk , D. Antipov , A. A. Gurevich , M. Dvorkin , A. S. Kulikov , V. M. Lesin , S. I. Nikolenko , S. Pham , A. D. Prjibelski ,
A. V. Pyshkin , A. V. Sirotkin , N. Vyahhi , G. Tesler , M. A. Alekseyev , and P. A. Pevzner . SPAdes: A new genome assembly algorithm
and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455–477, May 2012.
V. Bansal , O. Harismendy , R. Tewhey , S. S. Murray , N. J. Schork , E. J. Topol , and K. A. Frazer . Accurate detection and genotyping of
SNPs utilizing population sequencing data. Genome Research, 20(4):537–545, Apr 2010.
H. Bao , Y. Xiong , H. Guo , R. Zhou , X. Lu , Z. Yang , Y. Zhong , and S. Shi . MapNext: A software tool for spliced and unspliced
alignments and SNP detection of short sequence reads. BMC Genomics, 10 Suppl 3:S13, 2009.
D. W. Barnett , E. K. Garrison , A. R. Quinlan , M. P. Strömberg , and G. T. Marth . BamTools: A C++ API and toolkit for analyzing and
managing BAM files. Bioinformatics, 27(12):1691–1692, Jun 2011.
C. Bartenhagen and M. Dugas . Robust and exact structural variation detection with paired-end and soft-clipped alignments: SoftSV
compared with eight algorithms. Brief Bioinform, 17(1):51–62, Jan 2016.
Y. Benjamini and T. P. Speed . Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res,
40(10):e72, May 2012.
D. R. Bentley , S. Balasubramanian , H. P. Swerdlow , G. P. Smith , J. Milton , C. G. Brown , K. P. Hall , D. J. Evers , C. L. Barnes , H. R.
Bignell , J. M. Boutell , J. Bryant , R. J. Carter , R. Keira Cheetham , A. J. Cox , D. J. Ellis , M. R. Flatbush , N. A. Gormley , S. J.
Humphray , L. J. Irving , M. S. Karbelashvili , S. M. Kirk , H. Li , X. Liu , K. S. Maisinger , L. J. Murray , B. Obradovic , T. Ost , M. L.
Parkinson , M. R. Pratt , I. M. J. Rasolonjatovo , M. T. Reed , R. Rigatti , C. Rodighiero , M. T. Ross , A. Sabot , S. V. Sankar , A. Scally ,
G. P. Schroth , M. E. Smith , V. P. Smith , A. Spiridou , P. E. Torrance , S. S. Tzonev , E. H. Vermaas , K. Walter , X. Wu , L. Zhang , M. D.
Alam , C. Anastasi , I. C. Aniebo , D. M. D. Bailey , I. R. Bancarz , S. Banerjee , S. G. Barbour , P. A. Baybayan , V. A. Benoit , K. F.
Benson , C. Bevis , P. J. Black , A. Boodhun , J. S. Brennan , J. A. Bridgham , R. C. Brown , A. A. Brown , D. H. Buermann , A. A. Bundu ,
J. C. Burrows , N. P. Carter , N. Castillo , M. Chiara E Catenazzi , S. Chang , R. Neil Cooley , N. R. Crake , O. O. Dada , K. D.
Diakoumakos , B. Dominguez-Fernandez , D. J. Earnshaw , U. C. Egbujor , D. W. Elmore , S. S. Etchin , M. R. Ewan , M. Fedurco , L. J.
Fraser , K. V. Fuentes Fajardo , W. Scott Furey , D. George , K. J. Gietzen , C. P. Goddard , G. S. Golda , P. A. Granieri , D. E. Green , D.
L. Gustafson , N. F. Hansen , K. Harnish , C. D. Haudenschild , N. I. Heyer , M. M. Hims , J. T. Ho , A. M. Horgan , K. Hoschler , S. Hurwitz
, D. V. Ivanov , M. Q. Johnson , T. James , T. A. Huw Jones , G.-D. Kang , T. H. Kerelska , A. D. Kersey , I. Khrebtukova , A. P. Kindwall ,
Z. Kingsbury , P. I. Kokko-Gonzales , A. Kumar , M. A. Laurent , C. T. Lawley , S. E. Lee , X. Lee , A. K. Liao , J. A. Loch , M. Lok , S. Luo ,
R. M. Mammen , J. W. Martin , P. G. McCauley , P. McNitt , P. Mehta , K. W. Moon , J. W. Mullens , T. Newington , Z. Ning , B. Ling Ng ,
S. M. Novo , M. J. O’Neill , M. A. Osborne , A. Osnowski , O. Ostadan , L. L. Paraschos , L. Pickering , A. C. Pike , A. C. Pike , D. Chris
Pinkard , D. P. Pliskin , J. Podhasky , V. J. Quijano , C. Raczy , V. H. Rae , S. R. Rawlings , A. Chiva Rodriguez , P. M. Roe , J. Rogers ,
M. C. Rogert Bacigalupo , N. Romanov , A. Romieu , R. K. Roth , N. J. Rourke , S. T. Ruediger , E. Rusman , R. M. Sanches-Kuiper , M.
R. Schenker , J. M. Seoane , R. J. Shaw , M. K. Shiver , S. W. Short , N. L. Sizto , J. P. Sluis , M. A. Smith , J. Ernest Sohna Sohna , E. J.
Spence , K. Stevens , N. Sutton , L. Szajkowski , C. L. Tregidgo , G. Turcatti , S. Vandevondele , Y. Verhovsky , S. M. Virk , S. Wakelin ,
G. C. Walcott , J. Wang , G. J. Worsley , J. Yan , L. Yau , M. Zuerlein , J. Rogers , J. C. Mullikin , M. E. Hurles , N. J. McCooke , J. S. West
, F. L. Oaks , P. L. Lundberg , D. Klenerman , R. Durbin , and A. J. Smith . Accurate whole human genome sequencing using reversible
terminator chemistry. Nature, 456(7218):53–59, Nov 2008.
K. Berlin , S. Koren , C.-S. Chin , J. P. Drake , J. M. Landolin , and A. M. Phillippy . Assembling large genomes with single-molecule
sequencing and locality-sensitive hashing. Nat Biotechnol, 33(6):623–630, Jun 2015.
T. R. Bhangale , M. J. Rieder , R. J. Livingston , and D. A. Nickerson . Comprehensive identification and characterization of diallelic
insertion-deletion polymorphisms in 330 human candidate genes. Hum Mol Genet, 14(1):59–69, Jan 2005.
J. Blom , T. Jakobi , D. Doppmeier , S. Jaenicke , J. Kalinowski , J. Stoye , and A. Goesmann . Exact and complete short-read alignment to
microbial genomes using graphics processing unit programming. Bioinformatics, 27(10):1351–1358, May 2011.
B. H. Bloom . Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426, July 1970.
M. Boetzer , C. V. Henkel , H. J. Jansen , D. Butler , and W. Pirovano . Scaffolding pre-assembled contigs using SSPACE. Bioinformatics,
27(4):578–579, Feb 2011.
V. Boeva , A. Zinovyev , K. Bleakley , J.-P. Vert , I. Janoueix-Lerosey , O. Delattre , and E. Barillot . Control-free calling of copy number
alterations in deep-sequencing data using GC-content normalization. Bioinformatics, 27(2):268–269, Jan 2011.
J. K. Bonfield and M. V. Mahoney . Compression of FASTQ and SAM format sequencing data. PLoS One, 8(3):e59190, 2013.
F. Bonomi , M. Mitzenmacher , R. Panigrahy , S. Singh , and G. Varghese . An improved construction for counting bloom filters. In
European Symp on Algorithms (ESA), pages 684–695, 2006.
A. Bowe , T. Onodera , K. Sadakane , and T. Shibuya . Succinct de Bruijn graphs. In Workshop on Algorithms in Bioinformatics (WABI),
pages 225–235, 2012.
A. P. Boyle , J. Guinney , G. E. Crawford , and T. S. Furey . F-Seq: A feature density estimator for high-throughput sequence tags.
Bioinformatics, 24(21):2537–2538, Nov 2008.
A. P. Boyle , L. Song , B.-K. Lee , D. London , D. Keefe , E. Birney , V. R. Iyer , G. E. Crawford , and T. S. Furey . High-resolution genome-
wide in vivo footprinting of diverse transcription factors in human cells. Genome Research, 21(3):456–464, 2011.
S. Brenner , M. Johnson , J. Bridgham , G. Golda , D. H. Lloyd , D. Johnson , S. Luo , S. McCurdy , M. Foy , M. Ewan , R. Roth , D.
George , S. Eletr , G. Albrecht , E. Vermaas , S. R. Williams , K. Moon , T. Burcham , M. Pallas , R. B. DuBridge , J. Kirchner , K. Fearon ,
J. Mao , and K. Corcoran . Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat
Biotechnol, 18(6):630–634, Jun 2000.
J. D. Buenrostro , P. G. Giresi , L. C. Zaba , H. Y. Chang , and W. J. Greenleaf . Transposition of native chromatin for fast and sensitive
epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods, 10(12):1213–1218, 2013.
S. Burkhardt and J. Karkkainen . Better filtering with gapped q-grams. In Combinatorial Pattern Matching (CPM), pages 73–85, 2001.
J. Butler , I. MacCallum , M. Kleber , I. A. Shlyakhter , M. K. Belmonte , E. S. Lander , C. Nusbaum , and D. B. Jaffe . ALLPATHS: De novo
assembly of whole-genome shotgun microreads. Genome Research, 18(5):810–820, May 2008.
J. S. Carroll , C. A. Meyer , J. Song , W. Li , T. R. Geistlinger , J. Eeckhoute , A. S. Brodsky , E. K. Keeton , K. C. Fertuck , G. F. Hall , Q.
Wang , S. Bekiranov , V. Sementchenko , E. A. Fox , P. A. Silver , T. R. Gingeras , X. S. Liu , and M. Brown . Genome-wide analysis of
estrogen receptor binding sites. Nat Genet, 38(11):1289–1297, Nov 2006.
N. P. Carter . Methods and strategies for analyzing copy number variation using DNA microarrays. Nat Genet, 39(7 Suppl):S16–S21, Jul
2007.
M. Chaisson , P. Pevzner , and H. Tang . Fragment assembly with short reads. Bioinformatics, 20(13):2067–2074, Sep 2004.
M. J. Chaisson and P. A. Pevzner . Short read fragment assembly of bacterial genomes. Genome Research, 18(2):324–330, Feb 2008.
M. J. Chaisson and G. Tesler . Mapping single molecule sequencing reads using basic local alignment with successive refinement
(BLASR): Application and theory. BMC Bioinformatics, 13:238, 2012.
M. J. P. Chaisson , J. Huddleston , M. Y. Dennis , P. H. Sudmant , M. Malig , F. Hormozdiari , F. Antonacci , U. Surti , R. Sandstrom , M.
Boitano , J. M. Landolin , J. A. Stamatoyannopoulos , M. W. Hunkapiller , J. Korlach , and E. E. Eichler . Resolving the complexity of the
human genome using single-molecule sequencing. Nature, 517(7536):608–611, Jan 2015.
G. Chen , C. Wang , and T. Shi . Overview of available methods for diverse RNA-seq data analyses. Sci China Life Sci,
54(12):1121–1128, Dec 2011.
K. Chen , L. Chen , X. Fan , J. Wallis , L. Ding , and G. Weinstock . TIGRA: A targeted iterative graph routing assembler for breakpoint
assembly. Genome Research, 24(2):310–317, Feb 2014.
K. Chen , J. W. Wallis , M. D. McLellan , D. E. Larson , J. M. Kalicki , C. S. Pohl , S. D. McGrath , M. C. Wendl , Q. Zhang , D. P. Locke , X.
Shi , R. S. Fulton , T. J. Ley , R. K. Wilson , L. Ding , and E. R. Mardis . Break-Dancer: An algorithm for high-resolution mapping of
genomic structural variation. Nat Methods, 6(9):677–681, Sep 2009.
D. Y. Chiang , G. Getz , D. B. Jaffe , M. J. T. O’Kelly , X. Zhao , S. L. Carter , C. Russ , C. Nusbaum , M. Meyerson , and E. S. Lander .
High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods, 6(1):99–103, Jan 2009.
C.-S. Chin , D. H. Alexander , P. Marks , A. A. Klammer , J. Drake , C. Heiner , A. Clum , A. Copeland , J. Huddleston , E. E. Eichler , S.
W. Turner , and J. Korlach . Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods,
10(6):563–569, Jun 2013.
C.-S. Chin , J. Sorenson , J. B. Harris , W. P. Robins , R. C. Charles , R. R. Jean-Charles , J. Bullard , D. R. Webster , A. Kasarskis , P.
Peluso , E. E. Paxinos , Y. Yamaichi , S. B. Calderwood , J. J. Mekalanos , E. E. Schadt , and M. K. Waldor . The origin of the Haitian
cholera outbreak strain. N Engl J Med, 364(1):33–42, Jan 2011.
H. Chitsaz , J. L. Yee-Greenbaum , G. Tesler , M.-J. Lombardo , C. L. Dupont , J. H. Badger , M. Novotny , D. B. Rusch , L. J. Fraser , N.
A. Gormley , O. Schulz-Trieglaff , G. P. Smith , D. J. Evers , P. A. Pevzner , and R. S. Lasken . Efficient de novo assembly of single-cell
bacterial genomes from short-read data sets. Nat Biotechnol, 29(10):915–921, Oct 2011.
M. Choi , U. I. Scholl , W. Ji , T. Liu , I. R. Tikhonova , P. Zumbo , A. Nayir , A. Bakkaloğlu , S. Ozen , S. Sanjad , C. Nelson-Williams , A.
Farhi , S. Mane , and R. P. Lifton . Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci
USA, 106(45):19096–19101, Nov 2009.
H.-T. Chu , W. W. L. Hsiao , J.-C. Chen , T.-J. Yeh , M.-H. Tsai , H. Lin , Y.-W. Liu , S.-A. Lee , C.-C. Chen , T. T. H. Tsao , and C.-Y. Kao .
EBAR-Denovo: Highly accurate de novo assembly of RNA-seq with efficient chimera-detection. Bioinformatics, 29(8):1004–1010, Apr
2013.
K. Cibulskis , M. S. Lawrence , S. L. Carter , A. Sivachenko , D. Jaffe , C. Sougnez , S. Gabriel , M. Meyerson , E. S. Lander , and G. Getz
. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol, 31(3):213–219, Mar 2013.
K. Cibulskis , A. McKenna , T. Fennell , E. Banks , M. DePristo , and G. Getz . ContEst: Estimating cross-contamination of human samples
in next-generation sequencing data. Bioinformatics, 27(18):2601–2602, Sep 2011.
N. Cloonan , A. R. R. Forrest , G. Kolle , B. B. A. Gardiner , G. J. Faulkner , M. K. Brown , D. F. Taylor , A. L. Steptoe , S. Wani , G. Bethel
, A. J. Robertson , A. C. Perkins , S. J. Bruce , C. C. Lee , S. S. Ranade , H. E. Peckham , J. M. Manning , K. J. McKernan , and S. M.
Grimmond . Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods, 5(7):613–619, Jul 2008.
N. Cloonan , Q. Xu , G. J. Faulkner , D. F. Taylor , D. T. P. Tang , G. Kolle , and S. M. Grimmond . RNA-MATE: A recursive mapping
strategy for high-throughput RNA-sequencing data. Bioinformatics, 25(19):2615–2616, Oct 2009.
D. F. Conrad , D. Pinto , R. Redon , L. Feuk , O. Gokcumen , Y. Zhang , J. Aerts , T. D. Andrews , C. Barnes , P. Campbell , T. Fitzgerald ,
M. Hu , C. H. Ihm , K. Kristiansson , D. G. Macarthur , J. R. Macdonald , I. Onyiah , A. W. C. Pang , S. Robson , K. Stirrups , A. Valsesia ,
K. Walter , J. Wei , W. T. C. C. C., C. Tyler-Smith , N. P. Carter , C. Lee , S. W. Scherer , and M. E. Hurles . Origins and functional impact
of copy number variation in the human genome. Nature, 464(7289):704–712, Apr 2010.
T. A. Cooper , L. Wan , and G. Dreyfuss . RNA and disease. Cell, 136(4):777–793, Feb 2009.
A. J. Cox , M. J. Bauer , T. Jakobi , and G. Rosone . Large-scale compression of genomic sequence databases with the Burrows-Wheeler
transform. Bioinformatics, 28(11):1415–1419, Jun 2012.
M. P. Cox , D. A. Peterson , and P. J. Biggs . SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data.
BMC Bioinformatics, 11:485, 2010.
R. Dahm . Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Hum Genet, 122(6):565–581, Jan 2008.
K. Daily , P. Rigor , S. Christley , X. Xie , and P. Baldi . Data structures and compression algorithms for high-throughput sequencing
technologies. BMC Bioinformatics, 11:514, 2010.
P. Danecek , A. Auton , G. Abecasis , C. A. Albers , E. Banks , M. A. DePristo , R. E. Handsaker , G. Lunter , G. T. Marth , S. T. Sherry ,
G. McVean , R. Durbin , and 1000 Genomes Project Analysis Group . The variant call format and VCFtools. Bioinformatics,
27(15):2156–2158, Aug 2011.
A. Dayarian , T. P. Michael , and A. M. Sengupta . SOPRA: Scaffolding algorithm for paired reads via statistical optimization. BMC
Bioinformatics, 11:345, 2010.
F. De Bona , S. Ossowski , K. Schneeberger , and G. Rätsch . Optimal spliced alignments of short sequence reads. Bioinformatics,
24(16):i174–i180, Aug 2008.
F. Denoeud , J.-M. Aury , C. Da Silva , B. Noel , O. Rogier , M. Delledonne , M. Morgante , G. Valle , P. Wincker , C. Scarpelli , O. Jaillon ,
and F. Artiguenave . Annotating genomes with massive-scale RNA sequencing. Genome Biol, 9(12):R175, 2008.
S. Deorowicz and S. Grabowski . Compression of DNA sequence reads in FASTQ format. Bioinformatics, 27(6):860–862, Mar 2011.
M. A. DePristo , E. Banks , R. Poplin , K. V. Garimella , J. R. Maguire , C. Hartl , A. A. Philippakis , G. del Angel , M. A. Rivas , M. Hanna ,
A. McKenna , T. J. Fennell , A. M. Kernytsky , A. Y. Sivachenko , K. Cibulskis , S. B. Gabriel , D. Altshuler , and M. J. Daly . A framework
for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 43(5):491–498, May 2011.
M. T. Dimon , K. Sorber , and J. L. DeRisi . HMMSplicer: A tool for efficient and sensitive discovery of known and novel splice junctions in
RNA-seq data. PLoS One, 5(11):e13875, 2010.
H. Do and W. Sung . Compressed directed acyclic word graph with application in local alignment. Algorithmica, 67(2):125–141, 2013.
A. Dobin , C. A. Davis , F. Schlesinger , J. Drenkow , C. Zaleski , S. Jha , P. Batut , M. Chaisson , and T. R. Gingeras . STAR: Ultrafast
universal RNA-seq aligner. Bioinformatics, 29(1):15–21, Jan 2013.
J. C. Dohm , C. Lottaz , T. Borodina , and H. Himmelbauer . SHARCGS, a fast and highly accurate short-read assembly algorithm for de
novo genomic sequencing. Genome Research, 17(11):1697–1706, Nov 2007.
N. Donmez and M. Brudno . SCARPA: Scaffolding reads with practical algorithms. Bioinformatics, 29(4):428–434, Feb 2013.
J. Eid , A. Fehr , J. Gray , K. Luong , J. Lyle , G. Otto , P. Peluso , D. Rank , P. Baybayan , B. Bettman , A. Bibillo , K. Bjornson , B.
Chaudhuri , F. Christians , R. Cicero , S. Clark , R. Dalal , A. Dewinter , J. Dixon , M. Foquet , A. Gaertner , P. Hardenbol , C. Heiner , K.
Hester , D. Holden , G. Kearns , X. Kong , R. Kuse , Y. Lacroix , S. Lin , P. Lundquist , C. Ma , P. Marks , M. Maxham , D. Murphy , I. Park
, T. Pham , M. Phillips , J. Roy , R. Sebra , G. Shen , J. Sorenson , A. Tomaney , K. Travers , M. Trulson , J. Vieceli , J. Wegener , D. Wu ,
A. Yang , D. Zaccarin , P. Zhao , F. Zhong , J. Korlach , and S. Turner . Real-time DNA sequencing from single polymerase molecules.
Science, 323(5910):133–138, Jan 2009.
A. C. English , S. Richards , Y. Han , M. Wang , V. Vee , J. Qu , X. Qin , D. M. Muzny , J. G. Reid , K. C. Worley , and R. A. Gibbs . Mind
the gap: Upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One, 7(11):e47768, 2012.
B. Ewing , L. Hillier , M. C. Wendl , and P. Green . Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
Genome Research, 8(3):175–185, Mar 1998.
L. Fan , P. Cao , J. Almeida , and A. Z. Broder . Summary cache: A scalable Wide-area Web cache sharing protocol. IEEE/ACM
Transactions on Networking, 8(3):281–293, June 2000.
H. Fang , Y. Wu , G. Narzisi , J. A. O’Rawe , L. T. J. Barrón , J. Rosenbaum , M. Ronemus , I. Iossifov , M. C. Schatz , and G. J. Lyon .
Reducing INDEL calling errors in whole genome and exome sequencing data. Genome Med, 6(10):89, 2014.
M. Farach . Optimal suffix tree construction with large alphabets. In IEEE Symposium on Foundations of Computer Science (FOCS),
pages 137–143, 1997.
J. Fernandez-Banet , N. P. Lee , K. T. Chan , H. Gao , X. Liu , W.-K. Sung , W. Tan , S. T. Fan , R. T. Poon , S. Li , K. Ching , P. A. Rejto ,
M. Mao , and Z. Kan . Decoding complex patterns of genomic rearrangement in hepatocellular carcinoma. Genomics, 103(2–3):189–203,
2014.
P. Ferragine and G. Manzini . Opportunistic data structures with applications. In IEEE Symposium on Foundations of Computer Science
(FOCS), pages 390–398, 2000.
M. Ferrarini , M. Moretto , J. A. Ward , N. šurbanovski , V. Stevanović , L. Giongo , R. Viola , D. Cavalieri , R. Velasco , A. Cestaro , and D.
J. Sargent . An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics,
14:670, 2013.
L. Feuk , A. R. Carson , and S. W. Scherer . Structural variation in the human genome. Nat Rev Genet, 7(2):85–97, Feb 2006.
N. A. Fonseca , J. Rung , A. Brazma , and J. C. Marioni . Tools for mapping high-throughput sequencing data. Bioinformatics,
28(24):3169–3177, Dec 2012.
X. Fu , N. Fu , S. Guo , Z. Yan , Y. Xu , H. Hu , C. Menzel , W. Chen , Y. Li , R. Zeng , and P. Khaitovich . Estimating accuracy of RNA-seq
and microarrays with proteomics. BMC Genomics, 10:161, 2009.
E. R. Gamazon , R. S. Huang , M. E. Dolan , and N. J. Cox . Copy number polymorphisms and anticancer pharmacogenomics. Genome
Biol, 12(5):R46, 2011.
S. Gao , D. Bertrand , B. K. H. Chia , and N. Nagarajan . OPERA-LG: Efficient and exact scaffolding of large, repeat-rich eukaryotic
genomes with performance guarantees. Genome Biol, 17:102, 2016.
S. Gao , W.-K. Sung , and N. Nagarajan . Opera: Reconstructing optimal genomic scaffolds with high-throughput paired-end sequences.
Journal of Computational Biology, 18(11):1681–1691, Nov 2011.
M. Garber , M. G. Grabherr , M. Guttman , and C. Trapnell . Computational methods for transcriptome annotation and quantification using
RNA-seq. Nat Methods, 8(6):469–477, Jun 2011.
A. Gillet-Markowska , H. Richard , G. Fischer , and I. Lafontaine . Ulysses: Accurate detection of low-frequency structural variations in
large insert-size sequencing libraries. Bioinformatics, 31(6):801–808, Mar 2015.
G. Gonnella and S. Kurtz . Readjoiner: A fast and memory efficient string graph-based sequence assembler. BMC Bioinformatics, 13:82,
2012.
M. G. Grabherr , B. J. Haas , M. Yassour , J. Z. Levin , D. A. Thompson , I. Amit , X. Adiconis , L. Fan , R. Raychowdhury , Q. Zeng , Z.
Chen , E. Mauceli , N. Hacohen , A. Gnirke , N. Rhind , F. di Palma , B. W. Birren , C. Nusbaum , K. Lindblad-Toh , N. Friedman , and A.
Regev . Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol, 29(7):644–652, Jul 2011.
G. R. Grant , M. H. Farkas , A. D. Pizarro , N. F. Lahens , J. Schug , B. P. Brunk , C. J. Stoeckert , J. B. Hogenesch , and E. A. Pierce .
Comparative analysis of RNA-seq alignment algorithms and the RNA-seq unified mapper (RUM). Bioinformatics, 27(18):2518–2528, Sep
2011.
M. Griffith , O. L. Griffith , J. Mwenifumbo , R. Goya , A. S. Morrissy , R. D. Morin , R. Corbett , M. J. Tang , Y.-C. Hou , T. J. Pugh , G.
Robertson , S. Chittaranjan , A. Ally , J. K. Asano , S. Y. Chan , H. I. Li , H. Mc-Donald , K. Teague , Y. Zhao , T. Zeng , A. Delaney , M.
Hirst , G. B. Morin , S. J. M. Jones , I. T. Tai , and M. A. Marra . Alternative expression analysis by RNA sequencing. Nat Methods,
7(10):843–847, Oct 2010.
W. Gu , F. Zhang , and J. R. Lupski . Mechanisms for human genomic rearrangements. Pathogenetics, 1(1):4, 2008.
M. Guttman , M. Garber , J. Z. Levin , J. Donaghey , J. Robinson , X. Adiconis , L. Fan , M. J. Koziol , A. Gnirke , C. Nusbaum , J. L. Rinn ,
E. S. Lander , and A. Regev . Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic
structure of lincRNAs. Nat Biotechnol, 28(5):503–510, May 2010.
I. Hajirasouliha , F. Hormozdiari , C. Alkan , J. M. Kidd , I. Birol , E. E. Eichler , and S. C. Sahinalp . Detection and characterization of novel
sequence insertions using paired-end next-generation sequencing. Bioinformatics, 26(10):1277–1283, May 2010.
J. Hardy and A. Singleton . Genomewide association studies and human disease. N Engl J Med, 360(17):1759–1768, Apr 2009.
P. J. Hastings , G. Ira , and J. R. Lupski . A microhomology-mediated break-induced replication model for the origin of human copy number
variation. PLoS Genet, 5(1):e1000327, Jan 2009.
P. J. Hastings , J. R. Lupski , S. M. Rosenberg , and G. Ira . Mechanisms of change in gene copy number. Nat Rev Genet, 10(8):551–564,
Aug 2009.
A. Hatem , D. Bozdağ , A. E. Toland , and U. V. Catalyürek . Benchmarking short sequence mapping tools. BMC Bioinformatics, 14:184,
2013.
M. Hayes , Y. S. Pyon , and J. Li . A model-based clustering method for genomic structural variant prediction and genotyping using paired-
end sequencing data. PLoS One, 7(12):e52881, 2012.
D. Hernandez , P. François , L. Farinelli , M. Osterås , and J. Schrenzel . De novo bacterial genome sequencing: Millions of very short
reads assembled on a desktop computer. Genome Research, 18(5):802–809, May 2008.
J. R. Hesselberth , X. Chen , Z. Zhang , P. J. Sabo , R. Sandstrom , A. P. Reynolds , R. E. Thurman , S. Neph , M. S. Kuehn , W. S. Noble
, et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nature Methods, 6(4):283–289, 2009.
M. Holtgrewe , A.-K. Emde , D. Weese , and K. Reinert . A novel and well-defined benchmarking method for second generation read
mapping. BMC Bioinformatics, 12:210, 2011.
N. Homer , B. Merriman , and S. F. Nelson . BFAST: An alignment tool for large scale genome resequencing. PLoS One, 4(11):e7767,
2009.
W.-K. Hon , K. Sadakane , and W.-K. Sung . Breaking a time-and-space barrier in constructing full-text indices. SIAM Journal on
Computing, 38(6):2162–2178, 2009.
F. Hormozdiari , C. Alkan , E. E. Eichler , and S. C. Sahinalp . Combinatorial algorithms for structural variation detection in high-throughput
sequenced genomes. Genome Research, 19(7):1270–1278, Jul 2009.
F. Hormozdiari , I. Hajirasouliha , P. Dao , F. Hach , D. Yorukoglu , C. Alkan , E. E. Eichler , and S. C. Sahinalp . Next-generation
Variation-Hunter: Combinatorial algorithms for transposon insertion discovery. Bioinformatics, 26(12):i350–i357, Jun 2010.
M. Hsi-Yang Fritz , R. Leinonen , G. Cochrane , and E. Birney . Efficient storage of high throughput DNA sequencing data using reference-
based compression. Genome Research, 21(5):734–740, May 2011.
Y. Hu , K. Wang , X. He , D. Y. Chiang , J. F. Prins , and J. Liu . A probabilistic framework for aligning paired-end RNA-seq data.
Bioinformatics, 26(16):1950–1957, Aug 2010.
S. Huang , J. Zhang , R. Li , W. Zhang , Z. He , T.-W. Lam , Z. Peng , and S.-M. Yiu . SOAPsplice: Genome-wide ab initio detection of
splice junctions from RNA-seq data. Front Genet, 2:46, 2011.
Y. Huang , Y. Hu , C. D. Jones , J. N. MacLeod , D. Y. Chiang , Y. Liu , J. F. Prins , and J. Liu . A robust method for transcript
quantification with RNA-seq data. Journal of Computational Biology, 20(3):167–187, Mar 2013.
D. H. Huson , K. Reinert , and E. W. Myers . The greedy path-merging algorithm for contig scaffolding. J. ACM, 49(5):603–615, Sept.
2002.
D. Huy Hoang and W.-K. Sung . CWig: Compressed representation of Wiggle/BedGraph format. Bioinformatics, 30(18):2543–2550, Sep
2014.
R. M. Idury and M. S. Waterman . A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2):291–306, 1995.
S. Ivakhno , T. Royce , A. J. Cox , D. J. Evers , R. K. Cheetham , and S. Tavaré . CNAseg: A novel framework for identification of copy
number changes in cancer from second-generation sequencing data. Bioinformatics, 26(24):3051–3058, Dec 2010.
M. Jain , I. T. Fiddes , K. H. Miga , H. E. Olsen , B. Paten , and M. Akeson . Improved data analysis for the MinION nanopore sequencer.
Nat Methods, 12(4):351–356, Apr 2015.
G. Jean , A. Kahles , V. T. Sreedharan , F. De Bona , and G. Ritsch . RNA-seq read alignments with PALMapper. Curr Protoc
Bioinformatics, Chapter 11:Unit 11.6, Dec 2010.
W. R. Jeck , J. A. Reinhardt , D. A. Baltrus , M. T. Hickenbotham , V. Magrini , E. R. Mardis , J. L. Dangl , and C. D. Jones . Extending
assembly of short DNA sequences to handle error. Bioinformatics, 23(21):2942–2944, Nov 2007.
H. Ji . Computational analysis of ChIP-seq data. Methods Mol Biol, 674:143–159, 2010.
H. Ji , H. Jiang , W. Ma , D. S. Johnson , R. M. Myers , and W. H. Wong . An integrated software system for analyzing ChIP-chip and
ChIP-seq data. Nat Biotechnol, 26(11):1293–1300, Nov 2008.
H. Ji , X. Li , Q.-f. Wang , and Y. Ning . Differential principal component analysis of ChIP-seq. Proc Natl Acad Sci USA,
110(17):6789–6794, Apr 2013.
H. Jiang and W. H. Wong . SeqMap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics, 24(20):2395–2396, Oct
2008.
Y. Jiang , A. L. Turinsky , and M. Brudno . The missing indels: An estimate of indel variation in a human genome and analysis of factors
that impede detection. Nucleic Acids Res, 43(15):7217–7228, Sep 2015.
Y. Jiang , Y. Wang , and M. Brudno . PRISM: Pair-read informed split-read mapping for base-pair level detection of insertion, deletion and
structural variants. Bioinformatics, 28(20):2576–2583, Oct 2012.
D. S. Johnson , A. Mortazavi , R. M. Myers , and B. Wold . Genome-wide mapping of in vivo protein-DNA interactions. Science,
316(5830):1497–1502, Jun 2007.
J. M. Johnson , J. Castle , P. Garrett-Engele , Z. Kan , P. M. Loerch , C. D. Armour , R. Santos , E. E. Schadt , R. Stoughton , and D. D.
Shoemaker . Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science,
302(5653):2141–2144, Dec 2003.
D. C. Jones , W. L. Ruzzo , X. Peng , and M. G. Katze . Compression of next-generation sequencing reads aided by highly efficient de
novo assembly. Nucleic Acids Res, 40(22):e171, Aug 2012.
R. Jothi , S. Cuddapah , A. Barski , K. Cui , and K. Zhao . Genome-wide identification of in vivo protein-DNA binding sites from ChIP-seq
data. Nucleic Acids Res, 36(16):5221–5231, Sep 2008.
A. Kalsotra and T. A. Cooper . Functional consequences of developmentally regulated alternative splicing. Nat Rev Genet,
12(10):715–729, Oct 2011.
Z. Kan , H. Zheng , X. Liu , S. Li , T. Barber , Z. Gong , H. Gao , K. Hao , M. D. Willard , J. Xu , R. Hauptschein , P. A. Rejto , J. Fernandez
, G. Wang , Q. Zhang , B. Wang , R. Chen , J. Wang , N. P. Lee , W. Zhou , Z. Lin , Z. Peng , K. Yi , S. Chen , L. Li , X. Fan , J. Yang , R.
Ye , J. Ju , K. Wang , H. Estrella , S. Deng , P. Wei , M. Qiu , I. H. Wulur , J. Liu , M. E. Ehsani , C. Zhang , A. Loboda , W. K. Sung , A.
Aggarwal , R. T. Poon , S. T. Fan , J. Hardwick , J. Wang , C. Reinhard , H. Dai , Y. Li , J. M. Luk , and M. Mao . Whole genome
sequencing identifies recurrent mutations in hepatocellular carcinoma. Genome Research, 23(9):1422–1433, Jun 2013.
C. Kanduri , L. Ukkola-Vuoti , J. Oikkonen , G. Buck , C. Blancher , P. Raijas , K. Karma , H. Lähdesmäki , and I. Järvelä . The genome-
wide landscape of copy number variations in the MUSGEN study provides evidence for a founder effect in the isolated Finnish population.
Eur J Hum Genet, 21(12):1411–1416, Dec 2013.
E. Karakoc , C. Alkan , B. J. O’Roak , M. Y. Dennis , L. Vives , K. Mark , M. J. Rieder , D. A. Nickerson , and E. E. Eichler . Detection of
structural variants and indels within exome data. Nat Methods, 9(2):176–178, Feb 2012.
T. M. Keane , K. Wong , and D. J. Adams . RetroSeq: Transposable element discovery from next-generation sequencing data.
Bioinformatics, 29(3):389–390, Feb 2013.
J. D. Kececioglu and E. W. Myers . Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(1/2):7–51, 1995.
W. J. Kent . BLAT – the BLAST-like alignment tool. Genome Research, 12(4):656–664, Apr 2002.
W. J. Kent , A. S. Zweig , G. Barber , A. S. Hinrichs , and D. Karolchik . BigWig and BigBed: Enabling browsing of large distributed
datasets. Bioinformatics, 26(17):2204–2207, Sep 2010.
H. Keren , G. Lev-Maor , and G. Ast . Alternative splicing and evolution: Diversification, exon definition and function. Nat Rev Genet,
11(5):345–355, May 2010.
P. V. Kharchenko , M. Y. Tolstorukov , and P. J. Park . Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat
Biotechnol, 26(12):1351–1359, Dec 2008.
J. M. Kidd , T. Graves , T. L. Newman , R. Fulton , H. S. Hayden , M. Malig , J. Kallicki , R. Kaul , R. K. Wilson , and E. E. Eichler . A
human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell, 143(5):837–847, Nov 2010.
S. M. Kielbasa , R. Wan , K. Sato , P. Horton , and M. C. Frith . Adaptive seeds tame genomic sequence comparison. Genome Research,
21(3):487–493, jan 2011.
D. Kim , G. Pertea , C. Trapnell , H. Pimentel , R. Kelley , and S. L. Salzberg . Tophat2: Accurate alignment of transcriptomes in the
presence of insertions, deletions and gene fusions. Genome Biol, 14(4):R36, Apr 2013.
J. B. Kim , H. Zaehres , G. Wu , L. Gentile , K. Ko , V. Sebastiano , M. J. Araúzo-Bravo , D. Ruau , D. W. Han , M. Zenke , and H. R.
Schöler . Pluripotent stem cells induced from adult neural stem cells by reprogramming with two factors. Nature, 454(7204):646–650, Jul
2008.
T.-M. Kim , L. J. Luquette , R. Xi , and P. J. Park . rSW-seq: Algorithm for detection of copy number alterations in deep sequencing data.
BMC Bioinformatics, 11:432, 2010.
D. C. Koboldt , K. Chen , T. Wylie , D. E. Larson , M. D. McLellan , E. R. Mardis , G. M. Weinstock , R. K. Wilson , and L. Ding . VarScan:
Variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics, 25(17):2283–2285, Sep 2009.
D. C. Koboldt , Q. Zhang , D. E. Larson , D. Shen , M. D. McLellan , L. Lin , C. A. Miller , E. R. Mardis , L. Ding , and R. K. Wilson .
VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3):568–576,
Mar 2012.
J. O. Korbel , A. Abyzov , X. J. Mu , N. Carriero , P. Cayting , Z. Zhang , M. Snyder , and M. B. Gerstein . PEMer: A computational
framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome
Biol, 10(2):R23, 2009.
S. Koren , M. C. Schatz , B. P. Walenz , J. Martin , J. T. Howard , G. Ganapathy , Z. Wang , D. A. Rasko , W. R. McCombie , E. D. Jarvis ,
and Adam M Phillippy . Hybrid error correction and de novo assembly of singlemolecule sequencing reads. Nat Biotechnol, 30(7):693–700,
Jul 2012.
S. Koren , M. C. Schatz , B. P. Walenz , J. Martin , J. T. Howard , G. Ganapathy , Z. Wang , D. A. Rasko , W. R. McCombie , E. D. Jarvis ,
and Adam M Phillippy . Hybrid error correction and de novo assembly of singlemolecule sequencing reads. Nat Biotechnol, 30(7):693–700,
Jul 2012.
C. Kozanitis , C. Saunders , S. Kruglyak , V. Bafna , and G. Varghese . Compressing genomic sequence fragments using SlimGene.
Journal of Computational Biology, 18(3):401–413, Mar 2011.
P. Krawitz , C. Rödelsperger , M. Jäger , L. Jostins , S. Bauer , and P. N. Robinson . Microindel detection in short-read sequence data.
Bioinformatics, 26(6):722–729, Mar 2010.
H. Y. K. Lam , X. J. Mu , A. M. Stütz , A. Tanzer , P. D. Cayting , M. Snyder , P. M. Kim , J. O. Korbel , and M. B. Gerstein . Nucleotide-
resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat Biotechnol, 28(1):47–55, Jan 2010.
T. W. Lam , W.-K. Sung , S.-L. Tam , C.-K. Wong , and S.-M. Yiu . Compressed indexing and local alignment of DNA. Bioinformatics,
24(6):791–797, 2008.
T. W. Lam , A. Tam , E. Wu , R. Li , S. Wong , and S. M. Yiu . High throughput short read alignment via bi-directional BWT. In IEEE
International Conference on Bioinformatics and Biomedicine (BIBM), pages 31–36, 2009.
G. M. Landau and U. Vishkin . Introducing efficient parallelism into approximate string matching and a new serial algorithm. In Annual ACM
Symposium on Theory of Computing (STOC), pages 220–230, 1986.
B. Langmead and S. L. Salzberg . Fast gapped-read alignment with bowtie 2. Nat Methods, 9(4):357–359, Apr 2012.
B. Langmead , C. Trapnell , M. Pop , and S. L. Salzberg . Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol, 10(3):R25, 2009.
J. Laserson , V. Jojic , and D. Koller . Genovo: De novo assembly for metagenomes. Journal of Computational Biology, 18(3):429–443,
Mar 2011.
R. M. Layer , C. Chiang , A. R. Quinlan , and I. M. Hall . LUMPY: A probabilistic framework for structural variant discovery. Genome Biol,
15(6):R84, 2014.
J. A. Lee , C. M. B. Carvalho , and J. R. Lupski . A DNA replication mechanism for generating nonrecurrent rearrangements associated
with genomic disorders. Cell, 131(7):1235–1247, Dec 2007.
S. Lee , F. Hormozdiari , C. Alkan , and M. Brudno . MoDIL: Detecting small indels from clone-end sequencing with mixtures of
distributions. Nat Methods, 6(7):473–474, Jul 2009.
S. Lee , E. Xing , and M. Brudno . MoGUL: Detecting dommon insertions and deletions in a population. In Annual Intl Conf on Comp
Molecular Biology (RECOMB), pages 357–368, 2010.
W.-P. Lee , M. P. Stromberg , A. Ward , C. Stewart , E. P. Garrison , and G. T. Marth . MOSAIK: A hash-based algorithm for accurate next-
generation sequencing short-read mapping. PLoS One, 9(3):e90581, 2014.
J. Z. Levin , M. Yassour , X. Adiconis , C. Nusbaum , D. A. Thompson , N. Friedman , A. Gnirke , and A. Regev . Comprehensive
comparative analysis of strand-specific RNA sequencing methods. Nat Methods, 7(9):709–715, Sep 2010.
H. Li . A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation
from sequencing data. Bioinformatics, 27(21):2987–2993, Nov 2011.
H. Li . Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28(14):1838–1844, Jul
2012.
H. Li . Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30(20):2843–2851, Oct 2014.
H. Li . Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, Jul 2016.
H. Li and R. Durbin . Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754–1760, Jul
2009.
H. Li and R. Durbin . Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589–595, Mar 2010.
H. Li , B. Handsaker , A. Wysoker , T. Fennell , J. Ruan , N. Homer , G. Marth , G. Abecasis , R. Durbin , and 1000 Genome Project Data
Processing Subgroup . The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078–2079, Aug 2009.
H. Li , B. Handsaker , A. Wysoker , T. Fennell , J. Ruan , N. Homer , G. Marth , G. Abecasis , R. Durbin , and 1000 Genome Project Data
Processing Subgroup . The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078–2079, Aug 2009.
H. Li and N. Homer . A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform, 11(5):473–483, Sep
2010.
H. Li , J. Ruan , and R. Durbin . Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome
Research, 18(11):1851–1858, Nov 2008.
J. J. Li , P. J. Bickel , and M. D. Biggin . System wide analyses have underestimated protein abundances and the importance of
transcription in mammals. PeerJ, 2:e270, feb 2014.
R. Li , W. Fan , G. Tian , H. Zhu , L. He , J. Cai , Q. Huang , Q. Cai , B. Li , Y. Bai , Z. Zhang , Y. Zhang , W. Wang , J. Li , F. Wei , H. Li ,
M. Jian , J. Li , Z. Zhang , R. Nielsen , D. Li , W. Gu , Z. Yang , Z. Xuan , O. A. Ryder , F. C.-C. Leung , Y. Zhou , J. Cao , X. Sun , Y. Fu ,
X. Fang , X. Guo , B. Wang , R. Hou , F. Shen , B. Mu , P. Ni , R. Lin , W. Qian , G. Wang , C. Yu , W. Nie , J. Wang , Z. Wu , H. Liang , J.
Min , Q. Wu , S. Cheng , J. Ruan , M. Wang , Z. Shi , M. Wen , B. Liu , X. Ren , H. Zheng , D. Dong , K. Cook , G. Shan , H. Zhang , C.
Kosiol , X. Xie , Z. Lu , H. Zheng , Y. Li , C. C. Steiner , T. T.-Y. Lam , S. Lin , Q. Zhang , G. Li , J. Tian , T. Gong , H. Liu , D. Zhang , L.
Fang , C. Ye , J. Zhang , W. Hu , A. Xu , Y. Ren , G. Zhang , M. W. Bruford , Q. Li , L. Ma , Y. Guo , N. An , Y. Hu , Y. Zheng , Y. Shi , Z. Li
, Q. Liu , Y. Chen , J. Zhao , N. Qu , S. Zhao , F. Tian , X. Wang , H. Wang , L. Xu , X. Liu , T. Vinar , Y. Wang , T.-W. Lam , S.-M. Yiu , S.
Liu , H. Zhang , D. Li , Y. Huang , X. Wang , G. Yang , Z. Jiang , J. Wang , N. Qin , L. Li , J. Li , L. Bolund , K. Kristiansen , G. K.-S. Wong ,
M. Olson , X. Zhang , S. Li , H. Yang , J. Wang , and J. Wang . The sequence and de novo assembly of the giant panda genome. Nature,
463(7279):311–317, Jan 2010.
R. Li , Y. Li , K. Kristiansen , and J. Wang . SOAP: Short oligonucleotide alignment program. Bioinformatics, 24(5):713–714, Mar 2008.
R. Li , H. Zhu , J. Ruan , W. Qian , X. Fang , Z. Shi , Y. Li , S. Li , G. Shan , K. Kristiansen , S. Li , H. Yang , J. Wang , and J. Wang . De
novo assembly of human genomes with massively parallel short read sequencing. Genome Research, 20(2):265–272, Feb 2010.
S. Li , R. Li , H. Li , J. Lu , Y. Li , L. Bolund , M. H. Schierup , and J. Wang . SOAPindel: Efficient identification of indels from short paired
reads. Genome Research, 23(1):195–200, Jan 2013.
J.-Q. Lim , C. Tennakoon , P. Guan , and W.-K. Sung . BatAlign: An incremental method for accurate alignment of sequencing reads.
Nucleic Acids Res, 43(16):e107, Jul 2015.
H. Lin , Z. Zhang , M. Q. Zhang , B. Ma , and M. Li . ZOOM! zillions of oligos mapped. Bioinformatics, 24(21):2431–2437, Nov 2008.
M. R. Lindberg , I. M. Hall , and A. R. Quinlan . Population-based structural variation discovery with Hydra-Multi. Bioinformatics,
31(8):1286–1289, Apr 2015.
C.-M. Liu , T. Wong , E. Wu , R. Luo , S.-M. Yiu , Y. Li , B. Wang , C. Yu , X. Chu , K. Zhao , R. Li , and T.-W. Lam . SOAP3: Ultra-fast
GPU-based parallel alignment tool for short reads. Bioinformatics, 28(6):878–879, Mar 2012.
Y. Liu and B. Schmidt . Long read alignment based on maximal exact match seeds. Bioinformatics, 28(18):i318–i324, Sep 2012.
Y. Liu , B. Schmidt , and D. L. Maskell . CUSHAW: A CUDA compatible short read aligner to large genomes based on the Burrows-
Wheeler transform. Bioinformatics, 28(14):1830–1837, Jul 2012.
J. R. Lupski . Genomic rearrangements and sporadic disease. Nat Genet, 39(7 Suppl):S43–S47, Jul 2007.
A. Magi , M. Benelli , S. Yoon , F. Roviello , and F. Torricelli . Detecting common copy number variants in high-throughput sequencing data
by using JointSLM algorithm. Nucleic Acids Res, 39(10):e65, May 2011.
U. Manber and G. Myers . Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993.
G. Manzini . An analysis of the Burrows-Wheeler transform. J. ACM, 48(3):407–430, 2001.
G. Marçais and C. Kingsford . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics,
27(6):764–770, Mar 2011.
S. Marco-Sola , M. Sammeth , R. Guigó , and P. Ribeca . The GEM mapper: Fast, accurate and versatile alignment by filtration. Nat
Methods, 9(12):1185–1188, Dec 2012.
E. R. Mardis . Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif), 6:287–303, 2013.
J. C. Marioni , C. E. Mason , S. M. Mane , M. Stephens , and Y. Gilad . RNA-seq: An assessment of technical reproducibility and
comparison with gene expression arrays. Genome Research, 18(9):1509–1517, Sep 2008.
T. Marschall , I. G. Costa , S. Canzar , M. Bauer , G. W. Klau , A. Schliep , and A. Schönhuth . CLEVER: Clique-enumerating variant
finder. Bioinformatics, 28(22):2875–2882, Nov 2012.
J. Martin , V. M. Bruno , Z. Fang , X. Meng , M. Blow , T. Zhang , G. Sherlock , M. Snyder , and Z. Wang . Rnnotator: An automated de
novo transcriptome assembly pipeline from stranded RNA-seq reads. BMC Genomics, 11:663, 2010.
A. M. Maxam and W. Gilbert . A new method for sequencing DNA. Proc Natl Acad Sci USA, 74(2):560–564, Feb 1977.
S. A. McCarroll , A. Huett , P. Kuballa , S. D. Chilewski , A. Landry , P. Goyette , M. C. Zody , J. L. Hall , S. R. Brant , J. H. Cho , R. H.
Duerr , M. S. Silverberg , K. D. Taylor , J. D. Rioux , D. Altshuler , M. J. Daly , and R. J. Xavier . Deletion polymorphism upstream of IRGM
associated with altered IRGM expression and Crohn’s disease. Nat Genet, 40(9):1107–1112, Sep 2008.
M. T. McCarthy and C. A. O’Callaghan . PeaKDEck: A kernel density estimator-based peak calling program for DNaseI-seq data.
Bioinformatics, 30(9):1302–1304, May 2014.
K. J. McKernan , H. E. Peckham , G. L. Costa , S. F. McLaughlin , Y. Fu , E. F. Tsung , C. R. Clouser , C. Duncan , J. K. Ichikawa , C. C.
Lee , Z. Zhang , S. S. Ranade , E. T. Dimalanta , F. C. Hyland , T. D. Sokolsky , L. Zhang , A. Sheridan , H. Fu , C. L. Hendrickson , B. Li ,
L. Kotler , J. R. Stuart , J. A. Malek , J. M. Manning , A. A. Antipova , D. S. Perez , M. P. Moore , K. C. Hayashibara , M. R. Lyons , R. E.
Beaudoin , B. E. Coleman , M. W. Laptewicz , A. E. Sannicandro , M. D. Rhodes , R. K. Gottimukkala , S. Yang , V. Bafna , A. Bashir , A.
MacBride , C. Alkan , J. M. Kidd , E. E. Eichler , M. G. Reese , F. M. De La Vega , and A. P. Blanchard . Sequence and structural variation
in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research,
19(9):1527–1541, Sep 2009.
P. Medvedev , M. Fiume , M. Dzamba , T. Smith , and M. Brudno . Detecting copy number variation with mated short reads. Genome
Research, 20(11):1613–1622, Nov 2010.
P. Melsted and J. K. Pritchard . Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12:333, 2011.
M. L. Metzker . Sequencing technologies: The next generation. Nat Rev Genet, 11(1):31–46, Jan 2010.
K. H. Miga , C. Eisenhart , and W. J. Kent . Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce
false positive alignments. Nucleic Acids Res, 43(20):e133, Nov 2015.
C. A. Miller , O. Hampton , C. Coarfa , and A. Milosavljevic . ReadDepth: A parallel R package for detecting copy number alterations from
short sequencing reads. PLoS One, 6(1):e16327, 2011.
R. E. Mills , K. Walter , C. Stewart , R. E. Handsaker , K. Chen , C. Alkan , A. Abyzov , S. C. Yoon , K. Ye , R. K. Cheetham , A. Chinwalla ,
D. F. Conrad , Y. Fu , F. Grubert , I. Hajirasouliha , F. Hormozdiari , L. M. Iakoucheva , Z. Iqbal , S. Kang , J. M. Kidd , M. K. Konkel , J.
Korn , E. Khurana , D. Kural , H. Y. K. Lam , J. Leng , R. Li , Y. Li , C.-Y. Lin , R. Luo , X. J. Mu , J. Nemesh , H. E. Peckham , T. Rausch ,
A. Scally , X. Shi , M. P. Stromberg , A. M. Stütz , A. E. Urban , J. A. Walker , J. Wu , Y. Zhang , Z. D. Zhang , M. A. Batzer , L. Ding , G. T.
Marth , G. McVean , J. Sebat , M. Snyder , J. Wang , K. Ye , E. E. Eichler , M. B. Gerstein , M. E. Hurles , C. Lee , S. A. McCarroll , J. O.
Korbel , and G. Project . Mapping copy number variation by population-scale genome sequencing. Nature, 470(7332):59–65, Feb 2011.
M. Mohiyuddin , J. C. Mu , J. Li , N. Bani Asadi , M. B. Gerstein , A. Abyzov , W. H. Wong , and H. Y. K. Lam . MetaSV: An accurate and
integrative structural-variant caller for next generation sequencing. Bioinformatics, 31(16):2741–2744, Aug 2015.
V. Moncunill , S. Gonzalez , S. Beà , L. O. Andrieux , I. Salaverria , C. Royo , L. Martinez , M. Puiggròs , M. Segura-Wang , A. M. Stütz , A.
Navarro , R. Royo , J. L. Gelpí , I. G. Gut , C. López-Otín , M. Orozco , J. O. Korbel , E. Campo , X. S. Puente , and D. Torrents .
Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads. Nat Biotechnol,
32(11):1106–1112, Nov 2014.
S. B. Montgomery , D. L. Goode , E. Kvikstad , C. A. Albers , Z. D. Zhang , X. J. Mu , G. Ananda , B. Howie , K. J. Karczewski , K. S. Smith
, V. Anaya , R. Richardson , J. Davis , 1000 Genomes Project Consortium , D. G. MacArthur , A. Sidow , L. Duret , M. Gerstein , K. D.
Makova , J. Marchini , G. McVean , and G. Lunter . The origin, evolution, and functional impact of short insertion-deletion variants identified
in 179 human genomes. Genome Research, 23(5):749–761, May 2013.
A. Mortazavi , B. A. Williams , K. McCue , L. Schaeffer , and B. Wold . Mapping and quantifying mammalian transcriptomes by RNA-seq.
Nat Methods, 5(7):621–628, Jul 2008.
J. C. Mu , H. Jiang , A. Kiani , M. Mohiyuddin , N. Bani Asadi , and W. H. Wong . Fast and accurate read alignment for resequencing.
Bioinformatics, 28(18):2366–2373, Sep 2012.
J. M. Mullaney , R. E. Mills , W. S. Pittard , and S. E. Devine . Small insertions and deletions (INDELs) in human genomes. Hum Mol
Genet, 19(R2):R131–R136, Oct 2010.
O. Muralidharan , G. Natsoulis , J. Bell , D. Newburger , H. Xu , I. Kela , H. Ji , and N. Zhang . A cross-sample statistical model for SNP
detection in short-read sequencing data. Nucleic Acids Res, 40(1):e5, Jan 2012.
G. H. Murillo , N. You , X. Su , W. Cui , M. P. Reilly , M. Li , K. Ning , and X. Cui . MultiGeMS: Detection of SNVs from multiple samples
using model selection on high-throughput sequencing data. Bioinformatics, 32(10):1486–1492, May 2016.
E. W. Myers . The fragment assembly string graph. Bioinformatics, 21 Suppl 2:ii79–ii85, Sep 2005.
E. W. Myers , G. G. Sutton , A. L. Delcher , I. M. Dew , D. P. Fasulo , M. J. Flanigan , S. A. Kravitz , C. M. Mobarry , K. H. Reinert , K. A.
Remington , E. L. Anson , R. A. Bolanos , H. H. Chou , C. M. Jordan , A. L. Halpern , S. Lonardi , E. M. Beasley , R. C. Brandon , L. Chen ,
P. J. Dunn , Z. Lai , Y. Liang , D. R. Nusskern , M. Zhan , Q. Zhang , X. Zheng , G. M. Rubin , M. D. Adams , and J. C. Venter . A whole-
genome assembly of drosophila. Science, 287(5461):2196–2204, Mar 2000.
T. Namiki , T. Hachiya , H. Tanaka , and Y. Sakakibara . MetaVelvet: An extension of velvet assembler to de novo metagenome assembly
from short sequence reads. Nucleic Acids Res, 40(20):e155, Nov 2012.
G. Narzisi , J. A. O’Rawe , I. Iossifov , H. Fang , Y.-H. Lee , Z. Wang , Y. Wu , G. J. Lyon , M. Wigler , and M. C. Schatz . Accurate de novo
and transmitted indel detection in exome-capture data using microassembly. Nat Methods, 11(10):1033–1036, Oct 2014.
A. Natarajan , G. G. Yardimci , N. C. Sheffield , G. E. Crawford , and U. Ohler . Predicting cell-type – specific gene expression from regions
of open chromatin. Genome Research, 22(9):1711–1722, 2012.
A. M. Newman , S. V. Bratman , H. Stehr , L. J. Lee , C. L. Liu , M. Diehn , and A. A. Alizadeh . FACTERA: A practical method for the
discovery of genomic rearrangements at breakpoint resolution. Bioinformatics, 30(23):3390–3393, Dec 2014.
K. P. Ng , A. M. Hillmer , C. T. H. Chuah , W. C. Juan , T. K. Ko , A. S. M. Teo , P. N. Ariyaratne , N. Takahashi , K. Sawada , Y. Fei , S.
Soh , W. H. Lee , J. W. J. Huang , J. C. Allen , X. Y. Woo , N. Nagarajan , V. Kumar , A. Thalamuthu , W. T. Poh , A. L. Ang , H. T. Mya ,
G. F. How , L. Y. Yang , L. P. Koh , B. Chowbay , C.-T. Chang , V. S. Nadarajan , W. J. Chng , H. Than , L. C. Lim , Y. T. Goh , S. Zhang ,
D. Poh , P. Tan , J.-E. Seet , M.-K. Ang , N.-M. Chau , Q.-S. Ng , D. S. W. Tan , M. Soda , K. Isobe , M. M. Nöthen , T. Y. Wong , A.
Shahab , X. Ruan , V. Cacheux-Rataboul , W.-K. Sung , E. H. Tan , Y. Yatabe , H. Mano , R. A. Soo , T. M. Chin , W.-T. Lim , Y. Ruan ,
and S. T. Ong . A common BIM deletion polymorphism mediates intrinsic resistance and inferior responses to tyrosine kinase inhibitors in
cancer. Nat Med, 18(4):521–528, Apr 2012.
P. Ng , C.-L. Wei , W.-K. Sung , K. P. Chiu , L. Lipovich , C. C. Ang , S. Gupta , A. Shahab , A. Ridwan , C. H. Wong , E. T. Liu , and Y.
Ruan . Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Methods, 2(2):105–111,
Feb 2005.
A. Oshlack , M. D. Robinson , and M. D. Young . From RNA-seq reads to differential expression results. Genome Biol, 11(12):220, 2010.
T. D. Otto , M. Sanders , M. Berriman , and C. Newbold . Iterative correction of reference nucleotides (iCORN) using second generation
sequencing technology. Bioinformatics, 26(14):1704–1707, Jul 2010.
R. Pagh and F. F. Rodler . Cuckoo hashing. J. Algorithms, 51(2):122–144, 2004.
Q. Pan , O. Shai , L. J. Lee , B. J. Frey , and B. J. Blencowe . Deep surveying of alternative splicing complexity in the human transcriptome
by high-throughput sequencing. Nat Genet, 40(12):1413–1415, Dec 2008.
P. J. Park . ChIP-seq: Advantages and challenges of a maturing technology. Nat Rev Genet, 10(10):669–680, Oct 2009.
S. Pathak and S. Rajasekaran . LFQC: A lossless compression algorithm for FASTQ files. Bioinformatics, 31(20):3276–3281, Oct 2014.
C. E. Pearson , K. Nichol Edamura , and J. D. Cleary . Repeat instability: Mechanisms of dynamic mutations. Nat Rev Genet,
6(10):729–742, Oct 2005.
Y. Peng , H. C. M. Leung , S.-M. Yiu , and F. Y. L. Chin . IDBA: A practical iterative de Bruijn graph de novo assembler. In Annual Intl Conf
on Comp Molecular Biology (RECOMB), pages 426–440, 2010.
Y. Peng , H. C. M. Leung , S. M. Yiu , and F. Y. L. Chin . Meta-IDBA: A de novo assembler for metagenomic data. Bioinformatics,
27(13):i94–101, Jul 2011.
Y. Peng , H. C. M. Leung , S.-M. Yiu , and F. Y. L. Chin . T-IDBA: A de novo iterative de Bruijn graph assembler for transcriptome. In
Annual Intl Conf on Comp Molecular Biology (RECOMB), pages 337–338, 2011.
Y. Peng , H. C. M. Leung , S. M. Yiu , and F. Y. L. Chin . IDBA-UD: A de novo assembler for single-cell and metagenomic sequencing data
with highly uneven depth. Bioinformatics, 28(11):1420–1428, Jun 2012.
P. A. Pevzner , H. Tang , and M. S. Waterman . An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA,
98(17):9748–9753, Aug 2001.
J. K. Pickrell , D. J. Gaffney , Y. Gilad , and J. K. Pritchard . False positive peaks in ChIP-seq and other sequencing-based functional
assays caused by unannotated high copy number regions. Bioinformatics, 27(15):2144–2146, Aug 2011.
V. Plagnol , J. Curtis , M. Epstein , K. Y. Mok , E. Stebbings , S. Grigoriadou , N. W. Wood , S. Hambleton , S. O. Burns , A. J. Thrasher ,
D. Kumararatne , R. Doffinger , and S. Nejentsev . A robust model for read count data in exome sequencing experiments and implications
for copy number variant calling. Bioinformatics, 28(21):2747–2754, Nov 2012.
A. Pohl and M. Beato . bwtool: A tool for bigWig files. Bioinformatics, 30(11):1618–1619, Jun 2014.
M. Pop , D. S. Kosack , and S. L. Salzberg . Hierarchical scaffolding with bambus. Genome Research, 14(1):149–159, Jan 2004.
A. R. Quinlan , R. A. Clark , S. Sokolova , M. L. Leibowitz , Y. Zhang , M. E. Hurles , J. C. Mell , and I. M. Hall . Genome-wide mapping and
assembly of structural variant breakpoints in the mouse genome. Genome Research, 20(5):623–635, May 2010.
A. R. Quinlan and I. M. Hall . BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841–842, Mar
2010.
A. R. Quinlan and I. M. Hall . Characterizing complex structural variation in germline and somatic genomes. Trends Genet, 28(1):43–53,
Jan 2012.
K. Ragunathan , G. Jih , and D. Moazed . Epigenetic inheritance uncoupled from sequence-specific recruitment. Science,
348(6230):1258699, Apr 2015.
P. Ramachandran , G. A. Palidwor , C. J. Porter , and T. J. Perkins . MaSC: Mappability-sensitive cross-correlation for estimating mean
fragment length of single-end short-read sequencing data. Bioinformatics, 29(4):444–450, Feb 2013.
T. Rausch , T. Zichner , A. Schlattl , A. M. Stütz , V. Benes , and J. O. Korbel . DELLY: Structural variant discovery by integrated paired-
end and split-read analysis. Bioinformatics, 28(18):i333–i339, Sep 2012.
J. Reumers , P. De Rijk , H. Zhao , A. Liekens , D. Smeets , J. Cleary , P. Van Loo , M. Van Den Bossche , K. Catthoor , B. Sabbe , E.
Despierre , I. Vergote , B. Hilbush , D. Lambrechts , and J. Del-Favero . Optimized filtering reduces the error rate in detecting genomic
variants by short-read sequencing. Nat Biotechnol, 30(1):61–68, Jan 2012.
H. S. Rhee and B. F. Pugh . Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell,
147(6):1408–1419, Dec 2011.
F. J. Ribeiro , D. Przybylski , S. Yin , T. Sharpe , S. Gnerre , A. Abouelleil , A. M. Berlin , A. Montmayeur , T. P. Shea , B. J. Walker , S. K.
Young , C. Russ , C. Nusbaum , I. MacCallum , and D. B. Jaffe . Finished bacterial genomes from shotgun sequence data. Genome
Research, 22(11):2270–2277, Nov 2012.
G. Rizk , D. Lavenier , and R. Chikhi . DSK: k-mer counting with very low memory usage. Bioinformatics, 29(5):652–653, Mar 2013.
N. D. Roberts , R. D. Kortschak , W. T. Parker , A. W. Schreiber , S. Branford , H. S. Scott , G. Glonek , and D. L. Adelson . A comparative
analysis of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18):2223–2230, Sep 2013.
G. Robertson , M. Hirst , M. Bainbridge , M. Bilenky , Y. Zhao , T. Zeng , G. Euskirchen , B. Bernier , R. Varhol , A. Delaney , N. Thiessen ,
O. L. Griffith , A. He , M. Marra , M. Snyder , and S. Jones . Genome-wide profiles of STAT1 DNA association using chromatin
immunoprecipitation and massively parallel sequencing. Nat Methods, 4(8):651–657, Aug 2007.
G. Robertson , J. Schein , R. Chiu , R. Corbett , M. Field , S. D. Jackman , K. Mungall , S. Lee , H. M. Okada , J. Q. Qian , M. Griffith , A.
Raymond , N. Thiessen , T. Cezard , Y. S. Butterfield , R. Newsome , S. K. Chan , R. She , R. Varhol , B. Kamoh , A.-L. Prabhu , A. Tam ,
Y. Zhao , R. A. Moore , M. Hirst , M. A. Marra , S. J. M. Jones , P. A. Hoodless , and I. Birol . De novo assembly and analysis of RNA-seq
data. Nat Methods, 7(11):909–912, Nov 2010.
L. Roguski and S. Deorowicz . DSRC 2-industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213–2215, Aug 2014.
A. Roth , J. Ding , R. Morin , A. Crisan , G. Ha , R. Giuliany , A. Bashashati , M. Hirst , G. Turashvili , A. Oloumi , M. A. Marra , S. Aparicio ,
and S. P. Shah . JointSNVMix: A probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation
sequencing data. Bioinformatics, 28(7):907–913, Apr 2012.
J. M. Rothberg , W. Hinz , T. M. Rearick , J. Schultz , W. Mileski , M. Davey , J. H. Leamon , K. Johnson , M. J. Milgrew , M. Edwards , J.
Hoon , J. F. Simons , D. Marran , J. W. Myers , J. F. Davidson , A. Branting , J. R. Nobile , B. P. Puc , D. Light , T. A. Clark , M. Huber , J.
T. Branciforte , I. B. Stoner , S. E. Cawley , M. Lyons , Y. Fu , N. Homer , M. Sedova , X. Miao , B. Reed , J. Sabina , E. Feierstein , M.
Schorn , M. Alanjary , E. Dimalanta , D. Dressman , R. Kasinskas , T. Sokolsky , J. A. Fidanza , E. Namsaraev , K. J. McKernan , A.
Williams , G. T. Roth , and J. Bustillo . An integrated semiconductor device enabling non-optical genome sequencing. Nature,
475(7356):348–352, Jul 2011.
J. D. Rowley . Letter: A new consistent chromosomal abnormality in chronic myelogenous leukaemia identified by quinacrine fluorescence
and giemsa staining. Nature, 243(5405):290–293, Jun 1973.
J. Rozowsky , G. Euskirchen , R. K. Auerbach , Z. D. Zhang , T. Gibson , R. Bjornson , N. Carriero , M. Snyder , and M. B. Gerstein .
PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol, 27(1):66–75, Jan 2009.
M. Ruffalo , T. LaFramboise , and M. Koyutürk . Comparative analysis of algorithms for next-generation sequencing read alignment.
Bioinformatics, 27(20):2790–2796, Oct 2011.
S. M. Rumble , P. Lacroute , A. V. Dalca , M. Fiume , A. Sidow , and M. Brudno . SHRiMP: Accurate mapping of short color-space reads.
PLoS Comput Biol, 5(5):e1000386, May 2009.
S. Saha , A. B. Sparks , C. Rago , V. Akmaev , C. J. Wang , B. Vogelstein , K. W. Kinzler , and V. E. Velculescu . Using the transcriptome
to annotate the genome. Nat Biotechnol, 20(5):508–512, May 2002.
M. K. Sakharkar , V. T. K. Chow , and P. Kangueane . Distributions of exons and introns in the human genome. In Silico Biol,
4(4):387–393, 2004.
L. Salmela , V. Mäkinen , N. Välimäki , J. Ylinen , and E. Ukkonen . Fast scaffolding with small independent mixed integer programs.
Bioinformatics, 27(23):3259–3265, Dec 2011.
F. Sanger and A. R. Coulson . A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol,
94(3):441–448, May 1975.
F. Sanger , S. Nicklen , and A. R. Coulson . DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA,
74(12):5463–5467, Dec 1977.
J. F. Sathirapongsasuti , H. Lee , B. A. J. Horst , G. Brunner , A. J. Cochran , S. Binder , J. Quackenbush , and S. F. Nelson . Exome
sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics, 27(19):2648–2654, Oct 2011.
A. Sboner , L. Habegger , D. Pflueger , S. Terry , D. Z. Chen , J. S. Rozowsky , A. K. Tewari , N. Kitabayashi , B. J. Moss , M. S. Chee , F.
Demichelis , M. A. Rubin , and M. B. Gerstein . FusionSeq: A modular framework for finding gene fusions by analyzing paired-end RNA-
sequencing data. Genome Biol, 11(10):R104, 2010.
E. E. Schadt , S. Turner , and A. Kasarskis . A window into third-generation sequencing. Hum Mol Genet, 19(R2):R227–R240, Oct 2010.
S. Schbath , V. Martin , M. Zytnicki , J. Fayolle , V. Loux , and J.-F. Gibrat . Mapping reads on a genomic sequence: An algorithmic
overview and a practical comparative analysis. J Comput Biol, 19(6):796–813, Jun 2012.
R. Schmieder and R. Edwards . Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6):863–864, Mar 2011.
J. P. Schouten , C. J. McElgunn , R. Waaijer , D. Zwijnenburg , F. Diepvens , and G. Pals . Relative quantification of 40 nucleic acid
sequences by multiplex ligation-dependent probe amplification. Nucleic Acids Res, 30(12):e57, Jun 2002.
J. Schröder , A. Hsu , S. E. Boyle , G. Macintyre , M. Cmero , R. W. Tothill , R. W. Johnstone , M. Shackleton , and A. T. Papenfuss .
Socrates: Identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics, 30(8):1064–1072,
Jan 2014.
M. H. Schulz , D. R. Zerbino , M. Vingron , and E. Birney . Oases: Robust de novo RNA-seq assembly across the dynamic range of
expression levels. Bioinformatics, 28(8):1086–1092, Apr 2012.
Z. Shao , Y. Zhang , G.-C. Yuan , S. H. Orkin , and D. J. Waxman . MAnorm: A robust model for quantitative comparison of ChIP-seq data
sets. Genome Biol, 13(3):R16, 2012.
L. Shi , Y. Guo , C. Dong , J. Huddleston , H. Yang , X. Han , A. Fu , Q. Li , N. Li , S. Gong , K. Lintner , Q. Ding , Z. Wang , J. Hu , D.
Wang , F. Wang , L. Wang , G. Lyon , Y. Guan , Y. Shen , O. Evgrafov , J. Knowles , F. Thibaud-Nissen , V. Schneider , C. Yu , L. Zhou ,
E. Eichler , K. So , and K. Wang . Long-read sequencing and de novo assembly of a Chinese genome. Nat Comm, 7:12065, Jun 2016.
H. Shin , T. Liu , X. Duan , Y. Zhang , and X. S. Liu . Computational methodology for ChIP-seq analysis. Quantitative Biology, 1(1):54–70,
Mar 2013.
T. Shiraki , S. Kondo , S. Katayama , K. Waki , T. Kasukawa , H. Kawaji , R. Kodzius , A. Watahiki , M. Nakamura , T. Arakawa , S. Fukuda
, D. Sasaki , A. Podhajska , M. Harbers , J. Kawai , P. Carninci , and Y. Hayashizaki . Cap analysis gene expression for high-throughput
analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci USA, 100(26):15776–15781, Dec 2003.
J. T. Simpson and R. Durbin . Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):i367–i373, Jun
2010.
J. T. Simpson and R. Durbin . Efficient de novo assembly of large genomes using compressed data structures. Genome Research,
22(3):549–556, Mar 2012.
J. T. Simpson , K. Wong , S. D. Jackman , J. E. Schein , S. J. M. Jones , and I. Birol . ABySS: A parallel assembler for short read
sequence data. Genome Research, 19(6):1117–1123, Jun 2009.
S. Sindi , E. Helman , A. Bashir , and B. J. Raphael . A geometric approach for classification and comparison of structural variants.
Bioinformatics, 25(12):i222–i230, Jun 2009.
S. Sindi , S. Onal , L. C. Peng , H.-T. Wu , and B. J. Raphael . An integrative probabilistic model for identification of structural variation in
sequencing data. Genome Biol, 13(3):R22, 2012.
E. Siragusa , D. Weese , and K. Reinert . Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic
Acids Res, 41(7):e78, Apr 2013.
A. D. Smith , Z. Xuan , and M. Q. Zhang . Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC
Bioinformatics, 9:128, 2008.
M. J. Solomon and A. Varshavsky . Formaldehyde-mediated DNA-protein crosslinking: A probe for in vivo chromatin structures. Proc Natl
Acad Sci USA, 82(19):6470–6474, 1985.
C. Soneson and M. Delorenzi . A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics, 14:91,
2013.
L. Song , Z. Zhang , L. L. Grasfeder , A. P. Boyle , P. G. Giresi , B.-K. Lee , N. C. Sheffield , S. Gräf , M. Huss , D. Keefe , et al. Open
chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Research,
21(10):1757–1767, 2011.
T. Souaiaia , Z. Frazier , and T. Chen . ComB: SNP calling and mapping analysis for color and nucleotide space platforms. Journal of
Computational Biology, 18(6):795–807, Jun 2011.
A. Srebrow and A. R. Kornblihtt . The connection between splicing and cancer. J Cell Sci, 119(Pt 13):2635–2641, Jul 2006.
S. Stamm , J.-J. Riethoven , V. Le Texier , C. Gopalakrishnan , V. Kumanduri , Y. Tang , N. L. Barbosa-Morais , and T. A. Thanaraj . ASD:
A bioinformatics resource on alternative splicing. Nucleic Acids Res, 34(Database issue):D46–D55, Jan 2006.
P. Stankiewicz and J. R. Lupski . Structural variation in the human genome and its role in disease. Annu Rev Med, 61:437–455, 2010.
P. J. Stephens , C. D. Greenman , B. Fu , F. Yang , G. R. Bignell , L. J. Mudie , E. D. Pleasance , K. W. Lau , D. Beare , L. A. Stebbings ,
S. McLaren , M.-L. Lin , D. J. McBride , I. Varela , S. Nik-Zainal , C. Leroy , M. Jia , A. Menzies , A. P. Butler , J. W. Teague , M. A. Quail ,
J. Burton , H. Swerdlow , N. P. Carter , L. A. Morsberger , C. Iacobuzio-Donahue , G. A. Follows , A. R. Green , A. M. Flanagan , M. R.
Stratton , P. A. Futreal , and P. J. Campbell . Massive genomic rearrangement acquired in a single catastrophic event during cancer
development. Cell, 144(1):27–40, Jan 2011.
M. Sultan , M. H. Schulz , H. Richard , A. Magen , A. Klingenhoff , M. Scherf , M. Seifert , T. Borodina , A. Soldatov , D. Parkhomchuk , D.
Schmidt , S. O’Keeffe , S. Haas , M. Vingron , H. Lehrach , and M.-L. Yaspo . A global view of gene activity and alternative splicing by
deep sequencing of the human transcriptome. Science, 321(5891):956–960, Aug 2008.
W.-K. Sung . Algorithms in Bioinformatics: A Practical Introduction. Chapman & Hall/CRC Mathematical and Computational Biology. CRC
Press, 2009.
W.-K. Sung , H. Zheng , S. Li , R. Chen , X. Liu , Y. Li , N. P. Lee , W. H. Lee , P. N. Ariyaratne , C. Tennakoon , F. H. Mulawadi , K. F.
Wong , A. M. Liu , R. T. Poon , S. T. Fan , K. L. Chan , Z. Gong , Y. Hu , Z. Lin , G. Wang , Q. Zhang , T. D. Barber , W.-C. Chou , A.
Aggarwal , K. Hao , W. Zhou , C. Zhang , J. Hardwick , C. Buser , J. Xu , Z. Kan , H. Dai , M. Mao , C. Reinhard , J. Wang , and J. M. Luk .
Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat Genet, 44(7):765–769, May 2012.
Y. Surget-Groba and J. I. Montoya-Burgos . Optimization of de novo transcriptome assembly from next-generation sequencing data.
Genome Research, 20(10):1432–1440, Oct 2010.
J. K. Teer and J. C. Mullikin . Exome sequencing: The sweet spot before whole genomes. Hum Mol Genet, 19(R2):R145–R151, Oct 2010.
W. Tembe , J. Lowey , and E. Suh . G-SQZ: Compact encoding of genomic sequence and quality data. Bioinformatics, 26(17):2192–2194,
Sep 2010.
C. Tennakoon , R. W. Purbojati , and W.-K. Sung . BatMis: A fast algorithm for k-mismatch mapping. Bioinformatics, 28(16):2122–2128,
Aug 2012.
C. Trapnell , D. G. Hendrickson , M. Sauvageau , L. Goff , J. L. Rinn , and L. Pachter . Differential analysis of gene regulation at transcript
resolution with RNA-seq. Nat Biotechnol, 31(1):46–53, Jan 2013.
C. Trapnell , L. Pachter , and S. L. Salzberg . TopHat: Discovering splice junctions with RNA-seq. Bioinformatics, 25(9):1105–1111, May
2009.
C. Trapnell , B. A. Williams , G. Pertea , A. Mortazavi , G. Kwan , M. J. van Baren , S. L. Salzberg , B. J. Wold , and L. Pachter . Transcript
assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol,
28(5):511–515, May 2010.
T. J. Treangen , D. D. Sommer , F. E. Angly , S. Koren , and M. Pop . Next generation sequence assembly with AMOS. Curr Protoc
Bioinformatics, Chapter 11:Unit 11.8, Mar 2011.
A. Valouev , D. S. Johnson , A. Sundquist , C. Medina , E. Anton , S. Batzoglou , R. M. Myers , and A. Sidow . Genome-wide analysis of
transcription factor binding sites based on ChIP-seq data. Nat Methods, 5(9):829–834, Sep 2008.
V. B. Vega , E. Cheung , N. Palanisamy , and W.-K. Sung . Inherent signals in sequencing-based Chromatin-ImmunoPrecipitation control
libraries. PLoS One, 4(4):e5241, 2009.
V. E. Velculescu , L. Zhang , B. Vogelstein , and K. W. Kinzler . Serial analysis of gene expression. Science, 270(5235):484–487, Oct
1995.
R. Wan , V. N. Anh , and K. Asai . Transformations for the compression of FASTQ quality scores of next-generation sequencing data.
Bioinformatics, 28(5):628–635, Mar 2012.
E. T. Wang , R. Sandberg , S. Luo , I. Khrebtukova , L. Zhang , C. Mayr , S. F. Kingsmore , G. P. Schroth , and C. B. Burge . Alternative
isoform regulation in human tissue transcriptomes. Nature, 456(7221):470–476, Nov 2008.
H. Wang , D. Nettleton , and K. Ying . Copy number variation detection using next generation sequencing read counts. BMC
Bioinformatics, 15:109, 2014.
J. Wang , C. G. Mullighan , J. Easton , S. Roberts , S. L. Heatley , J. Ma , M. C. Rusch , K. Chen , C. C. Harris , L. Ding , L. Holmfeldt , D.
Payne-Turner , X. Fan , L. Wei , D. Zhao , J. C. Obenauer , C. Naeve , E. R. Mardis , R. K. Wilson , J. R. Downing , and J. Zhang . CREST
maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods, 8(8):652–654, 2011.
K. Wang , D. Singh , Z. Zeng , S. J. Coleman , Y. Huang , G. L. Savich , X. He , P. Mieczkowski , S. A. Grimm , C. M. Perou , J. N.
MacLeod , D. Y. Chiang , J. F. Prins , and J. Liu . MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic
Acids Res, 38(18):e178, Oct 2010.
X. Wang , Z. Wu , and X. Zhang . Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-
seq. J Bioinform Comput Biol, 8 Suppl 1:177–192, Dec 2010.
Z. Wang , M. Gerstein , and M. Snyder . RNA-seq: A revolutionary tool for transcriptomics. Nat Rev Genet, 10(1):57–63, Jan 2009.
R. L. Warren , G. G. Sutton , S. J. M. Jones , and R. A. Holt . Assembling millions of short DNA sequences using SSAKE. Bioinformatics,
23(4):500–501, Feb 2007.
J. D. Watson and F. H. Crick . Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356):737–738,
Apr 1953.
D. Weese , A.-K. Emde , T. Rausch , A. Döring , and K. Reinert . RazerS: Fast read mapping with sensitivity control. Genome Research,
19(9):1646–1654, Sep 2009.
C.-L. Wei , Q. Wu , V. B. Vega , K. P. Chiu , P. Ng , T. Zhang , A. Shahab , H. C. Yong , Y. Fu , Z. Weng , J. Liu , X. D. Zhao , J.-L. Chew ,
Y. L. Lee , V. A. Kuznetsov , W.-K. Sung , L. D. Miller , B. Lim , E. T. Liu , Q. Yu , H.-H. Ng , and Y. Ruan . A global map of p53
transcription-factor binding sites in the human genome. Cell, 124(1):207–219, Jan 2006.
P. Weiner . Linear pattern matching algorithms. Switching and Automata Theory, pages 1–11, 1973.
J. Weischenfeldt , O. Symmons , F. Spitz , and J. O. Korbel . Phenotypic impact of genomic structural variation: insights from and for
human disease. Nat Rev Genet, 14(2):125–138, Feb 2013.
D. A. Wheeler , M. Srinivasan , M. Egholm , Y. Shen , L. Chen , A. McGuire , W. He , Y.-J. Chen , V. Makhijani , G. T. Roth , X. Gomes , K.
Tartaro , F. Niazi , C. L. Turcotte , G. P. Irzyk , J. R. Lupski , C. Chinault , X.-Z. Song , Y. Liu , Y. Yuan , L. Nazareth , X. Qin , D. M. Muzny
, M. Margulies , G. M. Weinstock , R. A. Gibbs , and J. M. Rothberg . The complete genome of an individual by massively parallel DNA
sequencing. Nature, 452(7189):872–876, Apr 2008.
B. T. Wilhelm , S. Marguerat , S. Watt , F. Schubert , V. Wood , I. Good-head , C. J. Penkett , J. Rogers , and J. Bähler . Dynamic
repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453(7199):1239–1243, Jun 2008.
A. Wilm , P. P. K. Aw , D. Bertrand , G. H. T. Yeo , S. H. Ong , C. H. Wong , C. C. Khor , R. Petric , M. L. Hibberd , and N. Nagarajan .
LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput
sequencing datasets. Nucleic Acids Res, 40(22):11189–11201, Dec 2012.
K. Wong , T. M. Keane , J. Stalker , and D. J. Adams . Enhanced structural variant and breakpoint detection using SVMerge by integration
of multiple detection methods and local assembly. Genome Biol, 11(12):R128, 2010.
J. Wu , O. Anczuków , A. R. Krainer , M. Q. Zhang , and C. Zhang . OLego: Fast and sensitive mapping of spliced mRNA-seq reads using
small seeds. Nucleic Acids Res, 41(10):5149–5163, Apr 2013.
C. Xie and M. T. Tammi . CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC
Bioinformatics, 10:80, 2009.
Y. Xie , G. Wu , J. Tang , R. Luo , J. Patterson , S. Liu , W. Huang , G. He , S. Gu , S. Li , X. Zhou , T.-W. Lam , Y. Li , X. Xu , G. K.-S.
Wong , and J. Wang . SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-seq reads. Bioinformatics,
30(12):1660–1666, Jun 2014.
H. Xu , L. Handoko , X. Wei , C. Ye , J. Sheng , C.-L. Wei , F. Lin , and W.-K. Sung . A signal-noise model for significance analysis of
ChIP-seq with negative control. Bioinformatics, 26(9):1199–1204, May 2010.
H. Xu and W.-K. Sung . Identifying differential histone modification sites from ChIP-seq data. Methods Mol Biol, 802:293–303, 2012.
H. Yang , Y. Zhong , C. Peng , J.-Q. Chen , and D. Tian . Important role of indels in somatic mutations of human cancer genes. BMC Med
Genet, 11:128, 2010.
L. Yang , L. J. Luquette , N. Gehlenborg , R. Xi , P. S. Haseley , C.-H. Hsieh , C. Zhang , X. Ren , A. Protopopov , L. Chin , R. Kucherlapati
, C. Lee , and P. J. Park . Diverse mechanisms of somatic structural variations in human cancer genomes. Cell, 153(4):919–929, May
2013.
L. R. Yates and P. J. Campbell . Evolution of the cancer genome. Nat Rev Genet, 13(11):795–806, Nov 2012.
K. Ye , M. H. Schulz , Q. Long , R. Apweiler , and Z. Ning . Pindel: A pattern growth approach to detect break points of large deletions and
medium sized insertions from paired-end short reads. Bioinformatics, 25(21):2865–2871, Nov 2009.
S. Yoon , Z. Xuan , V. Makarov , K. Ye , and J. Sebat . Sensitive and accurate detection of copy number variants using read depth of
coverage. Genome Research, 19(9):1586–1592, Sep 2009.
D. R. Zerbino and E. Birney . Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research,
18(5):821–829, May 2008.
Y. Zhang , E.-W. Lameijer , P. A. ‘t Hoen , Z. Ning , P. E. Slagboom , and K. Ye . PASSion: A pattern growth algorithm-based pipeline for
splice junction detection in paired-end RNA-seq data. Bioinformatics, 28(4):479–486, Feb 2012.
Y. Zhang , T. Liu , C. A. Meyer , J. Eeckhoute , D. S. Johnson , B. E. Bernstein , C. Nusbaum , R. M. Myers , M. Brown , W. Li , and X. S.
Liu . Model-based analysis of ChIP-seq (MACS). Genome Biol, 9(9):R137, 2008.
Z. D. Zhang , J. Du , H. Lam , A. Abyzov , A. E. Urban , M. Snyder , and M. Gerstein . Identification of genomic indels and structural
variations using split reads. BMC Genomics, 12:375, 2011.

Genetic Engineering PPT Group 3
100% (2)
Genetic Engineering PPT Group 3
26 pages
Computer Science III
No ratings yet
Computer Science III
244 pages
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual 1
100% (78)
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual 1
11 pages
Moshiri Design and Analysis of Data Structures
No ratings yet
Moshiri Design and Analysis of Data Structures
335 pages
Algorithms For Next-Generation Sequencing - Wing-Kin Sung - 2017 - CRC - 9781466565500 - Anna's Archive
No ratings yet
Algorithms For Next-Generation Sequencing - Wing-Kin Sung - 2017 - CRC - 9781466565500 - Anna's Archive
351 pages
Roottr PDF
No ratings yet
Roottr PDF
90 pages
Implementation of A Read Mapping Tool Based On The Pigeon-Hole Principle
No ratings yet
Implementation of A Read Mapping Tool Based On The Pigeon-Hole Principle
38 pages
Get Bioinformatics Sequence Alignment and Markov Models, 1st Edition Textbook PDF Download
No ratings yet
Get Bioinformatics Sequence Alignment and Markov Models, 1st Edition Textbook PDF Download
14 pages
Bioinformatics Sequence Alignment and Markov Models 1st Edition Premium Ebook Download
100% (20)
Bioinformatics Sequence Alignment and Markov Models 1st Edition Premium Ebook Download
16 pages
Computing Methods in HEP
No ratings yet
Computing Methods in HEP
147 pages
Manual-7 1 0
No ratings yet
Manual-7 1 0
165 pages
3S03 OnLineText
No ratings yet
3S03 OnLineText
228 pages
Bioinformatica
No ratings yet
Bioinformatica
184 pages
Algorithms Parallel and Sequential
No ratings yet
Algorithms Parallel and Sequential
514 pages
Algo Imm6183
No ratings yet
Algo Imm6183
104 pages
Bio Perl
100% (1)
Bio Perl
96 pages
MATH3353 Notes
No ratings yet
MATH3353 Notes
100 pages
Data Structures
No ratings yet
Data Structures
239 pages
Dna Book
No ratings yet
Dna Book
171 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Applied Computer Science
100% (1)
Applied Computer Science
212 pages
Shane Torbert - Applied Computer Science 2011
100% (2)
Shane Torbert - Applied Computer Science 2011
212 pages
Thermodynamic Computational Tools For Python: Christopher Martin
No ratings yet
Thermodynamic Computational Tools For Python: Christopher Martin
55 pages
Bioinformatics For Evolutionary Biologists A Problems Approach Springer
No ratings yet
Bioinformatics For Evolutionary Biologists A Problems Approach Springer
410 pages
Biopython Tutorial
No ratings yet
Biopython Tutorial
237 pages
FIT2004 (Contents)
No ratings yet
FIT2004 (Contents)
3 pages
Wagner A
No ratings yet
Wagner A
70 pages
Icpc Seoul Qual
No ratings yet
Icpc Seoul Qual
25 pages
Reader ISC 2020
No ratings yet
Reader ISC 2020
141 pages
CS170: Efficient Algorithms and Intractable Problems Fall 2001
No ratings yet
CS170: Efficient Algorithms and Intractable Problems Fall 2001
113 pages
Guide
No ratings yet
Guide
160 pages
University of Aberdeen Department of Mathematical Sciences
No ratings yet
University of Aberdeen Department of Mathematical Sciences
116 pages
AlgorithmsandDataStructures Part5StringMatching
No ratings yet
AlgorithmsandDataStructures Part5StringMatching
29 pages
Fit2004 Course Notes
No ratings yet
Fit2004 Course Notes
167 pages
Computer Science Three
No ratings yet
Computer Science Three
244 pages
DWGX
No ratings yet
DWGX
262 pages
Advanced R
100% (2)
Advanced R
24 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
124 pages
Tutorial
No ratings yet
Tutorial
445 pages
Icpc World Finals Astana
No ratings yet
Icpc World Finals Astana
25 pages
VingronEtAl AlgorithmsPhylogeneticReconstruction Script 2005
No ratings yet
VingronEtAl AlgorithmsPhylogeneticReconstruction Script 2005
88 pages
Biopython Tutorial PDF
No ratings yet
Biopython Tutorial PDF
332 pages
BEP Definitieve Versie 21 6
No ratings yet
BEP Definitieve Versie 21 6
36 pages
Recent Advances in Surrogate-Based Optimization
No ratings yet
Recent Advances in Surrogate-Based Optimization
30 pages
Bio Python Tutorial
No ratings yet
Bio Python Tutorial
331 pages
Notebook
No ratings yet
Notebook
50 pages
Lecture Notes For Algorithm Analysis and Design: JNTU World
No ratings yet
Lecture Notes For Algorithm Analysis and Design: JNTU World
128 pages
Computational Geomatory
No ratings yet
Computational Geomatory
212 pages
Tutorial
No ratings yet
Tutorial
365 pages
Bio Python
100% (1)
Bio Python
357 pages
Biopython Tutorial and Cookbook
No ratings yet
Biopython Tutorial and Cookbook
324 pages
Neat Python Latest PDF
No ratings yet
Neat Python Latest PDF
95 pages
BioPython Cookbook
No ratings yet
BioPython Cookbook
310 pages
Aaanotes
No ratings yet
Aaanotes
156 pages
Networkx Reference
No ratings yet
Networkx Reference
853 pages
Notebook
No ratings yet
Notebook
160 pages
Risk Management and System Safety
From Everand
Risk Management and System Safety
Leonam dos Santos Guimarães
5/5 (1)
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Aliengenetics
100% (1)
Aliengenetics
9 pages
12.1 Meiosis One Pager
No ratings yet
12.1 Meiosis One Pager
1 page
4 - 7 Genome Assembly To Annotation - Final
No ratings yet
4 - 7 Genome Assembly To Annotation - Final
92 pages
Advanced Medicineprize2024
No ratings yet
Advanced Medicineprize2024
10 pages
Signature Assignment 5 Genetics
No ratings yet
Signature Assignment 5 Genetics
5 pages
Modern Theory of Evolution
No ratings yet
Modern Theory of Evolution
9 pages
Zoo-Cc-14 H NC
No ratings yet
Zoo-Cc-14 H NC
2 pages
Testing and Analysis of GMO Containing Foods and Feed - 1st Edition Open Access Download
No ratings yet
Testing and Analysis of GMO Containing Foods and Feed - 1st Edition Open Access Download
17 pages
TRANSPOSONS
No ratings yet
TRANSPOSONS
26 pages
11.1 Biotechnology - Principles & Processes
No ratings yet
11.1 Biotechnology - Principles & Processes
2 pages
Lesson 4 Q2 GENETIC ENGINEERING AND ITS IMPACT TO LIVING ORGANISMS
No ratings yet
Lesson 4 Q2 GENETIC ENGINEERING AND ITS IMPACT TO LIVING ORGANISMS
55 pages
Aside From The Safety Risks
No ratings yet
Aside From The Safety Risks
2 pages
Molecular Basis of Inheritance - 1
No ratings yet
Molecular Basis of Inheritance - 1
103 pages
Comparative Genomics
No ratings yet
Comparative Genomics
14 pages
BIO 417 L16 2022 (Autosaved)
No ratings yet
BIO 417 L16 2022 (Autosaved)
11 pages
SB025 3. Selection & Speciation PDF
No ratings yet
SB025 3. Selection & Speciation PDF
20 pages
Nutrigenomics Course Introduction Student 20240826
No ratings yet
Nutrigenomics Course Introduction Student 20240826
7 pages
Genetic Diseases Rubric
No ratings yet
Genetic Diseases Rubric
1 page
Tilling and Eco Tilling
No ratings yet
Tilling and Eco Tilling
16 pages
REVISION WORK (Structure of Chromosomes)
No ratings yet
REVISION WORK (Structure of Chromosomes)
11 pages
Bacterial Cell Division
No ratings yet
Bacterial Cell Division
21 pages
Information Transfer
No ratings yet
Information Transfer
52 pages
Bai Et Al - 2020 - Paleolithic Genetic Link Between Southern China and Mainland Southeast Asia2
No ratings yet
Bai Et Al - 2020 - Paleolithic Genetic Link Between Southern China and Mainland Southeast Asia2
4 pages
Concept Map
No ratings yet
Concept Map
3 pages
Anw Grade 12 Practical Task 2
No ratings yet
Anw Grade 12 Practical Task 2
6 pages
Lecture 6 - Vectors
100% (1)
Lecture 6 - Vectors
10 pages
DLL Matatag - Science 8 q1 w3
100% (1)
DLL Matatag - Science 8 q1 w3
12 pages
UNIT 9 Mutations
No ratings yet
UNIT 9 Mutations
36 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Previewpdf

Uploaded by

Previewpdf

Uploaded by

ALGORITHMS FOR

© 2017 by Taylor & Francis Group, LLC

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-4665-6550-0 (Hardback)

2 NGS ﬁle formats 21

3 Related algorithms and data structures 35

4 NGS read mapping 69

4.3.4.1 Estimating the lower bound of the number of

5 Genome assembly 123

5.4.1 Assemble long reads assuming long reads have a low

6 Single nucleotide variation (SNV) calling 175

6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7 Structural variation calling 209

8.8 Summary and further reading . . . . . . . . . . . . . . . . . 268

9 Peak calling methods 271

10 Data compression techniques used in NGS ﬁles 289

Next-generation sequencing (NGS) is a recently developed technology enabling

as ChIP-seq peak callers. Chapter 9 is devoted to discussing computational

DNA stands for deoxyribonucleic acid. It was ﬁrst discovered in 1869 by

1.1 DNA, RNA, protein and cells

nucleotides can be chained together to form a strand of DNA. Each strand of

protein, which is the sequence of amino acids corresponding to the sequence of

1.2 Sequencing technologies

• First-generation sequencing: Sequencing based on chemical degradation

• Second-generation sequencing: Sequencing many DNA fragments in par­

• Third-generation sequencing: Sequencing a single DNA molecule with­

In this section, we will discuss the three phases in detail.

1.3 First-generation sequencing

1. Amplify the DNA template by cloning.

2. Generate all possible preﬁxes of the DNA template.

4. Readout with ﬂuorescent tags.

FIGURE 1.3: The steps of Sanger sequencing.

ing method in 1970 − 2000. However, it is expensive and the throughput is

1.4 Second-generation sequencing

1.4.1 Template preparation

water drop in oil template binds PCR for

1.4.2 Base calling

1.4.3 Polymerase-mediated methods based on reversible ter­

FIGURE 1.5: Polymerase-mediated sequencing methods based on reversible

reversible terminator nucleotide by imaging. After imaging, the termination

FIGURE 1.6: Polymerase-mediated sequencing methods based on reversible

reversible terminator nucleotides. Then, bases fail to get incorporated to the

1.4.4 Polymerase-mediated methods based on unmodiﬁed

FIGURE 1.7: The ﬂowgram for the DNA sequence TCG­

1.4.5 Ligase-mediated method

The primary advantage of the two-base color encoding is that it improves

1.5 Third-generation sequencing

• Single-molecule real-time sequencing

• Direct imaging of individual DNA molecules using advanced microscopy

1.5.1 Single-molecule real-time sequencing

FIGURE 1.8: The illustration of PacBio sequencing. On an array slide,

1.5.2 Nanopore sequencing method

FIGURE 1.9: An illustration of the sequencing technique of Oxford

FIGURE 1.10: Consider a DNA fragment hybridized with a particular

1.5.3 Direct imaging of DNA using electron microscopy

Another choice is to use direct imaging. ZS genetics is developing meth­

TABLE 1.1: Comparison of the three generations of sequencing

First generationSecond genera- Third generation

1.6 Comparison of the three generations of sequencing

We have discussed the technologies of the three generations of sequencing.

1.7 Applications of sequencing

generation sequencing, we obtain the reference genomes of many species. (See

1.8 Summary and further reading

2. Should we always use second- or third- generation sequencing instead of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

• Second-generation sequencing: Sequencing many DNA fragments in par

• Third-generation sequencing: Sequencing a single DNA molecule with

1.4.3 Polymerase-mediated methods based on reversible ter

FIGURE 1.7: The ﬂowgram for the DNA sequence TCG

Another choice is to use direct imaging. ZS genetics is developing meth