Previewpdf
Previewpdf
NEXT-GENERATION SEQUENCING
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING
Wing-Kin Sung
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xi
1 Introduction 1
1.1 DNA, RNA, protein and cells . . . . . . . . . . . . . . . . . . 1
1.2 Sequencing technologies . . . . . . . . . . . . . . . . . . . . . 3
1.3 First-generation sequencing . . . . . . . . . . . . . . . . . . . 4
1.4 Second-generation sequencing . . . . . . . . . . . . . . . . . 6
1.4.1 Template preparation . . . . . . . . . . . . . . . . . . 6
1.4.2 Base calling . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Polymerase-mediated methods based on reversible
terminator nucleotides . . . . . . . . . . . . . . . . . . 7
1.4.4 Polymerase-mediated methods based on unmodified
nucleotides . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.5 Ligase-mediated method . . . . . . . . . . . . . . . . . 11
1.5 Third-generation sequencing . . . . . . . . . . . . . . . . . . 12
1.5.1 Single-molecule real-time sequencing . . . . . . . . . . 12
1.5.2 Nanopore sequencing method . . . . . . . . . . . . . . 13
1.5.3 Direct imaging of DNA using electron microscopy . . 15
1.6 Comparison of the three generations of sequencing . . . . . . 16
1.7 Applications of sequencing . . . . . . . . . . . . . . . . . . . 17
1.8 Summary and further reading . . . . . . . . . . . . . . . . . 19
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
vi Contents
8 RNA-seq 245
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.2 High-throughput methods to study the transcriptome . . . . 247
8.3 Application of RNA-seq . . . . . . . . . . . . . . . . . . . . . 248
8.4 Computational Problems of RNA-seq . . . . . . . . . . . . . 250
8.5 RNA-seq read mapping . . . . . . . . . . . . . . . . . . . . . 250
8.5.1 Features used in RNA-seq read mapping . . . . . . . . 250
8.5.1.1 Transcript model . . . . . . . . . . . . . . . . 250
8.5.1.2 Splice junction signals . . . . . . . . . . . . . 252
8.5.2 Exon-first approach . . . . . . . . . . . . . . . . . . . 253
8.5.3 Seed-and-extend approach . . . . . . . . . . . . . . . . 256
8.6 Construction of isoforms . . . . . . . . . . . . . . . . . . . . 260
8.7 Estimating expression level of each transcript . . . . . . . . . 261
8.7.1 Estimating transcript abundances when every read
maps to exactly one transcript . . . . . . . . . . . . . 261
8.7.2 Estimating transcript abundances when a read maps to
multiple isoforms . . . . . . . . . . . . . . . . . . . . . 264
8.7.3 Estimating gene abundance . . . . . . . . . . . . . . . 266
x Contents
References 307
Index 339
Preface
xi
xii Preface
for describing the alignments of the NGS reads on the reference genome. BED,
VCF and WIG formats are annotation formats.
To develop methods for processing NGS data, we need efficient algorithms
and data structures. Chapter 3 is devoted to briefly describing these tech
niques.
Chapter 4 studies read mappers. Read mappers align the NGS reads on
the reference genome. The input is a set of raw reads in fasta or fastq files.
The read mapper will align each raw read on the reference genome, that is,
identify the region in the reference genome which is highly similar to the read.
Then, the read mapper will output all these alignments in a SAM or BAM
file. This is the basic step for many NGS applications. (It is the first step for
the methods in Chapters 6−9.)
Chapter 5 studies the de novo assembly problem. Given a set of raw reads
extracted from whole genome sequencing of some sample genome, de novo
assembly aims to stitch the raw reads together to reconstruct the genome.
It enables us to reconstruct novel genomes like plants and bacteria. De novo
assembly involves a few steps: error correction, contig assembly (de Bruijn
graph approach or base-by-base extension approach), scaffolding and gap fill
ing. This chapter describes techniques developed for these steps.
Chapter 6 discusses the problem of identifying single nucleotide variations
(SNVs) and small insertions/deletions (indels) in an individual genome. The
genome of every individual is highly similar to the reference human genome.
However, each genome is still different from the reference genome. On average,
there is 1 single nucleotide variation in every 3000 bases and 1 small indel in
every 1000 bases. To discover these variations, we can first perform whole
genome sequencing or exome sequencing of the individual genome to obtain
a set of raw reads. After aligning the raw reads on the reference genome, we
use SNV callers and indel callers to call SNVs and small indels. This chapter
is devoted to discussing techniques used in SNV callers and indel callers.
Apart from SNVs and small indels, copy number variations (CNVs) and
structural variations (SVs) are the other types of variations that appear in our
genome. CNVs and SVs are not as frequent as SNVs and indels. Moreover, they
are more prone to change the phenotype. Hence, it is important to understand
them. Chapter 7 is devoted to studying techniques used in CNV callers and
SV callers.
All above technologies are related to genome sequencing. We can also se
quence RNA. This technology is known as RNA-seq. Chapter 8 studies meth
ods for analyzing RNA-seq. By applying computational methods on RNA-seq,
we can recover the transcriptome. More precisely, RNA-seq enables us to iden
tify exons and split junctions. Then, we can predict the isoforms of the genes.
We can also determine the expression of each transcript and each gene.
By combining Chromatin immunoprecipitation and next-generation se
quencing, we can sequence genome regions that are bound by some transcrip
tion factors or with epigenetic marks. Such technology is known as ChIP
seq. The computational methods that identify those binding sites are known
Preface xiii
Wing-Kin Sung
Chapter 1
Introduction
1 The actual term “genomics” is thought to have been coined by Dr. Tom Roderick, a
geneticist at the Jackson Laboratory (Bar Harbor, ME) at a meeting held in Maryland on
the mapping of the human genome in 1986.
1
2 Algorithms for Next-Generation Sequencing
5� − A C G T A G C T −3�
|| ||| ||| || || ||| ||| ||
3� − T G C A T C G A −5�
FIGURE 1.1: The double-stranded DNA. The two strands show a comple
mentary base pairing.
3. Separation by electrophoresis.
Step 1 amplifies the DNA template. The DNA template is inserted into
the plasmid vector; then the plasmid vector is inserted into the host cells for
cloning. By growing the host cells, we obtain many copies of the same DNA
template.
Step 2 generates all possible prefixes of the DNA template. Two tech
niques have been proposed for this step: (1) the Maxam-Gilbert technique [194]
and (2) the chain termination methodology (Sanger method) [259, 260]. The
Maxam-Gilbert technique relies on the cleaving of nucleotides by chemical.
Four different chemicals are used and generate all sequences ending with A, C, G
and T, respectively. This allows us to generate all possible prefixes of the tem
plate. This technique is most efficient for short DNA sequences. However, it
is considered unsafe because of the extensive use of toxic chemicals.
The chain termination methodology (Sanger method) is a better alter
native. Given a single-stranded DNA template, the method performs DNA
polymerase-dependent synthesis in the presence of (1) natural deoxynu
cleotides (dNTPs) and (2) dideoxynucleotides (ddNTPs). ddNTPs serve as
non-reversible synthesis terminators (see Figure 1.2(a,b)). The DNA synthesis
reaction is randomly terminated whenever a ddNTP is added to the growing
oligonucleotide chain, resulting in truncated products of varying lengths with
an appropriate ddNTP at their 3’ terminus.
After we obtain all possible prefixes of the DNA template, the product is
a mixture of DNA fragments of different lengths. We can separate these DNA
Introduction 5
C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ dATP + H+ + PPi
(a)
C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ ddATP + H+ + PPi
(b)
FIGURE 1.2: (a) The chemical reaction for the incorporation of dATP into
the growing DNA strand. (b) The chemical reaction for the incorporation of
ddATP into the growing DNA strand. The vertical bar behind A indicates
that the extension of the DNA strand is terminated.
3’-GCATCGGCATATG...-5’
5’-CGTA
CGTA G - +
CGTAG C
CGTAGC C
CGTAGCC G
CGTAGCCG T
DNA Insert Insert
CGTAGCCGT A
template into into GCCGTATAC
CGTAGCCGTA T
vector host cell Cloning CGTAGCCGTAT A
CGTAGCCGTATA C Electrophoresis
Cyclic sequencing & readout
fragments by their lengths using gel electrophoresis (Step 3). Gel electrophore
sis is based on the fact that DNA is negatively charged. When an electrical
field is applied to a mixture of DNA on a gel, the DNA fragments will move
from the negative pole to the positive pole. Due to friction, short DNA frag
ments travel faster than long DNA fragments. Hence, the gel electrophoresis
separates the mixture into bands, each containing DNA molecules of the same
length.
Using the fluorescent tags attached to the terminal ddNTPs (we have
4 different colors for the 4 different ddNTPs), the DNA fragments ending
with different nucleotides will be labeled with different fluorescent dyes. By
detecting the light emitted from different bands, the DNA sequence of the
template will be revealed (Step 4).
In summary, the Sanger method can generate sequences of length ∼800 bp.
The process can be fully automated and hence it was a popular DNA sequenc
6 Algorithms for Next-Generation Sequencing
Given a set of DNA fragments, the template preparation step first gener
ates a DNA template for each DNA fragment. The DNA template is created
by ligating adaptor sequences to the two ends of the target DNA fragment (see
Figure 1.4(a)). Then, the templates are amplified using PCR. There are two
common methods for amplifying the templates: (1) emulsion PCR (emPCR)
and (2) solid-phase amplification (Bridge PCR).
emPCR amplifies each DNA template by a bead. First of all, one piece of
DNA template and a bead are inserted within a water drop in oil. The surface
of every bead is coated with a primer corresponding to one type of adaptor.
The DNA template will hybridize with one primer on the surface of the bead.
Then, it is PCR amplified within a water drop in oil. Figure 1.4(b) illustrates
the emPCR. emPCR is used by 454, Ion Torrent and SOLiD.
For bridge PCR, the amplification is done on a flat surface (say, glass),
which is coated with two types of primers, corresponding to the adaptors.
Each DNA template is first hybridized to one primer on the flat surface.
Amplification proceeds in cycles, with one end of each bridge tethered to the
surface. Figure 1.4(c) illustrates the bridge PCR process. Bridge PCR is used
by Illumina.
Although PCR can amplify DNA templates, there is amplification bias.
Experiments revealed that templates that are AT-rich or GC-rich have a lower
amplification efficient. This limitation creates uneven sequencing of the DNA
templates in the sample.
Introduction 7
(a)
templates
beads
(c)
FIGURE 1.4: (a) From the DNA fragments, DNA template is created by
attaching the two ends with adaptor sequences. (b) Amplifying the template
by emPCR. (c) Amplifying the template by bridge PCR.
PCR clone
C C C
T C
T T T
G G G
C G
C C C
A A AC
T A
C C C C
T T T G T
G G G C G
C C C A C
A A A A
(a)
G
C C G C G C
T T T T ……
G G G G
Add After Repeat the
C C C C
reversible scanning, steps to
A A A A
terminator reverse the sequence
dGTP termination other bases
Wash &
scan
(b)
A C
C G
T T
G A
T C
(a)
(b)
A C G T A C G T
(c)
6
5
intensity
4
3
2
1
ACGTACGTACGTACGTACGT
a high-density array of wells, and each well contains one template. In each
iteration, a single type of dNTP flows across the wells. If the dNTP is comple
mentary to the template, polymerase will extend by one base and relax H+.
The relaxation of H+ changes the pH of the solution in the well and an IS
FET sensor at the bottom of the well measures the pH change and converts it
into electric signals [251]. The sensor avoids the use of optical measurements,
which require a complicated camera and laser. This is the main difference
between Ion Torrent sequencing and 454 sequencing. The unattached dNTP
molecules are washed out before the next iteration. By interpreting the flow-
gram obtained from the ISFET sensor, we can recover the sequences of the
templates.
Since the method used by Ion Torrent is similar to that of Roche 454, it
also has the disadvantage that it cannot distinguish long homopolymers.
A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
12 Algorithms for Next-Generation Sequencing
• Nanopore-sequencing technologies
Immobilized
polymerase
of length up to 20, 000 bp, with an average read length of about 10, 000 bp.
Another advantage of PacBio RS is that it can sequence methylation status
simultaneously.
However, PacBio sequencing is more costly. It is about 3 − 4 times more
expensive than short read sequencing. Also, PacBio RS has a high error rate,
up to 17.9% errors [46]. The majority of the errors are indel errors [71]. Luckily,
the error rate is unbiased and almost constant throughout the entire read
length [146]. By repeatedly sequencing the same DNA template, we can reduce
the error rate.
flow through the pore continuously. As illustrated in Figure 1.9, DNA material
is placed in the top chamber. The positive charge draws a strand of DNA
moving from the top chamber to the bottom chamber flowing through the
nanopore. By detecting the difference in the electrical conductivity in the
pore, the DNA sequence is decoded. (Note that IBM’s DNA transistor is a
prototype which uses a similar idea.)
The approach has difficulty in calling the individual base accurately. In
stead, Oxford nanopore technology will read the signal of k (say 5) bases
in each round. Then, using a hidden Markov model, the DNA base can be
decoded base by base.
Oxford nanopore technology has announced two sequencers: MiniION and
GridION. MiniION is a disposable USB-key sequencer. GridION is an ex
pandable sequencer. Oxford nanopore technology claimed that GridION can
sequence 30x coverage of a human genome in 6 days at US$2200 − $3600. It
has the potential to decode a DNA fragment of length 100, 000 bp. Its cost is
about US$25−$40 per gigabyte. Although it is not expensive, the error rate is
about 17.8% (4.9% insertion error, 7.8% deletion error and 5.1% substitution
error) [115].
Unlike Oxford nanopore technology, Genia suggested combining nanopore
and the DNA polymerase to sequence a single-strand DNA template. In Genia,
the DNA polymerase is tethered with a biological nanopore. When a DNA
template gets in touch with the DNA polymerase, DNA synthesis happens
with four engineered nucleotides for A, C, G and T , each attached with a
different short tag. When a nucleotide is incorporated into the DNA template,
the tag is cleaved and it will travel through the biological nanopore and an
electric signal is measured. Since different nucleotides have different tags, we
can reconstruct the DNA template by measuring the electric signals.
NABsys is another nanopore sequencer. It first chops the genome into DNA
fragments of length 100, 000 bp. The DNA fragments are hybridized with a
particular probe so that specific short DNA sequences on the DNA fragments
Introduction 15
(a) (b)
… …
(c)
are bounded by the probes. Those DNA fragments with bound probes are
driven through a nanopore (see Figure 1.10(a)), creating a current-versus
time tracing. The trace gives the position of the probes on the fragment.
(See Figure 1.10(b).) We can align the fragments based on their inter-probe
distance; then, we obtain a probe map for the genome (see Figure 1.10(c)).
We can obtain the probe maps for different probes. By aligning all of them,
we obtain the whole genome.
Unlike Genia, Oxford nanopore technology and the IBM DNA transis
tor, NABsys does not require a very accurate current measurement from the
nanopore. The company claims that this method is cheap, and that read length
is long and fast. Furthermore, it is accurate!
$10,000,000.00
$1,000,000.00
$100,000.00
$10,000.00
$1,000.00
$100.00
$10.00
$1.00
$0.10
$0.01
$0.00
Sep-01
Jan-02
Sep-02
Sep-03
Sep-04
Sep-11
May-02
Jan-03
May-03
Jan-04
May-04
Jan-05
Sep-05
May-05
Jan-06
Sep-06
May-06
Jan-07
Sep-07
May-07
Jan-08
Sep-08
Sep-09
May-08
Jan-09
May-09
Jan-10
Sep-10
May-10
Jan-11
May-11
Jan-12
Sep-12
Sep-13
May-12
Jan-13
May-13
Jan-14
Sep-14
May-14
Jan-15
Sep-15
May-15
Cost per Mb of DNA bases Cost per Genome
FIGURE 1.11: The sequencing cost over time. There are two curves. The
blue curve shows the sequencing cost per million of sequencing bases while
the red curve shows the sequencing cost per human genome. (Data is obtained
from http://www.genome.gov/sequencingcosts.)
ing has been applied to many other research areas, including metagenomics,
3D modeling of the genome, etc.
1.9 Exercises
1. Consider the DNA sequence 5’-ACTCAGTTCG-3’. What is its reverse
complement? The SOLiD sequencer will output color-based sequences.
What is the expected color-based sequence for the above DNA sequence
and its reverse complement? Do you observe an interesting property?