Next-Generation Sequencing Data Analysis 2nd Edition
Next-Generation Sequencing Data Analysis 2nd Edition
Next-Generation Sequencing
Data Analysis
For each NGS application, this book covers topics from experimental design,
sample processing, sequencing strategy formulation, to sequencing read quality
control, data preprocessing, read mapping or assembly, and more advanced
stages that are specific to each application. Major applications include:
Before detailing the analytic steps for each of these applications, the book
presents introductory cellular and molecular biology as a refresher mostly
for data scientists, the ins and outs of widely used NGS platforms, and an
overview of computing needs for NGS data management and analysis. The
book concludes with a chapter on the changing landscape of NGS technolo-
gies and data analytics.
The second edition of this book builds on the well-received first edition
by providing updates to each chapter. Two brand new chapters have been
added to meet rising data analysis demands on single-cell RNA-seq and clin-
ical sequencing. The increasing use of long-reads sequencing has also been
reflected in all NGS applications. This book discusses concepts and principles
that underlie each analytic step, along with software tools for implementa-
tion. It highlights key features of the tools while omitting tedious details to
provide an easy-to-follow guide for practitioners in life sciences, bioinfor-
matics, biostatistics, and data science. Tools introduced in this book are open
source and freely available.
iii
Next-Generation
Sequencing Data
Analysis
Second Edition
Xinkun Wang
iv
Contents
v
vi Contents
4.2.1.2 Implementation.......................................................... 60
4.2.1.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 63
4.2.1.4 Sequence Data Generation........................................ 63
4.2.2 Pacific Biosciences Single-Molecule Real-Time
(SMRT) Long-Read Sequencing............................................... 64
4.2.2.1 Sequencing Principle.................................................. 64
4.2.2.2 Implementation.......................................................... 64
4.2.2.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 65
4.2.2.4 Sequence Data Generation........................................ 65
4.2.3 Oxford Nanopore Technologies (ONT) Long-Read
Sequencing.................................................................................. 67
4.2.3.1 Sequencing Principle.................................................. 67
4.2.3.2 Implementation.......................................................... 68
4.2.3.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 68
4.2.3.4 Sequence Data Generation........................................ 69
4.2.4 Ion Torrent Semiconductor Sequencing.................................. 69
4.2.4.1 Sequencing Principle.................................................. 69
4.2.4.2 Implementation.......................................................... 70
4.2.4.3 Error Rate, Read Length, Date Output,
and Cost....................................................................... 70
4.2.4.4 Sequence Data Generation........................................ 72
4.3 A Typical NGS Workflow...................................................................... 72
4.4 Biases and Other Adverse Factors That May Affect NGS Data
Accuracy.................................................................................................. 74
4.4.1 Biases in Library Construction................................................. 74
4.4.2 Biases and Other Factors in Sequencing................................. 75
4.5 Major Applications of NGS................................................................... 76
4.5.1 Transcriptomic Profiling (Bulk and Single-Cell
RNA-Seq)..................................................................................... 76
4.5.2 Genetic Mutation and Variation Identification...................... 77
4.5.3 De Novo Genome Assembly...................................................... 77
4.5.4 Protein-DNA Interaction Analysis (ChIP-Seq)....................... 77
4.5.5 Epigenomics and DNA Methylation Study
(Methyl-Seq)................................................................................ 77
4.5.6 Metagenomics............................................................................. 78
12. De Novo Genome Assembly with Long and/or Short Reads............... 271
12.1 Genomic Factors and Sequencing Strategies for
De Novo Assembly................................................................................ 272
Contents xi
xv
newgenprepdf
xvi
Author
xvii
1
Part I
DOI: 10.1201/9780429329180-2 3
4 Next-Generation Sequencing Data Analysis
1.3 Molecules in Cells
Different types of molecules are needed to carry out the various cellular
processes. In a typical cell, water is the most abundant representing 70% of
the total cell weight. Besides water, there are a large variety of small and large
molecules. The major categories of small molecules include inorganic ions
The Cellular System and the Code of Life 5
(Na+, K+, Ca2+, Cl-, Mg2+, etc.), monosaccharides, fatty acids, amino acids, and
nucleotides. Major varieties of large molecules are polysaccharides, lipids,
proteins, and nucleic acids (DNA and RNA). Among these components,
the inorganic ions are important for signaling (e.g., waves of Ca2+ represent
important intracellular signal), cell energy storage (e.g., in the form of Na+
/K+ cross-membrane gradient), or protein structure/function (e.g., Mg2+ is
an essential cofactor for many metalloproteins). Carbohydrates (including
monosaccharides and polysaccharides), fatty acids, and lipids are major
energy-providing molecules in the cell. Lipids are also the major component
of cell membrane. Proteins, which are assembled from 20 types of amino acids
in different order and length, underlie almost all cellular activities, including
metabolism, signal transduction, DNA replication, and cell division. They
are also the building blocks of many subcellular structures, such as cytoskel-
eton (see next section). Nucleic acids carry the code of life in their nearly
endless nucleotide permutations, which not only provides instructions on the
assembly of all proteins in cells but also exerts control on how such assembly
is carried out based on environmental conditions.
1.4.1 Nucleus
Since DNA stores the code of life, it must be protected and properly maintained
to avoid possible damage and ensure accuracy and stability. As proper execu-
tion of the genetic information embedded in the DNA is critical to the normal
functioning of a cell, gene expression must also be tightly regulated under
6 Next-Generation Sequencing Data Analysis
Nucleus
Nuclear Envelope
(with nuclear pores)
Cell Membrane
Chromatin
Peroxisome
Ribosome
Nucleolus Microtubule
Lysosome
Mitochondrion
Golgi Apparatus
Rough ER
Smooth ER
Intermediate
Filament
Centrosome
Cytoplasm Endosome
Microfilament
FIGURE 1.1
The general structure of a typical eukaryotic cell. Shown here is an animal cell.
all conditions. The nucleus, located in the center of most cells in eukaryotes,
offers a well-protected environment for DNA storage, maintenance, and gene
expression. The nuclear space is enclosed by nuclear envelope consisting of
two concentric membranes. To allow movement of proteins and RNAs across
the nuclear envelope, which is essential for gene expression, there are pores
on the nuclear envelope that span the inner and outer membrane. The mech-
anical support of the nucleus is provided by the nucleoskeleton, a network
of structural proteins including lamins and actin among others. Inside the
nucleus, long strings of DNA molecules, through binding to certain proteins
called histones, are heavily packed to fit into the limited nuclear space. In
prokaryotic cells, a nucleus-like irregularly shaped region that does not have
a membrane enclosure called the nucleoid provides a similar but not as well-
protected space for DNA.
1.4.2 Cell Membrane
The cell membrane serves as a barrier to protect the internal structure of a
cell from the outside environment. Biochemically, the cell membrane, as well
as all other intracellular membranes such as the nuclear envelope, assumes
a lipid bilayer structure. While offering protection to their internal structure,
The Cellular System and the Code of Life 7
the cell membrane is also where cells exchange materials, and concurrently
energy, with the outside environment. Since the membrane is made of lipids,
most water-soluble substances, including ions, carbohydrates, amino acids,
and nucleotides, cannot directly cross it. To overcome this barrier, there are
channels, transporters, and pumps, all of which are specialized proteins, on
the cell membrane. Channels and transporters facilitate passive movement,
that is, in the direction from high to low concentration, without consumption
of cellular energy. Pumps, on the other hand, provide active transportation of
the molecules, since they transport the molecules against the concentration
gradient and therefore consume energy.
The cell membrane is also where a cell receives most incoming signals from
the environment. After signal molecules bind to their specific receptors on the
cell membrane, the signal is relayed to the inside, usually eliciting a series of
intracellular reactions. The ultimate cellular response that the signal induces
is dependent on the nature of the signal, as well as the type and condition
of the cell. For example, upon detecting insulin in the blood via the insulin
receptor in their membrane, cells in the liver respond by taking up glucose
from the blood for storage.
1.4.3 Cytoplasm
Inside the cell membrane, cytoplasm is the thick solution that contains the
majority of cellular substances, including all organelles in eukaryotic cells
but excluding the nucleus in eukaryotic cells and the DNA in prokary-
otic cells. The general fluid component of the cytoplasm that excludes the
organelles is called the cytosol. The cytosol makes up more than half of the
cellular volume and is where many cellular activities take place, including
a large number of metabolic steps such as glycolysis and interconversion of
molecules, and most signal transduction steps. In prokaryotic cells, due to
the lack of the nucleus and other specialized organelles, the cytosol is almost
the entire intracellular space and where most cellular activities take place.
Besides water, the cytosol contains large amounts of small and large
molecules. Small molecules, such as inorganic ions, provide an overall bio-
chemical environment for cellular activities. In addition, ions such as Na+,
K+, and Ca2+ also have substantial concentration differences between the
cytosol and the extracellular space. Cells spend a lot of energy maintaining
these concentration differences, and use them for signaling and metabolic
purposes. For example, the concentration of Ca2+ in the cytosol is normally
kept very low at ~10−7 M whereas in the extracellular space it is ~10−3 M. The
rushing in of Ca2+ under certain conditions through ligand-or voltage-
gated channels serves as an important messenger, inducing responses in a
number of signaling pathways, some of which lead to altered gene expres-
sion. Besides small molecules, the cytosol also contains large numbers of
macromolecules. Far from being simply randomly diffusing in the cytosol,
these large molecules form molecular machines that collectively function as
8 Next-Generation Sequencing Data Analysis
1.4.5 Ribosome
Ribosome is the protein assembly factory in cells, translating genetic infor-
mation carried in messenger RNAs (mRNAs) into proteins. There are vast
The Cellular System and the Code of Life 9
1.4.6 Endoplasmic Reticulum
As indicated by the name, ER is a network of membrane-enclosed spaces
throughout the cytosol. These spaces interconnect and form a single internal
environment called the ER lumen. There are two types of ERs in cells: rough
ER and smooth ER. The rough ER is where all cell membrane proteins, such
as ion channels, transporters, pumps, and signal molecule receptors, as well
as secretory proteins, such as insulin, are produced and sorted. The charac-
teristic surface roughness of this type of ER comes from the ribosomes that
bind to them on the outside. Proteins destined for cell membrane or secre-
tion, once emerging from these ribosomes, are threaded into the ER lumen.
This ER-targeting process is mediated by a signal sequence, or “address
tag,” located at the beginning part of these proteins. This signal sequence
is subsequently cleaved off inside ER before the protein synthesis process is
complete. Functionally different from the rough ER, the smooth ER plays an
important role in lipid synthesis for the replenishment of cellular membranes.
Besides membrane and secretory protein preparation and lipid synthesis,
one other important function of ER is to sequester Ca2+ from the cytosol. In
Ca2+-mediated cell signaling, shortly after entry of the calcium wave into the
cytosol, most of the incoming Ca2+ needs to be pumped out of the cell and/or
sequestered into specific organelles such as ER and mitochondria.
1.4.7 Golgi Apparatus
Besides ER, the Golgi apparatus also plays an indispensable role in sorting
as well as dispatching proteins to the cell membrane, extracellular space,
or other subcellular destinations. Many proteins synthesized in the ER are
sent to the Golgi apparatus via small vesicles for further processing before
being sent to their final destinations. Therefore the Golgi apparatus is
10 Next-Generation Sequencing Data Analysis
sometimes metaphorically described as the “post office” of the cell. The pro-
cessing carried out in this organelle includes chemical modification of some
of the proteins, such as adding oligosaccharide side chains, which serves as
“address labels.” Other important functions of the Golgi apparatus include
synthesizing carbohydrates and extracellular matrix materials, such as the
polysaccharide for the building of the plant cell wall.
1.4.8 Cytoskeleton
Cellular processes like the trafficking of proteins in vesicles from ER to the
Golgi apparatus, or the movement of a mitochondrion from one intracellular
location to another, are not simply based on diffusion. Rather, they follow
certain protein-made skeletal structure inside the cytosol, that is, the cyto-
skeleton, as tracks. Besides providing tracks for intracellular transport, the
cytoskeleton, like the skeleton in the human body, plays an equally important
role in maintaining cell shape, and protecting the cell framework from phys-
ical stresses as the lipid bilayer cell membrane is fragile and vulnerable to
such stresses. In eukaryotic cells, there are three major types of cytoskeletal
structures: microfilament, microtubule, and intermediate filament. Each type
is made of distinct proteins and has their own unique characteristics and
functions. For example, microfilament and microtubule are assembled from
actins and tubulins, respectively, and have different thickness (the diameter is
around 6 nm for microfilament and 23 nm for microtubule). While biochem-
ically and structurally different, both the microfilament and the microtubule
have been known to provide tracks for mRNA transport in the form of large
ribonucleoprotein complexes to specific intracellular sites, such as the distal
end of a neuronal dendrite, for targeted protein translation [4]. Besides its role
in intracellular transportation, the microtubule also plays a key role in cell
division through attaching to the duplicated chromosomes and moving them
equally into two daughter cells. In this process, all microtubules involved are
organized around a small organelle called a centrosome. Previously thought
to be only present in eukaryotic cells, cytoskeletal structure has also been
discovered in prokaryotic cells [5].
1.4.9 Mitochondrion
The mitochondrion is the “powerhouse” in eukaryotic cells. While some
energy is produced from the glycolytic pathway in the cytosol, most
energy is generated from the Krebs cycle and the oxidative phosphor-
ylation process that take place in the many mitochondria contained in a
cell. The number of mitochondria in a cell is ultimately dependent on its
energy demand. The more energy a cell needs, the more mitochondria
it has. Structurally, the mitochondrion is an organelle enclosed by two
membranes. The outer membrane is highly permeable to most cytosolic
The Cellular System and the Code of Life 11
molecules, and as a result the intermembrane space between the outer and
inner membranes is similar to the cytosol. Most of the energy releasing
process occurs in the inner membrane and in the matrix, that is, the space
enclosed by the inner membrane. For the energy release, high-energy elec-
tron carriers generated from the Krebs cycle in the matrix are fed into an
electron transport chain embedded in the inner membrane. The energy
released from the transfer of high-energy electrons through the chain to
molecular oxygen (O2), the final electron acceptor, creates a proton gra-
dient across the inner membrane. This proton gradient serves as the
energy source for the synthesis of ATP, the universal energy currency in
cells. In prokaryotic cells, since they do not have this organelle, ATP syn-
thesis takes place on their cytoplasmic membrane instead.
The origin of the mitochondrion, based on the widely accepted endo-
symbiotic theory, is an ancient α-Proteobacterium. So not surprisingly, the
mitochondrion carries its own DNA, but the genetic information contained
in the mitochondrial DNA (mtDNA) is extremely limited compared to the
nuclear DNA. The human mitochondrial DNA, for example, is 16,569 bp
in size coding for 37 genes, including 22 for transfer RNAs (tRNAs), 2
for rRNAs, and 13 for mitochondrial proteins. While it is much smaller
compared to the nuclear genome, there are multiple copies of mtDNA
molecules in each mitochondrion. Since cells usually contain hundreds
to thousands of mitochondria, there are a large number of mtDNA
molecules in each cell. In comparison, most cells only contain two copies
of the nuclear DNA. As a result, when sequencing cellular DNA samples,
sequences derived from mitochondrial DNA usually comprise a notable,
sometimes substantial, percentage of total generated reads. Although
small, the mitochondrial genomic system is fully functional and has the
entire set of protein factors for mtDNA transcription, translation, and
replication. As a result of its activity, when cellular RNA molecules are
sequenced, those transcribed from the mitochondrial genome also gen-
erate significant amounts of reads in the sequence output.
The many copies of mtDNA molecules in a cell may not all have the same
sequence due to mutations in individual molecules. Heteroplasmy occurs
when cells contain a heterogeneous set of mtDNA molecules. In general, mito-
chondrial DNA has a higher mutation rate than its nuclear counterpart. This
is because the transfer of high-energy electrons along the electron transport
chain can produce reactive oxygen species as byproducts, which can oxidize
and cause mutations in mtDNA. To make this situation even worse, the DNA
repair capability in mitochondria is rather limited. Increased heteroplasmy
has been associated with higher risk of developing aging-related diseases,
including Alzheimer’s disease, heart disease, and Parkinson’s disease [6].
Furthermore, mitochondrial DNA mutations have been known to underlie
aging and cancer development [7]. Certain hereditary mtDNA mutations
also underlie maternally inherited diseases that mostly affect the nervous
system and muscle, both of which are characterized by high energy demand.
12 Next-Generation Sequencing Data Analysis
1.4.10 Chloroplast
In animal cells, the mitochondrion is the only organelle that contains an
extranuclear genome. Plant and algae cells have another extranuclear genome
besides the mitochondrion, the plastid genome. Plastid is an organelle that can
differentiate into various forms, the most prominent of which is the chloroplast.
The chloroplast carries out photosynthesis through capturing the energy in sun-
light and fixing it into carbohydrates using carbon dioxide as substrate, and
releasing oxygen in the same process. For energy capturing, the green pigment
called chlorophyll first absorbs energy from sunlight, which is then transferred
through an electron transport chain to build up a proton gradient to drive the
synthesis of ATP. Despite the energy source, the buildup of proton gradient for
ATP synthesis in the chloroplast is very similar to that for ATP synthesis in the
mitochondrion. The chloroplast ATP derived from the captured light energy is
then spent on CO2 fixation. Similar to the mitochondrion, the chloroplast also
has two membranes: a highly permeable outer membrane and a much less per-
meable inner membrane. The photosynthetic electron transport chain, however,
is not located in the inner membrane, but in the membrane of a series of sac-like
structures called thylakoids located in the chloroplast stroma (analogous to the
mitochondrial matrix).
Plastid is believed to be evolved from an endosymbiotic cyanobaterium,
which has gradually lost the majority of its genes in its genome over millions of
years. The current size of most plastid genomes is 120–200 kb, coding for rRNAs,
tRNAs, and proteins. In higher plants there are around 100 genes coding for
various proteins of the photosynthetic system [8]. The transmission of plastid
DNA (ptDNA) from parent to offspring is more complicated than the maternal
transmission of mtDNA usually observed in animals. Based on the transmis-
sion pattern, it can be classified into three types: 1) maternal, inheritance only
through the female parent; 2) paternal, inheritance only through the male parent;
or 3) bioparental, inheritance through both parents [9]. Similar to the situation
in mitochondrion, there exist multiple copies of ptDNA in each plastid, and as
a result there are large numbers of ptDNA molecules in each cell with potential
heteroplasmy. Transcription from these ptDNA also generates copious amounts
of RNAs in the organelle. Therefore, sequence reads from ptDNA or RNA com-
prise part of the data when sequencing plant and algae DNA or RNA samples,
along with those from mtDNA or RNA.
References
1. Vale RD. The molecular motor toolbox for intracellular transport. Cell 2003,
112(4):467–480.
2. de Duve C. Peroxisomes and related particles in historical perspective. Ann N
Y Acad Sci 1982, 386:1–4.
3. Gabaldon T. Evolution of the peroxisomal proteome. Subcell Biochem 2018,
89:221–233.
4. Das S, Vera M, Gandin V, Singer RH, Tutucci E. Intracellular mRNA transport
and localized translation. Nat Rev Mol Cell Biol 2021, 22(7):483–504.
5. Mayer F. Cytoskeletons in prokaryotes. Cell Biol Int 2003, 27(5):429–438.
6. Chocron ES, Munkacsy E, Pickering AM. Cause or casualty: the role of mito-
chondrial DNA in aging and age-associated disease. Biochim Biophys Acta Mol
Basis Dis 2019, 1865(2):285–297.
7. Smith ALM, Whitehall JC, Greaves LC. Mitochondrial DNA mutations in
ageing and cancer. Mol Oncol 2022, 16(18):3276–3294.
8. de Vries J, Archibald JM. Plastid genomes. Curr Biol 2018, 28(8):R336–R337.
9. Harris SA, Ingram R. Chloroplast DNA and biosystematics: the effects of
intraspecific diversity and plastid transmission. Taxon 1991:393–412.
10. Roy U, Grewal RK, Roy S. Complex Networks and Systems Biology. In:
Systems and Synthetic Biology. Springer; 2015: 129–150.
The Cellular System and the Code of Life
Vale RD . The molecular motor toolbox for intracellular transport. Cell 2003, 112(4):467–480.
de Duve C. Peroxisomes and related particles in historical perspective. Ann N Y Acad Sci 1982,
386:1–4.
Gabaldon T. Evolution of the peroxisomal proteome. Subcell Biochem 2018, 89:221–233.
Das S , Vera M , Gandin V , Singer RH , Tutucci E . Intracellular mRNA transport and localized
translation. Nat Rev Mol Cell Biol 2021, 22(7):483–504.
Mayer F. Cytoskeletons in prokaryotes. Cell Biol Int 2003, 27(5):429–438.
Chocron ES , Munkacsy E , Pickering AM . Cause or casualty: the role of mitochondrial DNA in
aging and age-associated disease. Biochim Biophys Acta Mol Basis Dis 2019,
1865(2):285–297.
Smith ALM , Whitehall JC , Greaves LC . Mitochondrial DNA mutations in ageing and cancer.
Mol Oncol 2022, 16(18):3276–3294.
de Vries J , Archibald JM . Plastid genomes. Curr Biol 2018, 28(8):R336–R337.
Harris SA , Ingram R. Chloroplast DNA and biosystematics: the effects of intraspecific diversity
and plastid transmission. Taxon 1991:393–412.
Roy U , Grewal RK , Roy S. Complex Networks and Systems Biology. In: Systems and
Synthetic Biology. Springer; 2015: 129–150.
DNA Sequence
Fraser CM , Gocayne JD , White O , Adams MD , Clayton RA , Fleischmann RD , Bult CJ ,
Kerlavage AR , Sutton G , Kelley JM et al. The minimal gene complement of Mycoplasma
genitalium . Science 1995, 270(5235):397–403.
Hutchison CA , 3rd, Chuang RY , Noskov VN , Assad-Garcia N , Deerinck TJ , Ellisman MH ,
Gill J , Kannan K , Karas BJ , Ma L et al . Design and synthesis of a minimal bacterial genome.
Science 2016, 351(6280):aad6253.
Bennett GM , Moran NA . Small, smaller, smallest: the origins and evolution of ancient dual
symbioses in a Phloem-feeding insect. Genome Biol Evol 2013, 5(9):1675–1688.
Pellicer J , Fay MF , Leitch IJ . The largest eukaryotic genome of them all? Bot J Linn Soc 2010,
164(1):10–15.
Shapiro JA , von Sternberg R. Why repetitive DNA is essential to genome function. Biol Rev
Camb Philos Soc 2005, 80(2):227–250.
Roach JC , Glusman G , Smit AF , Huff CD , Hubley R , Shannon PT , Rowen L , Pant KP ,
Goodman N , Bamshad M et al . Analysis of genetic inheritance in a family quartet by whole-
genome sequencing. Science 2010, 328(5978):636–639.
Mahmoud M , Gobet N , Cruz-Davalos DI, Mounier N, Dessimoz C, Sedlazeck FJ. Structural
variant calling: the long and the short of it. Genome Biol 2019, 20(1):246.
Ebert P , Audano PA , Zhu Q , Rodriguez-Martin B , Porubsky D , Bonder MJ , Sulovari A ,
Ebler J , Zhou W , Serra Mari R et al . Haplotype-resolved diverse human genomes and
integrated analysis of structural variation. Science 2021, 372(6537):eabf7117.
Malnic B , Godfrey PA , Buck LB . The human olfactory receptor gene family. Proc Natl Acad
Sci U S A 2004, 101(8):2584–2589.
Inai Y , Ohta Y , Nishikimi M . The whole structure of the human nonfunctional L-gulono-
gamma-lactone oxidase gene—the gene responsible for scurvy—and the evolution of repetitive
sequences thereon. J Nutr Sci Vitaminol 2003, 49(5):315–319.
Law JA , Jacobsen SE . Establishing, maintaining and modifying DNA methylation patterns in
plants and animals. Nat Rev Genet 2010, 11(3):204–220.
Cedar H , Bergman Y. Linking DNA methylation and histone modification: patterns and
paradigms. Nat Rev Genet 2009, 10(5):295–304.
Guo W , Chung WY , Qian M , Pellegrini M , Zhang MQ . Characterizing the strand-specific
distribution of non-CpG methylation in human pluripotent cells. Nucleic Acids Res 2014,
42(5):3009–3016.
Wu H , Zhang Y . Reversing DNA methylation: mechanisms, genomics, and biological functions.
Cell 2014, 156(1–2):45–68.
Shademan B , Biray Avci C , Nikanfar M , Nourazarian A. Application of next-generation
sequencing in neurodegenerative diseases: opportunities and challenges. Neuromolecular Med
2021, 23(2):225–235.
Nishiyama A , Nakanishi M . Navigating the DNA methylation landscape of cancer. Trends
Genet 2021, 37(11):1012–1027.
Pappalardo XG , Barra V . Losing DNA methylation at repetitive elements and breaking bad.
Epigenetics Chromatin 2021, 14(1):25.
RNA
Bedard AV , Hien EDM , Lafontaine DA . Riboswitch regulation mechanisms: RNA, metabolites
and regulatory proteins. Biochim Biophys Acta Gene Regul Mech 2020, 1863(3):194501.
Ray PS , Jia J , Yao P , Majumder M , Hatzoglou M , Fox PL . A stress-responsive RNA switch
regulates VEGFA expression. Nature 2009, 457(7231):915–919.
Xu B , Zhu Y , Cao C , Chen H , Jin Q , Li G , Ma J , Yang SL , Zhao J , Zhu J et al . Recent
advances in RNA structurome. Sci China Life Sci 2022, 65(7):1285–1324.
Imashimizu M , Oshima T , Lubkowska L , Kashlev M . Direct assessment of transcription
fidelity by high-resolution RNA sequencing. Nucleic Acids Res 2013, 41(19):9090–9104.
Wang ET , Sandberg R , Luo S , Khrebtukova I , Zhang L , Mayr C , Kingsmore SF , Schroth
GP , Burge CB . Alternative isoform regulation in human tissue transcriptomes. Nature 2008,
456(7221):470–476.
Pan Q , Shai O , Lee LJ , Frey BJ , Blencowe BJ . Deep surveying of alternative splicing
complexity in the human transcriptome by high-throughput sequencing. Nat Genet 2008,
40(12):1413–1415.
Keegan LP , Gallo A , O'Connell MA . The many roles of an RNA editor. Nat Rev Genet 2001,
2(11):869–878.
Bratt E , Ohman M . Coordination of editing and splicing of glutamate receptor pre-mRNA. RNA
2003, 9(3):309–318.
Pfeiffer BE , Huber KM . Current advances in local protein synthesis and synaptic plasticity. J
Neurosci 2006, 26(27):7147–7150.
Rustad TR , Minch KJ , Brabant W , Winkler JK , Reiss DJ , Baliga NS , Sherman DR . Global
analysis of mRNA stability in Mycobacterium tuberculosis . Nucleic Acids Res 2013,
41(1):509–517.
Sharova LV , Sharov AA , Nedorezov T , Piao Y , Shaik N , Ko MS . Database for mRNA half-
life of 19 977 genes obtained by DNA microarray analysis of pluripotent and differentiating
mouse embryonic stem cells. DNA Res 2009, 16(1):45–58.
Yang E , van Nimwegen E , Zavolan M , Rajewsky N , Schroeder M , Magnasco M , Darnell JE ,
Jr. Decay rates of human mRNAs: correlation with functional characteristics and sequence
attributes. Genome Res 2003, 13(8):1863–1872.
Figueroa A , Cuadrado A , Fan J , Atasoy U , Muscat GE , Munoz-Canoves P , Gorospe M ,
Munoz A . Role of HuR in skeletal myogenesis through coordinate regulation of muscle
differentiation genes. Mol Cell Biol 2003, 23(14):4991–5004.
Kulkarni M , Ozgur S , Stoecklin G . On track with P-bodies. Biochem Soc Trans 2010, 38(Pt
1):242–251.
Labno A , Tomecki R , Dziembowski A . Cytoplasmic RNA decay pathways – enzymes and
mechanisms. Biochim Biophys Acta 2016, 1863(12):3125–3147.
Willis DE , Twiss JL . Regulation of protein levels in subcellular domains through mRNA
transport and localized translation. Mol Cell Proteomics 2010, 9(5):952–962.
Jeffares DC , Poole AM , Penny D . Relics from the RNA world. J Mol Evol 1998, 46(1):18–36.
Cech TR . Structural biology. The ribosome is a ribozyme. Science 2000, 289(5481):878–879.
Zhang L , Vielle A , Espinosa S , Zhao R . RNAs in the spliceosome: insight from cryoEM
structures. Wiley Interdiscip Rev RNA 2019, 10(3):e1523.
Wilson RC , Doudna JA . Molecular mechanisms of RNA interference. Annu Rev Biophys 2013,
42:217–239.
Friedman RC , Farh KK , Burge CB , Bartel DP . Most mammalian mRNAs are conserved
targets of microRNAs. Genome Res 2009, 19(1):92–105.
Kawamata T , Tomari Y . Making RISC. Trends Biochem Sci 2010, 35(7):368–376.
Carthew RW , Sontheimer EJ . Origins and mechanisms of miRNAs and siRNAs. Cell 2009,
136(4):642–655.
Liu X , Hao L , Li D , Zhu L , Hu S. Long non-coding RNAs and their biological roles in plants.
Genomics Proteomics Bioinformatics 2015, 13(3):137–147.
Derrien T , Johnson R , Bussotti G , Tanzer A , Djebali S , Tilgner H , Guernec G , Martin D ,
Merkel A , Knowles DG et al . The GENCODE v7 catalog of human long noncoding RNAs:
analysis of their gene structure, evolution, and expression. Genome Res 2012,
22(9):1775–1789.
Gupta RA , Shah N , Wang KC , Kim J , Horlings HM , Wong DJ , Tsai MC , Hung T , Argani P ,
Rinn JL et al . Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer
metastasis. Nature 2010, 464(7291):1071–1076.
Zhao J , Sun BK , Erwin JA , Song JJ , Lee JT . Polycomb proteins targeted by a short repeat
RNA to the mouse X chromosome. Science 2008, 322(5902):750–756.
Li W , Notani D , Ma Q , Tanasa B , Nunez E , Chen AY , Merkurjev D , Zhang J , Ohgi K , Song
X et al . Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation.
Nature 2013, 498(7455):516–520.
Yoon JH , Abdelmohsen K , Srikantan S , Yang X , Martindale JL , De S , Huarte M , Zhan M ,
Becker KG , Gorospe M . LincRNA-p21 suppresses target mRNA translation. Mol Cell 2012,
47(4):648–655.
Gong C , Maquat LE . lncRNAs transactivate STAU1-mediated mRNA decay by duplexing with
3’ UTRs via Alu elements. Nature 2011, 470(7333):284–288.
Yarmishyn AA , Kurochkin IV . Long noncoding RNAs: a potential novel class of cancer
biomarkers. Front Genet 2015, 6:145.
Ni YQ , Xu H , Liu YS . Roles of Long Non-coding RNAs in the development of aging-related
neurodegenerative diseases. Front Mol Neurosci 2022, 15:844193.
Nisar S , Bhat AA , Singh M , Karedath T , Rizwan A , Hashem S , Bagga P , Reddy R , Jamal F
, Uddin S et al . Insights into the role of CircRNAs: biogenesis, characterization, functional, and
clinical impact in human malignancies. Front Cell Dev Biol 2021, 9:617281.
Cech TR , Steitz JA . The noncoding RNA revolution—trashing old rules to forge new ones. Cell
2014, 157(1):77–94.
Carninci P , Kasukawa T , Katayama S , Gough J , Frith MC , Maeda N , Oyama R , Ravasi T ,
Lenhard B , Wells C et al . The transcriptional landscape of the mammalian genome. Science
2005, 309(5740):1559–1563.
Djebali S , Davis CA , Merkel A , Dobin A , Lassmann T , Mortazavi A , Tanzer A , Lagarde J ,
Lin W , Schlesinger F et al . Landscape of transcription in human cells. Nature 2012,
489(7414):101–108.