0% found this document useful (0 votes)
421 views86 pages

Next-Generation Sequencing Data Analysis 2nd Edition

Uploaded by

Zay Yar Win
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
421 views86 pages

Next-Generation Sequencing Data Analysis 2nd Edition

Uploaded by

Zay Yar Win
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

i

Next-​Generation Sequencing
Data Analysis

Next-​generation DNA and RNA sequencing has revolutionized biology and


medicine. With sequencing costs continuously dropping and our ability to
generate large datasets rising, data analysis becomes more important than
ever. Next-​Generation Sequencing Data Analysis walks readers through next-​
generation sequencing (NGS) data analysis step by step for a wide range of
NGS applications.

For each NGS application, this book covers topics from experimental design,
sample processing, sequencing strategy formulation, to sequencing read quality
control, data preprocessing, read mapping or assembly, and more advanced
stages that are specific to each application. Major applications include:

• RNA-​seq: Both bulk and single cell (separate chapters)


• Genotyping and variant discovery through whole genome/​ exome
sequencing
• Clinical sequencing and detection of actionable variants
• De novo genome assembly
• ChIP-​seq to map protein-​DNA interactions
• Epigenomics through DNA methylation sequencing
• Metagenome sequencing for microbiome analysis

Before detailing the analytic steps for each of these applications, the book
presents introductory cellular and molecular biology as a refresher mostly
for data scientists, the ins and outs of widely used NGS platforms, and an
overview of computing needs for NGS data management and analysis. The
book concludes with a chapter on the changing landscape of NGS technolo-
gies and data analytics.

The second edition of this book builds on the well-​received first edition
by providing updates to each chapter. Two brand new chapters have been
added to meet rising data analysis demands on single-cell RNA-​seq and clin-
ical sequencing. The increasing use of long-reads sequencing has also been
reflected in all NGS applications. This book discusses concepts and principles
that underlie each analytic step, along with software tools for implementa-
tion. It highlights key features of the tools while omitting tedious details to
provide an easy-​to-​follow guide for practitioners in life sciences, bioinfor-
matics, biostatistics, and data science. Tools introduced in this book are open
source and freely available.
iii

Next-​Generation
Sequencing Data
Analysis
Second Edition

Xinkun Wang
iv

Second edition published 2024


by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-​2742
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2024 Taylor & Francis Group, LLC
First edition published by CRC Press 2016
Reasonable efforts have been made to publish reliable data and information, but the author
and publisher cannot assume responsibility for the validity of all materials or the consequences
of their use. The authors and publishers have attempted to trace the copyright holders of all
material reproduced in this publication and apologize to copyright holders if permission to
publish in this form has not been obtained. If any copyright material has not been acknowledged
please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted,
reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means,
now known or hereafter invented, including photocopying, microfilming, and recording, or in
any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.
copyri​ght.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-​750-​8400. For works that are not available on CCC please contact
mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and
are used only for identification and explanation without intent to infringe.
ISBN: 9780367349899 (hbk)
ISBN: 9781032505701 (pbk)
ISBN: 9780429329180 (ebk)
DOI: 10.1201/​9780429329180
Typeset in Palatino
by Newgen Publishing UK
v

Contents

Preface to the Second Edition............................................................................... xv


Author....................................................................................................................xvii

Part I Introduction to Cellular and Molecular Biology

1. The Cellular System and the Code of Life................................................... 3


1.1 The Cellular Challenge............................................................................ 3
1.2 How Cells Meet the Challenge............................................................... 4
1.3 Molecules in Cells..................................................................................... 4
1.4 Intracellular Structures or Spaces........................................................... 5
1.4.1 Nucleus.......................................................................................... 5
1.4.2 Cell Membrane.............................................................................. 6
1.4.3 Cytoplasm...................................................................................... 7
1.4.4 Endosome, Lysosome, and Peroxisome.................................... 8
1.4.5 Ribosome....................................................................................... 8
1.4.6 Endoplasmic Reticulum.............................................................. 9
1.4.7 Golgi Apparatus........................................................................... 9
1.4.8 Cytoskeleton................................................................................ 10
1.4.9 Mitochondrion............................................................................ 10
1.4.10 Chloroplast.................................................................................. 12
1.5 The Cell as a System............................................................................... 12
1.5.1 The Cellular System................................................................... 12
1.5.2 Systems Biology of the Cell....................................................... 13
1.5.3 How to Study the Cellular System........................................... 14

2. DNA Sequence: The Genome Base............................................................. 17


2.1 The DNA Double Helix and Base Sequence....................................... 17
2.2 How DNA Molecules Replicate and Maintain Fidelity.................... 18
2.3 How the Genetic Information Stored in DNA Is Transferred
to Protein.................................................................................................. 20
2.4 The Genomic Landscape....................................................................... 21
2.4.1 The Minimal Genome................................................................ 21
2.4.2 Genome Sizes.............................................................................. 21
2.4.3 Protein-​Coding Regions of the Genome................................. 22
2.4.4 Non-​Coding Genomic Elements.............................................. 23
2.5 DNA Packaging, Sequence Access, and DNA-​Protein
Interactions.............................................................................................. 25
2.5.1 DNA Packaging.......................................................................... 25

v
vi Contents

2.5.2 Sequence Access......................................................................... 25


2.5.3 DNA-​Protein Interactions......................................................... 26
2.6 DNA Sequence Mutation and Polymorphism................................... 27
2.7 Genome Evolution.................................................................................. 28
2.8 Epigenome and DNA Methylation...................................................... 29
2.9 Genome Sequencing and Disease Risk................................................ 30
2.9.1 Mendelian (Single-​Gene) Diseases........................................... 31
2.9.2 Complex Diseases That Involve Multiple Genes................... 31
2.9.3 Diseases Caused by Genome Instability................................. 32
2.9.4 Epigenomic/​Epigenetic Diseases............................................. 32

3. RNA: The Transcribed Sequence................................................................. 35


3.1 RNA as the Messenger........................................................................... 35
3.2 The Molecular Structure of RNA.......................................................... 35
3.3 Generation, Processing, and Turnover of RNA as a
Messenger................................................................................................ 36
3.3.1 DNA Template............................................................................ 37
3.3.2 Transcription of Prokaryotic Genes......................................... 37
3.3.3 Pre-​mRNA Transcription of Eukaryotic Genes...................... 38
3.3.4 Maturation of mRNA................................................................. 40
3.3.5 Transport and Localization....................................................... 42
3.3.6 Stability and Decay.................................................................... 42
3.3.7 Major Steps of mRNA Transcript Level Regulation.............. 43
3.4 RNA Is More Than a Messenger.......................................................... 44
3.4.1 Ribozyme..................................................................................... 45
3.4.2 snRNA and snoRNA.................................................................. 46
3.4.3 RNA for Telomere Replication.................................................. 46
3.4.4 RNAi and Small Non-​Coding RNAs....................................... 47
3.4.4.1 miRNA......................................................................... 47
3.4.4.2 siRNA........................................................................... 49
3.4.4.3 piRNA.......................................................................... 49
3.4.5 Long Non-​Coding RNAs........................................................... 50
3.4.6 Other Non-​Coding RNAs......................................................... 50
3.5 The Cellular Transcriptional Landscape............................................. 51

Part II Introduction to Next-​Generation Sequencing


(NGS) and NGS Data Analysis

4. Next-​Generation Sequencing (NGS) Technologies: Ins and Outs........ 57


4.1 How to Sequence DNA: From First Generation to the Next............ 57
4.2 Ins and Outs of Different NGS Platforms........................................... 60
4.2.1 Illumina Reversible Terminator Short-​Read Sequencing..... 60
4.2.1.1 Sequencing Principle.................................................. 60
Contents vii

4.2.1.2 Implementation.......................................................... 60
4.2.1.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 63
4.2.1.4 Sequence Data Generation........................................ 63
4.2.2 Pacific Biosciences Single-​Molecule Real-​Time
(SMRT) Long-​Read Sequencing............................................... 64
4.2.2.1 Sequencing Principle.................................................. 64
4.2.2.2 Implementation.......................................................... 64
4.2.2.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 65
4.2.2.4 Sequence Data Generation........................................ 65
4.2.3 Oxford Nanopore Technologies (ONT) Long-​Read
Sequencing.................................................................................. 67
4.2.3.1 Sequencing Principle.................................................. 67
4.2.3.2 Implementation.......................................................... 68
4.2.3.3 Error Rate, Read Length, Data Output,
and Cost....................................................................... 68
4.2.3.4 Sequence Data Generation........................................ 69
4.2.4 Ion Torrent Semiconductor Sequencing.................................. 69
4.2.4.1 Sequencing Principle.................................................. 69
4.2.4.2 Implementation.......................................................... 70
4.2.4.3 Error Rate, Read Length, Date Output,
and Cost....................................................................... 70
4.2.4.4 Sequence Data Generation........................................ 72
4.3 A Typical NGS Workflow...................................................................... 72
4.4 Biases and Other Adverse Factors That May Affect NGS Data
Accuracy.................................................................................................. 74
4.4.1 Biases in Library Construction................................................. 74
4.4.2 Biases and Other Factors in Sequencing................................. 75
4.5 Major Applications of NGS................................................................... 76
4.5.1 Transcriptomic Profiling (Bulk and Single-​Cell
RNA-​Seq)..................................................................................... 76
4.5.2 Genetic Mutation and Variation Identification...................... 77
4.5.3 De Novo Genome Assembly...................................................... 77
4.5.4 Protein-​DNA Interaction Analysis (ChIP-​Seq)....................... 77
4.5.5 Epigenomics and DNA Methylation Study
(Methyl-​Seq)................................................................................ 77
4.5.6 Metagenomics............................................................................. 78

5. Early-​Stage Next-​Generation Sequencing (NGS) Data


Analysis: Common Steps............................................................................... 81
5.1 Basecalling, FASTQ File Format, and Base Quality Score................ 81
5.2 NGS Data Quality Control and Preprocessing................................... 84
5.3 Read Mapping......................................................................................... 86
5.3.1 Mapping Approaches and Algorithms.................................... 86
viii Contents

5.3.2 Selection of Mapping Algorithms and Reference


Genome Sequences..................................................................... 91
5.3.3 SAM/​BAM as the Standard Mapping File Format............... 93
5.3.4 Mapping File Examination and Operation............................. 95
5.4 Tertiary Analysis..................................................................................... 98

6. Computing Needs for Next-​Generation Sequencing (NGS)


Data Management and Analysis................................................................ 103
6.1 NGS Data Storage, Transfer, and Sharing......................................... 103
6.2 Computing Power Required for NGS Data Analysis...................... 105
6.3 Cloud Computing................................................................................ 106
6.4 Software Needs for NGS Data Analysis............................................ 108
6.4.1 Parallel Computing...................................................................110
6.5 Bioinformatics Skills Required for NGS Data Analysis...................111

Part III Application-​Specific NGS Data Analysis

7. Transcriptomics by Bulk RNA-​Seq............................................................117


7.1 Principle of RNA-​Seq............................................................................117
7.2 Experimental Design.............................................................................118
7.2.1 Factorial Design.........................................................................118
7.2.2 Replication and Randomization..............................................118
7.2.3 Sample Preparation and Sequencing Library
Preparation.................................................................................119
7.2.4 Sequencing Strategy................................................................. 121
7.3 RNA-​Seq Data Analysis....................................................................... 122
7.3.1 Read Mapping........................................................................... 122
7.3.2 Quantification of Reads........................................................... 126
7.3.3 Normalization........................................................................... 127
7.3.4 Batch Effect Removal............................................................... 129
7.3.5 Identification of Differentially Expressed Genes................. 129
7.3.6 Multiple Testing Correction.................................................... 133
7.3.7 Gene Clustering........................................................................ 134
7.3.8 Functional Analysis of Identified Genes............................... 134
7.3.9 Differential Splicing Analysis................................................. 136
7.4 Visualization of RNA-​Seq Data.......................................................... 137
7.5 RNA-​Seq as a Discovery Tool............................................................. 137

8. Transcriptomics by Single-Cell RNA-​Seq................................................ 145


8.1 Experimental Design............................................................................ 146
8.1.1 Single-Cell RNA-​Seq General Approaches........................... 146
8.1.2 Cell Number and Sequencing Depth..................................... 147
Contents ix

8.1.3 Batch Effects Minimization and Sample Replication.......... 149


8.2 Single-Cell Preparation, Library Construction, and
Sequencing............................................................................................ 150
8.2.1 Single-Cell Preparation............................................................ 150
8.2.2 Single Nuclei Preparation....................................................... 152
8.2.3 Library Construction and Sequencing.................................. 153
8.3 Preprocessing of scRNA-​Seq Data..................................................... 154
8.3.1 Initial Data Preprocessing and Quality Control................... 154
8.3.2 Alignment and Transcript Counting..................................... 156
8.3.3 Data Cleanup Post Alignment................................................ 157
8.3.4 Normalization........................................................................... 160
8.3.5 Batch Effects Correction........................................................... 162
8.3.6 Signal Imputation..................................................................... 164
8.4 Feature Selection, Dimension Reduction, and Visualization......... 165
8.4.1 Feature Selection....................................................................... 165
8.4.2 Dimension Reduction.............................................................. 166
8.4.3 Visualization.............................................................................. 168
8.5 Cell Clustering, Cell Identity Annotation, and Compositional
Analysis.................................................................................................. 171
8.5.1 Cell Clustering.......................................................................... 171
8.5.2 Cell Identity Annotation.......................................................... 173
8.5.3 Compositional Analysis........................................................... 177
8.6 Differential Expression Analysis........................................................ 178
8.7 Trajectory Inference.............................................................................. 181
8.8 Advanced Analyses.............................................................................. 185
8.8.1 SNV/​CNV Detection and Allele-​Specific Expression
Analysis...................................................................................... 185
8.8.2 Alternative Splicing Analysis................................................. 186
8.8.3 Gene Regulatory Network Inference..................................... 187

9. Small RNA Sequencing............................................................................... 205


9.1 Small RNA NGS Data Generation and Upstream
Processing.............................................................................................. 206
9.1.1 Data Generation........................................................................ 206
9.1.2 Preprocessing............................................................................ 207
9.1.3 Mapping..................................................................................... 208
9.1.4 Identification of Known and Putative Small
RNA Species.............................................................................. 209
9.1.5 Normalization........................................................................... 209
9.2 Identification of Differentially Expressed Small RNAs.................. 210
9.3 Functional Analysis of Identified Known Small RNAs.................. 210
x Contents

10. Genotyping and Variation Discovery by Whole Genome/​


Exome Sequencing........................................................................................ 215
10.1 Data Preprocessing, Mapping, Realignment, and
Recalibration......................................................................................... 216
10.2 Single Nucleotide Variant (SNV) and Short Indel Calling............. 217
10.2.1 Germline SNV and Indel Calling........................................... 217
10.2.2 Somatic Mutation Detection................................................... 219
10.2.3 Variant Calling from RNA Sequencing Data........................ 221
10.2.4 Variant Call Format (VCF)...................................................... 221
10.2.5 Evaluating VCF Results........................................................... 223
10.3 Structural Variant (SV) Calling........................................................... 225
10.3.1 Short-Read-​Based SV Calling................................................. 225
10.3.2 Long-Read-​Based SV Calling.................................................. 227
10.3.3 CNV Detection.......................................................................... 227
10.3.4 Integrated SV Analysis............................................................ 228
10.4 Annotation of Called Variants............................................................ 228

11. Clinical Sequencing and Detection of Actionable Variants................. 237


11.1 Clinical Sequencing Data Generation................................................ 238
11.1.1 Patient Sample Collection....................................................... 238
11.1.2 Library Preparation and Sequencing Approaches............... 240
11.2 Read Mapping and Variant Calling................................................... 243
11.3 Variant Filtering.................................................................................... 243
11.3.1 Frequency of Occurrence......................................................... 245
11.3.2 Functional Consequence......................................................... 245
11.3.3 Existing Evidence of Relationship to Human Disease........ 246
11.3.4 Clinical Phenotype Match....................................................... 246
11.3.5 Mode of Inheritance................................................................. 247
11.4 Variant Ranking and Prioritization.................................................... 247
11.5 Classification of Variants Based on Pathogenicity........................... 248
11.5.1 Classification of Germline Variants....................................... 248
11.5.2 Classification of Somatic Variants.......................................... 254
11.6 Clinical Review and Reporting.......................................................... 256
11.6.1 Use of Artificial Intelligence in Variant Reporting.............. 256
11.6.2 Expert Review........................................................................... 257
11.6.3 Generation of Testing Report.................................................. 257
11.6.4 Variant Validation..................................................................... 259
11.6.5 Incorporation into a Patient’s Electronic Health Record.... 260
11.6.6 Reporting of Secondary Findings.......................................... 260
11.6.7 Patient Counseling and Periodic Report Updates............... 260
11.7 Bioinformatics Pipeline Validation.................................................... 261

12. De Novo Genome Assembly with Long and/​or Short Reads............... 271
12.1 Genomic Factors and Sequencing Strategies for
De Novo Assembly................................................................................ 272
Contents xi

12.1.1 Genomic Factors That Affect De Novo Assembly................. 272


12.1.2 Sequencing Strategies for De Novo Assembly...................... 272
12.2 Assembly of Contigs............................................................................ 274
12.2.1 Sequence Data Preprocessing, Error Correction, and
Assessment of Genome Characteristics................................ 274
12.2.2 Contig Assembly Algorithms................................................. 277
12.2.3 Polishing.................................................................................... 280
12.3 Scaffolding and Gap Closure.............................................................. 281
12.4 Assembly Quality Evaluation............................................................. 282
12.5 Limitations and Future Development............................................... 284

13. Mapping Protein-​DNA Interactions with ChIP-​Seq............................. 293


13.1 Principle of ChIP-​Seq........................................................................... 293
13.2 Experimental Design............................................................................ 295
13.2.1 Experimental Control............................................................... 295
13.2.2 Library Preparation.................................................................. 295
13.2.3 Sequencing Length and Depth............................................... 296
13.2.4 Replication................................................................................. 296
13.3 Read Mapping, Normalization, and Peak Calling.......................... 297
13.3.1 Data Quality Control and Read Mapping............................ 297
13.3.2 Peak Calling.............................................................................. 300
13.3.3 Post-​Peak Calling Quality Control......................................... 307
13.3.4 Peak Visualization.................................................................... 309
13.4 Differential Binding Analysis............................................................. 310
13.5 Functional Analysis.............................................................................. 313
13.6 Motif Analysis....................................................................................... 314
13.7 Integrated ChIP-​Seq Data Analysis................................................... 315

14. Epigenomics by DNA Methylation Sequencing.................................... 321


14.1 DNA Methylation Sequencing Strategies......................................... 321
14.1.1 Bisulfite Conversion Methyl-​Seq............................................ 322
14.1.1.1 Whole-​Genome Bisulfite Sequencing
(WGBS)....................................................................... 322
14.1.1.2 Reduced Representation Bisulfite
Sequencing (RRBS)................................................... 323
14.1.2 Enzymatic Conversion Methyl-​Seq....................................... 324
14.1.3 Enrichment-​Based Methyl-​Seq............................................... 324
14.1.4 Differentiation of Cytosine Methylation from
Demethylation Products.......................................................... 325
14.2 DNA Methylation Sequencing Data Analysis.................................. 326
14.2.1 Quality Control and Preprocessing....................................... 326
14.2.2 Read Mapping........................................................................... 326
14.2.3 Quantification of DNA Methylation/​Demethylation
Products..................................................................................... 330
14.2.4 Visualization.............................................................................. 331
xii Contents

14.3 Detection of Differentially Methylated Cytosines


and Regions........................................................................................... 332
14.4 Data Verification, Validation, and Interpretation............................. 334

15. Whole Metagenome Sequencing for Microbial


Community Analysis.................................................................................... 341
15.1 Experimental Design and Sample Preparation................................ 342
15.1.1 Metagenome Sample Collection............................................. 343
15.1.2 Metagenome Sample Processing............................................ 343
15.2 Sequencing Approaches...................................................................... 344
15.3 Overview of Shotgun Metagenome Sequencing
Data Analysis........................................................................................ 345
15.4 Sequencing Data Quality Control and Preprocessing..................... 347
15.5 Taxonomic Characterization of a Microbial Community............... 347
15.5.1 Metagenome Assembly........................................................... 347
15.5.2 Sequence Binning..................................................................... 348
15.5.3 Calling of Genes and Other Genomic Elements from
Metagenomic Sequences......................................................... 351
15.5.4 Taxonomic Profiling................................................................. 351
15.6 Functional Characterization of a Microbial Community................ 352
15.6.1 Gene Function Annotation...................................................... 352
15.6.2 Gene Function Profiling and Metabolic Pathway
Reconstruction.......................................................................... 353
15.7 Comparative Metagenomic Analysis................................................ 354
15.7.1 Metagenome Sequencing Data Normalization.................... 354
15.7.2 Identification of Differentially Abundant
Species or OTUs........................................................................ 355
15.8 Integrated Metagenomics Data Analysis Pipelines......................... 355
15.9 Metagenomics Data Repositories....................................................... 355

Part IV The Changing Landscape of NGS


Technologies and Data Analysis

16. What’s Next for Next-​Generation Sequencing (NGS)?......................... 365


16.1 The Changing Landscape of Next-​Generation
Sequencing (NGS)................................................................................ 365
16.2 Newer Sequencing Technologies....................................................... 366
16.3 Continued Evolution and Growth of Bioinformatics
Tools for NGS Data Analysis.............................................................. 369
16.4 Efficient Management of NGS Analytic Workflows........................ 370
Contents xiii

16.5 Deepening Applications of NGS to Single-Cell and


Spatial Sequencing............................................................................... 372
16.6 Increasing Use of Machine Learning in NGS Data
Analytics................................................................................................ 374

Appendix I Common File Types Used in NGS Data Analysis.................. 383


Appendix II Glossary......................................................................................... 387
Index...................................................................................................................... 397
xv

Preface to the Second Edition

When I started working on the second edition of Next-​Generation Sequencing


Data Analysis, my primary goal was to add new chapters and contents on
clinical sequencing, single-cell sequencing, and third-​generation sequencing
(i.e., long reads) data analyses. These contents were either absent, or only
briefly discussed, in the first edition. For example, data processing for clinical
applications, where NGS has a direct impact on public health, was absent, and
a new chapter that covers clinical sequencing data QA/​QC, standard analysis
pipeline, and clinical interpretation is beneficial to the community. The dra-
matic growth in single-cell sequencing also warrants a new chapter, because
extracting rich biological information at the single-cell resolution requires
a new set of tools different from what is used to analyze “bulk” sequen-
cing data. Although long-read sequencing was covered in the first edition
of this book, technologies have since then made significant improvements
and achieved wide usage. Such developments require extensive updates to
nearly all of the applications, from RNA-​seq to metagenomics.
After meeting my primary goal, I set out to update the rest of the book.
This took much longer than I had initially planned. The challenges were two-​
fold. The first was updating the list of tools for each of the NGS applications.
Thanks to the productivity of the bioinformatics community, most NGS
applications have seen an abundance of tool development, and as a result
many new tools have emerged. Updating this large number of new tools took
quite some time. The second challenge was selectively introducing new and
existing tools, instead of overwhelming readers with a long list of tools that
have ever existed. While the tools presented in this edition are by no means
the most representative among all tools available, I made every effort to select
most of the effective open source tools in existence as of late 2022, drawing
information from benchmarking studies, citations, and recent updates.
In writing this edition, I have developed a renewed appreciation of the
intensity, excitement, and multiplicity of expertise in the NGS field. For
instance, there is an increasing convergence of expertise from artificial intel-
ligence, computer science, and high-​performance computing. At the same
time, because of the highly dynamic nature of the field, it becomes increas-
ingly challenging to keep abreast of the latest developments. This new
edition represents an effort from a practitioner in the field towards the goal of
informing readers on recent NGS data analysis tools. I would like to express
my gratitude to the many researchers and clinicians I have interacted with
in my role as the director of the Northwestern University NGS Facility. It is
their need for the latest NGS technologies that has kept me up to date with
the NGS field.

xv
newgenprepdf

xvi

Author

Xinkun Wang is Research Professor and the Director of the Next-​Generation


Sequencing Facility at Northwestern University in Chicago. Dr. Wang’s first
foray into the genomics field was during his doctoral training, performing
microarray-​based gene expression analysis. From 2005 to 2015, he was the
founding director of the University of Kansas Genomics Facility, prior to
moving to Northwestern to head the Northwestern University Sequencing
Facility (NUSeq) in late 2015. Dr. Wang is a renowned expert on genomics
technologies and data mining and their applications to the biomedical field.
Besides his monographic publications, he has published extensively in
neuroscience, with a focus on brain aging and neurodegenerative diseases
(mostly Alzheimer’s disease). Dr. Wang has served as principal investigator
on dozens of grants. Dr. Wang’s other professional activities include serving
on journal editorial boards, and as reviewers for journals, publishers, and
funding agencies.
Dr. Wang is a member of American Society of Human Genetics, Association
of Biomolecular Resource Facilities, the Honor Society of Phi Kappa Phi, and
Society for Neuroscience. Dr. Wang was born in Shandong province, China,
and is a first-​generation college graduate. His off-​work hobbies include cyc-
ling and Alpine skiing.

xvii
1

Part I

Introduction to Cellular and


Molecular Biology
1 
The Cellular System and the Code of Life

1.1 The Cellular Challenge


A cell, although minuscule with a diameter of less than 50 μm, works wonders if
you compare it to any human-​made system. Moreover, it perpetuates itself using
the information coded in its DNA. In case you ever had the thought of designing
an artificial system that shows this type of sophistication, you would know the
many insurmountable challenges such a system needs to overcome. A cell has
a complicated internal system, containing many types of molecules and parts.
To sustain the system, a cell needs to perform a wide variety of tasks, the most
fundamental of which are to maintain its internal order, prevent its system from
malfunctioning or breaking down, and reproduce or even improve the system,
in an environment that is constantly changing.
Energy is needed to maintain the internal order of the cellular system.
Without constant energy input, the entropy of the system will gradually
increase, as dictated by the second law of thermodynamics, and ultimately
lead to the destruction of the system. Besides energy, raw “building” material
is also constantly needed to renew its internal parts or build new ones if
needed, as the internal structure of a cell is dynamic and responds to constant
changes in environmental conditions. Therefore, to maintain the equilibrium
inside and with the environment, it requires a constant influx of energy and
raw material, and excretion of its waste. Guiding the capture of the requisite
energy and raw material for its survival and the perpetuation of the system is
the information encoded in its DNA sequence.
During the course of evolution a great number of organisms no longer
function as a single cell. The human body, for example, contains trillions
of cells. In a multicellular system, each cell becomes specialized to per-
form a specific function, e.g., β-​cells in our pancreas synthesize and release
insulin, and cortical neurons in the brain perform neurobiological functions
that underlie learning and memory. Despite this “division of labor,” the
challenges a single-​cell organism faces still hold true for each one of these
cells. Instead of dealing with the external environment directly, they interact
with and respond to changes in their microenvironment.

DOI: 10.1201/9780429329180-2 3
4 Next-Generation Sequencing Data Analysis

1.2 How Cells Meet the Challenge


Many cells, like algae and plant cells, directly capture energy from the sun or
other energy sources. Other cells (or organisms) obtain energy from the envir-
onment as heterotrophs. For raw material, cells can either fix carbon dioxide
in the air using the energy captured into simple organic compounds, which
are then converted to other requisite molecules, or directly obtain organic
molecules from the environment and convert them to requisite materials.
In the meantime, existing cellular components can also be broken down
when not needed for the re-​use of their building material. This process of
energy capture and utilization, and synthesis, interconversion, and breaking
down for re-​use of molecular material, constitutes the cellular metabolism.
Metabolism, the most fundamental characteristic of a cell, involves numerous
biochemical reactions.
Reception and transduction of various signals in the environment are crucial
for cellular survival. Reception of signals relies on specific receptors situated
on the cell surface, and for some signals, those inside the cell. Transduction
of incoming signals usually involves cascades of events in the cell, through
which the original signals are amplified and modulated. In response, cellular
metabolic profile is altered. The cellular signal reception and transduction
network is composed of circuits that are organized into various pathways.
Malfunctioning of these pathways can have a detrimental effect on cellular
response to the environment and eventually its survival.
Perpetuation and evolution of the cellular system rely on DNA replication
and cell division. The replication of DNA (to be detailed in Chapter 2) is a
high-​fidelity, but not error-​free, process. While maintaining the stability of
the system, this process also provides the mechanism for the diversification
and evolution of the cellular system. The cell division process is also tightly
regulated, for the most part to ensure equal transfer of the replicated DNA
into daughter cells. For the majority of multicellular organisms that repro-
duce sexually, in the process of germ cell formation the DNA is replicated
once but cell division occurs twice, leading to the reduction of DNA material
by half in the gametes. The recombination of DNA from female and male
gametes leads to further diversification in the offspring.

1.3 Molecules in Cells
Different types of molecules are needed to carry out the various cellular
processes. In a typical cell, water is the most abundant representing 70% of
the total cell weight. Besides water, there are a large variety of small and large
molecules. The major categories of small molecules include inorganic ions
The Cellular System and the Code of Life 5

(Na+​, K+​, Ca2+​, Cl-​, Mg2+​, etc.), monosaccharides, fatty acids, amino acids, and
nucleotides. Major varieties of large molecules are polysaccharides, lipids,
proteins, and nucleic acids (DNA and RNA). Among these components,
the inorganic ions are important for signaling (e.g., waves of Ca2+​ represent
important intracellular signal), cell energy storage (e.g., in the form of Na+​
/​K+​ cross-​membrane gradient), or protein structure/​function (e.g., Mg2+​ is
an essential cofactor for many metalloproteins). Carbohydrates (including
monosaccharides and polysaccharides), fatty acids, and lipids are major
energy-​providing molecules in the cell. Lipids are also the major component
of cell membrane. Proteins, which are assembled from 20 types of amino acids
in different order and length, underlie almost all cellular activities, including
metabolism, signal transduction, DNA replication, and cell division. They
are also the building blocks of many subcellular structures, such as cytoskel-
eton (see next section). Nucleic acids carry the code of life in their nearly
endless nucleotide permutations, which not only provides instructions on the
assembly of all proteins in cells but also exerts control on how such assembly
is carried out based on environmental conditions.

1.4 Intracellular Structures or Spaces


Cells maintain a well-​organized internal structure (Figure 1.1). Based on the
complexity of their internal structure, cells are divided into two major cat-
egories: prokaryotic and eukaryotic cells. The fundamental difference between
them is whether a nucleus is present. Prokaryotic cells, being more primor-
dial of the two, do not have a nucleus, and as a result their DNA is located in
a nucleus-​like but non-​enclosed area. Prokaryotic cells also lack organelles,
which are specialized and compartmentalized intracellular structures that
carry out different cellular functions (detailed next). Eukaryotic cells, on the
other hand, contain a distinct nucleus dedicated for DNA storage, mainten-
ance, and expression. Furthermore, they contain various organelles including
endoplasmic reticulum (ER), Golgi apparatus, cytoskeleton, mitochondrion,
and chloroplast (plant cells). The following is an introduction to the various
intracellular structures and spaces, including the nucleus, the organelles,
and other subcellular structures and spaces such as the cell membrane and
cytoplasm.

1.4.1 Nucleus
Since DNA stores the code of life, it must be protected and properly maintained
to avoid possible damage and ensure accuracy and stability. As proper execu-
tion of the genetic information embedded in the DNA is critical to the normal
functioning of a cell, gene expression must also be tightly regulated under
6 Next-Generation Sequencing Data Analysis

Nucleus
Nuclear Envelope
(with nuclear pores)
Cell Membrane
Chromatin
Peroxisome
Ribosome
Nucleolus Microtubule

Lysosome

Mitochondrion

Golgi Apparatus

Rough ER
Smooth ER

Intermediate
Filament

Centrosome
Cytoplasm Endosome
Microfilament

FIGURE 1.1
The general structure of a typical eukaryotic cell. Shown here is an animal cell.

all conditions. The nucleus, located in the center of most cells in eukaryotes,
offers a well-​protected environment for DNA storage, maintenance, and gene
expression. The nuclear space is enclosed by nuclear envelope consisting of
two concentric membranes. To allow movement of proteins and RNAs across
the nuclear envelope, which is essential for gene expression, there are pores
on the nuclear envelope that span the inner and outer membrane. The mech-
anical support of the nucleus is provided by the nucleoskeleton, a network
of structural proteins including lamins and actin among others. Inside the
nucleus, long strings of DNA molecules, through binding to certain proteins
called histones, are heavily packed to fit into the limited nuclear space. In
prokaryotic cells, a nucleus-​like irregularly shaped region that does not have
a membrane enclosure called the nucleoid provides a similar but not as well-​
protected space for DNA.

1.4.2 Cell Membrane
The cell membrane serves as a barrier to protect the internal structure of a
cell from the outside environment. Biochemically, the cell membrane, as well
as all other intracellular membranes such as the nuclear envelope, assumes
a lipid bilayer structure. While offering protection to their internal structure,
The Cellular System and the Code of Life 7

the cell membrane is also where cells exchange materials, and concurrently
energy, with the outside environment. Since the membrane is made of lipids,
most water-​soluble substances, including ions, carbohydrates, amino acids,
and nucleotides, cannot directly cross it. To overcome this barrier, there are
channels, transporters, and pumps, all of which are specialized proteins, on
the cell membrane. Channels and transporters facilitate passive movement,
that is, in the direction from high to low concentration, without consumption
of cellular energy. Pumps, on the other hand, provide active transportation of
the molecules, since they transport the molecules against the concentration
gradient and therefore consume energy.
The cell membrane is also where a cell receives most incoming signals from
the environment. After signal molecules bind to their specific receptors on the
cell membrane, the signal is relayed to the inside, usually eliciting a series of
intracellular reactions. The ultimate cellular response that the signal induces
is dependent on the nature of the signal, as well as the type and condition
of the cell. For example, upon detecting insulin in the blood via the insulin
receptor in their membrane, cells in the liver respond by taking up glucose
from the blood for storage.

1.4.3 Cytoplasm
Inside the cell membrane, cytoplasm is the thick solution that contains the
majority of cellular substances, including all organelles in eukaryotic cells
but excluding the nucleus in eukaryotic cells and the DNA in prokary-
otic cells. The general fluid component of the cytoplasm that excludes the
organelles is called the cytosol. The cytosol makes up more than half of the
cellular volume and is where many cellular activities take place, including
a large number of metabolic steps such as glycolysis and interconversion of
molecules, and most signal transduction steps. In prokaryotic cells, due to
the lack of the nucleus and other specialized organelles, the cytosol is almost
the entire intracellular space and where most cellular activities take place.
Besides water, the cytosol contains large amounts of small and large
molecules. Small molecules, such as inorganic ions, provide an overall bio-
chemical environment for cellular activities. In addition, ions such as Na+​,
K+​, and Ca2+​ also have substantial concentration differences between the
cytosol and the extracellular space. Cells spend a lot of energy maintaining
these concentration differences, and use them for signaling and metabolic
purposes. For example, the concentration of Ca2+​ in the cytosol is normally
kept very low at ~10−​7 M whereas in the extracellular space it is ~10−​3 M. The
rushing in of Ca2+​ under certain conditions through ligand-​or voltage-​
gated channels serves as an important messenger, inducing responses in a
number of signaling pathways, some of which lead to altered gene expres-
sion. Besides small molecules, the cytosol also contains large numbers of
macromolecules. Far from being simply randomly diffusing in the cytosol,
these large molecules form molecular machines that collectively function as
8 Next-Generation Sequencing Data Analysis

a “bustling metropolitan city” [1]. These supra-​macromolecular machines


are usually assembled out of multiple proteins, or proteins and RNA. Their
emergence and disappearance are dynamic and regulated by external and
internal conditions.

1.4.4 Endosome, Lysosome, and Peroxisome


Endocytosis is the process that cells bring in macromolecules, or other par-
ticulate substances such as bacteria or cell debris, into the cytoplasm from the
surroundings. Endosome and lysosome are two organelles that are involved
in this process. To initiate endocytosis, part of the cell membrane forms a
pit, engulfs the external substances, and then an endocytotic vesicle pinches
off from the cell membrane into the cytosol. Endosome, normally in the size
range of 300–​400 nm in diameter, forms from the fusion of these endocytotic
vesicles. The internalized materials contained in the endosome are sent to
other organelles such as lysosome for further digestion.
The lysosome is the principal site for intracellular digestion of internalized
materials as well as obsolete components inside the cell. Like the condition
in our stomach, the inside of the lysosome is acidic (pH at 4.5–​5.0), providing
an ideal condition for the many digestive enzymes within. These enzymes
can break down proteins, DNA, RNA, lipids, and carbohydrates. Normally
the lysosome membrane keeps these digestive enzymes from leaking into
the cytosol. Even in the event of these enzymes leaking out of the lysosome,
they can do little harm to the cell, since their digestive activities are heavily
dependent on the acidic environment inside the lysosome whereas the pH of
the cytosol is slightly alkaline (around 7.2).
Peroxisome is morphologically similar to the lysosome, but it contains
a different set of proteins, mostly oxidative enzymes that use molecular
oxygen to extract hydrogen from organic compounds to form hydrogen per-
oxide. The hydrogen peroxide can then be used to oxidize other substrates,
such as phenols or alcohols, via peroxidation reaction. As an example, liver
and kidney cells use these reactions to detoxify various toxic substances that
enter the body. Another function of the peroxisome is to break down long-​
chain fatty acids into smaller molecules by oxidation. Despite its important
functions, the origin of the peroxisome is still not entirely clear. One theory
proposes that this organelle has an endosymbiotic origin [2]. If this theory
holds true, all genes in the genome of the original endosymbiotic organism
must have been transferred to the nuclear genome. A more recent hypothesis,
however, is that they had an endogenous origin from the endomembrane
system, similar to the lysosome and the Golgi apparatus (see next section) [3].

1.4.5 Ribosome
Ribosome is the protein assembly factory in cells, translating genetic infor-
mation carried in messenger RNAs (mRNAs) into proteins. There are vast
The Cellular System and the Code of Life 9

numbers of ribosomes, usually from thousands to millions, in a typical cell.


While both prokaryotic and eukaryotic ribosomes are composed of two
components (or subunits), eukaryotic ribosomes are larger than their pro-
karyotic counterparts. In eukaryotic cells, the two ribosomal subunits are
first assembled inside the nucleus in a region called the nucleolus and then
shipped out to the cytoplasm. In the cytoplasm, ribosomes can be either
free, or get attached to another organelle (the ER). Biochemically, ribosomes
contain more than 50 proteins and several ribosomal RNA (rRNA) species.
Because ribosomes are highly abundant in cells, rRNAs are the most abun-
dant in total RNA extracts, accounting for 85% to 90% of all RNA species. For
profiling cellular RNA populations using next-​gen sequencing (NGS), rRNAs
are usually not of interest despite their abundance and therefore need to be
depleted to avoid generation of overwhelming amounts of sequencing reads
from them.

1.4.6 Endoplasmic Reticulum
As indicated by the name, ER is a network of membrane-​enclosed spaces
throughout the cytosol. These spaces interconnect and form a single internal
environment called the ER lumen. There are two types of ERs in cells: rough
ER and smooth ER. The rough ER is where all cell membrane proteins, such
as ion channels, transporters, pumps, and signal molecule receptors, as well
as secretory proteins, such as insulin, are produced and sorted. The charac-
teristic surface roughness of this type of ER comes from the ribosomes that
bind to them on the outside. Proteins destined for cell membrane or secre-
tion, once emerging from these ribosomes, are threaded into the ER lumen.
This ER-​targeting process is mediated by a signal sequence, or “address
tag,” located at the beginning part of these proteins. This signal sequence
is subsequently cleaved off inside ER before the protein synthesis process is
complete. Functionally different from the rough ER, the smooth ER plays an
important role in lipid synthesis for the replenishment of cellular membranes.
Besides membrane and secretory protein preparation and lipid synthesis,
one other important function of ER is to sequester Ca2+​ from the cytosol. In
Ca2+​-​mediated cell signaling, shortly after entry of the calcium wave into the
cytosol, most of the incoming Ca2+​ needs to be pumped out of the cell and/​or
sequestered into specific organelles such as ER and mitochondria.

1.4.7 Golgi Apparatus
Besides ER, the Golgi apparatus also plays an indispensable role in sorting
as well as dispatching proteins to the cell membrane, extracellular space,
or other subcellular destinations. Many proteins synthesized in the ER are
sent to the Golgi apparatus via small vesicles for further processing before
being sent to their final destinations. Therefore the Golgi apparatus is
10 Next-Generation Sequencing Data Analysis

sometimes metaphorically described as the “post office” of the cell. The pro-
cessing carried out in this organelle includes chemical modification of some
of the proteins, such as adding oligosaccharide side chains, which serves as
“address labels.” Other important functions of the Golgi apparatus include
synthesizing carbohydrates and extracellular matrix materials, such as the
polysaccharide for the building of the plant cell wall.

1.4.8 Cytoskeleton
Cellular processes like the trafficking of proteins in vesicles from ER to the
Golgi apparatus, or the movement of a mitochondrion from one intracellular
location to another, are not simply based on diffusion. Rather, they follow
certain protein-​made skeletal structure inside the cytosol, that is, the cyto-
skeleton, as tracks. Besides providing tracks for intracellular transport, the
cytoskeleton, like the skeleton in the human body, plays an equally important
role in maintaining cell shape, and protecting the cell framework from phys-
ical stresses as the lipid bilayer cell membrane is fragile and vulnerable to
such stresses. In eukaryotic cells, there are three major types of cytoskeletal
structures: microfilament, microtubule, and intermediate filament. Each type
is made of distinct proteins and has their own unique characteristics and
functions. For example, microfilament and microtubule are assembled from
actins and tubulins, respectively, and have different thickness (the diameter is
around 6 nm for microfilament and 23 nm for microtubule). While biochem-
ically and structurally different, both the microfilament and the microtubule
have been known to provide tracks for mRNA transport in the form of large
ribonucleoprotein complexes to specific intracellular sites, such as the distal
end of a neuronal dendrite, for targeted protein translation [4]. Besides its role
in intracellular transportation, the microtubule also plays a key role in cell
division through attaching to the duplicated chromosomes and moving them
equally into two daughter cells. In this process, all microtubules involved are
organized around a small organelle called a centrosome. Previously thought
to be only present in eukaryotic cells, cytoskeletal structure has also been
discovered in prokaryotic cells [5].

1.4.9 Mitochondrion
The mitochondrion is the “powerhouse” in eukaryotic cells. While some
energy is produced from the glycolytic pathway in the cytosol, most
energy is generated from the Krebs cycle and the oxidative phosphor-
ylation process that take place in the many mitochondria contained in a
cell. The number of mitochondria in a cell is ultimately dependent on its
energy demand. The more energy a cell needs, the more mitochondria
it has. Structurally, the mitochondrion is an organelle enclosed by two
membranes. The outer membrane is highly permeable to most cytosolic
The Cellular System and the Code of Life 11

molecules, and as a result the intermembrane space between the outer and
inner membranes is similar to the cytosol. Most of the energy releasing
process occurs in the inner membrane and in the matrix, that is, the space
enclosed by the inner membrane. For the energy release, high-​energy elec-
tron carriers generated from the Krebs cycle in the matrix are fed into an
electron transport chain embedded in the inner membrane. The energy
released from the transfer of high-​energy electrons through the chain to
molecular oxygen (O2), the final electron acceptor, creates a proton gra-
dient across the inner membrane. This proton gradient serves as the
energy source for the synthesis of ATP, the universal energy currency in
cells. In prokaryotic cells, since they do not have this organelle, ATP syn-
thesis takes place on their cytoplasmic membrane instead.
The origin of the mitochondrion, based on the widely accepted endo-
symbiotic theory, is an ancient α-Proteobacterium. So not surprisingly, the
mitochondrion carries its own DNA, but the genetic information contained
in the mitochondrial DNA (mtDNA) is extremely limited compared to the
nuclear DNA. The human mitochondrial DNA, for example, is 16,569 bp
in size coding for 37 genes, including 22 for transfer RNAs (tRNAs), 2
for rRNAs, and 13 for mitochondrial proteins. While it is much smaller
compared to the nuclear genome, there are multiple copies of mtDNA
molecules in each mitochondrion. Since cells usually contain hundreds
to thousands of mitochondria, there are a large number of mtDNA
molecules in each cell. In comparison, most cells only contain two copies
of the nuclear DNA. As a result, when sequencing cellular DNA samples,
sequences derived from mitochondrial DNA usually comprise a notable,
sometimes substantial, percentage of total generated reads. Although
small, the mitochondrial genomic system is fully functional and has the
entire set of protein factors for mtDNA transcription, translation, and
replication. As a result of its activity, when cellular RNA molecules are
sequenced, those transcribed from the mitochondrial genome also gen-
erate significant amounts of reads in the sequence output.
The many copies of mtDNA molecules in a cell may not all have the same
sequence due to mutations in individual molecules. Heteroplasmy occurs
when cells contain a heterogeneous set of mtDNA molecules. In general, mito-
chondrial DNA has a higher mutation rate than its nuclear counterpart. This
is because the transfer of high-​energy electrons along the electron transport
chain can produce reactive oxygen species as byproducts, which can oxidize
and cause mutations in mtDNA. To make this situation even worse, the DNA
repair capability in mitochondria is rather limited. Increased heteroplasmy
has been associated with higher risk of developing aging-​related diseases,
including Alzheimer’s disease, heart disease, and Parkinson’s disease [6].
Furthermore, mitochondrial DNA mutations have been known to underlie
aging and cancer development [7]. Certain hereditary mtDNA mutations
also underlie maternally inherited diseases that mostly affect the nervous
system and muscle, both of which are characterized by high energy demand.
12 Next-Generation Sequencing Data Analysis

1.4.10 Chloroplast
In animal cells, the mitochondrion is the only organelle that contains an
extranuclear genome. Plant and algae cells have another extranuclear genome
besides the mitochondrion, the plastid genome. Plastid is an organelle that can
differentiate into various forms, the most prominent of which is the chloroplast.
The chloroplast carries out photosynthesis through capturing the energy in sun-
light and fixing it into carbohydrates using carbon dioxide as substrate, and
releasing oxygen in the same process. For energy capturing, the green pigment
called chlorophyll first absorbs energy from sunlight, which is then transferred
through an electron transport chain to build up a proton gradient to drive the
synthesis of ATP. Despite the energy source, the buildup of proton gradient for
ATP synthesis in the chloroplast is very similar to that for ATP synthesis in the
mitochondrion. The chloroplast ATP derived from the captured light energy is
then spent on CO2 fixation. Similar to the mitochondrion, the chloroplast also
has two membranes: a highly permeable outer membrane and a much less per-
meable inner membrane. The photosynthetic electron transport chain, however,
is not located in the inner membrane, but in the membrane of a series of sac-​like
structures called thylakoids located in the chloroplast stroma (analogous to the
mitochondrial matrix).
Plastid is believed to be evolved from an endosymbiotic cyanobaterium,
which has gradually lost the majority of its genes in its genome over millions of
years. The current size of most plastid genomes is 120–​200 kb, coding for rRNAs,
tRNAs, and proteins. In higher plants there are around 100 genes coding for
various proteins of the photosynthetic system [8]. The transmission of plastid
DNA (ptDNA) from parent to offspring is more complicated than the maternal
transmission of mtDNA usually observed in animals. Based on the transmis-
sion pattern, it can be classified into three types: 1) maternal, inheritance only
through the female parent; 2) paternal, inheritance only through the male parent;
or 3) bioparental, inheritance through both parents [9]. Similar to the situation
in mitochondrion, there exist multiple copies of ptDNA in each plastid, and as
a result there are large numbers of ptDNA molecules in each cell with potential
heteroplasmy. Transcription from these ptDNA also generates copious amounts
of RNAs in the organelle. Therefore, sequence reads from ptDNA or RNA com-
prise part of the data when sequencing plant and algae DNA or RNA samples,
along with those from mtDNA or RNA.

1.5 The Cell as a System


1.5.1 The Cellular System
From the above description of a typical cell, it is obvious that the cell is a self-​
organizing system, containing many different molecules and structures that
The Cellular System and the Code of Life 13

work together coherently. Unlike other non-​biological systems, including


natural and artificial systems such as a car or a computer, the cell system is
unique as it continuously renews and perpetuates itself without violating
the laws of the physical world. It achieves this by obtaining energy from
and exchanging materials with its environment. The cellular system is also
characterized by its autonomy, that is, all of its activities are self-​regulated.
This autonomy is conferred by the genetic instructions coded in the cell’s
DNA. Besides such characteristics, the cell system is highly robust, as its
homeostasis is not easily disturbed by changes in its surroundings. This
robustness is a result of billions of years of evolution, which has led to the
building of tremendous complexity into the system. To study this complexity,
biologists have been mostly taking a reductionist approach to studying the
different cellular molecules and structures piece by piece. This approach has
been highly successful and much knowledge has been gathered on most parts
of the system. For a cell to function as a single entity, however, these different
parts do not work alone. To study how it operates as a whole, the different
parts need to be studied in the context of the entire system and therefore
a holistic approach is also needed. It has become more and more clear to
researchers in the life science community that the interactions between the
different cellular parts are equally, if not more, important as any part alone.

1.5.2 Systems Biology of the Cell


Systems biology is an emerging field that studies the complicated interactions
among the different parts of biological systems. It is an application of the
systems theory to the biological field. Introduced by the biologist Ludwig von
Bertalanffy in the 1940s, this theory aims to investigate the principles common
to all complex systems, and to describe these principles using mathematical
models. This theory is applicable to many disciplines including physics, soci-
ology, and biology, and one goal of this theory is to unify the principles of
systems as uncovered from the different disciplines. It is expected, therefore,
that principles uncovered from other systems may be applicable to biological
systems and provide guidance to better understanding of their working.
In the traditional reductionist approach, a single gene or protein is the
basic functioning unit. In systems biology, however, the basic unit is a
genetic circuit. Genetic circuit can be defined as a group of genes (or the
proteins they code) that work together to perform a certain task. There are
a multitude of tasks in a cell that need to be carried out by genetic circuits,
from the transduction of extracellular signal to the inside, the step-​by-​step
breakdown of energy molecules (such as glucose) to release energy, to the
replication of DNA prior to cell division. It is these genetic circuits that
underlie cellular behavior and physiology. If the information or material
flux in a genetic circuit is blocked or goes awry, the whole system will be
influenced, which might lead to the malfunctioning of the system and likely
a diseased state.
14 Next-Generation Sequencing Data Analysis

Based on the hierarchical organization principle of systems, gene circuits


interact with each other and form a complicated genetic network. Mapping
out a genetic network is a higher goal of systems biology. Genetic network
has been shown to share some common characteristics with non-​biological
networks such as the human society or the Internet [10]. One of such
characteristics is modularity, referring to the fact that genes (or proteins) that
work together to achieve a common goal often form a module and the module
is used as a single functional unit when needed. Another common character-
istic is the existence of hub or anchor nodes in the network, as a small number
of highly connected genes (or proteins) in a genetic network serve as hubs or
anchors through which other genes (or proteins) are connected to each other.

1.5.3 How to Study the Cellular System


Research into the systems biology of the cell is largely enabled by techno-
logical advancements in genomics, proteomics, and metabolomics. High-​
throughput genomics technologies, for example, allow simultaneous analysis
of tens of thousands of genes in an organism’s genome. Genome refers to the
whole set of genetic material in an organism’s DNA, including both protein-​
coding and non-​coding sequences. Similarly, proteome and metabolome are
defined as the complement of proteins and metabolites (small molecules),
respectively, in a cell or population of cells. Proteomics, through simultaneous
separation and identification of proteins in a proteome, provides answers to
the questions of how many proteins are present in the target cell(s) and at
what abundance levels. Metabolomics, on the other hand, through analyzing
a large number of metabolites simultaneously, monitors the metabolic status
of target cells.
The development of modern genomics technologies was mostly initiated
when the human genome was sequenced by the Human Genome Project.
The completion of the sequencing of this genome and the genomes of other
organisms, and the concurrent development of genomics technologies,
have for the first time offered an opportunity to study the systems prop-
erties of the cell. The first big wave of genomics technologies was mostly
centered on microarray, which enables analysis of the transcriptome and
subsequently study of genome-​ wide sequence polymorphism and the
epigenome. By studying all RNAs transcribed in a cell or population of
cells, transcriptomic analysis investigates what genes are active and how
active. Determination of genome-​wide sequence variations among indi-
viduals in a population enables examination of the relationship between
certain genomic polymorphisms and cellular dysfunctions, phenotypic
traits, or diseases. Epigenomic studies provide answers to the question
how the genomic information encoded in the DNA sequence is regulated
by the code conferred by chemical modifications of DNA bases. More
recently, the development of NGS technologies provides more power,
coverage, and resolution to the study of the genome, the transcriptome,
The Cellular System and the Code of Life 15

and the epigenome (for details on the development of NGS technologies


see Chapter 4). These NGS technologies, along with recent technological
developments in proteomics and metabolomics, further empower the
study of the cellular system.

References
1. Vale RD. The molecular motor toolbox for intracellular transport. Cell 2003,
112(4):467–​480.
2. de Duve C. Peroxisomes and related particles in historical perspective. Ann N
Y Acad Sci 1982, 386:1–​4.
3. Gabaldon T. Evolution of the peroxisomal proteome. Subcell Biochem 2018,
89:221–​233.
4. Das S, Vera M, Gandin V, Singer RH, Tutucci E. Intracellular mRNA transport
and localized translation. Nat Rev Mol Cell Biol 2021, 22(7):483–​504.
5. Mayer F. Cytoskeletons in prokaryotes. Cell Biol Int 2003, 27(5):429–​438.
6. Chocron ES, Munkacsy E, Pickering AM. Cause or casualty: the role of mito-
chondrial DNA in aging and age-​associated disease. Biochim Biophys Acta Mol
Basis Dis 2019, 1865(2):285–​297.
7. Smith ALM, Whitehall JC, Greaves LC. Mitochondrial DNA mutations in
ageing and cancer. Mol Oncol 2022, 16(18):3276–​3294.
8. de Vries J, Archibald JM. Plastid genomes. Curr Biol 2018, 28(8):R336–​R337.
9. Harris SA, Ingram R. Chloroplast DNA and biosystematics: the effects of
intraspecific diversity and plastid transmission. Taxon 1991:393–​412.
10. Roy U, Grewal RK, Roy S. Complex Networks and Systems Biology. In:
Systems and Synthetic Biology. Springer; 2015: 129–​150.
The Cellular System and the Code of Life
Vale RD . The molecular motor toolbox for intracellular transport. Cell 2003, 112(4):467–480.
de Duve C. Peroxisomes and related particles in historical perspective. Ann N Y Acad Sci 1982,
386:1–4.
Gabaldon T. Evolution of the peroxisomal proteome. Subcell Biochem 2018, 89:221–233.
Das S , Vera M , Gandin V , Singer RH , Tutucci E . Intracellular mRNA transport and localized
translation. Nat Rev Mol Cell Biol 2021, 22(7):483–504.
Mayer F. Cytoskeletons in prokaryotes. Cell Biol Int 2003, 27(5):429–438.
Chocron ES , Munkacsy E , Pickering AM . Cause or casualty: the role of mitochondrial DNA in
aging and age-associated disease. Biochim Biophys Acta Mol Basis Dis 2019,
1865(2):285–297.
Smith ALM , Whitehall JC , Greaves LC . Mitochondrial DNA mutations in ageing and cancer.
Mol Oncol 2022, 16(18):3276–3294.
de Vries J , Archibald JM . Plastid genomes. Curr Biol 2018, 28(8):R336–R337.
Harris SA , Ingram R. Chloroplast DNA and biosystematics: the effects of intraspecific diversity
and plastid transmission. Taxon 1991:393–412.
Roy U , Grewal RK , Roy S. Complex Networks and Systems Biology. In: Systems and
Synthetic Biology. Springer; 2015: 129–150.

DNA Sequence
Fraser CM , Gocayne JD , White O , Adams MD , Clayton RA , Fleischmann RD , Bult CJ ,
Kerlavage AR , Sutton G , Kelley JM et al. The minimal gene complement of Mycoplasma
genitalium . Science 1995, 270(5235):397–403.
Hutchison CA , 3rd, Chuang RY , Noskov VN , Assad-Garcia N , Deerinck TJ , Ellisman MH ,
Gill J , Kannan K , Karas BJ , Ma L et al . Design and synthesis of a minimal bacterial genome.
Science 2016, 351(6280):aad6253.
Bennett GM , Moran NA . Small, smaller, smallest: the origins and evolution of ancient dual
symbioses in a Phloem-feeding insect. Genome Biol Evol 2013, 5(9):1675–1688.
Pellicer J , Fay MF , Leitch IJ . The largest eukaryotic genome of them all? Bot J Linn Soc 2010,
164(1):10–15.
Shapiro JA , von Sternberg R. Why repetitive DNA is essential to genome function. Biol Rev
Camb Philos Soc 2005, 80(2):227–250.
Roach JC , Glusman G , Smit AF , Huff CD , Hubley R , Shannon PT , Rowen L , Pant KP ,
Goodman N , Bamshad M et al . Analysis of genetic inheritance in a family quartet by whole-
genome sequencing. Science 2010, 328(5978):636–639.
Mahmoud M , Gobet N , Cruz-Davalos DI, Mounier N, Dessimoz C, Sedlazeck FJ. Structural
variant calling: the long and the short of it. Genome Biol 2019, 20(1):246.
Ebert P , Audano PA , Zhu Q , Rodriguez-Martin B , Porubsky D , Bonder MJ , Sulovari A ,
Ebler J , Zhou W , Serra Mari R et al . Haplotype-resolved diverse human genomes and
integrated analysis of structural variation. Science 2021, 372(6537):eabf7117.
Malnic B , Godfrey PA , Buck LB . The human olfactory receptor gene family. Proc Natl Acad
Sci U S A 2004, 101(8):2584–2589.
Inai Y , Ohta Y , Nishikimi M . The whole structure of the human nonfunctional L-gulono-
gamma-lactone oxidase gene—the gene responsible for scurvy—and the evolution of repetitive
sequences thereon. J Nutr Sci Vitaminol 2003, 49(5):315–319.
Law JA , Jacobsen SE . Establishing, maintaining and modifying DNA methylation patterns in
plants and animals. Nat Rev Genet 2010, 11(3):204–220.
Cedar H , Bergman Y. Linking DNA methylation and histone modification: patterns and
paradigms. Nat Rev Genet 2009, 10(5):295–304.
Guo W , Chung WY , Qian M , Pellegrini M , Zhang MQ . Characterizing the strand-specific
distribution of non-CpG methylation in human pluripotent cells. Nucleic Acids Res 2014,
42(5):3009–3016.
Wu H , Zhang Y . Reversing DNA methylation: mechanisms, genomics, and biological functions.
Cell 2014, 156(1–2):45–68.
Shademan B , Biray Avci C , Nikanfar M , Nourazarian A. Application of next-generation
sequencing in neurodegenerative diseases: opportunities and challenges. Neuromolecular Med
2021, 23(2):225–235.
Nishiyama A , Nakanishi M . Navigating the DNA methylation landscape of cancer. Trends
Genet 2021, 37(11):1012–1027.
Pappalardo XG , Barra V . Losing DNA methylation at repetitive elements and breaking bad.
Epigenetics Chromatin 2021, 14(1):25.

RNA
Bedard AV , Hien EDM , Lafontaine DA . Riboswitch regulation mechanisms: RNA, metabolites
and regulatory proteins. Biochim Biophys Acta Gene Regul Mech 2020, 1863(3):194501.
Ray PS , Jia J , Yao P , Majumder M , Hatzoglou M , Fox PL . A stress-responsive RNA switch
regulates VEGFA expression. Nature 2009, 457(7231):915–919.
Xu B , Zhu Y , Cao C , Chen H , Jin Q , Li G , Ma J , Yang SL , Zhao J , Zhu J et al . Recent
advances in RNA structurome. Sci China Life Sci 2022, 65(7):1285–1324.
Imashimizu M , Oshima T , Lubkowska L , Kashlev M . Direct assessment of transcription
fidelity by high-resolution RNA sequencing. Nucleic Acids Res 2013, 41(19):9090–9104.
Wang ET , Sandberg R , Luo S , Khrebtukova I , Zhang L , Mayr C , Kingsmore SF , Schroth
GP , Burge CB . Alternative isoform regulation in human tissue transcriptomes. Nature 2008,
456(7221):470–476.
Pan Q , Shai O , Lee LJ , Frey BJ , Blencowe BJ . Deep surveying of alternative splicing
complexity in the human transcriptome by high-throughput sequencing. Nat Genet 2008,
40(12):1413–1415.
Keegan LP , Gallo A , O'Connell MA . The many roles of an RNA editor. Nat Rev Genet 2001,
2(11):869–878.
Bratt E , Ohman M . Coordination of editing and splicing of glutamate receptor pre-mRNA. RNA
2003, 9(3):309–318.
Pfeiffer BE , Huber KM . Current advances in local protein synthesis and synaptic plasticity. J
Neurosci 2006, 26(27):7147–7150.
Rustad TR , Minch KJ , Brabant W , Winkler JK , Reiss DJ , Baliga NS , Sherman DR . Global
analysis of mRNA stability in Mycobacterium tuberculosis . Nucleic Acids Res 2013,
41(1):509–517.
Sharova LV , Sharov AA , Nedorezov T , Piao Y , Shaik N , Ko MS . Database for mRNA half-
life of 19 977 genes obtained by DNA microarray analysis of pluripotent and differentiating
mouse embryonic stem cells. DNA Res 2009, 16(1):45–58.
Yang E , van Nimwegen E , Zavolan M , Rajewsky N , Schroeder M , Magnasco M , Darnell JE ,
Jr. Decay rates of human mRNAs: correlation with functional characteristics and sequence
attributes. Genome Res 2003, 13(8):1863–1872.
Figueroa A , Cuadrado A , Fan J , Atasoy U , Muscat GE , Munoz-Canoves P , Gorospe M ,
Munoz A . Role of HuR in skeletal myogenesis through coordinate regulation of muscle
differentiation genes. Mol Cell Biol 2003, 23(14):4991–5004.
Kulkarni M , Ozgur S , Stoecklin G . On track with P-bodies. Biochem Soc Trans 2010, 38(Pt
1):242–251.
Labno A , Tomecki R , Dziembowski A . Cytoplasmic RNA decay pathways – enzymes and
mechanisms. Biochim Biophys Acta 2016, 1863(12):3125–3147.
Willis DE , Twiss JL . Regulation of protein levels in subcellular domains through mRNA
transport and localized translation. Mol Cell Proteomics 2010, 9(5):952–962.
Jeffares DC , Poole AM , Penny D . Relics from the RNA world. J Mol Evol 1998, 46(1):18–36.
Cech TR . Structural biology. The ribosome is a ribozyme. Science 2000, 289(5481):878–879.
Zhang L , Vielle A , Espinosa S , Zhao R . RNAs in the spliceosome: insight from cryoEM
structures. Wiley Interdiscip Rev RNA 2019, 10(3):e1523.
Wilson RC , Doudna JA . Molecular mechanisms of RNA interference. Annu Rev Biophys 2013,
42:217–239.
Friedman RC , Farh KK , Burge CB , Bartel DP . Most mammalian mRNAs are conserved
targets of microRNAs. Genome Res 2009, 19(1):92–105.
Kawamata T , Tomari Y . Making RISC. Trends Biochem Sci 2010, 35(7):368–376.
Carthew RW , Sontheimer EJ . Origins and mechanisms of miRNAs and siRNAs. Cell 2009,
136(4):642–655.
Liu X , Hao L , Li D , Zhu L , Hu S. Long non-coding RNAs and their biological roles in plants.
Genomics Proteomics Bioinformatics 2015, 13(3):137–147.
Derrien T , Johnson R , Bussotti G , Tanzer A , Djebali S , Tilgner H , Guernec G , Martin D ,
Merkel A , Knowles DG et al . The GENCODE v7 catalog of human long noncoding RNAs:
analysis of their gene structure, evolution, and expression. Genome Res 2012,
22(9):1775–1789.
Gupta RA , Shah N , Wang KC , Kim J , Horlings HM , Wong DJ , Tsai MC , Hung T , Argani P ,
Rinn JL et al . Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer
metastasis. Nature 2010, 464(7291):1071–1076.
Zhao J , Sun BK , Erwin JA , Song JJ , Lee JT . Polycomb proteins targeted by a short repeat
RNA to the mouse X chromosome. Science 2008, 322(5902):750–756.
Li W , Notani D , Ma Q , Tanasa B , Nunez E , Chen AY , Merkurjev D , Zhang J , Ohgi K , Song
X et al . Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation.
Nature 2013, 498(7455):516–520.
Yoon JH , Abdelmohsen K , Srikantan S , Yang X , Martindale JL , De S , Huarte M , Zhan M ,
Becker KG , Gorospe M . LincRNA-p21 suppresses target mRNA translation. Mol Cell 2012,
47(4):648–655.
Gong C , Maquat LE . lncRNAs transactivate STAU1-mediated mRNA decay by duplexing with
3’ UTRs via Alu elements. Nature 2011, 470(7333):284–288.
Yarmishyn AA , Kurochkin IV . Long noncoding RNAs: a potential novel class of cancer
biomarkers. Front Genet 2015, 6:145.
Ni YQ , Xu H , Liu YS . Roles of Long Non-coding RNAs in the development of aging-related
neurodegenerative diseases. Front Mol Neurosci 2022, 15:844193.
Nisar S , Bhat AA , Singh M , Karedath T , Rizwan A , Hashem S , Bagga P , Reddy R , Jamal F
, Uddin S et al . Insights into the role of CircRNAs: biogenesis, characterization, functional, and
clinical impact in human malignancies. Front Cell Dev Biol 2021, 9:617281.
Cech TR , Steitz JA . The noncoding RNA revolution—trashing old rules to forge new ones. Cell
2014, 157(1):77–94.
Carninci P , Kasukawa T , Katayama S , Gough J , Frith MC , Maeda N , Oyama R , Ravasi T ,
Lenhard B , Wells C et al . The transcriptional landscape of the mammalian genome. Science
2005, 309(5740):1559–1563.
Djebali S , Davis CA , Merkel A , Dobin A , Lassmann T , Mortazavi A , Tanzer A , Lagarde J ,
Lin W , Schlesinger F et al . Landscape of transcription in human cells. Nature 2012,
489(7414):101–108.

Next-Generation Sequencing (NGS) Technologies


Bentley DR , Balasubramanian S , Swerdlow HP , Smith GP , Milton J , Brown CG , Hall KP ,
Evers DJ , Barnes CL , Bignell HR et al . Accurate whole human genome sequencing using
reversible terminator chemistry. Nature 2008, 456(7218):53–59.
Picard toolkit (https://broadinstitute.github.io/picard/)
Eid J , Fehr A , Gray J , Luong K , Lyle J , Otto G , Peluso P , Rank D , Baybayan P , Bettman B
et al . Real-time DNA sequencing from single polymerase molecules. Science 2009,
323(5910):133–138.
Wenger AM , Peluso P , Rowell WJ , Chang PC , Hall RJ , Concepcion GT , Ebler J ,
Fungtammasan A , Kolesnikov A , Olson ND et al . Accurate circular consensus long-read
sequencing improves variant detection and assembly of a human genome. Nat Biotechnol
2019, 37(10):1155–1162.
Jain M , Fiddes IT , Miga KH , Olsen HE , Paten B , Akeson M . Improved data analysis for the
MinION nanopore sequencer. Nat Methods 2015, 12(4):351–356.
Rang FJ , Kloosterman WP , de Ridder J. From squiggle to basepair: computational approaches
for improving nanopore sequencing read accuracy. Genome Biol 2018, 19(1):90.
Poptsova MS , Il'icheva IA , Nechipurenko DY , Panchenko LA , Khodikov MV , Oparina NY ,
Polozov RV , Nechipurenko YD , Grokhovsky SL . Non-random DNA fragmentation in next-
generation sequencing. Sci Rep 2014, 4:4532.
Seguin-Orlando A , Schubert M , Clary J , Stagegaard J , Alberdi MT , Prado JL , Prieto A ,
Willerslev E , Orlando L. Ligation bias in Illumina next-generation DNA libraries: implications for
sequencing ancient genomes. PLoS One 2013, 8(10):e78575.
Hafner M , Renwick N , Brown M , Mihailovic A , Holoch D , Lin C , Pena JT , Nusbaum JD ,
Morozov P , Ludwig J et al . RNA-ligase-dependent biases in miRNA representation in deep-
sequenced small RNA cDNA libraries. RNA 2011, 17(9):1697–1712.
Aird D , Ross MG , Chen WS , Danielsson M , Fennell T , Russ C , Jaffe DB , Nusbaum C ,
Gnirke A. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries.
Genome Biol 2011, 12(2):R18.
Wang B , Wan L , Wang A , Li LM . An adaptive decorrelation method removes Illumina DNA
base-calling errors caused by crosstalk between adjacent clusters. Sci Rep 2017, 7:41348.
Whiteford N , Skelly T , Curtis C , Ritchie ME , Lohr A , Zaranek AW , Abnizova I , Brown C .
Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics 2009,
25(17):2194–2199.
Cacho A , Smirnova E , Huzurbazar S , Cui X . A Comparison of base-calling algorithms for
illumina sequencing technology. Brief Bioinform 2016, 17(5): 786–795.

Early-Stage Next-Generation Sequencing (NGS) Data Analysis


Cacho A , Smirnova E , Huzurbazar S , Cui X . A comparison of base-calling algorithms for
Illumina sequencing technology. Brief Bioinform 2016, 17(5): 786–795.
Wick RR , Judd LM , Holt KE . Performance of neural network basecalling tools for Oxford
Nanopore sequencing. Genome Biol 2019, 20(1):129.
Boza V , Brejova B , Vinar T . DeepNano: deep recurrent neural networks for base calling in
MinION nanopore reads. PLoS One 2017, 12(6):e0178751.
David M , Dursi LJ , Yao D , Boutros PC , Simpson JT . Nanocall: an open source basecaller for
Oxford Nanopore sequencing data. Bioinformatics 2017, 33(1):49–55.
Teng H , Cao MD , Hall MB , Duarte T , Wang S , Coin LJM . Chiron: translating nanopore raw
signal directly into nucleotide sequence using deep learning. GigaScience 2018, 7(5):giy037.
Zeng J , Cai H , Peng H , Wang H , Zhang Y , Akutsu T. Causalcall: nanopore basecalling using
a temporal convolutional network. Front Genet 2019, 10:1332.
FastQC : A Quality Control Tool for High Throughput Sequence Data [Online]
(www.bioinformatics.babraham.ac.uk/projects/fastqc/)
Patel RK , Jain M . NGS QC Toolkit: a toolkit for quality control of next generation sequencing
data. PLoS One 2012, 7(2):e30619.
Chen S , Zhou Y , Chen Y , Gu J . fastp: an ultra-fast all-in-one FASTQ preprocessor.
Bioinformatics 2018, 34(17):i884–i890.
Albrecht S , Sprang M , Andrade-Navarro MA , Fontaine JF . seqQscorer: automated quality
control of next-generation sequencing data using machine learning. Genome Biol 2021,
22(1):75.
Brown J , Pirrung M , McCue LA . FQC Dashboard: integrates FastQC results into a web-based,
interactive, and extensible FASTQ quality control tool. Bioinformatics 2017, 33(19):3137–3139.
Ewels P , Magnusson M , Lundin S , Kaller M . MultiQC: summarize analysis results for multiple
tools and samples in a single report. Bioinformatics 2016, 32(19):3047–3048.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads.
EMBnet J 2011, 17(1):10–12.
Bolger AM , Lohse M , Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data.
Bioinformatics 2014, 30(15):2114–2120.
De Coster W , D'Hert S , Schultz DT , Cruts M , Van Broeckhoven C. NanoPack: visualizing and
processing long-read sequencing data. Bioinformatics 2018, 34(15):2666–2669.
Leger A , Leonardi T . pycoQC, interactive quality control for Oxford Nanopore Sequencing. J
Open Source Softw 2019, 4(34):1236.
Hufnagel DE , Hufford MB , Seetharam AS . SequelTools: a suite of tools for working with
PacBio Sequel raw sequence data. BMC Bioinformatics 2020, 21(1):429.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018,
34(18):3094–3100.
Li R , Li Y , Kristiansen K , Wang J . SOAP: short oligonucleotide alignment program.
Bioinformatics 2008, 24(5):713–714.
Li H , Ruan J , Durbin R. Mapping short DNA sequencing reads and calling variants using
mapping quality scores. Genome Res 2008, 18(11):1851–1858.
Raczy C , Petrovski R , Saunders CT , Chorny I , Kruglyak S , Margulies EH , Chuang HY ,
Kallberg M , Kumar SA , Liao A et al . Isaac: ultra-fast whole-genome secondary analysis on
Illumina sequencing platforms. Bioinformatics 2013, 29(16):2041–2043.
Roberts M , Hayes W , Hunt BR , Mount SM , Yorke JA . Reducing storage requirements for
biological sequence comparison. Bioinformatics 2004, 20(18):3363–3369.
Burrows M , Wheeler D. A block-sorting lossless data compression algorithm. In: Digital SRC
Research Report : 1994: Citeseer; 1994.
Ferragina P , Manzini G. Opportunistic data structures with applications. In: Proceedings 41st
annual symposium on foundations of computer science: 2000 : IEEE; 2000: 390–398.
Li H , Durbin R. Fast and accurate long-read alignment with Burrows–Wheeler transform.
Bioinformatics 2010, 26(5):589–595.
Langmead B , Salzberg SL . Fast gapped-read alignment with Bowtie 2. Nat Methods 2012,
9(4):357–359.
Langmead B , Wilks C , Antonescu V , Charles R . Scaling read aligners to hundreds of threads
on general-purpose processors. Bioinformatics 2019, 35(3):421–432.
Li R , Yu C , Li Y , Lam TW , Yiu SM , Kristiansen K , Wang J . SOAP2: an improved ultrafast
tool for short read alignment. Bioinformatics 2009, 25(15):1966–1967.
Dobin A , Davis CA , Schlesinger F , Drenkow J , Zaleski C , Jha S , Batut P , Chaisson M ,
Gingeras TR . STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29(1):15–21.
Smith TF , Waterman MS . Identification of common molecular subsequences. J Mol Biol 1981,
147(1):195–197.
Needleman SB , Wunsch CD . A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453.
Hamming RW . Error detecting and error correcting codes. Bell Syst Tech J 1950,
29(2):147–160.
Smith AD , Xuan Z , Zhang MQ. Using quality scores and longer reads improves accuracy of
Solexa read mapping. BMC Bioinformatics 2008, 9:128.
Langmead B , Trapnell C , Pop M , Salzberg SL . Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25.
Hach F , Hormozdiari F , Alkan C , Hormozdiari F , Birol I , Eichler EE , Sahinalp SC . mrsFAST:
a cache-oblivious algorithm for short-read mapping. Nat Methods 2010, 7(8):576–577.
Lunter G , Goodson M . Stampy: a statistical algorithm for sensitive and fast mapping of Illumina
sequence reads. Genome Res 2011, 21(6):936–939.
David M , Dzamba M , Lister D , Ilie L , Brudno M . SHRiMP2: sensitive yet practical SHort
Read Mapping. Bioinformatics 2011, 27(7):1011–1012.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
arXiv:13033997, 2013.
Sovic I , Sikic M , Wilm A , Fenlon SN , Chen S , Nagarajan N . Fast and sensitive mapping of
nanopore sequencing reads with GraphMap. Nat Commun 2016, 7:11307.
Chaisson MJ , Tesler G. Mapping single molecule sequencing reads using basic local alignment
with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012, 13:238.
Kielbasa SM , Wan R , Sato K , Horton P , Frith MC . Adaptive seeds tame genomic sequence
comparison. Genome Res 2011, 21(3):487–493.
Sedlazeck FJ , Rescheneder P , Smolka M , Fang H , Nattestad M , von Haeseler A , Schatz
MC . Accurate detection of complex structural variations using single-molecule sequencing. Nat
Methods 2018, 15(6):461–468.
Jain C , Rhie A , Zhang H , Chu C , Walenz BP , Koren S , Phillippy AM . Weighted minimizer
sampling improves long read mapping. Bioinformatics 2020, 36(Suppl_ 1):i111–i118.
Jain C , Rhie A , Hansen NF , Koren S , Phillippy AM . Long-read mapping to repetitive
reference sequences using Winnowmap2. Nat Methods 2022.
Zheng H , Kingsford C , Marcais G. Improved design and analysis of practical minimizers.
Bioinformatics 2020, 36(Suppl_1):i119–i127.
Shukla HG , Bawa PS , Srinivasan S. hg19KIndel: ethnicity normalized human reference
genome. BMC Genomics 2019, 20(1):459.
Chen NC , Solomon B , Mun T , Iyer S , Langmead B . Reference flow: reducing reference bias
using multiple population genomes. Genome Biol 2021, 22(1):8.
Vasimuddin M , Misra S , Li H , Aluru S. Efficient architecture-aware acceleration of BWA-MEM
for multicore systems. In: 2019 IEEE International Parallel and Distributed Processing
Symposium (IPDPS): 2019 : IEEE; 2019: 314–324.
Hsi-Yang Fritz M , Leinonen R , Cochrane G , Birney E . Efficient storage of high throughput
DNA sequencing data using reference-based compression. Genome Res 2011, 21(5):734–740.
Yuan Y , Norris C , Xu Y , Tsui KW , Ji Y , Liang H . BM-Map: an efficient software package for
accurately allocating multireads of RNA-sequencing data. BMC Genomics 2012, 13 Suppl 8:S9.
Thorvaldsdottir H , Robinson JT , Mesirov JP . Integrative Genomics Viewer (IGV): high-
performance genomics data visualization and exploration. Brief Bioinform 2013, 14(2):178–192.
Carver T , Harris SR , Berriman M , Parkhill J , McQuillan JA . Artemis: an integrated platform
for visualization and analysis of high-throughput sequence-based experimental data.
Bioinformatics 2012, 28(4):464–469.
SeqMonk (www.bioinformatics.babraham.ac.uk/projects/seqmonk/)
Buels R , Yao E , Diesh CM , Hayes RD , Munoz-Torres M , Helt G , Goodstein DM , Elsik CG ,
Lewis SE , Stein L et al . JBrowse: a dynamic web platform for genome visualization and
analysis. Genome Biol 2016, 17:66.
Milne I , Bayer M , Cardle L , Shaw P , Stephen G , Wright F , Marshall D . Table—next
generation sequence assembly visualization. Bioinformatics 2010, 26(3):401–402.
Okonechnikov K , Conesa A , Garcia-Alcalde F . Qualimap 2: advanced multi-sample quality
control for high-throughput sequencing data. Bioinformatics 2016, 32(2):292–294.

Computing Needs for Next-Generation Sequencing (NGS) Data


Management and Analysis
Li R , Zhu H , Ruan J , Qian W , Fang X , Shi Z , Li Y , Li S , Shan G , Kristiansen K et al . De
novo assembly of human genomes with massively parallel short read sequencing. Genome Res
2010, 20(2):265–272.
Lampa S , Dahlo M , Olason PI , Hagberg J , Spjuth O . Lessons learned from implementing a
national infrastructure in Sweden for storage and analysis of next-generation sequencing data.
GigaScience 2013, 2(1):9.
Supernat A , Vidarsson OV , Steen VM , Stokowy T . Comparison of three variant callers for
human whole genome sequencing. Sci Rep 2018, 8(1):17851.
Rasche A , Lienhard M , Yaspo ML , Lehrach H , Herwig R. ARH-seq: identification of
differential splicing in RNA-seq data. Nucleic Acids Res 2014, 42(14):e110.
Farazi TA , Brown M , Morozov P , Ten Hoeve JJ , Ben-Dov IZ , Hovestadt V , Hafner M ,
Renwick N , Mihailovic A , Wessels LF et al . Bioinformatic analysis of barcoded cDNA libraries
for small RNA profiling by next-generation sequencing. Methods 2012, 58(2):171–187.
Schadt EE , Linderman MD , Sorenson J , Lee L , Nolan GP . Computational solutions to large-
scale data management and analysis. Nat Rev Genet 2010, 11(9):647–657.
Galaxy C. The Galaxy platform for accessible, reproducible and collaborative biomedical
analyses: 2022 update. Nucleic Acids Res 2022, 50(W1):W345–W351.
Schatz MC , Philippakis AA , Afgan E , Banks E , Carey VJ , Carroll RJ , Culotti A , Ellrott K ,
Goecks J , Grossman RL et al . Inverting the model of genomics data sharing with the NHGRI
Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom 2022,
2(1):100085.
Transcriptomics by Bulk RNA-Seq
Li J , Fu C , Speed TP , Wang W , Symmans WF . Accurate RNA sequencing from formalin-
fixed cancer tissue To represent high-quality transcriptome from frozen tissue. JCO Precis
Oncol 2018, 2:PO.17.00091.
Parekh S , Ziegenhain C , Vieth B , Enard W , Hellmann I . The impact of amplification on
differential expression analyses by RNA-seq. Sci Rep 2016, 6:25533.
Zhulidov PA , Bogdanova EA , Shcheglov AS , Vagner LL , Khaspekov GL , Kozhemyako VB ,
Matz MV , Meleshkevitch E , Moroz LL , Lukyanov SA et al . Simple cDNA normalization using
kamchatka crab duplex-specific nuclease. Nucleic Acids Res 2004, 32(3):e37.
Yang L , Duff MO , Graveley BR , Carmichael GG , Chen LL . Genomewide characterization of
non-polyadenylated RNAs. Genome Biol 2011, 12(2):R16.
Busby MA , Stewart C , Miller CA , Grzeda KR , Marth GT . Scotty: a web tool for designing
RNA-Seq experiments to measure differential gene expression. Bioinformatics 2013,
29(5):656–657.
Bi R , Liu P. Sample size calculation while controlling false discovery rate for differential
expression analysis with RNA-sequencing experiments. BMC Bioinformatics 2016, 17:146.
Wu H , Wang C , Wu Z. PROPER: comprehensive power evaluation for differential expression
using RNA-seq. Bioinformatics 2015, 31(2):233–241.
Zhao S , Li CI , Guo Y , Sheng Q , Shyr Y. RnaSeqSampleSize: real data based sample size
estimation for RNA sequencing. BMC Bioinformatics 2018, 19(1):191.
Ching T , Huang S , Garmire LX . Power analysis and sample size estimation for RNA-seq
differential expression. RNA 2014, 20(11):1684–1696.
Auer PL , Doerge RW . Statistical design and analysis of RNA sequencing data. Genetics 2010,
185(2):405–416.
Robinson DG , Storey JD . subSeq: determining appropriate sequencing depth through efficient
read subsampling. Bioinformatics 2014, 30(23):3424–3426.
Grant GR , Farkas MH , Pizarro AD , Lahens NF , Schug J , Brunk BP , Stoeckert CJ ,
Hogenesch JB , Pierce EA . Comparative analysis of RNA-Seq alignment algorithms and the
RNA-Seq unified mapper (RUM). Bioinformatics 2011, 27(18):2518–2528.
Ryan MC , Cleland J , Kim R , Wong WC , Weinstein JN . SpliceSeq: a resource for analysis
and visualization of RNA-Seq data on alternative splicing and its functional impacts.
Bioinformatics 2012, 28(18):2385–2387.
Trapnell C , Pachter L , Salzberg SL . TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics 2009, 25(9):1105–1111.
Kim D , Pertea G , Trapnell C , Pimentel H , Kelley R , Salzberg SL . TopHat2: accurate
alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome
Biol 2013, 14(4):R36.
Wang K , Singh D , Zeng Z , Coleman SJ , Huang Y , Savich GL , He X , Mieczkowski P ,
Grimm SA , Perou CM et al . MapSplice: accurate mapping of RNA-seq reads for splice junction
discovery. Nucleic Acids Res 2010, 38(18):e178.
Au KF , Jiang H , Lin L , Xing Y , Wong WH . Detection of splice junctions from paired-end RNA-
seq data by SpliceMap. Nucleic Acids Res 2010, 38(14):4570–4578.
Marco-Sola S , Sammeth M , Guigo R , Ribeca P . The GEM mapper: fast, accurate and
versatile alignment by filtration. Nat Methods 2012, 9(12):1185–1188.
Dobin A , Davis CA , Schlesinger F , Drenkow J , Zaleski C , Jha S , Batut P , Chaisson M ,
Gingeras TR . STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29(1):15–21.
Kim D , Langmead B , Salzberg SL . HISAT: a fast spliced aligner with low memory
requirements. Nat Methods 2015, 12(4):357–360.
Kim D , Paggi JM , Park C , Bennett C , Salzberg SL . Graph-based genome alignment and
genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 2019, 37(8):907–915.
Wu TD , Reeder J , Lawrence M , Becker G , Brauer MJ . GMAP and GSNAP for Genomic
Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality. Methods Mol Biol
2016, 1418 :283–334.
Krizanovic K , Echchiki A , Roux J , Sikic M . Evaluation of tools for long read RNA-seq splice-
aware alignment. Bioinformatics 2018, 34(5):748–754.
Liu B , Liu Y , Li J , Guo H , Zang T , Wang Y. deSALT: fast and accurate long transcriptomic
read alignment with de Bruijn graph-based index. Genome Biol 2019, 20(1):274.
Marić J , Sović I , Križanović K , Nagarajan N , Šikić M . Graphmap2-splice-aware RNA-seq
mapper for long reads. bioRxiv 2019, doi: https://doi.org/10.1101/720458
Sahlin K , Makinen V . Accurate spliced alignment of long RNA sequencing reads.
Bioinformatics 2021, 37(24):4643–4651.
Graubert A , Aguet F , Ravi A , Ardlie KG , Getz G. RNA-SeQC 2: Efficient RNA-seq quality
control and quantification for large cohorts. Bioinformatics 2021, 37(18):3048–3050.
Wang L , Wang S , Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics 2012,
28(16):2184–2185.
Hartley SW , Mullikin JC . QoRTs: a comprehensive toolset for quality control and data
processing of RNA-Seq experiments. BMC Bioinformatics 2015, 16 :224.
Bushmanova E , Antipov D , Lapidus A , Prjibelski AD . rnaSPAdes: a de novo transcriptome
assembler and its application to RNA-Seq data. GigaScience 2019, 8(9):giz100.
Grabherr MG , Haas BJ , Yassour M , Levin JZ , Thompson DA , Amit I , Adiconis X , Fan L ,
Raychowdhury R , Zeng Q et al . Full-length transcriptome assembly from RNA-Seq data
without a reference genome. Nat Biotechnol 2011, 29(7):644–652.
Chang Z , Li G , Liu J , Zhang Y , Ashby C , Liu D , Cramer CL , Huang X . Bridger: a new
framework for de novo transcriptome assembly using RNA-seq data. Genome Biol 2015, 16:30.
Robertson G , Schein J , Chiu R , Corbett R , Field M , Jackman SD , Mungall K , Lee S , Okada
HM , Qian JQ et al . De novo assembly and analysis of RNA-seq data. Nat Methods 2010,
7(11):909–912.
Xie Y , Wu G , Tang J , Luo R , Patterson J , Liu S , Huang W , He G , Gu S , Li S et al .
SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads.
Bioinformatics 2014, 30(12):1660–1666.
Schulz MH , Zerbino DR , Vingron M , Birney E . Oases: robust de novo RNA-seq assembly
across the dynamic range of expression levels. Bioinformatics 2012, 28(8):1086–1092.
Pertea M , Pertea GM , Antonescu CM , Chang TC , Mendell JT , Salzberg SL . StringTie
enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 2015,
33(3):290–295.
Kovaka S , Zimin AV , Pertea GM , Razaghi R , Salzberg SL , Pertea M . Transcriptome
assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 2019, 20(1):278.
Shumate A , Wong B , Pertea G , Pertea M . Improved transcriptome assembly using a hybrid
of long and short reads with StringTie. PLoS Comput Biol 2022, 18(6):e1009730.
Prjibelski AD , Puglia GD , Antipov D, Bushmanova E, Giordano D, Mikheenko A, Vitale D,
Lapidus A. Extending rnaSPAdes functionality for hybrid transcriptome assembly. BMC
Bioinformatics 2020, 21(Suppl 12):302.
Liao Y , Smyth GK , Shi W . featureCounts: an efficient general purpose program for assigning
sequence reads to genomic features. Bioinformatics 2014, 30(7):923–930.
Anders S , Pyl PT , Huber W . HTSeq—a Python framework to work with high-throughput
sequencing data. Bioinformatics 2015, 31(2):166–169.
Li B , Dewey CN . RSEM: accurate transcript quantification from RNA-Seq data with or without
a reference genome. BMC Bioinformatics 2011, 12:323.
Roberts A , Pachter L . Streaming fragment assignment for real-time analysis of sequencing
experiments. Nat Methods 2013, 10(1):71–73.
Trapnell C , Roberts A , Goff L , Pertea G , Kim D , Kelley DR , Pimentel H , Salzberg SL , Rinn
JL , Pachter L . Differential gene and transcript expression analysis of RNA-seq experiments
with TopHat and Cufflinks. Nat Protoc 2012, 7(3):562–578.
Robert C , Watson M . Errors in RNA-Seq quantification affect genes of relevance to human
disease. Genome Biol 2015, 16:177.
Bray NL , Pimentel H , Melsted P , Pachter L . Near-optimal probabilistic RNA-seq
quantification. Nat Biotechnol 2016, 34(5):525–527.
Patro R , Duggal G , Love MI , Irizarry RA , Kingsford C . Salmon provides fast and bias-aware
quantification of transcript expression. Nat Methods 2017, 14(4):417–419.
Patro R , Mount SM , Kingsford C . Sailfish enables alignment-free isoform quantification from
RNA-seq reads using lightweight algorithms. Nat Biotechnol 2014, 32(5):462–464.
Wu DC , Yao J , Ho KS , Lambowitz AM, Wilke CO. Limitations of alignment-free tools in total
RNA-seq quantification. BMC Genomics 2018, 19(1):510.
Bullard JH , Purdom E , Hansen KD , Dudoit S. Evaluation of statistical methods for
normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 2010,
11:94.
Hansen KD , Irizarry RA , Wu Z . Removing technical variability in RNA-seq data using
conditional quantile normalization. Biostatistics 2012, 13(2):204–216.
Love MI , Huber W , Anders S. Moderated estimation of fold change and dispersion for RNA-
seq data with DESeq2. Genome Biol 2014, 15(12):550.
Robinson MD , McCarthy DJ , Smyth GK . edgeR: a Bioconductor package for differential
expression analysis of digital gene expression data. Bioinformatics 2010, 26(1):139–140.
Robinson MD , Oshlack A. A scaling normalization method for differential expression analysis of
RNA-seq data. Genome Biol 2010, 11(3):R25.
Kadota K , Nishiyama T , Shimizu K. A normalization strategy for comparing tag count data.
Algorithms Mol Biol 2012, 7(1):5.
Sun J , Nishiyama T , Shimizu K , Kadota K . TCC: an R package for comparing tag count data
with robust normalization strategies. BMC Bioinformatics 2013, 14 :219.
Li J , Witten DM , Johnstone IM , Tibshirani R . Normalization, testing, and false discovery rate
estimation for RNA-sequencing data. Biostatistics 2012, 13(3):523–538.
Manimaran S , Selby HM , Okrah K , Ruberman C , Leek JT , Quackenbush J , Haibe-Kains B ,
Bravo HC , Johnson WE . BatchQC: interactive software for evaluating sample and batch
effects in genomic data. Bioinformatics 2016, 32(24):3836–3838.
Johnson WE , Li C , Rabinovic A . Adjusting batch effects in microarray expression data using
empirical Bayes methods. Biostatistics 2007, 8(1):118–127.
Risso D , Ngai J , Speed TP , Dudoit S . Normalization of RNA-seq data using factor analysis of
control genes or samples. Nat Biotechnol 2014, 32(9):896–902.
Leek JT . svaseq: removing batch effects and other unwanted noise from sequencing data.
Nucleic Acids Res 2014, 42(21):e161.
Zhang Y , Parmigiani G , Johnson WE . ComBat-seq: batch effect adjustment for RNA-seq
count data. NAR Genom Bioinform 2020, 2(3):lqaa078.
Marioni JC , Mason CE , Mane SM , Stephens M , Gilad Y . RNA-seq: an assessment of
technical reproducibility and comparison with gene expression arrays. Genome Res 2008,
18(9):1509–1517.
Anders S , Huber W . Differential expression analysis for sequence count data. Genome Biol
2010, 11(10):R106.
Ritchie ME , Phipson B , Wu D , Hu Y , Law CW , Shi W , Smyth GK . limma powers differential
expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015,
43(7):e47.
Frazee AC , Pertea G , Jaffe AE , Langmead B , Salzberg SL , Leek JT . Ballgown bridges the
gap between transcriptome assembly and expression analysis. Nat Biotechnol 2015,
33(3):243–246.
Hardcastle TJ , Kelly KA . baySeq: empirical Bayesian methods for identifying differential
expression in sequence count data. BMC Bioinformatics 2010, 11:422.
Trapnell C , Hendrickson DG , Sauvageau M , Goff L , Rinn JL , Pachter L . Differential analysis
of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 2013, 31(1):46–53.
Wang L , Feng Z , Wang X , Wang X , Zhang X . DEGseq: an R package for identifying
differentially expressed genes from RNA-seq data. Bioinformatics 2010, 26(1):136–138.
Leng N , Dawson JA , Thomson JA , Ruotti V , Rissman AI , Smits BM , Haag JD , Gould MN ,
Stewart RM , Kendziorski C . EBSeq: an empirical Bayes hierarchical model for inference in
RNA-seq experiments. Bioinformatics 2013, 29(8):1035–1043.
Li J , Tibshirani R . Finding consistent patterns: a nonparametric approach for identifying
differential expression in RNA-Seq data. Stat Methods Med Res 2013, 22(5):519–536.
Tarazona S , Furio-Tari P , Turra D , Pietro AD , Nueda MJ , Ferrer A , Conesa A . Data quality
aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic
Acids Res 2015, 43(21):e140.
Corchete LA , Rojas EA , Alonso-Lopez D , De Las Rivas J , Gutierrez NC , Burguillo FJ .
Systematic comparison and assessment of RNA-seq procedures for gene expression
quantitative analysis. Sci Rep 2020, 10(1):19737.
Stupnikov A , McInerney CE , Savage KI , McIntosh SA , Emmert-Streib F , Kennedy R , Salto-
Tellez M , Prise KM , McArt DG . Robustness of differential gene expression analysis of RNA-
seq. Comput Struct Biotechnol J 2021, 19:3470–3481.
Feng J , Meyer CA , Wang Q , Liu JS , Shirley Liu X , Zhang Y. GFOLD: a generalized fold
change for ranking differentially expressed genes from RNA-seq data. Bioinformatics 2012,
28(21):2782–2788.
Claverie JM , Ta TN . ACDtool: a web-server for the generic analysis of large data sets of
counts. Bioinformatics 2019, 35(1):170–171.
Audic S , Claverie JM . The significance of digital gene expression profiles. Genome Res 1997,
7(10):986–995.
Benjamini Y , Hochberg Y . Controlling the false discovery rate – a practical and powerful
approach to multiple testing. J R Stat Soc Ser B Methodol 1995, 57(1):289–300.
Gene Ontology C. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res
2021, 49(D1):D325–D334.
Kanehisa M , Furumichi M , Tanabe M , Sato Y , Morishima K. KEGG: new perspectives on
genomes, pathways, diseases and drugs. Nucleic Acids Res 2017, 45(D1):D353–D361.
Rodchenkov I , Babur O , Luna A , Aksoy BA , Wong JV , Fong D , Franz M , Siper MC ,
Cheung M , Wrana M et al . Pathway Commons 2019 update: integration, analysis and
exploration of pathway data. Nucleic Acids Res 2020, 48(D1):D489–D497.
Martens M , Ammar A , Riutta A , Waagmeester A , Slenter DN , Hanspers K , R AM, Digles D ,
Lopes EN , Ehrhart F et al . WikiPathways: connecting communities. Nucleic Acids Res 2021,
49(D1):D613–D621.
Gillespie M , Jassal B , Stephan R , Milacic M , Rothfels K , Senff-Ribeiro A , Griss J , Sevilla C ,
Matthews L , Gong C et al . The reactome pathway knowledgebase 2022. Nucleic Acids Res
2022, 50(D1):D687–D692.
Xie Z , Bailey A , Kuleshov MV , Clarke DJB , Evangelista JE , Jenkins SL , Lachmann A ,
Wojciechowicz ML , Kropiwnicki E , Jagodnik KM et al . Gene set knowledge discovery with
Enrichr. Curr Protoc 2021, 1(3):e90.
Young MD , Wakefield MJ , Smyth GK , Oshlack A. Gene ontology analysis for RNA-seq:
accounting for selection bias. Genome Biol 2010, 11(2):R14.
Eden E , Navon R , Steinfeld I , Lipson D , Yakhini Z . GOrilla: a tool for discovery and
visualization of enriched GO terms in ranked gene lists. BMC bioinformatics 2009, 10:48.
Raudvere U , Kolberg L , Kuzmin I , Arak T , Adler P , Peterson H , Vilo J . g:Profiler: a web
server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic
Acids Res 2019, 47(W1):W191–W198.
Huang da W , Sherman BT , Lempicki RA . Systematic and integrative analysis of large gene
lists using DAVID bioinformatics resources. Nat Protoc 2009, 4(1):44–57.
Chen J , Bardes EE , Aronow BJ , Jegga AG . ToppGene Suite for gene list enrichment analysis
and candidate gene prioritization. Nucleic Acids Res 2009, 37(Web Server issue):W305–311.
Subramanian A , Tamayo P , Mootha VK , Mukherjee S , Ebert BL , Gillette MA , Paulovich A ,
Pomeroy SL , Golub TR , Lander ES et al . Gene set enrichment analysis: a knowledge-based
approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005,
102(43):15545–15550.
Doncheva NT , Morris JH , Gorodkin J , Jensen LJ . Cytoscape stringapp: network analysis and
visualization of proteomics data. J Proteome Res 2019, 18(2):623–632.
Montojo J , Zuberi K , Rodriguez H , Kazi F , Wright G , Donaldson SL , Morris Q , Bader GD .
GeneMANIA Cytoscape plugin: fast gene function predictions on the desktop. Bioinformatics
2010, 26(22):2927–2928.
Bindea G , Mlecnik B , Hackl H , Charoentong P , Tosolini M , Kirilovsky A , Fridman WH ,
Pages F , Trajanoski Z , Galon J . ClueGO: a Cytoscape plug-in to decipher functionally
grouped gene ontology and pathway annotation networks. Bioinformatics 2009,
25(8):1091–1093.
Merico D , Isserlin R , Stueker O , Emili A , Bader GD . Enrichment map: a network-based
method for gene-set enrichment visualization and interpretation. PLoS One 2010, 5(11):e13984.
Anders S , Reyes A , Huber W . Detecting differential usage of exons from RNA-seq data.
Genome Res 2012, 22(10):2008–2017.
Hartley SW , Mullikin JC . Detection and visualization of differential splicing in RNA-Seq data
with JunctionSeq. Nucleic Acids Res 2016, 44(15):e127.
Katz Y , Wang ET , Airoldi EM , Burge CB . Analysis and design of RNA sequencing
experiments for identifying isoform regulation. Nat Methods 2010, 7(12):1009–1015.
Shen S , Park JW , Lu ZX , Lin L , Henry MD , Wu YN , Zhou Q , Xing Y . rMATS: Robust and
flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc Natl Acad
Sci U S A 2014, 111(51):E5593–E5601.
Vaquero-Garcia J , Barrera A , Gazzara MR , Gonzalez-Vallinas J , Lahens NF , Hogenesch JB
, Lynch KW , Barash Y. A new view of transcriptome complexity and regulation through the lens
of local splicing variations. Elife 2016, 5:e11752.
Trincado JL , Entizne JC , Hysenaj G , Singh B , Skalic M , Elliott DJ , Eyras E . SUPPA2: fast,
accurate, and uncertainty-aware differential splicing analysis across multiple conditions.
Genome Biol 2018, 19(1):40.
Kahles A , Ong CS , Zhong Y , Ratsch G . SplAdder: identification, quantification and testing of
alternative splicing events from RNA-Seq data. Bioinformatics 2016, 32(12):1840–1847.
Li YI , Knowles DA , Humphrey J , Barbeira AN , Dickinson SP , Im HK , Pritchard JK .
Annotation-free quantification of RNA splicing using LeafCutter. Nat Genet 2018,
50(1):151–158.
Sterne-Weiler T , Weatheritt RJ , Best AJ , Ha KCH , Blencowe BJ . Efficient and Accurate
Quantitative Profiling of Alternative Splicing Patterns of Any Complexity on a Laptop. Mol Cell
2018, 72(1):187–200 e186.
Cascianelli S , Molineris I , Isella C , Masseroli M , Medico E. Machine learning for RNA
sequencing-based intrinsic subtyping of breast cancer. Sci Rep 2020, 10(1):14071.
Karam R , Conner B , LaDuca H , McGoldrick K , Krempely K , Richardson ME , Zimmermann H
, Gutierrez S , Reineke P , Hoang L et al . Assessment of Diagnostic Outcomes of RNA Genetic
Testing for Hereditary Cancer. JAMA Netw Open 2019, 2(10):e1913900.
Salzman J , Gawad C , Wang PL , Lacayo N , Brown PO . Circular RNAs are the predominant
transcript isoform from hundreds of human genes in diverse cell types. PLoS One 2012,
7(2):e30733.
Davare MA , Tognon CE . Detecting and targeting oncogenic fusion proteins in the genomic era.
Biol Cell 2015, 107(5):111–129.
Haas BJ , Dobin A , Li B , Stransky N , Pochet N , Regev A. Accuracy assessment of fusion
transcript detection via read-mapping and de novo fusion transcript assembly-based methods.
Genome Biol 2019, 20(1):213.

Transcriptomics by Single-Cell RNA-Seq


Cui Y , Irudayaraj J . Inside single cells: quantitative analysis with advanced optics and
nanomaterials. Wiley Interdiscip Rev Nanomed Nanobiotechnol 2015, 7(3):387–407.
Huang XT , Li X , Qin PZ , Zhu Y , Xu SN , Chen JP . Technical advances in single-cell RNA
sequencing and applications in normal and malignant hematopoiesis. Front Oncol 2018, 8:582.
Shalek AK , Satija R , Shuga J , Trombetta JJ , Gennert D , Lu D , Chen P , Gertner RS ,
Gaublomme JT , Yosef N et al . Single-cell RNA-seq reveals dynamic paracrine control of
cellular variation. Nature 2014, 510(7505):363–369.
Marinov GK , Williams BA , McCue K , Schroth GP , Gertz J , Myers RM , Wold BJ . From
single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing.
Genome Res 2014, 24(3):496–510.
Zhang M , Zou Y , Xu X , Zhang X , Gao M , Song J , Huang P , Chen Q , Zhu Z , Lin W et al .
Highly parallel and efficient single cell mRNA sequencing with paired picoliter chambers. Nat
Commun 2020, 11(1):2118.
Hagemann-Jensen M , Ziegenhain C , Chen P , Ramskold D , Hendriks GJ , Larsson AJM ,
Faridani OR , Sandberg R . Single-cell RNA counting at allele and isoform resolution using
Smart-seq3. Nat Biotechnol 2020, 38(6):708–714.
Macosko EZ , Basu A , Satija R , Nemesh J , Shekhar K , Goldman M , Tirosh I , Bialas AR ,
Kamitaki N , Martersteck EM et al . Highly Parallel Genome-wide Expression Profiling of
Individual Cells Using Nanoliter Droplets. Cell 2015, 161(5):1202–1214.
Klein AM , Mazutis L , Akartuna I , Tallapragada N , Veres A , Li V , Peshkin L , Weitz DA ,
Kirschner MW . Droplet barcoding for single-cell transcriptomics applied to embryonic stem
cells. Cell 2015, 161(5):1187–1201.
Cao J , Packer JS , Ramani V , Cusanovich DA , Huynh C , Daza R , Qiu X , Lee C , Furlan SN
, Steemers FJ et al . Comprehensive single-cell transcriptional profiling of a multicellular
organism. Science 2017, 357(6352):661–667.
Zheng GX , Terry JM , Belgrader P , Ryvkin P , Bent ZW , Wilson R , Ziraldo SB , Wheeler TD ,
McDermott GP , Zhu J et al . Massively parallel digital transcriptional profiling of single cells. Nat
Commun 2017, 8:14049.
Zhang X , Li T , Liu F , Chen Y , Yao J , Li Z , Huang Y , Wang J . Comparative analysis of
droplet-based ultra-high-throughput single-cell RNA-seq systems. Mol Cell 2019, 73(1):130–142
e135.
Ding J , Adiconis X , Simmons SK , Kowalczyk MS , Hession CC , Marjanovic ND , Hughes TK ,
Wadsworth MH , Burks T , Nguyen LT et al . Systematic comparison of single-cell and single-
nucleus RNA-sequencing methods. Nat Biotechnol 2020, 38(6):737–746.
How many Cells (https://satijalab.org/howmanycells)
Svensson V , da Veiga Beltrame E , Pachter L. Quantifying the tradeoff between sequencing
depth and cell number in single-cell RNA-seq. bioRxiv 2019, doi: https://doi.org/10.1101/762773
Zhang MJ , Ntranos V , Tse D . Determining sequencing depth in a single-cell RNA-seq
experiment. Nat Commun 2020, 11(1):774.
Parekh S , Ziegenhain C , Vieth B , Enard W , Hellmann I. zUMIs – A fast and flexible pipeline
to process RNA sequencing data with UMIs. GigaScience 2018, 7(6):giy059.
Heimberg G , Bhatnagar R , El-Samad H , Thomson M . Low dimensionality in gene expression
data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell
Syst 2016, 2(4):239–250.
Wu AR , Neff NF , Kalisky T , Dalerba P , Treutlein B , Rothenberg ME , Mburu FM , Mantalas
GL , Sim S , Clarke MF et al . Quantitative assessment of single-cell RNA-sequencing methods.
Nat Methods 2014, 11(1):41–46.
Svensson V , Natarajan KN , Ly LH , Miragaia RJ , Labalette C , Macaulay IC , Cvejic A ,
Teichmann SA . Power analysis of single-cell RNA-sequencing experiments. Nat Methods
2017, 14(4):381–387.
Genomics X . Technical Note–Removal of Dead Cells from Single Cell Suspensions Improves
Performance for 10× Genomics® Single Cell Applications. 2017.
van den Brink SC , Sage F , Vertesy A , Spanjaard B , Peterson-Maduro J , Baron CS , Robin C
, van Oudenaarden A. Single-cell sequencing reveals dissociation-induced gene expression in
tissue subpopulations. Nat Methods 2017, 14(10):935–936.
Wohnhaas CT , Leparc GG , Fernandez-Albert F , Kind D , Gantner F , Viollet C , Hildebrandt T
, Baum P . DMSO cryopreservation is the method of choice to preserve cells for droplet-based
single-cell RNA sequencing. Sci Rep 2019, 9(1):10699.
Cha J , Lee I . Single-cell network biology for resolving cellular heterogeneity in human
diseases. Exp Mol Med 2020, 52(11):1798–1808.
Korrapati S , Taukulis I , Olszewski R , Pyle M , Gu S , Singh R , Griffiths C , Martin D , Boger E
, Morell RJ et al . Single cell and single nucleus RNA-seq reveal cellular heterogeneity and
homeostatic regulatory networks in adult mouse stria vascularis. Front Mol Neurosci 2019,
12:316.
Gao R , Kim C , Sei E , Foukakis T , Crosetto N , Chan LK , Srinivasan M , Zhang H , Meric-
Bernstam F , Navin N . Nanogrid single-nucleus RNA sequencing reveals phenotypic diversity
in breast cancer. Nat Commun 2017, 8(1):228.
Liang Q , Dharmat R , Owen L , Shakoor A , Li Y , Kim S , Vitale A , Kim I , Morgan D , Liang S
et al . Single-nuclei RNA-seq on human retinal tissue provides improved transcriptome profiling.
Nat Commun 2019, 10(1):5743.
Zhu YY , Machleder EM , Chenchik A , Li R , Siebert PD . Reverse transcriptase template
switching: a SMART approach for full-length cDNA library construction. BioTechniques 2001,
30(4):892–897.
Bray NL , Pimentel H , Melsted P , Pachter L . Near-optimal probabilistic RNA-seq
quantification. Nat Biotechnol 2016, 34(5):525–527.
Du Y , Huang Q , Arisdakessian C , Garmire LX . Evaluation of STAR and Kallisto on single cell
RNA-seq data alignment. G3 2020, 10(5):1775–1783.
Melsted P , Ntranos V , Pachter L . The barcode, UMI, set format and BUStools. Bioinformatics
2019, 35(21):4472–4473.
Srivastava A , Malik L , Smith T , Sudbery I , Patro R. Alevin efficiently estimates accurate gene
abundances from dscRNA-seq data. Genome Biol 2019, 20(1):65.
McGinnis CS , Murrow LM , Gartner ZJ . DoubletFinder: doublet detection in single-cell RNA
sequencing data using artificial nearest neighbors. Cell Syst 2019, 8(4):329–337 e324.
Wolock SL , Lopez R , Klein AM . Scrublet: Computational identification of cell doublets in
single-cell transcriptomic data. Cell Syst 2019, 8(4):281–291 e289.
Gayoso A , Shor J , Carr AJ , Sharma R , Pe'er D. DoubletDetection. In.: Zenodo; 2020:
https://zenodo.org/record/2678042.
Bais AS , Kostka D . scds: computational annotation of doublets in single-cell RNA sequencing
data. Bioinformatics 2020, 36(4):1150–1158.
Bernstein NJ , Fong NL , Lam I , Roy MA , Hendrickson DG , Kelley DR . Solo: doublet
identification in single-cell RNA-seq via semi-supervised deep learning. Cell Syst 2020,
11(1):95–101 e105.
Xi NM , Li JJ . Benchmarking computational doublet-detection methods for single-cell RNA
sequencing data. Cell Syst 2020, 12(2):176–194.
Stoeckius M , Zheng S , Houck-Loomis B , Hao S , Yeung BZ , Mauck WM , 3rd, Smibert P,
Satija R. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for
single cell genomics. Genome Biol 2018, 19(1):224.
Kang HM , Subramaniam M , Targ S , Nguyen M , Maliskova L , McCarthy E , Wan E , Wong S
, Byrnes L , Lanata CM et al . Multiplexed droplet single-cell RNA-sequencing using natural
genetic variation. Nat Biotechnol 2018, 36(1):89.
Xu J , Falconer C , Nguyen Q , Crawford J , McKinnon BD , Mortlock S , Senabouth A ,
Andersen S , Chiu HS , Jiang LD et al . Genotype-free demultiplexing of pooled single-cell RNA-
seq. Genome Biol 2019, 20(1):290.
Heaton H , Talman AM , Knights A , Imaz M , Gaffney DJ , Durbin R , Hemberg M , Lawniczak
MKN . Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference
genotypes. Nat Methods 2020, 17(6):615.
Huang YH , McCarthy DJ , Stegle O . Vireo: Bayesian demultiplexing of pooled single-cell RNA-
seq data without genotype reference. Genome Biol 2019, 20(1):273.
Lun ATL , Riesenfeld S , Andrews T , Dao TP , Gomes T , participants in the 1st Human Cell
Atlas J, Marioni JC. EmptyDrops: distinguishing cells from empty droplets in droplet-based
single-cell RNA sequencing data. Genome Biol 2019, 20(1):63.
Heiser CN , Wang VM , Chen B , Hughey JJ , Lau KS . Automated quality control and cell
identification of droplet-based single-cell data using dropkick. Genome Res 2021,
31(10):1742–1752 .
Ni Z , Chen S , Brown J , Kendziorski C . CB2 improves power of cell detection in droplet-based
single-cell RNA sequencing data. Genome Biol 2020, 21(1):137.
Yang S , Corbett SE , Koga Y , Wang Z , Johnson WE , Yajima M , Campbell JD .
Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol 2020,
21(1):57.
Young MD , Behjati S. SoupX removes ambient RNA contamination from droplet-based single-
cell RNA sequencing data. Gigascience 2020, 9(12):giaa151.
Osorio D , Cai JJ . Systematic determination of the mitochondrial proportion in human and mice
tissues for single-cell RNA sequencing data quality control. Bioinformatics 2020, 37(7):963–967.
Bacher R , Chu LF , Leng N , Gasch AP , Thomson JA , Stewart RM , Newton M , Kendziorski
C . SCnorm: robust normalization of single-cell RNA-seq data. Nat Methods 2017,
14(6):584–586.
Yip SH , Wang P , Kocher JA , Sham PC , Wang J . Linnorm: improved statistical analysis for
single cell RNA-seq expression data. Nucleic Acids Res 2017, 45(22):e179.
Vallejos CA , Marioni JC , Richardson S. BASiCS: Bayesian Analysis of Single-Cell Sequencing
Data. PLoS Comput Biol 2015, 11(6):e1004333.
Qiu X , Hill A , Packer J , Lin D , Ma YA , Trapnell C . Single-cell mRNA quantification and
differential analysis with Census. Nat Methods 2017, 14(3):309–315.
Risso D , Perraudeau F , Gribkova S , Dudoit S , Vert JP . A general and flexible method for
signal extraction from single-cell RNA-seq data. Nat Commun 2018, 9(1):284.
Hafemeister C , Satija R . Normalization and variance stabilization of single-cell RNA-seq data
using regularized negative binomial regression. Genome Biol 2019, 20(1):296.
Butler A , Hoffman P , Smibert P , Papalexi E , Satija R . Integrating single-cell transcriptomic
data across different conditions, technologies, and species. Nat Biotechnol 2018,
36(5):411–420.
Wolf FA , Angerer P , Theis FJ . SCANPY: large-scale single-cell gene expression data
analysis. Genome Biol 2018, 19(1):15.
Lun AT , Bach K , Marioni JC . Pooling across cells to normalize single-cell RNA sequencing
data with many zero counts. Genome Biol 2016, 17:75.
Lopez R , Regier J , Cole MB , Jordan MI , Yosef N . Deep generative modeling for single-cell
transcriptomics. Nat Methods 2018, 15(12):1053–1058.
Lytal N , Ran D , An L. Normalization methods on single-cell RNA-seq data: an empirical
survey. Front Genet 2020, 11:41.
Cole MB , Risso D , Wagner A , DeTomaso D , Ngai J , Purdom E , Dudoit S , Yosef N .
Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq.
Cell Syst 2019, 8(4):315–328 e318.
Buttner M , Miao Z , Wolf FA , Teichmann SA , Theis FJ . A test metric for assessing single-cell
RNA-seq batch correction. Nat Methods 2019, 16(1):43–49.
Johnson WE , Li C , Rabinovic A . Adjusting batch effects in microarray expression data using
empirical Bayes methods. Biostatistics 2007, 8(1):118–127.
Haghverdi L , Lun ATL , Morgan MD , Marioni JC . Batch effects in single-cell RNA-sequencing
data are corrected by matching mutual nearest neighbors. Nat Biotechnol 2018, 36(5):421–427.
Stuart T , Satija R . Integrative single-cell analysis. Nat Rev Genet 2019, 20(5):257–272.
Welch JD , Kozareva V , Ferreira A , Vanderburg C , Martin C , Macosko EZ . Single-cell multi-
omic integration compares and contrasts features of brain cell identity. Cell 2019,
177(7):1873–1887 e1817.
Hie B , Bryson B , Berger B . Efficient integration of heterogeneous single-cell transcriptomes
using Scanorama. Nat Biotechnol 2019, 37(6):685–691.
Polanski K , Young MD , Miao Z , Meyer KB , Teichmann SA , Park JE . BBKNN: fast batch
alignment of single cell transcriptomes. Bioinformatics 2020, 36(3):964–965.
Korsunsky I , Millard N , Fan J , Slowikowski K , Zhang F , Wei K , Baglaenko Y , Brenner M ,
Loh PR , Raychaudhuri S . Fast, sensitive and accurate integration of single-cell data with
Harmony. Nat Methods 2019, 16(12):1289–1296.
Barkas N , Petukhov V , Nikolaeva D , Lozinsky Y , Demharter S , Khodosevich K , Kharchenko
PV . Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat Methods
2019, 16(8):695–698.
Tran HTN , Ang KS , Chevrier M , Zhang X , Lee NYS , Goh M , Chen J . A benchmark of
batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 2020,
21(1):12.
Rousseeuw PJ . Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. J Comput Appl Math 1987, 20:53–65.
van Dijk D , Sharma R , Nainys J , Yim K , Kathail P , Carr AJ , Burdziak C , Moon KR , Chaffer
CL , Pattabiraman D et al . Recovering gene interactions from single-cell data using data
diffusion. Cell 2018, 174(3):716–729 e727.
Wagner F , Yan Y , Yanai I . K-nearest neighbor smoothing for high-throughput single-cell RNA-
Seq data. bioRxiv 2018, doi: https://doi.org/10.1101/217737
Huang M , Wang J , Torre E , Dueck H , Shaffer S , Bonasio R , Murray JI , Raj A , Li M , Zhang
NR . SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 2018,
15(7):539–542.
Wang J , Agarwal D , Huang M , Hu G , Zhou Z , Ye C , Zhang NR . Data denoising with
transfer learning in single-cell transcriptomics. Nat Methods 2019, 16(9):875–878.
Linderman GC , Zhao J , Roulis M , Bielecki P , Flavell RA , Nadler B , Kluger Y. Zero-
preserving imputation of single-cell RNA-seq data. Nat Commun 2022, 13(1):192.
Lin PJ , Troup M , Ho JWK . CIDR: Ultrafast and accurate clustering through imputation for
single-cell RNA-seq data. Genome Biol 2017, 18(1):59.
Eraslan G , Simon LM , Mircea M , Mueller NS , Theis FJ . Single-cell RNA-seq denoising using
a deep count autoencoder. Nat Commun 2019, 10(1):390.
Li WV , Li JJ . An accurate and robust imputation method scImpute for single-cell RNA-seq
data. Nat Commun 2018, 9(1):997.
Mongia A , Sengupta D , Majumdar A . McImpute: matrix completion based imputation for single
cell RNA-seq data. Front Genet 2019, 10:9.
Gong W , Kwak IY , Pota P , Koyano-Nakagawa N , Garry DJ . DrImpute: imputing dropout
events in single cell RNA sequencing data. BMC Bioinformatics 2018, 19(1):220.
Lahnemann D , Koster J , Szczurek E , McCarthy DJ , Hicks SC , Robinson MD , Vallejos CA ,
Campbell KR , Beerenwinkel N , Mahfouz A et al . Eleven grand challenges in single-cell data
science. Genome Biol 2020, 21(1):31.
Andrews TS , Hemberg M . False signals induced by single-cell imputation. F1000Research
2018, 7:1740.
Hou W , Ji Z , Ji H , Hicks SC . A systematic evaluation of single-cell RNA-sequencing
imputation methods. Genome Biol 2020, 21(1):218.
Li Y , Willer C , Sanna S , Abecasis G . Genotype imputation. Annu Rev Genomics Hum Genet
2009, 10:387–406.
Brennecke P , Anders S , Kim JK , Kolodziejczyk AA , Zhang X , Proserpio V , Baying B , Benes
V , Teichmann SA , Marioni JC et al . Accounting for technical noise in single-cell RNA-seq
experiments. Nat Methods 2013, 10(11):1093–1095.
Buettner F , Natarajan KN , Casale FP , Proserpio V , Scialdone A , Theis FJ , Teichmann SA ,
Marioni JC , Stegle O . Computational analysis of cell-to-cell heterogeneity in single-cell RNA-
sequencing data reveals hidden subpopulations of cells. Nat Biotechnol 2015, 33(2):155–160.
Yip SH , Sham PC , Wang J . Evaluation of tools for highly variable gene discovery from single-
cell RNA-seq data. Brief Bioinform 2019, 20(4):1583–1589.
Duo A , Robinson MD , Soneson C. A systematic performance evaluation of clustering methods
for single-cell RNA-seq data. F1000Research 2018, 7:1141.
Townes FW , Hicks SC , Aryee MJ , Irizarry RA . Feature selection and dimension reduction for
single-cell RNA-Seq based on a multinomial model. Genome Biol 2019, 20(1):295.
Andrews TS , Hemberg M . M3Drop: dropout-based feature selection for scRNASeq.
Bioinformatics 2019, 35(16):2865–2867.
Trapnell C , Cacchiarelli D , Grimsby J , Pokharel P , Li S , Morse M , Lennon NJ , Livak KJ ,
Mikkelsen TS , Rinn JL . The dynamics and regulators of cell fate decisions are revealed by
pseudotemporal ordering of single cells. Nat Biotechnol 2014, 32(4):381–386.
Shao C , Hofer T . Robust classification of single-cell transcriptome data by nonnegative matrix
factorization. Bioinformatics 2017, 33(2):235–242.
Pierson E , Yau C . ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression
analysis. Genome Biol 2015, 16:241.
Buettner F , Pratanwanich N , McCarthy DJ , Marioni JC , Stegle O . f-scLVM: scalable and
versatile factor analysis for single-cell RNA-seq. Genome Biol 2017, 18(1):212.
Mahfouz A , van de Giessen M , van der Maaten L , Huisman S , Reinders M , Hawrylycz MJ ,
Lelieveldt BP . Visualizing the spatial gene expression organization in the brain through non-
linear similarity embeddings. Methods 2015, 73:79–89.
Senabouth A , Lukowski SW , Hernandez JA , Andersen SB , Mei X , Nguyen QH , Powell JE .
ascend: R package for analysis of single-cell RNA-seq data. GigaScience 2019, 8(8):giz087.
Sun S , Zhu J , Ma Y , Zhou X . Accuracy, robustness and scalability of dimensionality reduction
methods for single-cell RNA-seq analysis. Genome Biol 2019, 20(1):269.
Welch JD , Hartemink AJ , Prins JF . SLICER: inferring branched, nonlinear cellular trajectories
from single cell RNA-seq data. Genome Biol 2016, 17(1):106.
Haghverdi L , Buettner F , Theis FJ . Diffusion maps for high-dimensional single-cell analysis of
differentiation data. Bioinformatics 2015, 31(18):2989–2998.
Sun X , Liu Y , An L . Ensemble dimensionality reduction and feature gene extraction for single-
cell RNA-seq data. Nat Commun 2020, 11(1):5853.
Becht E , McInnes L , Healy J , Dutertre CA , Kwok IWH , Ng LG , Ginhoux F , Newell EW .
Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol 2018,
37(1):38–44.
Cao J , Spielmann M , Qiu X , Huang X , Ibrahim DM , Hill AJ , Zhang F , Mundlos S ,
Christiansen L , Steemers FJ et al . The single-cell transcriptional landscape of mammalian
organogenesis. Nature 2019, 566(7745):496–502.
Kobak D , Berens P . The art of using t-SNE for single-cell transcriptomics. Nat Commun 2019,
10(1):5416.
Hu Q , Greene CS . Parameter tuning is a key part of dimensionality reduction via deep
variational autoencoders for single cell RNA transcriptomics. Pac Symp Biocomput 2019,
24:362–373.
Ding J , Condon A , Shah SP . Interpretable dimensionality reduction of single cell transcriptome
data with deep generative models. Nat Commun 2018, 9(1):2002.
Deng Y , Bao F , Dai QH , Wu LF , Altschuler SJ . Scalable analysis of cell-type composition
from single-cell transcriptomics using deep recurrent learning. Nature Methods 2019, 16(4):311.
Wang B , Zhu J , Pierson E , Ramazzotti D , Batzoglou S . Visualization and analysis of single-
cell RNA-seq data by kernel-based similarity learning. Nat Methods 2017, 14(4):414–416.
Xiang R , Wang W , Yang L , Wang S , Xu C , Chen X . A comparison for dimensionality
reduction methods of single-cell RNA-seq data. Front Genet 2021, 12:646936.
Wolf FA , Hamey FK , Plass M , Solana J , Dahlin JS , Gottgens B , Rajewsky N , Simon L ,
Theis FJ . PAGA: graph abstraction reconciles clustering with trajectory inference through a
topology preserving map of single cells. Genome Biol 2019, 20(1):59.
Moon KR , van Dijk D , Wang Z , Gigante S , Burkhardt DB , Chen WS , Yim K , Elzen AVD ,
Hirn MJ , Coifman RR et al . Visualizing structure and transitions in high-dimensional biological
data. Nat Biotechnol 2019, 37(12):1482–1492.
Anchang B , Hart TD , Bendall SC , Qiu P , Bjornson Z , Linderman M , Nolan GP , Plevritis SK .
Visualization and cellular hierarchy inference of single-cell data using SPADE. Nat Protoc 2016,
11(7):1264–1279.
Weinreb C , Wolock S , Klein AM . SPRING: a kinetic interface for visualizing high dimensional
single-cell expression data. Bioinformatics 2018, 34(7):1246–1248.
Kim T , Chen IR , Lin Y , Wang AY , Yang JYH , Yang P . Impact of similarity metrics on single-
cell RNA-seq data clustering. Brief Bioinform 2019, 20(6):2316–2326.
Moussa M , Mandoiu, II . Single cell RNA-seq data clustering using TF-IDF based methods.
BMC Genomics 2018, 19(Suppl 6):569.
Blondel VD , Guillaume JL , Lambiotte R , Lefebvre E . Fast unfolding of communities in large
networks. J Stat Mech-Theory E 2008, doi:10.1088/1742-5468/2008/10/P10008
Traag VA , Waltman L , van Eck NJ . From Louvain to Leiden: guaranteeing well-connected
communities. Sci Rep 2019, 9(1):5233.
Kiselev VY , Kirschner K , Schaub MT , Andrews T , Yiu A , Chandra T , Natarajan KN , Reik W
, Barahona M , Green AR et al . SC3: consensus clustering of single-cell RNA-seq data. Nat
Methods 2017, 14(5):483–486.
Zurauskiene J , Yau C . pcaReduce: hierarchical clustering of single cell transcriptional profiles.
BMC Bioinformatics 2016, 17:140.
Grun D , Lyubimova A , Kester L , Wiebrands K , Basak O , Sasaki N , Clevers H , van
Oudenaarden A. Single-cell messenger RNA sequencing reveals rare intestinal cell types.
Nature 2015, 525(7568):251–255.
Xu C , Su Z . Identification of cell types from single-cell transcriptomes using a novel clustering
method. Bioinformatics 2015, 31(12):1974–1980.
Lake BB , Chen S , Sos BC , Fan J , Kaeser GE , Yung YC , Duong TE , Gao D , Chun J ,
Kharchenko PV et al . Integrative single-cell analysis of transcriptional and epigenetic states in
the human adult brain. Nat Biotechnol 2018, 36(1):70–80.
Guo M , Wang H , Potter SS , Whitsett JA , Xu Y. SINCERA: a pipeline for single-cell RNA-seq
profiling analysis. PLoS Comput Biol 2015, 11(11):e1004575.
Freytag S , Tian L , Lonnstedt I , Ng M , Bahlo M . Comparison of clustering tools in R for
medium-sized 10× Genomics single-cell RNA-sequencing data. F1000Research 2018, 7:1297.
Yu D , Huber W , Vitek O. Shrinkage estimation of dispersion in negative binomial models for
RNA-seq experiments with small sample size. Bioinformatics 2013, 29(10):1275–1282.
Pliner HA , Shendure J , Trapnell C . Supervised classification enables rapid annotation of cell
atlases. Nat Methods 2019, 16(10):983–986.
Domanskyi S , Szedlak A , Hawkins NT , Wang J , Paternostro G , Piermarocchi C. Polled
Digital Cell Sorter (p-DCS): Automatic identification of hematological cell types from single cell
RNA-sequencing clusters. BMC Bioinformatics 2019, 20(1):369.
Zhang Z , Luo D , Zhong X , Choi JH , Ma Y , Wang S , Mahrt E , Guo W , Stawiski EW ,
Modrusan Z et al . SCINA: A semi-supervised subtyping algorithm of single cells and bulk
samples. Genes 2019, 10(7):531.
Zhang AW , O'Flanagan C , Chavez EA , Lim JLP , Ceglia N , McPherson A , Wiens M , Walters
P , Chan T , Hewitson B et al . Probabilistic cell-type assignment of single-cell RNA-seq for
tumor microenvironment profiling. Nat Methods 2019, 16(10):1007–1015.
Xu C , Lopez R , Mehlman E , Regier J , Jordan MI , Yosef N. Probabilistic harmonization and
annotation of single-cell transcriptomics data with deep generative models. Mol Syst Biol 2021,
17(1):e9620.
Franzen O , Gan LM , Bjorkegren JLM . PanglaoDB: a web server for exploration of mouse and
human single-cell RNA sequencing data. Database (Oxford) 2019, 2019.
Zhang X , Lan Y , Xu J , Quan F , Zhao E , Deng C , Luo T , Xu L , Liao G , Yan M et al .
CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids
Res 2019, 47(D1):D721–D728.
Zeisel A , Hochgerner H , Lonnerberg P , Johnsson A , Memic F , van der Zwan J , Haring M ,
Braun E , Borm LE , La Manno G et al . Molecular architecture of the mouse nervous system.
Cell 2018, 174(4):999–1014 e1022.
Ecker JR , Geschwind DH , Kriegstein AR , Ngai J , Osten P , Polioudakis D , Regev A , Sestan
N , Wickersham IR , Zeng H . The BRAIN Initiative Cell Census Consortium: Lessons Learned
toward Generating a Comprehensive Brain Cell Atlas. Neuron 2017, 96(3):542–557.
Saunders A , Macosko EZ , Wysoker A , Goldman M , Krienen FM , de Rivera H , Bien E ,
Baum M , Bortolin L , Wang S et al . Molecular diversity and specializations among the cells of
the adult mouse brain. Cell 2018, 174(4):1015–1030 e1016.
Regev A , Teichmann SA , Lander ES , Amit I , Benoist C , Birney E , Bodenmiller B , Campbell
P , Carninci P , Clatworthy M et al . The Human Cell Atlas. Elife 2017, 6: e27041.
Tabula Muris C , Overall c , Logistical c , Organ c , processing, Library p, sequencing,
Computational data a, Cell type a, Writing g et al. Single-cell transcriptomics of 20 mouse
organs creates a Tabula Muris. Nature 2018, 562(7727):367–372.
Kiselev VY , Yiu A , Hemberg M . scmap: projection of single-cell RNA-seq data across data
sets. Nat Methods 2018, 15(5):359–362.
Tan Y , Cahan P . SingleCellNet: a computational tool to classify single cell RNA-seq data
across platforms and across species. Cell Syst 2019, 9(2):207–213 e202.
Ma F , Pellegrini M. ACTINN: automated identification of cell types in single cell RNA
sequencing. Bioinformatics 2020, 36(2):533–538.
Alquicira-Hernandez J , Sathe A , Ji HP , Nguyen Q , Powell JE . scPred: accurate supervised
method for cell-type classification from single-cell RNA-seq data. Genome Biol 2019, 20(1):264.
Pedregosa F , Varoquaux G , Gramfort A , Michel V , Thirion B , Grisel O , Blondel M ,
Prettenhofer P , Weiss R , Dubourg V . Scikit-learn: Machine learning in Python. J Mach Learn
Res 2011, 12:2825–2830.
Bakken T , Cowell L , Aevermann BD , Novotny M , Hodge R , Miller JA , Lee A , Chang I ,
McCorrison J , Pulendran B et al . Cell type discovery and representation in the era of high-
content single cell phenotyping. BMC Bioinformatics 2017, 18(Suppl 17):559.
Wang S , Pisco AO , McGeever A , Brbic M , Zitnik M , Darmanis S , Leskovec J , Karkanias J ,
Altman RB . Leveraging the cell ontology to classify unseen cell types. Nat Commun 2021,
12(1):5556.
Bernstein MN , Ma Z , Gleicher M , Dewey CN . CellO: comprehensive and hierarchical cell type
classification of human cells with the Cell Ontology. iScience 2021, 24(1):101913.
Abdelaal T , Michielsen L , Cats D , Hoogduin D , Mei H , Reinders MJT , Mahfouz A. A
comparison of automatic cell identification methods for single-cell RNA sequencing data.
Genome Biol 2019, 20(1):194.
Huang Q , Liu Y , Du Y , Garmire LX . Evaluation of cell type Annotation R Packages on Single-
cell RNA-seq Data. Genomics Proteomics Bioinformatics 2021, 19(2):267–281.
Aran D , Looney AP , Liu L , Wu E , Fong V , Hsu A , Chak S , Naikawadi RP , Wolters PJ ,
Abate AR et al . Reference-based analysis of lung single-cell sequencing reveals a transitional
profibrotic macrophage. Nat Immunol 2019, 20(2):163–172.
de Kanter JK , Lijnzaad P , Candelli T , Margaritis T , Holstege FCP . CHETAH: a selective,
hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res
2019, 47(16):e95.
Cao ZJ , Wei L , Lu S , Yang DC , Gao G . Searching large-scale scRNA-seq databases via
unbiased cell embedding with Cell BLAST. Nat Commun 2020, 11(1):3458.
Hu J , Li X , Hu G , Lyu Y , Susztak K , Li M . Iterative transfer learning with neural network for
clustering and cell type classification in single-cell RNA-seq analysis. Nat Mach Intell 2020,
2(10):607–618.
Lieberman Y , Rokach L , Shay T . CaSTLe – Classification of single cells by transfer learning:
Harnessing the power of publicly available single cell RNA sequencing experiments to annotate
new experiments. PLoS One 2018, 13(10):e0205499.
Haber AL , Biton M , Rogel N , Herbst RH , Shekhar K , Smillie C , Burgin G , Delorey TM ,
Howitt MR , Katz Y et al . A single-cell survey of the small intestinal epithelium. Nature 2017,
551(7680):333–339.
Hashimoto K , Kouno T , Ikawa T , Hayatsu N , Miyajima Y , Yabukami H , Terooatea T , Sasaki
T , Suzuki T , Valentine M et al . Single-cell transcriptomics reveals expansion of cytotoxic CD4
T cells in supercentenarians. Proc Natl Acad Sci U S A 2019, 116(48):24242–24251.
Cao Y , Lin Y , Ormerod JT , Yang P , Yang JYH , Lo KK . scDC: single cell differential
composition analysis. BMC Bioinformatics 2019, 20(Suppl 19):721.
Jackson DA . Compositional data in community ecology: the paradigm or peril of proportions?
Ecology 1997, 78(3):929–940.
Büttner M , Ostner J , Müller C , Theis F , Schubert B. scCODA is a Bayesian model for
compositional single-cell data analysis. Nat Commun 2021, 12(1):6876.
Kharchenko PV , Silberstein L , Scadden DT . Bayesian approach to single-cell differential
expression analysis. Nat Methods 2014, 11(7):740–742.
Finak G , McDavid A , Yajima M , Deng J , Gersuk V , Shalek AK , Slichter CK , Miller HW ,
McElrath MJ , Prlic M et al . MAST: a flexible statistical framework for assessing transcriptional
changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol
2015, 16:278.
Delmans M , Hemberg M . Discrete distributional differential expression (D3E) – a tool for gene
expression analysis of single-cell RNA-seq data. BMC Bioinformatics 2016, 17:110.
Korthauer KD , Chu LF , Newton MA , Li Y , Thomson J , Stewart R , Kendziorski C . A
statistical approach for identifying differential distributions in single-cell RNA-seq experiments.
Genome Biol 2016, 17(1):222.
Vu TN , Wills QF , Kalari KR , Niu N , Wang L , Rantalainen M , Pawitan Y . Beta-Poisson
model for single-cell RNA-seq data analyses. Bioinformatics 2016, 32(14):2128–2135.
Chen W , Li Y , Easton J , Finkelstein D , Wu G , Chen X. UMI-count modeling and differential
expression analysis for single-cell RNA sequencing. Genome Biol 2018, 19(1):70.
Miao Z , Deng K , Wang X , Zhang X . DEsingle for detecting three types of differential
expression in single-cell RNA-seq data. Bioinformatics 2018, 34(18):3223–3224.
Ye C , Speed TP , Salim A . DECENT: differential expression with capture efficiency
adjustmeNT for single-cell RNA-seq data. Bioinformatics 2019, 35(24):5155–5162.
Das S , Rai SN . SwarnSeq: An improved statistical approach for differential expression
analysis of single-cell RNA-seq data. Genomics 2021, 113(3):1308–1324.
Van den Berge K , Roux de Bezieux H , Street K , Saelens W , Cannoodt R , Saeys Y , Dudoit
S , Clement L . Trajectory-based differential expression analysis for single-cell sequencing data.
Nat Commun 2020, 11(1):1201.
Wang T , Li B , Nelson CE , Nabavi S. Comparative analysis of differential gene expression
analysis tools for single-cell RNA sequencing data. BMC Bioinformatics 2019, 20(1):40.
Soneson C , Robinson MD . Bias, robustness and scalability in single-cell differential expression
analysis. Nat Methods 2018, 15(4):255–261.
Dal Molin A , Baruzzo G , Di Camillo B. Single-cell RNA-sequencing: assessment of differential
expression analysis methods. Front Genet 2017, 8:62.
Jaakkola MK , Seyednasrollah F , Mehmood A , Elo LL . Comparison of methods to detect
differentially expressed genes between single-cell populations. Brief Bioinform 2017,
18(5):735–743.
Van den Berge K , Perraudeau F , Soneson C , Love MI , Risso D , Vert JP , Robinson MD ,
Dudoit S , Clement L . Observation weights unlock bulk RNA-seq tools for zero inflation and
single-cell applications. Genome Biol 2018, 19(1):24.
Love MI , Huber W , Anders S . Moderated estimation of fold change and dispersion for RNA-
seq data with DESeq2. Genome Biol 2014, 15(12):550.
Robinson MD , McCarthy DJ , Smyth GK . edgeR: a Bioconductor package for differential
expression analysis of digital gene expression data. Bioinformatics 2010, 26(1):139–140.
Ritchie ME , Phipson B , Wu D , Hu Y , Law CW , Shi W , Smyth GK . limma powers differential
expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015,
43(7):e47.
Seyednasrollah F , Rantanen K , Jaakkola P , Elo LL . ROTS: reproducible RNA-seq biomarker
detector-prognostic markers for clear cell renal cell cancer. Nucleic Acids Res 2016, 44(1):e1.
Vieth B , Parekh S , Ziegenhain C , Enard W , Hellmann I. A systematic evaluation of single cell
RNA-seq analysis pipelines. Nat Commun 2019, 10(1):4667.
Saelens W , Cannoodt R , Todorov H , Saeys Y . A comparison of single-cell trajectory
inference methods. Nat Biotechnol 2019, 37(5):547–554.
Street K , Risso D , Fletcher RB , Das D , Ngai J , Yosef N , Purdom E , Dudoit S. Slingshot: cell
lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 2018,
19(1):477.
Ji Z , Ji H . TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis.
Nucleic Acids Res 2016, 44(13):e117.
Shin J , Berg DA , Zhu Y , Shin JY , Song J , Bonaguidi MA , Enikolopov G , Nauen DW ,
Christian KM , Ming GL et al . Single-cell RNA-seq with waterfall reveals molecular cascades
underlying adult neurogenesis. Cell Stem Cell 2015, 17(3):360–372.
Haghverdi L , Buttner M , Wolf FA , Buettner F , Theis FJ . Diffusion pseudotime robustly
reconstructs lineage branching. Nat Methods 2016, 13(10):845–848.
Rashid S , Kotton DN , Bar-Joseph Z . TASIC: determining branching models from time series
single cell data. Bioinformatics 2017, 33(16):2504–2512.
Schiebinger G , Shu J , Tabaka M , Cleary B , Subramanian V , Solomon A , Gould J , Liu S ,
Lin S , Berube P et al . Optimal-Transport Analysis of Single-Cell Gene Expression identifies
developmental trajectories in reprogramming. Cell 2019, 176(4):928–943 e922.
Lin C , Bar-Joseph Z . Continuous-state HMMs for modeling time-series single-cell RNA-Seq
data. Bioinformatics 2019, 35(22):4707–4715.
Tran TN , Bader GD . Tempora: cell trajectory inference using time-series single-cell RNA
sequencing data. PLoS Comput Biol 2020, 16(9):e1008205.
Lonnberg T , Svensson V , James KR , Fernandez-Ruiz D , Sebina I , Montandon R , Soon MS
, Fogg LG , Nair AS , Liligeto U et al . Single-cell RNA-seq and computational analysis using
temporal mixture modelling resolves Th1/Tfh fate bifurcation in malaria. Sci Immunol 2017,
2(9):eaal2192.
LineagePulse (https://github.com/YosefLab/LineagePulse)
La Manno G , Soldatov R , Zeisel A , Braun E , Hochgerner H , Petukhov V , Lidschreiber K ,
Kastriti ME , Lonnerberg P , Furlan A et al . RNA velocity of single cells. Nature 2018,
560(7719):494–498.
Bergen V , Lange M , Peidli S , Wolf FA , Theis FJ . Generalizing RNA velocity to transient cell
states through dynamical modeling. Nat Biotechnol 2020, 38(12):1408–1414.
Herman JS , Sagar, Grun D. FateID infers cell fate bias in multipotent progenitors from single-
cell RNA-seq data. Nat Methods 2018, 15(5):379–386.
Bendall SC , Davis KL , Amir el AD , Tadmor MD , Simonds EF , Chen TJ , Shenfeld DK , Nolan
GP , Pe'er D . Single-cell trajectory detection uncovers progression and regulatory coordination
in human B cell development. Cell 2014, 157(3):714–725.
Setty M , Tadmor MD , Reich-Zeliger S , Angel O , Salame TM , Kathail P , Choi K , Bendall S ,
Friedman N , Pe'er D. Wishbone identifies bifurcating developmental trajectories from single-cell
data. Nat Biotechnol 2016, 34(6):637–645.
Kim S , Scheffler K , Halpern AL , Bekritsky MA , Noh E , Kallberg M , Chen X , Kim Y , Beyter D
, Krusche P et al . Strelka2: fast and accurate calling of germline and somatic variants. Nat
Methods 2018, 15(8):591–594.
Rodriguez-Meira A , Buck G , Clark SA , Povinelli BJ , Alcolea V , Louka E , McGowan S ,
Hamblin A , Sousos N , Barkas N et al . Unravelling Intratumoral Heterogeneity through High-
Sensitivity Single-Cell Mutational Analysis and Parallel RNA Sequencing. Mol Cell 2019,
73(6):1292–1305 e1298.
Pysam (https://github.com/pysam-developers/pysam)
Fasterius E , Uhlen M , Al-Khalili Szigyarto C. Single-cell RNA-seq variant analysis for
exploration of genetic heterogeneity in cancer. Sci Rep 2019, 9(1):9524.
Zafar H , Wang Y , Nakhleh L , Navin N , Chen K . Monovar: single-nucleotide variant detection
in single cells. Nat Methods 2016, 13(6):505–507.
Schnepp PM , Chen M , Keller ET , Zhou X . SNV identification from single-cell RNA
sequencing data. Hum Mol Genet 2019, 28(21):3569–3583.
Poirion O , Zhu X , Ching T , Garmire LX . Using single nucleotide variations in single-cell RNA-
seq to identify subpopulations and genotype-phenotype linkage. Nat Commun 2018, 9(1):4892.
Fangal VD . CTAT Mutations: A Machine Learning Based RNA-Seq Variant Calling Pipeline
Incorporating Variant Annotation, Prioritization, and Visualization. 2020.
Huang X , Huang Y . Cellsnp-lite: an efficient tool for genotyping single cells. Bioinformatics
2021, 37(23):4569–4571.
Liu F , Zhang Y , Zhang L , Li Z , Fang Q , Gao R , Zhang Z. Systematic comparative analysis of
single-nucleotide variant detection methods from single-cell RNA sequencing data. Genome
Biol 2019, 20(1):242.
Chung W , Eum HH , Lee HO , Lee KM , Lee HB , Kim KT , Ryu HS , Kim S , Lee JE , Park YH
et al . Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary
breast cancer. Nat Commun 2017, 8:15081.
Fan J , Lee HO , Lee S , Ryu DE , Lee S , Xue C , Kim SJ , Kim K , Barkas N , Park PJ et al .
Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell
RNA-seq data. Genome Res 2018, 28(8):1217–1227.
Serin Harmanci A , Harmanci AO , Zhou X. CaSpER identifies and visualizes CNV events by
integrative analysis of single-cell or bulk RNA-sequencing data. Nat Commun 2020, 11(1):89.
inferCNV of the Trinity CTAT Project (https://github.com/broadinstitute/inferCNV)
Muller S , Cho A , Liu SJ , Lim DA , Diaz A . CONICS integrates scRNA-seq with DNA
sequencing to map gene expression to tumor sub-clones. Bioinformatics 2018,
34(18):3217–3219.
van de Geijn B , McVicker G , Gilad Y , Pritchard JK . WASP: allele-specific software for robust
molecular quantitative trait locus discovery. Nat Methods 2015, 12(11):1061–1063.
Borel C , Ferreira PG , Santoni F , Delaneau O , Fort A , Popadin KY , Garieri M , Falconnet E ,
Ribaux P , Guipponi M et al . Biased allelic expression in human primary fibroblast single cells.
Am J Hum Genet 2015, 96(1):70–80.
Song Y , Botvinnik OB , Lovci MT , Kakaradov B , Liu P , Xu JL , Yeo GW . Single-cell
alternative splicing analysis with expedition reveals splicing dynamics during neuron
differentiation. Mol Cell 2017, 67(1):148–161 e145.
Huang Y , Sanguinetti G . BRIE: transcriptome-wide splicing quantification in single cells.
Genome Biol 2017, 18(1):123.
Huang Y , Sanguinetti G. BRIE2: computational identification of splicing phenotypes from
single-cell transcriptomic experiments. Genome Biol 2021, 22(1):251.
Matsumoto H , Hayashi T , Ozaki H , Tsuyuzaki K , Umeda M , Iida T , Nakamura M , Okano H ,
Nikaido I . An NMF-based approach to discover overlooked differentially expressed gene
regions from single-cell RNA-seq data. NAR Genom Bioinform 2019, 2(1):lqz020.
Ling JP , Wilks C , Charles R , Leavey PJ , Ghosh D , Jiang L , Santiago CP , Pang B ,
Venkataraman A , Clark BS et al . ASCOT identifies key regulators of neuronal subtype-specific
splicing. Nat Commun 2020, 11(1):137.
Ozaki H , Hayashi T , Umeda M , Nikaido I . Millefy: visualizing cell-to-cell heterogeneity in read
coverage of single-cell RNA sequencing datasets. BMC Genomics 2020, 21(1):177.
Wen WX , Mead AJ , Thongjuea S. VALERIE: Visual-based inspection of alternative splicing
events at single-cell resolution. PLoS Comput Biol 2020, 16(9):e1008195.
Hu Y , Wang K , Li M. Detecting differential alternative splicing events in scRNA-seq with or
without unique molecular identifiers. PLoS Comput Biol 2020, 16(6):e1007925.
Benegas G , Fischer J , Song YS . Robust and annotation-free analysis of alternative splicing
across diverse cell types in mice. Elife 2022, 11:e73520.
Liu S , Zhou B , Wu L , Sun Y , Chen J , Liu S . Single-cell differential splicing analysis reveals
high heterogeneity of liver tumor-infiltrating T cells. Sci Rep 2021, 11(1):5325.
Aibar S , Gonzalez-Blas CB , Moerman T , Huynh-Thu VA , Imrichova H , Hulselmans G ,
Rambow F , Marine JC , Geurts P , Aerts J et al . SCENIC: single-cell regulatory network
inference and clustering. Nat Methods 2017, 14(11):1083–1086.
Matsumoto H , Kiryu H , Furusawa C , Ko MSH , Ko SBH , Gouda N , Hayashi T , Nikaido I .
SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during
differentiation. Bioinformatics 2017, 33(15):2314–2321.
Matsumoto H , Kiryu H . SCOUP: a probabilistic model based on the Ornstein-Uhlenbeck
process to analyze single-cell expression data during differentiation. BMC Bioinformatics 2016,
17(1):232.
Chan TE , Stumpf MPH , Babtie AC . Gene Regulatory Network Inference from Single-Cell Data
Using Multivariate Information Measures. Cell Syst 2017, 5(3):251–267 e253.
Specht AT , Li J . LEAP: constructing gene co-expression networks for single-cell RNA-
sequencing data using pseudotime ordering. Bioinformatics 2017, 33(5):764–766.
Liu H , Li P , Zhu M , Wang X , Lu J , Yu T . Nonlinear Network Reconstruction from Gene
Expression Data Using Marginal Dependencies Measured by DCOL. PLoS One 2016,
11(7):e0158247.
Cordero P , Stuart JM . Tracing Co-Regulatory Network Dynamics in Noisy, Single-Cell
Transcriptome Trajectories. Pac Symp Biocomput 2017, 22:576–587.
Aubin-Frankowski PC , Vert JP . Gene regulation inference from single-cell RNA-seq data with
linear differential equations and velocity inference. Bioinformatics 2020, 36(18):4774–4780.
Huynh-Thu VA , Irrthum A , Wehenkel L , Geurts P . Inferring regulatory networks from
expression data using tree-based methods. PLoS One 2010, 5(9):e12776.
Moerman T , Aibar Santos S , Bravo Gonzalez-Blas C , Simm J , Moreau Y , Aerts J , Aerts S .
GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks.
Bioinformatics 2019, 35(12):2159–2161.
Woodhouse S , Piterman N , Wintersteiger CM , Gottgens B , Fisher J . SCNS: a graphical tool
for reconstructing executable regulatory networks from single-cell genomic data. BMC Syst Biol
2018, 12(1):59.
Lim CY , Wang H , Woodhouse S , Piterman N , Wernisch L , Fisher J , Gottgens B . BTR:
training asynchronous Boolean models using single-cell expression data. BMC Bioinformatics
2016, 17(1):355.
Pratapa A , Jalihal AP , Law JN , Bharadwaj A , Murali TM . Benchmarking algorithms for gene
regulatory network inference from single-cell transcriptomic data. Nat Methods 2020,
17(2):147–154.
Chen S , Mar JC . Evaluating methods of inferring gene regulatory networks highlights their lack
of performance for single cell gene expression data. BMC Bioinformatics 2018, 19(1):232.
Nguyen H , Tran D , Tran B , Pehlivan B , Nguyen T . A comprehensive survey of regulatory
network inference methods using single-cell RNA sequencing data. Brief Bioinform 2020,
22(3):bbaa190.
Kang Y , Thieffry D , Cantini L . Evaluating the Reproducibility of Single-Cell Gene Regulatory
Network Inference Algorithms. Front Genet 2021, 12:617282.
Dai H , Li L , Zeng T , Chen L . Cell-specific network constructed by single-cell RNA sequencing
data. Nucleic Acids Res 2019, 47(11):e62.
Li L , Dai H , Fang Z , Chen L. c-CSN: Single-cell RNA Sequencing Data Analysis by
Conditional Cell-specific Network. Genomics Proteomics Bioinformatics 2021, 19(2):319–329.
Langfelder P , Horvath S. WGCNA: an R package for weighted correlation network analysis.
BMC Bioinformatics 2008, 9:559.

Small RNA Sequencing


Huang V , Qin Y , Wang J , Wang X , Place RF , Lin G , Lue TF , Li LC . RNAa is conserved in
mammalian cells. PLoS One 2010, 5(1):e8848.
Androvic P , Benesova S , Rohlova E , Kubista M , Valihrach L . Small RNA-sequencing for
analysis of circulating mirnas: benchmark study. J Mol Diagn 2022, 24(4):386–394.
Baran-Gale J , Kurtz CL , Erdos MR , Sison C , Young A , Fannin EE , Chines PS , Sethupathy
P . Addressing bias in small RNA library preparation for sequencing: a new protocol recovers
microRNAs that evade capture by current methods. Front Genet 2015, 6:352.
Benesova S , Kubista M , Valihrach L . Small RNA-sequencing: approaches and considerations
for miRNA analysis. Diagnostics (Basel) 2021, 11(6):964.
Metpally RP , Nasser S , Malenica I , Courtright A , Carlson E , Ghaffari L , Villa S , Tembe W ,
Van Keuren-Jensen K . Comparison of analysis tools for miRNA high throughput sequencing
using nerve crush as a model. Front Genet 2013, 4:20.
Kang W , Eldfjell Y , Fromm B , Estivill X , Biryukova I , Friedlander MR . miRTrace reveals the
organismal origins of microRNA sequencing data. Genome Biol 2018, 19(1):213.
Aparicio-Puerta E , Gomez-Martin C , Giannoukakos S , Medina JM , Marchal JA , Hackenberg
M . mirnaQC: a webserver for comparative quality control of miRNA-seq data. Nucleic Acids
Res 2020, 48(W1):W262–W267.
Friedlander MR , Mackowiak SD , Li N , Chen W , Rajewsky N . miRDeep2 accurately identifies
known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res
2012, 40(1):37–52.
Aparicio-Puerta E , Gomez-Martin C , Giannoukakos S , Medina JM , Scheepbouwer C ,
Garcia-Moreno A , Carmona-Saez P , Fromm B , Pegtel M , Keller A et al . sRNAbench and
sRNAtoolbox 2022 update: accurate miRNA and sncRNA profiling for model and non-model
organisms. Nucleic Acids Res 2022, 50(W1):W710–W717.
Johnson NR , Yeoh JM , Coruh C , Axtell MJ . Improved placement of multi-mapping small
RNAs. G3 2016, 6(7):2103–2111.
Wu X , Kim TK , Baxter D , Scherler K , Gordon A , Fong O , Etheridge A , Galas DJ , Wang K .
sRNAnalyzer-a flexible and customizable small RNA sequencing data analysis pipeline. Nucleic
Acids Res 2017, 45(21):12140–12151.
Patil AH , Halushka MK . miRge3.0: a comprehensive microRNA and tRF sequencing analysis
pipeline. NAR Genom Bioinform 2021, 3(3):lqab068.
Fehlmann T , Kern F , Laham O , Backes C , Solomon J , Hirsch P , Volz C , Muller R , Keller A
. miRMaster 2.0: multi-species non-coding RNA sequencing analyses at scale. Nucleic Acids
Res 2021, 49(W1):W397–W408.
Morin RD , O'Connor MD , Griffith M , Kuchenbauer F , Delaney A , Prabhu AL , Zhao Y ,
McDonald H , Zeng T , Hirst M et al . Application of massively parallel sequencing to microRNA
profiling and discovery in human embryonic stem cells. Genome Res 2008, 18(4):610–621.
Tomasello L , Distefano R , Nigita G , Croce CM . The MicroRNA Family Gets Wider: The
IsomiRs Classification and Role. Front Cell Dev Biol 2021, 9:668648.
Barturen G , Rueda A , Hamberg M , Alganza A , Lebron R , Kotsyfakis M , Shi B-J , Koppers-
Lalic D , Hackenberg M . sRNAbench: profiling of small RNAs and its sequence variants in
single or multi-species high-throughput experiments. Methods Next-Generation Seq. 2014,
1:21–31.
Garmire LX , Subramaniam S . Evaluation of normalization methods in mammalian microRNA-
Seq data. RNA 2012, 18(6):1279–1288.
Dillies MA , Rau A , Aubert J , Hennequet-Antier C , Jeanmougin M , Servant N , Keime C ,
Marot G , Castel D , Estelle J et al . A comprehensive evaluation of normalization methods for
Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 2013, 14(6):671–683.
Tam S , Tsao MS , McPherson JD . Optimization of miRNA-seq data preprocessing. Brief
Bioinform 2015, 16(6):950–963.
Agarwal V , Bell GW , Nam JW , Bartel DP . Predicting effective microRNA target sites in
mammalian mRNAs. Elife 2015, 4:e05005.
Enright AJ , John B , Gaul U , Tuschl T , Sander C , Marks DS . MicroRNA targets in
Drosophila. Genome Biol 2003, 5(1):R1.
Betel D , Koppal A , Agius P , Sander C , Leslie C . Comprehensive modeling of microRNA
targets predicts functional non-conserved and non-canonical sites. Genome Biol 2010,
11(8):R90.
Chen Y , Wang X . miRDB: an online database for prediction of functional microRNA targets.
Nucleic Acids Res 2020, 48(D1):D127–D131.
Sticht C , De La Torre C , Parveen A , Gretz N . miRWalk: An online resource for prediction of
microRNA binding sites. PLoS One 2018, 13(10):e0206239.
Krek A , Grun D , Poy MN , Wolf R , Rosenberg L , Epstein EJ , MacMenamin P , da Piedade I ,
Gunsalus KC , Stoffel M et al . Combinatorial microRNA target predictions. Nat Genet 2005,
37(5):495–500.
Kertesz M , Iovino N , Unnerstall U , Gaul U , Segal E . The role of site accessibility in
microRNA target recognition. Nat Genet 2007, 39(10):1278–1284.
Miranda KC , Huynh T , Tay Y , Ang YS , Tam WL , Thomson AM , Lim B , Rigoutsos I . A
pattern-based method for the identification of microRNA binding sites and their corresponding
heteroduplexes. Cell 2006, 126(6):1203–1217.
Rehmsmeier M , Steffen P , Hochsmann M , Giegerich R. Fast and effective prediction of
microRNA/target duplexes. RNA 2004, 10(10):1507–1517.
Wen M , Cong P , Zhang Z , Lu H , Li T . DeepMirTar: a deep-learning approach for predicting
human miRNA targets. Bioinformatics 2018, 34(22):3781–3787.
Paraskevopoulou MD , Georgakilas G , Kostoulas N , Vlachos IS , Vergoulis T , Reczko M ,
Filippidis C , Dalamagas T , Hatzigeorgiou AG . DIANA-microT web server v5.0: service
integration into miRNA functional analysis workflows. Nucleic Acids Res 2013, 41(Web Server
issue):W169–173.
Ritchie W , Flamant S , Rasko JE . Predicting microRNA targets and functions: traps for the
unwary. Nat Methods 2009, 6(6):397–398.
Vlachos IS , Zagganas K , Paraskevopoulou MD , Georgakilas G , Karagkouni D , Vergoulis T ,
Dalamagas T , Hatzigeorgiou AG . DIANA-miRPath v3.0: deciphering microRNA function with
experimental support. Nucleic Acids Res 2015, 43(W1):W460–466.

Genotyping and Variation Discovery by Whole Genome/Exome


Sequencing
Acuna-Hidalgo R , Veltman JA , Hoischen A. New insights into the generation and role of de
novo mutations in health and disease. Genome Biol 2016, 17(1):241.
Miller MB , Reed HC , Walsh CA . Brain Somatic Mutation in Aging and Alzheimer’s Disease.
Annu Rev Genomics Hum Genet 2021, 22:239–256.
Martincorena I , Campbell PJ . Somatic mutation in cancer and normal cells. Science 2015,
349(6255):1483–1489.
McKenna A , Hanna M , Banks E , Sivachenko A , Cibulskis K , Kernytsky A , Garimella K ,
Altshuler D , Gabriel S , Daly M et al . The Genome Analysis Toolkit: a MapReduce framework
for analyzing next-generation DNA sequencing data. Genome Res 2010, 20(9):1297–1303.
Danecek P , Bonfield JK , Liddle J , Marshall J , Ohan V , Pollard MO , Whitwham A , Keane T ,
McCarthy SA , Davies RM et al . Twelve years of SAMtools and BCFtools. GigaScience 2021,
10(2).
Kim S , Scheffler K , Halpern AL , Bekritsky MA , Noh E , Kallberg M , Chen X , Kim Y , Beyter D
, Krusche P et al . Strelka2: fast and accurate calling of germline and somatic variants. Nat
Methods 2018, 15(8):591–594.
Garrison E , Marth G . Haplotype-based variant detection from short-read sequencing. arXiv:
12073907 , 2012.
Luo R , Schatz MC , Salzberg SL . 16GT: a fast and sensitive variant caller using a 16-genotype
probabilistic model. GigaScience 2017, 6(7):1–4.
Koboldt DC , Zhang Q , Larson DE , Shen D , McLellan MD , Lin L , Miller CA , Mardis ER ,
Ding L , Wilson RK . VarScan 2: somatic mutation and copy number alteration discovery in
cancer by exome sequencing. Genome Res 2012, 22(3):568–576.
Liu J , Shen Q , Bao H. Comparison of seven SNP calling pipelines for the next-generation
sequencing data of chickens. PLoS One 2022, 17(1):e0262574.
Valueva MV , Nagornov N , Lyakhov PA , Valuev GV , Chervyakov NI . Application of the
residue number system to reduce hardware costs of the convolutional neural network
implementation. Math Comput Simul 2020, 177:232–243.
Poplin R , Chang PC , Alexander D , Schwartz S , Colthurst T , Ku A , Newburger D , Dijamco J
, Nguyen N , Afshar PT et al . A universal SNP and small-indel variant caller using deep neural
networks. Nat Biotechnol 2018, 36(10):983–987.
Zheng Z , Li S , Su J , Leung AW-S , Lam T-W , Luo R . Symphonizing pileup and full-alignment
for deep learning-based long-read variant calling. Nat Comput Sci 2022, 2(12):797–803.
Edge P , Bansal V . Longshot enables accurate variant calling in diploid genomes from single-
molecule long read sequencing. Nat Commun 2019, 10(1):4660.
Ahsan MU , Liu Q , Fang L , Wang K. NanoCaller for accurate detection of SNPs and indels in
difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks.
Genome Biol 2021, 22(1):261.
(https://github.com/nanoporetech/medaka)
Shafin K , Pesout T , Chang PC , Nattestad M , Kolesnikov A , Goel S , Baid G , Kolmogorov M
, Eizenga JM , Miga KH et al . Haplotype-aware variant calling with PEPPER-Margin-
DeepVariant enables high accuracy in nanopore long-reads. Nat Methods 2021,
18(11):1322–1332.
Supernat A , Vidarsson OV , Steen VM , Stokowy T . Comparison of three variant callers for
human whole genome sequencing. Sci Rep 2018, 8(1):17851.
Pei S , Liu T , Ren X , Li W , Chen C , Xie Z. Benchmarking variant callers in next-generation
and third-generation sequencing analysis. Brief Bioinform 2021, 22(3):bbaa148.
Lin YL , Chang PC , Hsu C , Hung MZ , Chien YH , Hwu WL , Lai F , Lee NC . Comparison of
GATK and DeepVariant by trio sequencing. Sci Rep 2022, 12(1):1809.
Benjamin D , Sato T , Cibulskis K , Getz G , Stewart C , Lichtenstein L . Calling somatic SNVs
and indels with Mutect2. bioRxiv 2019, doi: https://doi.org/10.1101/861054
Larson DE , Harris CC , Chen K , Koboldt DC , Abbott TE , Dooling DJ , Ley TJ , Mardis ER ,
Wilson RK , Ding L . SomaticSniper: identification of somatic point mutations in whole genome
sequencing data. Bioinformatics 2012, 28(3):311–317.
Lai Z , Markovets A , Ahdesmaki M , Chapman B , Hofmann O , McEwen R , Johnson J ,
Dougherty B , Barrett JC , Dry JR . VarDict: a novel and versatile variant caller for next-
generation sequencing in cancer research. Nucleic Acids Res 2016, 44(11):e108.
Sahraeian SME , Liu R , Lau B , Podesta K , Mohiyuddin M , Lam HYK . Deep convolutional
neural networks for accurate somatic mutation detection. Nat Commun 2019, 10(1):1041.
Roth A , Ding J , Morin R , Crisan A , Ha G , Giuliany R , Bashashati A , Hirst M , Turashvili G ,
Oloumi A et al . JointSNVMix: a probabilistic model for accurate detection of somatic mutations
in normal/tumour paired next-generation sequencing data. Bioinformatics 2012, 28(7):907–913.
Narzisi G , Corvelo A , Arora K , Bergmann EA , Shah M , Musunuri R , Emde AK , Robine N ,
Vacic V , Zody MC . Genome-wide somatic variant calling using localized colored de Bruijn
graphs. Commun Biol 2018, 1:20.
Cai L , Yuan W , Zhang Z , He L , Chou KC . In-depth comparison of somatic point mutation
callers based on different tumor next-generation sequencing depth data. Sci Rep 2016, 6
:36540.
Kroigard AB , Thomassen M , Laenkholm AV , Kruse TA , Larsen MJ . Evaluation of nine
somatic variant callers for detection of somatic mutations in exome and targeted deep
sequencing data. PLoS One 2016, 11(3):e0151664.
Chen Z , Yuan Y , Chen X , Chen J , Lin S , Li X , Du H . Systematic comparison of somatic
variant calling performance among different sequencing depth and mutation frequency. Sci Rep
2020, 10(1):3501.
Zhao S , Agafonov O , Azab A , Stokowy T , Hovig E. Accuracy and efficiency of germline
variant calling pipelines for human genome data. Sci Rep 2020, 10(1):20222.
GATK Best Practices Workflow for RNAseq Short Variant Discovery (SNPs + Indels)
(https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-
discovery-SNPs-Indels-)
Piskol R , Ramaswami G , Li JB . Reliable identification of genomic variants from RNA-seq data.
Am J Hum Genet 2013, 93(4):641–651.
Tang X , Baheti S , Shameer K , Thompson KJ , Wills Q , Niu N , Holcomb IN , Boutet SC ,
Ramakrishnan R , Kachergus JM et al . The eSNV-detect: a computational system to identify
expressed single nucleotide variants from transcriptome sequencing data. Nucleic Acids Res
2014, 42(22):e172.
Goya R , Sun MG , Morin RD , Leung G , Ha G , Wiegand KC , Senz J , Crisan A , Marra MA ,
Hirst M et al . SNVMix: predicting single nucleotide variants from next-generation sequencing of
tumors. Bioinformatics 2010, 26(6):730–736.
Oikkonen L , Lise S. Making the most of RNA-seq: Pre-processing sequencing data with
Opossum for reliable SNP variant detection. Wellcome Open Res 2017, 2:6.
Danecek P , Auton A , Abecasis G , Albers CA , Banks E , DePristo MA , Handsaker RE ,
Lunter G , Marth GT , Sherry ST et al . The variant call format and VCFtools. Bioinformatics
2011, 27(15):2156–2158.
Knaus BJ , Grunwald NJ . vcfr: a package to manipulate and visualize variant call format data in
R. Mol Ecol Resour 2017, 17(1):44–53.
Freudenberg-Hua Y , Freudenberg J , Kluck N , Cichon S , Propping P , Nothen MM . Single
nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative
sample of the European population. Genome Res 2003, 13(10):2271–2276.
Li J , Jew B , Zhan L , Hwang S , Coppola G , Freimer NB , Sul JH . ForestQC: Quality control
on genetic variants from next-generation sequencing data using random forest. PLoS Comput
Biol 2019, 15(12):e1007556.
Gezsi A , Bolgar B , Marx P , Sarkozy P , Szalai C , Antal P. VariantMetaCaller: automated
fusion of variant calling pipelines for quantitative, precision-based filtering. BMC Genomics
2015, 16:875.
Cantarel BL , Weaver D , McNeill N , Zhang J , Mackey AJ , Reese J. BAYSIC: a Bayesian
method for combining sets of genome variants with improved specificity and sensitivity. BMC
Bioinformatics 2014, 15:104.
RTG Tools : Utilities for accurate VCF comparison and manipulation
(https://github.com/RealTimeGenomics/rtg-tools)
Sudmant PH , Rausch T , Gardner EJ , Handsaker RE , Abyzov A , Huddleston J , Zhang Y , Ye
K , Jun G , Fritz MH et al . An integrated map of structural variation in 2,504 human genomes.
Nature 2015, 526(7571):75–81.
Korbel JO , Abyzov A , Mu XJ , Carriero N , Cayting P , Zhang Z , Snyder M , Gerstein MB .
PEMer: a computational framework with simulation-based error models for inferring genomic
structural variants from massive paired-end sequencing data. Genome Biol 2009, 10(2):R23.
Chen K , Wallis JW , McLellan MD , Larson DE , Kalicki JM , Pohl CS , McGrath SD , Wendl MC
, Zhang Q , Locke DP et al . BreakDancer: an algorithm for high-resolution mapping of genomic
structural variation. Nat Methods 2009, 6(9):677–681.
Zeitouni B , Boeva V , Janoueix-Lerosey I , Loeillet S , Legoix-ne P, Nicolas A , Delattre O ,
Barillot E . SVDetect: a tool to identify genomic structural variations from paired-end and mate-
pair sequencing data. Bioinformatics 2010, 26(15):1895–1896.
1-2-3-SV (https://github.com/Vityay/1-2-3-SV)
Ye K , Guo L , Yang X , Lamijer EW , Raine K , Ning Z . Split-read indel and structural variant
calling using PINDEL. Methods Mol Biol 2018, 1833:95–105.
Iqbal Z , Caccamo M , Turner I , Flicek P , McVean G . De novo assembly and genotyping of
variants using colored de Bruijn graphs. Nat Genet 2012, 44(2):226–232.
Liu S , Huang S , Rao J , Ye W , Genome Denmark Consortium II , Krogh A , Wang J .
Discovery, genotyping and characterization of structural variation and novel sequence at single
nucleotide resolution from de novo genome assemblies on a population scale. GigaScience
2015, 4 :64.
Rausch T , Zichner T , Schlattl A , Stutz AM , Benes V , Korbel JO . DELLY: structural variant
discovery by integrated paired-end and split-read analysis. Bioinformatics 2012,
28(18):i333–i339.
Yang L , Luquette LJ , Gehlenborg N , Xi R , Haseley PS , Hsieh CH , Zhang C , Ren X ,
Protopopov A , Chin L et al . Diverse mechanisms of somatic structural variations in human
cancer genomes. Cell 2013, 153(4):919–929.
Bartenhagen C , Dugas M . Robust and exact structural variation detection with paired-end and
soft-clipped alignments: SoftSV compared with eight algorithms. Brief Bioinform 2016,
17(1):51–62.
Kronenberg ZN , Osborne EJ , Cone KR , Kennedy BJ , Domyan ET , Shapiro MD , Elde NC ,
Yandell M. Wham: Identifying structural variants of biological consequence. PLoS Comput Biol
2015, 11(12):e1004572.
Sindi S , Helman E , Bashir A , Raphael BJ . A geometric approach for classification and
comparison of structural variants. Bioinformatics 2009, 25(12):i222–230.
Sindi SS , Onal S , Peng LC , Wu HT , Raphael BJ . An integrative probabilistic model for
identification of structural variation in sequencing data. Genome Biol 2012, 13(3):R22.
Handsaker RE , Van Doren V , Berman JR , Genovese G , Kashin S , Boettger LM , McCarroll
SA . Large multiallelic copy number variations in humans. Nat Genet 2015, 47(3):296–303.
Qi J , Zhao F. inGAP-sv: a novel scheme to identify and visualize structural variation from
paired end mapping data. Nucleic Acids Res 2011, 39(Web Server issue):W567–575.
Quinlan AR , Clark RA , Sokolova S , Leibowitz ML , Zhang Y , Hurles ME , Mell JC , Hall IM .
Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome.
Genome Res 2010, 20(5):623–635.
Chen X , Schulz-Trieglaff O , Shaw R , Barnes B , Schlesinger F , Kallberg M , Cox AJ ,
Kruglyak S , Saunders CT . Manta: rapid detection of structural variants and indels for germline
and cancer sequencing applications. Bioinformatics 2016, 32(8):1220–1222.
Cameron DL , Schroder J , Penington JS , Do H , Molania R , Dobrovic A , Speed TP,
Papenfuss AT. GRIDSS: sensitive and specific genomic rearrangement detection using
positional de Bruijn graph assembly. Genome Res 2017, 27(12):2050–2060.
Wala JA , Bandopadhayay P , Greenwald NF , O'Rourke R , Sharpe T , Stewart C ,
Schumacher S , Li Y , Weischenfeldt J , Yao X et al . SvABA: genome-wide detection of
structural variants and indels by local assembly. Genome Res 2018, 28(4):581–591.
Wang J , Mullighan CG , Easton J , Roberts S , Heatley SL , Ma J , Rusch MC , Chen K , Harris
CC , Ding L et al . CREST maps somatic structural variation in cancer genomes with base-pair
resolution. Nat Methods 2011, 8(8):652–654.
Layer RM , Chiang C , Quinlan AR , Hall IM . LUMPY: a probabilistic framework for structural
variant discovery. Genome Biol 2014, 15(6):R84.
Eisfeldt J , Vezzi F , Olason P , Nilsson D , Lindstrand A . TIDDIT, an efficient and
comprehensive structural variant caller for massive parallel sequencing data. F1000Research
2017, 6 :664.
pbsv (https://github.com/PacificBiosciences/pbsv)
Sedlazeck FJ , Rescheneder P , Smolka M , Fang H , Nattestad M , von Haeseler A , Schatz
MC . Accurate detection of complex structural variations using single-molecule sequencing. Nat
Methods 2018, 15(6):461–468.
Ebert P , Audano PA , Zhu Q , Rodriguez-Martin B , Porubsky D , Bonder MJ , Sulovari A ,
Ebler J , Zhou W , Serra Mari R et al . Haplotype-resolved diverse human genomes and
integrated analysis of structural variation. Science 2021, 372(6537):eabf7117.
Gardner EJ , Lam VK , Harris DN , Chuang NT , Scott EC , Pittard WS , Mills RE , Genomes
Project C , Devine SE . The Mobile Element Locator Tool (MELT): population-scale mobile
element discovery and biology. Genome Res 2017, 27(11):1916–1929.
Tham CY , Tirado-Magallanes R , Goh Y , Fullwood MJ , Koh BTH , Wang W , Ng CH , Chng
WJ , Thiery A , Tenen DG et al . NanoVar: accurate characterization of patients’ genomic
structural variants using low-depth nanopore sequencing. Genome Biol 2020, 21(1):56.
Cretu Stancu M , van Roosmalen MJ , Renkens I , Nieboer MM , Middelkamp S , de Ligt J ,
Pregno G , Giachino D , Mandrile G , Espejo Valle-Inclan J et al . Mapping and phasing of
structural variation in patient genomes using nanopore sequencing. Nat Commun 2017,
8(1):1326.
Zhou W , Emery SB , Flasch DA , Wang Y , Kwan KY , Kidd JM , Moran JV , Mills RE .
Identification and characterization of occult human-specific LINE-1 insertions using long-read
sequencing technology. Nucleic Acids Res 2020, 48(3):1146–1163.
Heller D , Vingron M . SVIM: structural variant identification using mapped long reads.
Bioinformatics 2019, 35(17):2907–2915.
Gong L , Wong CH , Cheng WC , Tjong H , Menghi F , Ngan CY , Liu ET , Wei CL . Picky
comprehensively detects high-resolution structural variants in nanopore long reads. Nat
Methods 2018, 15(6):455–460.
Cleal K , Baird DM . Dysgu: efficient structural variant calling using short or long reads. Nucleic
Acids Res 2022, 50(9):e53.
Zheng GX , Lau BT , Schnall-Levin M , Jarosz M , Bell JM , Hindson CM , Kyriazopoulou-
Panagiotopoulou S , Masquelier DA , Merrill L , Terry JM et al . Haplotyping germline and
cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol 2016,
34(3):303–311.
Zhang F , Christiansen L , Thomas J , Pokholok D , Jackson R , Morrell N , Zhao Y , Wiley M ,
Welch E , Jaeger E et al . Haplotype phasing of whole human genomes using bead-based
barcode partitioning in a single tube. Nat Biotechnol 2017, 35(9):852–857.
Wang O , Chin R , Cheng X , Wu MKY , Mao Q , Tang J , Sun Y , Anderson E , Lam HK , Chen
D et al . Efficient and unique cobarcoding of second-generation sequencing reads from long
DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo
assembly. Genome Res 2019, 29(5):798–808.
Chen Z , Pham L , Wu TC , Mo G , Xia Y , Chang PL , Porter D , Phan T , Che H , Tran H et al .
Ultralow-input single-tube linked-read library method enables short-read second-generation
sequencing systems to routinely generate highly accurate and economical long-range
sequencing information. Genome Res 2020, 30(6):898–909.
Long Ranger (https://github.com/10XGenomics/longranger)
Abyzov A , Urban AE , Snyder M , Gerstein M. CNVnator: an approach to discover, genotype,
and characterize typical and atypical CNVs from family and population genome sequencing.
Genome Res 2011, 21(6):974–984.
Xie C , Tammi MT . CNV-seq, a new method to detect copy number variation using high-
throughput sequencing. BMC bioinformatics 2009, 10:80.
Talevich E , Shain AH , Botton T , Bastian BC . CNVkit: Genome-wide copy number detection
and visualization from targeted DNA sequencing. PLoS Comput Biol 2016, 12(4):e1004873.
Klambauer G , Schwarzbauer K , Mayr A , Clevert DA , Mitterecker A , Bodenhofer U ,
Hochreiter S. cn.MOPS: mixture of Poissons for discovering copy number variations in next-
generation sequencing data with a low false discovery rate. Nucleic Acids Res 2012, 40(9):e69.
Ivakhno S , Royce T , Cox AJ , Evers DJ , Cheetham RK , Tavare S . CNAseg--a novel
framework for identification of copy number changes in cancer from second-generation
sequencing data. Bioinformatics 2010, 26(24):3051–3058.
Zhu M , Need AC , Han Y , Ge D , Maia JM , Zhu Q , Heinzen EL , Cirulli ET , Pelak K , He M et
al . Using ERDS to infer copy-number variants in high-coverage genomes. Am J Hum Genet
2012, 91(3):408–421.
Yoon S , Xuan Z , Makarov V , Ye K , Sebat J . Sensitive and accurate detection of copy
number variants using read depth of coverage. Genome Res 2009, 19(9):1586–1592.
Boeva V , Popova T , Bleakley K , Chiche P , Cappo J , Schleiermacher G , Janoueix-Lerosey I
, Delattre O , Barillot E . Control-FREEC: a tool for assessing copy number and allelic content
using next-generation sequencing data. Bioinformatics 2012, 28(3):423–425.
Alkan C , Kidd JM , Marques-Bonet T , Aksay G , Antonacci F , Hormozdiari F , Kitzman JO ,
Baker C , Malig M , Mutlu O et al . Personalized copy number and segmental duplication maps
using next-generation sequencing. Nat Genet 2009, 41(10):1061–1067.
Chiang DY , Getz G , Jaffe DB , O'Kelly MJ , Zhao X , Carter SL , Russ C , Nusbaum C ,
Meyerson M , Lander ES . High-resolution mapping of copy-number alterations with massively
parallel sequencing. Nat Methods 2009, 6(1):99–103.
Miller CA , Hampton O , Coarfa C , Milosavljevic A . ReadDepth: a parallel R package for
detecting copy number alterations from short sequencing reads. PLoS One 2011, 6(1):e16327.
Roller E , Ivakhno S , Lee S , Royce T , Tanner S . Canvas: versatile and scalable detection of
copy number variants. Bioinformatics 2016, 32(15):2375–2377.
Dharanipragada P , Vogeti S , Parekh N . iCopyDAV: Integrated platform for copy number
variations-Detection, annotation and visualization. PLoS One 2018, 13(4):e0195334.
Cameron DL , Di Stefano L , Papenfuss AT . Comprehensive evaluation and characterisation of
short read general-purpose structural variant calling software. Nat Commun 2019, 10(1):3240.
Kosugi S , Momozawa Y , Liu X , Terao C , Kubo M , Kamatani Y . Comprehensive evaluation
of structural variation detection algorithms for whole genome sequencing. Genome Biol 2019,
20(1):117.
Wong K , Keane TM , Stalker J , Adams DJ . Enhanced structural variant and breakpoint
detection using SVMerge by integration of multiple detection methods and local assembly.
Genome Biol 2010, 11(12):R128.
Zarate S , Carroll A , Mahmoud M , Krasheninina O , Jun G , Salerno WJ , Schatz MC ,
Boerwinkle E , Gibbs RA , Sedlazeck FJ . Parliament2: Accurate structural variant calling at
scale. GigaScience 2020, 9(12):giaa145.
Becker T , Lee WP , Leone J , Zhu Q , Zhang C , Liu S , Sargent J , Shanker K , Mil-Homens A ,
Cerveira E et al . FusorSV: an algorithm for optimally combining data from multiple structural
variation detection methods. Genome Biol 2018, 19(1):38.
Jeffares DC , Jolly C , Hoti M , Speed D , Shaw L , Rallis C , Balloux F , Dessimoz C , Bahler J ,
Sedlazeck FJ . Transient structural variations have strong effects on quantitative traits and
reproductive isolation in fission yeast. Nat Commun 2017, 8 :14061.
Mohiyuddin M , Mu JC , Li J , Bani Asadi N , Gerstein MB , Abyzov A , Wong WH , Lam HY .
MetaSV: an accurate and integrative structural-variant caller for next generation sequencing.
Bioinformatics 2015, 31(16):2741–2744.
Medvedev P , Fiume M , Dzamba M , Smith T , Brudno M . Detecting copy number variation
with mated short reads. Genome Res 2010, 20(11):1613–1622.
Wang K , Li M , Hakonarson H . ANNOVAR: functional annotation of genetic variants from high-
throughput sequencing data. Nucleic Acids Res 2010, 38(16):e164.
Cingolani P , Platts A , Wang le L , Coon M , Nguyen T , Wang L , Land SJ , Lu X , Ruden DM .
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff:
SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 2012,
6(2):80–92.
McLaren W , Gil L , Hunt SE , Riat HS , Ritchie GR , Thormann A , Flicek P , Cunningham F .
The Ensembl Variant Effect Predictor. Genome Biol 2016, 17(1):122.
Hinrichs AS , Raney BJ , Speir ML , Rhead B , Casper J , Karolchik D , Kuhn RM , Rosenbloom
KR , Zweig AS , Haussler D et al . UCSC Data Integrator and Variant Annotation Integrator.
Bioinformatics 2016, 32(9):1430–1432.
SeattleSeq (http://snp.gs.washington.edu/SeattleSeqAnnotation138/)

Clinical Sequencing and Detection of Actionable Variants


Patel LR , Nykter M , Chen K , Zhang W . Cancer genome sequencing: understanding
malignancy as a disease of the genome, its conformation, and its evolution. Cancer letters
2013, 340(2):152–160.
Steuer CE , Ramalingam SS . Tumor mutation burden: leading immunotherapy to the era of
precision medicine? J Clin Oncol 2018, 36(7):631–632.
Farnaes L , Hildreth A , Sweeney NM , Clark MM , Chowdhury S , Nahas S , Cakici JA , Benson
W , Kaplan RH , Kronick R et al . Rapid whole-genome sequencing decreases infant morbidity
and cost of hospitalization. NPJ Genom Med 2018, 3:10.
Jones S , Anagnostou V , Lytle K , Parpart-Li S , Nesselbush M , Riley DR , Shukla M ,
Chesnick B , Kadan M , Papp E et al . Personalized genomic analyses for cancer mutation
discovery and interpretation. Sci Transl Med 2015, 7(283):283ra253.
Srinivasan M , Sedmak D , Jewell S. Effect of fixatives and tissue processing on the content and
integrity of nucleic acids. Am J Pathol 2002, 161(6):1961–1971.
Hedegaard J , Thorsen K , Lund MK , Hein AM , Hamilton-Dutoit SJ , Vang S , Nordentoft I ,
Birkenkamp-Demtroder K , Kruhoffer M , Hager H et al . Next-generation sequencing of RNA
and DNA isolated from paired fresh-frozen and formalin-fixed paraffin-embedded samples of
human cancer and normal tissue. PLoS One 2014, 9(5):e98187.
Do H , Dobrovic A . Sequence artifacts in DNA from formalin-fixed tissues: causes and
strategies for minimization. Clin Chem 2015, 61(1):64–71.
McDonough SJ , Bhagwate A , Sun ZF , Wang C , Zschunke M , Gorman JA , Kopp KJ ,
Cunningham JM . Use of FFPE-derived DNA in next generation sequencing: DNA extraction
methods. Plos One 2019, 14(4).
Oreskovic A , Brault ND , Panpradist N , Lai JJ , Lutz BR . Analytical comparison of methods for
extraction of short cell-free DNA from urine. J Mol Diagn 2019, 21(6):1067–1078.
Diefenbach RJ , Lee JH , Kefford RF , Rizos H . Evaluation of commercial kits for purification of
circulating free DNA. Cancer Genet 2018, 228–229:21–27.
Alborelli I , Generali D , Jermann P , Cappelletti MR , Ferrero G , Scaggiante B , Bortul M ,
Zanconati F , Nicolet S , Haegele J et al . Cell-free DNA analysis in healthy individuals by next-
generation sequencing: a proof of concept and technical validation study. Cell Death Dis 2019,
10(7):534.
Jiang P , Chan CW , Chan KC , Cheng SH , Wong J , Wong VW , Wong GL , Chan SL , Mok TS
, Chan HL et al . Lengthening and shortening of plasma DNA in hepatocellular carcinoma
patients. Proc Natl Acad Sci U S A 2015, 112(11):E1317–1325.
Wyatt AW , Annala M , Aggarwal R , Beja K , Feng F , Youngren J , Foye A , Lloyd P , Nykter M
, Beer TM et al . Concordance of circulating tumor DNA and matched metastatic tissue biopsy in
prostate cancer. J Natl Cancer Inst 2017, 109(12).
Chen M , Zhao H . Next-generation sequencing in liquid biopsy: cancer screening and early
detection. Hum Genomics 2019, 13(1):34.
Jones AG , Small CM , Paczolt KA , Ratterman NL . A practical guide to methods of parentage
analysis. Mol Ecol Resour 2010, 10(1):6–30.
Zhang L , Dong X , Lee M , Maslov AY , Wang T , Vijg J . Single-cell whole-genome sequencing
reveals the functional landscape of somatic mutations in B lymphocytes across the human
lifespan. Proc Natl Acad Sci U S A 2019, 116(18):9014–9019.
Petrackova A , Vasinek M , Sedlarikova L , Dyskova T , Schneiderova P , Novosad T , Papajik T
, Kriegova E . Standardization of sequencing coverage depth in NGS: recommendation for
detection of clonal and subclonal mutations in cancer diagnostics. Front Oncol 2019, 9:851.
Meggendorfer M , Jobanputra V , Wrzeszczynski KO , Roepman P , de Bruijn E , Cuppen E ,
Buttner R , Caldas C , Grimmond S , Mullighan CG et al . Analytical demands to use whole-
genome sequencing in precision oncology. Semin Cancer Biol 2021, 84:16–22.
Salk JJ , Schmitt MW , Loeb LA . Enhancing the accuracy of next-generation sequencing for
detecting rare and subclonal mutations. Nat Rev Genet 2018, 19(5):269–285.
Schmitt MW , Kennedy SR , Salk JJ , Fox EJ , Hiatt JB , Loeb LA . Detection of ultra-rare
mutations by next-generation sequencing. Proc Natl Acad Sci U S A 2012,
109(36):14508–14513.
Zook JM , McDaniel J , Olson ND , Wagner J , Parikh H , Heaton H , Irvine SA , Trigg L , Truty
R , McLean CY et al . An open resource for accurately benchmarking small variant and
reference calls. Nat Biotechnol 2019, 37(5):561–566.
Miller NA , Farrow EG , Gibson M , Willig LK , Twist G , Yoo B , Marrs T , Corder S , Krivohlavek
L , Walter A et al . A 26-hour system of highly sensitive whole genome sequencing for
emergency management of genetic diseases. Genome Med 2015, 7(1):100.
Mestek-Boukhibar L , Clement E , Jones WD , Drury S , Ocaka L , Gagunashvili A , Le Quesne
Stabej P , Bacchelli C , Jani N , Rahman S et al . Rapid Paediatric Sequencing (RaPS):
comprehensive real-life workflow for rapid diagnosis of critically ill children. J Med Genet 2018,
55(11):721–728.
Kendig KI , Baheti S , Bockol MA , Drucker TM , Hart SN , Heldenbrand JR , Hernaez M ,
Hudson ME , Kalmbach MT , Klee EW et al . Sentieon DNASeq variant calling workflow
demonstrates strong computational performance and accuracy. Front Genet 2019, 10:736.
Loka TP , Tausch SH , Renard BY . Reliable variant calling during runtime of Illumina
sequencing. Sci Rep 2019, 9(1):16502.
Stranneheim H , Engvall M , Naess K , Lesko N , Larsson P , Dahlberg M , Andeer R ,
Wredenberg A , Freyer C , Barbaro M et al . Rapid pulsed whole genome sequencing for
comprehensive acute diagnostics of inborn errors of metabolism. BMC Genomics 2014,
15:1090.
Clark MM , Hildreth A , Batalov S , Ding Y , Chowdhury S , Watkins K , Ellsworth K , Camp B ,
Kint CI , Yacoubian C et al . Diagnosis of genetic diseases in seriously ill children by rapid
whole-genome sequencing and automated phenotyping and interpretation. Sci Transl Med
2019, 11(489):eaat6177.
Zemojtel T , Kohler S , Mackenroth L , Jager M , Hecht J , Krawitz P , Graul-Neumann L ,
Doelken S , Ehmke N , Spielmann M et al . Effective diagnosis of genetic disease by
computational phenotype analysis of the disease-associated genome. Sci Transl Med 2014,
6(252):252ra123.
Karczewski KJ , Francioli LC , Tiao G , Cummings BB , Alföldi J , Wang Q , Collins RL ,
Laricchia KM , Ganna A , Birnbaum DP et al . The mutational constraint spectrum quantified
from variation in 141,456 humans. Nature 2020, 581(7809):434–443.
Siva N . 1000 Genomes project. Nat Biotechnol 2008, 26(3):256.
Taliun D , Harris DN , Kessler MD , Carlson J , Szpiech ZA , Torres R , Taliun SAG , Corvelo A ,
Gogarten SM , Kang HM et al . Sequencing of 53,831 diverse genomes from the NHLBI
TOPMed Program. Nature 2021, 590(7845):290–299.
Consortium UK , Walter K , Min JL , Huang J , Crooks L , Memari Y , McCarthy S , Perry JR ,
Xu C , Futema M et al . The UK10K project identifies rare variants in health and disease. Nature
2015, 526(7571):82–90.
Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP)
(http://evs.gs.washington.edu/EVS/)
MacDonald JR , Ziman R , Yuen RK , Feuk L , Scherer SW . The Database of Genomic
Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res
2014, 42(Database issue):D986–992.
Firth HV , Richards SM , Bevan AP , Clayton S , Corpas M , Rajan D , Van Vooren S , Moreau
Y , Pettett RM , Carter NP . DECIPHER: Database of Chromosomal Imbalance and Phenotype
in Humans Using Ensembl Resources. Am J Hum Genet 2009, 84(4):524–533.
Lappalainen I , Lopez J , Skipper L , Hefferon T , Spalding JD , Garner J , Chen C , Maguire M ,
Corbett M , Zhou G et al . DbVar and DGVa: public archives for genomic structural variation.
Nucleic Acids Res 2013, 41(Database issue):D936–941.
Obenchain V , Lawrence M , Carey V , Gogarten S , Shannon P , Morgan M .
VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants.
Bioinformatics 2014, 30(14):2076–2078.
Jaganathan K , Kyriazopoulou Panagiotopoulou S , McRae JF , Darbandi SF , Knowles D , Li YI
, Kosmicki JA , Arbelaez J , Cui W , Schwartz GB et al . Predicting splicing from primary
sequence with deep learning. Cell 2019, 176(3):535–548 e524.
Yeo G , Burge CB . Maximum entropy modeling of short sequence motifs with applications to
RNA splicing signals. J Comput Biol 2004, 11(2–3):377–394.
Reese MG , Eeckman FH , Kulp D , Haussler D . Improved splice site detection in Genie. J
Comput Biol 1997, 4(3):311–323.
Jian X , Boerwinkle E , Liu X . In silico prediction of splice-altering single nucleotide variants in
the human genome. Nucleic Acids Res 2014, 42(22):13534–13544.
Wai HA , Lord J , Lyon M , Gunning A , Kelly H , Cibin P , Seaby EG , Spiers-Fitzgerald K , Lye
J , Ellard S et al . Blood RNA analysis can increase clinical diagnostic rate and resolve variants
of uncertain significance. Genet Med 2020, 22(6):1005–1014.
Riepe TV , Khan M , Roosing S , Cremers FPM , t Hoen PAC . Benchmarking deep learning
splice prediction tools using functional splice assays. Hum Mutat 2021, 42(7):799–810.
Rentzsch P , Schubach M , Shendure J , Kircher M . CADD-Splice-improving genome-wide
variant effect prediction using deep learning-derived splice scores. Genome Med 2021,
13(1):31.
Adzhubei IA , Schmidt S , Peshkin L , Ramensky VE , Gerasimova A , Bork P , Kondrashov AS
, Sunyaev SR . A method and server for predicting damaging missense mutations. Nat Methods
2010, 7(4):248–249.
Choi Y , Chan AP . PROVEAN web server: a tool to predict the functional effect of amino acid
substitutions and indels. Bioinformatics 2015, 31(16):2745–2747.
Vaser R , Adusumalli S , Leng SN , Sikic M , Ng PC . SIFT missense predictions for genomes.
Nat Protoc 2016, 11(1):1–9.
Steinhaus R , Proft S , Schuelke M , Cooper DN , Schwarz JM , Seelow D. MutationTaster2021.
Nucleic Acids Res 2021, 49(W1):W446–W451.
Reva B , Antipin Y , Sander C. Predicting the functional impact of protein mutations: application
to cancer genomics. Nucleic Acids Res 2011, 39(17):e118.
Carter H , Douville C , Stenson PD , Cooper DN , Karchin R . Identifying Mendelian disease
genes with the variant effect scoring tool. BMC Genomics 2013, 14 Suppl 3:S3.
Shihab HA , Gough J , Cooper DN , Stenson PD , Barker GL , Edwards KJ , Day IN , Gaunt TR
. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions
using hidden Markov models. Hum Mutat 2013, 34(1):57–65.
Pejaver V , Urresti J , Lugo-Martinez J , Pagel KA , Lin GN , Nam HJ , Mort M , Cooper DN ,
Sebat J , Iakoucheva LM et al . Inferring the molecular and phenotypic impact of amino acid
variants with MutPred2. Nat Commun 2020, 11(1):5918.
Chun S , Fay JC . Identification of deleterious mutations within three human genomes. Genome
Res 2009, 19(9):1553–1561.
Siepel A , Bejerano G , Pedersen JS , Hinrichs AS , Hou M , Rosenbloom K , Clawson H ,
Spieth J , Hillier LW , Richards S et al . Evolutionarily conserved elements in vertebrate, insect,
worm, and yeast genomes. Genome Res 2005, 15(8):1034–1050.
Cooper GM , Stone EA , Asimenos G , Program NCS , Green ED , Batzoglou S , Sidow A .
Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 2005,
15(7):901–913.
Garber M , Guttman M , Clamp M , Zody MC , Friedman N , Xie X . Identifying novel
constrained elements by exploiting biased substitution patterns. Bioinformatics 2009,
25(12):i54–62.
Rentzsch P , Witten D , Cooper GM , Shendure J , Kircher M . CADD: predicting the
deleteriousness of variants throughout the human genome. Nucleic Acids Res 2019,
47(D1):D886–D894.
Jagadeesh KA , Wenger AM , Berger MJ , Guturu H , Stenson PD , Cooper DN , Bernstein JA ,
Bejerano G . M-CAP eliminates a majority of variants of uncertain significance in clinical
exomes at high sensitivity. Nat Genet 2016, 48(12):1581–1586.
Li C , Zhi D , Wang K , Liu X . MetaRNN: Differentiating Rare Pathogenic and Rare Benign
Missense SNVs and InDels using deep learning. bioRxiv 2021.
Ioannidis NM , Rothstein JH , Pejaver V , Middha S , McDonnell SK , Baheti S , Musolf A , Li Q ,
Holzinger E , Karyadi D et al . REVEL: an ensemble method for predicting the pathogenicity of
rare missense variants. Am J Hum Genet 2016, 99(4):877–885.
Ionita-Laza I , McCallum K , Xu B , Buxbaum JD . A spectral approach integrating functional
genomic annotations for coding and noncoding variants. Nat Genet 2016, 48(2):214–220.
Hu H , Huff CD , Moore B , Flygare S , Reese MG , Yandell M. VAAST 2.0: improved variant
classification and disease-gene identification using a conservation-controlled amino acid
substitution matrix. Genet Epidemiol 2013, 37(6):622–634.
Landrum MJ , Lee JM , Benson M , Brown GR , Chao C , Chitipiralla S , Gu B , Hart J , Hoffman
D , Jang W et al . ClinVar: improving access to variant interpretations and supporting evidence.
Nucleic Acids Res 2018, 46(D1):D1062–D1067.
Stenson PD , Ball EV , Mort M , Phillips AD , Shaw K , Cooper DN . The Human Gene Mutation
Database (HGMD) and its exploitation in the fields of personalized genomics and molecular
evolution. Curr Protoc Bioinformatics 2012, Chapter 1:1.13.1-1.13.20.
Online Mendelian Inheritance in Man, OMIM (https://omim.org/)
Tate JG , Bamford S , Jubb HC , Sondka Z , Beare DM , Bindal N , Boutselakis H , Cole CG ,
Creatore C , Dawson E et al . COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic
Acids Res 2019, 47(D1):D941–D947.
Grossman RL , Heath AP , Ferretti V , Varmus HE , Lowy DR , Kibbe WA , Staudt LM . Toward
a Shared Vision for Cancer Genomic Data. N Engl J Med 2016, 375(12):1109–1112.
Consortium APG . AACR Project GENIE: powering precision medicine through an International
Consortium. Cancer Discov 2017, 7(8):818–831.
Robinson PN , Kohler S , Oellrich A , Sanger Mouse Genetics P , Wang K , Mungall CJ , Lewis
SE , Washington N , Bauer S , Seelow D et al . Improved exome prioritization of disease genes
through cross-species phenotype comparison. Genome Res 2014, 24(2):340–348.
Sifrim A , Popovic D , Tranchevent LC , Ardeshirdavani A , Sakai R , Konings P , Vermeesch JR
, Aerts J , De Moor B , Moreau Y . eXtasy: variant prioritization by genomic data fusion. Nat
Methods 2013, 10(11):1083–1084.
Smedley D , Schubach M , Jacobsen JOB , Kohler S , Zemojtel T , Spielmann M , Jager M ,
Hochheiser H , Washington NL , McMurry JA et al . A whole-genome analysis framework for
effective identification of pathogenic regulatory variants in Mendelian disease. Am J Hum Genet
2016, 99(3):595–606.
Kohler S , Schulz MH , Krawitz P , Bauer S , Dolken S , Ott CE , Mundlos C , Horn D , Mundlos
S , Robinson PN . Clinical diagnostics in human genetics with semantic similarity searches in
ontologies. Am J Hum Genet 2009, 85(4):457–464.
Singleton MV , Guthery SL , Voelkerding KV , Chen K , Kennedy B , Margraf RL , Durtschi J ,
Eilbeck K , Reese MG , Jorde LB et al . Phevor combines multiple biomedical ontologies for
accurate identification of disease-causing alleles in single individuals and small nuclear families.
Am J Hum Genet 2014, 94(4):599–610.
Javed A , Agrawal S , Ng PC . Phen-Gen: combining phenotype and genotype to analyze rare
disorders. Nat Methods 2014, 11(9):935–937.
Stelzer G , Plaschkes I , Oz-Levi D , Alkelai A , Olender T , Zimmerman S , Twik M , Belinky F ,
Fishilevich S , Nudel R et al . VarElect: the phenotype-based variation prioritizer of the
GeneCards Suite. BMC Genomics 2016, 17 Suppl 2:444.
Boudellioua I , Kulmanov M , Schofield PN , Gkoutos GV , Hoehndorf R. DeepPVP: phenotype-
based prioritization of causative variants using deep learning. BMC Bioinformatics 2019,
20(1):65.
Rodriguez-Garcia MA , Gkoutos GV , Schofield PN , Hoehndorf R. Integrating phenotype
ontologies with PhenomeNET. J Biomed Semantics 2017, 8(1):58.
Davydov EV , Goode DL , Sirota M , Cooper GM , Sidow A , Batzoglou S. Identifying a high
fraction of the human genome to be under selective constraint using GERP++. PLoS Comput
Biol 2010, 6(12):e1001025.
Landrum MJ , Chitipiralla S , Brown GR , Chen C , Gu B , Hart J , Hoffman D , Jang W , Kaur K
, Liu C et al . ClinVar: improvements to accessing data. Nucleic Acids Res 2020,
48(D1):D835–D844.
Bragin E , Chatzimichali EA , Wright CF , Hurles ME , Firth HV , Bevan AP , Swaminathan GJ .
DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence
and copy-number variation. Nucleic Acids Res 2014, 42(Database issue):D993–D1000.
Smedley D , Jacobsen JO , Jager M , Kohler S , Holtgrewe M , Schubach M , Siragusa E ,
Zemojtel T , Buske OJ , Washington NL et al . Next-generation diagnostics and disease-gene
discovery with the Exomiser. Nat Protoc 2015, 10(12):2004–2015.
Yang H , Robinson PN , Wang K . Phenolyzer: phenotype-based prioritization of candidate
genes for human diseases. Nat Methods 2015, 12(9):841–843.
Holtgrewe M , Stolpe O , Nieminen M , Mundlos S , Knaus A , Kornak U , Seelow D ,
Segebrecht L , Spielmann M , Fischer-Zirnsak B et al . VarFish: comprehensive DNA variant
analysis for diagnostics and research. Nucleic Acids Res 2020, 48(W1):W162–W169.
Li MX , Gui HS , Kwan JS , Bao SY , Sham PC . A comprehensive framework for prioritizing
variants in exome sequencing studies of Mendelian diseases. Nucleic Acids Res 2012,
40(7):e53.
Flygare S , Hernandez EJ , Phan L , Moore B , Li M , Fejes A , Hu H , Eilbeck K , Huff C , Jorde
L et al . The VAAST Variant Prioritizer (VVP): ultrafast, easy to use whole genome variant
prioritization tool. BMC Bioinformatics 2018, 19(1):57.
Geoffroy V , Pizot C , Redin C , Piton A , Vasli N , Stoetzel C , Blavier A , Laporte J , Muller J.
VaRank: a simple and powerful tool for ranking genetic variants. PeerJ 2015, 3:e796.
Alexander J , Mantzaris D , Georgitsi M , Drineas P , Paschou P . Variant Ranker: a web-tool to
rank genomic data according to functional significance. BMC Bioinformatics 2017, 18(1):341.
Ip E , Chapman G , Winlaw D , Dunwoodie SL , Giannoulatou E . VPOT: A Customizable
Variant Prioritization Ordering Tool for Annotated Variants. Genomics Proteomics Bioinformatics
2019, 17(5):540–545.
Richards S , Aziz N , Bale S , Bick D , Das S , Gastier-Foster J , Grody WW , Hegde M , Lyon E
, Spector E et al . Standards and guidelines for the interpretation of sequence variants: a joint
consensus recommendation of the American College of Medical Genetics and Genomics and
the Association for Molecular Pathology. Genet Med 2015, 17(5):405–424.
Li Q , Wang K . InterVar: Clinical Interpretation of Genetic Variants by the 2015 ACMG-AMP
Guidelines. Am J Hum Genet 2017, 100(2):267–280.
Whiffin N , Walsh R , Govind R , Edwards M , Ahmad M , Zhang X , Tayal U , Buchan R ,
Midwinter W , Wilk AE et al . CardioClassifier: disease- and gene-specific computational
decision support for clinical genome interpretation. Genet Med 2018, 20(10):1246–1254.
Nicora G , Limongelli I , Gambelli P , Memmi M , Malovini A , Mazzanti A , Napolitano C , Priori
S , Bellazzi R . CardioVAI: An automatic implementation of ACMG-AMP variant interpretation
guidelines in the diagnosis of cardiovascular diseases. Hum Mutat 2018, 39(12):1835–1846.
Li MM , Datto M , Duncavage EJ , Kulkarni S , Lindeman NI , Roy S , Tsimberidou AM ,
Vnencak-Jones CL , Wolff DJ , Younes A et al . Standards and Guidelines for the Interpretation
and Reporting of Sequence Variants in Cancer: A Joint Consensus Recommendation of the
Association for Molecular Pathology, American Society of Clinical Oncology, and College of
American Pathologists. J Mol Diagn 2017, 19(1):4–23.
He MM , Li Q , Yan M , Cao H , Hu Y , He KY , Cao K , Li MM , Wang K. Variant Interpretation
for Cancer (VIC): a computational tool for assessing clinical impacts of somatic variants.
Genome Med 2019, 11(1):53.
Li Q , Ren Z , Cao K , Li MM , Wang K , Zhou Y . CancerVar: an artificial intelligence-
empowered platform for clinical interpretation of somatic mutations in cancer. Sci Adv 2022,
8(18):eabj1624.
Tamborero D , Rubio-Perez C , Deu-Pons J , Schroeder MP , Vivancos A , Rovira A , Tusquets
I , Albanell J , Rodon J , Tabernero J et al . Cancer Genome Interpreter annotates the biological
and clinical relevance of tumor alterations. Genome Med 2018, 10(1):25.
Griffith M , Spies NC , Krysiak K , McMichael JF , Coffman AC , Danos AM , Ainscough BJ ,
Ramirez CA , Rieke DT , Kujan L et al . CIViC is a community knowledgebase for expert
crowdsourcing the clinical interpretation of variants in cancer. Nat Genet 2017, 49(2):170–174.
Huang L , Fernandes H , Zia H , Tavassoli P , Rennert H , Pisapia D , Imielinski M , Sboner A ,
Rubin MA , Kluk M et al . The cancer precision medicine knowledge base for structured clinical-
grade mutations and interpretations. J Am Med Inform Assoc 2017, 24(3):513–519.
Zomnir MG , Lipkin L , Pacula M , Meneses ED , MacLeay A , Duraisamy S , Nadhamuni N , Al
Turki SH , Zheng Z , Rivera M et al . Artificial intelligence approach for variant reporting. JCO
Clin Cancer Inform 2018, 2.
den Dunnen JT , Dalgleish R , Maglott DR , Hart RK , Greenblatt MS , McGowan-Jordan J ,
Roux AF , Smith T , Antonarakis SE , Taschner PE . HGVS Recommendations for the
Description of Sequence Variants: 2016 update. Hum Mutat 2016, 37(6):564–569.
Braschi B , Denny P , Gray K , Jones T , Seal R , Tweedie S , Yates B , Bruford E .
Genenames.org: the HGNC and VGNC resources in 2019. Nucleic Acids Res 2019,
47(D1):D786–D792.
Arteche-Lopez A , Avila-Fernandez A , Romero R , Riveiro-Alvarez R , Lopez-Martinez MA ,
Gimenez-Pardo A , Velez-Monsalve C , Gallego-Merlo J , Garcia-Vara I , Almoguera B et al .
Sanger sequencing is no longer always necessary based on a single-center validation of 1109
NGS variants in 825 clinical exomes. Sci Rep 2021, 11(1):5697.
Beck TF , Mullikin JC , Program NCS , Biesecker LG . Systematic evaluation of sanger
validation of next-generation sequencing variants. Clin Chem 2016, 62(4):647–654.
Kerkhof J , Schenkel LC , Reilly J , McRobbie S , Aref-Eshghi E , Stuart A , Rupar CA , Adams
P , Hegele RA , Lin H et al . Clinical validation of copy number variant detection from targeted
next-generation sequencing panels. J Mol Diagn 2017, 19(6):905–920.
Miller DT , Lee K , Chung WK , Gordon AS , Herman GE , Klein TE , Stewart DR , Amendola
LM , Adelman K , Bale SJ et al . ACMG SF v3.0 list for reporting of secondary findings in clinical
exome and genome sequencing: a policy statement of the American College of Medical
Genetics and Genomics (ACMG). Genet Med 2021, 23(8):1381–1390.
Miller DT , Lee K , Gordon AS , Amendola LM , Adelman K , Bale SJ , Chung WK , Gollob MH ,
Harrison SM , Herman GE et al . Recommendations for reporting of secondary findings in
clinical exome and genome sequencing, 2021 update: a policy statement of the American
College of Medical Genetics and Genomics (ACMG). Genet Med 2021, 23(8):1391–1398.
Appelbaum PS , Parens E , Berger SM , Chung WK , Burke W . Is there a duty to reinterpret
genetic data? The ethical dimensions. Genet Med 2020, 22(3):633–639.
Clayton EW , Appelbaum PS , Chung WK , Marchant GE , Roberts JL , Evans BJ . Does the
law require reinterpretation and return of revised genomic results? Genet Med 2021,
23(5):833–836.
Deignan JL , Chung WK , Kearney HM , Monaghan KG , Rehder CW , Chao EC , Committee
ALQA . Points to consider in the reevaluation and reanalysis of genomic test results: a
statement of the American College of Medical Genetics and Genomics (ACMG). Genet Med
2019, 21(6):1267–1270.
Roy S , Coldren C , Karunamurthy A , Kip NS , Klee EW , Lincoln SE , Leon A , Pullambhatla M
, Temple-Smolkin RL , Voelkerding KV et al . Standards and Guidelines for Validating Next-
Generation Sequencing Bioinformatics Pipelines: A Joint Recommendation of the Association
for Molecular Pathology and the College of American Pathologists. J Mol Diagn 2018,
20(1):4–27.
Ewing AD , Houlahan KE , Hu Y , Ellrott K , Caloian C , Yamaguchi TN , Bare JC , P'ng C,
Waggott D , Sabelnykova VY et al . Combining tumor genome simulation with crowdsourcing to
benchmark somatic single-nucleotide-variant detection. Nat Methods 2015, 12(7):623–630.

De Novo Genome Assembly with Long and/or Short Reads


Schatz MC , Delcher AL , Salzberg SL . Assembly of large genomes using second-generation
sequencing. Genome Res 2010, 20(9):1165–1173.
Zerbino DR , Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn
graphs. Genome Res 2008, 18(5):821–829.
Simpson JT , Wong K , Jackman SD , Schein JE , Jones SJ , Birol I . ABySS: a parallel
assembler for short read sequence data. Genome Res 2009, 19(6):1117–1123.
Li R , Zhu H , Ruan J , Qian W , Fang X , Shi Z , Li Y , Li S , Shan G , Kristiansen K et al. De
novo assembly of human genomes with massively parallel short read sequencing. Genome Res
2010, 20(2):265–272.
English AC , Richards S , Han Y , Wang M , Vee V , Qu J , Qin X , Muzny DM , Reid JG ,
Worley KC et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read
sequencing technology. PLoS One 2012, 7(11):e47768.
Rayamajhi N , Cheng CC , Catchen JM . Evaluating Illumina-, Nanopore-, and PacBio-based
genome assembly strategies with the bald notothen, Trematomus borchgrevinki. G3 2022,
12(11):jkac192.
van Heesch S , Kloosterman WP , Lansu N , Ruzius FP , Levandowsky E , Lee CC , Zhou S ,
Goldstein S , Schwartz DC , Harkins TT et al. Improving mammalian genome scaffolding using
large insert mate-pair next-generation sequencing. BMC Genomics 2013, 14:257.
Li R , Fan W , Tian G , Zhu H , He L , Cai J , Huang Q , Cai Q , Li B , Bai Y et al. The sequence
and de novo assembly of the giant panda genome. Nature 2010, 463(7279):311–317.
Wang O , Chin R , Cheng X , Wu MKY , Mao Q , Tang J , Sun Y , Anderson E , Lam HK , Chen
D et al. Efficient and unique cobarcoding of second-generation sequencing reads from long
DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo
assembly. Genome Res 2019, 29(5):798–808.
Chen Z , Pham L , Wu TC , Mo G , Xia Y , Chang PL , Porter D , Phan T , Che H , Tran H et al.
Ultralow-input single-tube linked-read library method enables short-read second-generation
sequencing systems to routinely generate highly accurate and economical long-range
sequencing information. Genome Res 2020, 30(6):898–909.
Zheng GX , Lau BT , Schnall-Levin M , Jarosz M , Bell JM , Hindson CM , Kyriazopoulou-
Panagiotopoulou S , Masquelier DA , Merrill L , Terry JM et al. Haplotyping germline and cancer
genomes with high-throughput linked-read sequencing. Nat Biotechnol 2016, 34(3):303–311.
Nagarajan N , Pop M . Sequence assembly demystified. Nature Rev Genet 2013,
14(3):157–167.
Desai A , Marwah VS , Yadav A , Jha V , Dhaygude K , Bangar U , Kulkarni V , Jere A .
Identification of optimum sequencing depth especially for de novo genome assembly of small
genomes using next generation sequencing data. PLoS One 2013, 8(4):e60204.
Chen Y , Nie F , Xie SQ , Zheng YF , Dai Q , Bray T , Wang YX , Xing JF , Huang ZJ , Wang DP
et al. Efficient assembly of nanopore reads via highly accurate and intact error correction.
Nature Commun 2021, 12(1):60.
Chen Y , Zhang Y , Wang AY , Gao M , Chong Z . Accurate long-read de novo assembly
evaluation with Inspector. Genome Biol 2021, 22(1):312.
Magoc T , Salzberg SL . FLASH: fast length adjustment of short reads to improve genome
assemblies. Bioinformatics 2011, 27(21):2957–2963.
Zhang J , Kobert K , Flouri T , Stamatakis A . PEAR: a fast and accurate Illumina Paired-End
reAd mergeR. Bioinformatics 2014, 30(5):614–620.
Aronesty E. Comparison of sequencing utility programs. Open Bioinformatics J 2013, 7(1):1–8.
Masella AP , Bartram AK , Truszkowski JM , Brown DG , Neufeld JD. PANDAseq: paired-end
assembler for Illumina sequences. BMC Bioinformatics 2012, 13 :31.
Rognes T , Flouri T , Nichols B , Quince C , Mahe F . VSEARCH: a versatile open source tool
for metagenomics. PeerJ 2016, 4 :e2584.
Li H. BFC: correcting Illumina sequencing errors. Bioinformatics 2015, 31(17):2885–2887.
Heo Y , Wu XL , Chen D , Ma J , Hwu WM . BLESS: bloom filter-based error correction solution
for high-throughput sequencing reads. Bioinformatics 2014, 30(10):1354–1362.
Song L , Florea L , Langmead B . Lighter: fast and memory-efficient sequencing error correction
without counting. Genome Biol 2014, 15(11):509.
Liu Y , Schroder J , Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for
Illumina sequence data. Bioinformatics 2013, 29(3):308–315.
Schulz MH , Weese D , Holtgrewe M , Dimitrova V , Niu S , Reinert K , Richard H . Fiona: a
parallel and automatic strategy for read error correction. Bioinformatics 2014, 30(17):i356–363.
Salmela L , Schroder J . Correcting errors in short reads by multiple alignments. Bioinformatics
2011, 27(11):1455–1461.
Gnerre S , Maccallum I , Przybylski D , Ribeiro FJ , Burton JN , Walker BJ , Sharpe T , Hall G ,
Shea TP , Sykes S et al. High-quality draft assemblies of mammalian genomes from massively
parallel sequence data. Proc Natl Acad Sci U S A 2011, 108(4):1513–1518.
Simpson JT , Durbin R . Efficient de novo assembly of large genomes using compressed data
structures. Genome Res 2012, 22(3):549–556.
Pevzner PA , Tang H , Waterman MS . An Eulerian path approach to DNA fragment assembly.
Proc Natl Acad Sci U S A 2001, 98(17):9748–9753.
Marcais G , Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences
of k-mers. Bioinformatics 2011, 27(6):764–770.
Rhie A , Walenz BP , Koren S , Phillippy AM. Merqury: reference-free quality, completeness,
and phasing assessment for genome assemblies. Genome Biol 2020, 21(1):245.
Ranallo-Benavidez TR , Jaron KS , Schatz MC . GenomeScope 2.0 and Smudgeplot for
reference-free profiling of polyploid genomes. Nat Commun 2020, 11(1):1432.
Wang JR , Holt J , McMillan L , Jones CD . FMLRC: Hybrid long read error correction using an
FM-index. BMC Bioinformatics 2018, 19(1):50.
Koren S , Schatz MC , Walenz BP , Martin J , Howard JT , Ganapathy G , Wang Z , Rasko DA ,
McCombie WR , Jarvis ED et al. Hybrid error correction and de novo assembly of single-
molecule sequencing reads. Nat Biotechnol 2012, 30(7):693–700.
Salmela L , Rivals E . LoRDEC: accurate and efficient long read error correction. Bioinformatics
2014, 30(24):3506–3514.
Au KF , Underwood JG , Lee L , Wong WH . Improving PacBio long read accuracy by short
read alignment. PLoS One 2012, 7(10):e46679.
Goodwin S , Gurtowski J , Ethe-Sayers S , Deshpande P , Schatz MC , McCombie WR . Oxford
Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome.
Genome Res 2015, 25(11):1750–1756.
Hackl T , Hedrich R , Schultz J , Forster F . proovread: large-scale high-accuracy PacBio
correction through iterative short read consensus. Bioinformatics 2014, 30(21):3004–3011.
Baid G , Cook DE , Shafin K , Yun T , Llinares-López F , Berthet Q , Wenger AM , Rowell WJ ,
Nattestad M , Yang H et al. DeepConsensus improves the accuracy of sequences with a gap-
aware sequence transformer. Nat Biotechnol 2023, 41(2):232–238.
Koren S , Walenz BP , Berlin K , Miller JR , Bergman NH , Phillippy AM . Canu: scalable and
accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res
2017, 27(5):722–736.
Cheng H , Concepcion GT , Feng X , Zhang H , Li H . Haplotype-resolved de novo assembly
using phased assembly graphs with Hifiasm. Nat Methods 2021, 18(2):170–175.
Chin CS , Peluso P , Sedlazeck FJ , Nattestad M , Concepcion GT , Clum A , Dunn C , O'Malley
R , Figueroa-Balderas R , Morales-Cruz A et al. Phased diploid genome assembly with single-
molecule real-time sequencing. Nat Methods 2016, 13(12):1050–1054.
Nowoshilow S , Schloissnig S , Fei JF , Dahl A , Pang AWC , Pippel M , Winkler S , Hastie AR ,
Young G , Roscito JG et al. The axolotl genome and the evolution of key tissue formation
regulators. Nature 2018, 554(7690):50–55.
Xiao CL , Chen Y , Xie SQ , Chen KN , Wang Y , Han Y , Luo F , Xie Z . MECAT: fast mapping,
error correction, and de novo assembly for single-molecule sequencing reads. Nat Methods
2017, 14(11):1072–1074.
Salmela L , Walve R , Rivals E , Ukkonen E . Accurate self-correction of errors in long reads
using de Bruijn graphs. Bioinformatics 2017, 33(6):799–806.
Bao E , Xie F , Song C , Song D . FLAS: fast and high-throughput algorithm for PacBio long-
read self-correction. Bioinformatics 2019, 35(20):3953–3960.
Berlin K , Koren S , Chin CS , Drake JP , Landolin JM , Phillippy AM. Assembling large
genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 2015,
33(6):623–630.
Lander ES , Waterman MS . Genomic mapping by fingerprinting random clones: a mathematical
analysis. Genomics 1988, 2(3):231–239.
Warren RL , Sutton GG , Jones SJ , Holt RA . Assembling millions of short DNA sequences
using SSAKE. Bioinformatics 2007, 23(4):500–501.
Dohm JC , Lottaz C , Borodina T , Himmelbauer H . SHARCGS, a fast and highly accurate
short-read assembly algorithm for de novo genomic sequencing. Genome Res 2007,
17(11):1697–1706.
Jeck WR , Reinhardt JA , Baltrus DA , Hickenbotham MT , Magrini V , Mardis ER , Dangl JL ,
Jones CD . Extending assembly of short DNA sequences to handle error. Bioinformatics 2007,
23(21):2942–2944.
Hernandez D , Francois P , Farinelli L , Osteras M , Schrenzel J. De novo bacterial genome
sequencing: millions of very short reads assembled on a desktop computer. Genome Res 2008,
18(5):802–809.
Li H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.
Bioinformatics 2012, 28(14):1838–1844.
Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
Bioinformatics 2016, 32(14):2103–2110.
Kamath GM , Shomorony I , Xia F , Courtade TA , Tse DN . HINGE: long-read assembly
achieves optimal repeat resolution. Genome Res 2017, 27(5):747–756.
Chin C-S , Khalak A . Human Genome Assembly in 100 Minutes. bioRxiv 2019, doi:
https://doi.org/10.1101/705616
Shafin K , Pesout T , Lorig-Roach R , Haukness M , Olsen HE , Bosworth C , Armstrong J ,
Tigyi K , Maurer N , Koren S et al. Nanopore sequencing and the Shasta toolkit enable efficient
de novo assembly of eleven human genomes. Nat Biotechnol 2020, 38(9):1044–1053.
Vaser R , Šikić M . Time-and memory-efficient genome assembly with Raven. Nat Comput Sci
2021, 1(5):332–336.
NextDenovo (https://github.com/Nextomics/NextDenovo)
Liu H , Wu S , Li A , Ruan J. SMARTdenovo: a de novo assembler using long noisy reads.
Gigabyte 2021:1–9.
Nurk S , Walenz BP , Rhie A , Vollger MR , Logsdon GA , Grothe R , Miga KH , Eichler EE ,
Phillippy AM , Koren S . HiCanu: accurate assembly of segmental duplications, satellites, and
allelic variants from high-fidelity long reads. Genome Res 2020, 30(9):1291–1305.
Myers EW . Toward simplifying and accurately formulating fragment assembly. J Comput Biol
1995, 2(2):275–290.
Gonnella G , Kurtz S. Readjoiner: a fast and memory efficient string graph-based sequence
assembler. BMC Bioinformatics 2012, 13:82.
Luo R , Liu B , Xie Y , Li Z , Huang W , Yuan J , He G , Chen Y , Pan Q , Liu Y et al.
SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.
GigaScience 2012, 1(1):18.
Nurk S , Bankevich A , Antipov D , Gurevich AA , Korobeynikov A , Lapidus A , Prjibelski AD ,
Pyshkin A , Sirotkin A , Sirotkin Y et al. Assembling single-cell genomes and mini-metagenomes
from chimeric MDA products. J Comput Biol 2013, 20(10):714–737.
Kolmogorov M , Yuan J , Lin Y , Pevzner PA . Assembly of long, error-prone reads using repeat
graphs. Nat Biotechnol 2019, 37(5):540–546.
Ruan J , Li H . Fast and accurate long-read assembly with wtdbg2. Nat Methods 2020,
17(2):155–158.
Zimin AV , Marcais G , Puiu D , Roberts M , Salzberg SL , Yorke JA . The MaSuRCA genome
assembler. Bioinformatics 2013, 29(21):2669–2677.
Ye C , Hill CM , Wu S , Ruan J , Ma ZS . DBG2OLC: Efficient Assembly of Large Genomes
Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Sci Rep 2016,
6:31900.
Di Genova A , Buena-Atienza E , Ossowski S , Sagot MF . Efficient hybrid de novo assembly of
human genomes with WENGAN. Nat Biotechnol 2021, 39(4):422–430.
Walker BJ , Abeel T , Shea T , Priest M , Abouelliel A , Sakthikumar S , Cuomo CA , Zeng Q ,
Wortman J , Young SK et al. Pilon: an integrated tool for comprehensive microbial variant
detection and genome assembly improvement. PLoS One 2014, 9(11):e112963.
Medaka (https://github.com/nanoporetech/medaka)
Vaser R , Sovic I , Nagarajan N , Sikic M . Fast and accurate de novo genome assembly from
long uncorrected reads. Genome Res 2017, 27(5):737–746.
Loman NJ , Quick J , Simpson JT . A complete bacterial genome assembled de novo using only
nanopore sequencing data. Nat Methods 2015, 12(8):733–735.
Hu J , Fan J , Sun Z , Liu S . NextPolish: a fast and efficient genome polishing tool for long-read
assembly. Bioinformatics 2020, 36(7):2253–2255.
Zimin AV , Salzberg SL . The genome polishing tool POLCA makes fast and accurate
corrections in genome assemblies. PLoS Comput Biol 2020, 16(6):e1007981.
Huang N , Nie F , Ni P , Luo F , Gao X , Wang J . NeuralPolish: a novel Nanopore polishing
method based on alignment matrix construction and orthogonal Bi-GRU Networks.
Bioinformatics 2021, 37(19):3120–3127.
Boetzer M , Pirovano W . SSPACE-LongRead: scaffolding bacterial draft genomes using long
read sequence information. BMC Bioinformatics 2014, 15 :211.
Warren RL , Yang C , Vandervalk BP , Behsaz B , Lagman A , Jones SJ , Birol I . LINKS:
Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience 2015, 4 :35.
Gao S , Bertrand D , Chia BK , Nagarajan N. OPERA-LG: efficient and exact scaffolding of
large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biol 2016,
17:102.
Qin M , Wu S , Li A , Zhao F , Feng H , Ding L , Ruan J. LRScaf: improving draft genomes using
long noisy reads. BMC Genomics 2019, 20(1):955.
SMIS (Single Molecular Integrative Scaffolding) (www.sanger.ac.uk/tool/smis/)
Nguyen SH , Cao MD , Coin LJM . Real-time resolution of short-read assembly graph using
ONT long reads. PLoS Comput Biol 2021, 17(1):e1008586.
Putnam NH , O'Connell BL , Stites JC , Rice BJ , Blanchette M , Calef R , Troll CJ , Fields A ,
Hartley PD , Sugnet CW et al. Chromosome-scale shotgun assembly using an in vitro method
for long-range linkage. Genome Res 2016, 26(3):342–350.
Dudchenko O , Batra SS , Omer AD , Nyquist SK , Hoeger M , Durand NC , Shamim MS ,
Machol I , Lander ES , Aiden AP et al . De novo assembly of the Aedes aegypti genome using
Hi-C yields chromosome-length scaffolds. Science 2017, 356(6333):92–95.
Ghurye J , Rhie A , Walenz BP , Schmitt A , Selvaraj S , Pop M , Phillippy AM , Koren S.
Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput
Biol 2019, 15(8):e1007273.
Yeo S , Coombe L , Warren RL , Chu J , Birol I . ARCS: scaffolding genome drafts with linked
reads. Bioinformatics 2018, 34(5):725–731.
Kuleshov V , Snyder MP , Batzoglou S . Genome assembly from synthetic long read clouds.
Bioinformatics 2016, 32(12):i216–i224.
Guo L , Xu M , Wang W , Gu S , Zhao X , Chen F , Wang O , Xu X , Seim I , Fan G et al. SLR-
superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom
scheme. BMC Bioinformatics 2021, 22(1):158.
Kosugi S , Hirakawa H , Tabata S . GMcloser: closing gaps in assemblies accurately with a
likelihood-based selection of contig or long-read alignments. Bioinformatics 2015,
31(23):3733–3741.
Paulino D , Warren RL , Vandervalk BP , Raymond A , Jackman SD , Birol I . Sealer: a scalable
gap-closing application for finishing draft genomes. BMC Bioinformatics 2015, 16:230.
Xu GC , Xu TJ , Zhu R , Zhang Y , Li SQ , Wang HW , Li JT . LR_Gapcloser: a tiling path-based
gap closer that uses long reads to complete genome assembly. GigaScience 2019, 8(1):giy157.
Xu M , Guo L , Gu S , Wang O , Zhang R , Peters BA , Fan G , Liu X , Xu X , Deng L et al. TGS-
GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone
long reads. GigaScience 2020, 9(9):giaa094.
Zimin AV , Salzberg SL . The SAMBA tool uses long reads to improve the contiguity of genome
assemblies. PLoS Comput Biol 2022, 18(2):e1009860.
Schmeing S , Robinson MD . Gapless provides combined scaffolding, gap filling and assembly
correction with long reads. bioRxiv 2022, doi: https://doi.org/10.1101/2022.03.08.483466
Simao FA , Waterhouse RM , Ioannidis P , Kriventseva EV , Zdobnov EM . BUSCO: assessing
genome assembly and annotation completeness with single-copy orthologs. Bioinformatics
2015, 31(19):3210–3212.
Gurevich A , Saveliev V , Vyahhi N , Tesler G. QUAST: quality assessment tool for genome
assemblies. Bioinformatics 2013, 29(8):1072–1075.
Alonge M , Soyk S , Ramakrishnan S , Wang X , Goodwin S , Sedlazeck FJ , Lippman ZB ,
Schatz MC . RaGOO: fast and accurate reference-guided scaffolding of draft genomes.
Genome Biol 2019, 20(1):224.
Tamazian G , Dobrynin P , Krasheninnikova K , Komissarov A , Koepfli KP , O'Brien SJ .
Chromosomer: a reference-based genome arrangement tool for producing draft chromosome
sequences. GigaScience 2016, 5(1):38.
Kim J , Larkin DM , Cai Q , Asan, Zhang Y , Ge RL , Auvil L , Capitanu B , Zhang G , Lewin HA
et al. Reference-assisted chromosome assembly. Proc Natl Acad Sci U S A 2013,
110(5):1785–1790.
Nurk S , Koren S , Rhie A , Rautiainen M , Bzikadze AV , Mikheenko A , Vollger MR , Altemose
N , Uralsky L , Gershman A et al. The complete sequence of a human genome. Science 2022,
376(6588):44–53.
Mapping Protein-DNA Interactions with ChIP-Seq
Chen Y , Negre N , Li Q , Mieczkowska JO , Slattery M , Liu T , Zhang Y , Kim TK , He HH ,
Zieba J et al. Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Methods 2012,
9(6):609–614.
Visa N , Jordan-Pla A . ChIP and ChIP-Related Techniques: Expanding the fields of application
and improving ChIP performance. Methods Mol Biol 2018, 1689:1–7.
Skene PJ , Henikoff JG , Henikoff S . Targeted in situ genome-wide profiling with high efficiency
for low cell numbers. Nat Protoc 2018, 13(5):1006–1019.
Meyer CA , Liu XS . Identifying and mitigating bias in next-generation sequencing methods for
chromatin biology. Nat Rev Genet 2014, 15(11):709–721.
Jordán-Pla A , Visa N. Considerations on experimental design and data analysis of chromatin
immunoprecipitation experiments. Methods Mol Biol 2018, 1689:9–28.
Daley T , Smith AD . Predicting the molecular complexity of sequencing libraries. Nat Methods
2013, 10(4):325–327.
ENCODE Software Tools (www.encodeproject.org/software/)
Irreproducible Discovery Rate (IDR) (www.encodeproject.org/software/idr/)
Diaz A , Nellore A , Song JS . CHANCE: comprehensive software for quality control and
validation of ChIP-seq data. Genome Biol 2012, 13(10):R98.
Ramirez F , Ryan DP , Gruning B , Bhardwaj V , Kilpert F , Richter AS , Heyne S , Dundar F ,
Manke T . deepTools2: a next generation web server for deep-sequencing data analysis.
Nucleic Acids Res 2016, 44(W1):W160–165.
Nakato R , Shirahige K . Sensitive and robust assessment of ChIP-seq read distribution using a
strand-shift profile. Bioinformatics 2018, 34(14):2356–2363.
Landt SG , Marinov GK , Kundaje A , Kheradpour P , Pauli F , Batzoglou S , Bernstein BE ,
Bickel P , Brown JB , Cayting P et al. ChIP-seq guidelines and practices of the ENCODE and
modENCODE consortia. Genome Res 2012, 22(9):1813–1831.
Zhang Y , Liu T , Meyer CA , Eeckhoute J , Johnson DS , Bernstein BE , Nusbaum C , Myers
RM , Brown M , Li W et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008,
9(9):R137.
Kharchenko PV , Tolstorukov MY , Park PJ . Design and analysis of ChIP-seq experiments for
DNA-binding proteins. Nat Biotechnol 2008, 26(12):1351–1359.
Rozowsky J , Euskirchen G , Auerbach RK , Zhang ZD , Gibson T , Bjornson R , Carriero N ,
Snyder M , Gerstein MB . PeakSeq enables systematic scoring of ChIP-seq experiments
relative to controls. Nat Biotechnol 2009, 27(1):66–75.
Heinz S , Benner C , Spann N , Bertolino E , Lin YC , Laslo P , Cheng JX , Murre C , Singh H ,
Glass CK . Simple combinations of lineage-determining transcription factors prime cis-
regulatory elements required for macrophage and B cell identities. Mol Cell 2010,
38(4):576–589.
Zang C , Schones DE , Zeng C , Cui K , Zhao K , Peng W . A clustering approach for
identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics
2009, 25(15):1952–1958.
Ibrahim MM , Lacadie SA , Ohler U . JAMM: a peak finder for joint analysis of NGS replicates.
Bioinformatics 2015, 31(1):48–55.
Oh D , Strattan JS , Hur JK , Bento J , Urban AE , Song G , Cherry JM . CNN-Peaks: ChIP-Seq
peak detection pipeline using convolutional neural networks that imitate human visual
inspection. Sci Rep 2020, 10(1):7933.
Hentges LD , Sergeant MJ , Cole CB , Downes DJ , Hughes JR , Taylor S . LanceOtron: a deep
learning peak caller for genome sequencing experiments. Bioinformatics 2022,
38(18):4255–4263.
Ji H , Jiang H , Ma W , Johnson DS , Myers RM , Wong WH . An integrated software system for
analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008, 26(11):1293–1300.
Jothi R , Cuddapah S , Barski A , Cui K , Zhao K . Genome-wide identification of in vivo protein-
DNA binding sites from ChIP-Seq data. Nucleic Acids Res 2008, 36(16):5221–5231.
Feng X , Grossman R , Stein L . PeakRanger: a cloud-enabled peak caller for ChIP-seq data.
BMC Bioinformatics 2011, 12 :139.
Rashid NU , Giresi PG , Ibrahim JG , Sun W , Lieb JD . ZINBA integrates local covariates with
DNA-seq data to identify broad and narrow regions of enrichment, even within amplified
genomic regions. Genome Biol 2011, 12(7):R67.
Carroll TS , Liang Z , Salama R , Stark R , de Santiago I . Impact of artifact removal on ChIP
quality metrics in ChIP-seq and ChIP-exo data. Front Genet 2014, 5:75.
Amemiya HM , Kundaje A , Boyle AP . The ENCODE Blacklist: Identification of Problematic
Regions of the Genome. Sci Rep 2019, 9(1):9354.
Chen X , Xu H , Yuan P , Fang F , Huss M , Vega VB , Wong E , Orlov YL , Zhang W , Jiang J
et al. Integration of external signaling pathways with the core transcriptional network in
embryonic stem cells. Cell 2008, 133(6):1106–1117.
Shen L , Shao NY , Liu X , Maze I , Feng J , Nestler EJ . diffReps: detecting differential
chromatin modification sites from ChIP-seq data with biological replicates. PLoS One 2013,
8(6):e65598.
Manser P , Reimers M. A simple scaling normalization for comparing ChIP-Seq samples. PeerJ
PrePrints 2014, 1.
Tu S , Li M , Chen H , Tan F , Xu J , Waxman DJ , Zhang Y , Shao Z . MAnorm2 for
quantitatively comparing groups of ChIP-seq samples. Genome Res 2021, 31(1):131–145.
Nair NU , Sahu AD , Bucher P , Moret BM . ChIPnorm: a statistical method for normalizing and
identifying differential regions in histone modification ChIP-seq libraries. PLoS One 2012,
7(8):e39573.
Taslim C , Wu J , Yan P , Singer G , Parvin J , Huang T , Lin S , Huang K . Comparative study
on ChIP-seq data: normalization and binding pattern characterization. Bioinformatics 2009,
25(18):2334–2340.
Orlando DA , Chen MW , Brown VE , Solanki S , Choi YJ , Olson ER , Fritz CC , Bradner JE ,
Guenther MG . Quantitative ChIP-Seq normalization reveals global modulation of the
epigenome. Cell Rep 2014, 9(3):1163–1170.
Lun AT , Smyth GK . csaw: a Bioconductor package for differential binding analysis of ChIP-seq
data using sliding windows. Nucleic Acids Res 2016, 44(5):e45.
Zhang Y , Lin YH , Johnson TD , Rozek LS , Sartor MA . PePr: a peak-calling prioritization
pipeline to identify consistent or differential peaks from replicated ChIP-Seq data. Bioinformatics
2014, 30(18):2568–2575.
Xu H , Wei CL , Lin F , Sung WK . An HMM approach to genome-wide identification of
differential histone modification sites from ChIP-seq data. Bioinformatics 2008,
24(20):2344–2349.
Allhoff M , Sere K , J FP, Zenke M , I GC. Differential peak calling of ChIP-seq signals with
replicates with THOR. Nucleic Acids Res 2016, 44(20):e153.
Stark R , Brown G. DiffBind: differential binding analysis of ChIP-Seq peak data. In R package
version 2011, 100.
Liang K , Keles S . Detecting differential binding of transcription factors with ChIP-seq.
Bioinformatics 2012, 28(1):121–122.
Steinhauser S , Kurzawa N , Eils R , Herrmann C . A comprehensive comparison of tools for
differential ChIP-seq analysis. Brief Bioinform 2016, 17(6):953–966.
Eder T , Grebien F . Comprehensive assessment of differential ChIP-seq tools guides optimal
algorithm selection. Genome Biol 2022, 23(1):119.
Chen L , Wang C , Qin ZS , Wu H . A novel statistical method for quantitative comparison of
multiple ChIP-seq datasets. Bioinformatics 2015.
Taslim C , Huang T , Lin S. DIME: R-package for identifying differential ChIP-seq based on an
ensemble of mixture models. Bioinformatics 2011, 27(11):1569–1570.
Schweikert G , Kuo D . MMDiff2: statistical testing for ChIP-Seq data sets. In., vol. R package
version 1.24.0; 2022.
Song Q , Smith AD . Identifying dispersed epigenomic domains from ChIP-Seq data.
Bioinformatics 2011, 27(6):870–871.
Yu G , Wang LG , He QY . ChIPseeker: an R/Bioconductor package for ChIP peak annotation,
comparison and visualization. Bioinformatics 2015, 31(14):2382–2383.
McLean CY , Bristor D , Hiller M , Clarke SL , Schaar BT , Lowe CB , Wenger AM , Bejerano G .
GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 2010,
28(5):495–501.
Zhu LJ , Gazin C , Lawson ND , Pages H , Lin SM , Lapointe DS , Green MR . ChIPpeakAnno:
a Bioconductor package to annotate ChIP-seq and ChIP-chip data. BMC Bioinformatics 2010,
11:237.
Welch RP , Lee C , Imbriano PM , Patil S , Weymouth TE , Smith RA , Scott LJ , Sartor MA .
ChIP-Enrich: gene set enrichment testing for ChIP-seq data. Nucleic Acids Res 2014.
Liu T , Ortiz JA , Taing L , Meyer CA , Lee B , Zhang Y , Shin H , Wong SS , Ma J , Lei Y et al.
Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol 2011,
12(8):R83.
Machanick P , Bailey TL . MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics
2011, 27(12):1696–1697.
Bailey TL , Johnson J , Grant CE , Noble WS . The MEME Suite. Nucleic Acids Res 2015,
43(W1):W39–49.
Droit A , Gottardo R , Robertson G , Li L . rGADEM: de novo motif discovery. R package version
2.44.0. . In.; 2022.
Thomas-Chollier M , Herrmann C , Defrance M , Sand O , Thieffry D , van Helden J . RSAT
peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 2012, 40(4):e31.
Castro-Mondragon JA , Riudavets-Puig R , Rauluseviciute I , Lemma RB , Turchi L , Blanc-
Mathieu R , Lucas J , Boddie P , Khan A , Manosalva Perez N et al . JASPAR 2022: the 9th
release of the open-access database of transcription factor binding profiles. Nucleic Acids Res
2022, 50(D1):D165–D173.
Mahony S , Benos PV . STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic
Acids Res 2007, 35(Web Server issue):W253–258.
Gupta S , Stamatoyannopoulos JA , Bailey TL , Noble WS . Quantifying similarity between
motifs. Genome Biol 2007, 8(2):R24.
McLeay RC , Bailey TL . Motif Enrichment Analysis: a unified framework and an evaluation on
ChIP data. BMC Bioinformatics 2010, 11:165.
Bailey TL , Machanick P . Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res 2012,
40(17):e128.
Bailey TL , Grant CE . SEA: Simple Enrichment Analysis of motifs. bioRxiv 2021, doi:
https://doi.org/10.1101/2021.08.23.457422.
Grant CE , Bailey TL , Noble WS . FIMO: scanning for occurrences of a given motif.
Bioinformatics 2011, 27(7):1017–1018.
The MEME Suite (http://meme.nbcr.net/meme/)
Ernst J , Kellis M . Discovery and characterization of chromatin states for systematic annotation
of the human genome. Nat Biotechnol 2010, 28(8):817–825.
Klein HU , Schafer M , Porse BT , Hasemann MS , Ickstadt K , Dugas M . Integrative analysis of
histone ChIP-seq and transcription data using Bayesian mixture models. Bioinformatics 2014,
30(8):1154–1162.
Schafer M , Klein HU , Schwender H . Integrative analysis of multiple genomic variables using a
hierarchical Bayesian model. Bioinformatics 2017, 33(20):3220–3227.
Wang S , Sun H , Ma J , Zang C , Wang C , Wang J , Tang Q , Meyer CA , Zhang Y , Liu XS .
Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nature protocols
2013, 8(12):2502–2515.
Shin H , Liu T , Manrai AK , Liu XS . CEAS: cis-regulatory element annotation system.
Bioinformatics 2009, 25(19):2605–2606.

Epigenomics by DNA Methylation Sequencing


Seiler Vellame D , Castanho I , Dahir A , Mill J , Hannon E . Characterizing the properties of
bisulfite sequencing data: maximizing power and sensitivity to identify between-group
differences in DNA methylation. BMC Genomics 2021, 22(1):446.
Standards and Guidelines for Whole Genome Shotgun Bisulfite Sequencing
(www.roadmapepigenomics.org/protocols)
Ziller MJ , Hansen KD , Meissner A , Aryee MJ . Coverage recommendations for methylation
analysis by whole-genome bisulfite sequencing. Nat Methods 2015, 12(3):230–232.
Meissner A , Gnirke A , Bell GW , Ramsahoye B , Lander ES , Jaenisch R . Reduced
representation bisulfite sequencing for comparative high-resolution DNA methylation analysis.
Nucleic Acids Res 2005, 33(18):5868–5877.
Sun Z , Cunningham J , Slager S , Kocher JP . Base resolution methylome profiling:
considerations in platform selection, data preprocessing and analysis. Epigenomics 2015,
7(5):813–828.
Nautiyal S , Carlton VE , Lu Y , Ireland JS , Flaucher D , Moorhead M , Gray JW , Spellman P ,
Mindrinos M , Berg P et al. High-throughput method for analyzing methylation of CpGs in
targeted genomic regions. Proc Natl Acad Sci U S A 2010, 107(28):12587–12592.
Varley KE , Mitra RD . Bisulfite Patch PCR enables multiplexed sequencing of promoter
methylation across cancer samples. Genome Res 2010, 20(9):1279–1287.
Deng J , Shoemaker R , Xie B , Gore A , LeProust EM , Antosiewicz-Bourget J , Egli D ,
Maherali N , Park IH , Yu J et al. Targeted bisulfite sequencing reveals changes in DNA
methylation associated with nuclear reprogramming. Nat Biotechnol 2009, 27(4):353–360.
Ivanov M , Kals M , Kacevska M , Metspalu A , Ingelman-Sundberg M , Milani L . In-solution
hybrid capture of bisulfite-converted DNA for targeted bisulfite sequencing of 174 ADME genes.
Nucleic Acids Res 2013, 41(6):e72.
Liu MC , Oxnard GR , Klein EA , Swanton C , Seiden MV , Consortium C . Sensitive and
specific multi-cancer detection and localization using methylation signatures in cell-free DNA.
Ann Oncol 2020, 31(6):745–759.
Han Y , Zheleznyakova GY , Marincevic-Zuniga Y , Kakhki MP , Raine A , Needhamsen M ,
Jagodic M . Comparison of EM-seq and PBAT methylome library methods for low-input DNA.
Epigenetics 2021, 17(10):1195–1204.
Vaisvila R , Ponnaluri VKC , Sun Z , Langhorst BW , Saleh L , Guan S , Dai N , Campbell MA ,
Sexton BS , Marks K et al. Enzymatic methyl sequencing detects DNA methylation at single-
base resolution from picograms of DNA. Genome Res 2021, 31(7):1280–1289.
Feng S , Zhong Z , Wang M , Jacobsen SE . Efficient and accurate determination of genome-
wide DNA methylation patterns in Arabidopsis thaliana with enzymatic methyl sequencing.
Epigenetics Chromatin 2020, 13(1):42.
Sun Z , Vaisvila R , Hussong LM , Yan B , Baum C , Saleh L , Samaranayake M , Guan S , Dai
N , Correa IR Jr. , et al . Nondestructive enzymatic deamination enables single-molecule long-
read amplicon sequencing for the determination of 5-methylcytosine and 5-
hydroxymethylcytosine at single-base resolution. Genome Res 2021, 31(2):291–300.
Harris RA , Wang T , Coarfa C , Nagarajan RP , Hong C , Downey SL , Johnson BE , Fouse SD
, Delaney A , Zhao Y et al. Comparison of sequencing-based methods to profile DNA
methylation and identification of monoallelic epigenetic modifications. Nat Biotechnol 2010,
28(10):1097–1105.
Nair SS , Coolen MW , Stirzaker C , Song JZ , Statham AL , Strbenac D , Robinson MD , Clark
SJ . Comparison of methyl-DNA immunoprecipitation (MeDIP) and methyl-CpG binding domain
(MBD) protein capture for genome-wide DNA methylation analysis reveal CpG sequence
coverage bias. Epigenetics 2011, 6(1):34–44.
Rodriguez-Aguilera JR , Ecsedi S , Goldsmith C , Cros MP , Dominguez-Lopez M , Guerrero-
Celis N , Perez-Cabeza de Vaca R , Chemin I , Recillas-Targa F , Chagoya de Sanchez V et al.
Genome-wide 5-hydroxymethylcytosine (5hmC) emerges at early stage of in vitro differentiation
of a putative hepatocyte progenitor. Sci Rep 2020, 10(1):7822.
Yu M , Han D , Hon GC , He C . Tet-Assisted Bisulfite Sequencing (TAB-seq). Methods Mol Biol
2018, 1708 :645–663.
Booth MJ , Branco MR , Ficz G , Oxley D , Krueger F , Reik W , Balasubramanian S .
Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base
resolution. Science 2012, 336(6083):934–937.
Booth MJ , Marsico G , Bachman M , Beraldi D , Balasubramanian S . Quantitative sequencing
of 5-formylcytosine in DNA at single-base resolution. Nat Chem 2014, 6(5):435–440.
Liu Y , Hu Z , Cheng J , Siejka-Zielinska P , Chen J , Inoue M , Ahmed AA , Song CX .
Subtraction-free and bisulfite-free specific sequencing of 5-methylcytosine and its oxidized
derivatives at base resolution. Nat Commun 2021, 12(1):618.
Flusberg BA , Webster DR , Lee JH , Travers KJ , Olivares EC , Clark TA , Korlach J , Turner
SW . Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat
Methods 2010, 7(6):461–465.
Laszlo AH , Derrington IM , Brinkerhoff H , Langford KW , Nova IC , Samson JM , Bartlett JJ ,
Pavlenok M , Gundlach JH . Detection and mapping of 5-methylcytosine and 5-
hydroxymethylcytosine with nanopore MspA. Proc Natl Acad Sci U S A 2013,
110(47):18904–18909.
Schreiber J , Wescoe ZL , Abu-Shumays R , Vivian JT , Baatar B , Karplus K , Akeson M . Error
rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine
along individual DNA strands. Proc Natl Acad Sci U S A 2013, 110(47):18910–18915.
Tse OYO , Jiang P , Cheng SH , Peng W , Shang H , Wong J , Chan SL , Poon LCY , Leung TY
, Chan KCA et al. Genome-wide detection of cytosine methylation by single molecule real-time
sequencing. Proc Natl Acad Sci U S A 2021, 118(5):e2019768118.
Rand AC , Jain M , Eizenga JM , Musselman-Brown A , Olsen HE , Akeson M , Paten B .
Mapping DNA methylation with high-throughput nanopore sequencing. Nat Methods 2017,
14(4):411–413.
Trim Galore! (www.bioinformatics.babraham.ac.uk/projects/trim_galore/)
Hansen KD , Langmead B , Irizarry RA . BSmooth: from whole genome bisulfite sequencing
reads to differentially methylated regions. Genome Biol 2012, 13(10):R83.
Liang F , Tang B , Wang Y , Wang J , Yu C , Chen X , Zhu J , Yan J , Zhao W , Li R . WBSA:
web service for bisulfite sequencing data analysis. PLoS One 2014, 9(1):e86707.
Xi Y , Li W . BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics
2009, 10:232.
Frith MC , Mori R , Asai K. A mostly traditional approach improves alignment of bisulfite-
converted DNA. Nucleic Acids Res 2012, 40(13):e100.
Wu TD , Nacu S . Fast and SNP-tolerant detection of complex variants and splicing in short
reads. Bioinformatics 2010, 26(7):873–881.
Xi Y , Bock C , Muller F , Sun D , Meissner A , Li W . RRBSMAP: a fast, accurate and user-
friendly alignment tool for reduced representation bisulfite sequencing. Bioinformatics 2012,
28(3):430–432.
Krueger F , Andrews SR . Bismark: a flexible aligner and methylation caller for Bisulfite-Seq
applications. Bioinformatics 2011, 27(11):1571–1572.
Pedersen BS , Eyring K , De S , Yang IV , Schwartz DA . Fast and accurate alignment of long
bisulfite-seq reads. arXiv preprint arXiv:14011129 2014.
Huang KYY , Huang YJ , Chen PY . BS-Seeker3: ultrafast pipeline for bisulfite sequencing.
BMC Bioinformatics 2018, 19(1):111.
Guo W , Fiziev P , Yan W , Cokus S , Sun X , Zhang MQ , Chen PY , Pellegrini M . BS-Seeker2:
a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics 2013, 14 :774.
Kunde-Ramamoorthy G , Coarfa C , Laritsky E , Kessler NJ , Harris RA , Xu M , Chen R , Shen
L , Milosavljevic A , Waterland RA . Comparison and quantitative verification of mapping
algorithms for whole-genome bisulfite sequencing. Nucleic Acids Res 2014, 42(6):e43.
Nunn A, Otto C, Stadler PF, Langenberger D. Comprehensive benchmarking of software for
mapping whole genome bisulfite data: from read alignment to DNA methylation analysis. Brief
Bioinform 2021, 22(5):bbab021.
Sun X , Han Y , Zhou L , Chen E , Lu B , Liu Y , Pan X , Cowley AW , Jr., Liang M , Wu Q et al.
A comprehensive evaluation of alignment software for reduced representation bisulfite
sequencing data. Bioinformatics 2018, 34(16):2715–2723.
Zhou Q , Lim JQ , Sung WK , Li G. An integrated package for bisulfite DNA methylation data
analysis with indel-sensitive mapping. BMC Bioinformatics 2019, 20(1):47.
Bock C. Analysing and interpreting DNA methylation data. Nat Rev Genet 2012,
13(10):705–719.
Lin X , Sun D , Rodriguez B , Zhao Q , Sun H , Zhang Y , Li W . BSeQC: quality control of
bisulfite sequencing experiments. Bioinformatics 2013, 29(24):3227–3229.
MethylDackel (https://github.com/dpryan79/MethylDackel)
Akalin A , Kormaksson M , Li S , Garrett-Bakelman FE , Figueroa ME , Melnick A , Mason CE .
methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation
profiles. Genome Biol 2012, 13(10):R87.
Qu J , Zhou M , Song Q , Hong EE , Smith AD . MLML: consistent simultaneous estimates of
DNA methylation and hydroxymethylation. Bioinformatics 2013, 29(20):2645–2646.
MethPipe (https://github.com/smithlabcode/methpipe/)
Smith ZD , Chan MM , Mikkelsen TS , Gu H , Gnirke A , Regev A , Meissner A . A unique
regulatory phase of DNA methylation in the early mammalian embryo. Nature 2012,
484(7394):339–344.
Liu Y , Siegmund KD , Laird PW , Berman BP . Bis-SNP: Combined DNA methylation and SNP
calling for Bisulfite-seq data. Genome Biol 2012, 13(7):R61.
Washington University EpiGenome Browser (http://epigenomegateway.wustl.edu/browser/)
Kent WJ , Zweig AS , Barber G , Hinrichs AS , Karolchik D . BigWig and BigBed: enabling
browsing of large distributed datasets. Bioinformatics 2010, 26(17):2204–2207.
Dorff KC , Chambwe N , Zeno Z , Simi M , Shaknovich R , Campagne F . GobyWeb: simplified
management and analysis of gene expression and DNA methylation sequencing data. PLoS
One 2013, 8(7):e69666.
Muller F , Scherer M , Assenov Y , Lutsik P , Walter J , Lengauer T , Bock C . RnBeads 2.0:
comprehensive analysis of DNA methylation data. Genome Biol 2019, 20(1):55.
Assenov Y , Muller F , Lutsik P , Walter J , Lengauer T , Bock C . Comprehensive analysis of
DNA methylation data with RnBeads. Nat Methods 2014, 11(11):1138–1140.
Jiang P , Sun K , Lun FM , Guo AM , Wang H , Chan KC , Chiu RW , Lo YM , Sun H . Methy-
Pipe: an integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis.
PLoS One 2014, 9(6):e100360.
Sun D , Xi Y , Rodriguez B , Park HJ , Tong P , Meong M , Goodell MA , Li W . MOABS: model
based analysis of bisulfite sequencing data. Genome Biol 2014, 15(2):R38.
Zhang Y , Liu H , Lv J , Xiao X , Zhu J , Liu X , Su J , Li X , Wu Q , Wang F et al. QDMR: a
quantitative method for identification of differentially methylated regions by entropy. Nucleic
Acids Res 2011, 39(9):e58.
Piao Y , Xu W , Park KH , Ryu KH , Xiang R . Comprehensive evaluation of differential
methylation analysis methods for bisulfite sequencing data. Int J Environ Res Public Health
2021, 18(15):7975.
Liu Y , Han Y , Zhou L , Pan X , Sun X , Liu Y , Liang M , Qin J , Lu Y , Liu P . A comprehensive
evaluation of computational tools to identify differential methylation regions using RRBS data.
Genomics 2020, 112(6):4567–4576.
Statham AL , Strbenac D , Coolen MW , Stirzaker C , Clark SJ , Robinson MD . Repitools: an R
package for the analysis of enrichment-based epigenomic data. Bioinformatics 2010,
26(13):1662–1663.
Park Y , Figueroa ME , Rozek LS , Sartor MA . MethylSig: a whole genome DNA methylation
analysis pipeline. Bioinformatics 2014, 30(17):2414–2422.
Dolzhenko E , Smith AD . Using beta-binomial regression for high-precision differential
methylation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC
Bioinformatics 2014, 15:215.
Gaspar JM , Hart RP . DMRfinder: efficiently identifying differentially methylated regions from
MethylC-seq data. BMC Bioinformatics 2017, 18(1):528.
Juhling F , Kretzmer H , Bernhart SH , Otto C , Stadler PF , Hoffmann S . metilene: fast and
sensitive calling of differentially methylated regions from bisulfite sequencing data. Genome Res
2016, 26(2):256–262.
Feng H , Conneely KN , Wu H . A Bayesian hierarchical model to detect differentially
methylated loci from single nucleotide resolution sequencing data. Nucleic Acids Res 2014,
42(8):e69.
Halachev K , Bast H , Albrecht F , Lengauer T , Bock C. EpiExplorer: live exploration and global
analysis of large epigenomic datasets. Genome Biol 2012, 13(10):R96.
McLean CY , Bristor D , Hiller M , Clarke SL , Schaar BT , Lowe CB , Wenger AM , Bejerano G .
GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 2010,
28(5):495–501.

Whole Metagenome Sequencing for Microbial Community Analysis


Lloyd KG , Steen AD , Ladau J , Yin J , Crosby L . Phylogenetically novel uncultured microbial
cells dominate earth microbiomes. mSystems 2018, 3(5):e00055-18.
Yarza P , Ludwig W , Euzeby J , Amann R , Schleifer KH , Glockner FO , Rossello-Mora R .
Update of the All-Species Living Tree Project based on 16S and 23S rRNA sequence analyses.
Syst Appl Microbiol 2010, 33(6):291–299.
Delmont TO , Robe P , Clark I , Simonet P , Vogel TM . Metagenomic comparison of direct and
indirect soil DNA extraction approaches. J Microbiol Methods 2011, 86(3):397–400.
McIver LJ , Abu-Ali G , Franzosa EA , Schwager R , Morgan XC , Waldron L , Segata N ,
Huttenhower C . bioBakery: a meta’omic analysis environment. Bioinformatics 2018,
34(7):1235–1237.
BMTagger (http://biowulf.nih.gov/apps/bmtagger.html)
Schmieder R , Edwards R . Fast identification and removal of sequence contamination from
genomic and metagenomic datasets. PLoS One 2011, 6(3):e17288.
Bushnell B . BBMap: a fast, accurate, splice-aware aligner. In.: Lawrence Berkeley National
Lab.(LBNL), Berkeley, CA (United States) ; 2014.
Xu H , Luo X , Qian J , Pang X , Song J , Qian G , Chen J , Chen S . FastUniq: a fast de novo
duplicates removal tool for paired short reads. PLoS One 2012, 7(12):e52249.
Nayfach S , Shi ZJ , Seshadri R , Pollard KS , Kyrpides NC . New insights from uncultivated
genomes of the global human gut microbiome. Nature 2019, 568(7753):505–510.
Almeida A , Mitchell AL , Boland M , Forster SC , Gloor GB , Tarkowska A , Lawley TD , Finn
RD . A new genomic blueprint of the human gut microbiota. Nature 2019, 568(7753):499–504.
Nayfach S , Roux S , Seshadri R , Udwary D , Varghese N , Schulz F , Wu D , Paez-Espino D ,
Chen IM , Huntemann M et al . A genomic catalog of Earth’s microbiomes. Nat Biotechnol 2021,
39(4):499–509.
Kolmogorov M , Bickhart DM , Behsaz B , Gurevich A , Rayko M , Shin SB , Kuhn K , Yuan J ,
Polevikov E , Smith TPL et al. metaFlye: scalable long-read metagenome assembly using
repeat graphs. Nat Methods 2020, 17(11): 1103–1110.
Vaser R , Šikić M . Time- and memory-efficient genome assembly with Raven. Nat Comput Sci
2021, 1(5):332–336.
Koren S , Walenz BP , Berlin K , Miller JR , Bergman NH , Phillippy AM . Canu: scalable and
accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res
2017, 27(5):722–736.
Feng X , Cheng H , Portik D , Li H . Metagenome assembly of high-fidelity long reads with
Hifiasm-meta. Nat Methods 2022, 19(6):671–674.
Nurk S , Meleshko D , Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile
metagenomic assembler. Genome Res 2017, 27(5):824–834.
Bankevich A , Nurk S , Antipov D , Gurevich AA , Dvorkin M , Kulikov AS , Lesin VM , Nikolenko
SI , Pham S , Prjibelski AD et al. SPAdes: a new genome assembly algorithm and its
applications to single-cell sequencing. J Comput Biol 2012, 19(5):455–477.
Li D , Liu CM , Luo R , Sadakane K , Lam TW . MEGAHIT: an ultra-fast single-node solution for
large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 2015,
31(10):1674–1676.
Peng Y , Leung HC , Yiu SM , Chin FY . IDBA-UD: a de novo assembler for single-cell and
metagenomic sequencing data with highly uneven depth. Bioinformatics 2012,
28(11):1420–1428.
Namiki T , Hachiya T , Tanaka H , Sakakibara Y . MetaVelvet: an extension of Velvet assembler
to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 2012,
40(20):e155.
Afiahayati, Sato K , Sakakibara Y . MetaVelvet-SL: an extension of the Velvet assembler to a de
novo metagenomic assembler utilizing supervised learning. DNA Res 2015, 22(1):69–77.
Boisvert S , Raymond F , Godzaridis E, Laviolette F, Corbeil J. Ray Meta: scalable de novo
metagenome assembly and profiling. Genome Biol 2012, 13(12):R122.
Zimin AV , Marcais G , Puiu D , Roberts M , Salzberg SL , Yorke JA . The MaSuRCA genome
assembler. Bioinformatics 2013, 29(21):2669–2677.
Antipov D , Korobeynikov A , McLean JS , Pevzner PA . hybridSPAdes: an algorithm for hybrid
assembly of short and long reads. Bioinformatics 2016, 32(7):1009–1015.
Bertrand D , Shaw J , Kalathiyappan M , Ng AHQ , Kumar MS , Li C , Dvornicic M , Soldo JP ,
Koh JY , Tong C et al. Hybrid metagenomic assembly enables high-resolution analysis of
resistance determinants and mobile elements in human microbiomes. Nat Biotechnol 2019,
37(8):937–944.
Parks DH , Imelfort M , Skennerton CT , Hugenholtz P , Tyson GW . CheckM: assessing the
quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome
Res 2015, 25(7):1043–1055.
Mikheenko A , Saveliev V , Gurevich A . MetaQUAST: evaluation of metagenome assemblies.
Bioinformatics 2016, 32(7):1088–1090.
Simao FA , Waterhouse RM , Ioannidis P , Kriventseva EV , Zdobnov EM . BUSCO: assessing
genome assembly and annotation completeness with single-copy orthologs. Bioinformatics
2015, 31(19):3210–3212.
Mineeva O , Rojas-Carulla M , Ley RE , Scholkopf B , Youngblut ND . DeepMAsED: evaluating
the quality of metagenomic assemblies. Bioinformatics 2020, 36(10):3011–3017.
Clark SC , Egan R , Frazier PI , Wang Z . ALE: a generic assembly likelihood evaluation
framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics
2013, 29(4):435–443.
Koren S , Treangen TJ , Pop M . Bambus 2: scaffolding metagenomes. Bioinformatics 2011,
27(21):2964–2971.
Kang DD , Li F , Kirton E , Thomas A , Egan R , An H , Wang Z . MetaBAT 2: an adaptive
binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.
PeerJ 2019, 7:e7359.
Wu YW , Simmons BA , Singer SW . MaxBin 2.0: an automated binning algorithm to recover
genomes from multiple metagenomic datasets. Bioinformatics 2016, 32(4):605–607.
Imelfort M , Parks D , Woodcroft BJ , Dennis P , Hugenholtz P , Tyson GW . GroopM: an
automated tool for the recovery of population genomes from related metagenomes. PeerJ 2014,
2:e603.
Rosella (https://github.com/rhysnewell/rosella)
Alneberg J , Bjarnason BS , de Bruijn I , Schirmer M , Quick J , Ijaz UZ , Lahti L , Loman NJ ,
Andersson AF , Quince C . Binning metagenomic contigs by coverage and composition. Nat
Methods 2014, 11(11):1144–1146.
Strous M , Kraft B , Bisdorf R , Tegetmeyer HE . The binning of metagenomic contigs for
microbial physiology of mixed cultures. Front Microbiol 2012, 3:410.
Nissen JN , Johansen J , Allesoe RL , Sonderby CK , Armenteros JJA , Gronbech CH , Jensen
LJ , Nielsen HB , Petersen TN , Winther O et al . Improved metagenome binning and assembly
using deep variational autoencoders. Nat Biotechnol 2021, 39(5):555–560.
Wang Z , Huang P , You R , Sun F , Zhu S . MetaBinner: a high-performance and stand-alone
ensemble binning method to recover individual genomes from complex microbial communities.
Genome Biol 2023, 24(1):1.
Uritskiy GV , DiRuggiero J , Taylor J . MetaWRAP-a flexible pipeline for genome-resolved
metagenomic data analysis. Microbiome 2018, 6(1):158.
Sieber CMK , Probst AJ , Sharrar A , Thomas BC , Hess M , Tringe SG , Banfield JF . Recovery
of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat
Microbiol 2018, 3(7):836–843.
Meyer F , Fritz A , Deng ZL , Koslicki D , Lesker TR , Gurevich A , Robertson G , Alser M ,
Antipov D , Beghini F et al. Critical Assessment of Metagenome Interpretation: the second
round of challenges. Nat Methods 2022, 19(4):429–440.
Huson DH , Beier S , Flade I , Gorska A , El-Hadidi M , Mitra S , Ruscheweyh HJ , Tappu R .
MEGAN Community Edition – Interactive Exploration and Analysis of Large-Scale Microbiome
Sequencing Data. PLoS Comput Biol 2016, 12(6):e1004957.
Huson DH , Albrecht B , Bagci C , Bessarab I , Gorska A , Jolic D , Williams RBH . MEGAN-LR:
new algorithms allow accurate binning and easy interactive exploration of metagenomic long
reads and contigs. Biol Direct 2018, 13(1):6.
Wood DE , Lu J , Langmead B . Improved metagenomic analysis with Kraken 2. Genome Biol
2019, 20(1):257.
Gregor I , Droge J , Schirmer M , Quince C , McHardy AC . PhyloPythiaS+: a self-training
method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ
2016, 4:e1603.
Buchfink B , Xie C , Huson DH . Fast and sensitive protein alignment using DIAMOND. Nat
Methods 2015, 12(1):59–60.
Wood DE , Salzberg SL . Kraken: ultrafast metagenomic sequence classification using exact
alignments. Genome Biol 2014, 15(3):R46.
Hyatt D , Chen GL , Locascio PF , Land ML , Larimer FW , Hauser LJ . Prodigal: prokaryotic
gene recognition and translation initiation site identification. BMC Bioinformatics 2010, 11:119.
Noguchi H , Taniguchi T , Itoh T . MetaGeneAnnotator: detecting species-specific patterns of
ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage
genomes. DNA Res 2008, 15(6):387–396.
Lomsadze A , Gemayel K , Tang S , Borodovsky M . Modeling leaderless transcription and
atypical genes results in more accurate gene prediction in prokaryotes. Genome Res 2018,
28(7):1079–1089.
Rho M , Tang H , Ye Y . FragGeneScan: predicting genes in short and error-prone reads.
Nucleic Acids Res 2010, 38(20):e191.
Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL. Gene prediction with Glimmer for
metagenomic sequences augmented by classification and clustering. Nucleic Acids Res 2012,
40(1):e9.
Al-Ajlan A , El Allali A . CNN-MGP: Convolutional Neural Networks for Metagenomics Gene
Prediction. Interdiscip Sci 2019, 11(4):628–635.
Sommer MJ , Salzberg SL . Balrog: A universal protein model for prokaryotic gene prediction.
PLoS Comput Biol 2021, 17(2):e1008727.
Lowe TM , Eddy SR . tRNAscan-SE: a program for improved detection of transfer RNA genes in
genomic sequence. Nucleic Acids Res 1997, 25(5):955–964.
Laslett D , Canback B . ARAGORN, a program to detect tRNA genes and tmRNA genes in
nucleotide sequences. Nucleic Acids Res 2004, 32(1):11–16.
Bland C , Ramsey TL , Sabree F , Lowe M , Brown K , Kyrpides NC , Hugenholtz P . CRISPR
recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced
palindromic repeats. BMC Bioinformatics 2007, 8:209.
Grissa I , Vergnaud G , Pourcel C . CRISPRFinder: a web tool to identify clustered regularly
interspaced short palindromic repeats. Nucleic Acids Res 2007, 35(Web Server issue):W52–57.
Biswas A , Staals RH , Morales SE , Fineran PC , Brown CM . CRISPRDetect: A flexible
algorithm to define CRISPR arrays. BMC Genomics 2016, 17:356.
Beghini F , McIver LJ , Blanco-Miguez A , Dubois L , Asnicar F , Maharjan S , Mailyan A ,
Manghi P , Scholz M , Thomas AM et al. Integrating taxonomic, functional, and strain-level
profiling of diverse microbial communities with bioBakery 3. Elife 2021, 10:e65088.
Milanese A , Mende DR , Paoli L , Salazar G , Ruscheweyh HJ , Cuenca M , Hingamp P , Alves
R , Costea PI , Coelho LP et al. Microbial abundance, activity and population genomic profiling
with mOTUs2. Nat Commun 2019, 10(1):1014.
Menzel P , Ng KL , Krogh A . Fast and sensitive taxonomic classification for metagenomics with
Kaiju. Nat Commun 2016, 7:11257.
Kim D , Song L , Breitwieser FP , Salzberg SL . Centrifuge: rapid and sensitive classification of
metagenomic sequences. Genome Res 2016, 26(12):1721–1729.
Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in
metagenomics data. PeerJ Comput Sci 2017, 30:e104.
Chaumeil PA , Mussig AJ , Hugenholtz P , Parks DH . GTDB-Tk: a toolkit to classify genomes
with the Genome Taxonomy Database. Bioinformatics 2019, 36(6):1925–1927.
Parks DH , Chuvochina M , Waite DW , Rinke C , Skarshewski A , Chaumeil PA , Hugenholtz P
. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree
of life. Nat Biotechnol 2018, 36(10):996–1004.
Dilthey AT , Jain C , Koren S , Phillippy AM . Strain-level metagenomic assignment and
compositional estimation for long reads with MetaMaps. Nat Commun 2019, 10(1):3066.
Fan J , Huang S , Chorlton SD . BugSeq: a highly accurate cloud platform for long-read
metagenomic analyses. BMC Bioinformatics 2021, 22(1):160.
Portik DM , Brown CT , Pierce-Ward NT . Evaluation of taxonomic classification and profiling
methods for long-read shotgun metagenomic sequencing datasets. BMC Bioinformatics 2022,
23(1):541.
UniProt C . UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 2021,
49(D1):D480–D489.
Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-
Lafosse T, Qureshi M, Raj S et al . The InterPro protein families and domains database: 20
years on. Nucleic Acids Res 2021, 49(D1):D 344–D354.
Galperin MY , Wolf YI , Makarova KS , Vera Alvarez R , Landsman D , Koonin EV . COG
database update: focus on microbial diversity, model organisms, and widespread pathogens.
Nucleic Acids Res 2021, 49(D1):D274–D281.
Huerta-Cepas J , Szklarczyk D , Heller D , Hernandez-Plaza A , Forslund SK , Cook H , Mende
DR , Letunic I , Rattei T , Jensen LJ et al. eggNOG 5.0: a hierarchical, functionally and
phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.
Nucleic Acids Res 2019, 47(D1):D309–D314.
Seemann T . Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014,
30(14):2068–2069.
Shaffer M , Borton MA , McGivern BB , Zayed AA , La Rosa SL , Solden LM , Liu P , Narrowe
AB , Rodriguez-Ramos J , Bolduc B et al. DRAM for distilling microbial metabolism to automate
the curation of microbiome function. Nucleic Acids Res 2020, 48(16):8883–8900.
Tanizawa Y , Fujisawa T , Nakamura Y . DFAST: a flexible prokaryotic genome annotation
pipeline for faster genome publication. Bioinformatics 2018, 34(6):1037–1039.
Tatusova T , DiCuccio M , Badretdin A , Chetvernin V , Nawrocki EP , Zaslavsky L , Lomsadze
A , Pruitt KD , Borodovsky M , Ostell J . NCBI prokaryotic genome annotation pipeline. Nucleic
Acids Res 2016, 44(14):6614–6624.
Keegan KP , Glass EM , Meyer F . MG-RAST, a Metagenomics Service for Analysis of
Microbial Community Structure and Function. Methods Mol Biol 2016, 1399 :207–233.
Chen IA , Chu K , Palaniappan K , Ratner A , Huang J , Huntemann M , Hajek P , Ritter S ,
Varghese N , Seshadri R et al. The IMG/M data management and analysis system v.6.0: new
tools and advanced capabilities. Nucleic Acids Res 2021, 49(D1):D751–D763.
Kanehisa M , Sato Y , Morishima K . BlastKOALA and GhostKOALA: KEGG tools for functional
characterization of genome and metagenome sequences. J Mol Biol 2016, 428(4):726–731.
Mitchell AL , Almeida A , Beracochea M , Boland M , Burgin J , Cochrane G , Crusoe MR , Kale
V , Potter SC , Richardson LJ et al. MGnify: the microbiome analysis resource in 2020. Nucleic
Acids Res 2020, 48(D1):D570–D578.
Cantalapiedra CP , Hernandez-Plaza A , Letunic I , Bork P , Huerta-Cepas J . eggNOG-mapper
v2: functional annotation, orthology assignments, and domain prediction at the metagenomic
scale. Mol Biol Evol 2021, 38(12):5825–5829.
Kanehisa M , Furumichi M , Tanabe M , Sato Y , Morishima K . KEGG: new perspectives on
genomes, pathways, diseases and drugs. Nucleic Acids Res 2017, 45(D1):D353–D361.
Caspi R , Billington R , Keseler IM , Kothari A , Krummenacker M , Midford PE , Ong WK , Paley
S , Subhraveti P , Karp PD . The MetaCyc database of metabolic pathways and enzymes – a
2019 update. Nucleic Acids Res 2020, 48(D1):D445–D453.
Nazeen S , Yu YW , Berger B . Carnelian uncovers hidden functional patterns across diverse
study populations from whole metagenome sequencing reads. Genome Biol 2020, 21(1):47.
Ye Y , Doak TG . A parsimony approach to biological pathway reconstruction/inference for
genomes and metagenomes. PLoS Comput Biol 2009, 5(8):e1000465.
Seaver SMD , Liu F , Zhang Q , Jeffryes J , Faria JP , Edirisinghe JN , Mundy M , Chia N , Noor
E , Beber ME et al. The ModelSEED Biochemistry Database for the integration of metabolic
annotations and the reconstruction, comparison and analysis of metabolic models for plants,
fungi and microbes. Nucleic Acids Res 2021, 49(D1):D1555.
Machado D , Andrejev S , Tramontano M , Patil KR . Fast automated reconstruction of genome-
scale metabolic models for microbial species and communities. Nucleic Acids Res 2018,
46(15):7542–7553.
Paley S , Billington R , Herson J , Krummenacker M , Karp PD . Pathway tools visualization of
organism-scale metabolic networks. Metabolites 2021, 11(2):64.
Capela J , Lagoa D , Rodrigues R , Cunha E , Cruz F , Barbosa A , Bastos J , Lima D , Ferreira
EC , Rocha M et al. merlin, an improved framework for the reconstruction of high-quality
genome-scale metabolic models. Nucleic Acids Res 2022, 50(11):6052–6066.
Wang H , Marcisauskas S , Sanchez BJ , Domenzain I , Hermansson D , Agren R , Nielsen J ,
Kerkhoven EJ . RAVEN 2.0: a versatile toolbox for metabolic network reconstruction and a case
study on Streptomyces coelicolor. PLoS Comput Biol 2018, 14(10):e1006541.
Garza DR , van Verk MC , Huynen MA , Dutilh BE . Towards predicting the environmental
metabolome from metagenomics with a mechanistic model. Nat Microbiol 2018, 3(4):456–460.
Noecker C , Eng A , Srinivasan S , Theriot CM , Young VB , Jansson JK , Fredricks DN ,
Borenstein E . Metabolic model-based integration of microbiome taxonomic and metabolomic
profiles elucidates mechanistic links between ecological and metabolic variation. mSystems
2016, 1(1):e00013– e00015.
Mallick H , Franzosa EA , McLver LJ , Banerjee S , Sirota-Madi A , Kostic AD , Clish CB ,
Vlamakis H , Xavier RJ , Huttenhower C . Predictive metabolomic profiling of microbial
communities using amplicon or metagenomic sequences. Nat Commun 2019, 10(1):3136.
Paulson JN , Stine OC , Bravo HC , Pop M . Differential abundance analysis for microbial
marker-gene surveys. Nat Methods 2013, 10(12):1200–1202.
McMurdie PJ , Holmes S . Waste not, want not: why rarefying microbiome data is inadmissible.
PLoS Comput Biol 2014, 10(4):e1003531.
Segata N , Izard J , Waldron L , Gevers D , Miropolsky L , Garrett WS , Huttenhower C .
Metagenomic biomarker discovery and explanation. Genome Biol 2011, 12(6):R60.
Parks DH , Tyson GW , Hugenholtz P , Beiko RG . STAMP: statistical analysis of taxonomic
and functional profiles. Bioinformatics 2014, 30(21):3123–3124.
Mandal S , Van Treuren W , White RA , Eggesbo M , Knight R , Peddada SD . Analysis of
composition of microbiomes: a novel method for studying microbial composition. Microb Ecol
Health Dis 2015, 26:27663.
Lin H , Peddada SD . Analysis of compositions of microbiomes with bias correction. Nat
Commun 2020, 11(1):3514.
Martin BD , Witten D , Willis AD . Modeling microbial abundances and dysbiosis with beta-
binomial regression. Ann Appl Stat 2020, 14(1):94–115.
Mallick H , Rahnavard A , McIver LJ , Ma S , Zhang Y , Nguyen LH , Tickle TL , Weingart G ,
Ren B , Schwager EH et al. Multivariable association discovery in population-scale meta-omics
studies. PLoS Comput Biol 2021, 17(11):e1009442.
Bolyen E , Rideout JR , Dillon MR , Bokulich NA , Abnet CC , Al-Ghalith GA , Alexander H , Alm
EJ , Arumugam M , Asnicar F et al. Reproducible, interactive, scalable and extensible
microbiome data science using QIIME 2. Nat Biotechnol 2019, 37(8):852–857.
Shi W , Qi H , Sun Q , Fan G , Liu S , Wang J , Zhu B , Liu H , Zhao F , Wang X et al. gcMeta: a
Global Catalogue of Metagenomics platform to support the archiving, standardization and
analysis of microbiome data. Nucleic Acids Res 2019, 47(D1):D637–D648.

What's Next for Next-Generation Sequencing (NGS)?


Drmanac R , Sparks AB , Callow MJ , Halpern AL , Burns NL , Kermani BG , Carnevali P ,
Nazarenko I , Nilsen GB , Yeung G et al. Human genome sequencing using unchained base
reads on self-assembling DNA nanoarrays. Science 2010, 327(5961):78–81.
Fehlmann T , Reinheimer S , Geng C , Su X , Drmanac S , Alexeev A , Zhang C , Backes C ,
Ludwig N , Hart M et al. cPAS-based sequencing on the BGISEQ-500 to explore small non-
coding RNAs. Clin Epigenetics 2016, 8:123.
Zhu K , Du P , Xiong J , Ren X , Sun C , Tao Y , Ding Y , Xu Y , Meng H , Wang CC et al.
Comparative Performance of the MGISEQ-2000 and Illumina X-Ten Sequencing Platforms for
Paleogenomics. Front Genet 2021, 12:745508.
Tedersoo L , Albertsen M , Anslan S, Callahan B. Perspectives and Benefits of High-
Throughput Long-Read Sequencing in Microbial Ecology. Appl Environ Microbiol 2021,
87(17):e0062621.
Drmanac S , Callow M , Chen L , Zhou P , Eckhardt L , Xu C , Gong M , Gablenz S , Rajagopal
J , Yang Q et al. CoolMPS™: Advanced massively parallel sequencing using antibodies specific
to each natural nucleobase. bioRxiv 2020, doi: https://doi.org/10.1101/2020.02.19.953307
Arslan S , Garcia FJ , Guo M , Kellinger MW , Kruglyak S , LeVieux JA , Mah AH , Wang H ,
Zhao J , Zhou C . Sequencing by avidity enables high accuracy with low reagent consumption.
bioRxiv 2022, doi: https://doi.org/10.1101/2022.11.03.514117
Almogy G , Pratt M , Oberstrass F , Lee L , Mazur D , Beckett N , Barad O , Soifer I , Perelman
E , Etzioni Y et al. Cost-efficient whole genome-sequencing using novel mostly natural
sequencing-by-synthesis chemistry and open fluidics platform. bioRxiv 2022, doi:
https://doi.org/10.1101/2022.05.29.493900
Foox J , Tighe SW , Nicolet CM , Zook JM , Byrska-Bishop M , Clarke WE , Khayat MM ,
Mahmoud M , Laaguiby PK , Herbert ZT et al. Performance assessment of DNA sequencing
platforms in the ABRF Next-Generation Sequencing Study. Nat Biotechnol 2021,
39(9):1129–1140.
Puchtler TJ , Johnson K , Palmer RN , Talbot EL , Ibbotson LA , Powalowska PK , Knox R ,
Shibahara A , P MSC, Newell OJ et al. Single-molecule DNA sequencing of widely varying GC-
content using nucleotide release, capture and detection in microdroplets. Nucleic Acids Res
2020, 48(22):e132.
Marx V . Nanopores: a sequencer in your backpack. Nat Methods 2015, 12(11):1015–1018.
Zheng GX , Lau BT , Schnall-Levin M , Jarosz M , Bell JM , Hindson CM , Kyriazopoulou-
Panagiotopoulou S , Masquelier DA , Merrill L , Terry JM et al. Haplotyping germline and cancer
genomes with high-throughput linked-read sequencing. Nat Biotechnol 2016, 34(3):303–311.
Wang O , Chin R , Cheng X , Wu MKY , Mao Q , Tang J , Sun Y , Anderson E , Lam HK , Chen
D et al. Efficient and unique cobarcoding of second-generation sequencing reads from long
DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo
assembly. Genome Res 2019, 29(5):798–808.
Chen Z , Pham L , Wu TC , Mo G , Xia Y , Chang PL , Porter D , Phan T , Che H , Tran H et al .
Ultralow-input single-tube linked-read library method enables short-read second-generation
sequencing systems to routinely generate highly accurate and economical long-range
sequencing information. Genome Res 2020, 30(6):898–909.
Liu S , Wu I , Yu YP , Balamotis M , Ren B , Ben Yehezkel T , Luo JH . Targeted transcriptome
analysis using synthetic long read sequencing uncovers isoform reprograming in the
progression of colon cancer. Commun Biol 2021, 4(1):506.
Nurk S , Bankevich A , Antipov D , Gurevich AA , Korobeynikov A , Lapidus A , Prjibelski AD ,
Pyshkin A , Sirotkin A , Sirotkin Y et al. Assembling single-cell genomes and mini-metagenomes
from chimeric MDA products. J Comput Biol 2013, 20(10):714–737.
Antipov D , Korobeynikov A , McLean JS , Pevzner PA. hybridSPAdes: an algorithm for hybrid
assembly of short and long reads. Bioinformatics 2016, 32(7):1009–1015.
Li H . Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
Bioinformatics 2016, 32(14):2103–2110.
Kolmogorov M , Yuan J , Lin Y , Pevzner PA . Assembly of long, error-prone reads using repeat
graphs. Nat Biotechnol 2019, 37(5):540–546.
Chen Y , Nie F , Xie SQ , Zheng YF , Dai Q , Bray T , Wang YX , Xing JF , Huang ZJ , Wang DP
et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat
Commun 2021, 12(1):60.
Vaser R , Šikić M . Time-and memory-efficient genome assembly with Raven. Nat Comput Sci
2021, 1(5):332–336.
Li H . Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018,
34(18):3094–3100.
Li H . New strategies to improve minimap2 alignment accuracy. Bioinformatics 2021,
37(23):4572–4574.
Bai J , Bandla C , Guo J , Vera Alvarez R , Bai M , Vizcaino JA , Moreno P , Gruning B , Sallou
O , Perez-Riverol Y . BioContainers Registry: Searching Bioinformatics and Proteomics Tools,
Packages, and Containers. J Proteome Res 2021, 20(4):2056–2061.
Jalili V , Afgan E , Gu Q , Clements D , Blankenberg D , Goecks J , Taylor J , Nekrutenko A .
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020
update. Nucleic Acids Res 2020, 48(W1):W395–W402.
Terra (https://app.terra.bio/)
Di Tommaso P , Chatzou M , Floden EW , Barja PP , Palumbo E , Notredame C . Nextflow
enables reproducible computational workflows. Nat Biotechnol 2017, 35(4):316–319.
Kotliar M , Kartashov AV , Barski A . CWL-Airflow: a lightweight pipeline manager supporting
Common Workflow Language. GigaScience 2019, 8(7):giz084.
Zhang K , Hocker JD , Miller M , Hou X , Chiou J , Poirion OB , Qiu Y , Li YE , Gaulton KJ ,
Wang A et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 2021,
184(24):5985–6001 e5919.
Fang R , Preissl S , Li Y , Hou X , Lucero J , Wang X , Motamedi A , Shiau AK , Zhou X , Xie F
et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun
2021, 12(1):1337.
Bravo Gonzalez-Blas C , Minnoye L , Papasokrati D , Aibar S , Hulselmans G , Christiaens V ,
Davie K , Wouters J , Aerts S . cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq
data. Nat Methods 2019, 16(5):397–400.
Xiong L , Xu K , Tian K , Shao Y , Tang L , Gao G , Zhang M , Jiang T , Zhang QC . SCALE
method for single-cell ATAC-seq analysis via latent feature extraction. Nat Commun 2019,
10(1):4576.
Stuart T , Srivastava A , Madad S , Lareau CA , Satija R . Single-cell chromatin state analysis
with Signac. Nat Methods 2021, 18(11):1333–1341.
Dean FB , Nelson JR , Giesler TL , Lasken RS . Rapid amplification of plasmid and phage DNA
using Phi 29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res
2001, 11(6):1095–1099.
Zong C , Lu S , Chapman AR , Xie XS . Genome-wide detection of single-nucleotide and copy-
number variations of a single human cell. Science 2012, 338(6114):1622–1626.
Telenius H , Carter NP , Bebb CE , Nordenskjold M , Ponder BA , Tunnacliffe A . Degenerate
oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer.
Genomics 1992, 13(3):718–725.
Imamura H , Monsieurs P , Jara M , Sanders M , Maes I , Vanaerschot M , Berriman M , Cotton
JA , Dujardin JC , Domagalska MA . Evaluation of whole genome amplification and
bioinformatic methods for the characterization of Leishmania genomes at a single cell level. Sci
Rep 2020, 10(1):15043.
Fu Y , Zhang F , Zhang X , Yin J , Du M , Jiang M , Liu L , Li J , Huang Y , Wang J . High-
throughput single-cell whole-genome amplification through centrifugal emulsification and eMDA.
Commun Biol 2019, 2:147.
Hosokawa M , Nishikawa Y , Kogawa M , Takeyama H . Massively parallel whole genome
amplification for single-cell sequencing using droplet microfluidics. Sci Rep 2017, 7(1):5199.
Zahn H , Steif A , Laks E , Eirew P , VanInsberghe M , Shah SP , Aparicio S , Hansen CL .
Scalable whole-genome single-cell library preparation without preamplification. Nat Methods
2017, 14(2):167–173.
Dong X , Zhang L , Milholland B , Lee M , Maslov AY , Wang T , Vijg J . Accurate identification
of single-nucleotide variants in whole-genome-amplified single cells. Nat Methods 2017,
14(5):491–493.
Luquette LJ , Bohrson CL , Sherman MA , Park PJ . Identification of somatic mutations in single
cell DNA-seq using a spatial model of allelic imbalance. Nat Commun 2019, 10(1):3908.
Roth A , McPherson A , Laks E , Biele J , Yap D , Wan A , Smith MA , Nielsen CB , McAlpine
JN , Aparicio S et al. Clonal genotype and population structure inference from single-cell tumor
sequencing. Nat Methods 2016, 13(7):573–576.
Poirion O , Zhu X , Ching T , Garmire LX . Using single nucleotide variations in single-cell RNA-
seq to identify subpopulations and genotype-phenotype linkage. Nat Commun 2018, 9(1):4892.
Bakker B , Taudt A , Belderbos ME , Porubsky D , Spierings DC , de Jong TV , Halsema N ,
Kazemier HG , Hoekstra-Wakker K , Bradley A et al. Single-cell sequencing reveals karyotype
heterogeneity in murine and human malignancies. Genome Biol 2016, 17(1):115.
Garvin T , Aboukhalil R , Kendall J , Baslan T , Atwal GS , Hicks J , Wigler M , Schatz MC .
Interactive analysis and assessment of single-cell copy-number variations. Nat Methods 2015,
12(11):1058–1060.
Smallwood SA , Lee HJ , Angermueller C , Krueger F , Saadeh H , Peat J , Andrews SR ,
Stegle O , Reik W , Kelsey G . Single-cell genome-wide bisulfite sequencing for assessing
epigenetic heterogeneity. Nat Methods 2014, 11(8):817–820.
Guo H , Zhu P , Guo F , Li X , Wu X , Fan X , Wen L , Tang F . Profiling DNA methylome
landscapes of mammalian cells with single-cell reduced-representation bisulfite sequencing. Nat
Protoc 2015, 10(5):645–659.
Luo C , Rivkin A , Zhou J , Sandoval JP , Kurihara L , Lucero J , Castanon R , Nery JR , Pinto-
Duarte A , Bui B et al. Robust single-cell DNA methylome profiling with snmC-seq2. Nat
Commun 2018, 9(1):3824.
Wu P , Gao Y , Guo W , Zhu P . Using local alignment to enhance single-cell bisulfite
sequencing data efficiency. Bioinformatics 2019, 35(18):3273–3278.
Schultz MD , He Y , Whitaker JW , Hariharan M , Mukamel EA , Leung D , Rajagopal N , Nery
JR , Urich MA , Chen H et al. Human body epigenome maps reveal noncanonical DNA
methylation variation. Nature 2015, 523(7559):212–216.
Niemoller C , Wehrle J , Riba J , Claus R , Renz N , Rhein J , Bleul S , Stosch JM , Duyster J ,
Plass C et al. Bisulfite-free epigenomics and genomics of single cells through methylation-
sensitive restriction. Commun Biol 2021, 4(1):153.
Liao J , Lu X , Shao X , Zhu L , Fan X . Uncovering an organ’s molecular architecture at single-
cell resolution by spatially resolved transcriptomics. Trends Biotechnol 2021, 39(1):43–58.
Stickels RR , Murray E , Kumar P , Li J , Marshall JL , Di Bella DJ , Arlotta P , Macosko EZ ,
Chen F . Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2.
Nat Biotechnol 2021, 39(3):313–319.
Dries R , Zhu Q , Dong R , Linus Eng C-H , Li H , Liu K , Fu Y , Zhao T , Sarkar A , Bao F et al.
Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome
Biol 2021, 22(1):78.
Zhao E , Stone MR , Ren X , Guenthoer J , Smythe KS , Pulliam T , Williams SR , Uytingco CR
, Taylor SEB , Nghiem P et al. Spatial transcriptomics at subspot resolution with BayesSpace.
Nat Biotechnol 2021, 39(11):1375–1384.
Stephens ZD , Lee SY , Faghri F , Campbell RH , Zhai CX , Efron MJ , Iyer R , Schatz MC ,
Sinha S , Robinson GE . Big Data: Astronomical or Genomical? PLoS Biol 2015, 13(7).
Schmidt B , Hildebrandt A . Next-generation sequencing: big data meets high performance
computing. Drug Discov Today 2017, 22(4):712–717.
Olson ND , Wagner J , McDaniel J , Stephens SH , Westreich ST , Prasanna AG , Johanson E ,
Boja E , Maier EJ , Serang O et al. PrecisionFDA Truth Challenge V2: Calling variants from
short- and long-reads in difficult-to-map regions. Cell Genom 2022, 2(5):100129.
Poplin R , Chang PC , Alexander D , Schwartz S , Colthurst T , Ku A , Newburger D , Dijamco J
, Nguyen N , Afshar PT et al. A universal SNP and small-indel variant caller using deep neural
networks. Nat Biotechnol 2018, 36(10):983–987.
Shafin K , Pesout T , Chang PC , Nattestad M , Kolesnikov A , Goel S , Baid G , Kolmogorov M
, Eizenga JM , Miga KH et al. Haplotype-aware variant calling with PEPPER-Margin-
DeepVariant enables high accuracy in nanopore long-reads. Nat Methods 2021,
18(11):1322–1332.
Luo R , Wong C-L , Wong Y-S , Tang C-I , Liu C-M , Leung C-M , Lam T-W . Exploring the limit
of using a deep neural network on pileup data for germline variant calling. Nat Mach Intell 2020,
2(4):220–227.
Zheng Z , Li S , Su J , Leung AW-S , Lam T-W , Luo R . Symphonizing pileup and full-alignment
for deep learning-based long-read variant calling. Nat Comput Sci 2021, 2(12):797–803.
Ahsan MU , Liu Q , Fang L, Wang K. NanoCaller for accurate detection of SNPs and indels in
difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks.
Genome Biol 2021, 22(1):261.
Edge P , Bansal V . Longshot enables accurate variant calling in diploid genomes from single-
molecule long read sequencing. Nat Commun 2019, 10(1):4660.
Cai L , Wu Y , Gao J . DeepSV: accurate calling of genomic deletions from high-throughput
sequencing data using deep convolutional neural network. BMC Bioinformatics 2019, 20(1):665.
Hill T , Unckless RL . A Deep Learning Approach for Detecting Copy Number Variation in Next-
Generation Sequencing Data. G3 2019, 9(11):3575–3582.
Wang L , Xi Y , Sung S , Qiao H . RNA-seq assistant: machine learning based methods to
identify more transcriptional regulated genes. BMC Genomics 2018, 19(1):546.
Bonet J , Chen M , Dabad M , Heath S , Gonzalez-Perez A , Lopez-Bigas N , Lagergren J .
DeepMP: a deep learning tool to detect DNA base modifications on Nanopore sequencing data.
Bioinformatics 2022, 38(5):1235–1243.
Ni P , Huang N , Nie F , Zhang J , Zhang Z , Wu B , Bai L , Liu W , Xiao CL , Luo F et al.
Genome-wide detection of cytosine methylations in plant from Nanopore data using deep
learning. Nat Commun 2021, 12(1):5976.
Tse OYO , Jiang P , Cheng SH , Peng W , Shang H , Wong J , Chan SL , Poon LCY , Leung TY
, Chan KCA et al. Genome-wide detection of cytosine methylation by single molecule real-time
sequencing. Proc Natl Acad Sci U S A 2021, 118(5):e2019768118.
Hollister EB , Oezguen N , Chumpitazi BP , Luna RA , Weidler EM , Rubio-Gonzales M ,
Dahdouli M , Cope JL , Mistretta TA , Raza S et al. Leveraging human microbiome features to
diagnose and stratify children with irritable bowel syndrome. J Mol Diagn 2019, 21(3):449–461.
Abraham J , Heimberger AB , Marshall J , Heath E , Drabick J , Helmstetter A , Xiu J , Magee D
, Stafford P , Nabhan C et al. Machine learning analysis using 77,044 genomic and
transcriptomic profiles to accurately predict tumor type. Transl Oncol 2021, 14(3):101016.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy