0% found this document useful (0 votes)
48 views7 pages

A Review of Bioinformatic Pipeline Frameworks

This document reviews several pipeline frameworks used for bioinformatic analysis. It discusses how early frameworks like scripts and Makefiles were limited in their ability to handle dependencies and recover from failures. Modern frameworks address these issues and also offer features like parallelization, visualization, and integration with distributed computing. The frameworks differ in their use of implicit or explicit syntax, configuration vs convention-based design, and command line vs graphical interfaces. A framework choice should consider analysis requirements and the intended user base.

Uploaded by

Victor Wolleck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views7 pages

A Review of Bioinformatic Pipeline Frameworks

This document reviews several pipeline frameworks used for bioinformatic analysis. It discusses how early frameworks like scripts and Makefiles were limited in their ability to handle dependencies and recover from failures. Modern frameworks address these issues and also offer features like parallelization, visualization, and integration with distributed computing. The frameworks differ in their use of implicit or explicit syntax, configuration vs convention-based design, and command line vs graphical interfaces. A framework choice should consider analysis requirements and the intended user base.

Uploaded by

Victor Wolleck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Briefings in Bioinformatics, 18(3), 2017, 530–536

doi: 10.1093/bib/bbw020
Advance Access Publication Date: 24 March 2016
Paper

A review of bioinformatic pipeline frameworks


Jeremy Leipzig
Corresponding author: Jeremy Leipzig, Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, 3535 Market Street,
Room 1063, Philadelphia, PA 19104, USA. Tel.: þ12154261375; Fax: þ12155905245; E-mail: leipzigj@email.chop.edu

Abstract
High-throughput bioinformatic analyses increasingly rely on pipeline frameworks to process sequence and metadata.
Modern implementations of these frameworks differ on three key dimensions: using an implicit or explicit syntax, using a
configuration, convention or class-based design paradigm and offering a command line or workbench interface. Here I sur-
vey and compare the design philosophies of several current pipeline frameworks. I provide practical recommendations
based on analysis requirements and the user base.

Key words: pipeline; workflow; framework

Background offer advanced features, such as displays for visualizing pro-


gress in real time, the ability to instantiate containerized tools
Bioinformatic analyses invariably involve shepherding files that can run anywhere, support for performing work on distrib-
through a series of transformations, called a pipeline or a work- uted clusters or in the cloud and graphical user interfaces that
flow. Typically, these transformations are done by third-party allow workflows to be built by users without writing code. What
executable command line software written for Unix-compatible distinguishes frameworks from each other is not features but
operating systems. The advent of next-generation sequencing design philosophy. To understand the origins of these frame-
(NGS), in which millions of short DNA sequences are used as works requires closer examination of their predecessors, i.e.
the source input for interpreting a range of biological phenom- scripts and Makefiles.
ena, has intensified the need for robust pipelines. NGS analyses
tend to involve steps such as sequence alignment and genomic
annotation that are both time-intensive and parameter-heavy. Scripts
A basic exome pipeline delivering called variants from raw Scripts, written in Unix shell or other scripting languages such
sequence could consist of as few as 12 steps, most of which can as Perl, can be seen as the most basic form of pipeline frame-
be run in parallel, but a real analysis will typically involve sev- work. Scripting allows variables and conditional logic to be used
eral additional downstream steps and complex report gener- to build flexible pipelines. However, in terms of ‘robustness’, as
ation (Figure 1). defined by Sussman [2], scripts tend to be quite brittle. In par-
Although bioinformatics-specific pipelines such as bcbio- ticular, scripts lack two key features necessary for the efficient
nextgen (https://github.com/chapmanb/bcbio-nextgen) and processing of data: support for ‘dependencies’ and ‘reentrancy’.
Omics Pipe [1] offer high-performance automated analysis, they Dependencies refer to upstream files (or tasks) that downstream
are not frameworks in the sense they are not easily extensible transformation steps require as input. When a dependency is
to integrate new user-defined tools. A bioinformatics frame- updated, associated downstream files should be updated as
work should be able to accommodate production pipelines con- well. Reentrancy is the ability of a program to continue where it
sisting of both serial and parallel steps, complex dependencies, left off if interrupted, obviating the need to restart from the be-
varied software and data file types, fixed and user-defined par- ginning of a process. Pipelines often include steps that fail for
ameters and deliverables. Many modern pipeline frameworks any number of reasons such as network or disk issues, file

Jeremy Leipzig is a bioinformatics software developer at The Children’s Hospital of Philadelphia. He received his MS in Computer Science from North
Carolina State University. He is pursuing a PhD in Information Studies at Drexel University.
Submitted: 10 December 2015; Received (in revised form): 29 January 2016
C The Author 2016. Published by Oxford University Press.
V
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

530
A review of bioinformatic pipeline frameworks | 531

Figure 1. A DAG (Directed Acyclic Graph) depicting a trio analysis pipeline for detecting de novo mutations.
532 | Leipzig

corruption or bugs. A pipeline must be able to recover from the intermediates. The dependency tree allows Make to infer the
nearest checkpoint rather than overwrite or ‘clobber’ otherwise steps required to make any target for which a rule chain exists.
usable intermediate files. In addition, the introduction of new Make is a ‘domain-specific language’ (DSL)—it provides a con-
upstream files, such as samples, in an analysis should not ne- vention-based syntax for describing inputs and outputs with
cessitate reprocessing existing samples. special symbols ($<, $@, $.) to represent shortcuts for accessing
filename stems, paths and suffixes of both the target and pre-
requisites (Figure 3).
Make Because it was never designed for scientific pipelines, Make
Despite its origin as a compiler build automation tool early in has several limitations that render it impractical for modern
computing history, the Make utility [3] is still successfully used bioinformatic analyses. Make has no built-in support for distrib-
to manage file transformations common to scientific computing uted computing, so dispatching tasks that can be run in parallel
pipelines. Make introduced the concept of ‘implicit wildcard on several nodes of a cluster is not easily done within the Make
rules’, which define available file transformations based on file framework. Make’s syntax is restricted to one wildcard per rule
suffixes (Figure 2). and does not allow for lookup tables or other means of associat-
A dependency tree is generated by Make from these rules. ing inputs to outputs other than exact filename stem matching.
When Make is asked to build a downstream file, or ‘target’, file Although Make allows a low level of shell scripting, more
modification datetimes are used to determine whether any of sophisticated logic is difficult to implement.
that target’s dependencies are newer than the target or its
Modern pipeline frameworks
In recent years, a number of new pipeline frameworks have
been developed to address Make’s limitations in syntax, moni-
Figure 2. The basic Make rule syntax. toring and parallel processing as well as offer new features rele-
vant to bioinformatics and reproducible research, such as
visualization, version tracking and summary reports (Table 1).

Implicit convention frameworks


Implicit frameworks preserve the implicit wildcard idioms
Figure 3. A Make rule for performing a sequence alignment using bwa mem [4].
introduced by Make while extending its capabilities, usually by
Two paired fastq (.fq) files are used to produce a SAM alignment file. Symbols leveraging full-featured scripting languages such as Python to
are used to represent various pattern matching elements of filenames. implement logic both inside and outside of rules.

Table 1. A classification of modern pipeline frameworks

Syntax Paradigm Interaction Example Ease of Ease of Use Performance


Development

Implicit Convention CLI Snakemake,


Nextflow,
BigDataScript

Explicit Convention CLI Ruffus, bpipe

Explicit Configuration CLI Pegasus

Explicit Class CLI Queue, Toil

Implicit Class CLI Luigi

Explicit Configuration Open Source Galaxy, Taverna


Server
Workbench

Explicit Configuration Commercial DNAnexus,


Cloud SevenBridges
Workbench

Explicit Configuration Open Source Arvados, Agave


Cloud API

Note. Ease of development refers to the effort required to compose workflows and also wrap new tools, such as custom or publicly available scripts and executables.
Ease of use refers to the effort required to use existing pipelines to process new data, such as samples, and also the ease of sharing pipelines in a collaborative fashion.
Performance refers to the efficiency of the framework in executing a pipeline, in terms of both parallelization and scalability. More stars connotes ‘easier’ or ‘faster’.
A review of bioinformatic pipeline frameworks | 533

Figure 4. A Snakemake rule for performing a sequence alignment. This example uses a global configuration dictionary that allows parameters to be specified in JSON or
YAML-formatted configuration files or in the command line.

Figure 5. Tasks in Ruffus explicitly depend on other tasks, not file targets.

Snakemake [5] builds on the implicit or wildcard-based logic Class-based frameworks


of Make while extending its capabilities by allowing Python to
Some high-performance workflow languages are implemented
be interspersed through the pipeline in conjunction with a DSL.
in a class-based pure language manner. Although these may re-
Some implicit frameworks, such as Nextflow (http://nextflow.
semble DSL-based frameworks superficially, class-based imple-
io), provide tools to abstract and manage filenaming into global
mentations are often closely bound to an existing code library
variables to reduce ambiguity. BigDataScript [6] is a stand-alone
rather than various executables. Class-based pipelines often
DSL that offers its own language-independent syntax for imple-
contain many thousands of lines of code implementing domain
menting pipeline logic (Figure 4).
logic. Genome Analysis Toolkit (GATK) [10] is a large Java library
for variant analysis, and Queue is a GATK-integrated Scala
framework that provides abstract classes for implementing
Explicit frameworks pipelines. Luigi (https://github.com/spotify/luigi) and Toil
Implicit frameworks demand the user define rules or recipes for (https://github.com/bd2kgenomics/toil) are pure-Python frame-
performing file transformations separately from target(s). works that are not bound to any bioinformatics codebase, but
Although this approach is logical from the standpoint of defin- offer explicit Application Programming Interfaces (APIs) for
ing individual rules, users typically have a preconceived idea of defining task dependencies from within task methods. Luigi
the order of operations. Implicit frameworks force users to think places particular emphasis on scheduled execution, monitoring,
more carefully about filenames rather than about the process. visualization and the implicit dependency resolution of tasks.
In response, some frameworks such as Ruffus [7] and bpipe [8] Toil offers a strong focus on cloud execution.
use an explicit paradigm, as used in scripts, in which the rule Many existing implementations of bioinformatics software
topology is defined by the user, the order is fixed and tasks sim- tend to work with large ‘monolithic’ disk-based files, which im-
ply refer to each other rather than using a target naming pedes the ability of work tasks to be efficiently farmed out to in-
scheme (Figure 5). dividual cores or nodes in a cluster, or to ephemeral machine
instances in the cloud. Efforts such as Big Data Genomics
(http://bdgenomics.org) aim to make common data formats
Configuration frameworks ‘splittable’ for use with Hadoop and Spark-based scalable dis-
tributed computing frameworks. These efforts will likely also re-
Many pipeline frameworks dispense with inline scripting code quire the use of new or existing class-based pipelines for tasks
and instead use a configuration-based, rather than convention- to be tightly coupled to individual data structures within the li-
based, means of describing tasks. Pegasus [9] is a National brary, allowing a high level of granularity in terms of the con-
Science Foundation (NSF)-funded workflow system originally current processing of data.
designed for the physical sciences. Like all configuration-based
frameworks, Pegasus is explicit—it does not implicitly infer how
to produce targets but instead requires a fixed XML file that de-
Server workbenches
scribes individual job run instances and their dependencies Unlike the command line-based pipeline frameworks reviewed
(Figure 6). previously, workbenches allow end-users, typically scientists,
534 | Leipzig

Figure 6. A Pegasus DAX (Directed Acyclic Graph in XML). A subsequent step to alignment has been included to show that a Pegasus task relies on explicit job IDs to
identify its antecedents rather than a filename pattern to identify its dependencies. Pegasus has no built-in system of variable injection, but includes APIs to produce
DAX files.

Figure 7. The Galaxy Workflow Editor allows users to link inputs, outputs and tools on a graphical canvas.

to design analyses by linking preconfigured modular tools to- For existing tools that have an existing component plug-in,
gether, typically using a drag-and-drop graphical interface. using Taverna is an easy solution for end-users. Creating a new
Because they require exacting specifications of inputs and plug-in requires an in-depth knowledge of the XML-based API
outputs, workbenches are intrinsically a subset of configur- and exact specifications of acceptable input filetypes, param-
ation-based pipelines. The most popular bioinformatics server eter values, resource management and exception behavior. The
workbenches are Galaxy [11] and Taverna [12]. Galaxy serves as onus is entirely on the developer to provide a means for new
a Web-based interface for command line tools, whereas tools to exist in the Taverna ecosystem. Adding a new execut-
Taverna offers stand-alone clients and allows pipelines to ac- able to Galaxy often requires only 20 lines of configuration code,
cess tools distributed across the Internet. Both allow users but Galaxy wrappers can be less robust than those in Taverna,
to share workflows and are intended for local installations which requires slightly more familiarity with each tool on the
(Figure 7). part of end-users to implement.
A review of bioinformatic pipeline frameworks | 535

Choosing a pipeline framework


Although there is no formal study of bioinformatics pipeline
users specifically, a previous survey (https://github.com/
michaelbarton/bioinformatics-career-survey) suggests the audi-
ence for bioinformatics development is evenly mixed between
those with biological and computer science backgrounds, and
large and small institutions. The choice of framework should be
informed both by the demands of developing the pipeline
and the requirements of those using it, even if the developers
and end-users are the same people. The use of pipeline frame-
works is intimately tied to reproducible computational research
[15], as ad hoc analyses are not likely to be implemented in a
pipeline. Reusable pipelines that can be run in the cloud are
often preferable in terms of reproducible research and the type
of collaborative ‘big science’ popular in modern sequencing
Figure 8. A snippet of the common workflow language describing the bwa mem studies.
alignment program. Choosing between an implicit or explicit syntax is largely a
question of personal preference. To developers unfamiliar with
Make rule syntax, arranging a series of implicit wildcard rules
Cloud workbenches and APIs and trusting the engine to infer a dependency tree can seem
unintuitive, but this idiomatic style offers a high level of con-
Cloud computing, defined here as the on-demand rent of vir-
venience for integrating executable tools.
tualized computing infrastructure from remote managed data
Convention-based frameworks tend to encourage a high level
centers, offers an attractive scalable option for collaborative
of internal business logic. They also allow polished deliverables
multi-institutional research in terms of ‘bringing the tools to
(Web sites, PDF reports) to be easily generated from the underly-
the data’. Although subscription and compute costs are
ing data. At the same time, pipelines that ‘think on their feet’
decreasing, the speed of file transfer over the Internet to the
would seem inherently less reproducible when compared with
cloud remains an issue for these platforms. While all of the
configuration-based pipelines that demand a paper trail, but
aforementioned pipelines can be installed on cloud infrastruc- often the latter simply forces developers to write dynamic tools
ture [13], cloud workbenches offer a layer of abstraction that to generate static configurations. Because they are ‘set in stone’,
simplifies the complex process of provisioning servers. configuration-based pipelines often enable cluster schedulers to
Commercial workbenches, such as DNAnexus (http://dna consume an entire work plan in entirety instead of receiving
nexus.com), SevenBridges (http://sbgenomics.com) and tasks in a piecemeal fashion, allowing the scheduler to better an-
Illumina’s BaseSpace (http://basespace.illumina.com), leverage ticipate load and allocate both memory and compute resources.
the scalability of cloud computing to offer high performance Workbenches and class-based frameworks can be considered
while offering development and user experiences comparable heavyweight. There are costs in terms of flexibility and ease of
with local server-based open source workbenches. These pro- development associated with making a pipeline accessible or
viders also support APIs that allow users to launch automated fast. Integrating new tools into workbenches clearly increases
large batch analyses without using a Web interface. their audience, but, ironically, the developers who are most cap-
Next-generation cloud-based open source workbenches, able of developing plug-ins for workbenches are the least likely
such as Curoverse’s Arvados (https://curoverse.com) and the to use them. Class-based frameworks offer a high level of per-
iPlant Collaborative’s Agave [14], largely dispense with the Web formance, but like workbenches, require highly skilled devel-
GUI as a primary design tool and instead are built from the opers to build and maintain, and performance improvements are
ground up as APIs designed to enable the migration of local ana- not guaranteed to justify additional development time. The tran-
lyses to the cloud for collaborative research. sition to high-performance computing (HPC) frameworks will
likely favor class-based pipeline frameworks in the future, al-
though this will severely limit the number of developers who will
Future trends be able to contribute to these pipelines, owing to their inherent
complexity of HPC development compared with DSLs. A recent
The need for a consistent means of distributing popular tools
survey of institutions using bioinformatics pipelines [16] found
among so many frameworks is driving an effort to standardize
that virtually every participant anticipated further use of HPC-
workflow description languages. The Common Workflow
enabled pipelines in the future and had struggled with issues of
Language Specification (CWL; https://github.com/common-
reproducibility and data provenance. These issues require in-
workflow-language) describes a shared platform for developing
tense attention to implementing highly customized solutions
new tool descriptors, which has particular utility in supporting that do not lend themselves to lightweight pipelines.
cloud-enabled workbenches and plug-ins. Among the frame- For those laboratories that neither serve a large number of
works reviewed here, Taverna, Galaxy, Toil, Arvados and pure biologists who demand a workbench interface nor re-
SevenBridges have already made significant progress toward quire the high level of performance that class-based pipelines
supporting the CWL. Another promising trend is the container- offer, a clear choice is not so obvious. One heuristic for choosing
ization of bioinformatic tools using Docker, lightweight virtual- a framework to consider is ‘return on investment’. Laboratories
ization software, which will enable frameworks to easily that conduct large-scale, highly repetitive research requiring a
accommodate tools with complex software dependencies high degree of data provenance and versioning may benefit
(Figure 8). from configuration-based pipelines. Laboratories doing
536 | Leipzig

exploratory proofs-of-concept would see little reason to use more References


heavyweight frameworks—explicit DSL-based pipelines are
1. Fisch KM, Meißner T, Gioia L, et al. Omics pipe: a community-
adequate.
based framework for reproducible multi-omics data analysis.
Finally, most laboratories, especially those without access to
Bioinformatics 2015;31: 1724–8.
internal HPC resources, should consider cloud-based workbenches
2. Sussman GJ. Building robust systems an essay. Citeseer
and APIs. These offer many of the features of server-based frame-
works, with the added bonus of unlimited scalability and collab- 2007;113:1324.
orative research opportunities, albeit incurring direct costs. 3. Stallman RM, McGrath R. GNU Make: a Program for Directed
Although this review is not intended to be an exhaustive list Recompilation, Version 3.79.1. Boston: GNU Press, 2002.
of pipeline frameworks, such lists do exist (e.g. https://github. 4. Li H. Aligning sequence reads, clone sequences and assembly
com/pditommaso/awesome-pipeline). For laboratories relying contigs with BWA-MEM. arXiv 2013;1303.3997v2.
solely on scripts, the choice of a framework, especially one to 5. Ko € ster J, Rahmann S. Snakemake–a scalable bioinformatics
accommodate new custom tools, may seem overwhelming and workflow engine. Bioinformatics 2012;28:2520–2.
irreversible, but all frameworks use the parameterization of in- 6. Cingolani P, Sladek R, Blanchette M. BigDataScript: a
puts, outputs and tool descriptors. Once a script-based pipeline scripting language for data pipelines. Bioinformatics
is implemented in one framework, transitioning to a different 2015;31:10–16.
one is relatively simple should priorities change. 7. Goodstadt L. Ruffus: a lightweight Python library for compu-
tational pipelines. Bioinformatics 2010;26:2778–9.
8. Sadedin SP, Pope B, Oshlack A. Bpipe: a tool for running and
managing bioinformatics pipelines. Bioinformatics 2012;
Key Points 28:1525–6.
• Key pipeline concepts of dependency and reentrancy 9. Deelman E, Singh G, Su M-H, et al. Pegasus: a framework for
were introduced by Make. mapping complex scientific workflows onto distributed sys-
• Pipelines are best distinguished not by features but by tems. Sci Program 2005;13:219–37.
design philosophy. 10. DePristo MA, Banks E, Poplin R, et al. A framework
• Modern bioinformatic frameworks use a convention, for variation discovery and genotyping using next-
configuration or class-based design paradigm and use generation DNA sequencing data. Nat Rev Clin Oncol
an explicit or implicit syntax. 2011;43:491–8.
• Workbenches and class-based frameworks offer ease 11. Goecks J, Nekrutenko A, Taylor J, et al. Galaxy: a comprehen-
of use and performance, respectively, but require add- sive approach for supporting accessible, reproducible, and
itional investment in time and expertise to integrate transparent computational research in the life sciences.
new tools. Genome Biol 2010;11:R86
• Cloud-based platforms offer scalability and collabora- 12. Wolstencroft K, Haines R, Fellows D, et al. The Taverna work-
tive research advantages. flow suite: designing and executing workflows of Web
• Developers choosing a pipeline framework should Services on the desktop, web or in the cloud. Nucleic Acids Res
consider the return on investment when considering 2013;41:W557–61.
more heavyweight options. 13. Liu B, Madduri RK, Sotomayor B, et al. Cloud-based
bioinformatics workflow platform for large-scale next-
generation sequencing analyses. J Biomed Inform 2014;
49:119–33.
14. Dooley R, Vaughn M, Stanzione D, et al. Software-as-a-
Acknowledgements service: the iPlant foundation API. 5th IEEE Workshop on
I thank Deanne Taylor, Chaomei Chen, and Jane Greenberg Many-Task Computing on Grids and Supercomputers
for their support. I also thank the anonymous reviewers for (MTAGS), 2012. Salt Lake City, Utah, USA.
their helpful comments and suggestions. 15. Hurley DG, Budden DM, Crampin EJ. Virtual reference envir-
onments: a simple way to make research reproducible. Brief
Bioinform 2015;16:901–3.
Funding  ndez GC, et al.
16. Spjuth O, Bongcam-Rudloff E, Herna
This work is supported by The Children’s Hospital of Experiences with workflows for automating data-intensive
Philadelphia. bioinformatics. Biol Direct 2015;10:43

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy