A Review of Bioinformatic Pipeline Frameworks
A Review of Bioinformatic Pipeline Frameworks
doi: 10.1093/bib/bbw020
Advance Access Publication Date: 24 March 2016
Paper
Abstract
High-throughput bioinformatic analyses increasingly rely on pipeline frameworks to process sequence and metadata.
Modern implementations of these frameworks differ on three key dimensions: using an implicit or explicit syntax, using a
configuration, convention or class-based design paradigm and offering a command line or workbench interface. Here I sur-
vey and compare the design philosophies of several current pipeline frameworks. I provide practical recommendations
based on analysis requirements and the user base.
Jeremy Leipzig is a bioinformatics software developer at The Children’s Hospital of Philadelphia. He received his MS in Computer Science from North
Carolina State University. He is pursuing a PhD in Information Studies at Drexel University.
Submitted: 10 December 2015; Received (in revised form): 29 January 2016
C The Author 2016. Published by Oxford University Press.
V
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
530
A review of bioinformatic pipeline frameworks | 531
Figure 1. A DAG (Directed Acyclic Graph) depicting a trio analysis pipeline for detecting de novo mutations.
532 | Leipzig
corruption or bugs. A pipeline must be able to recover from the intermediates. The dependency tree allows Make to infer the
nearest checkpoint rather than overwrite or ‘clobber’ otherwise steps required to make any target for which a rule chain exists.
usable intermediate files. In addition, the introduction of new Make is a ‘domain-specific language’ (DSL)—it provides a con-
upstream files, such as samples, in an analysis should not ne- vention-based syntax for describing inputs and outputs with
cessitate reprocessing existing samples. special symbols ($<, $@, $.) to represent shortcuts for accessing
filename stems, paths and suffixes of both the target and pre-
requisites (Figure 3).
Make Because it was never designed for scientific pipelines, Make
Despite its origin as a compiler build automation tool early in has several limitations that render it impractical for modern
computing history, the Make utility [3] is still successfully used bioinformatic analyses. Make has no built-in support for distrib-
to manage file transformations common to scientific computing uted computing, so dispatching tasks that can be run in parallel
pipelines. Make introduced the concept of ‘implicit wildcard on several nodes of a cluster is not easily done within the Make
rules’, which define available file transformations based on file framework. Make’s syntax is restricted to one wildcard per rule
suffixes (Figure 2). and does not allow for lookup tables or other means of associat-
A dependency tree is generated by Make from these rules. ing inputs to outputs other than exact filename stem matching.
When Make is asked to build a downstream file, or ‘target’, file Although Make allows a low level of shell scripting, more
modification datetimes are used to determine whether any of sophisticated logic is difficult to implement.
that target’s dependencies are newer than the target or its
Modern pipeline frameworks
In recent years, a number of new pipeline frameworks have
been developed to address Make’s limitations in syntax, moni-
Figure 2. The basic Make rule syntax. toring and parallel processing as well as offer new features rele-
vant to bioinformatics and reproducible research, such as
visualization, version tracking and summary reports (Table 1).
Note. Ease of development refers to the effort required to compose workflows and also wrap new tools, such as custom or publicly available scripts and executables.
Ease of use refers to the effort required to use existing pipelines to process new data, such as samples, and also the ease of sharing pipelines in a collaborative fashion.
Performance refers to the efficiency of the framework in executing a pipeline, in terms of both parallelization and scalability. More stars connotes ‘easier’ or ‘faster’.
A review of bioinformatic pipeline frameworks | 533
Figure 4. A Snakemake rule for performing a sequence alignment. This example uses a global configuration dictionary that allows parameters to be specified in JSON or
YAML-formatted configuration files or in the command line.
Figure 5. Tasks in Ruffus explicitly depend on other tasks, not file targets.
Figure 6. A Pegasus DAX (Directed Acyclic Graph in XML). A subsequent step to alignment has been included to show that a Pegasus task relies on explicit job IDs to
identify its antecedents rather than a filename pattern to identify its dependencies. Pegasus has no built-in system of variable injection, but includes APIs to produce
DAX files.
Figure 7. The Galaxy Workflow Editor allows users to link inputs, outputs and tools on a graphical canvas.
to design analyses by linking preconfigured modular tools to- For existing tools that have an existing component plug-in,
gether, typically using a drag-and-drop graphical interface. using Taverna is an easy solution for end-users. Creating a new
Because they require exacting specifications of inputs and plug-in requires an in-depth knowledge of the XML-based API
outputs, workbenches are intrinsically a subset of configur- and exact specifications of acceptable input filetypes, param-
ation-based pipelines. The most popular bioinformatics server eter values, resource management and exception behavior. The
workbenches are Galaxy [11] and Taverna [12]. Galaxy serves as onus is entirely on the developer to provide a means for new
a Web-based interface for command line tools, whereas tools to exist in the Taverna ecosystem. Adding a new execut-
Taverna offers stand-alone clients and allows pipelines to ac- able to Galaxy often requires only 20 lines of configuration code,
cess tools distributed across the Internet. Both allow users but Galaxy wrappers can be less robust than those in Taverna,
to share workflows and are intended for local installations which requires slightly more familiarity with each tool on the
(Figure 7). part of end-users to implement.
A review of bioinformatic pipeline frameworks | 535