Papy: Parallel and Distributed Data-Processing Pipelines in Python
Papy: Parallel and Distributed Data-Processing Pipelines in Python
PaPy, which stands for parallel pipelines in Python, PyCogent [Knight07], Cinfony [OBHu08], MMTK
is a highly flexible framework that enables the con- [Hinsen00], Biskit [GrNiLe07]) are to be used.
struction of robust, scalable workflows for either gen- Many computational tasks fundamentally consist of
erating or processing voluminous datasets. A work- chained transformations of collections of data that are
flow is created from user-written Python functions independent, and likely of variable type (strings, im-
(nodes) connected by ’pipes’ (edges) into a directed ages, etc.). The scientific programmer is required to
acyclic graph. These functions are arbitrarily de- write transformation steps, connect them and - for
finable, and can make use of any Python modules large datasets to be feasible - parallelize the process-
or external binaries. Given a user-defined topology ing. Solutions to this problem generally can be di-
and collection of input data, functions are composed vided into: (i) Make-like software build tools, (ii)
into nested higher-order maps, which are transpar- workflow management systems (WMS), or (iii) grid
ently and robustly evaluated in parallel on a sin- engines and frontends. PaPy, which stands for parallel
gle computer or on remote hosts. Local and re- pipelines in Python, is a module for processing arbi-
mote computational resources can be flexibly pooled trary streams of data (files, records, simulation frames,
and assigned to functional nodes, thereby allowing images, videos, etc.) via functions connected into di-
facile load-balancing and pipeline optimization to rected graphs (flowcharts) like a WMS. It is not a par-
maximimize computational throughput. Input items allel computing paradigm like MapReduce [DeGh08]
are processed by nodes in parallel, and traverse the or BSP [SkHiMc96], nor is it a dependency-handling
graph in batches of adjustable size - a trade-off be- build tool like Scons [Knight05]. Neither does it sup-
tween lazy-evaluation, parallelism, and memory con- port declarative programming [Lloyd94]. In a nutshell,
sumption. The processing of a single item can be PaPy is a tool that makes it easy to structure proce-
parallelized in a scatter/gather scheme. The sim- dural workflows into Python scripts. The tasks and
plicity and flexibility of distributed workflows using data are composed into nested higher-order map func-
PaPy bridges the gap between desktop -> grid, en- tions, which are transparently and robustly evaluated
abling this new computing paradigm to be leveraged in parallel on a single computer or remote hosts.
in the processing of large scientific datasets. Workflow management solutions typically provide a
means to connect standardized tasks via a structured,
well-defined data format to construct a workflow. For
Introduction transformations outside the default repertoire of the
program, the user must program a custom task with
Computationally-intense fields ranging from astron- inputs and outputs in some particular (WMS-specific)
omy to chemoinformatics to computational biology format. This, then, limits the general capability of
typically involve complex workflows of data produc- a WMS in utilizing available codes to perform non-
tion or aggregation, processing, and analysis. Sev- standard or computationally-demanding analyses. Ex-
eral fundamentally different forms of data - se- amples of existing frameworks for constructing data-
quence strings (text files), coordinates (and coor- processing pipelines include Taverna (focused on web-
dinate trajectories), images, interaction maps, mi- services; run locally [OiAdFe04]), DAGMan (general;
croarray data, videos, arrays - may exist in dis- part of the Condor workload management system
tinct file formats, and are typically processed using [ThTaLi05]) and Cyrille2 (focused on genomics; run
available tools. Inputs/outputs are generally linked on SGE clusters [Ham08]). A typical drawback of in-
(if at all) via intermediary files in the context of tegrated WMS solutions such as the above is that,
some automated build software or scripts. The re- for tasks which are not in the standard repertoire of
cently exponential growth of datasets generated by the program, the user has to either develop a custom
high-throughput scientific approaches (e.g. struc- task or revert to traditional scripting for parts of the
tural genomics [TeStYo09]) or high-performance par- pipeline; while such an approach offers an immediate
allel computing methods (e.g. molecular dynam- solution, it is not easily sustainable, scalable, or adapt-
ics [KlLiDr09]) necessitates more flexible and scal- able, insofar as the processing logic becomes hardwired
able tools at the consumer end, enabling, for in- into these script-based workflows.
stance, the leveraging of multiple CPU cores and com-
In PaPy, pipelines are constructed from Python func-
putational grids. However, using files to communi-
tions with strict call semantics. Most general-purpose
cate and synchronize processes is generally inconve-
functions to support input/output, databases, inter-
nient and inefficient, particularly if specialized scien-
process communication (IPC), serialization, topology,
tific Python modules (e.g., BioPython [CoAnCh09],
and mathematics are already a part of PaPy. Domain-
specific functions (e.g. parsing a specific file-format)
41 M. Cieślik, C. Murain Proc. SciPy 2009, G. Varoquaux, S. van der Walt, J. Millman (Eds), pp. 41–48
PaPy: Parallel and distributed data-processing pipelines in Python
must be user-provided, but have no limitations as to Pipelines (see Figure 1) are constructed by connecting
functional complexity, used libraries, called binaries or functional units (Piper instances) by directed pipes,
web-services, etc. Therefore, as a general pipeline con- and are represented as a directed acyclic graph data
struction tool, PaPy is intentionally lightweight, and structure (Dagger instance). The pipers correspond
is entirely agnostic of specific application domains. to nodes and the pipes to edges in a graph. The
Our approach with PaPy is a highly modular workflow- topological sort of this graph reflects the input/output
engine, which neither enforces a particular data- dependencies of the pipers, and it is worth noting
exchange or restricted programming model, nor is tied that any valid DAG is a valid PaPy pipeline topology
to a single, specific application domain. This level (e.g., pipers can have multiple incoming and outgo-
of abstraction enables existing code-bases to be eas- ing pipes, and the pipeline can have multiple inputs
ily wrapped into a PaPy pipeline and benefit from its and outputs). A pipeline input consists of an iter-
robustness to exceptions, logging, and parallelism. able collection of data items, e.g. a list. PaPy does
not utilize a custom file format to store a pipeline;
instead, pipelines are constructed and saved as exe-
Architecture and design
cutable Python code. The PaPy module can be ar-
PaPy is a Python module “papy” written to enable bitrarily used within a Python script, although some
the logical design and deployment of efficient data- helpful and relevant conventions to construct a work-
processing pipelines. Central design goals were to flow script are described in the online documentation.
make the framework (i) natively parallel, (ii) flexible,
(iii) robust, (iv) free of idiosyncrasies and dependen- The functionality of a piper is defined by user-written
cies, and (v) easily usable. Therefore, PaPy’s modular, functions, which are Python functions with strict call
object-oriented architecture utilizes familiar concepts semantics. There are no limits as to what a function
such as map constructs from functional programming, does, apart from the requirement that any modules
and directed acyclic graphs. Parallelism is achieved it utilizes must be available on the remote execution
through the shared worker-pool model [Sunderam90]. hosts (if utilizing RPyC). A function can be used by
The architecture of PaPy is remarkably simple, yet multiple pipers, and multiple functions can be com-
flexible. It consists of only four core component classes posed within a single piper. CPU-intensive tasks with
to enable construction of a data-processing pipeline. little input data (e.g., MD simulations, collision de-
Each class provides an isolated subset of the func- tection, graph matching) are preferred because of the
tionality [Table1], which together includes facilities for high speed-up through parallel execution.
arbitrary flow-chart topology, execution (serial, par- Within a PaPy pipeline, data are shared as Python
allel, distributed), user function wrapping, and run- objects; this is in contrast to workflow management
time interactions (e.g. logging). The pipeline is a way solutions (e.g., Taverna) that typically enforce a spe-
of expressing what (functions), where (toplogy) and cific data exchange scheme. The user has the choice to
how (parallelism) a collection of (potentially interde- use any or none of the structured data-exchange for-
pendent) calculations should be performed. mats, provided the tools for using them are available
for Python. Communicated Python objects need to
Table 1: Components (classes) and their roles. be serializable, by default using the standard Pickle
protocol.
Synchronization and data communication between
Compo- Description and function pipers within a pipeline is achieved by virtue of queues
nent and locked pipes. No outputs or intermediate results
IMap1 Implements a process/thread pool. Eval- are implicitly stored, in contrast to usage of temporary
uates multiple, nested map functions in files by Make-like software. Data can be saved any-
parallel, using a mixture of threads or where within the pipeline by using pipers for data se-
processes (locally) and, optionally, remote rialization (e.g. JSON) and archiving (e.g. file-based).
RPyC servers. PaPy maintains data integrity in the sense that an
executing pipeline stopped by the user will have no
Piper Processing nodes of the pipeline created pending (lost) results.
Worker by wrapping user-defined functions; also,
exception handling, logging, and scatter-
gather functionality. Parallelism
Dagger Defines the data-flow and the pipeline in the Parallel execution is a major issue for workflows,
form of a directed acyclic graph (DAG); al- particularly (i) those involving highly CPU-intensive
lows one to add, remove, connect pipers, methods like MD simulations or Monte Carlo sam-
and validate topology. Coordinates the pling, or (ii) those dealing with large datasets (such as
starting/stopping of IMaps. arise in astrophysics, genomics, etc.). PaPy provides
Plumber Interface to monitor and run a pipeline; pro- 1 Note that the IMap class is available as a separate Python
c
2009, M. Cieślik, C. Mura 42
Proceedings of the 8th Python in Science Conference (SciPy 2009)
43 http://conference.scipy.org/proceedings/SciPy2009/paper_6
PaPy: Parallel and distributed data-processing pipelines in Python
too large to fit into memory (or to be stored as files), of producer and consumer processes, thereby mostly
and to cope with highly variable execution times for in- eliminating the manager process from IPC and allevi-
put items (a common scenario on a heterogenous grid, ating the bottleneck described above. Multiple serial-
and which would arise for certain types of tasks, such ization and transmission media are supported. In gen-
as replica-exchange MD simulations). eral, the producer makes data available (e.g. by serial-
izing it and opening a network socket) and sends only
information needed by the consumer end to locate the
data (e.g. the host and port of the network socket) via
the manager process. The consumer end receives this
information and reads the data. Direct communication
comes at the cost of losing platform-independence, as
the operating system(s) have to properly support the
chosen transmission medium (e.g. Unix pipes). Table
2 summarizes PaPy’s currently available options.
Figure 3. The stride as a trade-off between memory
consumption and parallelism of execution. Rectan- Table 2: Direct inter-process communication meth-
gular boxes represent graph traversal in batches. The ods.2
pipers involved (N-1, N, N+2) are shown on the right
(explanation in text).
OS Remarks
Inter-process communication Method
socket all Communication between hosts con-
A major aspect - and often bottleneck - of parallel nected by a network.
computing is inter-process communication (IPC; Fig. pipe UNIX- Communication between processes on
4) [LiYa00]. In PaPy, IPC occurs between parallel like a single host.
pipers connected in a workflow. The communication
process is two-stage and involves a manager process file all The storage location needs to be ac-
- i.e, the local Python interpreter used to start the cessible by all processes - e.g over NFS
workflow (Fig. 4). A coordinating process is neces- or a SAMBA share.
sary because the connected nodes might evaluate func- shm POSIX Shared memory support is provided by
tions in processes with no parent/child relationship. If the posix_shm library; it is an alter-
communication occurs between processes on different native to communication by pipes.
hosts, an additional step of IPC (involving a local and database all Serialized data can be stored as (key,
a remote RPyC process) is present. Inter-process com- value) pairs in a database. The keys
munication involves data serialization (i.e. representa- are semi-random. Currently SQLite
tion in a form which can be sent or stored), the actual and MySQL are supported, as pro-
data-transmission (e.g. over a network socket) and, vided by mysql-python and sqlite3.
finally, de-serialization on the recipient end. Because
the local manager process is involved in serializing (de- Note that it is possible to avoid some IPC by logically
serializing) data to (from) each parallel process, it can grouping processing steps within a single piper. This
clearly limit pipeline performance if large amounts of is done by constructing a single piper instance from a
data are to be communicated. worker instance created from a tuple of user-written
functions, instead of constructing multiple piper in-
stances from single function worker instances. A
worker instance is a callable object passed to the con-
structor of the Piper class. Also, note that any linear,
non-branching segment of a pipeline can be collapsed
into a single piper. This has the performance advan-
tage that no IPC occurs between functions within a
single piper, as they are executed in the same process.
shall, JSON
PaPy provides functionality for direct communication
c
2009, M. Cieślik, C. Mura 44
Proceedings of the 8th Python in Science Conference (SciPy 2009)
functions. These two features have been a major de- Graphical interface
sign goal. Robustness is achieved by embedding calls
to user functions in a try ... except clause. If an As a Python package, PaPy’s main purpose is to sup-
exception is raised, it is caught and does not stop the ply and expose an API for the abstraction of a par-
execution of the workflow (rather, it is wrapped and allel workflow. This has the advantage of flexibility
passed as a placeholder). Subsequent pipers ignore (e.g. usage within other Python programs), but re-
and propagate such objects. Logging is supported via quires that the programmer learn the API. A graphical
the logging module from the Python standard library. user interface (GUI) is currently being actively devel-
The papy and IMap packages emit logging statements oped (Fig. 5). The motivation for this functionality
at several levels of detail, i.e. DEBUG, INFO, ER- is to allow a user to interactively construct, execute
ROR; additionally, a function to easily setup and save (e.g. pause execution), and monitor (e.g. view logs)
or display logs is included. The log is written real- a workflow. While custom functions will still have to
time, and can be used to monitor the execution of a be written in Python, the GUI liberates the user from
workflow. knowing the specifics of the PaPy API; instead, the
user explores the construction of PaPy workflows by
connecting objects via navigation in the GUI.
Usage notes
A started parallel piper consumes a sequence of N in- Workflow construction example
put items (where N is defined by the “stride” argu-
ment), and produces a sequence of N resultant items. The following code listing illustrates steps in the con-
Pipers are by default “ordered”, meaning that an in- struction of a distributed PaPy pipeline. The first of
put item and its corresponding result item have the the two nodes evaluates a function (which simply de-
same index in both sequences. The order in which re- termines the host on which it is run), and the second
sult items become available may differ from the order prints the result locally. The first piper is assigned
input items are submitted for parallel processing. In a to a virtual resource combining local and remote pro-
pipeline, result items of an upstream piper are input cesses. The scripts take two command line arguments:
items for a downstream piper. The downstream piper a definition of the available remote hosts and a switch
can process input items only as fast as result items are for using TCP sockets for direct inter-process commu-
produced by the upstream piper. Thus, an inefficency nication between the pipers. The source code uses
arises if the upstream piper does not return an avail- the imports decorator. This construct allows import
able result because it is out of order. This results in statements to be attached to the code of a function. As
idle processes, and the problem can be addressed by noted earlier, the imported modules must be available
using a “stride” larger then the number of processes, on all hosts on which this function is run.
or by allowing the upstream piper to return results in The pipeline is started, for example, via:
the order they become available. The first solution re- $ python pipeline.py \
--workers=HOST1:PORT1#2,HOST2:PORT1#4
sults in higher memory consumption, while the second
irreversibly abolishes the original order of input data.
which uses 2 processes on HOST1 and 4 on HOST2,
and all locally-available CPUs. Remote hosts can be
started (assuming appropriate firewall settings) by:
$ python RPYC_PATH/servers/classic_server.py \
-m forking -p PORT
45 http://conference.scipy.org/proceedings/SciPy2009/paper_6
PaPy: Parallel and distributed data-processing pipelines in Python
#!/usr/bin/env python
# Part 0: import the PaPy infrastructure.
# papy and IMap are separate modules
from papy import Plumber, Piper, Worker
from IMap import IMap, imports Discussion and conclusions
from papy import workers
In the context of PaPy, the factors dictating the com-
# Part 1: Define user functions
@imports([’socket’, ’os’, ’threading’]) putational efficiency of a user’s pipeline are the nature
def where(inbox): of the individual functions (nodes, pipers), and the na-
result = "input: %s, host:%s, parent %s, \ ture of the data linkages between the constituent nodes
process:%s, thread:%s" % \ in the graph (edges, pipes). Although distributed
(inbox[0], \
socket.gethostname(), \ and parallel computing methods are becoming ubiq-
# the host name as reported by the OS uitous in many scientific domains (e.g., biologically
os.getppid(), \ # get parent process id meaningful usec-scale MD simulations [KlLiDrSh09]),
os.getpid(), \ # get process id
data post-processing and analysis are not keeping pace,
threading._get_ident())
# unique python thread identifier and will become only increasingly difficult on desktop
return result workstations.
It is expected that the intrinsic flexibility underly-
# Part 2: Define the topology
def pipeline(remote, use_tcp): ing PaPy’s design, and its easy resource distribution,
# this creates a IMap instance which uses could make it a useful component in the scientist’s
#’remote’ hosts. data-reduction toolkit. It should be noted that some
imap_ = IMap(worker_num=0, worker_remote=remote) data-generation workflows might also be expressible as
# this defines the communication protocol i.e.
# it creates worker instances with or without pipelines. For instance, parallel tempering / replica-
# explicit load_item functions. exchange MD [EaDe05] and multiple-walker metady-
if not use_tcp: namics [Raiteri06] are examples of intrinsically par-
w_where = Worker(where)
allelizable algorithms for exploration and reconstruc-
w_print = Worker(workers.io.print_)
else: tion of free energy surfaces of sufficient granularity. In
w_where = Worker((where, workers.io.dump_item), \those computational contexts, PaPy could be used to
kwargs=({}, {’type’:’tcp’})) orchestrate data generation as well as data aggregation
w_print = Worker((workers.io.load_item, \
workers.io.print_))
/ reduction / analysis.
# the instances are combined into a piper instance In conclusion, we have designed and implemented
p_where = Piper(w_where, parallel=imap_) PaPy, a workflow-engine for the Python programming
p_print = Piper(w_print, debug=True) language. PaPy’s features and capabilities include: (1)
# piper instances are assembled into a workflow
# (nodes of the graph) construction of arbitrarily complex pipelines; (2) flex-
pipes = Plumber() ible tuning of local and remote parallelism; (3) specifi-
pipes.add_pipe((p_where, p_print)) cation of shared local and remote resources; (4) versa-
return pipes
tile handling of inter-process communication; and (5)
# Part 3: execute the pipeline an adjustable laziness/parallelism/memory trade-off.
if __name__ == ’__main__’: In terms of usability and other strengths, we note that
# get command-line arguments using getopt PaPy exhibits (1) robustness to exceptions; (2) grace-
# following part of the code is not PaPy specific
# and has the purpose of interpreting commandline
ful support for time-outs; (3) real-time logging func-
# arguments. tionality; (4) cross-platform interoperability; (5) ex-
import sys tensive testing and documentation (a 60+ page man-
from getopt import getopt ual); and (6) a simple, object-oriented API accompa-
args = dict(getopt(sys.argv[1:], ’’, [’use_tcp=’, \
’workers=’])[0])
nied by a preliminary version of a GUI.
# parse arguments
use_tcp = eval(args[’--use_tcp’]) # bool
remote = args[’--workers’] Availability
remote = worker_remote.split(’,’)
remote = [hn.split(’#’) for hn in remote] PaPy is distributed as an open-source, platform- in-
remote = [(h, int(n)) for h, n in remote] dependent Python (CPython 2.6) module at http:
# create pipeline (see comments in function)
pipes = pipeline(remote, use_tcp) //muralab.org/PaPy, where extensive documenta-
# execution tion also can be found. It is easily installed
# the input to the function is a list of 100 via the Python Package Index (PyPI) at http://
# integers.
pypi.python.org/pypi/papy/ using setuptools by
pipes.start([range(100)])
# this starts the pipeline execution easy_install papy.
pipes.run()
# wait until all input items are processed
pipes.wait() Acknowledgements
# pause and stop (a running pipeline cannot
# be stopped) We thank the University of Virginia for start-up funds
pipes.pause()
pipes.stop()
in support of this research.
# print execution statistics
print pipes.stats
c
2009, M. Cieślik, C. Mura 46
Proceedings of the 8th Python in Science Conference (SciPy 2009)
47 http://conference.scipy.org/proceedings/SciPy2009/paper_6