0% found this document useful (0 votes)

61 views7 pages

Papy: Parallel and Distributed Data-Processing Pipelines in Python

This document summarizes a presentation on PaPy, a Python framework for constructing parallel and distributed data processing pipelines. PaPy allows users to define workflows as directed acyclic graphs of user-written Python functions connected by pipes. These functions can leverage any Python modules or external binaries. PaPy then composes the functions into nested map operations and evaluates them in parallel locally or on remote hosts. The framework is highly flexible, scalable, and enables leveraging multiple CPUs and computational grids for processing large scientific datasets through workflows.

Uploaded by

bender bender

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views7 pages

Papy: Parallel and Distributed Data-Processing Pipelines in Python

Uploaded by

bender bender

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Proceedings of the 8th Python in Science Conference (SciPy 2009)

PaPy: Parallel and distributed data-processing pipelines in Python

Marcin Cieślik (mpc4p@virginia.edu) – University of Virginia, U.S.
Cameron Mura (cmura@virginia.edu) – University of Virginia, U.S.

PaPy, which stands for parallel pipelines in Python, PyCogent [Knight07], Cinfony [OBHu08], MMTK
is a highly flexible framework that enables the con- [Hinsen00], Biskit [GrNiLe07]) are to be used.
struction of robust, scalable workflows for either gen- Many computational tasks fundamentally consist of
erating or processing voluminous datasets. A work- chained transformations of collections of data that are
flow is created from user-written Python functions independent, and likely of variable type (strings, im-
(nodes) connected by ’pipes’ (edges) into a directed ages, etc.). The scientific programmer is required to
acyclic graph. These functions are arbitrarily de- write transformation steps, connect them and - for
finable, and can make use of any Python modules large datasets to be feasible - parallelize the process-
or external binaries. Given a user-defined topology ing. Solutions to this problem generally can be di-
and collection of input data, functions are composed vided into: (i) Make-like software build tools, (ii)
into nested higher-order maps, which are transpar- workflow management systems (WMS), or (iii) grid
ently and robustly evaluated in parallel on a sin- engines and frontends. PaPy, which stands for parallel
gle computer or on remote hosts. Local and re- pipelines in Python, is a module for processing arbi-
mote computational resources can be flexibly pooled trary streams of data (files, records, simulation frames,
and assigned to functional nodes, thereby allowing images, videos, etc.) via functions connected into di-
facile load-balancing and pipeline optimization to rected graphs (flowcharts) like a WMS. It is not a par-
maximimize computational throughput. Input items allel computing paradigm like MapReduce [DeGh08]
are processed by nodes in parallel, and traverse the or BSP [SkHiMc96], nor is it a dependency-handling
graph in batches of adjustable size - a trade-off be- build tool like Scons [Knight05]. Neither does it sup-
tween lazy-evaluation, parallelism, and memory con- port declarative programming [Lloyd94]. In a nutshell,
sumption. The processing of a single item can be PaPy is a tool that makes it easy to structure proce-
parallelized in a scatter/gather scheme. The sim- dural workflows into Python scripts. The tasks and
plicity and flexibility of distributed workflows using data are composed into nested higher-order map func-
PaPy bridges the gap between desktop -> grid, en- tions, which are transparently and robustly evaluated
abling this new computing paradigm to be leveraged in parallel on a single computer or remote hosts.
in the processing of large scientific datasets. Workflow management solutions typically provide a
means to connect standardized tasks via a structured,
well-defined data format to construct a workflow. For
Introduction transformations outside the default repertoire of the
program, the user must program a custom task with
Computationally-intense fields ranging from astron- inputs and outputs in some particular (WMS-specific)
omy to chemoinformatics to computational biology format. This, then, limits the general capability of
typically involve complex workflows of data produc- a WMS in utilizing available codes to perform non-
tion or aggregation, processing, and analysis. Sev- standard or computationally-demanding analyses. Ex-
eral fundamentally different forms of data - se- amples of existing frameworks for constructing data-
quence strings (text files), coordinates (and coor- processing pipelines include Taverna (focused on web-
dinate trajectories), images, interaction maps, mi- services; run locally [OiAdFe04]), DAGMan (general;
croarray data, videos, arrays - may exist in dis- part of the Condor workload management system
tinct file formats, and are typically processed using [ThTaLi05]) and Cyrille2 (focused on genomics; run
available tools. Inputs/outputs are generally linked on SGE clusters [Ham08]). A typical drawback of in-
(if at all) via intermediary files in the context of tegrated WMS solutions such as the above is that,
some automated build software or scripts. The re- for tasks which are not in the standard repertoire of
cently exponential growth of datasets generated by the program, the user has to either develop a custom
high-throughput scientific approaches (e.g. struc- task or revert to traditional scripting for parts of the
tural genomics [TeStYo09]) or high-performance par- pipeline; while such an approach offers an immediate
allel computing methods (e.g. molecular dynam- solution, it is not easily sustainable, scalable, or adapt-
ics [KlLiDr09]) necessitates more flexible and scalable, insofar as the processing logic becomes hardwired
able tools at the consumer end, enabling, for in- into these script-based workflows.
stance, the leveraging of multiple CPU cores and com-
In PaPy, pipelines are constructed from Python func-
putational grids. However, using files to communi-
tions with strict call semantics. Most general-purpose
cate and synchronize processes is generally inconve-
functions to support input/output, databases, inter-
nient and inefficient, particularly if specialized scien-
process communication (IPC), serialization, topology,
tific Python modules (e.g., BioPython [CoAnCh09],
and mathematics are already a part of PaPy. Domain-
specific functions (e.g. parsing a specific file-format)

41 M. Cieślik, C. Murain Proc. SciPy 2009, G. Varoquaux, S. van der Walt, J. Millman (Eds), pp. 41–48
PaPy: Parallel and distributed data-processing pipelines in Python

must be user-provided, but have no limitations as to Pipelines (see Figure 1) are constructed by connecting
functional complexity, used libraries, called binaries or functional units (Piper instances) by directed pipes,
web-services, etc. Therefore, as a general pipeline con- and are represented as a directed acyclic graph data
struction tool, PaPy is intentionally lightweight, and structure (Dagger instance). The pipers correspond
is entirely agnostic of specific application domains. to nodes and the pipes to edges in a graph. The
Our approach with PaPy is a highly modular workflow- topological sort of this graph reflects the input/output
engine, which neither enforces a particular data- dependencies of the pipers, and it is worth noting
exchange or restricted programming model, nor is tied that any valid DAG is a valid PaPy pipeline topology
to a single, specific application domain. This level (e.g., pipers can have multiple incoming and outgo-
of abstraction enables existing code-bases to be eas- ing pipes, and the pipeline can have multiple inputs
ily wrapped into a PaPy pipeline and benefit from its and outputs). A pipeline input consists of an iter-
robustness to exceptions, logging, and parallelism. able collection of data items, e.g. a list. PaPy does
not utilize a custom file format to store a pipeline;
instead, pipelines are constructed and saved as exe-
Architecture and design
cutable Python code. The PaPy module can be ar-
PaPy is a Python module “papy” written to enable bitrarily used within a Python script, although some
the logical design and deployment of efficient data- helpful and relevant conventions to construct a work-
processing pipelines. Central design goals were to flow script are described in the online documentation.
make the framework (i) natively parallel, (ii) flexible,
(iii) robust, (iv) free of idiosyncrasies and dependen- The functionality of a piper is defined by user-written
cies, and (v) easily usable. Therefore, PaPy’s modular, functions, which are Python functions with strict call
object-oriented architecture utilizes familiar concepts semantics. There are no limits as to what a function
such as map constructs from functional programming, does, apart from the requirement that any modules
and directed acyclic graphs. Parallelism is achieved it utilizes must be available on the remote execution
through the shared worker-pool model [Sunderam90]. hosts (if utilizing RPyC). A function can be used by
The architecture of PaPy is remarkably simple, yet multiple pipers, and multiple functions can be com-
flexible. It consists of only four core component classes posed within a single piper. CPU-intensive tasks with
to enable construction of a data-processing pipeline. little input data (e.g., MD simulations, collision de-
Each class provides an isolated subset of the func- tection, graph matching) are preferred because of the
tionality [Table1], which together includes facilities for high speed-up through parallel execution.
arbitrary flow-chart topology, execution (serial, par- Within a PaPy pipeline, data are shared as Python
allel, distributed), user function wrapping, and run- objects; this is in contrast to workflow management
time interactions (e.g. logging). The pipeline is a way solutions (e.g., Taverna) that typically enforce a spe-
of expressing what (functions), where (toplogy) and cific data exchange scheme. The user has the choice to
how (parallelism) a collection of (potentially interde- use any or none of the structured data-exchange for-
pendent) calculations should be performed. mats, provided the tools for using them are available
for Python. Communicated Python objects need to
Table 1: Components (classes) and their roles. be serializable, by default using the standard Pickle
protocol.
Synchronization and data communication between
Compo- Description and function pipers within a pipeline is achieved by virtue of queues
nent and locked pipes. No outputs or intermediate results
IMap1 Implements a process/thread pool. Eval- are implicitly stored, in contrast to usage of temporary
uates multiple, nested map functions in files by Make-like software. Data can be saved any-
parallel, using a mixture of threads or where within the pipeline by using pipers for data se-
processes (locally) and, optionally, remote rialization (e.g. JSON) and archiving (e.g. file-based).
RPyC servers. PaPy maintains data integrity in the sense that an
executing pipeline stopped by the user will have no
Piper Processing nodes of the pipeline created pending (lost) results.
Worker by wrapping user-defined functions; also,
exception handling, logging, and scatter-
gather functionality. Parallelism
Dagger Defines the data-flow and the pipeline in the Parallel execution is a major issue for workflows,
form of a directed acyclic graph (DAG); al- particularly (i) those involving highly CPU-intensive
lows one to add, remove, connect pipers, methods like MD simulations or Monte Carlo sam-
and validate topology. Coordinates the pling, or (ii) those dealing with large datasets (such as
starting/stopping of IMaps. arise in astrophysics, genomics, etc.). PaPy provides
Plumber Interface to monitor and run a pipeline; pro- 1 Note that the IMap class is available as a separate Python

vides methods to save/load pipelines, mon- module.

itor state, save results.

c
2009, M. Cieślik, C. Mura 42
Proceedings of the 8th Python in Science Conference (SciPy 2009)

support for two levels of parallelism, which adress both

of these scenarios: (1) parallel processing of indepen-
dent input data items, (2) multiple parallel jobs for
a single input item. The first type of parallelism is
achieved by creating parallel pipers - i.e. providing
an IMap instance to the constructor. Pipers within
a pipeline can share an IMap instance or have dedi-
cated computational resources (Fig. 1). The mixing
of serial and parallel pipers is supported; this flexi-
bility permits intelligent pipeline load-balancing and
optimization. Per-item parallel jobs are made possi-
ble by the produce / spawn / consume (Fig. 2) id-
iom within a workflow. This idiom consists of at least
three pipers. The role of the first piper is to produce a
list of N subitems for each input item. Each of those
subitems is processed by the next piper, which needs
to be spawned N times; finally, the N results are con-
sumed by the last piper, which returns a single result.
Multiple spawning pipers are supported. The subitems
are typically independent fragments of the input item
or parameter sets. Per-item parallelism is similar to
the MapReduce model of distributed computing, but
is not restricted to handling only data structured as
(key, value) pairs.

Figure 1. (A) PaPy pipeline and its (B) computa-

tional resources. The directed graph illustrates the
Dagger object as a container of Piper objects (nodes),
connected by pipes (black arrows; in the upstream
/ downstream sense) or, equivalently, dependency
edges (gray arrows). Pipers are assigned to vari-
ous compute resources as indicated by different col-
ors. The sets of pipes connecting the two processing
streams illustrate the flexible construction of work-
flows. Encapsulation and composition of user-written
functions e.g., f, g into a Worker and Piper object
is represented as P(W(f,g,...)). Resources used by Figure 2. The produce / spawn / consume idiom al-
the sample pipeline are shown in B. Each virtual re- lows for parallel processing of a single input item in
source is an IMap object, which utilizes a worker pool addition to parallel processing of items (explanation
to evaluate the Worker on a data item. IMaps are in text).
shared by pipers and might share resources. The re-
sources are: a local pool of 20 threads used by a sin- The parallelism of an IMap instance is defined by
gle piper in the pipeline (green); four CPU-cores, of the number of local and remote worker processes or
which at most three are used concurrently (red) and threads, and the “stride” argument (Fig. 3), if it pro-
one dedicated to handle the input/output functions cesses multiple tasks. The “stride” is the number of
(yellow); and a pool of Python processes utilizing re- input items of task N processed before task N+1 com-
mote resources exposedby RPyC servers (blue cloud).
mences. Tasks are cycled until all input items have
Parallelism is achieved by pulling data through the
pipeline in adjustable batches.
been processed. In a PaPy pipeline pipers can share
a computational resource; they are different tasks of a
single IMap instance. The “stride” can also be con-
sidered as the number of input items processed by
pipers in consecutive rounds, with the order defined
by a topological sort of the graph. Therefore, the data
traverses the pipeline in batches of “stride” size. A
larger “stride” means that potentially more temporary
results will have to be held in memory, while a smaller
value may result in idle CPUs, as a new task cannot
start until the previous one finishes its “stride”. This
adjustable memory/parallelism trade-off allows PaPy
pipelines to process data sets with temporary results

43 http://conference.scipy.org/proceedings/SciPy2009/paper_6
PaPy: Parallel and distributed data-processing pipelines in Python

too large to fit into memory (or to be stored as files), of producer and consumer processes, thereby mostly
and to cope with highly variable execution times for in- eliminating the manager process from IPC and allevi-
put items (a common scenario on a heterogenous grid, ating the bottleneck described above. Multiple serial-
and which would arise for certain types of tasks, such ization and transmission media are supported. In gen-
as replica-exchange MD simulations). eral, the producer makes data available (e.g. by serial-
izing it and opening a network socket) and sends only
information needed by the consumer end to locate the
data (e.g. the host and port of the network socket) via
the manager process. The consumer end receives this
information and reads the data. Direct communication
comes at the cost of losing platform-independence, as
the operating system(s) have to properly support the
chosen transmission medium (e.g. Unix pipes). Table
2 summarizes PaPy’s currently available options.
Figure 3. The stride as a trade-off between memory
consumption and parallelism of execution. Rectan- Table 2: Direct inter-process communication meth-
gular boxes represent graph traversal in batches. The ods.2
pipers involved (N-1, N, N+2) are shown on the right
(explanation in text).

OS Remarks
Inter-process communication Method
socket all Communication between hosts con-
A major aspect - and often bottleneck - of parallel nected by a network.
computing is inter-process communication (IPC; Fig. pipe UNIX- Communication between processes on
4) [LiYa00]. In PaPy, IPC occurs between parallel like a single host.
pipers connected in a workflow. The communication
process is two-stage and involves a manager process file all The storage location needs to be ac-
- i.e, the local Python interpreter used to start the cessible by all processes - e.g over NFS
workflow (Fig. 4). A coordinating process is neces- or a SAMBA share.
sary because the connected nodes might evaluate func- shm POSIX Shared memory support is provided by
tions in processes with no parent/child relationship. If the posix_shm library; it is an alter-
communication occurs between processes on different native to communication by pipes.
hosts, an additional step of IPC (involving a local and database all Serialized data can be stored as (key,
a remote RPyC process) is present. Inter-process com- value) pairs in a database. The keys
munication involves data serialization (i.e. representa- are semi-random. Currently SQLite
tion in a form which can be sent or stored), the actual and MySQL are supported, as pro-
data-transmission (e.g. over a network socket) and, vided by mysql-python and sqlite3.
finally, de-serialization on the recipient end. Because
the local manager process is involved in serializing (de- Note that it is possible to avoid some IPC by logically
serializing) data to (from) each parallel process, it can grouping processing steps within a single piper. This
clearly limit pipeline performance if large amounts of is done by constructing a single piper instance from a
data are to be communicated. worker instance created from a tuple of user-written
functions, instead of constructing multiple piper in-
stances from single function worker instances. A
worker instance is a callable object passed to the con-
structor of the Piper class. Also, note that any linear,
non-branching segment of a pipeline can be collapsed
into a single piper. This has the performance advan-
tage that no IPC occurs between functions within a
single piper, as they are executed in the same process.

Additional features and notes

Workflow logging
Figure 4. Inter-process communication (IPC) between
pipers (p1, p2). The dashed arrow illustrates possible
PaPy provides support for detailed workflow logging
direct IPC. Communication between the local and and is robust to exceptions (errors) within user-written
remote processes utilizes RPyC (explanation in text). 2 Currently supported serialization algorithms: pickle, mar-

shall, JSON
PaPy provides functionality for direct communication

c
2009, M. Cieślik, C. Mura 44
Proceedings of the 8th Python in Science Conference (SciPy 2009)

functions. These two features have been a major de- Graphical interface
sign goal. Robustness is achieved by embedding calls
to user functions in a try ... except clause. If an As a Python package, PaPy’s main purpose is to sup-
exception is raised, it is caught and does not stop the ply and expose an API for the abstraction of a par-
execution of the workflow (rather, it is wrapped and allel workflow. This has the advantage of flexibility
passed as a placeholder). Subsequent pipers ignore (e.g. usage within other Python programs), but re-
and propagate such objects. Logging is supported via quires that the programmer learn the API. A graphical
the logging module from the Python standard library. user interface (GUI) is currently being actively devel-
The papy and IMap packages emit logging statements oped (Fig. 5). The motivation for this functionality
at several levels of detail, i.e. DEBUG, INFO, ER- is to allow a user to interactively construct, execute
ROR; additionally, a function to easily setup and save (e.g. pause execution), and monitor (e.g. view logs)
or display logs is included. The log is written real- a workflow. While custom functions will still have to
time, and can be used to monitor the execution of a be written in Python, the GUI liberates the user from
workflow. knowing the specifics of the PaPy API; instead, the
user explores the construction of PaPy workflows by
connecting objects via navigation in the GUI.
Usage notes
A started parallel piper consumes a sequence of N in- Workflow construction example
put items (where N is defined by the “stride” argu-
ment), and produces a sequence of N resultant items. The following code listing illustrates steps in the con-
Pipers are by default “ordered”, meaning that an in- struction of a distributed PaPy pipeline. The first of
put item and its corresponding result item have the the two nodes evaluates a function (which simply de-
same index in both sequences. The order in which re- termines the host on which it is run), and the second
sult items become available may differ from the order prints the result locally. The first piper is assigned
input items are submitted for parallel processing. In a to a virtual resource combining local and remote pro-
pipeline, result items of an upstream piper are input cesses. The scripts take two command line arguments:
items for a downstream piper. The downstream piper a definition of the available remote hosts and a switch
can process input items only as fast as result items are for using TCP sockets for direct inter-process commu-
produced by the upstream piper. Thus, an inefficency nication between the pipers. The source code uses
arises if the upstream piper does not return an avail- the imports decorator. This construct allows import
able result because it is out of order. This results in statements to be attached to the code of a function. As
idle processes, and the problem can be addressed by noted earlier, the imported modules must be available
using a “stride” larger then the number of processes, on all hosts on which this function is run.
or by allowing the upstream piper to return results in The pipeline is started, for example, via:
the order they become available. The first solution re- $ python pipeline.py \
--workers=HOST1:PORT1#2,HOST2:PORT1#4
sults in higher memory consumption, while the second
irreversibly abolishes the original order of input data.
which uses 2 processes on HOST1 and 4 on HOST2,
and all locally-available CPUs. Remote hosts can be
started (assuming appropriate firewall settings) by:
$ python RPYC_PATH/servers/classic_server.py \
-m forking -p PORT

This starts a RPyC server listening on the specified

PORT, which forks whenever a client connects. A fork-
ing server is capable of utilizing multiple CPU cores.
The following example (in expanded form) is provided
as part of PaPy’s online documentation.:

Figure 5. A screenshot of the PaPy GUI written in

Tkinter. Includes an interactive Python console and
an intuitive canvas to construct workflows.

45 http://conference.scipy.org/proceedings/SciPy2009/paper_6
PaPy: Parallel and distributed data-processing pipelines in Python

#!/usr/bin/env python
# Part 0: import the PaPy infrastructure.
# papy and IMap are separate modules
from papy import Plumber, Piper, Worker
from IMap import IMap, imports Discussion and conclusions
from papy import workers
In the context of PaPy, the factors dictating the com-
# Part 1: Define user functions
@imports([’socket’, ’os’, ’threading’]) putational efficiency of a user’s pipeline are the nature
def where(inbox): of the individual functions (nodes, pipers), and the na-
result = "input: %s, host:%s, parent %s, \ ture of the data linkages between the constituent nodes
process:%s, thread:%s" % \ in the graph (edges, pipes). Although distributed
(inbox[0], \
socket.gethostname(), \ and parallel computing methods are becoming ubiq-
# the host name as reported by the OS uitous in many scientific domains (e.g., biologically
os.getppid(), \ # get parent process id meaningful usec-scale MD simulations [KlLiDrSh09]),
os.getpid(), \ # get process id
data post-processing and analysis are not keeping pace,
threading._get_ident())
# unique python thread identifier and will become only increasingly difficult on desktop
return result workstations.
It is expected that the intrinsic flexibility underly-
# Part 2: Define the topology
def pipeline(remote, use_tcp): ing PaPy’s design, and its easy resource distribution,
# this creates a IMap instance which uses could make it a useful component in the scientist’s
#’remote’ hosts. data-reduction toolkit. It should be noted that some
imap_ = IMap(worker_num=0, worker_remote=remote) data-generation workflows might also be expressible as
# this defines the communication protocol i.e.
# it creates worker instances with or without pipelines. For instance, parallel tempering / replica-
# explicit load_item functions. exchange MD [EaDe05] and multiple-walker metady-
if not use_tcp: namics [Raiteri06] are examples of intrinsically par-
w_where = Worker(where)
allelizable algorithms for exploration and reconstruc-
w_print = Worker(workers.io.print_)
else: tion of free energy surfaces of sufficient granularity. In
w_where = Worker((where, workers.io.dump_item), \those computational contexts, PaPy could be used to
kwargs=({}, {’type’:’tcp’})) orchestrate data generation as well as data aggregation
w_print = Worker((workers.io.load_item, \
workers.io.print_))
/ reduction / analysis.
# the instances are combined into a piper instance In conclusion, we have designed and implemented
p_where = Piper(w_where, parallel=imap_) PaPy, a workflow-engine for the Python programming
p_print = Piper(w_print, debug=True) language. PaPy’s features and capabilities include: (1)
# piper instances are assembled into a workflow
# (nodes of the graph) construction of arbitrarily complex pipelines; (2) flex-
pipes = Plumber() ible tuning of local and remote parallelism; (3) specifi-
pipes.add_pipe((p_where, p_print)) cation of shared local and remote resources; (4) versa-
return pipes
tile handling of inter-process communication; and (5)
# Part 3: execute the pipeline an adjustable laziness/parallelism/memory trade-off.
if __name__ == ’__main__’: In terms of usability and other strengths, we note that
# get command-line arguments using getopt PaPy exhibits (1) robustness to exceptions; (2) grace-
# following part of the code is not PaPy specific
# and has the purpose of interpreting commandline
ful support for time-outs; (3) real-time logging func-
# arguments. tionality; (4) cross-platform interoperability; (5) ex-
import sys tensive testing and documentation (a 60+ page man-
from getopt import getopt ual); and (6) a simple, object-oriented API accompa-
args = dict(getopt(sys.argv[1:], ’’, [’use_tcp=’, \
’workers=’])[0])
nied by a preliminary version of a GUI.
# parse arguments
use_tcp = eval(args[’--use_tcp’]) # bool
remote = args[’--workers’] Availability
remote = worker_remote.split(’,’)
remote = [hn.split(’#’) for hn in remote] PaPy is distributed as an open-source, platform- in-
remote = [(h, int(n)) for h, n in remote] dependent Python (CPython 2.6) module at http:
# create pipeline (see comments in function)
pipes = pipeline(remote, use_tcp) //muralab.org/PaPy, where extensive documenta-
# execution tion also can be found. It is easily installed
# the input to the function is a list of 100 via the Python Package Index (PyPI) at http://
# integers.
pypi.python.org/pypi/papy/ using setuptools by
pipes.start([range(100)])
# this starts the pipeline execution easy_install papy.
pipes.run()
# wait until all input items are processed
pipes.wait() Acknowledgements
# pause and stop (a running pipeline cannot
# be stopped) We thank the University of Virginia for start-up funds
pipes.pause()
pipes.stop()
in support of this research.
# print execution statistics
print pipes.stats

c
2009, M. Cieślik, C. Mura 46
Proceedings of the 8th Python in Science Conference (SciPy 2009)

References 17, 2-4, 323-356.

[Ham08] Fiers MW, van der Burgt A, Datema E, de
[TeStYo09] Terwilliger TC, Stuart D, Yokoyama Groot JC, van Ham RC “High-throughput
S “Lessons from structural genomics” bioinformatics with the Cyrille2 pipeline
Ann.Rev. Biophys. (2009), 38, 371-83. system” BMC Bioinformatics (2008), 9,
[KlLiDr09] Klepeis JL, Lindorff-Larsen K, Dror RO, 96.
Shaw DE “Long-timescale molecular dy- [DeGh08] Dean J, Ghemawat S “MapReduce: Sim-
namics simulations of protein structure plified Data Processing on Large Clusters”
and function” Current Opinions in Struc- Comm. of the ACM (2008), 51, 107-113.
tural Biology (2009), 19(2), 120-7. [EaDe05] Earl, D. J. and M. W. Deem “Parallel tem-
[CoAnCh09] Cock PJ, Antao T, Chang JT, et al. pering: Theory, applications, and new per-
“Biopython: freely available Python tools spectives” Phys. Chem. Chem. Phys.
for computational molecular biology and (2005) 7(23), 3910-3916.
bioinformatics” Bioinformatics (2009), [Raiteri06] Raiteri P, Laio A, Gervasio FL, Micheletti
25(11), 1422-3. C, Parrinello M. J J Phys Chem B. (2006),
[Knight07] Knight, R et al. “PyCogent: a toolkit for 110(8), 3533-9.
making sense from sequence” Genome Bi- [LiYa00] Liu, P., Yang, C. “Locality-Preserving Dy-
ology (2007), 8(8), R171 namic Load Balancing for Data-Parallel
[OBHu08] O’Boyle NM, Hutchison GR “Cinfony - Applications on Distributed-Memory Mul-
combining Open Source cheminformat- tiprocessors.”“ (2000)
ics toolkits behind a common interface” [SkHiMc96] Skillicorn, D. B., Hill, J. M. D. & Mccoll,
Chemistry Central Journal (2008), 2, 24. W. F. ”Questions and answers about BSP“
[Hinsen00] Hinsen K “The Molecular Modeling (1996).
Toolkit: A New Approach to Molecu- [Knight05] Knight S, ”Building Software with SCons,“
lar Simulations” Journal of Computational Computing in Science and Engineering
Chemistry (2000), 21, 79-85. (2005), 7(1), 79-88.
[GrNiLe07] Grünberg R, Nilges M, Leckner J “Biskit- [Lloyd94] Lloyd JW, ”Practical advantages of declar-
-a software platform for structural bioin- ative programming“ (1994)
formatics” Bioinformatics (2007), 23(6), [Sunderam90] Sunderam V.S. ”PVM: A framework for
769-70. parallel distributed computing“ Concur-
[OiAdFe04] Oinn T, Addis M, Ferris J, et al. “Taverna: rency: Practice and Experience (1990), 2,
a tool for the composition and enactment 4, 315-339
of bioinformatics workflows” Bioinformat- [KlLiDrSh09] Klepeis JL, Lindorff-Larsen K, Dror RO,
ics (2004), 20(17), 3045-54. Shaw DE ”Long-timescale molecular dy-
[ThTaLi05] Thain D, Tannenbaum T, Livny M, “Dis- namics simulations of protein structure
tributed Computing in Practice: The Con- and function“ Curr. Opin. Struc. Biol.
dor Experience” Concurrency and Compu- (2009), 19(2), 120-7.
tation: Practice and Experience (2005),

47 http://conference.scipy.org/proceedings/SciPy2009/paper_6

Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
StarPU: Parallel Computing and Task Scheduling Techniques
From Everand
StarPU: Parallel Computing and Task Scheduling Techniques
Richard Johnson
No ratings yet
Graphcore Poplar Programming and Optimization: The Complete Guide for Developers and Engineers
From Everand
Graphcore Poplar Programming and Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Awk Programming in Practice: Definitive Reference for Developers and Engineers
From Everand
Awk Programming in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unified Data Workflows with Fugue: The Complete Guide for Developers and Engineers
From Everand
Unified Data Workflows with Fugue: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
High-Performance Stream Processing with Faust and Python: The Complete Guide for Developers and Engineers
From Everand
High-Performance Stream Processing with Faust and Python: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
wxPython Essentials: Definitive Reference for Developers and Engineers
From Everand
wxPython Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
From Everand
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
OpenMP in Practice: Definitive Reference for Developers and Engineers
From Everand
OpenMP in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NNG Reference Manual, Second Edition
From Everand
NNG Reference Manual, Second Edition
Garrett D'Amore
No ratings yet
JAX Essentials: The Complete Guide for Developers and Engineers
From Everand
JAX Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Study Guide Cisco 300-535 SPAUTO Automating and Programming Cisco Service Provider Solutions
From Everand
Study Guide Cisco 300-535 SPAUTO Automating and Programming Cisco Service Provider Solutions
Anand Vemula
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Bytewax for Pythonic Stream Processing: The Complete Guide for Developers and Engineers
From Everand
Bytewax for Pythonic Stream Processing: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
GASNet Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
GASNet Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Guide to H2O.ai: Definitive Reference for Developers and Engineers
From Everand
Practical Guide to H2O.ai: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Confluent Platform Architecture: Definitive Reference for Developers and Engineers
From Everand
Practical Confluent Platform Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
POSIX Threads Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
POSIX Threads Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Dataflow Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Dataflow Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
KNIME Workflow Design and Automation: Definitive Reference for Developers and Engineers
From Everand
KNIME Workflow Design and Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Study Guide Automating and Programming Cisco Data Center Solutions 300-635 DCAUTO Exam
From Everand
Study Guide Automating and Programming Cisco Data Center Solutions 300-635 DCAUTO Exam
Anand Vemula
No ratings yet
Apache Arrow Dataset in Practice: The Complete Guide for Developers and Engineers
From Everand
Apache Arrow Dataset in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Flutter Full-Stack
From Everand
Flutter Full-Stack
HAROLD WHITES
No ratings yet
MATLAB Data Science
From Everand
MATLAB Data Science
Henry Codwell
No ratings yet
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
From Everand
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
From Everand
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming AI Workloads with Habana Gaudi SDK: The Complete Guide for Developers and Engineers
From Everand
Programming AI Workloads with Habana Gaudi SDK: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
From Everand
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
Mei Wong
No ratings yet
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
GeoPandas in Action: Definitive Reference for Developers and Engineers
From Everand
GeoPandas in Action: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Boost.Asio Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Boost.Asio Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
MPICH Essentials: Definitive Reference for Developers and Engineers
From Everand
MPICH Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
From Everand
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
Justin Barbara
No ratings yet
SW604 Psychosocial Assessment Worksheet
No ratings yet
SW604 Psychosocial Assessment Worksheet
7 pages
Worksheet 2
No ratings yet
Worksheet 2
2 pages
3600-Article Text-14586-3-10-20200713
No ratings yet
3600-Article Text-14586-3-10-20200713
18 pages
Loctite PC 9462 en GL
No ratings yet
Loctite PC 9462 en GL
7 pages
Black Holes and Beyond
No ratings yet
Black Holes and Beyond
140 pages
Lyrics
No ratings yet
Lyrics
42 pages
Free Online AI Face Swap 2
No ratings yet
Free Online AI Face Swap 2
1 page
Nonlinear Dynamics and Machine Learning For Roboti
No ratings yet
Nonlinear Dynamics and Machine Learning For Roboti
23 pages
Biology Practical Class 12
No ratings yet
Biology Practical Class 12
7 pages
gooFSM Research Full Chapters
No ratings yet
gooFSM Research Full Chapters
79 pages
2023 English For Computer Science
No ratings yet
2023 English For Computer Science
134 pages
Class XI Commerce
No ratings yet
Class XI Commerce
3 pages
Solid State Physics From The Material Properties of Solids To Nanotechnologies Essentials of Physics Series David Schmool (Author) Download
No ratings yet
Solid State Physics From The Material Properties of Solids To Nanotechnologies Essentials of Physics Series David Schmool (Author) Download
42 pages
MATH 8 - Term 1 Lesson 4
No ratings yet
MATH 8 - Term 1 Lesson 4
22 pages
How Can Theory-Based Evaluation Make Greater Headway
No ratings yet
How Can Theory-Based Evaluation Make Greater Headway
25 pages
Force and Pressure Notes For Class 8
100% (2)
Force and Pressure Notes For Class 8
4 pages
Icme 14-TSG 23 Visualizacion
No ratings yet
Icme 14-TSG 23 Visualizacion
111 pages
01 - SS036 - Historical Antecedents
No ratings yet
01 - SS036 - Historical Antecedents
46 pages
CSET106 DMS Course File
No ratings yet
CSET106 DMS Course File
4 pages
Ict Tools in Biology Education: DR Katarzyna Potyrala
No ratings yet
Ict Tools in Biology Education: DR Katarzyna Potyrala
9 pages
You VS You
No ratings yet
You VS You
30 pages
Maneb Jce Mathematics 2012 Past Paper1719321067
No ratings yet
Maneb Jce Mathematics 2012 Past Paper1719321067
4 pages
Upstream Field Development Phase
No ratings yet
Upstream Field Development Phase
4 pages
Nursing Management and Leadership Approaches From The Perspective of Registered Nurses in Portugal
No ratings yet
Nursing Management and Leadership Approaches From The Perspective of Registered Nurses in Portugal
8 pages
CS Project File
No ratings yet
CS Project File
8 pages
Conventions CoralReefs
No ratings yet
Conventions CoralReefs
22 pages
Prime MX FIRA 6250 2018
No ratings yet
Prime MX FIRA 6250 2018
4 pages
The Prodigy of Welding: By: Brandy Ratliff Graduation Project 2010
No ratings yet
The Prodigy of Welding: By: Brandy Ratliff Graduation Project 2010
16 pages
SPARK STAR-68 Final 33KV Tripping
No ratings yet
SPARK STAR-68 Final 33KV Tripping
2 pages
ABSTRACT Experiment Material Chem
No ratings yet
ABSTRACT Experiment Material Chem
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Papy: Parallel and Distributed Data-Processing Pipelines in Python

Uploaded by

Papy: Parallel and Distributed Data-Processing Pipelines in Python

Uploaded by

Proceedings of the 8th Python in Science Conference (SciPy 2009)

PaPy: Parallel and distributed data-processing pipelines in Python

vides methods to save/load pipelines, mon- module.

support for two levels of parallelism, which adress both

Figure 1. (A) PaPy pipeline and its (B) computa-

Additional features and notes

This starts a RPyC server listening on the specified

Figure 5. A screenshot of the PaPy GUI written in

References 17, 2-4, 323-356.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.