0% found this document useful (0 votes)
8 views8 pages

PID4577775

This paper discusses the implementation of a DNA mapping algorithm utilizing a Processor-in-Memory (PIM) architecture developed by UPMEM, which integrates processing units into DRAM to enhance data access speed and bandwidth. The proposed method significantly accelerates DNA sequence mapping, achieving a speed-up of 25 times compared to traditional software and up to 80 times with SSDs, while efficiently managing the computational demands of mapping large genomic datasets. The paper outlines the architecture, mapping strategy, and performance evaluations, demonstrating the effectiveness of PIM in bioinformatics applications.

Uploaded by

kechalikamalakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

PID4577775

This paper discusses the implementation of a DNA mapping algorithm utilizing a Processor-in-Memory (PIM) architecture developed by UPMEM, which integrates processing units into DRAM to enhance data access speed and bandwidth. The proposed method significantly accelerates DNA sequence mapping, achieving a speed-up of 25 times compared to traditional software and up to 80 times with SSDs, while efficiently managing the computational demands of mapping large genomic datasets. The paper outlines the architecture, mapping strategy, and performance evaluations, demonstrating the effectiveness of PIM in bioinformatics applications.

Uploaded by

kechalikamalakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

DNA Mapping using Processor-in-Memory Architecture

Dominique Lavenier, Jean-Francois Roy, David Furodet

To cite this version:


Dominique Lavenier, Jean-Francois Roy, David Furodet. DNA Mapping using Processor-in-Memory
Architecture. Workshop on Accelerator-Enabled Algorithms and Applications in Bioinformatics, Dec
2016, Shenzhen, China. �hal-01399997�

HAL Id: hal-01399997


https://hal.science/hal-01399997v1
Submitted on 21 Nov 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
DNA Mapping using Processor-in-Memory
Architecture

Dominique Lavenier Jean-Francois Roy, David Furodet


IRISA / CNRS UPMEM
Rennes - France Grenoble - France
lavenier@irisa.fr jroy@upmem.com

Abstract - This paper presents the implementation of a Many mappers are available [3] [4]. They have their own
mapping algorithm on a new Processing-in-Memory (PIM) pros and cons depending of several criteria such as speed,
architecture developed by UPMEM Company. UPMEM’s memory footprint, multithreading implementation, sensitivity
solution consists in adding processing units into the DRAM, to or precision. The BWA mapper [2], based on Burrows-
minimize data access time and maximize bandwidth, in order to Wheeler Transform, can be considered as a good reference
drastically accelerate data-consuming algorithms. The technology since it is often used in many bioinformatics pipelines.
developed by UPMEM makes it possible to combine 256 cores Examples of other mappers are Bowtie [5], NextGenMap [10],
with 16 GBytes of DRAM, on a standard DIMM module. An SOAP2 [6], BFAST [7] or GASSST [8].
experimentation of DNA Mapping on Human genome dataset
shows that a speed-up of 25 can be obtained with UPMEM Mapping hundred of millions of short DNA sequences on
technology compared to fast mapping software such as BWA, complex genomes is time-consuming. It may take several hours
Bowtie2 or NextGenMap running on 16 Intel threads. of computation on a standard multicore processor. Thus, in
Experimentation also highlight that data transfer from storage order to significantly reduce the computation time of such
device limits the performances of the implementation. The use of treatment, many hardware implementations on GPU or FPGA
SSD drives can boost the speed-up to 80. accelerators have been proposed.
Keywords - mapping; processing-in-memory; PIM; From the GPU side, a few software such as CUSHAW
bioinformatics, genomic; I/O disk bandwidth; hardware [12][13], BarraCUDA [14][15], MaxSSmap [16] or SOAP3-dp
accelerator. [17] have been developed (the list is not exhaustive). They
provide interesting speed up compared to purely CPU-centric
I. INTRODUCTION software. We may also cite NextGenMap [10] that can also use
With the fast evolution of NGS technology (Next GPU resources (if available) to reduce runtime execution by
Generation Sequencing), mapping DNA sequences to complete 20-50%. Globally, average speed-up from 5 to 8 can be
genomes is now a daily bioinformatics task. However, it achieved compared to standard multicore processors. The great
requires important computing power. Each sequencing run advantage of this solution is that GPU boards are cheap and
generates hundred of millions of short DNA sequences (from can easily equip any bioinformatics servers.
100 to 250 bp length) that are compared with one or several
From the FPGA side, several reconfigurable architectures
reference genomes to extract new knowledge.
projects [18][19][20] have proposed interesting approaches.
More specifically, the mapping process consists in aligning The mapping core engines have high potentiality due to
short fragments of DNA, to large sequences (typically full aggressive hardware customization. Unfortunately, the
genomes). Contrary to BLAST-like alignments that locate any bioinformatics community cannot leverage these
portion of similar subsequences, the mapping action performs a developments, because the hardware is not available. That said,
complete alignment of the DNA fragments on the target the TimeLogic Company commercializes the VelocciMapper, a
sequence, by adding constraints such as the maximum numbers very fast mapping proprietary solution on their FPGA-
of substitutions or insertion/deletion errors. DeCypher platform [21] that seems promising.
From a computer science point of view, the challenge is to More recently the Edico-Genome Company has developed
be able to rapidly map hundreds of millions of short DNA a custom VLSI chip, called DRAGEN [22], mainly dedicated
fragments to full genomes, such as the Human Genome (3.2 x to the mapping of DNA sequences, even if it can performs
109 bp). The output of a mapping is a list of coordinates, for other bioinformatics task. With this chip, standard
each DNA fragments, where matches have been found. As bioinformatics pipelines that intensively used mapping
genome structures are highly repetitive structures, a DNA software (such as GATK [23]) can be highly speed up.
fragments can match at many locations. A mapping quality is
This paper presents an alternative solution based on
thus associated. The quality value depends of the mapping
Processing-in-Memory (PIM) concept. The PIM concept is not
confidence on a specific region. The output is generally
new. In the past, several research projects have explored the
encoded using SAM/BAM format [1].
potentialities of building processing units as close as possible often a short program performing basic operations on the data.
to the data. The Berkley IRAM project [24] probably pioneers It is called a tasklet. Note however, that the architecture of the
this kind of architecture to limit the Von Neumann bottleneck UPMEM DPU allows different tasklets to be specified and run
between the memory and the CPU. The PIM project of the concurrently on different blocks of data.
University of Notre Dame [25] was also an attempt to solve
this problem by combining processors and memories on a Depending on the server configuration (i.e. the number of
single chip. 16 GBytes UPMEM PIM-DRAM modules), a large number of
DPU can process data in parallel. Each DPU only accesses 64
UPMEM solution solves the same problem by building MBytes and cannot directly communicate with its neighbors.
DIMM modules integrating high density DRAM and RISC Data exchanges, if needed, must go through the host processor.
processors. The idea is to complement the main memory of a A DPU has a fast working memory (64 Kbytes) acting as
multicore processor with such smart modules. Data within cache/scratchpad memory and shared by all tasklets (threads)
these modules can be processed independently by activating in- running on the same DPU. This memory working space can be
memory computing power, releasing the pressure on the CPU- used to transfer blocks of data from the DRAM, and can be
memory transactions. explicitly managed by the programmer.
The DNA mapping task perfectly illustrates how such time- To sum up, programing an application consists in writing a
consuming application can benefit from the PIM architecture. main program (run on the host processor) and one or several
DNA sequences have to be compared with thousands of tasklets that will be executed on the DPUs. The main program
locations within a reference genome. Deporting this activity has to synchronize the data transfer to/from the DPUs, as well
directly to the PIM-DRAM module, and parallelizing the as the tasklet execution. Note that the tasklet execution can be
whole process to hundreds of PIM cores, avoids a lot of CPU- run asynchronously with the host program, allowing host tasks
memory transactions compared to a standard multithreaded to be overlapped with DPU tasks
solution.
Recently, UPMEM conducted a proof of concept project to
The rest of the paper is structured as follows: the next validate the technical feasibility of the DPU core on a DRAM
section briefly describes the main features of the UPMEM process. Indeed, DRAM manufacturing processes are
solution. Section 3 details how the mapping process is optimized in cost and bitcell density, and have never been
implemented on the UPMEM architecture. Section 4 evaluates designed to enable computing logic.
the performances and section 5 compares with current mapping
software. Section 6 concludes the paper.
III. MAPPING ON UPMEM

II. UPMEM ARCHITECTURE OVERVIEW A. Overview


This section presents the mapping strategy elaborated to
UPMEM technology is based on the concept of Processing-
fully exploit the PIM architecture. The main idea is to
in-Memory (PIM). The basic idea is to add processing elements
distribute an indexing structure (computed from the genome)
next to the data, i.e. in the DRAM, to maximize the bandwidth
across the DPU memories. The host processor receives the
and minimize the latency. Host processor is acting as an
DNA sequences and, according to specific k-mer features,
orchestrator: It performs read/write operations to the memories
dispatches them to the DPUs. To globally optimize the
and commands/controls the co-processors embedded in the
treatment, group of DNA sequences are sent to the DPUs
DRAM. This data-centric model of distributed processing is
before starting the mapping process. Results are sent back to
optimal for data-consuming algorithms.
the host processor. DNA sequences that have not been mapped
UPMEM PIM-DRAM solution can be packaged on are reallocated to other DPUs for further investigation. A three
16GByes DIMM modules with 256 processors: One processor pass processing allows more than 99% of DNA sequences to be
every 64 MBytes of DRAM. Each processor can run its own mapped. This strategy supposes first to have downloaded the
independent program. In addition, to hide memory latency, complete index into the DPU memory.
these processors are highly multithreaded (up to 24 threads can
The following algorithm illustrates how the overall
be run simultaneously) in such a way that the context is
mapping process:
switched at every clock cycle between threads.
1: Distribute the genome index across the DPUs
The UPMEM processor, called DPU (DRAM Processing
Unit), is a triadic RISC processor with 24 32-bits registers per 2: Loop N
thread. In addition to memory instructions, it comes with built- 3: List LIN  P x DNA sequences
in atomic instructions and conditional branching bundled with 4: Loop 3
arithmetic and logic operations.
5: Dispatch sequences of list LIN into DPUs
From a programing point of view, two different programs 6: Run mapping process
must be specified: (1) the host program that will dispatch the
data to the co-processors memory, sends commands, input 7: Get results  2 lists: LGOOD & LBAD
data, and retrieve the results; (2) the program that will execute 8: Output LGOOD
the treatment on the data stored in the PIM DRAM. This is 9: LIN  LBAD
The first loop (line 2) performs N iterations. N is the ratio of second table (Index2). More specifically, for a specific k-mer,
the number of DNA sequences to map divided by the number Index1 gives its address in Index2 and its number of
of DNA sequences that is processed in a single iteration. occurrences. A line in Index2 indicates the chromosome
Typically, a single iteration processes 106 sequences. The number and a position on that chromosome.
second loop (line 4) dispatches the sequences of the list LIN
into the DPUs. In the first iteration, the list LIN contains all the
DNA sequences. The mapping (line 6) is run in parallel and
provides a mapping score (and coordinates) for all DNA
sequences. The results are split into two lists (line 7): a list of
sequences with good scores (list LGOOD) and a list with bad
scores (list LBAD). Based on new k-mers, the list LBAD is
dispatched to the DPUs in the 2nd and 3rd iterations. The
following figure illustrates the mapping process:

Figure 2: Index1 provides for all possible k-mer its number of


occurrences and an entry in Index2. Index2 is a list of coordinates
specifying the chromosome number and a position.

The UPMEM implementation split Index2 into N parts, N


being the number of available DPUs. As each DPU has a
limited memory (64 MBytes), it cannot store the complete
genome. Consequently, k-mer positions along the genome are
useless inside a DPU without additional information. Thus, in
addition to coordinates, portions of genome text corresponding
to the neighborhood of the k-mers are memorized. The global
indexing scheme is shown below.

Figure 1: According to their k-mer composition, DNA sequences


are dispatched among the UPMEM memories that house a distributed
index of the genome. Mapping is run independently on all parts of the
index. DNA sequences that have not been mapped are dispatched
again to the index according to other k-mer criteria. After 3 rounds,
more than 99% of the DNA sequences are mapped.

B. Genome Indexing
To speed up the mapping and to avoid to systematically
comparing the DNA sequences with the full text of the
genome, the genome is indexed using words of k characters,
called k-mers. For each k-mers a list of coordinates specifying
its location is attached, typically the chromosome number and
the position of the k-mer on that chromosome. Then, the
mapping process consists in extracting one or several k-mers
from the DNA sequences in order to rapidly locate its position
on the genome. The k-mer acts more or less as an anchor from
which a complete match can be computed.
The index is composed of a first table of 4K entries (Index1) Figure 3: Index1 is stored on the host computer. Index2 is
that provides for all possible k-mer a list of coordinates where distributed among the DPU memory. Neighborhood information is
added to directly performed the mapping analysis.
it occurs in the genome. The list of coordinates is stored in a
Thus, for one k-mer, a line of Index2 memorizes the The CAS is fully consistent with the hardware version as it
chromosome number (1 Bytes), the position of the k-mer on actually represents the reference design. The CAS is used for
the chromosome (4 Bytes) and a neighborhood of 180 bp intensive testing of the design because it is faster and easier to
where each nucleotide is 2-bit encoded (45 Bytes). The storage execute with the test suite. The gap is null between the 2
one 1 k-mer requires 50 Bytes. Inside a DPU, 50 MBytes are versions, and this is systematically verified by a specific
allocated for the storage of the index or, in other words, the qualification process. As a consequence, the CAS execution
capability to store an equivalent genome of 1 Mbp. The rest of cycle count exactly reflects what the real hardware will
the memory is used for DNA sequences and result transfers. produce.
C. Mapping algorithm Performances have been evaluated on the following dataset:
The host processor receives a flow of DNA sequences. For • Human Genome (3.2 Gbp)
each sequence, a k-mer corresponding to the k first characters
is extracted. Based on this k-mer, the DNA sequence is • DNA sequences: a set of 111 x 106 100 bp
dispatched to the corresponding DPU. Every P sequences (P = sequences (13 GBytes)
106), the host activates the DPUs to start the mapping process To store the index corresponding to the Human genome,
of the DNA sequences stored in each DPU. the minimum number of DPUs is equal to 3.2x109/106 = 3200
More precisely, a specific DPU receive an average of Q = DPUs (cf. previous section: a DPU store an index that
P/N DNA sequences. The mapping consists in comparing these represents the equivalent of only 1Mbp). The UPMEM
Q sequences with the portions of the genome text stored inside configuration is thus set to 3328 DPUs (13 DIMM modules). In
each DPU memory, knowing that the k first characters are that situation, the UPMEM memory is equal to 208 GB.
identical. The comparison algorithm can be more or less We evaluate the execution time according to the algorithm
complex depending of the required mapping quality. For of section 3:
stringent mapping allowing only substitution errors, a simple
Hamming distance can be computed. For mapping with 1. Distribution of the genome index across the DPUs
insertion/deletion errors, banded smith and Waterman 2. Loop:
algorithm can be performed. A detailed implementation, and
the tasklet code, can be found in [28]. a. Dispatching of the sequence to the DPUs
b. Mapping
However, this strategy doesn’t guaranty to find all mapping
locations. If an assembly error occurs along the k first c. Result analysis
characters, the DNA sequence will be dispatched to the wrong A. Distribution of the genome index acroos the DPUs
DPU and no correct mapping will be detected. Thus, for DNA This step can be divided into the two following actions:
sequences with a low score, the next k characters are taken into
consideration to form a new k-mer allowing a new dispatching. • Download the index from the storage device
If again, no good score are computed the next k characters are • Dispatch the index inside the DPUs
considered. Practically, after 3 iterations, the best matches are
systematically found. As most of the mappers, the index is pre-computed. In our
case, Index1 is fully pre-computed, and the genome is
D. Post processing formatted to facilitate its encoding into the DPU memories.
As the mapping is fully performed inside the DPUs, no The size of Index1 is determined by the k-mer length. Here, the
more computation is required. The post processing consists k-mer length is set to 13. The numbers of entries of Index1 is
simply in getting the results from the DPUs and formatting the thus equal to 413 = 64 M entries. One entry stores the number
data (BAM/SAM format for example) before writing them to of k-mers (1 integer) and an address in index2 (1 integer). Thus
disk the total size of Index1 is 256 MBytes. This index is stored into
the host computer memory and required 2 sec to be
IV. PERFORMANCE EVALUATION downloaded (disk bandwidth = 130 MB/s). The size of the
Performances have been evaluated with a DELL server Fasta file containing the genome is equal to the size of the
(Xeon Processor E5-2670, 40 cores 2.5 GHz, 64 GBytes RAM) genome (3.2 GBytes). The download time is equal to 25 sec.
configuration running Linux Fedora 20. In our implementation, Dispatching Index2 across the DPUs consists in writing for
I/O transfer has a great impact on the overall performances, and each k-mers of the genome 48 bytes in a DPU memory, that is
thus, the hard disk read speed is an important parameter. We globally 3.2 x 109 x 48 = 153.6 GBytes. The bandwidth for
measure an average bandwidth of 130 MB/s (local disk). transferring data form the Host memory to the DPU memories
As the UPMEM memory devices are not yet available, is estimated to 11.53 GB/s (see [11]). The time for transferring
estimation are done with the UPMEM Cycle Accurate the index is thus equal to 153.6/11.53 = 13.3 sec. With the
Simulator (CAS) developed by the Company. Tasklet associated overhead to format the data, we globally estimate
programs are written in C and compiled (with a specific this step to 15 sec.
compiler) for the DPU target processors. Binaries are directly Actually, downloading the genome and dispatching the
executed by the CAS simulator. index into the DPUs are overlapped and practical time
measurements of this initialization step (TINIT) are under 30 sec.
B. Loop execution bound estimation for the 3 loop iteration is 30 x 106 cycles,
The loop performs the following actions: leading to an execution time T4 of 40 ms with a 750 MHz DPU
processor frequency.
1. Get block of DNA sequences from disk.
5. Collect results
2. Dispatch DNA sequences to DPUs
3. Initialize and start the DPUs For each DNA sequences, the DPU output the following
information: genome coordinates and mapping scores (2
4. Perform the mapping integers). There are thus 2 x 4 x 106 = 8 M bytes to transfer.
5. Collect results from DPU The transferring time T5 = 0.7ms.
6. Analyze and write results
For each action we detail how the execution time is
determined. 6. Analysis & write results

1. Get block of DNA sequences from disk This step that is run on the host processor evaluates the
score of the mapping and selects DNA sequences that have to
In our implementation, the loop iteration processes 106 be analyzed again. It also writes results to the output file. Our
DNA sequences. These sequences are read from the local disk. experimentation estimates the execution time T6 to 0.1 sec in
One million of DNA sequences of length 100 in Fasta format the worst case.
represent approximately 130 MBytes of data (text sequence +
annotation). The time to read this information depends again of Actions 2 to 6 are iterated 3 times. The first time involves
the I/O bandwidth of the storage device. With a bandwidth of all DNA fragments, the second time less than 10% and the
130 MB/sec, the time T1 = is equal to 1 sec. third time less than 3%. The cumulated execution time of
actions 2 to 6 is thus approximately equal to:
2. Dispatch DNA sequences to the DPU:
T2-6 = 1.13 x (50 + 40 + 100) = 190 ms.
Dispatching the DNA sequences to the DPUs is fast: it
consists in coding the 13 first characters of the sequence and in Actually, getting the data from the disk (action 1) can be
copying the sequence to the target DPU. Experiments indicates overlapped with the other tasks (actions 2 to 6), leading to an
an execution time < 40 ms. Transferring 100 MBytes of data the following TLOOP execution time:
(106 sequences of 100bp) to the DPU memory is also very fast. TLOOP = max (T1,T2-6) = 1sec
It requires 0.1/11.5 = 8.7 ms. Overall, this step takes a
maximum of T2 = 50 ms. C. Overall Execution time
3. Initialize and start the DPU The overall execution time T for mapping 111 x 106 DNA
sequences to the Human genome is approximately given by:
A DPU runs 10 tasklets. Each tasklet receives two
parameters: the number of DNA sequences to process, and the T = TINIT + 111 x TLOOP = 30 + 111 x 1 = 141 sec.
address where these fragments are stored. This represents 2 The general execution scheme is as follows:
integers (8 bytes) by tasklet, or 80 bytes per DPU, or an overall
transfer of 80x3328 = 266240 bytes. The equivalent time T3 is:
266240/11.53x109 = 23 μs. As broadcasting commands to 128
DPU simultaneously is possible, booting the DPU consist in
sending 3328/128 = 26 commands. This time is negligible.
4. Mapping
On average, a DPU receive 106/3328 = 300 DNA
sequences to process (3328 is the number of available DPUs).
The number of occurrences of a k-mer of size 13 is
approximately the size of the genome divided by 413, that is 3.2
x 109/413 = 50. The number of mappings that must be executed
by one DPU is thus equal to 15000 (300 x 50). The simulations
executed on the UPMEM Cycle Accurate Simulator range
from 10x106 to 25x106 cycles to perform such a treatment,
depending on the DPU load. As a matter of fact, the repartition
inside the DPUs is not uniform. It depends of the nature of the
DNA sequences. We have to take into account the worst
execution time since all DPUs must finish before analyzing all
results.
In the second and third round, only a fraction of the DNA
sequences that have not matched are sent to other DPUs. It Figure 4: Scheduling of the different tasks. Loading of the data
represents less than 10% of the initial number of sequences. and computation are overlapped. In the implementation, data
The impact on the overall execution time is weak. An upper transfers dominate the overall process.
As we can see, the TLOOP execution time is mainly
constrained by the I/O disk bandwidth: loading 106 100 bp The three software have been run with and without SSD
DNA sequences requires about 1 second while processing these storage devices. We didn’t detect any significant difference in
data takes only 0.2 second. the execution time. We end up with the same conclusion of Lee
A possible hardware optimization is to use SSD storage et al. [29]: mapper software don’t benefit of SSD
devices that have larger throughput. We tested with 512 GB performances.
SSD drive present on the server with an average bandwidth of
700 MB/sec. In that case, the time for distributing the index is VI. CONCLUSION
now dominated by the dispatch index (~15s). For the loop UPMEM PIM technology is a data-centric hardware
execution time, a good balance is achieved between the time accelerator. As opposed to GPU, FPGA and custom VLSI
for getting the DNA sequences from SSD (~185 ms) and the chips that focus on powerful processing units, the
time for executing actions 2 to 6 (~200 ms). computational power of Processing-In-Memory is brought by
the large and scalable number of independent processing
In that situation the new overall execution time is given by:
elements, each one being composed of a processing unit and a
T = 15 + 111 x 0.2 = 37.2 sec. DRAM bank. The Van Neumann bottleneck is pushed away,
and embarrassingly parallel applications can highly benefit
from this architecture by distributing computations across
V. COMPARISON WITH OTHER MAPPERS processing elements.
To evaluate the speed up brought by the UPMEM On the genomic side, many other treatments are potential
technology, we compared the execution time with the good candidates for an efficient implementation on UPMEM.
following mappers: An implementation study on the well-known Blast software
[26] has shown an expected speed-up of 25 compared to a
• BWA [2] server running 20 Intel cores [27]. There is also a lot of room
• Bowtie2 [5] for implementing many NGS analysis such as short read or
long read correction, genome assembly (especially large k-mer
• NextGenMap [10]
counts), GWAS studies, etc. The difficulty is how to split the
The three software have been run with different number of problem into thousands of tasklets, each of them working
threads: 8, 16 and 32 and with their default parameters (details independently on a small part of the data.
of the experimentation can been found in [28]). The following
table gives the execution time. Performances of hardware accelerators are tightly
correlated to their computing infrastructure environment. For
the mapping problem where huge volumes of data have to be
8 threads 16 threads 32 threads processed, performances are clearly restrained by data access.
In our case, the bandwidth of the hard disk drive is a critical
BWA 5901 3475 2191 bottleneck. SSD technology can help to increase data transfer.
Bowtie2 5215 2916 2241 Feeding optimally such accelerators is probably the main
problem, especially for large bio-informatics centers where
NextGenMap 3485 2104 1552 data are stored on large centralized storage devices of several
tenths of Tera Bytes. Servers that house hardware accelerators
Table 1: Execution time (in second) of the mapping of 111
millions of 100 bp DNA sequences with the Human Genome. The three
must have a privileged mass-storage connection to keep all
software have been run on a DELL server with the following their potential computing power.
characteristics: Xeon Processor E5-2670, 40 cores 2.5 GHz, 64 The SDK of UPMEM DPU has been made available ahead
GBytes RAM.
of silicon to enable the porting of applications and anticipate
the potential benefit of using such architecture. The SDK
comes with a C-compiler, a simulator and the APIs needed to
The speed-up is calculated as the ratio between the build a full application and will continue to be enhanced. It is
reference software execution time and the estimated UPMEM widely open to the community. In parallel, UPMEM is
execution time. It considers both hard and SSD disks. partnering with DRAM manufacturers to build silicon chips
assembled on DIMM modules, with a prototyping cycle in
2017.
8 threads 16 threads 32 threads
HARD SSD HARD SSD HARD SSD
BWA 41 157 24 93 15 58 REFERENCES
[1] The SAM/BAM Format Specification Working Group, Sequence
Bowtie2 36 140 20 78 16 60 Alignment/Map Format Specification, April 2015,
https://samtools.github.io/hts-specs/SAMv1.pdf
NextGenMap 24 93 15 56 11 41
[2] Li H. and Durbin R. (2009) Fast and accurate short read alignment with
Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.
Table 2: Speed-up of the UPMEM mapping implementation
compared to 3 software executions
[3] Jing Shang, Fei Zhu, Wanwipa Vongsangnak, Yifei Tang, Wenyu [16] Turki Turki and Usman Roshan, MaxSSmap: A GPU program for
Zhang, and Bairong Shen, Evaluation and Comparison of Multiple mapping divergent short reads to genomes with the maximum scoring
Aligners for Next-Generation Sequencing Data Analysis, BioMed subsequence, BMC Genomics, 15(1)1:969, 2014
Research International, vol. 2014, Article ID 309650, 16 pages, 2014. [17] Luo R, Wong T, Zhu J, Liu C-M, Zhu X, et al. (2013) SOAP3-dp: Fast,
[4] Schbath S, Martin V, Zytnicki M, Fayolle J, Loux V, Gibrat J-F. Accurate and Sensitive GPU-Based Short Read Aligner. PLoS ONE 8(5)
Mapping Reads on a Genomic Sequence: An Algorithmic Overview and [18] C. B. Olson et al., "Hardware acceleration of short read mapping," Field-
a Practical Comparative Analysis. Journal of Computational Biology. Programmable Custom Computing Machines (FCCM), 2012 IEEE
2012;19(6):796-813. doi:10.1089/cmb.2012.0022. 20th Annual International Symposium on IEEE, pp. 161-168, 2012. J.
[5] Langmead B. Trapnell C. Pop M., et al. Ultrafast and memory-efficient [19] Arram, K. H. Tsoi, W. Luk, and P. Jiang, "Reconfigurable acceleration
alignment of short DNA sequences to the human genome. Genome Biol. of short read mapping," In Field-Programmable Custom Computing
2009;10:R25. Machines (FCCM), 2013 IEEE 21st Annual International Symposium
[6] Li R. Yu C. Li Y., et al. SOAP2: an improved ultrafast tool for short on, pp. 210-217, IEEE, 2013.
read alignment. Bioinformatics. 2009;25:1966–1967 [20] J. Arram, et al. "Leveraging FPGAs for Accelerating Short Read
[7] Homer N. Merriman B. Nelson SF. BFAST: an alignment tool for large Alignment." IEEE/ACM Transactions on Computational Biology and
scale genome resequencing. PLoS ONE. 2009;4:e7767 Bioinformatics. 2016.
[8] Rizk G. Lavenier D. GASSST: global alignment short sequence search [21] http://www.timelogic.com/catalog/799/velocimapper
tool. Bioinformatics. 2010;26:2534–2540 [22] http://www.edicogenome.com/dragen/
[9] Ayat Hatem, Doruk Bozdağ, Amanda E Toland, Ümit V Çatalyürek, [23] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky
Benchmarking short sequence mapping tools, BMC Bioinformatics A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, The
2013, 14:184 Genome Analysis Toolkit: a MapReduce framework for analyzing next-
[10] Fritz J. Sedlazeck, Philipp Rescheneder, and Arndt von Haeseler generation DNA sequencing data, 2010 GENOME RESEARCH
NextGenMap: fast and accurate read mapping in highly polymorphic 20:1297-303
genomes.Bioinformatics (2013) 29 (21): 2790-2791 first published [24] Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K.,
online August 23, 2013 doi:10.1093/bioinformatics/btt468 Kozyrakis, C., Thomas, R., and Yelick, K. (1997). "A Case for
[11] UPMEM DPU-Data exchange with main CPU. UPMEM Technical note, Intelligent RAM: IRAM," IEEE Micro, 17 (2), pp. 34–44
version 1.3 [25] Kogge, P. M., T. Sunaga and e. a. E. Retter (1995). Combined DRAM
[12] Y. Liu, B. Schmidt, D. Maskell: CUSHAW: a CUDA compatible short and Logic Chip for Massively Parallel Applications. 16th IEEE Conf. on
read aligner to large genomes based on the Burrows-Wheeler transform, Advanced Research in VLSI, Raleigh, NC
Bioinformatics, (2012) 28(14): 1830-1837 [26] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z.,
[13] Y. Liu, B. Schmidt: CUSHAW2-GPU: empowering faster gapped short- Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a
read alignment using GPU computing. IEEE Design & Test of new generation of protein database search programs." Nucleic Acids
Computers 31(1):31-39, 2014 Res. 25:3389-3402.
[14] Klus P, Lam S, Lyberg D, Cheung MS, Pullan G, McFarlane I, Yeo [27] Dominique Lavenier, Charles Deltel, David Furodet, Jean-François Roy.
GSH, Lam BY. (2012) BarraCUDA - a fast short read sequence aligner BLAST on UPMEM. [Research Report] RR-8878, INRIA, 2016.
using graphics processing units. BMC Research Notes, 5:27. [28] Dominique Lavenier, Charles Deltel, David Furodet, Jean-François Roy.
[15] Langdon WB, Lam BY, Petke J, Harman M. (2015) Improving CUDA MAPPING on UPMEM. [Research Report] RR-8923, INRIA, 2016
DNA Analysis Software with Genetic Programming. Proceedings of the [29] Sungmin Lee, Hyeyoung Min, Sungroh Yoon. Will solid-state drives
2015 Annual Conference on Genetic and Evolutionary Computation - accelerate your bioinformatics? In-depth profiling, performance analysis
GECCO '15 and beyond. Briefing in Bioinformatics 2015; 17 (4): 713-727.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy