0% found this document useful (0 votes)

22 views13 pages

Memristive Data Ranking

Uploaded by

莊昆霖

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views13 pages

Memristive Data Ranking

Uploaded by

莊昆霖

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

Memristive Data Ranking

Ananth Krishna Prasad∗ , Morteza Rezaalipour† , Masoud Dehyadegari† and Mahdi Nazm Bojnordi∗
∗ University of Utah, † K.N. Toosi University of Technology
Email: ∗ {ananth, bojnordi}@cs.utah.edu, † {amrezaalipour, dehyadegari}@kntu.ac.ir

Abstract—Sorting is a fundamental operation in many large- on-chip cache capacity, the demand for high memory band-
scale data processing applications. In big data computing, sort- width results in sorting performance being bottlenecked by
ing imposes a massive requirement on the available memory the limited off-chip memory bandwidth. Large scale memory
bandwidth because of its natural demand for pairwise com-
parison. This high bandwidth requirement often leads to a management [16] and in-memory databases [17] have been
significant degradation in performance and energy-efficiency. recently explored as a promising solution to the data movement
Processing-in-memory has been examined as an effective solution and bandwidth challenges. The efficiency of these techniques
to the memory bandwidth problem for SIMD and data-parallel primarily depends on minimizing data movement between
operations, which does not necessarily solve the bandwidth the processor cores and off-chip memory using a hierarchy
problem for pairwise comparison. This paper proposes a viable
hardware/software mechanism for performing large-scale data of memories, non-uniform access to memory, transactional
ranking in memory with a bandwidth complexity of O(1). Large- memories, and non-volatile technologies. Nevertheless, such
scale comparison that forms the core computation of sorting optimizations do not eliminate, but rather mitigate the extent
algorithms is reformulated in terms of novel bit-level operations of data accesses to perform sorting on the processing core.
within the physical memory arrays for in-situ ranking, thereby The recent advent of emerging memory interfaces [18, 19]
eliminating the need for any pairwise comparison outside the
memory arrays. The proposed mechanism, called RIME, provides and cell technologies [20, 21] has enabled in-memory com-
an API library granting the user application sufficient control putation with large-scale data parallel operations, such as
over the fundamental operations for in-situ ranking, sorting, and bitwise XOR. Prior work on processing in memory (PIM)
merging. Our simulation results on a set of high-performance has shown various applications ranging from combinatorial
parallel sorting kernels indicate 12.4 − 50.7× throughput gains optimization [22, 23] and neural network computation [24–
for RIME. When used for ranking and sorting in a set of
database applications, graph analytics, and network processing, 26] to graph analytics [27]. The existing PIM solutions mostly
RIME achieves more than 90% energy reduction and 2.3−43.6× focus on accelerating matrix/vector operations inside memory
performance improvements. arrays or utilizing high bandwidth interfaces for near data
computation. Instead, RIME proposes an in-situ approach for
I. I NTRODUCTION memristive ranking-in-memory using a HW/SW co-design that
The continued growth in IoT, mobile devices, and cloud- minimizes the bandwidth requirements of sort algorithms. The
based services have led to the emergence of large datasets main contributions of this paper are as follows. (1) Large-scale
and big data workloads. Analyzing, querying, and filtering sorting workloads are characterized in terms of bandwidth
massive amounts of data in a structured manner becomes and throughput requirements. The primary reason for poor
increasingly hard. In these cases, a large amount of data often performance of sorting at low bandwidths is identified. (2)
require to be sorted, either because of dataset properties [1], or A novel memory system architecture is designed to enable in-
real-time requests from web users [2], or algorithm features memory min/max computation using a large-scale massively
[3]. Also, sorting data is often the key to enabling efficient parallel bitwise algorithm. (3) The necessary driver support
searching algorithms [4]. Data clustering, an important kernel and userspace API are provided to enable fine-grained control
in data mining applications, depends heavily on sort and search over the proposed system for efficient in-situ ranking and
operations [5]. Therefore, sorting an array of numbers is an ordinary memory operations. (4) Detailed evaluations of the
active area of research and a vital operation in many applica- proposed architecture at the system and circuit levels are pro-
tion domains such as image processing, database processing, vided, which indicate significant performance improvements
genome analysis, and text analysis [6]. and energy savings over the existing systems.
Several sorting algorithms were invented in every decade to
II. BACKGROUND AND M OTIVATIONS
be well adapted to computer architecture and distributions of
data, such as radixsort [7], mergesort [8], and quicksort [9]. A. Applications of Sorting
Further research has been conducted to identify efficient sort- Sorting is a fundamental operation in database applications.
ing algorithms using hardware accelerators [10], multiple cores For example, sorting is very common in query retrieval to pre-
by exploiting SIMD instructions [11, 12], GPUs [13], and pare the query results in a particular order by using OrderBy
ASIC [14]. With the increase in the computational capability clause. In addition, sorting may be necessary in several join
of processors and GPUs by enabling more cores and threads, operations such as sort-merge join algorithm. It serves in in-
the demand for memory bandwidth increases proportionally dex creation, user-requested output sorting, ranking, duplicate
[15]. For large datasets of size magnitudes larger than the removal, and grouping operations [28]. Numerous techniques

978-1-6654-2235-2/21/$31.00 ©2021 IEEE 440

DOI 10.1109/HPCA51647.2021.00045
Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
have been proposed ro realize efficient sorting based on multi- process. For d-digit elements, starting from the most signif-
core processors, GPUs, and SIMD architectures [4]. icant digit (MSD) to the least significant digit (LSD), the
MapReduce is used to perform massive data sorting in algorithm sorts the elements, considering only one digit at
distributed system. In particular, Shuffle is an important part a time, in a way that all the elements that have smaller digits
of MapReduce that performs sorting and transferring outputs appear on the left-hand side of elements with larger digits.
of the maps to reducers [29]. The execution time of algo- By iterating this process from d-1 to 0, the input array will
rithms such as Kruskal is dominated by sorting. Prim’s string be sorted. Radixsort may also be applied in the opposite digit
processing and Dijkstra’s algorithms are based on the priority direction (i.e., from LSD to MSD) [34].
queue, which relies on sorting and ranking data in a queue.
Heapsort. Heapsort is based on the heap data structure, which
Many other applications such as numerical computations,
is a complete binary tree of data points. The maximum or
combinatorial search, operations research, and commercial
minimum value is always located at the root of the heap tree.
computing are often based on sorting [30]. Moreover, sort-
During each iteration of the algorithm, Heapsort removes the
ing the retrieval results from PageRank, HillTop, and HITS
root node from the array and substitue the root node with the
(Hypertext Induced Topic Search) in a reasonable time is a
last element of the array and reheap the array [35]. The time
significant challenge [31]. Not only sorting integer values is
complexity of Heapsort is O(n log n).
important but also several applications need sorting real-valued
data, which is not as simple as integer values. For example, C. Design Challenges and Opportunities
Kim et al. [32] exploits integer arithmetic on floating-point
1) Memory Bandwidth Requirements: First, not all sort
data to reduce the execution time.
algorithms exhibit the same bandwidth requirements. Figure 1
B. Sorting Algorithms shows the number of accesses served by a memory system
below the on-die cache for Mergesort (M/S), Quicksort (Q/S),
Quicksort. Quicksort was first introduced by Sir C. A. R. and Radixsort (R/S). We consider two memory configurations
Hoare in 1961 [33]. Quicksort is based on the divide-and- for this analysis: one with an unlimited bandwidth and the
conquer paradigm that resolves a complex problem by con- other with an off-chip memory interface [36].1 Increasing the
stantly dividing it into simpler subproblems until it reaches a workload size on the bandwidth-unlimited system results in a
point where the solution to the subproblems becomes trivial. higher number of memory accesses (Figure 1(a)), which may
The algorithm starts with a Partition phase in which a bound be influenced by the number of processor cores (Figure 1(b)).
element (or a pivot) is selected from the given array as a divid- In real memory systems, however, the bandwidth is limited.
ing line for partitioning it into two smaller segments (or sub- Figure 1 (c) shows the sustained memory bandwidth is more
arrays). At the end of the Partition phase, if the size of sub- restricted as the number of cores varies from 1 to 64.
arrays is less than a cut-off amount, i.e., the solution becomes M/S R/S M/S R/S M/S R/S
Q/S Q/S Q/S
trivial, the sub-arrays may then be sorted by known methods; 16 Cores 65M Keys 65M Keys
or even by employing programs specialized for sorting arrays 500 1400 650
Memory Accesses (Millions)

Memory Accesses (Millions)

Memory Bandwidth (MBps)

containing less than cut-off elements. Conversely, if the sub-
450 (a) 1200 (b) 600 (c)
400
arrays are fairly large, the partitioning process continues for 350 1000 550
300 800 500
further division of the sub-arrays into even smaller ones. The 250
200 600 450
time complexity of Quicksort is reported to be of the order 150 400 400
of O(n2 ) and O(n log n), for the worst and average cases, 100
200 350
50
respectively [34]. 0 0 300
0
10
20
30
40
50
60
70

0
10
20
30
40
50
60
70

0
10
20
30
40
50
60
70
Mergesort. Mergesort was suggested by John von Neuman Data Size (Million Keys) Number of Cores Number of Cores
as early as 1945 [34]. Similar to Quicksort, Mergesort also Fig. 1. Bandwidth requirements for sort algorithms.
employs the divide-and-conquer paradigm, and it recursively
sorts a given array of elements. As the name of the algorithm Second, the performance of sort algorithms is sensitive
represents, Mergesort consists of a merge algorithm and re- to the available memory bandwidth. Figure 2 shows the
cursive calls. The merge algorithm takes two or more non- throughput of the sort algorithms, in terms of million keys
empty sorted arrays and outputs a final array that is also per second (MKps), on three systems with different available
sorted. Generally, Mergesort first divides the input array into bandwidths. For this analysis, in addition to the unlimited and
multiple sub-arrays, each containing only a single element, by off-chip bandwidths, we consider a high bandwidth memory
recursive calls; a subarray that contains only a single element system with an in-package DRAM [37]. In an ideal memory
is considered to be sorted. Then, it repeatedly merges the system with unlimited bandwidth (a), R/S outperforms both
subarrays until there remains only one array, which is the Q/S and M/S at the cost of exerting significant data movement
sorted output array [34]. on the memory interface. This superiority, however, is taken by
Q/S in the realistic memory systems with limited bandwidth–
Radixsort. Radixsort method employs a different scheme i.e., the in-package (b) and off-chip (c) memories.
compared to the previous sorting algorithms as it looks through
the individual digits of elements to perform a digit-inspection 1 Section VI provides the detailed system configuration.

441

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
M/S R/S M/S R/S M/S R/S
Q/S Q/S Q/S isolated column access a 1T1R memory cell (Figure 3) is
Unlimited Bandwidth In-Package (HBM) Off-Chip (DDR4)
preferred over the 1R crosspoint.
16 16 16
14 (a) 14 (b) 14 (c) III. D ESIGN OVERVIEW
Throughput (MKps)

Throughput (MKps)

Throughput (MKps)
12 12 12
10 10 10 A. In-Memory Min/Max Computation
8 8 8
6 6 6
Inspired by the prior work on bit-serial median ﬁlters [41,
4 4 4 42], we design a new algorithm for computing the minimum
2 2 2 (or maximum) of any N numbers in k serial steps, where
0 0 0
k is the number of bits used for representing each number.
0
10
20
30
40
50
60
70

0
10
20
30
40
50
60
70

0
10
20
30
40
50
60
70
Data Size (Million Keys) Data Size (Million Keys) Data Size (Million Keys) The proposed algorithm is applicable to signed/unsigned fixed-
Fig. 2. Impact of available bandwidth on performance. point and floating-point number formats.
1) Unsigned Fixed-Point Numbers: We consider α integer
2) Opportunities and Potentials: The above analyses on bits and β fraction bits to represent unsigned fixed-point
sorting algorithms indicate that (1) the bandwidth require- numbers. Every k-bit number is represented in the form of
ment scales linearly with the size of working set and (2) bα−1 · · ·b0 •b−1 · · ·b−β , where bi s are the binary digits and
the throughput of sorting is limited by bandwidth. Similar k = α + β. (Typically, β is set to 0 for representing pure
observations have been made by the prior work on StreamBox- integer
α−1 numbers.) The value of each number is computed
i
HBM [38], where a sort-merge based algorithm for streaming by i=−β 2 bi . Therefore, a number with more leading 0s
computation outperforms hash-join based approaches for in- produces a smaller value; for example, 0001.11 is less than
package memories. The prior work shows that throughput of 0010.00. We employ this simple principle to design a bit-serial
GroupBy, one of the key kernels in streaming computation, algorithm for finding the minimum (or maximum) of multiple
increases linearly with increasing the number of cores for numbers in a set. As shown in Algorithm 1, starting from the
HBM, while it stagnates beyond 16-cores for DRAM. Thus, if most significant bit position (i.e., k − 1), we follow a k-step
there is way to remove this bandwidth bottleneck for sorting, algorithm to examine the binary values of all bit positions
performance can be massively improved. (i.e., pos). At every step, some of the non-minimum (or non-
One of the main reasons sorting requires a large bandwidth maximum) values may be removed from the set. First, we
lies at the heart of the algorithms (i.e., comparison). In search for 1 at the current bit position (pos) to form a selection
naive terms, worst-case sorting requires all possible pairwise of matching numbers (sel). The selected numbers are removed
comparison of values. Even though more sophisticated algo- from the set only if the set and sel are not equal. As a result,
rithms such as Quicksort, Radixsort, and Heapsort improve all the final remaining numbers in the set have the minimum
the bandwidth efficiency, still they don’t solve the underlying value.
issue, which is access to pairs of values in memory. In contrast,
Algorithm 1 Find the minimum of unsigned fixed-points
RIME enables large-scale in-situ bitwise comparison that
massively improves the bandwidth efficiency by eliminating 1: set ← {all numbers}
unnecessary data movement on the memory interface. Given 2: for pos in (k − 1, · · · , 1, 0) do
the immense applications large-scale data sorting has, this 3: sel ← ∅
can potentially massively accelerate sorting kernels as part of 4: for all num ∈ set do
large-scale data processing applications. 5: if numpos = 1 then sel ← sel ∪ {num}
6: end if
D. Memristive Array Structure 7: end for
Memristive technology has been promoted as an 8: if sel=set then set ← set − sel
alternative to the conventional memories due to their 9: end if
scalability, non-volatility, and being free of leakage 10: end for
power. Moreover, they have shown unique capabilities
for efficient in-memory processing. In particular, resistive Figure 4 shows how the proposed algorithm finds the
RAM (RRAM) is one of the most promising memristive minimum of 5 unsigned fixed-point numbers with α = 3
devices under commercial development that shows great and β = 2. First, the most significant bit of all numbers are
potential for building main memory systems [21]. compared with 1 and the matching numbers (i.e., 4.00 and
Numerous cell architectures 6.50) are excluded from the set (Step 1). We then compare
have been proposed in the second most significant bit of the remaining numbers with

the literature that optimize 1. As we find no matches, none of the numbers is removed

RRAM for better reliability, from the set during Step 2. We repeat this process for the next

density, and computational Fig. 3. The 1T-1R memory cell [39] bit position during Step 3. As all the remaining numbers have
capabilities. 1R crosspoints used for RIME. a matching 1 in the third bit position, none of the numbers
are denser, but lacking isolated access to individual rows should be excluded from the set. During Steps 4 and 5, the next
and columns [40]. As the proposed in-situ approach requires matching numbers, respectively 1.75 and 1.25, are excluded

442

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
from the set. Finally, the remaining number in the set (i.e.,

1.00) represents the minimum value of the given numbers.

Fig. 5. Illustrative example of ﬁnding the minimum of 3 ﬂoating-point

numbers.

merging operations.

1) Sorting: As established before, conventional sort algo-
Fig. 4. Illustrative example of finding the minimum of 5 unsigned fixed-point rithms require a significant memory bandwidth due to their
numbers. complex access patterns for comparing pairs of data points.
2) Signed Fixed-Point Numbers: We use the two’s com- Depending on the type of algorithm, the complexity of mem-
plement format for representing signed fixed-point numbers ory bandwidth for sorting N data points may vary between
in the form of sbα−2 · · ·b0 •b−1 · · ·b−β , where s is the sign O(N logN ) and O(N 2 ) for large data sets [43].2 Our proposed
bit. Every signed fixed-point value may be computed by hardware/software approach lowers the bandwidth complexity
α−2
−2α−1 s + i=−β 2i bi . Similar to unsigned values, having 0s of sort operations to O(N ) eliminating the unnecessary data
in more significant bi s results in a smaller value. However, movement for finding min/max of given data points. From
a 1 in the sign bit position makes the value negative. To the software point of view, the proposed sort operation is
support signed numbers, we change Algorithm 1 to search carried out similar to reading data from an array of values.
for matching 0s (instead of 1s) in the first iteration of the For a specific data range in memory, every access can provide
loop (pos = k − 1). Therefore, the proposed algorithm can the next minimum value of the array. Therefore, repeating
exclude all the positive values from the set during Step 1 if a this process N times results in an ordered stream of data
mix of positive and negative numbers is given. If only positive from memory to the processor. First, the in-memory min/max
numbers are present, corresponding to the case where all bit- computer is initialized for a new data range in the memory. On
values in the first iteration of the loop being zero, the operation every sort access, the next minimum value of the data range
proceeds to search for matching 0s in the succeeding iterations is computed and sent to the processor. Also, the newly found
to find the minimum magnitude data is flagged for exclusion from the data range for the next
3) Floating-Point Numbers: The IEEE standard for sort accesses. The exclusion flags remain until the hardware
floating-point arithmetic (IEEE 754) proposes a three-segment is initialized for a new sort operation or the data memory is
layout for real-valued numbers comprising one sign bit (s), released through APIs provided (explained in Section V).
a multi-bit exponent (e), and a multi-bit fraction (f ). Ev- 2) Ranking: Similar to sorting, conventional data ranking
ery floating-point value may be computed by (−1)s ×(1 + algorithms consume a significant memory bandwidth. A nat-
f )×2e−b , where b is a positive bias added to the exponent. ural way of finding the kth ordered item of N numbers is to
Similar to signed fixed-point numbers, at the sign bit repeat a sort algorithm until reaching the kth min/max of the
position, the algorithm searches for 0s to remove from the numbers. This approach may result in a bandwidth complexity
set. A constant offset is added to the exponent bits, but of O(kN ). Using the proposed in-memory min/max finder,
that doesn’t change the monotonic relationship between the we can decrease the bandwidth costs of finding the kth
actual exponent and the magnitude of the value represented in ordered value in a data range to k accesses, which indicates a
the exponent bits. This makes it such that there is virtually bandwidth complexity of O(k). For a given k, the in-memory
no difference in the algorithm between signed fixed-point hardware repeats min/max computation for k iterations until
numbers and floating point numbers. the desired value is found.
Figure 5 shows an example for finding the minimum of 3 3) Merging: A merge operation refers to combining two
numbers in a hypothetical 8-bit floating-point format similiar (or more) data sets into a single ordered set of data. The
to IEEE 754, with 4 mantissa bits and 3 exponent bits. At resultant set may include all members of the input sets or
the first step, the sign bit is checked. At the second step, only data points that exists in all input sets (a.k.a., merge-
given that all values in the first (sign) column were not zero, join in databases). Having a sorting algorithm at its heart,
the algorithm searches for 1s to find the number with the the bandwidth complexity of merging is the same as that of
maximum possible magnitude. After step 4, only one selected sorting. The conventional merge operations require a band-
value remains, and that is the minimum value. width complexity as low as O(N logN ), where N is the size
B. Rank/Sort/Merge Operation of the resultant merged data; whereas, our proposed hard-
In-memory min/max computation can significantly allevi- 2 The memory bandwidth complexity significantly reduces for small data
ate the bandwidth costs of large-scale ranking, sorting, and sets that entirely or partially fit in the on-chip cache.

443

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
ware/software solution reduces this complexity down to O(N ). memory cells, where each cell represents a single bit of a
To support fast merge operations, the in-memory hardware value. Therefore, each column of the array includes a bit value
implements concurrent min/max computation on multiple data from multiple data points. For every bitwise search, the bitline
ranges. Figure 6 shows how to merge two data sets (A and B) driver of the current column is first enabled to make a read
into a stream of ordered numbers. After initializing the data current flow through the bitline. The bitline is connected to the
ranges, software reads the first minimum value from each data binary cells that represent 0 and 1 using high (H) and low (L)
sets (i.e., 1 and 4). The smaller min value is 1 from A, which resistance states, respectively. Ideally, the bitline currents reach
is selected for the output stream. As a replacement for this the selectlines (recall Section II-D) only after passing through
value, the next min value is read from A and the min selection the selected cells that represent 1 with their low resistance
process repeats until all the values from both sets are accessed. states. In practice, however, the cells with high resistance state
In the case of a merge-join operation, the output stream will pass current too. But the significance of current flowing though
only include the min values that exist in both sets (i.e., 5). each memristive cells is inversely proportional to its resistive
state. To make a near-ideal situation for bitwise search, we

choose memristive devices that provide a large dynamic range
of resistance states (i.e., RH is much bigger than RL ). By
Fig. 6. Illustrative example of merging two data sets.
sensing the selectlines, we perform a column read at the array
periphery. As shown in Figure 7, the result undergoes a
IV. P ROPOSED A RCHITECTURE bitwise XNOR with the reference bit value, a 1-bit search
Integrating a min/max compute logic in memory chips key, to generate a match vector.
may introduce significant overheads in terms of performance, 2) Selective Row Exclusion: Row exclusion is performed
energy, and memory capacity. To minimize the overheads, we through loading the generated match vector into the select
propose minimal changes to the periphery and organization of vector, where more 1s may be turned into 0s. Therefore, we
conventional memristive arrays for in-situ value ranking. reduce the number of selected rows for the next iteration of
bit-serial min/max computation. To ensure only non-minimal
A. Memristive In-Situ Ranking values are excluded from the data set (Section III-A), the newly
As explained in Section III-A, the key operation for bit- generated match vector is only loaded into the select vector
serial min/max computation is a repetitive search for bit value if at least one of the selected cells differs from the others. As
(1 or 0) within individual columns of a data array. At every shown in Figure 7, the All 0 or 1 Logic block generates the
step, the outcome of the search is a match vector indicating load signal for the select vector latches. At the end of each
which rows of the array should be excluded from the data column operation, the contents of the select vector are updated
set. As a result, the memory array needs to support two only if the load signal is driven high.
new operations for bitwise column search and selective row
exclusion. To enable these new operations, we choose the B. Memory Organization
conventional 1T1R structure explained in Section II-D. As 1) Mat Architecture: One major component of the addi-
shown in Figure 7, we propose extra control mechanism at tional circuit for computing min/max within memory arrays
each memristive array to enable wordline activation selectively is the sensing circuit at each selectline for producing the
and match vector generation on the selectlines, iteratively. match vector. In a conventional memristive array, the sensing
circuits are connected to the bitlines for reading a row of the

array, whereas the proposed column search operation needs
the sensing circuits at the selectlines. We propose a physical

structure for the memristive arrays that enables sharing the

sensing circuits between read and column search operations.

Figure 8 shows how every four arrays within a mat share the
sensing and driving circuits for row read, column search, and

row write operations. At the center of each mat, a controller is

employed to operate the sense amps and drivers appropriately

for the received read, write, and column search commands. All

four memristive arrays are active during each mat command
Fig. 7. Enabling selective wordline activation and match vector generation.
to perform a bit parallel access. The outcome of each column
1) Bitwise Column Search: Every column search is per- search is a binary signal indicating if at least one of the
formed on a set of selected cells within a particular column mat arrays requires a row exclusion. This signal is then sent
of the array. A select vector, connected to all the wordlines, upstream to the chip controller for further process.
determines the selected cells for each column search operation. 2) Chip Organization: Building upon the proposed mat
Initially, all memristive rows containing the data points are structure, we design a memristive chip capable of storing
selected by the select vector (Section IV-B2). The data points data points and performing in-situ min/max computation.
are represented with multi-bit values stored in single-level Every chip comprises a controller and multiple banks that are

444

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.

Tree-based Index Computation. Upon completing a min/max

operation, which ﬁnds the global min/max output across the
span of selected mats, the data/index tree is expected to

compute the memory address of the output. We design the

data/index tree to act as a priority encoder that selects only

one min/max value per bitwise column search. The outcome of

each min/max computation is a multi-bit index progressively
Fig. 8. Sharing sense and drive circuits for row read (a), column search (b), produced in the data/index tree and sent upstream from the ar-
and row write (c) operations. rays. Each bit of the index is generated by one of the tree nodes
along the path. Figure 10 shows an example index generation
connected using a data/index H-tree. The banks are further for 16 arrays across 4 memristive mats, where arrays 2, 7, and
divided into subbanks, which are similarly connected using an 12 contain the min/max value. Each mat generates a binary
internal data/index H-tree. Each subbank comprises multiple signal (i.e., E) indicating if it contains the min/max value3
mats and a selector to keep only one active mat per access. and an initial index (i.e., A) representing which array/row has
Multi-Mat Management. Each mat is designed to compute the min/max value.4 At every node of the tree, Ai and Ei
the min/max value of the data points independently. Though, signals that form the two children are combined to generate An
not all data may ﬁt in a single mat. Therefore, a multi-mat and En . En is a binary signal computed by ORing the same
management is necessary to enable computing the min/max signals produced by the children. An is, however, a multi-bit
value of larger data sets. Along these lines, we design a value produced through concatenating a most signiﬁcant bit to
special data/index tree that transfers data and address in a selected index from the children. A0 is selected if E0 is 1;
both directions between the chip controller and mats. (In the otherwise, we choose A1 . The additional bit is computed by
conventional memory architecture, interconnection trees are E0 ∧E1 .

typically used for sending address and control in one direction

only, from the controller to memory arrays.) This capability

is necessary for two reasons: (1) the memory location of

the result is needed after every min/max computation and

(2) a global knowledge about all data points is necessary to

accurately perform a column exclusion across multiple mats. Fig. 10. Calculating the address of the minimum value in H-trees.
Figure 9 shows an example that needs global knowledge for The output of index-reduction per column-wise computation
a multi-mat exclusion. A column search command is sent is sent to the chip controller, which performs per-mat global
to 3 mats for finding 1s at the second most significant bit select vector update. The process continues till either we reach
of all the data points. The local search results are zero, all, LSB or only 1 selected value is left across all selected mats.
and one matches in Mats 0, 1, and 2 respectively. Following Once we reach LSB, and multiple selected values are left, all
the local mat computation steps, Mats 0 and 1 should not selected values are the same minimum value in the dataset.
exclude any data points due to having zero and all matches. The output of the index reduction tree at this stage would
This, however, results in not excluding the numbers in Mat 1, correspond to the array with the lowest address having the
which is wrong. Instead, in a multi-mat row exclusion, all 3 mimimum value which ensures stable sort of data.
mats need to be checked for the row exclusion needs. Each

mat, after comparison, returns 2 signals to the controller, one
value being the output of the “all 0 or all 1” logic, and the
other one indicating if there was a 1 in the column. In the
example case, the mat returns 00, 01 and 10 to the controller

for mats 0, 1, and 2 respectively. Based on this, the controller

decides which rows to exclude in computation. To realize this

mechanism efficiently, we use a specialized data/index tree Fig. 11. Initializing the select vectors via H-trees.
that ORes all the exclusion signals from the mats to signal the
chip controller for a required update to the select vectors. Select Vector Initialization. To make use of the proposed in-
situ accelerator, it is necessary to initialize the select vector
of all memristive arrays containing the data points prior to

the iterative Min/Max computation. We allow the application

to determine an address range for every initialization at the

software level (Section V). The process is then completed in

hardware by sending the begin and end of the address range

3 The exclusion signal is reused for this purpose.
Fig. 9. Multi-mat row exclusion. 4 Priority is always given to the smaller indices.

445

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
to the data/index tree. From root to leaves, the begin and end The 3rd argument of rime_init() function call specifies
exclude all tree branches with addresses respectively below the data type stored in the memory. rime_min() and
and above the address range. At the memristive arrays of the rime_max() have an argument that takes a pointer to the tar-
remaining branches, the select bit of each row is set to 1 if it get sorted array. The main operation of rime_init() is con-
is within the range; otherwise, it is set to 0. Figure 11 shows figuring the data/index tree and setting the operational mode of
how the begin and end signals, specifying an address range the chip controller. Also, rime_init() allows for defining
from 5 to 10, are sent to the memory arrays. Every node of a sub-region within a region defined by rime_malloc()
the tree relays signals to one or both of its children based a for min/max operation through appropriate arguments. This
certain address bit. This way, we use a fast and efficient way capability enables user flexibility in controlling the ranges of
for initializing select bits prior to in-situ computation. data in consideration for specific operations.

V. S OFTWARE -H ARDWARE I NTERFACE

!"
In this paper, we use a DDR4 [36] interface to enable #$ %
+ , & , , -
& ./0 - 1
fast and efficient byte-addressable communication between / 2 &, -
#$ $& %
software and the proposed accelerator. Depending on the 34 - , -
5 - 1
# ' ((%)
application requirements, modules of the proposed memristive 6 , -
#$ $ $ % - 34 -
architecture may be included in the system for storage and in- , 3&

memory data ranking purposes. A small fraction of the address 6 , -
#$ $ $ % - 34 -
space visible to software within every chip is mapped to an , 3&
*
internal RAM array, and is used for implementing the data
+ , & , , -
buffers and the conﬁguration parameters. Software conﬁgures #$ % & ./0 - 1

the on-chip data layout and initiates the optimization by Fig. 12. Example code snippet for forming a sorted list.
writing to a memory mapped control register. Both memory
configuration and data transfer accesses are performed through Memory Allocation for RIME. The tree-based reduction
ordinary DDR4 reads and writes. This is made possible connects multiple physically contiguous mats in a sub-bank
by making all accesses to the accelerator in-order strong- (Section IV-B2). This makes it necessary that large chunks of
uncacheable. contiguous virtual memory be allocated to contiguous mats of
physical memory to efficiently utilize the tree-based reduction
DIMM Organization. To support large-scale data ranking method. Moreover, reservation of such virtual pages onto a
problems whose working set does not fit within a single chip, it contiguous physical space on demand could become impossi-
is possible to interconnect multiple RIME chips under a dual ble due to physical memory fragmentation. This necessitates
in-line memory module (DIMM) [44]. Moreover, a system that there exist no fragmented physical region allocation to
may include multiple DIMMs for larger data. Each DIMM is virtual space when rime_malloc is called. RIME ensures
configured to be either used in the RIME mode or normal this requirement through a driver that avoids fragmentation
storage mode, decided at the system boot time. Runtime prone allocation on the RIME defined address spaces. The
reconfigurability between the RIME and normal storage modes driver has tunable parameters to specify the number of pages
is not allowed owing to constraints imposed by the tree-based that should be reserved on startup during an mmap call, and
index reduction architecture (more details are provided in the number of additional pages to reserve when the initially
Section V). Each DIMM is equipped with control registers, reserved block gets full (similar to many malloc implemen-
data buffers, and a controller. The controller receives the tations). When the available reserved blocks are all taken,
DDR4 commands, data, and address bits from the external the driver reserves additional contiguous physical memory and
interface, and orchestrates the necessary data movement and expands the existing allocated memory region.
computation among all of the chips on the DIMM.

Software Support. The proposed system provides a userspace
API library for efficient utilization of the in-memory process-
ing capabilities by the user. The API enables applications to
(1) allocate memory in the accelerator, (2) configure hardware
prior to each computation, and (3) compute the min/max of
the dataset. Any allocated memory in the accelerator may be Fig. 13. Physical page allocation for malloc in the normal storage mode (a)
and RIME defined DIMMs (b).
used as normal with load and store instructions, within the
constraints imposed by the tree based index reduction method. Such a difference in memory allocation between the RIME
The various functions offered as part of the RIME API, defined and normal storage regions is highlighted in Figure 13.
along with usage, are shown as part of an example code snippet Each small square in the figure denotes a physical page.
in Figure 12. We design rime_init(), rime_min(), and There are three instances of malloc calls (A, B and C),
rime_max() as part of the API to initialize the hardware each of different size, within a region reserved by mmap
and compute the minimum and maximum values, respectively. for the normal storage mode (a) and the contiguous RIME

446

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
mode (b). In the conventional case of memory allocation, a activates all the chips across all DIMMs to receive a single
virtually contiguous address space of multiple pages may not min/max value from each chip.6 The library buffers these
be mapped to physically noncontiguous pages. In such a case, values and performs a comparison in CPU to find the absolute
it is highly inefficient to perform RIME operations because the min/max value (circled in the figure). Next, only the chip that
system cannot exploit its reduction tree to efficiently compute had the minimum value in the previous iteration will be active,
the minimum value index for every column-wise comparison. which returns a new min/max value to replace the previous
In the RIME defined regions, physical pages of each malloc one. For example, at i = 1, the chip in RIME 1 which earier
are contiguous, thereby utilizing the tree reduction, efficiently. computed the minimum value of 5 needs to compute a new
One drawback of the contiguous physical page allocation minimum value This process will continue for 100 iterations
is that if the size of rime_malloc request exceeds the size to find the 100 minimum values. The extra buffered values are
of any physically unallocated contiguous space in the RIME discarded when a new rime_init() is called for the same
region, memory allocation for that malloc is not possible. address range/sub-range, and proceeding rime min/rime max
This is accounted for in the rime_malloc implementation, follow the same approach.
which returns a null pointer in such cases. Therefore, the user
can try using rime_free to free up unnecessary allocated VI. E XPERIMENTAL S ETUP
memory within the RIME region and try memory allocation A. Architecture
again. It is notable that a contiguous physical region is only Based on the prior work on ESESC [45], we develop a
necessary if the DIMM address space should be used for QEMU-based cycle accurate simulator to model a multicore
RIME computation. In the case of using the DIMM for normal out-of-order processor. For the baseline systems, we interface
memory purposes, the conventional allocation mechanism is the processor to cycle accurate components for an off-chip
sufficient. main memory using DDR4 DRAM [36] and an eight-vault
Address Mapping and Multi-DIMM Support. DRAM ad- HBM [37]. To realize the proposed API and software support,
dress mappings may be interleaved at fine granularity across we modify QEMU for an extended version of memkind
channels to exploit further parallelism during block transfer. A library [46] that enables special memory allocation and in-
RIME DIMM does not allow for such address mapping (Sec- memory ranking. Table I shows the simulation parameters.
tion V): assume two 1GB single-DIMM channels (RIME 0 and TABLE I
RIME 1); the address space 0x00000000–0x3FFFFFFF S IMULATION PARAMETERS .
maps to RIME 0 and 0x40000000–0x7FFFFFFF maps Core Type 64 4-issue cores, 2 GHz, 256 ROB entries
Instruction L1 32KB, direct-mapped, 64B block, hit/miss: 2/2
to RIME 1.5 Each rime min/rime max call is accompanied
Cache

Data L1 32KB, 4-way, LRU, 64B block, hit/miss: 2/2, MESI

by the starting/ending addresses of the target data range Shared L2 8MB, 16-way, LRU, 64B block, hit/miss: 15/12
as arguments (Figure 12). Therefore, all the chips within Memory 2KB row buffer, 2GB DDR4-2000,
HBM

Conﬁguration Channels/Ranks/Banks: 4/8/8

each RIME DIMM are conﬁgured for the operational address Timing tRCD:44, tCAS:44, tCCD:16, tWTR:31, tWR:4, tRTP:46, tBL:4,
ranges. If the data spans more than one channels, the API (CPU cycles) tCWD:61, tRP:44, tRRD:16, tRAS:112, tRC:271, tFAW:181
Memory 8KB row buffer, 8Gb DDR4-1600 chips,
sends multiple such commands to the RIME DIMMs.
Main

Conﬁguration Channels/Ranks/Banks: 4/2/8

Timing tRCD:44, tCAS:44, tCCD:16, tWTR:31, tWR:4, tRTP:46, tBL:10,
(CPU cycles) tCWD:61, tRP:44, tRRD:16, tRAS:112, tRC:271, tFAW:181
Memory Channels/Chips/Banks/Subbanks: 1/8/64/64, 1Gb DDR4-1600
RIME

Conﬁguration compatible chips, 512x512 SLC subarrays, die area: 20.54mm2

Timing and tRead: 4.3ns, tWrite: 54.2ns, tCompute: 282.5ns, vRead: 1V,

Power vWrite: 2V, vCompute: 1V, compute energy/chip: 51.3nJ

B. Circuits

We model the data array, sensing circuits, drivers, mat

controller, and interconnect elements using SPICE predictive
technology models [47] of NMOS and PMOS transistors

at 22nm. To estimate the area, delay, dynamic energy, and
leakage power of proposed memristive system, we perform

circuit simulations for the building blocks using Cadence

(SPECTRE) [48]. Then, we use the resistive memory pa-
rameters provided by the prior work [49] to evaluate the
read/write/compute voltages, area, delay, and energy of the
Fig. 14. Example of sorting in case of data spread across multiple channels
data arrays. All the additional gates, latches, and the con-
Figure 14 shows an example two channel RIME system, trol logic are synthesized using the Cadence Encounter RTL
where each channel has 8 chips. Four iterations of the for- Compiler [50] with FreePDK [51] at 45nm. The results are
loop in Figure 12 are shown. During the ﬁrst iteration, software then scaled down to a 22nm memory technology node. All
5 The bit location 230 is used to extract the DIMM address. 6 The chip controller excludes this value from the range.

447

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
the SRAM units for the tables and data buffers at the chip sorted first. This algorithm is very common in network routing
controller are evaluated using CACTI 6.5 [52]. To estimate protocols [62–65]. We devise a program that iteratively finds
the system power/energy, we use the cycle-accurate simulator a vertex with the minimum distance from the source node.
in coordination with McPAT [53] for the processor die, Micron The algorithm is similar to Prim’s algorithm. However, Prim
power calculator [54] for the main memory, and prior work on provides a minimum spanning tree, but Dijkstra prepares a
HBM memories [55] for the in-package memory architecture. shortest path tree.
The overheads associated with the additional circuitry is A*-Search Algorithm. A*-search is a smart algorithm for
measured through modeling with CACTI 6.5 [52]. The match path finding and graph traversal, which is commonly used for
vectors incur a 3% area overhead per mat. Including all addi- finding the shortest path from one point to another in a graph
tional latches, control logic, tree reduction and multiplexers, with multiple obstacles. The algorithm plays a significant
each mat has an 8% area overhead and 5% die overhead. role in robotics, web-based maps, virtual reality systems,
C. Workloads geographic information system, and games [66–68]. We realize
a 2D binary matrix representing the obstacles with 0 and non-
In addition to various sorting kernels (i.e., mergesort, quick- obstacles with 1. The algorithm is then to find a path from the
sort, radixsort, and heapsort), we develop two versions of six source to a destination only through non-obstacle paths.
applications for execution on the proposed RIME architecture
and the conventional multicore CPU with in-package and off- Strict Priority Queue. In the priority queue, data is arranged
chip memory systems. All of the workloads are compiled with in descending/ascending order based on their priority. Every
GCC using the -O3 parameter for the MIPS64 ISA. dequeue operation results in removing the minimum/maximum
entry of the queue based on their values. We use the heap
GroupBy. Scalar aggregate and GroupBy are two types of structure for the baseline priority queue application. Numerous
aggregates often used to summarize a large set of records algorithms in network for routing and congestion management
for strategic decision making. In particular, GroupBy refers are based on strict priority queue [69].
to generating a set of groups for a given table7 [4], which is a From the above workloads, Dijkstra’s, Kruskal’s, and Prim’s
key operator for decision support systems, database, and big algorithms work with IEEE-754 floating-point values, while
data processing [56]. In GroupBy, the whole table is split into the rest of the workloads are of type integer. Note that if the
several groups depending on a specific key. Then, functions dataset uses fixed-point, it is processed by RIME in the fixed
such as filtering, aggregation, and transformation are applied point mode; if the dataset uses floating-point, it is processed
within each group. Finally, the groups are merged or joined in the floating-point mode. No data conversion is required.
to create a new table. Sorting is at the heart of modern large-
scale GroupBy functions [38]. We devise a key-value database VII. E VALUATIONS
using quick sort (Q/S) for the GroupBy application to achieve A. Performance
the highest throughput.
Sorting. Figure 15 shows the throughput of various sort
MergeJoin. Sort MergeJoin is a key operation in database algorithms, in terms of million keys per second (MKps), using
systems, which refers to combining records from several RIME and the baseline systems when the data size varies
tables. Numerous proposals have been made to accelerate from 0.5-65M keys. RIME achieves a superior performance
MergeJoin through parallelism [57] or FPGA accelerators [58]. over both the baselines for all the evaluated data sizes. As
For the key-value database, we devise a MergeJoin that sorts compared with the off-chip baseline, the in-package memory
two large tables to generate a new table that includes only offers a higher memory bandwidth, which results in average
items that exist in both input tables. throughput gains of 2.4× (M/S), 2.3× (Q/S), 8.1× (R/S),
Kruskal’s and Prim’s Algorithms. Minimum spanning tree and 1.9× (H/S). In contrast, RIME lowers the bandwidth
(MST) is a crucial concept in graph theory. It plays a key role complexity of sorting via in-situ computation, thereby gaining
in a broad domain of applications, including vehicular ad-hoc 30.2× (M/S), 12.4× (Q/S), 50.7× (R/S), and 26× (H/S)
network (VANET) [59], multi-level Steiner tree [60], touring average throughputs.
M/S R/S M/S R/S
problems, VLSI layout, network organization, and rail transit Q/S H/S Q/S H/S

network [61]. Kruskal’s and Prim’s algorithms are two main Off-Chip (DDR4) In-Package (HBM) RIME
45 45 45
Throughput (MKps)

tools for forming MST a given graph. In Kruskal’s algorithm, 40 40 40

35 35 35
all the graph edges are sorted from low weight to high. Then, 30 30 30
25 25 25
the graph edges are iteratively added to the output MST. Prim 20 20 20
15 15 15
starts from a vertex and iteratively ﬁnds a local vertex with 10 10 10
5 5 5
the minimum cost to include in the output MST. 0 0 0
0
10
20
30
40
50
60
70

0
10
20
30
40
50
60
70

Dijkstra’s Algorithm. It ﬁnds the shortest paths from a source Data Size (Million Keys) Data Size (Million Keys) Data Size (Million Keys)
graph node to all other nodes. The algorithm needs data to be Fig. 15. Throughput of the evaluated sorting algorithms.
7 Whereas, in scalar aggregates, the whole table is grouped and a single Merging. GroupBy and MergeJoin heavily rely on sorting key-
value is produced. value entries. As shown in Figure 16, for the range of evaluated

448

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
data sizes, the HBM implementation of GroupBy achieves used for adding and removing packets to a buffer. On every
1.1 − 2× better performance than off-chip DRAM. Whereas, remove, a packet with the minimum key value is removed
RIME improves performance by 5.4 − 23.1×. Similarly, the from the queue. To model various loads and packet rates, we
HBM version of MergeJoin performs 1.1 − 2× better than the assess performance for a range of initial buffer sizes (0.5-65M
off-chip DRAM baseline; while, RIME improves performance packets) and various ratios of packet add to remove (i.e., R).
by 5.6 − 24.1×. Figure 18 shows the throughput of removing packets from the
buffer for RIME and the baseline systems. We use a heap
Off-Chip RIME Off-Chip RIME
In-Package In-Package structure for the baseline priority queues, which need heap
GroupBy MergeJoin maintenance at both insert and remove operations. Therefore,
Throughput (MKps)

60 60
50 50 increasing the buffer size and add-to-remove ratio results in a
40 40
30 30 lower throughput for the baseline HBM and off-chip system. In
20 20
10 10
contrast, RIME achieves a constantly high throughput due to
0 0
using ordinary memory writes for adding packets to the queue
0

70
Data Size (Million Keys) Data Size (Million Keys) and low complexity accesses for removing packets from the
Fig. 16. Throughput of merge and join algorithms for various sizes. buffer. Across all the evaluated sizes and rate, RIME gains
6.1 − 43.6× better performance than the both HBM and off-
Ranking. Figure 17 shows the throughput of various algo- chip baselines.
rithms based on data ranking. RIME improves performance
R=1 R=4 R=1 R=4 R=1 R=4
signiﬁcantly. For Kruskal, the HBM implementation achieves R=2 R=5 R=2 R=5 R=2 R=5
R=3 R=3 R=3
2.8 − 3.7× of the off-chip performance; while, RIME gains Off-Chip (DDR4) In-Package (HBM) RIME
8.5 − 20.9×. Similarly for Dijkstra, the performance gains

Throughput (MKps)
30 30 30

over the off-chip baseline are 1.2 − 2.2× and 7.5 − 17.2× 25 25 25
20 20 20
for HBM and RIME, respectively. Such performance gains 15 15 15

for RIME are enabled due to the signiﬁcant reduction in 10 10 10

5 5 5
memory bandwidth requirements. We observe similar trends 0 0 0
0
10
20
30
40
50
60
70

0
10
20
30
40
50
60
70

0
10
20
30
40
50
60
70
in Prim and A*-Search. The performance gains for Prim are
Data Size (Million Keys) Data Size (Million Keys) Data Size (Million Keys)
2 − 4.4× and 6.3 − 14.3× for the HBM and RIME systems.
Fig. 18. Throughput of strict priority queue for various packet rates and sizes.
As compared to the off-chip system, A*-Search on HBM
and RIME respectively achieve the 1 − 1.1× and 2.3 − 23×
performance of the off-chip DRAM baseline. B. Power and Energy
Off-Chip RIME Off-Chip RIME
In-Package In-Package
Kruskal Prim The proposed software library (V) for controlling RIME
Throughput (MKps)

60
50
60
50
DIMMs ensure a peak power of 1W for all the evaluated
40 40 applications. For system energy evaluation, we execute all
30 30
20 20 the evaluated workloads for 65M keys. Figure 19 shows the
10 10
0 0 system energy consumed by the HBM and RIME systems
0

normalized to the off-chip baseline. The HBM system con-

sumes an average of 24% more energy than off-chip for
Off-Chip RIME Off-Chip RIME
In-Package In-Package A*-Search and Strict Priority Queues. This is mainly due
Dijkstra A*-Search
Throughput (MKps)

60 60
to (1) similar execution times of the two baselines and (2)
50 50 the additional static power consumed by the in-package and
40 40
30 30 off-chip memories in the HBM baseline. As for the other
20 20
10 10 applications, HBM can signiﬁcantly reduce the execution time,
0 0
thereby decreasing the system energy by about 40%. RIME
0

Data Size (Million Keys) Data Size (Million Keys) achieves average energy reductions of 94% (Kruskal), 92%
(Dijkstra), 91% (Prim), 95% (GroupBy), 95% (MergeJoin),
Fig. 17. Throughput of graph algorithms for various sizes.
94% (A*-Search), and 96% (Strict Priority Queuing).
Across all the evaluated applications, we found the perfor-
Off-Chip DRAM Off-Chip RRAM
mance of ranking with RIME pretty insensitive to data size. In-Package DRAM In-Package Memory
CPU CPU
However, when executing an application, multiple ranking
Relative Energy

1.4 0.2
1.2
operations may be carried out in between frequent RIME ini- 1 0.15
0.8
0.1
tialization and application phases that are sensitive to data size. 0.6
0.4 0.05
Therefore, we observe a stagnation in throughput of RIME as 0.2
0 0
data size increases for most of the evaluated workloads.
Kr
D ska
Pr stra
G
M upB
A* geJ
SP ea in
SP (R h
SP (R 1)
SP (R 2)
SP (R 3)

Kr
D ska
Pr stra
G
M upB
A* geJ
SP ea in
SP (R h
SP (R 1)
SP (R 2)
SP (R 3)
ijk l

ijk l
ro

ro
er y

er y
u

im
-S o

-S o
Q rc
Q =
Q =
Q =
Q =4

Q rc
Q =
Q =
Q =
Q =4
(R )

(R )
=5

Strict Priority Queuing. We evaluate the strict priority queu-

)

Fig. 19. System energy for various applications (65M keys).

ing using a packet processing workload, where two threads are

449

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
C. Lifetime a CPU-FPGA heterogeneous platform for Mergesort. Casper
RRAM devices can endure a finite number of writes ranging et al. [79] present a hardware design for selection, merge join,
from 106 to 1012 [70–72]. We assess the impact of this finite and sorting for database applications. Specialized hardware
endurance on the lifetime of the proposed memristive system. has been also considered for accelerating sort operations [14].
Notice that wear is only induced by writing to the memristive These solutions still require moving data from memory to
arrays. Therefore, we need to track the number of writes the accelerator prior to sorting. Whereas, RIME enables both
performed per memory locations during the execution of each storing and sorting within the same memory arrays with no
workload. Notably, unlike the conventional sort algorithms, need for moving data elements.
RIME does not require any data swap during the sort iterations. Ranking Accelerators. Numerous techniques have been pro-
Therefore, no additional writes to the memristive cells are posed in the literature to speedup median computation. Kumar
necessary during a RIME sort operation. Also, it is notable et al. [80] present a hardware implementation for computing
that the initialization and exclusion are performed to the flag the median of 25 integers in three clock cycles at 394 Mhz.
bits implemented in CMOS latches. By tracking the total Szanto et al. [81] propose a hierarchical histogram based
number of writes per second carried out during the execution median filter in GPUs for parallel applications. Sindhu et
of all applications, we first identify a block with the highest al. [82] design a comparator for fast sorting and ranking data.
write frequency. Then, we compute the lifetime assuming that Venkatappareddy et al. [83] propose a methodology to employ
the most frequently written block keeps getting writes at the the binary median filter for polynomial expressions. Lin et
same rate until it stops working. Based on this study for 108 al. [84] propose a 1D comparison-free bit-level median filter
writes, we expect at least 376 years lifetimes for the evaluated by cascading different median units. Rupesh et al. [85, 86]
applications. examine a data clustering accelerator based on in-situ me-
dian calculation in RRAM. Unlike the existing solutions, the
VIII. R ELATED W ORK
proposed memory system is capable of accelerating sort and
Sorting with SIMD and GPU Accelerators. Software ap- general ranking operations rather only finding the median.
proaches have been proposed to exploit SIMD instructions In-Memory Processing. In the literature, numerous accelera-
for utilizing data-level parallelism in sorting applications [11]. tors have been proposed for in memory processing that aim at
Inoue et al. [11] present a sorting algorithm based on a multi- reducing data movement between the processor and memory
way Mergesort that reduces the cache misses through avoiding through performing computation on the memory chips. Com-
random accesses for rearrangement of data. CloudRAMSort putational RAM [87] builds a system, where SIMD pipelines
performs large-scale distributed sorting by using SIMD and are placed next to the memory arrays for in-memory compu-
multicore architectures [73]. Hou et al. [13] propose a seg- tation. A similar approach is proposed by Parallel PIM [88]
mented sort mechanism for load balanced processing on GPUs to perform SIMD operations in memory. Active Pages [89]
by combining or splitting data segments of different sizes. propose a microprocessor that includes additional logic circuit
Stehle et al. [16] propose a hybrid Radixsort that reduces in DRAM chips for in-memory computation. FlexRAM [90]
the amount of memory transfer in GPUs. Satish et al. [74] and intelligent RAM (IRAM) [91] are other examples for
present an implementation of Radixsort and Mergesort on in-memory processing that have been evaluated on different
many-core GPUs by considering fine-grained parallelism and technologies. However, RIME focuses on fast and efficient in-
minimal global communication. Several algorithms have been memory ranking for a different class of applications.
devised to improve the performance of modern databases
using parallel multicore processing, SIMD instructions, GPUs, IX. C ONCLUSIONS
and ASIC [75]. Albutiu et al. [57] develop a parallel sort- Large scale sorting is a fundamental operation for fu-
merge algorithm to minimize the query response times in ture data intensive applications. This paper characterized the
databases. Chhugani et al. [15] present a multi-threaded SIMD bandwidth and throughput requirements of large-scale sorting
implementation of Mergesort based on the binary tree to workloads and identified the primary reason for poor perfor-
better utilize bandwidth . The existing software solutions rely mance of sorting. As an effective solution for the bandwidth
on improving performance through reducing data movement problem in sorting applications, we examined a novel memory
between the processor and memory. However, they don’t solve system capable of data ranking in memory. The proposed
the fundamental issue, which is the need for accessing data architecture exhibits significant potentials for orders of mag-
from memory. In contrast, RIME enables in-situ ranking with nitude performance and energy-efficiency gains for the future
no need to transfer data from memory to the processor. large scale data processing.
Sorting with FPGA and ASIC Accelerators. Kobayashi
et al. [76] propose an FPGA-based sorting accelerator that X. ACKNOWLEDGEMENT
receives data from a host PC through PCIe bus and sends the The authors would like to thank the anonymous reviewers
sorted data back. Bonsai [77] proposes an adaptive merge-tree for their useful feedback. This work was supported in part
based sorting solution to optimize FPGA sorting performance by the National Science Foundation (NSF) under Grant CCF-
across all scales of data (MB to GB). Zhang et al. [78] develop 1755874.

450

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES in 2016 IEEE International Symposium on High Performance Computer
Architecture (HPCA), pp. 1–13, IEEE, 2016.
[1] N. Bell, S. Dalton, and L. N. Olson, “Exposing fine-grained parallelism [24] T. P. Xiao, C. H. Bennett, B. Feinberg, S. Agarwal, and M. J. Marinella,
in algebraic multigrid methods,” SIAM Journal on Scientific Computing, “Analog architectures for neural network acceleration based on non-
vol. 34, no. 4, pp. C123–C152, 2012. volatile memory,” Applied Physics Reviews, vol. 7, no. 3, p. 031301,
[2] K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang, “Mega- 2020.
kv: a case for gpus to maximize the throughput of in-memory key-value [25] A. Pal Chowdhury, P. Kulkarni, and M. Nazm Bojnordi, “Mb-cnn:
stores,” Proceedings of the VLDB Endowment, vol. 8, no. 11, pp. 1226– memristive binary convolutional neural networks for embedded mobile
1237, 2015. devices,” Journal of Low Power Electronics and Applications, vol. 8,
[3] P. Flick and S. Aluru, “Parallel distributed memory construction of suffix no. 4, p. 38, 2018.
and longest common prefix arrays,” in Proceedings of the International [26] M. N. Bojnordi and E. Ipek, “The memristive boltzmann machines,”
Conference for High Performance Computing, Networking, Storage and IEEE Micro, vol. 37, no. 3, pp. 22–29, 2017.
Analysis, pp. 1–10, 2015. [27] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-
[4] D. Taniar, C. H. Leung, W. Rahayu, and S. Goel, High-performance in-memory accelerator for parallel graph processing,” in Proceedings of
parallel database processing and grid databases, vol. 67. John Wiley the 42nd Annual International Symposium on Computer Architecture,
& Sons, 2008. pp. 105–117, 2015.
[5] E. Kovacs and I. Ignat, “Clustering with prototype entity selection [28] G. Graefe, “Implementing sorting in database systems,” ACM Computing
compared with k-means,” Journal of Control Engineering and Applied Surveys (CSUR), vol. 38, no. 3, pp. 10–es, 2006.
Informatics, vol. 9, no. 1, pp. 11–18, 2007. [29] Z. Wang, L. Tian, D. Guo, and X. Jiang, “Optimization and analysis
[6] V. Jugé, “Adaptive shivers sort: An alternative sorting algorithm,” of large scale data sorting algorithm based on hadoop,” arXiv preprint
in Proceedings of the Fourteenth Annual ACM-SIAM Symposium on arXiv:1506.00449, 2015.
Discrete Algorithms, pp. 1639–1654, SIAM, 2020. [30] “sorting applications.” https://algs4.cs.princeton.edu/25applications/.
[7] M. D. MacLaren, “Internal sorting by radix plus sifting,” Journal of the Accessed: 2020-03-12.
ACM (JACM), vol. 13, no. 3, pp. 404–411, 1966. [31] L. Z. Xiang, “Research and improvement of pagerank sort algorithm
[8] H. H. Goldstine, J. Von Neumann, and J. Von Neumann, “Planning and based on retrieval results,” in 2014 7th International Conference on In-
coding of problems for an electronic computing instrument,” 1947. telligent Computation Technology and Automation, pp. 468–471, IEEE,
[9] C. A. R. Hoare, “Algorithm 64: quicksort,” Communications of the ACM, 2014.
vol. 4, no. 7, p. 321, 1961. [32] C. Kim, S. Yoon, and D. Kim, “Fast sort of floating-point data for
[10] R. Bordawekar, D. Brand, M. Cho, B. R. Konigsburg, and R. Puri, data engineering,” Advances in Engineering Software, vol. 42, no. 1-2,
“Radix sort acceleration using custom asic,” May 24 2018. US Patent pp. 50–54, 2011.
App. 15/857,770. [33] C. A. R. Hoare, “Algorithm 64: Quicksort,” Commun. ACM, vol. 4,
[11] H. Inoue and K. Taura, “Simd-and cache-friendly algorithm for sorting p. 321, July 1961.
an array of structures,” Proceedings of the VLDB Endowment, vol. 8, [34] D. E. Knuth, The Art of Computer Programming, Volume 3: (2nd Ed.)
no. 11, pp. 1274–1285, 2015. Sorting and Searching. USA: Addison Wesley Longman Publishing Co.,
[12] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and Inc., 1998.
P. Dubey, “Fast sort on cpus and gpus: a case for bandwidth oblivious [35] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction
simd sort,” in Proceedings of the 2010 ACM SIGMOD International to algorithms. MIT press, 2009.
Conference on Management of data, pp. 351–362, 2010. [36] “Jedec standard: Ddr4 sdram,” JEDEC Solid State Technology Associa-
[13] K. Hou, W. Liu, H. Wang, and W.-c. Feng, “Fast segmented sort on tion, 2012.
gpus,” in Proceedings of the International Conference on Supercomput- [37] S. JEDEC, “High bandwidth memory (hbm) dram,” JESD235, 2013.
ing, pp. 1–10, 2017. [38] H. Miao, M. Jeon, G. Pekhimenko, K. S. McKinley, and F. X. Lin,
[14] S. Haas, S. Scholze, S. Höppner, A. Ungethüm, C. Mayr, R. Schüffny, “Streambox-hbm: Stream analytics on high bandwidth hybrid memory,”
W. Lehner, and G. Fettweis, “Application-specific architectures for in Proceedings of the Twenty-Fourth International Conference on Archi-
energy-efficient database query processing and optimization,” Micropro- tectural Support for Programming Languages and Operating Systems,
cessors and Microsystems, vol. 55, pp. 119–130, 2017. pp. 167–181, 2019.
[15] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. [39] M. Zangeneh and A. Joshi, “Design and optimization of nonvolatile
Chen, A. Baransi, S. Kumar, and P. Dubey, “Efficient implementation of multibit 1t1r resistive ram,” IEEE Transactions on Very Large Scale
sorting on multi-core simd cpu architecture,” Proceedings of the VLDB Integration (VLSI) Systems, vol. 22, no. 8, pp. 1815–1828, 2014.
Endowment, vol. 1, no. 2, pp. 1313–1324, 2008. [40] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu,
[16] E. Stehle and H.-A. Jacobsen, “A memory bandwidth-efficient hybrid and Y. Xie, “Overcoming the challenges of crossbar resistive memory
radix sort on gpus,” in Proceedings of the 2017 ACM International architectures,” in 2015 IEEE 21st International Symposium on High
Conference on Management of Data, pp. 417–432, 2017. Performance Computer Architecture (HPCA), pp. 476–488, IEEE, 2015.
[17] T. Lahiri, M.-A. Neimat, and S. Folkman, “Oracle timesten: An in- [41] P.-E. Danielsson, “Getting the median faster,” Computer Graphics and
memory database for enterprise applications.,” IEEE Data Eng. Bull., Image Processing, vol. 17, no. 1, pp. 71–78, 1981.
vol. 36, no. 2, pp. 6–13, 2013. [42] I. Hatirnaz, F. Gurkaynak, and Y. Leblebici, “Realization of a pro-
[18] M. P. (Intel), An Intro to MCDRAM (High Band- grammable rank-order filter architecture using capacitive threshold logic
width Memory) on Knights Landing. Intel, January gates,” in ISCAS’99. Proceedings of the 1999 IEEE International Sym-
2016. https://software.intel.com/en-us/blogs/2016/01/20/ posium on Circuits and Systems VLSI (Cat. No. 99CH36349), vol. 1,
an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing. pp. 435–438, IEEE, 1999.
[19] P. Behnam and M. N. Bojnordi, “Redcache: reduced dram caching,” in [43] D. H. Yoon and F. Petrini, “Hourglass: A bandwidth-driven performance
2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6, model for sorting algorithms,” in Supercomputing (J. M. Kunkel, T. Lud-
IEEE, 2020. wig, and H. W. Meuer, eds.), (Cham), pp. 93–108, Springer International
[20] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Ra- Publishing, 2014.
jendran, M. Asheghi, and K. E. Goodson, “Phase change memory,” [44] “Ddr4 sdram registered dimm design specification,” JEDEC Solid State
Proceedings of the IEEE, vol. 98, no. 12, pp. 2201–2227, 2010. Technology Association, 2014.
[21] P. Behnam, A. P. Chowdhury, and M. N. Bojnordi, “R-cache: A highly [45] E. K. Ardestani and J. Renau, “Esesc: A fast multicore simulator using
set-associative in-package cache using memristive arrays,” in 2018 IEEE time-based sampling,” in 2013 IEEE 19th International Symposium on
36th International Conference on Computer Design (ICCD), pp. 423– High Performance Computer Architecture (HPCA), pp. 448–459, IEEE,
430, IEEE, 2018. 2013.
[22] A. Sebastian, M. Le Gallo, R. Khaddam-Aljameh, and E. Eleftheriou, [46] “Memkind.”
“Memory devices and applications for in-memory computing,” Nature [47] W. Zhao and Y. Cao, “New generation of predictive technology model
Nanotechnology, pp. 1–16, 2020. for sub-45nm design exploration,” in International Symposium on Qual-
[23] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A ity Electronic Design, 2006.
hardware accelerator for combinatorial optimization and deep learning,” [48] “Spectre circuit simulator.” http://www.cadence.com/products/cic/

451

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.
spectre circuit/pages/default.aspx. [72] C.-W. Hsu, I.-T. Wang, C.-L. Lo, M.-C. Chiang, W.-Y. Jang, C.-H. Lin,
[49] M. Wu, Y. Lin, W. Jang, C. Lin, and T. Tseng, “Low-power and highly and T.-H. Hou, “Self-rectifying bipolar tao x/tio 2 rram with superior
reliable multilevel operation in ZrO2 1t1r rram,” IEEE Electron Device endurance over 10 12 cycles for 3d high-density storage-class memory,”
Letters, vol. 32, no. 8, pp. 1026–1028, 2011. in VLSI Technology (VLSIT), 2013 Symposium on, pp. T166–T167,
[50] “Encounter RTL compiler.” http://www.cadence.com/products/ld/rtl IEEE, 2013.
compiler/. [73] C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani,
[51] “Free PDK 45nm open-access based PDK for the 45nm technology “Cloudramsort: fast and efficient large-scale distributed ram sort on
node.” http://www.eda.ncsu.edu/wiki/FreePDK. shared-nothing cluster,” in Proceedings of the 2012 ACM SIGMOD
[52] S. Wilton and N. Jouppi, “CACTI: An enhanced cache access and cycle International Conference on Management of Data, pp. 841–850, 2012.
time model,” vol. 31, pp. 677–688, May 1996. [74] N. Satish, M. Harris, and M. Garland, “Designing efficient sorting
[53] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and algorithms for manycore gpus,” in 2009 IEEE International Symposium
N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling on Parallel & Distributed Processing, pp. 1–10, IEEE, 2009.
framework for multicore and manycore architectures,” in International [75] H. Zhang, G. Chen, B. C. Ooi, K.-L. Tan, and M. Zhang, “In-memory
Symposium on Computer Architecture, 2009. big data management and processing: A survey,” IEEE Transactions on
[54] “Micron ddr4 power calculator.” https://www.micron.com/∼/media/ Knowledge and Data Engineering, vol. 27, no. 7, pp. 1920–1948, 2015.
documents/products/power-calculator/ddr4 power calc.xlsm. [76] R. Kobayashi and K. Kise, “A high performance fpga-based sorting
[55] M. O’Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. accelerator with a data compression mechanism,” IEICE Transactions
Keckler, and W. J. Dally, “Fine-grained dram: Energy-efficient dram on Information and Systems, vol. 100, no. 5, pp. 1003–1015, 2017.
for extreme bandwidth systems,” in Proceedings of the 50th Annual [77] N. Samardzic, W. Qiao, V. Aggarwal, M. F. Chang, and J. Cong, “Bon-
IEEE/ACM International Symposium on Microarchitecture, MICRO-50 sai: High-performance adaptive merge tree sorting,” in 2020 ACM/IEEE
’17, (New York, NY, USA), pp. 41–54, ACM, 2017. 47th Annual International Symposium on Computer Architecture (ISCA),
[56] S. Chaudhuri and K. Shim, “Including group-by in query optimization,” pp. 282–294, 2020.
in VLDB, vol. 94, pp. 354–366, 1994. [78] C. Zhang, R. Chen, and V. Prasanna, “High throughput large scale sort-
[57] M.-C. Albutiu, A. Kemper, and T. Neumann, “Massively parallel sorting on a cpu-fpga heterogeneous platform,” in 2016 IEEE International
merge joins in main memory multi-core database systems,” arXiv Parallel and Distributed Processing Symposium Workshops (IPDPSW),
preprint arXiv:1207.0145, 2012. pp. 148–155, IEEE, 2016.
[58] M.-T. Xue, Q.-J. Xing, C. Feng, F. Yu, and Z.-G. Ma, “Fpga-accelerated [79] J. Casper and K. Olukotun, “Hardware acceleration of database opera-
hash join operation for relational databases,” IEEE Transactions on tions,” in Proceedings of the 2014 ACM/SIGDA international symposium
Circuits and Systems II: Express Briefs, 2019. on Field-programmable gate arrays, pp. 151–160, 2014.
[59] J. J. Kponyo, Y. Kuang, E. Zhang, and K. Domenic, “Vanet cluster- [80] V. Kumar, A. Asati, and A. Gupta, “Low-latency median filter core
on-demand minimum spanning tree (mst) prim clustering algorithm,” for hardware implementation of 5× 5 median filtering,” IET Image
in 2013 International Conference on Computational Problem-Solving Processing, vol. 11, no. 10, pp. 927–934, 2017.
(ICCP), pp. 101–104, IEEE, 2013. [81] P. Szántó and B. Fehér, “Hierarchical histogram-based median filter for
[60] R. Ahmed, F. D. Sahneh, S. Kobourov, and R. Spence, “Kruskal-based gpus,” Acta Polytechnica Hungarica, vol. 15, no. 2, 2018.
approximation algorithm for the multi-level steiner tree problem,” arXiv [82] E. Sindhu and K. Vasanth, “Vlsi architectures for 8 bit data comparators
preprint arXiv:2002.06421, 2020. for rank ordering image applications,” in 2019 International Conference
[61] T. Liang, H. Liu, and Y. Tan, “Research on the gravity planning model of on Communication and Signal Processing (ICCSP), pp. 0087–0093,
prefecture city rail transit network,” in E3S Web of Conferences, vol. 145, IEEE, 2019.
p. 02005, EDP Sciences, 2020. [83] P. Venkatappareddy, B. Lall, C. Jayanth, K. Dinesh, and M. Deepthi,
[62] B. Musznicki, M. Tomczak, and P. Zwierzykowski, “Dijkstra-based “Novel methods for implementation of efficient median filter,” in 2017
localized multicast routing in wireless sensor networks,” in 2012 8th 14th IEEE India Council International Conference (INDICON), pp. 1–5,
International Symposium on Communication Systems, Networks & Dig- IEEE, 2017.
ital Signal Processing (CSNDSP), pp. 1–6, IEEE, 2012. [84] C. Lin, W.-T. Chen, Y.-C. Chou, and P.-Y. Chen, “A novel comparison-
[63] I. Koutsopoulos, E. Noutsi, and G. Iosifidis, “Dijkstra goes social: free 1d median filter,” IEEE Transactions on Circuits and Systems II:
Social-graph-assisted routing in next generation wireless networks,” in Express Briefs, 2019.
European Wireless 2014; 20th European Wireless Conference, pp. 1–7, [85] Y. K. Rupesh, P. Behnam, G. R. Pandla, M. Miryala, and M. N. Bojnordi,
VDE, 2014. “Accelerating k-medians clustering using a novel 4t-4r rram cell,” IEEE
[64] F. Yue-zhen, L. Dun-min, W. Qing-chun, and J. Fa-chao, “An improved Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26,
dijkstra algorithm used on vehicle optimization route planning,” in 2010 no. 12, pp. 2709–2722, 2018.
2nd international conference on computer engineering and technology, [86] Y. K. Rupesh and M. N. Bojnordi, “Large scale data clustering using
2010. memristive k-median computation,” in 2017 26th International Con-
[65] C. Liu, Y. Li, W. Cheng, and G. Shi, “An improved multi-channel ference on Parallel Architectures and Compilation Techniques (PACT),
aodv routing protocol based on dijkstra algorithm,” in 2019 14th IEEE pp. 374–374, IEEE, 2017.
Conference on Industrial Electronics and Applications (ICIEA), pp. 547– [87] D. G. Elliott, M. Stumm, W. M. Snelgrove, C. Cojocaru, and R. Mcken-
551, IEEE, 2019. zie, “Computational ram: implementing processors in memory,” IEEE
[66] S. Bandi and D. Thalmann, “The use of space discretization for au- Design Test of Computers, vol. 16, pp. 32–41, Jan 1999.
tonomous virtual humans (video session),” in Proceedings of the second [88] M. Gokhale, B. Holmes, and K. Iobst, “Processing in memory: the
international conference on Autonomous agents, pp. 336–337, 1998. terasys massively parallel pim array,” Computer, vol. 28, pp. 23–31,
[67] J. Yao, C. Lin, X. Xie, A. J. Wang, and C.-C. Hung, “Path planning Apr 1995.
for virtual human motion using improved a* star algorithm,” in 2010 [89] M. Oskin, F. T. Chong, and T. Sherwood, “Active pages: a computation
Seventh international conference on information technology: new gen- model for intelligent memory,” in Proceedings. 25th Annual Inter-
erations, pp. 1154–1158, IEEE, 2010. national Symposium on Computer Architecture (Cat. No.98CB36235),
[68] B. M. ElHalawany, H. M. Abdel-Kader, A. TagEldeen, A. E. Elsayed, pp. 192–203, Jun 1998.
and Z. B. Nossair, “Modified a* algorithm for safer mobile robot naviga- [90] Y. Kang, W. Huang, S. M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik,
tion,” in 2013 5th International Conference on Modelling, Identification and J. Torrellas, “Flexram: Toward an advanced intelligent memory
and Control (ICMIC), pp. 74–78, IEEE, 2013. system,” in 2012 IEEE 30th International Conference on Computer
[69] D. Medhi and K. Ramasamy, Network routing: algorithms, protocols, Design (ICCD), pp. 5–14, Sept 2012.
and architectures. Morgan Kaufmann, 2017. [91] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,
[70] C. Cheng, A. Chin, and F. Yeh, “Novel ultra-low power rram with good C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent ram,”
endurance and retention,” in VLSI Technology (VLSIT), 2010 Symposium IEEE Micro, vol. 17, pp. 34–44, Mar. 1997.
on, pp. 85–86, June 2010.
[71] H. Akinaga and H. Shima, “Resistive random access memory (reram)
based on metal oxides,” Proceedings of the IEEE, vol. 98, no. 12,
pp. 2237–2251, 2010.

452

Authorized licensed use limited to: National Central University. Downloaded on July 15,2024 at 08:46:05 UTC from IEEE Xplore. Restrictions apply.

ReRAM History Status and Future
No ratings yet
ReRAM History Status and Future
14 pages
CDac CCAT Question Paper
33% (3)
CDac CCAT Question Paper
9 pages
Christophe Bobda-Introduction To Reconfigurable Computing - Architectures, Algorithms and Applications (2007)
100% (1)
Christophe Bobda-Introduction To Reconfigurable Computing - Architectures, Algorithms and Applications (2007)
375 pages
An Efficient O N Comparison-Free Sorting Algorithm
No ratings yet
An Efficient O N Comparison-Free Sorting Algorithm
13 pages
A Low-Cost Pipelined Architecture Based On A Hybrid Sorting Algorithm
No ratings yet
A Low-Cost Pipelined Architecture Based On A Hybrid Sorting Algorithm
14 pages
A Low-Cost Pipelined Architecture Based On A Hybrid Sorting Algorithm
No ratings yet
A Low-Cost Pipelined Architecture Based On A Hybrid Sorting Algorithm
14 pages
Efficient Implementation of Sorting On Multi-Core SIMD CPU Architecture
No ratings yet
Efficient Implementation of Sorting On Multi-Core SIMD CPU Architecture
12 pages
Algorithms For Memory Hierarchies
No ratings yet
Algorithms For Memory Hierarchies
443 pages
Electronics 13 02971 v2
No ratings yet
Electronics 13 02971 v2
11 pages
Design and Implementation of Synchronous Dual-Port Memory
No ratings yet
Design and Implementation of Synchronous Dual-Port Memory
6 pages
Research Progress in Architecture and Application of RRAM With Computing-In-Memory - PMC
No ratings yet
Research Progress in Architecture and Application of RRAM With Computing-In-Memory - PMC
24 pages
Merge Sort Sequential and Parallel Progr
No ratings yet
Merge Sort Sequential and Parallel Progr
7 pages
IEEE - Logic Synthesis For RRAM-based In-Memory Computing
No ratings yet
IEEE - Logic Synthesis For RRAM-based In-Memory Computing
14 pages
(Survey) Memory Devices and Applications For In-Memory Computing
No ratings yet
(Survey) Memory Devices and Applications For In-Memory Computing
16 pages
Dsa Small
No ratings yet
Dsa Small
21 pages
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
No ratings yet
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
32 pages
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
No ratings yet
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
21 pages
Memory Devices and Applications For In-Memory Computing
No ratings yet
Memory Devices and Applications For In-Memory Computing
16 pages
Efficient Parallel Sort On AVX-512-based Multi-Core and Many-Core Architectures
No ratings yet
Efficient Parallel Sort On AVX-512-based Multi-Core and Many-Core Architectures
9 pages
10 1109icesc48915 2020 9155623
No ratings yet
10 1109icesc48915 2020 9155623
7 pages
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
No ratings yet
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
12 pages
Review RMS
No ratings yet
Review RMS
23 pages
Advanced Data Research Paper
100% (1)
Advanced Data Research Paper
6 pages
Marc Snir NGDM07
No ratings yet
Marc Snir NGDM07
36 pages
Thesis
No ratings yet
Thesis
86 pages
Introduction To Reconfigurable Computing
No ratings yet
Introduction To Reconfigurable Computing
30 pages
4659 DDR4-3200FPGABasedSystemWithInterposer Dannan 8mar2022
No ratings yet
4659 DDR4-3200FPGABasedSystemWithInterposer Dannan 8mar2022
28 pages
Anjali Kumari Report
No ratings yet
Anjali Kumari Report
8 pages
Sample Ch1and2
No ratings yet
Sample Ch1and2
25 pages
Cache-Oblivious Data Structures
No ratings yet
Cache-Oblivious Data Structures
29 pages
"Fuzzy" Algorithms For Congestion Control
No ratings yet
"Fuzzy" Algorithms For Congestion Control
6 pages
Special Issue On in Memory Computing Circui 2023 Memories Materials Devi
No ratings yet
Special Issue On in Memory Computing Circui 2023 Memories Materials Devi
3 pages
Ec24m2018 VTTD
No ratings yet
Ec24m2018 VTTD
11 pages
Quicksort and Selection
No ratings yet
Quicksort and Selection
12 pages
AI Learns Sorting Algorithm
No ratings yet
AI Learns Sorting Algorithm
16 pages
Reconfigurable Dataflow Graphs For Processing-In-memory
No ratings yet
Reconfigurable Dataflow Graphs For Processing-In-memory
11 pages
Sorting Algorthims With Fpga
No ratings yet
Sorting Algorthims With Fpga
18 pages
Scimakelatex 8836 None PDF
No ratings yet
Scimakelatex 8836 None PDF
4 pages
Isfpga16 Resolve
No ratings yet
Isfpga16 Resolve
10 pages
Dsa 23
No ratings yet
Dsa 23
6 pages
Previewpdf
No ratings yet
Previewpdf
8 pages
PPL Gpu Sorting Pre Print
No ratings yet
PPL Gpu Sorting Pre Print
28 pages
A Hybrid Pipelined Architecture For High Performance Top-K Sorting On FPGA
No ratings yet
A Hybrid Pipelined Architecture For High Performance Top-K Sorting On FPGA
5 pages
External Memory Sorting and Searching
No ratings yet
External Memory Sorting and Searching
22 pages
Ch0 Overview
No ratings yet
Ch0 Overview
81 pages
NN Sort
No ratings yet
NN Sort
13 pages
On-Chip Memory Architecture Exploration of Embedded System On Chip
No ratings yet
On-Chip Memory Architecture Exploration of Embedded System On Chip
203 pages
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
From Everand
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Technology Prospects For Data-Intensive Computing
No ratings yet
Technology Prospects For Data-Intensive Computing
21 pages
Max-PIM Fast and Efficient Max Min Searching in DRAM
No ratings yet
Max-PIM Fast and Efficient Max Min Searching in DRAM
6 pages
Fsort External Sorting
No ratings yet
Fsort External Sorting
6 pages
MONTRES-NVM: An External Sorting Algorithm For Hybrid Memory
No ratings yet
MONTRES-NVM: An External Sorting Algorithm For Hybrid Memory
6 pages
Simple Symmetric Sustainable Sorting - The Greensort® Article
No ratings yet
Simple Symmetric Sustainable Sorting - The Greensort® Article
50 pages
Project Paper
No ratings yet
Project Paper
5 pages
The Next Memory Architectures Revolution
No ratings yet
The Next Memory Architectures Revolution
11 pages
Introduction To Reconfigurable Computing Architectures Algorithms and Applications
No ratings yet
Introduction To Reconfigurable Computing Architectures Algorithms and Applications
374 pages
High-Throughput Pattern Matching With
No ratings yet
High-Throughput Pattern Matching With
14 pages
Merge
No ratings yet
Merge
8 pages
FOV AAT - Merged
No ratings yet
FOV AAT - Merged
15 pages
Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University
No ratings yet
Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University
16 pages
Comparative of Advanced Sorting Algorithms Quick Sort Heap Sort Merge Sort Intro Sort Radix Sort Based On Time and Memory Usage
No ratings yet
Comparative of Advanced Sorting Algorithms Quick Sort Heap Sort Merge Sort Intro Sort Radix Sort Based On Time and Memory Usage
7 pages
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
From Everand
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
William Smith
No ratings yet
20240604專題演講
No ratings yet
20240604專題演講
1 page
Towards Accurate
No ratings yet
Towards Accurate
18 pages
ReSMA Accelerating Approximate String Matching Using
No ratings yet
ReSMA Accelerating Approximate String Matching Using
6 pages
Homework Assignment No. 3 Self-Exercise
No ratings yet
Homework Assignment No. 3 Self-Exercise
2 pages
Data Structures and Algorithms Lab Journal - Lab 1
No ratings yet
Data Structures and Algorithms Lab Journal - Lab 1
11 pages
Lab Manual
No ratings yet
Lab Manual
12 pages
Binary Search
No ratings yet
Binary Search
9 pages
Daa Lab Term Work - Pcs 409
No ratings yet
Daa Lab Term Work - Pcs 409
13 pages
Cse RGPV Syllabus
No ratings yet
Cse RGPV Syllabus
54 pages
M1 T1 Question Paper
No ratings yet
M1 T1 Question Paper
2 pages
Super 30 Course Syllabus
No ratings yet
Super 30 Course Syllabus
24 pages
DSA Lab Experiments - 7 C), 7 D), 7 E) 7 F) 7 G)
No ratings yet
DSA Lab Experiments - 7 C), 7 D), 7 E) 7 F) 7 G)
8 pages
Data Structures (Using C) - 3sem
No ratings yet
Data Structures (Using C) - 3sem
7 pages
Jim DSA Questions
No ratings yet
Jim DSA Questions
26 pages
Daa 4th and 5th PRG
No ratings yet
Daa 4th and 5th PRG
6 pages
OOP Using C++ Draft (Bagus)
No ratings yet
OOP Using C++ Draft (Bagus)
893 pages
DSA Question Bank
No ratings yet
DSA Question Bank
8 pages
C++ Interview Questions in
100% (1)
C++ Interview Questions in
34 pages
Tutorial Questions (Final)
No ratings yet
Tutorial Questions (Final)
9 pages
Top Ten SAS Performance Techniques PDF
No ratings yet
Top Ten SAS Performance Techniques PDF
7 pages
DS Ex1
No ratings yet
DS Ex1
21 pages
Introduction To Query Processing and Query Optimization Techniques
No ratings yet
Introduction To Query Processing and Query Optimization Techniques
77 pages
Analysis of Algorithm
No ratings yet
Analysis of Algorithm
20 pages
D Slab Manual With Explanation
No ratings yet
D Slab Manual With Explanation
90 pages
Bms Institute of Technology Department of Mca Sub Code - 16mca38 Algorithms Laboratory Viva Questions
No ratings yet
Bms Institute of Technology Department of Mca Sub Code - 16mca38 Algorithms Laboratory Viva Questions
13 pages
Student Course Handout - DAA-24
No ratings yet
Student Course Handout - DAA-24
8 pages
Design and Analysis of Algorithms: Data Structures
No ratings yet
Design and Analysis of Algorithms: Data Structures
19 pages
An Algorithm For Slope Selection: Papers
No ratings yet
An Algorithm For Slope Selection: Papers
9 pages
Unit 2 - Selection Sort
No ratings yet
Unit 2 - Selection Sort
10 pages
Python Lecture - 13
No ratings yet
Python Lecture - 13
13 pages
Theory of Algorithms
No ratings yet
Theory of Algorithms
332 pages
CS 332: Algorithms: Linear-Time Sorting Continued Medians and Order Statistics
No ratings yet
CS 332: Algorithms: Linear-Time Sorting Continued Medians and Order Statistics
29 pages
Array Assignment 1D
No ratings yet
Array Assignment 1D
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Memristive Data Ranking

Uploaded by

Memristive Data Ranking

Uploaded by

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

Memristive Data Ranking

978-1-6654-2235-2/21/$31.00 ©2021 IEEE 440

Memory Accesses (Millions)

Memory Bandwidth (MBps)

Fig. 5. Illustrative example of ﬁnding the minimum of 3 ﬂoating-point

sensing circuits between read and column search operations.

(2) a global knowledge about all data points is necessary to

Data L1 32KB, 4-way, LRU, 64B block, hit/miss: 2/2, MESI

Conﬁguration Channels/Ranks/Banks: 4/8/8

Conﬁguration Channels/Ranks/Banks: 4/2/8

Conﬁguration compatible chips, 512x512 SLC subarrays, die area: 20.54mm2

tools for forming MST a given graph. In Kruskal’s algorithm, 40 40 40

for RIME are enabled due to the signiﬁcant reduction in 10 10 10

normalized to the off-chip baseline. The HBM system con-

Strict Priority Queuing. We evaluate the strict priority queu-

Fig. 19. System energy for various applications (65M keys).

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Memristive Data Ranking

Uploaded by

Memristive Data Ranking

Uploaded by

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

Memristive Data Ranking

978-1-6654-2235-2/21/$31.00 ©2021 IEEE 440

Memory Accesses (Millions)

Memory Bandwidth (MBps)

  Fig. 5. Illustrative example of ﬁnding the minimum of 3 ﬂoating-point

sensing circuits between read and column search operations.

(2) a global knowledge about all data points is necessary to

Data L1 32KB, 4-way, LRU, 64B block, hit/miss: 2/2, MESI

Conﬁguration Channels/Ranks/Banks: 4/8/8

Conﬁguration Channels/Ranks/Banks: 4/2/8

Conﬁguration compatible chips, 512x512 SLC subarrays, die area: 20.54mm2

tools for forming MST a given graph. In Kruskal’s algorithm, 40 40 40

for RIME are enabled due to the signiﬁcant reduction in 10 10 10

normalized to the off-chip baseline. The HBM system con-

Strict Priority Queuing. We evaluate the strict priority queu-

Fig. 19. System energy for various applications (65M keys).

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Fig. 5. Illustrative example of ﬁnding the minimum of 3 ﬂoating-point