0% found this document useful (0 votes)
9 views11 pages

Electronics 13 02971 v2

Uploaded by

muahebttgc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Electronics 13 02971 v2

Uploaded by

muahebttgc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

electronics

Article
Efficient Processing-in-Memory System Based on RISC-V
Instruction Set Architecture
Jihwan Lim, Jeonghun Son and Hoyoung Yoo *

Department of Electronics Engineering, Chungnam National University, Daejeon 34134, Republic of Korea;
jihwan.lim@abov.co.kr (J.L.); jhsohn.cas@o.cnu.ac.kr (J.S.)
* Correspondence: hyyoo@cnu.ac.kr

Abstract: A lot of research on deep learning and big data has led to efficient methods for processing
large volumes of data and research on conserving computing resources. Particularly in domains like
the IoT (Internet of Things), where the computing power is constrained, efficiently processing large
volumes of data to conserve resources is crucial. The processing-in-memory (PIM) architecture was
introduced as a method for efficient large-scale data processing. However, PIM focuses on changes
within the memory itself rather than addressing the needs of low-cost solutions such as the IoT.
This paper proposes a new approach using the PIM architecture to overcome memory bottlenecks
effectively in domains with computing performance constraints. We adopt the RISC-V instruction set
architecture for our proposed PIM system’s design, implementation, and comprehensive performance
evaluation. Our proposal expects to efficiently utilize low-spec systems like the IoT by minimizing
core modifications and introducing PIM instructions at the ISA level to enable solutions that leverage
PIM capabilities. We evaluate the performance of our proposed architecture by comparing it with
existing structures using convolution operations, the fundamental unit of deep-learning and big data
computations. The experimental results show our proposed structure achieves a 34.4% improvement
in processing speed and 18% improvement in power consumption compared to conventional von
Neumann-based architectures. This substantiates its effectiveness at the application level, extending
to fields such as deep learning and big data.

Keywords: processing in memory (PIM); artificial intelligence (AI); machine learning; deep learning;
RISC-V; Internet of Things (IoT)
Citation: Lim, J.; Son, J.; Yoo, H.
Efficient Processing-in-Memory
System Based on RISC-V Instruction
Set Architecture. Electronics 2024, 13,
1. Introduction
2971. https://doi.org/10.3390/
electronics13152971 With the advancements in big data, deep learning, and IoT, research focusing on
efficient large-scale data processing and resource conservation has become pivotal in each
Academic Editor: Sunggu Lee
respective field [1]. However, despite much research, algorithms still consume a lot of
Received: 17 June 2024 computing resources due to structural limitations in how large-scale data are read from
Revised: 22 July 2024 memory and processed, and the research of efficient structures remains an important
Accepted: 25 July 2024 research topic [2]. These structural limitations come from memory bottlenecks typical in
Published: 27 July 2024 von Neumann architectures, where all calculations are processed by the core, necessitating
the loading of extensive data from memory [3]. Thus, the speed of memory access becomes
a major limiting factor [4]. Yet, due to physical constraints, memory access speeds are
significantly lower compared to core processing speeds, thereby making the performance
Copyright: © 2024 by the authors. of algorithms processing large-scale data heavily reliant on the speed of loading the data
Licensee MDPI, Basel, Switzerland.
from memory [3].
This article is an open access article
Research to improve memory access limitations include research on increasing memory
distributed under the terms and
bandwidth, enhancing memory speeds, and exploring graphics processing unit (GPU)
conditions of the Creative Commons
technologies [5]. High-bandwidth memory (HBM), for instance, proposes the use of high-
Attribution (CC BY) license (https://
bandwidth memory interfaces to increase data transfer rates, but structural bottlenecks
creativecommons.org/licenses/by/
4.0/).
continue to be important challenges [6]. And GPU uses parallel computing logic to offload

Electronics 2024, 13, 2971. https://doi.org/10.3390/electronics13152971 https://www.mdpi.com/journal/electronics


Electronics 2024, 13, 2971 2 of 11

computations from the core, acting as accelerators. However, the bottleneck in transferring
data from the memory to the GPU complicates achieving optimal performance [7]. To
make improvements in the bottlenecks, recent PIM architectures are proposed [8]. Unlike
traditional von Neumann architectures where only the core processes and memory stores
data, PIM integrates the memory and processing to efficiently perform data processing,
offering a promising approach to overcoming memory bottlenecks [9]. However, research
on such PIM architectures has primarily focused on optimizing the in-memory computation
logic rather than developing PIM solutions for low-cost environments such as the IoT.
Consequently, studies on how to effectively utilize PIM architectures within the IoT remain
a necessary and unresolved area of research [10].
This paper proposes a novel structure combining PIM architecture with computing
resources constrained by a low area and low performance, as an alternative to mitigate
the data movement bottleneck between the memory and processors inherent in traditional
computer architectures [11]. Specifically, we introduce a PIM system based on the RISC-V
instruction set architecture, demonstrating effective strategies to alleviate memory bottle-
necks prevalent in existing computing systems [12]. Evaluating our proposed PIM system
involves assessing its performance through fundamental operations such as convolution in
deep learning [13]. The experimental results highlight the enhanced performance of the
proposed architecture in data-intensive fields like deep learning and big data, underscor-
ing its capability to facilitate efficient large-scale data processing in resource-constrained
environments such as the IoT [14]. We propose our architecture in the following sequence.
In Section 2, Background, we introduce the fundamental information necessary to describe
our proposed architecture. In Section 3, Proposed Design, we provide an explanation of
our proposed architecture, demonstrate its feasibility, and compare it with existing archi-
tectures to highlight improvements. In Section 4, Experimental Results, we evaluate the
algorithm and circuit performance of both the proposed architecture and the conventional
von Neumann architecture, thereby proving the superiority of our approach. Finally, in
Section 5, Conclusion, we summarize our claims and outline future research directions.

2. Background
The background section provides essential information necessary for this study. Firstly,
it introduces the concept of convolution operation, which serves as the fundamental
computational unit in deep-learning algorithms. Secondly, we describe the RISC-V RV32I
architecture, which is based on the traditional von Neumann architecture and utilizes the
RISC instruction set.

2.1. Convolution
Convolution is an important operation in deep learning and image processing, used
to apply filters to signals or images, create new signals or images, or identify specific
features in an image. This operation involves computing the sum of element-wise products
between an input signal (or image) and a kernel (filter). The mathematical representation
of convolution is shown in Equation (1) [15].

n k k
[ f ∗ g](i, j) = ∑∑∑ f ( p, g, d)· g( p + i, g + j, d) (1)
d =0 p =0 q =0

Here, f represents a kernel of size k, g represents an input image, and ∗ denotes the
convolution operator. Moving the filter f across the input data g, a single multiply-accumulate
operation is performed between corresponding elements. Convolution calculates a weighted
sum of neighboring pixel values to generate a new pixel value [f ∗ g](i,j). Figure 1 illustrates
the concept of the convolution operation. For an input of size Hin × Win × N, a k × k × n
kernel performs convolution. It computes the element-wise product between the input data
and kernel data at the same location, followed by summation to determine the output data.
This process is repeated by shifting the kernel, generating output data of size Hout × Wout .
Electronics 2024, 13, x FOR PEER REVIEW 3 of 11

between the input data and kernel data at the same location, followed by summation to
determine the output data. This process is repeated by shifting the kernel, generating out-
between the input data and kernel data at the same location, followed by summation to
Electronics 2024, 13, 2971 put data of size 𝐻 × 𝑊 . Using convolution, various types of filters can be applied 3toof 11
determine the output data. This process is repeated by shifting the kernel, generating out-
images, enabling operations such as sharpening, edge detection, or blurring. Convolution
put data of size 𝐻 × 𝑊 . Using convolution, various types of filters can be applied to
is also fundamental in deep learning, where convolutional layers are composed of these
images, enabling operations such as sharpening, edge detection, or blurring. Convolution
operations to extract features from images. Through iterative operations, convolution lay-
isUsing
also fundamental
convolution,in deep learning,
various wherecan
types of filters convolutional
be applied layers are composed
to images, of these
enabling operations
ers learn optimal kernel values. Ultimately, convolutional neural networks (CNNs) utilize
such as sharpening,
operations edge detection,
to extract features or blurring.
from images. ThroughConvolution is also fundamental
iterative operations, convolution in lay-
deep
these learned convolutional layers to classify images or perform tasks such as object de-
learning,
ers learn where kernel
optimal convolutional
values. layers are composed
Ultimately, of these
convolutional operations
neural networks to(CNNs)
extract features
utilize
tection [16].
fromlearned
these images.convolutional
Through iterative operations,
layers to classifyconvolution layers learn
images or perform optimal
tasks such askernel
objectvalues.
de-
Ultimately,
tection [16]. convolutional neural networks (CNNs) utilize these learned convolutional layers
to classify images or perform tasks such as object detection [16].

Figure 1. Operation of convolution.

Figure
2.2. 1.1.
RISC-V
Figure Operation
ofof convolution.
RV32I Architecture
Operation convolution.
2.2.RISC-V
RISC-VisRV32I
an open-source
Architectureinstruction set architecture (ISA) designed according to the
2.2. RISC-V
principles RV32I
of the Architecture
reduced instruction set computer (RISC). Its main featuresaccording
include sim-
RISC-V is an open-source instruction set architecture (ISA) designed to the
principles of the reduced instruction set computer (RISC). Its main features includethe
RISC-V
plicity, is an
scalability, open-source
and instruction
modularity. Figure set
2 architecture
below (ISA)
illustrates sixdesigned according
representative to
instruc- sim-
tion formats
principles
plicity, usedreduced
of the
scalability,in themodularity.
and 32-bit RISC-V
instruction setISA.
Figure 2 Each
belowinstruction
computer (RISC). Its is
illustrates structured
main
six featuresaccording
include
representative to
sim-
instruction
specific
plicity,
formats formats,
used inmaking
scalability, and
the them
32-bit concise
modularity.
RISC-V and
Figure
ISA. intuitive,
Each2 below and is
enabling
illustrates
instruction an efficient
six representative
structured processor
according instruc-
to specific
design.
tion formats used in the 32-bit RISC-V ISA. Each instruction is structured
formats, making them concise and intuitive, and enabling an efficient processor design. according to
specific formats, making them concise and intuitive, and enabling an efficient processor
design. 31 25 24 20 19 15 14 12 11 7 6 0
R-type funct7 rs2 rs1 funct3 rd opcode
31
31 25 24 20 19
20 19 1514
15 14 121111
12 77 66 0 0
R-type
I-type funct7imm12 rs2 rs1
rs1 funct3
funct3 rdrd opcode
opcode
31 25 24 20 19
20 19 1514
15 14 1211
12 11 77 66 00
I-type
S-type imm7 imm12 rs2 rs1
rs1 funct3
funct3 rd
imm5 opcode
opcode
31 30 25 24
25 24 20 19
20 19 1514
15 14 121111
12 8 7766 0 0
S-type
B-type i imm7
imm6 rs2
rs2 rs1
rs1 funct3
funct3 imm5 i
imm4 opcode
opcode
31 30 25 24 20 19 15 14 121111
12 87 7 6 6 0 0
B-type
U-type i imm6 rs2
imm20 rs1 funct3 imm4
rd i opcode
opcode
31 21 20 19 12
121111 7766 0 0
J-type
U-type i imm10 imm20i imm8 rdrd opcode
opcode
31 21 20 19 12 11 7 6 0
RISC-V
Figure2.2.RISC-V
Figure base
base instruction
instruction formats.
formats.
J-type i imm10 i imm8 rd opcode
Table1 1lists
Table listskey
keyRISC-V
RISC-V instructions
instructions used
used prominently
prominently during
during compilation
compilation for for
con-con-
Figure 2. RISC-V
volution base instruction
operations. formats.
Convolution involves multiplying a kernel with an image and accu-
volution operations. Convolution involves multiplying a kernel with an image and accu-
mulating the results into the existing output, commonly utilizing add and
mulating the results into the existing output, commonly utilizing add and mul operations, mul operations,
Tableas1addi
lists and
key slli
RISC-V instructions used prominently during compilation 2for con-
asaswell
wellas addi and instructions
slli instructions forfor memory
memory address
address calculation
calculation [17].
[17]. Table
Table presents
2 presents
volution operations.
assemblycode code Convolution
performing involves
convolution multiplying
operations a kernel
using with
RISC-V an image and
instructions. accu-
assembly performing convolution operations using RISC-V instructions. The lwThein- lw
mulating the loads
instruction results into the
kernel andexisting
image output,
data commonly
from memory utilizing
to the add
core, and mulmultiplication
performs operations,
struction loads kernel and image data from memory to the core, performs multiplication
asoperations,
well as addi andand slli instructions
stores results in thefor memory
memory address
area. calculation
Subsequently, [17]. Table
it loads 2 presents
operation results
assembly code performing convolution operations using RISC-V instructions.
from memory to the core using lw, performs addition operations, and stores results The lw in-in
struction
the memoryloadsarea.
kernel and image
Repeating data
these from
steps bymemory
moving to the core,
through performs
memory areasmultiplication
computes the
final convolution result. Figure 3 depicts the structure of a RISC-V core designed based
on RV32I ISA. It comprises a five-stage pipelined structure: instruction fetch, instruction
decoding, execute, memory, and write back. It includes control logic to manage the pipeline
and generate control signals, an internal SRAM controller, and a load–store unit (LSU) to
Electronics 2024, 13, 2971 4 of 11

control peripheral addresses, a Harvard architecture with separate data and instruction
memories, and an AXI4 (Advanced eXtensible Interface 4) system bus for bus systems.
Such structures are widely used in low-area, low-power systems like the IoT, leveraging
space-efficient designs and achieving a high performance through the Harvard architecture
and pipelining techniques [18].

Table 1. Configuring the basic instruction of a convolution operation.

PIM Instruction Instruction Format Meaning


add rd, rs2, rs1 R-type R[rd] = M[rs2] + M[rs1]
Electronics 2024, 13, x FOR PEER REVIEW 4 of 11
mul rd, rs2, rs1 R-type R[rd] = M[rs2] × M[rs1]
slli rd, rs1, imm R-type R[rd] = M[rs1]  imm
addi rd, rs1, imm R-type R[rd] = M[rs1] + imm
operations, and stores results in the memory area. Subsequently, it loads operation results
from 2.
Table memory to theoperation
Convolution core using
intolw, performs
assembly addition operations, and stores results in the
instruction.
memory area. Repeating these steps by moving through memory areas computes the final
convolution result. Figure 3 depicts the structure
Assembly Instruction of a RISC-V core designed based on
Meaning
lw RV32I ISA. It comprises
x14, −32(x8) a five-stage
Load data pipelined
from address in memorystructure: instruction
R[x8]-32 fetch, instruction de-
to register x14
lw coding, execute, memory,
x14, −56(x8) Load dataand
fromwrite back.
address in It includes
memory control
R[x8]-56 logic tox14
to register manage the pipeline
mul x15, x14, x15 Multiply
and generate control the an
signals, data in registers
internal SRAM x14,controller,
x15 and store
andthea result in register
load–store unitx15
(LSU) to
sw x15, −40(x9)
control peripheralStore data in register
addresses, a Harvardx15 to address in memory
architecture R[x9]-40 data and instruction
with separate
x14, −88(x9)
lw
memories, and anLoad
AXI4data from address in memory R[x9]-88 to register x14
(Advanced eXtensible Interface 4) system bus for bus systems.
lw x15, −40(x9) Load data from address in memory R[x9]-40 to register x15
add Such
x15, x14, x15 structures are widely used in low-area,
Add the data in registers x14, x15low-power
and store systems
the result like the IoT,
in register x15leveraging
sw space-efficient designs
x15, −40(x9) and in
Store data achieving a high
register x15 performance
to address in memory through
R[x9]-40the Harvard architec-
ture and pipelining techniques [18].

Figure 3.
Figure 3. Block
Block diagram
diagram of
of RISC-V
RISC-V RV32I
RV32I architecture.
architecture.

Table
3. 1. Configuring
Proposed Designthe basic instruction of a convolution operation.
PIM Instruction In the existingInstruction
von Neumann architecture, the core must Meaning
Format fetch data from memory,
perform computations, and then store the results back into memory. This process involves
add rd, rs2, rs1 R-type R[rd] = M[rs2] + M[rs1]
significant data movement, and systems targeting low-power and low-performance en-
mul rd, rs2, rs1 R-type R[rd] = M[rs2] × M[rs1]
vironments like RISC-V have limitations in bandwidth and speed between memory and
slli rd, rs1, imm R-type R[rd] = M[rs1] ≪ imm
processors due to various constraints such as power and area. These speed limitations
addi rd, rs1, imm
during data movement lead R-type R[rd]
to overall system degradation. In=this
M[rs1] + we
paper, immpropose to

Table 2. Convolution operation into assembly instruction.

Assembly Instruction Meaning


lw x14, −32(x8) Load data from address in memory R[x8]-32 to register x14
In the existing von Neumann architecture, the core must fetch data from memory,
perform computations, and then store the results back into memory. This process involves
significant data movement, and systems targeting low-power and low-performance envi-
ronments like RISC-V have limitations in bandwidth and speed between memory and
processors due to various constraints such as power and area. These speed limitations
Electronics 2024, 13, 2971 5 of 11
during data movement lead to overall system degradation. In this paper, we propose to
mitigate this issue by incorporating PIM capabilities into RISC-V systems to minimize the
data movement between memory and processors.
mitigate this issuemethods
We delineate by incorporating PIMPIM
for utilizing capabilities
systemsinto RISC-Vfrom
in RISC-V systems bothtosoftware
minimizeand the
data movement
hardware between
perspectives. memory
First, and processors.PIM processing, we propose a method
for software-based
whereWe PIMdelineate
instructionsmethods for utilizing
are recognized asPIM systems
existing loadin RISC-V from
instructions both
rather software
than introduc-and
hardware perspectives. First, for software-based PIM processing,
ing new instructions. This approach treats PIM instructions within the core as load in- we propose a method
where PIMenabling
structions, instructions are recognized
an efficient hardware as existing load instructions
design without significant rather than introducing
alterations to the ex-
new instructions. This approach treats PIM instructions within
isting core system. Moreover, since they are recognized as load operations, compatibility the core as load instructions,
enabling
with an efficient
existing hardwareisdesign
code structures maintainedwithout significant
without alterations
requiring to thedecode
additional existing core
logic,
system. Moreover, since they are recognized as load operations,
making it straightforward. Figure 4 illustrates the format of the proposed PIM instruc- compatibility with existing
code structures
tions. To maintain is maintained
compatibility without
with requiring additional
existing systems, wedecode
base the logic, makingon
PIM-type it straight-
the ex-
forward. Figure 4 illustrates the format of the proposed
isting I-type format of load instructions, dividing the original 12-bit immediate fieldPIM instructions. To maintain
into
compatibility
two 6-bit fieldswith existing
to enable thesystems, we base
simultaneous the PIM-type
access on the existing
to two memory locations I-type
withformat
PIM in- of
load instructions, dividing the original 12-bit immediate field into
structions. The positions of the opcode, funct, rd, rs1, and imm fields perfectly align with two 6-bit fields to enable
theexisting
the simultaneousI-type access
format,toobviating
two memory locations
the need for awith PIM instructions.
separate decode logic.The Tablepositions
3 shows of
the opcode, funct, rd, rs1, and imm fields perfectly align with the existing I-type format,
the transformation of key instructions used in convolution operations (add, mul, slli, and
obviating the need for a separate decode logic. Table 3 shows the transformation of key
addi) into PIM instructions, as add.p, mul.p, slli.p, and addi.p. The add.p and mul.pz in-
instructions used in convolution operations (add, mul, slli, and addi) into PIM instructions,
structions derived from add and mul instructions can operate on data from two memory
as add.p, mul.p, slli.p, and addi.p. The add.p and mul.pz instructions derived from add
addresses before loading, while slli.p and addi.p derived from slli and addi instructions
and mul instructions can operate on data from two memory addresses before loading,
can perform slli and addi operations on data from a single memory address before loading.
while slli.p and addi.p derived from slli and addi instructions can perform slli and addi
Table 4 presents an example of converting convolution operations from Table 2 into PIM
operations on data from a single memory address before loading. Table 4 presents an
operations. In the conventional method, four instructions—two lw instructions to load the
example of converting convolution operations from Table 2 into PIM operations. In the
data from memory into registers, followed by either mul or add operations, and sw in-
conventional method, four instructions—two lw instructions to load the data from memory
structions to store results back to memory—are used. In contrast, the transformed instruc-
into registers, followed by either mul or add operations, and sw instructions to store results
tions integrate two lw instructions and one operation instruction, recognized by the core
back to memory—are used. In contrast, the transformed instructions integrate two lw
as a single instruction
instructions and one (interpreted as load), and
operation instruction, sw instructions.
recognized by the coreConsequently,
as a single using PIM
instruction
instructions
(interpreted as load), and sw instructions. Consequently, using PIM instructions reducesand
reduces the number of instructions the core processes from three to one, the
allows
numberparallel processing
of instructions withprocesses
the core PIM units, frompotentially
three to one, achieving
and allows an up to 66%
parallel perfor-
processing
mance
with PIM improvement in a coreachieving
units, potentially system with an upcycles
to 66%per performance
instruction (CPI) of 1. Whilein
improvement signifi-
a core
cant performance gains over conventional instructions can
system with cycles per instruction (CPI) of 1. While significant performance gains over be expected, representing a
high number of register addresses and immediate values in a
conventional instructions can be expected, representing a high number of register addressessingle instruction presents
aand
trade-off between
immediate flexibility
values in address
in a single accesspresents
instruction and performance.
a trade-offUsing betweena commercially
flexibility in
available RISC-V compiler for compilation revealed the
address access and performance. Using a commercially available RISC-V compiler presence of immediate values ex- for
ceeding the representation range (2 ), which cannot be processed
compilation revealed the presence of immediate values exceeding the representation range by PIM instructions.
Therefore,
(26 ), whichancannot
essential PIM-awareby
be processed compiler is requiredTherefore,
PIM instructions. to effectively address such
an essential trade-
PIM-aware
offs. This proposal
compiler is required is not limited to address
to effectively convolution suchoperation
trade-offs.codeThisbut can also
proposal is be
notapplied
limitedto to
any code where two lw instructions and one computation instruction
convolution operation code but can also be applied to any code where two lw instructions are consecutively
arranged, or where oneinstruction
and one computation lw instruction and one computation
are consecutively arranged, instruction
or where one are consecutively
lw instruction
arranged.
and one computation instruction are consecutively arranged. However, due toacross
However, due to the difficulty of demonstrating this transformation all
the diffi-
algorithms, this paper will
culty of demonstrating thisexplain its application
transformation acrossspecifically
all algorithms,to CNNs as a representative
this paper will explain its
example.
application specifically to CNNs as a representative example.

Figure
Figure4.4.RISC-V
RISC-Vbase
basePIM
PIMinstruction
instructionformat.
format.

Table 3. Configuring the basic PIM instruction of a convolution operation.

PIM Instruction Instruction Format Meaning


add.p rd, imm(rs1), imm(rs1) PIM-type R[rd] = M[rs1+imm[25:20]] + M[rs1+imm[31:26]]
mul.p rd, imm(rs1), imm(rs1) PIM-type R[rd] = M[rs1+imm[25:20]] × M[rs1+imm[31:26]]
slli.p rd, imm(rs1), imm PIM-type R[rd] = M[rs1+imm[25:20]]  imm[31:26]
addi.p rd, imm(rs1), imm PIM-type R[rd] = M[rs1+imm[25:20]] + imm[31:26]
Electronics 2024, 13, 2971 6 of 11

Table 4. Convolution operation into PIM assembly instruction.

Assembly Instruction Meaning


mul.p x15, −32(x8), −56(x8) Multiply (in PU) the data in memory R[x8]-32, R[x8]-56 and load to register x15
sw x15, −40(x9) Store data in register x15 to address in memory R[x9]-40
add.p x15, −88(x9), −40(x9) Add (in PU) the data in memory R[x9]-88, R[x9]-40 and store to register x15
sw x15, −88(x9) Store data in register x15 to address in memory R[x9]-88

Secondly, to hardware-implement PIM instructions, we introduced PIM logic not


inside the memory but in the SRAM controller, designing a control unit within the core
to manage it. Figure 5 illustrates the RISC-V core structure for processing the proposed
instructions. In the conventional von Neumann architecture system shown in Figure 3,
processing units responsible for processing reside exclusively within the core. However, our
proposed PIM structure incorporates a processing unit (PU) within the SRAM controller,
enabling processing units to perform operations on read data from memory internally,
transmitting only minimal data to the core. The PU added to the SRAM controller is
designed to perform operations using read data from memory and does not fetch or decode
instructions directly, necessitating the addition of a PIM control unit (PCU) within the core
to decode PIM-related instructions and generate commands for executing operations in the
PIM memory controller before the instruction is executed in the internal processing stage in
the core. This unit generates PIM-transaction-related signals (PIMen, PIMsel, and PIMaddr)
when instructions are fetched and decoded from the instruction memory to control the
PIM logic. Additionally, one of its critical roles is to produce signals (pipeCtrl_pim2lw)
to ensure that PIM instructions are processed similarly to lw instructions in the core’s
pipeline without additional pipeline stalls, smoothly integrating the process. Furthermore,
it outputs lw-related flags to prevent the control logic from identifying PIM instructions as
unknown instructions. This technique ensures that actual operations are performed not
in the pipeline in the core but via instructions sent to PIM, simultaneously reading the
data from memory. The control signals generated by PCU are utilized in the PU of the
SRAM controller. The PU performs PIM operations using output Q1 and Q2 of the data
memory. To facilitate this, a phase register is added to synchronize the SRAM read time
and control signals, and a bypass mux is added to allow the memory data to be loaded
directly without passing through the operation logic when PIM is not in use. This structure
Electronics 2024, 13, x FOR PEER REVIEW
maintains compatibility with existing systems while enabling the efficient execution of7 PIM
of 11
instructions through a streamlined design.

Figure 5.
Figure 5. Block
Block diagram
diagram of
of proposed
proposed architecture.
architecture.

4. Experimental Results
To evaluate our proposed architecture, we assess the processing speed performance
using convolution operations, which are the basic units used in deep-learning and large-
scale data processing fields. Additionally, we compare the area when synthesized using
the Synopsys Design Compiler for the CMOS 28 nm process at 100 MHz. We specifically
Electronics 2024, 13, 2971 7 of 11

4. Experimental Results
To evaluate our proposed architecture, we assess the processing speed performance
using convolution operations, which are the basic units used in deep-learning and large-
scale data processing fields. Additionally, we compare the area when synthesized using
the Synopsys Design Compiler for the CMOS 28 nm process at 100 MHz. We specifically
compare the input data size of 224 × 224 × 3, typical for deep-learning algorithms targeting
IoT devices like MobileNet, and three kernel sizes: 3 × 3, 5 × 5, and 7 × 7. The source
code, originally written in C, is compiled using the RISC-V GNU Toolchain 12.1.0, with
modifications to incorporate the PIM instructions proposed in this paper. This process is
necessary due to the absence of compilers optimized for compiling and optimizing PIM
instructions. Our experimental results demonstrate significant performance improvements
in the proposed architecture featuring PIM instructions.
Firstly, in our proposed structure, the execution time of convolutions significantly
decreases compared to traditional von Neumann-based architectures. Figure 6 compares the
memory access rate for different kernel sizes between the proposed and existing structures.
As a result, the memory access rate decreased by 24% when performing the convolution
operation compared to the original. Figure 7 compares operation speeds for different kernel
sizes between the proposed and existing structures. For each kernel size, the proposed
structure showed a reduction in processing time of 31.4%, 32.7%, and 34.4% compared to the
existing structure. This latency reduction is due to PIM instructions performing calculations
directly in memory, reducing the number of instructions executed by the processor. While,
theoretically, this could lead to a 66% performance improvement (as three instructions
are consolidated into one), the actual compilation revealed limitations due to branch
instructions and non-convertible codes, resulting in restrained performance gains. However, 8 of 11
Electronics 2024, 13, x FOR PEER REVIEW
these results are derived from experiments without full compiler optimization, suggesting
that more extensive use of PIM instructions could yield higher performance improvements.

Figure 6. Comparison of memory access rate.


Figure 6. Comparison of memory access rate.
Electronics 2024, 13, 2971 8 of 11

Figure 6. Comparison of memory access rate.

Figure 7. Comparison of execution time according to kernel size.


Figure 7. Comparison of execution time according to kernel size.
Secondly, we compare the synthesis results of the proposed architecture with the
Secondly,
existing weincompare
structure the synthesis
terms of area. results
The proposed of thereduces
structure proposed
data architecture
movement and with
the the ex-
number
isting of instructions
structure processed
in terms of area. in theproposed
The core, enhancing the processing
structure reduces dataspeed. However,and the
movement
implementing
number PIM systems
of instructions requiresin
processed additional
the core,logic to control
enhancing thePIM and perform
processing opera-
speed. However,
tions in memory. This trade-off evaluates the high processing speed advantage against
implementing PIM systems requires additional logic to control PIM and perform opera-
these requirements. Figure 8 compares the synthesis results to equivalent gate counts
tions in memory. This trade-off evaluates the high processing speed advantage against
(normalized on a two-input NAND gate) between the proposed and existing structures.
We evaluated areas divided into the core area with an added PCU, the interface (IF) area
with an added PU, Memory (Mem), and Peripheral (Peri). The peripheral area remained
the same; the proposed structure has increased in gate count by 285 for PCU and 5389 for
PIM PU compared to traditional structures, resulting in a 1.27% increase in overall core and
interface areas, leading to negligible area growth for significant performance improvements.
However, there was a 31% increase in the memory area due to the additional usage in our
system’s dual-port RAM. If the proposed structure omits the dual-port RAM, the memory
area implementation could match that of the existing structure, albeit with an additional
cycle for loading memory data. This presents a choice for users between area and per-
formance priorities. Through our experiments, we confirm that the proposed structure
achieves an over 30% processing performance improvement with a modest 1.27% increase
in the logic area. Figure 9 compares the energy consumption between the proposed and
existing structures. Through our experiments, the energy consumption decreased by about
18% compared to the original.
memory area implementation could match that of the existing structure, albeit with an
additional cycle for loading memory data. This presents a choice for users between area
and performance priorities. Through our experiments, we confirm that the proposed
structure achieves an over 30% processing performance improvement with a modest
1.27% increase in the logic area. Figure 9 compares the energy consumption between the
Electronics 2024, 13, 2971 9 of 11
proposed and existing structures. Through our experiments, the energy consumption de-
creased by about 18% compared to the original.

Electronics 2024, 13, x FOR PEER REVIEW 10 of 11


Figure 8. Comparison of equivalent gate count by design.
Figure 8. Comparison of equivalent gate count by design.

Figure 9. Comparison of energy consumption.


Figure 9. Comparison of energy consumption.
5. Conclusions
5. Conclusions
This paper proposes a PIM architecture as an alternative to traditional von Neumann
This paper
architectures forproposes
efficient alarge-scale
PIM architecture as an alternative
data processing to traditional von
in resource-constrained Neumann
environments.
architectures
The proposed for efficient
system, basedlarge-scale dataintroduces
on RISC-V, processing anin resource-constrained
innovative environ-
approach to efficiently
ments. The proposed system, based on RISC-V, introduces an innovative approach to ef-
ficiently handle large-scale data processing tasks in domains with limited computing per-
formance. To evaluate this system, performance assessments are conducted on operations
commonly used in large-scale data processing. In convolution experiments, we compare
the processing speed of the proposed PIM system against a standard RISC-V processor.
Electronics 2024, 13, 2971 10 of 11

handle large-scale data processing tasks in domains with limited computing performance.
To evaluate this system, performance assessments are conducted on operations commonly
used in large-scale data processing. In convolution experiments, we compare the process-
ing speed of the proposed PIM system against a standard RISC-V processor. The results
demonstrate how the PIM system alleviates bottlenecks associated with data movement.
By performing computations directly within memory, the proposed system enhances the
processing speed and reduces the latency associated with data movement. This insight sug-
gests that leveraging PIM can significantly improve efficiency in scenarios where the data
movement is a critical bottleneck. The findings of this paper provide important insights
that open avenues for the development and application of PIM systems in future IoT and
embedded environments. Effectively harnessing PIM will require support from the appli-
cation software perspective. Subsequent research should focus on optimizing compilers
capable of compiling new PIM instructions and developing optimized applications that
utilize these instructions

Author Contributions: Conceptualization, J.L.; Software, J.L. and J.S.; Project administration, H.Y.
All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The original contributions presented in the study are included in the
article, further inquiries can be directed to the corresponding author.
Acknowledgments: This work was supported by the National Research Foundation of Korea (NRF)
grant funded by the Korea government (MSIT) (No. 2022R1A5A8026986). This work was supported
by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded
by the Korea government (MSIT) (2022-0-01170).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Liang, D.S. Smart and Fast Data Processing for Deep Learning in Internet of Things: Less is more. IEEE Internet Things J. 2019, 6,
5981–5989. [CrossRef]
2. Zhuoying, Z.; Ziling, T.; Pinghui, M.; Xiaonan, W.; Dan, Z.; Xin, Z.; Ming, T.; Jie, L. A Heterogeneous Parallel Non-von Neumann
Architecture System for Accurate and Efficient Machine Learning Molecular Dynamics. IEEE Trans. Circuits Syst. I Regul. Pap.
2023, 70, 2439–2449.
3. Azriel, L.; Mendelson, A.; Weiser, U. Peripheral memory: A technique for fighting memory bandwidth bottleneck. IEEE Comput.
Archit. Lett. 2015, 14, 54–57. [CrossRef]
4. Souvik, K.; Priyanka, G.; Jeffry, L.; Hemanth, C.; BVVSN, R. Memristors Enabled Computing Correlation Parameter In-Memory
System: A Potential Alternative to Von Neumann Architecture. IEEE Trans. Very Large Scale Intergration (VLSI) Syst. 2022, 30,
755–768.
5. Cristobal, N.; Roberto, C.; Ricardo, B.; Javier, A.; Raimundo, V. GPU Tensor Cores for Fast Arithmetic Reductions. IEEE Trans.
Parallel Distrib. Syst. 2021, 32, 72–84.
6. Lee, J.; Kim, J.; Kim, K.; Ku, Y.; Kim, D.; Jeong, C.; Yun, T.; Kim, H.; Cho, H.; Oh, S.; et al. High bandwidth memory(HBM)
with TSV technique. In Proceedings of the 2016 13th International SoC Design Conference (ISOCC), Jeju, Republic of Korea, 29
December 2016.
7. Park, I.; Singhal, N.; Lee, M.; Cho, S.; Kim, C. Design and Performance Evaluation of Image Processing Algorithms on GPUs.
IEEE Trans. Parallel Distrib. Syst. 2011, 22, 91–104. [CrossRef]
8. Kim, D.; Yu, C.; Xie, S.; Chen, Y.; Kim, J.; Kim, B.; Kulkarni, J.; Kim, T. An Overview of Processing-in-Memory Circuits for Artificial
Intelligence and Machine Learning. IEEE J. Emerg. Sel. Top. Circuits Syst. 2022, 12, 338–353. [CrossRef]
9. Lee, S.; Kang, S.; Lee, J.; Kim, H.; Lee, E.; Seo, S.; Yoon, H.; Lee, S.; Lim, K.; Shin, H.; et al. Hardware Architecture and Software
Stack for PIM Based on Commercial DRAM Technology. In Proceedings of the 2021 ACM/IEEE 48th annual International
Symposium on Computer Architecture, Valencia, Spain, 4 August 2021.
10. Lee, W.J.; Kim, C.H.; Paik, Y.; Kim, S.W. PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA. IEEE Access
2023, 11, 8622–8632. [CrossRef]
11. Heo, J.; Kim, J.; Han, W.; Kim, J.; Kim, J. SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator with Local Error
Prediction for Area/Energy-Efficient On-Device Learning. IEEE J. Solid-State Circuits 2024, 59, 2671–2683. [CrossRef]
12. Elshimy, M.; Iskandar, V.; Goehringer, D.; Mohamed, A. A Near-Memory Dynamically Programmable Many-Core Overlay. In
Proceedings of the 2023 IEEE 16th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, Singapore,
18–21 December 2023.
Electronics 2024, 13, 2971 11 of 11

13. Dinelli, G.; Meoni, G.; Rapuano, E.; Fanucci, L. Advantages and Limitations of Fully on-Chip CNN FPGA-Based Hardware
Accelerator. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems, Seville, Spain, 12–14 October
2020.
14. Heo, J.; Kim, J.; Lim, S.; Han, W.; Kim, J. T-PIM: An Energy-Efficient Processing-in-Memory Accelerator for End-to-End On-Device
Training. IEEE J. Solid-State Circuits 2023, 58, 600–613. [CrossRef]
15. Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
16. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects.
IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [CrossRef] [PubMed]
17. Wang, S.; Wang, X.; Xu, Z.; Chen, B.; Feng, C.; Wang, Q.; Ye, T. Optimizing CNN Computation Using RISC-V Custom Instruction
Sets for Edge Platforms. IEEE Trans. Comput. 2024, 73, 1371–1384. [CrossRef]
18. Shin, D.; Yoo, H. The Heterogeneous Deep Neural Network Processor With a Non-von Neumann Architecture. Proc. IEEE 2020,
108, 1245–1260. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy