Electronics 13 02971 v2
Electronics 13 02971 v2
Article
Efficient Processing-in-Memory System Based on RISC-V
Instruction Set Architecture
Jihwan Lim, Jeonghun Son and Hoyoung Yoo *
Department of Electronics Engineering, Chungnam National University, Daejeon 34134, Republic of Korea;
jihwan.lim@abov.co.kr (J.L.); jhsohn.cas@o.cnu.ac.kr (J.S.)
* Correspondence: hyyoo@cnu.ac.kr
Abstract: A lot of research on deep learning and big data has led to efficient methods for processing
large volumes of data and research on conserving computing resources. Particularly in domains like
the IoT (Internet of Things), where the computing power is constrained, efficiently processing large
volumes of data to conserve resources is crucial. The processing-in-memory (PIM) architecture was
introduced as a method for efficient large-scale data processing. However, PIM focuses on changes
within the memory itself rather than addressing the needs of low-cost solutions such as the IoT.
This paper proposes a new approach using the PIM architecture to overcome memory bottlenecks
effectively in domains with computing performance constraints. We adopt the RISC-V instruction set
architecture for our proposed PIM system’s design, implementation, and comprehensive performance
evaluation. Our proposal expects to efficiently utilize low-spec systems like the IoT by minimizing
core modifications and introducing PIM instructions at the ISA level to enable solutions that leverage
PIM capabilities. We evaluate the performance of our proposed architecture by comparing it with
existing structures using convolution operations, the fundamental unit of deep-learning and big data
computations. The experimental results show our proposed structure achieves a 34.4% improvement
in processing speed and 18% improvement in power consumption compared to conventional von
Neumann-based architectures. This substantiates its effectiveness at the application level, extending
to fields such as deep learning and big data.
Keywords: processing in memory (PIM); artificial intelligence (AI); machine learning; deep learning;
RISC-V; Internet of Things (IoT)
Citation: Lim, J.; Son, J.; Yoo, H.
Efficient Processing-in-Memory
System Based on RISC-V Instruction
Set Architecture. Electronics 2024, 13,
1. Introduction
2971. https://doi.org/10.3390/
electronics13152971 With the advancements in big data, deep learning, and IoT, research focusing on
efficient large-scale data processing and resource conservation has become pivotal in each
Academic Editor: Sunggu Lee
respective field [1]. However, despite much research, algorithms still consume a lot of
Received: 17 June 2024 computing resources due to structural limitations in how large-scale data are read from
Revised: 22 July 2024 memory and processed, and the research of efficient structures remains an important
Accepted: 25 July 2024 research topic [2]. These structural limitations come from memory bottlenecks typical in
Published: 27 July 2024 von Neumann architectures, where all calculations are processed by the core, necessitating
the loading of extensive data from memory [3]. Thus, the speed of memory access becomes
a major limiting factor [4]. Yet, due to physical constraints, memory access speeds are
significantly lower compared to core processing speeds, thereby making the performance
Copyright: © 2024 by the authors. of algorithms processing large-scale data heavily reliant on the speed of loading the data
Licensee MDPI, Basel, Switzerland.
from memory [3].
This article is an open access article
Research to improve memory access limitations include research on increasing memory
distributed under the terms and
bandwidth, enhancing memory speeds, and exploring graphics processing unit (GPU)
conditions of the Creative Commons
technologies [5]. High-bandwidth memory (HBM), for instance, proposes the use of high-
Attribution (CC BY) license (https://
bandwidth memory interfaces to increase data transfer rates, but structural bottlenecks
creativecommons.org/licenses/by/
4.0/).
continue to be important challenges [6]. And GPU uses parallel computing logic to offload
computations from the core, acting as accelerators. However, the bottleneck in transferring
data from the memory to the GPU complicates achieving optimal performance [7]. To
make improvements in the bottlenecks, recent PIM architectures are proposed [8]. Unlike
traditional von Neumann architectures where only the core processes and memory stores
data, PIM integrates the memory and processing to efficiently perform data processing,
offering a promising approach to overcoming memory bottlenecks [9]. However, research
on such PIM architectures has primarily focused on optimizing the in-memory computation
logic rather than developing PIM solutions for low-cost environments such as the IoT.
Consequently, studies on how to effectively utilize PIM architectures within the IoT remain
a necessary and unresolved area of research [10].
This paper proposes a novel structure combining PIM architecture with computing
resources constrained by a low area and low performance, as an alternative to mitigate
the data movement bottleneck between the memory and processors inherent in traditional
computer architectures [11]. Specifically, we introduce a PIM system based on the RISC-V
instruction set architecture, demonstrating effective strategies to alleviate memory bottle-
necks prevalent in existing computing systems [12]. Evaluating our proposed PIM system
involves assessing its performance through fundamental operations such as convolution in
deep learning [13]. The experimental results highlight the enhanced performance of the
proposed architecture in data-intensive fields like deep learning and big data, underscor-
ing its capability to facilitate efficient large-scale data processing in resource-constrained
environments such as the IoT [14]. We propose our architecture in the following sequence.
In Section 2, Background, we introduce the fundamental information necessary to describe
our proposed architecture. In Section 3, Proposed Design, we provide an explanation of
our proposed architecture, demonstrate its feasibility, and compare it with existing archi-
tectures to highlight improvements. In Section 4, Experimental Results, we evaluate the
algorithm and circuit performance of both the proposed architecture and the conventional
von Neumann architecture, thereby proving the superiority of our approach. Finally, in
Section 5, Conclusion, we summarize our claims and outline future research directions.
2. Background
The background section provides essential information necessary for this study. Firstly,
it introduces the concept of convolution operation, which serves as the fundamental
computational unit in deep-learning algorithms. Secondly, we describe the RISC-V RV32I
architecture, which is based on the traditional von Neumann architecture and utilizes the
RISC instruction set.
2.1. Convolution
Convolution is an important operation in deep learning and image processing, used
to apply filters to signals or images, create new signals or images, or identify specific
features in an image. This operation involves computing the sum of element-wise products
between an input signal (or image) and a kernel (filter). The mathematical representation
of convolution is shown in Equation (1) [15].
n k k
[ f ∗ g](i, j) = ∑∑∑ f ( p, g, d)· g( p + i, g + j, d) (1)
d =0 p =0 q =0
Here, f represents a kernel of size k, g represents an input image, and ∗ denotes the
convolution operator. Moving the filter f across the input data g, a single multiply-accumulate
operation is performed between corresponding elements. Convolution calculates a weighted
sum of neighboring pixel values to generate a new pixel value [f ∗ g](i,j). Figure 1 illustrates
the concept of the convolution operation. For an input of size Hin × Win × N, a k × k × n
kernel performs convolution. It computes the element-wise product between the input data
and kernel data at the same location, followed by summation to determine the output data.
This process is repeated by shifting the kernel, generating output data of size Hout × Wout .
Electronics 2024, 13, x FOR PEER REVIEW 3 of 11
between the input data and kernel data at the same location, followed by summation to
determine the output data. This process is repeated by shifting the kernel, generating out-
between the input data and kernel data at the same location, followed by summation to
Electronics 2024, 13, 2971 put data of size 𝐻 × 𝑊 . Using convolution, various types of filters can be applied 3toof 11
determine the output data. This process is repeated by shifting the kernel, generating out-
images, enabling operations such as sharpening, edge detection, or blurring. Convolution
put data of size 𝐻 × 𝑊 . Using convolution, various types of filters can be applied to
is also fundamental in deep learning, where convolutional layers are composed of these
images, enabling operations such as sharpening, edge detection, or blurring. Convolution
operations to extract features from images. Through iterative operations, convolution lay-
isUsing
also fundamental
convolution,in deep learning,
various wherecan
types of filters convolutional
be applied layers are composed
to images, of these
enabling operations
ers learn optimal kernel values. Ultimately, convolutional neural networks (CNNs) utilize
such as sharpening,
operations edge detection,
to extract features or blurring.
from images. ThroughConvolution is also fundamental
iterative operations, convolution in lay-
deep
these learned convolutional layers to classify images or perform tasks such as object de-
learning,
ers learn where kernel
optimal convolutional
values. layers are composed
Ultimately, of these
convolutional operations
neural networks to(CNNs)
extract features
utilize
tection [16].
fromlearned
these images.convolutional
Through iterative operations,
layers to classifyconvolution layers learn
images or perform optimal
tasks such askernel
objectvalues.
de-
Ultimately,
tection [16]. convolutional neural networks (CNNs) utilize these learned convolutional layers
to classify images or perform tasks such as object detection [16].
Figure
2.2. 1.1.
RISC-V
Figure Operation
ofof convolution.
RV32I Architecture
Operation convolution.
2.2.RISC-V
RISC-VisRV32I
an open-source
Architectureinstruction set architecture (ISA) designed according to the
2.2. RISC-V
principles RV32I
of the Architecture
reduced instruction set computer (RISC). Its main featuresaccording
include sim-
RISC-V is an open-source instruction set architecture (ISA) designed to the
principles of the reduced instruction set computer (RISC). Its main features includethe
RISC-V
plicity, is an
scalability, open-source
and instruction
modularity. Figure set
2 architecture
below (ISA)
illustrates sixdesigned according
representative to
instruc- sim-
tion formats
principles
plicity, usedreduced
of the
scalability,in themodularity.
and 32-bit RISC-V
instruction setISA.
Figure 2 Each
belowinstruction
computer (RISC). Its is
illustrates structured
main
six featuresaccording
include
representative to
sim-
instruction
specific
plicity,
formats formats,
used inmaking
scalability, and
the them
32-bit concise
modularity.
RISC-V and
Figure
ISA. intuitive,
Each2 below and is
enabling
illustrates
instruction an efficient
six representative
structured processor
according instruc-
to specific
design.
tion formats used in the 32-bit RISC-V ISA. Each instruction is structured
formats, making them concise and intuitive, and enabling an efficient processor design. according to
specific formats, making them concise and intuitive, and enabling an efficient processor
design. 31 25 24 20 19 15 14 12 11 7 6 0
R-type funct7 rs2 rs1 funct3 rd opcode
31
31 25 24 20 19
20 19 1514
15 14 121111
12 77 66 0 0
R-type
I-type funct7imm12 rs2 rs1
rs1 funct3
funct3 rdrd opcode
opcode
31 25 24 20 19
20 19 1514
15 14 1211
12 11 77 66 00
I-type
S-type imm7 imm12 rs2 rs1
rs1 funct3
funct3 rd
imm5 opcode
opcode
31 30 25 24
25 24 20 19
20 19 1514
15 14 121111
12 8 7766 0 0
S-type
B-type i imm7
imm6 rs2
rs2 rs1
rs1 funct3
funct3 imm5 i
imm4 opcode
opcode
31 30 25 24 20 19 15 14 121111
12 87 7 6 6 0 0
B-type
U-type i imm6 rs2
imm20 rs1 funct3 imm4
rd i opcode
opcode
31 21 20 19 12
121111 7766 0 0
J-type
U-type i imm10 imm20i imm8 rdrd opcode
opcode
31 21 20 19 12 11 7 6 0
RISC-V
Figure2.2.RISC-V
Figure base
base instruction
instruction formats.
formats.
J-type i imm10 i imm8 rd opcode
Table1 1lists
Table listskey
keyRISC-V
RISC-V instructions
instructions used
used prominently
prominently during
during compilation
compilation for for
con-con-
Figure 2. RISC-V
volution base instruction
operations. formats.
Convolution involves multiplying a kernel with an image and accu-
volution operations. Convolution involves multiplying a kernel with an image and accu-
mulating the results into the existing output, commonly utilizing add and
mulating the results into the existing output, commonly utilizing add and mul operations, mul operations,
Tableas1addi
lists and
key slli
RISC-V instructions used prominently during compilation 2for con-
asaswell
wellas addi and instructions
slli instructions forfor memory
memory address
address calculation
calculation [17].
[17]. Table
Table presents
2 presents
volution operations.
assemblycode code Convolution
performing involves
convolution multiplying
operations a kernel
using with
RISC-V an image and
instructions. accu-
assembly performing convolution operations using RISC-V instructions. The lwThein- lw
mulating the loads
instruction results into the
kernel andexisting
image output,
data commonly
from memory utilizing
to the add
core, and mulmultiplication
performs operations,
struction loads kernel and image data from memory to the core, performs multiplication
asoperations,
well as addi andand slli instructions
stores results in thefor memory
memory address
area. calculation
Subsequently, [17]. Table
it loads 2 presents
operation results
assembly code performing convolution operations using RISC-V instructions.
from memory to the core using lw, performs addition operations, and stores results The lw in-in
struction
the memoryloadsarea.
kernel and image
Repeating data
these from
steps bymemory
moving to the core,
through performs
memory areasmultiplication
computes the
final convolution result. Figure 3 depicts the structure of a RISC-V core designed based
on RV32I ISA. It comprises a five-stage pipelined structure: instruction fetch, instruction
decoding, execute, memory, and write back. It includes control logic to manage the pipeline
and generate control signals, an internal SRAM controller, and a load–store unit (LSU) to
Electronics 2024, 13, 2971 4 of 11
control peripheral addresses, a Harvard architecture with separate data and instruction
memories, and an AXI4 (Advanced eXtensible Interface 4) system bus for bus systems.
Such structures are widely used in low-area, low-power systems like the IoT, leveraging
space-efficient designs and achieving a high performance through the Harvard architecture
and pipelining techniques [18].
Figure 3.
Figure 3. Block
Block diagram
diagram of
of RISC-V
RISC-V RV32I
RV32I architecture.
architecture.
Table
3. 1. Configuring
Proposed Designthe basic instruction of a convolution operation.
PIM Instruction In the existingInstruction
von Neumann architecture, the core must Meaning
Format fetch data from memory,
perform computations, and then store the results back into memory. This process involves
add rd, rs2, rs1 R-type R[rd] = M[rs2] + M[rs1]
significant data movement, and systems targeting low-power and low-performance en-
mul rd, rs2, rs1 R-type R[rd] = M[rs2] × M[rs1]
vironments like RISC-V have limitations in bandwidth and speed between memory and
slli rd, rs1, imm R-type R[rd] = M[rs1] ≪ imm
processors due to various constraints such as power and area. These speed limitations
addi rd, rs1, imm
during data movement lead R-type R[rd]
to overall system degradation. In=this
M[rs1] + we
paper, immpropose to
Figure
Figure4.4.RISC-V
RISC-Vbase
basePIM
PIMinstruction
instructionformat.
format.
Figure 5.
Figure 5. Block
Block diagram
diagram of
of proposed
proposed architecture.
architecture.
4. Experimental Results
To evaluate our proposed architecture, we assess the processing speed performance
using convolution operations, which are the basic units used in deep-learning and large-
scale data processing fields. Additionally, we compare the area when synthesized using
the Synopsys Design Compiler for the CMOS 28 nm process at 100 MHz. We specifically
Electronics 2024, 13, 2971 7 of 11
4. Experimental Results
To evaluate our proposed architecture, we assess the processing speed performance
using convolution operations, which are the basic units used in deep-learning and large-
scale data processing fields. Additionally, we compare the area when synthesized using
the Synopsys Design Compiler for the CMOS 28 nm process at 100 MHz. We specifically
compare the input data size of 224 × 224 × 3, typical for deep-learning algorithms targeting
IoT devices like MobileNet, and three kernel sizes: 3 × 3, 5 × 5, and 7 × 7. The source
code, originally written in C, is compiled using the RISC-V GNU Toolchain 12.1.0, with
modifications to incorporate the PIM instructions proposed in this paper. This process is
necessary due to the absence of compilers optimized for compiling and optimizing PIM
instructions. Our experimental results demonstrate significant performance improvements
in the proposed architecture featuring PIM instructions.
Firstly, in our proposed structure, the execution time of convolutions significantly
decreases compared to traditional von Neumann-based architectures. Figure 6 compares the
memory access rate for different kernel sizes between the proposed and existing structures.
As a result, the memory access rate decreased by 24% when performing the convolution
operation compared to the original. Figure 7 compares operation speeds for different kernel
sizes between the proposed and existing structures. For each kernel size, the proposed
structure showed a reduction in processing time of 31.4%, 32.7%, and 34.4% compared to the
existing structure. This latency reduction is due to PIM instructions performing calculations
directly in memory, reducing the number of instructions executed by the processor. While,
theoretically, this could lead to a 66% performance improvement (as three instructions
are consolidated into one), the actual compilation revealed limitations due to branch
instructions and non-convertible codes, resulting in restrained performance gains. However, 8 of 11
Electronics 2024, 13, x FOR PEER REVIEW
these results are derived from experiments without full compiler optimization, suggesting
that more extensive use of PIM instructions could yield higher performance improvements.
handle large-scale data processing tasks in domains with limited computing performance.
To evaluate this system, performance assessments are conducted on operations commonly
used in large-scale data processing. In convolution experiments, we compare the process-
ing speed of the proposed PIM system against a standard RISC-V processor. The results
demonstrate how the PIM system alleviates bottlenecks associated with data movement.
By performing computations directly within memory, the proposed system enhances the
processing speed and reduces the latency associated with data movement. This insight sug-
gests that leveraging PIM can significantly improve efficiency in scenarios where the data
movement is a critical bottleneck. The findings of this paper provide important insights
that open avenues for the development and application of PIM systems in future IoT and
embedded environments. Effectively harnessing PIM will require support from the appli-
cation software perspective. Subsequent research should focus on optimizing compilers
capable of compiling new PIM instructions and developing optimized applications that
utilize these instructions
Author Contributions: Conceptualization, J.L.; Software, J.L. and J.S.; Project administration, H.Y.
All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The original contributions presented in the study are included in the
article, further inquiries can be directed to the corresponding author.
Acknowledgments: This work was supported by the National Research Foundation of Korea (NRF)
grant funded by the Korea government (MSIT) (No. 2022R1A5A8026986). This work was supported
by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded
by the Korea government (MSIT) (2022-0-01170).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Liang, D.S. Smart and Fast Data Processing for Deep Learning in Internet of Things: Less is more. IEEE Internet Things J. 2019, 6,
5981–5989. [CrossRef]
2. Zhuoying, Z.; Ziling, T.; Pinghui, M.; Xiaonan, W.; Dan, Z.; Xin, Z.; Ming, T.; Jie, L. A Heterogeneous Parallel Non-von Neumann
Architecture System for Accurate and Efficient Machine Learning Molecular Dynamics. IEEE Trans. Circuits Syst. I Regul. Pap.
2023, 70, 2439–2449.
3. Azriel, L.; Mendelson, A.; Weiser, U. Peripheral memory: A technique for fighting memory bandwidth bottleneck. IEEE Comput.
Archit. Lett. 2015, 14, 54–57. [CrossRef]
4. Souvik, K.; Priyanka, G.; Jeffry, L.; Hemanth, C.; BVVSN, R. Memristors Enabled Computing Correlation Parameter In-Memory
System: A Potential Alternative to Von Neumann Architecture. IEEE Trans. Very Large Scale Intergration (VLSI) Syst. 2022, 30,
755–768.
5. Cristobal, N.; Roberto, C.; Ricardo, B.; Javier, A.; Raimundo, V. GPU Tensor Cores for Fast Arithmetic Reductions. IEEE Trans.
Parallel Distrib. Syst. 2021, 32, 72–84.
6. Lee, J.; Kim, J.; Kim, K.; Ku, Y.; Kim, D.; Jeong, C.; Yun, T.; Kim, H.; Cho, H.; Oh, S.; et al. High bandwidth memory(HBM)
with TSV technique. In Proceedings of the 2016 13th International SoC Design Conference (ISOCC), Jeju, Republic of Korea, 29
December 2016.
7. Park, I.; Singhal, N.; Lee, M.; Cho, S.; Kim, C. Design and Performance Evaluation of Image Processing Algorithms on GPUs.
IEEE Trans. Parallel Distrib. Syst. 2011, 22, 91–104. [CrossRef]
8. Kim, D.; Yu, C.; Xie, S.; Chen, Y.; Kim, J.; Kim, B.; Kulkarni, J.; Kim, T. An Overview of Processing-in-Memory Circuits for Artificial
Intelligence and Machine Learning. IEEE J. Emerg. Sel. Top. Circuits Syst. 2022, 12, 338–353. [CrossRef]
9. Lee, S.; Kang, S.; Lee, J.; Kim, H.; Lee, E.; Seo, S.; Yoon, H.; Lee, S.; Lim, K.; Shin, H.; et al. Hardware Architecture and Software
Stack for PIM Based on Commercial DRAM Technology. In Proceedings of the 2021 ACM/IEEE 48th annual International
Symposium on Computer Architecture, Valencia, Spain, 4 August 2021.
10. Lee, W.J.; Kim, C.H.; Paik, Y.; Kim, S.W. PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA. IEEE Access
2023, 11, 8622–8632. [CrossRef]
11. Heo, J.; Kim, J.; Han, W.; Kim, J.; Kim, J. SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator with Local Error
Prediction for Area/Energy-Efficient On-Device Learning. IEEE J. Solid-State Circuits 2024, 59, 2671–2683. [CrossRef]
12. Elshimy, M.; Iskandar, V.; Goehringer, D.; Mohamed, A. A Near-Memory Dynamically Programmable Many-Core Overlay. In
Proceedings of the 2023 IEEE 16th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, Singapore,
18–21 December 2023.
Electronics 2024, 13, 2971 11 of 11
13. Dinelli, G.; Meoni, G.; Rapuano, E.; Fanucci, L. Advantages and Limitations of Fully on-Chip CNN FPGA-Based Hardware
Accelerator. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems, Seville, Spain, 12–14 October
2020.
14. Heo, J.; Kim, J.; Lim, S.; Han, W.; Kim, J. T-PIM: An Energy-Efficient Processing-in-Memory Accelerator for End-to-End On-Device
Training. IEEE J. Solid-State Circuits 2023, 58, 600–613. [CrossRef]
15. Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
16. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects.
IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [CrossRef] [PubMed]
17. Wang, S.; Wang, X.; Xu, Z.; Chen, B.; Feng, C.; Wang, Q.; Ye, T. Optimizing CNN Computation Using RISC-V Custom Instruction
Sets for Edge Platforms. IEEE Trans. Comput. 2024, 73, 1371–1384. [CrossRef]
18. Shin, D.; Yoo, H. The Heterogeneous Deep Neural Network Processor With a Non-von Neumann Architecture. Proc. IEEE 2020,
108, 1245–1260. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.