1 ModelDrivenAcelerador
1 ModelDrivenAcelerador
the Systolic Array-based Accelerator DataSet (SA-DS). SA-DS comprises of a diverse collection of spatial arrays following the
standardized Berkeley’s Gemmini accelerator generator template, enabling design reuse, adaptation, and customization. SA-DS is
intended to spark LLM-centred research on DNN hardware accelerator architecture. We envision that SA-DS provides a framework
which will shape the course of DNN hardware acceleration research for generations to come. SA-DS is open-sourced under the
permissive MIT license at this https://github.com/ACADLab/SA-DS.
Index Terms—Systolic array design, LLM-powered hardware synthesis, accelerator architecture optimization, EDA
✦
1 I NTRODUCTION
Artificial Intelligence (AI) has shown a remarkable po-
tential to address complex design problems ranging from At the frontier of AI advancement, Large Language
software development to drug discovery. A key advantage Models (LLMs) [7] offer an appealing solution for alleviating
of AI is a significant reduction of the manual effort and the challenges in hardware accelerator design. Along this
expertise requirements. This promising capability of AI line, GPT4AIGChip [8] exemplifies using LLMs to automate
suggests its application in hardware design, particularly for the hardware design process, from conceptual design to
developing specialized AI accelerators needed to keep pace synthesis to fabrication. However, the lack of specialized
with the rapid evolution of Deep Neural Networks (DNNs) datasets of hardware accelerator design artifacts presents a
[1]. In hardware design for DNNs, the complexity and need strong obstacle in fully leveraging the potential of LLMs [9].
for expert knowledge have been major limitations [2], [3]. This limitation restricts usage to the vanilla LLMs without
Systolic array accelerators, typically obtained with spe- fine-tuning or in-context learning [9]; two of the most effec-
cialized AI hardware generators such as Gemmini [3], have tive approaches for maximizing the LLM capabilities.
significantly advanced the processing capabilities for DNNs, To bridge the gap, we introduce a Systolic Array Ac-
providing high throughput and energy efficiency. These celerator Dataset (SA-DS) to facilitate effective learning and
systems, integrated with architectures like the Rocket Chip generation of optimized designs by LLMs. Specifically, our
processor [4], demonstrate the scalability and flexibility contributions include:
necessary for contemporary AI applications. Despite these (1) We create, curate, and release SA-DS, the first systolic
appealing achievements, challenges such as the low-level array accelerator dataset for DNN hardware accelerator de-
nature, the complex programming interfaces, memory us- sign. Each data point in SA-DS features a verbal description
age, and the need for extensive development times per- of an accelerator micro-architecture and a Chisel description
sist [5]. Moreover, systolic array accelerator generators like of the design itself. These accelerator designs are obtained
Gemmini [3] generally face limitations in efficiently han- using the Gemmini generator [3].
dling diverse and irregular computational patterns beyond (2) We demonstrate the potential of SA-DS in enabling
their optimized standard operations [6]. These limitations LLM-based hardware accelerator design by showcasing its
underscore the need for innovative solutions such as AI suitability for generating viable accelerator designs by pro-
model-based solutions [5], [6]. viding 1-short prompts with multiple LLMs. Experimental
results validate the suitability of SA-DS in providing high-
∗ These authors contributed equally. quality and relevant accelerator design examples to contem-
This work is supported in part by the National Science Foundation under
Grant No. 2228028. This work has been submitted to the IEEE for possible
porary LLMs including GPT [7], Claude [10], and Google’s
publication. Copyright may be transferred without notice, after which this Gemini [11] as compared to existing Hardware Description
version may no longer be accessible Language (HDL) datasets.
• M. Nazzal, D. Vungarala, M. Morsali, A. Ghosh, A. Khreishah,
and S. Angizi are with the Department of Electrical and Com- 2 BACKGROUND
puter Engineering, New Jersey Institute of Technology, Newark, NJ Static Accelerator Generation Tools: A wide array of static
07102 USA. E-mail:{mn69, dv336, mm2772, arnob.ghosh, abadallah,
shaahin.angizi}@njit.edu. DNN accelerator design tools have been developed such
• C. Zhang is with the School of Computational Science and Engi- as VTA [12], MAGNet [13], and DNNWeaver [14] to suit
neering, Georgia Institute of Technology, Atlanta, GA, USA. E-mail: various applications. These tools provide many hardware
chaozhang@gatech.edu.
architecture templates supporting vector/systolic spatial
2
TABLE 1
Comparison of the state-of-the-art LLM-based HDL/HLS generators.
Function/Property Ours ChatEDA [9] VeriGen [15] GPT4AIGChip [8] ChipGPT [16] Chip-Chat [17] AutoChip [18]
Function AI Accelerator Gen. RTL-to-GDSII Verilog Gen. AI Accelerator Gen. Verilog Gen. Verilog Gen. Verilog Gen.
Chatbot∗ ✓
Dataset ✓ NA† ✓(Verilog) NA NA NA
Output format Chisel GDSII Verilog HLS Verilog Verilog Verilog
Automated Verification ✓ †
Multi-shot examples ✓
Human in Loop Low NA Medium Medium Medium High Low
∗A user interface featuring Prompt Optimization for the input of LLM. † Not applicable.
Input Weight B
Input
Transposer im2col A
Controller P
E
P
E
P
E x +
Dependency Accumulator
Mgmt ACC
Tile Tile Tile Tile P P P Preload
DMA Engine E E E
Spatial Array OS
Local TLB Tile Tile Tile Tile Partial Sum C
Weight Preload
synthesis tools to comprehensively assess and validate the Function Units Input type Systolic Array DataFlow
PPA metrics of the hardware design.
Training Convolutions
Tile Rows,Columns
Non-linear activation
meshColumns
Normalization
Max pooling
meshRows
4 SA-DS S AMPLE C REATION
Float
Both
SInt
WS
OS
SA-DS uses the Gemmini generator to provide a variety of
spatial array designs, making it easier for users to adapt
and reuse these designs for different projects. The dataset
and design tools are developed with Chisel, a program-
accType = SInt(16.W), accType = SInt(32.W),
ming language embedded in Scala [22], known for its clear spatialArrayOutputType = SInt(20.W),
meshRows = 2,
spatialArrayOutputType = SInt(20.W),
meshRows = 16,
meshColumns = 16,
and efficient coding style [23]. Gemmini’s configurable na- DS_Acc
meshColumns = 2,
dataflow = Dataflow.OS,
DS_Acc
dataflow = Dataflow.BOTH,
has_training_convs = true, has_training_convs = false,
ture allows for significant customization, suiting various #1 has_max_pool = false,
has_nonlinear_activations = false,
#500 has_max_pool = false,
has_nonlinear_activations = false,
application-specific requirements, thus supporting the ad- has_dw_convs =true,
has_normalizations = false,
has_dw_convs =true,
has_normalizations = false,
has_first_layer_optimizations = false, has_first_layer_optimizations = false,
vancement in AI chip design [24]. This combination of a
versatile template like Gemmini and a powerful design Fig. 3. The design space parameters of the proposed SA-DS based on
language like Chisel ensures that SA-DS can effectively meet Gemmini [3].
the diverse needs of hardware design in AI applications.
Algorithm 1 describes how SA-DS is created within the samples per dataflow type. The distribution of these param-
Chipyard framework [25], which ensures that the designs eters and their corresponding function units is systemat-
are verifiable. The process focuses on generating spatial ically represented in Fig. 4, facilitating an understanding
array structures and function units from the Gemmini code- of the dataset’s comprehensive nature and the interplay
base. Specifically, the algorithm iterates through various between different function units in each configuration.
configurations of these elements, as indicated in line 6, Significance of the Parameters: M represents the con-
guided by insights from analyzing the Gemmini code. Each figurations derived from Gemmini, maintaining full-stack
modification made during this process is checked for accu- compatibility. These configurations are crucial for defin-
racy using Verilator, ensuring that each version of the design ing the hardware accelerator’s micro-architectural elements
(M) is properly annotated to highlight its key features. The based on Gemmini’s template, including scratchpad and
variables and their values used in this algorithm are care- accumulator sizes. Key parameters include:
fully chosen based on extensive testing with the Gemmini
template, leading to a diverse set of potential Gemmini • Spatial Array Size: Defines the number of PEs, crucial for
designs, as shown in Fig. 3. computational capacity.
• DataFlow: Manages data movement among PEs, with
The SA-DS, generated as detailed in Algorithm 1, offers
options like OS, WS, or an automatic selection during
a variety of configurations influenced by Function Unit (FU)
runtime.
availability and spatial array sizes, leading to a structured
• Function Units: Additional units that support DNN func-
dataset easily navigable and applicable to diverse hardware
tionalities like ReLU and normalization.
design needs. As illustrated in Fig. 3, the dataset organizes
• Accumulation & Spatial Array Output Type: Affects com-
these configurations into six main categories, each contain-
putation precision, primarily supporting signed integer
ing 1536 unique samples. These samples are enriched with
types, with potential expansion to floating-point and com-
dataflow variations like Output Stationary (OS), Weight
plex integer types.
Stationary (WS), and their combinations, accounting for 512
These elements facilitate customization to meet specific ap-
plication requirements.
Algorithm 1 SA-DS Creation with the Gemmini Generator
1: Input: Source Code S 5 E XPERIMENTAL A NALYSIS
2: Output: List of Verified Modified Source Codes M
3: P ← list of changeable variable parameters In this section, we evaluate the effectiveness of SA-DS
4: M ← empty list in supporting the design generation process for hardware
5: function G ENERATE VARIATIONS(S, P ) accelerators via LLMs, referencing the framework depicted
6: for each combination in P do
7: Smod ← S
8: for each (parameter, value) in combination do
Single Function Unit Three Function Unit Five Function Unit
9: Replace parameter in Smod with value Two Function Unit Four Function Unit Six Funtion Unit
10: end for 500 3000
11: verif ied ← V ERIFY W ITH V ERILATOR(Smod ) 2500
400
12: if verif ied then
Frequency
Frequency
2000
13: M.append(Smod ) 300
14: end if 1500
200
15: end for 1000
16: return M 100 500
17: end function 0 0
18: function V ERIFY W ITH V ERILATOR(Smod ) OS WS BOTH FU FU+1 FU+2 FU+3 FU+4 FU+5
19: return Verilator verification result for Smod DATAFLOW Function Units (FU)
(a) (b)
20: end function Fig. 4. The frequency of sample sin SA-DS in terms of (a) function units
in each category based on the Data flow for systolic arrays, (b) Function
units available individually or in combination with the others.
4
in Fig. 2. Due to space limitations and given the complex- area of utilizing LLMs for automated hardware design gen-
ity of LLM fine-tuning and prompt optimization requiring eration. Key examples along this line include fine-tuning
further research, our analysis is conducted conceptually. We high-end LLMs for hardware design, optimized multi-shot
initiate a proof-of-concept experiment that benchmarks SA- learning, and prompt engineering serving the objectives of
DS against a recent HLS Dataset (HLSD) from [26]. This design efficiency in terms of execution time, hardware cost,
experiment considers utilizing each dataset to supply one- and power consumption.
shot examples for LLM prompts, aiming to enhance the
generation of hardware designs from verbal descriptions. To R EFERENCES
objectively assess the impact of each dataset, we analyze the [1] Y.-H. Chen et al., “Eyeriss: An energy-efficient reconfigurable
code quality derived from representative prompt-code pairs accelerator for deep convolutional neural networks,” IEEE JSSC,
vol. 52, no. 1, pp. 127–138, 2016.
selected from SA-DS. The experiment extends across four [2] W.-Q. Ren et al., “A survey on collaborative dnn inference for edge
prominent LLMs; GPT-4 [7], GPT-3.5 [27], Claude [10], and intelligence,” Machine Intelligence Research, vol. 20, no. 3, pp. 370–
Gemini Advanced [11]. Reflecting the diversity of hardware 395, 2023.
[3] H. Genc et al., “Gemmini: Enabling systematic deep-learning ar-
specifications covered by SA-DS, our methodology includes chitecture evaluation via full-stack integration,” in DAC. IEEE,
randomly selecting test sets from six categories within SA- 2021, pp. 769–774.
DS, ensuring each category is represented by 30 samples. [4] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin,
Evaluation is conducted by manual code review with C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz et al., “The
rocket chip generator,” EECS Department, University of California,
an HLS and Chisel expert. Due to the manual nature of Berkeley, Tech. Rep. UCB/EECS-2016-17, vol. 4, pp. 6–2, 2016.
verification, a bi-state verification scheme is adopted. A pass [5] P. Xu and Y. Liang, “Automatic code generation for rocket chip
characterizes the generation of a complete and functional rocc accelerators,” 2020.
[6] R. Xu, S. Ma, Y. Wang, and Y. Guo, “Hesa: Heterogeneous systolic
code complying with the verbal description or the case array architecture for compact cnns hardware accelerators,” in
where the LLM generates the most crucial portions of the 2021 Design, Automation & Test in Europe Conference & Exhibition
code and leaves redundant lines or if the code is extended (DATE). IEEE, 2021, pp. 657–662.
to have other unended functionalities beyond what is re- [7] (2023) Open ai chatgpt. [Online]. Available: https://openai.com/
research/gpt-4
quested in the verbal description. Conversely, a fail refers [8] Y. Fu et al., “GPT4AIGChip: Towards next-generation ai acceler-
to generating incomplete code, incorrect file headers, or ator design automation via large language models,” in ICCAD,
incurring fatal errors of different types that render the code 2023, pp. 1–9.
[9] Z. He et al., “Chateda: A large language model powered au-
unfunctional. Therefore, we use Verilator as an automated tonomous agent for eda,” in MLCAD. IEEE, 2023, pp. 1–6.
design verification tool exclusively for codes marked as Pass. [10] (2023) Anthropic. [Online]. Available: https://www.anthropic.
The results of this experiment are summarized in Table 2. com
[11] (2024) Gemini. [Online]. Available: https://deepmind.google
TABLE 2 [12] T. Moreau et al., “Vta: an open hardware-software stack for deep
Suitability of 1-shot examples: SA-DS vs. HLSD learning,” arXiv preprint arXiv:1807.04188, vol. 10, 2018.
SA-DS HLSD [13] R. Venkatesan et al., “Magnet: A modular accelerator generator for
LLM neural networks,” in ICCAD. IEEE, 2019, pp. 1–8.
Pass Fail Pass Fail
GPT-4 135 45 72 108 [14] H. Sharma et al., “From high-level deep neural models to fpgas,”
Gemini Advanced 144 36 57 123 in MICRO. IEEE, 2016, pp. 1–12.
GPT-3.5 155 25 68 112 [15] S. Thakur et al., “Verigen: A large language model for verilog code
generation,” ACM TODAES, 2023.
Claude 150 30 71 109
[16] K. Chang et al., “Chipgpt: How far are we from natural language
hardware design,” arXiv preprint arXiv:2305.14019, 2023.
The comparison between SA-DS and HLSD datasets in [17] J. Blocklove et al., “Chip-chat: Challenges and opportunities in
generating one-shot prompts for LLMs like GPT-4, Gem- conversational hardware design,” arXiv preprint arXiv:2305.13243,
2023.
ini Advanced, GPT-3.5, and Claude, as shown in Table 2, [18] S. Thakur et al., “Autochip: Automating hdl generation using llm
reveals a clear pattern. SA-DS consistently shows fewer feedback,” arXiv preprint arXiv:2311.04887, 2023.
failures and more passes across all tested LLMs with around [19] N. Friedman, “Introducing github copilot: your ai pair program-
46% more passes on average. This suggests that SA-DS’s mer,” URL: https://github. blog/2021-06-29-introducing-github-copilot-
ai-pair-programmer, 2021.
examples better align with the LLMs’ capabilities, leading [20] H. Pearce, B. Tan, and R. Karri, “Dave: Deriving automatically ver-
to more effective code generation. The higher Pass rates ilog from english,” in Proceedings of the 2020 ACM/IEEE Workshop
in SA-DS imply that while not perfect, the generated code on Machine Learning for CAD, 2020, pp. 27–32.
[21] Y. Lu et al., “Rtllm: An open-source benchmark for de-
often needs fewer revisions to meet design requirements,
sign rtl generation with large language model,” arXiv preprint
indicating its practical value in streamlining the accelerator arXiv:2308.05345, 2023.
design process. [22] (2024) Scala. [Online]. Available: https://www.scala-lang.org
[23] J. Bachrach et al., “Chisel: constructing hardware in a scala embed-
ded language,” in DAC, 2012, pp. 1216–1225.
6 C ONCLUSION [24] M. Chen, W. Shao, P. Xu, M. Lin, K. Zhang, F. Chao, R. Ji, Y. Qiao,
and P. Luo, “Diffrate: Differentiable compression rate for efficient
This study has introduced the first publicly accessible LLM vision transformers,” in Proceedings of the IEEE/CVF International
prompt-Chisel code dataset, dubbed SA-DS. The prompt Conference on Computer Vision, 2023, pp. 17 164–17 174.
and code examples in SA-DS cover wide variety of appli- [25] A. Amid et al., “Chipyard: Integrated design, simulation, and
implementation framework for custom socs,” IEEE Micro, vol. 40,
cations and design criteria. A proof-of-concept experiment no. 4, pp. 10–21, 2020.
has showcased the benefits of SA-DS in enabling the high- [26] Z. Wei et al., “Hlsdataset: Open-source dataset for ml-assisted fpga
quality generation of hardware accelerator designs with design using high level synthesis,” in ASAP. IEEE, 2023, pp. 197–
204.
mere verbal descriptions of novice users. This exemplifies [27] (2023) Chat gpt-3.5. [Online]. Available: https://openai.com/
the promising potential of enabling further research in the blog/chatgpt