0% found this document useful (0 votes)
52 views4 pages

1 ModelDrivenAcelerador

Uploaded by

Aid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views4 pages

1 ModelDrivenAcelerador

Uploaded by

Aid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1

A Dataset for Large Language Model-Driven AI


Accelerator Generation
Mahmoud Nazzal∗ , Member, IEEE, Deepak Vungarala∗ , Student Member, IEEE, Mehrdad
Morsali, Student Member, IEEE, Chao Zhang, Member, IEEE, Arnob Ghosh, Member, IEEE, Abdallah
Khreishah, Senior Member, IEEE, and Shaahin Angizi, Senior Member, IEEE
Abstract—In the ever-evolving landscape of Deep Neural Networks (DNN) hardware acceleration, unlocking the true potential of
systolic array accelerators has long been hindered by the daunting challenges of expertise and time investment. Large Language
Models (LLMs) offer a promising solution for automating code generation which is key to unlocking unprecedented efficiency and
performance in various domains, including hardware descriptive code. However, the successful application of LLMs to hardware
accelerator design is contingent upon the availability of specialized datasets tailored for this purpose. To bridge this gap, we introduce
arXiv:2404.10875v1 [cs.AR] 16 Apr 2024

the Systolic Array-based Accelerator DataSet (SA-DS). SA-DS comprises of a diverse collection of spatial arrays following the
standardized Berkeley’s Gemmini accelerator generator template, enabling design reuse, adaptation, and customization. SA-DS is
intended to spark LLM-centred research on DNN hardware accelerator architecture. We envision that SA-DS provides a framework
which will shape the course of DNN hardware acceleration research for generations to come. SA-DS is open-sourced under the
permissive MIT license at this https://github.com/ACADLab/SA-DS.
Index Terms—Systolic array design, LLM-powered hardware synthesis, accelerator architecture optimization, EDA

1 I NTRODUCTION
Artificial Intelligence (AI) has shown a remarkable po-
tential to address complex design problems ranging from At the frontier of AI advancement, Large Language
software development to drug discovery. A key advantage Models (LLMs) [7] offer an appealing solution for alleviating
of AI is a significant reduction of the manual effort and the challenges in hardware accelerator design. Along this
expertise requirements. This promising capability of AI line, GPT4AIGChip [8] exemplifies using LLMs to automate
suggests its application in hardware design, particularly for the hardware design process, from conceptual design to
developing specialized AI accelerators needed to keep pace synthesis to fabrication. However, the lack of specialized
with the rapid evolution of Deep Neural Networks (DNNs) datasets of hardware accelerator design artifacts presents a
[1]. In hardware design for DNNs, the complexity and need strong obstacle in fully leveraging the potential of LLMs [9].
for expert knowledge have been major limitations [2], [3]. This limitation restricts usage to the vanilla LLMs without
Systolic array accelerators, typically obtained with spe- fine-tuning or in-context learning [9]; two of the most effec-
cialized AI hardware generators such as Gemmini [3], have tive approaches for maximizing the LLM capabilities.
significantly advanced the processing capabilities for DNNs, To bridge the gap, we introduce a Systolic Array Ac-
providing high throughput and energy efficiency. These celerator Dataset (SA-DS) to facilitate effective learning and
systems, integrated with architectures like the Rocket Chip generation of optimized designs by LLMs. Specifically, our
processor [4], demonstrate the scalability and flexibility contributions include:
necessary for contemporary AI applications. Despite these (1) We create, curate, and release SA-DS, the first systolic
appealing achievements, challenges such as the low-level array accelerator dataset for DNN hardware accelerator de-
nature, the complex programming interfaces, memory us- sign. Each data point in SA-DS features a verbal description
age, and the need for extensive development times per- of an accelerator micro-architecture and a Chisel description
sist [5]. Moreover, systolic array accelerator generators like of the design itself. These accelerator designs are obtained
Gemmini [3] generally face limitations in efficiently han- using the Gemmini generator [3].
dling diverse and irregular computational patterns beyond (2) We demonstrate the potential of SA-DS in enabling
their optimized standard operations [6]. These limitations LLM-based hardware accelerator design by showcasing its
underscore the need for innovative solutions such as AI suitability for generating viable accelerator designs by pro-
model-based solutions [5], [6]. viding 1-short prompts with multiple LLMs. Experimental
results validate the suitability of SA-DS in providing high-
∗ These authors contributed equally. quality and relevant accelerator design examples to contem-
This work is supported in part by the National Science Foundation under
Grant No. 2228028. This work has been submitted to the IEEE for possible
porary LLMs including GPT [7], Claude [10], and Google’s
publication. Copyright may be transferred without notice, after which this Gemini [11] as compared to existing Hardware Description
version may no longer be accessible Language (HDL) datasets.
• M. Nazzal, D. Vungarala, M. Morsali, A. Ghosh, A. Khreishah,
and S. Angizi are with the Department of Electrical and Com- 2 BACKGROUND
puter Engineering, New Jersey Institute of Technology, Newark, NJ Static Accelerator Generation Tools: A wide array of static
07102 USA. E-mail:{mn69, dv336, mm2772, arnob.ghosh, abadallah,
shaahin.angizi}@njit.edu. DNN accelerator design tools have been developed such
• C. Zhang is with the School of Computational Science and Engi- as VTA [12], MAGNet [13], and DNNWeaver [14] to suit
neering, Georgia Institute of Technology, Atlanta, GA, USA. E-mail: various applications. These tools provide many hardware
chaozhang@gatech.edu.
architecture templates supporting vector/systolic spatial
2

TABLE 1
Comparison of the state-of-the-art LLM-based HDL/HLS generators.
Function/Property Ours ChatEDA [9] VeriGen [15] GPT4AIGChip [8] ChipGPT [16] Chip-Chat [17] AutoChip [18]
Function AI Accelerator Gen. RTL-to-GDSII Verilog Gen. AI Accelerator Gen. Verilog Gen. Verilog Gen. Verilog Gen.
Chatbot∗ ✓
Dataset ✓ NA† ✓(Verilog) NA NA NA
Output format Chisel GDSII Verilog HLS Verilog Verilog Verilog
Automated Verification ✓ †

Multi-shot examples ✓
Human in Loop Low NA Medium Medium Medium High Low
∗A user interface featuring Prompt Optimization for the input of LLM. † Not applicable.

Input Weight B
Input
Transposer im2col A
Controller P
E
P
E
P
E x +
Dependency Accumulator
Mgmt ACC
Tile Tile Tile Tile P P P Preload
DMA Engine E E E
Spatial Array OS
Local TLB Tile Tile Tile Tile Partial Sum C

Weight Preload

Scratchpad ReLU ++ +++ P P P


Weight
Bank 0 E E E
Accumulator
Bitshift
SRAM Input Forward
Bank K Pooling Matrix Scalar Activation A x Input
Engine Multiplier
WS + Fig. 2. An envisioned framework for utilizing SA-DS.
Partial Sum C
(To PE below)
Fig. 1. Gemmini architectural template for ASIC accelerator design [3] . plicability. Due to space limitations, we restrict our presenta-
tion to describing how it can be utilized. This framework is
arrays, data flows, software ecosystems, and OS support.
outlined in Fig. 2. Since an LLM’s response is determined by
Among these tools, Gemmini [3] is a comprehensive and
both the prompt and the model coefficients, the framework
well-packaged open-source infrastructure tailored to design
focuses on these two aspects. An immediate usage of SA-DS
full-stack DNN accelerators. Gemmini offers a versatile
in the envisioned framework is to help to fine-tune a generic
hardware framework, a multi-layered software stack, and
LLM for the task of hardware accelerator design (Step 2
an integrated System-On-Chip (SoC) environment based on
in Fig. 2). This can be achieved by allowing its samples to
the architectural template shown in Fig. 1. Central to a
partially or fully alter the model coefficients. Besides, multi-
design provided by Gemmini is a spatial array architecture,
shot prompting techniques such as in-context learning can
where the template employs a 2-D array of tiles containing
be used where SA-DS will function as the source for multi-
Processing Elements (PE). These PE units operate in par-
shot examples. Also, an initial prompt conveying the user’s
allel, handling Multiply-Accumulate (MAC) operations effi-
intent and key software and hardware specifications of the
ciently. To optimize the area, power, and performance trade-
intended design can still be further engineered/optimized
offs, the size of the spatial array, input type, and function
through the Prompt Optimizer step (Step 1 ). However,
units such as non-linear functions, pooling, normalization,
administering this optimization requires the development
and dataflow can be adjusted.
of timely and accurate evaluation techniques and metrics
LLM for Hardware Design: LLMs are increasingly pivotal
for the designs generated. Since SA-DS combines verbal
in generating HDL and High-Level Synthesis (HLS) code.
description and systolic array design pairs, a systolic array
Table 1 provides a high-level comparison of the state-of-the-
accelerator is taken as an outcome from the LLM (step 3 ).
art methods along this line. GitHub Copilot [19] pioneered
Next, a third-party quality evaluation tool can be utilized
automatic code generation, setting the stage for domain-
to provide a quantitative evaluation of the design, verify
specific applications like DAVE [20]. Expanding on these
functional correctness, and integrate the design with the full
capabilities, VeriGen [15] and ChatEDA [9] advance the field
stack. (step 4 ).
by refining hardware design workflows and automating
In the proposed usage framework, once the LLM gener-
the Register-Transfer Level (RTL) to Graphic Data System
ates systolic array accelerators (step 3 ), the process moves
version II (GDSII) process with fine-tuned LLMs. ChipGPT
to quality and functional evaluation (step 4 ). The design,
[16] and Autochip [18] further this evolution by integrating
often formulated in Chisel, undergoes conversion to RTL
LLMs to generate and optimize hardware designs, with Au-
using tools like Verilator to assess functional correctness
tochip producing accurate Verilog code through simulation
and full-stack integration, thus translating into a verifiable
feedback. Chip-Chat [17] demonstrates the efficacy of inter-
HDL. This step crucially embeds an automated RTL-GDSII
active LLMs like ChatGPT-4 in fast-tracking design space
validation process, where generated designs are assessed
exploration. RTLLM [21] and GPT4AIGChip [8] specifically
and flagged as Valid or Invalid based on their code sequence
target design process efficiency, highlighting LLMs’ capacity
completeness and input-output correctness. Valid designs
to manage complex design tasks and expand access to AI
advance to resource validation, focusing on optimizing for
accelerator design, showcasing the broad potential of LLMs
Power, Performance, and Area (PPA) metrics. Conversely,
in hardware design. However, except for GPT4AIGChip [8],
designs flagged as Invalid trigger a feedback loop for error
these works do not address using LLMs for AI hardware
analysis and LLM retraining, facilitating iterative refinement
accelerator architecture design.
(Steps 2 to 5 ) to meet the set performance criteria. Ulti-
3 SA-DS AND A N E NVISIONED F RAMEWORK mately, the process culminates in Step 6 , where a script
We propose a dataset for LLM-enhanced AI hardware accel- generates Tool Command Language (TCL) instructions to
erator design and envision a general framework for its ap- automate RTL evaluation for HDL codes, integrating with
3

synthesis tools to comprehensively assess and validate the Function Units Input type Systolic Array DataFlow
PPA metrics of the hardware design.

Training Convolutions

Tile Rows,Columns
Non-linear activation

First layer Optim.

meshColumns
Normalization

Max pooling

meshRows
4 SA-DS S AMPLE C REATION

Float

Both
SInt

WS
OS
SA-DS uses the Gemmini generator to provide a variety of
spatial array designs, making it easier for users to adapt
and reuse these designs for different projects. The dataset
and design tools are developed with Chisel, a program-
accType = SInt(16.W), accType = SInt(32.W),
ming language embedded in Scala [22], known for its clear spatialArrayOutputType = SInt(20.W),
meshRows = 2,
spatialArrayOutputType = SInt(20.W),
meshRows = 16,
meshColumns = 16,
and efficient coding style [23]. Gemmini’s configurable na- DS_Acc
meshColumns = 2,
dataflow = Dataflow.OS,
DS_Acc
dataflow = Dataflow.BOTH,
has_training_convs = true, has_training_convs = false,
ture allows for significant customization, suiting various #1 has_max_pool = false,
has_nonlinear_activations = false,
#500 has_max_pool = false,
has_nonlinear_activations = false,
application-specific requirements, thus supporting the ad- has_dw_convs =true,
has_normalizations = false,
has_dw_convs =true,
has_normalizations = false,
has_first_layer_optimizations = false, has_first_layer_optimizations = false,
vancement in AI chip design [24]. This combination of a
versatile template like Gemmini and a powerful design Fig. 3. The design space parameters of the proposed SA-DS based on
language like Chisel ensures that SA-DS can effectively meet Gemmini [3].
the diverse needs of hardware design in AI applications.
Algorithm 1 describes how SA-DS is created within the samples per dataflow type. The distribution of these param-
Chipyard framework [25], which ensures that the designs eters and their corresponding function units is systemat-
are verifiable. The process focuses on generating spatial ically represented in Fig. 4, facilitating an understanding
array structures and function units from the Gemmini code- of the dataset’s comprehensive nature and the interplay
base. Specifically, the algorithm iterates through various between different function units in each configuration.
configurations of these elements, as indicated in line 6, Significance of the Parameters: M represents the con-
guided by insights from analyzing the Gemmini code. Each figurations derived from Gemmini, maintaining full-stack
modification made during this process is checked for accu- compatibility. These configurations are crucial for defin-
racy using Verilator, ensuring that each version of the design ing the hardware accelerator’s micro-architectural elements
(M) is properly annotated to highlight its key features. The based on Gemmini’s template, including scratchpad and
variables and their values used in this algorithm are care- accumulator sizes. Key parameters include:
fully chosen based on extensive testing with the Gemmini
template, leading to a diverse set of potential Gemmini • Spatial Array Size: Defines the number of PEs, crucial for
designs, as shown in Fig. 3. computational capacity.
• DataFlow: Manages data movement among PEs, with
The SA-DS, generated as detailed in Algorithm 1, offers
options like OS, WS, or an automatic selection during
a variety of configurations influenced by Function Unit (FU)
runtime.
availability and spatial array sizes, leading to a structured
• Function Units: Additional units that support DNN func-
dataset easily navigable and applicable to diverse hardware
tionalities like ReLU and normalization.
design needs. As illustrated in Fig. 3, the dataset organizes
• Accumulation & Spatial Array Output Type: Affects com-
these configurations into six main categories, each contain-
putation precision, primarily supporting signed integer
ing 1536 unique samples. These samples are enriched with
types, with potential expansion to floating-point and com-
dataflow variations like Output Stationary (OS), Weight
plex integer types.
Stationary (WS), and their combinations, accounting for 512
These elements facilitate customization to meet specific ap-
plication requirements.
Algorithm 1 SA-DS Creation with the Gemmini Generator
1: Input: Source Code S 5 E XPERIMENTAL A NALYSIS
2: Output: List of Verified Modified Source Codes M
3: P ← list of changeable variable parameters In this section, we evaluate the effectiveness of SA-DS
4: M ← empty list in supporting the design generation process for hardware
5: function G ENERATE VARIATIONS(S, P ) accelerators via LLMs, referencing the framework depicted
6: for each combination in P do
7: Smod ← S
8: for each (parameter, value) in combination do
Single Function Unit Three Function Unit Five Function Unit
9: Replace parameter in Smod with value Two Function Unit Four Function Unit Six Funtion Unit
10: end for 500 3000
11: verif ied ← V ERIFY W ITH V ERILATOR(Smod ) 2500
400
12: if verif ied then
Frequency

Frequency

2000
13: M.append(Smod ) 300
14: end if 1500
200
15: end for 1000
16: return M 100 500
17: end function 0 0
18: function V ERIFY W ITH V ERILATOR(Smod ) OS WS BOTH FU FU+1 FU+2 FU+3 FU+4 FU+5
19: return Verilator verification result for Smod DATAFLOW Function Units (FU)
(a) (b)
20: end function Fig. 4. The frequency of sample sin SA-DS in terms of (a) function units
in each category based on the Data flow for systolic arrays, (b) Function
units available individually or in combination with the others.
4

in Fig. 2. Due to space limitations and given the complex- area of utilizing LLMs for automated hardware design gen-
ity of LLM fine-tuning and prompt optimization requiring eration. Key examples along this line include fine-tuning
further research, our analysis is conducted conceptually. We high-end LLMs for hardware design, optimized multi-shot
initiate a proof-of-concept experiment that benchmarks SA- learning, and prompt engineering serving the objectives of
DS against a recent HLS Dataset (HLSD) from [26]. This design efficiency in terms of execution time, hardware cost,
experiment considers utilizing each dataset to supply one- and power consumption.
shot examples for LLM prompts, aiming to enhance the
generation of hardware designs from verbal descriptions. To R EFERENCES
objectively assess the impact of each dataset, we analyze the [1] Y.-H. Chen et al., “Eyeriss: An energy-efficient reconfigurable
code quality derived from representative prompt-code pairs accelerator for deep convolutional neural networks,” IEEE JSSC,
vol. 52, no. 1, pp. 127–138, 2016.
selected from SA-DS. The experiment extends across four [2] W.-Q. Ren et al., “A survey on collaborative dnn inference for edge
prominent LLMs; GPT-4 [7], GPT-3.5 [27], Claude [10], and intelligence,” Machine Intelligence Research, vol. 20, no. 3, pp. 370–
Gemini Advanced [11]. Reflecting the diversity of hardware 395, 2023.
[3] H. Genc et al., “Gemmini: Enabling systematic deep-learning ar-
specifications covered by SA-DS, our methodology includes chitecture evaluation via full-stack integration,” in DAC. IEEE,
randomly selecting test sets from six categories within SA- 2021, pp. 769–774.
DS, ensuring each category is represented by 30 samples. [4] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin,
Evaluation is conducted by manual code review with C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz et al., “The
rocket chip generator,” EECS Department, University of California,
an HLS and Chisel expert. Due to the manual nature of Berkeley, Tech. Rep. UCB/EECS-2016-17, vol. 4, pp. 6–2, 2016.
verification, a bi-state verification scheme is adopted. A pass [5] P. Xu and Y. Liang, “Automatic code generation for rocket chip
characterizes the generation of a complete and functional rocc accelerators,” 2020.
[6] R. Xu, S. Ma, Y. Wang, and Y. Guo, “Hesa: Heterogeneous systolic
code complying with the verbal description or the case array architecture for compact cnns hardware accelerators,” in
where the LLM generates the most crucial portions of the 2021 Design, Automation & Test in Europe Conference & Exhibition
code and leaves redundant lines or if the code is extended (DATE). IEEE, 2021, pp. 657–662.
to have other unended functionalities beyond what is re- [7] (2023) Open ai chatgpt. [Online]. Available: https://openai.com/
research/gpt-4
quested in the verbal description. Conversely, a fail refers [8] Y. Fu et al., “GPT4AIGChip: Towards next-generation ai acceler-
to generating incomplete code, incorrect file headers, or ator design automation via large language models,” in ICCAD,
incurring fatal errors of different types that render the code 2023, pp. 1–9.
[9] Z. He et al., “Chateda: A large language model powered au-
unfunctional. Therefore, we use Verilator as an automated tonomous agent for eda,” in MLCAD. IEEE, 2023, pp. 1–6.
design verification tool exclusively for codes marked as Pass. [10] (2023) Anthropic. [Online]. Available: https://www.anthropic.
The results of this experiment are summarized in Table 2. com
[11] (2024) Gemini. [Online]. Available: https://deepmind.google
TABLE 2 [12] T. Moreau et al., “Vta: an open hardware-software stack for deep
Suitability of 1-shot examples: SA-DS vs. HLSD learning,” arXiv preprint arXiv:1807.04188, vol. 10, 2018.
SA-DS HLSD [13] R. Venkatesan et al., “Magnet: A modular accelerator generator for
LLM neural networks,” in ICCAD. IEEE, 2019, pp. 1–8.
Pass Fail Pass Fail
GPT-4 135 45 72 108 [14] H. Sharma et al., “From high-level deep neural models to fpgas,”
Gemini Advanced 144 36 57 123 in MICRO. IEEE, 2016, pp. 1–12.
GPT-3.5 155 25 68 112 [15] S. Thakur et al., “Verigen: A large language model for verilog code
generation,” ACM TODAES, 2023.
Claude 150 30 71 109
[16] K. Chang et al., “Chipgpt: How far are we from natural language
hardware design,” arXiv preprint arXiv:2305.14019, 2023.
The comparison between SA-DS and HLSD datasets in [17] J. Blocklove et al., “Chip-chat: Challenges and opportunities in
generating one-shot prompts for LLMs like GPT-4, Gem- conversational hardware design,” arXiv preprint arXiv:2305.13243,
2023.
ini Advanced, GPT-3.5, and Claude, as shown in Table 2, [18] S. Thakur et al., “Autochip: Automating hdl generation using llm
reveals a clear pattern. SA-DS consistently shows fewer feedback,” arXiv preprint arXiv:2311.04887, 2023.
failures and more passes across all tested LLMs with around [19] N. Friedman, “Introducing github copilot: your ai pair program-
46% more passes on average. This suggests that SA-DS’s mer,” URL: https://github. blog/2021-06-29-introducing-github-copilot-
ai-pair-programmer, 2021.
examples better align with the LLMs’ capabilities, leading [20] H. Pearce, B. Tan, and R. Karri, “Dave: Deriving automatically ver-
to more effective code generation. The higher Pass rates ilog from english,” in Proceedings of the 2020 ACM/IEEE Workshop
in SA-DS imply that while not perfect, the generated code on Machine Learning for CAD, 2020, pp. 27–32.
[21] Y. Lu et al., “Rtllm: An open-source benchmark for de-
often needs fewer revisions to meet design requirements,
sign rtl generation with large language model,” arXiv preprint
indicating its practical value in streamlining the accelerator arXiv:2308.05345, 2023.
design process. [22] (2024) Scala. [Online]. Available: https://www.scala-lang.org
[23] J. Bachrach et al., “Chisel: constructing hardware in a scala embed-
ded language,” in DAC, 2012, pp. 1216–1225.
6 C ONCLUSION [24] M. Chen, W. Shao, P. Xu, M. Lin, K. Zhang, F. Chao, R. Ji, Y. Qiao,
and P. Luo, “Diffrate: Differentiable compression rate for efficient
This study has introduced the first publicly accessible LLM vision transformers,” in Proceedings of the IEEE/CVF International
prompt-Chisel code dataset, dubbed SA-DS. The prompt Conference on Computer Vision, 2023, pp. 17 164–17 174.
and code examples in SA-DS cover wide variety of appli- [25] A. Amid et al., “Chipyard: Integrated design, simulation, and
implementation framework for custom socs,” IEEE Micro, vol. 40,
cations and design criteria. A proof-of-concept experiment no. 4, pp. 10–21, 2020.
has showcased the benefits of SA-DS in enabling the high- [26] Z. Wei et al., “Hlsdataset: Open-source dataset for ml-assisted fpga
quality generation of hardware accelerator designs with design using high level synthesis,” in ASAP. IEEE, 2023, pp. 197–
204.
mere verbal descriptions of novice users. This exemplifies [27] (2023) Chat gpt-3.5. [Online]. Available: https://openai.com/
the promising potential of enabling further research in the blog/chatgpt

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy