New Solutions On LLM Acceleration Optimization
New Solutions On LLM Acceleration Optimization
Application
Yingbing Huang1 , Lily Jiaxin Wan1 , Hanchen Ye1 , Manvi Jha1 , Jinghua Wang1 , Yuhong Li1 ,
Xiaofan Zhang2 , Deming Chen1
{yh21, wan25, hanchen8, manvij2, jinghua3, leeyh, dchen}@illinois.edu, xiaofanz@google.com
1 University of Illinois Urbana-Champaign, 2 Google
hending and producing human-like text in a wide range of efforts to train LLMs with groundbreaking multimodal capa-
applications. However, the increasing size and complexity bilities, encompassing both visual and speech understand-
of LLMs present significant challenges in both training and ing [27, 28, 55, 56]. They have been successfully applied in
deployment, leading to substantial computational and stor- various applications, including virtual assistants, content
age costs as well as heightened energy consumption. In this generation, question-answering systems, and recommenda-
paper, we provide a review of recent advancements and re- tion systems. The versatility and effectiveness of LLMs have
search directions aimed at addressing these challenges and made them indispensable tools in various industries, driving
enhancing the efficiency of LLM-based systems. We begin by innovation and accelerating progress in artificial intelligence.
discussing algorithm-level acceleration techniques focused In domains such as LLM-aided design, large language
on optimizing LLM inference speed and resource utilization. models have been utilized for a variety of tasks, including
We also explore LLM-hardware co-design strategies with a high-level synthesis, hardware description generation, and
vision to improve system efficiency by tailoring hardware functional verification, significantly streamlining the design
architectures to LLM requirements. Further, we delve into process and reducing time-to-market for hardware designs
LLM-to-accelerator compilation approaches, which involve [38]. For instance, ChipNeMo [32] enhances LLaMA2 with
customizing hardware accelerators for efficient LLM deploy- domain-specific optimizations for more efficient hardware
ment. Finally, as a case study to leverage LLMs for assisting design. AutoChip [48] focuses on automating HDL genera-
circuit design, we examine LLM-aided design methodologies tion using feedback from LLMs, thereby improving the itera-
for an important task: High-Level Synthesis (HLS) functional tive design process. ChatEDA [53] leverages LLMs to create
verification, by creating a new dataset that contains a large an autonomous agent for EDA, while VeriGen [49] special-
number of buggy and bug-free codes, which can be essential izes in generating Verilog code. Additionally, DIVAS [41]
for training LLMs to specialize on HLS verification and de- provides an end-to-end framework for SoC security analysis,
bugging. For each aspect mentioned above, we begin with a and another approach utilizes LLMs to fix hardware security
detailed background study, followed by the presentation of bugs [1]. Moreover, large language models are also being
several novel solutions proposed to overcome specific chal- utilized for automated code generation for information tech-
lenges. We then outline future research directions to drive nology tasks in YAML, further showcasing their versatility
further advancements. Through these efforts, we aim to pave and efficiency [42].
the way for more efficient and scalable deployment of LLMs However, the widespread adoption of these models has
across a diverse range of applications. been hindered by their demanding computational require-
ments, which often result in slow response times and high
1 Introduction costs for hardware and energy. Addressing these challenges
is crucial to fully harnessing the potential of LLMs and un-
In recent years, Large Language Models (LLMs) have emerged
locking their benefits in real-world applications. To address
as powerful tools across various domains, revolutionizing
these challenges, this research paper explores a comprehen-
natural language processing, information retrieval, LLM-
sive approach to optimize LLMs at algorithm, hardware, com-
aided design, and others. The ability of LLMs to understand,
piler, and design-automation levels, aiming to enhance their
generate, and manipulate human language has propelled
efficiency and performance across diverse applications.
them to the forefront of research and applications in vari-
Previous works explore various algorithmic strategies
ous industries. These models, trained on vast amounts of
aimed at decreasing the inference latency of LLMs. We be-
All the authors contributed equally. gin by examining various methods for optimizing param-
The presented work is an expanded and more comprehensive study based eter utilization in large language models. These methods
on our invited DAC’24 paper with the same title and co-authors.
include techniques such as early exiting and layer skipping
[13], which help reduce computational overhead, as well as code that can then be translated into RTL code for hardware
contextual sparsity, which dynamically prunes irrelevant pa- implementation. HIDA, built on top of the ScaleHLS frame-
rameters during inference [33]. Additionally, previous works work, automates the conversion of algorithmic hardware de-
explore a Mixture of Experts (MoE) approach, which dis- scriptions into efficient dataflow architectures, also directly
tribute computation across multiple sub-models to enhance generating HLS accelerators from PyTorch models. Looking
efficiency and scalability [12]. We then delve into optimiza- forward, we discuss future directions, including spatial archi-
tion techniques for Key-Value (KV) cache, which is crucial tecture exploration, runtime reconfiguration for scalability,
for complex tasks like chain-of-thought reasoning and in- and heterogeneous computing solutions, to further enhance
formation retrieval. Additionally, we discuss advancements the efficiency and scalability of hardware accelerators for
in parallel decoding [29], including techniques for aligning LLMs. Through these advancements, we aim to address the
small and large model predictions, multi-token predictions, computational and memory challenges associated with LLM
and parallel processing capabilities. Building upon this back- acceleration, ultimately improving the performance and en-
ground study, we propose two novel approaches: a parallel ergy efficiency of hardware implementations.
decoding framework called Medusa [5], which employs mul- There has recently been a growing interest in leveraging
tiple decoding heads coupled with an optimized tree-based LLMs to enhance Electronic Design Automation (EDA) pro-
decoding strategy, and SnapKV, a method for effectively re- cesses, offering significant potential for improving design
ducing KV cache size [31]. Our experimental results demon- productivity, code generation, and verification [53]. Existing
strate significant speedups in inference time without com- research in this domain encompasses various applications, in-
promising generation quality, along with improved memory cluding assistant chatbots for design workflow enhancement,
efficiency. Finally, we outline future directions for tailored Verilog and script generation, and Hardware Description
algorithmic optimization, advancements in KV compression, Language (HDL) verification and analysis. Despite these ad-
and tackling the computational load from speculative de- vancements, several challenges persist, notably the scarcity
coding, aiming to boost LLM efficiency and effectiveness in of high-quality, domain-specific datasets and the need for
practical applications. more specialized LLMs tailored to grasp the intricacies of
LLM-hardware co-design aims to customize hardware ar- electronic design languages and processes. As a case study
chitectures to meet the demands of LLMs while providing to leverage LLMs for assisting circuit design, we focus on
insights to optimize LLM architectures [15]. Previously, we an important task: High-Level Synthesis (HLS) functional
proposed an LLM-hardware co-design framework called Au- verification. We pursue this task through the construction
toDistill [64], which integrates model compression, trans- of the Chrysalis dataset, an extensive collection of HLS de-
former architecture exploration, and multi-objective opti- signs embedded with diverse sets of realistic bugs, and the
mization to produce student models with lower inference development of an HLS-specific debugging assistant. A de-
latency and smaller sizes while maintaining high accuracy. bugging assistant can be built by training an LLM fine-tuned
Moreover, a pruning-aware quantization strategy that com- on the Chrysalis dataset which aims to significantly expedite
bines two effective LLM compression methods, pruning, and the verification process, enhance productivity, and reduce
quantization, to optimize LLM architectures for hardware time-to-market for hardware designs. Additionally, we out-
efficiency has been proposed [51]. Furthermore, we explore line future research directions, including LLM-aided formal
the potential of reconfigurable and heterogeneous hardware verification and the integration of LLMs into multi-agent
for LLMs, aiming to dynamically adjust hardware architec- systems for hardware/software design automation, offering
tures to accommodate the latest LLM advancements and a transformative approach to streamlining the design, verifi-
model compression methods, thereby enhancing both model cation, and debugging processes in EDA.
quality and hardware efficiency. In the rest of the paper, Section 2 delves into algorithm-
The demand for efficient hardware accelerators for deep level acceleration for LLMs, while Section 3 provides an
neural networks has led to a new direction of using High- overview of hardware co-design tailored for LLMs. Section
Level Synthesis (HLS) frameworks [7] [9] to quickly translate 4 focuses on the compiler for mapping LLMs to accelerators,
model architectures into hardware implementations. How- and Section 5 explores LLM aided designs. Finally, Section 6
ever, exploring the vast design space effectively to achieve presents the conclusion of the study.
optimal solutions remains a significant challenge. We summa-
rize two novel compilation frameworks published previously
by us: ScaleHLS [21, 57] and HIDA [60]. ScaleHLS leverages
2 LLM Algorithm-Level Acceleration
the MLIR infrastructure [26] for scalable High-Level Synthe-
sis, optimizing hardware designs with a Design Space Ex- 2.1 Background Study
ploration engine by performing multi-level transformations. LLMs excel in tasks such as coding copilot, question answer-
As far as we know, ScaleHLS was the first flow that could ing, and summarization. However, their autoregressive na-
take a PyTorch model and transform it into synthesizable C ture, where each token depends on all previous ones, makes
decoding memory-intensive and hard to parallelize. This re-
sults in significant latency due to the massive size of LLM
parameters, impacting applications requiring fast responses
❄️ 🔥
/
Original Model 🔝 Top-k Predictions
like chatbots. Addressing the challenge of reducing infer-
LM Head It, I, As
ence latency in LLMs is becoming increasingly critical. This
section primarily explores previous methods aimed at de- Last Hidden
🔥 Medusa Heads
creasing the inference latency of LLMs from an algorithmic Medusa Head 1 is, ', the
standpoint, which could facilitate straightforward implemen- Transformer
Layers
tation and integration across various platforms. Medusa Head 2 difficult, is, '
2.1.1 Efficient Parameter Utilization. The early study Medusa Head 3 not, difficult, a
Embedding
[45] shows only a necessary subset of parameters is used
per input token to reduce language model inference latency
while preserving accuracy. The concepts of early exiting
📝 📜 ✍🏻
✅
Input Candidates Single step prediction
and layer skipping [13, 44] in decoder architectures, allow
for an efficient generation of sequences by potentially ex-
What will happen if
Medusa meets a llama?
❌ ❌
It is difficult not
It' difficult a
It is difficult
iting the decoding process early based on certain criteria, It is' not ...
improvement. Flash2_hmask is the result from the combina- Front-end Front-end Graph-level IR
Analysis Library
ScaleHLS
tion of FlashAttention2 [10] and our pruning-aware quanti- TOSA Linalg Tensor
Graph
Opt. Passes
zation approach [51]. Loop-level IR
Loop Automated
Affine Vector MemRef
Opt. Passes DSE Engine
system and hardware architecture and we anticipate further Affine Vector MemRef HLSCpp
Directive
Opt. Passes
innovations in model size reduction, efficient sharding strate- Translate
Lowering
gies, optimized data layouts, and other techniques to fully HLS C/C++ HLS QoR
Transform
utilize the full potential of target systems. Emitter Estimator
CIRCT
3.3.2 Reconfiguratble and heterogeneous hardware Framework
for LLMs. Reconfigurable hardware, such as FPGAs, is a ... Verilog HLS C/C++ ScaleHLS Tool ScaleHLS Dialect Existing Dialect
.
Node1
Conv. LLM layers and buffering only a subset of intermediate re-
Dispatch0 HLS C++
Emitter
Node2 Conv. sults on-chip, spatial architectures can markedly decrease
Task0
Add
Token Node2
0
5
1
6
2
7
3
8
4
9
…
… on-chip memory requirements and overall latency. This ap-
Conv.
proach presents a compelling solution for LLM inference.
… … … … … …
External
Task1 Memory
. Conv. AXI Node3 Nonetheless, the automatic generation of spatial architec-
Node6 Conv.
Task2
Conv.
Add
Schedule6-0
tures remains challenging, opening vast avenues for innova-
Task3
Task6 Conv. Node6-0 tion in compilation and architecture design.
Node4 Tile Load
Add Dispatch6-0 Conv.
Task4
Task6-0
Tile Load
4.3.2 Runtime Reconfiguration. To achieve spatial par-
Dataflow Lowering
Conv.
Task6-1
Node5 Node6
Node6-1
Tile Comp. allelization, tasks must be instantiated simultaneously on-
Conv. Conv.
Task5
Conv.
Task6
Conv.
Tile Comp. chip. However, due to constrained computational and on-
Task7
Task6-2
Node7
Node6-2
Tile Store
chip memory resources, it is infeasible to simultaneously
Tile Store
Add Add
map all layers of emerging LLMs on-chip, which significantly
Functional Dataflow Structural Dataflow limits the scalability of spatial architectures. Consequently,
Tensor or Memory Ref. Passing Memory Accessing Streaming Memory runtime reconfiguration emerges as a crucial strategy for
enabling scalable spatial solutions. The main challenge lies
Figure 7. HIDA framework architecture [60]. in automating the balance between spatial and sequential
execution — that is, addressing the scheduling problem — to
4.2.2 HIDA. HIDA [60] is an HLS framework built upon optimize the performance-energy trade-off.
ScaleHLS with hierarchical dataflow intermediate represen-
tations (IR) and optimizations, enabling the automated trans- 4.3.3 Heterogeneous Computation. Accelerating LLMs
formation of algorithmic hardware descriptions to efficient presents a unique challenge due to their dual nature, being
dataflow architectures. Fig. 7 shows the HIDA’s overall ar- both computation-bound and memory-bound. The prefill
chitecture. The core of HIDA is its novel dataflow IR, named phase of LLMs is dominated by General Matrix Multiply
(GEMM) operators, making it computation-intensive. In con- methodology where LLMs, particularly GPT-4, are prompted
trast, the generation phase is dominated by General Matrix- with refined rules and RTL code to generate SVA, which is
Vector (GEMV) operations, demanding substantial memory then evaluated for correctness and completeness through a
bandwidth to keep the computation units engaged (refer to series of testbench simulations and revisions.
Section 2.2). This dual nature of LLMs unveils significant Despite the promising developments in LLM-aided design
opportunities for heterogeneous computing solutions, where within EDA, several challenges remain: (1) Data Quality and
compilers assume an important role in the code generation Availability: The efficacy of LLMs in EDA critically hinges on
for heterogeneous platforms and facilitating efficient com- the availability of high-quality, domain-specific datasets for
munication between them. their training and refinement. Unfortunately, the proprietary
4.3.4 Advanced HIDA. Although the HIDA framework nature of many electronic designs and the tools used for
can conduct effective dataflow-aware design space explo- EDA significantly limit access to such datasets. The bulk of
ration, optimizing streaming buffers in LLM accelerators detailed hardware design codes, primarily developed within
remains a formidable challenge due to the self-attention corporate settings, are not made public. This restriction leads
mechanism and complex inter-layer connections in LLMs. to a scarcity of accessible, high-grade datasets, thus hinder-
Enhancements in the HIDA framework could address more ing the development and optimization of LLMs specifically
complicated stream optimizations to reduce on-chip mem- engineered for EDA applications; (2) Development of Spe-
ory consumption. Additionally, recent works [8, 67] have cialized LLMs: There is a critical need for the development of
demonstrated the ability to generate efficient kernel designs LLMs that are specifically tailored to grasp the complexities
through customized scheduling. We propose to integrate of electronic design languages and processes. The generic
these highly-optimized kernels into the HIDA explorer to models, while useful, often lack the nuanced understanding
further improve the efficiency of LLM accelerators. We also required to effectively generate, verify, and analyze hardware
propose to enhance the code generation of HIDA to support code and to interact with EDA tools at a level that matches
more hardware platforms with dataflow architecture, such human experts. This necessitates a concerted effort to create
as AMD Versal ACAP [3]. more specialized models that can comprehend and manip-
ulate the intricate details of electronic designs with a high
degree of accuracy and efficiency.
5 LLM-aided design
5.1 Background Study 5.2 Proposed Works
The existing related work regarding leveraging LLMs in the One use case of LLM-aided design is to harness LLMs for
field of EDA could be divided into three categories [66]: (1) enhancing the verification and debugging processes for HLS
Assistant Chatbot for Enhanced Design Workflow: Chip- code development. HLS, with its higher level of abstraction,
Nemo [32] leverages domain adaptation techniques such can significantly improve design productivity as explained
as custom tokenizers, domain-adaptive continued pretrain- in Section 4.1.
ing, and Supervised Fine-Tuning (SFT) atop the foundation 5.2.1 Chrysalis Dataset. The cornerstone of our proposed
model LLaMA2 [37]. This integration facilitates instant ac- work is the Chrysalis dataset, an extensive collection of HLS
cess to knowledge, streamlines the query-response process, designs embedded with a diverse set of realistic bugs [51].
and diminishes reliance on conventional search methods This dataset is meticulously curated from a wide range of
and associated delays. (2) HDL and Script Generation: LLMs, open-source HLS benchmark suites, featuring over 1,500 de-
such as those in AutoChip [48] and VeriGen [49], have shown signs, with both the version embedded with bugs and the
their effectiveness in generating Verilog codes and EDA tool corresponding bug-free version. Fig. 8 outlines our method-
scripts from natural language instructions; (3) HDL Verifica- ology for constructing the Chrysalis dataset. We begin by
tion and Analysis: RTLFixer [50] exemplifies this by intro- gathering open-source benchmark suites and difficult bug
ducing a framework aimed at correcting Verilog code errors types (bugs also include non-ideal HLS Pragma insertions),
utilizing tools like OpenAI GPT-3.5/GPT-4, supplemented by which compilers often struggle to identify. These suites are
Retrieval Augmented Generation (RAG) and ReAct prompt- then converted to the function-level designs. Using the Maxi-
ing techniques. Additional efforts in this area focus on gener- mal Marginal Relevance (MMR) algorithm, we select the top-
ating SystemVerilog Assertions (SVA) for security purposes k similar designs from the RAG database for bug injection
[23][35][41], illustrating the wide-ranging potential of LLMs prompts. The prompt generation chooses one strategy based
in bolstering HDL verification and analysis processes. CHI- on bug type statistics: one combining In-Context Learning
RAAG [34] is proposed to generate SVA assertions from (ICL), Retrieval-augmented Generation (RAG), and Chain-
natural language specification based on GPT-4. For those of-Thoughts (CoT); the other using just RAG and ICL. After
assertions with syntax error or simulation error, LLMs could integration, the prompts are processed by an LLM (GPT-4
receive the automatic feedback of log file and then regener- in our case) to generate bug (or non-ideal Pragma) injection
ate the SVA for retest. Orenes-Vera [39] proposed an iterative solutions, which are then validated through both automatic
Prompt Proposed HLS
Bug Types Benchmark Debugging Flow
Suites Context: I am exploring HW programming… LLM
Steps[CoT]: (1) Identify an array; (2) … Potential Refined
Debugging
Bugs HLS Design
Requirements: The reply structure should be… Assistant
Bug Code
Examples[RAG+ICL]: Here are examples: HLS
Selection Converter Design Test
Example 1: {…}, Example 2: {…},
Under Test Vectors
Complementary Rules:
1. The bug only appears inside the function… (DUT)
RAG Database Designs Bug-free
Testbench C Simulation Bugs HLS
Designs Design
MMR Top-k Prompting with RTL Co- Industrial HLS
Bug Injections Comparator Examples ICL+RAG +CoT Generation Simulation Debugging Flow
or
Iterative Prompting with
Update ICL+RAG
√ Figure 9. LLM-based HLS Debugging Flows Working To-
Automatic Check gether with Traditional Flows: By incorporating our LLM
LLM
debugging assistant, the number of bugs requiring verifica-
Manual Check Codeline Exists?
tion by test cases can be significantly reduced.
Valid Injection? Bug Type is
Output
Correct?
Impact Results Pragma Inside Specific Bug
/Performance? Functions? open-source LLMs and developing a domain-specific RTL
Injection
debugger through fine-tuning. To effectively transition to
Figure 8. Overview of the Chrysalis Dataset Construction this new application, we must tackle diverse bug types spe-
and Iterative Upgrade Process: For each check iteration, it cific to RTL, such as those related to timing constraints, race
involves evaluating the dataset’s validity and expanding the conditions, and synthesis-simulation mismatches. Particu-
RAG dataset accordingly. Through these iterations, the qual- larly, the inherent timing characteristics of RTL designs can
ity of the dataset progressively improves. lead to more complex bugs, often manifesting as issues in
timing analysis that are not present in higher-level abstrac-
tions. Given the complexity of RTL code, one ambitious goal
procedures as well as manual checks by hardware engineers. is to reduce manual debugging and verification effort by
Successful solutions are added to the RAG database to en- building an advanced and automated RTL verification and
hance its diversity and volume, improving future solutions. debugging flow. This would involve enriching our dataset to
Our evaluations show 88% validity for all the bugs. This include RTL designs (can be generated through HLS working
dataset serves not only as a tool for training our domain- with our Chrysalis dataset) together with test benches and
specific LLM, but also as a benchmark for evaluating the test vectors. A flow similar to Fig. 8 can be developed to
model’s proficiency in identifying and suggesting fixes for assess and improve the validity of such bug injections. Fur-
common and complex HLS bugs. thermore, seamless integration with EDA tools is crucial to
5.2.2 HLS-specific Debugging Assistant. Building upon enable real-time analysis and correction within the existing
the Chrysalis dataset, our next step involves the creation of design frameworks.
an HLS-specific debugging assistant, as Fig. 9 shows. Engi-
neers typically design test vectors and create test benches 5.3 Future Directions
manually, then perform C simulations and co-simulations to 5.3.1 LLM-Aided Debugging. Our research highlights
analyze and identify potential bugs, which is time-consuming. challenges in using LLMs to inject certain HLS bug types,
To improve the efficiency of the debugging process, we pro- such as operator precedence errors and incorrect data ac-
posed a novel flow leveraging the capabilities of LLMs on cess patterns for interface pragmas. These difficulties stem
top of the traditional HLS debugging flow. This LLM will from sparse code patterns and the complexities of the exist-
be fine-tuned to understand the intricacies of HLS code, en- ing codebase, necessitating further investigation and refined
abling it to identify errors, suggest corrections, and provide methodologies for effective bug injection. Additionally, as
guidance directly within the developers’ workflow. The assis- manual verification of bug injections remains necessary in
tant aims to integrate seamlessly with popular development our current flow, creating an automated flow to estimate per-
environments, offering real-time support to engineers as formance could speed up the identification and resolution
they navigate the complexities of HLS designs. By providing of non-ideal Pragma issues, thus enhancing the quality and
context-aware suggestions and corrections, the debugging quantity of the dataset. Furthermore, for the HLS-specific
assistant will significantly expedite the verification process, debugging assistant, we will employ Low-Rank Adaptation
enhancing productivity and reducing the time-to-market for (LoRA) [18] for supervised fine-tuning on state-of-the-art
hardware designs. open-source LLMs such as LLaMA3 [36], utilizing commer-
The entire methodology could be adapted to RTL debug- cial HLS documentation for design guidelines and rules to-
ging as well, starting from the bug injection stage, using gether with our Chrysalis dataset.
5.3.2 LLM-Aided Formal Verification. LLMs can en- dataset that can be used to train an LLM-based HLS-specific
hance the formal verification process in hardware design debugging assistant. A similar strategy can be adopted for
by generating precise assertions for the proof of correctness. building an RTL-specific debugging assistant as well. These
By integrating these assertions into the formal verification methods are promising for streamlining the debugging and
workflow, LLMs can substantially increase hardware design verification process of the hardware code development.
productivity. One promising direction is to explore an iter- For each aspect mentioned above, we also outlined promis-
ative process: after the initial proof attempts, the theorem ing future directions for further research and exploration.
prover’s feedback is utilized to refine the LLM’s output. This
feedback loop enables the LLM to adjust its generated proofs Acknowledgments
iteratively until the assertions are fully verifiable. Through This work is supported in part by the IBM-Illinois Discovery
this dynamic interaction between LLMs and theorem provers, Accelerator Institute, AMD Center of Excellence at UIUC,
the generation of program proofs becomes both faster and AMD Heterogeneous Adaptive Compute Cluster (HACC)
more achievable. This methodology not only speeds up the initiative, NSF 2117997 grant through the A3D3 institute,
verification process but also ensures a higher degree of relia- and Semiconductor Research Corporation (SRC) 2023-CT-
bility in hardware design verification. 3175 grant.
5.3.3 LLM-Aided Hardware/Software Design Automa-
tion. In the realm of EDA, employing LLM multi-agent sys-
References
tems promises a transformative approach to streamlining [1] Baleegh Ahmad et al. 2023. Fixing hardware security bugs
the design, verification, and debugging processes. These so- with large language models. arXiv:2302.01215.
[2] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury
phisticated systems autonomously manage various phases
Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa:
of the workflow, seamlessly transitioning from design to ver- Training generalized multi-query transformer models from
ification and debugging. By deploying multiple specialized multi-head checkpoints. arXiv preprint arXiv:2305.13245.
LLM agents — each adept in distinct facets of the design pro- [3] AMD/Xilinx. [n. d.]. Versal Adaptive Compute Acceleration
cess such as code generation, verification, error detection, Platform. https://www.xilinx.com/products/silicon-devices/
and performance optimization — a highly efficient pipeline acap/versal.html
is crafted. This orchestrated integration allows the agents [4] Tianle Cai et al. 2024. Medusa: Simple llm inference
to collaboratively refine and optimize the design iteratively, acceleration framework with multiple decoding heads.
leveraging real-time feedback and comprehensive verifica- arXiv:2401.10774.
tion results. Throughout the process, hardware engineers [5] Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Ja-
are only tasked with overseeing the initial specification and son D. Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple
LLM Inference Acceleration Framework with Multiple De-
periodically reviewing the outputs from the LLMs to ensure
coding Heads. In ICML. https://openreview.net/forum?id=
that they align with the design intentions and confirm the PEpbUobfJv
reliability of the LLMs’ outputs. [6] Charlie Chen et al. 2023. Accelerating large language model
decoding with speculative sampling. arXiv:2302.01318.
6 Conclusion [7] Deming Chen et al. 2005. xPilot: A Platform-Based Behavioral
Synthesis System. In SRC Techcon.
In our study, we focused on optimizing LLMs to reduce infer-
[8] Hongzheng Chen et al. 2024. Allo: A Programming
ence latency and improve efficiency across various applica- Model for Composable Accelerator Design. arXiv preprint
tions. We presented a new method, Medusa, to use multiple arXiv:2404.04815.
decoding heads for prediction, coupled with an optimized [9] Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Pe-
tree-based decoding strategy for parallel token processing to ichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA
speed up the execution of LLMs. We also proposed a novel HLS Today: Successes, Challenges, and Opportunities. ACM
method, SnapKV, that effectively reduced KV cache size, Trans. Reconfigurable Technol. Syst. 15, 4, Article 51, 42 pages.
addressing the computational and memory bottlenecks in https://doi.org/10.1145/3530775
scenarios involving long sequence inputs. [10] Tri Dao et al. 2022. Flashattention: Fast and memory-efficient
We discussed LLM/hardware co-design to integrate both exact attention with io-awareness. Advances in Neural Infor-
hardware optimization for efficient execution and model ar- mation Processing Systems.
[11] Tim Dettmers et al. 2022. Llm. int8 (): 8-bit matrix multiplica-
chitecture exploration for improved system efficiency while
tion for transformers at scale. arXiv preprint arXiv:2208.07339.
maintaining LLM accuracy. HLS frameworks like ScaleHLS [12] Nan Du et al. 2022. Glam: Efficient scaling of language models
and HIDA were explored for accelerating LLMs directly from with mixture-of-experts. In ICML.
PyTorch models, envisioning automated generation of spatial [13] Maha Elbayad et al. 2019. Depth-adaptive transformer.
architectures and heterogeneous computing solutions. arXiv:1910.10073.
We also explored the advancements in LLM-aided design [14] Yichao Fu et al. 2024. Break the sequential dependency of llm
for EDA and discussed a novel flow to create the Chrysalis inference using lookahead decoding. arXiv:2402.02057.
[15] Cong Hao et al. 2018. Deep neural network model and FPGA arXiv:2402.00093.
accelerator co-design: Opportunities and challenges. In IC- [35] Xingyu Meng et al. 2023. Unlocking hardware security assur-
SICT. ance: The potential of llms. arXiv:2308.11042.
[16] Cong Hao et al. 2019. FPGA/DNN co-design: An efficient [36] Meta. 2024. Introducing Meta Llama 3: The most capable
design methodology for IoT intelligence on the edge. In DAC. openly available LLM to date. https://ai.meta.com/blog/meta-
[17] Seongmin Hong et al. 2022. DFX: A low-latency multi-FPGA llama-3/
appliance for accelerating transformer-based text generation. [37] Meta. 2024. Llama 2: open source, free for research and com-
In MICRO. mercial use. https://llama.meta.com/llama2/
[18] Edward J Hu et al. 2021. Lora: Low-rank adaptation of large [38] Md Rakib Hossain Misu et al. 2024. Towards AI-Assisted
language models. arXiv preprint arXiv:2106.09685. Synthesis of Verified Dafny Methods. Proc. ACM Softw. Eng. 1,
[19] Albert Q Jiang et al. 2024. Mixtral of experts. arXiv:2401.04088. FSE. https://doi.org/10.1145/3643763
[20] Norm Jouppi et al. 2023. Tpu v4: An optically reconfigurable [39] Marcelo Orenes-Vera et al. 2023. Using llms to facilitate formal
supercomputer for machine learning with hardware support verification of rtl. arXiv e-prints, arXiv–2309.
for embeddings. In ISCA. [40] James M Ortega and Werner C Rheinboldt. 2000. Iterative
[21] Hyegang Jun et al. 2023. AutoScaleDSE: A scalable design solution of nonlinear equations in several variables. Classics
space exploration engine for high-level synthesis. In TRETS. in Applied Mathematics.
[22] Jordan Juravsky et al. 2024. Hydragen: High-Throughput LLM [41] Sudipta Paria et al. 2023. Divas: An llm-based end-to-end
Inference with Shared Prefixes. arXiv:2402.05099. framework for soc security analysis and policy-based protec-
[23] Rahul Kande et al. 2023. Llm-assisted generation of hardware tion. arXiv:2308.06932.
assertions. arXiv:2306.14027. [42] Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn
[24] Achintya Kundu et al. 2024. Efficiently Distilling LLMs for Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade, Matt Jones,
Edge Applications. arXiv preprint arXiv:2404.01353. Alessandro Morari, and Ruchir Puri. 2023. Invited: Automated
[25] Woosuk Kwon et al. 2023. Efficient Memory Management Code generation for Information Technology Tasks in YAML
for Large Language Model Serving with PagedAttention. In through Large Language Models. In 2023 60th ACM/IEEE De-
SOSP. sign Automation Conference (DAC). 1–4. https://doi.org/10.
[26] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, 1109/DAC56929.2023.10247987
Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, [43] Andrea Santilli et al. 2023. Accelerating transformer inference
Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling for translation via parallel decoding. arXiv:2305.10427.
compiler infrastructure for domain specific computation. In [44] Tal Schuster et al. 2022. Confident adaptive language modeling.
2021 IEEE/ACM International Symposium on Code Generation Advances in Neural Information Processing Systems.
and Optimization (CGO). IEEE, 2–14. [45] Antoine Simoulin et al. 2021. How many layers and why?
[27] Zhihong Lei, Ernest Pusateri, Shiyi Han, Leo Liu, Mingbin An analysis of the model depth in transformers. In IJCNLP
Xu, Tim Ng, Ruchir Travadi, Youyuan Zhang, Mirko Hanne- Student Research Workshop.
mann, Man-Hung Siu, et al. 2024. Personalization of ctc-based [46] Mitchell Stern et al. 2018. Blockwise parallel decoding for
end-to-end speech recognition using pronunciation-driven deep autoregressive models. Advances in Neural Information
subword tokenization. In ICASSP 2024-2024 IEEE International Processing Systems.
Conference on Acoustics, Speech and Signal Processing (ICASSP). [47] Gemini Team et al. 2024. Gemini: A Family of Highly Capable
IEEE, 10096–10100. Multimodal Models. arXiv:2312.11805 [cs.CL]
[28] Zhihong Lei, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, [48] Shailja Thakur et al. 2023. Autochip: Automating hdl genera-
Tim Ng, Yuanyuan Zhang, Ernest Pusateri, Mirko Hannemann, tion using llm feedback. arXiv:2311.04887.
Yaqiao Deng, et al. 2023. Acoustic Model Fusion For End- [49] Shailja Thakur et al. 2023. Verigen: A large language model
to-End Speech Recognition. In 2023 IEEE Automatic Speech for verilog code generation. In TRETS.
Recognition and Understanding Workshop (ASRU). IEEE, 1–7. [50] YunDa Tsai et al. 2023. Rtlfixer: Automatically fixing rtl syntax
[29] Yaniv Leviathan et al. 2023. Fast inference from transformers errors with large language models. arXiv:2311.16543.
via speculative decoding. In ICML. [51] Lily Jiaxin Wan et al. 2024. Software/Hardware Co-design for
[30] Patrick Lewis et al. 2020. Retrieval-augmented generation for LLM and Its Application for Design Verification. In ASP-DAC.
knowledge-intensive nlp tasks. Advances in Neural Information [52] Jason Wei et al. 2022. Chain-of-thought prompting elicits
Processing Systems. reasoning in large language models. Advances in neural infor-
[31] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, mation processing systems.
Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and [53] Haoyuan Wu et al. 2024. ChatEDA: A Large Language Model
Deming Chen. 2024. SnapKV: LLM Knows What You are Look- Powered Autonomous Agent for EDA. IEEE Transactions on
ing for Before Generation. arXiv preprint arXiv:2404.14469. Computer-Aided Design of Integrated Circuits and Systems, 1–1.
[32] Mingjie Liu et al. 2023. Chipnemo: Domain-adapted llms for https://doi.org/10.1109/TCAD.2024.3383347
chip design. arXiv:2311.00176. [54] Guangxuan Xiao et al. 2023. Smoothquant: Accurate and
[33] Zichang Liu et al. 2023. Deja vu: Contextual sparsity for efficient post-training quantization for large language models.
efficient llms at inference time. In ICML. In ICML.
[34] Bhabesh Mali et al. 2024. ChIRAAG: ChatGPT Informed [55] Mingbin Xu, Alex Jin, Sicheng Wang, Mu Su, Tim Ng, Henry
Rapid and Automated Assertion Generation. arXiv preprint Mason, Shiyi Han, Yaqiao Deng, Zhen Huang, and Mahesh
Krishnamoorthy. 2023. Conformer-Based Speech Recogni- [62] Shulin Zeng et al. 2024. FlightLLM: Efficient Large Language
tion On Extreme Edge-Computing Devices. arXiv preprint Model Inference with a Complete Mapping Flow on FPGA.
arXiv:2312.10359. arXiv:2401.03868.
[56] Mingbin Xu, Congzheng Song, Ye Tian, Neha Agrawal, Filip [63] Xiaofan Zhang et al. 2018. DNNBuilder: An automated tool
Granqvist, Rogier van Dalen, Xiao Zhang, Arturo Argueta, for building high-performance DNN hardware accelerators
Shiyi Han, Yaqiao Deng, et al. 2023. Training large-vocabulary for FPGAs. In ICCAD.
neural language models by private federated learning for [64] Xiaofan Zhang et al. 2022. AutoDistill: An end-to-end frame-
resource-constrained devices. In ICASSP 2023-2023 IEEE Inter- work to explore and distill hardware-efficient language models.
national Conference on Acoustics, Speech and Signal Processing arXiv:2201.08539.
(ICASSP). IEEE, 1–5. [65] Zhenyu Zhang et al. 2023. H2o: Heavy-hitter oracle for
[57] Hanchen Ye et al. 2022. ScaleHLS: A new scalable high-level efficient generative inference of large language models.
synthesis framework on multi-level intermediate representa- arXiv:2306.14048.
tion. In HPCA. [66] Ruizhe Zhong et al. 2023. LLM4EDA: Emerging Progress in
[58] Hanchen Ye et al. 2022. ScaleHLS: a scalable high-level syn- Large Language Models for Electronic Design Automation.
thesis framework with multi-level transformations and opti- arXiv:2401.12224.
mizations. In DAC. [67] Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo
[59] Hanchen Ye et al. 2023. High-level Synthesis for Domain Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones,
Specific Computing. In ISPD. Jingtong Hu, et al. 2023. CHARM: C omposing H eterogeneous
[60] Hanchen Ye et al. 2024. HIDA: A Hierarchical Dataflow Com- A ccele R ators for M atrix Multiply on Versal ACAP Archi-
piler for High-Level Synthesis. In ASPLOS. tecture. In Proceedings of the 2023 ACM/SIGDA International
[61] Haoran You et al. 2023. Vitcod: Vision transformer acceleration Symposium on Field Programmable Gate Arrays. 153–164.
via dedicated algorithm and accelerator co-design. In HPCA. [68] Barret Zoph et al. 2022. Designing effective sparse expert
models. arXiv:2202.08906.