0% found this document useful (0 votes)
35 views12 pages

New Solutions On LLM Acceleration Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views12 pages

New Solutions On LLM Acceleration Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

New Solutions on LLM Acceleration, Optimization, and

Application
Yingbing Huang1 , Lily Jiaxin Wan1 , Hanchen Ye1 , Manvi Jha1 , Jinghua Wang1 , Yuhong Li1 ,
Xiaofan Zhang2 , Deming Chen1
{yh21, wan25, hanchen8, manvij2, jinghua3, leeyh, dchen}@illinois.edu, xiaofanz@google.com
1 University of Illinois Urbana-Champaign, 2 Google

Abstract text data, demonstrate unparalleled proficiency in tasks such


Large Language Models (LLMs) have become extremely po- as text generation, translation, summarization, sentiment
tent instruments with exceptional capacities for compre- analysis, and more [47][36]. Additionally, there are ongoing
arXiv:2406.10903v1 [cs.LG] 16 Jun 2024

hending and producing human-like text in a wide range of efforts to train LLMs with groundbreaking multimodal capa-
applications. However, the increasing size and complexity bilities, encompassing both visual and speech understand-
of LLMs present significant challenges in both training and ing [27, 28, 55, 56]. They have been successfully applied in
deployment, leading to substantial computational and stor- various applications, including virtual assistants, content
age costs as well as heightened energy consumption. In this generation, question-answering systems, and recommenda-
paper, we provide a review of recent advancements and re- tion systems. The versatility and effectiveness of LLMs have
search directions aimed at addressing these challenges and made them indispensable tools in various industries, driving
enhancing the efficiency of LLM-based systems. We begin by innovation and accelerating progress in artificial intelligence.
discussing algorithm-level acceleration techniques focused In domains such as LLM-aided design, large language
on optimizing LLM inference speed and resource utilization. models have been utilized for a variety of tasks, including
We also explore LLM-hardware co-design strategies with a high-level synthesis, hardware description generation, and
vision to improve system efficiency by tailoring hardware functional verification, significantly streamlining the design
architectures to LLM requirements. Further, we delve into process and reducing time-to-market for hardware designs
LLM-to-accelerator compilation approaches, which involve [38]. For instance, ChipNeMo [32] enhances LLaMA2 with
customizing hardware accelerators for efficient LLM deploy- domain-specific optimizations for more efficient hardware
ment. Finally, as a case study to leverage LLMs for assisting design. AutoChip [48] focuses on automating HDL genera-
circuit design, we examine LLM-aided design methodologies tion using feedback from LLMs, thereby improving the itera-
for an important task: High-Level Synthesis (HLS) functional tive design process. ChatEDA [53] leverages LLMs to create
verification, by creating a new dataset that contains a large an autonomous agent for EDA, while VeriGen [49] special-
number of buggy and bug-free codes, which can be essential izes in generating Verilog code. Additionally, DIVAS [41]
for training LLMs to specialize on HLS verification and de- provides an end-to-end framework for SoC security analysis,
bugging. For each aspect mentioned above, we begin with a and another approach utilizes LLMs to fix hardware security
detailed background study, followed by the presentation of bugs [1]. Moreover, large language models are also being
several novel solutions proposed to overcome specific chal- utilized for automated code generation for information tech-
lenges. We then outline future research directions to drive nology tasks in YAML, further showcasing their versatility
further advancements. Through these efforts, we aim to pave and efficiency [42].
the way for more efficient and scalable deployment of LLMs However, the widespread adoption of these models has
across a diverse range of applications. been hindered by their demanding computational require-
ments, which often result in slow response times and high
1 Introduction costs for hardware and energy. Addressing these challenges
is crucial to fully harnessing the potential of LLMs and un-
In recent years, Large Language Models (LLMs) have emerged
locking their benefits in real-world applications. To address
as powerful tools across various domains, revolutionizing
these challenges, this research paper explores a comprehen-
natural language processing, information retrieval, LLM-
sive approach to optimize LLMs at algorithm, hardware, com-
aided design, and others. The ability of LLMs to understand,
piler, and design-automation levels, aiming to enhance their
generate, and manipulate human language has propelled
efficiency and performance across diverse applications.
them to the forefront of research and applications in vari-
Previous works explore various algorithmic strategies
ous industries. These models, trained on vast amounts of
aimed at decreasing the inference latency of LLMs. We be-
All the authors contributed equally. gin by examining various methods for optimizing param-
The presented work is an expanded and more comprehensive study based eter utilization in large language models. These methods
on our invited DAC’24 paper with the same title and co-authors.
include techniques such as early exiting and layer skipping
[13], which help reduce computational overhead, as well as code that can then be translated into RTL code for hardware
contextual sparsity, which dynamically prunes irrelevant pa- implementation. HIDA, built on top of the ScaleHLS frame-
rameters during inference [33]. Additionally, previous works work, automates the conversion of algorithmic hardware de-
explore a Mixture of Experts (MoE) approach, which dis- scriptions into efficient dataflow architectures, also directly
tribute computation across multiple sub-models to enhance generating HLS accelerators from PyTorch models. Looking
efficiency and scalability [12]. We then delve into optimiza- forward, we discuss future directions, including spatial archi-
tion techniques for Key-Value (KV) cache, which is crucial tecture exploration, runtime reconfiguration for scalability,
for complex tasks like chain-of-thought reasoning and in- and heterogeneous computing solutions, to further enhance
formation retrieval. Additionally, we discuss advancements the efficiency and scalability of hardware accelerators for
in parallel decoding [29], including techniques for aligning LLMs. Through these advancements, we aim to address the
small and large model predictions, multi-token predictions, computational and memory challenges associated with LLM
and parallel processing capabilities. Building upon this back- acceleration, ultimately improving the performance and en-
ground study, we propose two novel approaches: a parallel ergy efficiency of hardware implementations.
decoding framework called Medusa [5], which employs mul- There has recently been a growing interest in leveraging
tiple decoding heads coupled with an optimized tree-based LLMs to enhance Electronic Design Automation (EDA) pro-
decoding strategy, and SnapKV, a method for effectively re- cesses, offering significant potential for improving design
ducing KV cache size [31]. Our experimental results demon- productivity, code generation, and verification [53]. Existing
strate significant speedups in inference time without com- research in this domain encompasses various applications, in-
promising generation quality, along with improved memory cluding assistant chatbots for design workflow enhancement,
efficiency. Finally, we outline future directions for tailored Verilog and script generation, and Hardware Description
algorithmic optimization, advancements in KV compression, Language (HDL) verification and analysis. Despite these ad-
and tackling the computational load from speculative de- vancements, several challenges persist, notably the scarcity
coding, aiming to boost LLM efficiency and effectiveness in of high-quality, domain-specific datasets and the need for
practical applications. more specialized LLMs tailored to grasp the intricacies of
LLM-hardware co-design aims to customize hardware ar- electronic design languages and processes. As a case study
chitectures to meet the demands of LLMs while providing to leverage LLMs for assisting circuit design, we focus on
insights to optimize LLM architectures [15]. Previously, we an important task: High-Level Synthesis (HLS) functional
proposed an LLM-hardware co-design framework called Au- verification. We pursue this task through the construction
toDistill [64], which integrates model compression, trans- of the Chrysalis dataset, an extensive collection of HLS de-
former architecture exploration, and multi-objective opti- signs embedded with diverse sets of realistic bugs, and the
mization to produce student models with lower inference development of an HLS-specific debugging assistant. A de-
latency and smaller sizes while maintaining high accuracy. bugging assistant can be built by training an LLM fine-tuned
Moreover, a pruning-aware quantization strategy that com- on the Chrysalis dataset which aims to significantly expedite
bines two effective LLM compression methods, pruning, and the verification process, enhance productivity, and reduce
quantization, to optimize LLM architectures for hardware time-to-market for hardware designs. Additionally, we out-
efficiency has been proposed [51]. Furthermore, we explore line future research directions, including LLM-aided formal
the potential of reconfigurable and heterogeneous hardware verification and the integration of LLMs into multi-agent
for LLMs, aiming to dynamically adjust hardware architec- systems for hardware/software design automation, offering
tures to accommodate the latest LLM advancements and a transformative approach to streamlining the design, verifi-
model compression methods, thereby enhancing both model cation, and debugging processes in EDA.
quality and hardware efficiency. In the rest of the paper, Section 2 delves into algorithm-
The demand for efficient hardware accelerators for deep level acceleration for LLMs, while Section 3 provides an
neural networks has led to a new direction of using High- overview of hardware co-design tailored for LLMs. Section
Level Synthesis (HLS) frameworks [7] [9] to quickly translate 4 focuses on the compiler for mapping LLMs to accelerators,
model architectures into hardware implementations. How- and Section 5 explores LLM aided designs. Finally, Section 6
ever, exploring the vast design space effectively to achieve presents the conclusion of the study.
optimal solutions remains a significant challenge. We summa-
rize two novel compilation frameworks published previously
by us: ScaleHLS [21, 57] and HIDA [60]. ScaleHLS leverages
2 LLM Algorithm-Level Acceleration
the MLIR infrastructure [26] for scalable High-Level Synthe-
sis, optimizing hardware designs with a Design Space Ex- 2.1 Background Study
ploration engine by performing multi-level transformations. LLMs excel in tasks such as coding copilot, question answer-
As far as we know, ScaleHLS was the first flow that could ing, and summarization. However, their autoregressive na-
take a PyTorch model and transform it into synthesizable C ture, where each token depends on all previous ones, makes
decoding memory-intensive and hard to parallelize. This re-
sults in significant latency due to the massive size of LLM
parameters, impacting applications requiring fast responses
❄️ 🔥
/
Original Model 🔝 Top-k Predictions
like chatbots. Addressing the challenge of reducing infer-
LM Head It, I, As
ence latency in LLMs is becoming increasingly critical. This
section primarily explores previous methods aimed at de- Last Hidden
🔥 Medusa Heads

creasing the inference latency of LLMs from an algorithmic Medusa Head 1 is, ', the
standpoint, which could facilitate straightforward implemen- Transformer
Layers
tation and integration across various platforms. Medusa Head 2 difficult, is, '

2.1.1 Efficient Parameter Utilization. The early study Medusa Head 3 not, difficult, a
Embedding
[45] shows only a necessary subset of parameters is used
per input token to reduce language model inference latency
while preserving accuracy. The concepts of early exiting
📝 📜 ✍🏻

Input Candidates Single step prediction
and layer skipping [13, 44] in decoder architectures, allow
for an efficient generation of sequences by potentially ex-
What will happen if
Medusa meets a llama?
❌ ❌
It is difficult not
It' difficult a
It is difficult

iting the decoding process early based on certain criteria, It is' not ...

thus saving on computational resources while maintaining


output quality. On another perspective, contextual sparsity, Figure 1. The proposed parallel decoding framework
as investigated by Liu et al. [33], leverages the insight that Medusa. During inference, each head generates multiple
a significant portion of model weights can be dynamically top predictions for its designated position. These predictions
omitted without affecting performance, capitalizing on the are assembled into candidates processed in parallel using a
variability of importance across different weights for differ- tree-based attention mechanism. Then the framework verifies
ent inputs. Lastly, the Mixture of Experts (MoE) [12, 19, 68] the candidates and accepts a continuation [4].
approach decouples model size from computational demands,
enabling significant scaling of model capacity with minimal
impact on inference efficiency, offering a pathway to enhanc-
ing model performance without proportional increases in is specifically adapted for LLM efficiency. Recent advance-
computational burden. ments in parallel decoding for LLMs include techniques by
Leviathan et al. [29] and Chen et al. [6], which introduce a re-
2.1.2 KV Cache Optimization. KV Cache in LLMs stores sampling strategy to align small and large model predictions
previously computed attention values and keys. This caching with the LLM’s distribution, ensuring output consistency.
proves particularly effective for complex tasks like chain-of- Stern et al. [46] explore multi-token predictions from a single
thought [52] reasoning or information retrieval [30]. How- forward pass using a linear projection layer and a tree-based
ever, it introduces overheads like setup time, extra memory decoding structure to improve decoded sequence acceptance.
for cache storage, and the complexity of managing cache Additionally, Santilli et al. [43] and Fu et al. [14] adapt Ja-
validity when the sequence length or batch size increases. cobi and Gaussian-Seidel algorithms [40] for parallelizing
Several strategies have been developed to enhance KV Cache decoding, incorporating n-gram reuse and attention masks
efficiency. One key approach is through advanced KV Cache to enhance LLM efficiency.
management techniques. For instance, the VLLM [25] intro-
duces PagedAttention which stores keys and values in seg- 2.2 Proposed Works
mented memory blocks, allowing for more efficient retrieval LLM inference is predominantly memory-bandwidth-bound
during attention calculations. Additionally, solutions like with the main latency bottleneck stemming from accelera-
Hydragen [22] employ a Shared-prefix KV Cache strategy, tors’ memory bandwidth rather than arithmetic computa-
greatly improving cache reuse rates by leveraging common tions. This bottleneck is inherent to the sequential nature
sequences. Another significant advancement is the use of KV of auto-regressive decoding, where each forward pass re-
Cache compression [65], which implements eviction policies quires transferring the complete model parameters from
to selectively retain tokens in the cache, guided by a scoring High-Bandwidth Memory (HBM) to the accelerator’s cache.
function based on cumulative attention. This process, which generates only a single token, under-
2.1.3 Parallel Decoding. Parallel decoding presents a utilizes the arithmetic computation potential of modern ac-
unique approach by executing multiple decoding steps si- celerators, leading to inefficiency. In our proposed work [4],
multaneously to reduce the overall steps needed, diverg- named as Medusa, we revisit the concept of parallel decoding
ing from traditional methods. It typically involves a smaller with a new perspective, noting that current research primar-
"draft" model predicting several upcoming words, which are ily aims to boost generation speed through draft models. Yet,
then collectively assessed by the main LLM. This technique obtaining an appropriate draft model either from scratch or
from distillation is non-trivial. Also, hosting dual-sized mod- both cloud and edge devices, accurately predicting the perfor-
els on a server presents challenges, and it’s even harder to mance of parallel decoding models has become increasingly
integrate the draft model into a distributed system. To tackle complex. A universal solution for LLM performance opti-
this, we present a novel approach (shown in Fig. 1) using mization remains elusive. Without accurate prediction before
multiple decoding heads as the adapter for prediction, cou- training and deployment, there is a risk of wasting computa-
pled with an optimized tree-based decoding strategy, enhanc- tional resources and failing to meet performance targets. We
ing the efficiency of the method. Our proposed technique focus on modeling and predicting various scenarios, utilizing
does not need a separate draft model and allows for seam- an analytic model to achieve precise performance estima-
less integration into existing LLM systems. Our experiments tions. Our objective is to create predictive frameworks that
demonstrate that limited-resource fine-tuning can achieve aid in selecting optimal parallel decoding algorithms and
over 2.2× speedup without compromising generation qual- their hyperparameters tailored to model size, task complex-
ity, while full fine-tuning further improves the speedup to ity, and performance goals. This aims to enhance efficiency,
2.3-2.8× on models with various sizes. Furthermore, paral- adaptability, and overall model performance.
lel decoding improves resource utilization due to increased 2.3.2 Combining KV Compression and Parallel De-
matrix operations for multi-token validation per step. By coding. Leveraging KV compression, we see opportunities
employing an optimized tree-based attention mechanism, for notable improvements in tasks with large input prompts,
we strive to minimize the overhead introduced by parallel like summarization and multi-round chats, where precise
decoding. Our focus on optimization of both fine-tuning and prompt compression will be crucial for maintaining retrieval
inference with the decoding adapter in the context of spec- accuracy and understanding. In long-context scenarios, di-
ulative decoding presents a novel direction for enhancing rectly processing the entire prompt and performing inference
LLM performance. with parallel decoding introduces significant inference over-
Furthermore, our method, SnapKV [31], effectively re- head due to the increased computational complexity and
duces KV cache size, addressing the computational and mem- memory requirements. To address this, we explore effective
ory bottlenecks in scenarios involving long sequence inputs. attention mechanisms such as Group Query Attention [2]
Our findings demonstrate the consistent attention allocation and techniques like quantizing the KV cache to reduce com-
patterns of important features in prompts used throughout putational load. These refinements are intended to boost LLM
the generation process, independent of prompt formats. This efficiency and effectiveness in practical uses while maintain-
observation highlights the potential on KV cache compres- ing the generation quality.
sion for long sequence input, which could reduce the compu-
tational and memory overhead in attention calculation dur-
3 LLM-Hardware co-design
ing generation steps. Leveraging this insight, our innovative
approach intelligently identifies these attention allocation 3.1 Background Study
patterns by using the window of features at the end of long The exceptional capabilities of LLMs are countered by their
sequence input, as shown in Fig. 2, and compresses the KV significant memory and computational overheads. Address-
cache accordingly. This proposed work achieves consistent ing these, LLM-hardware co-design, inspired by the DNN-
decoding speeds, significantly enhancing generation speed accelerator co-design methodology [15] [16], customizes
by 3.6x and improving memory efficiency by 8.2x compared hardware to meet these demands and provides insights to
to the baseline, when processing inputs of 16k tokens. optimize LLM architectures. Specialized accelerators, like
GPUs, TPUs, and FPGAs, enhance parallel processing and
Input Sequence KVs
memory capacity and provide efficient LLM execution. At
the same time, software strategies, such as model distilla-
rs

Help me analyze the Q4 Obs.


Laye

report of this company… Prefix window The company’s


Can you help me rephrase
tion, pruning, and quantization, can effectively reduce LLM
R&D expenses for
my email? …
Voting & Selecting
Important Features x Attention
Weight Calc. the fourth quarter
I want to buy a gift for my of 2023 is xxx.xx
mom…
I don’t understand what is
Obs.
window
billion. This figure
can be seen in the
size and complexity, making them adaptable to hardware
constraints.
KV cache in LLMs… context of…
Clustering &
Can you tell me the details Concatenating Features Obs.
of R&D expense of Q4? window
“Snap” !
Compressed KVs We have seen customized hardware accelerators are built
for LLM workloads. For example, Tensor Processing Units
Figure 2. The graph shows the simplified workflow of (TPUs) [20] are designed to efficiently handle matrix opera-
SnapKV, where the orange area represents the group of po- tions which are fundamental for LLM attention and linear
sitions per head clustered and selected by SnapKV. layers. These accelerator designs enhance LLM support by
integrating High Bandwidth Memory (HBM), reconfigurable
high-speed interconnects, and multi-type parallel computa-
2.3 Future Directions tion support, offering cost-effective LLM training and serving
2.3.1 Enhanced Versatility in Parallel Decoding. With solutions. Beyond ASICs, FPGA-based accelerators are be-
the growth in the size of LLMs and their deployment across ing actively investigated for their potential to provide more
Table 1. The results on SQuADv1.1. Ours-1 and Ours-2 de-
flexible and faster turnaround solutions. For example, DFX note two models designed with Autodistill [64].
[17] utilizes model parallelism and enables rapid concurrent
execution of transformer-based workloads, while FlightLLM
[62] introduces a configurable compute unit and a LLM map- # Param Latency F1 EM
ping flow to support LLM inference.
For LLM designs, researchers have investigated hardware- BERTBASE 109 M - 88.5 80.8
aware model compression technologies to optimize LLM ar- DistilBERT 67 M - 85.8 77.1
DistilBERT* 67 M - 86.9 79.1
chitecture. FlashAttention [10] reduces the number of High
TinyBERT6 * 67 M - 87.5 79.7
Bandwidth Memory (HBM) accesses by using tiling tech- NAS-BERT* 60 M - 88.0 80.5
niques in attention computations and extends it to block- NAS-BERT*† 60 M - 88.4 81.2
sparse attention. PagedAttention [25] divides the KV cache MobileBERT 25.3 M 0.65 ms 90.0 82.9
into blocks and manages blocks as pages in OS’s virtual MobileBERT‡ 25.3 M 0.65 ms 87.7 80.0
memory, reducing the internal and external fragmentation Ours-1 22.8 M 0.59 ms 88.4 80.8
and thus increasing the efficiency within a single request. Ours-2 20.6 M 0.49 ms 88.1 80.5
In addition, model distillation, pruning, and quantization
have proven to improve hardware efficiency for LLM deploy-
ment. MLFS [24] freezes a base model and stores many small
low-rank adapter matrices, which maintains high quality
encoder models on edge applications and reduces training
time. LLM.int8 [11] develops a two-part vector-wise quan-
tization procedure and a new mixed-precision decomposi-
tion scheme, enabling models like OPT-175B on a single
server with consumer GPUs. SmoothQuant [54] uses a per-
channel smoothing factor to handle outliers in activations
and achieves up to 1.56x speedup and 2x memory reduc- Figure 3. The proposed AutoDistill framework [64].
tion. ViTCoD [61] prunes attention maps to either dense or
sparse patterns and designs an accelerator that coordinates methods, pruning and quantization, for LLM-hardware co-
between these two workloads to boost hardware utilization design [51]. We observe a similar sparsity distribution pat-
while integrating on-chip encoder and decoder engines. tern of attention heads across various datasets as shown in
Fig. 4, which could potentially be used to smartly choose
either completely pruning (0 bit) or different quantization
precision (4, 8, 16 bits) for attention heads on individual lay-
3.2 Proposed Works
ers without additional overhead, based on the hardware level
Following the LLM-hardware co-design method, we pro- objective. Moreover, pruning-aware quantization method
pose AutoDistill [64], an end-to-end model distillation frame- could also be combined with other state-of-the-art hardware-
work to deliver hardware-efficient models. As shown in Fig. aware LLM acceleration frameworks, such as flash attention
3, AutoDistill introduces a three-stage solution, which in- in Fig. 5, and give higher throughput.
cludes model architecture exploration, model compression,
and model evaluation to deliver efficient models given the 3.3 Future Directions
target hardware and hardware-software tradeoff require- 3.3.1 System-aware algorithmic optimization. As LLMs
ments. To facilitate the hardware/software co-design pro- continue to grow in size and complexity, it is becoming in-
cess, these stages are tightly connected and continuously creasingly critical to design them with hardware system in
iterated in a quality-performance space. During the evalu- mind. It means the model developers should consider the
ation stage, model quality and its hardware performance hardware system configurations, including accelerator archi-
results are passed back to the model exploration to guide tecture, compute power, memory capacity, system topology,
the search engine for finding a better model architecture network bandwidth, model parallelism strategies, in addition
that could fulfill both software and hardware requirements. to the LLM design parameters.
Results show that AutoDistill can efficiently produce stu- This future direction will create multiple optimization op-
dent models with lower inference latency and smaller sizes, portunities to explore the combined design space consisting
while maintaining high accuracy on multiple downstream of hardware and software configurations. Model quality will
tasks, such as SQuAD for question answering and reading not be the only optimization objective while hardware per-
comprehension in Table 1. formance and efficiency metrics, such as Queries per second
We also propose a pruning-aware quantization strategy (QPS) and latency, will also be considered and explored as
by combining two of the most effective LLM compression parts of the LLM designs. With such an enhanced design
fast development of hardware accelerators to efficiently han-
dle key LLM operations, including matrix multiplication and
attention mechanisms. Additionally, it can be combined with
heterogeneous hardware to explore new compute paradigms,
such as adopting in-memory computing, to address memory-
bound operations. This direction enables trade-offs between
model quality and hardware efficiency.
3.3.3 Co-design for edge LLM applications. The co-
design for edge LLM applications is crucial, given the in-
tricate challenges posed by edge computing’s energy and
resource limitations. LLM-hardware co-design emerges as a
promising solution to these challenges, aiming to harmonize
software and hardware to optimize LLM performance on
edge devices. Future research will focus on creating tailored
architectures and algorithms that efficiently manage compu-
tational resources, ensuring that the quality of LLM services
Figure 4. The profiling results on the activity of heads across
remains high. This could involve exploring adaptive power
different datasets by measuring each head’s contribution
management techniques, optimizing memory usage, and en-
based on its variance over the input sequence. Heads that
hancing processing speeds without sacrificing the accuracy
show low variance are considered inactive, leading to con-
or responsiveness of LLM applications.
textual sparsity.
4 LLM-to-Accelerator Compiler
4.1 Background Study
High-Level Synthesis (HLS) [7][9] is vital for rapidly devel-
oping efficient, high-density hardware accelerators, enabling
quick evaluation of different algorithmic choices. The chal-
lenge of enabling scalable compiler from LLM models to
HLS-based accelerators lies in effectively exploring the vast
design space, which can lead to sub-optimal solutions if not
done well, undermining the productivity benefits of HLS. To
tackle this challenge, in this section, we will introduce two
compilation frameworks, ScaleHLS and HIDA, which can
generate HLS accelerators directly from PyTorch models.

4.2 Proposed Works

... HLS C/C++ PyTorch


Figure 5. The preliminary result from forward throughput Polygeist Torch-MLIR
Transform and

improvement. Flash2_hmask is the result from the combina- Front-end Front-end Graph-level IR
Analysis Library
ScaleHLS
tion of FlashAttention2 [10] and our pruning-aware quanti- TOSA Linalg Tensor
Graph
Opt. Passes
zation approach [51]. Loop-level IR

Loop Automated
Affine Vector MemRef
Opt. Passes DSE Engine

space, LLMs can be specifically tailored to the underlying Directive-level IR

system and hardware architecture and we anticipate further Affine Vector MemRef HLSCpp
Directive
Opt. Passes
innovations in model size reduction, efficient sharding strate- Translate

Lowering
gies, optimized data layouts, and other techniques to fully HLS C/C++ HLS QoR
Transform
utilize the full potential of target systems. Emitter Estimator

CIRCT
3.3.2 Reconfiguratble and heterogeneous hardware Framework

for LLMs. Reconfigurable hardware, such as FPGAs, is a ... Verilog HLS C/C++ ScaleHLS Tool ScaleHLS Dialect Existing Dialect

promising solution to address the continuously evolving LLM


Figure 6. ScaleHLS framework architecture [57].
designs. It offers the ability to adapt to the specific computa-
tional patterns of different LLM workloads, which allows a
4.2.1 ScaleHLS. ScaleHLS [21, 57–59] is a scalable HLS HIDA-IR, which is designed for modeling dataflow at two
framework based on Multi-Level Intermediate Represen- distinct abstraction levels: Functional and Structural. This
tation (MLIR) [26]. Fig. 6 shows the overall architecture. dual-level approach is critical for capturing the dataflow’s
ScaleHLS supports C/C++ and PyTorch as design entries. characteristics and its multi-level hierarchy, thereby facili-
Once the inputs are parsed, ScaleHLS supports three levels tating effective optimizations.
of IR, graph, loop, and directive, to apply the HLS-oriented op- Another important aspect of HIDA is the introduction of
timizations progressively. At the graph and loop level, graph HIDA-OPT, a new dataflow optimizer. This optimizer uti-
optimizations (e.g., node fusion and coarse-grained pipelin- lizes a pattern-driven task fusion algorithm coupled with an
ing) and loop optimizations can be performed efficiently. At intensity- and connection-aware dataflow parallelization al-
the lowest directive level, HLS-specific optimizations are gorithm, which can capture the computation complexity and
applied to fine-tune the hardware micro-architecture. interconnection topology of the dataflow nodes during the
On top of each level of IR, ScaleHLS provides a set of trans- parallelization process. Furthermore, HIDA is designed to be
form passes to optimize HLS designs. By performing each end-to-end and extensible, supporting both PyTorch and C++
transform pass at the "correct" level of abstraction, ScaleHLS inputs. This flexibility empowers users to rapidly experiment
is able to leverage the intrinsic hierarchy of HLS designs with various design parameters and prototype new dataflow
and reduce the algorithmic complexity of transforms. Mean- architectures, broadening the framework’s applicability and
while, we propose a Design Space Exploration (DSE) engine ease of use. Despite being fully automated and able to han-
to automatically optimize the configurable design parame- dle various applications, HIDA achieved throughputs that
ters and search for the Pareto-dominating design points in were 8.54× and 1.29× higher than those of ScaleHLS and
the latency-resource utilization space. Finally, the optimized RTL-based neural network accelerators [63], respectively.
IR is emitted as synthesizable HLS C/C++ code. Experimental
results show that, comparing to the baseline designs with- 4.3 Future Directions
out manual directives insertion and code-rewriting, that are 4.3.1 Spatial Architecture. Due to the substantial vol-
only synthesized by Vitis HLS, ScaleHLS improves the perfor- ume of parameters and intermediate computations involved
mances with up to 3825.0× on representative neural network in LLMs, the bottleneck in hardware acceleration frequently
models. resides in the external memory bandwidth. Contrary to the
Inputs Outputs Von Neumann architecture, which consistently battles the
HIDA
PyTorch HLS C++ Vitis HLS,
Schedule0
Node0
"memory wall," spatial architectures can leverage on-chip
etc.
Add Node1 Conv. communication among tasks to minimize frequent external
memory accesses. By overlapping the execution of distinct
0 1 2 3 4 …
Torch- Optimized 5 6 7 8 9 …
Polygeist HLS C++
MLIR … … … … … …

.
Node1
Conv. LLM layers and buffering only a subset of intermediate re-
Dispatch0 HLS C++
Emitter
Node2 Conv. sults on-chip, spatial architectures can markedly decrease
Task0
Add
Token Node2
0
5
1
6
2
7
3
8
4
9

… on-chip memory requirements and overall latency. This ap-
Conv.
proach presents a compelling solution for LLM inference.
… … … … … …
External
Task1 Memory
. Conv. AXI Node3 Nonetheless, the automatic generation of spatial architec-
Node6 Conv.
Task2
Conv.
Add
Schedule6-0
tures remains challenging, opening vast avenues for innova-
Task3
Task6 Conv. Node6-0 tion in compilation and architecture design.
Node4 Tile Load
Add Dispatch6-0 Conv.

Task4
Task6-0
Tile Load
4.3.2 Runtime Reconfiguration. To achieve spatial par-
Dataflow Lowering

Conv.
Task6-1
Node5 Node6
Node6-1
Tile Comp. allelization, tasks must be instantiated simultaneously on-
Conv. Conv.
Task5
Conv.
Task6
Conv.
Tile Comp. chip. However, due to constrained computational and on-
Task7
Task6-2
Node7
Node6-2
Tile Store
chip memory resources, it is infeasible to simultaneously
Tile Store
Add Add
map all layers of emerging LLMs on-chip, which significantly
Functional Dataflow Structural Dataflow limits the scalability of spatial architectures. Consequently,
Tensor or Memory Ref. Passing Memory Accessing Streaming Memory runtime reconfiguration emerges as a crucial strategy for
enabling scalable spatial solutions. The main challenge lies
Figure 7. HIDA framework architecture [60]. in automating the balance between spatial and sequential
execution — that is, addressing the scheduling problem — to
4.2.2 HIDA. HIDA [60] is an HLS framework built upon optimize the performance-energy trade-off.
ScaleHLS with hierarchical dataflow intermediate represen-
tations (IR) and optimizations, enabling the automated trans- 4.3.3 Heterogeneous Computation. Accelerating LLMs
formation of algorithmic hardware descriptions to efficient presents a unique challenge due to their dual nature, being
dataflow architectures. Fig. 7 shows the HIDA’s overall ar- both computation-bound and memory-bound. The prefill
chitecture. The core of HIDA is its novel dataflow IR, named phase of LLMs is dominated by General Matrix Multiply
(GEMM) operators, making it computation-intensive. In con- methodology where LLMs, particularly GPT-4, are prompted
trast, the generation phase is dominated by General Matrix- with refined rules and RTL code to generate SVA, which is
Vector (GEMV) operations, demanding substantial memory then evaluated for correctness and completeness through a
bandwidth to keep the computation units engaged (refer to series of testbench simulations and revisions.
Section 2.2). This dual nature of LLMs unveils significant Despite the promising developments in LLM-aided design
opportunities for heterogeneous computing solutions, where within EDA, several challenges remain: (1) Data Quality and
compilers assume an important role in the code generation Availability: The efficacy of LLMs in EDA critically hinges on
for heterogeneous platforms and facilitating efficient com- the availability of high-quality, domain-specific datasets for
munication between them. their training and refinement. Unfortunately, the proprietary
4.3.4 Advanced HIDA. Although the HIDA framework nature of many electronic designs and the tools used for
can conduct effective dataflow-aware design space explo- EDA significantly limit access to such datasets. The bulk of
ration, optimizing streaming buffers in LLM accelerators detailed hardware design codes, primarily developed within
remains a formidable challenge due to the self-attention corporate settings, are not made public. This restriction leads
mechanism and complex inter-layer connections in LLMs. to a scarcity of accessible, high-grade datasets, thus hinder-
Enhancements in the HIDA framework could address more ing the development and optimization of LLMs specifically
complicated stream optimizations to reduce on-chip mem- engineered for EDA applications; (2) Development of Spe-
ory consumption. Additionally, recent works [8, 67] have cialized LLMs: There is a critical need for the development of
demonstrated the ability to generate efficient kernel designs LLMs that are specifically tailored to grasp the complexities
through customized scheduling. We propose to integrate of electronic design languages and processes. The generic
these highly-optimized kernels into the HIDA explorer to models, while useful, often lack the nuanced understanding
further improve the efficiency of LLM accelerators. We also required to effectively generate, verify, and analyze hardware
propose to enhance the code generation of HIDA to support code and to interact with EDA tools at a level that matches
more hardware platforms with dataflow architecture, such human experts. This necessitates a concerted effort to create
as AMD Versal ACAP [3]. more specialized models that can comprehend and manip-
ulate the intricate details of electronic designs with a high
degree of accuracy and efficiency.
5 LLM-aided design
5.1 Background Study 5.2 Proposed Works
The existing related work regarding leveraging LLMs in the One use case of LLM-aided design is to harness LLMs for
field of EDA could be divided into three categories [66]: (1) enhancing the verification and debugging processes for HLS
Assistant Chatbot for Enhanced Design Workflow: Chip- code development. HLS, with its higher level of abstraction,
Nemo [32] leverages domain adaptation techniques such can significantly improve design productivity as explained
as custom tokenizers, domain-adaptive continued pretrain- in Section 4.1.
ing, and Supervised Fine-Tuning (SFT) atop the foundation 5.2.1 Chrysalis Dataset. The cornerstone of our proposed
model LLaMA2 [37]. This integration facilitates instant ac- work is the Chrysalis dataset, an extensive collection of HLS
cess to knowledge, streamlines the query-response process, designs embedded with a diverse set of realistic bugs [51].
and diminishes reliance on conventional search methods This dataset is meticulously curated from a wide range of
and associated delays. (2) HDL and Script Generation: LLMs, open-source HLS benchmark suites, featuring over 1,500 de-
such as those in AutoChip [48] and VeriGen [49], have shown signs, with both the version embedded with bugs and the
their effectiveness in generating Verilog codes and EDA tool corresponding bug-free version. Fig. 8 outlines our method-
scripts from natural language instructions; (3) HDL Verifica- ology for constructing the Chrysalis dataset. We begin by
tion and Analysis: RTLFixer [50] exemplifies this by intro- gathering open-source benchmark suites and difficult bug
ducing a framework aimed at correcting Verilog code errors types (bugs also include non-ideal HLS Pragma insertions),
utilizing tools like OpenAI GPT-3.5/GPT-4, supplemented by which compilers often struggle to identify. These suites are
Retrieval Augmented Generation (RAG) and ReAct prompt- then converted to the function-level designs. Using the Maxi-
ing techniques. Additional efforts in this area focus on gener- mal Marginal Relevance (MMR) algorithm, we select the top-
ating SystemVerilog Assertions (SVA) for security purposes k similar designs from the RAG database for bug injection
[23][35][41], illustrating the wide-ranging potential of LLMs prompts. The prompt generation chooses one strategy based
in bolstering HDL verification and analysis processes. CHI- on bug type statistics: one combining In-Context Learning
RAAG [34] is proposed to generate SVA assertions from (ICL), Retrieval-augmented Generation (RAG), and Chain-
natural language specification based on GPT-4. For those of-Thoughts (CoT); the other using just RAG and ICL. After
assertions with syntax error or simulation error, LLMs could integration, the prompts are processed by an LLM (GPT-4
receive the automatic feedback of log file and then regener- in our case) to generate bug (or non-ideal Pragma) injection
ate the SVA for retest. Orenes-Vera [39] proposed an iterative solutions, which are then validated through both automatic
Prompt Proposed HLS
Bug Types Benchmark Debugging Flow
Suites Context: I am exploring HW programming… LLM
Steps[CoT]: (1) Identify an array; (2) … Potential Refined
Debugging
Bugs HLS Design
Requirements: The reply structure should be… Assistant
Bug Code
Examples[RAG+ICL]: Here are examples: HLS
Selection Converter Design Test
Example 1: {…}, Example 2: {…},
Under Test Vectors
Complementary Rules:
1. The bug only appears inside the function… (DUT)
RAG Database Designs Bug-free
Testbench C Simulation Bugs HLS
Designs Design
MMR Top-k Prompting with RTL Co- Industrial HLS
Bug Injections Comparator Examples ICL+RAG +CoT Generation Simulation Debugging Flow
or
Iterative Prompting with
Update ICL+RAG
√ Figure 9. LLM-based HLS Debugging Flows Working To-
Automatic Check gether with Traditional Flows: By incorporating our LLM
LLM
debugging assistant, the number of bugs requiring verifica-
Manual Check Codeline Exists?
tion by test cases can be significantly reduced.
Valid Injection? Bug Type is
Output
Correct?
Impact Results Pragma Inside Specific Bug
/Performance? Functions? open-source LLMs and developing a domain-specific RTL
Injection
debugger through fine-tuning. To effectively transition to
Figure 8. Overview of the Chrysalis Dataset Construction this new application, we must tackle diverse bug types spe-
and Iterative Upgrade Process: For each check iteration, it cific to RTL, such as those related to timing constraints, race
involves evaluating the dataset’s validity and expanding the conditions, and synthesis-simulation mismatches. Particu-
RAG dataset accordingly. Through these iterations, the qual- larly, the inherent timing characteristics of RTL designs can
ity of the dataset progressively improves. lead to more complex bugs, often manifesting as issues in
timing analysis that are not present in higher-level abstrac-
tions. Given the complexity of RTL code, one ambitious goal
procedures as well as manual checks by hardware engineers. is to reduce manual debugging and verification effort by
Successful solutions are added to the RAG database to en- building an advanced and automated RTL verification and
hance its diversity and volume, improving future solutions. debugging flow. This would involve enriching our dataset to
Our evaluations show 88% validity for all the bugs. This include RTL designs (can be generated through HLS working
dataset serves not only as a tool for training our domain- with our Chrysalis dataset) together with test benches and
specific LLM, but also as a benchmark for evaluating the test vectors. A flow similar to Fig. 8 can be developed to
model’s proficiency in identifying and suggesting fixes for assess and improve the validity of such bug injections. Fur-
common and complex HLS bugs. thermore, seamless integration with EDA tools is crucial to
5.2.2 HLS-specific Debugging Assistant. Building upon enable real-time analysis and correction within the existing
the Chrysalis dataset, our next step involves the creation of design frameworks.
an HLS-specific debugging assistant, as Fig. 9 shows. Engi-
neers typically design test vectors and create test benches 5.3 Future Directions
manually, then perform C simulations and co-simulations to 5.3.1 LLM-Aided Debugging. Our research highlights
analyze and identify potential bugs, which is time-consuming. challenges in using LLMs to inject certain HLS bug types,
To improve the efficiency of the debugging process, we pro- such as operator precedence errors and incorrect data ac-
posed a novel flow leveraging the capabilities of LLMs on cess patterns for interface pragmas. These difficulties stem
top of the traditional HLS debugging flow. This LLM will from sparse code patterns and the complexities of the exist-
be fine-tuned to understand the intricacies of HLS code, en- ing codebase, necessitating further investigation and refined
abling it to identify errors, suggest corrections, and provide methodologies for effective bug injection. Additionally, as
guidance directly within the developers’ workflow. The assis- manual verification of bug injections remains necessary in
tant aims to integrate seamlessly with popular development our current flow, creating an automated flow to estimate per-
environments, offering real-time support to engineers as formance could speed up the identification and resolution
they navigate the complexities of HLS designs. By providing of non-ideal Pragma issues, thus enhancing the quality and
context-aware suggestions and corrections, the debugging quantity of the dataset. Furthermore, for the HLS-specific
assistant will significantly expedite the verification process, debugging assistant, we will employ Low-Rank Adaptation
enhancing productivity and reducing the time-to-market for (LoRA) [18] for supervised fine-tuning on state-of-the-art
hardware designs. open-source LLMs such as LLaMA3 [36], utilizing commer-
The entire methodology could be adapted to RTL debug- cial HLS documentation for design guidelines and rules to-
ging as well, starting from the bug injection stage, using gether with our Chrysalis dataset.
5.3.2 LLM-Aided Formal Verification. LLMs can en- dataset that can be used to train an LLM-based HLS-specific
hance the formal verification process in hardware design debugging assistant. A similar strategy can be adopted for
by generating precise assertions for the proof of correctness. building an RTL-specific debugging assistant as well. These
By integrating these assertions into the formal verification methods are promising for streamlining the debugging and
workflow, LLMs can substantially increase hardware design verification process of the hardware code development.
productivity. One promising direction is to explore an iter- For each aspect mentioned above, we also outlined promis-
ative process: after the initial proof attempts, the theorem ing future directions for further research and exploration.
prover’s feedback is utilized to refine the LLM’s output. This
feedback loop enables the LLM to adjust its generated proofs Acknowledgments
iteratively until the assertions are fully verifiable. Through This work is supported in part by the IBM-Illinois Discovery
this dynamic interaction between LLMs and theorem provers, Accelerator Institute, AMD Center of Excellence at UIUC,
the generation of program proofs becomes both faster and AMD Heterogeneous Adaptive Compute Cluster (HACC)
more achievable. This methodology not only speeds up the initiative, NSF 2117997 grant through the A3D3 institute,
verification process but also ensures a higher degree of relia- and Semiconductor Research Corporation (SRC) 2023-CT-
bility in hardware design verification. 3175 grant.
5.3.3 LLM-Aided Hardware/Software Design Automa-
tion. In the realm of EDA, employing LLM multi-agent sys-
References
tems promises a transformative approach to streamlining [1] Baleegh Ahmad et al. 2023. Fixing hardware security bugs
the design, verification, and debugging processes. These so- with large language models. arXiv:2302.01215.
[2] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury
phisticated systems autonomously manage various phases
Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa:
of the workflow, seamlessly transitioning from design to ver- Training generalized multi-query transformer models from
ification and debugging. By deploying multiple specialized multi-head checkpoints. arXiv preprint arXiv:2305.13245.
LLM agents — each adept in distinct facets of the design pro- [3] AMD/Xilinx. [n. d.]. Versal Adaptive Compute Acceleration
cess such as code generation, verification, error detection, Platform. https://www.xilinx.com/products/silicon-devices/
and performance optimization — a highly efficient pipeline acap/versal.html
is crafted. This orchestrated integration allows the agents [4] Tianle Cai et al. 2024. Medusa: Simple llm inference
to collaboratively refine and optimize the design iteratively, acceleration framework with multiple decoding heads.
leveraging real-time feedback and comprehensive verifica- arXiv:2401.10774.
tion results. Throughout the process, hardware engineers [5] Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Ja-
are only tasked with overseeing the initial specification and son D. Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple
LLM Inference Acceleration Framework with Multiple De-
periodically reviewing the outputs from the LLMs to ensure
coding Heads. In ICML. https://openreview.net/forum?id=
that they align with the design intentions and confirm the PEpbUobfJv
reliability of the LLMs’ outputs. [6] Charlie Chen et al. 2023. Accelerating large language model
decoding with speculative sampling. arXiv:2302.01318.
6 Conclusion [7] Deming Chen et al. 2005. xPilot: A Platform-Based Behavioral
Synthesis System. In SRC Techcon.
In our study, we focused on optimizing LLMs to reduce infer-
[8] Hongzheng Chen et al. 2024. Allo: A Programming
ence latency and improve efficiency across various applica- Model for Composable Accelerator Design. arXiv preprint
tions. We presented a new method, Medusa, to use multiple arXiv:2404.04815.
decoding heads for prediction, coupled with an optimized [9] Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Pe-
tree-based decoding strategy for parallel token processing to ichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA
speed up the execution of LLMs. We also proposed a novel HLS Today: Successes, Challenges, and Opportunities. ACM
method, SnapKV, that effectively reduced KV cache size, Trans. Reconfigurable Technol. Syst. 15, 4, Article 51, 42 pages.
addressing the computational and memory bottlenecks in https://doi.org/10.1145/3530775
scenarios involving long sequence inputs. [10] Tri Dao et al. 2022. Flashattention: Fast and memory-efficient
We discussed LLM/hardware co-design to integrate both exact attention with io-awareness. Advances in Neural Infor-
hardware optimization for efficient execution and model ar- mation Processing Systems.
[11] Tim Dettmers et al. 2022. Llm. int8 (): 8-bit matrix multiplica-
chitecture exploration for improved system efficiency while
tion for transformers at scale. arXiv preprint arXiv:2208.07339.
maintaining LLM accuracy. HLS frameworks like ScaleHLS [12] Nan Du et al. 2022. Glam: Efficient scaling of language models
and HIDA were explored for accelerating LLMs directly from with mixture-of-experts. In ICML.
PyTorch models, envisioning automated generation of spatial [13] Maha Elbayad et al. 2019. Depth-adaptive transformer.
architectures and heterogeneous computing solutions. arXiv:1910.10073.
We also explored the advancements in LLM-aided design [14] Yichao Fu et al. 2024. Break the sequential dependency of llm
for EDA and discussed a novel flow to create the Chrysalis inference using lookahead decoding. arXiv:2402.02057.
[15] Cong Hao et al. 2018. Deep neural network model and FPGA arXiv:2402.00093.
accelerator co-design: Opportunities and challenges. In IC- [35] Xingyu Meng et al. 2023. Unlocking hardware security assur-
SICT. ance: The potential of llms. arXiv:2308.11042.
[16] Cong Hao et al. 2019. FPGA/DNN co-design: An efficient [36] Meta. 2024. Introducing Meta Llama 3: The most capable
design methodology for IoT intelligence on the edge. In DAC. openly available LLM to date. https://ai.meta.com/blog/meta-
[17] Seongmin Hong et al. 2022. DFX: A low-latency multi-FPGA llama-3/
appliance for accelerating transformer-based text generation. [37] Meta. 2024. Llama 2: open source, free for research and com-
In MICRO. mercial use. https://llama.meta.com/llama2/
[18] Edward J Hu et al. 2021. Lora: Low-rank adaptation of large [38] Md Rakib Hossain Misu et al. 2024. Towards AI-Assisted
language models. arXiv preprint arXiv:2106.09685. Synthesis of Verified Dafny Methods. Proc. ACM Softw. Eng. 1,
[19] Albert Q Jiang et al. 2024. Mixtral of experts. arXiv:2401.04088. FSE. https://doi.org/10.1145/3643763
[20] Norm Jouppi et al. 2023. Tpu v4: An optically reconfigurable [39] Marcelo Orenes-Vera et al. 2023. Using llms to facilitate formal
supercomputer for machine learning with hardware support verification of rtl. arXiv e-prints, arXiv–2309.
for embeddings. In ISCA. [40] James M Ortega and Werner C Rheinboldt. 2000. Iterative
[21] Hyegang Jun et al. 2023. AutoScaleDSE: A scalable design solution of nonlinear equations in several variables. Classics
space exploration engine for high-level synthesis. In TRETS. in Applied Mathematics.
[22] Jordan Juravsky et al. 2024. Hydragen: High-Throughput LLM [41] Sudipta Paria et al. 2023. Divas: An llm-based end-to-end
Inference with Shared Prefixes. arXiv:2402.05099. framework for soc security analysis and policy-based protec-
[23] Rahul Kande et al. 2023. Llm-assisted generation of hardware tion. arXiv:2308.06932.
assertions. arXiv:2306.14027. [42] Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn
[24] Achintya Kundu et al. 2024. Efficiently Distilling LLMs for Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade, Matt Jones,
Edge Applications. arXiv preprint arXiv:2404.01353. Alessandro Morari, and Ruchir Puri. 2023. Invited: Automated
[25] Woosuk Kwon et al. 2023. Efficient Memory Management Code generation for Information Technology Tasks in YAML
for Large Language Model Serving with PagedAttention. In through Large Language Models. In 2023 60th ACM/IEEE De-
SOSP. sign Automation Conference (DAC). 1–4. https://doi.org/10.
[26] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, 1109/DAC56929.2023.10247987
Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, [43] Andrea Santilli et al. 2023. Accelerating transformer inference
Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling for translation via parallel decoding. arXiv:2305.10427.
compiler infrastructure for domain specific computation. In [44] Tal Schuster et al. 2022. Confident adaptive language modeling.
2021 IEEE/ACM International Symposium on Code Generation Advances in Neural Information Processing Systems.
and Optimization (CGO). IEEE, 2–14. [45] Antoine Simoulin et al. 2021. How many layers and why?
[27] Zhihong Lei, Ernest Pusateri, Shiyi Han, Leo Liu, Mingbin An analysis of the model depth in transformers. In IJCNLP
Xu, Tim Ng, Ruchir Travadi, Youyuan Zhang, Mirko Hanne- Student Research Workshop.
mann, Man-Hung Siu, et al. 2024. Personalization of ctc-based [46] Mitchell Stern et al. 2018. Blockwise parallel decoding for
end-to-end speech recognition using pronunciation-driven deep autoregressive models. Advances in Neural Information
subword tokenization. In ICASSP 2024-2024 IEEE International Processing Systems.
Conference on Acoustics, Speech and Signal Processing (ICASSP). [47] Gemini Team et al. 2024. Gemini: A Family of Highly Capable
IEEE, 10096–10100. Multimodal Models. arXiv:2312.11805 [cs.CL]
[28] Zhihong Lei, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, [48] Shailja Thakur et al. 2023. Autochip: Automating hdl genera-
Tim Ng, Yuanyuan Zhang, Ernest Pusateri, Mirko Hannemann, tion using llm feedback. arXiv:2311.04887.
Yaqiao Deng, et al. 2023. Acoustic Model Fusion For End- [49] Shailja Thakur et al. 2023. Verigen: A large language model
to-End Speech Recognition. In 2023 IEEE Automatic Speech for verilog code generation. In TRETS.
Recognition and Understanding Workshop (ASRU). IEEE, 1–7. [50] YunDa Tsai et al. 2023. Rtlfixer: Automatically fixing rtl syntax
[29] Yaniv Leviathan et al. 2023. Fast inference from transformers errors with large language models. arXiv:2311.16543.
via speculative decoding. In ICML. [51] Lily Jiaxin Wan et al. 2024. Software/Hardware Co-design for
[30] Patrick Lewis et al. 2020. Retrieval-augmented generation for LLM and Its Application for Design Verification. In ASP-DAC.
knowledge-intensive nlp tasks. Advances in Neural Information [52] Jason Wei et al. 2022. Chain-of-thought prompting elicits
Processing Systems. reasoning in large language models. Advances in neural infor-
[31] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, mation processing systems.
Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and [53] Haoyuan Wu et al. 2024. ChatEDA: A Large Language Model
Deming Chen. 2024. SnapKV: LLM Knows What You are Look- Powered Autonomous Agent for EDA. IEEE Transactions on
ing for Before Generation. arXiv preprint arXiv:2404.14469. Computer-Aided Design of Integrated Circuits and Systems, 1–1.
[32] Mingjie Liu et al. 2023. Chipnemo: Domain-adapted llms for https://doi.org/10.1109/TCAD.2024.3383347
chip design. arXiv:2311.00176. [54] Guangxuan Xiao et al. 2023. Smoothquant: Accurate and
[33] Zichang Liu et al. 2023. Deja vu: Contextual sparsity for efficient post-training quantization for large language models.
efficient llms at inference time. In ICML. In ICML.
[34] Bhabesh Mali et al. 2024. ChIRAAG: ChatGPT Informed [55] Mingbin Xu, Alex Jin, Sicheng Wang, Mu Su, Tim Ng, Henry
Rapid and Automated Assertion Generation. arXiv preprint Mason, Shiyi Han, Yaqiao Deng, Zhen Huang, and Mahesh
Krishnamoorthy. 2023. Conformer-Based Speech Recogni- [62] Shulin Zeng et al. 2024. FlightLLM: Efficient Large Language
tion On Extreme Edge-Computing Devices. arXiv preprint Model Inference with a Complete Mapping Flow on FPGA.
arXiv:2312.10359. arXiv:2401.03868.
[56] Mingbin Xu, Congzheng Song, Ye Tian, Neha Agrawal, Filip [63] Xiaofan Zhang et al. 2018. DNNBuilder: An automated tool
Granqvist, Rogier van Dalen, Xiao Zhang, Arturo Argueta, for building high-performance DNN hardware accelerators
Shiyi Han, Yaqiao Deng, et al. 2023. Training large-vocabulary for FPGAs. In ICCAD.
neural language models by private federated learning for [64] Xiaofan Zhang et al. 2022. AutoDistill: An end-to-end frame-
resource-constrained devices. In ICASSP 2023-2023 IEEE Inter- work to explore and distill hardware-efficient language models.
national Conference on Acoustics, Speech and Signal Processing arXiv:2201.08539.
(ICASSP). IEEE, 1–5. [65] Zhenyu Zhang et al. 2023. H2o: Heavy-hitter oracle for
[57] Hanchen Ye et al. 2022. ScaleHLS: A new scalable high-level efficient generative inference of large language models.
synthesis framework on multi-level intermediate representa- arXiv:2306.14048.
tion. In HPCA. [66] Ruizhe Zhong et al. 2023. LLM4EDA: Emerging Progress in
[58] Hanchen Ye et al. 2022. ScaleHLS: a scalable high-level syn- Large Language Models for Electronic Design Automation.
thesis framework with multi-level transformations and opti- arXiv:2401.12224.
mizations. In DAC. [67] Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo
[59] Hanchen Ye et al. 2023. High-level Synthesis for Domain Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones,
Specific Computing. In ISPD. Jingtong Hu, et al. 2023. CHARM: C omposing H eterogeneous
[60] Hanchen Ye et al. 2024. HIDA: A Hierarchical Dataflow Com- A ccele R ators for M atrix Multiply on Versal ACAP Archi-
piler for High-Level Synthesis. In ASPLOS. tecture. In Proceedings of the 2023 ACM/SIGDA International
[61] Haoran You et al. 2023. Vitcod: Vision transformer acceleration Symposium on Field Programmable Gate Arrays. 153–164.
via dedicated algorithm and accelerator co-design. In HPCA. [68] Barret Zoph et al. 2022. Designing effective sparse expert
models. arXiv:2202.08906.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy