0% found this document useful (0 votes)
5 views6 pages

MPReport 2

The document defines key concepts related to scheduling multi-model AI workloads on specialized hardware, including Multi-chip Modules (MCM) and the Energy-Delay Product (EDP) as a metric for efficiency. It discusses the NN-Baton scheduler and proposes a new scheduling methodology that optimizes workload distribution across chiplets, emphasizing the importance of time windows and layer granularity. The methodology is structured in four steps, focusing on partitioning, node allocation, segmentation, and mapping to enhance performance metrics such as latency and energy consumption.

Uploaded by

Abhinandan Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

MPReport 2

The document defines key concepts related to scheduling multi-model AI workloads on specialized hardware, including Multi-chip Modules (MCM) and the Energy-Delay Product (EDP) as a metric for efficiency. It discusses the NN-Baton scheduler and proposes a new scheduling methodology that optimizes workload distribution across chiplets, emphasizing the importance of time windows and layer granularity. The methodology is structured in four steps, focusing on partitioning, node allocation, segmentation, and mapping to enhance performance metrics such as latency and energy consumption.

Uploaded by

Abhinandan Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Definitions :

Multi-model AI Workload : A workload for an application that combines the features from
various models to achieve optimized outcomes,with a particular model being used for a
particular task. Ex : a self-driven car uses vision models to detect signs,a Lidar-based model for
3D Mapping and decision-making models for real-time control

Multi-chip Module (MCM) : A circuit made of multiple,independent integrated circuits where


(often),each chip is specialized for a particular function

Multi-Tenancy : Running different workloads/benchmarks from different vendors or ‘tenants’ on


the same hardware

What do they mean by ‘scheduling’ the workload?


●​ Different models may run well on different types of processors. Ex : a vision model may
be well-suited for a GPU and a NLP model for an AI Core
●​ One of the requirements is that the models must simultaneously. The running of one
model must not interfere with the running of another

Metrics :
Energy-Delay Product :
Energy Delay Product (EDP) is a metric that measures the efficiency of a design by balancing
energy consumption with execution speed. It's calculated by multiplying normalized energy
consumption by normalized execution time.
EDP is used to:
●​ Evaluate the trade-offs between power-saving techniques for digital designs
●​ Analyze data from CPU frequency tuning and network bandwidth experiments
●​ Determine the level of tuning that can be applied while meeting runtime requirements
●​ Calculate the expected turnaround time for a particular application
EDP is a well-known metric for software energy efficiency. However, it's not adequate because it
can't distinguish between optimizations that have the same EDP value but belong to different
categories.

Workloads :
MLPerf Benchmarks : A highly-relevant,up-to-date set of benchmarks most commonly used to
test the efficacy of any hardware aiming to perform training and inferences on an AI Workload
XRBench : A collection of multi-model,multi-task workloads for helping support eXtended
Reality (XR) applications. Usually used for AR/VR applications.
The paper talks about ten multi-model scenarios:​
First five are curated using MLPerf and represent multi-tenancy scenarios
The next five are curated using XRBench for AR/VR scenarios
Hardware Used :
ShiDianNao : A specialized hardware accelerator designed to speed up CNNs. It is optimized
to consume low power by leveraging an on-chip memory system to minimize data movement.
NVDLA : A deep-learning accelerator developed by NVIDIA which is highly-configurable and
optimized for machine learning and deep learning inference tasks.

Background :
Scheduling AI workloads on AI Hardware and MCMs :
Scheduling on CPUs/GPUs :
​ ​ Most of these CUs are based on homogeneous cores
​ ​ Heterogeneity is found in very crude ways : big and little cores in CPUs,CUDA
and Tensor cores in GPUs
​ ​ Offer limited programmer control and employ a Cache-based memory
system,hinder with scheduling of workloads
​ Scheduling on Customized AI accelerators :
​ ​ Give full programmer/compiler control over memory operations
​ ​ Employ scratchpad memory based systems
​ ​ Scratchpad Memory :
smaller and faster than Cache
directly connected to the CPU core,ensuring fast computation
used for very specific calculations
​ Scheduling on MCM AI Accelerators :
​ ​ come with a Network-on-Package(NoP),that facilitates communication b/w the
chiplets on a package (or,module)
​ ​ scaling is an issue
As shown in the picture above,for the example,the paper talks about a MCM containing a total
of 4 chiplets - 3 NVDLA-like and 1 ShiDianNao-like. They use the NN-Baton scheduler.
NN-Baton : A scheduling tool that handles workload orchestration onto the MCM at a chiplet
granularity level,i.e,it exactly directs which workload should be run on which chiplet.
They consider two models : ResNet-50(3 layers) and GPT-2(1 layer). The combination of an
image segmentation model and a NLP model makes the data sufficiently heterogeneous and
hence,appropriate for their research.
Figures C1 and C2 show the functioning of the NN-Baton scheduler,which schedules the entire
ResNet-50 workload onto a single chiplet since each chiplet is capable of processing the entire
workload individually. In C3, the researchers use their proposed scheduler to achieve a better
performance than NN-Baton.
Similarly,in C4 and C5,they demonstrate the working of NN-Baton and their scheduler. In
C6,they propose a time window based scheduling and draw out even better performance. This
time window forms the crux of their methodology.
Complexity of the scheduling space :
For a multi-model workload constituting N models,with model i containing Li layers,they defined:
𝑁
𝐿 = ∑𝐿𝑖
𝑖
which is the total no. of layers under consideration. Then,the complexity of the scheduling space
is given by : ​

Methodology :
The scheduling algorithm is divided into two stages -
●​ the first stage involves determining layers from a model or multiple models that can be
executed without any dependencies. These layers are grouped into a time window.
●​ the second stage involves spatial and temporal partitioning of layers within a time
window to direct which particular layer will be scheduled onto which chiplet
Thus,the scheduling takes place at a layer granularity level.

As inputs, the scheduling framework receives :


●​ description files of the multi-model workloads (layer parameters, topology,
dependencies, etc)
●​ a description file of the MCM hardware specification (the number of chiplets, the shape,
and chiplet arrays dataflow organization, NoP bandwidth, on-chiplet memory size, etc.).
As output, the scheduling framework reports an optimized schedule with expected metrics such
as latency, energy, EDP. Special emphasis is laid on EDP as it reveals more information about
how exactly each layer performs on each chiplet.
We provide some definitions next.
Definitions :
1.​ Time Window : A coarse-grained group of layers that can be run without any
dependencies. A time window may constitute layers from a single model or layers from
multiple models as well. Layers included in a time window ‘i’ MUST NOT appear in
another time window ‘j’,where i != j.
2.​ Segment : A subset of layers within a time window that are scheduled onto a chiplet. A
set of segments must include ALL layers within a time window and the segments must
be mutually exclusive,i.e,no two segments must have a common layer.
3.​ Latency in Communication (𝐿𝑎𝑡𝑐𝑜𝑚) : The latency in communication is defined as :

‘nop’ refers to Network-on-package. The bandwidth of the network-on-package is an


important factor. The communication also includes the number of hops from the source
to the destination,the term ‘delta’ accounts for traffic within the network. Finally,the
latency in accessing memory is also included in case an off-chip request is made.
4.​ Layer Latency : The latency incurred by a single layer 𝑙 mapped onto a chiplet is
defined as :

Here, the first term is because of the latency of loading the layer operands,the second
term arises from the layer computation cost incurred while scheduling the layer on a
particular chiplet. And the third term computes the latency in transferring the results from
one layer to a subsequent layer.
5.​ Overall Latency : The overall latency to schedule the workload is given as :

where the latency of a time window is given as :

It is determined for every segment.


Steps Involved :

In the 1st Step, the algorithm :


●​ generates candidate time window partitioning strategies via sampling a set of discrete
points in time as boundary points
●​ assigns layers from models to each time window
Here,we see a total of 6 layers from Model A and 5 layers from Model B.
A special hyperparameter ‘nsplits’ is used to predetermine the number of time windows,which
leads to finding proper cut points for each model.
The algorithm adopts a first-fit greedy-packing method to assign layers to time windows.
Using this approach ensures that :
●​ low-latency layers are run in earlier windows (restricts starvation)
●​ the number of time windows can be dynamically controlled by skipping trivial time
windows with no workloads

In the 2nd step,the algorithm determines the layers from each model are to be run on how
many ‘nodes’. The paper refers to the chiplets as ‘nodes’ in this step because here the
algorithm doesn’t assume any information about the underlying chiplets. A uniform distribution
rule allocates Ni nodes to the ith model as follows:

where E(Pi) represents the expected value of a target performance optimization metric (latency,
energy, EDP) for the model i and |C| represents the total no. of chiplets.

In the 3rd step,we partition the layers into segments,making sure the resources allocated to
the layers in the previous step still hold. Given |Li| and |Ni| as the respective number of layers
and number of assigned nodes from the previous step to model workload mi, the max number
of segments that can be generated for mi is upper bounded by Ni.
Thus, the overall segmentation space complexity becomes :

To reduce the search complexity, the algorithm supports a hyperparameter, particularly


beneficial in cases where time windows have model workloads with dissimilar layer distributions
(e.g., one model with a heavy layer, another with numerous small layers), which could lead to an
explosion in the number of trivial segmentation options from the low-cost layers, causing the
the search space complexity to rise. Small layers are more suited for continuous execution on
the same resource and to limit unnecessary, costly inter-chiplet data movements, the algorithm
designates a node allocation constraint to restrict the number of nodes assigned to workloads
with disproportionately large number of layers.

In the 4th step,the mapping of the segments onto the chiplets take place. The algorithm does
the mapping in two phases : spatial and temporal
Spatial Mapping : determines which segment runs on which chiplet
Temporal Mapping : determines the execution order of segments on each chiplet

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy