MPReport 2
MPReport 2
Multi-model AI Workload : A workload for an application that combines the features from
various models to achieve optimized outcomes,with a particular model being used for a
particular task. Ex : a self-driven car uses vision models to detect signs,a Lidar-based model for
3D Mapping and decision-making models for real-time control
Metrics :
Energy-Delay Product :
Energy Delay Product (EDP) is a metric that measures the efficiency of a design by balancing
energy consumption with execution speed. It's calculated by multiplying normalized energy
consumption by normalized execution time.
EDP is used to:
● Evaluate the trade-offs between power-saving techniques for digital designs
● Analyze data from CPU frequency tuning and network bandwidth experiments
● Determine the level of tuning that can be applied while meeting runtime requirements
● Calculate the expected turnaround time for a particular application
EDP is a well-known metric for software energy efficiency. However, it's not adequate because it
can't distinguish between optimizations that have the same EDP value but belong to different
categories.
Workloads :
MLPerf Benchmarks : A highly-relevant,up-to-date set of benchmarks most commonly used to
test the efficacy of any hardware aiming to perform training and inferences on an AI Workload
XRBench : A collection of multi-model,multi-task workloads for helping support eXtended
Reality (XR) applications. Usually used for AR/VR applications.
The paper talks about ten multi-model scenarios:
First five are curated using MLPerf and represent multi-tenancy scenarios
The next five are curated using XRBench for AR/VR scenarios
Hardware Used :
ShiDianNao : A specialized hardware accelerator designed to speed up CNNs. It is optimized
to consume low power by leveraging an on-chip memory system to minimize data movement.
NVDLA : A deep-learning accelerator developed by NVIDIA which is highly-configurable and
optimized for machine learning and deep learning inference tasks.
Background :
Scheduling AI workloads on AI Hardware and MCMs :
Scheduling on CPUs/GPUs :
Most of these CUs are based on homogeneous cores
Heterogeneity is found in very crude ways : big and little cores in CPUs,CUDA
and Tensor cores in GPUs
Offer limited programmer control and employ a Cache-based memory
system,hinder with scheduling of workloads
Scheduling on Customized AI accelerators :
Give full programmer/compiler control over memory operations
Employ scratchpad memory based systems
Scratchpad Memory :
smaller and faster than Cache
directly connected to the CPU core,ensuring fast computation
used for very specific calculations
Scheduling on MCM AI Accelerators :
come with a Network-on-Package(NoP),that facilitates communication b/w the
chiplets on a package (or,module)
scaling is an issue
As shown in the picture above,for the example,the paper talks about a MCM containing a total
of 4 chiplets - 3 NVDLA-like and 1 ShiDianNao-like. They use the NN-Baton scheduler.
NN-Baton : A scheduling tool that handles workload orchestration onto the MCM at a chiplet
granularity level,i.e,it exactly directs which workload should be run on which chiplet.
They consider two models : ResNet-50(3 layers) and GPT-2(1 layer). The combination of an
image segmentation model and a NLP model makes the data sufficiently heterogeneous and
hence,appropriate for their research.
Figures C1 and C2 show the functioning of the NN-Baton scheduler,which schedules the entire
ResNet-50 workload onto a single chiplet since each chiplet is capable of processing the entire
workload individually. In C3, the researchers use their proposed scheduler to achieve a better
performance than NN-Baton.
Similarly,in C4 and C5,they demonstrate the working of NN-Baton and their scheduler. In
C6,they propose a time window based scheduling and draw out even better performance. This
time window forms the crux of their methodology.
Complexity of the scheduling space :
For a multi-model workload constituting N models,with model i containing Li layers,they defined:
𝑁
𝐿 = ∑𝐿𝑖
𝑖
which is the total no. of layers under consideration. Then,the complexity of the scheduling space
is given by :
Methodology :
The scheduling algorithm is divided into two stages -
● the first stage involves determining layers from a model or multiple models that can be
executed without any dependencies. These layers are grouped into a time window.
● the second stage involves spatial and temporal partitioning of layers within a time
window to direct which particular layer will be scheduled onto which chiplet
Thus,the scheduling takes place at a layer granularity level.
Here, the first term is because of the latency of loading the layer operands,the second
term arises from the layer computation cost incurred while scheduling the layer on a
particular chiplet. And the third term computes the latency in transferring the results from
one layer to a subsequent layer.
5. Overall Latency : The overall latency to schedule the workload is given as :
In the 2nd step,the algorithm determines the layers from each model are to be run on how
many ‘nodes’. The paper refers to the chiplets as ‘nodes’ in this step because here the
algorithm doesn’t assume any information about the underlying chiplets. A uniform distribution
rule allocates Ni nodes to the ith model as follows:
where E(Pi) represents the expected value of a target performance optimization metric (latency,
energy, EDP) for the model i and |C| represents the total no. of chiplets.
In the 3rd step,we partition the layers into segments,making sure the resources allocated to
the layers in the previous step still hold. Given |Li| and |Ni| as the respective number of layers
and number of assigned nodes from the previous step to model workload mi, the max number
of segments that can be generated for mi is upper bounded by Ni.
Thus, the overall segmentation space complexity becomes :
In the 4th step,the mapping of the segments onto the chiplets take place. The algorithm does
the mapping in two phases : spatial and temporal
Spatial Mapping : determines which segment runs on which chiplet
Temporal Mapping : determines the execution order of segments on each chiplet