0% found this document useful (0 votes)

5 views6 pages

MPReport 2

The document defines key concepts related to scheduling multi-model AI workloads on specialized hardware, including Multi-chip Modules (MCM) and the Energy-Delay Product (EDP) as a metric for efficiency. It discusses the NN-Baton scheduler and proposes a new scheduling methodology that optimizes workload distribution across chiplets, emphasizing the importance of time windows and layer granularity. The methodology is structured in four steps, focusing on partitioning, node allocation, segmentation, and mapping to enhance performance metrics such as latency and energy consumption.

Uploaded by

Abhinandan Dash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views6 pages

MPReport 2

Uploaded by

Abhinandan Dash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Definitions :

Multi-model AI Workload : A workload for an application that combines the features from
various models to achieve optimized outcomes,with a particular model being used for a
particular task. Ex : a self-driven car uses vision models to detect signs,a Lidar-based model for
3D Mapping and decision-making models for real-time control

Multi-chip Module (MCM) : A circuit made of multiple,independent integrated circuits where

(often),each chip is specialized for a particular function

Multi-Tenancy : Running different workloads/benchmarks from different vendors or ‘tenants’ on

the same hardware

What do they mean by ‘scheduling’ the workload?

● Different models may run well on different types of processors. Ex : a vision model may
be well-suited for a GPU and a NLP model for an AI Core
● One of the requirements is that the models must simultaneously. The running of one
model must not interfere with the running of another

Metrics :
Energy-Delay Product :
Energy Delay Product (EDP) is a metric that measures the efficiency of a design by balancing
energy consumption with execution speed. It's calculated by multiplying normalized energy
consumption by normalized execution time.
EDP is used to:
● Evaluate the trade-offs between power-saving techniques for digital designs
● Analyze data from CPU frequency tuning and network bandwidth experiments
● Determine the level of tuning that can be applied while meeting runtime requirements
● Calculate the expected turnaround time for a particular application
EDP is a well-known metric for software energy efficiency. However, it's not adequate because it
can't distinguish between optimizations that have the same EDP value but belong to different
categories.

Workloads :
MLPerf Benchmarks : A highly-relevant,up-to-date set of benchmarks most commonly used to
test the efficacy of any hardware aiming to perform training and inferences on an AI Workload
XRBench : A collection of multi-model,multi-task workloads for helping support eXtended
Reality (XR) applications. Usually used for AR/VR applications.
The paper talks about ten multi-model scenarios:
First five are curated using MLPerf and represent multi-tenancy scenarios
The next five are curated using XRBench for AR/VR scenarios
Hardware Used :
ShiDianNao : A specialized hardware accelerator designed to speed up CNNs. It is optimized
to consume low power by leveraging an on-chip memory system to minimize data movement.
NVDLA : A deep-learning accelerator developed by NVIDIA which is highly-configurable and
optimized for machine learning and deep learning inference tasks.

Background :
Scheduling AI workloads on AI Hardware and MCMs :
Scheduling on CPUs/GPUs :
Most of these CUs are based on homogeneous cores
Heterogeneity is found in very crude ways : big and little cores in CPUs,CUDA
and Tensor cores in GPUs
Offer limited programmer control and employ a Cache-based memory
system,hinder with scheduling of workloads
Scheduling on Customized AI accelerators :
Give full programmer/compiler control over memory operations
Employ scratchpad memory based systems
Scratchpad Memory :
smaller and faster than Cache
directly connected to the CPU core,ensuring fast computation
used for very specific calculations
Scheduling on MCM AI Accelerators :
come with a Network-on-Package(NoP),that facilitates communication b/w the
chiplets on a package (or,module)
scaling is an issue
As shown in the picture above,for the example,the paper talks about a MCM containing a total
of 4 chiplets - 3 NVDLA-like and 1 ShiDianNao-like. They use the NN-Baton scheduler.
NN-Baton : A scheduling tool that handles workload orchestration onto the MCM at a chiplet
granularity level,i.e,it exactly directs which workload should be run on which chiplet.
They consider two models : ResNet-50(3 layers) and GPT-2(1 layer). The combination of an
image segmentation model and a NLP model makes the data sufficiently heterogeneous and
hence,appropriate for their research.
Figures C1 and C2 show the functioning of the NN-Baton scheduler,which schedules the entire
ResNet-50 workload onto a single chiplet since each chiplet is capable of processing the entire
workload individually. In C3, the researchers use their proposed scheduler to achieve a better
performance than NN-Baton.
Similarly,in C4 and C5,they demonstrate the working of NN-Baton and their scheduler. In
C6,they propose a time window based scheduling and draw out even better performance. This
time window forms the crux of their methodology.
Complexity of the scheduling space :
For a multi-model workload constituting N models,with model i containing Li layers,they defined:
𝑁
𝐿 = ∑𝐿𝑖
𝑖
which is the total no. of layers under consideration. Then,the complexity of the scheduling space
is given by :

Methodology :
The scheduling algorithm is divided into two stages -
● the first stage involves determining layers from a model or multiple models that can be
executed without any dependencies. These layers are grouped into a time window.
● the second stage involves spatial and temporal partitioning of layers within a time
window to direct which particular layer will be scheduled onto which chiplet
Thus,the scheduling takes place at a layer granularity level.

As inputs, the scheduling framework receives :

● description files of the multi-model workloads (layer parameters, topology,
dependencies, etc)
● a description file of the MCM hardware specification (the number of chiplets, the shape,
and chiplet arrays dataflow organization, NoP bandwidth, on-chiplet memory size, etc.).
As output, the scheduling framework reports an optimized schedule with expected metrics such
as latency, energy, EDP. Special emphasis is laid on EDP as it reveals more information about
how exactly each layer performs on each chiplet.
We provide some definitions next.
Definitions :
1. Time Window : A coarse-grained group of layers that can be run without any
dependencies. A time window may constitute layers from a single model or layers from
multiple models as well. Layers included in a time window ‘i’ MUST NOT appear in
another time window ‘j’,where i != j.
2. Segment : A subset of layers within a time window that are scheduled onto a chiplet. A
set of segments must include ALL layers within a time window and the segments must
be mutually exclusive,i.e,no two segments must have a common layer.
3. Latency in Communication (𝐿𝑎𝑡𝑐𝑜𝑚) : The latency in communication is defined as :

‘nop’ refers to Network-on-package. The bandwidth of the network-on-package is an

important factor. The communication also includes the number of hops from the source
to the destination,the term ‘delta’ accounts for traffic within the network. Finally,the
latency in accessing memory is also included in case an off-chip request is made.
4. Layer Latency : The latency incurred by a single layer 𝑙 mapped onto a chiplet is
defined as :

Here, the first term is because of the latency of loading the layer operands,the second
term arises from the layer computation cost incurred while scheduling the layer on a
particular chiplet. And the third term computes the latency in transferring the results from
one layer to a subsequent layer.
5. Overall Latency : The overall latency to schedule the workload is given as :

where the latency of a time window is given as :

It is determined for every segment.

Steps Involved :

In the 1st Step, the algorithm :

● generates candidate time window partitioning strategies via sampling a set of discrete
points in time as boundary points
● assigns layers from models to each time window
Here,we see a total of 6 layers from Model A and 5 layers from Model B.
A special hyperparameter ‘nsplits’ is used to predetermine the number of time windows,which
leads to finding proper cut points for each model.
The algorithm adopts a first-fit greedy-packing method to assign layers to time windows.
Using this approach ensures that :
● low-latency layers are run in earlier windows (restricts starvation)
● the number of time windows can be dynamically controlled by skipping trivial time
windows with no workloads

In the 2nd step,the algorithm determines the layers from each model are to be run on how
many ‘nodes’. The paper refers to the chiplets as ‘nodes’ in this step because here the
algorithm doesn’t assume any information about the underlying chiplets. A uniform distribution
rule allocates Ni nodes to the ith model as follows:

where E(Pi) represents the expected value of a target performance optimization metric (latency,
energy, EDP) for the model i and |C| represents the total no. of chiplets.

In the 3rd step,we partition the layers into segments,making sure the resources allocated to
the layers in the previous step still hold. Given |Li| and |Ni| as the respective number of layers
and number of assigned nodes from the previous step to model workload mi, the max number
of segments that can be generated for mi is upper bounded by Ni.
Thus, the overall segmentation space complexity becomes :

To reduce the search complexity, the algorithm supports a hyperparameter, particularly

beneficial in cases where time windows have model workloads with dissimilar layer distributions
(e.g., one model with a heavy layer, another with numerous small layers), which could lead to an
explosion in the number of trivial segmentation options from the low-cost layers, causing the
the search space complexity to rise. Small layers are more suited for continuous execution on
the same resource and to limit unnecessary, costly inter-chiplet data movements, the algorithm
designates a node allocation constraint to restrict the number of nodes assigned to workloads
with disproportionately large number of layers.

In the 4th step,the mapping of the segments onto the chiplets take place. The algorithm does
the mapping in two phases : spatial and temporal
Spatial Mapping : determines which segment runs on which chiplet
Temporal Mapping : determines the execution order of segments on each chiplet

MSBIM 2021 ArchiCAD Professional Template
100% (2)
MSBIM 2021 ArchiCAD Professional Template
22 pages
Sony CRM
No ratings yet
Sony CRM
9 pages
SIM808GPSGSMmanual 1685940223
No ratings yet
SIM808GPSGSMmanual 1685940223
12 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
WIKY - Service Manual - v1.31: WWW - Alfastreet.si
100% (1)
WIKY - Service Manual - v1.31: WWW - Alfastreet.si
31 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Lecture 36
No ratings yet
Lecture 36
15 pages
A Software-Defined Tensor Streaming Multiprocessor For Large-Scale Machine Learning
No ratings yet
A Software-Defined Tensor Streaming Multiprocessor For Large-Scale Machine Learning
14 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Week 1 Csc447
No ratings yet
Week 1 Csc447
36 pages
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Levental Uchicago 0330D 17419
No ratings yet
Levental Uchicago 0330D 17419
163 pages
NanoFlow - Towards Optimal Large Language Model Serving Throughput
No ratings yet
NanoFlow - Towards Optimal Large Language Model Serving Throughput
19 pages
Dynamic Partitioned Scheduling of Real-Time Tasks On ARM Big - littLE
No ratings yet
Dynamic Partitioned Scheduling of Real-Time Tasks On ARM Big - littLE
14 pages
Hc2024 Amd Vpeng
No ratings yet
Hc2024 Amd Vpeng
36 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
8 pages
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
Introduction About ACA Syllabus
No ratings yet
Introduction About ACA Syllabus
18 pages
Intro PDF
No ratings yet
Intro PDF
16 pages
JohnBiggs UPF
No ratings yet
JohnBiggs UPF
16 pages
User Manual of 3D - NeuroSim - V1.0
No ratings yet
User Manual of 3D - NeuroSim - V1.0
28 pages
Final
No ratings yet
Final
17 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
3 pages
Google TPU
No ratings yet
Google TPU
27 pages
Chapter 06
No ratings yet
Chapter 06
59 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
Architecting To Support Machine Learning
No ratings yet
Architecting To Support Machine Learning
47 pages
Dynamic Optimizations in GPU ISCAS 2021
No ratings yet
Dynamic Optimizations in GPU ISCAS 2021
5 pages
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Dynamic Scheduler For Multi-Core Processor - Final Report - All 4 Names
No ratings yet
Dynamic Scheduler For Multi-Core Processor - Final Report - All 4 Names
63 pages
Provably Good Multicore Cache Performance For Divide-and-Conquer Algorithms
No ratings yet
Provably Good Multicore Cache Performance For Divide-and-Conquer Algorithms
10 pages
Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand
No ratings yet
Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand
33 pages
Towards Green AI-Native Networks: Evaluation of Neural Circuit Policy For Estimating Energy Consumption of Base Stations
No ratings yet
Towards Green AI-Native Networks: Evaluation of Neural Circuit Policy For Estimating Energy Consumption of Base Stations
15 pages
Analysis of GPU
No ratings yet
Analysis of GPU
14 pages
Study Guide Cisco 300-535 SPAUTO Automating and Programming Cisco Service Provider Solutions
From Everand
Study Guide Cisco 300-535 SPAUTO Automating and Programming Cisco Service Provider Solutions
Anand Vemula
No ratings yet
Cloud Short Note by Dipu #2
No ratings yet
Cloud Short Note by Dipu #2
26 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
CC Unit 1
No ratings yet
CC Unit 1
24 pages
Chapter 6 Parallel Processor
No ratings yet
Chapter 6 Parallel Processor
21 pages
Seminar
No ratings yet
Seminar
85 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
GPGPU
No ratings yet
GPGPU
139 pages
HPC Lecture 2 Points
No ratings yet
HPC Lecture 2 Points
7 pages
A Study of Performance Programming of CPU, GPU Accelerated Computers and SIMD Architecture
No ratings yet
A Study of Performance Programming of CPU, GPU Accelerated Computers and SIMD Architecture
19 pages
1 s2.0 S1383762122001138 Main
No ratings yet
1 s2.0 S1383762122001138 Main
51 pages
Unit 1 - Part - 3
No ratings yet
Unit 1 - Part - 3
29 pages
Analysis of HLFET and MCP Task Scheduling Algorithms
No ratings yet
Analysis of HLFET and MCP Task Scheduling Algorithms
5 pages
Online Scheduling of CPU-NPU Co-Inference For Edge AI Tasks
No ratings yet
Online Scheduling of CPU-NPU Co-Inference For Edge AI Tasks
6 pages
Chapter 06
No ratings yet
Chapter 06
57 pages
Graphcore Poplar Programming and Optimization: The Complete Guide for Developers and Engineers
From Everand
Graphcore Poplar Programming and Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Lecture-27 Interconnection Networks+chapter-5 Slides-Version-2
No ratings yet
Lecture-27 Interconnection Networks+chapter-5 Slides-Version-2
70 pages
Special Issue On Contemporary Industry Products 2024
No ratings yet
Special Issue On Contemporary Industry Products 2024
2 pages
CC ZG501 Course Handout
No ratings yet
CC ZG501 Course Handout
8 pages
pdc1: MODULE 1: PARALLELISM FUNDAMENTALS
No ratings yet
pdc1: MODULE 1: PARALLELISM FUNDAMENTALS
42 pages
MP800
No ratings yet
MP800
9 pages
Agile Fundamentals For Project Managers: Saturday Workshop PMI Lakeshore Chapter
No ratings yet
Agile Fundamentals For Project Managers: Saturday Workshop PMI Lakeshore Chapter
72 pages
CPCL Programming PDF
No ratings yet
CPCL Programming PDF
391 pages
Inheritance
No ratings yet
Inheritance
23 pages
Venkatesh Maipathii
No ratings yet
Venkatesh Maipathii
12 pages
Renault Can Clip V195 Software Download Installation Activation and Panch
No ratings yet
Renault Can Clip V195 Software Download Installation Activation and Panch
1 page
Crisis2008-Paper (Ordaz 2013)
No ratings yet
Crisis2008-Paper (Ordaz 2013)
11 pages
Grade 12 Data Processing
No ratings yet
Grade 12 Data Processing
24 pages
JAVA Tip
No ratings yet
JAVA Tip
114 pages
Azad
No ratings yet
Azad
4 pages
CUH PCULog0
No ratings yet
CUH PCULog0
2 pages
MySQL Front
No ratings yet
MySQL Front
44 pages
Tutorial Css
No ratings yet
Tutorial Css
7 pages
Pybullet Quickstartguide
No ratings yet
Pybullet Quickstartguide
88 pages
Sad Notes Updated Final
No ratings yet
Sad Notes Updated Final
47 pages
MCQ For 9th Class
67% (3)
MCQ For 9th Class
20 pages
BCM50 - Troubleshooting Guide
No ratings yet
BCM50 - Troubleshooting Guide
92 pages
Problem Solving Cos 102 Class-1
No ratings yet
Problem Solving Cos 102 Class-1
48 pages
Aquamatic 5200 (-A) User Manual EN 5200EN3 - V1.0
No ratings yet
Aquamatic 5200 (-A) User Manual EN 5200EN3 - V1.0
36 pages
Python Codes
No ratings yet
Python Codes
17 pages
Oracle DBA Tuning
No ratings yet
Oracle DBA Tuning
10 pages
Sti College - Gensan, Inc.: Ict Laboratory Exercise
No ratings yet
Sti College - Gensan, Inc.: Ict Laboratory Exercise
4 pages
Forcepoint Ipsec Guide: Forcepoint Web Security Cloud
No ratings yet
Forcepoint Ipsec Guide: Forcepoint Web Security Cloud
36 pages
Upou Comp - Ed - 20 - 3T - 2022-2023
No ratings yet
Upou Comp - Ed - 20 - 3T - 2022-2023
4 pages
Atmel 42365 SAM C21 Datasheet
No ratings yet
Atmel 42365 SAM C21 Datasheet
1,194 pages
PLC Curriculum
No ratings yet
PLC Curriculum
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

MPReport 2

Uploaded by

MPReport 2

Uploaded by

Definitions :

Multi-chip Module (MCM) : A circuit made of multiple,independent integrated circuits where

Multi-Tenancy : Running different workloads/benchmarks from different vendors or ‘tenants’ on

What do they mean by ‘scheduling’ the workload?

As inputs, the scheduling framework receives :

‘nop’ refers to Network-on-package. The bandwidth of the network-on-package is an

where the latency of a time window is given as :

It is determined for every segment.

In the 1st Step, the algorithm :

To reduce the search complexity, the algorithm supports a hyperparameter, particularly

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.