0% found this document useful (0 votes)
54 views100 pages

Lecture 1

Uploaded by

arastogi1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views100 pages

Lecture 1

Uploaded by

arastogi1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

1

[COMSE6998-015] Fall
2024
Introduction to Deep
Learning and LLM based
Generative AI Systems

Lecture 1 09/03/24
Class Introduction
• Instructors
Parijat Dube <pd2637@columbia.edu>
Research Staff Member at IBM Research, NY
Machine learning, deep learning,
System performance optimization
Generative AI for enterprise automation

Chen Wang <cw3687@columbia.edu>


Research Staff Member at IBM Research, NY
Kubernetes, Container Cloud Platform,
Data-driven/QoE based resource management,
AI4Sys & Sys4AI, LLM Serving & Finetuning
Sustainable Cloud & AI systems

2
• Course Assistants:
Anmol Jain aj3231@columbia.edu
Dhruvi Shah dms2338@columbia.edu
Harshvardhan Srivastava hs3447@columbia.edu

Class • Prerequisites:
• Knowledge of ML and use of ML algorithms
Introduction • Programming in Python

• Class on CourseWork:
https://courseworks2.columbia.edu/courses/203899
All information about the class will be available here including
syllabus, announcements and assignments

3
Today’s Agenda

• Course Overview
• Syllabus
• Assignments and Grading
• Logistics
• Machine Learning Systems
• Machine Learning on Cloud and Model Lifecycle
• Model Performance and Complexity Tradeoffs
• Class 1 Topics

4
• What this course will cover ?
• DL training architectures, hyperparameters
• LLM pre-training, fine-tuning and inference serving systems
• Cloud based DL/LLM systems and performance issues
• DL/LLM systems performance evalua>on tools, techniques,
benchmarks
• DL/LLM systems performance op>miza>on
Course •

Programming assignments involving GPUs on cloud
Research paper readings
Information • What this course will not cover ?
• Details about specific ML algorithms
• Mathema>cal analysis of ML/DL algorithms

5
• Identify different components of DL /LLM system stack and their
interdependencies
• Knowledge of ML/LLM model lifecycle and steps in making a trained
model production ready
• Train a DL/LLM model and make it a web service for inferencing
• Performance considerations, tools, techniques at different stages of
model lifecycle: development, testing, deployment.
Educational • DL/LLM training pitfalls and techniques/best practices data processing

Objectives • Ability to train DL/LLM models on cloud platforms using GPUs


• Performance characterization of DL/LLM systems
• Knowledge of DL/LLM benchmarks and performance metrics
• Performance optimization of DL/LLM systems

6
Class 1: Fundamentals of Deep Learning (DL)
• ML performance concepts/techniques: overfitting, generalization, bias,
variance tradeoff, regularization
• Performance metrics: algorithmic and system Level
• DL training: backpropagation, activation functions, data preprocessing,
batch normalization, SGD and its variants, exploding and vanishing
gradients, weight initialization
• DL training hyperparameters
• batch size, learning rate, momentum, weight decay
• Regularization techniques in DL Training: dropout, early stopping, data
augmentation

7
Class 2: Distributed Training and Standard
DL Architectures
• Single node training
• Model and Data Parallelism
• Distributed training, parameter server, all reduce
• Hardware Acceleration: GPUs, TPUs, specialized AI systems
• DL architectures: CNNs, RNNs, LSTMs, Transformers, GANs, Diffusion
models

8
Class 3: Cloud Technologies and ML
Platforms
ML and Cloud Technologies
• ML system stack on cloud
• Docker, Kubernetes
Cloud Based ML Pla4orms
• ML as a service offering: AWS, MicrosoO, Google, and IBM
• System stack, capabiliQes and tools support
• Deploying producQon ML applicaQons using TorchX
• Scaling ML applicaQons to a cluster using Ray
• Job scheduling on DL clusters

9
Class 4: Operational Machine Learning
• Devops principles in machine learning
• DL deployment in production environment
• Automated Machine Learning, H20 AutoML
• Machine Learning Operationalization (MLOps)
• MLOps opensource platforms: Kubeflow, MLflow
• MLOps opensource tools
• Open Neural Network Exchange (ONNX)

10
Class 5: ML Monitoring and Benchmarking
• Monitoring tools: TensorBoard, resource usage using nvidia-smi
• Training-logs and their analysis
• Time-series analysis of monitoring data
• Drift detection and re-training
• MLperf suite, MLPerf Training, MLPerf Inference, MLPerf Storage
• Time-to-Train performance metric

11
Class 6: A:en;on, Transformer, and Popular
Large Language Models (LLMs)

• Seq2Seq models
• Encoder and decoder
• Attention mechanism
• Transformer architecture: self-attention, multi-head attention, encoder-
decoder attention
• LLMs: BERT, OpenAI GPT, LLAMA, Gemini, Claude, IBM Granite

12
Class 7: Prompt Engineering and LLM Apps
• Translation
• Code Generation • Basics of
• Summarization Prompting
• Proofread and Correct • Prompt Elements
• Math Calculation • General Tips
• Entity Extraction • Examples
• Call Functions
LLM Use Prompt
Cases Engineering

Prompt
LLM App
Engineering
Development
Techniques
• Zero-shot Prompting
• LangChain • Few-shot Prompting
• LlamaIndex • Chain-of-Thought,
Automatic CoT
• ... ...

13
Class 8: RAG and LLM Agents
Capabilities & Limitations of LLM
• Knowledge Cutoffs
• Hallucinations
• Structured Data Challenge
• Biases

Retrieval Augmented Generation


• Use Cases
• Semantic Search
• Summarization
• Keyword Search and Embeddings
• Retrieval and Rerank
• Answer Generation

Vector Databases

Functions, Tools and LLM Agents

Graph RAG and Agentic RAG


14
Class 9: Pre-Training for LLM

Training Process Managing High Optimizing Use Cases for


Pre-training Scaling Model
for Different Memory Training Custom LLM Pre-
Concepts Training
Architectures Requirements Resources training
1.Training from 1.Encoder-only 1.Quantization 1.Distributed Data 1.Balancing model 1.Domain
existing models techniques Parallel (DDP) size, training data adaptation (e.g.,
foundation 2.Decoder-only 2.Challenges with 2.Fully Sharded volume, and law, medicine)
models models consumer-grade Data Parallel compute budget 2.Introduction to
2.Training from 3.Sequence-to- hardware (FSDP) 2.Insights from the BloombergGPT
scratch sequence (seq- 3.Zero Chinchilla study as an example
3.Model selection to-seq) models Redundancy
from Optimizer (ZeRO)
HuggingFace and
PyTorch hubs

15
Class 10: Fine Tuning Techniques
PEFT LoRA
• Introduction of PEFT • Introduction of LoRA
• Benefits • Benefits
• PEFT Methods • Performance

Multi-task Fine- Prompt Tuning


Tuning
• Explanation
• Benefits • Benefits
• FLAN Models

Instruction Fine-Tuning Process Comparison and Practical


• Training Data Preparation Instruction Fine- Application
• Dataset Splitting Tuning, Prompt • LoRA vs Prompt Tuning
• Prompt-Completion Pairs Tuning and
PEFT • QLoRA
• Performance Evaluation • Benefits

16
Class 11: LLM Benchmarks
LLM Benchmarks

Evaluatio Unseen Data Tools


Purpose & Types of n Metrics Evaluation
Motivation: Benchmarking Advanced
Techniques
ROUGE Generaliz MLPerf
ation
Resources
Motivations System .
Purpose Model BLEU
Comprehensive Risks LLMPerf
Efficiency Benchmarks:
Accuracy, N-grams
speed, Performance,
Evaluate reliability. accuracy
performance, Scalability HuggingFace
metrics. GLUE
Efficiency & Leaderboard
Limitations. Precision/Re
call/F1

SuperGLUE Fmperf

MMLU

BIG-
bench

HELM 17
Class 12: Efficient Serving of LLMs
LLM Serving

Resource LLM Serving Fine-Tuned Model Performance &


GPU Techniques
Optimization Frameworks Serving Trade-offs

Memory Performance vs.


Batching: Optimization vLLM vLLM Throughput GPU Sharing

Fairness in
Static Batching Flash Attention Deepspeed-MII S-LoRA GPU Optimization
Scheduling/Routing

Paged Attention
Dynamic Batching Kernel TensorRT LoRAX Multiplexing

Continuous HuggingFace TGI


Batching Unified Paging Server 18
Class 13: RLHF
Concept: Human feedback guides reinforcement learning.
Introduction
RLHF Fine-Tuning Methods
Purpose: Align LLMs with human values/preferences.

Instruction Fine-Tuning: Tailor LLMs to follow specific instructions.

Path Methods: Guide learning based on human feedback.

Toxicity/Misinformation: Mitigate harmful content.


Challenges in RLHF
Human Alignment: Ensure helpful, honest, harmless models.

Reward Model Training: Evaluate outputs, assign rewards.


RLHF Process Automated Labelers: Replace human labelers, select completions.
LLM Updating: Align models with human feedback.
PPO Overview: Fine-tuning via reinforcement learning.
Proximal Policy Optimization
(PPO) RLHF Application: Iterative updates, maximize rewards.

Concept: Ethical AI alignment with human values.


Constitutional AI
Role in RLHF: Ensure adherence to societal norms, trust, safety.

19
Class 14: Multimodal Generative AI systems
Creating Large
Definition & Beyond Language Multimodal Models The Multimodality
Importance Models Revolution
(LMMs)
• Definition: • Expansion: • Incorporation: • Shift: From
Process multiple Incorporate visual, Additional unimodal to
data types (text, auditory, sensory modalities into multimodal AI.
images, audio, inputs. LLMs. • Advancements:
video). • Examples: DALL- • Architectures: Computer vision,
• Importance: E, GPT-4 with Encoder-decoder, speech recognition,
Human-like vision. transformer-based. NLP.
perception and • Training: Cross- • Integration:
interaction. modal, contrastive Unified models.
learning.
• Flamingo: Visual
language model,
few-shot learning.
20
Recommended Books
• This course does not follow any textbook
• List of books (covering deep learning topics)
• Charu Aggarwal “Neural Networks and Deep Learning”, available at
https://link.springer.com/book/10.1007/978-3-319-94463-0
• Goodfellow, Bengio, Courville, “Deep Learning”, available at
http://www.deeplearningbook.org
• These books are good for material covered in Module 1 and 2 of syllabus.
• For basics of machine learning concepts an excellent textbook is G. James
et al “Introduction to Statistical Learning Theory”. Second version available
is available for free download at https://www.statlearning.com
• All other reading material will be posted on Canvas.
21
Assignments and Grading
• Distribution of marks:
• Assignments: 30%
• Quizzes: 10%
• Final Project: 30%
• Final Exam: 30%
• Assignments (30%)
• 5 assignments
• Assignments posted at the end of lectures 2, 4, 6, 8, 10; due in 2 weeks
• All programming assignments should be done as Jupyter notebooks, unless
specified otherwise
• Quizzes (10%)
• Canvas
• 5 quizzes
22
Assignments and Grading (contd.)
• Final Project (30%): Team assignment. Team of 2. Any project involving
performance of LLM systems. 1-page project proposal due by
Midterm. Detailed rubric shall be provided. Project grading:
• Project proposal (5%) – due in mid October
• Midpoint checkpoint (5%) - due before Thanksgiving break
• Github repo with README, documented code (5%)
• Final presentation and demo (15%)
• Final Exam (30%): In person exam

23
Class Logistics
• Reach CAs: Office hours (over Zoom) and Ed
• Access to Computer Clusters
• Class communications: Ed

24
AI Timeline Cloud computing was coined

Inflection point in AI adoption are closely tied to innovations in computing and


availability of big data

2006: Amazon elastic cloud and S3 was launched 25


Factors Contributing to AI Success
• Algorithms, Data, Compute, Applications
• Distributed training algorithms scaling upto 100s of GPUs
• Data growing at exponential rate; Internet, Social media, Internet of thngs (IoT)
• Compute power growth with specialized cores; GPUs, TPUs
• Development of innovative applications
• 2012 Alexnet by Krizhevsky et al at ImageNet Competition
• Simple convolutional neural network: 5 convolutional, 3 fully connected
• GPU based; Beat other models by 11% margin
• Triggered ”Cambrian Explosion” in deep learning technologies
“Neural networks are growing and evolving at an extraordinary rate, at a
lightening rate,….What started out just five years ago with AlexNet…five
years later, thousands of species of AI have emerged.”

26
Evolution of Large Language Models (LLMs)

Figure from Mohamadi et al, ChatGPT in the Age of Generative AI and Large Language Models: A Concise Survey
27
AI at US Open
• https://www.ibm.com/case-studies/us-
open “Foundation models are incredibly
• Generative AI models for generating content powerful and are ushering in a new
• IBM watsonX AI and data platform built for age of generative AI, But to
business generate meaningful business
• IBM Granite foundation models
outcomes, they need to be trained
• watsonX.data: to connect and curate the USTA’s on high-quality data and develop
trusted data sources
domain expertise. And that’s why an
• Foundation models were trained to translate tennis organization’s proprietary data is
data into cogent descriptions
the key differentiator when it comes
• summarizing entire matches in the case of Match to AI.”
Reports
Shannon Miller, IBM Consulting
• generating sentences that describe the action in
highlight reels for AI Commentary
28
AI Blogs of Major Companies
• Meta: https://ai.meta.com/blog/
• Google: https://ai.googleblog.com
• IBM Research:
https://www.ibm.com/blogs/research/category/ai/
• Microsoft: https://news.microsoft.com/source/topics/ai/
• AWS Machine Learning Blog:
https://aws.amazon.com/blogs/machine-learning/

29
Machine Learning System
A composition of one or more software components,
with possible interactions, deployed on a hardware
platform with the purpose of achieving some
performance objective.
A Machine Learning system is a system where one or
more software components are machine learning
based.
• Why study ML systems ?
• Algorithms run on real and (possibly) faulty hardware in production environments
• Theoretical performance is far away from observed
• To characterize holistic performance of not just the algorithm but the end-to-end
performance of the entire system

30
Constituents of a ML System

Machine Learning
System

Hardware/ Software Platform


Infrastructure Algorithm(s) Data

Model

31
Infrastructure
• Compute units and accelerators, memory, storage, network
• Resources can be acquired as bare metal, VMs/Containers on cloud
• Hardware can help improve performance pretty much everywhere in
the pipeline
• Design better hardware
• Adapt existing architectures to ML tasks.
• Develop brand-new architectures for ML.
• Hardware compute precision affects performance: tradeoff between
accuracy and runtime

32
(Learning) Algorithm
• General and domain specific architectures
• Hyperparameter tuning to extract the best performance
• Effects the resource requirements: compute (FLOPS), memory
• Performance (runtime) and scalability of an algorithm depends on:
• Hardware/Infrastructure
• Software platform (frameworks, libraries, drivers)

33
Data
• Data as a critical element; Data is the king in ML
• Different modalities: Audio, video, images, text
• Data sources, collection, labeling, quality, data storage
• Data type determines the choice of learning algorithm
• Making the data business ready is challenging
• Many data-driven organizations are spending 80 percent of their time
on data preparation and find it a major bottleneck.
• DataOps: tools, processes, and organizational structures to cope with
significant increase in volume, velocity, and variety of data.
34
Software Engineering in ML Systems
• Machine learning applications run as pipelines
that ingest data, compute features, identify
model(s), discover hyperparameters, train
model(s), validate and deploy model(s).
• Making a model as a production-capable web
service
• Containerization (docker), cluster deployment (K8s)
• APIs exposed as web service (Tensorflow serving/ONNX
runtime)
• Workflow engines (e.g., Kubeflow) automate the
ML pipeline
ML specific testing and monitoring apart from
• Deployment monitoring and operational analytics traditional software testing
• Devops principles applicable to ML Systems: • Data testing
• Continuous Integration, Continuous delivery (CI/CD) • Infrastructure testing
• Predictability • Model testing
• “A model may be unexplainable—but an API cannot be • Production testing
unpredictable”
• Reproducibility and Traceability
• Provenance for machine learning artifacts 35
Creating an ML System
• System Objective: What problem will my system solve ? What is the target deployment scenario? What are
the performance objectives ?
• Solution Approach: What is my solution and its components ?
• Data Collection
• Identifying the data sources: What are my data sources ?
• Collecting the data
• Data Preparation
• Preparing the data: Is my data business ready ?
• Ingesting the data: What is the right storage for my data ?
• Model Development
• Identification and training
• Model evaluation
• Hyperparameter Tuning
• Model Deployment
• Model optimization (if needed) for the deployment infrastructure
• Model packaging and deployment
• Monitoring and Feedback
• Is my deployed model performing as expected ?
• Is there a drift in model performance which requires re-training ?
36
Inhibitors in Successful Implementation of
ML Solutions
• Deployment and automation
• Reproducibility of models and predictions
• Diagnostics
• Governance and regulatory compliance
• Scalability
• Collaboration
• Monitoring and management

37
Cloud Computing
• Access to computing resources and storage on demand
• Pay-as-you go model
• Heterogeneous resources: GPUs, CPUs, storage type
• Different offering models: IaaS, PaaS, SaaS, MLaaS
• Different deployment models: Public, private, hybrid cloud
• Provisioning, maintenance, monitoring, life-cycle-management

38
Cloud and AI
• AI
• Harness power of Big Data and
compute
• Cloud
• Access to Big Data
• Platform to quickly develop,
deploy, and test AI solutions
• Ease in AI reachability

Yuichi Yoda 39
Cloud based Machine Learning Services
• IBM Watson Studio
https://www.ibm.com/products/watson-studio
• Amazon Sagemaker
https://aws.amazon.com/sagemaker
• Microsoft Azure Machine Learning
https://azure.com/ml
• Google Vertex AI Platform
https://cloud.google.com/vertex-ai/

40
Deep Learning on Cloud Stack

41
AI Model Training Lifecycle
Performance considerations at each stage
• Data preprocessing: de-noising, de-biasing,
train/test set creation
• Feature engineering: search efficient data
transformations
• Model training: model identification/synthesis,
hyperparameter tuning, regularization
• Model hardening: efficient adversarial training
• Model serving: hardware, model pruning and
compression
• Monitoring: response time, drift detection
• Continuous learning: model adaptability, retraining
42
Practical Machine Learning Systems

✔ ✔


43
Class 1: Fundamentals of Deep Learning (DL)

44
Linear Regression

45
Mean Square Error (MSE)
CASE 1 CASE 2 CASE 3

true value

predicted

46
OverfiXng and UnderfiXng
• Overfi_ng: model performs well on training data but does not
generalize well to unseen data (test data)
• Underfi_ng: model is not complex enough to capture pa`ern in the
training data well and therefore suffers from low performance on
unseen data

47
Bias of a model

48
49
50
51
52
53
54
55
Model Complexity Tradeofffs
CASE 1 CASE 2 CASE 3

56
Model Complexity Tradeoffs
• Simple model
• Fail to completely capture the rela\onship
between features
• Introduces bias: Consistent test error across
different choices of training data
• Low variance
• Increasing training data does not help in
reducing bias Generaliza>on
Gap
• Complex model captures nuances in
training data causing Overfieng
• Low bias
• Train error << Test error
• With different training instances, the model
predic\on for same test instance will be very
different – High Variance
57
Bias-Variance Tradeoff

High Bias High Variance

Increase model complexity Increase training data

Low Variance
Low Bias

58
Regulariza;on
• Techniques used to improve generalizaQon of a model by reducing its
complexity
• Techniques to make a model perform well on test data oOen at expense of
its performance on training data
• Avoid overfi`ng, reduce variance
• Simpler models are preferable: low memory, increase interpretability
• However simpler models may reduce the expressive power of models
• Sample techniques:
• Parameter norm penal\es
• 𝑳𝟐 and 𝑳𝟏 norm weight decay
• Noise injec\on
• Dropout

59
Regularization in Regression
𝑳𝟐 Regularization Loss (Ridge
Regression)

𝑳𝟏 Regularization Loss
(LASSO)

What value of lambda to choose ?

60
Bias-variance tradeoff with Lambda

61
Performance Metrics
• Algorithmic performance: accuracy, precision, recall, F1-score, ROC,
• System performance: training time, inference time, training cost,
memory requirement, training efficiency

62
Accuracy, Precision, Recall, Specificity
True value

Positive Negative
Positive

False discovery rate = 1-Precision


true positive (tp) false positive (fp)
Predicted value
Negative

false negative (fn) true negative (tn)


Specificity
Sensitivity, True positive rate
False positive rate = 1-Specificity

Balanced accuracy = (𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦)/2

Considers all entries in the confusion matrix


Value lies between 0 (worst classifier) and 1 (best classifier)

63
Balancing Precision and Recall
• F1 score: Harmonic mean of precision and recall; measure of
classifier accuracy

• 𝑠𝑐𝑜𝑟𝑒: 𝛽 = 1 𝑖𝑠 𝐹1 𝑠𝑐𝑜𝑟𝑒; recall is considered β cmes as


important as precision

64
Performance dependence on threshold
• Classifier gives prediction score (the probability of belonging to positive class)
and then we apply a threshold on the score to predict the class
• Metrics like accuracy, precision, recall, Fbeta score are all calculated on the
predicted classes and not on prediction scores
• Metric value depends on the threshold

65
Receiver Opera`ng Characteris`cs (ROC)
• Plots true posicve rate (Recall) vs false posicve rate at different
thresholds

66
Run`me (training `me) of ML
95 OpenML datasets, 7 classification algorithms, fit time

• For the same dataset, depending on the choice of classifier, the run>me can differ by orders of magnitude
• Run>me has both algorithmic, dataset, and system level dependency
• Run>me depends on how the model scales with different features of input: training data size, linear separability,
complexity (number of output classes for classifica>on problems), algorithmic hyperparameters, sta>s>cal meta-
features of data (mean size per class, log number of features) 67
DL Training Time
• DL performance is closely ced to the hardware
• compute power, memory, network
• Tesla V100: 640 tensor cores (> 100 TFLOPS), 16 GB
• NVIDIA NVLink: 300 GB/s
• Volta opimized CUDA libraries

68
DL Inference Throughput

69
Deep Learning Training
Forward Phase

• Forward phase
• Loss calculaion Compute Loss
• Backward phase
• Weight update

Backward Phase

70
Deep Learning Training Steps

• Forward phase:
• compute the ac\va\ons of the hidden units based on the current value of weights
• calculate output
• calculate loss func\on

• Backward phase:
• compute par\al deriva\ve of loss func\on w.r.t. all the weights;
• use backpropaga)on algorithm to calculate the par\al deriva\ves recursively
• backpropaga\on changes the weights (and biases) in a network to decrease the
loss
• Update the weights using gradient descent

71
Gradient Descent

• Loss is calculated over all the training points at each weight update
• Memory requirements may be prohibicve

72
Stochas`c Gradient Descent (SGD)

• Loss is calculated using one training data at each weight update


• Stochascc gradient descent is only a randomized approximacon of
the true loss funccon.

73
Mini-batch Gradient Descent

• A batch 𝐵 of training points is used in a single update of weights


• Increases the memory requirements. Layer outputs are matrices
instead of vectors. In backward phase, matrices of gradients are
calculated.

74
Hyperparameters in Deep Learning
• Network architecture: number of hidden layers, number of hidden
units per later
• Accvacon funccons
• Weight inicalizer
• Learning rate
• Batch size
• Momentum
• Opcmizer

75
Popular Ac`va`on Func`ons
# #" $!
! Tanh 𝜙 𝑧 = ReLU (Rectified Linear Unit)
Sigmoid 𝜙 𝑧 = # #" "!
!"# !" 𝜙 𝑧 = max{𝑧, 0}

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2

−0.4 −0.4

−0.6
−0.6

−0.8
−0.8

−1
−1
−10 −5 0 5 10
−6 −4 −2 0 2 4 6

76
Weight Ini`aliza`on
• Inicalizing all weights to same value will cause neurons to evolve
symmetrically
• Generally, biases are inicalized with 0 values and weights with
random numbers; Inicalizing weights to random values breaks
symmetry and enables different neurons to learn different features
• Inical value of weights is important.
– Poor iniializaions can lead to bad convergence or no learning.
– Instability across different layers (vanishing and exploding gradients).

77
Vanishing and Exploding Gradients

𝜕𝐿 𝜕𝐿
= ϕ& 𝑤%"! ℎ% . 𝑤%"! .
𝜕ℎ% 𝜕ℎ%"!

• For sigmoid ac\va\on, 𝜙 ! 𝑧 = 𝜙(𝑧)(1 − 𝜙 𝑧 ) , has maximum value of 0.25 at 𝜙(z)=0.5

"# "#
• Each will will be less than 0.25 of
"$! "$!"#
• As we (back) propagate further gradient keep decreasing further; Afer r layers the value of gradient
reduces to 0.25% (= 10&' for r=10) of the original value causing the update magnitudes of earlier
layers to be very small compared to later layers => vanishing gradient problem.
• If we use ac\va\on with larger gradient and larger weights=> gradient may become very large during
backpropaga\on (exploding gradients)
• Improper ini\aliza\on of weights also causes vanishing (too small weights) or exploding (too large
weights) gradients
78
Ac`va`on Func`ons Deriva`ves ReLU
sigmoid tanh

1
1
1

0.8 0.8
0.8

0.6 0.6
0.6

0.4 0.4
0.4

0.2 0.2
0.2

0 0
0

−0.2 −0.2
−0.2

−0.4 −0.4
−0.4

−0.6 −0.6
−0.6

−0.8 −0.8
−0.8

−1 −1
−1
−6 −4 −2 0 2 4 6 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−10 −5 0 5 10

• Sigmoid and tanh derivaQves vs ReLU


• Sigmoid and tanh gradients saturate at large values of argument; very
suscepQble to vanishing gradient problem
• ReLU is faster to train; most commonly used acQvaQon funcQon in deep learning
79
Vanishing and Exploding Gradients

80
Popular Weight Ini`alizers r_in =4 r_out=3

• Xavier inicalizacon
• Each neuron weight is sampled from 0 mean Gaussian distribuion with
standard deviaion 2⁄ 𝑟 + 𝑟
'( )*%

when 𝑟#$ and 𝑟%&' are number of input and output weights for the neuron
• Used with Sigmoid or Tanh acivaions
• Also referred to as Glorot iniializaion (like in Keras)
• He inicalizacon
• Sample weights from 0 mean Gaussian distribuion with standard
deviaion
2⁄𝑟

r can be 𝑟#$ or 𝑟%&'


• Used with ReLU acivaion

h`ps://www.deeplearning.ai/ai-notes/inicalizacon/ 81
Learning rate: large value vs small value
Learning rate schedule
• Start with a higher learning rate to explore the loss space => find a good staring
values for the weights
• Use smaller learning rates in later steps to converge to a minima =>tune the
weights slowly
• Different choices of decay funcions: Decay functions Decay equation
• exponen\al, inverse, mul\-step, polynomial 𝛼+
Inverse 𝛼% =
• babysiing the learning rate 1 + 𝛾. 𝑡
exponential 𝛼% = 𝛼+ 𝑒𝑥𝑝(−𝛾. 𝑡)
• Training with different learning rate decay
(
polynomial 𝑡
• Keras learning rate schedules and decay n=1 gives linear
𝛼% = 𝛼+ 1 −
max _𝑡
,$
• Other new forms: cosine decay multi-step 𝛼% =
-%
at step n

83
Batch size
• Effect of batch size on learning
• Batch size is restricted by the GPU memory (12GB for K40, 16GB for
P100 and V100) and the model size
• Model and batch of data needs to remain in GPU memory for one iteraion
• Are you restricted to work with small size mini-batches for large models
and/or GPUs with limited memory
• No, you can simulate large batch size by delaying gradient/weight updates to
happen every n iteraions (instead of n=1) ; supported by frameworks

84
What Batch size to choose ?
• Hardware constraints (GPU memory) dictate the largest batch size
• Should we try to work with the largest possible batch size ?
• Large batch size gives more confidence in gradient estimation
• Large batch size allows you to work with higher learning rates, faster convergence
• Large batch size leads to poor generalization (Keskar et al 2016)
• Lands on sharp minima wheareas small batch SGD find flat minimas which generalize better

85
Learning rate and Batch size rela`onship
• “Noise scale” in stochascc gradient descent (Smith et al 2017)
-
𝑔=𝜖 −1 N: training dataset size
.
/-
𝑔≈ .
B: batch size
𝑎𝑠 𝐵 ≪ 𝑁
𝜖: learning rate
• There is an opcmum noise scale g which maximizes the test set
accuracy (at constant learning rate)
• Introduces an opimal batch size proporional to the learning rate when B ≪ N
• Increasing batch size will have the same effect as decreasing learning
rate
• Achieves near-idenical model performance on the test set with the same
number of training epochs but significantly fewer parameter updates
Learning rate decrease Vs Batch size increase

87
Gradient Descent Convergence

• GD convergence is poor due to difference in gradient values along different


dimensions
• Effective descent direction gets away from the minima if we use finite
learning rate
• Gradient descent might also get trapped at saddle points and/or local minima

We need to :
– Move quickly in direc0ons with small but consistent (poin0ng in one direc0on, +ve
or -ve) gradients.
– Move slowly in direc0ons with big but inconsistent (oscilla0ng between –ve and
+ve) gradients.

C. Aggarwal. Neural Networks and Deep Learning


Gradient Descent with Momentum
OPTIMUM
STARTING
POINT

STARTING
POINT WITH
MOMENTUM (b) WITHOUT MOMENTUM

GD SLOWS DOWN
LOSS

IN FLAT REGION
GD GETS TRAPPED STARTING
OPTIMUM
WITHOUT
IN LOCAL OPTIMUM MOMENTUM POINT

(a) RELATIVE DIRECTIONS (c) WITH MOMENTUM

VALUE OF NEURAL NETWORK PARAMETER

• Add momentum to GD updates:

• Learning is accelerated as oscillations are damped and updates progress in the consistent directions of loss
decrease
• Enables working with large learning rate values and hence faster convergence

C. Aggarwal. Neural Networks and Deep Learning


Nesterov Momentum
• Simple momentum-based updates cause solution to overshoot the
target minima
• Idea is to use some lookahead in computing the updates

• Put on the brakes as the marble reaches near bottom of hill.


• Difference from standard momentum method in terms of where the
gradient is computed.
Parameter-specific learning rates
• Apply a different learning rate to each parameter at each step
• Encourage faster relative movement in gently sloping direction
• Penalize dimension with large fluctuations in gradient
• Several optimizers: AdaGrad, RMSProp, RMSProp+Nestrov
Momentum, AdaDelta, Adam
• Differ in the manner parameter specific learning rates are calculated
Normalizing Input Data
𝑖
• Min-max normaliza\on (for feature 𝑗of input datapoint )
#KL $%!&L
𝑥!" ⇐
%'#L $%!&L
• Data with smaller standard devia\on; scaled to be in the range [0,1]
• Lessen the effect of outliers
• Standardiza\on
𝑥!" − 𝑚𝑒𝑎𝑛"
𝑥!" ⇐
𝑠𝑡𝑑_𝑑𝑒𝑣"

• Normaliza\on helps in the convergence of op\miza\on algorithm


• Should apply same normaliza\on parameters to both train and test set
• Normaliza\on parameters are calculated using train data
• Training converges faster when the inputs are normalized

92
Batch normalization
• Internal covariance shik – change in the distribucon of network
accvacons due to change in network parameters during training
• Idea is to reduce internal covariance shik by applying normalizacon
to inputs of each layer
• Achieve fix distribucon of inputs at each layer
• Normalizacon for each training mini-batch.
• Batch normalizacon enables training with larger learning rates
• Reduces the dependence of gradients on the scale of the parameters
• Faster convergence and bemer generalizaion

93
Batch normaliza`on

Why this step ?

94
𝑳𝟐 and 𝑳𝟏 Regularization in Neural Network
𝑳𝟐 Regularization
Loss

2 Weight decay
𝑳𝟏 Regularization
Loss

95
𝑳𝟐 vs 𝑳𝟏 Regulariza`on in Neural Network
• Value of lambda (hyperparameter) can be tuned using the
validacon set
• 𝑳𝟏 regularizacon leads to sparse weight matrices; Used to
determine edges to prune
• Both 𝑳𝟐 and 𝑳𝟏 regularizacon move the weights progressively
towards 0
• Mulcplicacve vs addicve weight decay

96
Early Stopping

• Stop the training when validation error starts rising to prevent overfitting
• Early stopping is an implicit regularization technique
• Done in hindsight; define a performance criteria and checkpoint the latest “best” model
• May not help with large datasets with less likelihood of overfitting
• No principled approach to early stop. Can be tricky when validation error has multiple local minimas
• 𝑳𝟐 regularization (with proper value of regularizer) may achieve similar or better performance than early stopping
97
Dropout
• Dropout is a regularizaion technique to deal with overfieng problem and improve
generalizaion
• Prevents co-adaptaion of acivaion units
• a feature detector is only helpful in the context of several other specific feature detectors.
• Probabilisically drop input features or acivaion units in hidden layers
• Approximately combines exponenially many different neural network architectures
• Layer dependent dropout probability ( ~0.2 for input, ~0.5 for hidden)

98
Dataset Augmenta`on
• Artificially enlarge the training set by adding
transformations/perturbations of the training data
• Domain-specific transformations
• Provides more training data
• Helps in model generalization and prevents overfitting
• Augmentation techniques (images):
• Horizontal and Vertical shift
• Horizontal and Vertical flip
• Rotation
• Brightness
• Erosion and Dilation
• Noising

99
Prepare for Lecture 2
• Access to compute cluster
• Set up you GCP account (you may want to work with Google Colab)
• We will provide you GCP coupons
• Take account and familiarize with cloud compucng clusters
• First home-work posted on 09/12

100

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy