0% found this document useful (0 votes)

54 views100 pages

Lecture 1

Uploaded by

arastogi1997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views100 pages

Lecture 1

Uploaded by

arastogi1997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

1

[COMSE6998-015] Fall
2024
Introduction to Deep
Learning and LLM based
Generative AI Systems

Lecture 1 09/03/24
Class Introduction
• Instructors
Parijat Dube <pd2637@columbia.edu>
Research Staff Member at IBM Research, NY
Machine learning, deep learning,
System performance optimization
Generative AI for enterprise automation

Chen Wang <cw3687@columbia.edu>

Research Staff Member at IBM Research, NY
Kubernetes, Container Cloud Platform,
Data-driven/QoE based resource management,
AI4Sys & Sys4AI, LLM Serving & Finetuning
Sustainable Cloud & AI systems

2
• Course Assistants:
Anmol Jain aj3231@columbia.edu
Dhruvi Shah dms2338@columbia.edu
Harshvardhan Srivastava hs3447@columbia.edu

Class • Prerequisites:
• Knowledge of ML and use of ML algorithms
Introduction • Programming in Python

• Class on CourseWork:
https://courseworks2.columbia.edu/courses/203899
All information about the class will be available here including
syllabus, announcements and assignments

3
Today’s Agenda

• Course Overview
• Syllabus
• Assignments and Grading
• Logistics
• Machine Learning Systems
• Machine Learning on Cloud and Model Lifecycle
• Model Performance and Complexity Tradeoffs
• Class 1 Topics

4
• What this course will cover ?
• DL training architectures, hyperparameters
• LLM pre-training, ﬁne-tuning and inference serving systems
• Cloud based DL/LLM systems and performance issues
• DL/LLM systems performance evalua>on tools, techniques,
benchmarks
• DL/LLM systems performance op>miza>on
Course •
•
Programming assignments involving GPUs on cloud
Research paper readings
Information • What this course will not cover ?
• Details about speciﬁc ML algorithms
• Mathema>cal analysis of ML/DL algorithms

5
• Identify different components of DL /LLM system stack and their
interdependencies
• Knowledge of ML/LLM model lifecycle and steps in making a trained
model production ready
• Train a DL/LLM model and make it a web service for inferencing
• Performance considerations, tools, techniques at different stages of
model lifecycle: development, testing, deployment.
Educational • DL/LLM training pitfalls and techniques/best practices data processing

Objectives • Ability to train DL/LLM models on cloud platforms using GPUs

• Performance characterization of DL/LLM systems
• Knowledge of DL/LLM benchmarks and performance metrics
• Performance optimization of DL/LLM systems

6
Class 1: Fundamentals of Deep Learning (DL)
• ML performance concepts/techniques: overfitting, generalization, bias,
variance tradeoff, regularization
• Performance metrics: algorithmic and system Level
• DL training: backpropagation, activation functions, data preprocessing,
batch normalization, SGD and its variants, exploding and vanishing
gradients, weight initialization
• DL training hyperparameters
• batch size, learning rate, momentum, weight decay
• Regularization techniques in DL Training: dropout, early stopping, data
augmentation

7
Class 2: Distributed Training and Standard
DL Architectures
• Single node training
• Model and Data Parallelism
• Distributed training, parameter server, all reduce
• Hardware Acceleration: GPUs, TPUs, specialized AI systems
• DL architectures: CNNs, RNNs, LSTMs, Transformers, GANs, Diffusion
models

8
Class 3: Cloud Technologies and ML
Platforms
ML and Cloud Technologies
• ML system stack on cloud
• Docker, Kubernetes
Cloud Based ML Pla4orms
• ML as a service oﬀering: AWS, MicrosoO, Google, and IBM
• System stack, capabiliQes and tools support
• Deploying producQon ML applicaQons using TorchX
• Scaling ML applicaQons to a cluster using Ray
• Job scheduling on DL clusters

9
Class 4: Operational Machine Learning
• Devops principles in machine learning
• DL deployment in production environment
• Automated Machine Learning, H20 AutoML
• Machine Learning Operationalization (MLOps)
• MLOps opensource platforms: Kubeflow, MLflow
• MLOps opensource tools
• Open Neural Network Exchange (ONNX)

10
Class 5: ML Monitoring and Benchmarking
• Monitoring tools: TensorBoard, resource usage using nvidia-smi
• Training-logs and their analysis
• Time-series analysis of monitoring data
• Drift detection and re-training
• MLperf suite, MLPerf Training, MLPerf Inference, MLPerf Storage
• Time-to-Train performance metric

11
Class 6: A:en;on, Transformer, and Popular
Large Language Models (LLMs)

• Seq2Seq models
• Encoder and decoder
• Attention mechanism
• Transformer architecture: self-attention, multi-head attention, encoder-
decoder attention
• LLMs: BERT, OpenAI GPT, LLAMA, Gemini, Claude, IBM Granite

12
Class 7: Prompt Engineering and LLM Apps
• Translation
• Code Generation • Basics of
• Summarization Prompting
• Proofread and Correct • Prompt Elements
• Math Calculation • General Tips
• Entity Extraction • Examples
• Call Functions
LLM Use Prompt
Cases Engineering

Prompt
LLM App
Engineering
Development
Techniques
• Zero-shot Prompting
• LangChain • Few-shot Prompting
• LlamaIndex • Chain-of-Thought,
Automatic CoT
• ... ...

13
Class 8: RAG and LLM Agents
Capabilities & Limitations of LLM
• Knowledge Cutoffs
• Hallucinations
• Structured Data Challenge
• Biases

Retrieval Augmented Generation

• Use Cases
• Semantic Search
• Summarization
• Keyword Search and Embeddings
• Retrieval and Rerank
• Answer Generation

Vector Databases

Functions, Tools and LLM Agents

Graph RAG and Agentic RAG

14
Class 9: Pre-Training for LLM

Training Process Managing High Optimizing Use Cases for

Pre-training Scaling Model
for Different Memory Training Custom LLM Pre-
Concepts Training
Architectures Requirements Resources training
1.Training from 1.Encoder-only 1.Quantization 1.Distributed Data 1.Balancing model 1.Domain
existing models techniques Parallel (DDP) size, training data adaptation (e.g.,
foundation 2.Decoder-only 2.Challenges with 2.Fully Sharded volume, and law, medicine)
models models consumer-grade Data Parallel compute budget 2.Introduction to
2.Training from 3.Sequence-to- hardware (FSDP) 2.Insights from the BloombergGPT
scratch sequence (seq- 3.Zero Chinchilla study as an example
3.Model selection to-seq) models Redundancy
from Optimizer (ZeRO)
HuggingFace and
PyTorch hubs

15
Class 10: Fine Tuning Techniques
PEFT LoRA
• Introduction of PEFT • Introduction of LoRA
• Benefits • Benefits
• PEFT Methods • Performance

Multi-task Fine- Prompt Tuning

Tuning
• Explanation
• Benefits • Benefits
• FLAN Models

Instruction Fine-Tuning Process Comparison and Practical

• Training Data Preparation Instruction Fine- Application
• Dataset Splitting Tuning, Prompt • LoRA vs Prompt Tuning
• Prompt-Completion Pairs Tuning and
PEFT • QLoRA
• Performance Evaluation • Benefits

16
Class 11: LLM Benchmarks
LLM Benchmarks

Evaluatio Unseen Data Tools

Purpose & Types of n Metrics Evaluation
Motivation: Benchmarking Advanced
Techniques
ROUGE Generaliz MLPerf
ation
Resources
Motivations System .
Purpose Model BLEU
Comprehensive Risks LLMPerf
Efficiency Benchmarks:
Accuracy, N-grams
speed, Performance,
Evaluate reliability. accuracy
performance, Scalability HuggingFace
metrics. GLUE
Efficiency & Leaderboard
Limitations. Precision/Re
call/F1

SuperGLUE Fmperf

MMLU

BIG-
bench

HELM 17
Class 12: Efficient Serving of LLMs
LLM Serving

Resource LLM Serving Fine-Tuned Model Performance &

GPU Techniques
Optimization Frameworks Serving Trade-offs

Memory Performance vs.

Batching: Optimization vLLM vLLM Throughput GPU Sharing

Fairness in
Static Batching Flash Attention Deepspeed-MII S-LoRA GPU Optimization
Scheduling/Routing

Paged Attention
Dynamic Batching Kernel TensorRT LoRAX Multiplexing

Continuous HuggingFace TGI

Batching Unified Paging Server 18
Class 13: RLHF
Concept: Human feedback guides reinforcement learning.
Introduction
RLHF Fine-Tuning Methods
Purpose: Align LLMs with human values/preferences.

Instruction Fine-Tuning: Tailor LLMs to follow specific instructions.

Path Methods: Guide learning based on human feedback.

Toxicity/Misinformation: Mitigate harmful content.

Challenges in RLHF
Human Alignment: Ensure helpful, honest, harmless models.

Reward Model Training: Evaluate outputs, assign rewards.

RLHF Process Automated Labelers: Replace human labelers, select completions.
LLM Updating: Align models with human feedback.
PPO Overview: Fine-tuning via reinforcement learning.
Proximal Policy Optimization
(PPO) RLHF Application: Iterative updates, maximize rewards.

Concept: Ethical AI alignment with human values.

Constitutional AI
Role in RLHF: Ensure adherence to societal norms, trust, safety.

19
Class 14: Multimodal Generative AI systems
Creating Large
Definition & Beyond Language Multimodal Models The Multimodality
Importance Models Revolution
(LMMs)
• Definition: • Expansion: • Incorporation: • Shift: From
Process multiple Incorporate visual, Additional unimodal to
data types (text, auditory, sensory modalities into multimodal AI.
images, audio, inputs. LLMs. • Advancements:
video). • Examples: DALL- • Architectures: Computer vision,
• Importance: E, GPT-4 with Encoder-decoder, speech recognition,
Human-like vision. transformer-based. NLP.
perception and • Training: Cross- • Integration:
interaction. modal, contrastive Unified models.
learning.
• Flamingo: Visual
language model,
few-shot learning.
20
Recommended Books
• This course does not follow any textbook
• List of books (covering deep learning topics)
• Charu Aggarwal “Neural Networks and Deep Learning”, available at
https://link.springer.com/book/10.1007/978-3-319-94463-0
• Goodfellow, Bengio, Courville, “Deep Learning”, available at
http://www.deeplearningbook.org
• These books are good for material covered in Module 1 and 2 of syllabus.
• For basics of machine learning concepts an excellent textbook is G. James
et al “Introduction to Statistical Learning Theory”. Second version available
is available for free download at https://www.statlearning.com
• All other reading material will be posted on Canvas.
21
Assignments and Grading
• Distribution of marks:
• Assignments: 30%
• Quizzes: 10%
• Final Project: 30%
• Final Exam: 30%
• Assignments (30%)
• 5 assignments
• Assignments posted at the end of lectures 2, 4, 6, 8, 10; due in 2 weeks
• All programming assignments should be done as Jupyter notebooks, unless
specified otherwise
• Quizzes (10%)
• Canvas
• 5 quizzes
22
Assignments and Grading (contd.)
• Final Project (30%): Team assignment. Team of 2. Any project involving
performance of LLM systems. 1-page project proposal due by
Midterm. Detailed rubric shall be provided. Project grading:
• Project proposal (5%) – due in mid October
• Midpoint checkpoint (5%) - due before Thanksgiving break
• Github repo with README, documented code (5%)
• Final presentation and demo (15%)
• Final Exam (30%): In person exam

23
Class Logistics
• Reach CAs: Office hours (over Zoom) and Ed
• Access to Computer Clusters
• Class communications: Ed

24
AI Timeline Cloud computing was coined

Inflection point in AI adoption are closely tied to innovations in computing and

availability of big data

2006: Amazon elastic cloud and S3 was launched 25

Factors Contributing to AI Success
• Algorithms, Data, Compute, Applications
• Distributed training algorithms scaling upto 100s of GPUs
• Data growing at exponential rate; Internet, Social media, Internet of thngs (IoT)
• Compute power growth with specialized cores; GPUs, TPUs
• Development of innovative applications
• 2012 Alexnet by Krizhevsky et al at ImageNet Competition
• Simple convolutional neural network: 5 convolutional, 3 fully connected
• GPU based; Beat other models by 11% margin
• Triggered ”Cambrian Explosion” in deep learning technologies
“Neural networks are growing and evolving at an extraordinary rate, at a
lightening rate,….What started out just five years ago with AlexNet…five
years later, thousands of species of AI have emerged.”

26
Evolution of Large Language Models (LLMs)

Figure from Mohamadi et al, ChatGPT in the Age of Generative AI and Large Language Models: A Concise Survey
27
AI at US Open
• https://www.ibm.com/case-studies/us-
open “Foundation models are incredibly
• Generative AI models for generating content powerful and are ushering in a new
• IBM watsonX AI and data platform built for age of generative AI, But to
business generate meaningful business
• IBM Granite foundation models
outcomes, they need to be trained
• watsonX.data: to connect and curate the USTA’s on high-quality data and develop
trusted data sources
domain expertise. And that’s why an
• Foundation models were trained to translate tennis organization’s proprietary data is
data into cogent descriptions
the key differentiator when it comes
• summarizing entire matches in the case of Match to AI.”
Reports
Shannon Miller, IBM Consulting
• generating sentences that describe the action in
highlight reels for AI Commentary
28
AI Blogs of Major Companies
• Meta: https://ai.meta.com/blog/
• Google: https://ai.googleblog.com
• IBM Research:
https://www.ibm.com/blogs/research/category/ai/
• Microsoft: https://news.microsoft.com/source/topics/ai/
• AWS Machine Learning Blog:
https://aws.amazon.com/blogs/machine-learning/

29
Machine Learning System
A composition of one or more software components,
with possible interactions, deployed on a hardware
platform with the purpose of achieving some
performance objective.
A Machine Learning system is a system where one or
more software components are machine learning
based.
• Why study ML systems ?
• Algorithms run on real and (possibly) faulty hardware in production environments
• Theoretical performance is far away from observed
• To characterize holistic performance of not just the algorithm but the end-to-end
performance of the entire system

30
Constituents of a ML System

Machine Learning
System

Hardware/ Software Platform

Infrastructure Algorithm(s) Data

Model

31
Infrastructure
• Compute units and accelerators, memory, storage, network
• Resources can be acquired as bare metal, VMs/Containers on cloud
• Hardware can help improve performance pretty much everywhere in
the pipeline
• Design better hardware
• Adapt existing architectures to ML tasks.
• Develop brand-new architectures for ML.
• Hardware compute precision affects performance: tradeoff between
accuracy and runtime

32
(Learning) Algorithm
• General and domain specific architectures
• Hyperparameter tuning to extract the best performance
• Effects the resource requirements: compute (FLOPS), memory
• Performance (runtime) and scalability of an algorithm depends on:
• Hardware/Infrastructure
• Software platform (frameworks, libraries, drivers)

33
Data
• Data as a critical element; Data is the king in ML
• Different modalities: Audio, video, images, text
• Data sources, collection, labeling, quality, data storage
• Data type determines the choice of learning algorithm
• Making the data business ready is challenging
• Many data-driven organizations are spending 80 percent of their time
on data preparation and find it a major bottleneck.
• DataOps: tools, processes, and organizational structures to cope with
significant increase in volume, velocity, and variety of data.
34
Software Engineering in ML Systems
• Machine learning applications run as pipelines
that ingest data, compute features, identify
model(s), discover hyperparameters, train
model(s), validate and deploy model(s).
• Making a model as a production-capable web
service
• Containerization (docker), cluster deployment (K8s)
• APIs exposed as web service (Tensorflow serving/ONNX
runtime)
• Workflow engines (e.g., Kubeflow) automate the
ML pipeline
ML specific testing and monitoring apart from
• Deployment monitoring and operational analytics traditional software testing
• Devops principles applicable to ML Systems: • Data testing
• Continuous Integration, Continuous delivery (CI/CD) • Infrastructure testing
• Predictability • Model testing
• “A model may be unexplainable—but an API cannot be • Production testing
unpredictable”
• Reproducibility and Traceability
• Provenance for machine learning artifacts 35
Creating an ML System
• System Objective: What problem will my system solve ? What is the target deployment scenario? What are
the performance objectives ?
• Solution Approach: What is my solution and its components ?
• Data Collection
• Identifying the data sources: What are my data sources ?
• Collecting the data
• Data Preparation
• Preparing the data: Is my data business ready ?
• Ingesting the data: What is the right storage for my data ?
• Model Development
• Identification and training
• Model evaluation
• Hyperparameter Tuning
• Model Deployment
• Model optimization (if needed) for the deployment infrastructure
• Model packaging and deployment
• Monitoring and Feedback
• Is my deployed model performing as expected ?
• Is there a drift in model performance which requires re-training ?
36
Inhibitors in Successful Implementation of
ML Solutions
• Deployment and automation
• Reproducibility of models and predictions
• Diagnostics
• Governance and regulatory compliance
• Scalability
• Collaboration
• Monitoring and management

37
Cloud Computing
• Access to computing resources and storage on demand
• Pay-as-you go model
• Heterogeneous resources: GPUs, CPUs, storage type
• Different offering models: IaaS, PaaS, SaaS, MLaaS
• Different deployment models: Public, private, hybrid cloud
• Provisioning, maintenance, monitoring, life-cycle-management

38
Cloud and AI
• AI
• Harness power of Big Data and
compute
• Cloud
• Access to Big Data
• Platform to quickly develop,
deploy, and test AI solutions
• Ease in AI reachability

Yuichi Yoda 39
Cloud based Machine Learning Services
• IBM Watson Studio
https://www.ibm.com/products/watson-studio
• Amazon Sagemaker
https://aws.amazon.com/sagemaker
• Microsoft Azure Machine Learning
https://azure.com/ml
• Google Vertex AI Platform
https://cloud.google.com/vertex-ai/

40
Deep Learning on Cloud Stack

41
AI Model Training Lifecycle
Performance considerations at each stage
• Data preprocessing: de-noising, de-biasing,
train/test set creation
• Feature engineering: search efficient data
transformations
• Model training: model identification/synthesis,
hyperparameter tuning, regularization
• Model hardening: efficient adversarial training
• Model serving: hardware, model pruning and
compression
• Monitoring: response time, drift detection
• Continuous learning: model adaptability, retraining
42
Practical Machine Learning Systems
✔
✔ ✔
✔

✔
✔

43
Class 1: Fundamentals of Deep Learning (DL)

44
Linear Regression

45
Mean Square Error (MSE)
CASE 1 CASE 2 CASE 3

true value

predicted

46
OverfiXng and UnderfiXng
• Overfi_ng: model performs well on training data but does not
generalize well to unseen data (test data)
• Underfi_ng: model is not complex enough to capture paèrn in the
training data well and therefore suffers from low performance on
unseen data

47
Bias of a model

48
49
50
51
52
53
54
55
Model Complexity Tradeoﬀfs
CASE 1 CASE 2 CASE 3

56
Model Complexity Tradeoffs
• Simple model
• Fail to completely capture the rela\onship
between features
• Introduces bias: Consistent test error across
different choices of training data
• Low variance
• Increasing training data does not help in
reducing bias Generaliza>on
Gap
• Complex model captures nuances in
training data causing Overfieng
• Low bias
• Train error << Test error
• With different training instances, the model
predic\on for same test instance will be very
different – High Variance
57
Bias-Variance Tradeoff

High Bias High Variance

Increase model complexity Increase training data

Low Variance
Low Bias

58
Regulariza;on
• Techniques used to improve generalizaQon of a model by reducing its
complexity
• Techniques to make a model perform well on test data oOen at expense of
its performance on training data
• Avoid overﬁ`ng, reduce variance
• Simpler models are preferable: low memory, increase interpretability
• However simpler models may reduce the expressive power of models
• Sample techniques:
• Parameter norm penal\es
• 𝑳𝟐 and 𝑳𝟏 norm weight decay
• Noise injec\on
• Dropout

59
Regularization in Regression
𝑳𝟐 Regularization Loss (Ridge
Regression)

𝑳𝟏 Regularization Loss
(LASSO)

What value of lambda to choose ?

60
Bias-variance tradeoﬀ with Lambda

61
Performance Metrics
• Algorithmic performance: accuracy, precision, recall, F1-score, ROC,
• System performance: training time, inference time, training cost,
memory requirement, training efficiency

62
Accuracy, Precision, Recall, Speciﬁcity
True value

Positive Negative
Positive

False discovery rate = 1-Precision

true positive (tp) false positive (fp)
Predicted value
Negative

false negative (fn) true negative (tn)

Specificity
Sensitivity, True positive rate
False positive rate = 1-Specificity

Balanced accuracy = (𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦)/2

Considers all entries in the confusion matrix

Value lies between 0 (worst classifier) and 1 (best classifier)

63
Balancing Precision and Recall
• F1 score: Harmonic mean of precision and recall; measure of
classiﬁer accuracy

• 𝑠𝑐𝑜𝑟𝑒: 𝛽 = 1 𝑖𝑠 𝐹1 𝑠𝑐𝑜𝑟𝑒; recall is considered β cmes as

important as precision

64
Performance dependence on threshold
• Classifier gives prediction score (the probability of belonging to positive class)
and then we apply a threshold on the score to predict the class
• Metrics like accuracy, precision, recall, Fbeta score are all calculated on the
predicted classes and not on prediction scores
• Metric value depends on the threshold

65
Receiver Opera`ng Characteris`cs (ROC)
• Plots true posicve rate (Recall) vs false posicve rate at diﬀerent
thresholds

66
Run`me (training `me) of ML
95 OpenML datasets, 7 classification algorithms, fit time

• For the same dataset, depending on the choice of classifier, the run>me can differ by orders of magnitude
• Run>me has both algorithmic, dataset, and system level dependency
• Run>me depends on how the model scales with different features of input: training data size, linear separability,
complexity (number of output classes for classifica>on problems), algorithmic hyperparameters, sta>s>cal meta-
features of data (mean size per class, log number of features) 67
DL Training Time
• DL performance is closely ced to the hardware
• compute power, memory, network
• Tesla V100: 640 tensor cores (> 100 TFLOPS), 16 GB
• NVIDIA NVLink: 300 GB/s
• Volta opimized CUDA libraries

68
DL Inference Throughput

69
Deep Learning Training
Forward Phase

• Forward phase
• Loss calculaion Compute Loss
• Backward phase
• Weight update

Backward Phase

70
Deep Learning Training Steps

• Forward phase:
• compute the ac\va\ons of the hidden units based on the current value of weights
• calculate output
• calculate loss func\on

• Backward phase:
• compute par\al deriva\ve of loss func\on w.r.t. all the weights;
• use backpropaga)on algorithm to calculate the par\al deriva\ves recursively
• backpropaga\on changes the weights (and biases) in a network to decrease the
loss
• Update the weights using gradient descent

71
Gradient Descent

• Loss is calculated over all the training points at each weight update
• Memory requirements may be prohibicve

72
Stochas`c Gradient Descent (SGD)

• Loss is calculated using one training data at each weight update

• Stochascc gradient descent is only a randomized approximacon of
the true loss funccon.

73
Mini-batch Gradient Descent

• A batch 𝐵 of training points is used in a single update of weights

• Increases the memory requirements. Layer outputs are matrices
instead of vectors. In backward phase, matrices of gradients are
calculated.

74
Hyperparameters in Deep Learning
• Network architecture: number of hidden layers, number of hidden
units per later
• Accvacon funccons
• Weight inicalizer
• Learning rate
• Batch size
• Momentum
• Opcmizer

75
Popular Ac`va`on Func`ons
# #" $!
! Tanh 𝜙 𝑧 = ReLU (Rectified Linear Unit)
Sigmoid 𝜙 𝑧 = # #" "!
!"# !" 𝜙 𝑧 = max{𝑧, 0}

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2

−0.4 −0.4

−0.6
−0.6

−0.8
−0.8

−1
−1
−10 −5 0 5 10
−6 −4 −2 0 2 4 6

76
Weight Iniàlizaòn
• Inicalizing all weights to same value will cause neurons to evolve
symmetrically
• Generally, biases are inicalized with 0 values and weights with
random numbers; Inicalizing weights to random values breaks
symmetry and enables different neurons to learn different features
• Inical value of weights is important.
– Poor iniializaions can lead to bad convergence or no learning.
– Instability across different layers (vanishing and exploding gradients).

77
Vanishing and Exploding Gradients

𝜕𝐿 𝜕𝐿
= ϕ& 𝑤%"! ℎ% . 𝑤%"! .
𝜕ℎ% 𝜕ℎ%"!

• For sigmoid ac\va\on, 𝜙 ! 𝑧 = 𝜙(𝑧)(1 − 𝜙 𝑧 ) , has maximum value of 0.25 at 𝜙(z)=0.5

"# "#
• Each will will be less than 0.25 of
"$! "$!"#
• As we (back) propagate further gradient keep decreasing further; Afer r layers the value of gradient
reduces to 0.25% (= 10&' for r=10) of the original value causing the update magnitudes of earlier
layers to be very small compared to later layers => vanishing gradient problem.
• If we use ac\va\on with larger gradient and larger weights=> gradient may become very large during
backpropaga\on (exploding gradients)
• Improper ini\aliza\on of weights also causes vanishing (too small weights) or exploding (too large
weights) gradients
78
Ac`va`on Func`ons Deriva`ves ReLU
sigmoid tanh

1
1
1

0.8 0.8
0.8

0.6 0.6
0.6

0.4 0.4
0.4

0.2 0.2
0.2

0 0
0

−0.2 −0.2
−0.2

−0.4 −0.4
−0.4

−0.6 −0.6
−0.6

−0.8 −0.8
−0.8

−1 −1
−1
−6 −4 −2 0 2 4 6 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−10 −5 0 5 10

• Sigmoid and tanh derivaQves vs ReLU

• Sigmoid and tanh gradients saturate at large values of argument; very
suscepQble to vanishing gradient problem
• ReLU is faster to train; most commonly used acQvaQon funcQon in deep learning
79
Vanishing and Exploding Gradients

80
Popular Weight Ini`alizers r_in =4 r_out=3

• Xavier inicalizacon
• Each neuron weight is sampled from 0 mean Gaussian distribuion with
standard deviaion 2⁄ 𝑟 + 𝑟
'( )*%

when 𝑟#$ and 𝑟%&' are number of input and output weights for the neuron
• Used with Sigmoid or Tanh acivaions
• Also referred to as Glorot iniializaion (like in Keras)
• He inicalizacon
• Sample weights from 0 mean Gaussian distribuion with standard
deviaion
2⁄𝑟

r can be 𝑟#$ or 𝑟%&'

• Used with ReLU acivaion

h`ps://www.deeplearning.ai/ai-notes/inicalizacon/ 81
Learning rate: large value vs small value
Learning rate schedule
• Start with a higher learning rate to explore the loss space => find a good staring
values for the weights
• Use smaller learning rates in later steps to converge to a minima =>tune the
weights slowly
• Different choices of decay funcions: Decay functions Decay equation
• exponen\al, inverse, mul\-step, polynomial 𝛼+
Inverse 𝛼% =
• babysiing the learning rate 1 + 𝛾. 𝑡
exponential 𝛼% = 𝛼+ 𝑒𝑥𝑝(−𝛾. 𝑡)
• Training with different learning rate decay
(
polynomial 𝑡
• Keras learning rate schedules and decay n=1 gives linear
𝛼% = 𝛼+ 1 −
max _𝑡
,$
• Other new forms: cosine decay multi-step 𝛼% =
-%
at step n

83
Batch size
• Eﬀect of batch size on learning
• Batch size is restricted by the GPU memory (12GB for K40, 16GB for
P100 and V100) and the model size
• Model and batch of data needs to remain in GPU memory for one iteraion
• Are you restricted to work with small size mini-batches for large models
and/or GPUs with limited memory
• No, you can simulate large batch size by delaying gradient/weight updates to
happen every n iteraions (instead of n=1) ; supported by frameworks

84
What Batch size to choose ?
• Hardware constraints (GPU memory) dictate the largest batch size
• Should we try to work with the largest possible batch size ?
• Large batch size gives more confidence in gradient estimation
• Large batch size allows you to work with higher learning rates, faster convergence
• Large batch size leads to poor generalization (Keskar et al 2016)
• Lands on sharp minima wheareas small batch SGD find flat minimas which generalize better

85
Learning rate and Batch size relaònship
• “Noise scale” in stochascc gradient descent (Smith et al 2017)
-
𝑔=𝜖 −1 N: training dataset size
.
/-
𝑔≈ .
B: batch size
𝑎𝑠 𝐵 ≪ 𝑁
𝜖: learning rate
• There is an opcmum noise scale g which maximizes the test set
accuracy (at constant learning rate)
• Introduces an opimal batch size proporional to the learning rate when B ≪ N
• Increasing batch size will have the same effect as decreasing learning
rate
• Achieves near-idenical model performance on the test set with the same
number of training epochs but significantly fewer parameter updates
Learning rate decrease Vs Batch size increase

87
Gradient Descent Convergence

• GD convergence is poor due to difference in gradient values along different

dimensions
• Effective descent direction gets away from the minima if we use finite
learning rate
• Gradient descent might also get trapped at saddle points and/or local minima

We need to :
– Move quickly in direc0ons with small but consistent (poin0ng in one direc0on, +ve
or -ve) gradients.
– Move slowly in direc0ons with big but inconsistent (oscilla0ng between –ve and
+ve) gradients.

C. Aggarwal. Neural Networks and Deep Learning

Gradient Descent with Momentum
OPTIMUM
STARTING
POINT

STARTING
POINT WITH
MOMENTUM (b) WITHOUT MOMENTUM

GD SLOWS DOWN
LOSS

IN FLAT REGION
GD GETS TRAPPED STARTING
OPTIMUM
WITHOUT
IN LOCAL OPTIMUM MOMENTUM POINT

(a) RELATIVE DIRECTIONS (c) WITH MOMENTUM

VALUE OF NEURAL NETWORK PARAMETER

• Add momentum to GD updates:

• Learning is accelerated as oscillations are damped and updates progress in the consistent directions of loss
decrease
• Enables working with large learning rate values and hence faster convergence

C. Aggarwal. Neural Networks and Deep Learning

Nesterov Momentum
• Simple momentum-based updates cause solution to overshoot the
target minima
• Idea is to use some lookahead in computing the updates

• Put on the brakes as the marble reaches near bottom of hill.

• Difference from standard momentum method in terms of where the
gradient is computed.
Parameter-speciﬁc learning rates
• Apply a different learning rate to each parameter at each step
• Encourage faster relative movement in gently sloping direction
• Penalize dimension with large fluctuations in gradient
• Several optimizers: AdaGrad, RMSProp, RMSProp+Nestrov
Momentum, AdaDelta, Adam
• Differ in the manner parameter specific learning rates are calculated
Normalizing Input Data
𝑖
• Min-max normaliza\on (for feature 𝑗of input datapoint )
#KL $%!&L
𝑥!" ⇐
%'#L $%!&L
• Data with smaller standard devia\on; scaled to be in the range [0,1]
• Lessen the eﬀect of outliers
• Standardiza\on
𝑥!" − 𝑚𝑒𝑎𝑛"
𝑥!" ⇐
𝑠𝑡𝑑_𝑑𝑒𝑣"

• Normaliza\on helps in the convergence of op\miza\on algorithm

• Should apply same normaliza\on parameters to both train and test set
• Normaliza\on parameters are calculated using train data
• Training converges faster when the inputs are normalized

92
Batch normalization
• Internal covariance shik – change in the distribucon of network
accvacons due to change in network parameters during training
• Idea is to reduce internal covariance shik by applying normalizacon
to inputs of each layer
• Achieve ﬁx distribucon of inputs at each layer
• Normalizacon for each training mini-batch.
• Batch normalizacon enables training with larger learning rates
• Reduces the dependence of gradients on the scale of the parameters
• Faster convergence and bemer generalizaion

93
Batch normaliza`on

Why this step ?

94
𝑳𝟐 and 𝑳𝟏 Regularization in Neural Network
𝑳𝟐 Regularization
Loss

2 Weight decay
𝑳𝟏 Regularization
Loss

95
𝑳𝟐 vs 𝑳𝟏 Regulariza`on in Neural Network
• Value of lambda (hyperparameter) can be tuned using the
validacon set
• 𝑳𝟏 regularizacon leads to sparse weight matrices; Used to
determine edges to prune
• Both 𝑳𝟐 and 𝑳𝟏 regularizacon move the weights progressively
towards 0
• Mulcplicacve vs addicve weight decay

96
Early Stopping

• Stop the training when validation error starts rising to prevent overfitting
• Early stopping is an implicit regularization technique
• Done in hindsight; define a performance criteria and checkpoint the latest “best” model
• May not help with large datasets with less likelihood of overfitting
• No principled approach to early stop. Can be tricky when validation error has multiple local minimas
• 𝑳𝟐 regularization (with proper value of regularizer) may achieve similar or better performance than early stopping
97
Dropout
• Dropout is a regularizaion technique to deal with overfieng problem and improve
generalizaion
• Prevents co-adaptaion of acivaion units
• a feature detector is only helpful in the context of several other specific feature detectors.
• Probabilisically drop input features or acivaion units in hidden layers
• Approximately combines exponenially many different neural network architectures
• Layer dependent dropout probability ( ~0.2 for input, ~0.5 for hidden)

98
Dataset Augmenta`on
• Artificially enlarge the training set by adding
transformations/perturbations of the training data
• Domain-specific transformations
• Provides more training data
• Helps in model generalization and prevents overfitting
• Augmentation techniques (images):
• Horizontal and Vertical shift
• Horizontal and Vertical flip
• Rotation
• Brightness
• Erosion and Dilation
• Noising

99
Prepare for Lecture 2
• Access to compute cluster
• Set up you GCP account (you may want to work with Google Colab)
• We will provide you GCP coupons
• Take account and familiarize with cloud compucng clusters
• First home-work posted on 09/12

100

S4D480_EN_Col21
No ratings yet
S4D480_EN_Col21
270 pages
LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
PGP Generative AI and ML (2) (2)
No ratings yet
PGP Generative AI and ML (2) (2)
2 pages
AI Engineer Roadmap
No ratings yet
AI Engineer Roadmap
22 pages
Master Catalog for GenAI Programs for LNW-19Jul2024
No ratings yet
Master Catalog for GenAI Programs for LNW-19Jul2024
9 pages
ERA V3 - Course Structure
No ratings yet
ERA V3 - Course Structure
12 pages
Deep-Learning-Generative-AI (1) (1)
No ratings yet
Deep-Learning-Generative-AI (1) (1)
6 pages
Datasheet Building LLM Applications With Prompt Engineering(1)
No ratings yet
Datasheet Building LLM Applications With Prompt Engineering(1)
3 pages
TACN-VD-1-4
No ratings yet
TACN-VD-1-4
6 pages
Coursepack Deep Learningeven2024 - R1uc604c
No ratings yet
Coursepack Deep Learningeven2024 - R1uc604c
13 pages
nvt-certification-exam-study-guide-gen-ai-llm-web
No ratings yet
nvt-certification-exam-study-guide-gen-ai-llm-web
8 pages
LLM Mastery Pathways
No ratings yet
LLM Mastery Pathways
8 pages
Data Seminar
No ratings yet
Data Seminar
10 pages
Best Course
No ratings yet
Best Course
3 pages
Stas Bekman - Machine Learning Engineering
No ratings yet
Stas Bekman - Machine Learning Engineering
217 pages
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
No ratings yet
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
34 pages
Deep Atlas MLI Syllabus
No ratings yet
Deep Atlas MLI Syllabus
1 page
Lecture 1a - Introduction
No ratings yet
Lecture 1a - Introduction
38 pages
PPT (1)
No ratings yet
PPT (1)
18 pages
Stas Bekman - Machine Learning Engineering
No ratings yet
Stas Bekman - Machine Learning Engineering
261 pages
PGP AIML Online - Brochure
No ratings yet
PGP AIML Online - Brochure
19 pages
Artificial Intelligence and Deep Learning: Certificate Program
No ratings yet
Artificial Intelligence and Deep Learning: Certificate Program
12 pages
sodapdf-converted (5)
No ratings yet
sodapdf-converted (5)
2 pages
Syl5 ML
No ratings yet
Syl5 ML
5 pages
NewSyllabus_1157202352913185 (5)
No ratings yet
NewSyllabus_1157202352913185 (5)
7 pages
Lec-01
No ratings yet
Lec-01
28 pages
Artificial Intelligence & Machine Learning Curriculum Pregrad
No ratings yet
Artificial Intelligence & Machine Learning Curriculum Pregrad
12 pages
GenAI Course for QA Engineers
No ratings yet
GenAI Course for QA Engineers
6 pages
Large Language Models Johns Hopkins University
No ratings yet
Large Language Models Johns Hopkins University
54 pages
HDS401 Deep Learning Module Outline
No ratings yet
HDS401 Deep Learning Module Outline
3 pages
Syllabus-EE 414, 517, Deep Learning, Fall 2023
No ratings yet
Syllabus-EE 414, 517, Deep Learning, Fall 2023
4 pages
Agentic AI
No ratings yet
Agentic AI
12 pages
AI-ML-DS (Level-2) Lab Lesson Plan
No ratings yet
AI-ML-DS (Level-2) Lab Lesson Plan
2 pages
Can you generate a pdf of the final course outline
No ratings yet
Can you generate a pdf of the final course outline
6 pages
GHOST Day Applied Machine Learning Conference
No ratings yet
GHOST Day Applied Machine Learning Conference
1 page
dl_01_introduction
No ratings yet
dl_01_introduction
87 pages
DL - Unit1
No ratings yet
DL - Unit1
59 pages
Machine Learning and Generative AI
No ratings yet
Machine Learning and Generative AI
5 pages
Datafy Generative-Ai Learning Path[82]
No ratings yet
Datafy Generative-Ai Learning Path[82]
7 pages
455969
No ratings yet
455969
15 pages
Projects
No ratings yet
Projects
10 pages
Artificial Intelligence Machine Learning Program Brochure
0% (1)
Artificial Intelligence Machine Learning Program Brochure
14 pages
Doubt clearance
No ratings yet
Doubt clearance
5 pages
Deep Neural Network AIML Handout v1.0-1
No ratings yet
Deep Neural Network AIML Handout v1.0-1
8 pages
Comprehensive AI & ML Course- From Beginner to Gen... (1)
No ratings yet
Comprehensive AI & ML Course- From Beginner to Gen... (1)
5 pages
Planet, Code - PYTHON for LARGE LANGUAGE MODELS_ a Beginners Handbook for Leveraging Llms Into Modern Development Workflows and Applications (2025)
No ratings yet
Planet, Code - PYTHON for LARGE LANGUAGE MODELS_ a Beginners Handbook for Leveraging Llms Into Modern Development Workflows and Applications (2025)
254 pages
Updated DL Handbook 2023-24
No ratings yet
Updated DL Handbook 2023-24
25 pages
Complete Generative AI Curriculum
No ratings yet
Complete Generative AI Curriculum
6 pages
1. Application Of Large Language
No ratings yet
1. Application Of Large Language
75 pages
aiml_online_brochure (1)
No ratings yet
aiml_online_brochure (1)
20 pages
Data Science C
No ratings yet
Data Science C
21 pages
P1
No ratings yet
P1
8 pages
Syl3 ML
No ratings yet
Syl3 ML
5 pages
14_Key_Skills_to_Master_Large_Language_Models__1729745509
No ratings yet
14_Key_Skills_to_Master_Large_Language_Models__1729745509
17 pages
Toc 9780138199302
No ratings yet
Toc 9780138199302
8 pages
Syl6 ML
No ratings yet
Syl6 ML
3 pages
Artificial Intelligence Machine Learning Program Brochure
No ratings yet
Artificial Intelligence Machine Learning Program Brochure
22 pages
1 MONTH INTERN ON AI
No ratings yet
1 MONTH INTERN ON AI
8 pages
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Machine Learning with MLlib: Definitive Reference for Developers and Engineers
From Everand
Applied Machine Learning with MLlib: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
LightGBM in Practice: Definitive Reference for Developers and Engineers
From Everand
LightGBM in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
How To Change Your Oil
No ratings yet
How To Change Your Oil
5 pages
Empowerment Technologies - Module 2
No ratings yet
Empowerment Technologies - Module 2
3 pages
Plan Petronas Pump
No ratings yet
Plan Petronas Pump
2 pages
COA - Unit-1 (Amiraj) - VisionPapers - in
No ratings yet
COA - Unit-1 (Amiraj) - VisionPapers - in
25 pages
Fence Estimate CES PER 1
No ratings yet
Fence Estimate CES PER 1
3 pages
Outbound Genesys
No ratings yet
Outbound Genesys
491 pages
Yogandhra Master Trainer Registration Form
No ratings yet
Yogandhra Master Trainer Registration Form
10 pages
Harman Kardon HD 950 Service Manual
No ratings yet
Harman Kardon HD 950 Service Manual
36 pages
first year transcript (2)
No ratings yet
first year transcript (2)
1 page
Mil Lesson 8
No ratings yet
Mil Lesson 8
38 pages
Sales Invoice Tracker1
No ratings yet
Sales Invoice Tracker1
9 pages
Dbms Lab Manual 10CSL58
100% (2)
Dbms Lab Manual 10CSL58
34 pages
Vent Scrubber Bulletin
No ratings yet
Vent Scrubber Bulletin
2 pages
Relief Valve (Line) - Test and Adjust - Hydraulic Hammer PDF
100% (1)
Relief Valve (Line) - Test and Adjust - Hydraulic Hammer PDF
4 pages
Experiment 2 Ping Message
No ratings yet
Experiment 2 Ping Message
3 pages
Shivam Bhadani
No ratings yet
Shivam Bhadani
1 page
IAT-III Question Paper with Solution of BCS303 Operating Systems March-2024-Attar Mahay Sheetal
No ratings yet
IAT-III Question Paper with Solution of BCS303 Operating Systems March-2024-Attar Mahay Sheetal
13 pages
Architecture Thesis Topic
No ratings yet
Architecture Thesis Topic
11 pages
P5 - Electricity in The Home
No ratings yet
P5 - Electricity in The Home
2 pages
Juris Analytica Final MoA Draft
No ratings yet
Juris Analytica Final MoA Draft
5 pages
Technical Seminar On: Face Recognition Based On Convolution Neural Network
No ratings yet
Technical Seminar On: Face Recognition Based On Convolution Neural Network
22 pages
Pci Dss v3 2 1 Saq A Compliance Standards
No ratings yet
Pci Dss v3 2 1 Saq A Compliance Standards
24 pages
IELTS SPEAKING PART 2 ANSWERS 2019 Top 121 Ielts Speaking Part 2 Model Answers For An 8.0 Band Score! Julia White instant download
100% (2)
IELTS SPEAKING PART 2 ANSWERS 2019 Top 121 Ielts Speaking Part 2 Model Answers For An 8.0 Band Score! Julia White instant download
33 pages
1 - Customer Request Form - C917
No ratings yet
1 - Customer Request Form - C917
1 page
LF 45 - 55
100% (3)
LF 45 - 55
410 pages
Amadeus Review2
No ratings yet
Amadeus Review2
34 pages
Csm-Form School
No ratings yet
Csm-Form School
2 pages
GUIDELINE WinGD-2S iCER-Installation
No ratings yet
GUIDELINE WinGD-2S iCER-Installation
25 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 1

Uploaded by

Lecture 1

Uploaded by

1

Chen Wang <cw3687@columbia.edu>

Objectives • Ability to train DL/LLM models on cloud platforms using GPUs

Retrieval Augmented Generation

Functions, Tools and LLM Agents

Graph RAG and Agentic RAG

Training Process Managing High Optimizing Use Cases for

Multi-task Fine- Prompt Tuning

Instruction Fine-Tuning Process Comparison and Practical

Evaluatio Unseen Data Tools

Resource LLM Serving Fine-Tuned Model Performance &

Memory Performance vs.

Continuous HuggingFace TGI

Instruction Fine-Tuning: Tailor LLMs to follow specific instructions.

Path Methods: Guide learning based on human feedback.

Toxicity/Misinformation: Mitigate harmful content.

Reward Model Training: Evaluate outputs, assign rewards.

Concept: Ethical AI alignment with human values.

Inflection point in AI adoption are closely tied to innovations in computing and

2006: Amazon elastic cloud and S3 was launched 25

Hardware/ Software Platform

High Bias High Variance

Increase model complexity Increase training data

What value of lambda to choose ?

False discovery rate = 1-Precision

false negative (fn) true negative (tn)

Balanced accuracy = (𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦)/2

Considers all entries in the confusion matrix

• 𝑠𝑐𝑜𝑟𝑒: 𝛽 = 1 𝑖𝑠 𝐹1 𝑠𝑐𝑜𝑟𝑒; recall is considered β cmes as

• Loss is calculated using one training data at each weight update

• A batch 𝐵 of training points is used in a single update of weights

• For sigmoid ac\va\on, 𝜙 ! 𝑧 = 𝜙(𝑧)(1 − 𝜙 𝑧 ) , has maximum value of 0.25 at 𝜙(z)=0.5

• Sigmoid and tanh derivaQves vs ReLU

r can be 𝑟#$ or 𝑟%&'

• GD convergence is poor due to difference in gradient values along different

C. Aggarwal. Neural Networks and Deep Learning

(a) RELATIVE DIRECTIONS (c) WITH MOMENTUM

VALUE OF NEURAL NETWORK PARAMETER

• Add momentum to GD updates:

C. Aggarwal. Neural Networks and Deep Learning

• Put on the brakes as the marble reaches near bottom of hill.

• Normaliza\on helps in the convergence of op\miza\on algorithm

Why this step ?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.