Deep Learning Week 201
Deep Learning Week 201
Data Type E.g., A demographic dataset with statistics on different cities' population, GDP per capita, and economic growth.
Unstructured -> does not have a predefined schema or format. E.g., text documents, images, audio files, and videos.
Lesson 2 – Standard Notations for Deep Learning
Some Deep Learning mathematical notations.
Processing the Usually process the entire training set without using an explicit for loop.
training set This is different from the typical approach of using a for loop to step through m training examples one by one.
Notations -> Superscript (i) -> the ith training example | Superscript [l] -> the lth layer [-]
Number of examples -> 𝑚 | Input Size -> 𝑛) | Output Size -> 𝑛* | Hidden Units of the lth layer -> 𝑛+ | Number of layers -> 𝐿
Input Matrix -> 𝑋 ∈ ℝ:!×< → It contains the feature values for each training example, 𝑚 -> examples ; 𝑛 -> features.
Objects Training Example -> 𝑥 ()) ∈ ℝ:! → It is represented as a column vector.
used in Label Matrix -> 𝑌 ∈ ℝ:"×< → It contains desired outputs for each training example, 𝑚 -> examples ; 𝑛 -> classes
Neural Output Label ->𝑦 (#) ∈ ℝ3! → It is the predicted class or value for an ith training example.
Networks Weight Matrix-> 𝑊 [-] ∈ ℝ34&567 89 43#:; #3 36): -<*67 × 34&567 89 43#:; #3 :+6 >76?#84; -<*67 → [l] = layer
Bias Vector -> 𝑏 [-] ∈ ℝ34&567 89 43#:; #3 36): -<*67 → It allows [@]
the network to shift the activation function output.
3!
Predicted Output Vector -> 𝑦( ∈ ℝ → It also can be denoted 𝑎 where 𝐿 is the number of layers.
Examples of 𝑎 = 𝑔 ' 𝑊( 𝑥 ) + 𝑏* Where 𝑔 ' is lth layer activation function. Activation 𝑎 of the lth layer. It is computed by applying the activation function
Equations of 𝑔 $ to the linear combination of the weights 𝑊 and activations 𝑥 from the previous layer, plus the bias term 𝑏% .
𝑦( ()) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊/ ℎ + 𝑏0 → Activation 𝑦! (") of the output layer. It is computed by applying the sigmoid activation function (𝑠𝑜𝑓𝑡𝑚𝑎𝑥) to the linear
common combination of the weights 𝑊, input features ℎ, and bias term 𝑏& .
forward [-] [-] [-12] [-] -
General Activation Formula: 𝑎/ = 𝑔 - ∑0 𝑤/0 𝑎0 + 𝑏/ = 𝑔 - 𝑧/ →Activation 𝑎'[$] of the jth neuron in the lth
propagation [$+%]
layer. It is computed by applying the activation function 𝑔 $ to the weighted sum of activations 𝑎* from the neurons in the previous layer 𝑙 − 1 , multiplied by the weights
[$] [$]
𝑤'* , plus the bias term 𝑏' .
Cost function: 𝐽 𝑥, 𝑊, 𝑏, 𝑦 = 𝐽 𝑦, ( 𝑦 → It measures the difference between the predicted output 𝑦! (which is a function of the input 𝑥, weights 𝑊, and biases 𝑏)
Examples and the true output 𝑦, providing a quantitative measure of how well the network is performing.
of cost Cross-entropy Cost Function: 𝐽!" 𝑦, ( 𝑦 = ∑& (#)
#$% 𝑦 𝑙𝑜𝑔𝑦 ( (#) →It measures the dissimilarity between the predicted probabilities 𝑦! (") and the true
functions labels 𝑦 (") across all 𝑚 training examples, with the goal of minimizing this cost during training to improve the network's performance.
( 𝑦 = ∑&
Mean Absolute Error (MAE) cost function: 𝐽2 𝑦, #$% 𝑦
(#)
− 𝑦( (#) →It calculates the average absolute difference between the predicted
values 𝑦! (") and the actual values 𝑦 (") across all 𝑚 examples. MAE equally penalizes all errors, making it less sensitive to outliers compared to Mean Squared Error (MSE).
Lesson 2 – Standard Notations for Deep Learning
Some Deep Learning mathematical notations.
Processing the Usually process the entire training set without using an explicit for loop.
training set This is different from the typical approach of using a for loop to step through m training examples one by one.
Notations -> Superscript (i) -> the ith training example | Superscript [l] -> the lth layer [-]
Number of examples -> 𝑚 | Input Size -> 𝑛) | Output Size -> 𝑛* | Hidden Units of the lth layer -> 𝑛+ | Number of layers -> 𝐿
Input Matrix -> 𝑋 ∈ ℝ:!×< → It contains the feature values for each training example, 𝑚 -> examples ; 𝑛 -> features.
Objects Training Example -> 𝑥 ()) ∈ ℝ:! → It is represented as a column vector.
used in Label Matrix -> 𝑌 ∈ ℝ:"×< → It contains desired outputs for each training example, 𝑚 -> examples ; 𝑛 -> classes
Neural Output Label ->𝑦 (#) ∈ ℝ3! → It is the predicted class or value for an ith training example.
Networks Weight Matrix-> 𝑊 [-] ∈ ℝ34&567 89 43#:; #3 36): -<*67 × 34&567 89 43#:; #3 :+6 >76?#84; -<*67 → [l] = layer
Bias Vector -> 𝑏 [-] ∈ ℝ34&567 89 43#:; #3 36): -<*67 → It allows [@]
the network to shift the activation function output.
3!
Predicted Output Vector -> 𝑦( ∈ ℝ → It also can be denoted 𝑎 where 𝐿 is the number of layers.
Examples of 𝑎 = 𝑔 ' 𝑊( 𝑥 ) + 𝑏* Where 𝑔 ' is lth layer activation function. Activation 𝑎 of the lth layer. It is computed by applying the activation function
Equations of 𝑔 $ to the linear combination of the weights 𝑊 and activations 𝑥 from the previous layer, plus the bias term 𝑏% .
𝑦( ()) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊/ ℎ + 𝑏0 → Activation 𝑦! (") of the output layer. It is computed by applying the sigmoid activation function (𝑠𝑜𝑓𝑡𝑚𝑎𝑥) to the linear
common combination of the weights 𝑊, input features ℎ, and bias term 𝑏& .
forward [-] [-] [-12] [-] -
General Activation Formula: 𝑎/ = 𝑔 - ∑0 𝑤/0 𝑎0 + 𝑏/ = 𝑔 - 𝑧/ →Activation 𝑎'[$] of the jth neuron in the lth
propagation [$+%]
layer. It is computed by applying the activation function 𝑔 $ to the weighted sum of activations 𝑎* from the neurons in the previous layer 𝑙 − 1 , multiplied by the weights
[$] [$]
𝑤'* , plus the bias term 𝑏' .
Cost function: 𝐽 𝑥, 𝑊, 𝑏, 𝑦 = 𝐽 𝑦, ( 𝑦 → It measures the difference between the predicted output 𝑦! (which is a function of the input 𝑥, weights 𝑊, and biases 𝑏)
Examples and the true output 𝑦, providing a quantitative measure of how well the network is performing.
of cost Cross-entropy Cost Function: 𝐽!" 𝑦, ( 𝑦 = ∑& (#)
#$% 𝑦 𝑙𝑜𝑔𝑦 ( (#) →It measures the dissimilarity between the predicted probabilities 𝑦! (") and the true
functions labels 𝑦 (") across all 𝑚 training examples, with the goal of minimizing this cost during training to improve the network's performance.
( 𝑦 = ∑&
Mean Absolute Error (MAE) cost function: 𝐽2 𝑦, #$% 𝑦
(#)
− 𝑦( (#) →It calculates the average absolute difference between the predicted
values 𝑦! (") and the actual values 𝑦 (") across all 𝑚 examples. MAE equally penalizes all errors, making it less sensitive to outliers compared to Mean Squared Error (MSE).