0% found this document useful (0 votes)
105 views46 pages

Deep Learning Chapter 1

The document covers the basics of deep networks, focusing on linear algebra, probability distributions, and gradient-based optimization techniques in machine learning. It explains key concepts such as scalars, vectors, matrices, tensors, and various optimization algorithms like Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Additionally, it discusses the importance of model capacity, overfitting, and underfitting in machine learning.

Uploaded by

startrader196
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
105 views46 pages

Deep Learning Chapter 1

The document covers the basics of deep networks, focusing on linear algebra, probability distributions, and gradient-based optimization techniques in machine learning. It explains key concepts such as scalars, vectors, matrices, tensors, and various optimization algorithms like Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Additionally, it discusses the importance of model capacity, overfitting, and underfitting in machine learning.

Uploaded by

startrader196
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 46
tworks Basics peep Ne a UNIT I DEEP NETWORKS BASICS Linear Algebra: Scalars - Vectors - Matrices and tensors; Probability Distributions - Gradientbased Optimization - Machine Learning Basics: Capacity - Overfitting and underfitting - Hyperparameters and validation sets - Estimators - Bias and variance - Stochastic gradient descent -- Challenges motivating deep learning; Deep Networks: Deep feedforward networks; Regularization - Optimization. 1. LINEAR ALGEBRA The term Linear Algebra was initially introduced in the early 18th century to find out the unknowns in Linear equations and solve the equation easily; It is alsoa prerequisite to start learning Machine Learning and data science. Deep Learning is a subdomain of machine learning, concerned with the algorithm which imitates the function and structure of the brain called the artificial neural network. Linear algebra is a form of continuous rather than discrete mathematics. 1.1. USES OF LINEAR ALGEBRA 1. Optimization of data. Implementation of Linear Regression in Machine Learning. linear algebra is also used in neural networks and the data science field. 2, 3 4. Better Graphic experience 5. Improved Statistics 6. Creating better Machine Learning algorithms i 8 Estimating the forecast of Machine Learning Easy to Learn 1.1.1 Better Graphics Experience Linear Algebra helps to provide better graphical processing in Machine Learning and edge detection. Moreover, Linear Algebra helps solve like Image, audio, video, data set through a specific terminology named Matrix and compute large and complex Decomposition Techniques s Deep i, 1.1.2 Improved Statistics Statistics is an important concept to organize and eee data in Ma oa ind the concept of stat ~ Learning. Also. linear Algebra helps to understa! Pp i ime manner. 1.1.3 Creating better Machine Learning algorithms Few supervised learning algorithms can be created using Linea ebra gistic Regression 2. Linear Regression 3. Decision Trees +4. Support Vector Machines (SVM) Further. below are some unsupervised learning algorithms listed that canals created with the help of linear algebra as follows: 1. Single Value Decomposition (SVD) 2. Clustering 3. Components Analysis 1.1.4 Easy to Learn mathematics and its applications. 1.2 EXAMPLES OF LINEAR ALGEBRA IN MACHINE LEARNING Below are some popular examples of linear algebra in Machine learning: 1. Datasets and Data Files 2. Linear Regression Recommender Systems One-hot encoding Regularization Principal Component Analysis Images and Photographs J aA A Pw i —— seen een hres eterna sines:---..__-- eeu Deep Networks Basics 1.3 8. Singular-Value Decomposition 9. Deep Learning 10. Latent Semantic Analysis 1.3. SCALARS Ascalar is just a single number, which are usually arrays of multiple numbers. We write scalars in italics. We usually give scalars lower-case variable names. For example, We might say “Let s € R be the slope of the line,” while defining a real-valued scalar, or “Let n € N be the aumber of units,” while defining a natural number scalar. 1.4 VECTORS A vector is an array of numbers. The numbers are arranged in order. We can identify each individual number by its index in that ordering. Practically we give vectors lower case names written in bold typeface, such as x. The elements of the vector are identified by writing its name in italic typeface, with a subscript. The first element of x is x,, the second element is x, and so on. 1.5 MATRICES Amatrix is a 2-D array of numbers.We usually give matrices upper-case variable names with bold typeface, such as A If a real-valued matrix A has aheight of manda width of n, then we say that A € Rm xn. We usually identify the elements of a matrix using its name in italic but not bold font, and the indices are listed with separating commas. Aa A | An, Ana 1.4 1.6 TENSORS In some cases we will need an array with more than two axes. In the general , an array of numbers arranged on a regular grid with a var number of axes is known as a tensor. We denote a tensor named “A” with this typeface: A. We identify the ele of A at coordinates (i, j, k) by writing Aij,k. t 3)1)4]1 ‘e’ 519]2]/6 7 5|3|5|8 ‘s’ 9|/7|9]3 ‘o 2)3]8|4 r 6/2])6]4 1.7 PROBABILITY DISTRIBUTIONS Probability denotes the possibility of something happening. It is a mathematical concept that predicts how likely events are to occur, The probability values are expressed between 0 and 1. The definition of probability is the degree to which something is likely to occur. This fundamental theory of probability is also applied to probability distributions. 1.7.1 Discrete Variable and Probability Mass Function The probability mass function is the function which describes the probability ‘sociated with the random variable X, This function is named P(X) or P(X =) © avoid confusion. P(X = x) corresponds to the probability that the random variable X takes the value vy. 1.8 GRADIENT-BASED OPTIMIZATION 1.8.1 Optimizer Optimizers update earning a the parameters of neural networks such as weights and lea Tate to minimize the railt 4 toss function. Here, the loss function acts as a guide to the © Deep Networks Basics 15 telling optimizer if itis moving in the right direction to reach the bottom of the valley, the global minimum. 1.8.2 The Intuition behind Optimizers with an Example Let us imagine a climber hiking down the hill with no sense of direction. He doesn’t know the right way to reach the valley in the hills, but, he can understand whether he is moving closer (going downhill) or further away (uphill) from his final destination. If he keeps taking steps in the correct direction, he will reach to his aim i.,¢ the valley Exactly, this is the intuition behind optimizers- to reach a global minimum concerning the loss function. 1.8.3 Instances of Gradient-Based Optimizers Different instances of Gradient descent based Optimizers are as follows: ¢ Batch Gradient Descent or Vanilla Gradient Descent or Gradient Descent (GD) « — Stochastic Gradient Descent (SGD) « Mini batch Gradient Descent (MB-GD) 1.8.4 Batch Gradient Descent Gradient descent is an optimization algorithm that's used when training deep learning models. It’s based on a convex function and updates its parameters iteratively to minimize a given function to its local minimum. The notation used in the above Formula is given below, Gradient Descent a. 0,= 9-455) Oo @,) i Leaming Rate In the above formula, © cis the learning rate, © Jis the cost function, and © @is the parameter to be updated. 16 Deep Learning As you can see, the gradient represents the partial derivative of J (cost function) with respect to ©;. Note that, as we reach closer to the global minima, the slope or the gradient of the curve becomes less and less steep, which results in a smaller value of derivative. which in turn reduces the step size or learning rate automatically. It is the most basic but most used optimizer that directly uses the derivative of the loss function and learning rate to reduce the loss function and tries to reach the global minimum. Thus, the Gradient Descent Optimization algorithm has many application including « Linear Regression, + Classification Algorithms, * Backpropagation in Neural Networks, etc. Initial Weight \ Incremental Step \ Weight Our aim is to reach at the bottom of the graph (Cost vs weight), or to a point where we can no longer move downhill-a local minimum. 1.8.5 Role of Gradient Cost t Gradient je a Minimum Cost Derivative of Cos In general, Gradient represents the slope of the equation while gradients are partial Sea describe the change reflected in the loss function with respect !0 ean tell ue ae en tametes ofthe function, Now, this slight change in oss Fncton next step to reduce the output of the loss function. deriv, the si Deep Networks Basics 17 1.8.6 Role of Learning Rate Learning rate represents the size of the steps our optimization algorithm takes to reach the global minima. To ensure that the gradient descent algorithm reaches the local minimum we must set the learning rate to an appropriate value, which is neither too low nor too high. Taking very large steps i.e, a large value of the learning rate may skip the global minima, and the model will never reach the optimal value for the loss function. On the contrary, taking very small steps i.e, a small value of learning rate will take forever to converge. Thus, the size of the step is also dependent on the gradient value. Big learning rate ‘Small learning rate The gradient represents the direction of increase. But our aim is to find the minimum point in the valley so we have to go in the opposite direction of the gradient. Therefore, we update parameters in the negative gradient direction to minimize the loss. Algorithm: 6 = 8 -— a . AJ(8) In code, Batch Gradient Descent looks something like this: for x in range(epochs): params_gradient = find_gradient(loss_function, data, parameters) parameters = parameters — learning_rate * params_gradient Advantages of Batch Gradient Descent 1. Easy computation. 2. Easy to implement. 3. Easy to understand. 1.8 i, Deep Learning Disadvantages of Bateh Gradient Descent 1.9 1. May trap at local minima. 2. Weights are changed after calculating the gradient on the whole dataset % if the datasct is too large then this may take years to converge to the minjr; 3. Requires large memory to calculate gradient on the whole dataset STOCHASTIC GRADIENT DESCENT 1. To overcome some of the disadvantages of the GD algorithm, the SGI algorithm comes into the picture as an extension of the Gradient Descen 2. One of the disadvantages of the Gradient Descent algorithm is that it require a lot of memory to load the entire dataset at a time to compute the derivat of the loss function. 3. So, In the SGD algorithm, we compute the derivative by taking one dat point at a timei.e, tries to update the model’s parameters more frequently 4. Therefore, the model parameters are updated after the computation of loss on each training example. So, let’s have a dataset that contains 1000 rows, and when we apply SGD it will update the model parameters 1000 times in one complete cycle of a dataset instead of one time as in Gradient Descent. Algorithm: 6 = @ — a . AJ(O;x(i);y(i)) where {x(i), y(i)} are the training examples We want the training, even more, faster, so we take a GradientDescent step for cach training example. Let’s see the implications in the image below: Oe Stochastic Gradient Descent Gradient Descent ED ES ‘+’ denotes a mi is a lot faster Figure : SGD vs GD te ids to many oscillations to reach convergence. But ac P for GD, as it uses only one training example (vs. the batch for GD). inium of the Cost. SGD lea to compute for SGD than Deep Networks Basics 1.9 Let’s try to find some insights from the above diagram: 1. In the left diagram of the above picture, we have SGD (where 1 per step time) we take a Gradient Descent step for each example and on the right diagram is GD(1 step per entire training set). 2. SGD seems to be quite noisy, but at the same time it is much faster than others and also it might be possible that it not converges to a minimum. 3. Itis observed that in SGD the updates take more iterations compared to GD to reach minima. 4. On the contrary, the GD takes fewer steps to reach minima but the SGD algorithm is noisier and takes more iterations as the model parameters are frequently updated parameters having high variance and fluctuations in loss functions at different values of intensities. 5. Its code snippet simply adds a loop over the training examples and finds the gradient with respect to each of the training examples. for x in range(epochs): np.random.shuffle(data) for example in data: params_gradient = find_gradient(loss_function, example, parameters) parameters = parameters - learning_rate * params_gradient Advantages of Stochastic Gradient Descent 1. Convergence takes less time as compared to others since there are frequent updates in model parameters. 2. Requires less memory as no need to store values of loss functions. 3. May get new minima’s. Disadvantages of Stochastic Gradient Descent 1. High variance in model parameters. 2. Even after achieving global minima, it may overshoots. 3. To reach the same convergence as that of gradient descent, we need to slowly reduce the value of the learning rate. Deep Learni oo P Learning 1.9.1 Mini-Batch Gradient Descent 1. To overcome the problem of large time complexity in the case of the SGD algorithm. 2. MB-GD algorithm comes into the picture as an extension of the SGD algorithm. 3. It’s not all but it also overcomes the problem of Gradient descent. Therefore. It’s considered the best among all the variations of gradient descent algorithms. MB-GD algorithm takes a batch of points or subset of points from the dataset to compute derivate. Stochastic Gradient Descent Mini-Batch Gradient Descent 4. It is observed that the derivative of the loss function for MB-GD is almost the same as a derivate of the loss function for GD after some number of iterations. 5. But the number of iterations to achieve minima is large for MB-GD compared to GD and the cost of computation is also large. 6. Therefore, the weight updation is dependent on the derivate of loss for a batch of points. The updates in the case of MB-GD are much noisy because the derivative is not always towards minima. 7. Itupdates the model parameters after every batch. So, this algorithm divides the dataset into various batches and after every batch, it updates the parameters. Algorithm: @ = 6 - a . AJ(0; Bi’), where {B(i)} are the batches of training examples In the code snippet, instead of iterating over examples, we now iterate over mini- batches of size 30: _ Deep Networks Basics 1.11 for x in range(epochs): np.tandom.shuffle(data) for batch in get_batches(data, batch_size=30): params_gradient = find_gradient(loss_function, batch, parameters) parameters = parameters — learning_rate * params_gradient Advantages of Mini Batch Gradient Descent 1. Updates the model parameters frequently and also has less variance. 2. Requires not less or high amount of memory i.e requires a medium amount of memory. Disadvantages of Mini Batch Gradient Descent 1. The parameter updation in MB-SGD is much noisy compared to the weight updation in the GD algorithm. 2. Compared to the GD algorithm, it takes a longer time to converge. 3. May get stuck at local minima. 1.9.2 Challenges with all types of Gradient-based Optimizers 1.9.2.1 Optimum Learning Rate Choosing an optimum value of the learning rate. If we choose the learning rate as a too-small value, then gradient descent may take a very long time to converge. For more about this challenge, refer to the above section of Learning Rate which we discussed in the Gradient Descent Algorithm. 1.9.6.2 Constant Learning Rate For all the parameters, they have a constant learning rate but there may be some parameters that we may not want to change at the same rate. 1.10 MACHINE LEARNING BASICS : Capacity,Overfitting and underfitting 1.10.1 Capacity of a model Model capacity is ability to fit variety of functions 1. Model with Low capacity struggles to fit training set aL a 142 Deep Learning 2. AHigh capacity model can overfit by memorizing One way to control capacity of a learning algorithm is by choosing the hypothesis space ie., set of functions that the learning algorithm is allowed to select as being the solution. E-g.. the linear regression algorithm has the set of all linear functions of its input as the hypothesis space. We can generalize to include polynomials is its hypothesis space which increases model capacity. 1.10.1.1 Capacity of Polynomial Curve Fits A polynomial of degree 1 gives a linear regression model with the prediction y=b+wx By introducing x, as another features provided to the regression model, we can learn a model that is quadratic as a function of x. yYsbtw xtwix, The output is still a linear function of the parameters so we can use normal equations to train in closed-form We can continue to add more powers of x as additional features, e.g., a polynomial of degree 9. 9 $=b+ yD wx! ial 1.10.1.2 Appropriate Capacity 1. Machine Learning algorithms will perform well when their capacity is appropriate for the true complexity of the task that they need to perform and the amount of training data they are provided with 2. Models with insufficient capacity are unable to solve complex tasks 3. Models with high capacity can solve complex tasks, bit when their capacity is higher than needed to solve the present task, they may overfit. a, Deep Networks Basics 1.13 1.10.1. 3 Ordering Learning Machines by Capacity 1.10.1.4 Goal of learning is to choose an optimal element of a structure (e.g., polynomial degree) and estimate its coefficients from a given training sample. For approximating functions linear in parameters such as polynomials, complexity is given by the no. of free parameters. For functions nonlinear in parameters, the complexity is defined as VC- dimension. The optimal choice of model complexity provides the minimum of the expected risk. —_> underfitting overfitting True Risk Classification Error Confidence Interval Empirical Risk Representational and Effective Capacity Representational capacity: Specifies family of functions learning algorithm can choose from Effective capacity: Imperfections in optimization algorithm can limit representational capacity

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy