0% found this document useful (0 votes)
14 views21 pages

FML Unit5

FML Unit5

Uploaded by

Vinoth Kumar M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
14 views21 pages

FML Unit5

FML Unit5

Uploaded by

Vinoth Kumar M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 21
Neural Networks Syllabus Mette or heiinea an Junctions, network training - gradient descent optimization - ha Tota the wameninor ackpropagation, from shallow networks 1 deep networks Unit saturation ‘anishing gradient problem) - ReLU, hyperparameter tuning, batch normalization, regularization, dropout. Contents 41° Perceptron 42. Activation Functions 43 Gradient Descent Optimization 44. Error Backpropagation 45 Shallow Networks 46 Deep Network 47 Vanishing Gradient Problem 48 ReLU 49 Hyperparameter Tuning 4.10 Normalization 4.11 Regularization 412 Two Marks Questions with Answers @-1) Neural Ne 4-2 OtMorky El Perceptron . ae ut neu al e outpt leams 4 york with on * The perceptron is a feed-forward networ! ing hyper-plane in a pattern space. separating hyper-p! to one threshold output Fy neuron, Th. © The "n” linear Fx neurons feed forward ‘tome perceptron separates linearly separable set of pal . EI Single Layer Perceptron ork with one output neuron that leams q ‘The "n" linear Fx neurons feed forward tron separates linearly separable set The perceptron is a feed-forward netw' separating hyper-plane in a pattern space. to one threshold output Fy neuron. The percep of patterns. SLP is the simplest type of artificial neural networks and can only classify linearly separable cases with a binary target (1, 0). © We can connect any number of McCulloch-Pitts neurons together in any way we like. An arrangement of one input layer of McCulloch-Pitts neurons feeding forward to one output layer of McCulloch-Pitts neurons is known as a Perceptron, * A single layer feed-forward network consists of one or more output neurons, each of which is connected with a weighting factor W; to all of the inputs X;. * The Perceptron is a kind of a single-layer artificial network with only one neuron. The Percepton is a network in which the neuron unit calculates the linear combination of its real-valued or boolean inputs and passes it through a threshold activation function, Fig. 4.1.1 shows Perceptron. Input 4 Input 2: Sere é . - Output . Cur) Sigmoid Input N- Threshold 6 TECHNICAL PUBLICATIONS® . 8N Up-thrust for lt a wees In the simplest case the network has only two inputs and a single output. The qutput of the neuron is : yr (3 wax 4} suppose that the activation function is a threshold then 1 ifs>0 f= 1-1 ifsso ‘The Perceptron can represent most of the primitive boolean functions : AND, OR, NAND and NOR but can not represent XOR. In single layer perceptron, initial weight values are assigned randomly because it does not have previous knowledge. It sum all the weighted inputs. If the sum is greater than the threshold value then it is activated i.e. output = 1. Output W1X1+W2X2 +--+ WpXp 28 9 1 WX +WoX2 ++ WaXn $0 = 0 ‘The input values are presented to the perceptron, and if the predicted output is the same as the desired output, then the performance is considered satisfactory and no changes to the weights are made. If the output does not match the desired output, then the weights need to be changed to reduce the error. The weight adjustment is done as follows = AW = x dxx Where x = Input data d = Predicted output and desired output. 11 = Learning rate * If the output of the perceptron is correct then we do not take any action. If the output is incorrect then the weight vector is W > W + AW. * The process of weight adaptation is called learning. * Perceptron Learning Algorithm : 1. Select random sample from training set as input. 2 If classification is correct, do nothing. 3. If classification is incorrect, modify the weight vector W using Wy = Wy +nd (n) Xj (9) Repeat this procedure until the entire training set is classified correctly. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Neural jy, Me 4-4 Stet fachine Leaming my Multilayer Perceptron ture of a singy * A MultiLayer: Perceptron (MLP) has the same actin’ of ae Perceptron with one or more hidden layers. An Ple Neurons called perceptrons. . f source nodes for. A typical multilayer perceptron network consists of ie nodes and an on the input layer, one or more hidden layers of comp! tut layer of nodes. . * It is not possible to find weights which enable single ne passpiene © deg : : See Fig. 4.1.2. with non-linearly separable problems like XOR : See Fig, oR AND Fig. 4.1.2 ERED Limitation of Learning in Perceptron : Linear Separability * Consider two-input patterns (X1,X2) being classified into two classes as shown in Fig. 4.13. Each point with either symbol of x or O represents a pattem with a set of values(X1,X,), * Each pattern is classified into one of two classes, Notice that these classes can be separated with a TECHNICAL PUBLICA TION: + 8” up-thrust for knowledge eorine Leeming 4-5 ‘Neural Networks Linear separability refers to the fact that classes of patterns with Trdimensional vector X = (X1/X2/--Xp) can be separated with a single decision surface. In the case above, the line L represents the decision surface. If two classes of patterns can be separated by a decision boundary, represented by the linear equation then they are said to be linearly separable. The simple network can correctly classify any patterns. Decision boundary (i.e., W, b or q) of linearly separable classes can be determined either by some learning procedures or by solving linear equation systems based on representative patterns of each classes. If such a decision boundary does not exist, then the two classes are said to be linearly inseparable. Linearly inseparable problems cannot be solved by the simple network, more sophisticated architecture is needed. « Examples of linearly separable classes 1. Logical AND function Decision boundary Patterns (bipolar) HY wal -1-1-1 wa leal -1101 be-l r-101 aio) 114 = ltt xy = 0 ° x ° +> ° X: Class I (y= 1) TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Machine Learning a i ion 2. Logical OR functior Decision boundary Patterns (bipolar) wh w= b -11 1 q-0 17151 14 gt x2 = 0 yaad Xied eto soo ° x X:Class I (y= 1) O: Class Il (y =-1) Fig. 4.1.5 * Examples of linearly inseparable classes 1. Logical XOR (exclusive OR) function Patterns (bipolar) x1 X2 y -1 -1 -1 -1 101 1-104 leigie 1 ce ate t | 0 ° _: x x Class 1 ( ) ©: Class It y = 4) Fi - octine Learning 4-7 Neural Networks + No line can separate these two classes, as can be seen from the fact that the following linear inequality system has no solution. pewpew2 <0 pew tw220 Q) pew -w220 2) paw #2 <0 @) pecause we have b < 0 from (1) + (4), and b > = 0 from (2) + (3), which is a contradiction. [1 Activation Functions Activation functions also known as transfer function is used to map input nodes to output nodes in certain fashion. ‘The activation function is the most important factor in a neural network which decided whether or not a neuron will be activated or not and transferred to the next layer. + Activation functions help in normalizing the output between 0 to 1 or - 1 to 1. It helps in the process of backpropagation due to their differentiable property. During backpropagation, loss function gets updated, and activation function helps the gradient descent curves to achieve their local minima. * Activation function basically decides in any neural network that given input or receiving information is relevant or it is irrelevant. * These activation function makes the multilayer network to have greater representational power than single layer network only when norlinearity is introduced. * The input to the activation function is sum which is defined by the following equation. sum = 1,W, +I, W24--+InWn n = XI Wwjtb in Machi 4-8 eure Ney line Leaming Woy istic tion * Activation Function : Logistic Func Firat derivative of logistic function 1 {(sum) = Ress am; = (L4s7S%sumy-1 Fig. 4.2.1 limit - 1) to an upper i i from a lower limit (0 or pp istic ion monotonically increases ) , eats increases. In which values vary between 0 and 1, with a value of 0.5 when I is zero. * Activation Function : Ar Tangent _ 2, 1 i {(stim) = = tan 1 (sxsum) a os oa o2| s20 fun) 0 ~02| TECHNICAL PUBLICATIONS® 4 Up-thrust for knowledge in sve Le0r7i00 4-9 Neural Networks |_ activation Function : Hyperbolic Tangent yperbatie activation function jeu) = tanh (5° D ; sx sum e os os. oa. 02, tum) g -02| -o4 -06 08 oF acs o [23] icentity or Linear Activation Function g output vectors «A linear activation is a mathematical equation used for obtainin} with specific properties. «It is a simple straight line 15 activation function where our function is directly proportional to the weighted sum of neurons or input. «Linear activation functions =6 are better in giving a wide ae range of activations and a line of a positive slope a may increase the firing rate as the input rate =. increases. * Fig. 4.2.4 shows identity function. * The equation for linear a fo) = ax When a = 1 then f(x) = x and this is a special case known as identity. -1 05 Fig. 4.2.4 ctivation function is : TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 4-10 Machine Leaming * Properties : _, 1. Range is ~ infinity to + infinity: be achieved faster, so optimisation © " face 2. Provides a convex error sur! 3. df(x)/dx = a which is consta! * Limitations : 1. Since the derivative is constant, 2. Back propagation is constant as thi ot be optimised with gradient descent, nt. So cann' Jation with input, the gradient has no rel e change is delta x. . tworks in practice. 3. Activation function does not work in neural nel EEX sigmoia * A sigmoid function produces a curve with ; function shown on the left is a special case of the growth of some set. 1 —— 5 1 an "S" shape. The example sigmoig the logistic function, which models sig (t) = 09 et l+e . (08 * In general, a sigmoid function is | real-valued and differentiable, having a non-negative or non-positive first 0.6 derivative, one local minimum and 05 one local maximum. 4 * The logistic sigmoid function is 93 related to the hyperbolic tangent as 02 follows : ‘| x od 1 - 2sig(x) =. 1-2 ~tanh> ° 1+e -6 -4 2 0 2 4 6 * Sigmoid functions are often used in Fig. 4.2.5 artificial neural networks to introduce nonlinearity in the model. applies a sigmoid function to the result, * A reason for its popularity in neural satisfies a property between the computationally easy to perform, networks is because the sigmoid function derivative and itself such that it is d. 1 api8() = sig(t) (1-sig(t)) Derivatives of the sigmoid function are usually employed in learning algorithms. learning algorithms. TECHNICAL PUBLICATIONS® TIONS® «an uy Ip-thrust for knowled 190 ocrine Leeming “en Neural Networks op Gradient Descent Optimization . Gradient Descent is an optimization algorithm in gadget mastering used to limit a feature with the aid of _ iteratively — moving towards the minimal fee of the characteristic. Cost function We essentially use this algorithm when we have to locate the least possible values which could fulfill a given fee function. In gadget getting to know, greater regularly that not we try to limit loss features (like Mean Squared Error). By minimizing the loss characteristic, we will improve our model and Gradient Descent is one of the most popular algorithms used for this cause. Fig. 4.3.1 Gradient descent algorithm The graph above shows how exactly a Gradient Descent set of rules works. We first take a factor in the value function and begin shifting in steps in the direction of the minimum factor. The size of that step or how quickly we ought to converge to the minimum factor is defined by Learning Rate. We can cowl more location with better learning fee but at the risk of overshooting the minima. On the opposite hand, small steps/smaller gaining knowledge of charges will eat a number of time to attain the lowest point. * Now, the direction wherein algorithm has to transport (closer to minimal) is also important. We calculate this by way of using derivatives. You need to be familiar with derivatives from calculus. A spinoff is largely calculated because the slope of the graph at any specific factor. We get that with the aid of finding the tangent line to the graph at that point. The extra steep the tangent, would suggest that more steps would be needed to reach minimum point, much less steep might suggest lesser steps are required to reach the minimum factor. E tea Stochastic Gradient Descent © The word ‘stochastic’ means a system or a process that is linked with a random Probability. Hence, in Stochastic Gradient Descent, a few samples are selected tandomly instead of the whole data set for each iteration. TECHNICAL PUBLIGATIONS® - an up-thrust for knowledge Machine Leaming is a type of gradient descent that runs 0 ing epoch for each exam” parameters one at a tim, ™Mple e, nt (SGD) tion. It proc .s each trainin} ple at some com] t shows frequen rocesses & train g example’s a time, utational efficiency | p' losses ig it updates dates that require Stochastic Gradient Desce! training example per itera within a dataset and update: ining exam it shows tems as i hence it is easier to stor, © in ‘As it requires only one 179 However, allocated memory: oon to batch gradient sy° d speed. quent updat helpful in comparis’ more detail an z x a noisy gradient. Howeve t, is also treated as imum and also escaping th, le ft es, it i a Further, due to fret Finding the global mint sometimes it can be local minimum. stic gradient descent : te in desired memory. “Advantages of Stocha .n batch gradient descent. a) It is easier to alloca by It is relatively fast to compute thai more efficient for large datasets. ic Gradient Descent : ber of hyperparameter: ber of iterations. otis Disadvantages of Stochasti a) SGD requires a num parameter and the num 5 such as the regularization b) SGD is sensitive to feature scaling. mg Error Backpropagation layer neural network. It is Backpropagation is a training method used for a multi-l t descent method which alec] called the generalized delta rule. It is a gradient minimizes the total squared error of the output computed by the net. The b: " “i : a re pret algorithm looks for the minimum value of the error function ete eee ior id eee called the delta rule or gradient descent. The J error i is i : eae ction is then considered to be a solution to the Backpropagation is a opag systematic method. init i generalization of Widrow-Hoff error Gerais mnallvle wee eee uses backpropagation. correction rule. 80 % of ANN applications Fig. 4.4.1 (See on next page) shows back; Consider a simple neuron : Te een ees a. Neuron has a summing junction and rere b Any fond ines. fact oe oe function. everywhere wit i i rere with sum canbe used as ache eis everywhere. and in ee 's activation function. ing chine Lear 4. wee es Neural Networks Activation function y Oo ouput ‘Summing junction Threshold (sum) Synaptic weights Fig. 4.4.1 Backpropagation network c Examples : Logistic function, Arc tangent function, Hyperbolic tangent activation function. + These activation function makes the multilayer network to have greater representational power than single layer network only when non-linearity is introduced. Need of hidden layers : 1. A network with only two layers (input and output) can only represent the input with whatever representation already exists in the input data. 2. If the data is discontinuous or non-linearly separable, the innate representation is inconsistent and the mapping cannot be leamed using two layers (input and Output). 3. Therefore, hidden layer(s) are used mn) in one d by the value of the connec f the activation for the uni between input and output layers layer only to those in the next higher layer. ting weight and it is fed its in the next higher Weights connects unit (neuro The output of the unit is scalec : forward to provide a portion o! layer. jal neural network with any number Backpropagation can be applied to an artifici of hidden layers. The training objective is to adjust the weights so that the : application of a set of inputs produces the desired outputs. : The network is usually trained, with a large number of Training procedure input - output pairs. 1. Generate weights ran negative) to ensure that the net weights. om values (both positive and domly to small rand t saturated by large values of twork is not TECHNICAL PUBLICATIONS® - an up-hrust for knowledge Neural Networks i t. Choose a training pair from the training se 2 3. Apply the input vector to network input. 4. Calculate the network output. 5. vork output and the desir Calculate the error, the difference between the network outp| red output. . ; ; vay Adjust the weights of the network in a W i er 7. Repeat steps 2 - 6 for each pair of input-output in the training set until the error for the entire system is acceptably low. that minimizes this error. > Forward pass and backward pass : ec tes pastes 5 imine i ‘0 passes. * Backpropagation neural network training involves ; 1. In the forward pass, the input signals moves forward from the network input to the output. 2. In the backward pass, the calculated error signals propagate backward through the network, where they are used to adjust the weights. 3._ In the forward pass, the calculation of the output is carried out, layer by layer, in the forward direction. The output of one layer is the input to the next layer, * In the reverse pass, a. The weights of the output neuron layer are adjusted first since the target value of each output neuron is available to guide the adjustment of the associated weights, using the delta rule. b. Next, we adjust the weights of the middle layers. As the middle layer neurons have no target values, it makes the problem complex. * Selection of number of hidden units the number of input units. 1. Never choose h to be more than twice the number of input units. + The number of hidden units depends on 2. You can load p patterns of I elements into log, p hidden units. 3. Ensure that we must have at least Ie times as many training examples. Feature extraction requires fewer hidden units than inputs. 4, 5. Learning many examples of disjointed inputs requires more hidden units than inputs. 2 The number of hidden units required for a classification task increases with the number of classes in the task. Large networks Tequire longer training times. TECHNICAL PUBLICATIONS® . an Uprthrust for knowledge 4-15 Neural Networks chine Leaming tors influencing Backpropagation training fatto waining time can be reduced by using 1, Bias : Networks with biases can represent relationships between inputs and outputs more easily than networks without biases. Adding a bias to each neuron is usually desirable to offset the origin of the activation function. The weight of the bias is trainable similar to weight except that the input is always +1. Momentum : The use of momentum enhances the stability of the training, process. Momentum is used to keep the training process going in the same general direction analogous to the way that momentum of a moving object behaves. In backpropagation with momentum, the weight change is a combination of the current gradient and the previous gradient. "ERI Advantages and Disadvantages advantages of backpropagation : 1, Itis simple, fast and easy to program. 2. Only numbers of the input are tuned and not any other parameter. 3, No need to have prior knowledge about the network. 4, It is flexible. 5, A standard approach and works efficiently. 6. It does not require the user to learn special functions. " Disadvantages of backpropagation : 1. Backpropagation possibly be sensitive 2. The performance of this is highly reliant on the input data to noisy data and irregularity. 3. Needs excessive time for training. 4. The need for a matrix-based method for backpropagation instead of mini - _ EEX shallow Networks + The terms shallow and deep refer to the number of layers in a neural network; shallow neural networks refer to a neural network that have a small number of layers, usually regarded as having a single hidden layer and deep neural networks refer to neural networks that have multiple hidden layers. Both types of networks perform certain tasks better than the other and selecting the right network depth is important for creating a successful model. es of the feature vector of the data to be hidden layer of nodes (neurons) each of batch. In a shallow neural network, the valu classified (the input layer) are passed to a TECHNICAL PUBLICATIONS® - an up-thrust for knawiedge Neural, 4-16 Netw Machine Learning ; , : ¢ activation function, g, acting on, the ing to som which generates a response according i values, 2. weighted sum of those valu‘ is then passed to a final, output + layer . Jes of each unit in the hidden layer 1s Te” Pe ro tices a + The respons single uit), layer (which may consist of classification prediction output. EX Deep Network is a new area of machine s ae eres objective of moving machine learning ee t0 one of is aan Deep learning is about learning multiple levels of presen in . snd absteetion that help to make sense of data such as images, sound and text, learning research, which has been ‘Deep learning’ means using a neural network with several layers of odes between input and output. It is generally better than other met! Bs S on image, speech and certain other types of data because the series of layers tween input and output do feature identification and processing in a series of stages, just as our brains seem to. Deep Learning emphasizes the network architecture of today's most successful machine leaming approaches. These methods are based on "deep" multi - layer neural networks with many hidden layers. EEE TensorFlow * TensorFlow is one of the most popular frameworks used to build deep learning models. The framework is developed by Google Brain Team. * Languages like C++, R and Python are supported by the framework to create the models as well as the libraries, This framework can be accessed from both - desktop and mobile, * The translator used by Google is the best example of TensorFlow. In this, the model is created by adding the functionalities of text classification, natural language processing, speech or handwriting recognition, image recognition, etc. * The framework has its own visualization toolkit, named TensorBoard which helps in powerful data visualization of the network along with its performance. * One more tool added in TensorFlow, TensorFlo} and easy deployment of the newly developed a change in the existing API or architecture. © TensorFlow framework comes along with a detailed documentation for the users to adapt it quickly and easil aan ; ly, making it ¢ framework to model deep learning algorithms, epee cares Recep lease w Serving, can be used for quick Igorithms without introducing any TECHNICAL PUBLICATIONS® - an “p-thrust for knowledge “a ochine Learning 4017 Neural Networks some of the characteristics of TensorFlow is : © Multiple GPU supported. © One can visualize graphs and queues easily using TensorBoard o Powerful documentation and larger support from community. pa Keras + If you are comfortable in programming with Python, then learning Keras will not prove hard to you. This will be the most recommended framework to create deep Jearning models for ones having a sound of Python. Keras is built purely on Python and can run on the top of TensorFlow. Due to its complexity and use of low - level libraries, TensorFlow can be comparatively harder to adapt for the new users as compared to Keras. Users those who are beginners in deep learning, and find its models difficult to understand in TensorFlow generally prefer Keras as it solves all complex models in no time. + Keras has been developed keeping in mind the complexities in the deep learning models, and hence it can run quickly’ to get the results in minimum time. Convolutional as well as Recurrent Neural networks are supported in Keras. The framework can run easily on CPU and GPU. + The models in Keras can be classified into 2 categories = 1, Sequential model : The layers in the deep learning model are defined in a sequential manner. Hence the implementation of the layers in this model will also be done sequentially. 2. Keras functional API : Deep leaming models that has multiple outputs, or has shared layers, ie. more complex models can be implemented in Keras functional API. EEE] pitference between Deep Network and Shallow Network | | Sr.No. Deep network Shallow network | | : | 1 Deep network contains many Shallow network contains only © | | hidden layers. fone hidden layer. | peed y | | 2 Deep network can compactly Shallow networks with one | | express highly complex functions Hidden layer cannot place i over input space complex functions over the input | L | | TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Machine Leaming Vanishing Gradient Problem a network is more | i 3. Training in DN is easy and no ee ein our current | issue of local minima in DN te wath , net's needs more | | 4. DN can fit functions better with shallow 2 ee batter | | Jess parameters than a shallow paral a | network So sail isa problem that user face, when we are training ke backpropagation. This dient-based methods il : .d tune the parameters of the earlier layers in The vanishing gradient problem Neural Networks by using gta problem makes it difficult to learn an the network. The vanishing gradient pro multilayer feed-forward netwo: have the ability to propagate use the model back to the layers near the.input end of the model. dered unable to learn on a specific It results in models with many layers being re! dataset, It could even cause models with many layers to prematurely converge to a substandard solution. When the backpropagation algorithm advances downwards or backward going from the output layer to the input layer, the gradients tend to shrink, becoming ‘mailer and smaller till they approach zero. This ends up leaving the weights of the initial or lower layers practically unchanged. In this situation, the gradient descent does not ever end up converging to the optimum. blem is essentially 4 situation in which a deep rk or a Recurrent Neural Network (RNN) does not ful gradient information from the output end of Vanishing gradient does not necessarily imply that the gradient vector is all zero. It implies that the gradients are minuscule, which would cause the learning to be very slow. ‘The most important solution to the vanishin; i is i gradient problem is a specific type of neural network called Long Short-Term Memory Networks (LSTMs). ‘ ” Indication of vanishing gradient problem : a) The parameters of the higher la ers change t i parameters of lower layers barely change. ea aaa b) The model weights could become 0 during training, ©) The model learns at a particular ‘ ly slow pace ini a very early phase after only a few iio are elena Some methods that are proposed to overcome the vani. a) Residual neural networks (ResNets) oor g could stagnate at ishing gradient problem : TECHNICA! Bret im» Machine Learning. 4-19 Neural Networks b) Multi-level hierarchy ) Long Short Term Memory (LSTM) d) Faster hardware e) ReLU ) Batch normalization [Eel ReLu « Rectified Linear Unit (ReLU) solve the vanishing gradient problem. ReLU is a nor-linear function or piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. « It is the most commonly used activation function in neural networks, especially in Convolutional Neural Networks (CNNs) and Multilayer perceptron’s. + Mathematically, it is expressed as f(x) = max (0, x) where x : input to neuron _« Fig. 4.8.1 shows ReLU function 0 Riz) = max(0, 2) | | t i rs -10_ 800 8 6 4 2 ° Fig. 4.8.1 ReLU function _* The derivative of an activation function is required when updating the weights during the back-propagation of the error. The slope of ReLU is 1 for positive values and 0 for negative values. It becomes non-differentiable when the input x is zero, but it can be safely assumed to be zero and causes no problem in practice. _ * ReLU is used in the hidden layers instead of Sigmoid or tanh. The ReLU function Solves the problem of computational complexity of the Logistic Sigmoid and Tanh functions, * A ReLU activation unit is known to be less likely to create a vanishing gradient Problem because its derivative is always 1 for positive values of the argument. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Machine Leaming 4-20 Neural Networks * Advantages of ReLU function a) ReLU is simple to compute and has a backpropagation of the error. predictable gradient for the b) Easy to implement and very fast. ©) The calculation speed is very fast. The ReLU function has only a direct relationship. d) It can be used for deep network training. Disadvantages of ReLU function : , a) When the input ‘is negative, ReLU is not fully functional which means when it comes to the wrong number installed, ReLU will die. This problem is also known as the Dead Neurons problem. b) ReLU function can only be used within hi Model. idden layers of a Neural Network (ERM LReLu and ERELU 1. LReLU «The Leaky ReLU is one of the most well-known activation function. It is the same as ReLU for positive numbers. But instead of being 0 for all negative values, it has a constant slope (less than 1.) + Leaky ReLU is a type of activation function that helps to prevent the function from becoming saturated at 0. It has a small slope instead of the standard ReLU which has an infinite slope. © Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Fig. 4.8.2 shows LReLU function. fy) Fig. 4.8.2 LReLU function «The leak helps to increase the ran, is 0.01 or so. ge of the ReLU function. Usually, the value of @ Tt ECHNICAL PUBLICATIONS® . an up thrust for k F knowledna = chine Leeming en 1 Networks + The motivation for can also result in activation function using LReLU i an rR eLU instead of ReLU is that constant zero gradients ‘amning, as when a saturated neuron uses a sigmoid EReLU + An Elastic ReLU (EReLU) considers distribution during the training for th non-linearity. @ slope randomly drawn from a uniform 'e positive inputs to control the amount of + The EReLU is defined as : EReLU() = max(Rx; 0) in the output range of [0;1) where R is a random number u 8 « At the test time, the EReLU becomes the identity function for positive inputs. BEI Hyperparameter Tuning _ + Hyperparameters are parameters whose values control the leaming process and determine the values of model parameters that a learnin; ig algorithm ends up learning. While designing a machine learning model, one always has multiple choices for the architectural design for the model. This creates a confusion on which design to choose for the model based on its optimality. And due to this, there are always trials for defining a perfect machine learning model. The parameters that are used to define these machine learning models are known as the hyperparameters and the rigorous search for these parameters to build an optimized model is known as hyperparameter tuning. Hyperparameters are not model parameters, which can be directly trained from data. Model parameters usually specify the way to transform the input into the required output, whereas hyperparameters define the actual structure of the model that gives the required data. 49.1 | Layer Size * Layer size is defined by the number of neurons in a given layer. Input and output layers are relatively easy to figure out because they correspond directly to how our modeling problem handles input and ouput. For the input layer, this will match up to the number of features in the input Vector, For the output layer, this will either be a single output neuron or a number of neurons matching the number of classes we are trying to predict. It is obvious that a neural network with 3 layers will give better performance than that of 2 layers. Increasing more than 3 doesn't help that much in neural networks. In the case of CNN, an increasing number of layers makes the model better. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy