The document covers various machine learning techniques for predictive modeling, including decision trees, logistic regression, neural networks, and k-nearest neighbors (kNN). It explains the structure and functioning of decision trees, their advantages, and the algorithms used for building them, along with an overview of logistic regression and neural networks. Additionally, it discusses model assessment techniques like Cp, AIC, and BIC for evaluating regression models.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
5 views75 pages
Pa Unit-Iii
The document covers various machine learning techniques for predictive modeling, including decision trees, logistic regression, neural networks, and k-nearest neighbors (kNN). It explains the structure and functioning of decision trees, their advantages, and the algorithms used for building them, along with an overview of logistic regression and neural networks. Additionally, it discusses model assessment techniques like Cp, AIC, and BIC for evaluating regression models.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75
UNIT-III
Machine Learning for prediction
Predictive modeling – decision trees, logistic regression, neural network, kNN, Bayesian method Regression model : Assessing Predictive models - Batch Approach to Model Assessment, Percent Correct Classification, Rank-Ordered Approach to Model Assessment, Assessing Regression Models Decision Tree • Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome. • In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. • The decisions or the test are performed on the basis of features of the given dataset. • It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions. . • It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further branches and constructs a tree-like structure. • In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree algorithm. • A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees. • Below diagram explains the general structure of a decision tree: • Note: A decision tree can contain categorical data (YES/NO) as well as numeric data. . .• Why use Decision Trees? • There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and problem is the main point to remember while creating a machine learning model. Below are the two reasons for using the Decision tree: • Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand. • The logic behind the decision tree can be easily understood because it shows a tree-like structure. • Decision Tree Terminologies: • Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets divided into two or more homogeneous sets. . • Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node. • Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given conditions. • Branch/Sub Tree: A tree formed by splitting the tree. • Pruning: Pruning is the process of removing the unwanted branches from the tree. • Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes. .• How does the Decision Tree algorithm Work? Step-1: Begin the tree with the root node, says S, which contains the complete dataset. • Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM). • Step-3: Divide the S into subsets that contains possible values for the best attributes. • Step-4: Generate the decision tree node, which contains the best attribute. • Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node . • Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM). The root node splits further into the next decision node (distance from the office) and one leaf node based on the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram: .• Attribute Selection Measures • While implementing a Decision tree, the main issue arises that how to select the best attribute for the root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There are two popular techniques for ASM, which are: • Information Gain, gini index • 1. Information Gain: • Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. • It calculates how much information a feature provides us about a class. • According to the value of information gain, we split the node and build the decision tree. • A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first. It can be calculated using the below formula: • Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature) .• Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as: . LOGISTIC REGRESSION • Logistic regression is a statistical method used for binary classification, which means it is used to predict the probability that an input belongs to one of two possible classes (usually denoted as 0 and 1). • Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It is widely used in various fields, including machine learning, medical research, economics, and social sciences. • Logistic regression is a simple yet powerful algorithm for binary classification tasks, and its interpretability and efficiency make it a popular choice in various applications. It's important to note that logistic regression can be extended to handle multiclass classification problems using techniques like one-vs-all (OvA) or softmax regression. . . -- •. . ------ . . Neural networks • ➢ Neural networks, inspired by the structure and functioning of the human brain, are a class of machine learning models that excel at learning complex patterns and representations from data. • ➢ They consist of interconnected nodes, known as neurons, organized into layers. • ➢ Neural networks are a fundamental component of deep learning, a subfield of machine learning characterized by the use of deep architectures with multiple layers. .• Key Concepts: • Neurons: Neurons are the basic building blocks of a neural network. Each neuron processes input data and produces an output. • Layers: Neural networks are organized into layers, including an input layer, one or more hidden layers, and an output layer. The input layer receives data, hidden layers process it, and the output layer produces the final result. • Weights and Biases: Weights and biases are parameters that the neural network learns during training. They determine the strength of connections between neurons and affect the output. • Activation Functions: Activation functions introduce non-linearities to the model, allowing it to learn complex relationships in the data. Common activation functions include sigmoid, tanh, and rectified linear unit (ReLU). . • Feedforward and Backpropagation: In the training phase, data is passed through the network in a feedforward manner to make predictions. The backpropagation algorithm is then used to adjust weights and biases based on the error, optimizing the model. • Loss Function: The loss function measures the difference between the predicted output and the actual target. During training, the goal is to minimize this loss, guiding the network to make more accurate predictions. • Deep Learning: Deep neural networks have multiple hidden layers, enabling them to learn hierarchical representations of data. This depth allows them to handle intricate features and patterns. • Types of Neural Networks: Different architectures cater to various tasks, including: Convolutional Neural Networks (CNNs) for image- related tasks. Recurrent Neural Networks (RNNs) for sequential data. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks for improved handling of long range dependencies. . pplications: A •Image and Speech Recognition: Neural networks have achieved significant success in tasks such as image classification, object detection, and speech recognition. •Natural Language Processing (NLP): They are widely used in language- related tasks, including sentiment analysis, machine translation, and text generation. •Medical Diagnostics: Neural networks contribute to medical image analysis, disease diagnosis, and predicting patient outcomes. •Autonomous Vehicles: In the field of autonomous driving, neural networks play a crucial role in tasks like object detection, path planning, and decision-making. •Financial Forecasting: Neural networks are applied to predict stock prices, analyze market trends, and model financial data. .• Example: Virtual Personal Assistant's Speech Recognition • Imagine using a virtual personal assistant, such as Apple's Siri, Amazon's Alexa, or Google Assistant. These virtual assistants employ neural networks, particularly recurrent neural networks (RNNs) and convolutional neural networks (CNNs), for speech recognition. • 1.Data Input: You activate the virtual assistant by saying a command, such as "Hey Siri" or "Alexa." • 2.Neural Network Processing: The neural network within the virtual assistant's system processes the incoming audio data in real-time. The network has been trained on vast datasets containing various spoken phrases and words. • 3.Feature Extraction: The neural network extracts relevant features from the audio data, capturing the nuances of your voice, accent, and speech patterns. Recurrent neural networks are particularly effective in handling sequential data like spoken language. • 4.Pattern Recognition: The neural network recognizes patterns and converts the audio input into a sequence of words or commands. This involves complex computations to understand context, syntax, and semantics .• 5.Command Execution: Based on the recognized speech, the virtual assistant executes the corresponding command. For example, if you say, "What's the weather today?" the neural network interprets the query and triggers the appropriate response by fetching real-time weather information. • 6.Continuous Learning: Neural networks in virtual assistants are often designed for continuous learning. As users interact more, the neural network adapts to individual speech patterns, accents, and preferences, enhancing its performance over time. • This example illustrates how neural networks in speech recognition applications have become integral to our daily lives. The technology enables natural and seamless interactions with devices, showcasing the power of neural networks in understanding and processing complex patterns in real-world scenarios What is Artificial Neural Network? . .• Relationship between Biological neural network and artificial neural network: • Biological Neural Network • Dendrites • Cell nucleus • Synapse • Axon • Artificial Neural Network • Inputs • Nodes • Weights • Output . •. Input Layer: As the name suggests, it accepts inputs in several different formats provided by the programmer. • Hidden Layer: The hidden layer presents in-between input and output layers. It performs all the calculations to find hidden features and patterns. • Output Layer: The input goes through a series of transformations using the hidden layer, which finally results in output that is conveyed using this layer. The artificial neural network takes input and computes the weighted sum of the inputs and includes a bias. This computation is represented in the form of a transfer function. K-nearest Neighbour •K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning technique. •K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. •K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. •K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems. .• K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. ▪ It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. ▪ KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data. Example: ▪ Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. ▪ So for this identification, we can use the KNN algorithm, as it works on a similarity measure. ▪ Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category. . .• Why do we need a K-NN Algorithm? • Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories • . To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. • Consider the below diagram: . Cp (Mallow's Cp): • ● Purpose: Cp was developed by Colin Mallows to assess the quality of linear regression models. It is primarily used in the context of regression analysis. • ● Calculation: Cp measures the trade-off between model fit and model complexity. It is calculated as follows: Cp = (SSE_p / MSE) - (n - 2p) Where: ● SSE_p: The sum of squared errors for the model with p predictor variables. • ● MSE: The mean squared error for the full model with all predictor variables. • ● n: The number of data points. • ● p: The number of predictor variables in the model. • ● Interpretation: A smaller Cp value indicates a better trade-off between model fit and complexity. Cp is used to assess whether a model with a subset of predictor variables is competitive with the full model while penalizing for model complex AIC (Akaike Information Criterion): • ● Purpose: AIC is a general-purpose model selection criterion used in various statistical models, including linear regression, time series analysis, and more. • ● Calculation: AIC is based on the likelihood function of the model and is calculated as follows: AIC = -2 * log(Likelihood) + 2 * k Where: • ● Likelihood: The likelihood of the model given the data. • ● k: The number of estimated parameters in the model. • ● Interpretation: AIC balances the fit of the model and its complexity. A lower AIC value indicates a better model. It effectively penalizes models with more parameters, encouraging simplicity. BIC (Bayesian Information Criterion): • ● Purpose: BIC is similar to AIC but tends to penalize model complexity more heavily. It is also used for model selection in various statistical contexts. • ● Calculation: BIC is calculated as follows: BIC = -2 * log(Likelihood) + k * log(n) Where: • Likelihood: The likelihood of the model given the data. • ● k: The number of estimated parameters in the model. • ● n: The number of data points. • ● Interpretation: BIC favors simpler models more strongly than AIC. A lower BIC value indicates a better model. Compared to AIC, BIC is more conservative when it comes to model selection, often resulting in a more parsimonious choice. • The choice between Cp, AIC, and BIC depends on the specific context and goals of your analysis. Cp is suitable for linear regression, while AIC and BIC are versatile and widely used in various statistical modeling scenarios. BIC is the most conservative in terms of model selection and favors simpler models the most, while AIC strikes a balance between model fit and complexity. . • Bayesian Estimation of the Parameters of a Function We now discuss the case where we estimate the parameters, not of a distribution, but some function of the input, for regression or classification. Again, our approach is to consider these parameters as random variables with a prior distribution and use Bayes’ rule to calculate a posterior distribution. • We can then either evaluate the full integral, approximate it, or use the MAP estimate . Assessing Predictive Models • Assessing predictive models is a critical step in the data science and machine learning workflow. The goal is to understand how well a model performs and whether it can generalize to unseen data. Here’s an overview of the process and key metrics to assess predictive models: 1. Split the Data • Training Set: Used to train the model. • Validation Set: Used to fine-tune hyperparameters and avoid overfitting. • Test Set: Used to evaluate the final performance of the model. A common practice is to split the data into 70-80% for training and the remaining 20-30% for testing. In more advanced setups, cross-validation (e.g., k-fold cross- validation) is used for more robust evaluation. • 2. Key Performance Metrics • The choice of evaluation metric depends on the type of model (classification, regression, etc.) and the specific problem at hand. • Assessing predictive models involves evaluating how well a model performs on both seen (training) data and unseen (testing or validation) data. This process helps in identifying the model’s generalizability, predictive power, and any potential areas of improvement. Here’s a structured approach to assessing predictive models: . Batch Approach to Model Assessment • The first approach to assessing model accuracy is a batch approach, which means that all the records in the test or validation data are used to compute the accuracy without regard to the order of predictions in the data. • A second approach based on a rank-ordered sorting of the predictions will be considered next. • Throughout this chapter, the target variable for binary classification results will be shown as having 1s and 0s, although any two values can be used without loss of generality, such as “Y” and “N,” or “good” and “bad.” Percent Correct Classification • Percent correct classification is a basic metric used to evaluate the performance of a classification model. • It measures the proportion of correct predictions made by the model out of the total predictions. • However, while it provides a quick idea of a model's accuracy, it might not always be sufficient on its own, especially in certain scenarios like imbalanced datasets. • Percent correct classification refers to the percentage of instances (data points) that the model correctly classifies. It is often used to measure the effectiveness of classification algorithms. The higher the percentage, the better the model is at making correct predictions. . .. Example Calculation Suppose you have a model that classifies 100 data points into two classes (Class A and Class B). Out of these 100 data points, the model correctly classifies 85 of them. To calculate percent correct: • Percent Correct=(85/100)×100=85% So, the model has 85% accuracy in its predictions. Confusion Matrix and Percent Correct To better understand percent correct, it's helpful to look at the confusion matrix. The confusion matrix summarizes the performance of a classification algorithm and includes the following terms: • True Positive (TP): The number of instances correctly predicted as the positive class. • True Negative (TN): The number of instances correctly predicted as the negative class. • False Positive (FP): The number of instances incorrectly predicted as the positive class. • False Negative (FN): The number of instances incorrectly predicted as the negative class. . Limitations of Percent Correct While percent correct is easy to understand, it has several limitations: • Imbalanced Datasets: In cases where there is a significant imbalance between the number of instances in different classes, percent correct may give a false impression of model performance. For example, in a dataset with 95% instances of Class A and 5% instances of Class B, a model that always predicts Class A could achieve 95% accuracy but would perform poorly on Class B. • For example, consider: • Class A: 95 samples • Class B: 5 samples • If the model predicts every sample as Class A, it will be correct 95% of the time, but it will have missed every instance of Class B, leading to poor performance on the minority class. .When to Use Percent Correct Percent correct can be useful when: • The dataset is balanced (i.e., each class has a similar number of instances). • A quick, overall evaluation of model performance is needed. • The costs of misclassification are relatively uniform across classes. • However, if the data is imbalanced or if specific class performance is important, other metrics like precision, recall, or the F1-score should be considered in addition to percent correct. Summary: • Percent correct classification is a simple and easy-to-understand metric that gives the percentage of correct predictions made by a model. It’s calculated as the ratio of correct predictions to the total number of predictions. While useful for quick assessments, it has limitations in cases of imbalanced datasets or when different types of misclassification carry different costs. More comprehensive metrics like precision, recall, and F1- score are often used alongside percent correct for a deeper evaluation of a model's performance . . . . •. ROC Curve (Receiver Operating Characteristic Curve) and AUC (Area Under the Curve) are used to evaluate the performance of binary classification models. These metrics are particularly useful when dealing with imbalanced datasets or when you want to assess the model across different thresholds. . ROC Curve (Receiver Operating Characteristic Curve) The ROC Curve is a graphical representation that shows the performance of a classification model at different classification thresholds. The curve plots two metrics: • True Positive Rate (TPR) or Sensitivity (y-axis) • False Positive Rate (FPR) (x-axis) . .AUC (Area Under the Curve) The AUC (Area Under the Curve) is a scalar value that summarizes the overall performance of the model, based on the ROC curve. It represents the area under the ROC curve. What does AUC represent? • AUC = 0.5: The model performs no better than random guessing. The ROC curve would be a diagonal line from (0,0) to (1,1). • AUC = 1: The model perfectly classifies all the positive and negative instances, with no errors (perfect separation). • 0.5 < AUC < 1: The model performs better than random guessing, with higher values indicating better performance. • AUC < 0.5: The model is performing worse than random guessing, which may indicate that it has learned the wrong decision boundary (or it's a bad model Rank-Ordered Approach to Model assessment • In contrast to batch approaches to computing model accuracy, rank- ordered metrics begin by sorting the numeric output of the predictive model, either the probability or confidence of a classification model or the actual predicted output of a regression model. • The rank-ordered predictions are binned into segments and summary statistics related to the accuracy of the model are computed either individually for each segment, or cumulatively as you traverse the sorted file list. • The three most common rank-ordered error metrics are gains charts, lift charts, and ROI charts. In each of these charts, the x-axis is the percent depth of the rank-ordered list of probabilities, and the y-axis is the gain, lift, or ROI produced by the model at that depth. .• In the context of rank-ordered error metrics for evaluating machine learning models, particularly in classification tasks, Gains Charts, Lift Charts, and Return on Investment (ROI) Charts are often used to assess model performance, especially when dealing with imbalanced datasets. These charts are used primarily in the fields of marketing, finance, and customer analytics to evaluate how well a model is identifying the most relevant or profitable segments of data. 1. Gains Chart • A Gains Chart is used to measure the effectiveness of a classification model in identifying the target class (e.g., "churned customers", "purchased product"). It compares the cumulative percentage of true positive cases (or correct predictions) identified by the model at various thresholds to a random classifier (which would select cases randomly). . ow a Gains Chart Works: H • X-Axis: Represents the percentage of the total data, ranked from the most likely to the least likely to belong to the target class. This is typically referred to as the decile or percentile. • Y-Axis: Represents the cumulative percentage of actual positives or true positive cases (e.g., customers who actually churned • The chart plots two curves: • Model curve: Cumulative number of true positives identified by the model at various cutoffs (percentiles). • Random curve: This represents a random classifier's performance, where the percentage of true positives found is directly proportional to the percentage of data sampled. • Interpretation: • A perfect model will have a steep initial slope, meaning it quickly identifies the majority of the positive class with the smallest percentage of samples. It should outperform the random classifier. • The larger the area between the model curve and the random curve, the better the model is at identifying the positive class efficiently. . . . Lift Chart A Lift Chart is similar to the Gains Chart but focuses on how much better the model is at identifying the target class over random guessing, in a more quantifiable way. How a Lift Chart Works: • X-Axis: Again, represents the percentage of the population (samples) ranked by their likelihood of being part of the target class (e.g., likelihood of churn). • Y-Axis: Represents the lift—which is the ratio of the model’s performance (true positive rate) to that of random guessing. . • Lift > 1 indicates that the model is effectively identifying more positive cases than random guessing. • Lift = 1 indicates that the model's performance is similar to random guessing • Lift charts can provide an easily interpretable view of model performance, particularly in marketing or customer analytics, where identifying the right segment of customers (e.g., the top 10% who are most likely to churn) is crucial. Example: • In a direct mail campaign, a Lift Chart would show how much more effective a model is at identifying customers who will respond positively to an offer compared to a random selection. . . . . Return on Investment (ROI) Chart • A Return on Investment (ROI) Chart measures the financial return or benefit from a model’s predictions. The ROI typically focuses on how much profit or value the model generates relative to the costs of applying the model, especially in business settings like marketing, finance, or sales. How a ROI Chart Works: • X-Axis: Represents the percentage of the population (samples) ranked by their likelihood of being in the target class. • Y-Axis: Represents the ROI—the cumulative financial gain or profit achieved by applying the model’s predictions. • The ROI is calculated by comparing the cost of targeting certain customers or segments with the financial return from successfully identifying those customers. True Positives (TP): Customers who were predicted correctly to take an action (e.g., purchase, churn). Cost per Action (C): The cost associated with targeting or acting on that prediction (e.g., cost of sending a promotional email). . Profit per Action (P): The profit made from a correct prediction (e.g., profit from customers who purchase after receiving a promotion). The ROI chart shows how much profit or benefit you can expect at each decile of the population based on model predictions. Example: For a marketing campaign aimed at promoting a new product, the ROI chart could show how much more profit could be made by focusing on the top decile (top 10% of customers most likely to buy) compared to random targeting. . Assessing Regression Models . .• regression models are batch methods computed for the data partition you are using to assess the models (training, testing, validation, and so on). The most commonly used metric is the coefficient of determination known as R2, pronounced “r squared.” R2 measures the percentage of the variance of the target variable that is explained by the models. .