DADS303 - MBA 3 - Machine - Learning
DADS303 - MBA 3 - Machine - Learning
Note: Answer all questions. Kindly note that answers for 10 marks questions should be approximate 400 - 450
words. Each question is followed by an evaluation scheme.
ML helps in serving as a solution by extracting meaningful information from a huge set of raw data. If
implemented in a correct manner it can solve many business complexities and predict complex
behavioural patterns from the customer or user data.
1. Customer Lifetime Value Prediction – Lifetime Value or LTV is the parameter that tells how
loyal the customer to a platform or a website or a business. This LTV of a customer can be
predicted using ML from purchase pattern, browsing history and behavioural patterns.
2. Predictive Maintenance – manufacturing companies often undergo preventive maintenance
that is expensive and time consuming. With the use of ML, factory data can be used to built
historical data, workflow visualization tool, flexible analysis and feedback loop. Due to this
many hidden patterns and insights can be found using ML.
3. Eliminates Manual Data Entry – predictive data modelling and ML can be helpful in
eliminating the errors caused by manual data entry. As the data discovered is of good
quality, it can be analysed and value addition can be done in the business.
4. Detecting Spam – ML usage can help in detecting spams and potential threats to a platform
or website. ML techniques including neural network detect spam and phishing messages.
5. Product Recommendation – based on the data accumulated for a product over a period of
time, analysis is done to draw insights for the product improvement. Website optimization,
product optimization and product can be made more user friendly.
6. Financial Analysis – using ML predictive modelling, large volumes of historical data can be
analysed and portfolio management, algorithmic trading and fraud detection can be done.
7. Image Recognition – Image recognition is done by various companies using data mining and
ML where pattern recognition and db knowledge history is done. It is used by different
domains like automobiles, healthcare etc.
8. Medical Diagnosis – patient’s health improvement and healthcare cost reduction is done by
using ML’s superior diagnostic tools and effective treatment plans.
9. Improving Cyber Security – in increasing the data security in an organisation ML can be used
and it can solve many problems around the same. ML allows to build new technologies
which quickly and effectively detect unknown threats.
10. Increasing Customer Satisfaction – ML can help improving the customer satisfaction. This is
achieved by analysing customer’s feedback and for problem solving common problem areas
are identified and resolved at the product level. After problem identification the customers
are also assigned to the suitable executive for solutioning.
Q2) What do you mean by Regularization? Briefly discuss various methods to do Regularization in
Regression.
A2) Before understanding the term regularization, let us understand ‘overfitting’ and ‘underfitting’
of the model. To train our model, we give it some data to learn, followed by plotting the data points
and drawing the best fit line to understand the relationship between multiple variables. This is called
data fitting. We can call our data to be best fit if all necessary patterns are made and no irrelevant
and random data point / patterns are present. This undesired datapoints are called as Noise.
So, when the model is trained with the noise, it is called as Overfitting. But a scenario where ML
model can neither learn the relationship between variables in testing data nor predict or classify a
new data point is called Underfitting.
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Introduction to Machine Learning
Directorate of Online Education
So, Regularization in Machine Learning is a technique that is used for calibration of ML models so
that adjusted loss function can be minimized and overfitting / underfitting can be avoided.
1. Ridge Regularization
It fixes the overfitting or underfitting by adding a penalty equivalent to the sum of
the squares of magnitude of coefficients.
It performs regularization by shrinking the coefficients.
Let us consider cost function Cost function = Loss + λ x∑‖w‖^2 where Loss = sum of
squared residuals ; λ = penalty of errors ; w = slope of the curve / line.
Higher the value of lambda, more is the shrinking of coefficients present. Therefore,
it reduced the multicollinearity and reduces the complexity by coefficient shrinkage.
2. Lasso Regularization
It fixes the overfitting or underfitting by adding penalty equivalent to the sum of the
absolute values of coefficients.
It takes the actual values of the coefficients present and performs the coefficient
minimization by not of much magnitude.
This means that sum of the coefficients can also be 0 due to the presence of
negative coefficients.
A3) Binary logistic regression is a statistical method used to model the relationship between a binary
dependent variable (also called the response or outcome variable) and one or more independent
variables (also called predictors or explanatory variables). The dependent variable takes on only two
values, typically coded as 0 or 1, representing the absence or presence of an event, respectively.
The goal of binary logistic regression is to estimate the probability of the dependent variable being 1,
given the values of the independent variables. The logistic regression model uses a logistic function
(also known as a sigmoid function) to transform a linear combination of the independent variables
into a probability value between 0 and 1.
p = 1 / (1 + exp(-z))
where p is the probability of the dependent variable being 1, exp is the exponential function, and z is
a linear combination of the independent variables, as follows:
To estimate the coefficients in the model, a method called maximum likelihood estimation is used.
The method involves finding the values of the coefficients that maximize the likelihood of observing
the data given the model.
Once the coefficients are estimated, the model can be used to predict the probability of the
dependent variable being 1, given a set of values for the independent variables. A threshold value
can then be chosen to classify the observations into two groups, typically 0 or 1.
Binary logistic regression is commonly used in various fields, such as finance, marketing, medicine,
and social sciences, to analyze and predict binary outcomes.
In the below example, we fit a binary logistic regression model using the "glm" function in R,
specifying the formula vs ~ mpg + hp + wt, where vs is the binary dependent variable and mpg, hp,
and wt are the independent variables. We use the "binomial" family to specify that we are fitting a
binary logistic regression model.
We then display the summary of the model, which shows the estimated coefficients for the
independent variables, their standard errors, the z-values, and the p-values.
Finally, we make predictions on new data by creating a new data frame with values for mpg, hp, and
wt, and using the predict function to obtain the predicted probabilities of vs being 1 for each
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Introduction to Machine Learning
Directorate of Online Education
observation in the new data frame. The type = "response" argument specifies that we want the
predicted probabilities instead of the linear predictor values.
A4) K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data scienc and groups the unlabeled dataset into various clusters.
K refers to the number of pre-defined clusters that need to be created in the process. So for example
if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It allows us to arrange the datapoints in the form of cluster into different groups and thus makes it
convenient to discover the categories of groups in the unlabeled dataset on its own without the
need for any training. It is a centroid-based algorithm, where each cluster is associated with a
centroid and its main aim is to minimize the sum of distances between the data point and their
corresponding clusters.
In this algorithm we have to input the unlabeled dataset to divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. It should be noted that the
value of k should be predetermined in this algorithm.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities so that they can be grouped in a
cluster.
Suppose M1 and M2 are two variables and scatter plot is as shown below:
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Introduction to Machine Learning
Directorate of Online Education
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
We need to choose some random k points or centroid to form the cluster. These points can be either
the points from the dataset or any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and
yellow for clear visualization.
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Introduction to Machine Learning
Directorate of Online Education
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find
new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process
of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Introduction to Machine Learning
Directorate of Online Education
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.
We will repeat the process by finding the center of gravity of centroids, so the new centroids will be
as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:
We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Introduction to Machine Learning
Directorate of Online Education
Q5) Briefly explain ‘Splitting Criteria’, ‘Merging Criteria’ and ‘Stopping criteria’ in Decision Tree.
SPLITTING CRITERIA
The objective of splitting criteria is to find the optimal way to partition the data into homogeneous
subsets in terms of the target variable.
Information gain: Information gain measures the reduction in entropy (i.e., degree of
disorder) achieved by splitting a node based on a particular variable.
Gain ratio: Gain ratio is similar to information gain, but it takes into account the
intrinsic information of a variable, which is the degree to which a variable is capable
of making finer partitions in the data. The gain ratio penalizes variables that have too
many categories or levels.
Chi-square: it is used to determine whether a split based on a particular variable is
statistically significant. The optimal split is the one with the highest chi-square value.
Reduction in variance: This criterion is used in regression trees and measures the
reduction in variance achieved by splitting a node based on a particular variable. The
optimal split is the one that minimizes the weighted sum of the variance of the child
nodes.
MERGING CRITERIA
Also known as pruning criteria, are used in decision tree algorithms to determine when to stop
growing the tree by merging or pruning some of the nodes. The objective of merging criteria is to
prevent overfitting, which occurs when the tree is too complex and captures noise or random
variation in the data, rather than the underlying patterns or relationships.
STOPPING CRITERIA
Stopping criteria in decision trees refer to the rules that determine when to stop splitting a node and
make it a leaf node. The decision tree algorithm continues to split nodes until a certain stopping
criterion is met. Some commonly used stopping criteria in decision trees include:
Maximum tree depth: This criterion specifies the maximum depth of the tree. Once
the tree reaches the maximum depth, the algorithm stops splitting and creates a leaf
node.
Minimum number of samples: This criterion specifies the minimum number of
samples required to split a node. If the number of samples at a node is less than the
specified minimum, the node becomes a leaf node.
Maximum number of leaf nodes: This criterion specifies the maximum number of
leaf nodes allowed in the tree. Once the maximum number is reached, the algorithm
stops splitting and creates leaf nodes.
Minimum impurity decrease: This criterion specifies the minimum amount of
reduction in impurity that must be achieved by splitting a node. If the impurity
decrease is less than the specified minimum, the node becomes a leaf node.
Q6) What is Support Vector Machine? What are the various steps in using Support Vector
Machine?
A6) Support Vector Machine (SVM) is a fast and dependable classification algorithm that performs
very well with the limited amount of data to analyse. It is a supervised machine learning model that
uses classification algorithm for two-group classification problems. In the scatter plot below you can
see that support vectors are the data points that lie near the decision boundary.
You can also see that blue line separates the two categories. This line is called Hyperplane. The
objective of the SVM algorithm is to find a hyperplane in an N-dimensional space that distinctly
classifies the data points.
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Introduction to Machine Learning
Directorate of Online Education
The dimension of the hyperplane depends upon the number of features. If the number of input
features is two, then the hyperplane is just a line. If the number of input features is three, then the
hyperplane becomes a 2-D plane. It becomes difficult to imagine when the number of features
exceeds three. The distance of support vectors between the separating hyperplane is called margin.
The hyperplane is best when the margin is maximum. Now that we have understood what is SVM,
let us now understand how is it used in R:
We use 70% of the data for training and the remaining 30% for testing.
We then fit a support vector machine model using the svm function from the
"e1071" package. We specify the formula Species ~ . to indicate that we want to
predict the "Species" variable based on all the other variables in the dataset.
We display the model summary using the summary function, which shows the
number of support vectors, the kernel function used, and the model accuracy.
We then make predictions on the testing set using the predict function, and create a
confusion matrix using the table function to see how well the model performed.
Finally, we calculate the model accuracy by dividing the number of correct
predictions by the total number of observations in the testing set.