0% found this document useful (0 votes)
24 views55 pages

Machine Learning

Machine learning enables machines to automatically learn from data, improve performance through experience, and predict outcomes without being explicitly programmed. It works by analyzing historical data to build predictive models and uses those models to predict outputs for new data. Machine learning is used in many applications including movie recommendations, email filtering, product suggestions, and more.

Uploaded by

saurabh khairnar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views55 pages

Machine Learning

Machine learning enables machines to automatically learn from data, improve performance through experience, and predict outcomes without being explicitly programmed. It works by analyzing historical data to build predictive models and uses those models to predict outputs for new data. Machine learning is used in many applications including movie recommendations, email filtering, product suggestions, and more.

Uploaded by

saurabh khairnar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Machine

Learning
What is Machine Learning ?

● It enables a machine to automatically learn from data, improve performance from experiences, and
predict things without being explicitly programmed.
● A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it.
● Machine Learning is a program that analyses data and learns to predict the outcome.

For e.g: Cars24 price predictions,movie/book suggestions,product suggestions,email filtering,facebook


auto-tagging,etc
1. Movie recommendation system
2. Factors responsible for sales reduction
3. Todays website traffic
4. Predicting stock market growth pattern
1. Movie recommendation system
2. Factors responsible for sales reduction
3. Todays website traffic
4. Predicting stock market growth pattern

Model building Lifecycle EDA steps

1. Problem definition 1.Data types of the cols


2. Hypothesis generation 2. Find and handle the null values
3. Data collection 3. Detect & handle the outliers
4. EDA 4. Find & handle skewness
5. Predictive modeling 5. Encoding
6. Model Deployment 6. Standardization(scaling)
7. Feature Engineering
Machine Learning Types
Applications of Machine Learning
1. Image Recognition:
Image recognition is used to identify objects, persons, places, digital images, etc.
For eg: Facebook Automatic friend tagging suggestion:
The technology behind this is machine learning face detection and recognition algorithm.

2. Speech Recognition:
Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to text", or "Computer
speech recognition." For eg: Google→Search by voice
Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the voice instructions.

3. Traffic prediction:

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from the user and sends back to its
database to improve the performance.
4. Product recommendations:

Whenever we search for some product on Amazon, then we started getting an advertisement for the same product while internet
surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the product as per customer
interest.

5. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails in our spam box.

6. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and downs in shares, so
for this machine learning's long short term memory neural network is used for the prediction of stock market trends.

7. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing very fast and able
to build 3D models that can predict the exact position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
Regression

● Regression is a technique in supervised learning where we predict the continuous data.


● It is a supervised learning technique where the target variable is continuous in nature.

What is Linear Regression?

It is a process of establishing a relationship between x & y, and when the relationship is linear in
nature we call it as linear regression.

y=mx+c
Q. Explain Linear Regression.

Definition : It is a process of establishing relationship between dependent and independent variable and when the relation is
linear in nature we call it linear regression.

Goal: To find/create the best fit line using the formula—>


Y = MX + C
Where,
Y is dependent variable
X is an independent variable
M is the slope/gradient/weight/coefficient of regression
C is the point of intersection at y-axis

Objective:
1. To establish the relationship between x & y
2. To predict new observations

Advantages: Very easy to interpret


Disadvantages: Affected by outliers , missing values & skewness.

Performance: We use MAE,MSE,RMSE for the evaluation of error model


We use r^2 score to evaluate how good the model is performing
Optimization:
x y x-xmean y-ymean (x-xmean)^2 (y-ymean)^2

1 3 -2 0.6 4 1.2

2 4 -1 0.4 1 -0.4

3 2 0 -1.6 0 0

4 4 1 0.4 1 0.4

5 5 2 1.4 4 2.8
R Squared

x y yp yp-ymean (yp-ymean)^2 y-ymean (y-ymean)^2

1 3 2.8 0.8 0.64 0.6 0.36

2 4 3.2 -0.4 0.16 0.4 0.16

3 2 3.6 0 0 -1.6 2.56

4 4 4 0.4 0.16 0.4 0.16

5 5 4.4 0.8 0.64 1.4 1.96


KNN- K nearest neighbor classifier

● It is a supervised machine learning algorithm used to solve problem of


classification.
● It is a non-parametric algorithm.

How KNN works?


1. Select /Decide the number of K neighbors
2. Calculate the distance from new data point to the old data points.
3. Take the k nearest neighbors as per the calculated distance
4. Calculate the mode of these points/count the number of datapoints in each
category & assign the new data point where the count is maximum.

What is ‘K’ in KNN?


It is a parameter that refers to the number of nearest neighbors to be considered for
decision making process.
X={ Python: 6, ML: 8}

Python ML Result
3.1 1500
4 3 F
3.8 2500
6 7 P
4.2 3000
7 7 P
5.2 8000
5 8 F
6.3 5500
8 5 P
2.3 1000
7 6 P
Features of KNN

1. No Training time (Fast Training):

● Knn does not learn anything in its training period.Hence it is known as ‘LAZY
algorithm’
● It just stores your data into the memory.
● It comes in action while predicting new observations.

2. It works with both linear & non-linear data sets.


3. It does not perform well on large data sets.
[Since it will take a lot of time to calculate distances]
4. Easy to implement.
It requires i.) The value of k ii.) Distance metrics
5. Knn is a supervised ML algorithm used to solve classification as well as
regression problems.
Interview Question

1. What is KNN? How does it work?


2. What is ‘K’ in KNN?
3. How ‘KNN’ can be used to solve a regression problem?
4. Why KNN is known as LAZY algo.?
5. Does KNN work on large data sets?
6. Explain advantages & disadvantages of KNN.
Confusion Matrix Terminologies

➢ TP (True Positives):

Actual positives in the data,which have been correctly predicted positive by our model.
Hence True Positive.

➢ TN(True Negatives):

Actual negatives in the data,which have been correctly predicted negative by our model.
Hence True Positive.

➢ FP (False Positives)/Type-1 Error:

Actual negatives in the data,but our model has predicted positive.


Hence False positive.

➢ FN(False Negative)/Type-2 Error:

Actual positives in the data,but our model has predicted negative.


Hence False negative.
➢ Accuracy: Tells the percentage of correctly predicted values out of all the data points.

Accuracy= TP + TN
TP+TN+FP+FN

➢ TPR (True Positive Rate) or Recall: Out of all the positive data points,how many have been truly identified as positive
by our model.

TRP= TP
TP+FN

➢ TNR (True Negative Rate) or: Out of all the negative data points,how many have been truly identified as negative by
our model.

TNP= TN
TN+FP
➢ FPR (False Positive Rate) or: Out of all the negative data points,how many have been falsely identified as positive by
our model.

FPR= FP
TN+FP
➢ FNR (False Negative Rate) or: Out of all the positive data points,how many have been falsely identified as negative by
our model.

FNR= FN
TP+FN
Precision:

Out of all points which have been identified as positive by our model ie. how many are actually true.

Precision = TP
TP+FP
Regularization

A.) Overfitting:

● Low Bias, High variance


● This situation occurs when the model tries to cover/learn every data point in the
dataset.
● While doing this it may also cover noisy data as well and hence cannot predict
the new observation accurately.
● In overfitting the training accuracy is high,testing accuracy is low.
● Here the model does not work properly on new observations.

How to solve the problem of overfitting?


● Regularization
● Cross validation technique
● Ensemble learning
B.) Underfitting:

● High Bias , Low /High variance


● This situation occurs when the model fails to identify/ learn
patterns.
● Here the training accuracy is very low.

c.) Good fit model/Model of goodness:


● In this situation the model performs well on training data as well
as testing data.
● Here we have low training errors and low testing error.
● Low Bias and Low Variance.
What is Regularization?

● To solve problem of overfitting by adding extra error term/penalty term to the existing model by
manipulating/Tuning the coefficients.

● This technique can be used in such a way that it will allow to maintain all variables or features in the
model by reducing the magnitude of the variable.

1.) Ridge (L2 regularization):


- Reduce the coefficient value of less important features.
- It adds small amount of bias which is known as error term/ penalty term/Ridge regression penalty.
- It reduces the complexity of the model by shrinking the coefficients.
- If you do not want ro perform dimensionality reduction you can use Ridge as it reduces the coefficient
value.
- Error term is calculated by multiplying lambda with square of coefficients.

2.) Lasso (L1 regularization):


- It makes the coefficient value to 0 of the less important features.
- It does not include all the features.
- If you want ro perform dimensionality reduction/feature selection then you can use L1 of Lasso as it
makes the coefficient value to Zero (i.e it does consider the important features)
- Error term is calculated by multiplying lambda with absolute of coefficients.
Logistic Regression

● -> It is a supervised ML algo used to solve classification problems.


● -> It predicts outcomes which are categorical in nature.
● -> Logistic regression uses sigmoid/ logistic function to classify a data point.
● -> Logistic/sigmoid function always return probabilistic value that lies between 0 to 1
● -> In logistic regression , instead of fitting best fit line , we fit "s" shaped curve, which predicts
two maximum values (0 or 1)
● ->Curve indicates likelihood of something.
● ->Sigmoid function maps any value into a range of 0-1
● ->logistic function uses threshold which help to classify a data point.
● ->value above threshold will be considered as 1 , value less than threshold will be considered as 0
● -> It is widely used to solve binary classification problems.

-------------------------------------
Assumption of Logistic Regression.
1)Target must be categorical in nature.
2)NO multicollinearity
-------------------------------------------
Advantages:

-Performs well on linear data.


-Results are easily interpretable.
-It works well on large data sets
-faster training because of sigmoid function.
- works well on binary datasets

Disadvantages:

- it does not perform well on non-linear data.


- It does work well on high dimensional data ( Large features).
- It makes assumptions on data.
- It does not work well on multi classification datasets
Linear Regression Logistic Regression

1. Continuous target variable 1. Categorical target variable

2. Least square criterion 2. Maximum likelihood

3. Straight line 3. Curve line

4. Predicted values lie between the range 4. Predicted values lie between the range
of +∞ to -∞ of 0 to 1

5. MSE 5. Binary cross Entropy

6. MAE,MSE,RMSE,R2 SCORE 6. Accuracy score,confusion


matrix,classification report
● HyperTuning Parameters:

1) penalty -: it adds penalty term. possible values are - {l1,l2,elastic net,none}


2) solver -: liblinear,sag,saga,lbfgs
3) multi_class -: auto,ovr,multinomial

ROC -AUC curve ( Receiver operating characteristics Area Under Curve)

➢ it is a performance metrics for the classification problem at various threshold settings.


➢ It tell how much the model is capable of classifying between classes.
➢ Higher the AUC, the better the model is at predicting 0 class as 0 and 1 class as 1.
➢ High value of AUC means model is good and vice versa.
➢ It is graph which we plot with TPR vs FPR. where TPR is on Y axis and FPR is on X
axis.
Interview Questions:

1) How logistic regression works? (imp)


2) what is sigmoid function /importance of sigmoid function(imp)
3) importance of threshold in logistic regression. (imp)
4) Can logistic regression works with large data? --> Yes, it requires large data
5) Explain Drawbacks of logistic regression.
6) When you will like to use logistic regression (imp)
7) Explain ROC- AUC curve.
8) How will you improvise logistic regression performance / what are hypertuners of
logistic regression (imp)
9) Does logistic regression uses regularization by default? ---> Yes , l2 by default
10) explain solver in logistic regression(imp)
11) Advantages and disadvantages of logistic regression
12) logistic regression vs linear regression (imp)
Support Vector Machine (SVM)

● Supervised ML algo which can be used to solve classification as well as regression problems.

● Objective -:
SVM is based on the idea of finding a hyperplane/ Decision line in an N-Dimensional space that best
seperate the features into different domains.

● Hyperplane-:

➢ Hyperplanes are decision boundaries that classify the data points into classes. Data points falling either
side of the hyperplane can be classified to different classes.

➢ Dimension of hyperplane is depends on number of features. i.e if no of features are 2, then hyperplane
is line. if no of features are 3 or more than 3 then it is known as 2d hyperplane
● Support vectors-:
○ Support vectors are data points that are closer to the hyperplane and
influence the position of the hyperplane.
○ support vectors plays imp role to draw decision line/hyperplane.

● Margin-: The distance of vectors from the hyperplane are called margins.
Distance from boundary line to decision line.

● Best hyperplane ----> hyperplane with High margin is considered as best


hyperplane.

● kernel -: Kernel is used to handle non-linear dataset as we can not draw best
decision line in non linear data.
kernel will add extra dimension to handle non-linear data by finding out best
hyperplane in higher dimension space.
Advantages of SVM-:
1) It can handle linear as well non linear data. -: It handles linear data by finding a
best decision line and it handles non linear data by using kernel trick.

2) It can be used to solve classification as well as regression problems.

3) Stability -: A small change to the data does not affect the hyperplane.

Disadvantages:
1) Choosing a correct kernel type.
2) Extensive memory requirement - > High complex algo , high vol. of computation
requires.
3) Long training time on large non linear data.
4) It requires Feature scaling.
5) Difficult to interpret results of SVM.
Hyper-Parameter:

1. C - It is a hyper-parameter which controls the error.


If we have low C means low error and if we have high C means large error.
Best values of C=[0.001,0.01,0.1,1,10,100]

2. Gamma : It decides how much curvature we want in a decision boundary.


Best values of gamma=[0.001,0.01,0.1,1,10,100]

3. Kernel: It is used to handle the non-linear dataset.


Values of Kernel:
I Sigmoid & tanh: These kernels are used in neural networks.
Ii. Linear kernel: This kernel is used when the data is linearly separable.
Iii. Poly kernel: This is used in image processing.
Iv. RBF(default) : Radial Basis Function(RBF) is used when we do not have prior knowledge about the
data.
Decision Tree

● It is a supervised ML algo that uses label data to classify a data point.


● It can be used to solve regression as well as classification problem.
● It is a graphical representation for getting all the possible solutions to problem/ decision based on given
condition.
● It uses different nodes such as Root node, branch/decision node and leaf node.
● It tree like structured classifier , where internal nodes represent the features of a data set , branches
represent the decision rules each leaf node represent the outcome.

● On which basis DT select feature for further splitting?

sol 1) On the basis of impurity. DT select a feature with low impurity.


sol 2) Information Gain.

● How to calculate impurity?


1) Gini index - 1-p2-q2
where p is a probability of an event will occur (like the movie) and
q is the prob of event will not occur (not like the movie)

2) Entropy—> -p log p - q log q


Advantages of DT:

➔ Results of DT are easy to interpret.


➔ DT are not affected by noisy data.
➔ It can handle non linear data also.
➔ It can solve regression as well as classification problem.

Disadvantages of DT

➔ It is not suitable for large and high dimensional datasets.


➔ It is not flexible as it might lead to reconstruct DT.
➔ it always overfits. (IMP)

● How to solve overfitting problem of DT?


--> use pruning techniques
1) max depth
2) min_sample_leaf
3) min_sample_split
Naive bayes classifier.

-> It is a supervised ML alogo used to solve classification problem using the concept of bayes
theorem.
->It is a probabilistic classifier which means it predicts on the basis of probability of an object.
->Applications of Naive Bayes classifier -: Text classification , Sentimental analysis.

Bayes theorem
Derive bayes theorem
Advantages of Naive Bayes.

1)It can handle non linear data.


2)Simple to implement -: It is easy to calculate conditional probability
3) Very fast training
4) works well on large dataset.
5) it can solve binary as well as multi-classification problem.
6) it can handle categorical as well as cont. data.
7)It can handle text data and hence it is widely used in text classification.

Disadvantages-:

1) It is based on an assumption -: All i/p features are independent of each other. It assumes that all the
attributes are mutually independent.

2) Zero probability problem-: This algo faces zero probability problem where it assigns zero probability to
categorical variable whose category in set is not available.
Types of Naive Bayes -:

1)Bernoulli NB -: solves binary classification where you have 2 categories i.e. yes or no.

2)Multinomial NB-: solves multiclass classification problem where you have more than 2 classes and it also
solves problem of binary classification which has imbalance classes.

3)Gaussian NB -: if you have continuous data in your columns or you have numeric features which exhibit
normal distribution the Gaussian NB is the correct choice.
Ensemble Learning

● It is used to improve predictive power of an algorithm by applying multiple


algorithms and by aggregating them.

● This approach allows better predictive performance as compare to single


learning approach.

★ Techniques of Ensemble Learning:

1. Bagging
2. Boosting
3. Stacking
1. Bagging:

How it works?

➢ Step-1: Multiple subsets are created from the original dataset ( Datapoints inside
the subsets are selected randomly).
➢ step-2: A base model is created on each of the subsets.
➢ Step-3: Each model is learned in parallel and independent of each other.
➢ Step-4: Final predictions are determined by aggregating the predictions of all the
models.

● Bagging is a independent process i.e. Model are build independent of each other.
● Bagging is a parallel process i.e. Model can be built parallel to each other.
● Example of Bagging is Random Forest.
2. Boosting:

● Boosting is a sequential process.


● A model is build from the training data and then the second model is build which tries to
correct the errors present in the first model.
● This process is continued until the training dataset is predicted completely.

How it works?

➢ step-1: Initialize the dataset and assign equal weights to each datapoint.
➢ step-2: Provide this as an input to the model and identify the wrongly classified data points.
➢ step-3: Increase the weights of misclassified data points.
➢ step-4: If required result is achieved then stop else got to step 2

● Boosting is a dependent process i.e. models build are dependent on each other.
● Bagging is a non-parallel process i.e. model cannot be build parallel to each other.
● Example of Boosting ADA Boost,GBoost,XGBoost.
● Boosting is used to reduce Bias(Training errors).
Gradient Boosting:
It tries to minimize loss or error by calculating the partial derivative of weights or intercept.

How it works:
Step-1: Select random values of m & c.
Step-2: Build the model.
Step-3: calculate partial derivative of previous values of m & c.
Step-4: Continue this process until we reach Global Minima.
UNSUPERVISED
LEARNING
Clustering

● Clustering is a unsupervised learning process of creating groups of data points


based on similarity.
● Here we do not have target column. we look at the data and then try to club
similar observation and form different groups.

● Application of clustering/ where to apply clustering>?

-customer segmentation.
-recommendation system.

● How to perform clustering?


- We have two algorithms to perform clustering
1) K-Means clustering
2) Hierarchical clustering.
How K-Means works?

Here K is -: no of groups/clusters to make.

1) Decide the value of K.


(To decide the value of K we must have Domain knowledge).

2) Select K centroids
(Centroids can be selected randomly or can be selected from data points.)

3) By calculating the Euclidean distance assign the datapoint to the nearest centroids/cluster.Now again find
the new centroid for that cluster and keep doing this process for inner iteration times (default value is 300).
and then calculate inertia.

4) Now again re-generate centroids and go to step no 3. Keep doing this process for Outer iteration times.
(default value:- 10)

5) Final centroids/clusters are selected whose inertia value is low.


How Good clusters/final clusters are selected?

How to select number of cluster to make?


1)You must domain knowledge
2)Use Elbow technique/ Method

➢ Silhouette Method:

● It measures how similar a data point is to its own cluster compared to other clusters.
● The range of the silhouette value is between 1 and -1.
● Positive values indicates that the point is placed in the correct cluster.
● Negative values indicates that there are too many or too few clusters.
Interview questions
-What is clustering?
-Why to use clustering / Application clustering?
-What is K in K-means
-Difference between Kmeans and KNN algo.
-How Kmeans works?
-How best clusters are selected
-what is inertia and importance of it
-How to select the best value fo K?
Hierarchical Clustering
● Hierarchical clustering is an unsupervised ML which is used to group the data into cluster and also known
as hierarchical cluster.

● We develop the hierarchy of clusters into a form tree.


● This tree shape structure is known dendrogram.

#step1-: Consider each data point as a single cluster. so if we have N data points then we will have N clusters.

#step2-: consider two similar data points and make them as a one cluster. i.e now we will have N-1 clusters.

#step3-: repeat this process until one final cluster gets formed.

#step4-: from this we will draw dendrogram to find optimal no. of clusters.

Types of hierarchical clustering :

1)Agglomerative Clustering --> bottom up approach ---> works by clubbing similar data points
2) Divisive clustering --> Top down ---> divides dissimilar data points.
Q.How to find similar data points?

using Linkage method. -: This will help us in calculating distance between Clusters.

Types of linkage methods: single,complete,centroid,ward,average.

Q .How to find optimal no clusters from dendrogram?

ans-: we will find maximum vertical distance that does not cut any horizontal line
Linkage Method

1.) Single Linkage:


For the Single linkage, two clusters with the closest minimum distance are merged. This process
repeats until there is only a single cluster left.

2.) Complete Linkage:


For the Complete linkage, two clusters with the closest maximum distance are merged. This process
repeats until there is only a single cluster left.

3.) Centroid Linkage:


For the Centroid linkage, two clusters with the lowest centroid distance are merged. This process
repeats until there is only a single cluster left.

4.) Ward’s Linkage:


For Ward’s linkage, two clusters are merged based on their error sum of square (ESS) values. The two
clusters with the lowest ESS are merged.

5.) Average Linkage:


the Average linkage method uses the average pairwise proximity among all pairs of objects in different
clusters. Clusters are merged based on their lowest average distances.
PCA - Principal Component analysis

1. Standardize range of continuous variables.

2. Compute the covariance to identify co-relations.

3. Calculate eigenvectors and corresponding eigenvalues.

4. Sort the eigenvectors according to their eigenvalues in decreasing order.

5. Choose first k eigenvectors and that will be the new k dimensions.

6. Transform the original n dimensional data points into k dimensions.


Variance : It is a measure of the variability or it simply measures how spread the data set is.
Mathematically, it is the average squared deviation from the mean score.
Covariance : It is a measure of the extent to which corresponding elements from two sets of ordered
data move in the same direction. Formula is shown above denoted by cov(x,y) as the covariance of x
and y.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy