Machine Learning GL
Machine Learning GL
BUSINESS REPORT
RHEA.S.M
PGPDSBA Online Sep_B 2021
Page 1 of 25
Table of Contents
1. Problem-1: Modelling.....................................................................................................................................4
1.1. Objective.........................................................................................................................................................4
1.2. Descriptive and Exploratory Data Analysis....................................................................................................4
1.2.1. Descriptive Data analysis:.......................................................................................................................4
1.2.2. Univariate and Bivariate data Analysis:..................................................................................................5
1.2.3. Correlation analysis:...............................................................................................................................9
1.2.4. Outlier Analysis:...................................................................................................................................10
1.3. Categorical Variables Treatment and Scaling...............................................................................................11
1.3.1. Encoding of the Variables:..........................................................................................................................11
1.3.2. Scaling of Variables:...................................................................................................................................12
1.3.3. Data Split:...................................................................................................................................................12
1.4. Logistic Regression Analysis vs. LDA.........................................................................................................12
1.4.1. LR Models Performance and Inference:................................................................................................12
1.4.2: LDA Models Performance and Inference..............................................................................................14
1.5. KNN Model and Naïve Bayes Model...........................................................................................................15
1.5.1. Naïve Bayes Models Performance and Inference:.................................................................................15
1.5.2. KNN Model Performance and Inference:..............................................................................................16
1.6. Bagging and Boosting...................................................................................................................................18
1.6.1. Bagging Performance and Inference:....................................................................................................18
1.6.2. Boosting Performance and Inference:...................................................................................................19
1.7. Models Performance and Inference...............................................................................................................21
1.8. Insights and Recommendations.....................................................................................................................22
2. Problem-2: Text Mining and Analysis..........................................................................................................23
2.1. Objective.......................................................................................................................................................23
2.2. Background...................................................................................................................................................23
2.3. Analysis Methodology..................................................................................................................................23
Page 2 of 25
List of Figures
Figure No. Name Page No.
Fig 1 Histograms with KDE and Box plots for continuous and ordinal variables 6
Fig 2 Count plot for Categorical variables 7
Fig 3 Box plots: Target Variable vs. continuous and ordinal variables. 7
Fig 4 Swarm plots: Target Variable vs. continuous and ordinal variables. 8
Fig 5 Pair Plot 9
Fig 6 Heat map or Correlation plot for continuous/ordinal variables 10
Fig 7 Outlier Analysis 10
List of Tables
Table No. Name Page No.
Table 1 Summary of Descriptive statistics information 5
Table 2 Encoding of Categorical variables 11
Table 3 Logistic Regression 13
Table 4 Linear Discriminant Analysis 14
Table 5 Naïve Bayes 15
Table 6 KNN 17
Table 7 Bagging 18
Table 8 Boosting 19
Table 9 Comparison between different models 21
Table 10 Length of words and sentences 23
Table 11 Length of words after removing stop words and punctuation 23
Table 12 Three most common words 23
Page 3 of 25
1. Problem-1: Modelling
1.1. Objective
The objective the problem is to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting overall win and
seats covered by a particular party.
We additionally also have to choose a Final Model, after comparing between all the models built
and write inferences with regards to which model is best/optimized.
Background: You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to create
an exit poll that will help in predicting overall win and seats covered by a particular party.
Data Dictionary:
vote: Party choice: Conservative or Labour.
age: Age in years.
economic.cond.national: Assessment of current national economic conditions, 1 to 5.
economic.cond.household: Assessment of current household economic conditions, 1 to
5.
Blair: Assessment of the Labour leader, 1 to 5.
Hague: Assessment of the Conservative leader, 1 to 5.
Europe: An 11-point scale that measures respondents' attitudes
toward European integration. High scores represent
‘Eurosceptic’ sentiment.
political.knowledge: Knowledge of parties' positions on European integration, 0
to 3.
gender: Male or Female
Provided data set consists of total 9 variables in which 8 independent variables and one
dependent variable.
a) Independent variables: age, economic.cond.national, economic.cond.household,
Blair, Hague, Europe, political.knowledge.
b) Dependent variable: ‘vote’.
Data set contains total of 1525 entries among which 7 integer type variables and 2 object
type variables.
Duplicates were verified, 8 duplicate rows were present in the data set which were
removed before further analysis was done on the data. There are no null values in the
dataset.
The following Table 1 consists the head(), tail(), info() and description both normal and
statistical of the dataset at hand.
In the Statistical description, it is valid that Political Knowledge has zero values since the
rating scale to asses said knowledge starts from 0-3.
Page 4 of 25
Table-1: Summary of Descriptive statistics information
Univariate analysis is the simplest form of analyzing data. Univariate data requires to
analyze each variable separately. While, a Bivariate analysis will measure the
correlations between two variables.
Figure-1 shows individual distributions for all the continuous and ordinal variables
present in the data set. It is observed that most variables are normally distributed except
Page 5 of 25
for ‘age’ and ‘Europe’ which is right skewed. The boxplots for each of the variables have
also been plotted alongside the respective histograms.
There seem to be outliers in ‘economic.cond.national’ and ‘economic.cond.household’
but since these are ordinal variables they should not be treated.
Figure-1 Histograms with KDE and Box plots for continuous and ordinal variables
Page 6 of 25
Figure-2 shows count plot of categorical variables of vote and gender. From the figure it
is observed that there are more female voters than males in the dataset provided. And the
maximum voters prefer to have a labour leader rather than a conservative one.
Figure-3 shows us the comparison between the target variable and continuous and ordinal
variables using box plots. The average age of voters preferring a Labour Leader are about
50 years, while voters preferring a conservative leader are 60 years.
Figure-3 Box plots: Target Variable vs. continuous and ordinal variables.
Page 7 of 25
Figure-4 shows us the comparison between the target variable and continuous and ordinal
variables using swarm plots. We can see that there are a lot of voters from various age
groups leaning to a Labour leader. The assessment of the national economic condition
according to the voters seem to be better with a Labour leader. There is not much
difference in the Assessment of the current household conditions.
Page 8 of 25
Figure-4 Swarm plots: Target Variable vs. continuous and ordinal variables.
Figure-6 demonstrates the heat map or correlation plot of variables. A heat map is a two-
dimensional representation of data in which values are represented by colors. A simple
heat map provides an immediate visual summary of information.
As per figure it is observed that there is a weak correlation between the variables while
mostly positive there are some negatively weak too. Most of the values in the below
figure are lesser than 0.4, hence it can be said that there is not much of an actual
relationship between the variables. That is the value of one variable does not have any
effect on another variable. Hence concluding that they are all fairly independent in
nature.
Page 9 of 25
Figure-6 Heat map or Correlation plot for continuous/ordinal variables
Data set contains outliers by plotting the box plot for all continuous/ordinal variables.
Figure-7 represents outlier analysis before treatment.
There are no outliers in age which is the only continuous variable, while there are outliers
shown in 2 of the ordinal variables. But since they are ordinal in nature, a decision of not
treating the outliers were taken.
Though there are lesser number of voters that have chosen a lower rating on the
assessments than the others, everyone’s opinion matters and hence though during
analysis it is an outlier. Hence proving they mustn’t be treated.
Page 10 of 25
1.3. Categorical Variables Treatment and Scaling
Page 11 of 25
1.3.2. Scaling of Variables:
Feature scaling is a method used to normalize the range of independent variables or features of
data. In data processing, it is also known as data normalization and is generally performed during
the data preprocessing step.
We need to perform Feature Scaling when we are dealing with Gradient Descent Based algorithms
(Linear and Logistic Regression) and Distance-based algorithms (KNN, K-means) as these are
very sensitive to the range of the data points.
Scaling is not mandatory for LDA and Naive Bayes. But if we decide to scale the data it doesn’t
matter. Since these modelling techniques are not affected by feature scaling.
Decision trees and Tree-based ensemble methods (RF, XGB) are invariant to feature scaling, but
still, it might be a good idea to rescale/standardize your data.
Hence all the models have been built after scaling/Standardizing the data using the zscore
technique.
Data split was performed with 70:30 ratio of Train and Test data using defined random state.
Total of 8 independent variable were present in the X data frame whereas Y contains the
dependent variable of vote/choice of party.
LDA works when all the independent/predictor variables are continuous (not categorical) and
follow a Normal distribution. Whereas in Logistic Regression this is not the case and categorical
variables can be used as independent variables while making predictions.
Logistic Regression fit applied for X Train and Y Train data set using sklearn Logistic Regression
model function.
Also, Linear Discriminant Analysis performed for X Train and Y Train data.
As per Logistic Regression analysis the following summary metrics are presented.
The model is tuned using a param grid and the best parameter is chosen from the said grid and
applied to the model.
Logistic regression does not really have any critical hyper parameters to tune. Sometimes, you can
see useful differences in performance or convergence with different solvers (solver). The ‘newton-
cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation, or no
regularization. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual
formulation only for the L2 penalty. For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and
‘lbfgs’ handle multinomial loss. Chosen solver is ‘sag’.
Penalty: A regression model that uses L1 regularization technique is called Lasso Regression and
model which uses L2 is called Ridge Regression. The key difference between these two is the
penalty term. Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss
function. Chosen penalty ‘l2’
Tolerance: Is the stopping criteria. This tells scikit to stop searching for a minimum (or maximum)
once some tolerance is achieved, i.e. once you're close enough.
Maximum number of iterations taken for the solvers to converge.
Page 12 of 25
Table-3 Logistic Regression
Train Data Set Test Data Set
AUC: 0.890 AUC: 0.885
In general, a model fits the data well if the differences between the observed values and the
model's predicted values are small and unbiased. The train and test model scores are not too far
apart from one another hence this is a good fit model.
Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
Precision is the fraction of true positive examples among the examples that the model classified as
positive. In other words, the number of true positives divided by the number of false positives plus
true positives.
Recall, also known as sensitivity, is the fraction of examples classified as positive, among the total
number of positive examples. In other words, the number of true positives divided by the number
of true positives plus false negatives.
When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.
Page 13 of 25
1.4.2: LDA Models Performance and Inference
As per Linear Discriminant Analysis the following summary metrics are presented.
The model is tuned using a param grid and the best parameter is chosen from the said grid and
applied to the model.
Solvers: chosen ‘lsqr’ by the grid
a) ‘svd’: Singular value decomposition (default). Does not compute the covariance
matrix, therefore this solver is recommended for data with a large number of features.
b) ‘lsqr’: Least squares solution. Can be combined with shrinkage or custom covariance
estimator.
c) ‘eigen’: Eigenvalue decomposition. Can be combined with shrinkage or custom
covariance estimator.
Shrinkage: This should be left to None if covariance_estimator is used. Note that shrinkage
works only with ‘lsqr’ and ‘eigen’ solvers. Set to ‘auto’.
Tol: Absolute threshold for a singular value of X to be considered significant, used to estimate
the rank of X.
Table-4 Linear Discriminant Analysis
Train Data Set Test Data Set
AUC: 0.890 AUC: 0.888
Page 14 of 25
The train and test model scores are not too far apart from one another hence this is a good fit
model.
Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.
A general difference between KNN and other models is the large real time computation needed by
KNN compared to others. KNN vs. Naive Bayes: Naive Bayes is much faster than KNN due to
KNN's real-time execution.
Gaussian Naïve Bayes fit is applied for X Train and Y Train data set using sklearn GaussianNB
model function.
KNN Analysis is also performed for X Train and Y Train data.
As per Naïve Bayes Analysis the following summary metrics are presented.
Hyper-parameter tuning is not a valid method to improve Naive Bayes classifier accuracy.
Like all machine learning algorithms, we can boost the Naive Bayes classifier by applying
some simple techniques to the dataset, like data pre-processing and feature selection.
Hence we went on to modelling with the default parameters set.
Page 15 of 25
Classification Report: Classification Report:
The train and test model scores are not too far apart from one another hence this is a good fit
model.
Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.
Page 16 of 25
Table-6 KNN
Train Data Set Test Data Set
AUC: 0.912 AUC: 0.848
Page 17 of 25
The train and test model scores are not too far apart from one another hence this is a good fit
model. Not an ideal good fit, it is slightly underfit.
Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.
Boosting is a method of merging different types of predictions. Bagging decreases variance, not
bias, and solves over-fitting issues in a model. Boosting decreases bias, not variance. In Bagging,
each model receives an equal weight.
Boosting fit fit is applied for X Train and Y Train data set using sklearn AdaBoostClassifier
model function.
Bagging Analysis is also performed for X Train and Y Train data.
As per Bagging Model (Random Forest) the following summary metrics are presented.
The Random Forest is tuned using a param grid and the best parameter is chosen from the said
grid and applied to the Bagging Model.
n_estimators, default=100: The number of trees in the forest.
Max_depth, default=None: The maximum depth of the tree. If None, then nodes are expanded
until all leaves are pure or until all leaves contain less than min_samples_split samples.
Min_samples_split or float, default=2: The minimum number of samples required to split an
internal node.
Min_samples_leaf or float, default=1: The minimum number of samples required to be at a
leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf training samples in each of the left and right branches. This may have the
effect of smoothing the model, especially in regression.
max_features, default=”auto”: The number of features to consider when looking for the best
split.
Table-7 Bagging
Train Data Set Test Data Set
AUC: 0.891 AUC: 0.869
Page 18 of 25
Model Score: 0.7898209236569275 Model Score: 0.7982456140350878
Confusion Matrix:
The train and test model scores are not too far apart from one another hence this is a good fit
model.
Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.
As per Boosting Model the following summary metrics are presented. There have been two
types of Boosting performed in the Jupyter Notebook, the better of both have been selected
and showcased here.
We have chosen to go with the Adaptive Boosting model, since the values seem closer and the
model represents a good fit in comparison with the Gradient Boosting Model.
Base_estimator, default=None: The base estimator from which the boosted ensemble is built.
If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1.
Page 19 of 25
n_estimators, default=50: The maximum number of estimators at which boosting is
terminated. In case of perfect fit, the learning procedure is stopped early.
Learning_rate, default=1.0: Weight applied to each classifier at each boosting iteration. A
higher learning rate increases the contribution of each classifier. There is a trade-off between
the learning_rate and n_estimators parameters.
Table-8 Boosting
Train Data Set Test Data Set
AUC: 0.911 AUC: 0.897
The train and test model scores are not too far apart from one another hence this is a good fit
model.
Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
Page 20 of 25
When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.
All the models were compared w.r.t AUC and model scores/Accuracy values for both Train
and Test data (refer below table).
The LDA model seems to be the best fit of the lot, the train and the test values seem closest
for this model type. In general, a model fits the data well if the differences between the
observed values and the model's predicted values are small and unbiased. The train and test
model scores are not too far apart from one another hence this is a good fit model.
Accuracy might not be best possible metric based on which we can take a decision, so we
need to be aware of other metrics such as precision, recall, f1-score, which becomes more
relevant to choose the best model.
LDA performs better than the other model on the basis of Accuracy, AUC and Recall.
Recall, also known as sensitivity, is the fraction of examples classified as positive, among the
total number of positive examples. In other words, the number of true positives divided by the
number of true positives plus false negatives.
Precision is the fraction of true positive examples among the examples that the model
classified as positive. In other words, the number of true positives divided by the number of
false positives plus true positives.
When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number.
It is a measure of the models accuracy on the dataset. These scores are closer to 1 hence
stating that they have a good accuracy.
The Precision and F1 Score seem to be similar for all, hence that cannot be used as a
distinguishing factor.
LDA performs better than the other model on the basis of Accuracy, AUC and Recall.
Page 21 of 25
The Precision and F1 Score seem to be similar for all, hence that cannot be used as a
distinguishing factor.
Hence considering all the models, LDA is the better model on prediction of overall win and
seats covered by a particular party.
Data set contains total of 1525 entries among which 7 integer type variables and 2 object type
variables. Out of which 8 are independent variables and one dependent variable. Duplicates were
verified, 8 duplicate rows were present in the data set which were removed. There are no null
values in the dataset.
The Target Variable is Vote and through visual representation it can clearly be seen that the voters
have preferred to choose a Labour leader. There have also been more female voters than male.
It is observed that there is a weak correlation between the variables while mostly positive there are
some negatively weak too. Most of the values in the below figure are lesser than 0.4, hence it can
be said that there is not much of an actual relationship between the variables. That is the value of
one variable does not have any effect on another variable. Hence concluding that they are all fairly
independent in nature.
There are no outliers in age which is the only continuous variable, while there are outliers shown
in 2 of the ordinal variables. But since they are ordinal in nature, a decision of not treating the
outliers were taken. Though there are lesser number of voters that have chosen a lower rating on
the assessments than the others, everyone’s opinion matters though during analysis it is an outlier.
The LDA model seems to be the best fit of the lot, the train and the test values seem closest for
this model type. In general, a model fits the data well if the differences between the observed
values and the model's predicted values are small and unbiased. The train and test model scores
are not too far apart from one another hence this is a good fit model.
Page 22 of 25
LDA performs better than the other model on the basis of Accuracy, AUC and Recall. The
Precision and F1 Score seem to be similar for all, hence that cannot be used as a distinguishing
factor.
The objective the problem is to extract machine-readable facts from the 3 speeches of the
Presidents of the United States of America.
The purpose of Text Analysis is to create structured data out of free text content. The process can
be thought of as slicing and dicing heaps of unstructured, heterogeneous documents into easy-to-
manage and interpret data pieces.
2.2. Background
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973
Page 23 of 25
2.3. Analysis Methodology
We firstly imported the necessary libraries for Text Mining and Analysis. The Natural Language
Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical
natural language processing for English written in the Python programming language.
We download the keyword given in the question ie: Inaugural and extract the 3 speeches using the
raw() function.
We then went to check on the unique words and sentences that can be found in all the 3 speeches.
The following is a table with the length of characters, words and sentences in each speech:
We then removed all the stop words from each of the speeches. We then proceeded to stem and
remove punctuations from the speeches. There were a total of 211 stop words which was removed
from all the speeches.
Table 11: Length of words after removing stop words and punctuation
Franklin D. Roosevelt John F. Kennedy Richard Nixon
Characters (raw) 4590 4771 5950
Words (split) 627 693 833
A. Franklin D. Roosevelt
B. John F. Kennedy
Page 24 of 25
C. Richard Nixon
Page 25 of 25