0% found this document useful (0 votes)

136 views25 pages

Machine Learning GL

Uploaded by

rhea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views25 pages

Machine Learning GL

Uploaded by

rhea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

MACHINE LEARNING

BUSINESS REPORT

RHEA.S.M
PGPDSBA Online Sep_B 2021

Page 1 of 25
Table of Contents

1. Problem-1: Modelling.....................................................................................................................................4
1.1. Objective.........................................................................................................................................................4
1.2. Descriptive and Exploratory Data Analysis....................................................................................................4
1.2.1. Descriptive Data analysis:.......................................................................................................................4
1.2.2. Univariate and Bivariate data Analysis:..................................................................................................5
1.2.3. Correlation analysis:...............................................................................................................................9
1.2.4. Outlier Analysis:...................................................................................................................................10
1.3. Categorical Variables Treatment and Scaling...............................................................................................11
1.3.1. Encoding of the Variables:..........................................................................................................................11
1.3.2. Scaling of Variables:...................................................................................................................................12
1.3.3. Data Split:...................................................................................................................................................12
1.4. Logistic Regression Analysis vs. LDA.........................................................................................................12
1.4.1. LR Models Performance and Inference:................................................................................................12
1.4.2: LDA Models Performance and Inference..............................................................................................14
1.5. KNN Model and Naïve Bayes Model...........................................................................................................15
1.5.1. Naïve Bayes Models Performance and Inference:.................................................................................15
1.5.2. KNN Model Performance and Inference:..............................................................................................16
1.6. Bagging and Boosting...................................................................................................................................18
1.6.1. Bagging Performance and Inference:....................................................................................................18
1.6.2. Boosting Performance and Inference:...................................................................................................19
1.7. Models Performance and Inference...............................................................................................................21
1.8. Insights and Recommendations.....................................................................................................................22
2. Problem-2: Text Mining and Analysis..........................................................................................................23
2.1. Objective.......................................................................................................................................................23
2.2. Background...................................................................................................................................................23
2.3. Analysis Methodology..................................................................................................................................23

Page 2 of 25
List of Figures
Figure No. Name Page No.
Fig 1 Histograms with KDE and Box plots for continuous and ordinal variables 6
Fig 2 Count plot for Categorical variables 7
Fig 3 Box plots: Target Variable vs. continuous and ordinal variables. 7
Fig 4 Swarm plots: Target Variable vs. continuous and ordinal variables. 8
Fig 5 Pair Plot 9
Fig 6 Heat map or Correlation plot for continuous/ordinal variables 10
Fig 7 Outlier Analysis 10

List of Tables
Table No. Name Page No.
Table 1 Summary of Descriptive statistics information 5
Table 2 Encoding of Categorical variables 11
Table 3 Logistic Regression 13
Table 4 Linear Discriminant Analysis 14
Table 5 Naïve Bayes 15
Table 6 KNN 17
Table 7 Bagging 18
Table 8 Boosting 19
Table 9 Comparison between different models 21
Table 10 Length of words and sentences 23
Table 11 Length of words after removing stop words and punctuation 23
Table 12 Three most common words 23

Page 3 of 25
1. Problem-1: Modelling
1.1. Objective

 The objective the problem is to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting overall win and
seats covered by a particular party.
 We additionally also have to choose a Final Model, after comparing between all the models built
and write inferences with regards to which model is best/optimized.

1.2. Descriptive and Exploratory Data Analysis

Background: You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to create
an exit poll that will help in predicting overall win and seats covered by a particular party.
Data Dictionary:
vote: Party choice: Conservative or Labour.
age: Age in years.
economic.cond.national: Assessment of current national economic conditions, 1 to 5.
economic.cond.household: Assessment of current household economic conditions, 1 to
5.
Blair: Assessment of the Labour leader, 1 to 5.
Hague: Assessment of the Conservative leader, 1 to 5.
Europe: An 11-point scale that measures respondents' attitudes
toward European integration. High scores represent
‘Eurosceptic’ sentiment.
political.knowledge: Knowledge of parties' positions on European integration, 0
to 3.
gender: Male or Female

1.2.1. Descriptive Data analysis:

 Provided data set consists of total 9 variables in which 8 independent variables and one
dependent variable.
a) Independent variables: age, economic.cond.national, economic.cond.household,
Blair, Hague, Europe, political.knowledge.
b) Dependent variable: ‘vote’.
 Data set contains total of 1525 entries among which 7 integer type variables and 2 object
type variables.
 Duplicates were verified, 8 duplicate rows were present in the data set which were
removed before further analysis was done on the data. There are no null values in the
dataset.
 The following Table 1 consists the head(), tail(), info() and description both normal and
statistical of the dataset at hand.
 In the Statistical description, it is valid that Political Knowledge has zero values since the
rating scale to asses said knowledge starts from 0-3.

Page 4 of 25
Table-1: Summary of Descriptive statistics information

Head and Tail of the dataset: Info of dataset:

Describe function on dataset: Null and Duplicates:

There are no null values.

Number of duplicated rows= 8

 Shape Before Duplicates
Removal (1525, 9)
 Shape After Duplicates
Removal: (1517, 9)

Statistical description of the dataset after removing duplicates:

1.2.2. Univariate and Bivariate data Analysis:

 Univariate analysis is the simplest form of analyzing data. Univariate data requires to
analyze each variable separately. While, a Bivariate analysis will measure the
correlations between two variables.
 Figure-1 shows individual distributions for all the continuous and ordinal variables
present in the data set. It is observed that most variables are normally distributed except

Page 5 of 25
for ‘age’ and ‘Europe’ which is right skewed. The boxplots for each of the variables have
also been plotted alongside the respective histograms.
 There seem to be outliers in ‘economic.cond.national’ and ‘economic.cond.household’
but since these are ordinal variables they should not be treated.

Figure-1 Histograms with KDE and Box plots for continuous and ordinal variables

Page 6 of 25
 Figure-2 shows count plot of categorical variables of vote and gender. From the figure it
is observed that there are more female voters than males in the dataset provided. And the
maximum voters prefer to have a labour leader rather than a conservative one.

Figure-2 Count plot for Categorical variables

 Figure-3 shows us the comparison between the target variable and continuous and ordinal
variables using box plots. The average age of voters preferring a Labour Leader are about
50 years, while voters preferring a conservative leader are 60 years.

Figure-3 Box plots: Target Variable vs. continuous and ordinal variables.

Page 7 of 25
 Figure-4 shows us the comparison between the target variable and continuous and ordinal
variables using swarm plots. We can see that there are a lot of voters from various age
groups leaning to a Labour leader. The assessment of the national economic condition
according to the voters seem to be better with a Labour leader. There is not much
difference in the Assessment of the current household conditions.

Page 8 of 25
Figure-4 Swarm plots: Target Variable vs. continuous and ordinal variables.

 A pairplot plot a pairwise relationships in a dataset. Figure-5 represents the pair plot of

continuous/ordinal with target variable = vote set as a hue to help determine and interpret
relationships with distribution plots.

Figure-5 Pair Plot

1.2.3. Correlation analysis:

 Figure-6 demonstrates the heat map or correlation plot of variables. A heat map is a two-
dimensional representation of data in which values are represented by colors. A simple
heat map provides an immediate visual summary of information.
 As per figure it is observed that there is a weak correlation between the variables while
mostly positive there are some negatively weak too. Most of the values in the below
figure are lesser than 0.4, hence it can be said that there is not much of an actual
relationship between the variables. That is the value of one variable does not have any
effect on another variable. Hence concluding that they are all fairly independent in
nature.

Page 9 of 25
Figure-6 Heat map or Correlation plot for continuous/ordinal variables

1.2.4. Outlier Analysis:

 Data set contains outliers by plotting the box plot for all continuous/ordinal variables.
Figure-7 represents outlier analysis before treatment.
 There are no outliers in age which is the only continuous variable, while there are outliers
shown in 2 of the ordinal variables. But since they are ordinal in nature, a decision of not
treating the outliers were taken.
 Though there are lesser number of voters that have chosen a lower rating on the
assessments than the others, everyone’s opinion matters and hence though during
analysis it is an outlier. Hence proving they mustn’t be treated.

Figure-7 Outlier Analysis

Page 10 of 25
1.3. Categorical Variables Treatment and Scaling

1.3.1. Encoding of the Variables:

Table-2: Encoding of Categorical variables

No. of Unique Categorical Values: Summary:

Head of Encoded dataset:

Data Types after Encoding:

Page 11 of 25
1.3.2. Scaling of Variables:

 Feature scaling is a method used to normalize the range of independent variables or features of
data. In data processing, it is also known as data normalization and is generally performed during
the data preprocessing step.
 We need to perform Feature Scaling when we are dealing with Gradient Descent Based algorithms
(Linear and Logistic Regression) and Distance-based algorithms (KNN, K-means) as these are
very sensitive to the range of the data points.
 Scaling is not mandatory for LDA and Naive Bayes. But if we decide to scale the data it doesn’t
matter. Since these modelling techniques are not affected by feature scaling.
 Decision trees and Tree-based ensemble methods (RF, XGB) are invariant to feature scaling, but
still, it might be a good idea to rescale/standardize your data.
 Hence all the models have been built after scaling/Standardizing the data using the zscore
technique.

1.3.3. Data Split:

 Data split was performed with 70:30 ratio of Train and Test data using defined random state.
 Total of 8 independent variable were present in the X data frame whereas Y contains the
dependent variable of vote/choice of party.

1.4. Logistic Regression Analysis vs. LDA

 LDA works when all the independent/predictor variables are continuous (not categorical) and
follow a Normal distribution. Whereas in Logistic Regression this is not the case and categorical
variables can be used as independent variables while making predictions.
 Logistic Regression fit applied for X Train and Y Train data set using sklearn Logistic Regression
model function.
 Also, Linear Discriminant Analysis performed for X Train and Y Train data.

1.4.1. LR Models Performance and Inference:

 As per Logistic Regression analysis the following summary metrics are presented.
 The model is tuned using a param grid and the best parameter is chosen from the said grid and
applied to the model.
 Logistic regression does not really have any critical hyper parameters to tune. Sometimes, you can
see useful differences in performance or convergence with different solvers (solver). The ‘newton-
cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation, or no
regularization. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual
formulation only for the L2 penalty. For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and
‘lbfgs’ handle multinomial loss. Chosen solver is ‘sag’.
 Penalty: A regression model that uses L1 regularization technique is called Lasso Regression and
model which uses L2 is called Ridge Regression. The key difference between these two is the
penalty term. Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss
function. Chosen penalty ‘l2’
 Tolerance: Is the stopping criteria. This tells scikit to stop searching for a minimum (or maximum)
once some tolerance is achieved, i.e. once you're close enough.
 Maximum number of iterations taken for the solvers to converge.

Page 12 of 25
Table-3 Logistic Regression
Train Data Set Test Data Set
AUC: 0.890 AUC: 0.885

Model Score: 0.8378887841658812 Model Score: 0.8223684210526315

Confusion Matrix: Confusion Matrix:

Classification Report: Classification Report:

 In general, a model fits the data well if the differences between the observed values and the
model's predicted values are small and unbiased. The train and test model scores are not too far
apart from one another hence this is a good fit model.
 Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
 Precision is the fraction of true positive examples among the examples that the model classified as
positive. In other words, the number of true positives divided by the number of false positives plus
true positives.
 Recall, also known as sensitivity, is the fraction of examples classified as positive, among the total
number of positive examples. In other words, the number of true positives divided by the number
of true positives plus false negatives.
 When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.

Page 13 of 25
1.4.2: LDA Models Performance and Inference

 As per Linear Discriminant Analysis the following summary metrics are presented.
 The model is tuned using a param grid and the best parameter is chosen from the said grid and
applied to the model.
 Solvers: chosen ‘lsqr’ by the grid
a) ‘svd’: Singular value decomposition (default). Does not compute the covariance
matrix, therefore this solver is recommended for data with a large number of features.
b) ‘lsqr’: Least squares solution. Can be combined with shrinkage or custom covariance
estimator.
c) ‘eigen’: Eigenvalue decomposition. Can be combined with shrinkage or custom
covariance estimator.
 Shrinkage: This should be left to None if covariance_estimator is used. Note that shrinkage
works only with ‘lsqr’ and ‘eigen’ solvers. Set to ‘auto’.
 Tol: Absolute threshold for a singular value of X to be considered significant, used to estimate
the rank of X.
Table-4 Linear Discriminant Analysis
Train Data Set Test Data Set
AUC: 0.890 AUC: 0.888

Model Score: 0.8331762488218661 Model Score: 0.831140350877193

Confusion Matrix:

Classification Report: Classification Report:

Page 14 of 25
 The train and test model scores are not too far apart from one another hence this is a good fit
model.
 Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
 When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.

1.5. KNN Model and Naïve Bayes Model

 A general difference between KNN and other models is the large real time computation needed by
KNN compared to others. KNN vs. Naive Bayes: Naive Bayes is much faster than KNN due to
KNN's real-time execution.
 Gaussian Naïve Bayes fit is applied for X Train and Y Train data set using sklearn GaussianNB
model function.
 KNN Analysis is also performed for X Train and Y Train data.

1.5.1. Naïve Bayes Models Performance and Inference:

 As per Naïve Bayes Analysis the following summary metrics are presented.
 Hyper-parameter tuning is not a valid method to improve Naive Bayes classifier accuracy.
 Like all machine learning algorithms, we can boost the Naive Bayes classifier by applying
some simple techniques to the dataset, like data pre-processing and feature selection.
 Hence we went on to modelling with the default parameters set.

Table-5 Naïve Bayes

Train Data Set Test Data Set
AUC: 0.888 AUC: 0.877

Model Score: 0.8303487276154571 Model Score: 0.8201754385964912

Confusion Matrix:

Page 15 of 25
Classification Report: Classification Report:

 The train and test model scores are not too far apart from one another hence this is a good fit
model.
 Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
 When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.

1.5.2. KNN Model Performance and Inference:

 As per KNN Model the following summary metrics are presented.

 The model is tuned using a param grid and the best parameter is chosen from the said grid and
applied to the model.
 n_neighbors, default=5: Number of neighbours to use by default for k neighbours queries.
The n_neighbors chosen from the grid is 9.
 Weights, default=’uniform’: Weight function used in prediction.
a) ‘uniform’ : uniform weights. All points in each neighbourhood are weighted equally.
b) ‘distance’ : weight points by the inverse of their distance. In this case, closer neighbours
of a query point will have a greater influence than neighbours which are further away.
c) [callable] : a user-defined function which accepts an array of distances, and returns an
array of the same shape containing the weights.
 Leaf_size, default=30: This can affect the speed of the construction and query, as well as the
memory required to store the tree. The optimal value depends on the nature of the problem.
 Metric default=’minkowski’: The distance metric to use for the tree. The default metric is
minkowski, and with p=2 is equivalent to the standard Euclidean metric. We have used
‘manhattan’ distance as it was the best parameter chosen from the grid, it is a distance metric
between two points in a N dimensional vector space.

Page 16 of 25
Table-6 KNN
Train Data Set Test Data Set
AUC: 0.912 AUC: 0.848

Model Score: 0.8303487276154571 Model Score: 0.8048245614035088

Confusion Matrix:

Classification Report: Classification Report:

Page 17 of 25
 The train and test model scores are not too far apart from one another hence this is a good fit
model. Not an ideal good fit, it is slightly underfit.
 Accuracy might not be best possible metric based on which we can take a decision, so we need to
be aware of other metrics such as precision, recall, f1-score, which becomes more relevant to
choose the best model.
 When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.

1.6. Bagging and Boosting

 Boosting is a method of merging different types of predictions. Bagging decreases variance, not
bias, and solves over-fitting issues in a model. Boosting decreases bias, not variance. In Bagging,
each model receives an equal weight.
 Boosting fit fit is applied for X Train and Y Train data set using sklearn AdaBoostClassifier
model function.
 Bagging Analysis is also performed for X Train and Y Train data.

1.6.1. Bagging Performance and Inference:

 As per Bagging Model (Random Forest) the following summary metrics are presented.
 The Random Forest is tuned using a param grid and the best parameter is chosen from the said
grid and applied to the Bagging Model.
 n_estimators, default=100: The number of trees in the forest.
 Max_depth, default=None: The maximum depth of the tree. If None, then nodes are expanded
until all leaves are pure or until all leaves contain less than min_samples_split samples.
 Min_samples_split or float, default=2: The minimum number of samples required to split an
internal node.
 Min_samples_leaf or float, default=1: The minimum number of samples required to be at a
leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf training samples in each of the left and right branches. This may have the
effect of smoothing the model, especially in regression.
 max_features, default=”auto”: The number of features to consider when looking for the best
split.

Table-7 Bagging
Train Data Set Test Data Set
AUC: 0.891 AUC: 0.869

Page 18 of 25
Model Score: 0.7898209236569275 Model Score: 0.7982456140350878
Confusion Matrix:

Classification Report: Classification Report:

1.6.2. Boosting Performance and Inference:

 As per Boosting Model the following summary metrics are presented. There have been two
types of Boosting performed in the Jupyter Notebook, the better of both have been selected
and showcased here.
 We have chosen to go with the Adaptive Boosting model, since the values seem closer and the
model represents a good fit in comparison with the Gradient Boosting Model.
 Base_estimator, default=None: The base estimator from which the boosted ensemble is built.
If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1.

Page 19 of 25
 n_estimators, default=50: The maximum number of estimators at which boosting is
terminated. In case of perfect fit, the learning procedure is stopped early.
 Learning_rate, default=1.0: Weight applied to each classifier at each boosting iteration. A
higher learning rate increases the contribution of each classifier. There is a trade-off between
the learning_rate and n_estimators parameters.

Table-8 Boosting
Train Data Set Test Data Set
AUC: 0.911 AUC: 0.897

Model Score: 0.8501413760603205 Model Score: 0.8245614035087719

Confusion Matrix:

Classification Report: Classification Report:

Page 20 of 25
 When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number. It is
a measure of the models accuracy on the dataset. These scores are closer to 1 hence stating that
they have a good accuracy.

1.7. Models Performance and Inference

 All the models were compared w.r.t AUC and model scores/Accuracy values for both Train
and Test data (refer below table).
 The LDA model seems to be the best fit of the lot, the train and the test values seem closest
for this model type. In general, a model fits the data well if the differences between the
observed values and the model's predicted values are small and unbiased. The train and test
model scores are not too far apart from one another hence this is a good fit model.
 Accuracy might not be best possible metric based on which we can take a decision, so we
need to be aware of other metrics such as precision, recall, f1-score, which becomes more
relevant to choose the best model.
 LDA performs better than the other model on the basis of Accuracy, AUC and Recall.
 Recall, also known as sensitivity, is the fraction of examples classified as positive, among the
total number of positive examples. In other words, the number of true positives divided by the
number of true positives plus false negatives.
 Precision is the fraction of true positive examples among the examples that the model
classified as positive. In other words, the number of true positives divided by the number of
false positives plus true positives.
 When both the recall and precision values are important we look at the F1-score, as it is the
harmonic mean of precision and recall. It combines precision and recall into a single number.
It is a measure of the models accuracy on the dataset. These scores are closer to 1 hence
stating that they have a good accuracy.
 The Precision and F1 Score seem to be similar for all, hence that cannot be used as a
distinguishing factor.
 LDA performs better than the other model on the basis of Accuracy, AUC and Recall.

Page 21 of 25
 The Precision and F1 Score seem to be similar for all, hence that cannot be used as a
distinguishing factor.
 Hence considering all the models, LDA is the better model on prediction of overall win and
seats covered by a particular party.

Table 9- Comparison between different models

Logistic Logistic LDA LDA GNB GNB KNN KNN Bagging Bagging Boosting Boosting
Train Test Train Test Train Test Train Test Train Test Train Test
Accuracy 0.84 0.82 0.83 0.83 0.83 0.82 0.83 0.80 0.79 0.80 0.85 0.82
AUC 0.890 0.885 0.890 0.888 0.888 0.877 0.912 0.848 0.891 0.869 0.911 0.897
Recall 0.91 0.88 0.90 0.88 0.88 0.86 0.91 0.89 0.96 0.96 0.91 0.88
Precision 0.86 0.87 0.87 0.88 0.87 0.88 0.85 0.84 0.79 0.79 0.88 0.87
F1 Score 0.89 0.87 0.88 0.88 0.88 0.87 0.88 0.86 0.86 0.87 0.89 0.88

1.8. Insights and Recommendations

 Data set contains total of 1525 entries among which 7 integer type variables and 2 object type
variables. Out of which 8 are independent variables and one dependent variable. Duplicates were
verified, 8 duplicate rows were present in the data set which were removed. There are no null
values in the dataset.
 The Target Variable is Vote and through visual representation it can clearly be seen that the voters
have preferred to choose a Labour leader. There have also been more female voters than male.
 It is observed that there is a weak correlation between the variables while mostly positive there are
some negatively weak too. Most of the values in the below figure are lesser than 0.4, hence it can
be said that there is not much of an actual relationship between the variables. That is the value of
one variable does not have any effect on another variable. Hence concluding that they are all fairly
independent in nature.
 There are no outliers in age which is the only continuous variable, while there are outliers shown
in 2 of the ordinal variables. But since they are ordinal in nature, a decision of not treating the
outliers were taken. Though there are lesser number of voters that have chosen a lower rating on
the assessments than the others, everyone’s opinion matters though during analysis it is an outlier.
 The LDA model seems to be the best fit of the lot, the train and the test values seem closest for
this model type. In general, a model fits the data well if the differences between the observed
values and the model's predicted values are small and unbiased. The train and test model scores
are not too far apart from one another hence this is a good fit model.

Page 22 of 25
 LDA performs better than the other model on the basis of Accuracy, AUC and Recall. The
Precision and F1 Score seem to be similar for all, hence that cannot be used as a distinguishing
factor.

2. Problem-2: Text Mining and Analysis

2.1. Objective

 The objective the problem is to extract machine-readable facts from the 3 speeches of the
Presidents of the United States of America.
 The purpose of Text Analysis is to create structured data out of free text content. The process can
be thought of as slicing and dicing heaps of unstructured, heterogeneous documents into easy-to-
manage and interpret data pieces.

2.2. Background

In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973

Page 23 of 25
2.3. Analysis Methodology

 We firstly imported the necessary libraries for Text Mining and Analysis. The Natural Language
Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical
natural language processing for English written in the Python programming language.
 We download the keyword given in the question ie: Inaugural and extract the 3 speeches using the
raw() function.
 We then went to check on the unique words and sentences that can be found in all the 3 speeches.
 The following is a table with the length of characters, words and sentences in each speech:

Table 10: Length of words and sentences

Franklin D. Roosevelt John F. Kennedy Richard Nixon
Characters (raw) 7571 7618 9991
Words 1536 1546 2028
Sentences 68 52 69

 We then removed all the stop words from each of the speeches. We then proceeded to stem and
remove punctuations from the speeches. There were a total of 211 stop words which was removed
from all the speeches.

Table 11: Length of words after removing stop words and punctuation
Franklin D. Roosevelt John F. Kennedy Richard Nixon
Characters (raw) 4590 4771 5950
Words (split) 627 693 833

 Moving on to finding out the most common words(top 3):

Table 12: Three most common words

Franklin D. Roosevelt John F. Kennedy Richard Nixon
[('know', 10), [('us', 12), [('us', 26),
('spirit', 9), ('world', 8), ('peace', 19),
('us', 8)] ('Let', 8)] ('new', 15)]

 Then we went on to plot word clouds for each of the 3 speeches:

A. Franklin D. Roosevelt

B. John F. Kennedy

Page 24 of 25
C. Richard Nixon

Page 25 of 25

All Life Bank - AIML - ML - Project - Low - Code - Notebook
No ratings yet
All Life Bank - AIML - ML - Project - Low - Code - Notebook
78 pages
Predictive Modelling
100% (1)
Predictive Modelling
58 pages
Machine Learning Final Manual
No ratings yet
Machine Learning Final Manual
45 pages
Random Forest
No ratings yet
Random Forest
30 pages
Random Forest
0% (1)
Random Forest
2 pages
Actividad Semana 4 - Jupyter Notebook
100% (1)
Actividad Semana 4 - Jupyter Notebook
7 pages
Lesson Plan Union and Intersection of Sets 2
No ratings yet
Lesson Plan Union and Intersection of Sets 2
5 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
100% (1)
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
25 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Year 8 Mathematics Autumn White Rose Higher B
0% (1)
Year 8 Mathematics Autumn White Rose Higher B
12 pages
Capstone Project 2 1
No ratings yet
Capstone Project 2 1
3 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Logistic Regression
No ratings yet
Logistic Regression
41 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Capstone Project - DS With R
No ratings yet
Capstone Project - DS With R
2 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
January 1995 PW
100% (1)
January 1995 PW
78 pages
Problem 2 - Survey: Importing Nessceary Libraries
No ratings yet
Problem 2 - Survey: Importing Nessceary Libraries
10 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
Linear Regression (Check List)
100% (1)
Linear Regression (Check List)
2 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
Statistics Probability
No ratings yet
Statistics Probability
66 pages
ML Project Report: (Text Learning Case Study)
No ratings yet
ML Project Report: (Text Learning Case Study)
9 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
SQL - Basics
No ratings yet
SQL - Basics
25 pages
Problem 1
No ratings yet
Problem 1
12 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
Business Report: Advanced Statistics Module Project I
100% (1)
Business Report: Advanced Statistics Module Project I
5 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Excavador 330 BL Shematic System Electrical
No ratings yet
Excavador 330 BL Shematic System Electrical
11 pages
LP LECTURE NOTES-1 Linux Programming PDF
No ratings yet
LP LECTURE NOTES-1 Linux Programming PDF
235 pages
Emerson Digital Compressor Controller
No ratings yet
Emerson Digital Compressor Controller
17 pages
Dynamic Equilibrium
No ratings yet
Dynamic Equilibrium
4 pages
Anesthetic Technique For Inferior Alveolar Nerve Block: A New Approach
No ratings yet
Anesthetic Technique For Inferior Alveolar Nerve Block: A New Approach
5 pages
Likert Scales, Levels of Measurement and The 'Laws' of Statistics PDF
No ratings yet
Likert Scales, Levels of Measurement and The 'Laws' of Statistics PDF
8 pages
Pipe Vibration and Pressure Detection - Bruel - Kjaer
No ratings yet
Pipe Vibration and Pressure Detection - Bruel - Kjaer
12 pages
Edb Postgres Architecture Deep Dive
No ratings yet
Edb Postgres Architecture Deep Dive
5 pages
Materials For Mechanical Parts
No ratings yet
Materials For Mechanical Parts
20 pages
Python - Making A Fast Port Scanner - Stack Overflow
No ratings yet
Python - Making A Fast Port Scanner - Stack Overflow
8 pages
First/Second Semester B.E.Degree Examination Engineering Chemistry
No ratings yet
First/Second Semester B.E.Degree Examination Engineering Chemistry
2 pages
Es 13 - Module 7 - Flanged Bolt Coupling
No ratings yet
Es 13 - Module 7 - Flanged Bolt Coupling
9 pages
Virtual Density Lab 2018 PDF
No ratings yet
Virtual Density Lab 2018 PDF
2 pages
23 July 2024 - Comprehensive Review of Depression Detection Techniques Based On Machine Learning Approach
No ratings yet
23 July 2024 - Comprehensive Review of Depression Detection Techniques Based On Machine Learning Approach
25 pages
Matlab Programmming Previous Papers
No ratings yet
Matlab Programmming Previous Papers
4 pages
CIEN 30043 Lecture No. 6 - Chapter 4 Part 2
No ratings yet
CIEN 30043 Lecture No. 6 - Chapter 4 Part 2
8 pages
Long-Term Exposure To Ambient Benzene and Brain Disorders Among Urban Adults
No ratings yet
Long-Term Exposure To Ambient Benzene and Brain Disorders Among Urban Adults
16 pages
Re-Evaluation of The 2,2-Diphenyl-1-Picrylhydrazyl Free Radical (DPPH) Assay For Antioxidant Activity
No ratings yet
Re-Evaluation of The 2,2-Diphenyl-1-Picrylhydrazyl Free Radical (DPPH) Assay For Antioxidant Activity
10 pages
Grade 8-Ls-13-Light-Work Book
No ratings yet
Grade 8-Ls-13-Light-Work Book
6 pages
Kidney Disease Early-Stage Identification and Prevention Using Supervised Machine Learning
No ratings yet
Kidney Disease Early-Stage Identification and Prevention Using Supervised Machine Learning
6 pages
CS423 Raw Sockets BW
No ratings yet
CS423 Raw Sockets BW
34 pages
Atg - Format
No ratings yet
Atg - Format
8 pages
Reliability
No ratings yet
Reliability
10 pages
Physical Chemistry (471) : Faculty of Applied Sciences Laboratory Report
No ratings yet
Physical Chemistry (471) : Faculty of Applied Sciences Laboratory Report
12 pages
Practice Problem Set #6: Stocks I Theoretical and Conceptual Questions: (See Notes or Textbook For Solutions)
No ratings yet
Practice Problem Set #6: Stocks I Theoretical and Conceptual Questions: (See Notes or Textbook For Solutions)
2 pages
Homework 03
No ratings yet
Homework 03
4 pages
Term 1: Mechanics and Thermodynamics: Chapter 2: Kinematics
No ratings yet
Term 1: Mechanics and Thermodynamics: Chapter 2: Kinematics
7 pages
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Machine Learning GL

Uploaded by

Machine Learning GL

Uploaded by

MACHINE LEARNING

1.2. Descriptive and Exploratory Data Analysis

1.2.1. Descriptive Data analysis:

Head and Tail of the dataset: Info of dataset:

Describe function on dataset: Null and Duplicates:

There are no null values.

Number of duplicated rows= 8

Statistical description of the dataset after removing duplicates:

1.2.2. Univariate and Bivariate data Analysis:

Figure-2 Count plot for Categorical variables

 A pairplot plot a pairwise relationships in a dataset. Figure-5 represents the pair plot of

Figure-5 Pair Plot

1.2.3. Correlation analysis:

1.2.4. Outlier Analysis:

Figure-7 Outlier Analysis

1.3.1. Encoding of the Variables:

Table-2: Encoding of Categorical variables

 Unique sub-categories were

Head of Encoded dataset:

Data Types after Encoding:

1.3.3. Data Split:

1.4. Logistic Regression Analysis vs. LDA

1.4.1. LR Models Performance and Inference:

Model Score: 0.8378887841658812 Model Score: 0.8223684210526315

Classification Report: Classification Report:

Model Score: 0.8331762488218661 Model Score: 0.831140350877193

Classification Report: Classification Report:

1.5. KNN Model and Naïve Bayes Model

1.5.1. Naïve Bayes Models Performance and Inference:

Table-5 Naïve Bayes

Model Score: 0.8303487276154571 Model Score: 0.8201754385964912

1.5.2. KNN Model Performance and Inference:

 As per KNN Model the following summary metrics are presented.

Model Score: 0.8303487276154571 Model Score: 0.8048245614035088

Classification Report: Classification Report:

1.6. Bagging and Boosting

1.6.1. Bagging Performance and Inference:

Classification Report: Classification Report:

1.6.2. Boosting Performance and Inference:

Model Score: 0.8501413760603205 Model Score: 0.8245614035087719

Classification Report: Classification Report:

1.7. Models Performance and Inference

Table 9- Comparison between different models

1.8. Insights and Recommendations

2. Problem-2: Text Mining and Analysis

Table 10: Length of words and sentences

 Moving on to finding out the most common words(top 3):

Table 12: Three most common words

 Then we went on to plot word clouds for each of the 3 speeches:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.