0% found this document useful (0 votes)

67 views117 pages

ML - Business Report - Priyanka Sharma

This document provides an analysis of election survey data to predict which party voters will vote for. It describes cleaning the data by dropping duplicate and unnamed columns. Univariate analysis finds the data is normally distributed. Bivariate analysis uses strip plots and correlation to examine relationships between variables like age, economic conditions, party leaders, and Europe views with the vote.

Uploaded by

Priyanka Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views117 pages

ML - Business Report - Priyanka Sharma

Uploaded by

Priyanka Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 117

Machine Learning

Priyanka Sharma
Nov’22 Batch
Table of content

Questions Page number

Problem 1 3
Problem 2 112
Problem 1:

You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voterswith 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered bya particular
party.

Read the dataset. Do the descriptive statistics and do the null value condition check. Write
an inference on it.

Data Information:
Observation:

We have dropped the 'unnamed' column from the dataset as it is notuseful for our study.

The dataset had 8 duplicated values. So, we are dropped them.

The data set had 1525 rows and 9 columns. After dropping theduplicate values, there are
1517 rows and 9columns.

It has 7 numerical data types and 2 categorical datatypes.

There is no null value in any column.

Checking for missing values:

There are no missing values.

Checking for duplicated values:

There are 8 duplicated values. So, we are dropping them.

Data description:
Perform Univariate and Bivariate Analysis. Doexploratory data analysis.
Check for Outliers.

Exploratory Data Analysis:Null value check:

There are no null values present in the data.

Data types:
There are 7 numerical and 2 categorical data types in the data.

Univariate Analysis:Description

Histogram and box plot

Observation:

The data is normally distributed.

Maximum number of people are aged between 40 and 70.

Outliers are not present.

The minimum value is 24 and the maximum value is 93.

The mean value is 54.241266

Count plot of 'vote':

Observation:

Labour party has higher number of votes. It has more than double the votes of conservative
party.

Labour party has 1057 votes.

Conservative party has 460 votes.

Count plot of 'economic.cond.national':

Viewing the exact values of the variables of ‘economic.cond.national’ :

Mean of 'economic.cond.national':

Observation:

The top 2 variables are 3 and 4.

1 has the least value which is 37.

3 has the highest value which is 604.

3 is slightly higher than the 2nd highest variable 4 whose value si 538.
Count plot of 'economic.cond.household':

Mean of 'economic.cond.household':

Observation:

The top 2 variables are 3 and 4.

1 has the least value which is 65.

3 has the highest value which is 645.

3 is moderately higher than the 2nd highest variable 4whose value is 435.
Count plot of 'Blair':

Observation:

The top 2 variables are 2 and 4.

3 has the least value which is 1.

4 has the highest value which is 833.

4 is much higher than the 2nd highest variable 2 whose value si 434.
Count plot of 'Hague':

Viewing the exact values of the variables of 'Hague':

Observation:

The top 2 variables are 2 and 4.

3 has the least value which is 37.

2 has the highest value which is 617.

Count plot of 'Europe':

Viewing the exact values of the variables of' Europe':

Mean of 'Europe':
Observation:

The top 2 variables are 11 and 6.

2 has the least value which is 77.

11 has the highest value which is 338.

11 is moderately higher than the 2nd highest variable 6whosevalue is 207.

Count plot of 'political.knowledge':

Bivariate Analysis:

Strip plot of 'vote' and 'age':

Observation:

We can clearly see that, the labour party has got more votes thanthe conservative party.

In every age group, the labour party has got more votes thanthe conservative party.

Female votes are considerably higher than the male votes in bothparties.

In both genders, the labour party has got more votes.

Strip plot of 'vote' and 'economic.cond.national':

Viewing the exact values of the variables of 'vote'with respect to

'economic.cond.national':
Observation:

Labour party has higher votes overall.

Out of 82 people who gave a score of 5, 73 people have voted forthe labour party.

Out of 538 people who gave a score of 4, 447 people have voted for the labour party. This is
the highest set of people in the labour party.

Out of 604 people who gave a score of 3, 405 people have voted for the labour party. This is
the 2nd highest set of people in the labour party. The remaining 199 people who have voted
for the

conservative party is the highest set of people in that party.

Out of 256 people who gave a score of 2, 116 people have voted for the labour party. 140
people have voted for the conservative

party. This is the instance where the conservative party has got more votes than the labour
party.

Out of 37 people who gave a score of 1, 16 people have voted forthe labour party. 21 people
have voted for the conservative party.

The score of 3, 4 and 5 have more votes in the labour party.

The score of 1 and 2 have more votes in the conservative party.

Strip plot of 'vote' and 'economic.cond.household':

Viewing the exact values of the variables of 'vote' with respect to

'economic.cond.household':

Observation:

Labour party has higher votes overall.

Out of 92 people who gave a score of 5, 69 people have voted forthe labour party.

Out of 435 people who gave a score of 4, 349 people have voted for the labour party. This is
the 2nd highest set of people in the labour party.
Out of 645 people who gave a score of 3, 448 people have voted for the labour party. This is
the highest set of people in the labour party. The remaining 197 people who have voted for
the

conservative party is the highest set of people in that party.

Out of 280 people who gave a score of 2, 154 people have voted for the labour party. 126
people have voted for the conservative party.

Out of 65 people who gave a score of 1, 37 people have voted forthe labour party. 28 people
have voted for the conservative party.

In all the instances, the labour party have more votes than the conservative party.

Strip plot of 'vote' and 'Blair':

Viewing the exact values of the variables of 'vote'withrespect to 'Blair':

Observation:

Labour party has higher votes overall.

Out of 152 people who gave a score of 5, 149 people have voted for the labour party. The
remaining 3 people, despite giving a score of 5 to the labour leader, have chosen to vote
for the

conservative party.

Out of 833 people who gave a score of 4, 676 people have voted for the labour party. The
remaining 157 people, despite giving a

score of 4 to the labour leader, have chosen to vote for the conservative party.

Only 1 person has given a score of 3 and that person has votedfor the conservative party.

Out of 434 people who gave a score of 2, 240 people have voted for the conservative party.
The remaining 194 people, despite

giving an unsatisfactory score of 2 to the labour leader, have chosen to vote for the
labour party.

Out of 97 people who gave a score of 1, 59 people have votedfor the conservative party.
The remaining 38 people, despite

giving the lowest score of 1 to the labour leader, have chosento vote for the labour party.

The score of 4 and 5 have more votes in the labour party.

Observation:

Labour party has higher votes overall.

Out of 73 people who gave a score of 5, 59 people have voted for theconservative party.

Out of 557 people who gave a score of 4, 286 people have voted for the conservative party.
people, despite giving a score of 4 to the conservative leader, have chosen to vote for the
labour party.

Out of 37 people who gave a score of 3, 28 have voted for the labour party. The remaining 9,
despite giving an average score

of 3 to the conservative party, have chosen to vote for the conservative party.

Out of 617 people who gave a score of 2, 522 people have votedfor the labour party. The
remaining 95 people, despite giving an unsatisfactory score of 2 to the conservative leader,
have chosento vote for the conservative party.

Out of 233 people who gave a score of 1, 222 people have voted for the labour party.

The score of 4 and 5 have more votes in the conservative party, although in 4, the votes are
almost equal in both the parties. Conservative party gets slightly higher.

The score of 1, 2 and 3 have more votes in the labour party. Still, a significant percentage of
people who gave a bad score to the conservative leader still chose to vote for 'Hague'.
Viewing the exact values of the variables of 'vote'with respect to 'Europe':
Observation:

Out of 338 people who gave a score of 11, 166 people have voted for the labour party and
172 people have voted for the conservative party.

People who gave score of 7 to 10 have voted for labour and conservative almost equally.
Conservative party seem to be slightly higher in these instances.

Out of 207 people who gave a score of 6, 172 people have voted for the labour party and 35
people have voted for theconservative party.

People who gave a score of 1 to 6 have predominantly voted for the labour party. As we
can see, there are a total of 770 people who have given scores from 1 to 6. Out of 770
people,

672 people have voted for the labour party. So, 87.28% of the people have chosen labour
party.

So, we can infer that lower the 'Eurosceptic' sentiment, higher the votes for labour party
Checking pair-wise distribution of the continuous variables:

Observation:

Pair plot is a combination of histograms and scatter plots.

From the histogram, we can see that, the 'Blair', 'Europe'

And 'political.knowledge' variables are slightly left

skewed.

All other variables seem to be normally distributed.

From the scatter plots, we can see that, there is mostly nocorrelation between the variables.
Correlation matrix is a table which shows the correlationcoefficient between variables.
Correlation values range from -1 to +1.For values closer to zero, it means that, there is no
linear trend between two variables. Values close to 1 means that the correlationis positive.

The correlation heat map helps us to visualize the correlation between two variables.
Observation:

We can see that, mostly there is no correlation in the dataset through this matrix. There are
some variables that are moderately positively correlated and some that are slightly negatively
correlated.

‘economic.cond.national’ with ‘economic.cond.household’ have moderate positive correlation.

‘Blair' with 'economic.cond.national' and‘economic.cond.household’ have moderate positive

correlation.

‘Europe’ with ‘Hague’ have moderate positive correlation.

'Hague' with 'economic.cond.national' and 'Blair' havemoderate negative correlation.

'Europe' with 'economic.cond.national' and 'Blair' have moderate negative correlation.

Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30).

Viewing the data after encoding:

Train-test-split:
Our model will use all the variables and 'vote' is the target variable. The train-test split is a
technique for evaluating the performance of a machine learning algorithm. The procedure
involvestaking a dataset and dividing it into two subsets.

Train Dataset: Used to fit the machine learning model.

Test Dataset: Used to evaluate the fit machine learningmodel.

The data is divided into 2 subsets, training and testing set. Earlier, wehave extracted the target
variable ‘vote’ in a separate vector for subsets. Random state chosen as 1.

Training Set: 70percent of data.

Testing Set: 30 percent of the data.

Why scaling?:

The dataset contains features highly varying in magnitudes, unitsand range between the 'age'
column and other columns.

But since, most of the machine learning algorithms use Eucledian distance between two data
points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units.

The results would vary greatly between different units, 1km and 1000 meters.

The features with high magnitudes will weigh in a lot more in the distance calculations than
features with low magnitudes.

To suppress this effect, we need to bring all features to the same level of magnitudes. This can
be achieved by scaling.

in this case, we have a lot of encoded, ordinal, categorical and continuous variables. So, we
use the minmaxscaler technique to scale the data.

Viewing the data after scaling:

Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression Model:

There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target
variable.

Classification report - Train data:

Classification report - Test data:

Logistic Regression Model – Observation Train data:

Validness of the model:

The model is not over-fitted or under-fitted.

The error in the test data is slightly higher than the train data, which is absolutely fine because
the error margin is low and the error in both train and test data is not too high. Thus, the
model is not over-fitted or under-fitted.

There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target
variable.

Classification report - Train data:

Classification report - Test data:

Validness of the model:

The model is not over-fitted or under-fitted.

The error in the test data is slightly higher than the train data,which is absolutely fine
because the error margin is low and the

error in both train and test data is not too high. Thus, the modelis not over-fitted or under-
fitted.

Apply KNN Model and Naïve Bayes Model.Interpret the results.

K-Nearest Neighbor Model:

There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target
variable. We take K value as 7.
Validness of the model:

The model is over-fitted.

As we can see, the train data has a 100% accuracy and test data has 84% accuracy. The
difference is more than 10%. So, wecan infer that the KNN model is over-fitted.

Naïve Bayes Model:

There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target
variable.

Classification report - Train data:

Classification report - Test data:

Validness of the model:

The model is not over-fitted or under-fitted.

The error in the test data is slightly higher than the train data, which is absolutely fine because
the error margin is low and the error in both train and test data is not too high. Thus, the
model

is not over-fitted or under-fitted.

Logistic Regression Model Tuning

Train data:

Accuracy: 83.6%

Precision: 86%

Recall: 92%

F1-Score: 89%

Test data:

Accuracy: 84.21%

Precision: 86%

Recall: 90%

F1-Score: 88%

Comparison on performance of both regular andtunedlogistic regression models:

Regular Model Tuned Model (%)

(%)

Train:

Accuracy 83.41 83.6

Precision 86 86

Recall 92 92

F1-score 89 89

Test:

Accuracy 82.68 84.21

Precision 86 86

Recall 89 90

F1-score 87 88

As we can see from the above tabular comparison, there is not much difference between the
performance regular LR model and tuned LR model.

The values are high overall and there is no over-fitting or under-fitting. Therefore both
models are equally good

models.
Linear Discriminant Analysis Model Tuning: Bestparameters:

Classification report - Train data:

Classification report - Test data:

LDA Model Tuned - Observation Train data:

Accuracy: 83.22%

Precision: 87%

Recall: 90%

F1-Score: 88%
Test data:

Accuracy: 83.99%

Precision: 87%

Recall: 89%

F1-Score: 88%

Comparison on performance of both regular andtunedLDA models:

Regular Model(%) Tuned Model (%)

Train:

Accuracy 83.41 83.22

Precision 86 87

Recall 91 90

F1-score 89 88

Test:

Accuracy 83.33 83.99

Precision 86 87

Recall 89 89

F1-score 88 88

As we can see from the above tabular comparison, there is not much difference between the
performance of regular LDA model and tuned LDA model.

The values are high overall and there is no over-fitting or under-fitting. Therefore both
models are equally good models.

K-Nearest Neighbour Model Tuning:Best parameters:

Accuracy - Train data:

Accuracy - Test data:

Classification report - Train data:

Classification report - Test data:

KNN Model Tuned - ObservationTrain data:

Accuracy: 84.35%

Precision: 88%

Recall: 91%

F1-Score: 89%

Test data:

Accuracy: 86.18%

Precision: 87%

Recall: 93%

F1-Score: 90%
Comparison on performance of both regular andtunedKNN models:

Regular Model(%) Tuned Model (%)

Train:

Accuracy 100 84.35

Precision 100 88

Recall 100 91

F1-score 100 89

Test:

Accuracy 83.77 86.18

Precision 86 87

Recall 90 93

F1-score 88 90

There is no over-fitting or under-fitting in the tuned KNN model.Overall, it is a good model.

As we can see, the regular KNN model was over-fitted. But model tuning has helped the model
to recover from over- fitting.

The values are better in the tuned KNN model.

Therefore, the tuned KNN model is a better model.

Ensemble Random Forest ClassifierFeature importances:

Here,

0 = age

1 = economic.cond.national

2 = economic.cond.household

3 = Blair

4 = Hague

5 = Europe

6 = political.knowledge

7 = gender_male

Accuracy - Train data:

Accuracy - Test data:

Classification report - Train data:

Classification report - Test data:

Random Forest Classifier - ObservationTrain data:

Accuracy: 100%

Precision: 100%

Recall: 100%

F1-Score: 100%

Test data:

Accuracy: 82.68%

Precision: 84%

Recall: 91%

F1-Score: 88%

The model is over-fitted. We will use bagging to improve theperformance of the model.

Ensemble technique - BaggingAccuracy - Train data:

Accuracy - Test data:

The RF model even after using bagging technique, is still over-fitted.

Ensemble technique - AdaBoostingAccuracy - Train data:

Accuracy - Test data:

The model is not over-fitted. The values are good. Therefore, themodel is a good model.

Ensemble technique - Gradient BoostingAccuracy - Train data:

Accuracy - Test data:

The model is not over-fitted. The values are better thanAdaBoosting model. The model is a
good model.

Random Forest model TuningBest parameters:

Bagging tuned:Accuracy - Train data:

Accuracy - Test data:

Classification report - Train data:

Classification report - Test data:

The tuning of the model has help the model recover from over-fitting.Now the model is a
good model.

Random Forest tuned – AdaBoosting Accuracy - Train data:

Accuracy - Test data:

Classification Report - Train data:

Classification Report - Test data:

There is no over-fitting. There is improvement from the regular model.The model is a good
model.

Gradient Boosting Classifier TunedBest parameters:

Accuracy - Train data:

Accuracy - Test data:

Classification Report - Train data:

Classification Report - Test data:

The gradient boost classifier after tuning, has improved the model significantly.

The difference between the train and test accuracies has also been reduced.

Overall, the tuned Gradient Boost classifier is a bettermodel.

Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy,Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the modelsand write inference which model is best/optimized.

Logistic Regression Model - Regular: Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:

ROC and AUC - Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation: Train data:

Accuracy: 83.41%

Precision: 86%

Recall: 92%

F1-Score: 89%
AUC: 88.98%

Test data:

Accuracy: 82.68%

Precision: 86%

Recall: 89%

F1-Score: 87%

AUC: 88.4%

The model is not over-fitted or under-fitted. It is a good model.Logistic Regression Model -

Tuned:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:

ROC and AUC - Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation: Train data:

Accuracy: 83.6%

Precision: 86%

Recall: 92%

F1-Score: 89%
AUC: 88.89%
Test data:

Accuracy: 84.21%

Precision: 86%

Recall: 90%

F1-Score: 88%

AUC: 89.05%

Comparison between the regular LR model andtuned LR model:

As we can see, there is not much difference between theperformance of regular LR model and
tuned LR model.

The values are high overall and there is no over-fitting. Therefore, both models are equally
good models.

LDA Model - Regular:Predicted Class and probs:

Accuracy - Train:

Accuracy - Test:
ROC and AUC - Train:

ROC and AUC - Test:

Confusion matrix - Train:
Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 83.41%

Precision: 86%

Recall: 91%

F1-Score: 89%

AUC: 88.94%

Test data:

Accuracy: 83.33%

Precision: 86%

Recall: 89%

F1-Score: 88%

AUC: 88.76%

Validness of the model:

The model is not over-fitted or under-fitted.

The error in the test data is slightly higher than the train data,which is absolutely fine
because the error margin is low and the

error in both train and test data is not too high. Thus, the modelis not over-fitted or under-
fitted

LDA Model - Tuned:Predicted Class and probs:

Accuracy - Train:
ROC and AUC - Train:

Accuracy - Test:

ROC and AUC - Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Observation:

Train data:

Accuracy: 83.22%

Precision: 87%

Recall: 90%

F1-Score: 88%

AUC: 88.68%

Test data:

Accuracy: 83.99%

Precision: 87%

Recall: 89%

F1-Score: 88%

AUC: 89.33%

There is no over-fitting or under-fitting in the tuned LDAmodel. Overall, it is a good model.

Comparison between the regular LDA model andtuned LDA model

As we can see, there is not much difference between theperformance of regular LDA model
and tuned LDA model.

The values are high overall and there is no over-fitting orunder-fitting.

Therefore both models are equally good models.

KNN Model - Regular:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:
ROC and AUC - Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation: Train data:

Accuracy: 100%

Precision: 100%

Recall: 100%

F1-Score: 100%

AUC: 100%

Test data:

Accuracy: 83.77%

Precision: 86%

Recall: 90%

F1-Score: 88%
Validness of the model

The model is over-fitted.

As we can see, the train data has a 100% accuracy and test data has 84% accuracy. The
difference is more than 10%. So, wecan infer that the KNN model is over-fitted.

KNN Model - Tuned:Predicted Class and probs:

Accuracy - Train:

Accuracy - Test:

ROC and AUC - Train:

ROC and AUC - Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 84.35%

Precision: 88%

Recall: 91%

F1-Score: 89%

AUC: 90.23%

Test data:

Accuracy: 86.18%

Precision: 87%

Recall: 93%

F1-Score: 90%
AUC: 92.27%

There is no over-fitting or under-fitting in the tuned KNNmodel.

Overall, it is a good model.

Comparison between the regular KNN model and tunedKNNmodel:

As we can see, the regular KNN model was over-fitted.But model tuning has helped the
model to recover from over-fitting.

The values are better in the tuned KNN model.

Therefore, the tuned KNN model is a better model.

Naïve Bayes Model - Regular:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:
ROC and AUC - Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 83.51%
Precision: 88%

Recall: 90%

F1-Score: 89%AUC: 88.79%

Test data:

Accuracy: 82.24%

Precision: 87%

Recall: 87%

F1-Score: 87%

AUC: 87.64%

Validness of the model

The model is not over-fitted or under-fitted.

The error in the test data is slightly higher than the train data,which is absolutely fine
because the error margin is low and the

error in both train and test data is not too high. Thus, the modelis not over-fitted or under-
fitted.

There is no hyper-parameters to tune in Naive Bayes model. So,we cannot tune this model.

Ensemble Random Forest Classifier - Regular:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:

ROC and AUC - Test:

Confusion matrix - Test:
Classification report - Test:

Observation:

Train data:

Accuracy: 100%

Precision: 100%

Recall: 100%

F1-Score: 100%

AUC: 100%

Test data:

Accuracy: 82.68%

Precision: 84%

Recall: 91%

F1-Score: 88%

AUC: 88.62%

The model is over-fitted.

Bagging-Regular:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:

ROC and AUC –

Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 95.38%

Precision: 95%

Recall: 98%

F1-Score: 97%

AUC: 99.4%

Test data:

Accuracy: 82.46%

Precision: 83%

Recall: 92%

F1-Score: 87%

AUC: 89.4%

After using bagging, the model is still over-fitted. The values are high.But the difference
between the train and test accuracy is high.

Bagging - Tuned: PredictedClass and probs:

Accuracy - Train:
Accuracy - Test:

ROC and AUC - Test:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 84.45%

Precision: 86%

Recall: 94%

F1-Score: 90%

AUC: 90.41%

Test data:

Accuracy: 81.36%

Precision: 82%

Recall: 92%

F1-Score: 87%

AUC: 88.58%

The tuning of the model has help the model recover from over-fitting.Now the model is a
good model.

AdaBoosting - Regular:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:

ROC and AUC – Test

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 84.26%

Precision: 87%

Recall: 91%

F1-Score: 89%

AUC: 89.79%

Test data:

Accuracy: 82.02%

Precision: 86%

Recall: 87%

F1-Score: 87%

AUC: 87.81%

The tuning of the model has help the model recover from over-fitting.Now the model is a
good model.

AdaBoosting -Tuned:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:

ROC and AUC - Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 93.5%

Precision: 95%

Recall: 96%

F1-Score: 95%

AUC: 98.62%

Test data:

Accuracy: 83.11%

Precision: 86%

Recall: 89%

F1-Score: 88%

AUC: 89.99%

The model is a good model. There is no over-fitting. There is

improvement from the regular model.

Gradient Boosting - Regular:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:

ROC and AUC - Test:

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 89.26%

Precision: 91%

Recall: 94%

F1-Score: 93%

AUC: 95.11%

Test data:

Accuracy: 83.33%

Precision: 85%

Recall: 91%

F1-Score: 88%

AUC: 89.87%

The values are high. There is no over-fitting of any sorts. The model is a good model.

Gradient Boosting - Tuned:Predicted Class and probs:

Accuracy - Train:

ROC and AUC - Train:

Accuracy - Test:

ROC and AUC - Test

Confusion matrix - Train:

Classification report - Train:

Confusion matrix - Test:

Classification report - Test:

Observation:

Train data:

Accuracy: 88.31%

Precision: 89%

Recall: 95%

F1-Score: 92%

AUC: 94.69%

Test data:

Accuracy: 87.28%

Precision: 88%

Recall: 94%

F1-Score: 91%

AUC: 94.97%

The tuning of the Gradient Boost model has improved the model further. The values are high.
The better is better than the regular model.

Comparison of train data of all models in a structured tabular manner:

Accuracy Precision Recall F1- AUC

Score

LR - Regular 83.41% 86% 92% 89% 88.98

LR - Tuned 83.6% 86% 92% 89% 88.89

%
LDA - Regular 83.41% 86% 91% 89% 88.94

LDA - Tuned 83.22% 87% 90% 88% 88.68

KNN - Regular 100% 100% 100% 100% 100%

KNN - Tuned 84.35% 88% 91% 89% 90.23

Naïve Bayes - Regular 83.51% 88% 90% 89% 88.79

Random Forest - 100% 100% 100% 100% 100%

Regular

Bagging - Regular 95.38% 95% 98% 97% 99.4%

Bagging - Tuned 84.45% 86% 94% 90% 90.41

AdaBoosting - Regular 84.26% 87% 91% 89% 89.79

AdaBoosting - Tuned 93.5% 95% 96% 95% 98.62

Gradient Boosting - 89.26% 91% 94% 93% 95.11

Regular %

Gradient Boosting - 88.31% 89% 95% 92% 94.69

Tuned %

Comparison of test data of all models in astructured tabular

manner:

Accuracy Precision Recall F1- AUC

Score

LR - Regular 82.68% 86% 89% 87% 88.4%

LR - Tuned 84.21% 86% 90% 88% 89.05

LDA - Regular 83.33% 86% 89% 88% 88.76

LDA - Tuned 83.99% 87% 89% 88% 89.33

KNN - Regular 83.77% 86% 90% 88% 88.46

KNN - Tuned 86.18% 87% 93% 90% 92.27

Naïve Bayes - Regular 82.24% 87% 87% 87% 87.64

Random Forest - 82.68% 84% 91% 88% 88.62

Regular %

Bagging - Regular 82.46% 83% 92% 87% 89.4%

Bagging - Tuned 81.36% 82% 92% 87% 88.58

AdaBoosting - Regular 82.02% 86% 87% 87% 87.81

AdaBoosting - Tuned 83.11% 86% 89% 88% 89.99

Gradient Boosting - 83.33% 85% 91% 88% 89.87

Regular %

Gradient Boosting - 87.28% 88% 94% 91% 94.97

Tuned %

Comparing the AUC, ROC curve on the train data ofall the tuned models:

In all the models, tuned ones are better than the regular models. So, wecompare only the tuned
models and describe which model is the best/optimized.
Comparing the AUC, ROC curve on the test dataof all the tuned models:

In all the models, tuned ones are better than the regular models. So, wecompare only the tuned
models and describe which model is the best/optimized.

Conclusion:

There is no under-fitting or over-fitting in any of the tuned models.

All the tuned models have high values and every model is good. But as we can see, the most
consistent tuned model in both trainand test data is the Gradient Boost model.

The tuned gradient boost model performs the best with 88.31% accuracy score in train and
87.28% accuracy score in test. Alsoit has the best AUC score of 94% inboth train and test data
which

is the highest of all the models.

It also has a precision score of 88% and recall of 94% which is also the highest of all the
models. So, we conclude that Gradient Boost Tuned model is the best/optimized model.
Based on these predictions, what are theinsights?

Insights:

Labour party has more than double the votes of conservative party.

Most number of people have given a score of 3 and 4 for the national economic condition and
the average score is 3.245221

Most number of people have given a score of 3 and 4 for the household economic condition
and the average score is 3.137772

Blair has higher number of votes than Hague and the scores are much better for Blair than for
Hague.

The average score of Blair is 3.335531 and the average score of Hague is 2.749506. So, here
we can see that,Blair has a better score.

On a scale of 0 to 3, about 30% of the total population has zero knowledge about
politics/parties.

People who gave a low score of 1 to a certain party, still decided to vote for the same
party instead of voting for the other party. This can be because of lack of political
knowledge among the people.

People who have higher Eurosceptic sentiment, has voted for theconservative party and lower
the Eurosceptic sentiment, higher the votes for Labour party.
Out of 454 people who gave a score of 0 for political knowledge, 360 people have voted for the
labour party and 94 people have voted for the conservative party.

All models performed well on training data set as well as test dataset. The tuned models have
performed better than the regular models.

There is no over-fitting in any model except Random Forest and Bagging regular models.
Gradient Boosting model tuned is the best/optimized model.

Business recommendations:

Hyper-parameters tuning is an import aspect of model building. There are limitations to this as
to process these combinations, huge amount of processing power is required. But if tuning can
be done with many sets of parameters, we might get even better results.

Gathering more data will also help in training the models and thus improving the predictive
powers.

We can also create a function in which all the models predict the outcome in sequence. This
will help in better understanding andthe probability of what the outcome will be.

Using Gradient Boosting model without scaling for predicting the outcome as it has the best
optimized performance.
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States
of America:

President Franklin D. Roosevelt in 1941 President John F. Kennedy in 1961 President Richard
Nixon in 1973

Find the number of characters, words, and sentences for the mentioned documents.

Number of characters:

President Franklin D. Roosevelt's speech have 7571characters (including spaces).

President John F. Kennedy's speech have 7618 characters(including spaces).

President Richard Nixon's speech have 9991 characters(including spaces).

Number of words:

There are 1526 words in President Franklin D.Roosevelt's speech.

There are 1543 words in President John F. Kennedy's speech.

There are 2006 words in President Richard Nixon's speech.

Number of sentences:
There are 68 sentences in President Franklin D.Roosevelt's speech.

There are 52 sentences in President John F. Kennedy's speech.

There are 68 sentences in President Richard Nixon'sspeech.

Remove all the stop-words from all three speeches.

Before, removing the stop-words, we have changed all the letters to lowercase and we have
removed special characters.

Word count before the removal of stop-words:

Before the removal of stop-words,

President Franklin D. Roosevelt's speech has 1334words.

President John F. Kennedy's speech have 1362 words.

President Richard Nixon's speech have 1800 words.

Word count after the removal of stop-words:

After the removal of stop-words,

President Franklin D. Roosevelt's speech have 623 words.

President John F. Kennedy's speech have 693 words.

President Richard Nixon's speech have 831 words.

Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stop-words)

Top 3 words in Roosevelt's speech:

The top 3 words are,

nation - 11

know - 10

spirit - 9

Top 3 words in Roosevelt's speech:

The top 3 words are,

let - 11

us - 10

sides - 9

Top 3 words in Roosevelt's speech:

The top 3 words are,

us - 26

let - 22

peace - 19
Plot the word cloud of each of the threespeeches. (after removing the
stop-words)

Word cloud of Roosevelt's speech:

Word cloud of Kennedy's speech:
Word cloud of Nixon's speech:

Machine Learning Models For Salary Prediction Dataset Using Python
No ratings yet
Machine Learning Models For Salary Prediction Dataset Using Python
5 pages
15 Midterm 2740 Version B PDF
No ratings yet
15 Midterm 2740 Version B PDF
17 pages
Final Project - ML - Nikita Chaturvedi - 03.10.2021 - Jupyter Notebook
100% (11)
Final Project - ML - Nikita Chaturvedi - 03.10.2021 - Jupyter Notebook
154 pages
January 2023 1news Kantar Public Poll
No ratings yet
January 2023 1news Kantar Public Poll
17 pages
CRISP - DM - Business Understanding
No ratings yet
CRISP - DM - Business Understanding
17 pages
Machine Learning-2 Business Report
No ratings yet
Machine Learning-2 Business Report
78 pages
Business+Report - Ensemble1 Lavekar
No ratings yet
Business+Report - Ensemble1 Lavekar
32 pages
Davice ML 21-01-2024
No ratings yet
Davice ML 21-01-2024
50 pages
Kailash BusinessReport ML
No ratings yet
Kailash BusinessReport ML
51 pages
Mvchine Learning Project Report
No ratings yet
Mvchine Learning Project Report
33 pages
Umendra Pratap Singh Solanki ML Graded Project 18-12-2022
No ratings yet
Umendra Pratap Singh Solanki ML Graded Project 18-12-2022
27 pages
Theories of Voting Behaviour
No ratings yet
Theories of Voting Behaviour
2 pages
Who Abstained in The 2016 United Kingdom European Union Membership Referendum?
No ratings yet
Who Abstained in The 2016 United Kingdom European Union Membership Referendum?
7 pages
Business Report ML
No ratings yet
Business Report ML
29 pages
GOVP1 Participation & Voting Behaviour
No ratings yet
GOVP1 Participation & Voting Behaviour
7 pages
Voting Behavior
No ratings yet
Voting Behavior
4 pages
RAHULSHARMA
No ratings yet
RAHULSHARMA
40 pages
Best For Britain - Tactical Voting in 2017 UK Parliamentary General Election
No ratings yet
Best For Britain - Tactical Voting in 2017 UK Parliamentary General Election
9 pages
#3 Variables
No ratings yet
#3 Variables
6 pages
Colomer Josep Personal Representation The Neglected Dimension of Electoral System ECPRBook
No ratings yet
Colomer Josep Personal Representation The Neglected Dimension of Electoral System ECPRBook
197 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
Election Data
No ratings yet
Election Data
66 pages
PSRM II Assingment 6
No ratings yet
PSRM II Assingment 6
2 pages
A Vote Equation and The 2004 Election
No ratings yet
A Vote Equation and The 2004 Election
10 pages
RCPA3 Fillable Chapter Exercises - CH 4
No ratings yet
RCPA3 Fillable Chapter Exercises - CH 4
8 pages
Election Data
No ratings yet
Election Data
33 pages
Introsoc Repaso
No ratings yet
Introsoc Repaso
7 pages
POS 363 Elections and Political Parties
No ratings yet
POS 363 Elections and Political Parties
11 pages
Machine Learning Project: Problem 1
67% (3)
Machine Learning Project: Problem 1
26 pages
ML ProjectReport-Sonali Joshi
100% (2)
ML ProjectReport-Sonali Joshi
38 pages
Clarcke and Lebo Fractional (Co) Integration and Governing Party Support in Britain
No ratings yet
Clarcke and Lebo Fractional (Co) Integration and Governing Party Support in Britain
20 pages
Fiorina, Morris P. An Outline For A Model of Party Choice.
No ratings yet
Fiorina, Morris P. An Outline For A Model of Party Choice.
26 pages
Professor Geoffrey Evans, DR Pippa Norris - Critical Elections - British Parties and Voters in Long-Term Perspective - Sage Publications LTD (1999)
No ratings yet
Professor Geoffrey Evans, DR Pippa Norris - Critical Elections - British Parties and Voters in Long-Term Perspective - Sage Publications LTD (1999)
351 pages
A Level Edexcel Politics Past Paper 2020
No ratings yet
A Level Edexcel Politics Past Paper 2020
24 pages
The UK Party System and Party Politics: Part 1: The Electoral Dimension
No ratings yet
The UK Party System and Party Politics: Part 1: The Electoral Dimension
33 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Political Economics Term Paper Proposal
No ratings yet
Political Economics Term Paper Proposal
4 pages
WP 060076
No ratings yet
WP 060076
41 pages
July - 2023 - 1 NEWS Verian Poll Report No Supplementary Questions Version
No ratings yet
July - 2023 - 1 NEWS Verian Poll Report No Supplementary Questions Version
11 pages
Voting Behavior - Wikipedia
No ratings yet
Voting Behavior - Wikipedia
16 pages
PROG8430 - Data Analysis, Modeling and Algorithms Assignment 1 Exploratory Data Analysis With R'
No ratings yet
PROG8430 - Data Analysis, Modeling and Algorithms Assignment 1 Exploratory Data Analysis With R'
7 pages
Class Is The Basis of British Party Politics
No ratings yet
Class Is The Basis of British Party Politics
14 pages
PoliticsReview29 4 UK Voting Behaviour
No ratings yet
PoliticsReview29 4 UK Voting Behaviour
3 pages
Elections and Voters A Comparative Introduction - Compressed PDF
100% (1)
Elections and Voters A Comparative Introduction - Compressed PDF
45 pages
Cautionary Notes: V-Dem Methodology
No ratings yet
Cautionary Notes: V-Dem Methodology
6 pages
Voting Behaviour Booklet
No ratings yet
Voting Behaviour Booklet
37 pages
Machine Learning GL
No ratings yet
Machine Learning GL
25 pages
Missing Voters 17may10
No ratings yet
Missing Voters 17may10
41 pages
Survey Analysis For Extended Project
No ratings yet
Survey Analysis For Extended Project
8 pages
March 2023 Full 1NEWS Kantar Public Poll Report
100% (1)
March 2023 Full 1NEWS Kantar Public Poll Report
18 pages
Britain Barometer February 2022 Tables Final
No ratings yet
Britain Barometer February 2022 Tables Final
63 pages
SEC - Political Leadership & Communication - Unit 4 & 5
No ratings yet
SEC - Political Leadership & Communication - Unit 4 & 5
8 pages
C. A. E. Goodhart - R. J. Bhansali
No ratings yet
C. A. E. Goodhart - R. J. Bhansali
64 pages
Downs Stokes and Modified Rational Choice Modelling Turnout in 2001
No ratings yet
Downs Stokes and Modified Rational Choice Modelling Turnout in 2001
21 pages
Tactical Coalition Voting
No ratings yet
Tactical Coalition Voting
42 pages
Unit 4
No ratings yet
Unit 4
21 pages
Lecture w6 2 Hypothesis Testing 1 PDF
No ratings yet
Lecture w6 2 Hypothesis Testing 1 PDF
29 pages
Parliament and Government Composition Database (Parlgov)
No ratings yet
Parliament and Government Composition Database (Parlgov)
8 pages
ML P L Lohitha 22-01-23 Business Report
No ratings yet
ML P L Lohitha 22-01-23 Business Report
34 pages
Leaving Democracy?
From Everand
Leaving Democracy?
Johnathan Kemp
No ratings yet
Seat by Seat: The Political Anorak's Guide to Potential Gains and Losses in the 2015 General Election
From Everand
Seat by Seat: The Political Anorak's Guide to Potential Gains and Losses in the 2015 General Election
Iain Dale
No ratings yet
The End of British Party Politics?
From Everand
The End of British Party Politics?
Roger Awan-Scully
No ratings yet
Predictive Modelling - BR - Priyanka Sharma
No ratings yet
Predictive Modelling - BR - Priyanka Sharma
36 pages
Data Mining Project - PCA - Hair Salon
No ratings yet
Data Mining Project - PCA - Hair Salon
8 pages
Data Mining Project - Clustering - State Wise Health Income
No ratings yet
Data Mining Project - Clustering - State Wise Health Income
9 pages
Finance Risk Analytics - Priyanka Sharma - Business Report
No ratings yet
Finance Risk Analytics - Priyanka Sharma - Business Report
49 pages
BMC Helix IOPS
No ratings yet
BMC Helix IOPS
2 pages
Company CV Siddhant Rawat
No ratings yet
Company CV Siddhant Rawat
2 pages
AnneDashini 1106191005 SE Assignment
No ratings yet
AnneDashini 1106191005 SE Assignment
6 pages
Unit I - Afs
No ratings yet
Unit I - Afs
18 pages
ML2021 Feb Denmark Multilingual 20210223
No ratings yet
ML2021 Feb Denmark Multilingual 20210223
23 pages
Leave No Context Behind: Efficient Infinite Context Transformers With Infini-Attention
No ratings yet
Leave No Context Behind: Efficient Infinite Context Transformers With Infini-Attention
14 pages
ML Engineering Masterclass
No ratings yet
ML Engineering Masterclass
8 pages
Character Recognition Using DNN
No ratings yet
Character Recognition Using DNN
2 pages
Daily AI Exercise - Kmeans - KNN
No ratings yet
Daily AI Exercise - Kmeans - KNN
15 pages
Kman 07
No ratings yet
Kman 07
9 pages
Machine Learning Based Vehicle Intention Trajectory Recognition and Prediction For Autonomous Driving
No ratings yet
Machine Learning Based Vehicle Intention Trajectory Recognition and Prediction For Autonomous Driving
7 pages
Bangalore University: Computer Science and Engineering
No ratings yet
Bangalore University: Computer Science and Engineering
26 pages
Aipt Answers Khazana
No ratings yet
Aipt Answers Khazana
20 pages
School of Computer Science and Engineering: Project On Artificial Intelligence in Defence
No ratings yet
School of Computer Science and Engineering: Project On Artificial Intelligence in Defence
33 pages
Senthil Kumar - Amrita
No ratings yet
Senthil Kumar - Amrita
12 pages
Thesis SVM
100% (1)
Thesis SVM
5 pages
Major Project
No ratings yet
Major Project
17 pages
B.Tech - IT and CSIT Syllabus of 3rd Year
No ratings yet
B.Tech - IT and CSIT Syllabus of 3rd Year
37 pages
Stanford Coursework Login
100% (2)
Stanford Coursework Login
5 pages
Apache Spark A Comprehensive Guide
No ratings yet
Apache Spark A Comprehensive Guide
9 pages
Dav Public School, Vasant Kunj, New Delhi: Artificial Intelligence (Subject Code: 417)
No ratings yet
Dav Public School, Vasant Kunj, New Delhi: Artificial Intelligence (Subject Code: 417)
8 pages
How To Build AI
No ratings yet
How To Build AI
10 pages
Multimae: Multi-Modal Multi-Task Masked Autoencoders
No ratings yet
Multimae: Multi-Modal Multi-Task Masked Autoencoders
21 pages
1 s2.0 S2589721722000034 Main
No ratings yet
1 s2.0 S2589721722000034 Main
8 pages
Machine Learning From Rohit
No ratings yet
Machine Learning From Rohit
14 pages
AI's Impact On Digital Marketing
No ratings yet
AI's Impact On Digital Marketing
8 pages
Solution Dseclzg524!01!102020 Ec2r
100% (1)
Solution Dseclzg524!01!102020 Ec2r
6 pages
TBBT Imp Q Ans
No ratings yet
TBBT Imp Q Ans
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.