ML - Business Report - Priyanka Sharma
ML - Business Report - Priyanka Sharma
Priyanka Sharma
Nov’22 Batch
Table of content
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voterswith 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered bya particular
party.
Read the dataset. Do the descriptive statistics and do the null value condition check. Write
an inference on it.
Data Information:
Observation:
We have dropped the 'unnamed' column from the dataset as it is notuseful for our study.
The data set had 1525 rows and 9 columns. After dropping theduplicate values, there are
1517 rows and 9columns.
Data description:
Perform Univariate and Bivariate Analysis. Doexploratory data analysis.
Check for Outliers.
Data types:
There are 7 numerical and 2 categorical data types in the data.
Univariate Analysis:Description
Labour party has higher number of votes. It has more than double the votes of conservative
party.
Mean of 'economic.cond.national':
Observation:
3 is slightly higher than the 2nd highest variable 4 whose value si 538.
Count plot of 'economic.cond.household':
Mean of 'economic.cond.household':
Observation:
3 is moderately higher than the 2nd highest variable 4whose value is 435.
Count plot of 'Blair':
Observation:
4 is much higher than the 2nd highest variable 2 whose value si 434.
Count plot of 'Hague':
Observation:
Mean of 'Europe':
Observation:
We can clearly see that, the labour party has got more votes thanthe conservative party.
In every age group, the labour party has got more votes thanthe conservative party.
Female votes are considerably higher than the male votes in bothparties.
Out of 82 people who gave a score of 5, 73 people have voted forthe labour party.
Out of 538 people who gave a score of 4, 447 people have voted for the labour party. This is
the highest set of people in the labour party.
Out of 604 people who gave a score of 3, 405 people have voted for the labour party. This is
the 2nd highest set of people in the labour party. The remaining 199 people who have voted
for the
Out of 256 people who gave a score of 2, 116 people have voted for the labour party. 140
people have voted for the conservative
party. This is the instance where the conservative party has got more votes than the labour
party.
Out of 37 people who gave a score of 1, 16 people have voted forthe labour party. 21 people
have voted for the conservative party.
Observation:
Out of 92 people who gave a score of 5, 69 people have voted forthe labour party.
Out of 435 people who gave a score of 4, 349 people have voted for the labour party. This is
the 2nd highest set of people in the labour party.
Out of 645 people who gave a score of 3, 448 people have voted for the labour party. This is
the highest set of people in the labour party. The remaining 197 people who have voted for
the
Out of 280 people who gave a score of 2, 154 people have voted for the labour party. 126
people have voted for the conservative party.
Out of 65 people who gave a score of 1, 37 people have voted forthe labour party. 28 people
have voted for the conservative party.
In all the instances, the labour party have more votes than the conservative party.
Out of 152 people who gave a score of 5, 149 people have voted for the labour party. The
remaining 3 people, despite giving a score of 5 to the labour leader, have chosen to vote
for the
conservative party.
Out of 833 people who gave a score of 4, 676 people have voted for the labour party. The
remaining 157 people, despite giving a
score of 4 to the labour leader, have chosen to vote for the conservative party.
Only 1 person has given a score of 3 and that person has votedfor the conservative party.
Out of 434 people who gave a score of 2, 240 people have voted for the conservative party.
The remaining 194 people, despite
giving an unsatisfactory score of 2 to the labour leader, have chosen to vote for the
labour party.
Out of 97 people who gave a score of 1, 59 people have votedfor the conservative party.
The remaining 38 people, despite
giving the lowest score of 1 to the labour leader, have chosento vote for the labour party.
Observation:
Out of 73 people who gave a score of 5, 59 people have voted for theconservative party.
Out of 557 people who gave a score of 4, 286 people have voted for the conservative party.
people, despite giving a score of 4 to the conservative leader, have chosen to vote for the
labour party.
Out of 37 people who gave a score of 3, 28 have voted for the labour party. The remaining 9,
despite giving an average score
of 3 to the conservative party, have chosen to vote for the conservative party.
Out of 617 people who gave a score of 2, 522 people have votedfor the labour party. The
remaining 95 people, despite giving an unsatisfactory score of 2 to the conservative leader,
have chosento vote for the conservative party.
Out of 233 people who gave a score of 1, 222 people have voted for the labour party.
The score of 4 and 5 have more votes in the conservative party, although in 4, the votes are
almost equal in both the parties. Conservative party gets slightly higher.
The score of 1, 2 and 3 have more votes in the labour party. Still, a significant percentage of
people who gave a bad score to the conservative leader still chose to vote for 'Hague'.
Viewing the exact values of the variables of 'vote'with respect to 'Europe':
Observation:
Out of 338 people who gave a score of 11, 166 people have voted for the labour party and
172 people have voted for the conservative party.
People who gave score of 7 to 10 have voted for labour and conservative almost equally.
Conservative party seem to be slightly higher in these instances.
Out of 207 people who gave a score of 6, 172 people have voted for the labour party and 35
people have voted for theconservative party.
People who gave a score of 1 to 6 have predominantly voted for the labour party. As we
can see, there are a total of 770 people who have given scores from 1 to 6. Out of 770
people,
672 people have voted for the labour party. So, 87.28% of the people have chosen labour
party.
So, we can infer that lower the 'Eurosceptic' sentiment, higher the votes for labour party
Checking pair-wise distribution of the continuous variables:
Observation:
skewed.
From the scatter plots, we can see that, there is mostly nocorrelation between the variables.
Correlation matrix is a table which shows the correlationcoefficient between variables.
Correlation values range from -1 to +1.For values closer to zero, it means that, there is no
linear trend between two variables. Values close to 1 means that the correlationis positive.
The correlation heat map helps us to visualize the correlation between two variables.
Observation:
We can see that, mostly there is no correlation in the dataset through this matrix. There are
some variables that are moderately positively correlated and some that are slightly negatively
correlated.
Train-test-split:
Our model will use all the variables and 'vote' is the target variable. The train-test split is a
technique for evaluating the performance of a machine learning algorithm. The procedure
involvestaking a dataset and dividing it into two subsets.
Why scaling?:
The dataset contains features highly varying in magnitudes, unitsand range between the 'age'
column and other columns.
But since, most of the machine learning algorithms use Eucledian distance between two data
points in their computations, this is a problem.
If left alone, these algorithms only take in the magnitude of features neglecting the units.
The results would vary greatly between different units, 1km and 1000 meters.
The features with high magnitudes will weigh in a lot more in the distance calculations than
features with low magnitudes.
To suppress this effect, we need to bring all features to the same level of magnitudes. This can
be achieved by scaling.
in this case, we have a lot of encoded, ordinal, categorical and continuous variables. So, we
use the minmaxscaler technique to scale the data.
There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target
variable.
The error in the test data is slightly higher than the train data, which is absolutely fine because
the error margin is low and the error in both train and test data is not too high. Thus, the
model is not over-fitted or under-fitted.
There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target
variable.
The error in the test data is slightly higher than the train data,which is absolutely fine
because the error margin is low and the
error in both train and test data is not too high. Thus, the modelis not over-fitted or under-
fitted.
There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target
variable. We take K value as 7.
Validness of the model:
As we can see, the train data has a 100% accuracy and test data has 84% accuracy. The
difference is more than 10%. So, wecan infer that the KNN model is over-fitted.
There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target
variable.
The error in the test data is slightly higher than the train data, which is absolutely fine because
the error margin is low and the error in both train and test data is not too high. Thus, the
model
Accuracy: 83.6%
Precision: 86%
Recall: 92%
F1-Score: 89%
Test data:
Accuracy: 84.21%
Precision: 86%
Recall: 90%
F1-Score: 88%
(%)
Train:
Precision 86 86
Recall 92 92
F1-score 89 89
Test:
Precision 86 86
Recall 89 90
F1-score 87 88
As we can see from the above tabular comparison, there is not much difference between the
performance regular LR model and tuned LR model.
The values are high overall and there is no over-fitting or under-fitting. Therefore both
models are equally good
models.
Linear Discriminant Analysis Model Tuning: Bestparameters:
Accuracy: 83.22%
Precision: 87%
Recall: 90%
F1-Score: 88%
Test data:
Accuracy: 83.99%
Precision: 87%
Recall: 89%
F1-Score: 88%
Train:
Precision 86 87
Recall 91 90
F1-score 89 88
Test:
Precision 86 87
Recall 89 89
F1-score 88 88
As we can see from the above tabular comparison, there is not much difference between the
performance of regular LDA model and tuned LDA model.
The values are high overall and there is no over-fitting or under-fitting. Therefore both
models are equally good models.
Accuracy: 84.35%
Precision: 88%
Recall: 91%
F1-Score: 89%
Test data:
Accuracy: 86.18%
Precision: 87%
Recall: 93%
F1-Score: 90%
Comparison on performance of both regular andtunedKNN models:
Train:
Precision 100 88
Recall 100 91
F1-score 100 89
Test:
Precision 86 87
Recall 90 93
F1-score 88 90
As we can see, the regular KNN model was over-fitted. But model tuning has helped the model
to recover from over- fitting.
0 = age
1 = economic.cond.national
2 = economic.cond.household
3 = Blair
4 = Hague
5 = Europe
6 = political.knowledge
7 = gender_male
Accuracy: 100%
Precision: 100%
Recall: 100%
F1-Score: 100%
Test data:
Accuracy: 82.68%
Precision: 84%
Recall: 91%
F1-Score: 88%
The model is over-fitted. We will use bagging to improve theperformance of the model.
The model is not over-fitted. The values are good. Therefore, themodel is a good model.
The model is not over-fitted. The values are better thanAdaBoosting model. The model is a
good model.
The tuning of the model has help the model recover from over-fitting.Now the model is a
good model.
There is no over-fitting. There is improvement from the regular model.The model is a good
model.
The gradient boost classifier after tuning, has improved the model significantly.
The difference between the train and test accuracies has also been reduced.
Accuracy - Train:
Accuracy: 83.41%
Precision: 86%
Recall: 92%
F1-Score: 89%
AUC: 88.98%
Test data:
Accuracy: 82.68%
Precision: 86%
Recall: 89%
F1-Score: 87%
AUC: 88.4%
Accuracy - Train:
Accuracy: 83.6%
Precision: 86%
Recall: 92%
F1-Score: 89%
AUC: 88.89%
Test data:
Accuracy: 84.21%
Precision: 86%
Recall: 90%
F1-Score: 88%
AUC: 89.05%
As we can see, there is not much difference between theperformance of regular LR model and
tuned LR model.
The values are high overall and there is no over-fitting. Therefore, both models are equally
good models.
Accuracy - Train:
Accuracy - Test:
ROC and AUC - Train:
Train data:
Accuracy: 83.41%
Precision: 86%
Recall: 91%
F1-Score: 89%
AUC: 88.94%
Test data:
Accuracy: 83.33%
Precision: 86%
Recall: 89%
F1-Score: 88%
AUC: 88.76%
The error in the test data is slightly higher than the train data,which is absolutely fine
because the error margin is low and the
error in both train and test data is not too high. Thus, the modelis not over-fitted or under-
fitted
Accuracy - Train:
ROC and AUC - Train:
Accuracy - Test:
Train data:
Accuracy: 83.22%
Precision: 87%
Recall: 90%
F1-Score: 88%
AUC: 88.68%
Test data:
Accuracy: 83.99%
Precision: 87%
Recall: 89%
F1-Score: 88%
AUC: 89.33%
As we can see, there is not much difference between theperformance of regular LDA model
and tuned LDA model.
Accuracy - Train:
Accuracy - Test:
ROC and AUC - Test:
Accuracy: 100%
Precision: 100%
Recall: 100%
F1-Score: 100%
AUC: 100%
Test data:
Accuracy: 83.77%
Precision: 86%
Recall: 90%
F1-Score: 88%
Validness of the model
As we can see, the train data has a 100% accuracy and test data has 84% accuracy. The
difference is more than 10%. So, wecan infer that the KNN model is over-fitted.
Accuracy - Train:
Accuracy - Test:
Observation:
Train data:
Accuracy: 84.35%
Precision: 88%
Recall: 91%
F1-Score: 89%
AUC: 90.23%
Test data:
Accuracy: 86.18%
Precision: 87%
Recall: 93%
F1-Score: 90%
AUC: 92.27%
As we can see, the regular KNN model was over-fitted.But model tuning has helped the
model to recover from over-fitting.
Accuracy - Train:
Accuracy - Test:
ROC and AUC - Test:
Observation:
Train data:
Accuracy: 83.51%
Precision: 88%
Recall: 90%
Accuracy: 82.24%
Precision: 87%
Recall: 87%
F1-Score: 87%
AUC: 87.64%
The error in the test data is slightly higher than the train data,which is absolutely fine
because the error margin is low and the
error in both train and test data is not too high. Thus, the modelis not over-fitted or under-
fitted.
There is no hyper-parameters to tune in Naive Bayes model. So,we cannot tune this model.
Accuracy - Train:
Observation:
Train data:
Accuracy: 100%
Precision: 100%
Recall: 100%
F1-Score: 100%
AUC: 100%
Test data:
Accuracy: 82.68%
Precision: 84%
Recall: 91%
F1-Score: 88%
AUC: 88.62%
Accuracy - Test:
Observation:
Train data:
Accuracy: 95.38%
Precision: 95%
Recall: 98%
F1-Score: 97%
AUC: 99.4%
Test data:
Accuracy: 82.46%
Precision: 83%
Recall: 92%
F1-Score: 87%
AUC: 89.4%
After using bagging, the model is still over-fitted. The values are high.But the difference
between the train and test accuracy is high.
Accuracy - Train:
Accuracy - Test:
Observation:
Train data:
Accuracy: 84.45%
Precision: 86%
Recall: 94%
F1-Score: 90%
AUC: 90.41%
Test data:
Accuracy: 81.36%
Precision: 82%
Recall: 92%
F1-Score: 87%
AUC: 88.58%
The tuning of the model has help the model recover from over-fitting.Now the model is a
good model.
Accuracy - Test:
Observation:
Train data:
Accuracy: 84.26%
Precision: 87%
Recall: 91%
F1-Score: 89%
AUC: 89.79%
Test data:
Accuracy: 82.02%
Precision: 86%
Recall: 87%
F1-Score: 87%
AUC: 87.81%
The tuning of the model has help the model recover from over-fitting.Now the model is a
good model.
Accuracy - Test:
Observation:
Train data:
Accuracy: 93.5%
Precision: 95%
Recall: 96%
F1-Score: 95%
AUC: 98.62%
Test data:
Accuracy: 83.11%
Precision: 86%
Recall: 89%
F1-Score: 88%
AUC: 89.99%
Accuracy - Test:
Observation:
Train data:
Accuracy: 89.26%
Precision: 91%
Recall: 94%
F1-Score: 93%
AUC: 95.11%
Test data:
Accuracy: 83.33%
Precision: 85%
Recall: 91%
F1-Score: 88%
AUC: 89.87%
The values are high. There is no over-fitting of any sorts. The model is a good model.
Accuracy - Test:
Observation:
Train data:
Accuracy: 88.31%
Precision: 89%
Recall: 95%
F1-Score: 92%
AUC: 94.69%
Test data:
Accuracy: 87.28%
Precision: 88%
Recall: 94%
F1-Score: 91%
AUC: 94.97%
The tuning of the Gradient Boost model has improved the model further. The values are high.
The better is better than the regular model.
Score
%
LDA - Regular 83.41% 86% 91% 89% 88.94
Regular
Regular %
Tuned %
Score
Regular %
Regular %
Tuned %
Comparing the AUC, ROC curve on the train data ofall the tuned models:
In all the models, tuned ones are better than the regular models. So, wecompare only the tuned
models and describe which model is the best/optimized.
Comparing the AUC, ROC curve on the test dataof all the tuned models:
In all the models, tuned ones are better than the regular models. So, wecompare only the tuned
models and describe which model is the best/optimized.
Conclusion:
The tuned gradient boost model performs the best with 88.31% accuracy score in train and
87.28% accuracy score in test. Alsoit has the best AUC score of 94% inboth train and test data
which
It also has a precision score of 88% and recall of 94% which is also the highest of all the
models. So, we conclude that Gradient Boost Tuned model is the best/optimized model.
Based on these predictions, what are theinsights?
Insights:
Labour party has more than double the votes of conservative party.
Most number of people have given a score of 3 and 4 for the national economic condition and
the average score is 3.245221
Most number of people have given a score of 3 and 4 for the household economic condition
and the average score is 3.137772
Blair has higher number of votes than Hague and the scores are much better for Blair than for
Hague.
The average score of Blair is 3.335531 and the average score of Hague is 2.749506. So, here
we can see that,Blair has a better score.
On a scale of 0 to 3, about 30% of the total population has zero knowledge about
politics/parties.
People who gave a low score of 1 to a certain party, still decided to vote for the same
party instead of voting for the other party. This can be because of lack of political
knowledge among the people.
People who have higher Eurosceptic sentiment, has voted for theconservative party and lower
the Eurosceptic sentiment, higher the votes for Labour party.
Out of 454 people who gave a score of 0 for political knowledge, 360 people have voted for the
labour party and 94 people have voted for the conservative party.
All models performed well on training data set as well as test dataset. The tuned models have
performed better than the regular models.
There is no over-fitting in any model except Random Forest and Bagging regular models.
Gradient Boosting model tuned is the best/optimized model.
Business recommendations:
Hyper-parameters tuning is an import aspect of model building. There are limitations to this as
to process these combinations, huge amount of processing power is required. But if tuning can
be done with many sets of parameters, we might get even better results.
Gathering more data will also help in training the models and thus improving the predictive
powers.
We can also create a function in which all the models predict the outcome in sequence. This
will help in better understanding andthe probability of what the outcome will be.
Using Gradient Boosting model without scaling for predicting the outcome as it has the best
optimized performance.
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States
of America:
President Franklin D. Roosevelt in 1941 President John F. Kennedy in 1961 President Richard
Nixon in 1973
Find the number of characters, words, and sentences for the mentioned documents.
Number of characters:
Number of words:
Number of sentences:
There are 68 sentences in President Franklin D.Roosevelt's speech.
Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stop-words)
nation - 11
know - 10
spirit - 9
let - 11
us - 10
sides - 9
us - 26
let - 22
peace - 19
Plot the word cloud of each of the threespeeches. (after removing the
stop-words)