Advanced Statistics Project Report Final
Advanced Statistics Project Report Final
ADVANCED STATISTICS
Project Report
PGP-DSBA
SAIRA BANU
PGP – DATA SCIENCE AND BUSINESS ANALYTICS
Table of Contents
1 Problem Statement:1…………………………………………………………………….3
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for
both Education and Occupation individually…………………………………………………….5
1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA
results……………………………………………………………………………………………………...……….5
1.3 Perform one-way ANOVA for variable Occupation with respect to the variable
‘Salary’. State whether the null hypothesis is accepted or rejected based on the
ANOVA results…………………………………………………………………………………….………. ….6
1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class
means are significantly different. Interpret the
result……………………………………………………………………………………………….……….……7
1.5 What is the interaction between the two treatments? Analyse the effects of one
variable on the other (Education and Occupation) with the help of an interaction
plot…………………………………………………………………………………………………….…….…….8
1.6 Perform a two-way ANOVA based on the Education and Occupation (along with
their interaction Education*Occupation) with the variable ‘Salary’. State the null
and alternative hypotheses and state your results. How will you interpret this
result?............................................................................................................................................8
1.7 Explain the business implications of performing ANOVA for this particular case
study…………………………………………………………………………………………………….….………9
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to
be performed]. What insight do you draw from the EDA?..........................................14
2.2 Is scaling necessary for PCA in this case? Give justification and perform
scaling………………………………………………………………………………………………………….…22
2.3 Comment on the comparison between the covariance and the correlation matrices
from this data. [on scaled data] …………………………………………………………….……….28
2.4 Check the dataset for outliers before and after scaling. What insight do you derive
here?............................................................................................................................. ............ ...30
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both] …….32
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features………………………………………………………………33
1 | Advanced Statistics
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use
values with two places of decimals only). [hint: write the linear equation of PC in
terms of eigenvectors and corresponding features] ……………………………………….34
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors
indicate?.......................................................................................................................................35
2.9 Explain the business implication of using the Principal Component Analysis for
this case study. How may PCs help in the further analysis? [Hint: Write
Interpretations of the Principal Components Obtained] …………………………………36
List of Figures
3 | Advanced Statistics
Problem Statement 1A:
Salary is hypothesized to depend on educational qualification and occupation. To
understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected
and each person’s educational qualification and occupation are noted. Educational
qualification is at three levels, High school graduate, Bachelor, and Doctorate.
Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty,
and Executive or managerial. A different number of observations are in each level of
education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality
assumption may not always hold if the sample size is small.]
Data Description:
Education has three levels: Doctorate, Bachelors and high school graduate.
Occupation has four levels: Administrative and Clerical, Sales, Professional or
Specialty, and Executive or Managerial.
Salary: Integer data type showing salary of 40 individuals.
Sample Dataset:
4 | Advanced Statistics
Figure 2: Sample containing null data
1.1Q) State the null and the alternate hypothesis for conducting one-
way ANOVA for both Education and Occupation individually.
Solution:
The one-way analysis of variance (ANOVA) is used to determine whether there are any
statistically significant differences between the means of three or more independent
(unrelated) groups.
Formulation of hypothesis for conducting one-way ANOVA for education qualification with
respect to salary
5 | Advanced Statistics
To perform one-way ANOVA for education with respect to the variable 'Salary', we apply the
ANOVA formula in the Jupyter notebook and run the AOV table. We get following output:
6 | Advanced Statistics
The F ratio output is 0.88 which is much lower as compared to the education levels and
the variance across the occupation levels is much lower as compared to the variance
within each segment. Hence, we conclude that have no dependency of occupation on
salary of the individual and all the 4 occupations have the same mean salary.
1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out
which class means are significantly different. Interpret the result.
Solution:
Salary with Education where we found the null hypothesis is rejected, we use the
tukeyhsd to check in which class means are significantly different.
7 | Advanced Statistics
The null hypothesis is rejected in (2) where from the above boxplot we can see that the
HS-grad mean is significantly low than that of other qualifications.
Problem 1B:
1.5Q) What is the interaction between the two treatments? Analyse
the effects of one variable on the other (Education and Occupation)
with the help of an interaction plot.
Solution:
As seen from the below interaction plots, there seems to be moderate interaction
between the two categorical variables.
Adm-clerical and sales professionals with bachelors and doctorate degrees earn
almost similar salary packages.
From the Above interaction plot we can clearly see that Salary is affected by both
Education and Occupation. So, we can say that the Salary earned is dependent on
both Education and Occupation.
8 | Advanced Statistics
Solution:
From the above analysis, it can be seen that the p-values are less than the confidence level
(0.05) and hence the null hypothesis is rejected in this case.
Hence, we can say that there is dependency between Education and occupation which
tells us that the salary is dependent upon education and occupation.
Solution:
We are conducting the hypothesis to check whether there was any relationship
between Salary and Occupation.
From the question 1.2 we can see that the null hypothesis is rejected stating that
the Mean salary is different due to different educations.
9 | Advanced Statistics
From the question 1.3 we can see that the null hypothesis is not rejected stating
that the Mean salary is not different due to different occupations.
From the question 1.6 we can see that the null hypothesis is rejected stating that
There is interaction between education and occupation which in turn affects the
salary.
===================**********************===================
10 | Advanced Statistics
2 Problem Statement: 2
Data Description:
The purpose of the dataset is to study data obtained from different colleges where
we have to perform the exploratory data analysis to deduct some inferences.
11 | Advanced Statistics
Figure 10: Problem 2- Data Description
12 | Advanced Statistics
The “Names” column can be removed from the dataset to do the exploratory data
analysis.
There is no null data in dataset.
Duplicate Detection:
Outlier Treatment:
From the Box Plot (), we can see that we need to treat outliers expect for
“Top25perc” as below.
13 | Advanced Statistics
Figure 15: After Outlier Treatment
14 | Advanced Statistics
Figure 17: Univariate Analysis – Skewness of Data
Inference:
The function will show the box plot, histogram, or distplot to view the distribution, the
numerical variable’s statistical description, and any outliers that may exist.
15 | Advanced Statistics
16 | Advanced Statistics
17 | Advanced Statistics
18 | Advanced Statistics
19 | Advanced Statistics
20 | Advanced Statistics
21 | Advanced Statistics
22 | Advanced Statistics
23 | Advanced Statistics
Figure 18: Univariate Analysis- Distplot and Boxplot
24 | Advanced Statistics
Bivariate Analysis:
Bivariate analysis is done using the Heatmap of the correlation matrix wherein
the dependency between two variables is checked.
The correlation indicates how two variables are dependent on one another and
to what extent.
2.2 Is scaling necessary for PCA in this case? Give justification and
perform scaling.
26 | Advanced Statistics
Solution:
From the Boxplot of the outlier treated data Figure 15: After Outlier Treatment. We can
see that there are some extreme variations in the range of data for e.g., Grad. Rate and
Outstate where there is difference in the range making it difficult to compare the two
data. Also, a variable in the dataset with high standard deviation will have higher weight
of calculation than that of low standard deviation variable. Hence, it is necessary to scale
the data to standardisation the range of all the dimensions.
PCA calculates new data axis depending on the deviation of the data.
27 | Advanced Statistics
From the above description and sample data of the scaled dataset we can see that the
scale of each data is standardised when the mean tends to 0 and the standard deviation
tends to 1.
Also, the data is now evenly spread out making it easier to derive deductions from it.
Solution:
Correlation Matrix:
Inference:
28 | Advanced Statistics
Figure 25: Heatmap of correlation of Scaled data
29 | Advanced Statistics
2.4 Check the dataset for outliers before and after scaling. What insight
do you derive here?
Solution:
The below is the Dataset after Outlier treatment in Figure 27: Outlier treated Dataset.
As seen in the below Figure 27: Boxplot of Outlier treated Data before the treatment
the dataset had many outliers except one dimension. Post treatment we have
negligible number of outliers.
The above is the dataset in Figure 29: After Outlier and Scaling treatment on
Dataset after outlier and scaling treatment.
As seen in the Figure 30: Boxplot of after Outlier and Scaling treatment on Data
Scaling the data helped to standardise the data thus giving same weight to all and
further supplementing by giving PCA relevant axis.
31 | Advanced Statistics
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print
Both].
Solution:
To perform PCA we need to perform Bartlett’s test of sphericity to test the hypothesis
that the variables are uncorrelated in the population.
After performing the test, the p-value is 0 which means that the null hypothesis is rejected
and at least one pair of variables in the data are correlated hence PCA is recommended.
Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is
expected. ON the other hand, MSA >0.7 is expected to provide a considerable reduction
is the dimension and extraction of meaningful components.
Here for the dataset, the p-value is 0.86 which indicates that after performing PCA there
would be a considerable reduction of dimensionality.
32 | Advanced Statistics
Eigen vectors:
Performing PCA on scaled and treated data using Sklearn, we find the above outputs.
Eigen values:
2.6 Perform PCA and export the data of the Principal Component
(Eigen vectors) into a data frame with the original features.
Solution:
After performing PCA and exporting data of the Eigen vectors from question 2.5 in a
Data frame with its original features, below is the sample data frame from loading with
all the components.
33 | Advanced Statistics
Figure 32: Data Frame with Eigen Vectors
2.7 Write down the explicit form of the first PC (in terms of the eigenv
ectors. Use values with two places of decimals only). [hint: write the li
near equation of PC in terms of eigenvectors and corresponding featur
es]
Solution:
The linear or explicit form of first PC is obtained by using the eigen vectors for PC1.
Below is the linear form of the first PC, refer to Figure 32: Data Frame with Eigen Vectors
.
The Linear eq of PC1:
0.24 * Apps + 0.21 * Accept + 0.16 * Enroll + 0.34 * Top10perc + 0.34 * Top25perc + 0.13
* F. Undergrad + 0.01 * P. Undergrad + 0.3 * Outstate + 0.25 * Room. Board + 0.09 * Book
s + -0.05 * Personal + 0.32 * PhD + 0.32 * Terminal + -0.18 * S.F. Ratio + 0.2 * perc. Alum
ni + 0.34 * Expend + 0.25 * Grad. Rate.
34 | Advanced Statistics
2.8 Consider the cumulative values of the eigenvalues. How does it hel
p you to decide on the optimum number of principal components? Wh
at do the eigenvectors indicate?
Solution:
Inference:
From the above Figure 33: Scree Plot we can see the individual explained variances by
the principal components. We can observe a sudden decrease in slope from the third
principal component onwards. This means the maximum variances are captured by the
first two principal components. This point is also called inflection point.
Cumulative Plot:
35 | Advanced Statistics
Figure 34: Cumulative Variance
Inference:
From the above scree plot Figure 33: Scree Plot, cumulative eigen variance and from
Figure 34: Cumulative Variance we can see that around 8 components give about 90% of
the variance. Thus, the dimensions can be reduced from 17 to 8 given optimal solution.
The data can now be reoriented onto new axes by transforming the principal componen
ts into the direction of the eigenvectors.
Solution:
After PCA we found out that the dimensions can be reduced to 8 components
which indicates that the correlation between them 0 as per
Figure 35: PCA Transformed Correlation matrix
This indicates that the redundant dimensions are removed.
Now the 8 dimensions/PCA represent almost 90% of the data.
36 | Advanced Statistics
Figure 35: PCA Transformed Correlation matrix
PC1: Show the no. of students who have to pay outstate tuition.
PC2: Show that the student admission depends highly on application, acceptance
and enrolment where these components are highly correlated.
PC3: Shows the cost of books for a student.
PC4: Indicates the top 10 and top 25 percentage.
PC5: Represents % of faculties with Ph.D. and terminal degree.
PC6: Represents the student/faculty ratio.
PC7: Highlights estimated personal spending for a student and graduation rate
PC8: Highlights the alumni members.
37 | Advanced Statistics
Figure 36: PCA Transformed Data frame Heatmap
38 | Advanced Statistics
Final Inferences from the above PCA:
In our case study, after performing multivariate analysis we have observed that
many of the variables are correlated. Thus, we don't need all these variables for
analysis but we are not sure which variables to drop and which to select, hence
we perform PCA, which captures the information (in the form of variance) from
all these variables into new dimension variables. Now based on the requirement
of information we can select the number of new dimension variables required.
The new dimension variables are independent of each other, which also helps in
certain algorithms.
The dimensionality reduction as obtained from PCA helps in lesser computing
power, i.e., faster processing for further analysis.
The dimensionality reduction also helps in lesser storage space.
The dimensionality reduction also helps in addressing the overfitting issue,
which mainly occurs when there are too many variables.
======================************************======================
39 | Advanced Statistics