0% found this document useful (0 votes)

344 views40 pages

Advanced Statistics Project Report Final

The document analyzes salary data of 40 individuals to understand the dependency of salary on education and occupation. Various statistical tests like one-way ANOVA, multiple comparisons, and interaction plots are performed to analyze the effect of education and occupation on salary. The results show that education has a significant effect on salary while both education and occupation together also have significant interacting effects on determining individual salaries.

Uploaded by

Imane Chatoui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

344 views40 pages

Advanced Statistics Project Report Final

Uploaded by

Imane Chatoui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

8/28/2022

ADVANCED STATISTICS
Project Report
PGP-DSBA

SAIRA BANU
PGP – DATA SCIENCE AND BUSINESS ANALYTICS
Table of Contents
1 Problem Statement:1…………………………………………………………………….3
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for
both Education and Occupation individually…………………………………………………….5

1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA
results……………………………………………………………………………………………………...……….5

1.3 Perform one-way ANOVA for variable Occupation with respect to the variable
‘Salary’. State whether the null hypothesis is accepted or rejected based on the
ANOVA results…………………………………………………………………………………….………. ….6

1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class
means are significantly different. Interpret the
result……………………………………………………………………………………………….……….……7

1.5 What is the interaction between the two treatments? Analyse the effects of one
variable on the other (Education and Occupation) with the help of an interaction
plot…………………………………………………………………………………………………….…….…….8

1.6 Perform a two-way ANOVA based on the Education and Occupation (along with
their interaction Education*Occupation) with the variable ‘Salary’. State the null
and alternative hypotheses and state your results. How will you interpret this
result?............................................................................................................................................8

1.7 Explain the business implications of performing ANOVA for this particular case
study…………………………………………………………………………………………………….….………9

2 Problem Statement: 2……………………………………………………………………………….…11

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to
be performed]. What insight do you draw from the EDA?..........................................14

2.2 Is scaling necessary for PCA in this case? Give justification and perform
scaling………………………………………………………………………………………………………….…22

2.3 Comment on the comparison between the covariance and the correlation matrices
from this data. [on scaled data] …………………………………………………………….……….28

2.4 Check the dataset for outliers before and after scaling. What insight do you derive
here?............................................................................................................................. ............ ...30

2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both] …….32

2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features………………………………………………………………33

1 | Advanced Statistics
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use
values with two places of decimals only). [hint: write the linear equation of PC in
terms of eigenvectors and corresponding features] ……………………………………….34

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors
indicate?.......................................................................................................................................35

2.9 Explain the business implication of using the Principal Component Analysis for
this case study. How may PCs help in the further analysis? [Hint: Write
Interpretations of the Principal Components Obtained] …………………………………36

List of Figures

Figure 1: Problem Statement 1: Sample Dataset………………………………………………..……4

Figure 2: Sample containing null data……………………………………………………………..………..5
Figure 3: One way ANOVA: Salary and Education……………………………………………….…..…6
Figure 4: One way ANOVA: Salary and Education………………………………………………………6
Figure 5: Multiple comparison of Salary and Education………………………………………..……7
Figure 6: Boxplot salary vs Education……………………………………………………………………….7
Figure 7: Point Plot interaction between education and occupation…………………….…..…8
Figure 8: Two Way ANOVA……………………………………………………………………………………...…9
Figure 9: Problem Statement 2: Data Info…………………………………………………….………….11
Figure 10: Problem 2- Data Description………………………………………………………….….….…12
Figure 11: Problem 2 – Sample Dataset………………………………………………………….….…..…12
Figure 12: Problem 2 – Null Data……………………………………………………………………….…….12
Figure 13: Problem 2-Duplicate Data……………………………………………………………….………13
Figure 14: Before Outlier Treatment Boxplot……………………………………………………..…….13
Figure 15: After Outlier Treatment…………………………………………………………………………..14
Figure 16: Univariate Analysis – Data Description………………………………………………..…..14
Figure 17: Univariate Analysis – Skewness of Data…………………………………………..…...…..15
Figure 18: Univariate Analysis- Distplot and Boxplot………………………………………..16 to 24
Figure 19: Correlation Heatmap………………………………………………………………….………….25
2 | Advanced Statistics
Figure 20: Data Description – After removing outliers………………………………….…………..25

Figure 21: Pair Plot………………………………………………………………………………….….….……26

Figure 22: Scaled data Sample………………………………………………………………….…….……..27

Figure 23: Scaled data Description………………………………………………………….……….……27

Figure 24: Correlation Matrix…………………………………………………………………….…..……..28

Figure 25: Heatmap of correlation of Scaled data………………………………………..………….29
Figure 26: Outlier treated Dataset……………………………………………………………….….……..30
Figure 28: Boxplot of Outlier treated Data……………………………………………………………..30
Figure 29: After Outlier and Scaling treatment on Dataset………………………………….…..31

Figure 30: Boxplot of after Outlier and Scaling treatment on Data……………………….….31

Figure 31: Eigen Vectors Matrix…………………………………………………………………………………….….33

Figure 32: Data Frame with Eigen Vectors……………………………………………………..………34

Figure 33: Scree Plot…………………………………………………………………………………………….35
Figure 34: Cumulative Variance……………………………………………………………………..……..36
Figure 35: PCA Transformed Correlation matrix ………...…………………………………………37

Figure 36: PCA Transformed Data frame Heatmap…………………………………………..…….38

Figure 37: Transformed Dataset after performing PCA…………………………………………..38

3 | Advanced Statistics
Problem Statement 1A:
Salary is hypothesized to depend on educational qualification and occupation. To
understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected
and each person’s educational qualification and occupation are noted. Educational
qualification is at three levels, High school graduate, Bachelor, and Doctorate.
Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty,
and Executive or managerial. A different number of observations are in each level of
education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality
assumption may not always hold if the sample size is small.]

Data Description:
 Education has three levels: Doctorate, Bachelors and high school graduate.
 Occupation has four levels: Administrative and Clerical, Sales, Professional or
Specialty, and Executive or Managerial.
 Salary: Integer data type showing salary of 40 individuals.
Sample Dataset:

Figure 1: Problem Statement 1: Sample Dataset

Dataset has salary of 40 individuals where salary is dependent on Education
qualification and occupation. The dependency of the two components on salary of
the individual is to be determined by doing hypothesis testing and the conclusion
has to be derived.

4 | Advanced Statistics
Figure 2: Sample containing null data

1.1Q) State the null and the alternate hypothesis for conducting one-
way ANOVA for both Education and Occupation individually.

Solution:

The one-way analysis of variance (ANOVA) is used to determine whether there are any
statistically significant differences between the means of three or more independent
(unrelated) groups.

Formulation of hypothesis for conducting one-way ANOVA for education qualification with
respect to salary

H0: Salary depends on education qualification

H1: Salary does not depend on education

Confidence level = 0.05

1.2 Perform one-way ANOVA for Education with respect to the

variable ‘Salary’. State whether the null hypothesis is accepted or
rejected based on the ANOVA results.
Solution:
The hypothesis for conducting one way ANOVA for Education:
H0: 3 Education Levels have the same mean salary.
H1: At least 1 Education level has different mean salary.
Confidence level:0.05

5 | Advanced Statistics
To perform one-way ANOVA for education with respect to the variable 'Salary', we apply the
ANOVA formula in the Jupyter notebook and run the AOV table. We get following output:

Figure 3: One way ANOVA: Salary and Education

The P value (1.257709e-08) is less than 0.05, hence we reject the null hypothesis.
Therefore, we infer the mean is different in at least category of Education.
The F ratio output is 30.96 which we infer that the variance between education levels is
30.96 times higher the variance within each category, hence there is a larger variance
between them noted. Hence, we come o conclusion that different education
qualifications affect salary mean and at least 1 education gives different salary.

1.3 Perform one-way ANOVA for variable Occupation with respect to

the variable ‘Salary’. State whether the null hypothesis is accepted or
rejected based on the ANOVA results.
Solution:
The hypothesis for conducting one way ANOVA for Education:
H0: All 4 Occupation Levels have the same mean salary.
H1: At least 1 Occupation level has different mean salary.
Confidence level:0.05
To perform a one-way ANOVA on Salary with respect to Occupation, we apply the
ANOVA formula in the Jupyter notebook and run the AOV table. We get following output:

Figure 4: One way ANOVA: Salary and Occupation

The P value (0.458508) is more than 0.05, hence we fail to reject the null hypothesis.

6 | Advanced Statistics
The F ratio output is 0.88 which is much lower as compared to the education levels and
the variance across the occupation levels is much lower as compared to the variance
within each segment. Hence, we conclude that have no dependency of occupation on
salary of the individual and all the 4 occupations have the same mean salary.

1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out
which class means are significantly different. Interpret the result.
Solution:
Salary with Education where we found the null hypothesis is rejected, we use the
tukeyhsd to check in which class means are significantly different.

Figure 5: Multiple comparison of Salary and Education

Conclusion: In the above result we observe that all class means with each other have
significant differences. Reject the null hypothesis.

Figure 6: Boxplot salary vs Education

7 | Advanced Statistics
The null hypothesis is rejected in (2) where from the above boxplot we can see that the
HS-grad mean is significantly low than that of other qualifications.

Problem 1B:
1.5Q) What is the interaction between the two treatments? Analyse
the effects of one variable on the other (Education and Occupation)
with the help of an interaction plot.
Solution:
As seen from the below interaction plots, there seems to be moderate interaction
between the two categorical variables.

Figure 7: Point Plot interaction between education and occupation

 As the Education and Occupation interaction is less than the significance level of
0.05, we can say that there is statistical interaction between Education &
Occupation.

 Adm-clerical and sales professionals with bachelors and doctorate degrees earn
almost similar salary packages.

 From the Above interaction plot we can clearly see that Salary is affected by both
Education and Occupation. So, we can say that the Salary earned is dependent on
both Education and Occupation.

1.6Q) Perform a two-way ANOVA based on the Education and

Occupation (along with their interaction Education*Occupation) with
the variable ‘Salary’. State the null and alternative hypotheses and
state your results. How will you interpret this result?

8 | Advanced Statistics
Solution:

Formulation of hypothesis for conducting two-way ANOVA based on education and

occupation with respect to salary.

H0: Salary depends on both categories - education and occupation

H1: Salary does not depend on at least one of the categories - education and occupation
Confidence level = 0.05.

Figure 8: Two Way ANOVA

From the above analysis, it can be seen that the p-values are less than the confidence level
(0.05) and hence the null hypothesis is rejected in this case.

Hence, we can say that there is dependency between Education and occupation which
tells us that the salary is dependent upon education and occupation.

1.7 Explain the business implications of performing ANOVA for this

particular case study.

Solution:

 Performing One-way ANOVA on Education and Occupation helps us to know if

there is any relationship between the Salary earned by a person and their
Education and Occupation.

 We are conducting the hypothesis to check whether there was any relationship
between Salary and Occupation.

 Salary being a dependent variable, we are checked that it is dependent on

Education and Occupation.

 From the question 1.2 we can see that the null hypothesis is rejected stating that
the Mean salary is different due to different educations.

9 | Advanced Statistics
 From the question 1.3 we can see that the null hypothesis is not rejected stating
that the Mean salary is not different due to different occupations.

 From the question 1.6 we can see that the null hypothesis is rejected stating that

 There is interaction between education and occupation which in turn affects the
salary.

===================**********************===================

10 | Advanced Statistics
2 Problem Statement: 2

The dataset Education - Post 12th Standard.csv contains information on various

colleges. You are expected to do a Principal Component Analysis for this case study
according to the instructions given. The data dictionary of the 'Education - Post 12th
Standard.csv' can be found in the following file: Data Dictionary.xlsx.

Data Description:

Figure 9: Problem 2 – Data Info

 The purpose of the dataset is to study data obtained from different colleges where
we have to perform the exploratory data analysis to deduct some inferences.

 Also, principal component analysis is to be performed to reduce the dimensions

(remove redundant dimensions) for better analysis of data with the data giving the
highest variance so the final dimensions will show 0 correlation in them.

11 | Advanced Statistics
Figure 10: Problem 2- Data Description

Figure 11: Problem 2 – Sample Dataset

Exploratory Data Analysis:

Figure 12: Problem 2 – Null Data

12 | Advanced Statistics
 The “Names” column can be removed from the dataset to do the exploratory data
analysis.
 There is no null data in dataset.

Duplicate Detection:

Figure 13: Problem 2-Duplicate Data

 There are no duplicate records to fix.

Outlier Treatment:
 From the Box Plot (), we can see that we need to treat outliers expect for
“Top25perc” as below.

Figure 14: Before Outlier Treatment Boxplot

 We shall treat outliers by treating the data beyond the IQR to the lower and
upper bounds.
 For the higher outliers we will treat it to get it at 95 percentile values.
 Lower-Level outliers will be treated to get it at 5 percentile values.

13 | Advanced Statistics
Figure 15: After Outlier Treatment

2.1 Perform Exploratory Data Analysis [both univariate and

multivariate analysis to be performed]. What insight do you draw
from the EDA?
Solution:
From the Treatment done above we have a dataset with no extreme outliers on which
univariate and multivariate analysis can be performed.
The further analysis would be done on outlier treated data.
Univariate Analysis:

Figure 16: Univariate Analysis – Data Description

14 | Advanced Statistics
Figure 17: Univariate Analysis – Skewness of Data

Inference:

 The maximum positive skewness is observed for ‘P. Undergrad’ variable.

 A positive skewness of greater than 1 is observed for 9 variables.
 Least skewness is observed for ‘Grad. Rate’.
 The highest negative skewness is observed for ‘Terminal’ variable, ‘PhD’ variable
also has nearly same negative skewness value.

To display information as part of a univariate analysis of numeric variables, let’s define

the function “univariate Analysis.” The function will take two arguments: the name of
the column and the number of bins.

The function will show the box plot, histogram, or distplot to view the distribution, the
numerical variable’s statistical description, and any outliers that may exist.

24 | Advanced Statistics
Bivariate Analysis:

Figure 19: Correlation Heatmap

Figure 20: Data Description – After removing outliers

25 | Advanced Statistics
Figure 21: Pair Plot

 Bivariate analysis is done using the Heatmap of the correlation matrix wherein
the dependency between two variables is checked.

 “Apps” have high correlation between “Accept” and “Enroll”.

 The correlation indicates how two variables are dependent on one another and
to what extent.

 The objective of PCA is to reduce this correlation.

2.2 Is scaling necessary for PCA in this case? Give justification and
perform scaling.
26 | Advanced Statistics
Solution:

From the Boxplot of the outlier treated data Figure 15: After Outlier Treatment. We can
see that there are some extreme variations in the range of data for e.g., Grad. Rate and
Outstate where there is difference in the range making it difficult to compare the two
data. Also, a variable in the dataset with high standard deviation will have higher weight
of calculation than that of low standard deviation variable. Hence, it is necessary to scale
the data to standardisation the range of all the dimensions.

PCA calculates new data axis depending on the deviation of the data.

To get the output we use Z-Score method to scale the data.

Figure 22: Scaled data Sample

Figure 23: Scaled data Description

27 | Advanced Statistics
From the above description and sample data of the scaled dataset we can see that the
scale of each data is standardised when the mean tends to 0 and the standard deviation
tends to 1.

Also, the data is now evenly spread out making it easier to derive deductions from it.

2.3 Comment on the comparison between the covariance and the

correlation matrices from this data. [on scaled data].

Solution:

Correlation Matrix:

Figure 24: Correlation Matrix

Inference:

 Covariance shows the direction of the linear relationship between variables.

 Correlation measures both the strength and direction of the linear relationship
between two variables.
 Correlation is a function of covariance.
 Covariance indicates the relationship of two variables whenever one variable
changes. If an increase in one variable results in an increase in the other variable,
both variables are said to have a positive covariance. Decreases in one variable
also cause a decrease in the order.

28 | Advanced Statistics
Figure 25: Heatmap of correlation of Scaled data

Figure 26: Covariance Matrix

29 | Advanced Statistics
2.4 Check the dataset for outliers before and after scaling. What insight
do you derive here?

Solution:

The below is the Dataset after Outlier treatment in Figure 27: Outlier treated Dataset.

Figure 27: Outlier treated Dataset

As seen in the below Figure 27: Boxplot of Outlier treated Data before the treatment
the dataset had many outliers except one dimension. Post treatment we have
negligible number of outliers.

Figure 28: Boxplot of Outlier treated Data

30 | Advanced Statistics
Figure 29: After Outlier and Scaling treatment on Dataset

 The above is the dataset in Figure 29: After Outlier and Scaling treatment on
Dataset after outlier and scaling treatment.

Figure 30: Boxplot of after Outlier and Scaling treatment on Data

 As seen in the Figure 30: Boxplot of after Outlier and Scaling treatment on Data
Scaling the data helped to standardise the data thus giving same weight to all and
further supplementing by giving PCA relevant axis.

31 | Advanced Statistics
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print
Both].

Solution:

Statistical tests to be done before PCA - Bartlett's test of sphericity:

To perform PCA we need to perform Bartlett’s test of sphericity to test the hypothesis
that the variables are uncorrelated in the population.

H0: All variables in the data are correlated

H1: At least one pair of variables in the data are correlated

After performing the test, the p-value is 0 which means that the null hypothesis is rejected
and at least one pair of variables in the data are correlated hence PCA is recommended.

Kaiser-Meyer-Olkin (KMO) Test:

The Kaiser-Meyer-Olkin (KMO) – measure of sampling adequacy (MSA) is ana index

used to examine how appropriate PCA is.

Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is
expected. ON the other hand, MSA >0.7 is expected to provide a considerable reduction
is the dimension and extraction of meaningful components.

Here for the dataset, the p-value is 0.86 which indicates that after performing PCA there
would be a considerable reduction of dimensionality.

Step 1: Create Covariance Matrix as Question 2.4

Step 2: Get Eigen Vector and Eigen Values.

32 | Advanced Statistics
Eigen vectors:

Figure 31: Eigen Vectors Matrix

Performing PCA on scaled and treated data using Sklearn, we find the above outputs.

Eigen values:

Eigen values: ([0.33151857, 0.28373652, 0.06464061, 0.05855307, 0.05274046,

0.04497099, 0.03449059, 0.03257588, 0.02603662, 0.02245497,
0.01443066, 0.00862682, 0.00799196, 0.00727087, 0.00438662,
0.0032887, 0.00228608])

2.6 Perform PCA and export the data of the Principal Component
(Eigen vectors) into a data frame with the original features.

Solution:

After performing PCA and exporting data of the Eigen vectors from question 2.5 in a
Data frame with its original features, below is the sample data frame from loading with
all the components.

33 | Advanced Statistics
Figure 32: Data Frame with Eigen Vectors

2.7 Write down the explicit form of the first PC (in terms of the eigenv
ectors. Use values with two places of decimals only). [hint: write the li
near equation of PC in terms of eigenvectors and corresponding featur
es]

Solution:

The linear or explicit form of first PC is obtained by using the eigen vectors for PC1.

Below is the linear form of the first PC, refer to Figure 32: Data Frame with Eigen Vectors
.
The Linear eq of PC1:

0.24 * Apps + 0.21 * Accept + 0.16 * Enroll + 0.34 * Top10perc + 0.34 * Top25perc + 0.13
* F. Undergrad + 0.01 * P. Undergrad + 0.3 * Outstate + 0.25 * Room. Board + 0.09 * Book
s + -0.05 * Personal + 0.32 * PhD + 0.32 * Terminal + -0.18 * S.F. Ratio + 0.2 * perc. Alum
ni + 0.34 * Expend + 0.25 * Grad. Rate.

34 | Advanced Statistics
2.8 Consider the cumulative values of the eigenvalues. How does it hel
p you to decide on the optimum number of principal components? Wh
at do the eigenvectors indicate?

Solution:

Figure 33: Scree Plot

Inference:
From the above Figure 33: Scree Plot we can see the individual explained variances by
the principal components. We can observe a sudden decrease in slope from the third
principal component onwards. This means the maximum variances are captured by the
first two principal components. This point is also called inflection point.

Cumulative Variance: array ([0.33151857, 0.61525509, 0.6798957, 0.73844877,

0.79118924, 0.83616023, 0.87065082, 0.9032267, 0.92926332, 0.95171829,
0.96614894, 0.97477576, 0.98276773, 0.9900386, 0.99442522, 0.99771392, 1.])

Cumulative Plot:

35 | Advanced Statistics
Figure 34: Cumulative Variance

Inference:

From the above scree plot Figure 33: Scree Plot, cumulative eigen variance and from
Figure 34: Cumulative Variance we can see that around 8 components give about 90% of
the variance. Thus, the dimensions can be reduced from 17 to 8 given optimal solution.
The data can now be reoriented onto new axes by transforming the principal componen
ts into the direction of the eigenvectors.

2.9 Explain the business implication of using the Principal Component

Analysis for this case study. How may PCs help in the further analysis?
[Hint: Write Interpretations of the Principal Components Obtained].

Solution:

 After PCA we found out that the dimensions can be reduced to 8 components
which indicates that the correlation between them 0 as per
Figure 35: PCA Transformed Correlation matrix
 This indicates that the redundant dimensions are removed.
 Now the 8 dimensions/PCA represent almost 90% of the data.

36 | Advanced Statistics
Figure 35: PCA Transformed Correlation matrix

Inference from the transformed data or the PC’s:

 PC1: Show the no. of students who have to pay outstate tuition.
 PC2: Show that the student admission depends highly on application, acceptance
and enrolment where these components are highly correlated.
 PC3: Shows the cost of books for a student.
 PC4: Indicates the top 10 and top 25 percentage.
 PC5: Represents % of faculties with Ph.D. and terminal degree.
 PC6: Represents the student/faculty ratio.
 PC7: Highlights estimated personal spending for a student and graduation rate
 PC8: Highlights the alumni members.

37 | Advanced Statistics
Figure 36: PCA Transformed Data frame Heatmap

The Transformed Dataset after PCA:

Figure 37: Transformed Dataset after performing PCA

38 | Advanced Statistics
Final Inferences from the above PCA:

 PCA captures maximum possible variance of the original variables with

minimum number of dimensions. There are different methods employed to
capture the optimum information with minimum dimensions. One of the most
common methods is using the cumulative values of the eigenvalues for deciding
the optimum value.
 For better visualisation we convert the cumulative values in percentage terms
first then we decide on the minimum amount of information we want to capture
in our new dimensions and based on this select up to the principal components
that explain the desired percentage of variance. For example, in our case if we
want minimum 90% of the variance explained by the new dimensions, we select
the first 8 principle components which explain approximately 90.79% of the
variance.
 Some of the other methods employed for the purpose of selection of optimum
number of principal components are Kraizer rule, where we consider all the
principal components with eigen values greater than 1. And Scree plot, where we
accept only up to the principal component previous to the inflection point (i.e., a
sharp break in slope).

Business Implications of using Principal Component Analysis:

 In our case study, after performing multivariate analysis we have observed that
many of the variables are correlated. Thus, we don't need all these variables for
analysis but we are not sure which variables to drop and which to select, hence
we perform PCA, which captures the information (in the form of variance) from
all these variables into new dimension variables. Now based on the requirement
of information we can select the number of new dimension variables required.
 The new dimension variables are independent of each other, which also helps in
certain algorithms.
 The dimensionality reduction as obtained from PCA helps in lesser computing
power, i.e., faster processing for further analysis.
 The dimensionality reduction also helps in lesser storage space.
 The dimensionality reduction also helps in addressing the overfitting issue,
which mainly occurs when there are too many variables.

======================************************======================

39 | Advanced Statistics

Time Series Forecasting Project (Shoe Sales)
No ratings yet
Time Series Forecasting Project (Shoe Sales)
26 pages
Capstone Project Report 2
No ratings yet
Capstone Project Report 2
178 pages
Maths Sample Papers XII
No ratings yet
Maths Sample Papers XII
111 pages
Assignment 2
100% (1)
Assignment 2
8 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Sci 7 q1 12 Demonstrate Proper Use and Handling of Science Equipment
No ratings yet
Sci 7 q1 12 Demonstrate Proper Use and Handling of Science Equipment
44 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
DMC Double Metal Cyanide Catalyst
No ratings yet
DMC Double Metal Cyanide Catalyst
2 pages
M Schemes 04
0% (2)
M Schemes 04
3 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
AV Project Shivakumar Vanga
100% (1)
AV Project Shivakumar Vanga
37 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
Easy View 2 Manual en
No ratings yet
Easy View 2 Manual en
25 pages
Advanced Statistics
100% (1)
Advanced Statistics
16 pages
Canopy Merged PDF
No ratings yet
Canopy Merged PDF
32 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Problem Statement
0% (2)
Problem Statement
2 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
No ratings yet
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
22 pages
Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Ruhee Ansari - Advanced Statistic Project SCB
100% (1)
Ruhee Ansari - Advanced Statistic Project SCB
28 pages
PROJECT Advanced Statistics
No ratings yet
PROJECT Advanced Statistics
58 pages
Advanced Statistics - Project - 16052021
100% (1)
Advanced Statistics - Project - 16052021
9 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Great Learning Predictive Modelling Project
No ratings yet
Great Learning Predictive Modelling Project
12 pages
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
No ratings yet
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
Statistical Methods For Decision Making
100% (1)
Statistical Methods For Decision Making
15 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
Project: Advanced Statistics: Anova, Eda and Pca
No ratings yet
Project: Advanced Statistics: Anova, Eda and Pca
35 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Software User Manual PDF
No ratings yet
Software User Manual PDF
24 pages
SMDM - Week 1 Checklist
100% (1)
SMDM - Week 1 Checklist
3 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Experiment 3: Spatial Domain Image Enhancement: MATLAB Code
No ratings yet
Experiment 3: Spatial Domain Image Enhancement: MATLAB Code
8 pages
Predictive Modeling
No ratings yet
Predictive Modeling
21 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
Jest All in One Notes 2018-Final Updated PDF
No ratings yet
Jest All in One Notes 2018-Final Updated PDF
165 pages
Geneaid - GSYNC DNA Extraction Kit - Protocol
100% (1)
Geneaid - GSYNC DNA Extraction Kit - Protocol
16 pages
Business - Report-Comp-Fin - Data - Part A - Problem
No ratings yet
Business - Report-Comp-Fin - Data - Part A - Problem
17 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
MTH302-lec-02 Worksheet
No ratings yet
MTH302-lec-02 Worksheet
6 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
Generalised Angular Momentum
No ratings yet
Generalised Angular Momentum
10 pages
Problem 1 - (Download Data) : Importing Nessceary Libraries
No ratings yet
Problem 1 - (Download Data) : Importing Nessceary Libraries
16 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Industrial Training Presentation (BHEL)
No ratings yet
Industrial Training Presentation (BHEL)
25 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
1516-Advanced Paper-2 Set-A PDF
No ratings yet
1516-Advanced Paper-2 Set-A PDF
21 pages
Unit - 2 (1) DBMS
No ratings yet
Unit - 2 (1) DBMS
25 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
Design of PV
No ratings yet
Design of PV
6 pages
Introduction To Part I: The Methanol-to-Olefins (MTO) Reaction and Small-Pore Microporous Materials
No ratings yet
Introduction To Part I: The Methanol-to-Olefins (MTO) Reaction and Small-Pore Microporous Materials
13 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
Dragonpay API
No ratings yet
Dragonpay API
31 pages
Problem 2 Businessreport ML
No ratings yet
Problem 2 Businessreport ML
9 pages
Level Iii Ut Specific Examination
No ratings yet
Level Iii Ut Specific Examination
8 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Saic-Q-1035 Sub-Base & Base Course
No ratings yet
Saic-Q-1035 Sub-Base & Base Course
4 pages
Assignment 5 - Heuristics and Principles
No ratings yet
Assignment 5 - Heuristics and Principles
4 pages
Assignment 2 Solution
No ratings yet
Assignment 2 Solution
6 pages
Brochure Rilsan-PA11 2005
No ratings yet
Brochure Rilsan-PA11 2005
32 pages
Electrical and Optical Properties of Germanium-Doped Zinc Oxide Thin Films
No ratings yet
Electrical and Optical Properties of Germanium-Doped Zinc Oxide Thin Films
4 pages
Experimental Study On Self Compacting Concrete With Various Percentage of Steel Fibres
No ratings yet
Experimental Study On Self Compacting Concrete With Various Percentage of Steel Fibres
4 pages
Color: Due On Sunday June 7th, by 11:59PM
No ratings yet
Color: Due On Sunday June 7th, by 11:59PM
2 pages
Capstone Project Proposal - HR Audit
No ratings yet
Capstone Project Proposal - HR Audit
3 pages
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
Hydrocracking Technology
100% (1)
Hydrocracking Technology
12 pages
2-3btc of Freebitco - in
100% (1)
2-3btc of Freebitco - in
2 pages
1SFA898118R7000 pstx720 600 70
No ratings yet
1SFA898118R7000 pstx720 600 70
6 pages
Half-Wave Rectifier Feeding A DC Motor
No ratings yet
Half-Wave Rectifier Feeding A DC Motor
4 pages
Inuktitut Syllabics Chart
100% (3)
Inuktitut Syllabics Chart
3 pages
10-An - Swimming Pool Dehumidifier Sizing
No ratings yet
10-An - Swimming Pool Dehumidifier Sizing
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Advanced Statistics Project Report Final

Uploaded by

Advanced Statistics Project Report Final

Uploaded by

8/28/2022

2 Problem Statement: 2……………………………………………………………………………….…11

Figure 1: Problem Statement 1: Sample Dataset………………………………………………..……4

Figure 21: Pair Plot………………………………………………………………………………….….….……26

Figure 22: Scaled data Sample………………………………………………………………….…….……..27

Figure 23: Scaled data Description………………………………………………………….……….……27

Figure 24: Correlation Matrix…………………………………………………………………….…..……..28

Figure 30: Boxplot of after Outlier and Scaling treatment on Data……………………….….31

Figure 31: Eigen Vectors Matrix…………………………………………………………………………………….….33

Figure 32: Data Frame with Eigen Vectors……………………………………………………..………34

Figure 36: PCA Transformed Data frame Heatmap…………………………………………..…….38

Figure 37: Transformed Dataset after performing PCA…………………………………………..38

Figure 1: Problem Statement 1: Sample Dataset

H0: Salary depends on education qualification

H1: Salary does not depend on education

Confidence level = 0.05

1.2 Perform one-way ANOVA for Education with respect to the

Figure 3: One way ANOVA: Salary and Education

1.3 Perform one-way ANOVA for variable Occupation with respect to

Figure 4: One way ANOVA: Salary and Occupation

Figure 5: Multiple comparison of Salary and Education

Figure 6: Boxplot salary vs Education

Figure 7: Point Plot interaction between education and occupation

1.6Q) Perform a two-way ANOVA based on the Education and

Formulation of hypothesis for conducting two-way ANOVA based on education and

H0: Salary depends on both categories - education and occupation

Figure 8: Two Way ANOVA

1.7 Explain the business implications of performing ANOVA for this

 Performing One-way ANOVA on Education and Occupation helps us to know if

 Salary being a dependent variable, we are checked that it is dependent on

The dataset Education - Post 12th Standard.csv contains information on various

Figure 9: Problem 2 – Data Info

 Also, principal component analysis is to be performed to reduce the dimensions

Figure 11: Problem 2 – Sample Dataset

Figure 12: Problem 2 – Null Data

Figure 13: Problem 2-Duplicate Data

Figure 14: Before Outlier Treatment Boxplot

2.1 Perform Exploratory Data Analysis [both univariate and

Figure 16: Univariate Analysis – Data Description

 The maximum positive skewness is observed for ‘P. Undergrad’ variable.

To display information as part of a univariate analysis of numeric variables, let’s define

Figure 19: Correlation Heatmap

Figure 20: Data Description – After removing outliers

 “Apps” have high correlation between “Accept” and “Enroll”.

 The objective of PCA is to reduce this correlation.

To get the output we use Z-Score method to scale the data.

Figure 22: Scaled data Sample

Figure 23: Scaled data Description

2.3 Comment on the comparison between the covariance and the

Figure 24: Correlation Matrix

 Covariance shows the direction of the linear relationship between variables.

Figure 26: Covariance Matrix

Figure 27: Outlier treated Dataset

Figure 28: Boxplot of Outlier treated Data

Figure 30: Boxplot of after Outlier and Scaling treatment on Data

Statistical tests to be done before PCA - Bartlett's test of sphericity:

H0: All variables in the data are correlated

H1: At least one pair of variables in the data are correlated

Kaiser-Meyer-Olkin (KMO) Test:

The Kaiser-Meyer-Olkin (KMO) – measure of sampling adequacy (MSA) is ana index

Step 1: Create Covariance Matrix as Question 2.4

Step 2: Get Eigen Vector and Eigen Values.

Figure 31: Eigen Vectors Matrix

Eigen values: ([0.33151857, 0.28373652, 0.06464061, 0.05855307, 0.05274046,

Figure 33: Scree Plot

Cumulative Variance: array ([0.33151857, 0.61525509, 0.6798957, 0.73844877,

2.9 Explain the business implication of using the Principal Component

Inference from the transformed data or the PC’s:

The Transformed Dataset after PCA:

Figure 37: Transformed Dataset after performing PCA

 PCA captures maximum possible variance of the original variables with

Business Implications of using Principal Component Analysis:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.