0% found this document useful (0 votes)
344 views40 pages

Advanced Statistics Project Report Final

The document analyzes salary data of 40 individuals to understand the dependency of salary on education and occupation. Various statistical tests like one-way ANOVA, multiple comparisons, and interaction plots are performed to analyze the effect of education and occupation on salary. The results show that education has a significant effect on salary while both education and occupation together also have significant interacting effects on determining individual salaries.

Uploaded by

Imane Chatoui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
344 views40 pages

Advanced Statistics Project Report Final

The document analyzes salary data of 40 individuals to understand the dependency of salary on education and occupation. Various statistical tests like one-way ANOVA, multiple comparisons, and interaction plots are performed to analyze the effect of education and occupation on salary. The results show that education has a significant effect on salary while both education and occupation together also have significant interacting effects on determining individual salaries.

Uploaded by

Imane Chatoui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

8/28/2022

ADVANCED STATISTICS
Project Report
PGP-DSBA

SAIRA BANU
PGP – DATA SCIENCE AND BUSINESS ANALYTICS
Table of Contents
1 Problem Statement:1…………………………………………………………………….3
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for
both Education and Occupation individually…………………………………………………….5

1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA
results……………………………………………………………………………………………………...……….5

1.3 Perform one-way ANOVA for variable Occupation with respect to the variable
‘Salary’. State whether the null hypothesis is accepted or rejected based on the
ANOVA results…………………………………………………………………………………….………. ….6

1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class
means are significantly different. Interpret the
result……………………………………………………………………………………………….……….……7

1.5 What is the interaction between the two treatments? Analyse the effects of one
variable on the other (Education and Occupation) with the help of an interaction
plot…………………………………………………………………………………………………….…….…….8

1.6 Perform a two-way ANOVA based on the Education and Occupation (along with
their interaction Education*Occupation) with the variable ‘Salary’. State the null
and alternative hypotheses and state your results. How will you interpret this
result?............................................................................................................................................8

1.7 Explain the business implications of performing ANOVA for this particular case
study…………………………………………………………………………………………………….….………9

2 Problem Statement: 2……………………………………………………………………………….…11

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to
be performed]. What insight do you draw from the EDA?..........................................14

2.2 Is scaling necessary for PCA in this case? Give justification and perform
scaling………………………………………………………………………………………………………….…22

2.3 Comment on the comparison between the covariance and the correlation matrices
from this data. [on scaled data] …………………………………………………………….……….28

2.4 Check the dataset for outliers before and after scaling. What insight do you derive
here?............................................................................................................................. ............ ...30

2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both] …….32

2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features………………………………………………………………33

1 | Advanced Statistics
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use
values with two places of decimals only). [hint: write the linear equation of PC in
terms of eigenvectors and corresponding features] ……………………………………….34

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors
indicate?.......................................................................................................................................35

2.9 Explain the business implication of using the Principal Component Analysis for
this case study. How may PCs help in the further analysis? [Hint: Write
Interpretations of the Principal Components Obtained] …………………………………36

List of Figures

Figure 1: Problem Statement 1: Sample Dataset………………………………………………..……4


Figure 2: Sample containing null data……………………………………………………………..………..5
Figure 3: One way ANOVA: Salary and Education……………………………………………….…..…6
Figure 4: One way ANOVA: Salary and Education………………………………………………………6
Figure 5: Multiple comparison of Salary and Education………………………………………..……7
Figure 6: Boxplot salary vs Education……………………………………………………………………….7
Figure 7: Point Plot interaction between education and occupation…………………….…..…8
Figure 8: Two Way ANOVA……………………………………………………………………………………...…9
Figure 9: Problem Statement 2: Data Info…………………………………………………….………….11
Figure 10: Problem 2- Data Description………………………………………………………….….….…12
Figure 11: Problem 2 – Sample Dataset………………………………………………………….….…..…12
Figure 12: Problem 2 – Null Data……………………………………………………………………….…….12
Figure 13: Problem 2-Duplicate Data……………………………………………………………….………13
Figure 14: Before Outlier Treatment Boxplot……………………………………………………..…….13
Figure 15: After Outlier Treatment…………………………………………………………………………..14
Figure 16: Univariate Analysis – Data Description………………………………………………..…..14
Figure 17: Univariate Analysis – Skewness of Data…………………………………………..…...…..15
Figure 18: Univariate Analysis- Distplot and Boxplot………………………………………..16 to 24
Figure 19: Correlation Heatmap………………………………………………………………….………….25
2 | Advanced Statistics
Figure 20: Data Description – After removing outliers………………………………….…………..25

Figure 21: Pair Plot………………………………………………………………………………….….….……26

Figure 22: Scaled data Sample………………………………………………………………….…….……..27

Figure 23: Scaled data Description………………………………………………………….……….……27

Figure 24: Correlation Matrix…………………………………………………………………….…..……..28


Figure 25: Heatmap of correlation of Scaled data………………………………………..………….29
Figure 26: Outlier treated Dataset……………………………………………………………….….……..30
Figure 28: Boxplot of Outlier treated Data……………………………………………………………..30
Figure 29: After Outlier and Scaling treatment on Dataset………………………………….…..31

Figure 30: Boxplot of after Outlier and Scaling treatment on Data……………………….….31

Figure 31: Eigen Vectors Matrix…………………………………………………………………………………….….33

Figure 32: Data Frame with Eigen Vectors……………………………………………………..………34


Figure 33: Scree Plot…………………………………………………………………………………………….35
Figure 34: Cumulative Variance……………………………………………………………………..……..36
Figure 35: PCA Transformed Correlation matrix ………...…………………………………………37

Figure 36: PCA Transformed Data frame Heatmap…………………………………………..…….38

Figure 37: Transformed Dataset after performing PCA…………………………………………..38

3 | Advanced Statistics
Problem Statement 1A:
Salary is hypothesized to depend on educational qualification and occupation. To
understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected
and each person’s educational qualification and occupation are noted. Educational
qualification is at three levels, High school graduate, Bachelor, and Doctorate.
Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty,
and Executive or managerial. A different number of observations are in each level of
education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality
assumption may not always hold if the sample size is small.]

Data Description:
 Education has three levels: Doctorate, Bachelors and high school graduate.
 Occupation has four levels: Administrative and Clerical, Sales, Professional or
Specialty, and Executive or Managerial.
 Salary: Integer data type showing salary of 40 individuals.
Sample Dataset:

Figure 1: Problem Statement 1: Sample Dataset


Dataset has salary of 40 individuals where salary is dependent on Education
qualification and occupation. The dependency of the two components on salary of
the individual is to be determined by doing hypothesis testing and the conclusion
has to be derived.

4 | Advanced Statistics
Figure 2: Sample containing null data

1.1Q) State the null and the alternate hypothesis for conducting one-
way ANOVA for both Education and Occupation individually.

Solution:

The one-way analysis of variance (ANOVA) is used to determine whether there are any
statistically significant differences between the means of three or more independent
(unrelated) groups.

Formulation of hypothesis for conducting one-way ANOVA for education qualification with
respect to salary

H0: Salary depends on education qualification

H1: Salary does not depend on education

Confidence level = 0.05

1.2 Perform one-way ANOVA for Education with respect to the


variable ‘Salary’. State whether the null hypothesis is accepted or
rejected based on the ANOVA results.
Solution:
The hypothesis for conducting one way ANOVA for Education:
H0: 3 Education Levels have the same mean salary.
H1: At least 1 Education level has different mean salary.
Confidence level:0.05

5 | Advanced Statistics
To perform one-way ANOVA for education with respect to the variable 'Salary', we apply the
ANOVA formula in the Jupyter notebook and run the AOV table. We get following output:

Figure 3: One way ANOVA: Salary and Education


The P value (1.257709e-08) is less than 0.05, hence we reject the null hypothesis.
Therefore, we infer the mean is different in at least category of Education.
The F ratio output is 30.96 which we infer that the variance between education levels is
30.96 times higher the variance within each category, hence there is a larger variance
between them noted. Hence, we come o conclusion that different education
qualifications affect salary mean and at least 1 education gives different salary.

1.3 Perform one-way ANOVA for variable Occupation with respect to


the variable ‘Salary’. State whether the null hypothesis is accepted or
rejected based on the ANOVA results.
Solution:
The hypothesis for conducting one way ANOVA for Education:
H0: All 4 Occupation Levels have the same mean salary.
H1: At least 1 Occupation level has different mean salary.
Confidence level:0.05
To perform a one-way ANOVA on Salary with respect to Occupation, we apply the
ANOVA formula in the Jupyter notebook and run the AOV table. We get following output:

Figure 4: One way ANOVA: Salary and Occupation


The P value (0.458508) is more than 0.05, hence we fail to reject the null hypothesis.

6 | Advanced Statistics
The F ratio output is 0.88 which is much lower as compared to the education levels and
the variance across the occupation levels is much lower as compared to the variance
within each segment. Hence, we conclude that have no dependency of occupation on
salary of the individual and all the 4 occupations have the same mean salary.

1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out
which class means are significantly different. Interpret the result.
Solution:
Salary with Education where we found the null hypothesis is rejected, we use the
tukeyhsd to check in which class means are significantly different.

Figure 5: Multiple comparison of Salary and Education


Conclusion: In the above result we observe that all class means with each other have
significant differences. Reject the null hypothesis.

Figure 6: Boxplot salary vs Education

7 | Advanced Statistics
The null hypothesis is rejected in (2) where from the above boxplot we can see that the
HS-grad mean is significantly low than that of other qualifications.

Problem 1B:
1.5Q) What is the interaction between the two treatments? Analyse
the effects of one variable on the other (Education and Occupation)
with the help of an interaction plot.
Solution:
As seen from the below interaction plots, there seems to be moderate interaction
between the two categorical variables.

Figure 7: Point Plot interaction between education and occupation


 As the Education and Occupation interaction is less than the significance level of
0.05, we can say that there is statistical interaction between Education &
Occupation.

 Adm-clerical and sales professionals with bachelors and doctorate degrees earn
almost similar salary packages.

 From the Above interaction plot we can clearly see that Salary is affected by both
Education and Occupation. So, we can say that the Salary earned is dependent on
both Education and Occupation.

1.6Q) Perform a two-way ANOVA based on the Education and


Occupation (along with their interaction Education*Occupation) with
the variable ‘Salary’. State the null and alternative hypotheses and
state your results. How will you interpret this result?

8 | Advanced Statistics
Solution:

Formulation of hypothesis for conducting two-way ANOVA based on education and


occupation with respect to salary.

H0: Salary depends on both categories - education and occupation


H1: Salary does not depend on at least one of the categories - education and occupation
Confidence level = 0.05.

Figure 8: Two Way ANOVA

From the above analysis, it can be seen that the p-values are less than the confidence level
(0.05) and hence the null hypothesis is rejected in this case.

Hence, we can say that there is dependency between Education and occupation which
tells us that the salary is dependent upon education and occupation.

1.7 Explain the business implications of performing ANOVA for this


particular case study.

Solution:

 Performing One-way ANOVA on Education and Occupation helps us to know if


there is any relationship between the Salary earned by a person and their
Education and Occupation.

 We are conducting the hypothesis to check whether there was any relationship
between Salary and Occupation.

 Salary being a dependent variable, we are checked that it is dependent on


Education and Occupation.

 From the question 1.2 we can see that the null hypothesis is rejected stating that
the Mean salary is different due to different educations.

9 | Advanced Statistics
 From the question 1.3 we can see that the null hypothesis is not rejected stating
that the Mean salary is not different due to different occupations.

 From the question 1.6 we can see that the null hypothesis is rejected stating that

 There is interaction between education and occupation which in turn affects the
salary.

===================**********************===================

10 | Advanced Statistics
2 Problem Statement: 2

The dataset Education - Post 12th Standard.csv contains information on various


colleges. You are expected to do a Principal Component Analysis for this case study
according to the instructions given. The data dictionary of the 'Education - Post 12th
Standard.csv' can be found in the following file: Data Dictionary.xlsx.

Data Description:

Figure 9: Problem 2 – Data Info

 The purpose of the dataset is to study data obtained from different colleges where
we have to perform the exploratory data analysis to deduct some inferences.

 Also, principal component analysis is to be performed to reduce the dimensions


(remove redundant dimensions) for better analysis of data with the data giving the
highest variance so the final dimensions will show 0 correlation in them.

11 | Advanced Statistics
Figure 10: Problem 2- Data Description

Figure 11: Problem 2 – Sample Dataset


Exploratory Data Analysis:

Figure 12: Problem 2 – Null Data

12 | Advanced Statistics
 The “Names” column can be removed from the dataset to do the exploratory data
analysis.
 There is no null data in dataset.

Duplicate Detection:

Figure 13: Problem 2-Duplicate Data


 There are no duplicate records to fix.

Outlier Treatment:
 From the Box Plot (), we can see that we need to treat outliers expect for
“Top25perc” as below.

Figure 14: Before Outlier Treatment Boxplot


 We shall treat outliers by treating the data beyond the IQR to the lower and
upper bounds.
 For the higher outliers we will treat it to get it at 95 percentile values.
 Lower-Level outliers will be treated to get it at 5 percentile values.

13 | Advanced Statistics
Figure 15: After Outlier Treatment

2.1 Perform Exploratory Data Analysis [both univariate and


multivariate analysis to be performed]. What insight do you draw
from the EDA?
Solution:
From the Treatment done above we have a dataset with no extreme outliers on which
univariate and multivariate analysis can be performed.
The further analysis would be done on outlier treated data.
Univariate Analysis:

Figure 16: Univariate Analysis – Data Description

14 | Advanced Statistics
Figure 17: Univariate Analysis – Skewness of Data

Inference:

 The maximum positive skewness is observed for ‘P. Undergrad’ variable.


 A positive skewness of greater than 1 is observed for 9 variables.
 Least skewness is observed for ‘Grad. Rate’.
 The highest negative skewness is observed for ‘Terminal’ variable, ‘PhD’ variable
also has nearly same negative skewness value.

To display information as part of a univariate analysis of numeric variables, let’s define


the function “univariate Analysis.” The function will take two arguments: the name of
the column and the number of bins.

The function will show the box plot, histogram, or distplot to view the distribution, the
numerical variable’s statistical description, and any outliers that may exist.

15 | Advanced Statistics
16 | Advanced Statistics
17 | Advanced Statistics
18 | Advanced Statistics
19 | Advanced Statistics
20 | Advanced Statistics
21 | Advanced Statistics
22 | Advanced Statistics
23 | Advanced Statistics
Figure 18: Univariate Analysis- Distplot and Boxplot

24 | Advanced Statistics
Bivariate Analysis:

Figure 19: Correlation Heatmap

Figure 20: Data Description – After removing outliers


25 | Advanced Statistics
Figure 21: Pair Plot

 Bivariate analysis is done using the Heatmap of the correlation matrix wherein
the dependency between two variables is checked.

 “Apps” have high correlation between “Accept” and “Enroll”.

 The correlation indicates how two variables are dependent on one another and
to what extent.

 The objective of PCA is to reduce this correlation.

2.2 Is scaling necessary for PCA in this case? Give justification and
perform scaling.
26 | Advanced Statistics
Solution:

From the Boxplot of the outlier treated data Figure 15: After Outlier Treatment. We can
see that there are some extreme variations in the range of data for e.g., Grad. Rate and
Outstate where there is difference in the range making it difficult to compare the two
data. Also, a variable in the dataset with high standard deviation will have higher weight
of calculation than that of low standard deviation variable. Hence, it is necessary to scale
the data to standardisation the range of all the dimensions.

PCA calculates new data axis depending on the deviation of the data.

To get the output we use Z-Score method to scale the data.

Figure 22: Scaled data Sample

Figure 23: Scaled data Description

27 | Advanced Statistics
From the above description and sample data of the scaled dataset we can see that the
scale of each data is standardised when the mean tends to 0 and the standard deviation
tends to 1.

Also, the data is now evenly spread out making it easier to derive deductions from it.

2.3 Comment on the comparison between the covariance and the


correlation matrices from this data. [on scaled data].

Solution:

Correlation Matrix:

Figure 24: Correlation Matrix

Inference:

 Covariance shows the direction of the linear relationship between variables.


 Correlation measures both the strength and direction of the linear relationship
between two variables.
 Correlation is a function of covariance.
 Covariance indicates the relationship of two variables whenever one variable
changes. If an increase in one variable results in an increase in the other variable,
both variables are said to have a positive covariance. Decreases in one variable
also cause a decrease in the order.

28 | Advanced Statistics
Figure 25: Heatmap of correlation of Scaled data

Figure 26: Covariance Matrix

29 | Advanced Statistics
2.4 Check the dataset for outliers before and after scaling. What insight
do you derive here?

Solution:

The below is the Dataset after Outlier treatment in Figure 27: Outlier treated Dataset.

Figure 27: Outlier treated Dataset

As seen in the below Figure 27: Boxplot of Outlier treated Data before the treatment
the dataset had many outliers except one dimension. Post treatment we have
negligible number of outliers.

Figure 28: Boxplot of Outlier treated Data


30 | Advanced Statistics
Figure 29: After Outlier and Scaling treatment on Dataset

 The above is the dataset in Figure 29: After Outlier and Scaling treatment on
Dataset after outlier and scaling treatment.

Figure 30: Boxplot of after Outlier and Scaling treatment on Data

 As seen in the Figure 30: Boxplot of after Outlier and Scaling treatment on Data
Scaling the data helped to standardise the data thus giving same weight to all and
further supplementing by giving PCA relevant axis.

31 | Advanced Statistics
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print
Both].

Solution:

Statistical tests to be done before PCA - Bartlett's test of sphericity:

To perform PCA we need to perform Bartlett’s test of sphericity to test the hypothesis
that the variables are uncorrelated in the population.

H0: All variables in the data are correlated

H1: At least one pair of variables in the data are correlated

After performing the test, the p-value is 0 which means that the null hypothesis is rejected
and at least one pair of variables in the data are correlated hence PCA is recommended.

Kaiser-Meyer-Olkin (KMO) Test:

The Kaiser-Meyer-Olkin (KMO) – measure of sampling adequacy (MSA) is ana index


used to examine how appropriate PCA is.

Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is
expected. ON the other hand, MSA >0.7 is expected to provide a considerable reduction
is the dimension and extraction of meaningful components.

Here for the dataset, the p-value is 0.86 which indicates that after performing PCA there
would be a considerable reduction of dimensionality.

Step 1: Create Covariance Matrix as Question 2.4

Step 2: Get Eigen Vector and Eigen Values.

32 | Advanced Statistics
Eigen vectors:

Figure 31: Eigen Vectors Matrix

Performing PCA on scaled and treated data using Sklearn, we find the above outputs.

Eigen values:

Eigen values: ([0.33151857, 0.28373652, 0.06464061, 0.05855307, 0.05274046,


0.04497099, 0.03449059, 0.03257588, 0.02603662, 0.02245497,
0.01443066, 0.00862682, 0.00799196, 0.00727087, 0.00438662,
0.0032887, 0.00228608])

2.6 Perform PCA and export the data of the Principal Component
(Eigen vectors) into a data frame with the original features.

Solution:

After performing PCA and exporting data of the Eigen vectors from question 2.5 in a
Data frame with its original features, below is the sample data frame from loading with
all the components.

33 | Advanced Statistics
Figure 32: Data Frame with Eigen Vectors

2.7 Write down the explicit form of the first PC (in terms of the eigenv
ectors. Use values with two places of decimals only). [hint: write the li
near equation of PC in terms of eigenvectors and corresponding featur
es]

Solution:

The linear or explicit form of first PC is obtained by using the eigen vectors for PC1.

Below is the linear form of the first PC, refer to Figure 32: Data Frame with Eigen Vectors
.
The Linear eq of PC1:

0.24 * Apps + 0.21 * Accept + 0.16 * Enroll + 0.34 * Top10perc + 0.34 * Top25perc + 0.13
* F. Undergrad + 0.01 * P. Undergrad + 0.3 * Outstate + 0.25 * Room. Board + 0.09 * Book
s + -0.05 * Personal + 0.32 * PhD + 0.32 * Terminal + -0.18 * S.F. Ratio + 0.2 * perc. Alum
ni + 0.34 * Expend + 0.25 * Grad. Rate.

34 | Advanced Statistics
2.8 Consider the cumulative values of the eigenvalues. How does it hel
p you to decide on the optimum number of principal components? Wh
at do the eigenvectors indicate?

Solution:

Figure 33: Scree Plot

Inference:
From the above Figure 33: Scree Plot we can see the individual explained variances by
the principal components. We can observe a sudden decrease in slope from the third
principal component onwards. This means the maximum variances are captured by the
first two principal components. This point is also called inflection point.

Cumulative Variance: array ([0.33151857, 0.61525509, 0.6798957, 0.73844877,


0.79118924, 0.83616023, 0.87065082, 0.9032267, 0.92926332, 0.95171829,
0.96614894, 0.97477576, 0.98276773, 0.9900386, 0.99442522, 0.99771392, 1.])

Cumulative Plot:

35 | Advanced Statistics
Figure 34: Cumulative Variance

Inference:

From the above scree plot Figure 33: Scree Plot, cumulative eigen variance and from
Figure 34: Cumulative Variance we can see that around 8 components give about 90% of
the variance. Thus, the dimensions can be reduced from 17 to 8 given optimal solution.
The data can now be reoriented onto new axes by transforming the principal componen
ts into the direction of the eigenvectors.

2.9 Explain the business implication of using the Principal Component


Analysis for this case study. How may PCs help in the further analysis?
[Hint: Write Interpretations of the Principal Components Obtained].

Solution:

 After PCA we found out that the dimensions can be reduced to 8 components
which indicates that the correlation between them 0 as per
Figure 35: PCA Transformed Correlation matrix
 This indicates that the redundant dimensions are removed.
 Now the 8 dimensions/PCA represent almost 90% of the data.

36 | Advanced Statistics
Figure 35: PCA Transformed Correlation matrix

Inference from the transformed data or the PC’s:

 PC1: Show the no. of students who have to pay outstate tuition.
 PC2: Show that the student admission depends highly on application, acceptance
and enrolment where these components are highly correlated.
 PC3: Shows the cost of books for a student.
 PC4: Indicates the top 10 and top 25 percentage.
 PC5: Represents % of faculties with Ph.D. and terminal degree.
 PC6: Represents the student/faculty ratio.
 PC7: Highlights estimated personal spending for a student and graduation rate
 PC8: Highlights the alumni members.

37 | Advanced Statistics
Figure 36: PCA Transformed Data frame Heatmap

The Transformed Dataset after PCA:

Figure 37: Transformed Dataset after performing PCA

38 | Advanced Statistics
Final Inferences from the above PCA:

 PCA captures maximum possible variance of the original variables with


minimum number of dimensions. There are different methods employed to
capture the optimum information with minimum dimensions. One of the most
common methods is using the cumulative values of the eigenvalues for deciding
the optimum value.
 For better visualisation we convert the cumulative values in percentage terms
first then we decide on the minimum amount of information we want to capture
in our new dimensions and based on this select up to the principal components
that explain the desired percentage of variance. For example, in our case if we
want minimum 90% of the variance explained by the new dimensions, we select
the first 8 principle components which explain approximately 90.79% of the
variance.
 Some of the other methods employed for the purpose of selection of optimum
number of principal components are Kraizer rule, where we consider all the
principal components with eigen values greater than 1. And Scree plot, where we
accept only up to the principal component previous to the inflection point (i.e., a
sharp break in slope).

Business Implications of using Principal Component Analysis:

 In our case study, after performing multivariate analysis we have observed that
many of the variables are correlated. Thus, we don't need all these variables for
analysis but we are not sure which variables to drop and which to select, hence
we perform PCA, which captures the information (in the form of variance) from
all these variables into new dimension variables. Now based on the requirement
of information we can select the number of new dimension variables required.
 The new dimension variables are independent of each other, which also helps in
certain algorithms.
 The dimensionality reduction as obtained from PCA helps in lesser computing
power, i.e., faster processing for further analysis.
 The dimensionality reduction also helps in lesser storage space.
 The dimensionality reduction also helps in addressing the overfitting issue,
which mainly occurs when there are too many variables.

======================************************======================

39 | Advanced Statistics

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy