October 17: Great Learning Authored By: ANIMESH HALDER
October 17: Great Learning Authored By: ANIMESH HALDER
Great Learning
Authored by: ANIMESH HALDER
Content
Problem 2: A Survey
Introduction 10
Data Description 10
Basic information 11
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be 12
performed]. What insight do you draw from the EDA?
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling. 13
2.3 Comment on the comparison between the covariance and the correlation matrices 13
from this data [on scaled data].
2.4 Check the dataset for outliers before and after scaling. What insight do you derive 15
here? [Please do not treat Outliers unless specifically asked to do so].
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]. 15
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a 17
data frame with the original features.
2
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [Hint: write the linear equation of PC in terms of 17
eigenvectors and corresponding features]
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors 18
indicate?
2.9 Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations 18
of the Principal Components Obtained]
3
List of Tables:
List of Figures:
4
Problem 1: Analysis of Salary as a Function of Educational Qualification and
Occupation
Problem statement: Salary is hypothesized to depend on educational qualification and occupation. To
understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each
person's educational qualification and occupation are noted. Educational qualification is at three levels,
High school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical,
Sales, Professional or speciality, and Executive or managerial. A different number of observations are
in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may not always
hold if the sample size is small.].
Introduction:
The purpose of the study is to find the relationship of the earned salary with the educational qualification
and the occupations of 40 individuals. The dataset consists of 3 different educational qualifications with
4 different types of occupations. Analysis of the dataset will reveal the understanding of the difference
that occurs due to the qualification and the selected job type. The ANOVA test is considered here as the
diagnosing tool.
Data Description:
Education Qualification: Doctorate, Bachelors, and HS Grad
Basic information:
Education object
Occupation object
Salary int64
There is a total of 40 rows and 3 columns in the dataset. Out of 3, 2 columns are of object type and rest
1 are of either integer data type. Neither any missing values nor any Nan entries are found. Total memory
usage: 1.1+ KB
5
1A.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
In the ANOVA test, hypothesis testing is mandatory. The null hypothesis declared the mean of all the
individual treatments are the same whereas the alternative hypothesis tells at least one mean is different
from the others.
In the present dataset, the continuous variable is the function of the two categorical treatments. Hence
two sets of the hypotheses can be defined for conducting the one-way ANOVA as below:
Set 1:
H0: The mean salaries are the same for all types of Educational Qualifications.
H1: The mean salaries are different (at least one) for all Educational Qualifications.
Set 2:
H0: The mean Salaries are the same in all types of Occupations.
H1: The mean Salaries are different (at least one) in all types of Occupations.
For any set of the hypothesis, if the calculated p-value is found smaller than the default significance level
(α), 0.05, then the null hypothesis is rejected, otherwise accepted.
1A.2 Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
Using dataset transformed to two treatments; one categorical (Education) and another is the continuous
variable, Salary.
The null hypothesis (H0) declared as the mean Salaries are the same for all types of Educational
Qualifications where the alternative hypothesis (H1) tells that mean Salaries are different (at least one)
for all ‘Educational Qualifications’.
Testing the reformed dataset with the one-way ANOVA, the obtained p-value (1.257709e-08) is found
smaller than the default significance value, hence we reject the null hypothesis, and it proves that there
is a difference in the mean Salary earned by the different Educational levels.
1A.3 Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
The one-way ANOVA is applied to the reformed dataset, consisting of categorical treatment Occupation
and the continuous variable Salary. As per the rule of the ANOVA test, the hypothesis is declared, for
the dependent variable salary in terms of the independent treatment, as follows:
6
H0: The mean salaries are the same for all types of Occupations.
H1: The mean salaries are different (at least one) for all types of Occupations.
In these statements, H0 and H1 are the null and alternative hypotheses respectively.
Testing results show that the p-value (0.458508) is greater than the significance value, hence the null
hypothesis is accepted that tells mean salary is the same in all the occupations.
1A.4 If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded).
The null hypothesis is rejected in (2). It means that the mean salary is different at least for one with
different educational qualifications.
1B.1 What is the interaction between two treatments? Analyze the effects of one variable on the other
(Education and Occupation) with the help of an interaction plot. [hint: use the ‘pointplot’ function
from the ‘seaborn’ function].
The interaction of the variable Salary is analyzed with the other two variables, Education and Occupation
respectively. The interactions are well described in Figure 1 using the function ‘pointplot’- available in
‘seaborn’ function. Using Figure 1, a distribution table can be formed as below.
7
3. The salary changes almost in linear mode from Adm-clerical to Prof-specialty for those who are
Doctorates. The trend falls in Exec-managerial occupation, though the earned salary is higher
than the salary drawn by the doctorates in Sales.
Figure 1: Dependency of drawing salary on the occupation for three types of educational qualifications.
1B.2 Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses
and state your results. How will you interpret this result?
The hypothesis declared for the two-way ANOVA test to the same problem.
H0: There is no interaction between the treatments for drawing Salary
H1: There is an interaction between the treatments for getting Salary.
Due to the inclusion of the interaction effect term, changes are visible in the p-value of the first two
treatments as compared to the two-way ANOVA without the interaction effect terms. But the changes
do not alter the hypothesis statements. The p-value (0.000022325) of the interaction effect term of
'Education' and ‘Occupation’ suggests that the null hypothesis is rejected in this case.
1B.3 Explain the business implications of performing ANOVA for this particular case study.
If we consider drawing salary amount reflects the performance in suitable occupation as per the
educational qualification, referring to the Table 1, then Doctorates are highly selected for Prof-specialty
8
job, as they draw a maximum salary. In this regard, Exce-managerial is the best selection for the
Bachelors, as this job offers mostly paid occupation in all the individual career follow the order as given
in the independent axis of Figure 1. Bachelors are not suitable for the Prof-specialty occupation. People
having HS-Grad qualification are eligible for Adm-clerical, Sales, and Prof-specialty, though in Sales
they perform worst, so Adm-clerical and Prof-specialty are two occupations they can perform well.
9
Problem 2: A Survey
Problem statement: The dataset Education - Post 12th Standard.csv contains information on various
colleges. You are expected to do a Principal Component Analysis for this case study according to the
instructions given. The data dictionary of the 'Education - Post 12th Standard.csv' can be found in the
following file: Data Dictionary.xlsx.
Introduction:
The purpose of the study is to analyse the colleges as per the different features, to rank the best from
them. The dataset contains a total of 777 different colleges with information about the number of
applications for admission, the number of accepted students and finally the number of enrolled
individuals. Apart from the said 3 treatments, there are 2 other treatments are there describing the number
of enrolled students who are from the top 10% and top 25% from the 12th class respectively. The datasets
also describe the quality of the colleges in terms of the number of full-time and part-time undergraduate
students. The number of students for whom the particular college or university is Out-of-state tuition are
also includes in the features. There are 9 treatments describing the infrastructure of the institutions, like
Cost of Room and board, estimated book costs for a student, the percentage of faculties with PhD's,
percentage of faculties with a terminal degree, Student/Faculty ratio, the instructional expenditure per
student, percentage of alumni who donate, and Graduation rate.
The EDA and PCA tests are the two tools used here for the analysis.
Data Description:
Names: Names of various universities and colleges
Apps: Number of applications received
Accept: Number of applications accepted
Enrol: Number of new students enrolled
Top10perc: Percentage of new students from top 10% of Higher Secondary class
Top25perc: Percentage of new students from top 25% of Higher Secondary class
Full-Time Undergrad: Number of full-time undergraduate students
Part-Time Undergrad: Number of part-time undergraduate students
Outstate: Number of students for whom the particular college or university is Out-of-state tuition
Room & Board: Cost of Room and board
Books: Estimated book costs for a student
Personal: Estimated personal spending for a student
10
PhD: Percentage of faculties with PhD's
Terminal: Percentage of faculties with a terminal degree
S-F Ratio: Student/faculty ratio
perc_alumni: Percentage of alumni who donate
Expend: The Instructional expenditure per student
Graduation Rate: Graduation rate
Basic information:
Names object
Apps int64
Accept int64
Enroll int64
Top10perc int64
Top25perc int64
F.Undergrad int64
P.Undergrad int64
Outstate int64
Room.Board int64
Books int64
Personal int64
PhD int64
Terminal int64
S.F.Ratio float64
perc.alumni int64
Expend int64
Grad.Rate int64
There is a total of 777 rows and 18 columns in the dataset. Out of 18, 1 column is an object type, 1
float64 and the rest 16 are of integer data type. Neither any missing values nor any Nan entries are found.
Total memory usage: 109.4+ KB
11
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?
Boxplot is the right option to perform the univariate analysis as shown in Figure 2; though all the 16
plots are affected by outliers.
Figure 2: Boxplot of the dataset, showing the distribution of data. The dataset is affected by the outliers.
The multivariate analysis can be performed using the correlation function. The analysis described that
the Application to the institutes is highly correlated with the number of acceptance and the number of
student enrollment. Except for the two, the application is highly correlated with a full-time undergraduate
program. The same is noticed between enrollment and full-time undergraduate. The graduation rate is
moderately correlated with outstate, the students who were top 10 and 25 % in higher secondary class,
not so much correlated with PhD, terminal and other factors like books, student-faculty ration and so on.
Thus, the insight of the analysis is likely
1. There are 17 numeric fields in the dataset.
2. The number of enrolled students in the institutes is 35 to 6392.
3. The student number varies within 139 to a max of 31643 for the full time under graduation
program.
12
4. The average Student/Faculty ratio is 14.
5. The graduation rate varies from 10 to a max of 118.
6. Outliers to be treated
7. The application to the institutions is highly accepted and enrolled based on the full-time
undergraduate program.
8. Massachusetts Institute of Technology is the most preferable institute for the new students who
are top 10% in higher secondary.
9. The University of California at Irvine is selected by the new top 25% in higher secondary class.
10. Both the institutes having the best infrastructures in terms of faculty qualification and faculty-
student ratio.
11. Among all the institutes, the graduation rate is maximum in Cazenovia College.
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.
In the given dataset, scaling is necessary for PCA, to bring all the variables within the same scale, as the
features are of different magnitude ranges.
2.3 Comment on the comparison between the covariance and the correlation matrices from this
data [on scaled data].
Covariance is nothing but a measure of correlation. Correlation refers to the scaled form of covariance.
Covariance indicates the direction of the linear relationship between variables. Correlation on the other
hand measures both the strength and direction of the linear relationship between two variables. The
covariance matrix shows the correlations between the different features of the dataset. In this aspect, the
heat map is utilized for better visualization. It consists of all the 17 features, whereas the correlation
matrix is calculated on the scaled data. To find the correlation matrix, the scree plot is required to plot
(Figure 3), which conveys the number of components required to meet at least 88 per cent of the goodness
of the data. In the present study, 8 Principal components are required for the same purpose.
13
Figure 3: The Scree plot, showing the relative number of Principal Components are required. According to the plot, the first
component holds the 32% goodness of the dataset.
The selected 8 Principal components (Figure 4b) are the transformed data that replace all the initially
selected 17 components (Figure 4a). With these 8 Principal components, the data obtained are free from
any correlation except the PC value itself.
Figure 4: The heat map showing the r values for the covariance and the correlation matrices. (a) The covariance matrix
shows all possible correlations between all the paired treatments and (b) the correlation matrix that consists of eight PCs
and showing that the correlation only exits when it is compared with itself.
14
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here? [Please
do not treat Outliers unless specifically asked to do so].
The dataset without scaling as shown in Figure 2, having outliers and the range of the variables are
different as per the variable itself. To bring all the parameters within a standard range scaling is
performed. But the scaling does not change the pattern of the data distribution, it just reduces the
parameter's value from its original to standard. In the present dataset, the scaling is done on the original
dataset (Figure 5), and that yields all the minimum values as negative, like the number of Applications,
is –0.755134, the estimated cost for books is -2.747779, the Student/Faculty ratio is -2.929799 and so
on, which are absurd.
Figure 5: The Boxplots showing the dataset of all the variables after scaling. It is observed that the pattern of the distribution
remains unaltered except the range of the dataset, which in turn produce negative values.
Thus, to avoid all the negative values as the minimum, outlier-treatment is necessary.
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both].
Here all 17 features are selected as there is no particular idea about the number of required components
to perform PCA. A function known as PCA is called from the library known as ‘sklearn. decomposition’
to get the eigenvectors and the eigenvalues.
15
The extracted eigenvalues represents the unit of variability that captured by each Principal component:
[5.45052162, 4.48360686, 1.17466761, 1.00820573, 0.93423123, 0.84849117, 0.6057878, 0.58787222,
0.53061262, 0.4043029, 0.31344588, 0.22061096, 0.16779415, 0.1439785, 0.08802464, 0.03672545,
0.02302787]
The extracted eigenvectors or PCA components are:
[[0.2487656, 0.2076015, 0.17630359, 0.35427395, 0.34400128, 0.15464096, 0.0264425, 0.29473642,
0.24903045, 0.06475752, -0.04252854, 0.31831287, 0.31705602, -0.17695789, 0.20508237, 0.31890875,
0.25231565],
[0.33159823, 0.37211675, 0.40372425, -0.08241182, -0.04477866, 0.41767377, 0.31508783, -0.24964352, -
0.13780888, 0.05634184, 0.21992922, 0.05831132, 0.04642945, 0.24666528, -0.24659527, -0.13168986, -
0.16924053],
[-0.0630921, -0.10124906, -0.08298557, 0.03505553, -0.02414794, -0.06139298, 0.13968172, 0.04659887,
0.14896739, 0.67741165, 0.49972112, -0.12702837, -0.06603755, -0.2898484, -0.14698927, 0.22674398, -
0.20806465],
[0.28131053, 0.26781735, 0.16182677, -0.05154725, -0.10976654, 0.10041234, -0.15855849, 0.13129136,
0.18499599, 0.08708922, -0.23071057, -0.53472483, -0.51944302, -0.16118949, 0.01731422, 0.07927349,
0.26912907],
[0.00574141, 0.05578609, -0.05569364, -0.39543434, -0.42653359, -0.04345437, 0.30238541, 0.222532,
0.56091947, -0.12728883, -0.22231102, 0.14016633, 0.20471973, -0.07938825, -0.21629741, 0.07595812, -
0.10926791],
[-0.01623744, 0.00753468, -0.04255798, -0.0526928, 0.03309159, -0.04345423, -0.19119858, -0.03000039,
0.16275545, 0.64105495, -0.331398, 0.09125552, 0.15492765, 0.48704587, -0.04734001, -0.29811862,
0.21616331],
[-0.04248635, -0.01294972, -0.02769289, -0.16133207, -0.11848556, -0.02507636, 0.06104235, 0.10852897,
0.20974423, -0.14969203, 0.63379006, -0.00109641, -0.02847701, 0.21925936, 0.24332116, -0.22658448,
0.55994394],
[-0.1030904, -0.05627096, 0.05866236, -0.12267803, -0.10249197, 0.07888964, 0.57078382, 0.009846, -
0.22145344, 0.21329301, -0.23266084, -0.07704, -0.01216133, -0.08360487, 0.67852365, -0.05415938, -
0.00533554],
[-0.09022708, -0.17786481, -0.12856071, 0.34109986, 0.40371199, -0.05944192, 0.5606729, -0.00457333,
0.27502255, -0.13366335, -0.09446889, -0.18518152, -0.2549382, 0.27454438, -0.25533491, -0.04913888,
0.04190431],
[0.0525098, 0.04114008, 0.03448791, 0.06402578, 0.01454923, 0.02084718, -0.22310581, 0.18667536,
0.29832424, -0.08202922, 0.13602762, -0.1234522, -0.08857846, 0.47204525, 0.42299971, 0.13228633, -
0.59027107],
16
[0.04304621, -0.05840559, -0.06939888, -0.00810481, -0.27312847, -0.08115782, 0.10069332, 0.14322067, -
0.35932173, 0.03194004, -0.01857847, 0.04037233, -0.0589734, 0.44500073, -0.13072798, 0.69208887,
0.219839],
[0.02407091, -0.14510245, 0.01114315, 0.0385543, -0.08935156, 0.05617677, -0.06353607, -0.82344378,
0.35455973, -0.02815937, -0.03926403, 0.02322243, 0.01648504, -0.01102621, 0.18266065, 0.3259823,
0.1221067]]
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features.
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only). [Hint: write the linear equation of PC in terms of eigenvectors and
corresponding features].
17
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?
The cumulative value of the first eight Principal Components is 88.67. The general rule of thumb is to
choose the first n numbers of PC’s such that the first n PC’s explain 70-90% of the total variance. Hence
from the cumulative result of eigenvalues, helps in selecting the required no. of Principal Components.
In this case, first, eight PC's have been selected capturing 88.7% of the variation and thereby reducing
the initial dimension of the dataset by half.
The eigenvector associated with the largest eigenvalue indicates the direction in which the data has the
most variance. Similarly, the eigenvector associated with the second largest eigenvalue indicates the
direction in which the data has the second most variance and so on.
2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal
Components Obtained]
The explanation of the business implication can be viewed in Figure 6, and describe below.
18
The first Principal Component can be viewed as a measure of variables Top10perc, Top25perc,
Terminal, PhD and Expend. These five criteria vary together. If one increases the remaining tend
to increase as well.
The second Principal Component can be viewed as a measure of variables Enroll and
F.Undergrad. These two criteria vary together. If one increases the remaining tend to increase as
well. Thus, we may conclude that number of full-time undergraduate students increases as the
number of enrolments increases.
The third Principal Component can be viewed as a measure of variables Books and Personal.
These two criteria vary together. Thus, estimated Personal spending of students increases with
an increase in estimated book cost for a student.
The fourth Principal Component can be viewed as a measure of variables PhD and Terminal.
These two criteria vary together. The percentage of faculty with a terminal degree increases with
the percentage of faculty having a PhD.
The fifth Principal Component can be viewed as a measure of variables Room.Board. The
colleges having high value tend to have a high cost of room and board.
The sixth principle component is primarily the measure of the variable book i.e. estimated book
cost for a student.
The seventh Principal Component can be viewed as a measure of variables Personal and Grad
Rate. These two variables vary together. The Graduation rate increases with an increase in
estimated Personal spending for a student.
The eighth Principal Component can be viewed as a measure of variables Percentage of alumni
who donate and P.Undergrad. These two criteria vary together. The percentage of donations by
the alumni increases with the percentage of part-time undergraduate students.
19