Data Analysis in Management With SPSS Software
Data Analysis in Management With SPSS Software
J.P. Verma
Data Analysis
in Management
with SPSS Software
J.P. Verma
Research and Advanced Studies
Lakshmibai National University
of Physical Education
Gwalior, MP, India
The IBM SPSS Statistics has been used in solving various applications in different chapters of the book
with the permission of the International Business Machines Corporation, # SPSS, Inc., an IBM
Company. The various screen images of the software are Reprinted Courtesy of International Business
Machines Corporation, # SPSS. “SPSS was acquired by IBM in October, 2009.”
IBM, the IBM logo, ibm.com, and SPSS are trademarks of International Business Machines Corp.,
registered in many jurisdictions worldwide. Other product and service names might be trademarks of
IBM or other companies. A current list of IBM trademarks is available on the Web at “IBM Copyright and
trademark information” at www.ibm.com/legal/copytrade.shtml.
While serving as a faculty of statistics for the last 30 years, I have experienced that
the non-statistics faculty and research scholars in different disciplines find it
difficult to use statistical techniques in their research problems. Even if their
theoretical concepts are sound its troublesome for them to use statistical software.
This book provides readers with a greater understanding of a variety of statistical
techniques along with the procedure to use the most popular statistical software
package SPSS.
The book strengthens the intuitive understanding of the material, thereby
increasing the ability to successfully analyze data in the future. It enhances readers
capability in using data analysis techniques to a broader spectrum of research
problems.
The book is intended for the undergraduate and postgraduate courses along with
pre-doctoral and doctoral course work on data analysis, statistics, and/or quantita-
tive methods taught in management and other allied disciplines like psychology,
economics, education, nursing, medical, or other behavioral and social sciences.
This book is equally useful to the advanced researchers in the area of humanities
and behavioural and social sciences in solving their research problems.
The book has been written to provide solutions to the researchers in different
disciplines for using one of the powerful statistical software SPSS. The book will
serve the students as a self-learning text of using SPSS for applying statistical
techniques in their research problems.
In most of the research studies, data are analyzed using multivariate statistics
which poses an additional problem for the beginners. These techniques cannot be
understood without in-depth knowledge of statistical concepts. Further, several
fields in science, engineering, and humanities have developed their own nomencla-
ture assigning different names to the same concepts. Thus, one has to gather
sufficient knowledge and experience in order to analyze their data efficiently.
This book covers most of the statistical techniques including some of the most
powerful multivariate techniques along with their detailed analysis and interpreta-
tion of the SPSS output that are required by the research scholars in different
discipline to achieve their research objectives.
vii
viii Preface
The USP of this book is that even without having the indepth knowledge of
statistics, one can learn various statistical techniques and their applications on their
own.
Each chapter is self-contained and starts with the topics like Introductory
concepts, application areas, statistical techniques used in the chapter and step-by-
step solved example with SPSS. In each chapter in depth interpretation of SPSS
output has been made to help the readers in understanding the application of
statistical techniques in different situations. Since the SPSS output generated in
different statistical applications are raw and cannot be directly used for reporting
hence model way of writing the results has been shown wherever it is required.
This book focuses on providing readers with the knowledge and skills needed to
carry out research in management, humanities, and social and behavioral sciences
by using SPSS. Looking at the contents and prospects of learning computing skills
using SPSS, this book is a must for every researcher from graduate-level studies
onward. Towards the end of each chapter, short answer questions, multiple-choice
questions, and assignments have been provided as a practice exercise for the readers.
The common mistakes like using two-tailed test for testing one-tailed hypothe-
sis, using the term “level of confidence” for defining level of significance or using
the statement like “accepting the null hypothesis” instead of “not able to reject the
null hypothesis” have been explained extensively in the text so that the readers may
avoid such mistakes during organizing and conducting their research work.
The faculty who uses this book will find it very useful as it presents many
illustrations with either real or simulated data to discuss analytical techniques in
different chapters. Some of the examples cited in the text are from my own and my
colleagues’ research studies.
This book consists of 14 chapters. Chapter 1 deals with the data types, data
cleaning, and procedure to start SPSS on the system. Notations used throughout the
book in using SPSS commands have been explained in this chapter. Chapter 2 deals
with descriptive study. Different situations have been discussed under which such
studies can be undertaken. The procedure of computing various descriptive statis-
tics has been discussed in this chapter. Besides computing procedure through SPSS,
a new approach has been shown towards the end of the second chapter to develop
the profile graph which can be used for comparing different domains of the
populations.
Chapter 3 explains the chi-square and its different applications by means of
solved examples. The step-by-step procedure of computing chi-square using SPSS
has been discussed. Chi-square is the test of significance for association between the
attributes, but it provides comparison of the two groups as well, in case of the
responses being measured on the nominal scale. This fact has been discussed for the
benefit of the readers.
Chapter 4 explains the procedure of computing correlation matrix and partial
correlations using SPSS. The emphasis has been given on how to interpret the
relationships.
In Chapter 5, computing multiple correlations and regression analysis have been
discussed. Both the approaches of regression analysis in SPSS i.e. Stepwise and
Enter methods have been discussed for estimating any measurable phenomenon.
Preface ix
xi
Contents
1 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Metric Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Nonmetric Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Important Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Mutually Exclusive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Independent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Dependent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Extraneous Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
The Sources of Research Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Primary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Secondary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Detection of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Typographical Conventions Used in This Book . . . . . . . . . . . . . . . . . 11
How to Start SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Preparing Data File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Defining Variables and Their Properties Under Different Columns . . 13
Defining Variables for the Data in Table 1.1 . . . . . . . . . . . . . . . . . . 16
Entering the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Importing Data in SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Importing Data from an ASCII File . . . . . . . . . . . . . . . . . . . . . . . . 18
Importing Data File from Excel Format . . . . . . . . . . . . . . . . . . . . . 22
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
xiii
xiv Contents
2 Descriptive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Summary of When to Use the Mean, Median,
and Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Measures of Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
The Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
The Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
The Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
The Index of Qualitative Variation . . . . . . . . . . . . . . . . . . . . . . . . . 46
Standard Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Coefficient of Variation (CV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Percentile Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Situation for Using Descriptive Study . . . . . . . . . . . . . . . . . . . . . . . . 53
Solved Example of Descriptive Statistics using SPSS . . . . . . . . . . . . . 54
Computation of Descriptive Statistics Using SPSS . . . . . . . . . . . . . 54
Interpretation of the Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Developing Profile Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Summary of the SPSS Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3 Chi-Square Test and Its Application . . . . . . . . . . . . . . . . . . . . . . . . 69
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Advantages of Using Crosstabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Statistics Used in Cross Tabulations . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chi-Square Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chi-Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Application of Chi-Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Contingency Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Lambda Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Phi Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Cramer’s V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Kendall Tau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Situation for Using Chi-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Solved Examples of Chi-square for Testing an Equal
Occurrence Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Contents xv
Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Logistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Logistic Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Judging the Efficiency of the Logistic Model . . . . . . . . . . . . . . . . . 418
Understanding Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Graphical Explanation of Logistic Model . . . . . . . . . . . . . . . . . . . . 419
Logistic Model with Mathematical Equation . . . . . . . . . . . . . . . . . . 421
Interpreting the Logistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . 422
Assumptions in Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 423
Important Features of Logistic Regression . . . . . . . . . . . . . . . . . . . 423
Research Situations for Logistic Regression . . . . . . . . . . . . . . . . . . . . 424
Steps in Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Solved Example of Logistics Analysis Using SPSS . . . . . . . . . . . . . . . 426
First Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
Second Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
SPSS Commands for the Logistic Regression . . . . . . . . . . . . . . . . . 428
Interpretation of Various Outputs Generated
in Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Explanation of Odds Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
Summary of the SPSS Commands for Logistic Regression . . . . . . . . . 437
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
14 Multidimensional Scaling for Product Positioning . . . . . . . . . . . . . 443
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
What Is Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
Terminologies Used in Multidimensional Scaling . . . . . . . . . . . . . . . . 444
Objects and Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Similarity vs. Dissimilarity Matrices . . . . . . . . . . . . . . . . . . . . . . . . 445
Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Perceptual Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
What We Do in Multidimensional Scaling? . . . . . . . . . . . . . . . . . . . . 446
Procedure of Dissimilarity-Based Approach of Multidimensional
Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Procedure of Attribute-Based Approach of Multidimensional Scaling 447
Assumptions in Multidimensional Scaling . . . . . . . . . . . . . . . . . . . 448
Limitations of Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . 449
Solved Example of Multidimensional Scaling
(Dissimilarity-Based Approach of Multidimensional Scaling)
Using SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Contents xxi
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
Chapter 1
Data Management
Learning Objectives
After completing this chapter, you should be able to do the following:
• Explain different types of data generated in management research.
• Know the characteristics of variables.
• Learn to remove the outliers from the data by understanding different data
cleaning methods before using in SPSS.
• Understand the difference between primary and secondary data.
• Know the formats used in this book for using different commands,
subcommands, and options used in SPSS.
• Learn to install SPSS package for data analysis.
• Understand the procedure of importing data in other formats into SPSS.
• Prepare the data file for analysis in SPSS.
Introduction
between inferential and inductive studies is that the phenomenon which we infer on
the basis of the sample exists in the inferential studies, whereas it is yet to occur in the
inductive studies. Thus, assessing satisfaction level in an organization on the basis of
a sample of employees may be the problem of inferential statistics.
Finally, applied studies refers to those studies which are used in solving the
problems of real life. The statistical methods such as times series analysis, index
numbers, quality control, and sample survey are included in this class of analysis.
Types of Data
Depending upon the data types, two broad categories of statistical techniques are
used for data analysis. For instance, parametric tests are used if the data are metric,
whereas in case of nonmetric data, nonparametric tests are used. It is therefore
important to know in advance the types of data which are generated in management
research.
Data can be classified in two categories, that is, metric and nonmetric. Metric and
nonmetric data are also known as quantitative and qualitative data, respectively.
Metric data is analyzed using parametric tests such as t, F, Z, and correlation coeffi-
cient, whereas nonparametric tests such as sign test, median test, chi-square test,
Mann-Whitney test, and Kruskal-Wallis test are used in analyzing nonmetric data.
Certain assumptions about the data and form of the distribution need to be satisfied
in using parametric tests. Parametric tests are more powerful in comparison to that of
nonparametric tests, provided required assumptions are satisfied. On the other hand,
nonparametric tests are more flexible and easy to use. Very few assumptions need to
be satisfied before using these tests. Nonparametric tests are also known as
distribution-free tests.
Let us understand the characteristics of different types of metric and nonmetric
data generated in research. Metric data is further classified into interval and ratio
data. On the other hand, nonmetric data is classified into nominal and ordinal. The
details of these four types of data are discussed below under two broad categories,
namely, metric data and nonmetric data, and are shown in Fig. 1.1.
Metric Data
Data is said to be metric if it is measured at least on interval scale. Metric data are
always associated with a scale measure, and, therefore, it is also known as scale data
or quantitative data. Metric data can be measured on two different types of scale,
that is, interval and ratio.
4 1 Data Management
Data Type
Interval Data
The interval data is measured along a scale where each position is equidistant from
one another. In this scale, the distance between two pairs is equivalent in some way.
In interval data, doubling principle breaks down as there is no zero on the scale. For
instance, the 6 marks given to an individual on the basis of his IQ do not explain that
his nature is twice as good as the person with 3 marks. Thus, interval variables
measured on an interval scale have values in which differences are uniform and
meaningful, but ratios are not. Interval data may be obtained if the parameters of job
satisfaction or level of frustration is rated on scale 1–10.
Ratio Data
The data on ratio scale has a meaningful zero value and has an equidistant measure
(i.e., the difference between 30 and 40 is the same as the difference between 60 and
70). For example, 60 marks obtained on a test are twice of 30. This is so because
zero can be measured on ratio scale. Ratio data can be multiplied and divided
because of an equidistant measure and doubling principle. Observations that we
measure or count are usually ratio data. Examples of ratio data are height, weight,
sales data, stock price, advance tax, etc.
Nonmetric Data
Nominal Data
Ordinal Data
Variables on the ordinal scale are also known as categorical variables, but here the
categories are ordered. The order of items is often defined by assigning numbers to
them to show their relative position. Categorical variables that assess performance
(good, average, poor, etc.) are ordinal variables. Similarly, attitudes (strongly agree,
agree, undecided, disagree, and strongly disagree) are also ordinal variables. On the
basis of the order of an ordinal variable, we may not know which value is the best or
worst on the measured phenomenon. Moreover, the distance between ordered
categories is also not measureable. No arithmetic can be done with the ordinal
data as they show sequence only. Data obtained on ordinal scale is in terms of ranks.
Ordinal data is denoted as “ordinal” in SPSS.
Important Definitions
Variable
A variable is a phenomenon that changes from time to time, place to place, and
individual to individual. Examples of variable are salary, scores in CAT examina-
tion, height, weight, etc. The variables can further be divided into discrete and
continuous. Discrete variables are those variables which can assume value from a
limited set of numbers. Examples of such variables are number of persons in a
department, number of retail outlets, number of bolts in a box, etc. On the other
hand, continuous variables can be defined as those variables that can take any value
within a range. Examples of such variables are height, weight, distance, etc.
1
SPSS, Inc. is an IBM company which was acquired by IBM in October, 2009.
6 1 Data Management
Attribute
Attributes are said to be mutually exclusive if they cannot occur at the same time.
Thus, in a survey, a person can choose only one option from a list of alternatives (as
opposed to selecting as many that might apply). Similarly in choosing the gender,
one can either choose male or female.
Independent Variable
Dependent Variable
Extraneous Variable
Any additional variable that may provide alternative explanations or cast doubt on
conclusions in an experimental study is known as extraneous variable. If the effect
of three different teaching methods on the student’s performance is to be compared,
then the IQ of the students may be termed as extraneous variable as it might affect
the learning efficiency during experimentation if the IQs are not same in all the
groups.
The Sources of Research Data 7
In designing a research experiment, one needs to specify the kind of data required
and how to obtain it. The researcher may obtain the data from the reliable source if
it is available. But if the required data is not available from any source, it may
be collected by the researcher themselves. Several agencies collect data for some
specified purposes and make them available for the other researchers to draw
other meaningful conclusions as per their plan of study. Even some of the
commercial agencies provide the real-time data to the users with cost. The data
so obtained from other sources are referred as secondary data, whereas the data
collected by the researchers themselves are known as primary data. We shall now
discuss other features of these data in the following sections:
Primary Data
The data obtained during study by the researchers themselves or with the help
of their colleagues, subordinates, or field investigators are known as primary data.
The primary data is obtained by the researcher in a situation where relevant data
is not available from the reliable sources or such data do not exist with any of
the agency or if the study is an experimental study where specific treatments are
required to be given in the experiment. The primary data is much more reliable
because of the fact that the investigator himself is involved in data collection and
hence can ensure the correctness of the data. Different methods can be used to collect
the primary data by the researcher. These methods are explained below:
By Observation
The data in this method is obtained by observation. One can ensure the quality of
data as the investigator himself observes real situation and records the data.
For example, to assess the quality of any product, one can see as to how the
articles are prepared by the particular process. In an experimental study, the
performance of the subjects, their behavior, and other temperaments can be
noted after they have undergone a treatment. The drawback of this method is
that sometimes it becomes very frustrating for the investigator to be present all
the time for collecting the data. Further, if an experiment involves the human
being, then the subjects may become conscious in the presence of an investigator,
due to which performance may be affected which will ultimately result in
inaccurate data.
8 1 Data Management
Through Surveys
This is the most widely used method of data collection in the area of management,
psychology, market research, and other behavioral studies. The researcher must try
to motivate respondents by explaining them the purpose of the survey and impact of
their responses on the results of the study. The questionnaire must be short and must
hide the identity of respondents. Further, the respondent may be provided reason-
able incentives as per the availability of the budget. For instance, a pen, a pencil, or
a notepad with print statements like “With best Compliments from. . .” or “With
Thanks. . .,” “Go Green,” “Save Environment” may be provided before seeking
their opinion on the questionnaire. You can print your organization name or your
name as well if you are an independent researcher. The first two slogans may
promote your company as well, whereas the other two convey the social message to
the respondents. The investigator must ensure the authenticity of the collected data
by means of cross-checking some of the sampled information.
From Interviews
The data collected through the direct interview allows the investigator to go for in-
depth questioning and follow-up questions. The method is slow and costly and
forces an individual to be away from the job during the time of interview. During
the interview, the respondent may provide the wrong information if certain sensi-
tive issues are touched upon, and the respondent may like to avoid it on the premise
that it might suffer their reputation. For instance, if the respondent’s salary is very
low and the questions are asked about his salary, it is more likely that you end up
with the wrong information. Similarly in asking the question, as to how much you
invest on sports for your children in a year, you might get wrong information due to
the false ego of respondent.
Through Logs
The data obtained through the logs maintained by the organizations may be used as
primary data. Fault logs, error logs, complaint logs, and transaction logs may be
used to extract the required data for the study. Such data provide valuable findings
about system performance over time under different conditions if used well, as they
are empirical data and obtained from the objective data sources.
Primary data can be considered to be reliable because you know how it was
collected and what was done to it. It is something like cooking yourself. You know
what went into it.
Data Cleaning 9
Secondary Data
Instead of data obtained by the investigator himself if it is obtained from some other
sources, it is termed as secondary data. Usually, companies collect the data for some
specific purpose, and after that, they publish it for the use of the researchers to draw
some meaningful conclusions as per their requirements. Many government
agencies allow their real-time data to the researchers for using in their research
study. For instance, census data collected by the National Sample Surveys Organi-
zation may be used by the researchers for getting several demographic and socio-
economic information. Government departments and universities maintain their
open-source data and allow the researchers to use it. Nowadays, many commercial
agencies collect the data in different fields and make it available to the researchers
with nominal cost.
The secondary data may be obtained from many sources; some of them are listed
below:
• Government ministries through national informatics center
• Government departments
• Universities
• Thesis and research reports
• Open-source data
• Commercial organization
Care must be taken to ensure the reliability of the agency from which the data is
obtained. One must ensure to take an approval of the concerned department,
agency, organization, universities, or individuals for using their data. Due acknowl-
edgment must be shown in their research report for using their data. Further, data
source must be mentioned while using the data obtained from secondary sources.
In making comparison between primary and secondary data, one may conclude
that primary data is expensive and difficult to acquire, but it is more reliable.
Secondary data is cheap and easy to collect but must be used with caution.
Data Cleaning
Before preparing the data file for analysis, it is important to organize the data on
paper first. There are more chances that the data set may contain error or outlier.
And if it is so, the results obtained may be erroneous. Analysts tend to waste lot of
time in drawing the valid conclusions if data is erroneous. Thus, it is utmost
important that the data must be cleaned before analysis. If data is clean, the analysis
is straightforward and valid conclusions may be drawn.
10 1 Data Management
In data cleaning, the invalid data is detected first and then it is corrected. Some of
the common sources of errors are as follows:
• Typing errors in data entry
• Not applicable option or blank options are coded as “0”
• Data for one variable column is entered under the adjacent column
• Coding errors
• Data collection errors
Detection of Errors
The wrongly fed data can be detected by means of descriptive statistics computed
by SPSS. Following approaches may be useful in this regard.
Using Frequencies
Normally, the value of standard deviation is less than the mean except in case of
certain distribution like negative binomial. Thus, if the standard deviation for any of
the variables like age, height, or IQ is more than their mean, it can only be if some
of the values of these variables are outliers. Such entries can easily be identified and
removed.
Logic Checks
Errors in data may also be detected by observing as to whether the responses are
logical or not? For example, one would expect to see 100% of responses, not 110%.
Similarly, if a question is asked to a female employee as to whether they have
How to Start SPSS 11
availed maternity leave so far or not and if the reply is marked “yes” but you notice
that the respondent is coded as male, such logical errors can be spotted out by
looking to the values of the categorical variable. Logical approach should be used
judiciously to avoid the embarrassing situation in reporting the finding like 10% of
the men in the sample had availed the maternity leave during the last 10 years.
Throughout the book, certain convention has been followed in writing commands
by means of symbol, bold words, italic words, and words in quotes to signify the
special meaning. Readers should note these conventions for easy understanding of
commands used in different chapters of this book.
Start ⟹ All Denotes a menu command, which means choosing the command All
Programs Program from the Start menu. Similarly Analyze ⟹ Correlate ⟹
Partial means open the Analyze menu, then open the Correlate
submenu, and finally choose Partial.
Regression Any word written in bold refers to the main command of any window in the
SPSS package.
Prod_Data Any word or combination of words written in italics form during explaining
SPSS is referred as variable.
“Name” Any word or combination of words written in quotes refers to the
subcommand.
‘Scale’ Any word written in single quote refers to one of the option under
subcommand.
Continue This refers to the end of selection of commands in a window and will take
you to the next level of options in any computation.
OK This refers to the end of selecting all the options required for any particular
analysis. After pressing the OK invariably, SPSS will lead you to the
output window.
This book has been written by referring to the IBM SPSS Statistics 20.0 version;
however, in all the previous versions of SPSS, procedure of computing is more or
less similar.
The SPSS needs to be activated on your computer before entering the data. This
can be done by clicking the left button of the mouse on SPSS tag by going through
the SPSS directory in the Start and All Programs option (if the SPSS directory has
been created in the Programs file). Using the following command sequence, SPSS
can be activated on your computer system:
12 1 Data Management
If you use the above-mentioned command sequence, the screen shall look like
Fig. 1.2.
After clicking the tag SPSS, you will get the following screen to prepare the data
file or open the existing data file.
If you are entering the data for new problem and the file is to be created for the
first time, check the following option in the above-mentioned window:
And if the existing file is to be opened or edited, select the following option in
the window:
Preparing Data File 13
Table 1.1 FDI inflows and S.N. FDI Exports inflows Imports Trade
trade (in percent) in different
states 1 4.92 4.03 3.12 3.49
2 0.07 4.03 3.12 3.49
3 0.00 1.11 2.69 2.04
4 5.13 17.11 27.24 23.07
5 11.14 13.43 11.24 12.14
6 0.48 1.14 3.41 2.47
7 0.30 2.18 1.60 1.84
8 29.34 20.56 18.68 19.45
9 0.57 1.84 1.16 1.44
10 0.03 1.90 1.03 1.39
11 8.63 5.24 9.24 7.59
12 0.00 3.88 6.51 5.43
13 2.20 7.66 1.57 4.08
14 2.37 4.04 4.76 4.46
15 34.01 14.53 3.35 7.95
16 0.81 1.00 1.03 1.02
Click OK to get the screen to define the variables in the Variable View. Details
of preparing data file are shown below.
The procedure of preparing the data file shall be explained by means of the data
shown in Table 1.1.
In SPSS, before entering data, all the variables need to be defined in the
Variable View. Once Type in data option is selected in the screen shown in
Fig. 1.3, click the Variable View. This will allow you to define all the variables
in the SPSS. The blank screen shall look like Fig. 1.4.
Now you are ready for defining the variables row wise.
Column 1: In first column, short name of the variables are defined. The variable
name should essentially start with an alphabet and may use under-
score and numerals in between, without any gap. There should be no
space between any two characters of the variable name. Further,
variable name should not be started with numerals or any special
character.
Column 2: Under the column heading “Type,” format of the variable (numeric or
nonnumeric) and the number of digits before and after decimal are
defined. This can be done by double-clicking the concerned cell. The
screen shall look like Fig. 1.5.
14 1 Data Management
Column 3: Under the column heading “Width,” number of digits a variable can
have may be altered.
Column 4: In this column, number of decimal a variable can have may be altered.
Column 5: Under the column heading “Label,” full name of the variable can be
defined. The user can take advantage of this facility to write the
expanded name of the variable the way one feels like.
Preparing Data File 15
Column 6: Under the column heading “Values,” the coding of the variable may
be defined by double clicking the cell. Sometimes, the variable is of
classificatory in nature. For example, if there is a choice of choosing
any one of the following four departments for training
(a) Production
(b) Marketing
(c) Human resource
(d) Public relation
then these departments can be coded as 1 ¼ production, 2 ¼ market-
ing, 3 ¼ human resource, and 4 ¼ public relation. While entering data
into the computer, these codes are entered, as per the response of a
particular subject. SPSS window showing the option for entering code
has been shown in Fig. 1.6.
Column 7: In survey study, it is quite likely that for certain questions the respon-
dent does not reply, which creates the problem of missing value. Such
missing value can be defined under column heading “Missing.”
Column 8: Under the heading “Columns,” width of the column space where data
is typed in Data View is defined.
Column 9: Under the column heading “Align,” the alignment of data while
feeding may be defined as left, right, or center.
Column 10: Under the column heading “Measure,” the variable type may be
defined as scale, ordinal, or nominal.
16 1 Data Management
Fig. 1.6 Screen showing how to define the code for the different labels of the variable
1. Write short name of all the five variables as States, FDI_Inf, Export, Import, and
Trade under the column heading “Name.”
2. Under the column heading “Label,” full name of these variables may be defined
as FDI Inflows, Export Data, Import Data, and Trade Data. One can take liberty
of defining some more detailed name of these variables as well.
3. Use default entries in rest of the columns.
After defining variables in the variable view, the screen shall look like Fig. 1.7.
After defining all the five variables in the Variable View, click Data View on the
left bottom of the screen to open the format for entering the data. For each variable,
data can be entered column wise. After entering the data, the screen will look like
Fig. 1.8. Save the data file in the desired location before further processing.
After preparing the data file, one may use different types of statistical analysis
available under the tag Analyze in the SPSS package. Different types of statistical
analyses have been discussed in different chapters of the book along with their
interpretations. Methods of data entry are different in certain applications; for
Importing Data in SPSS 17
Fig. 1.7 Variables along with their characteristics for the data shown in Table 1.1
Fig. 1.8 Screen showing entered data for all the variables in the data view
instance, readers are advised to note carefully the way data is entered for the
application in Example 6.2 in Chap. 6. Relevant details have been discussed in
that chapter.
In SPSS, data can be imported from ASCII as well as Excel file. The procedure of
importing these two types of data files has been discussed in the following sections.
18 1 Data Management
In ASCII file, data for each variable may be separated by a space, tab, comma, or
some other character. The Text Import Wizard in SPSS facilitates you to import
data from an ASCII file format. Consider the following set of data in ASCII file
saved on the desktop by the file name Business data:
– Choose “Text” as the “File Type” if your ASCII file has the .txt extension.
Otherwise, choose the option “All files.”
– After selecting the file that you want to import, click Open as shown in
Fig. 1.9.
2. After choosing the ASCII file from the saved location in Fig. 1.9, the Text Import
Wizard will pop up automatically as shown in Fig. 1.10 that will take you for
further option in importing the file. Take the following steps:
– If your file does not match a predefined format, which is usually not, so click
Next.
3. After clicking the Next option above, you will get the screen as shown in
Fig. 1.11. Take the following steps:
– Define delimiter and check the option “Delimited” as the data in the file is
separated by either space or comma.
– If variable names are written in the first row of your data file, check the header
row option as “Yes,” otherwise “No.” In this, the option “Yes” will be
selected because variable names have been written in the first row of the
data file. Click Next.
4. After clicking the option Next, you will get the screen as shown in Fig. 1.12.
Enter the line number where the first case of your data begins. If there is no
Fig. 1.9 Selecting an ASCII file saved as text file for importing in SPSS
Fig. 1.10 Import text wizard for opening an ASCII file in SPSS
Fig. 1.11 Defining option for delimiter and header row
Fig. 1.12 Defining option for beginning line of data and number of cases to be selected
Importing Data in SPSS 21
variable name in the first line of the data file, line 1 is selected; otherwise, line
2 may be selected as the data starts from line 2 in the data file. Take the following
steps:
– Check the option “Each line represents a case.” Normally in your data file,
each line represents a case.
– Check the option “All of the cases.” Usually, you import all the cases from the
file. Other option may be tried if only few cases are imported from the file.
Click Next to get the screen as shown in Fig. 1.13.
5. In Fig. 1.13, delimiters of the data file (probably comma or space) are set:
– Check the delimiters as “Coma” as the data is separated by comma. Other
delimiters may be selected if used in the data file.
– Check the “Double quote” as text qualifier. Other options may be checked if
the variables are flanked other than double quote.
– On the basis of the options chosen by you, SPSS formats the file in the small
screen in the bottom. There you can check if everything is set correctly. Click
Next when everything is ok to get the screen as shown in Fig. 1.14.
22 1 Data Management
6. In Fig. 1.14, you can define the specifications for the variables, but you may just
ignore it if you have already defined your variables or want to do it later. Click
Next to get the screen as shown in Fig. 1.15.
7. In Fig. 1.15, select all the default options and ensure that your actual data file has
been shown in the window or not. Once your data is shown in the window, click
Finish. This will import your file successfully in SPSS.
The data prepared in Excel file can be imported in SPSS by simple command. While
importing Excel data file, one must ensure that it is not open. The sequence of
commands for importing Excel data file is as follows:
1. File > Open > Data > requiredfile
– Choose “Excel” as the File Type if your ASCII file has the .xls extension.
Otherwise, choose the option “All files.”
– After selecting the file that you want to import, click Open as shown in
Fig. 1.16.
Importing Data in SPSS 23
2. After choosing the required Excel file from the saved location in Fig. 1.16, you
will get the pop screen called “Opening Excel Data Source” as shown in
Fig. 1.17. Take the following steps:
– Check the option “Read variable names from the first row of data” if you are
using the header row in the data file.
– Select the right worksheet from which you want to import the data. The
screen will show you all the worksheets of the file containing data. If you
have data only in the first worksheet, leave this option as it is.
– If you want to use only a portion of data from the file, define the fields in
“Range” option like A3:E8. This means that the data from the A3 row till
column E8 shall be selected.
– Press Continue to get the Excel file opened in SPSS.
24 1 Data Management
Fig. 1.17 Option for defining the range of data in Excel sheet to be imported in SPSS
Exercise 25
Exercise
Short-Answer Questions
Note: Write the answer to each of the questions in not more than 200 words.
Q.1. What do you mean by inductive and inferential statistics? What is the differ-
ence between them? Explain by means of example.
Q.2. What do you mean by metric and nonmetric data? Discuss them by means of
example.
Q.3. Under what situation analytical studies should be conducted? Discuss a situa-
tion where it can be used.
Q.4. What do you mean by mutually exclusive and independent attributes? Give
two examples where the attributes are not mutually exclusive.
Q.5. What is an extraneous variable? How it affects the findings of an experiment?
Suggest remedies for eliminating its effects.
Q.6. While feeding the data in SPSS, what are the possible mistakes that a user
might commit?
Q.7. Explain in brief as to how an error can be identified in data feeding.
Multiple-Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the
one that you consider the closest to the correct answer.
1. Given the following statements,
I. Parametric tests do not assume anything about the form of the distribution.
II. Nonparametric tests are simple to use.
III. Parametric tests are the most powerful if their assumptions are satisfied.
IV. Nonparametric tests are based upon the assumption of normality.
choose the correct statements from the above-listed ones.
(a) (I) and (II)
(b) (I) and (III)
(c) (II) and (III)
(d) (III) and (IV)
2. If the respondents were required to rate themselves on emotional strength on a
9-point scale, what type of data would be generated?
(a) Ratio
(b) Interval
(c) Nominal
(d) Ordinal
26 1 Data Management
Q.1 c Q.2 b
Q.3 d Q.4 b
Q.5 a Q.6 a
Q.7 d Q.8 b
Q.9 b Q.10 b
Chapter 2
Descriptive Analysis
Learning Objectives
After completing this chapter, you should be able to do the following:
• Learn the importance of descriptive studies.
• Know the various statistics used in descriptive studies.
• Understand the situations in management research for undertaking a descriptive
study.
• Describe and interpret various descriptive statistics.
• Learn the procedure of computing descriptive statistics using SPSS.
• Know the procedure of developing the profile chart of a product or organization.
• Discuss the findings in the outputs generated by the SPSS in a descriptive study.
Introduction
Descriptive studies are carried out to understand the profile of any organization that
follows certain common practice. For example, one may like to know or be able to
describe the characteristics of an organization that implement flexible working
timing or that have a certain working culture. Descriptive studies may be
undertaken to describe the characteristics of a group of employees in an organiza-
tion. The purpose of descriptive studies is to prepare a profile or to describe
interesting phenomena from an individual or an organizational point of view.
Although descriptive studies can identify sales pattern over a period of time or in
different geographical locations but cannot ascertain the causal factors. These
studies are often very useful for developing further research hypotheses for testing.
Descriptive research may include case studies, cross-sectional studies, or longitu-
dinal investigations.
Different statistics are computed in descriptive studies to describe the nature of data.
These statistics computed from the sample provide summary of various measures.
Descriptive statistics are usually computed in all most every experimental research
study. The primary goal in a descriptive study is to describe the sample at any specific
point of time without trying to make inferences or causal statements. Normally, there
are three primary reasons to conduct such studies:
1. To understand an organization by knowing its system
2. To help in need assessment and planning resource allocation
3. To identify areas of further research
Descriptive studies help in identifying patterns and relationships that might
otherwise go unnoticed.
A descriptive study may be undertaken to ascertain and be able to describe the
characteristics of variables of interest, in a given situation. For instance, a study of
an organization in terms of percentage of employee in different age categories, their
job satisfaction level, motivation level, gender composition, and salary structure
can be considered as descriptive study. Quite frequently descriptive studies are
undertaken in organizations to understand the characteristics of a group or
employees such as age, educational level, job status, and length of service in
different departments.
Descriptive studies may also be undertaken to know the characteristics of all
those organizations that operate in the same sector. For example, one may try to
describe the production policy, sales criteria, or advertisement campaign in phar-
macy companies. Thus, the goal of descriptive study is to offer the researcher a
profile or to describe relevant aspects of the phenomena of interest in an organiza-
tion, industry, or a domain of population. In many cases, such information may be
vital before considering certain corrective steps.
In a typical profile study, we compute various descriptive statistics like mean,
standard deviation, coefficient of variation, range, skewness, and kurtosis. These
descriptive statistics explain different features of the data. For instance, mean
explains an average value of the measurement, whereas standard deviation
describes variation of the scores around their mean value; the coefficient of varia-
tion provides relative variability of scores; range gives the maximum variation;
skewness explains the symmetricity; and kurtosis describes the variation in the
data set.
In descriptive studies, one tries to obtain information regarding current status of
different phenomena. Purpose of such study is to describe “What exists?” with
respect to situational variables.
In descriptive research, the statement of problem needs to be defined first and
then identification of information is planned. Once the objectives of the study are
identified, method of data collection is planned to obtain an unbiased sample, and
therefore, it is important to define the population domain clearly. Further, an
optimum sample size should be selected for the study as it enhances the efficiency
in estimating population characteristics.
Once the data is collected, it should be compiled in a meaningful manner for
further processing and reporting. The nature of each variable can be studied by
looking to the values of different descriptive statistics. If purpose of the study is
Measures of Central Tendency 31
analytical as well, then these data may further be analyzed for testing different
formulated hypotheses.
On the basis of descriptive statistics and graphical pictures of the parameters,
different kinds of generalizations and predictions can be made. While conducting
descriptive studies, one gets an insight to identify the future scope of the related
research studies.
Researchers are often interested in defining a value that best describes some
characteristics of the population. Often, this characteristic is a measure of central
tendency. A measure of central tendency is a single score that attempts to describe a
set of data by identifying the central position within that set of data. The three most
common measures of central tendency are the mean, the median, and the mode.
Measures of central tendency are also known as central location. Perhaps, you are
more familiar with the mean (also known as average) as the measure of central
tendency, but there are others, such as the median and the mode, which are
appropriate in some specific situations.
The mean, median, and mode are all valid measures of central tendency, but,
under different conditions, some measures of central tendency become more appro-
priate than other. In the following sections, we will look at the various features of
mean, median, and mode and the conditions under which they are most appropriate
to be used.
Mean
The mean is the most widely used and popular measure of central tendency. It is
also termed as average. It gives an idea as to how an average score looks like.
For instance, one might be interested to know that on an average how much is the
sale of items per day on the basis of monthly sales figure. The mean is a good
measure of central tendency for symmetric distributions but may be misleading in
case of skewed distribution. The mean can be computed with both discrete and
continuous data. The mean is obtained by dividing the sum of all scores by the
number of scores in the data set.
If X1, X2, . . ., Xn are the n scores in the data set, then the sample mean, usually
denoted by X (pronounced X bar), is
X1 þ X2 þ ::::Xn
X ¼
n
32 2 Descriptive Analysis
P
This formula is usually written by using the Greek capital letter , pronounced
“sigma,” which means “sum of. . .”:
1 X
X ¼ X (2.1)
n
1X
m¼ X
n
The mean is the model of your data set and explains that on an average, the data
set tends to concentrate toward it. You may notice that the mean is not often one of
the actual values that you have observed in your data set.
If X1, X2, X3, . . ., Xn are n scores with f1, f2, f3, . . ., fn frequencies respectively in the
data set, then the mean is computed as
P P
fi X i fi Xi
X¼ P ¼ (2.2)
fi n
where
P
f i Xi is the total of all the scores.
In case the data is arranged in class interval format, the X will be the midpoint of
the class interval. Let us see how to explain the data shown in the class interval form
in Table 2.1. The first class interval shows that the ten articles are in the price range
of Rs. 1–50 and that of six articles are in the range of Rs. 51–100 and so on. Here,
the exact price of each article is not known because they have been grouped
together. Thus, in case of grouped data, the scores lose its own identity. This
becomes problematic as it is difficult to add the scores because their magnitudes
are not known. In order to overcome this problem, an assumption is made while
computing mean and standard deviation from the grouped data. It is assumed that
the frequency is concentrated at the midpoint of the class interval. By assuming so,
the identity of each and every score can be regained; this helps us to compute the
sum of all the scores which is required for computing mean and standard deviation.
But by taking this assumption, it is quite likely that the scores may be either
underestimated or overestimated. For instance, in Table 2.1, if all the ten items
Measures of Central Tendency 33
would have had prices in the range of Rs. 1–50 but due to assumption they are
assumed to have prices as Rs. 25.5, a negative error may be created which is added
in the computation of mean. But it may be quite likely that the prices of other six
items may be on the higher side, say Rs. 90, whereas they are assumed to have the
price as 75.5 which creates the positive error. Thus, these positive and negative
errors add up to zero
P in the computation of mean.
In Table 2.1, fX represents the sum of all the scores, and therefore,
P
fi Xi 2914
X ¼ ¼ ¼ 104:07
n 28
XA
D¼
i
where “A” and “i” are origin and scale, respectively. Thus, any score which is
subtracted from all the scores in the data set is termed as origin, and any score by
which all the scores are divided is known as scale. The choice of origin and scale is
up to the researcher, but the only criterion which one should always keep in mind is
that the very purpose of using the transformation is to simplify the data and
computation.
Let us see what is the effect of change of origin and scale on the computation of
mean? If all the X scores are transformed into D by using the above-mentioned
transformation, then taking summation on both sides,
P P XA
D¼ i
P P
) ðX AÞ ¼ i D
34 2 Descriptive Analysis
Thus, we have seen that if all the scores X are transformed into D by changing the
origin and scales as A and i, respectively, then the original mean can be obtained
by multiplying the new mean D by the scale i and adding the origin value into it.
Thus, it may be concluded that the mean is not independent of change of origin and
scale.
In case of grouped data, the mean can be computed by transforming the scores so
obtained by taking the midpoint of the class intervals. Consider the data shown in
Table 2.1 once again. After computing the midpoint of the class intervals, let us
transform the scores by changing the origin and scale as 175.5 and 50, respectively.
Usually, origin (A) is taken as the midpoint of the middlemost class interval, and the
scale (i) is taken as the width of the class interval. The origin A is also known as
assumed mean (Table 2.2).
Here, width of the class interval ¼ i ¼ 50 and assumed mean A ¼ 175.5
Since we know that
1X
X ¼AþiD¼Aþi fD
n
1
) X ¼ 175:5 þ 50 ð40Þ
28
¼ 175:5 71:43 ¼ 104:07
Measures of Central Tendency 35
P
In computing the mean, the factor i ð1=nÞ fD can be considered as the
correction factor. If the assumed mean is taken higher than the actual mean, the
correction factor shall be negative, and, if it is taken as lower than the actual mean,
the correction factor will become positive. One may take assume mean as the
midpoint of the even lowest or highest class interval. But in that case, the magnitude
of the correction factor shall be higher and the very purpose of simplifying the
computation process shall be defeated. Thus, the correct strategy is to take the
midpoint of the middlemost class interval as the assumed mean. However, in case
the number of class intervals is even, midpoint of any of the two middle class
intervals may be taken as the assumed mean.
Properties of Mean
n1 X1 þ n2 X2
X ¼
n 1 þ n2
4. The sum of the deviation of a set of values from their arithmetic mean is always
0. In other words,
X
ðX XÞ ¼ 0
Median
Median is the middlemost score in the data set arranged in order of magnitude. It is
a positional average and is not affected by the extreme scores. If X1, X2, . . ., Xn are
the n scores in a data set arranged in the ascending or descending order, then its
median is obtained by
N þ 1 th
Md ¼ score (2.4)
2
One should note that ðn þ 1Þ=2 is not the median, but the score lying in that
position is the median. Consider the weight of the following ten subjects: 56, 45, 53,
41, 48, 53, 52, 65, 38, 42.
After arranging the scores
S.N.: 1 2 3 4 5 6 7 8 9 10
Weight: 38 41 42 45 48 52 53 53 56 65
Here, n ¼ 10.
th
Thus, Md ¼ 10þ1 2 ¼ 5:5th score ¼ ð48þ52
2
Þ
¼ 50
In case of odd number of scores you will get a single score lying in the middle,
but in case of even number of scores, the middlemost score is obtained by taking the
average of the two middle scores as in that case there are two middle scores.
Median is used in case the effect of extreme scores needs to be avoided. For
example, consider the marks of the students in a college as shown below:
Student: 1 2 3 4 5 6 7 8 9 10
Marks: 35 40 30 32 35 39 33 32 91 93
The mean score for these ten students is 46. However, the raw data suggests that
this mean value might not be the best way to accurately reflect the typical perfor-
mance of a student, as most students have marks in between 30 and 40. Here, the
mean is being skewed by the two large scores. Therefore, in this situation, median
gives better estimate of average instead of mean. Thus, in a situation where the
effect of extreme scores needs to be avoided, median should be preferred over
mean. In case the data is normally distributed, the values of mean, median, and
mode are same. Moreover, they all represent the most typical value in the data set.
However, as the data becomes skewed, the mean loses its ability to provide the best
central location as the mean is being dragged in the direction of skew. In that case,
the median best retains this position and is not influenced much by the skewed
values. As a rule of thumb if the data is non-normal, then it is customary to use the
median instead of the mean.
Measures of Central Tendency 37
While computing the median for grouped data, it is assumed that the frequencies are
equally distributed in the class interval. This assumption is also used in computing
the quartile deviation because median and quartile deviation both are nonparamet-
ric statistics and depend upon positional score. In case of grouped data, the median
is computed by the following formula:
n
F
Md ¼ jj þ 2 i (2.5)
fm
where
|| : lower limit of the median class
n : total of all the frequencies
F : cumulative frequency of the class just lower than the median class
fm : frequency of the median class
i : width of the class interval
The computation of the median shall be shown by means of an example.
Consider the marks in mathematics obtained by the students as shown in Table 2.3.
In computing median, first of all we need to find the median class. Median class
is the one in which the median is supposed to lie. To obtain the median class, we
compute n/2 and then we look for this value in the column of cumulative frequency.
The class interval for which the cumulative frequency includes the value n/2 is
taken as median class.
Here, n ¼ 70
and therefore, n
2 ¼ 2 ¼ 35
70
Now, we look for 35 in the column of cumulative frequency. You can see that
the class interval 31–35 has a cumulative frequency 48 which includes the value
n/2 ¼ 35. Thus, 31–35 is the median class. After deciding the median class, the
median can be computed by using the formula (2.5).
Here, || ¼ Lower limit of the median class ¼ 30.5
fm ¼ Frequency of the median class ¼ 20
F ¼ Cumulative frequency of the class just lower than the median class ¼ 28
i ¼ Width of the class interval ¼ 5
Substituting these values in the formula (2.5),
F n
Md ¼ jj þ 2 i
fm
35 28
¼ 30:5 þ 5 ¼ 30:50 þ 1:75 ¼ 32:25
20
In computing the lower limit of the median class, 0.5 has been subtracted from
the lower limit because the class interval is discrete. Any value which is equal or
38 2 Descriptive Analysis
greater than 30.5 shall fall in the class interval 31–35, and that is why actual lower
limit is taken as 30.5 instead of 31. But in case of continuous class intervals, lower
limit of the class interval is the actual lower limit, and we do not subtract 0.5 from it.
In case of continuous class interval, it is further assumed that the upper limit is
excluded from the class interval. This make the class intervals mutually exclusive.
In Table 2.3, the lowest class interval is truncated, and therefore, its midpoint
can be computed; hence, the mean can not be computed in this situation. Thus, if the
class intervals are truncated at one or both the ends, median is the best choice as a
measure of central tendency.
Mode
Mode can be defined as the score that occurs most frequently in a set of data. If the
scores are plotted, then the mode is represented by the highest bar in a bar chart or
histogram. Therefore, mode can be considered as the most popular option in the set
of responses. Usually, mode is computed for categorical data where we wish to
know as to which the most common category is. The advantage of mode is that it is
not affected by the extreme scores (outliers). Sometime, there could be two scores
having equal or nearly equal frequencies in the data set. In that case, the data set will
have two modes and the distribution shall be known as bimodal. Thus, on the basis
of the number of modes, the distribution of the scores may be unimodal, bimodal, or
multimodal. Consider the following data set: 2, 5, 4, 7, 6, 3, 7, 8, 7, 9, 1, 7. Here, the
score 7 is being repeated maximum number of times; hence, the mode of this data
set is 7.
The mode can be used in variety of situations. For example, if a pizza shop
sells 12 different varieties of pizzas, the mode would represent the most popular
pizza. Mode may be computed to know as to which of the text book is more popular
Measures of Central Tendency 39
than others, and accordingly, the publisher would print more copy of that book
instead of printing equal number of all books.
Similarly, it is important for the manufacturer to produce more of the most
popular shoes because manufacturing different shoes in equal numbers would cause
a shortage of some shoes and an oversupply of others. Other applications of the
mode may be to find the most popular brand of soft drink or biscuits to take the
manufacturing decision accordingly.
Drawbacks of Mode
In computing the mode with grouped data first of all one needs to identify the modal
class. The class interval, for which the frequency is maximum, is taken as modal
class. The frequency of the modal class is denoted by f0, and that of frequencies
before and after the modal class are represented by f1 and f2, respectively. Once
these frequencies are identified, they can be used to compute the value of the mode.
The formula for computing mode with the grouped data is given by
fm f1
M0 ¼ jj þ i (2.6)
2f m f 1 f 2
where
|| : lower limit of the modal class
fm : frequency of the modal class
f1 : frequency of the class just lower than the modal class
f2 : frequency of the class just higher than the modal class
i : width of the class interval
Table 2.4 shows the distribution of age of bank employees. Let us compute the
value of mode in order to find as to what is the most frequent age of employees in
the bank.
Since the maximum frequency is 50 for the class interval 26–30, hence this will
be the modal class here.
40 2 Descriptive Analysis
fm f1
M0 ¼ jj þ i
2f m f 1 f 2
50 25
¼ 25:5 þ 5
2 50 25 10
¼ 25:5 þ 1:92 ¼ 27:42
Thus, one may conclude that mostly employees in the bank are of around
27 years of age.
Measures of Variability
Variability refers to the extent of scores that vary from each other. The data set is
said to have high variability when it contains values which are considerably higher
and lower than the mean value. The terms variability, dispersion, and spread are all
synonyms and refer as to how much the distribution is spread out. Measure of
central tendency refers to the central location in the data set, but the central location
itself is not sufficient to define the characteristics of the data set. It may happen that
the two data sets are similar in their central location but might differ in their
variability. Thus, measure of central tendency and measure of variability both are
required to describe the nature of the data correctly. There are four measures of
variability that are frequently used, the range: interquartile range, variance, and
standard deviation. In the following paragraphs, we will look at each of these four
measures of variability in more detail.
The Range
The range is the crudest measure of variability and is obtained by subtracting the
lowest score from the highest score in the data set. It is rarely used because it is
based on only two extreme scores. The range is simple to compute and is useful
when it is required to evaluate the whole of a data set. The range is useful in
showing the maximum spread within a data set. It can be used to compare the spread
between similar data sets.
Using range becomes problematic if one of the extreme score is exceptionally
high or low (referred to as outlier). In that case, the range so computed may not
represent the true variability within the data set. Consider a situation where scores
obtained by the students on a test were recorded and the minimum and maximum
scores were 25 and 72, respectively. If a particular student did not appear in the
exam due to some reason and his score was posted as zero, then the range becomes
72(72–0) instead of 47(72–25). Thus, in the presence of an outlier, the range
provides the wrong picture about the variability within the data set. To overcome
the problem of outlier in a data set, the interquartile range is often calculated instead
of the range.
The interquartile range is a measure that indicates the maximum variability of the
central 50% of values within the data set. The interquartile range can further be
divided into quarters by identifying the upper and lower quartiles. The lower
quartile (Q1) is equivalent to the 25th percentile in the data set which is arranged
in order of magnitude, whereas the upper quartile (Q3) is equivalent to the 75th
42 2 Descriptive Analysis
percentile. Thus, Q1 is a point below which 25% scores lie, and Q3 refers to a score
below which 75% scores lie. Since the median is a score below which 50% scores
lie, hence, the upper quartile lies halfway between the median and the highest value
in the data set, whereas the lower quartile lies halfway between the median and the
lowest value in the data set. The interquartile range is computed by subtracting the
lower quartile from the upper quartile and is given by
Q ¼ Q3 Q1 (2.7)
The interquartile range provides a better picture of the overall data set by
ignoring the outliers. Just like range, interquartile range also depends upon the
two values. Statistically, the standard deviation is more powerful measure of
variability as it is computed with all the values in the data set.
The standard deviation is the most widely used measure of variability, the value of
which depends upon how closely the scores cluster around the mean value. It can be
computed only for interval or ratio data. The standard deviation is the square root of
the average squared deviation of the scores from its mean value and is represented
by s(termed as sigma):
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 X
s¼ ð X mÞ 2
N
After simplification,
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2ffi
1 X 2 X
s¼ X (2.8)
N N
where m is the population mean. The term s is used for population standard deviation,
whereas S is used for sample standard deviation. The population standard deviation s
can be estimated from the sample data by the following formula:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 X
ðX XÞ
2
S¼
n1
After simplifying,
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi
1 X 2 ð X Þ2
S¼ X (2.9)
n1 nðn 1Þ
Measures of Variability 43
If X1, X2, X3, . . ., Xn are the n scores with f1, f2, f3,. . ., fn frequencies respectively
the data set, then the standard deviation shall be given as
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 X
f ðX XÞ
2
S¼
n1
After simplification,
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
1 X 2 ð fXÞ2
S¼ fX (2.10)
n1 nðn 1Þ
where X refers to the sample mean. The standard deviation measures the aggregate
variation of every value within a data set from the mean. It is the most robust and
widely used measure of variability because it takes into account every score in the
data set.
When the scores in a data set are tightly bunched together, the standard deviation
is small. When the scores are widely apart, the standard deviation will be relatively
large. The standard deviation is usually presented in conjunction with the mean and
is measured in the same units.
The sample standard deviation of a series of scores can be computed by using the
formula (2.9). Following are the data on memory retention test obtained on 10
individuals. The scores are the number of items recollected by individuals
(Table 2.5).
P P 2
Here n ¼ 10, X ¼ 47, and X ¼ 247.
Substituting these values in the formula (2.9),
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi
1 X 2 ð X Þ2
S¼ X
n1 nðn 1Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 ð47Þ2
¼ 247
10 1 10 9
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 27:44 24:54 ¼ 1:7
Thus, the standard deviation of the test scores on memory retention is 1.7.
Looking to this value of standard deviation, no conclusion can be drawn as to
whether the variation is less or more. It is so because standard deviation is
considered to be the absolute variability. This problem can be solved by computing
coefficient of variability. It will be discussed later in this chapter
Let us see what happens to the standard deviation if the origin and scale of the
scores are changed in the data set. Let the scores transformed by using the following
transformation:
XA
D¼
i
) X ¼AþiD
where “A” is origin and “i” is the scale. One can choose any value of origin, but the
value of scale is usually the width of the class interval.
Taking summation on both side and dividing both sides by n, we get
X ¼ A þ i D
Thus, it may be concluded that the standard deviation is free from change of
origin but is affected by the change scale.
Let us compute the standard deviation for the data shown in Table 2.1. Consider
the same data in Table 2.6 once again. After computing the midpoints of the class
intervals, let us transform the scores by taking the origin and scale as 175.5 and 50,
respectively. Usually, origin (A) is taken as the midpoint of the middlemost class
interval, and the scale (h) is taken as the width of the class interval. The origin A is
also known as assumed mean.
Here, width of the class interval ¼ h ¼ 50 and assumed mean A ¼ 175.5.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2ffi
P
From the equation (2.11), SX ¼ i SD ¼ i n1 1
f DD
After simplification,
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi
1 X ð fDÞ 2
SX ¼ i fD2
n1 nðn 1Þ
P P
Substituting the values of n, fD and fD2 , we get
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 ð40Þ2
SX ¼50 128
28 1 28 27
¼80:93
Variance
The variance is the square of standard deviation. It can be defined as the average of
the squared deviations of scores from their mean value. It also measures variation of
46 2 Descriptive Analysis
the scores in the distribution. It shows the magnitude of variation among the scores
around its mean value. In other words, it measures the consistency of data. Higher
variance indicates more heterogeneity, whereas lower variance represents more
homogeneity in the data.
Like standard deviation, it also measures the variability of scores that are
measured in interval or ratio scale. The variance is usually represented by s2 and
is computed as
1X 2
s2 ¼ ð X mÞ (2.12)
N
The variance can be estimated from the sample by using the following formula:
1 X
ðX XÞ
2
s2 ¼
n1
P
1 X 2 ð X Þ2
¼ X
n1 nðn 1Þ
Measures of variability like range, standard deviation, or variance are computed for
interval or ratio data. What if the data is in nominal form? In social research, one
may encounter many situations where it is required to measure the variability of the
data based on nominal scale. For example, one may like to find the variability of
ethnic population in a city, variation in the responses on different monuments,
variability in the liking of different sports in an institution, etc. In all these
situations, an index of qualitative variation (IQV) may be computed by the follow-
ing formula to find the magnitude of variability:
P
K 1002 P2
IQV ¼ (2.13)
1002 ðK 1Þ
where
K ¼ The number of categories
SP2 ¼ Sum of squared percentages of frequencies in all the groups
Standard Error 47
The IQV is based on the ratio of the total number of differences in the distribution
to the maximum number of possible differences within the same distribution. This
IQV can vary from 0.00 to 1.00. When all the cases are in one category, there is no
variation and the IQV is 0.00. On the other hand, if all the cases are equally
distributed across the categories, there is maximum variation and the IQV is 1.00.
To show the computation process, consider an example where the number of
students belonging to different communities were recorded as shown in Table 2.7
Here, we have K ¼ number of categories ¼ 5:
P
K 1002 P2
IQV ¼
1002 ðK 1Þ
5 1002 5; 018:34 24; 908:3
¼ ¼
100 ð5 1Þ
2 40; 000
¼ 0:62
By looking to the formula (2.13), you can see that the IQV is partially a
function of the number of categories. Here, we used five categories of
communities. Had we used more number of categories, the IQV would have
been quite less, and, on the other hand, if the number of categories would have
been less than the value of IQV, it would have been higher than what
we are getting.
Standard Error
If we draw n samples from the same population and compute their means, then these
means will not be the same but will differ with each other. The variation among these
means is referred as the standard error of mean. Thus, the standard error of any
statistic is the standard deviation of that statistic in the sampling distribution. Stan-
dard error measures the sampling fluctuation of any statistic and is widely used in
statistical inference. The standard error gives a measure of how well a sample is true
48 2 Descriptive Analysis
representative of the population. When the sample is truly representing the popula-
tion, the standard error will be small.
Constructing confidence intervals and testing of significance are based on standard
errors. The standard error of mean can be used to compare the observed mean to a
hypothesized value. The two values may be different at 5% level if the ratio of the
difference to the standard error is less than 2 or greater than +2.
The standard error of any statistics is affected by the sample size. In general, the
standard error decreases with the increase in sample size. It is denoted by s with a
subscript of a statistic for which it is computed.
Let X1, X2, X3, . . ., Xn are the means of n samples drawn from the same population.
Then the standard deviation of these n mean scores is said to be standard error of
mean. The standard error of sample mean can be estimated by even one sample. If
any sample consists of n scores with population standard deviation s, then the
standard error of the mean is given by
s
sX ¼ pffiffiffi (2.14)
n
s
ss ¼ pffiffiffiffiffi (2.15)
2n
Like standard error of the mean, the standard error of the standard deviation also
measures the fluctuation of standard deviations among the samples.
Remark If population standard deviation s is unknown, it may be estimated by the
sample standard deviation S.
S
CV ¼ 100 (2.16)
X
where S and X represent sample standard deviation and sample mean respectively.
Since coefficient of variation measures the relative variability and computes the
variability in percentage, it can be used to know whether a particular parameter is
more variable or less variable. Coefficient of variation can be used for comparing
the variability of two groups in a situation when their mean values are not equal.
Moments 49
It may also be used to compare the variability of two groups of data having different
units.
On the other hand, standard deviation is a measure of absolute variability, and
therefore, it cannot be used to assess the variability of any data set without knowing
its mean value. Further, standard deviation cannot be used to compare the
variability of two sets of scores if their mean value differs.
Consider the following statistics obtained on the number of customers visiting
the retail outlets of a company in two different locations in a month. Let us see what
conclusions can be drawn with this information.
Location A B
Mean 40 20
SD 8 6
CV 20% 30%
Moments
A moment is a quantitative value that tells us the shape of a set of points. The
moment can be central or noncentral. Central moment is represented by mr, whereas
0
noncentral moment is denoted by mr . If the deviation of scores is taken around mean,
then the moment becomes central, and if it is taken around zero or any other arbitrary
value, it is known as noncentral moment. The rth central moment is given by
1X
ðX XÞ
r
mr ¼ (2.17)
n
0 1X r
mr ¼ X (2.18)
n
50 2 Descriptive Analysis
0
The first noncentral moment m1 about zero always represents mean of the
distribution. These noncentral moments are used to compute central moments by
means of a recurrence relation.
Skewness
Skewness gives an idea about the symmetricity of the data. In symmetrical distri-
bution if the curve is divided in the middle, the two parts become the mirror image
of each other. If the curve is not symmetrical, it is said to be skewed. The skewness
of the distribution is represented by b1 and is given by
m23
b1 ¼ (2.19)
m32
where m2 and m3 are the second and third central moments. For a symmetric
distribution, b1 is 0. A distribution is positively skewed if b1 is positive and
negatively skewed if it is negative. In a positively skewed distribution, the tail is
heavy toward the right side of the curve, whereas in a negatively skewed curve, the
tail is heavy toward the left side of the curve. Further, in positively skewed curve,
median is greater than mode, whereas in negatively skewed curve, the median is
less than mode. Graphically both these curves can be shown by Fig. 2.1a, b:
The standard error of the skewness is given by
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
6nðn 1Þ
SEðSkewnessÞ ¼ SEðb1 Þ ¼ (2.20)
ðn 2Þðn þ 1Þðn þ 3Þ
pffiffiffiffiffiffiffiffi
where n is the sample size. Some authors use 6=n for computing standard error of
the skewness, but it is a poor approximation for the small sample.
The standard error of skewness can be used to test its significance. In testing the
significance of skewness, the following Z statistic is used which follows a normal
distribution.
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
nðn 1Þ b1
Z¼ (2.21)
n2 SEðb1 Þ
a b
Mode
Median
Mean
Mean
Median
Mode
Fig. 2.1 (a and b) Showing positively and negatively skewed curve
very likely to be skewed negatively. On the other hand, if calculated Z > +2, it may
be concluded that the population is positively skewed.
However, in general, skewness values more than twice its standard error
indicates a departure from symmetry. This gives a criterion to test whether skew-
ness (positive or negative) in the distribution is significant or not. Thus, if the data is
positively skewed, it simply means that majority of the scores are less than its mean
value, and in case of negative skewness, most of the scores are more than its mean
value.
Kurtosis
Kurtosis is a statistical measure used for describing the distribution of observed data
around the mean. It measures the extent to which the observations cluster around
the mean value. It is measured by (Gamma) and is computed as
g ¼ b2 3 (2.22)
where b2 ¼ mm42 , m2 and m4 represent the second and fourth central moments
2
respectively.
For a normal distribution, the value of kurtosis ðgÞ is zero. Positive value of
kurtosis in a distribution indicates that the observations cluster more around its
mean value and have longer tails in comparison to that of normal distribution,
whereas a distribution with negative kurtosis indicates that the observations cluster
less around its mean and have shorter tails.
Depending upon the value of kurtosis, the distribution of scores can be classified
into any one of the three categories: leptokurtic, mesokurtic, and platykurtic.
If for any variable the kurtosis is positive, the curve is known as leptokurtic and
it represents a low level of data fluctuation, as the observations cluster around the
mean. On the other hand, if the kurtosis is negative, the curve is known as
platykurtic and it means that the data has a larger degree of variance. In other
words, if the value of kurtosis is significant and positive, it signifies less variability
in the data set or we may say that the data is more homogenous. On the other hand,
significant negative kurtosis indicates that there is more variability in the data set or
we may conclude that the data is more heterogeneous. Further, if the kurtosis is 0,
52 2 Descriptive Analysis
Platykurtic
the curve is classified as mesokurtic. Its flatness is equivalent to normal curve. Thus,
a normal curve is always a mesokurtic curve. The three types of the curves are
shown in Fig. 2.2
The standard error of kurtosis can be given by
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n2 1
SEðKurtosisÞ ¼ SEðgÞ ¼ 2SEðb1 Þ (2.23)
ðn 3Þðn þ 5Þ
where n is the sample size. Some author suggests the approximated formula for
standard error of kurtosis as
rffiffiffiffiffi
24
SEðKurtosisÞ ¼ (2.24)
n
but this formula is poor approximation for small samples.
The standard error of the kurtosis is used to test its significance. The test statistics
Z can be computed as follows:
g
Z¼ (2.25)
SEðgÞ
This Z follows normal distribution. The critical value for Z is approximately 2 for
two-tailed test in testing the hypothesis that kurtosis ¼ 0 at approximately 5% level.
If the value of calculated Z is < 2, then the population is very likely to have
negative kurtosis and the distribution may be considered as platykurtic. On the
other hand, if the value of calculated Z is > +2, then the population is very likely to
have positive kurtosis and the distribution may be considered as leptokurtic.
Percentiles
Percentiles are used to develop norms based on the performance of the subjects.
A given percentile indicates the percentage of scores below it and is denoted by PX.
For example, P40 is a score below which 40% scores lie. Median is also known as
P50, and it indicates that 50% scores lie below it. Percentiles can be computed to
know the position of an individual on any parameter. For instance, 95th percentile
obtained by a student in GMAT examination indicates that his performance is better
than 95% of the students appearing in that examination.
Situation for Using Descriptive Study 53
Since 25th percentile P25, 50th percentile P50, and 75th percentile P75 are also
known as first, second, and third quartiles, respectively, hence procedure of com-
puting other percentiles will be same as the procedure adopted in computing
quartiles. Quartiles (the 25th, 50th, and 75th percentiles) divide the data into four
groups of equal size. Percentiles at decile points and quartiles can be computed by
using SPSS.
Percentile Rank
A percentile rank can be defined as the percentage of scores that fall at or below a
given score. Thus, if the percentile rank of a score A is X, it indicates that X
percentage of scores lies below the score A. The percentile rank can be computed
from the following formula:
CF 0:5 f s
Percentile rank of the score X ¼ 100 (2.26)
n
where
CF: number of scores below X
fs : number of times the score X occurs in the data set
n : number of scores in the data set
There may be varieties of situation where a descriptive study may be planned. One
such situation has been discussed below to narrate the use of such study.
Nowadays Industries are also assuming social responsibilities toward society.
They keep engage themselves in many of the social activities like adult literacy,
slum development, HIV and AIDS program, community development, energy
conservation drive, and go green campaign. One such organization has started its
HIV and AIDS program in which it not only promotes the awareness but also
provides treatments. This company provides antiretroviral therapy to anyone in the
community who is a HIV-positive irrespective of whether that person is an
employee of the company or not. The company also provides counseling, education,
and training and disseminates information on nutrition, health, and hygiene. The
target population of the company for this program is truck drivers, contract and
migrant workers, employees of local organizations, and members of the local com-
munity. Descriptive study may be conducted to investigate the following issues:
(a) Number of programs organized in different sections of the society
(b) Number of people who attended the awareness program in different sections of
the society
54 2 Descriptive Analysis
(c) Number of people who are affected with HIV/AIDS in different sections of the
society
(d) The most vulnerable group affected by the HIV
(e) Details of population affected from HIV in different age and sex categories
To cater the above objectives, data may be processed as follows:
(i) Classify the data on HIV/AIDS-infected persons in different sections of the
society like truck drivers, contract laborers, migrant’s laborers, and local
establishment members of the local community month wise in the last
5 years.
(ii) Classify the number of participants attending the HIV/AIDS awareness pro-
gram in different sections of the society month wise in the last 5 years.
(iii) Compute the largest and smallest scores, mean, SD, coefficient of variation,
standard error, skewness, kurtosis, and quartile deviation for the data in all
the groups.
All these computations can be done by using SPSS, the procedure of which shall
be explained later in this chapter by using the following example:
for starting the SPSS on the system, preparing the data file, and importing the file
in SPSS from other sources. The following steps will help you to prepare the data
file:
(i) Starting the SPSS: Use the following command sequence to start SPSS on
your system:
Start ! All Programs ! IBM SPSS Statistics ! IBM SPSS Statistics 20
After clicking Type in Data option, you will be taken to the Variable
View option for defining the variables in the study.
(ii) Defining variables: In this example, there are eight variables that need to
be defined along with their properties. Do the following:
1. Click Variable View in the left corner of the bottom of the screen to
define variables and their properties.
2. Write short name of all the eight variables as Del_Speed, Price_Lev,
Price_Fle, Manu_Ima, Service, Salfor_Ima, Prod_Qua, and Sat_Lev
under the column heading Name.
3. Under the column heading Label, full name of these variables may be
defined as Delivery Speed, Price Level, Price Flexibility, Manufac-
turer Image, Service, Salesforce Image, Product Quality, and Satis-
faction Level.
4. Since all the variables were measured on an interval scale, hence
select the option “Scale” under the heading Measure for each
variable.
5. Use default entries in rest of the columns.
After defining variables in Variable View, the screen shall look like
Fig. 2.3.
(iii) Entering data: After defining all the eight variables in the Variable View,
click Data View on the left bottom of the screen to open the format for
entering the data column wise. For each variable, enter the data column
wise. After entering the data, the screen will look like Fig. 2.4. Save the
data file in the desired location before further processing.
(b) SPSS Commands for Descriptive Analysis
After entering the data in data view, do the following steps for computing
desired descriptive statistics:
(i) SPSS commands for descriptive statistics: In data view, click the following
commands in sequence:
Analyze ⇨ Descriptive Statistics ⇨ Frequencies
The screen shall look like as shown in Fig. 2.5.
(ii) Selecting variables for computing descriptive statistics: After clicking the
Frequencies tag, you will be taken to the next screen for selecting variables
56 2 Descriptive Analysis
for which descriptive statistics need to be computed. The screen shall look
like as shown in Fig. 2.6. Do the following:
– Select the variables Del_Speed, Price_Lev, Price_Fle, Manu_Ima, Ser-
vice, Salfor_Ima, Prod_Qua, and Sat_Lev from the left panel to the
“Variable(s)” section of the right panel.
Here, all the eight variables can be selected one by one or all at once. To do
so, the variable(s) needs to be selected from the left panel, and by arrow
command, it may be brought to the right panel. The screen shall look like
Fig. 2.6.
(iii) Selecting option for computation: After selecting the variables, options
need to be defined for the computation of desired statistics. Do the
following:
– Click the option Statistics on the screen as shown in Fig. 2.6. This will
take you to the next screen that is shown in Fig. 2.7. Do the following:
– Check the options “Quartiles” and “Cut points for 10 equal groups”
in “Percentile Values” section.
– Check the option “Mean,” “Median,” and “Mode” under “Central
Tendency” section.
– Check the option “Std. Deviation,” “Variance,” “Range,” “Mini-
mum,” “Maximum,” “Range,” and “S.E. mean” under “Dispersion”
section.
Solved Example of Descriptive Statistics using SPSS 57
Remarks
(a) You have four different classes of statistics like “Percentile Value,”
“Central Tendency,” “Dispersion,” and “Distribution” that can be
computed. Any or all the options may be selected under these
categories. Under the category “Percentile Values,” quartiles can be
checked (√) for computing Q1 and Q3. For computing percentiles at
deciles points, cut points can be selected for 10 equal groups. Simi-
larly, if the percentiles are required to be computed in the interval of 5,
cut points may be selected as 5.
(b) In using the option cut points for the percentiles, output contains some
additional information on frequency in different segments. If the
researcher is interested, the same may be incorporated in the findings;
otherwise, it may be ignored.
(c) “Percentile” option is selected if percentile values at different
intervals are required to be computed. For example, if we are inter-
ested in computing P4, P16, P27, and P39, then these numbers are added
in the “Percentile(s)” option.
(d) In this problem, only quartiles and cut points for “10” options have
been checked under the heading “Percentile Values,” whereas under
the heading “Central Tendency,” “Dispersion,” and “Distribution,” all
the options have been checked.
(iv) Option for graph: The option Chart can be clicked in Fig. 2.6 if graph is
required to be constructed. Any one of the option under this tag like bar
charts, pie charts, or histograms may be selected. If no chart is required,
then option “None” may be selected.
58 2 Descriptive Analysis
Fig. 2.4 Screen showing entered data for all the variables in the data view
Different interpretations can be made from the results in Table 2.9. However, some
of the important findings that can be drawn are as follows:
1. Except price level, mean and median of all the variables are nearly equal.
Solved Example of Descriptive Statistics using SPSS 59
Fig. 2.5 Screen showing the SPSS commands for computing descriptive statistics
2. Standard error of mean is least for the service whereas maximum for the price
flexibility.
3. As a guideline, a skewness value more than twice its standard error indicates a
departure from symmetry. Since none of the variable’s skewness is greater than
twice its standard error (2 .512) hence all the variables are symmetrically
distributed.
4. SPSS uses the statistic b2 3 for kurtosis. Thus, for a normal distribution,
kurtosis value is 0. If for any variable the value of kurtosis is positive, its
distribution is known as leptokurtic, which indicates low level of data fluctuation
around its mean value, whereas negative value of kurtosis indicates large degree
of variance among the data and the distribution is known as platykurtic.
Since the value of kurtosis for any of the variable is not more than twice its
standard error of kurtosis hence none of the kurtosis values are significant. In
other words the distribution of all the variables is mesokurtic.
5. Minimum and maximum values of the parameter can give some interesting facts
and provide the range of variation. For instance, delivery speed of the products is
in the range of 1.3–6 days. Thus, one can expect the delivery of any product in at
the most 6 days time, and accordingly, one may try to place the order.
Table 2.9 Output showing various statistics for different attributes of the company
Del_Speed Price_Lev Price_Fle Manu_Ima Service Salfor_Ima Prod_Qua Sat_Lev
N Valid 20 20 20 20 20 20 20 20
Missing 0 0 0 0 0 0 0 0
Mean 3.7150 2.2150 8.1150 5.5350 2.9650 2.8150 6.9650 5.0450
SE of mean .29094 .28184 .33563 .22933 .14203 .20841 .30177 .17524
Median 3.8000 1.7000 8.3500 5.4000 3.0000 2.6500 6.8000 5.0000
Mode 3.40a 1.30a 5.70a 4.70 3.00 2.30a 6.80 4.30a
Std. deviation 1.30112 1.26044 1.50097 1.02561 .63518 .93205 1.34957 .78370
Variance 1.693 1.589 2.253 1.052 .403 .869 1.821 .614
Skewness .144 .970 .338 .380 .010 .459 .001 .490
SE of skewness .512 .512 .512 .512 .512 .512 .512 .512
Kurtosis .732 .092 1.435 .467 .288 .500 .324 .566
SE of kurtosis .992 .992 .992 .992 .992 .992 .992 .992
Range 4.70 4.60 4.20 4.00 2.50 3.20 5.20 2.90
Minimum 1.30 .60 5.70 3.80 1.80 1.40 4.50 3.90
Solved Example of Descriptive Statistics using SPSS
X X
Z¼
S
Thus, mean of all the variables will become same. The values so obtained
are shown in Table 2.11.
Step 3: Convert these Z values into its linear transformed scores by using the
transformation Zl ¼ 50 + 10 Z. By using this transformation, the nega-
tive values of Z-scores can be converted into positive scores. Descriptive
statistics shown in the form of linearly transformed scores are shown
Table 2.12.
Table 2.11 Standard scores Min (Z) Mean (Z) Max (Z)
of minimum, maximum, and
average of all the variables Delivery speed 1.86 0 1.76
Price level 1.28 0 2.37
Price flexibility 1.61 0 1.19
Manufacturer image 1.69 0 2.19
Service 1.83 0 2.08
Salesforce image 1.53 0 1.91
Product quality 1.83 0 2.02
Satisfaction level 1.47 0 2.24
Step 4: Use Excel graphic functionality for developing line diagram to show the
company’s profile on its various parameters. The profile chart so prepared
is shown in Fig. 2.8.
2. Click Variable View tag and define the variables Del_Speed, Price_Lev,
Price_Fle, Manu_Ima, Service, Salfor_Ima, Prod_Qua, and Sat_Lev as a scale
variable.
3. Once the variables are defined, then type the data for these variables by clicking
Data View.
4. In the data view, follow the below-mentioned command sequence for computing
descriptive statistics:
Analyze ! Descriptive Statistics ! Frequencies
5. Select all the variables from left panel to the right panel for computing various
descriptive statistics.
6. Click the tag Statistics and check the options under the headings “Percentile
Values,” “Central Tendency,” “Dispersion,” and “Distribution.” Press
Continue.
7. Click the Charts option and select the required chart, if graph is required for all
the variables.
8. Click OK to get the output for descriptive statistics.
Exercise
Short-Answer Questions
Note: Write answer to each of the following questions in not more than 200 words:
Q.1. If average performance of two groups is equal, can it be said that both the
groups are equally good?
Q.2. What do you mean by absolute and relative variability? Explain by means of
examples.
Q.3. What is coefficient of variation? In what situation it should be computed?
With the help of the following data on BSE quote during last trading sessions,
can it be concluded that the group WIPRO’s quotes were more variable than
GAIL?
Q.4. Is there any difference between standard error of mean and error in computing
the mean? Explain your answer.
Q.5. If skewness of a set of data is zero, can it be said to be is normally distributed?
If yes, how? And if no, how it can be checked for its normality?
Q.6. If performance of a student is 96th percentile in a particular subject, can it be
concluded that he is very intelligent in that subject? Explain your answer.
Q.7. What is a quartile measure? In what situation it should be used?
Exercise 65
Multiple-Choice Questions
Note: Question no. 1–10 has four alternative answers for each question. Tick mark
the one that you consider the closest to the correct answer.
1. If a researcher is interested to know the number of employees in an organization
belonging to different regions and how many of them have opted for club
memberships, the study may be categorized as
(a) Descriptive
(b) Inferential
(c) Philosophical
(d) Descriptive and inferential both
2. Choose the correct sequence of commands to compute descriptive statistics.
(a) Analyze -> Descriptive Statistics -> Frequencies
(b) Analyze -> Frequencies -> Descriptive Statistics
(c) Analyze -> Frequencies
(d) Analyze -> Descriptive Statistics
3. Which pair of statistics are nonparametric statistics?
(a) Mean and median
(b) Mean and SD
(c) Median and SD
(d) Median and Q.D.
4. Standard error of mean can be defined as
(a) Error in computing mean
(b) Difference in sample and population mean
(c) Variation in the mean values among the samples drawn from the same
population
(d) Error in measuring the data on which mean is computed
5. The value of skewness for a given set of data shall be significant if
(a) Skewness is more than twice its standard error.
(b) Skewness is more than its standard error.
(c) Skewness and standard error are equal.
(d) Skewness is less than its standard error.
6. Kurtosis in SPSS is assessed by
(a) b2
(b) b2 þ 3
(c) b2 3
(d) 2 þ b2
66 2 Descriptive Analysis
7. In order to prepare the profile chart, minimum scores for each variable are
converted into
(a) Percentage
(b) Standard score
(c) Percentile score
(d) Rank
8. While selecting option for percentile in SPSS, cut points are used for
(a) Computing Q1 and Q3
(b) Preparing the percentile at deciles points only
(c) Cutting Q1 and Q3
(d) Computing the percentiles at fixed interval points
9. If IQ of a group of students is positively skewed, what conclusions could be
drawn?
(a) Most of the students are less intelligent.
(b) Most of the students are more intelligent.
(c) There are equal number of high and low intelligent students.
(d) Nothing can be said about the intelligence of the students.
10. If the data is platykurtic, what can be said about its variability?
(a) More variability exists.
(b) Less variability exists.
(c) Variability is equivalent to normal distribution.
(d) Nothing can be said about the variability.
Assignment
1. Following table shows the data on different abilities of employees in an organi-
zation. Compute various descriptive statistics and interpret its findings.
2. Following are the grades of ten MBA students in 10 courses. Compute various
descriptive statistics and interpret your findings.
Q.1 a Q.2 a
Q.3 d Q.4 c
Q.5 a Q.6 c
Q.7 b Q.8 d
Q.9 a Q.10 a
Chapter 3
Chi-Square Test and Its Application
Learning Objectives
After completing this chapter you should be able to do the following:
• Know the use of chi-square in analyzing nonparametric data.
• Understand the application of chi-square in different research situations.
• Know the advantages of crosstabs analysis.
• Learn to construct the hypothesis in applying chi-square test.
• Explain the situations in which different statistics like contingency coefficient,
lambda coefficient, phi coefficient, gamma, Cramer’s V, and Kendall tau, for
measuring an association between two attributes, can be used.
• Learn the procedure of data feeding in preparing the data file for analysis using SPSS.
• Describe the procedure of testing an equal occurrence hypothesis and testing the
significance of an association in different applications by using SPSS.
• Interpret the output of chi-square analysis generated in SPSS.
Introduction
In survey research, mainly two types of hypothesis are tested. One may test goodness
of fit for a single attribute or may like to test the significance of association between
any two attributes. To test an equal occurrence hypothesis, it is required to tabulate
the observed frequency for each variable. The chi-square statistic in “nonparametric”
section of SPSS may be used to test the hypothesis of equal occurrence.
The scores need to be arranged in contingency table for studying an association
between any two attributes. A contingency table is the arrangement of frequency in
rows and column. The process of creating a contingency table from the observed
frequency is known as crosstab. The cross tabulation procedure provides tabulation of
two variables in two-way table. A frequency distribution provides the distribution of
one variable, whereas a contingency table describes the distribution of two or more
variables simultaneously (Table 3.1).
1. Crosstabs analysis is easy to understand and is good for the researchers, who do
not want to use more sophisticated statistical techniques.
2. Crosstabs treats all data as nominal. In other words, the data is treated as nominal
even if it is measured in interval, ratio, or ordinal form.
3. A table is more explanatory than a single statistics.
4. They are simple to conduct.
Chi-Square Statistic
If X1, . . ., Xn are independent and identical N(m, s2) random variables, then the
Pn 2
Xi m:
statistics s follows the chi-square distribution with (n1) degrees of
i¼1
freedom and is written as
Statistics Used in Cross Tabulations 71
1.0
Fig. 3.1 Probability
distribution of chi-square for
different degrees of freedom
0.8
0.6
1
n=1
p(x)
n=2
0.4
2 n=3
3 n=4 n=5
0.2
5
0.0
0 2 4 6 8
x
n
X
Xi m 2
w 2 ð n 1Þ
i¼1
s
1
f ðxÞ ¼ xn=21 ex=2 (3.1)
2 n=2
Gðn=2Þ
x>0 and n ¼ 1; 2; 3; . . .
The mean and variance of the chi-square statistics are n and 2n, respectively. The
w2 distribution is not unique but depends upon degrees of freedom. The family of
distribution with varying degrees of freedom is shown in Fig. 3.1.
If w21 and w22 are two independent chi-square variates with n1 and n2 degrees of
freedom, respectively, then w21 + w22 is also a chi-square variate with n1 + n2 degrees
of freedom. This property is used extensively in the questionnaire studies. Consider
a study to compare the attitude of male and female consumers about a particular
brand of car. The questionnaire may consist of questions under three factors,
namely, financial consideration, driving comforts, and facilities. Each of these
factors may have several questions. On each of the questions, attitude of male
and female users may be compared using chi-square. Further, by using additive
72 3 Chi-Square Test and Its Application
properties, the chi-square of each question under a particular factor may be added to
compare the attitude of male and female on that factor.
Chi-Square Test
Chi-square test is the most frequently used nonparametric statistical test. It is also
known as Pearson chi-square test and provides us the mechanism to test the
independence of two categorical variables. The chi-square test is based upon a
chi-square distribution just like the way a t-test is based upon t-distribution or an F-
test is based upon an F-distribution. The results of the Pearson’s chi-square test are
evaluated by referencing to the chi-square distribution.
The chi-square statistic is denoted as w2 and is pronounced as kai-square. The
properties of chi-square were first investigated by Karl Pearson in 1900 and hence
named after Karl Pearson chi-square test.
In using chi-square test, the chi-square (w2 ) statistic is computed as
Xn
ðf o f e Þ2
w2 ¼ (3.2)
i¼1
fe
where fo and fe are the observed and expected frequencies for each of the possible
outcome, respectively.
The chi-square test is used for two purposes: first, to test the goodness of fit and,
second, to test the independence of two attributes. In both the situations, we intend
to determine whether the observed frequencies significantly differ from the theoret-
ical (expected) frequencies. The chi-square tests in these two situations shall be
discussed in the following sections:
Table 3.2 Preferences of the college students about different brands of cold drinks
Color White Orange Brown
Frequencies 50 40 30
computing the chi-square. The “Nonparametric Tests” option in SPSS provides the
computation of chi-square (w2 ). In such situations, following set of hypotheses is
tested:
H0: All three specializations are equally popular.
H1: All three specializations are not equally popular.
By using the procedure discussed above for applying chi-square test, the null
hypothesis may be tested. The procedure would clear by looking to the following
solved examples:
Example 3.1 A beverages company produces cold drink with three different
colors. One hundred and twenty college students were asked about their
preferences. The responses are shown in Table 3.2. Do these data show that all
the flavors were equally liked by the students? Test your hypothesis at .05 level of
significance.
Solution Here it is required to test the null hypothesis of equal occurrence; hence,
expected frequencies corresponding to each of the three observed frequencies shall
be obtained by dividing the total of all the observed frequencies by the number of
categories. Hence, expected frequency (fe) for each category shall be (Table 3.3)
50 þ 40 þ 30
fe ¼ ¼ 40
3
3
Expected number of students getting grade A ¼ 300 ¼ 75
3þ2þ3þ4
2
Expected number of students getting grade B ¼ 300 ¼ 50
12
3
Expected number of students getting grade C ¼ 300 ¼ 75
12
4
Expected number of students getting grade D ¼ 300 ¼ 100
12
Thus, the observed and expected frequencies can be listed as shown in Table 3.4.
Xr
ðf o f e Þ2
w2 ¼
i¼1
fe
ð90 75Þ2 ð65 50Þ2 ð60 75Þ2 ð85 100Þ2
¼ þ þ þ
75 50 75 100
225 225 225 225
¼ þ þ þ ¼ 3 þ 4:5 þ 3 þ 2:25 ¼ 12:75
75 50 75 100
) Calculated w2 ¼12:75
76 3 Chi-Square Test and Its Application
B
A B1 B2 ... Bc Total
A1 f o11 f o12 ... f o1c (A1)
A2 f o21 f o22 ... f o2c (A2)
.
.
Ar f or1 f or2 ... f orc (Ar)
Total (B1) (B2) ... (Bc) N
ðAi ÞðBj Þ
Eij ¼
N
Thus,
2
r X
X c f oij f eij
w2 ¼ (3.3)
i¼1 j¼1
f eij
The value of the w2 variate so obtained can be used to test the independence of
two attributes.
Consider a situation where it is required to test the significance of association
between Gender (male and female) and Response (“prefer day shift” and “prefer
night shift”). In this situation, following hypotheses may be tested:
H0: Gender and Response toward shift preferences are independent.
H1: There is an association between the Gender and Response toward shift
preferences.
The calculated value of chi-square (w2 ) obtained from the formula (3.3) may be
compared with that of its tabulated value for testing the null hypothesis.
Thus, if calculated w2 is less than tabulated w2 with (r 1) (c 1) df at some
level of significance, then H0 may not be rejected otherwise H0 may be rejected.
Remark If H0 is rejected, we may interpret that there is a significant association
between the gender and their preferences toward shifts. Here, significant associa-
tion simply means that the response pattern of male and female is different. The
readers may note that chi-square statistic is used to test the significance of associa-
tion, but ultimately one gets the comparison between the levels of one attribute
across the levels of other attribute.
Example 3.3 Five hundred families were investigated to test the belief that high-
income people usually prefer to visit private hospitals and low-income people often
go to government hospitals whenever they fall sick. The results so obtained are
shown in Table 3.5.
Test whether income and hospital preferences are independent. Compute the
contingency coefficient to find the strength of association. Test your hypothesis at
5% level.
Solution The null hypothesis to be tested is
H0: Income and hospital preferences are independent.
Before computing the value of chi-square, the expected frequencies for each cell
need to be computed with the marginal totals and grand totals given in the observed
frequency (fo) table. The procedure is discussed in Table 3.6.
2
X
r X
c f oij f eij
w2 ¼
i¼1 j¼1
f eij
ð125 140:4Þ2 ð145 129:6Þ2 ð135 119:6Þ2 ð95 110:4Þ2
¼ þ þ þ
140:4 129:6 119:6 110:4
¼ 1:69 þ 1:83 þ 1:98 þ 2:15 ¼ 7:65
) Calculated w2 ¼ 7:65
78 3 Chi-Square Test and Its Application
Test of Significance
Here, r ¼ 2 and c ¼ 2, and therefore degree of freedom is
(r 1) (c 1) ¼ 1.
From Table A.6 in the Appendix, the tab w2:05 ð1Þ ¼ 3:841:
Since Cal. w2 > Tab. w2:05 ð1Þ, the null hypothesis may be rejected at .05 level of
significance. It may therefore be concluded that there is an association between the
income level and the types of hospital preferred by the people.
(a) While using chi-square test, one must ensure that the sample is random,
representative, and adequate in size.
(b) Chi-square should not be calculated if the frequencies are in percentage form;
in that case, these frequencies must be converted back to absolute numbers
before using the test.
(c) If any of the cell frequencies is less than 5, then for each cell, .5 is subtracted
from the difference of observed and expected frequency while computing chi-
square. This correction is known as Yates’ correction. SPSS automatically does
this correction while computing chi-square.
(d) The sum of the observed frequencies should be equal to the sum of the expected
frequencies.
(a) In SPSS, the null hypothesis is not tested on the basis of the comparison
between calculated and tabulated chi-square; rather, it uses the concept of
p value. p value is the probability of rejecting the null hypothesis when actually
it is true.
Statistics Used in Cross Tabulations 79
(b) Thus, the chi-square is said to be significant at 5% level if the p value is less
than .05 and is insignificant if it is more than .05.
Contingency Coefficient
Lambda Coefficient
Phi Coefficient
In a situation when both the variables are binary, phi coefficient is used to measure
the degree of association between them. This measure is similar to the correlation
coefficient in its interpretation. Two binary variables are considered positively
associated if most of the data falls along the diagonal cells. In contrast, two binary
variables are considered negatively associated if most of the data falls off the
diagonal.
The assumptions of normality and homogeneity can be violated when the
categories are extremely uneven, as in the case of proportions close to .90, .95 or
80 3 Chi-Square Test and Its Application
.10, .05. In such cases, the phi coefficient can be significantly attenuated. The
assumption of linearity cannot be violated within the context of the phi coefficient
of correlation.
Gamma
If both the variables are measured at the ordinal level, Gamma is used for testing the
strength of association of the cross tabulations. It makes no adjustment for either
table size or ties. The value of Gamma can range from 1 (100% negative
association, or perfect inversion) to +1 (100% positive association, or perfect
agreement). A value of zero indicates the absence of association.
Cramer’s V
Kendall Tau
Tau b and Tau c both test the strength of association of the cross tabulations in a
situation when both variables are measured at the ordinal level. Both these tests Tau
b and Tau c make adjustments for ties, but Tau b is most suitable for square tables
whereas Tau c is most suitable for rectangular tables. Their values can range from
1 (100% negative association, or perfect inversion) to +1 (100% positive associa-
tion, or perfect agreement). A value of zero indicates the absence of association.
Chi-square is one of the most popularly used nonparametric statistical tests used in
the questionnaire study. Two different types of hypotheses, that is, testing the
goodness of fit and testing the significance of association between two attributes,
can be tested using chi-square.
Testing an equal occurrence hypothesis is a special case of goodness of fit.
In testing an equal occurrence hypothesis, the observed frequencies on different
Solved Examples of Chi-square for Testing an Equal Occurrence Hypothesis 81
levels of a factor are obtained. The total of observed frequencies for all the levels is
divided by the number of levels to obtain the expected frequencies for each level.
Consider an experiment in which it is intended to test whether all the three
locations, that is, Delhi, Mumbai, and Chennai, are equally preferred by the
employees of an organization for posting. Out of 250 employees surveyed, 120
preferred Delhi, 80 preferred Mumbai, and 50 preferred Chennai. In this situation,
the following null hypothesis may be tested using chi-square:
H0: All the three locations are equally preferred.
Against the alternative hypothesis:
H1: All the three locations are not equally preferred.
Here, the chi-square test can be used to test the null hypothesis of equal
occurrence.
Another application of chi-square is to test the significance of association between
any two attributes. Suppose it is desired to know as to whether preference of
consumers for a specific brand of soap depends upon their socioeconomic status
where the response of 200 customers is shown in Table 3.7.
The following null hypothesis may be tested by using the chi-square for two
samples at 5% level to answer the question.
H0: Socioeconomic status and soap preferences are independent.
Against the alternative hypothesis:
H1: There is an association between the socioeconomic status and soap preferences.
If the null hypothesis is rejected, one may draw the conclusion that the prefer-
ence of soap is significantly associated with the socioeconomic status of an individ-
ual. In other words, it may be concluded that the response patterns of the customers
in high and low socioeconomic status are different.
The above two different kinds of application of chi-square have been discussed
below by means of solved examples using SPSS.
Example 3.4 In a study, 90 workers were tested for their job satisfaction. Their job
satisfaction level was obtained on the basis of the questionnaire, and the
respondents were classified into one of the three categories, namely, low, average,
and high. The observed frequencies are shown in Table 3.8. Compute chi-square in
testing whether there is any specific trend in their job satisfaction.
82 3 Chi-Square Test and Its Application
After checking the option Type in Data on the screen you will be taken to
the Variable View option for defining the variables in the study.
(ii) Defining variables: There is only one variable Job Satisfaction Level that
needs to be defined. Since this variable can assume any one of the three
values, it is a nominal variable. The procedure of defining the variable in
SPSS is as follows:
1. Click Variable View to define variables and their properties.
2. Write short name of the variable, that is, Job_Sat under the column
heading Name.
3. For this variable, define the full name, that is, Job Satisfaction Level
under the column heading Label.
4. Under the column heading Values, define “1” for low, “2” for medium,
and “3” for high.
5. Under the column heading Measure, select the option “Nominal”
because the variable Job_Sat is a nominal variable.
6. Use default entries in rest of the columns.
Solved Examples of Chi-square for Testing an Equal Occurrence Hypothesis 83
After defining the variables in variable view, the screen shall look like
Fig. 3.2.
(iii) Entering data: Once the variable Job_Sat has been defined in the Variable
View, click Data View on the left bottom of the screen to open the format
for entering data column wise.
In this example, we have only one variable Job_Sat with three levels as
Low, Medium, and High. The Low satisfaction level was observed in 40
workers, whereas Medium satisfaction level was observed in 30 workers
and High satisfaction level was observed in 20 workers. Since these levels
have been defined as 1, 2, and 3, the data shall be entered under one
variable Job_Sat as shown below:
After entering data, the screen shall look like Fig. 3.3. Only the partial data
has been shown in the figure as data set is long enough to fit in the window.
Save the data file in the desired location before further processing.
(b) SPSS Commands for Computing Chi-Square
After preparing the data file in data view, take the following steps to compute
the chi-square:
(i) Initiating the SPSS commands to compute chi-square for single variable: In
data view, click the following commands in sequence:
(ii) Selecting variable for computing chi-square: After clicking the “Chi-
Square” option, you will be taken to the next screen for selecting the
variable for which chi-square needs to be computed. Since there is only
one variable Jobs satisfaction level in the example, select it from the left
panel by using the left click of the mouse and bring it to the right panel by
clicking the arrow. The screen shall look like Fig. 3.5.
(iii) Selecting the option for computation: After selecting the variable, option
needs to be defined for the computation of chi-square. Take the following
steps:
– Click the Options in the screen shown in Fig. 3.5. This will take you to
the next screen that is shown in Fig. 3.6.
– Check the option “Descriptive.”
– Use default entries in other options.
– Click Continue. This will take you back to screen shown in Fig. 3.5
– Press OK.
Table 3.9 shows the observed and expected frequencies of the different levels of job
satisfaction. No cell frequency is less than 5, and, therefore, no correction is
required while computing chi-square.
Solved Examples of Chi-square for Testing an Equal Occurrence Hypothesis 85
Example 3.5 Out of 200 MBA students, 40 were given an academic counseling
throughout the semester, whereas other 40 did not receive this counseling. On the
basis of their marks in the final examination, their performance was categorized as
improved, unchanged, and deteriorated. Based on the results shown in Table 3.11,
can it be concluded that the academic counseling is effective at 5% level?
88 3 Chi-Square Test and Its Application
Solution In order to check whether academic counseling is effective, we shall test the
significance of association between treatment and performance. If the association
between these two attributes is significant, then it may be interpreted that the pattern of
performance in the counseling and control groups is not same. In that case, it might be
concluded that the counseling is effective since the number of improved cases is
higher in counseling group than that of control group.
Thus, it is important to compute the chi-square first in order to test the null hypothesis.
H0: There is no association between treatment and performance.
Against the alternative hypothesis:
H1: There is an association between treatment and performance.
The commands for computing chi-square in case of two samples are different
than that of one sample computed in Example 3.4.
In two-sample case, chi-square is computed using Crosstabs option in Descrip-
tive statistics command of SPSS. The chi-square so obtained shall be used for
testing the above-mentioned null hypothesis. Computation of chi-square for two
samples using SPSS has been shown in the following steps:
By checking the option Type in Data on the screen you will be taken to the
Variable View option for defining the variables in the study.
(ii) Defining variables: Here, two variables Treatment and Performance need
to be defined. Since both these variables are classificatory in nature, they
are treated as nominal variables in SPSS. The procedure of defining
variables and their characteristics in SPSS is as follows:
1. Click Variable View to define variables and their properties.
2. Write short name of the variables as Treatment and Performance under
the column heading Name.
Solved Example of Chi-square for Testing the Significance of Association. . . 89
3. Under the column heading Label, full name of the Treatment and
Performance variables may be defined as Treatment groups and Perfor-
mance status, respectively. There is flexibility in choosing full name of
each variable.
4. In the Treatment row, double-click the cell under the column Values and
add the following values to different labels:
Value Label
1 Counseling group
2 Control group
Value Label
3 Improved
4 Unchanged
5 Deteriorated
Data feeding procedure for the data of Table 3.11 in SPSS under Data View
S.N. Treatment Performance
1 1 3
2 1 3
3 1 3 Type “3” twenty-two times as there
4 1 3 are 22 students showed Improved
5 1 3 performance in counseling group
6 1 3
7 1 3
8 1 3
9 1 3
10 1 3
11 1 3
12 Type “1” forty 1 3
13 1 3
14 times as there are 40 1 3
15 students in the 1 3
counseling group
16 1 3
17 1 3
18 1 3
19 1 3
20 1 3
21 1 3
22 1 3
23 1 4
24 1 4 Type “4” eight times as there are
25 1 4 8 students showed Unchanged
26 1 4 performance in counseling
27 1 4 group
28 1 4
29 1 4
30 1 4
31 1 5
32 1 5 Type “5” ten times as there are 10
33 1 5 students showed Deteriorated
34 1 5 performance in counseling
35 1 5 group
(continued)
Solved Example of Chi-square for Testing the Significance of Association. . . 91
(continued)
S.N. Treatment Performance
36 1 5
37 1 5
38 1 5
39 1 5
40 1 5
41 2 3 Type “3” four times as there are 4
42 2 3 students showed Improved
43 2 3 performance in control group
44 2 3
45 2 4
46 2 4 Type “4” five times as there are 5
47 Type “2” forty times as 2 4 students showed Unchanged
48 there are 40 students in 2 4 performance in control group
49 the control group 2 4
50 2 5
51 2 5
52 2 5 Type “5” thirty-one times as there
53 2 5 are 31 students showed
54 2 5 Deteriorated performance in
55 2 5 control group
56 2 5
57 2 5
58 2 5
59 2 5
60 2 5
61 2 5
62 2 5
63 2 5
64 2 5
65 2 5
66 2 5
67 2 5
68 2 5
69 2 5
70 2 5
71 2 5
72 2 5
73 2 5
74 2 5
75 2 5
76 2 5
77 2 5
78 2 5
79 2 5
80 2 5
Treatment coding: 1 ¼ Counseling group, 2 ¼ Control group
Performance coding: 3 ¼ Improved, 4 ¼ Unchanged, 5 ¼ Deteriorated
92 3 Chi-Square Test and Its Application
After entering the data, the screen will look like Fig. 3.8. The screen shows only
the partial data as the data is entered column wise which takes two-page-long
entries. Save the data file in the desired location before further processing.
(b) SPSS Commands for Computing Chi-square with Two Variables
After entering all the data by clicking the data view, take the following steps for
computing chi-square:
(i) Initiating the SPSS commands for computing chi-square: In Data View,
click the following commands in sequence:
Analyze ! Descriptive Statistics ! Crosstabs
The screen shall look like Fig. 3.9.
(ii) Selecting variables for computing chi-square: After clicking the
“Crosstabs” option, you will be taken to the next screen for selecting
variables for the crosstabs analysis and computing chi-square. Out of the
two variables, one has to be selected in the Row(s) panel and the other in
the Column(s) panel.
Solved Example of Chi-square for Testing the Significance of Association. . . 93
Fig. 3.9 Screen showing the SPSS commands for computing chi-square in crosstabs
Select the variables Treatment group and Performance status from the left
panel and bring them to the “Row(s)” and “Column(s)” sections of the right
panel, respectively, by arrow button. The screen shall look like Fig. 3.10.
(iii) Selecting option for computation: After selecting variables, option needs
to be defined for the crosstabs analysis and computation of chi-square.
Take the following steps:
– Click Statistics option to get the screen shown in Fig. 3.11.
– Check the options “Chi-square” and “Contingency coefficient.”
– Click Continue.
– Click Cells option to get the screen shown in Fig. 3.12. Then,
– Check the options “Observed” and “Expected” under the Counts
section. Observed is checked by default.
– Click Continue. You will be taken back to the screen shown in
Fig. 3.10.
– Use default entries in other options. Readers are advised to try other
options and see what changes they are getting.
– Click OK.
Figure 3.10 Screen showing selection of variables for chi-square in crosstab
Fig. 3.11 Screen showing option for computing chi-square and contingency coefficient
Solved Example of Chi-square for Testing the Significance of Association. . . 95
Fig. 3.12 Screen showing option for computing observed and expected frequencies
7. Select the variable Job_Sat from left panel to the right panel.
8. Click the tag Options and check the box of “Descriptive.” Press Continue.
9. Click OK to get the output.
(b) For computing chi-square statistic (for testing the significance of associa-
tion between two attributes):
1. Start SPSS by using the following command sequence:
2. Click Variable View tag and define the variable Treatment and Perfor-
mance as “Nominal” variables.
3. In the Treatment row, double-click the cell under the column Values and
add the values “1” for Counseling group and “2” for Control group.
Similarly, in the Performance row, define the value “3” for Improved,
4 for Unchanged, and 5 for Deteriorated.
4. Use default entries in rest of the columns.
5. Click Data View tag and feed first forty entries as 1 and next forty entries
as 2 for the Treatment variable.
6. Similarly for the Performance variable, enter first twenty-two entries as 3,
next eight entries as 4, and further ten entries as 5. These three sets of
entries are for counseling group. Similarly for showing the entries of
control group, enter first four entries as 3, next five entries as 4, and after
that thirty-one entries as 5 in the same column.
7. Click the following command sequence for computing chi-square:
8. Select variables Treatment group and Performance status from the left panel
to the “Row(s)” and “Column(s)” sections of the right panel, respectively.
9. Click the option Statistics and check the options “Chi-square” and “Con-
tingency coefficient.” Press Continue.
10. Click OK to get the output.
98 3 Chi-Square Test and Its Application
Exercise
Short-Answer Questions
Note: Write answer to each of the questions in not more than 200 words:
Q.1. Responses were obtained from male and female on different questions related
to their knowledge about smoking. There were three possible responses
Agree, Undecided, and Disagree for each of the questions. How will you
compare the knowledge of male and female about smoking?
Q.2. Write in brief two important applications of chi-square.
Q.3. How will you frame a null hypothesis in testing the significance of an associ-
ation between gender and IQ where IQ is classified into high and low
category? Write the decision criteria in testing the hypothesis.
Q.4. Can the chi-square be used for comparing the attitude of male and female on
the issue of “Foreign retail chain may be allowed in India” if the frequencies
are given in 3 5 table below? If so or otherwise, interpret your findings.
Under what situation chi-square is the most robust test?
Q.5 If chi-square is significant, it indicates that the association between the two
attributes exists. How would you find the magnitude of an association?
Q.6 What is phi coefficient? In what situation it is used? Explain by means of an
example.
Multiple-Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the
one that you consider the closest to the correct answer.
1. For testing the significance of association between Gender and IQ level, the
command sequence for computing chi-square in SPSS is
(a) Analyze -> Nonparametric Tests -> Chi-square
(b) Analyze -> Descriptive Statistics -> Crosstabs
(c) Analyze -> Chi-square -> Nonparametric Tests
(d) Analyze -> Crosstabs -> Chi-square
2. Choose the most appropriate statement about the null hypothesis in chi-square.
(a) There is an association between gender and response.
(b) There is no association between gender and response.
(c) There are 5050% chances of significant and insignificant association.
(d) None of the above is correct.
Exercise 99
Gender
Male Female
Region North 30 20
South 10 40
(a) 16.67
(b) 166.7
(c) 1.667
(d) 1667
5. Chi-square is used for
(a) Finding magnitude of an association between two attributes
(b) Finding significance of an association between two attributes
(c) Comparing the variation between two attributes
(d) Comparing median of two attributes
6. Chi-square is the most robust test if the frequency table is
(a) 22
(b) 23
(c) 33
(d) mn
7. While using chi-square for testing an association between the attributes, SPSS
provides Crosstabs option. Choose the most appropriate statement.
(a) Crosstabs treats all data as nominal.
(b) Crosstabs treats all data as ordinal.
(c) Crosstabs treats some data as nominal and some data as ordinal.
(d) Crosstabs treats data as per the problem.
100 3 Chi-Square Test and Its Application
8. If responses are obtained in the form of the frequency on a 5-point scale and it is
required to compare the responses of male and female on the issue “Marketing
stream is good for the female students,” which statistical test you would prefer?
(a) Two-sample t-test
(b) Paired t-test
(c) One-way ANOVA
(d) Chi-square test
9. If p value for a chi-square is .02, what conclusion you can draw?
(a) Chi-square is significant at 95% confidence.
(b) Chi-square is not significant at 95% confidence.
(c) Chi-square is significant at .01 levels.
(d) Chi-square is not significant at .05 levels.
10. The degree of freedom of chi-square in a r c table is
(a) r+c
(b) r + c1
(c) rc
(d) (r1)(c1)
11. Phi coefficient is used if
(a) Both the variables are ordinal.
(b) Both the variables are binary.
(c) Both the variables are interval.
(d) One of the variables is nominal and the other is ordinal.
12. Gamma coefficient is used if
(a) Both the variables are interval.
(b) Both the variables are binary.
(c) Both the variables are ordinal.
(d) Both the variables may be on any scale.
Assignments
1. Following are the frequencies of students in an institute belonging to Low,
Medium, and High IQ groups. Can it be concluded that there is a specific
trend of IQ’s among the students. Test your hypothesis at 5% level.
Frequencies of the students in different IQ groups
along with the expected frequency and percentage frequencies in the Crosstabs
and interpret your findings. Test your hypothesis at 5% level.
Workers
Skilled Unskilled
Gender Male 50 15
Female 15 40
Q.1 b Q.2 b
Q.3 b Q.4 a
Q.5 b Q.6 a
Q.7 a Q.8 d
Q.9 a Q.10 d
Q.11 b Q.12 c
Chapter 4
Correlation Matrix and Partial Correlation:
Explaining Relationships
Learning Objectives
After completing this chapter, you should be able to do the following:
• Learn the concept of linear correlation and partial correlation.
• Explore the research situations in which partial correlation can be effectively
used.
• Understand the procedure in testing the significance of product moment correla-
tion and partial correlation.
• Develop the hypothesis to test the significance of correlation coefficient.
• Formulate research problems where correlation matrix and partial correlation
can be used to draw effective conclusion.
• Learn the application of correlation matrix and partial correlation through case
study discussed in this chapter.
• Understand the procedure of using SPSS in computing correlation matrix and
partial correlation.
• Interpret the output of correlation matrix and partial correlation generated in
SPSS.
Introduction
One of the thrust areas in the management research is to find the ways and means to
improve productivity. It is therefore important to know the variables that affect it.
Once these variables are identified, an effective strategy may be adopted by
prioritizing it to enhance the productivity in the organization. For instance, if a
company needs to improve the sale of a product, then its first priority would be to
ensure its quality and then to improve other variables like resources available to the
marketing team, their incentive criteria, and dealer’s scheme. It is because of the
fact that the product quality is the most important parameter in enhancing sale.
Even if the sample is random, it is not possible to find the real relationship
between any two variables as it might be affected by other variables. For instance, if
the correlation computed between height and weight of the children belonging to
age category 12–18 years is 0.85, it may not be considered as the real relationship.
Here all the subjects are in the developmental age, and in this age category, if the
height increases, weight also increases; therefore, the relationship exhibited
between height and weight is due to the impact of age as well. To know the real
relationship between the height and weight, one must eliminate the effect of age.
This can be done in two ways. First, all the subjects can be taken in the same age
category, but it is not possible in the experimental situation once the data collection
is over. Even if an experimenter tries to control the effect of one or two variables
manually, it may not be possible to control the effect of other variables; otherwise
one might end up with getting one or two samples only for the study.
In the second approach, the effects of independent variables are eliminated
statistically by partialing out their effects by computing partial correlation. Partial
correlation provides the relationship between any two variables after partialing out
the effect of other independent variables.
Although the correlation coefficient may not give the clear picture of the real
relationship between any two variables, it provides the inputs for computing partial
and multiple correlations, and, therefore, in most of the studies, it is important to
compute the correlation matrix among the variables. This chapter discusses the
procedure for computing correlation matrix and partial correlation using SPSS.
Matrix is an arrangement of scores in rows and column, and if its elements are
correlation coefficients, it is known as correlation matrix. Usually in correlation
matrix, upper diagonal values of the matrix are written. For instance, the correlation
matrix with the variables X1, X2, X3, and X4 may look like as follows:
X1 X2 X3 X4
X1 1 0.5 0.3 0.6
X2 1 0.7 0.8
X3 1 0.4
X4 1
The lower diagonal values in the matrix are not written because of the fact that
the correlation between X2 and X4 is same as the correlation between X4 and X2.
Some authors prefer to write the above correlation matrix in the following form:
X1 X2 X3 X4
X1 0.5 0.3 0.6
X2 0.7 0.8
X3 0.4
X4
106 4 Correlation Matrix and Partial Correlation: Explaining Relationships
In this correlation matrix, diagonal values are not written as it is obvious that these
values are 1 because correlation between the same two variables is always one.
In this section, we shall discuss the product moment correlation and partial
correlation along with testing of their significance.
NSXY ðSXÞðSYÞ
r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (4.1)
½NSX2 ðSXÞ2 ½NSY 2 ðSYÞ2
where N is the number of paired scores. The limits of r are from 1 to +1. The
positive value of r means higher scores on one variable tend to be paired with higher
scores on the other, or lower scores on one variable tend to be paired with lower
scores on the other. On the other hand, negative value of r means higher scores on
one variable tend to be paired with lower scores on the other and vice versa. Further,
r ¼ +1 indicates the perfect positive relationship between the two variables. This
means that if there is an increase (decrease) in X by an amount “a,” the Y will also be
increased (decreased) by the same amount. Similarly r ¼ 1 signifies the perfect
negative linear correlation between the two variables. In this case, if the variable X
is increased (decreased) by an amount “b,” then the variable Y shall be decreased
(increased) by the same amount. The three extreme values of the correlation
coefficient r can be shown graphically in Fig. 4.1.
Example 4.1: Following are the scores on age and memory retention. Compute the
correlation coefficient and test its significance at 5% level (Table 4.1).
Solution In order to compute the correlation coefficient, first of all the summation
P P P 2 P 2 P
X, Y, X , Y , and XY shall be computed in Table 4.2.
Fig. 4.1 Graphical presentations of the three extreme cases of correlation coefficient
Details of Correlation Matrix and Partial Correlation 107
Here N is 10:
NSXY ðSXÞðSYÞ
r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
½NSX2 ðSXÞ2 ½NSY 2 ðSYÞ2
10 597 91 67
r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
½10 853 912 ½10 461 672
127
¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:732
249 121
X X X Y Y
2
sx sy
1X X
ðX XÞ ) ðX XÞ ¼ ns2x
2 2
s2x ¼
n
1X X
ðY YÞ ) ðY YÞ ¼ ns2y
2 2
Since s2y ¼
n
P X
ðX XÞðY
YÞ
and r xy ¼ )
ðX XÞðY YÞ ¼ nsx sy r xy
nsx sy
Now
X X X 2 P P P
Y Y ðX XÞ ðX XÞðY YÞ ðY YÞ
2 2
¼ 2 þ
sx sy s2x sx sy s2y
2
ns2x 2nsx sy r xy nsy
¼ þ ¼ 2n 2nr
s2x sx sy s2y
¼ 2nð1 rÞ
; 2nð1 rÞ 0 ðn>0Þ
1þr 0 ; r 1 (4.2)
1r 0 ; r 1 (4.3)
1r 1
Details of Correlation Matrix and Partial Correlation 109
Let us apply the transformation by shifting the origin and scale of X and Y.
Let U ¼ Xah and V ¼ k
Yb
Thus, ¼ hðU UÞ
XX (4.5)
Similarly,
X X X
Y ¼ b þ kV ) Y¼ bþk V
) Y ¼ b þ kV
Thus, ¼ kðV VÞ
YY (4.6)
Substituting the values of ðX XÞand
ðY YÞfrom The Equations (4.5) and (4.6)
into (4.4),
P
hðU UÞ kðV VÞ
rx;y ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P P
h2 ðU UÞ k2 ðV VÞ
2 2
P
hk ðU UÞðV VÞ
¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2 P
ðU UÞ ðV VÞ
2
hk
¼ ru;v
; rx;y ¼ ru;v
Thus, it may be concluded that the coefficient of correlation between any two
variables is independent of change of origin and scale.
4. Correlation coefficient is the geometrical mean between two regression
coefficients. If byx and bxy are the regression coefficients, then
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
r xy ¼ byx bxy
110 4 Correlation Matrix and Partial Correlation: Explaining Relationships
5. One must ensure that the result of correlation coefficient should be generalized
only for that population from which the sample was drawn. Usually for a specific
small sample, correlation may be high for any two variables, and if it is so, then it
must be verified with the larger representative and relevant sample.
One of the main limitations of the correlation coefficient is that it measures only
linear relationship between the two variables. Thus, correlation coefficient should
be computed only when the data are measured either on interval scale or ratio scale.
The other limitation of the correlation coefficient is that it does not give the real
relationship between the variables. To overcome this problem, partial correlation
may be computed which explains the real relationship between the variables after
controlling for other variables with certain limitations.
First Approach
The easiest way to test the null hypothesis mentioned above is to look for the critical
value of r with n 2 degrees of freedom at any desired level of significance in
Table A.3 in the Appendix. If the calculated value of r is less than or equal to the
critical value of r, null hypothesis would fail to be rejected, and if calculated r is
greater than critical value of r, null hypothesis may be rejected. For instance, if the
correlation coefficient between height and self-esteem of 25 individuals is 0.45,
then the critical value of r required for significance at .05 level of significance and
N 2(¼23) df from Table A.3 in the Appendix can be seen as 0.396. Since
calculated value of r, that is, 0.45 is greater than the critical value of r (¼0.396),
the null hypothesis may be rejected at .05 level of significance, and we may
conclude that there is a significant correlation between height and self-esteem.
112 4 Correlation Matrix and Partial Correlation: Explaining Relationships
Second Approach
r pffiffiffiffiffiffiffiffiffiffiffi
t ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi n 2 (4.7)
1 r2
Here r is the observed correlation coefficient and n is the number of paired sets of
data.
The calculated value of t is compared with that of tabulated value of t at .05 level
and n2 df (¼t.05(n2)). The value of tabulated t can be obtained from Table A.2 in
the Appendix.
Thus, if Cal t t.05(n2), null hypothesis is failed to be rejected at .05 level of significance
and if Cal t > t.05(n2), null hypothesis may be rejected at .05 level of significance
Third Approach
Note: The SPSS output follows third approach and provides p values for each of the
correlation coefficient in the correlation matrix.
Partial Correlation
r 12 r 13 r 23
r 12:3 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (4.8)
ð1 r13 2 Þð1 r 2 Þ
23
Example 4.2: The following correlation matrix shows the correlation among
different academic performance parameters. Compute partial correlations r12.3
and r12.34 and test their significance. Interpret the findings also (Table 4.3).
114 4 Correlation Matrix and Partial Correlation: Explaining Relationships
Solution
(i) Computation of r 12:3
r 12 r 13 r 23
Since we know that r 12:3 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 r13 2 1 r23 2
We shall first compute the first-order partial correlations r12.3, r14.3, and r24.3
which are required to compute the second-order partial correlation r12.34.
Since r12.3 has already been computed above, the remaining two shall be
computed here.
Thus,
and
After substituting the values of r12:3 ,r 14:3 , and r 24:3 , the second-order partial
correlation becomes
Situation for Using Correlation Matrix and Partial Correlation 115
Dependent variable
1. Job satisfaction (X1)
116 4 Correlation Matrix and Partial Correlation: Explaining Relationships
Independent variables
1. Autonomy (X2)
2. Organizational culture (X3)
3. Compensation (X4)
4. Upward communications (X5)
5. Job training opportunity (X6)
6. Management style (X7)
7. Performance appraisal (X8)
8. Recognition (X9)
9. Working atmosphere (X10)
10. Working relationships (X11)
The following computations may be done to fulifil the objectives of the study:
1. Compute product moment correlation coefficient between Job satisfaction and
each of the environmental and motivational variables.
2. Identify few independent variables that show significant correlations with the
Job satisfaction for further developing the regression model. Say these selected
variables are X3, X9, X6, and X2.
3. Out of these identified variables in step 2, pick up the one having the highest
correlation with the dependent variable (X1), say it is X6.
4. Then find the partial correlation between the variables X1 and X6 by eliminating
the effect of variables X3, X9, and X2 in steps. In other words, find the first-order
partial correlation r16.3, second-order partial correlation r16.39, and third-order
partial correlation r16.392.
5. Similarly find the partial correlation between other identified variables X3, X9,
and X2 with that of dependent variable (X1) in steps. In other words, compute the
following three more sets of partial correlation:
(i) r13.9, r13.96, and r13.962
(ii) r19.3, r19.36, and r19.362
(iii) r12.3, r12.39, and r12.396
Statistical Test
To address the objectives of the study and to test the listed hypotheses, the
following computations may be done:
1. Correlation matrix among all the independent variables and dependent variable
2. Partial correlations of different orders between the Job satisfaction and identified
independent variables
Thus, we have seen how a research situation requires computing correlation
matrix and partial correlations to fulfill the objectives.
Example 4.3 To understand the relationships between patient’s loyalty and other
variables, a study was conducted on 20 patients in a hospital. The following data
was obtained. Construct the correlation matrix and compute different partial
correlations using SPSS and interpret the findings (Table 4.4).
Solution First of all, correlation matrix shall be computed using SPSS. Option
shall be selected to show the significant correlation values. After selecting the
variables that shows significant correlation with the customer loyalty, partial
correlation shall be computed between customer loyalty and any of these selected
variables after controlling the effect of the remaining variables. The correlation
coefficients and partial correlations so obtained in the output using SPSS shall be
tested for their significance by using the p value.
The values of mean and standard deviation for all the variables are shown in
Table 4.5. The user may draw the conclusions accordingly, and the findings may
be used for further analysis in the study.
The actual output shows the full correlation matrix, but only upper diagonal
values of the correlation matrix are shown in Table 4.6. This table shows the
magnitude of correlation coefficients along with their p values and sample size.
The product moment correlation coefficient is also known as Pearson correlation as
Solved Example of Correlation Matrix and Partial Correlations by SPSS 121
Fig. 4.4 Screen showing SPSS commands for computing correlation matrix
Fig. 4.5 Screen showing selection of variables for computing correlation matrix
122 4 Correlation Matrix and Partial Correlation: Explaining Relationships
Fig. 4.6 Screen showing option for computing correlation matrix and other statistics
it was developed by the British mathematician Karl Pearson. The value of correla-
tion coefficient required for significance (known as critical value) at 5% as well as
at 1% level can be seen from Table A.3 in the Appendix. Thus, at 18 degrees of
freedom, the critical values of r at 5 and 1% are 0.444 and 0.561, respectively. The
correlation coefficient with one asterisk (*) mark is significant at 5% level, whereas
the one with two asterisk (**) marks shows the significance at 1% level. In this
example, the research hypothesis is two-tailed which states that “There is a signifi-
cant correlation between the two variables.” The following conclusions may be
drawn from the results in Table 4.6:
(a) The Customer loyalty is significantly correlated with customer trust at 5% level,
whereas it is significantly correlated with Service quality and Customer satis-
faction at 1% level.
(b) Customer satisfaction is highly correlated with service quality. This is rightly so
as only satisfied customers would be loyal to any hospital.
(c) All those correlation coefficients having p value less than .05 are significant at
5% level. This is shown by asterisk (*) mark by the side of the correlation
coefficient. Similarly correlations having p value less than .01 are significant at
1% level, and this is indicated by two asterisk (**) marks by the side of
correlation coefficient.
Solved Example of Correlation Matrix and Partial Correlations by SPSS 123
Table 4.6 Correlation matrix for the data on customer’s behavior along with p values
Customer Service Customer Customer
trust quality satisfaction loyalty
(X1) (X2) (X3) (X4)
Customer trust Pearson 1 .754** .704** .550*
correlation
(X1) sig. (2-tailed) .000 .001 .012
N 20 20 20 20
Service quality Pearson 1 .910** .742**
correlation
(X2) sig. (2-tailed) .000 .000
N 20 20 20
Customer Pearson 1 .841**
satisfaction correlation
(X3) sig. (2-tailed) .000
N 20 20
Customer loyalty Pearson 1
correlation
(X4) sig. (2-tailed)
N 20
**Correlation is significant at the 0.01 level (2-tailed); *Correlation is significant at the 0.05 level
(2-tailed)
The decision of eliminating the effect of variables X2 and X1 has been taken because
both these variables are significantly correlated with the criterion variable. How-
ever, one can investigate the relationship between X4 vs. X2 after eliminating the
effect of the variables X3 and X1. Similarly partial correlation between X4 vs. X1
may also be investigated after eliminating the effect of the variables X3 and X2. The
procedure of computing these partial correlations with SPSS has been discussed in
the following sections:
(a) Data File for Computing Partial Correlation
The data file which was prepared for computing correlation matrix shall be used
for computing the partial correlations. Thus, procedure for defining the
variables and entering the data for all the variables is exactly the same as was
done in case of computing correlation matrix.
(b) SPSS Commands for Partial Correlation
After entering all the data in the data view, take the following steps for
computing partial correlation:
(i) Initiating the SPSS commands for partial correlation: In Data View, go to
the following commands in sequence:
Analyze ⟶ Correlate ⟶ Partial
The screen shall look like Fig. 4.7.
(ii) Selecting variables for partial correlation: After clicking the Partial option,
you will get the next screen for selecting variables for the partial correlation.
– Select the two variables Customer loyalty (X4) and Customer satisfac-
tion (X3) from the left panel to the “Variables” section in the right panel.
Here, relationship between the variables X4 and X3 needs to be
computed after controlling the effects of Service quality (X2) and
Customer trust (X1).
– Select the variables Service quality (X2) and Customer trust (X1) from
the left panel to the “Controlling for” section in the right panel. X2 and
X1 are the two variables whose effects are to be eliminated.
The selection of variables is made either one by one or all at once. To do
so, the variable needs to be selected from the left panel, and by arrow
command, it may be brought to the right panel. The screen shall look like
Fig. 4.8.
(iii) Selecting options for computation: After selecting the variables for partial
correlation and identifying controlling variables, option needs to be defined
for the computation of partial correlation. Take the following steps:
– In the screen shown in Fig. 4.8, ensure that the options “Two-tailed”
and “Display actual significance level” are checked. By default they are
checked.
– Click the tag Options; you will get the screen as shown in Fig. 4.9.
Take the following steps:
– Check the box of “Means and standard deviations.”
Solved Example of Correlation Matrix and Partial Correlations by SPSS 125
Fig. 4.7 Screen showing SPSS commands for computing partial correlations
– Use the default entries in other options. Readers are advised to try
other options and see what changes they are getting in their outputs.
– Click Continue.
– Click OK.
Table 4.7 shows the descriptive statistics of all the variables selected in the study.
Values of mean and standard deviations may be utilized for further analysis.
Readers may note that similar table of descriptive statistics was also obtained
while computing correlation matrix by using SPSS.
126 4 Correlation Matrix and Partial Correlation: Explaining Relationships
In Table 4.8, partial correlation between Customer loyalty (X4) and Customer
satisfaction (X3) after controlling the effect of Service quality (X2) and Customer
trust (X1) is shown as 0.600. Since p value for this partial correlation is .009, which
is less than .01, it is significant at 1% level. It may be noted that the correlation
coefficient between Customer loyalty and Customer satisfaction in Table 4.6 is
0.841 which is highly significant, but when the effects of Service quality and
Customer trust are eliminated, the actual correlation dropped down to 0.600. But
this partial correlation of 0.600 is still highly correlated in the given sample, and,
hence, it may be concluded that within the framework of this study, there exists a
real relationship between Customer loyalty and Customer satisfaction. One may
draw the conclusion that at all cost, Customer satisfaction is the most important
factor for maintaining patient’s loyalty towards the hospital.
Fig. 4.9 Screen showing option for computing partial correlation and other statistics
Table 4.8 Partial correlation between Customer loyalty (X4) and Customer satisfaction (X3) after
controlling the effect of Service quality (X3) and Customer trust (X1)
Customer
Customer satisfaction
Control variables loyalty (X4) (X3)
Service quality (X2) and Customer Correlation 1.000 .600
Customer trust (X1) loyalty (X4) significance .009
(2-tailed)
df 0 16
Customer Correlation .600 1.000
satisfaction
(X3) significance .009
(2-tailed)
df 16 0
Note: Readers are advised to compute partial correlations of different orders with the same data
128 4 Correlation Matrix and Partial Correlation: Explaining Relationships
2. Click Variable View tag and define the variables Trust, Service, Satisfac-
tion, and Loyalty as Scale variables.
3. Once the variables are defined, type the data column wise for these variables
by clicking Data View.
4. In Data View, click the following commands in sequence for correlation
matrix:
Analyze ⟶ Correlate ⟶ Bivariate
5. Select all the variables from left panel to the “Variables” section of the right
panel.
6. Ensure that the options “Pearson,” “Two-tailed,” and “Flag significant
correlations” are checked by default.
7. Click the tag Options and check the box of “Means and standard
deviations.” Click Continue.
8. Click OK for output.
(b) For Computing Partial Correlation
1. Follow steps 1–3 as discussed above.
2. With the same data file, follow the below-mentioned commands in sequence
for computing partial correlations:
Analyze ⟶ Correlate ⟶ Partial
3. Select any two variables between which the partial correlation needs to be
computed from left panel to the “Variables” section of the right panel.
Select the variables whose effects are to be controlled, from left panel to the
“Controlling for” section in the right panel.
4. After selecting the variables for computing partial correlation, click the
caption Options on the screen. Check the box “Means and standard devia-
tion” and press Continue.
5. Click OK to get the output of the partial correlation and descriptive
statistics.
Exercise
Short-Answer Questions
Note: Write the answer to each of the questions in not more than 200 words.
Q.1. “Product moment correlation coefficient is a deceptive measure of relation-
ship, as it does not reveal anything about the real relationship between two
variables.” Comment on this statement.
Q.2. Describe a research situation in management where partial correlation can be
used to draw some meaningful conclusions.
Exercise 129
Q.3. Compute correlation coefficient between X and Y and interpret your findings
considering that Y and X are perfectly related by equation Y ¼ X2.
X: 3 2 1 0 1 2
Y: 9 4 1 0 1 4
Q.4. How will you test the significance of partial correlation using t-test?
Q.5. What does the p value refer to? How is it used in testing the significance of
product moment correlation coefficient?
Multiple-Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the
one that you consider the closest to the correct answer.
1. In testing the significance of product moment correlation coefficient, degree of
freedom for t-test is
(a) N1
(b) N+2
(c) N+1
(d) N2
2. If the sample size increases, the value of correlation coefficient required for its
significance
(a) Increases
(b) Decreases
(c) Remains constant
(d) May increase or decrease
3. Product moment correlation coefficient measures the relationship which is
(a) Real
(b) Linear
(c) Curvilinear
(d) None of the above
4. Given that r12 ¼ 0.7 and r12.3 ¼ 0.28, where X1 is academic performance, X2 is
entrance test score, and X3 is IQ, what interpretation can be drawn?
(a) Entrance test score is an important contributory variable to the academic
performance.
(b) IQ affects the relationship between academic performance and entrance
test score in a negative fashion.
(c) IQ has got nothing to do with the academic performance.
(d) It seems there is no real relationship between academic performance and
entrance test score.
5. If p value for a partial correlation is 0.001, what conclusion can be drawn?
(a) Partial correlation is not significant at 5% level.
(b) Partial correlation is significant at 1% level.
130 4 Correlation Matrix and Partial Correlation: Explaining Relationships
2. The data in the following table shows the determinants of US domestic price of
copper during 1966–1980. Compute the following and interpret your findings:
(a) Correlation matrix with all the six variables
(b) Partial correlations: r12.3, r12.34, and r12.346
(c) Partial correlations: r13.2, r13.24, and r13.246
Determinants of US domestic price of copper
Q.1 d Q.2 b
Q.3 b Q.4 d
Q.5 b Q.6 a
Q.7 c Q.8 b
Q.9 c Q.10 d
Chapter 5
Regression Analysis and Multiple Correlations:
For Estimating a Measurable Phenomenon
Learning Objectives
After completing this chapter, you should be able to do the following:
• Explain the use of regression analysis and multiple correlation in research.
• Interpret various terms involved in regression analysis.
• Learn to use SPSS for doing regression analysis.
• Understand the procedure of identifying the most efficient regression model.
• Know the method of constructing the regression equation based on the SPSS
output.
Introduction
Regression analysis deals with estimating the value of dependent variable on the
basis of one or more independent variables. To do so, an equation is developed
between dependent and independent variables by means of least square method.
When the estimation is done on the basis of one independent variable, the procedure
is known as simple regression, and if the estimation involves more than one
independent variable, it is referred to as multiple regression analysis.
In multiple regression analysis, the dependent variable is referred to as Y,
whereas independent variables are denoted as X. The dependent variable is also
known as criterion variable. The goal is to develop an equation that will determine
the Y variable in a linear function of corresponding X variables. The regression
equation can be either linear or curvilinear, but our discussion shall be limited to
linear regression only.
In regression analysis, a regression model is developed by using the observed
data obtained on dependent variable and several independent variables. During the
process, only those independent variables are picked up for developing the model
which shows significant relationship with dependent variable. Therefore, the
researcher must be careful in identifying the independent variables in regression
analysis study. It may be quite possible that some of the important independent
variables might have been left in the study, and, therefore, in spite of the best
possible effort, the regression model so developed may not be reliable.
Multiple regression analysis can be used in many applications of management
and behavioral researches. Numerous situations can be listed where the use of this
technique can provide an edge to the decision makers for optimum solutions.
For example, in order to evaluate and reform the existing organization and make
them more responsive to the new challenges, the management may be interested to
know; the factors responsible for sale to plan the business strategy, the estimated
number of inventory required in a given month, and the factors affecting the job
satisfaction of the employees. They may also be concerned in developing the model
for deciding the pay packets of an employee, factors that motivate people to work,
or parameters that affect the productivity of work. In all these situations, regression
model may provide the input to the management for strategic decision-making. The
success of the model depends upon the inclusion of relevant independent variables
in the study. For instance, a psychologist may like to draw up variables that directly
affect one’s mental health causing abnormal behavior. Therefore, it is important for
the researchers to review the literature thoroughly for identifying the relevant
independent variables for estimating the criterion variable.
Besides regression analysis, there are other quantitative and qualitative methods
used in performance forecasting. But the regression analysis is one of the most
popularly used quantitative techniques.
In developing a multiple regression equation, one needs to know the efficiency
in estimating the dependent variable on the basis of the identified independent
variables in the model. The efficiency of estimation is measured by the coefficient
of determination (R2) which is the square of multiple correlation. The coefficient of
determination explains the percentage of variance in the dependent variable by the
identified independent variables in the model. The multiple correlation explains the
relationship between the group of independent variables and dependent variable.
Thus, high multiple correlation ensures greater accuracy in estimating the value of
dependent variable on the basis of independent variables. Usually multiple correla-
tion, R is computed during regression analysis to indicate the validity of regression
model. It is necessary to show the value of R2 along with regression equation for
having an idea about the efficiency in prediction.
Any regression model having larger multiple correlation gives better estimates
in comparison to that of other models. We will see an explanation of the multiple
correlation while discussing the solved example later in this chapter.
Multiple Correlation
If the number of independent variables is more than two, then the multiple
correlation is computed from the following formula:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
R1:2345::::n ¼ 1 ð1 r12 2 Þð1 r 2 Þð1 r 2
13:2 14:23 Þ::::::ð1 r1n:23...ðn1Þ Þ
2 (5.2)
1. The multiple correlation can never be lower than the highest correlation between
dependent and any of the independent variables. For instance, the value of R1.234
can never be less than the value of any of the product moment correlations r12,
r13, or r14.
2. Sometimes, an independent variable does not show any relationship with depen-
dent variable, but if it is combined with some other variable, its effect becomes
significant. Such variable is known as suppression variable. These suppression
variables should be handled carefully. Thus, if the independent variables are
identified on the basis of their magnitude of correlations with the dependent
variable for developing regression line, some of the suppression variable might
136 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
(ii) qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
; R1:234 ¼ 1 1 r12 2 1 r13:2 2 1 r14:232
Thus, substituting the values of r12, r13.2, and r14.23 in the following equation:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
R1:234 ¼ 1 1 r12 2 1 r13:2 2 1 r14:23
2
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
h i
¼ 1 1 ð0:5Þ2 1 0:632 1 0:952
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 1 ½0:75 0:603 0:10
¼ 0:976
Interpretation
Coefficient of Determination
It can be defined as the variance explained in the dependent variable on the basis of
the independent variables selected in the regression model. It is the square of
multiple correlation and is represented by R2. Thus, in regression analysis R2 is
138 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
used for assessing the efficiency of the regression model. If for a particular
regression model R is 0.8, it means that 64% of the variability in the dependent
variable can be explained by the independent variables selected in the model.
ðY YÞ ¼ byx ðX XÞ (5.3)
Equation 5.3 can be used to predict the value of Y if the value of X is known.
Similarly to estimate the value of X from the value of Y, the regression equation of X
on Y shall be used which is shown in Eq. 5.4.
ðX XÞ ¼ bxy ðY YÞ (5.4)
where X and Y are the sample means of X and Y, respectively, and byx and bxy are the
regression coefficients. These regression coefficients can be computed as
sY
byx ¼ r (5.5)
sX
sX
bxy ¼ r (5.6)
sY
After substituting the value of byx in Eq. (5.3) and solving, we get
sY sY
Y¼r X þ ðY r XÞ (5.7)
sX sX
) Y ¼ BX þ C (5.8)
sY sY
where B is equal to r and C is (Y r X). The coefficients B and C are known
sX sX
as unstandardized regression coefficient and regression constant respectively.
sY sY
Remark Reproduce r and (Y r X) in equation format. Y is the mean of Y
sX sX
and X is the mean of X
After substituting the values of byx and bxy in the regression equations (5.3) and
(5.4), we get
Terminologies Used in Regression Analysis 139
sy
ðY YÞ ¼ r ðX XÞ
sx
sx
ðX XÞ ¼ r ðY YÞ
sy
ðY YÞ ðX XÞ
¼r
sy sx
ðX XÞ ðY YÞ
¼r
sx sy
Z y ¼ bx Z x (5.9)
Z x ¼ by Z y (5.10)
The Eqs. (5.9) and (5.10) are known as regression equations in standard score
form, and the coefficients bx and by are known as beta coefficients and are referred
to as standardized regression coefficients.
The two regression equations (5.3) and (5.4) are different. Equation (5.3) is known
as regression equation of Y on X and is used to estimate the value of Y on the basis of
X, whereas Eq. (5.4) is known as regression equation of X on Y and is used for
estimating the value of X if Y is known. These two equations can be rewritten as
follows:
ðY YÞ ¼ byx ðX XÞ
1
ðY YÞ ¼ ðX XÞ
bxy
These two regression equations can be same if the expressions in the right-hand
side of these two equations are same.
That is,
140 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
1
byx ðX XÞ ¼ ðX XÞ
bxy
) byx bxy ¼ 1
sy sx
)r r ¼1
sx sy
) r2 ¼ 1
) r ¼ 1
Hence, the two regression equations shall be similar if there is a perfect positive
or perfect negative correlation between them. In that situation, same regression
equation can be used to estimate the value of Y or value of X.
The regression coefficient can be obtained for the given set of data by simplifying
the formula:
sY
;B ¼ r
sX
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
2
P 2 Y
P P P 1
N Y N
N XY X Y
¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2 P ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P P ffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
2ffi
N X ð XÞ2 N Y 2 ð YÞ2 P X
1
N X2 N
After solving,
P
P P
XY X Y
N
B¼ P P (5.11)
N X 2 ð X Þ2
sY
C¼Yr X
sX
After simplification,
Terminologies Used in Regression Analysis 141
P P 2 P P
Y X X XY
C¼ P P (5.12)
N X2 ð XÞ2
Thus, by substituting the value of B and C in Eq. (5.8), regression equation can
be developed.
Example 5.2 Consider the two sets of scores on job satisfaction (X) and autonomy
(Y) as shown below. Compute the regression coefficient “B” and constant “C” and
develop regression equation.
Autonomy (X) : 15 13 7 11 9
Job satisfaction (Y) : 9 85 86
To compute “B” and “C,” we shall first compute ∑X, ∑Y, ∑X2, and ∑XY.
P P P
N XY X Y 5 416 55 36
B¼ P 2 P ¼ ¼ 0:5
N X ð X Þ2 5 645 55 55
P P 2 P P
Y X X XY 36 645 55 416
C¼ P P ¼ ¼ 1:7
N X2 ð XÞ2 5 645 55 55
These values can be obtained from the SPSS output discussed in the solved
Example 5.1. The SPSS produces these outputs on the basis of least square
methods. The method of least square has been discussed later in this chapter.
1. The square root of the product of two regression coefficients is equal to the
correlation coefficient between X and Y. The sign of the correlation coefficient is
equal to the sign of the regression coefficients. Further, the signs of the two
regression coefficients are always same.
142 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
sy sx
; byx bxy ¼ r r ¼ r2
sx sy
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
) r ¼ byx bxy
To prove that the sign of the correlation coefficient between X and Y and both
the regression coefficients are same, consider the following formula:
CovðX; YÞ
rxy ¼ (5.13)
sX sY
CovðX; YÞ
byx ¼ (5.14)
s2x
CovðX; YÞ
bxy ¼ (5.15)
s2y
Since sx and sy are always positive, the values of rxy, byx, and bxy will be same
and will depend upon the sign of Cov(X,Y).
2. If one of the regression coefficients is greater than 1, the other will have to be
less than 1. Thus, in other words, both the regression coefficients can be less than
1 but can never be greater than 1.
We have
byx þ bxy
>r
2
The simple linear regression equation (5.8) is also known as least squares regression
equation. Let us plot the paired values of Xi and Yi for n sets of data; the scattergram
shall look like Fig. 5.1.
The line of best fit can be represented as
Y^ ¼ BX þ C
where B is the slope of the line and C is the intercept on Y axis. There can be many
lines passing through these points, but the line of best fit shall be the one for which
the sum of the squares of the residuals should be least. This fact can be explained as
follows:
Each sample point has two dimensions X and Y. Thus, for ith point, Yi is the
actual value and Y^i is the estimated value obtained from the line. We shall call the
line as the line of best fit if the total sum of squares is least for all these points.
X 2 2 2 2 2
Y i Y^i ¼ Y 1 Y^1 þ Y 2 Y^2 þ Y 3 Y^3 þ . . . þ Y n Y^n
Since the criterion used for selecting the best fit line is based upon the fact that
the squares of the residuals should be least, the regression equation is known as least
square regression equation. This method of developing regression equation is
known as ordinary least square method (OLS) or simply least square method.
144 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
Least square means that the criterion used to select the best fitting line is that the
sum of the squares of the residuals should be least.
In other words, the least squares regression equation is the line for which the sum
P 2
of squared residuals Y i Y^i is least.
The line of best fit is chosen on the basis of some algebra based on the concept of
differentiation and solving the normal equations. We can compute the regression
coefficient B and regression constant C so that the sum of the squared residuals is
minimized. The procedure is as follows:
Consider a set of n data points (X1,Y1), (X2,Y2), . . ., (Xn,Yn), then the regression
line is
Y i ¼ Y^i þ ei (5.17)
where Yi is the actual value and Y^i is the estimated value obtained from the
regression line shown in Eq. (5.16). The ei is the amount of error in estimating Yi.
Our effort is to minimize the error ei 8i , so as to get the best fit of the regression
line. This can be done by minimizing the sum of the squared deviation S2 as shown
below:
X
n X
n X
n
ðY i Y^i Þ ¼
2
S2 ¼ e2i ¼ ðY i BXi CÞÞ2 (5.18)
i¼1 i¼1 i¼1
The coefficients B and C are so chosen that S2 is minimized. This can be done by
differentiating equation (5.18) first with respect to B and then with respect to C and
equating the results to zero.
Thus,
@S2 Xn
¼ 2 Xi ðY i BXi CÞ ¼ 0
@B i¼1
@S2
P
n
and @C ¼ 2 ðY i BXi CÞ ¼ 0
i¼1
Solving these equations, we get
X
n
Xi ðY i BXi CÞ ¼ 0
i¼1
Terminologies Used in Regression Analysis 145
P
n
and ðY i BXi CÞ ¼ 0
i¼1
Taking the summation inside the bracket, the equations become
X
n X
n X
n
B Xi2 þ C Xi ¼ Xi Y i (5.19)
i¼1 i¼1 i¼1
X
n X
n
B Xi þ nC ¼ Yi (5.20)
i¼1 i¼1
The above two equations are known as normal equations having two unknowns
B and C.
P B and P
After solving these equations for C, P
N XY X Y
we get B¼ P P
N X2 ð XÞ2
and
P P 2 P P
Y X X XY
C¼ P P
N X2 ð XÞ2
In using the linear regression model, the following assumptions must be satisfied:
1. Both the variables X and Y must be measured on either interval or ratio scale.
2. The regression model is linear in nature.
3. Error terms in estimating the dependent variable are independent and normally
distributed.
4. Error distribution in predicting the dependent variable is constant irrespective of
the values of X.
Multiple Regression
Y ¼ a þ b1 X 1 þ b 2 X 2 þ b 3 X 3 þ b 4 X 4
where
Y is a dependent variable
X1, X2, X3, and X4 are the independent variables
a represents regression constant
b1, b2, b3, and b4 are the unstandardized regression coefficients
In using SPSS for regression analysis, the regression coefficients are computed in the
output. Significance of these regression coefficients are tested by means of t-test. The
regression coefficient becomes significant at 5% level if its significance value
(p value) provided in the output is less than .05. Significance of regression coefficient
indicates that the corresponding variable significantly explains the variation in the
dependent variable and it contributes to the regression model. F-test is computed in
the output to test the significance of overall model whereas R2 and adjusted R2 show
the percentage variability in the dependent variable as explained by all the indepen-
dent variables together in the model. Further, standardized regression coefficients are
computed in the output to find the relative predictability of the independent variables
in the model.
While doing regression analysis, the independent variables are selected either on
the basis of literature or some known information. In conducting a regression study,
a large number of independent variables are selected, and, therefore, there is a need
to identify only those independent variables which explain the maximum variation
in the dependent variable. This can be done by following any of the two methods,
namely, “stepwise regression” or “Enter” method in SPSS.
Enter Method
The main focus of any industry is to maximize the profits by controlling different
strategic parameters. Optimum processes are identified, employees are motivated,
incentives are provided to sales force, and human resources are strengthened to
enhance the productivity and improve profit scenario. All these situations lead to an
exploratory study where the end result is estimated on the basis of certain indepen-
dent parameters. For instance, if one decides to know what all parameters are required
to boost the sales figure in an organization, then a regression study may be planned.
The parameters like employee’s incentives, retailer’s margin, user’s schemes, prod-
uct info, advertisement expenditure, and socioeconomic status may be studied to
develop the regression model. Similarly, regression analysis may be used to identify
the parameters responsible for job satisfaction in the organization. In such case,
parameters like employee’s salary, motivation, incentives, medical facility, family
welfare incentives, and training opportunity may be selected as independent variables
for developing regression model for estimating the job satisfaction of an employee.
Regression analysis may identify independent variables which may be used for
developing strategies in production process, inventory control, capacity utilization,
sales criteria, etc. Further, regression analysis may be used to estimate the value of
dependent variable at some point of time if the values of independent variables are
known. This is more relevant in a situation where the value of dependent variable is
difficult to know. For instance, in launching a new product in a particular city, one
cannot know the sales figure, and accordingly it may affect the decision of stock
inventory. By using the regression model on sales, one can estimate the sales figure in
a particular month.
Example 5.3 In order to assess the feasibility of a guaranteed annual wage, the
Rand Corporation conducted a study to assess the response of labor supply in terms
of average hours of work(Y) based on different independent parameters. The data
were drawn from a national sample of 6,000 households with male head earnings
less than $15,000 annually. These data are given in Table 5.3. Apply regression
150 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
analysis by using SPSS to suggest a regression model for estimating the average
hours worked during the year based on identified independent parameters.
Solution To develop the regression model for estimating the average hours of
working during the year for guaranteed wages on the basis of socioeconomic
variables, do the following steps:
(i) Choose the “stepwise regression” method in SPSS to get the regression
coefficients of the independent variables identified in the model for developing
the regression equation.
(ii) Test the regression coefficients for its significance through t-test by using its
significance value (p value) in the output.
(iii) Test the regression model for its significance through the F-value by looking to
its significance value (p value) in the output.
(iv) Use the value of R2 in the output to know the amount of variance explained in
the dependent variable by the identified independent variables together in the
model.
Steps involved in getting the output of regression analysis by using SPSS have been
explained in the following sections.
Table 5.3 Data on average yearly hour and other socioeconomic variables
Hours Rate ERSP ERNO NEIN Assets Age DEP School
S.N. (X1) (X2) (X3) (X4) (X5) (X6) (X7) (X8) (X9)
1 2,157 2.905 1,121 291 380 7,250 38.5 2.340 10.5
2 2,174 2.970 1,128 301 398 7,744 39.3 2.335 10.5
3 2,062 2.350 1,214 326 185 3,068 40.1 2.851 8.9
4 2,111 2.511 1,203 49 117 1,632 22.4 1.159 11.5
5 2,134 2.791 1,013 594 730 12,710 57.7 1.229 8.8
6 2,185 3.040 1,135 287 382 7,706 38.6 2.602 10.7
7 2,210 3.222 1,100 295 474 9,338 39.0 2.187 11.2
8 2,105 2.493 1,180 310 255 4,730 39.9 2.616 9.3
9 2,267 2.838 1,298 252 431 8,317 38.9 2.024 11.1
10 2,205 2.356 885 264 373 6,789 38.8 2.662 9.5
11 2,121 2.922 1,251 328 312 5,907 39.8 2.287 10.3
12 2,109 2.499 1,207 347 271 5,069 39.7 3.193 8.9
13 2,108 2.796 1,036 300 259 4,614 38.2 2.040 9.2
14 2,047 2.453 1,213 297 139 1,987 40.3 2.545 9.1
15 2,174 3.582 1,141 414 498 10,239 40.0 2.064 11.7
16 2,067 2.909 1,805 290 239 4,439 39.1 2.301 10.5
17 2,159 2.511 1,075 289 308 5,621 39.3 2.486 9.5
18 2,257 2.516 1,093 176 392 7,293 37.9 2.042 10.1
19 1,985 1.423 553 381 146 1,866 40.6 3.833 6.6
20 2,184 3.636 1,091 291 560 11,240 39.1 2.328 11.6
21 2,084 2.983 1,327 331 296 5,653 39.8 2.208 10.2
22 2,051 2.573 1,194 279 172 2,806 40.0 2.362 9.1
23 2,127 3.262 1,226 314 408 8,042 39.5 2.259 10.8
24 2,102 3.234 1,188 414 352 7,557 39.8 2.019 10.7
25 2,098 2.280 973 364 272 4,400 40.6 2.661 8.4
26 2,042 2.304 1,085 328 140 1,739 41.8 2.444 8.2
27 2,181 2.912 1,072 304 383 7,340 39.0 2.337 10.2
28 2,186 3.015 1,122 30 352 7,292 37.2 2.046 10.9
29 2,188 3.010 990 366 374 7,325 38.4 2.847 10.6
30 2,077 1.901 350 209 951 370 37.4 4.158 8.2
31 2,196 3.009 947 294 342 6,888 37.5 3.047 10.6
32 2,093 1.899 342 311 120 1,425 37.5 4.512 8.1
33 2,173 2.959 1,116 296 387 7,625 39.2 2.342 10.5
34 2,179 2.971 1,128 312 397 7,779 39.4 2.341 10.5
35 2,200 2.980 1,126 204 393 7,885 39.2 2.341 10.6
Source: D. H. Greenberg and M. Kosters, Income Guarantees and the Working Poor, The Rand
Corporation, R-579-OEO, December 1970.
Hours(X1): average hours worked during the year
Rate(X2): average hourly wage (dollars)
ERSP(X3): average yearly earnings of spouse (dollars)
ERNO(X4): average yearly earnings of other family members (dollars)
NEIN(X5): average yearly non-earned income
Assets(X6): average family asset holdings (bank account) (dollars)
Age(X7): average age of respondent
Dep(X8): average number of dependents
School(X9): average highest grade of school completed
152 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
Fig. 5.3 Screen showing entered data for all the variables in the data view
Either the variable selection is made one by one or all at once. To do so, the
variable needs to be selected from the left panel, and by arrow command, it
may be brought to the right panel. After choosing the variables for
analysis, the screen shall look like Fig. 5.5.
(iii) Selecting the options for computation: After selecting the variables, option
needs to be defined for the regression analysis. Take the following steps:
– In the screen shown in Fig. 5.5, click the tag Statistics; you will get the
screen as shown in Fig. 5.6.
– Check the box “R squared change,” “Descriptive,” and “Part and
partial correlations.”
– By default, the options “Estimates” and “Model fit” are checked.
Ensure that they remain checked.
– Click Continue. You will now be taken back to the screen shown in
Fig. 5.5.
154 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
By checking the option “R squared change” the output shall include the values
of R2 and adjusted R2. Similarly by checking the option “Descriptive” the
output will provide the values of mean and standard deviations along with
correlation matrix of all the variables, whereas checking the option “Part and
partial correlations” shall provide the partial correlations of various orders
between Average hours worked during the year and other variables. Readers
are advised to try other options and see what changes they are getting in their
outputs.
– In the option Method shown in Fig. 5.5, select “Stepwise.”
– Click OK.
(c) Getting the Output
Clicking the OK tag in Fig. 5.5 will lead you to the output window. In the
output window of SPSS, the relevant outputs can be selected by using the right
click of the mouse and may be copied in the word file. The output panel shall
have the following results:
1. Mean and standard deviation
2. Correlation matrix along with significance value
3. Model summary along with the values of R, R2 and adjusted R2
4. ANOVA table showing F-values for all the models
Solved Example of Multiple Regression Analysis Including Multiple Correlation 155
Different outputs generated in the SPSS are shown below along with their
interpretations.
1. The values of mean and standard deviation for all the variables are shown in
Table 5.4. These values can be used for further analysis in the study. By using
the procedure discussed in Chap. 2, a profile chart may be prepared by comput-
ing other descriptive statistics for all the variables.
2. The correlation matrix in Table 5.5 shows the correlations among the variables
along with their significance value (p value). Significance of these correlations
has been tested for one-tailed test. The correlation coefficient with one asterisk
156 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
Fig. 5.6 Screen showing options for computing various components of regression analysis
mark (*) indicates its significance at 5% level. The asterisk mark (*) is put on the
correlation coefficient if its value is more than the required value of correlation
coefficients for its significance at 5% level which is .284. For one-tailed test, the
required value of “r” for significance with 33 (N 2) df can be seen from
Table A.3 in the Appendix.
Solved Example of Multiple Regression Analysis Including Multiple Correlation 157
Table 5.5 Correlation matrix for different variables along with significance level
Hours Rate Ersp Erno Nein Assets Age Dep School
Pearson correlation
Hours 1.000 .556** .124 .245 .413** .716** .077 .339* .681**
Rate .556**
1.000 .572 **
.059 .297*
.783** .044 .601**
.881**
Ersp .124 .572** 1.000 .041 .238 .298* .015 .693**
.549**
Erno .245 .059 .041 1.000 .152 .296* .775** .050 .299*
Nein .413** .297* .238 .152 1.000 .512** .347* .045 .219
Assets .716**
.783** .298 *
.296 *
.512**
1.000 .414** .530** .634**
Age .077 .044 .015 .775** .347* .414** 1.000 .048 .331
Dep 339* .601** .693** .050 .045 .530** .048 1.000 .603**
School .681** .881** .549** .299* .219 .634** .331* .603**
1.000
Sig. (1-tailed)
Hours .000 .239 .078 .007 .000 .330 .023 .000
Rate .000 . .000 .368 .041 .000 .401 .000 .000
Ersp .239 .000 . .408 .084 .041 .465 .000 .000
Erno .078 .368 .408 . .192 .042 .000 .387 .041
Nein .007 .041 .084 .192 . .001 .021 .398 .103
Assets .000 .000 .041 .042 .001 . .007 .001 .000
Age .330 .401 .465 .000 .021 .007 . .391 .026
Dep .023 .000 .000 .387 .398 .001 .391 . .000
School .000 .000 .000 .041 .103 .000 .026 .000 .
Hours: Average hours worked during the year
Rate: Average hourly wage
Ersp: Average yearly earnings of spouse
Erno: Average yearly earnings of other family members
Nein: Average yearly non-earned income
Assets: Average family asset holdings
Age: Average age of respondent
Dep: Average number of dependents
School: Average highest grade of school completed
*Significant at 0.05 level (1-tailed) Significant value of r at .05 level with 33 df (1-tailed) ¼ 0.284;
**Significant at 0.01 level (1-tailed) Significant value of r at .01 level with 33 df (1-
tailed) ¼ 0.392
Similarly for one-tailed test, the significance value for the correlation coefficient
at .01 level with 33 (¼N 2) df can be seen as 0.392. Thus, all those correlation
coefficients having values more than 0.392 are significant at 1% level. Such correla-
tion coefficients have been shown with two asterisk marks (**).
Readers may also show the correlation matrix by writing the upper diagonal
values as has been done in Chap. 4.
3. From Table 5.5, it can be seen that Hours (Average hours worked during the
year) is significantly correlated with Rate (Average hourly wage), Nein (Average
yearly non-earned income), Assets (Average family asset holdings), and School
(Average highest grade of school completed) at 1% level, whereas with Dep
(Average number of dependents) at 5% level.
158 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
Table 5.6 Model summary along with the values of R and R square
Change statistics
R Adj R SE of the R square Sig. F
Model R square square. estimate change F change df1 df2 change
1 .716a .512 .498 45.44102 .512 34.687 1 33 .000
2 .861b .742 .726 33.58681 .229 28.405 1 32 .000
3 .879c .773 .751 32.00807 .031 4.235 1 31 .048
a
Predictors: (Constant), Average family asset holdings
b
Predictors: (Constant), Average family asset holdings, Average yearly earnings of other family
members
c
Predictors: (Constant), Average family asset holdings, Average yearly earnings of other family
members, Average number of dependents
4. The three regression models generated by the SPSS have been presented in
Table 5.6. In the third model, the value of R2 is .773, which is maximum, and,
therefore, third model shall be used to develop the regression equation. It can be
seen from Table 5.6 that in the third model, three independent variables, namely,
Assets (Average family asset holdings), Erno (Average yearly earnings of other
family members), and Dep (Average number of dependents), have been
identified, and, therefore, the regression equation shall be developed using
these three variables only. The R2 value for this model is 0.773, and, therefore,
these three independent variables explain 77.3% variations in Hours (Average
hours worked during the year) in the USA. Thus, this model can be considered
appropriate to develop the regression equation.
5. In Table 5.7, F-values for all the models have been shown. Since F-value for the
third model is highly significant, it may be concluded that the model selected is
highly efficient also.
6. Table 5.8 shows the unstandardized and standardized regression coefficients in all
the three models. Unstandardized coefficients are also known as “B” coefficients
and are used to develop the regression equation whereas standardized regression
coefficients are denoted by “b” and are used to explain the relative importance of
independent variables in terms of their contribution toward the dependent
variables in the model. In the third model, t-values for all the three regression
coefficients are significant as their significance values (p values) are less than .05.
Thus, it may be concluded that the variables Assets (Average family asset
holdings), Erno (Average yearly earnings of other family members), and Dep
(Average number of dependents) significantly explain the variations in the Hours
(Average hours worked during the year).
Regression Equation
Table 5.7 ANOVA table showing F-values for all the modelsa
Model Sum of squares df Mean square F Sig.
1 Regression 71,625.498 1 71,625.498 34.687 .000b
Residual 68,141.245 33 2,064.886
Total 139,766.743 34
2 Regression 103,668.390 2 51,834.195 45.949 .000c
Residual 36,098.353 32 1,128.074
Total 139,766.743 34
3 Regression 108,006.736 3 36,002.245 35.141 .000d
Residual 31,760.007 31 1,024.516
Total 139,766.743 34
a
Dependent variable: Average hours worked during the year
b
Predictors: (Constant), Average family asset holdings
c
Predictors: (Constant), Average family asset holdings, Average yearly earnings of other family
members
d
Predictors: (Constant), Average family asset holdings, Average yearly earnings of other family
members, Average number of dependents
where
Hours: Average hours worked during the year
Assets: Average family asset holdings
Erno: Average yearly earnings of other family members
Dep: Average number of dependents
Thus, it may be concluded that the above regression equation is quite reliable as
the value of R2 is 0.773. In other words, the three variables selected in this
regression equation explain 77.3% of the total variability in the Hour (Average
hours worked during the year), which is quite good. Since the F-value for this
regression model is highly significant, the model is reliable. At the same time, all
the regression coefficients in this model are highly significant, and, therefore, it may
be interpreted that all the three variables selected in the model, namely, Assets
(Average family asset holdings), Erno (Average yearly earnings of other family
members), and Dep (Average number of dependents), have significant predictabil-
ity in estimating the value of the Hour (Average hours worked during the year) in
the USA.
Table 5.8 Regression coefficients of selected variables in different models along with their t-values and partial correlationsa
Standardized
Unstandardized coefficients coefficients Correlations
Model B Std. error Beta t Sig. Zero-order Partial Part
1 (Constant) 2,042.064 17.869 114.280 .000
Average family asset holdings .016 .003 .716 5.890 .000 .716 .716 .716
2 (Constant) 2,123.257 20.162 105.308 .000
Average family asset holdings .019 .002 .864 9.190 .000 .716 .852 .826
Average yearly earnings of other .338 .063 .501 5.330 .000 .245 .686 .479
family members
3 (Constant) 2,064.285 34.503 59.828 .000
Average family asset holdings .022 .002 .993 9.092 .000 .716 .853 .778
Average yearly earnings of other .371 .063 .550 5.933 .000 .245 .729 .508
family members
Average number of dependents 20.816 10.116 .215 2.058 .048 .339 .347 .176
a
Dependent variable: Average hours worked during the year
Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
Exercise 161
3. Once the data file is ready, use the following command sequence for selecting
the variables for analysis.
Analyze ! Regression ! Linear
4. Select the dependent variable from left panel to the “Dependent” section of the
right panel. Select all other independent variables from left panel to the “Inde-
pendent(s)” section of the right panel.
5. After selecting the variables for regression analysis, click the tag Statistics on
the screen. Check the box “R squared change,” “Descriptive,” and “Part and
partial correlations.” Press Continue.
6. In the Method option, select “Stepwise,” then press OK to get the different
outputs for regression analysis.
Exercise
Short-Answer Questions
Note: Write answer to each of the questions in not more than 200 words.
Q.1. Describe regression analysis. Explain the difference between simple regres-
sion and multiple regression models.
Q.2. What is the difference between stepwise regression and backward regression?
Q.3. Discuss the role of R2 in regression analysis. Explain multiple correlation and
its order.
Q.4. Explain an experimental situation where regression analysis can be used.
Q.5. How will you know that the variables which are selected in the regression
analysis are valid?
Q.6. What is the difference between Stepwise and Enter method in developing
multiple regression equation?
Multiple-Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the
one that you consider the closest to the correct answer.
1. The range of multiple correlation R is
(a) 1 to 0
(b) 0 to 1
(c) 1 to 0
(d) None of the above
2. SPSS commands for multiple regression analysis is
(a) Analyze -> Linear -> Regression
(b) Analyze -> Regression -> Linear
(c) Analyze -> Linear Regression
(d) Analyze -> Regression Linear
162 5 Regression Analysis and Multiple Correlations: For Estimating a Measurable. . .
No. of independent
Models variables R2
(a) Model I: 5 0.88
(b) Model II: 4 0.87
(c) Model III: 3 0.86
(d) Model IV: 2 0.65
(c) Both Dealer’s incentive and Product price are significant at .01 level in the
model.
(d) Both Dealer’s incentive and Product price are not significant at .05 level in
the model.
8. Choose correct statement about B and b coefficients.
(a) “B” is an unstandardized coefficient and “b” is a standardized coefficient.
(b) “b” is an unstandardized coefficient and “B” is a standardized coefficient.
(c) Both “B” and “b” are standardized coefficients.
(d) Both “B” and “b” are unstandardized coefficients.
Assignments
1. The data on copper industry and its determinants in the US market during
1951–1980 are shown in the following table. Construct a regression model and
develop the regression equation by using the SPSS. Test the significance of
regression coefficients and explain the robustness of the regression model to
predict the price of the copper in the US market.
(continued)
Determinants of US domestic price of copper
DPC GNP IIP MEPC NOH PA
66.50 2,127.6 145.2 709.8 2,023.30 54.42
98.30 2,628.80 152.5 935.7 1,749.20 61.01
101.40 2,633.10 147.1 940.9 1,298.50 70.87
DPC ¼ 12-month average US domestic price of copper (cents per pound)
GNP ¼ annual gross national product ($, billions)
IIP ¼ 12-month average index of industrial production
MEPC ¼ 12-month average London Metal Exchange price of copper (pounds sterling)
NOH ¼ number of housing starts per year (thousands of units)
PA ¼ 12-month average price of aluminum (cents per pound)
Note: The data are from the sources such as American Metal Market, Metals Week, and US
Department of Commerce publications
Note: The data were collected by Gary R. Smith from sources such as American
Metal Market, Metals
2. Data in the following table shows the crime rate in 47 states in the USA in 1960.
Develop a suitable regression model for estimating the crime rate depending
upon identified socioeconomic variables.
(continued)
US crime data for 47 states
S.N. R Age ED EX0 LF N NW U1 U2 X
25 52.3 130 116 63 641 14 26 70 21 196
26 199.3 131 121 160 631 3 77 102 41 152
27 34.2 135 109 69 540 6 4 80 22 139
28 121.6 152 112 82 571 10 79 103 28 215
29 104.3 119 107 166 521 168 89 92 36 154
30 69.6 166 89 58 521 46 254 72 26 237
31 37.3 140 93 55 535 6 20 135 40 200
32 75.4 125 109 90 586 97 82 105 43 163
33 107.2 147 104 63 560 23 95 76 24 233
34 92.3 126 118 97 542 18 21 102 35 166
35 65.3 123 102 97 526 113 76 124 50 158
36 127.2 150 100 109 531 9 24 87 38 153
37 83.1 177 87 58 638 24 349 76 28 254
38 56.6 133 104 51 599 7 40 99 27 225
39 82.6 149 88 61 515 36 165 86 35 251
40 115.1 145 104 82 560 96 126 88 31 228
41 88 148 122 72 601 9 19 84 20 144
42 54.2 141 109 56 523 4 2 107 37 170
43 82.3 162 99 75 522 40 208 73 27 224
44 103 136 121 95 574 29 36 111 37 162
45 45.5 139 88 46 480 19 49 135 53 249
46 50.8 126 104 106 599 40 24 78 25 171
47 84.9 130 121 90 623 3 22 113 40 160
Source: W. Vandaele, “Participation in Illegitimate Activities: Erlich Revisited,” in A. Blumstein,
J. Cohen, and Nagin, D., eds., Deterrence and Incapacitation, National Academy of Sciences,
1978, pp. 270–335. 386
R ¼ crime rate, number of offenses reported to police per million population
Age ¼ number of males of age 14–24 per 1,000 population
S ¼ indicator variable for southern states (0 ¼ no, 1 ¼ yes)
ED ¼ mean number of years of schooling times 10 for persons age 25 or older
EX0 ¼ 1,960 per capita expenditure on police by state and local government
LF ¼ labor force participation rate per 1,000 civilian urban males age 14–24
N ¼ state population size in hundred thousands
NW ¼ number of nonwhites per 1,000 population
U1 ¼ unemployment rate of urban males per 1,000 of age 14–24
U2 ¼ unemployment rate of urban males per 1,000 of age 35–39
X ¼ the number of families per 1,000 earnings 1/2 the median income
Q.1 b Q.2 b
Q.3 c Q.4 c
Q.5 d Q.6 c
Q.7 c Q.8 a
Chapter 6
Hypothesis Testing for Decision-Making
Learning Objectives
After completing this chapter, you should be able to do the following:
• Understand the purpose of hypothesis testing.
• Learn to construct the hypotheses.
• Know the situations for using one- and two-tailed tests.
• Describe the procedure of hypothesis testing.
• Understand the p value.
• Learn the computing procedure manually in different situations by using t-tests.
• Identify an appropriate t-test in different research situations.
• Know the assumptions under which t-test should be used.
• Describe the situations in which one-tailed and two-tailed tests should be used.
• Interpret the difference between one-tailed and two-tailed hypotheses.
• Learn to compute t-statistic in different research situations by using SPSS.
• Learn to interpret the outputs of different t-tests generated in SPSS.
Introduction
Human beings are progressive in nature. Most of our decisions in life are governed
by our past experiences. These decisions may be subjective or objective. Subjective
decisions are solely based upon one’s own perception of viewing issues. These
perceptions keep on changing from person to person. Same thing or situation can
be perceived differently by different persons, and therefore, the decision cannot be
universalized. On the other hand, if decisions are taken on the basis of scientific
law, it is widely accepted and works well in the similar situations.
Decision makers are always engaged in identifying optimum decision in a given
situation for solving a problem. Theory of statistical inference which is based on
scientific principles provide optimum solution to these decision makers. Statistical
inference includes theory of estimation and testing of hypothesis. In this chapter,
Hypothesis Construction
Testing of Hypothesis
Test concerning two group means Test concerning single Test concerning two
group mean group means
Test concerning
single group mean
Comparing two group Comparing two group
means (Unrelated groups) means (Related groups)
x -m
z=
s/ n
x1 - x 2 x -m x1 - x 2 d S12
z= t= t= t= F=
s12 s 22 s/ n
S
1
+
1
sd / n S 22
+
n1 n2 n1 n 2
Null Hypothesis
Alternative Hypothesis
Test Statistic
In hypothesis testing, the decision about rejecting or not rejecting the null hypothe-
sis depends upon the value of test statistic. A test statistic is a random variable
X whose value is tested against the critical value to arrive at a decision.
If a random sample of size n is drawn from the normal population with mean, m
and variance, s2, then the sampling distribution of mean will also be normal with
mean m and variance s2/n. As per the central limit theorem even if the population
from which the sample is drawn is not normal, the sample mean will still follow the
normal distribution with mean, m, and variance s2/n provided the sample size n is
large (n > 30).
Steps in Hypothesis Testing 171
Thus, in case of large sample (n > 30), for testing the hypothesis concerning
mean, z-test is used. However, in cases of small sample (n 30), the distribution of
sample mean follows t-distribution if the population variance is not known. In such
situation, t-test is used. In case population standard deviation (s) is unknown, it is
estimated by the sample standard deviation(S). For different sample size, the
t-curve is different, and it approaches to normal curve for sample size n > 30.
All these curves are symmetrical and bell shaped and distributed around t ¼ 0. The
exact shape of the t-curve depends on the degrees of freedom.
In one-way ANOVA, the comparison between group variance and within-group
variance is done by using the F-statistic. The critical value of F can be obtained
from the Table A4 or A5 in appendix for a particular level of significance and the
degrees of freedom between and within the groups.
Rejection Region
Rejection region is a part of the sample space in which if the value of test statistic
falls, null hypothesis is rejected. Rejection region is also known as critical region. The
value of the statistic in the distribution that divides sample space into acceptance and
rejection region is known as critical value. These can be seen in Fig. 6.2.
The size of the rejection region is determined by the level of significance (a).
The level of significance is that probability level below which we reject the null
hypothesis. The term statistical significance of a statistic refers only to the rejection
of a null hypothesis at some level a. It indicates that the observed difference
between the sample mean and the mean of the sampling distribution did not
occur by chance alone. So to conclude, if the test statistic falls in the rejection/
critical region, H0 is rejected, else H0 is failed to be rejected.
We have seen that the research hypothesis is tested by means of testing the null
hypothesis. Thus, the focus of the researcher is to find whether the null hypothesis can
be rejected on the basis of the sample data or not. In testing the null hypothesis, the
researcher has two options, that is, either to reject the null hypothesis or fail to reject
the null hypothesis. Further, the true state of the null hypothesis may be true or false
in either of these situations. Thus, the researcher has four courses of actions in testing
the null hypothesis. The two actions, that is, rejecting the null hypothesis when it is
false and fails to reject the null hypothesis when it is true, are correct decisions.
Whereas the remaining two decisions, that is, rejecting the null hypothesis when it is
true and fails to reject the null hypothesis when it is false, are the two wrong
decisions. These two wrong decisions are known as two different kinds of errors in
hypothesis testing. All the four courses of actions have been summarized in Table 6.1.
Thus, in hypothesis testing, a researcher is exposed to two types of errors known
as type I and type II errors.
Type I error can be defined as rejecting the null hypothesis, H0, when it is true. The
probability of type I error is known as level of significance and is denoted by a.
The choice of a determines the critical values. Looking to the relative importance of
the decision, the researcher fixes the value of a. Normally the level of significance is
chosen as .05 or .01.
Type I and Type II Errors 173
Type II error is said to be committed if we fail to reject the null hypothesis (H0)
when it is false. The probability of type II error is denoted by the Greek letter b and
is used to determine the power of the test. The value of b depends on the way the
null hypothesis is false. For example, in testing the null hypothesis of equal
population means for a fixed sample size, the probability of type II error decreases
as the difference between population means increases. The term 1 b is said to be
the power of test. The power of test is the probability of rejecting the null hypothesis
when it is wrong.
Often type I and type II errors are confused with a and b, respectively. In fact a is
not the type I error but it is the probability of type I error and similarly b is the
probability of type II error and not the type II error. Since a is the probability, hence it
can take any value in between 0 and 1, and one should write the statement like “null
hypothesis may be rejected at .05 level of significance” instead of “null hypothesis
may be rejected at 5% level of significance.” Thus, the level of significance (a) should
always be expressed in fractions such as .05 and .01, or it may be written as 5 or 1%
level. For fixed sample size, the reduction of type I and type II errors simultaneously
is not possible because if you try to minimize one error, the other error will increase.
Therefore, there are two ways to reducing these two errors.
The first approach is to increase the sample size. This is not always possible in
research studies because once the data is collected, the same has to be used by the
researcher for drawing the inferences. Moreover, by increasing the sample size, a
researcher loses the control over experiment, due to which these errors get elevated.
The second approach is to identify the error which is more severe, fix it up at a
desired level, and then try to minimize the other error to a maximum possible extent.
In most of the research studies, type I error is considered to be more severe because
wrongly rejecting a correct hypothesis forces us to accept the wrong alternative
hypothesis. For example, consider an experiment where it is desired to test the
effectiveness of an advertisement campaign on the sales performance. The null
hypothesis required to be tested in this case would be, “Advertisement campaign
either do not have any impact on sales or may reduce the sales performance.” Now if
the null hypothesis is wrongly rejected, an organization would go for the said
advertisement campaign which in fact is not effective. These decisions will unneces-
sarily enhance the budget expenditure without any further appreciation in the revenue
modal. Severity of type I error can also be seen in the following legal analogy.
Convicts are presumed to be innocent until unless they are proved to be guilty. The
purpose of the trial is to see whether the null hypothesis of innocence can be rejected
based on the evidences. Here the type I error (rejecting a correct null hypothesis)
means convicting the innocence, whereas type II error (failing to reject the false null
hypothesis) means letting the guilty go free. Here the type I error is more severe than
type II error because no innocent should be punished in comparison to guilty may get
174 6 Hypothesis Testing for Decision-Making
no punishment. Type I error becomes more serious if the crime is murder and the
person gets the punishment of death sentence. Thus, usually in research studies, the
type I error is fixed at the desired level of, say, .05 or .01 and then type II error is tried
to be minimized as much as possible.
The value of a and b depends upon each other. For a fixed sample size, the only
way to reduce the probability of making one type of error is to increase the other.
Consider a situation where it is desired to compare the means of two populations.
Let us assume that the rejection regions have critical values 1. Using the statistical
test, H0 will never get rejected as it will exclude every possible difference in sample
means. Since the null hypothesis will never be rejected, the probability of rejecting
the null hypothesis when it is true will be zero. In other words, the value of a ¼ 0.
Since the null hypothesis will never be rejected, the probability of type II error
(failing to reject the null hypothesis when it is false) will be 1 or to say that b ¼ 1.
Now consider the rejection regions whose critical values are 0,0. In this case, the
rejection region includes every possible difference in sample means. This test will
always reject H0. Since the null hypothesis will always be rejected, the probability
of type I error (rejecting H0 when it is true) will be 1 or the value of a ¼ 1. Since the
null hypothesis is always rejected, the probability of type II error (failing to reject
H0 when it is false) is 0, or the value of b ¼ 0.
To conclude, a statistical test having rejection region bounded by the critical
values 1 has a ¼ 0 and b ¼ 1, whereas the test with a rejection region bounded
by the critical values 0,0 has a ¼ 1 and b ¼ 0. Consider a test having rejection
region bounded by the critical values q. As q increases from 0 to 1, a decreases
from 1 to 0, while b increases from 0 to 1.
Consider an experiment in which null and alternative hypotheses are H0 and H1,
respectively. We perform a test to determine whether or not the null hypothesis
should be rejected in favor of the alternative hypothesis. In this situation, two
different kinds of tests can be performed. One may either use a one-tailed test to
see whether there is an increase or decrease in the parameter or may decide to use a
two-tailed test to verify for any change in the parameter that can be increased or
decrease. The word tail refers to the far left and far right of a distribution curve.
These one-tailed and two-tailed tests can be performed at any of the two, 0.01 or
0.05, levels of significance.
One-Tailed Test A statistical test is known as one-tailed test if the null hypothesis
(H0) is rejected only for the values of the test statistic falling into one specified tail of
its sampling distribution. In one-tailed test, the direction is specified, that is, we are
interested to verify whether population parameter is greater than some value. Or at
times we may be interested to know whether the population parameter is less than
some value. In other words, the researcher is clear as to what specifically he/she is
interested to test. Depending upon the research hypothesis, one-tailed test can be
classified as right-tailed or left-tailed tests. If the research hypothesis is to test whether
Criteria for Using One-Tailed and Two-Tailed Tests 175
Fig. 6.3 Critical regions at 5% level in (a) left-tailed test and (b) right-tailed test
0.025 0.025
the population mean is greater than some specified value, then the test is known as
right-tailed test and the entire critical region shall lie in the right tail only. And if the
test statistic falls into right critical region, the alternative hypothesis will be accepted
instead of the null hypothesis. On the other hand, if the research hypothesis is to test
whether the population mean is less than some specified value, then the test is known
as left-tailed test and the entire critical region shall lie in the left tail only. The critical
regions at 5% level in both these situations are shown in Fig. 6.3.
Two-Tailed Test A statistical test is said to be a two-tailed test if the null
hypothesis (H0) is rejected only for values of the test statistic falling into either
tail of its sampling distribution. In two-tailed test, no direction is specified. We are
only interested to test whether the population parameter is either greater than or less
than some specified value. If the test statistic falls into either of the critical regions,
the alternative hypothesis will be accepted instead of the null hypothesis. In two-
tailed test, the critical region is divided in both the tails. For example, if the null
hypothesis is tested at 5% level, the critical region shall be divided in both the tails
as shown in Fig. 6.4. Tables A.1 and A.2 in Appendix provide critical values for z-
test and t-test, respectively.
A one-tailed test is used when we are quite sure about the direction of the difference in
advance (e.g., exercise will improve the fitness level). With that assumption, the level
of significance (a) is only calculated from one tail of the distribution. However, in
standard testing, the probability is calculated from both tails.
176 6 Hypothesis Testing for Decision-Making
For instance if the significance of correlation is tested between age and medical
expenses; one might hypothesize that medical expenses may increase or do not increase
but will never decrease with age. In such case one-tailed hypothesis should be used.
On the other hand, in testing the correlation between people’s weights with their
income, we may not have reasons to believe that the income will increase with increase
in weights or the income will decreases with weights. Here we might be interested just
to find out if there was any relationship at all and that is a two-tailed hypothesis.
The issue in deciding between one-tailed and two-tailed tests is not whether or
not you expect a difference to exist. Had you known whether or not there was a
difference, there is no reason to collect the data. Instead, the question is whether the
direction of a difference can only go one way. One should only use a one-tailed test
if there is an absolute certainty before data collection that in the overall populations,
either there is no difference or there is a difference in a specified direction. Further,
if you end up showing a difference in the opposite direction, you should be ready to
attribute that difference to random sampling without bothering about the fact that
the measured difference might reflect a true difference in the overall populations.
If a difference in the “wrong” direction brings even little meaning to your findings,
you should use two-tailed test.
The advantage of using one-tailed hypothesis is that you can use a smaller
sample to test it. The smaller sample often reduces your cost of the experiment.
But on the other hand, it is easier to reject the null hypothesis with a one-tailed test
in comparison to two-tailed test. Thus, the level of significance increases in one-
tailed test. Because of this reason, it is rarely correct to perform a one-tailed test;
usually we want to test whether any difference exists.
The strategy in choosing between one-tailed and two-tailed tests is to prefer a two-
tailed test unless there is a strong belief that the difference in the population can
only be in one direction. If the two-tailed test is statistically significant (p < a),
interpret the findings in one-tailed manner. Consider an experiment in which it is
desired to test the null hypothesis that the average cure time of cold and cough by a
newly introduced vitamin C tablet is 4 days against an alternative hypothesis that it
is not. If a sample of 64 patients has an average recovery time of 3.5 days with
s ¼ 1.0 day, the p value in this testing would be 0.0002 and therefore the null
hypothesis H0 will be rejected and we accept the alternative hypothesis H1. Thus, in
this situation, it is concluded that the recovery time is not equal to 4 days for the
new prescription of vitamin C.
But we may conclude more than that in saying that the recovery time is less than
4 days with the new prescription of vitamin C. We arrive at this conclusion by
combining the two facts: Firstly, we have proved that the recovery time is different
than 4 days, which means it must be either less or more than 4 days, and secondly,
the sample mean X (¼3.5 days) in this problem is less than the specified value, that
Degrees of Freedom 177
is, 4 days(population mean). After combining these two facts, it may be concluded
that the average recovery time (3.5 days) is significantly lower than the 4 days. This
conclusion is quite logical because if we again test the null hypothesis H0:m 4
against the alternative hypothesis H1:m < 4(one-tailed test), the p value would be
0.0001 which is even smaller than 0.0002.
Thus, we may conclude first by answering the original question then going for
writing about the directional difference such as “The mean recovery time in cold and
cough symptom with the new prescription of vitamin C is different from 4 days”; in
fact, it is less than 4 days.
What Is p Value?
Degrees of Freedom
Any parameter can be estimated with certain amount of information or data set. The
number of independent pieces of data or scores that are used to estimate a parameter is
known as degrees of freedom and is usually abbreviated as df. In general, the degrees
of freedom of an estimate are calculated as the number of independent scores that are
required to estimate the parameter minus the number of parameters estimated as
178 6 Hypothesis Testing for Decision-Making
intermediate steps in the estimation of the parameter itself. In general, each item being
estimated costs one degree of freedom.
The degrees of freedom can be defined as the number of independent scores or
pieces of information that are free to vary in computing a statistic.
Since the variance s2 is estimated by the statistic S2 which is computed from a
random sample of n independent scores, let us see what the degrees of freedom of S2
are. Since S is computed from the sample of n scores, its degrees of freedom would
have been n, but because one degree of freedom is lost due to the condition that
P
ðX XÞ ¼ 0, the degrees of freedom for S2 are n 1. If we go by the definition,
the degrees of freedom of S2 are equal to the number of independent scores (n)
minus the number of parameters estimated as intermediate steps (one, as m is
estimated by X) and are therefore equal to n 1.
In case of two samples, pooled standard deviation S is computed by using
n1 + n2 observations. In the computation of S, the two parameters m1 and m2 are
estimated by X1, and X2 hence, the two degrees of freedom are lost and therefore the
degrees of freedom for estimating S are n1 + n2 2.
In computing chi-square in a 2 2 contingency table for testing the indepen-
dence between rows and columns, it is assumed that you already know 3 pieces of
information: the row proportions, the column proportions, and the total number of
observations. Since the total number of pieces of information in the contingency
table is 4, and 3 are already known before computing the chi-square statistic, the
degrees of freedom are 4 3 ¼ 1. We know that the degrees of freedom for chi-
square are obtained by (r 1) (c 1); hence, with this formula, also the
degrees of freedom in a 2 2 contingency table are 1.
One-Sample t-Test
A t-test can be defined as a statistical test used for testing of hypothesis in which the
test statistic follows a Student’s t-distribution under the assumption that the null
hypothesis is true. This test is used if the population standard deviation is not known
and the distribution of the population from which the sample has been drawn is
normally distributed. Usually t-test is used for small sample size (n < 30) in a
situation where population standard deviation is not known. Even if the sample is
large (n 30) but if the population standard deviation is not known in that situation,
also t-test should be used instead of z-test. A one-sample t-test is used for testing
whether the population mean is equal to a predefined value or not. An example of a
one-sample t-test may be to see whether population average sleep time is equal to 5 h
or not.
In using t-test, it is assumed that the distribution of data is approximately normal.
The t-distribution depends on the sample size. Its parameter is called the degrees of
freedom (df) which is equal to n 1, where n is the sample size.
One-Sample t-Test 179
Xm
t¼ pffiffiffi (6.1)
S= n
In the era of housing boom, everybody is interested to buy a home, and the role of
banking institution is very important in this regard. Every bank tries to woe their
clients by highlighting their specific features of housing loan like less assessment
fee, quick sanctioning of the loans, and waving of penalty for prepayment. One
particular bank was more interested to concentrate on loan processing time instead
of other attributes and therefore made certain changes in their loan processing
procedure without sacrificing the risk features so as to serve their clients with
quick processing time. They want to test if their mean loan processing time differs
from a competitor’s claim of 4 h. The bank randomly selected a sample of few loan
applications in their branches and noted the processing time for each cases. On the
basis of this sample data, the authorities may be interested to test whether the bank’s
processing time in all their branches is equal to 4 h or not. One-sample t-test can
provide the solution to test the hypothesis in this situation.
Example 6.1 A professor wishes to know if his statistics class has a good back-
ground of basic math. Ten students were randomly chosen from the class and were
given a math proficiency test. Based on the previous experience, it was
hypothesized that the average class performance on such math proficiency test
cannot be less than 75. The professor wishes to know whether this hypothesis
may be accepted or not. Test your hypothesis at 5% level assuming that the
distribution of the population is normal. The scores obtained by the students are
as follows:
Math proficiency score: 71, 60, 80, 73, 82, 65, 90, 87, 74, and 72
Solution The following steps shall show the procedure of applying the t-test for
one sample in testing the hypothesis, whether the students of statistics class had
their average score on math proficiency test equal to 75 or not.
180 6 Hypothesis Testing for Decision-Making
H0 : m 75
H1 : m < 75
X m
t¼ pffiffiffi
S= n
754
Since n ¼ 10; X ¼ ¼ 75:4 and
10ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s P
1 X 2 ð X Þ2
S¼ X
n1 nðn 1Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 7542 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 57648 ¼ 6405:33 6316:84
9 10 9
¼9:41
(d) Decision criteria: From Table A.2 in Appendix, the tabulated value of t for one-
tailed test at .05 level of significance with 9 degrees of freedom is
t:05 ð9Þ ¼ 1:833:
Since calculated tð¼ 0:134Þ<t:05 ð9Þ, hence the null hypothesis is failed to be
rejected at 5% level.
(e) Inference: Since the null hypothesis is failed to be rejected, hence the alterna-
tive hypothesis that the average math proficiency performance of the students is
less than 75 cannot be accepted. Thus, it may be concluded that the average
students’ performance on math proficiency test is equal or higher than 75.
The two-sample t-test is used for testing the hypothesis of equality of means of two
normally distributed populations. All t-tests are usually called Student’s t-tests. But
strictly speaking, this name should be used only if the variances of the two
populations are also assumed to be equal. Two-sample t-test is based on the
assumption that the variances of the populations s21 and s22 are unknown and
population distributions are normal. In case the assumptions of equality of
variances are not met, then the test used in such situation is called as Welch’s t-
test. Readers may read some other text for this test.
We often want to compare the means of two different populations, for example,
comparing the effect of two different diets on weights, the effect of two teaching
methodologies on the performance, or the IQ of boys and girls. In such situations,
two-sample t-test can be used. One of the conditions of using two-sample t-test is
that the samples are independent and identically distributed. Consider an experi-
ment in which the job satisfaction needs to be compared among the bank employees
working in rural and urban areas. Two randomly selected groups of 30 subjects each
may be selected from rural and urban areas. Assuming all other conditions of the
employees like salary structure, status and age categories to be similar, null
hypothesis of no difference in their job satisfaction scores may be tested by using
the two-sample t-test for independent samples. In this case, the two samples are
independent because subjects in both the groups are not same.
The following assumptions need to be fulfilled before using the two-sample t-test
for independent groups:
• The distributions of both the populations from which the samples have been
drawn are normally distributed.
• The variances of the two populations are nearly equal.
182 6 Hypothesis Testing for Decision-Making
df ¼ df þ df 2 ¼ ðn1 1Þ þ ðn2 1Þ ¼ n1 þ n2 2
H0 : m1 ¼ m2
H1 : m1 6¼ m2
X1 X2
Calculated t ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi (6.2)
1 1
S þ
n1 n 2
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðn1 1ÞS21 þ ðn2 1ÞS22
where S ¼
n1 þ n2 2
(c) Degrees of freedom n1 þ n2 2
(d) Decision criteria
In two-tailed test, the critical region is divided in both the tails. If the level of
significance is a, then the area in each tail would be a/2. If the critical value is
ta=2 and
if calculated jtjbta=2 , H0 is failed to be rejected at a level of significance
and if calculatedjtj>ta=2 , H0 may be rejected at a level of significance
Note: The value of calculated t is taken as absolute because the difference in the
two means may be positive or negative.
184 6 Hypothesis Testing for Decision-Making
We have already discussed the situations in which one-tailed test should be used.
One-tailed test should only be used if an experimenter, on the basis of past
information, is absolutely sure that the difference can go only in one direction.
One-tailed test can be either right tailed or left tailed. In right-tailed test, it is desired
to test the hypothesis, whether mean of first group is greater than that of the mean of
the second group. In other words, the researcher is interested in a particular group
only. In such testing, if the null hypothesis is rejected, it can be concluded that the
first group mean is significantly higher than that of the second group mean. The
situation where right-tailed test can be used is to test whether frustration level is less
among those employees whose jobs are linked with incentives in comparison to
those whose jobs are not linked with the incentives. Here the first group is the one
whose jobs are linked with the incentives, whereas the second group’s jobs are not
linked with the incentives. In this situation, it is assumed that the employees feel
happy in their jobs if it is linked with incentives. The testing protocol in testing the
right-tailed hypothesis is as follows:
(a) Hypotheses need to be tested
H0 : m1 bm2
H1 : m1 >m2
X1 X2
Calculated t ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
1 1
S þ
n1 n 2
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðn1 1ÞS21 þ ðn2 1ÞS22
where S ¼
n1 þ n 2 2
(c) Degrees of freedom n1 þ n2 2
(d) Decision criteria
In one-tailed test, the entire critical region lies in one tail only. Here the
research hypothesis is the right tailed; hence, the entire critical region would
lie in the right tail only, and therefore, the sign of the critical value would be
positive. If the critical value is represented by ta and
if calculated tbta , H0 is failed to be rejected at a level of significance
and if calculated t>ta , H0 may be rejected at a level of significance
At times the researcher is interested in testing whether a particular group mean is less
than the second one. In this type of hypothesis testing, it is desired to test whether
mean of first group is less than that of mean of the second group. Here if the null
Two-Sample t-Test for Unrelated Groups 185
hypothesis is rejected, it can be concluded that the first group mean is significantly
smaller than that of the second group mean. Consider a situation where an exercise
therapist is interested to know whether a 4-week weight reduction program is
effective or not if implemented on the housewives. The two groups consisting 20
women each are selected for the study, and the first group is exposed to the weight
reduction program, whereas the second group serves as a control and does not take
part in any special activities except daily normal work. If the therapist is interested
to know whether on an average first treatment group shows the reduction in their
weight in comparison to those who did not participate in the program, the left-tailed
test may be used. In this situation, as per the experience, it is known that any weight
reduction program will always reduce the weight in general in comparison to those
who do not participate in it, and therefore, one tailed test would be appropriate in
this situation. The testing protocol in applying the left-tailed test is as follows:
(a) Hypotheses need to be tested
H0 : m1 m2
H1 : m1 <m2
X1 X2
Calculated t ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1
S þ
n1 n 2
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðn1 1ÞS21 þ ðn2 1ÞS22
where S¼
n1 þ n2 2
Table 6.2 Data on stress level for the students in both the counseling groups
Personal counseling: 27 22 28 21 23 22 20 31 26
Audiovisual counseling: 35 28 24 28 31 32 33 34 30
Test your hypothesis at 1% level, whether any one method of counseling is better
than other. It is assumed that population variances are equal and both the
populations are normally distributed.
Solution To test the required hypothesis, the following steps shall explain the
procedure.
(a) Here the hypothesis which needs to be tested is
H0 : mPersonal ¼ mAudivisual
H1 : mPersonal 6¼ mAudivisual
X Y
t ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1
S þ
n 1 n2
In order to compute the value of t-statistic, the mean and standard deviation of
both the groups along with the pooled standard deviation S will have to be
computed first (Table 6.3).
Since n1 ¼ n2 ¼ 9, X ¼ 220 275
9 ¼ 24:44 and Y ¼ 9 ¼ 30:56
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi
1 X 2 ð X Þ2
SX ¼ X
n1 1 n1 ð n1 1Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 ð220Þ2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 5488 ¼ 686 672:22
8 98
¼3:71
Two-Sample t-Test for Unrelated Groups 187
Similarly
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi
1 X 2 ð Y Þ2
SY ¼ Y
n2 1 n2 ð n2 1Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 ð275Þ2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 8499 ¼ 1062:38 1050:35
8 98
¼3:47
One of the conditions of using the two-sample t-test for independent groups is
that the variance of the two populations must be same. This hypothesis can be
tested by using the F-test.
S2X 3:712
Thus, F¼ ¼ ¼ 1:14
S2Y 3:472
From Table A.4 in Appendix, tabulated F.05(8,8) ¼ 3.44
Since calculated value of F is less than the tabulated F, hence it may not be
concluded that the variances of the two groups are different, and therefore, two-
sample t-test for two independent samples can be applied in this example.
Remark: In computing F-statistic, the larger variance must be kept in the
numerator, whereas the smaller one should be in the denominator.
188 6 Hypothesis Testing for Decision-Making
Y,
After substituting the values of X, and pooled standard deviation S, we get
X Y 24:44 30:56
calculated t ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1 1 1
S þ 3:59 þ
n1 n2 9 9
6:12
¼
1:69
¼ 3:62
) calculated jtj ¼3:62
(d) Decision criteria: From Table A.2 in Appendix, the tabulated value of t for
two-tailed test at .05 level of significance with 16(¼n1 + n2 2) degrees of
freedom is t:05 ð16Þ ¼ 2:12:
Since calculated tð¼ 3:62Þ>t:05 ð16Þ, the null hypothesis may be rejected at 5%
level against the alternative hypothesis.
Further, since the mean stress score of the personal counseling group is lower
than that of the audiovisual group, hence it may be concluded that the stress
score of the personal counseling group is significantly less than that of the
audiovisual group.
(e) Inference: Since the null hypothesis is rejected, hence the alternative hypothesis
that the average stress scores of the personal counseling group as well as
audiovisual counseling groups are not same is accepted. Further, since the
mean stress score of the personal counseling group is significantly lower than
that of the audiovisual group, it may be concluded that the personal counseling
is more effective in comparison to that of the audiovisual counseling in reduc-
ing stress among women.
Example 6.3 A researcher wishes to know whether girls’ marriage age in metro
cities is higher than that of class B cities. Twelve families from metro cities and 11
families from class B cities were randomly chosen and were asked about their
daughter’s age at which they got married. The data so obtained are shown in
Table 6.4. Can it be concluded from the given data that the girls’ marriage age
was higher in metro cities in comparison to class B cities? Test your hypothesis at
5% level assuming that the population variances are equal and the distribution of
both the populations from which the samples have been drawn are normally
distributed.
Solution In order to test the hypothesis, the following steps shall be performed:
(a) The hypothesis which needs to be tested is
Two-Sample t-Test for Unrelated Groups 189
X Y
t ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1
S þ
n 1 n2
To compute the value of t statistic, the mean and standard deviation of both the
groups along with the pooled standard deviation S need to be computed first
(Table 6.5).
Here n1 ¼ 12 and n2 ¼ 11 X ¼ 342 272
12 ¼ 28:5 and Y ¼ 11 ¼ 24:73
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi
1 X 2 ð X Þ2
SX ¼ X
n1 1 n1 ð n1 1Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 ð342Þ2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 9854 ¼ 895:82 886:09
11 12 11
¼3:12
190 6 Hypothesis Testing for Decision-Making
Similarly
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi
1 X 2 ð Y Þ2
SY ¼ Y
n2 1 n2 ðn2 1Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 ð272Þ2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 6784 ¼ 678:4 672:58
10 11 10
¼2:41
Since t-test can only be applied if the variance of both the populations is same,
this hypothesis can be tested by using the F-test.
S2 2
Thus, F ¼ SX2 ¼ 3:12
2:412
¼ 1:67
Y
The tabulated value of F can be seen from Table A.4 in Appendix.
Thus, tabulated F.05(11,10) ¼ 2.85
Since calculated value of F is less than that of tabulated F, hence hypothesis of
equality of variances in two groups may not be rejected, and therefore, the two-
sample t-test for independent samples can be applied in this example.
Y,
After substituting the values of X, and pooled standard deviation, S we get
Two-Sample t-Test for Unrelated Groups 191
X Y 28:5 24:73
calculated t ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
1 1 1 1
S þ 2:80 þ
n1 n2 12 11
3:77
¼
2:80 0:42
¼3:21
(d) Decision criteria: From Table A.2 in Appendix, the tabulated value of t for one-
tailed test at .05 level of significance with 21(¼n1 + n2 2) degrees of free-
dom is t:05 ð21Þ ¼ 1:721. Similarly for one-tailed test, tabulated value of t at .01
level of significance is t:01 ð21Þ ¼ 2:518:
Since calculated tð¼ 3:21Þ>t:05 ð21Þ, the null hypothesis may be rejected at 5%
level. Further, calculated value of t is also less than that of tabulated value of t at
1% level as well; hence, t-value is also significant at 1% level.
(e) Inference: Since the null hypothesis is rejected, hence the alternative hypothesis
that the marriage age of the girls in metro cities is higher than that of class B
cities is accepted. It may thus be concluded that girls in metro cities prefers to
marry late in age in comparison to that of class B cities.
Paired t-test is used to test the null hypothesis that the difference between the two
responses measured on the same experimental units has a mean value of zero. This
statistical test is normally used to test the research hypothesis as to whether the
posttreatment response is better than the pretreatment response. Paired t-test is used
in all those situations where there is only one experimental group and no control
group. The question which is tested here is to know whether the treatment is
effective or not. This is done by measuring the responses of the subjects in the
experimental group before and after the treatment. There can be several instances in
which the paired t-test may be used. Such situations may be, for instance, to see the
effectiveness of management development program on the functional efficiency,
effectiveness of the weight training program in weight reduction, effectiveness of
the psychological training in enhancing memory retention power, etc.
The paired t-test is also known as “repeated measures” t-test. In using the paired
t-test, the data must be obtained in pair on the same set of subjects before and after
the experiment.
While applying the paired t-test for two related groups, the pairwise differences,
di, is computed for all n paired data. The mean, d and standard deviation, Sd, of the
differences di are calculated. Thus, paired t-statistic is computed as follows:
d
t¼ pffiffiffi (6.3)
Sd = n
An assumption in using the paired t-test is that the difference di follows the
normal distribution. An experiment where paired difference is computed is often
more powerful, since it can eliminate differences in the samples that increase the
total variance s2. When the comparison is made between groups (of similar experi-
mental units), it is called blocking. The paired difference experiment is an example
of a randomized block experiment.
Note: The blocking has to be done before the experiment is performed.
While using the paired t-test, the following assumptions need to be satisfied:
1. The distribution of the population is normal.
2. The distribution of scores obtained by pairwise difference is normal, and the
differences are a random sample.
3. Cases must be independent of each other.
Remark: If the normality assumption is not fulfilled, you may use the nonparamet-
ric Wilcoxon sign rank test for paired difference designs.
Testing protocol of using paired t-test is similar to that of two-sample t-test for
independent groups discussed above. In applying paired t-test, the only difference is
that the test statistic is
d
t¼ pffiffiffi
Sd = n
instead of the one used in two-sample t-test. Further, in paired t-test, the degrees of
freedom are n 1. While using paired t-test, one should normally construct the
two-tailed test first, and if the difference is significant, then by looking to the values
of the samples mean of the pre- and posttesting responses, one may interpret as to
which group mean is higher than the other. In general using one-tailed test should
be avoided until there is strong evidence that the difference can go only in one
direction. In one-tailed test, the probability of rejecting the correct null hypothesis
becomes more in comparison to two-tailed test for the same level of significance.
Two-Sample t-Test for Unrelated Groups 193
Table 6.6 Calorie intake of the women participants before and after the nutrition educative
program
Before: 2,900 2,850 2,950 2,800 2,700 2,850 2,400 2,200 2,650 2,500 2,450 2,650
After: 2,800 2,750 2,800 2,800 2,750 2,800 2,450 2,250 2,550 2,450 2,400 2,500
The trade-off using one- and two-tailed tests has been discussed in details while
discussing the criteria for using one-tailed and two-tailed tests earlier in this
chapter.
d
t¼ pffiffiffi
Sd = n
where d is the mean of the difference between X and Y, and Sd is the standard
deviation of these differences as given by
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
1 X 2 ð dÞ2
Sd ¼ d
n1 nðn 1Þ
The d and Sd shall be computed first to find the value of t-statistic (Table 6.7).
600
d ¼ ¼ 50
12
and
Two-Sample t-Test for Unrelated Groups 195
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
1 X 2 ð d Þ2 1 ð600Þ2
Sd ¼ d ¼ 90000
n1 nðn 1Þ 11 12 11
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 8181:82 2727:27
¼73:85
d 50
calculated t ¼ pffiffiffi ¼ pffiffiffiffiffi
Sd = n 73:85 12
pffiffiffiffiffi
50 12
¼
73:85
¼2:345
(d) Decision criteria: From Table A.2 in Appendix, the tabulated value of t for
two-tailed test at .05 level of significance with 11(¼n 1) degrees of freedom
is t:05=2 ð11Þ ¼ 2:201.
Since calculated tð¼ 2:345Þ>t:05=2 ð11Þ, the null hypothesis may be rejected at
5% level against the alternative hypothesis. It may therefore be concluded that
the mean calorie intakes before and after the nutritional educative program are
not same.
Since the mean calorie intake of the after-testing group is lower than that of the
before-testing group, it may be concluded that the mean calorie score of the
after-testing group is significantly less than that of the before-testing group.
(e) Inference: It is therefore concluded that the nutritional educative program is
effective in reducing the calorie intake among the participants.
196 6 Hypothesis Testing for Decision-Making
H0 : m ¼ 28
against the alternative hypothesis
H1 : m 6¼ 28
After using the SPSS commands as mentioned below for testing the population
mean to be equal to 28, the output will generate the value of t-statistic along with its
p value. If p value is less than .05, then the t-statistic will be significant and the null
hypothesis shall be rejected at 5% level in favor of alternative hypothesis; other-
wise, we would fail to reject the null hypothesis.
(i) Starting the SPSS: Use the following command sequence to start SPSS:
Start ! All Programs ! IBM SPSS Statistics ! IBM SPSS
Statistics 20
After checking the option Type in Data on the screen you will be taken to
the Variable View option for defining the variables in the study.
(ii) Defining variables: There is only one variable in this example which needs
to be defined in SPSS along with its properties. Since the variable is
measured on interval scale, hence will be defined as ‘Scale’ variable. The
procedure of defining the variable in the SPSS is as follows:
1. Click Variable View to define the variable and its properties.
2. Write short name of the variable as Age under the column heading Name.
3. Under the column heading Label, define full name of the variable as
Employees’ Age.
4. Under the column heading Measure, select the option ‘Scale’ for the
variable.
5. Use default entries in all other columns.
After defining the variable in variable view, the screen shall look like Fig. 6.5.
(iii) Entering data: After defining the variable in the Variable View, click Data
View on the left bottom of the screen to enter the data. Enter the data for the
variable column wise. After entering the data, the screen will look like
Fig. 6.6. Save the data file in the desired location before further processing.
(b) SPSS Commands for Computing t-Statistic
After entering the data in the data view, follow these steps for computing t-
statistic:
(i) Initiating the SPSS commands for one-sample t-test: In data view, go to the
following commands in sequence:
Analyze ⇨ Compare Means ⇨ One-Sample t Test
The screen shall look like Fig. 6.7 as shown below.
(ii) Selecting variables for t-statistic: After clicking the One-Sample t Test
option you will be taken to the next screen for selecting variable for
computing t-statistic. Select the variable Age from left panel and bring it
to the right panel by clicking the arrow sign. In case of computing t-value
for more than one variable simultaneously, all the variables can be selected
together. The screen shall look like Fig. 6.8.
(iii) Selecting the options for computation: After selecting the variable, option
needs to be defined for the one-sample t-test. Do the following:
– In the screen shown in Fig. 6.8, enter the ‘test value’ as 28. This is the
assumed population mean age that we need to verify in the hypothesis.
– Click the tag Options, you will get the screen shown in Fig. 6.9. Enter
the confidence interval as 95% and click Continue and then you will be
taken back to the screen shown in Fig. 6.8.
198 6 Hypothesis Testing for Decision-Making
Fig. 6.5 Defining variable and its characteristics for the data shown in Table 6.8
Fig. 6.7 Screen showing SPSS commands for one sample t-test
Fig. 6.9 Screen showing options for computing one-sample t-test and selecting significance level
Table 6.11 t-table for the Mean SD Mean diff. t-value p value
data on employees’ age
31.53 3.54 3.53 3.862 .002
the mouse, and the content may be copied in the word file. The output panel
shall have the following results:
1. Sample statistics showing mean, standard deviation, and standard error
2. Table showing the value of t and its significance level
Solved Example of Two-Sample t-Test for Unrelated Groups with SPSS 201
In this example, all the outputs so generated by the SPSS will look like
Tables 6.9 and 6.10. The model way of writing the results of one-sample t-
test has been shown in Table 6.11.
The mean, standard deviation, and standard error of mean for the data on age are
given in Table 6.9. These values may be used for further analysis.
From Table 6.11, it can be seen that the t-value is equal to 3.862 along with its
p value .002. Since p value is less than 0.05, it may be concluded that the t-value is
significant and the null hypothesis may be rejected at 5% level. Further, since the
average age of the employees in this problem is 31.5 which is higher than the
assumed age of 28 years, hence it may be inferred that average age of the employees
in the organization is higher than 28 years.
H0 : mA ¼ mB
against the alternative hypothesis
H1 : mA 6¼ mB
After computing the value of t-statistic for two independent samples by the
SPSS, it will be tested for its significance. The SPSS output also gives the signifi-
cance value (p value) corresponding to the t-value. The t-value would be significant
if its corresponding p value is less than .05, and in that case, the null hypothesis
shall be rejected at 5% level; otherwise, null hypothesis is failed to be rejected.
One of the conditions in using the two sample t-test is that the variance of the
two groups must be equal or nearly equal. The SPSS uses Levene’s F-test to test
202 6 Hypothesis Testing for Decision-Making
this assumption. If the p value for F-test is more than .05, null hypothesis may be
accepted, and this will ensure the validity of t-test.
Another important feature in this example is the style of feeding the data for
SPSS analysis. The readers should note the procedure of defining the variables and
feeding the data carefully in this example. Here there are two variables Pizza
Company and Delivery Time. Pizza Company is a nominal variable, whereas
Delivery Time is a scale variable.
3. Under the column heading Label, full names of these variables may be
defined as Pizza Company and Delivery Time, respectively. Readers
may choose some other names of these variables as well.
4. For the variable Company, double-click the cell under the column
heading Values and add the following values to different levels:
Value Label
1 Company A
2 Company B
The screen for defining the values can be seen in Fig. 6.10.
5. Under the column heading Measure, select the option ‘Nominal’ for the
Company variable and ‘Scale’ for the Del_Time variable.
6. Use default entries in rest of the columns.
After defining the variables in variable view, the screen shall look like
Fig. 6.10
(iii) Entering data
After defining both the variables in Variable View, click Data View on
the left corner in the bottom of the screen shown in Fig. 6.10 to open the
data entry format column wise. For the Company variable, type first twelve
scores as 1 and the next ten scores as 2 in the column. This is because the
value ‘1’ denotes Company A and there are 12 delivery time scores
reported by the customers. Similarly, the value ‘2’ denotes Company B
and there are 10 delivery time scores as reported by the customers. After
entering the data, the screen will look like Fig. 6.11.
204 6 Hypothesis Testing for Decision-Making
bring it in the “Test Variable” section of the right panel. Similarly, select
the variable Pizza Company from the left panel and bring it to the “Group-
ing Variable” section of the right panel.
Select variable from the left panel and bring it to the right panel by using
the arrow key. After selecting both the variables, the values ‘1’ and ‘2’ need
to be defined for the grouping variable Pizza Company by pressing the tag
‘Define Groups.’ The screen shall look like Fig. 6.13.
Note: Many variables can be defined in the variable view in the same data
file for computing several t-values for different independent groups.
(iii) Selecting options for computation: After selecting the variables, option
needs to be defined for the two-sample t-test. Do the following:
– In the screen shown in Fig. 6.13, click the tag Options and you will get
the screen shown in Fig. 6.14.
– Enter the confidence interval as 95% and click Continue to get back to
the screen shown in Fig. 6.13. By default, the confidence interval is
95%; however, if desired, it may be changed to some other level.
The confidence level is the one at which the hypothesis needs to be
tested. In this problem, the null hypothesis is required to be tested at .05
level of significance, and therefore, the confidence level here shall be
95%. One can choose the confidence level as 90 or 99% if the level of
significance for testing the hypothesis is .10 or .01, respectively.
– Click OK on the screen shown in Fig. 6.13.
206 6 Hypothesis Testing for Decision-Making
Fig. 6.13 Screen showing selection of variable for two-sample t-test for unrelated groups
Fig. 6.14 Screen showing the option for choosing the significance level
Table 6.14 F- and t-table for testing the equality of variances and equality of means of two
unrelated groups
Levene’s 95%
test for confidence
equality of interval of the
variance t-test for equality of means difference
Sig. (two- Mean SE
F Sig. t df tailed) diff. diff. Lower Upper
Delivery time in sec.
Equal variances .356 .557 3.028 20 .007 2.43 .804 0.76 4.11
assumed
Equal variances not 3.139 19.3 .005 2.43 .775 0.81 4.05
assumed
right click of the mouse, and it may be copied in the word file. The output panel
shall have the following results:
1. Descriptive statistics for the data in different groups
2. ‘F-’ and ‘t-’values for testing the equality of variances and equality of
means, respectively
(i) In this example, all the outputs so generated by the SPSS will look like
Tables 6.13 and 6.14. The model way of writing the results of two-sample t-
test for unrelated samples has been shown in Table 6.15.
The following interpretations can be made on the basis of the results shown in the
above outputs:
1. Table 6.13 shows the mean, standard deviation, and standard error of the mean
for the data on delivery time of both the pizza companies. The mean delivery
time of the company B is less than that of the delivery time of company A.
However, whether this difference is significant or not shall be revealed by
looking to the t-value and its associated p value. However, if the t-value is not
significant, no one should draw the conclusion about the delivery time of the
pizza companies by looking to the sample means.
208 6 Hypothesis Testing for Decision-Making
Table 6.15 t-table for the data on delivery time along with F-value
Groups Means S.D. Mean. diff SE of mean diff t-value p value F-value p value
Company A 20.58 2.16 2.43 .804 3.028 .007 .356 .557
Company B 18.15 1.45
2. One of the conditions for using the two-sample t-ratio for unrelated groups is that
the variance of the two groups must be equal. To test the equality of variances,
Levene’s test was used. In Table 6.14, F-value is .356 which is insignificant as
the p value is .557 which is more than .05. Thus, the null hypothesis of equality
of variances may be accepted, and it is concluded that the variances of the two
groups are equal.
3. It can be seen from Table 6.15 that the value of t-statistic is 3.028. This t-value is
significant as its p value is 0.007 which is less than .05. Thus, the null hypothesis
of equality of population means of two groups is rejected, and it may be
concluded that the average delivery time of the pizza in both the companies is
different. Further, average delivery time of the company B is less than that of the
company A, and therefore, it may be concluded that the delivery of pizza by the
company B to their customers is faster than that of the company A.
Remark: The readers can note that initially the two-tailed hypothesis was tested in
this example, but the final conclusion has been made similar to the one-tailed test. This
is because of the fact that if the t-statistic is significant in two-tailed test then it will also
be significant at one-tailed test. To make it clearer, let us consider that for two-tailed
test, the critical value is ta=2 at level of significance. This value will always be greater
than that of the critical value of ta in one-tailed test, and therefore, if the calculated
value of t is greater than ta=2 , it will also be greater than ta .
Example 6.7 An experiment was conducted to know the impact of new advertise-
ment campaign on sale of television of a particular brand. The number of television
units sold on 12 consecutive working days before and after launching the advertise-
ment campaign in a city was recorded. The data obtained are shown in Table 6.16.
Solution Here the hypothesis which needs to be tested is
After getting the value of t-statistic for paired sample in the output of SPSS, it
needs to be tested for its significance. The output so generated by the SPSS also
gives the significance level (p value) along with t-value. The null hypothesis may be
rejected if the p value is less than .05; otherwise, it is accepted. If the null hypothesis
is rejected, an appropriate conclusion may be drawn regarding the effectiveness of
the advertisement campaign by looking to the mean values of the sales before and
after the advertisement.
In this problem, there are two variables TV Sold before Advertisement and TV Sold
after Advertisement. For both these variables, data shall be entered in two different
columns unlike the way it was entered in two-sample t-test for unrelated groups.
Fig. 6.15 Variables along with their characteristics for the data shown in Table 6.16
ratio scale. These variables can be defined along with their properties in
SPSS by using the following steps:
1. After clicking the Type in Data above, click the Variable View to define
the variables and their properties.
2. Write short name of the variables as After_Ad and Before_Ad under the
column heading Name.
3. Under the column heading Label, full name of these variables may be
defined as TV Sold before Advertisement and TV Sold after Advertise-
ment, respectively. Readers may choose some other names of these
variables if so desired.
4. Under the column heading Measure, select the option ‘Scale’ for both
the variables.
5. Use default entries in rest of the columns.
After defining the variables in variable view, the screen shall look like
Fig. 6.15.
(iii) Entering the data
Once both these variables are defined in the Variable View, click Data
View on the left corner in the bottom of the screen as shown in Fig. 6.15 to
open the format for entering the data column wise. For both these
variables, data is entered column wise. After entering the data, the screen
will look like Fig. 6.16.
(b) SPSS Commands for Paired t-Test
After entering all the data in the data view, take following steps for paired t-test.
(i) Initiating SPSS commands for paired t-test: In data view, click the follow-
ing commands in sequence:
Analyze ! Compare means ! Paired-Samples t Test
The screen shall look like Fig. 6.17.
Solved Example of Paired t-Test with SPSS 211
(ii) Selecting variables for analysis: After clicking the Paired-Samples t Test,
the next screen will follow for variable selection. Select the variable TV
Sold before Advertisement and TV Sold after Advertisement from left panel
and bring them to the right panel as variable 1 and variable 2 of pair 1. After
selecting both the variables, the screen shall look like Fig. 6.18.
Note: Many pairs of variables can be defined in the variable view in the
same data file for computing several paired t-tests. These pairs of variables
can be selected together in the screen as shown in Fig. 6.18.
(iii) Selecting options for computation: After selecting the variables, option
needs to be defined for computing paired t-test. Do the following:
– In the screen as shown in Fig. 6.18, click the tag Options and you will
get the screen where by default confidence level is selected 95%. No
need of doing anything except to click Continue.
One can define the confidence level as 90 or 99% if the level of
significance for testing the hypothesis is .10 or .01, respectively.
– Click OK on the screen shown in Fig. 6.18.
The following interpretations can be made on the basis of the results shown in the
above output:
1. The values of the mean, standard deviation, and standard error of the mean for
the data on TV sales before and after the advertisement are shown in Table 6.17.
These values can be used to draw conclusion as to whether the advertisement
campaign was effective or not.
2. It can be seen from Table 6.18 that the value of t-statistic is 4.204. This t-value is
significant as the p value is 0.001 which is less than .05. Thus, the null hypothesis
of equality of average TV sales before and after advertisement is rejected, and
therefore, it may be concluded that the average sale of the TV units before and
after the advertisement is not same.
Further, by looking to the values of the mean sales of the TV units before and
after advertisement in Table 6.17, you may note that the average sales have
increased after the advertisement campaign. Since the null hypothesis has been
rejected, it may thus be concluded that the increase in the TV units has been
significantly increased due to advertisement campaign.
214 6 Hypothesis Testing for Decision-Making
You may notice that we started with testing two-tailed test but ended up in
testing one-tailed test. This is because of the fact that if the t-value is significant
at 5% level in two-tailed test, then this will also be significant in one-tailed test.
(vi) Select the Company and Del_Time variables from left panel and bring
them in the “Test Variable” and “Grouping Variable” sections of the right
panel, respectively.
(vii) Define the values 1 and 2 as two groups for the grouping variable
Company.
(viii) By clicking the tag Options, ensure that confidence interval is selected as
95% and click Continue.
(ix) Press OK for output.
(c) For Paired t-Test
(i) Start the SPSS the way it is done in case of one-sample t-test.
(ii) In variable view, define the variables After_Ad and Before_Ad as scale
variables.
(iii) In the data view, follow the below-mentioned command sequence for
computing the value of t after entering the data for both the variables:
Analyze ! Compare means ! Paired-Samples t Test
(iv) Select the variables After_Ad and Before_Ad from left panel and bring
them to the right panel as variable 1 and variable 2 of pair 1.
(v) By clicking the tag Options, ensure that confidence interval is selected as
95% and click Continue.
(vi) Press OK for output.
Exercise
Short-Answer Questions
Note: Write the answer to each of the following questions in not more than 200
words.
Q.1. What do you mean by pooled standard deviation? How will you compute it?
Q.2. Discuss the criteria of choosing a statistical test in testing hypothesis
concerning mean and variances.
Q.3. What are the various considerations in constructing null and alternative
hypotheses?
Q.4. What are the various steps in testing a hypothesis?
Q.5. Discuss the advantages and disadvantages of one- and two-tailed tests.
Q.6. Explain the situations in which one- and two-tailed tests should be used.
Q.7. Discuss the concept of one- and two-tailed hypotheses in terms of rejection
region.
Q.8. What do you mean by type I and type II errors? Discuss the situations when
type II error is to be controlled.
Q.9. What do you mean by p value? How it is used in testing the significance of
test statistic?
216 6 Hypothesis Testing for Decision-Making
H0 : mRural mUrban
H1 : mRural <mUrban
(b)
H0 : mMale ¼ mFemale
H1 : mMale >mFemale
(c)
H0 : mMale ¼ mFemale
H1 : mMale <mFemale
(d)
H0 : mMale 6¼ mFemale
H1 : mMale ¼ mFemale
218 6 Hypothesis Testing for Decision-Making
H0 : m1 m2
H1 : m1 <m2
2. The following data set represents the weight of the average daily household
waste (kg/day/house) generated from 20 houses in a locality:
4.1 3.7 4.3 2.5 2.5 6.8 4.0 4.5 4.6 7.1
3.5 3.1 6.6 5.5 6.5 4.1 4.2 4.8 5.1 4.8
Can it be concluded that the average daily household waste of that community is
5.0 kg/day/house? Test your hypothesis at 1% level.
3. A feeding experiment was conducted with two random samples of pigs on the
relative value of limestone and bone meal for bone development. The data so
obtained on ash content are shown in the following table:
Ash contents (%) in the bones S.N. Lime stone Bone stone
of pigs
1 48.9 52.5
2 52.3 53.9
3 51.4 53.2
4 50.6 49.9
5 52 51.6
6 45.8 48.5
7 50.5 52.6
8 52.1 44.6
9 53 52.8
10 46.5 48.8
Test the significance of the difference between the mean ash content of the two
groups at 5% level.
4. A company wanted to know as to which of the two pizza types, that is, fresh
veggie and peppy paneer, was most popular among the people. An experiment
was conducted in which 12 men were given two types of pizza, that is, fresh
veggie pizza and pepper paneer pizza, to eat on two different days. Each pizza
was carefully weighed at exactly 16 oz. After 20 min, the leftover pizzas were
weighed, and the amount of each type of pizza remaining per person was
calculated assuming that the subjects would eat more if they preferred the
pizza type. The data so obtained is shown in the following table.
Weights of the leftover pizzas S.N. Fresh veggie (in oz.) Pepper paneer (in oz.)
in both varieties
1 12.5 15
2 5.87 7.1
3 14 14
4 12.3 13.7
5 3.5 14.2
6 2.6 5.6
7 14.4 15.4
8 10.2 11.3
9 4.5 15.6
10 6.5 10.5
11 4.3 8.5
12 8.4 9.3
220 6 Hypothesis Testing for Decision-Making
Apply the paired t-test and interpret your findings. Do people seem to prefer
fresh veggie pizza over pepper veggie pizza? Test your hypothesis at 5% level.
Answers to Multiple-Choice Questions
Q.1 b Q.2 d
Q.3 c Q.4 c
Q.5 d Q.6 a
Q.7 a Q.8 a
Q.9 a Q.10 b
Q.11 d Q.12 d
Q.13 c Q.14 d
Assignments
1. Calculated value of t ¼ 0.037; average IQ score of the students is 101.
2. Calculated value of t ¼ 1.286; average daily household waste of the commu-
nity is 5 kg/day/house.
3. Calculated value of t ¼ 0.441; mean ash contents of both the groups are same.
4. Calculated value of t ¼ 3.193 which is significant. People prefer fresh veggie
pizza.
Chapter 7
One-Way ANOVA: Comparing Means of More
than Two Samples
Learning Objectives
After completing this chapter, you should be able to do the following:
• Understand the basics of one-way analysis of variance (ANOVA).
• Learn to interpret the model involved in one-way analysis of variance.
• Learn the different designs of ANOVA.
• Describe the situations in which one-way analysis of variance should be used.
• Learn the manual procedure of applying one-way ANOVA in testing of
hypothesis.
• Construct the null and research hypotheses to be tested in the research study.
• Learn what happens if multiple t-tests are used instead of one-way ANOVA.
• Understand the steps involved in one-way analysis of variance in equal and
unequal sample sizes.
• Interpret the significance of F-statistic using the concept of p value.
• Know the procedure of making data file for analysis in SPSS.
• Understand the steps involved in using SPSS for solving the problems of one-
way analysis of variance.
• Describe the output of one-way analysis of variance obtained in SPSS.
Introduction
As per the central limit theorem, if the groups are drawn from the same popula-
tion, the variance between the group means should be lower than the variance
within the groups. Thus, a higher ratio (F-value) indicates that the samples have
been drawn from different populations.
There are varieties of situations in which one-way analysis of variance can be
used to compare the means of more than two groups. Consider a study in which it is
required to compare the responses of the students belonging to north, south, west
and east regions towards liking of mess food in the university. If the quality of mess
food is rated on a scale of 1–10 (1 ¼ “I hate the food,” 10 ¼ “Best food ever”),
then the responses of the students belonging to different regions can be obtained in
the form of the interval scores. Here the independent variable would be the
student’s region having four different levels namely north, south, east and west
whereas the response of the students shall be the dependent variable. To achieve the
objective of the study the null hypothesis of no difference among the mean
responses of the four groups may be tested against the alternative hypothesis that
at least one group mean differs. If the null hypothesis is rejected, a post hoc test is
used to get the correct picture as to which group’s liking is the best.
Similarly a human resource manager may wish to determine whether the
achievement motivation differs among the employees in three different age
categories (<25, 26–35, and >35 years) after attending a training program. Here,
the independent variable is the employee’s age category, whereas the achievement
motivation is the dependent variable. In this case, it is desired to test whether the
data provide sufficient evidence to indicate that the mean achievement motivation
of any age category differs from other. The one-way ANOVA can be used to answer
this question.
There are three basic principles of design of experiments, that is, randomization,
replication, and local control. Out of these three, only randomization and replica-
tion need to be satisfied by the one-way ANOVA experiments. Randomization
refers to the random allocation of the treatment to experimental units. On the other
hand, replication refers to the application of each individual level of the factor to
multiple subjects. In other words, the experiment must be replicated in more than
one subject. In the above example several employees in each age group should be
selected in a random fashion in order to satisfy the principles of randomization and
replication. This facilitates in drawing the representative sample.
One-Way ANOVA
It is used to compare the means of more than two independent groups. In one-way
ANOVA, the effect of different levels of only one factor on the dependent variable
is investigated. Usually one-way ANOVA is used for more than two groups because
Principles of ANOVA Experiment 223
two groups may be compared using t-test. In comparing two group means, the t and F
are related as F ¼ t2. In using one-way ANOVA, the experimenter is often interested
in investigating the effect of different treatments on some subjects. Which may be
people, animals, or plants, etc. For instance, obesity can be compared among the
employees of three different departments: marketing, production, and human
resource of an organization. Similarly anxiety of the employees can be compared
in three different units of an organization. Thus, one-way ANOVA has a wide
application in management sciences, humanities, and social sciences.
Factorial ANOVA
A factorial design is the one in which the effect of two factors on the dependent
variable is investigated. Here each factor may have several levels and each combi-
nation becomes a treatment. Usually factorial ANOVA is used to compare the main
effect of each factor as well as their interaction effects across the levels of other
factor on the criterion variable. But the situation may arise where each combination
of levels in two factors is treated as a single treatment and it is required to compare
the effect of these treatments on the dependent variable. In such situation one-way
ANOVA can be used to test the required hypothesis. Consider a situation where the
effect of different combination of duration and time on learning efficiency is to be
investigated. The duration of interest is 30 and 60 minutes and the subjects are
given training in the morning and evening sessions for a learning task. The four
combinations of treatments would be morning time with 30 minutes duration,
morning time with 60 minutes duration, evening time with 30 minutes duration
and evening time with 60 minutes duration. In this case neither the main effect nor
the interaction effects are of interest to the investigator rather just the combinations
of these levels form four levels of the independent treatment.
If the number of factors and their levels are large, then lots of experimental
groups need to be created which is practically not possible, and in that case
fractional factorial design is used. In this design, only important combinations are
studied.
Repeated measure ANOVA is used when same subjects are given different
treatments at different time interval. In this design, same criterion variable is
measured many times on each subject. This design is known as repeated measure
design because repeated measures are taken at different time in order to see the
impact of time on changes in criterion variable. In some studies of repeated measure
design, same criterion variable is compared under two or more different conditions.
For example, in order to see the impact of temperature on memory retention, a
subject’s memory might be tested once in an air-conditioned atmosphere and
another time in a normal room temperature.
224 7 One-Way ANOVA: Comparing Means of More than Two Samples
The experimenter must ensure that the carryover effect does not exist in
administering different treatments on the same subjects. The studies in repeated
measure design are also known as longitudinal studies.
Multivariate ANOVA
Multivariate ANOVA is used when there are two or more dependent variables.
It provides solution to test the three hypotheses, namely, (a) whether changes in
independent variables have significant impact in dependent variables, (b) whether
interaction among independent variables is significant, and (c) whether interaction
among dependent variables is significant. Multivariate analysis of variance is also
known as MANOVA. In this design, the dependent variables must be loosely
related with each other. They should neither be highly correlated nor totally
uncorrelated among themselves. Multivariate ANOVA is used to compare the
effects of two or more treatments on a group of dependent variables. The dependent
variables should be such so that together it conveys some meaning. Consider an
experiment where the impact of educational background on three personality traits
honesty, courtesy, and responsibility is to be studied in an organization. The
subjects may be classified on the basis of their educational qualification; high
school, graduation or post-graduation. Here the independent variable is the Educa-
tion with three different levels: high school, graduation, and postgraduation,
whereas the dependent variables are the three personality traits namely honesty,
courtesy, and responsibility. The one-way MANOVA facilitates us to compare the
effect of education on the personality as a whole of an individual.
Let us suppose that there are r groups of scores where first group has n1 scores,
second has n2 scores, and so on, and rth group has nr scores. If Xij represents the jth
score in the ith group (i ¼ 1, 2, . . ., r; j ¼ 1, 2, . . ., ni), then these scores can be
shown as follows:
Total Mean
1 X11 X12 . . ... X1j .... X1n1 R1 X1
2 X21 X22 . . ... X2j . . .. X2n2 R2 X2
. . . . .
. . . . .
Samples i Xi1 Xi2 . . ... Xij . . .. Xini Ri Xi
. . . . .
. . . . .
r Xr1 Xr2. . ... Xrj . . .. Xrnr Rr Xr
G ¼ R1 + R2 + . . .Rr
One-Way ANOVA Model and Hypotheses Testing 225
Here,
N ¼ n1 + n2 + . . .nr, the total of all the scores
G is the grand total of all N scores
Ri is the total of all the scores in ith group
The total variability among the above-mentioned N scores can be attributed due
to the variability between groups and variability within groups. Thus, the total
variability can be broken into the following two components:
This is known as one-way ANOVA model where it is assumed that the variability
among the scores may be due to the groups. After developing the model, the
significance of the group variability is tested by comparing the variability between
groups with that of variability within groups by using the F-test.
The null hypothesis which is being tested in this case is that whether variability
between groups (SSb) and variability within the groups (SSw) are the same or not.
If the null hypothesis is rejected, it is concluded that the variability due to groups is
significant, and it is inferred that means of all the groups are not same. On the other
hand, if the null hypothesis is not rejected, one may draw the inference that group
means do not differ significantly. Thus, if r groups are required to be compared on
some criterion variable, then the null hypothesis can be tested by following the
below mentioned steps:
(a) Hypothesis construction: The following null hypothesis is tested
H0 : m1 ¼ m2 ¼ . . . ¼ mr
deviations of all the scores from their mean value. It is usually denoted
by TSS and is given by
XX
G 2
TSS ¼ Xij
i j
N
after solving
XX G2
¼ Xij2 (7.2)
i j
N
Here G is the grand total of all the scores. The degrees of freedom for total
sum of squares is N 1, and, therefore, mean sum of squares is computed
by dividing TSS by N 1.
(ii) Sum of squares between groups (SSb): The sum of squares between groups
can be defined as the variation of group around the grand mean of the data
set. In other words, it is the measure of variation between the group means
and is usually denoted by SSb. This is also known as the variation due to
assignable causes. The sum of squares between groups is computed as
X R2 G2
SSb ¼ i
(7.3)
i
ni N
ANOVA table
Sources of variation SS df MSS F-value
Between groups SSb r1 MSSb ¼ r1
SSb
F ¼ MSS
MSSb
w
(v) F-statistic: Under the normality assumptions, the F-value obtained in the
above table, that is,
MSSb
F¼ (7.5)
MSSw
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
Critical difference ¼ t:05 ðN r Þ ðMSSÞw (7.6)
n
where ni and nj represent the sample sizes of ith and jth groups, respectively,
and other symbols have their usual meanings.
The SPSS output provides p value (significant value) for each pair of means
to test the significance of difference between them. If p value for any pair
228 7 One-Way ANOVA: Comparing Means of More than Two Samples
of means is less than .05, it is concluded that means are significantly different
otherwise not. SPSS provides various options for post hoc tests. One may
choose one or more options for analysis while using SPSS.
While applying one-way ANOVA for comparing means of different groups, the
following assumptions are made:
1. The data must be measured either on interval or ratio scale.
2. The samples must be independent.
3. The dependent variable must be normally distributed.
4. The population from which the samples have been drawn must be normally
distributed.
5. The variances of the population must be equal.
6. The errors are independent and normally distributed.
Remarks
1. ANOVA is a relatively robust procedure in case of violations of the normality
assumption.
2. In case the data is ordinal, a nonparametric alternative such as Kruskal-Wallis
one-way analysis of variance should be used instead of parametric one-way
ANOVA.
Many times a researcher argues that what if I use three t-tests rather than using
one-way ANOVA in comparing the means of three groups. One of the logics is that,
why to use three times t-test if equality of means can be tested by using one-way
ANOVA once. If the number of groups are more, then one needs to apply large
number of t-tests. For example, in case of six groups, one needs to apply 6 C2 ¼ 15,
t-tests instead of one-time one-way ANOVA. This may be one of the arguments of
the researcher in favor of using one-way ANOVA, but the main problem in using
multiple t-tests instead of one-way ANOVA is that the type I error gets inflated.
If the level of significance has been chosen as p1, then Fisher has showed that the
type I error rate expands from p1 to some larger value as the number of tests
between paired means increases. The error rate expansion is constant and predict-
able which can be computed by the following equation:
p ¼ 1 ð1 p1 Þr (7.8)
where p is the new level of significance and r is the number of t-tests used for
comparing all the pair of group means.
Application of One-Way ANOVA 229
For example, in comparing three group means, if t-tests are used instead of one-
way ANOVA and if the level of significance is chosen as .05, then the total number
of paired comparison would be 3 C2 ¼ 3.
Here, p1 ¼ 0.05 and r ¼ 3, and, therefore, the actual level of significance
becomes
p ¼ 1 ð1 p1 Þr
¼ 1 ð1 0:5Þ3 ¼ 1 0:953 ¼ 1 0:8574
¼ 0:143
One-way ANOVA is used when more than two group means are compared. Such
situations are very frequent in management research where a researcher may like to
compare more than two group means. For instance, one may like to compare the
mood state of the employees working in three different plants or to compare the
occupational stress among three different age categories of employees in an
organization.
Consider an experiment where a market analyst of a company is interested to
know the effect of three different types of incentives on the sale of a particular brand
of shampoo. Shampoo is sold to the customers with three schemes. In the first scheme
20% extra is offered in the same price, in the second scheme shampoo is sold with
free bath soap, whereas in the third scheme it is sold to the customers with a free
ladies’ perfume. These three schemes are offered to the customers in the same outlet
for 3 months. During the second month, sales of the shampoo are recorded in all three
schemes for 20 days. In this situation, scheme is the independent variable having
three different levels: 20% extra shampoo, shampoo with a bath soap, and shampoo
with a ladies’ perfume whereas, the sales figure is the dependent variable. Here the
null hypothesis which is required to be tested would be
H0 : Average sale of shampoo in all three incentive groups are same against the
alternative hypothesis.
H1 : At least one group mean is different.
The one-way ANOVA may be applied to compute F-value. If F-statistic is
significant, the null hypothesis may be rejected, and in that case, a post hoc test
may be applied to find as to which incentive is the most attractive in improving the
sale of the shampoo. On the other hand, if F-value is not significant, one fails to
reject the null hypothesis, and in that case, there would be no reason to believe that
any one incentive is better than others to enhance the sale.
230 7 One-Way ANOVA: Comparing Means of More than Two Samples
Example 7.1 An audio company predicts that students learn more effectively with
a constant low-tune melodious music in background, as opposed to an irregular loud
orchestra or no music at all. To verify this hypothesis, a study was planned by
dividing 30 students into three groups of ten each. Students were assigned to these
three groups in a random fashion, and all of them were given a comprehension to
read for 20 min. Students in group 1 were asked to study the comprehension with
low-tune melodious music at a constant volume in the background. Whereas the
students in group 2 were exposed to loud orchestra and group 3 to no music at all
while reading the comprehension. After reading the comprehension, they were
asked to solve few questions. The marks obtained are shown in the Table 7.1.
Do these data confirm that learning is more effective in particular background
music? Test your hypothesis at 5% level.
Solution Following steps shall be taken to test the required hypothesis:
(a) Hypotheses construction: The researcher is interested in testing the following
null hypothesis:
Table 7.2 Computation of group total, group means, and grand total
Music Orchestra Without music
8 4 3
4 6 4
8 3 6
6 4 2
6 3 1
7 8 2
3 3 6
7 2 4
9 4 1
6 3 2
Group total R1 ¼ 64 R2 ¼ 40 R3 ¼ 31 G ¼ R1 + R2 + R3 ¼ 135
Group mean 6.4 4 3.1
XX G2
(iii) Total sum of squareðTSSÞ ¼ RSS CF ¼ Xij2
i j
N
¼ 755 607:5 ¼ 147:5
X R2 G2
(iv) Sum of squares between groups ðSSb Þ ¼ i
i
ni N
64 þ 402 þ 312
2
¼ 607:5
10
¼ 665:7 607:5 ¼ 58:2
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 MSSw
CD ¼ t:05 ð27Þ
n
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 3:31
¼ 2:052 ¼ 1:67
10
(e) Results
The group means may be compared by arranging them in descending order as
shown in the Table 7.4
It is clear from Table 7.4 that the mean difference between “music” and
“orchestra” groups as well as “music” and “without music” groups is greater
than the critical difference. Since the mean difference between orchestra and
without music groups is significant hence it is shown by clubbing their means
by the line as shown in the Table 7.4.
(f) Inference
From the results, it is clear that the mean learning performance in music group
is significantly higher than that of orchestra as well as nonmusic groups,
whereas the mean learning of orchestra group is equal to that of nonmusic
group. It is therefore concluded that melodious music improves the learning
efficiency.
Solved Example of One-Way ANOVA with Equal Sample Size Using SPSS 233
Example 7.2 The data in the following table indicates the psychological health
ratings of corporate executives in banking, insurance, and retail sectors. Apply one-
way ANOVA to test whether the executives of any particular sector are healthier in
their psychological health in comparison to other sectors. Test your hypothesis at
5% as well as 1% level (Table 7.5).
Solution
In this problem, it is required to test the following null hypothesis
Value Label
1 Banking
2 Insurance
3 Retail
5. Under the column heading Measure, select the option “Scale” for the
Psy_Health variable and “Nominal” for the Sector variable.
6. Use default entries in rest of the columns. The screen shall look like
Fig. 7.1.
Remark: Many variables can be defined in the variable view simulta-
neously if ANOVA is to be applied for more than one variable.
(iii) Entering data
After defining both the variables in Variable View, click Data View on
the left corner in the bottom of the screen as shown in Fig. 7.1 to open the
Solved Example of One-Way ANOVA with Equal Sample Size Using SPSS 235
data entry format column wise. After entering the data, the screen will look
like as shown in Fig. 7.2. Since the data is large, only a portion of data is
shown in the figure. Save the data file in the desired location before further
processing.
(b) SPSS commands for one-way ANOVA
After entering all the data in the data view, follow the below-mentioned steps
for one-way analysis of variance:
(i) Initiating the SPSS commands for one-way ANOVA: In data view, click the
following commands in sequence:
Fig. 7.5 Screen showing options for post hoc test and significance level
Different descriptive statistics have been shown in Table 7.6 which may be used to
study the nature of the data. Further descriptive profiles of the psychological health
rating for the corporate executives in different sectors can be developed by using
the values of mean, standard deviation, and minimum and maximum scores in each
groups. The procedure of developing such profile has been discussed in Chap. 2 of
this book.
Solved Example of One-Way ANOVA with Equal Sample Size Using SPSS 239
Table 7.6 Descriptive statistics for the data on psychological health among corporate executives
in different sectors
95% confidence interval for mean
N Mean SD SE Lower bound Upper bound Min. Max.
Banking 15 43.40 5.84 1.51 40.17 46.63 36.00 59.00
Insurance 15 40.00 4.97 1.28 37.25 42.75 32.00 53.00
Retail 15 48.53 8.53 2.20 43.81 53.26 32.00 65.00
Total 45 43.98 7.38 1.10 41.76 46.20 32.00 65.00
Note: Values have been rounded off nearest to the two decimal places
The mean of different groups in Table 7.6 and the results of Table 7.8 have been
used to prepare the graphics shown in Table 7.9 which can be used to draw
conclusions about post hoc comparison of means.
The F-value in Table 7.7 is significant at 5% level because its p value (¼.004) is
less than .05. Thus, the null hypothesis of no difference among the means of the
three groups may be rejected at 5% level. Since the p value is also less than .01, the
null hypothesis may be rejected at 1% level also.
Here, the F-value is significant; hence, the post hoc test needs to be applied for
testing the significance of mean difference between different pairs of groups.
Table 7.8 provides such comparison. It can be seen from this table that the
difference between banking and retail groups on their psychological health rating
is significant at 5% level because the p value for this mean difference is .04 which is
less than .05.
Similarly, the difference between insurance and retail groups on their psycho-
logical health is also significant at 5% as well as 1% level because the p value
attached to this mean difference is .001 which is less than .05 as well as .01.
There is no significant difference between the banking and insurance groups on
their psychological health rating because the p value attached to this group is .167
which is more than .05.
All the above-mentioned three findings can be very easily understood by looking
to the graphics in Table 7.9. From this table, it is clear that the mean psychological
health rating score is highest among the executives in the retail sector in comparison
to that of banking and insurance sectors. It may thus be concluded that the psycho-
logical health of the executives in the retail sector is best in comparison to that of
banking and insurance sectors.
Solved Example of One-Way ANOVA with Unequal Sample 241
Solution Solving problems of one-way ANOVA with equal and unequal samples
through SPSS are almost similar. In case of unequal sample size, one should be
careful in feeding the data. The procedure of feeding the data in this case shall be
discussed below. Here, the SPSS procedure shall be discussed in brief as it is
exactly similar to the one discussed in Example 7.2. Readers are advised to refer
to the procedure mentioned in Example 7.2 in case of doubt in solving this problem
of unequal sample size.
Here, the null hypothesis which needs to be tested is
H0 : mA ¼ mB ¼ mC
against the alternative hypothesis that at least one group mean differs.
If the null hypothesis is rejected, post hoc test will be used for comparing group
means. Since the sample sizes are different, the Scheffe’s test has been used for post
hoc analysis.
Table 7.10 Occupational stress scores among the employees in different age categories
Group A (<40 years) Group B (40–55 years) Group C (>55 years)
54 75 55
48 68 51
47 68 59
54 71 64
56 79 52
62 86 48
56 81 65
45 79 48
51 72 56
54 78 49
48 69
52
242 7 One-Way ANOVA: Comparing Means of More than Two Samples
Value Label
1 Group A (<40 years)
2 Group B (40–55 years)
3 Group C (>55 years)
5. Under the column heading Measure, select the option “Scale” for the
Stress variable and “Nominal” for the Age_Gp variable.
6. Use default entries in rest of the columns.
After defining all the variables in variable view, the screen shall look like
Fig. 7.7.
Remark: More than one variable can be defined in the variable view for
doing ANOVA for many variables simultaneously.
(iii) Entering the data: After defining the variables in the Variable View, enter
the data, column-wise in Data View. The data feeding shall be done as
follows:
Solved Example of One-Way ANOVA with Unequal Sample 243
(b) SPSS commands for one-way ANOVA for unequal sample size
After entering all the data in data view, save the data file in the desired location
before further processing.
(i) Initiating the SPSS commands for one-way ANOVA: In data view, go to the
following commands in sequence:
Analyze ➾ Compare Means ➾ One-Way ANOVA
(ii) Selecting variables for analysis: After clicking the One-Way ANOVA
option, you will be taken to the next screen for selecting variables. Select
244 7 One-Way ANOVA: Comparing Means of More than Two Samples
the variables Stress scores and Age group from left panel to the “Dependent
list” section and “Factor” section of the right panel, respectively. The
screen shall look like Fig. 7.9.
(iii) Selecting options for computation: After variable selection, option needs to
be defined for generating outputs in one-way ANOVA. This shall be done
as follows:
– Click the tag Post Hoc in the screen shown in Fig. 7.9.
– Check the option “Scheffe.” This test is selected because the sample
sizes are unequal; however, you can choose any other test if you so
desire.
– If graph needs to be prepared, select the option “Means plot.”
– Write “Significance level” as .05. Usually this is written by default;
however, you may write any other significance level like .01 or .10
as well.
– Click Continue.
– Click the tag Options and then check “Descriptive.” Click Continue.
– After selecting the options, click OK.
(c) Getting the output
After clicking OK on the screen as shown in Fig. 7.9, the output shall be
generated in the output window. The relevant outputs can be selected by
using right click of the mouse and may be copied in the word file. The following
output shall be generated in this example:
(a) Descriptive statistics
Solved Example of One-Way ANOVA with Unequal Sample 245
Table 7.11 Descriptive statistics for the data on occupational stress of employees in different age
categories
95% confidence interval for mean
N Mean SD SE Lower bound Upper bound Min. Max.
Group A (<40 years) 12 52.25 4.77 1.38 49.23 55.28 45.00 62.00
Group B 11 75.09 5.97 1.80 71.08 79.10 68.00 86.00
(40–55 years)
Group C (>55 years) 10 54.70 6.29 1.99 50.20 59.20 48.00 65.00
Total 33 60.61 11.80 2.05 56.42 64.79 45.00 86.00
Table 7.11 shows the descriptive statistics of the data on occupational stress of
employees in different age categories. These statistics can be used to develop a
graphic profile of the employee’s occupational stress in different age categories.
Solved Example of One-Way ANOVA with Unequal Sample 247
Fig. 7.10 Graphical presentation of mean scores of occupational stress in three different age
categories
The procedure of developing such profile has been discussed in detail in Chap. 2 of
this book. Further, these descriptive statistics can be used to discuss the nature of
data in different age categories.
Table 7.12 gives the value of calculated F. The p value attached with the F is
.000 which is less than .05 as well as .01; hence, it is significant at 5% as well as 1%
levels. Since the F-value is significant, the null hypothesis of no difference in the
occupational stress among the employees in all the three age categories is rejected.
The post hoc test is now used to compare the means in different pairs.
SPSS provides the option of choosing the post hoc test, and, therefore, one may
choose any one or more test for post hoc analysis. In this example, the Scheffe’s test
was chosen to compare the means in different pairs. Table 7.13 provides such
comparisons.
It can be seen that the difference between occupational stress of the employees in
group A (<40 years) and group B (40–55 years) is significant at 5% as well as at 1%
level both as the p value for this mean difference is .000 which is less than .05 as
well as .01. Similarly, the mean difference between occupational stress of the
employees in group B (40–55 years) and group C (>55 years) is also significant
at 5% as well as 1% level both as the p value for this mean difference is .000 which
is also less than .05 and .01. However, there is no significant difference between the
occupational stress of the employees in group A (<40 years) and group C
(>55 years) because the p value is .606.
248 7 One-Way ANOVA: Comparing Means of More than Two Samples
Table 7.13 Post hoc comparison of group means using Scheffe’s test
(I) Age group (J) Age group Mean diff. (I J) SE Sig. (p value)
Group A (<40 years) Group B (40–55 years) 22.84091* 2.36531 .000
Group C (>55 years) 2.45000 2.42623 .606
Group B (40–55 years) Group A (<40 years) 22.84091* 2.36531 .000
Group C (>55 years) 20.39091* 2.47585 .000
Group C (>55 years) Group A (<40 years) 2.45000 2.42623 .606
Group B (40–55 years) 20.39091* 2.47585 .000
Note: The values of lower bound and upper bound have been omitted from the original output
*The mean difference is significant at 5% as well as 1% levels
(iii) Under the column heading Values, define “1” for banking, “2” for insur-
ance, and “3” for retail.
(iv) After defining variables, type the data for these variables by clicking Data
View.
(v) In the data view, follow the below-mentioned command sequence for the
computation involved in one-way analysis of variance:
Analyze ➾ Compare Means ➾ One-Way ANOVA
(vi) Select the variables Psychological health rating and Different sector from
left panel to the “Dependent list” section and “Factor” section of the right
panel, respectively.
(vii) Click the tag Post Hoc and check the option “LSD” and ensure that the
value of “Significance level” is written as .05. Click Continue.
(viii) Click the tag Options and then check “Descriptive.” Press Continue.
(ix) Press OK for output.
Exercise
Note: Write answer to each of the following questions in not more than 200 words.
Q.1. In an experiment, it is desired to compare the time taken to complete a task
by the employees in three age groups, namely, 20–30, 31–40, and
41–50 years. Write the null hypothesis as well as all possible types of
alternative hypotheses.
Q.2. Explain a situation where one-way analysis of variance can be applied.
Which variances are compared in one-way ANOVA?
Q.3. Define principles of ANOVA. What impact it will have if these principles are
not met?
Q.4. In what situations factorial experiments are planned? Discuss a specific
situation where it can be used.
Q.5. What is repeated measure design? What precaution one must take in framing
such an experiment?
Q.6. Discuss the procedure of one-way ANOVA in testing of hypotheses.
Q.7. Write a short note on post hoc tests.
Q.8. What do you mean by different sum of squares? Which sum of square you
would like to increase and decrease in your experiment and why?
Q.9. What are the assumptions in applying one-way ANOVA?
Q.10. If you use multiple t-tests instead of one-way ANOVA, what impact it will
have on results?
Q.11. Analysis of variance is used for comparing means of different groups, but in
doing so F-test is applied, which is a test of significance for comparing the
variances of two groups. Discuss this anomaly.
250 7 One-Way ANOVA: Comparing Means of More than Two Samples
Q.12. What do you mean by the post hoc test? Differentiate between LSD and
Scheffe’s test.
Q.13. What is p value? In what context it is used?
Multiple Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the one
that you consider the closest to the correct answer.
1. In one-way ANOVA experiment, which of the following is a randomization
assumption that must be true?
(a) The treatment must be randomly assigned to the subjects.
(b) Groups must be chosen randomly.
(c) The type of data can be randomly chosen to either categorical or quantitative.
(d) The treatments must be randomly assigned to the groups.
2. Choose the correct statement.
(a) Total sum of square is additive in nature.
(b) Total mean sum of square is additive in nature.
(c) Total sum of square is nonadditive.
(d) None of the above is correct.
3. In one-way ANOVA, Xij represents
(a) The sample mean of the criterion variable for the ith group
(b) The criterion variable value for the ith subject in the jth group
(c) The number of observations in the jth group
(d) The criterion variable value for the jth subject in the ith group
4. In one-way ANOVA, TSS measures
(a) The variability within groups
(b) The variability between groups
(c) The overall variability in the data
(d) The variability of the criterion variable in any group.
5. In an experiment, three unequal groups are compared with total number of
observations in all the groups as 31 (with some items missing). Calculate the
test statistic for one-way ANOVA F-test.
Source df SS MS F
Between groups 7.5 3.75 ?
Within groups
Total 20.8
Exercise 251
(a) 17.89
(b) 789
(c) 7.89
(d) 78.9
6. Choose the correct statement.
(a) LSD may be used for unequal sample size.
(b) Scheffe’s test may be used for unequal sample size.
(c) Scheffe’s test may be used for comparing more than ten groups.
(d) None of the above is correct.
7. If two groups having 10 observations in each are compared by using one-way
ANOVA and if SSw ¼ 140, then what will be the value of MSSw?
(a) 50
(b) 5
(c) 0.5
(d) 50.5
8. In a one-way ANOVA, if the level of significance is fixed at .05 and if p value
associated with F-statistics is 0.062, then what should you do?
(a) Reject H0, and it is concluded that the group population means are not all
equal.
(b) Reject H0, and it may be concluded that it is reasonable that the group
population means are all equal.
(c) Fail to reject H0, and it may be concluded that the group population means
are not all equal.
(d) Fail to reject H0, and it may be concluded that there is no reason to believe
that the population means differ.
9. Choose the correct statement.
(a) If F-statistic is significant at .05 level, it will also be significant at .01 level.
(b) If F-statistic is significant at .01 level, it may not be significant at .05 level.
(c) If F-statistic is significant at .01 level, it will necessarily be significant at
.05 level.
(d) If F-statistic is not significant at .01 level, it will not be significant at .05
level.
10. Choose the correct statement.
(a) If p value is 0.02, F-statistic shall be significant at 5% level.
(b) If p value is 0.02, F-statistic shall not be significant at 5% level.
(c) If p value is 0.02, F-statistic shall be significant at 1% level.
(d) None of the above is correct.
11. In comparing the IQ among three classes using one-way ANOVA in SPSS,
choose the correct statement about the variable types.
252 7 One-Way ANOVA: Comparing Means of More than Two Samples
17. In one-way ANOVA, four groups were compared for their memory retention
power. These four groups had 8, 12, 10, and 11 subjects, respectively. What
shall be the degree of freedom of between groups?
(a) 41
(b) 37
(c) 3
(d) 40
18. If motivation has to be compared among the employees of three different units
using one-way ANOVA, then the variables Motivation and Units need to be
selected in SPSS. Choose the correct selection strategy.
(a) Motivation in “Factor” section and Plant in “Dependent list” section.
(b) Motivation in “Dependent list” section and Plant in “Factor” section.
(c) Both Motivation and Plant in “Dependent list” section.
(d) Both Motivation and Plant in “Factor” section.
Assignments
1. A CFL company was interested to know the impact of weather on the life of the
bulb. The bulb was lit continuously in hot humid and cold environmental
conditions till it was fused. The following are the number of hours it lasted in
different conditions:
Apply one-way analysis of variance and test whether the average life of bulbs are
same in all the weather conditions. Test your hypothesis at 5% level of signifi-
cance as well as 1% level of significance.
2. It was experienced by a researcher that the housewives read local news with
more interests in comparison to the news containing health information and read
254 7 One-Way ANOVA: Comparing Means of More than Two Samples
Learning Objectives
After completing this chapter, you should be able to do the following:
• Explain the importance of two-way analysis of variance (ANOVA) in research.
• Understand different designs where two-way ANOVA can be used.
• Describe the assumptions used in two-way analysis of variance.
• Learn to construct various hypotheses to be tested in two-way analysis of
variance.
• Interpret various terms involved in two-way analysis of variance.
• Learn to apply two-way ANOVA manually in your data.
• Understand the procedure of analyzing the interaction between two factors.
• Know the procedure of using SPSS for two-way ANOVA.
• Learn the model way of writing the results in two-way analysis of variance by
using the output obtained in the SPSS.
• Interpret the output obtained in two-way analysis of variance.
Introduction
District
1 2 3 4
Outlet 1 B C D A
Outlet 2 D A B C
Outlet 3 C D A B
Outlet 4 A B C D
All the three basic principles of design, that is, randomization, replication, and
local controls, are used in planning a two-way ANOVA experiment in order to
minimize the error variance. In one-way ANOVA experiments only two principles
Classification of ANOVA 257
i.e. randomization and replication are used to control the error variance whereas
in two-way ANOVA experiments all the three principles i.e. randomization, rep-
lication, and local control are used to control the error variance. The very purpose
of using these three principles of design is to enable the researcher to conclude with
more authority that the variation in the criterion variable is due to the identified
level of a particular factor.
In two-way ANOVA experiment, the principle of randomization means that the
samples in each group are selected in a random fashion so as to make the groups as
homogeneous as possible. The randomization avoids biases and brings control in
the experiment and helps in reducing the error variance up to a certain extent.
The principle of replication refers to studying the effect of two factors on more
than one subject in each cell. The logic is that one should get the same findings on
more than one subject. In two-way ANOVA experiment, the principle of replication
allows a researcher to study the significance of interaction between the two factors.
Interaction effect cannot be studied if there is only one observation in each cell.
The principle of local control refers to making the groups as homogeneous as
possible so that variation due to one or more assignable causes may be segregated
from the experimental error. Thus, the application of local control helps us in
reducing the error variation and making the design more efficient.
In the example discussed above, in studying the effect of age on job satisfaction
if the employees were divided only according to their age, then we would have
ignored the effect of gender on job satisfaction which would have increased
the experimental error. However, if the researcher feels that instead of gender
if the job satisfaction varies as per their salary structure, then the subjects may be
selected as per their salary bracket in different age categories. This might further
reduce the experimental error. Thus, maximum homogeneity can be ensured
among the observations in each cell by including the factor in the design which is
known to vary with the criterion variable.
Classification of ANOVA
By using the above-mentioned principles the two-way ANOVA can be used for
different designs. Some of the most popular designs where two-way ANOVA can
be used are discussed below.
Two-way ANOVA is the most widely used in factorial designs. The factorial design
is used for more than one independent variable. The independent variables are also
referred to as factors. In factorial design, there are at least two or more factors.
Usually, two-way analysis of variance is used in factorial designs having two factors.
258 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
In this design, the effect of two factors (having different levels) is seen on the
dependent variable. Consider an experiment where age (A) and gender (B) are
taken as two factors whose effect has to be seen on the dependent variability,
sincerity. Further, let the factor A has three levels (20–30, 31–40, and 41–50 years)
and the factor B has two levels (Male and Female). Thus, in this design, 2 3, that
is, six combination of treatment groups need to be taken. This design facilitates
in studying the effect of both the factors A and B on the dependent variable. Further,
in this design, significance of the interaction effect between the two factors can
also be tested. The factorial design is very popular in the behavioral research, social
sciences, and humanities. This design has a few advantages over single-factor
designs. The most important aspect of the factorial design is that it can provide
some information about how factors interact or combine in the effect they have on the
dependent variable. The factorial design shall be discussed in detail while in solving
two-way ANOVA problem later in this chapter.
Another design where two-way ANOVA is used is the repeated measure design.
This design is also known as a within-subject design. In this design, same subject is
tested under repeated conditions over a time. The repeated measure design can be
considered to be an extension of the paired-samples t-test because in this case,
comparison is done between more than two repeated measures. The repeated
measure design is used to eliminate the individual differences as a source of
between-group differences. This helps to create a more powerful test. The only
care to be taken in the repeated measure design is that while testing the same subject
repeatedly, no carryover effect should be there.
In this design, effect of two factors is studied on more than one dependent variable.
It is similar to the factorial design having two factors, but the only difference is that
here we have more than one dependent variable. At times, it makes sense to
combine the dependent variables for drawing the conclusion about the effects of
two factors on it. For instance, in an experiment, if the effect of teaching methods
and hostel facilities have to be seen on the overall academic performance
(consisting four subjects: Physics, Chemistry, Math, and English) of the students,
then it makes sense to see the effect of these two factors, that is, teaching methods
and hostel facilities on all the subjects together. Once the effect of any of these
factors is found to be significant, then the two-way ANOVA for each of the
dependent variable is applied.
Important Terminologies Used in Two-Way ANOVA 259
Two-way ANOVA design is more efficient over one-way ANOVA because of the
following four reasons:
1. Unlike one-way ANOVA, the two-way ANOVA design facilitates us to test the
effect of two factors at the same time.
2. Since in two-way ANOVA variation is explained by two assignable causes, it
reduces the error variance. Due to this fact, two-way ANOVA design is more
efficient than one-way ANOVA.
3. In two-way ANOVA, one can test for independence of the factors provided there
is more than one observation per cell. However, number of observations in each
cell must be equal. On the other hand, in one-way ANOVA, one may have the
unequal number of scores in each group.
4. Besides reducing the error variance, two-way ANOVA also reduces the compu-
tation as it includes several one-way ANOVA.
Factors
Independent variables are usually known as factors. In two-way ANOVA, the effect
of two factors is studied on certain criterion variable. Each of the two factors may
have two or more levels. The degrees of freedom for each factor is equal to the
number of levels in the factor minus one.
260 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
Treatment Groups
Main Effect
The main effect is the effect of one independent variable (or factor) on the
dependent variable across all the levels of the other variable. The interaction is
ignored for this part. Just the rows or just the columns are used, not mixed. This is
the part which is similar to one-way analysis of variance. Each of the variances
calculated to analyze the main effects (rows and columns) is like between
variances. The degrees of freedom for the main effect are one less than its number
of levels. For example, if the factor A has r levels and factor B has c levels, then the
degrees of freedom for the factor A and B would be r 1 and c 1, respectively.
Interaction Effect
The joint effect of two factors on the dependent variable is known as interaction
effect. It can also be defined as the effect that one factor has on the other factor. The
degrees of freedom for the interaction is the product of degrees of freedom of both
the factors. If the factors A and B have levels r and c, respectively, then the degrees
of freedom for the interaction would be (r 1) (c 1).
Within-Group Variation
The within-group variation is the sum of squares within each treatment groups.
In two-way ANOVA, all treatment groups must have the same sample size. The
total number of treatment groups is the product of the number of levels for each
factor. The within variance is equal to within variation divided by its degrees of
freedom. The within group is also denoted as error. The within-group variation is
often denoted by SSE.
Two-Way ANOVA Model and Hypotheses Testing 261
Let us suppose that there are two factors A and B whose effects have to be tested on
the criterion variable X, and let the factors A and B have levels r and c, respectively,
with n units per cell, then these scores can be written as follows:
Factor B
1 .. j .. c
1 X111 X1j1 X1c1
X112 .. X1j2 .. X1c2
. . .
X11n X1jn X1cn R1
T 11 T 1j T 1c
Factor A i Xi11 Xij1 Xic1
Xi12 .. Xij2 .. Xic2
. . .
Xi1n Xijn Xicn Ri
T i1 T ij T ic
r Xr11 Xrj1 Xrc1
Xr12 .. Xrj2 .. Xrc2
. . .
Xr1n Xrjn Xrcn Rr
T r1 T rj T rc
P P
C1 Cj Cc G¼ Ri ¼ Cj
where
Xijk represents the kth score in the (i,j)th cell
Tij represents the total of all the n scores in the (i,j)th cell
G is the grand total of all the scores
Ri is the total of all the scores in ith level of the factor A
Cj is the total of all the scores in jth level of the factor B
N is the total of all the scores and is equal to r c n
In two-way ANOVA, the total variability among the above-mentioned N scores
can be attributed to the variability due to row (or factor A), due to column (or factor
B), due to interaction (row column (A B)), and due to error. Thus, the total
variability can be broken into the following four components:
Remark: SSE is the variability within group which was represented by SSw in one-
way ANOVA.
262 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
A (SSR), sum of squares due to column factor B (SSC), sum of squares due to
interaction A B (SSI), and sum of squares due to error (SSE).
(i) Total sum of squares (TSS): It represents the total variation present in the
data set and is usually represented by TSS. It is defined as the sum of the
squared deviations of all the scores in the data set from their grand mean.
The TSS is computed by the following formula:
r X
X n
c X
G 2
TSS ¼ Xijk
i¼1 j¼1 k¼1
N
after solving
XXX G2
¼ 2
Xijk (8.2)
i j k
N
Since the degrees of freedom for the TSS are N 1, therefore mean sum
of squares is computed by dividing the TSS by N 1.
(ii) Sum of squares due to row factor (SSR): It is the measure of variation
between the row group means and is usually denoted by SSR. This is also
known as the variation due to row factor (one of the assignable causes).
The sum of squares due to row is computed as follows:
X
r
R2 G2
SSR ¼ i
(8.3)
i¼1
nc N
X
c C2
j G2
SSC ¼ (8.4)
j¼1
nr N
The degrees of freedom for SSC are given by c 1 as there are c column
group means that are required to be compared. The mean sum of square for
column factor is obtained by dividing SSC by c 1.
264 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
(iv) Sum of squares due to interaction (SSI): It is the measure of variation due
to the interaction of both the factors A and B. It facilitates us to test whether
or not there is a significant interaction effect. The sum of squares due to
interaction is usually denoted by SSI. The SSI is computed as follows:
r X
X c T2
ij G2
SSI ¼ SSR SSC (8.5)
i¼1 j¼1
n N
The mean sum of square due to error is obtained by dividing SSE by its
degrees of freedom N rc.
(vi) ANOVA table: This is a summary table showing different sum of squares
and mean sum of squares for all the components of variation. The compu-
tation of F-values is shown in this table. This table is popularly known as
two-way ANOVA table. After computing all the sum of squares, the
ANOVA table is prepared for further analysis which is shown as follows:
(vii) F-statistic: Under the normality assumptions, the F-value obtained in the
ANOVA table, say, for row, follows a F-distribution with (r 1, N r)
degrees of freedom. F-statistic is computed for each source of variation.
In two-way ANOVA, we investigate the effect of main effects along with the
interaction effect of two factors on dependent variable. The following situation
shall develop an insight among the researchers to appreciate the use of this analysis.
Consider a situation where a mobile company is interested to examine the effect
of gender and age of the customers on the frequency of short messaging service
(sms) sent per week. Each person may be classified according to gender (men and
women) and age category (16–25, 26–35, and 36–45 years). Thus, there will be six
groups, one for each combination of gender and age. Random sample of equal size
in each group may be drawn, and each person may be asked about the number of
sms he or she sends per week. In this situation, there are three main research
questions that can be answered:
(a) Whether the number of sms sent depends on gender
(b) Whether the number of sms sent depends on age
(c) Whether the number of sms sent depends on gender differently for different age
categories of age, and vice versa
All these questions can be answered through testing of hypothesis in two-way
ANOVA model. The first two questions simply ask whether sending sms depends
on age and gender. On the other hand, the third question asks whether sending sms
depends on gender differently for people in different age category, or whether
sending sms depends on age differently for men and women. This is because one
may think that men send more sms than women in 18–25 years age category, but
women send more sms than men in 26–55 years age category. After applying the
two-way ANOVA model, one may be able to explain the above-mentioned research
questions in the following manner:
whether the factor gender has a significant impact on the number of sms sent
irrespective of their age categories. And if it is so, one may come to know whether
men send more sms than women or vice versa, irrespective of their age categories.
Similarly, one can test whether the factor age has a significant impact on the
number of sms sent irrespective of gender. And if age factor is significant, one can
know that in which age category people send more sms irrespective of their gender.
The most important aspect of two-way ANOVA is to know the presence of
interaction effect of gender and age on sending the sms. One may test whether these
two factors, that is, gender and age, are independent to each other in deciding the
number of sms sent. The interaction analysis allows us to compare the average sms
sent in different age categories in each of the men and women groups separately.
Similarly, it also provides the comparison of the average sms sent by the men and
women in different age categories separately.
The information provided through this analysis may be used by the marketing
department to chalk out their promotional strategy for men and women separately
in different age categories for the mobile users.
Situation Where Two-Way ANOVA Can Be Used 267
Computation
Before computing components of different sum of squares, let us first compute the
row, column, and cell total along with the grand total (Table 8.2).
1. Raw sum of square (sum of squares of all the scores in the study)
r X
X c X
n
RSS ¼ 2
Xijk
i¼1 j¼1 k¼1
¼ 152 þ 132 þ . . . 92 þ 152 þ 142 þ . . . 82 þ 182 þ 162 þ . . . 162
þ 102 þ 72 þ . . . 82 þ 132 þ 142 þ . . . 142 þ 112 þ 102 þ . . . 112
¼ 792 þ 666 þ 1080 þ 343 þ 1106 þ 655
¼ 4642
Situation Where Two-Way ANOVA Can Be Used 269
2. Correction factor ¼ CF
G2 3622
¼ ¼ ¼ 4368:13
N 30
3. Total sum of squares ¼ TSS
XXX G2
¼ 2
Xijk ¼ RSS CF
i j k
N
¼ 4642 4368:13 ¼ 273:87
4. Sum of squares due to row factorðgenderÞ ¼ SSR
Xr
R2i G2 1902 þ 1722
¼ ¼ 4368:13
i¼1
nc N 53
¼ 10:80
5. Sum of squares due to column factorðincentivesÞ ¼ SSC
Xc C2
j G2
¼
j¼1
nr N
1032 þ 1302 þ 1292
¼
52
4368:13 ¼ 46:87
6. Sum of squares due to interaction ¼ SSI
Xr X c T2
ij G2
¼ SSR SSC
i¼1 j¼1
n N
622 þ 562 þ 722 þ 412 þ 742 þ 572
¼
5
4368:13 10:80 46:87
¼ 4514 4425:8 ¼ 88:20
7. Sum of squares due to error ¼ SSE
¼ TSS SSR SSC SSI
¼ 273:87 10:80 46:87 88:20
¼ 128
Tabulated value of F can be seen from Table A.4 in the Appendix. Thus, from
Table A.4, the value of F.05 (1,24) ¼ 4.26 and F.05 (2,24) ¼ 3.40.
In Table 8.3, since the calculated value of F for incentives and interaction is
greater than their corresponding tabulated value of F, these two F-ratios are
significant. However, F-value for gender is not significant.
270 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
Table 8.3 Two-way ANOVA table for the data on sales of toothpaste
Sum of Mean sum of Tab. F at
Source of variation squares (SS) df squares (MSS) F 5% level
Gender 10.80 r1¼1 10.80 2.03 4.26
Incentives 46.87 c1¼2 23.44 4.398* 3.40
Interaction (gender* 88.20 (r 1) 44.10 8.27* 3.40
incentives) (c 1) ¼ 2
Error 128.00 N rc ¼ 24 5.33
Corrected total 273.87 N 1 ¼ 29
*Significant at 5% level
Post hoc test shall be used to further analyze the column factor (incentives) and
the interaction effect.
Table 8.4 shows that the mean difference between II and III incentive groups is
less than the critical difference (¼2.13); hence, there is no difference between these
two incentive groups. To show this, a line has been drawn below these two group
means. On the other hand, there is a significant difference between the means of II
and I as well as III and I incentive groups. Thus, it may be concluded that the II and
III incentives are equally effective and better than Ist incentive in enhancing the
sale of the toothpaste irrespective of the gender of the sales manager.
Situation Where Two-Way ANOVA Can Be Used 271
Table 8.5 provides the post hoc comparison of means of male and female groups
in each of the three incentive groups. Since the mean difference between male and
female in each of the three incentive groups is higher than the critical difference,
these differences are significant at 5% level. Further, it may be concluded that the
average sales of male group in I and III incentives groups are higher than that of
female group whereas average sales of female is higher than that of male in II
incentive group.
Table 8.6 shows the comparison of different incentive groups in each of the gender
group. It can be seen from this table that in male section, average sales are signifi-
cantly different in III and II incentive groups, whereas average sales in the III and I
incentive groups as well as I and II incentive groups are same. On the other hand, in
female section, the average sales in all the three incentive groups are significantly
different from each other.
272 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
Thus, on the basis of the given data, the result suggests that IIIrd incentive
should be preferred if it is promoted by the male sales manager whereas the sales of
the toothpaste would increase if it is promoted by the female sales manager using
IInd incentive.
5. For the variable Sweetness, double-click the cell under the column
Values and add following values to different labels:
Value Label
1 Semisweet chocolate
2 Bittersweet chocolate
3 Unsweetened chocolate
6. Similarly, for the variable Colour, add the following values to different
labels:
Value Label
1 White
2 Milk
3 Dark
there will be 45 data. Under the column Sweetness, first 15 scores are
entered as 1 (denotes semisweet chocolate), the next 15 scores are entered
as 2 (denotes bittersweet chocolate), and the remaining 15 scores are
entered as 3 (denotes unsweetened chocolate). Under the column Colour,
first five scores are entered as 1 (denotes white color chocolate), next five
scores as 2 (denotes milk color chocolate), and subsequent five scores as 3
(denotes dark color chocolate). These 15 data belong to semisweet choco-
late group. Similarly, next 15 scores of bittersweet chocolate group and
unsweetened chocolate groups can be just the repetition of the semisweet
chocolate group. Thus, after feeding the first 15 data in the Colour column,
repeat this set of 15 data twice in the same column.
After entering the data, the screen will look like Fig. 8.2. Save the data
file in the desired location before further processing.
(b) SPSS commands for two-way ANOVA
After entering the data in data view as per above-mentioned scheme, follow the
below-mentioned steps for two-way analysis of variance:
(i) Initiating the SPSS commands for two-way ANOVA: In Data View, click
the following commands in sequence:
Analyze ! General Linear Model ! Univariate
The screen shall look like Fig. 8.3.
(ii) Selecting variables for two-way ANOVA: After clicking the Univariate
option, you will be taken to the next screen for selecting variables. Select
the variables Chocolate_Sale from left panel to the “Dependent variable”
section of the right panel. Similarly, select the variables Chocolate_S-
weetness and Chocolate_Colour from left panel to the “Fixed Factor(s)”
section of the right panel. The screen will look like Fig. 8.4.
(iii) Selecting the option for computation: After selecting the variables, various
options need to be defined for generating the output in two-way ANOVA.
Do the following:
– Click the tag Post Hoc in the screen shown in Fig. 8.4. Then,
– Select the factors Sweetness and Colour from the left panel to the
“Post Hoc Tests for” panel on the right side by using the arrow key.
– Check the option “LSD.” LSD test is selected as a post hoc because
the sample sizes are equal in each cell.
The screen will look like Fig. 8.5.
– Click Continue. This will again take you back to the screen as shown in
Fig. 8.4.
– Now click the tag Options on the screen and do the following steps:
– Check the option “Descriptive.”
276 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
Fig. 8.2 Screen showing data entry for different variables in the data view
Solved Example of Two-Way ANOVA Using SPSS 277
Fig. 8.6 Screen showing options for descriptive statistics and comparison of main effects
right click of the mouse and may be copied in the word file. Here, the following
outputs shall be selected:
1. Descriptive statistics
2. Two-way ANOVA table
3. Pairwise comparisons of sweetness groups (all color groups combined)
4. Pairwise comparisons of different color groups (all sweetness groups
combined)
In this example, all the identified outputs so generated by the SPSS will look
like as shown in Tables 8.8, 8.9, 8.10, and 8.11.
In order to interpret the findings, these outputs may be rearranged so that it can
directly be used in your project. These rearranged formatted tables have been
shown under the heading “Model Way of Writing the Results” in the next
section.
difference among different groups. The procedure of comparing group means has
been discussed later in this section.
The first important table consisting F-values for the factors and interaction can
be reproduced by deleting some of the contents of Table 8.9. The information so
reduced is shown in Table 8.12.
The p-values for the Sweetness, Color, and Interaction (Sweetness Color) in
Table 8.12 are less than .05; hence, all the three F-values are significant at 5% level.
Thus, the null hypothesis for the Sweetness factor, Color factor, and Interaction
(Sweetness Color) may be rejected at .05 level of significance. Now the post hoc
comparison analysis shall be done for these factors and interaction. These analyses
are shown below.
Solved Example of Two-Way ANOVA Using SPSS 281
Table 8.12 Two-way ANOVA table for the data on chocolate sale
Sum of squares Mean sum of
Source of variation (SS) df squares (MSS) F p-value (sig.)
Sweetness 339.51 2 169.76 9.83 .000
Color 120.04 2 60.02 3.47 .042
Sweetness color 1,487.42 4 371.86 21.52 .000
Error 622.00 36 17.28
Corrected total 2,568.977 44
282 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
Table 8.13 Comparison of mean chocolate sale in all the three sweetness groups (all colors
combined)
Bittersweet chocolate Semisweet chocolate Unsweetened chocolate CD at 5% level
26.93 25.73 20.60 3.08
For row analysis, critical difference has been obtained by using the LSD test. The
value of “t” at .05 level and 36 df (error degrees of freedom) can be obtained from
Table A.2 in Appendix.
Thus,
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2ðMSSÞE
CD for row ¼ t:05 ð36Þ ½n ¼ number of scores in each cell ¼ 5
nc
½c ¼ number of columnðcolour groupsÞ ¼ 3
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 17:28
¼ 2:03 ¼ 3:08
53
Table 8.13 has been obtained by using the contents from the Tables 8.8 and 8.10.
Readers are advised to note the way this table has been generated.
If difference of any two group means is higher than the critical difference, the
difference is said to be significant. Owing to this principle from Table 8.13, two
conclusions can be drawn.
(a) Average sale of chocolate in bittersweet and semisweet categories is signifi-
cantly higher than that of unsweetened category.
(b) Average sale of chocolate in bittersweet and semisweet categories is equal.
It may thus be inferred that bittersweet and semisweet chocolates are more
preferred than unsweetened chocolates irrespective of the color of the chocolate.
Remark By looking at the p-values in Table 8.10, you can infer as to which group
means differ significantly. If for any mean difference, significance value (p-value)
is less than .05, then the difference is considered to be significant. In using this
table, you can test the significance of mean difference, but it is difficult to find out as
to which group mean is higher until and unless the results of Table 8.8 are
combined. Hence, it is advised to construct Table 8.13 for post hoc analysis so as
to get the clear picture in the analysis.
Solved Example of Two-Way ANOVA Using SPSS 283
Table 8.14 Comparison of mean chocolate sale in all the three Color groups
(all sweetness types combined)
White chocolate Milk chocolate Dark chocolate CD at 5% level
26.40 24.47 22.40 3.08
For column analysis, critical difference has been obtained by using the LSD test as
there are equal numbers of samples in each column.
Thus,
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2ðMSSÞE
CD for column ¼ t:05 ð36Þ ½n ¼ number of scores in each cell ¼ 5
nr
½r ¼ number of rowðsweetness groupsÞ ¼ 3
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 17:28
¼ 2:03 ¼ 3:08
53
Table 8.14 has been obtained from the contents in Tables 8.8 and 8.11.
From Table 8.14, the following two conclusions can be drawn:
(a) There is no difference in the average sale of white and milk chocolate. Similarly
average sale of milk and dark chocolate is also same.
(b) Average sale of white chocolates is significantly higher than that of dark
chocolates.
Thus, it may be inferred, in general, that the mean sale of the white chocolates is
more in comparison to that of dark chocolate irrespective of the type of sweetness.
Remark You may note that critical difference for row and column analysis is
same. It is so because the number of rows is equal to the number of columns.
Interaction Analysis
Since F-value for the interaction is significant, it indicates that there is a joint effect
of the chocolate sweetness and chocolate colors on the sale of chocolates. In other
words, there is an association between sweetness and color of the chocolates. Thus,
to compare the average chocolate sale among the three levels of sweetness in each
of the color groups and to compare the average sales in all the three types of colored
284 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
Table 8.15 Comparison of mean chocolate sale among different sweetness groups in each of the
three colour groups
Color Sweetness CD at 5% level
White 27.80 (Bittersweet) 27.80 (Unsweetened) 23.60 (Semisweet) 5.34
Tables 8.15 and 8.16 have been generated with the help of the contents of
Table 8.8. Readers are advised to note that CD is same for comparing all the
three types of sweetness groups in each color group as well as for comparing all
the colored groups in each of the sweetness group. It is so because the number of
samples (n) in each cell is equal.
If the difference of group means is higher than that of the critical difference, it
denotes that there is a significant difference between the two means; otherwise,
group means are equal. If the mean difference is not significant, an underline is put
against both the groups.
From Table 8.15, the following three conclusions can be drawn:
(a) The average sale of chocolates in all the three categories of sweetness groups is
same for white chocolates.
(b) In milk chocolates, the average sale in bittersweet category is significantly
higher than that of unsweetened and semisweet categories.
(c) In dark chocolates, the average sale in semisweet category is significantly
higher than that of bittersweet and unsweetened categories.
It is thus concluded that in white chocolate, it hardly matters which sweetness
flavor is being sold, whereas types of sweetness matters in case of milk and dark
chocolates.
Summary of the SPSS Commands for Two-Way ANOVA 285
Table 8.16 Comparison of mean chocolate sale among different colour groups in each of the
three sweetness groups
Sweetness Color CD at 5% level
Semisweet 34.00 (Dark) 23.60 (White) 19.60 (Milk) 5.34
(vi) In the data view, follow the below-mentioned command sequence for two-
way ANOVA:
Analyze ➾ General Linear Model ➾ Univariate
(vii) Select the variable Chocolate_Sale from left panel to the “dependent variable”
section of the right panel. Similarly, select the variables Chocolate_Sweetness
and Chocolate_Colour from left panel to the “Fixed Factor(s)” section of the
right panel.
(viii) Click the tag Post Hoc and select the factors Sweetness and Colour from the
left panel to the “Post Hoc test” panel on the right side. Check the option
“LSD” and then click Continue.
(ix) Click the tag Options, Select the variables OVERALL, Sweetness, Colour,
and Sweetness Colour from left panel to the right panel. Check the
“Compare main effects” and “Descriptive” boxes and ensure the value of
significance as .05. Click Continue.
(x) Press OK for output.
Exercise
Note: Write answer to each of the following questions in not more than 200 words.
Q.1. What do you mean by main effects, interactions effects, and within-group
variance? Explain by means of an example.
Q.2. Justify the name “two-way analysis of variance.” What are the advantages of
using two-way ANOVA design over one-way?
Q.3. While using two-way ANOVA, what assumptions need to be made about the
data?
Q.4. Describe an experimental situation where two-way ANOVA can be used.
Discuss different types of hypotheses that you would like to test.
Q.5. Discuss a situation where a factorial design can be used in market research.
What research questions you would like to investigate?
Q.6. What is repeated measure design? Explain by means of an example. What
precaution should be taken in planning such design?
Q.7. Explain MANOVA and discuss any one situation where it can be applied in
management studies.
Q.8. Describe Latin square design. Discuss its layout. How is it different than
factorial design?
Exercise 287
Multiple-Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the
one that you consider the closest to the correct answer.
1. In applying two-way ANOVA in an experiment, where “r” levels of factor A
and “c” levels of factor B are studied. What will be the degree of freedom for
interaction?
(a) rc
(b) r+c
(c) rc 1
(d) (r 1)(c 1)
2. In an experiment “r” levels of factor A are compared in “c” levels of factor B.
Thus, there are N scores in this experiment. What is the degree of freedom for
within group?
(a) N rc
(b) N + rc
(c) N rc + 1
(d) Nrc 1
3. In a two-way ANOVA, if the two factors A and B have levels 3 and 4,
respectively, and the number of scores per cell is 3, what would be the degrees
of freedom of error?
(a) 36
(b) 24
(c) 12
(d) 9
4. In order to apply two-way ANOVA
(a) There should be equal number of observations in each cell.
(b) There may be unequal number of observations in each cell.
(c) There should be at least ten observations in each cell.
(d) There is no restriction on the number of observations per cell.
5. Consider an experiment in which the Satisfaction levels of employees (men and
women both) were compared in their plants located in three different cities.
Choose the correct statement in defining the three variables Gender, City, and
Satisfaction level in SPSS:
(a) Gender and Satisfaction level are Scale variables and City is Nominal.
(b) Gender and City are Nominal variables and Satisfaction level is Scale.
(c) Gender and City are Scale variables and Satisfaction level is Nominal.
(d) City and Satisfaction level are Scale variables and Gender is Nominal.
288 8 Two-Way Analysis of Variance: Examining Influence of Two Factors. . .
12. What should be the minimum number of observations in order to perform two-
way ANOVA with interaction effect?
(a) 8
(b) 6
(c) 4
(d) 2
Assignments
1. Four salesmen were appointed by a company to sell their products in door-to-
door marketing. Their sales were observed in three seasons, summer, rainy, and
winter, on month to month basis. The sales data so obtained (in lakhs of rupees)
are shown in the following table:
Learning Objectives
After completing this chapter, you should be able to do the following:
• Learn the concept of analysis of covariance.
• Know the application of analysis of covariance in different situation.
• Describe the concept of covariate and neutralize its effect from the treatment
effect.
• Know the model involved in the analysis of covariance.
• Understand the concept of adjusting treatment means for covariate using linear
regression.
• Understand the analysis of covariance graphically.
• Learn the method of using analysis of covariance.
• Understand as to why the analysis of covariance is efficient design in comparison
to one-way analysis of variance.
• To be able to formulate the hypotheses in analysis of covariance.
• Understand the assumptions used in analysis of covariance.
• Know the method of preparing data file for analysis in SPSS.
• Learn the steps involved in using SPSS for analysis of covariance.
• Interpret the output obtained in analysis of covariance.
• Learn the model way of writing the results of analysis.
Introduction
( x, y)
y
C Y= mX + c
In order to understand the ANCOVA model, let us first refresh our concept of
representing the line in the slope intercept form. You may recall that this line used
to be represented by
Y ¼ mX þ c (9.1)
where m is the slope and c is the intercept of the line on y-axis. Graphically this
equation can be represented by the Fig. 9.1.
Equation of line in any form may be converted to this slope intercept form for
comparing their slopes and intercepts.
The line shown in Fig. 9.1 is a regression line for estimating the value of Y if
the value of X is known. Now if you look at the vertical line over x, it intersects
x and y). In other words, the point (
the regression line at ( x, y) actually lies on the
regression line. This concept shall be used to explain the analysis of covariance.
To understand ANCOVA, let us consider A and B represent the two treatments.
Further, YA and YB represent the value of criterion variable, whereas XA and XB
represent the value of covariate in the two treatment groups A and B respectively.
These two treatments are represented by the lines A and B in Fig. 9.2. If higher value
294 9 Analysis of Covariance: Increasing Precision in Comparison by Controlling. . .
Fig. 9.2 Geometrical representation of two treatments (Y’s) with their covariates (X’s)
If Yij represents the jth score of the criterion variable in the ith treatment group and
Xij represents the jth score of the covariate in the ith treatment group, then the one-
way ANCOVA model is represented as follows:
þ eij
Y ij ¼ m þ bðXij XÞ (9.2)
Analysis of Covariance Model 295
where
m is the overall population mean (on criterion variable)
b is slope of the lines
eij is unexplained error terms which are independent and normally distributed with
mean 0 and variance 1
One-way analysis of covariance fits a straight line to each treatment group of X-Y
data, such that the slopes of the lines are all equal. This fitted model may then be
used to test the following null hypothesis:
H0: The intercepts for each line are equal.
This hypothesis tests as to whether all the treatment group means are equal or not
after making the adjustment for X (covariate). Here, we assume that the slopes are
equal. It is so because there is no point of comparing the treatments effect if one of
the treatments produces positive effect whereas other induces negative effect.
Let us see how the treatment means are adjusted for covariate and are computed
for comparison. Adding both sides of Eq. (9.2) for j and dividing by n (number of
scores in each treatment group), we get
!
1X 1X 1X
Y ij ¼ m þ b Xij X þ eij
n j n j n j
) þ ei
Y i ¼ m þ bðXi XÞ
Y i ¼ m þ bðXi XÞ (9.3)
where
Y i is mean of the criterion variable in the ith treatment group
m is the overall population mean (on criterion variable)
Xi is mean of the covariate (X data) in the ith treatment group
Other symbols have their usual meanings. If one-way ANCOVA model has two
treatment groups, then the model (9.3) will give rise to two straight lines as shown
in Fig. 9.2. Thus, by testing the hypothesis H0 in ANCOVA, we actually compare
the two treatments Y A and Y B after adjusting it for the covariates.
296 9 Analysis of Covariance: Increasing Precision in Comparison by Controlling. . .
Remark
1. If the slope of line A and B is equal it indicates that the effect of both the
treatments are in one direction only. Both the treatments will induce either
positive or negative effect.
2. By comparing the intercepts of the lines, we try to compare whether the effect of
all the treatments on the criterion variable is same or not.
against the possible alternatives that at least one group mean differs
where
Adj_Post_Adv_1 is adjusted mean sale in the first treatment group (where first
advertisement campaign was used)
Adj_Post_Adv_2 is adjusted mean sale in the second treatment group (where
second advertisement campaign was used)
Adj_Post_Adv_3 is adjusted mean sale in the third treatment group (where no
advertisement campaign was used)
ANCOVA table generated in the SPSS output contains the value of F-statistics
along with their significance value. Thus, if F-value is significant, the null
hypothesis is rejected and the post hoc test is used to compare the adjusted
posttreatment means of different groups in pairs.
Assumptions in ANCOVA
Example 9.1 A study was planned to investigate the effect of different doses of
vitamin C in curing the cold. Forty five subjects who were suffering from cold
symptoms were divided into three groups. The first two groups were given a low
dose and high dose of vitamin C every day whereas the third group was given a
placebo. The number of days these subjects were suffering from cold before starting
the treatment was taken as the covariate whereas the curing time in each treatment
group was recorded as a dependent variable. The data so obtained on the subjects
are shown in the Table 9.1.
Solved Example of ANCOVA Using SPSS 299
Table 9.1 Data on cold duration before and during implementation of vitamin C in different
groups
Contents of vitamin C
S.N. High dose Low dose Placebo
Pre days Post days Pre days Post days Pre days Post days
1 0 2 14 12 1 10
2 10 3 16 13 10 8
3 11 5 5 8 5 14
4 15 9 12 10 6 9
5 6 3 0 1 10 13
6 12 8 8 4 5 11
7 9 7 12 9 12 15
8 13 7 5 10 13 15
9 1 6 19 10 6 10
10 8 13 14 8 19 20
11 7 12 6 11 8 12
12 6 10 8 11 8 14
13 4 3 5 8 6 12
14 3 2 2 6 5 9
15 4 3 4 6 8 14
Pre days: Cold duration before treatment
Post days: Cold duration during treatment
against the alternative hypothesis that at least one group mean (adjusted) is different
300 9 Analysis of Covariance: Increasing Precision in Comparison by Controlling. . .
where
mAdj Days in Treatment A is adjusted mean curing time in treatment group A
mAdj Days in Treatment B is adjusted mean curing time in treatment group B
mAdj Days in Treatment C is adjusted mean curing time in treatment group C
The SPSS output provides ANCOVA table along with pairwise comparison of
adjusted post means of different treatment groups. The pairwise comparison of
means is done only when the F-ratio is significant.
The analysis of covariance table generated in the SPSS output looks similar
to the one- way ANOVA table as only adjusted post means is compared here.
In the ANCOVA table, F-value is shown along with its significance value
(p-value). The F-value would be significant if its corresponding p-value is less
than .05, and in that case null hypothesis would be rejected. Once the F-value
is found to be significant, then a post hoc test is used to compare the paired
means. SPSS provides the choice of post hoc test to be used in the analysis.
In this example, since the sample sizes are equal, LSD test shall be used
as a post hoc test for comparing the group means. The SPSS output provides
the significance value (p-value) for each pair of difference of group
means. Thus, by looking at the values of means, the best treatment may be
identified.
After clicking the Type in Data, you will be taken to the Variable View
option for defining variables in the study.
(ii) Defining variables
In this example, three variables, vitamin dose, cold duration before
treatment and cold duration during treatment need to be defined. The
procedure of defining these variables along with their characteristics is
as follows:
Solved Example of ANCOVA Using SPSS 301
1. Click the Variable View to define the variables and their properties.
2. Write short name of the variables as Vitamin_Dose, Pre_Days and
Post_Days under the column heading Name.
3. Under the column heading Label, define full name of these variables as
Vitamin dose, Cold duration before treatment, and Cold duration during
treatment. Other names may also be chosen for describing these
variables.
4. Under the column heading Measure, select the option “Nominal” for the
variable Vitamin dose and “Scale” for the variables Cold duration before
treatment and Cold duration during treatment.
5. For the variable Vitamin dose, double-click the cell under the column
Values and add the following values to different labels:
Value Label
1 Treatment A
2 Treatment B
3 Treatment C
Fig. 9.7 Screen showing option for choosing sum of squares and model type
mouse and may be copied in the word file. The identified outputs shall be
rearranged for interpreting the findings. The details have been shown under the
heading Model Way of Writing the Results.
(d) SPSS output
The readers should note the kind of outputs to be selected from the output
window of SPSS for explaining the findings. The following four outputs have
been selected for discussing the results of ANCOVA:
1. Descriptive statistics
2. Adjusted estimates of the dependent variable
3. ANCOVA table
4. Post hoc comparison table
These outputs have been shown in Tables 9.2, 9.3, 9.4, and 9.5.
306 9 Analysis of Covariance: Increasing Precision in Comparison by Controlling. . .
The above output generated by the SPSS can be shown in a much more user-
friendly format by modifying the relevant contents of the Tables 9.2, 9.3, 9.4, and
9.5. The below-mentioned edited outputs can directly be shown in the project,
dissertation, or thesis. These modified outputs shall be used to discuss the findings
of ANCOVA.
(a) Descriptive Statistics of the Data Obtained on the Criterion Variable
The mean and standard deviation of the criterion variable in different treatment
groups have been shown in Table 9.6. Entries in this table have been copied
from Table 9.2. If you are interested in computing different descriptive statistics
for the covariate (Number of days having cold symptoms before treatment)
also, the same be computed by using the procedure discussed in Chap. 2.
However, the SPSS does not generate these statistics during ANCOVA
analysis.
Look at the table heading which can be used in writing the final results in
your study.
308 9 Analysis of Covariance: Increasing Precision in Comparison by Controlling. . .
Table 9.7 Adjusted mean and standard error for the data on cold duration in different groups
during treatment
95% Confidence interval
Vitamin dose Mean Std. error Lower bound Upper bound
Treatment A 6.5a .70 5.09 7.93
Treatment B 8.2a .70 6.78 9.63
Treatment C 12.4a .70 10.94 13.77
a
Covariates appearing in the model are evaluated at the following values:
Cold duration before treatment ¼ 8.0222
Values have been rounded off
From Table 9.6, it can be seen that average time taken to cure the cold
symptoms is highest in treatment group C whereas the least time is in treatment
group A. Treatment C signifies the placebo, whereas treatment A is the high
dose of vitamin C. The next question is to see whether this difference is
significant or not after adjusting for the covariate (number of days having
cold symptoms before treatment).
(b) Descriptive Statistics of the Data Obtained on the Criterion Variable after
Adjusting for Covariate
The adjusted mean and standard error of the criterion variable in different
treatment groups have been shown in Table 9.7. The mean of criterion variable
has been obtained in all the three treatment groups after adjusting for the
covariate (Number of days having cold symptoms before treatment). These
data have been taken from Table 9.3. Readers may note that these values are
different from that of the unadjusted values shown in Table 9.6. The advantage
of using the ANCOVA is that the differences in the posttesting means are
compensated for the initial differences in the scores. In other words, it may be
said that the effect of covariate is eliminated in comparing the effectiveness of
treatments on the criterion variable.
Kindly note the heading of the table which may be used for writing the final
results of ANCOVA.
(c) ANCOVA Table for the Data on Criterion Variable (Number of Days Having
Cold Symptoms During Treatment)
The main ANCOVA table may be reproduced by deleting some of the
unwanted details of Table 9.4. The final results of ANCOVA have been
shown in Table 9.8. The “significance” (Sig.) value has been named as p-
value. In most of the scientific literature, p-value is used instead of term
significance value.
Model Way of Writing the Results of ANCOVA and Their Interpretations 309
Table 9.8 ANCOVA table for the data on cold duration in different groups during treatment
Source Sum of squares df Mean square F (p-value) Sig.
Pre_Days 183.993 1 183.993 24.961 .000
Vitamin_Dose 270.768 2 135.384 18.367 .000
Error 302.217 41 7.371
Corrected total 756.978 44
Table 9.8 shows the F-value for comparing the adjusted means of the criterion
variable in three Vitamin_Dose groups (treatment A, treatment B, and treatment
C). You can note that F-statistic computed for Vitamin_Dose is significant
because p -value associated with it is .000 which is less than .05. Thus, the null
hypothesis of no difference among the adjusted means for the data on criterion
variable (number of days having cold symptoms during treatment) in three
treatment groups may be rejected at 5% level.
Remark: You can see that the F-value for Pre_Days (covariate) is also signifi-
cant. It shows that the initial conditions of the experimental groups are not
same, and that is why we are applying ANCOVA after adjusting mean values of
the criterion variable for the covariate.
(d) Post Hoc Comparison for the Group Means in Post-measurement Adjusted
with the Initial Differences
Since F-statistic is significant, post hoc comparison has been made for the
adjusted means of the three treatment groups, which is shown in Table 9.9.
This, table has been obtained by deleting some of the information from
Table 9.5. It may be noted here that p-value for the mean difference between
treatments A and C as well between treatments B and C is .000. Since p value is
less than .05, both these mean differences are significant at 5% level. Thus, the
following conclusions can be drawn:
(i) There is a significant difference between the adjusted means of criterion
variable (Number of days having cold symptoms during treatment) in
treatment A (High vitamin C dose) and treatment C (Placebo).
310 9 Analysis of Covariance: Increasing Precision in Comparison by Controlling. . .
(ii) Click Variable View tag and define the variables Vitamin_Dose as nominal
variable and Pre_Days and Post_Days as scale variables.
(iii) Under the column heading Values against the variable Vitamin_Dose,
define “1” for Treatment A, “2” for Treatment B, and “3” for Treatment C.
(iv) After defining the variables, type the data for these variables by clicking
Data View.
(v) In the data view, follow the below-mentioned command sequence for
ANCOVA:
(vi) Select the variables Cold duration during treatment, Vitamin dose, and Cold
duration before treatment from left panel to the “Dependent variable”
section, “Fixed Factor(s)” section, and “Covariate(s)” section of the right
panel, respectively.
(vii) Click the tag Model and select the Sum of Squares option as “Type I.” Press
Continue.
(viii) Click the tag Options and select the variables Overall and Vitamin_Dose
from the left panel to the “Display Means for” section of the right panel.
Check the option “Compare main effects” and “Descriptive statistics.”
Ensure the value of significance as .05 or .01 as the case may be. Press
Continue.
(ix) Click OK for output.
Exercise
Note: Write answer to each of the following questions in not more than 200 words.
Q1. What do you mean by the covariate? How it is controlled in ANCOVA? Give
a specific example.
Q2. Describe an experimental situation where ANCOVA can be applied. Construct
null hypothesis and all possible alternative hypotheses.
Q3. Thirty boys were selected for direct marketing of a vacuum cleaner in three
similar cities. In each of the city, 10 boys were sent for direct marketing for a
month. Three different kinds of incentives, namely, conveyance allowance,
two percent bonus, and gifts were offered to these sales agents in these three
cities on completing the target. To compare the effectiveness of three different
incentives on sale, which statistical technique should be used?
Q4. If two treatment groups are to be compared on some criterion variable, how do
you interpret if the slopes of the two regression lines are same? Further, if the
intercepts are equal, what it conveys? Explain by means of graphical
representation.
312 9 Analysis of Covariance: Increasing Precision in Comparison by Controlling. . .
Q5. Explain the statement “the analysis of covariance is a mix of one-way ANOVA
and linear regression.”
Q6. Why the observed mean of criterion variable is adjusted in ANCOVA? How
this adjustment is done?
Q7. What are the various assumptions used in analysis of covariance?
Q8. Which design is more efficient and why among one-way ANOVA and
ANCOVA?
Multiple-Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the
one that you consider the closest to the correct answer.
1. In designing an experiment, if the randomization is not possible, control
is observed by matching the groups. This matching is done on the variable
which is
(a) Independent
(b) Extraneous
(c) Dependent
(d) Any variable found suitable
2. Covariate is a variable which is supposed to be correlated with
(a) Criterion variable
(b) Independent variable
(c) Dependent variable
(d) None of the above
3. In ANCOVA, while doing post hoc analysis, which group means are
compared?
(a) Pretest group means
(b) Posttest group means
(c) Pretest adjusted group means
(d) Posttest adjusted group means
4. In ANCOVA, if the slopes of the regression lines in different treatment groups
are same, one can infer that
(a) Some of the treatments will show the improvement where the other
treatments may show the deterioration.
(b) All the treatments will show either deterioration or improvement but with
varying degrees.
(c) One cannot tell about the improvement or deterioration due to different
treatments.
(d) All treatments will have the same amount of improvement in the criterion
variable.
5. In ANCOVA, if intercepts of the regression lines in the two treatment groups
are same, then it may be inferred that
Exercise 313
tested for their financial knowledge before and after the training program.
While using SPSS for ANCOVA, three variables, namely, Pre_Knowledge
Drib, Post_Knowledge, and Treatment_Group, need to be defined. Choose
the correct types of each variable.
(a) Pre_Knowledge and Post_Knowledge are Scale and Treatment_Group is
Ordinal.
(b) Pre_Knowledge and Post_Knowledge are Nominal and Treatment_Group
is Scale.
(c) Pre_Knowledge and Treatment_Group are Scale and Post_Knowledge is
Nominal.
(d) Pre_Knowledge and Post_Knowledge are Scale and Treatment_Group is
Nominal.
12. While using SPSS for ANCOVA, the three variables, namely, Pre_Test, Post_-
Test, and Treatment_Group, are classified as
(a) Post_Test as Dependent variable whereas Pre_Test and Treatment_Group
as Fixed Factor(s)
(b) Post_Test as Dependent variable, Pre_Test as Covariate, and
Treatment_Group as Fixed Factor
(c) Treatment_Group as Dependent variable, Pre_Test and Post_Test as Fixed
Factor(s)
(d) Treatment_Group as Dependent variable, Post_Test as Covariate, and
Pre_Test as Fixed Factor
13. Choose the correct sequence of commands in SPSS for starting ANCOVA.
(a) Analyze ! Univariate ! General Linear Model
(b) Analyze ! General Linear Model ! Multivariate
(c) Analyze ! General Linear Model ! Univariate
(d) Analyze ! General Linear Model ! Repeated Measures
Assignments
1. In a psychological experiment 60, subjects were randomly divided into three
equal groups. These groups were taught with audiovisual aid, traditional method,
and need-based methods. Prior to the treatments, learning motivation of all the
subjects was assessed. After 4 weeks, improvement in academic achievements
was noted. The data so obtained on academic achievements is shown in the
Table A-1.
Apply analysis of covariance to see as to which methodology of teaching is more
effective for academic achievement. Test your hypothesis at .05 as well as .01
level of significance.
2. A study was conducted to know the impact of gender on life optimism. Since age
is considered as factor effecting life optimism, it was considered as covariate.
Exercise 315
Table A-1 Scores on academic achievements and learning motivation in three types of teaching
methods
Learning Objectives
After completing this chapter, you should be able to do the following:
• Understand the concept of cluster analysis.
• Know the different terminologies used in cluster analysis.
• Learn to compute different distances used in the analysis.
• Understand different techniques of clustering.
• Describe the assumptions used in the analysis.
• Explain the situations where cluster analysis can be used.
• Learn the procedure of using cluster analysis.
• Know the use of hierarchical cluster analysis and K-means cluster analysis.
• Describe the situation under which two-step cluster should be used.
• Understand various outputs of cluster analysis.
• Know the procedure of using cluster analysis with SPSS.
• Understand different commands and its outcomes used in SPSS for cluster
analysis.
• Learn to interpret the outputs of cluster analysis generated by the SPSS.
Introduction
Market analysts are always in search of strategies responsible for buying behavior.
The whole lot of customers can be grouped on the basis of their buying behavior
patterns. This segmentation of customers helps analysts in developing marketing
strategy for different products in different segments of customers. These segments
are developed on the basis of buying behavior of the customers in such a way so
that the individuals in the segments are more alike but the individuals in different
segments differ to a great extent in their characteristics. The concept of segmenting
may be used to club different television serials into homogeneous categories on
the basis of their characteristics. An archaeological surveyor’s may like to cluster
different idol excavated from archaeological digs into the civilizations from which
they originated. These idols may be clustered on the basis of their physical and
chemical parameters to identify their age and civilization to which they belong.
Doctors may diagnose a patient for viral infection and determine whether distinct
subgroups can be identified on the basis of a clinical checklist and pathological
tests. Thus, in different fields several situations may arise where it is required to
segment the subjects on the basis of their behaviour pattern so that an appropriate
strategy may be formed for these segments separately. Segmenting may also be
done for the objects based on their similarity of features and characteristics. Such
segmenting of objects may be useful for making a policy decision. For instance, all
the cars can be classified into small, medium and large segments depending upon
their features like engine power, price, seating capacity, luggage capacity, and fuel
consumption. Different policy may be adopted to promote these segments of
vehicle by the authorities.
The problem of segmentation shall be discussed in this chapter by means of
cluster analysis. The more emphasis has been given on understanding various
concepts of this analysis and the procedure used in it. Further, solved example
has been discussed by means of using SPSS for easy understanding of readers. The
reader should note as to how different outputs generated in this analysis by the
SPSS have been interpreted.
Distance Measure
listeners, etc. The simplest way of computing distances between cases in a multidi-
mensional space is to compute Euclidean distances. There are many methods
available for computing distances and it is up to the researcher to identify an
appropriate method according to the nature of the problem. Although plenty of
methods are available for computing distances between the cases, we are discussing
herewith the five most frequently used methods. These methods for computing the
distances shall be discussed later in this chapter by using some data.
Consider the data in Table 10.1 where age, income, and qualification are the
three different parameters on which employees need to be grouped into different
clusters. We will see the computation of distances between the two employees
using different distance method.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X n
deij ¼ ðXik Xjk Þ2 (10.1)
k¼1
where
Xik is the measurement of ith cases on kth variable
Xjk is the measurement of jth cases on kth variable
n is number of variables
Let us compute the Euclidean distance between first and second employee by
using their profile as shown in Table 10.1.
320 10 Cluster Analysis: For Segmenting the Population
Since Euclidean distance is the square root of the squared Euclidean distance,
pffiffiffiffiffiffiffiffiffi
Euclidean distance between first and second employee ¼ de12 ¼ 0:38 ¼ 0:62:
In computing the Euclidean distance, each difference is squared to find the absolute
difference on each of the variables measured on both the employees. After adding all
of the squared differences, we take the square root. We do it because by squaring the
differences, the units of measurements are changed, and so by taking the square root,
we get back the original unit of measurement.
If Euclidean distances are smaller, the cases are more similar. However, this
measure depends on the units of measurement for the variables. If variables are
measured on different scales, variables with large values will contribute more to the
distance measure than the variables with small values. It is therefore important to
standardize scores before proceeding with the analysis if variables are measured on
different scales. In SPSS, you can standardize variables in different ways.
Manhattan Distance
The Manhattan distance between the two cases is computed by summing the absolute
distances along each variable. The Manhattan distance is also known as city-
block distance and is appropriate when the data set is discrete. By using the data of
Table 10.1 the Manhattan distance between first and second employee has been
computed in the Table 10.3.
Thus, the Manhattan distance between first and second employees ¼ dm ¼
0.2 + 0.3 + 0.5 ¼ 1.00.
Terminologies Used in Cluster Analysis 321
Chebyshev Distance
The Chebyshev distance between the two cases are obtained by finding the
maximum absolute difference in values for any variable. This distance is computed
if we want to define two cases as “different” if they differ on any one of the
dimensions. The Chebyshev distance is computed as
In Table 10.1, the Chebyshev distance between the first and fourth employees
would be 2.8 as this is the maximum absolute difference of these two employees on
income variable.
The Pearson distance (dp) is also based on the Pearson correlation coefficient
between the observations of the two cases. This distance is computed as dp ¼ 1r
and lies between 0 and 2. Since the maximum and minimum values of r can be +1
and 1, respectively, the range of the Pearson distance (dp) can be from 0 to 2. The
zero value of dp indicates that the cases are alike, and 2 indicates that the cases are
entirely distinct.
Clustering Procedure
Clustering Procedure
Centroid methods
Parallel Threshold method
Variance methods
Optimizing Partitioning method
Ward’s procedure
Linkage methods
together on the basis of the smallest distance between the two objects, one from
each of the two different clusters. Similarly the two clusters may be linked together
on the basis of the maximum distance between the two objects, one from each
cluster. There are different ways the objects can be clustered together. The entire
clustering procedures can be broadly classified in three different categories, that is,
hierarchical clustering, nonhierarchical clustering, and two-step clustering. These
procedures shall be discussed in detail under various headings in this section. The
details of various classification procedures have been shown graphically in
Fig. 10.1.
Hierarchical Clustering
Agglomerative Clustering
Centroid Method
In this method, clusters are merged on the basis of the Euclidean distance between
the cluster centroids. Clusters having least Euclidean distance between their
centroids are merged together. In this method, if two unequal sized groups are
merged together, then larger of the two tends to dominate the merged cluster. Since
centroid methods compare the means of the two clusters, outliers affect it less than
most other hierarchical clustering methods. However, it may not perform well in
comparison to Ward’s method or average linkage method (Milligan 1980). Linkage
of clusters using centroid method is shown in Fig. 10.4.
Variance Methods
In this method, clusters are formed that minimize the within cluster variance. In other
words, clusters are linked if the variation within the two clusters is least. This is done
by checking the squared Euclidean distance to the center mean. The method used in
checking the minimum variance in forming clusters is known as Ward’s minimum
variance method. This method tends to join the clusters having small number of
observations and is biased towards producing clusters with same shape and with
nearly equal number of observations. The variance method is very sensitive to the
outliers. If “a” to “g” represents seven clusters then cluster formation using Ward’s
method can be shown graphically in Fig. 10.5.
Linkage Methods
In agglomerative clustering, clusters are formed on the basis of three different types
of linkage methods.
Terminologies Used in Cluster Analysis 325
1. Single Linkage Method: In this method, clusters are formed on the basis of
minimum distance between the closest members of the two clusters. This is also
known as nearest neighbor rule. This kind of linkage can be seen in Fig. 10.6.
2. Complete Linkage Method: In this method, clusters are formed on the basis of
minimum distance between the farthest members of the two clusters. This is also
known as furthest neighbor rule. Complete linkage can be shown by Fig. 10.7.
3. Average Linkage Method: This procedure uses the minimum average distance
between all pairs of objects (in each pair one member must be from a different
cluster) as the criteria to make the next higher cluster. Average linkage can be
shown by Fig. 10.8.
326 10 Cluster Analysis: For Segmenting the Population
Divisive Clustering
is in thousands. Because of this, so much processing is required, and even with the
modern computer, one needs to wait for some time to get the results. On the other hand,
k-means clustering method does not require computation of all possible distances.
Nonhierarchical clustering solution has three different approaches, that is, sequen-
tial threshold method, parallel threshold, and optimizing partitioning method.
The sequential threshold method is based on finding a cluster center and then
grouping all objects that are within a specified threshold distance from the center.
Here, one cluster is created at a time.
In parallel threshold method, several cluster centers are determined simulta-
neously and then objects are grouped depending upon the specified threshold
distance from these centers. These threshold distances may be adjusted to include
more or fewer objects in the clusters.
The optimizing partitioning is similar to other two nonhierarchical methods
except it allows for reassignment of objects to another cluster depending on some
optimizing criterion. In this method, a nonhierarchical procedure is run first, and
then objects are reassigned so as to optimize an overall criterion.
Precautions: K-means clustering is very sensitive toward the outliers because
they will usually be selected as initial cluster centers. If outlier exists in the data,
this will result in outliers forming clusters with small number of cases. Therefore, it
is important for the researcher to screen the data for outliers and remove them
before starting the cluster analysis.
Two-Step Cluster
Pre-clusters are the clusters of original cases/objects that are used in place of raw
data to reduce the size of the distance matrix between all possible pair of cases.
After completing the pre-clustering, the cases in the same pre-cluster are treated as
a single entity. Thus, the size of the distance matrix depends upon the number of
pre-clusters instead of cases. Hierarchical clustering method is used on these pre-
clusters instead of the original cases.
328 10 Cluster Analysis: For Segmenting the Population
In the second step, the standard hierarchical clustering algorithm is used on the pre-
clusters for obtaining the cluster solution. The agglomerative clustering algorithm
may be used to produce a range of cluster solutions. To determine which number of
clusters is the best, each of these cluster solutions may be compared using either
Schwarz’s Bayesian criterion (BIC) or the Akaike information criterion (AIC) as
the clustering criterion. The readers are advised to read about these procedures from
some other texts.
Cluster analysis is normally used for the data measured on interval scale and rarely
used for ratio data. In cluster analysis, distances are computed between the pair of
cases on each of the variables. And if the units of measurement for these variables
are different, then one must be worried about its impact on these distances.
Variables having larger values will have a larger impact on the distance com-
pared to variables that have smaller values. In that case, one must standardize the
variables to a mean of 0 and a standard deviation of 1.
If the variables are measured on interval scale and range of scale is same for each
of the variable, then standardization of variables is not required, but if its range of
measurement scale is different for different variables or if they are measured on
ratio scale, then one must standardize the variables in some way so that they all
contribute equally to the distance or similarity between cases.
Icicle Plots
It is the plotting of cases joining to form the clusters at each stage. You can see in
Fig. 10.10 what is happening at each step of the cluster analysis when average
linkage between groups is used to link the clusters. The figure is called an icicle plot
because the columns representing cases look like icicles hanging from eaves. Each
column represents one of the case/object you are clustering. Each row represents a
cluster solution with different numbers of clusters.
If you look at the figure from bottom up, the last row (not shown) is the first step of
the analysis. Each of the cases is a cluster of its own. The number of clusters at that
point is 6. The five-cluster solution arises when the cases “a” and “b” are joined into a
cluster. It is so because they had the smallest distance of all pairs. The four-cluster
solution results from the merging of the cases “d” and “e” into a cluster. The three-
cluster solution is the result of combining the cases “c” with “de.” Going similarly,
for the one cluster solution, all of the cases are combined into a single cluster.
Terminologies Used in Cluster Analysis 329
a b c d e f
Number 1 X X X X X X X X X X X X
of 2 X X X X X X X X X X X
Clusters 3 X X X X X X X X X
4 X X X X X X X X
5 X X X X X X X
Remarks
1. When pairs of cases are tied for the smallest distance to form a cluster, an arbitrary
selection is made. And, therefore, if cases are sorted differently, you might get a
different cluster solution. But that should not bother you as there is no right or
wrong answer to a cluster analysis. Many groupings are equally viable.
2. In case of large number of cases in cluster analysis, icicle plot can be developed
by taking cases as rows. You must specify the “Horizontal” on the Cluster Plots
dialog box.
The Dendrogram
The dendrogram is the graphical display of the distances on which clusters are com-
bined. The dendrogram can be seen in Fig. 10.22 and is read from left to right. Vertical
lines show joined clusters. The position of the line on the scale represents the distance at
which clusters are joined. The observed distances are rescaled to fall into the range of
1–25, and hence you do not see the actual distances; however, the ratio of the rescaled
distances within the dendrogram is the same as the ratio of the original distances. In fact,
the dendrogram is the graphical representation of the information provided by the
agglomeration schedule.
Consider the data of four employees on three different parameters age, income, and
qualification as shown in the Table 10.4. Let us see how the proximity matrix is
developed on these data.
The proximity matrix is the arrangement of squared Euclidean distances in rows and
columns obtained between all pairs of cases. The squared Euclidean distances shall be
computed by adding the squared differences between the two employees on each of the
three variables.
330 10 Cluster Analysis: For Segmenting the Population
In using cluster analysis, one needs to follow different steps to get the final results.
You may not understand all the steps at this moment but use it as a blueprint of the
analysis and proceed further, and I am sure by the time you finish reading the entire
chapter, you will have a fairly good idea about its application. Once you understand
different concepts of cluster analysis discussed in this chapter, you will be taken to a
solved example by using SPSS, and this will give you practical knowledge of using
this analysis to your data set with SPSS. Below are the steps which are used in
cluster analysis:
1. Identify the variables on which subjects/objects need to be clustered.
Assumptions in Cluster Analysis 331
2. Select the distance measure for computing distance between cases. One can
choose any of the distance measures like squared Euclidean distance, Manhattan
distance, Chebyshev distance, or Mahalanobis (or correlation) distance.
3. Decide the clustering procedure to be used from the wide variety of clustering
procedure available in the hierarchical or nonhierarchical clustering sections.
4. Decide on the number of clusters to be formed. The sole criteria in deciding the
number of clusters is based on the fact that one should be able to explain these
clusters on the basis of their characteristics.
5. Map and interpret clusters using illustrative techniques like perceptual maps,
icicle plots, and dendrograms and draw conclusions.
6. Assess reliability and validity of the obtained clusters by using any one or more
of the following methods:
(i) Apply the cluster analysis on the same data by using different distance
measure.
(ii) Apply the cluster analysis on the same data by using different clustering
technique.
(iii) Split the same data randomly into two halves and apply the cluster analysis
separately on each part.
(iv) Repeat cluster analysis on same data several times by deleting one variable
each time.
(v) Repeat cluster analysis several times, using a different order each time.
Cluster analysis can be applied to a wide variety of research problems in the area of
management, psychology, medicine, pharmaceuticals, social sciences, etc. Follow-
ing are the situations where this technique can be applied:
1. Cluster analysis can be used to classify the consumer population into market
segments for understanding the requirements of potential customers in different
groups. Such studies may be useful in segmenting the market, identifying the
target market, product positioning, and developing new products.
2. In a big departmental store, all inventories may be clustered into different groups
for placing them in same location or giving the similar code for enhancing sale
and easy monitoring of the products.
3. In the field of psychiatry, the cluster analysis may provide the cluster of
symptoms such as paranoia and schizophrenia, which is essential for successful
therapy.
4. In educational research, all schools of a district can be classified into different
clusters on the basis of the parameters like number of children, teacher’s
strength, total grant, school area, and location to develop and implement the
programs and policies effectively for each of these groups separately.
5. In the area of mass communication, television channels may be classified into
homogenous groups based on certain characteristics like TRP, number of
programs televised per week, number of artists engaged, coverage time,
programs in different sectors, advertisements received, and turnover. Different
policies may be developed for different groups of channels by the regulatory
body.
6. In medical research, cluster analysis may provide the solution for clustering of
diseases so that new drugs may be developed for different clusters of diseases.
This analysis may also be useful in clustering the patients on the basis of
symptoms for easy monitoring of drug therapy on mass scale.
Stage 1
1. The first step in cluster analysis is to apply the hierarchical cluster analysis in
SPSS to find the agglomerative schedule and proximity matrix for the data
obtained on each of the variables for all the cases. To form clusters, you need
Solved Example of Cluster Analysis Using SPSS 333
Stage 2
3. The second step in cluster analysis is to apply the K-means cluster analysis in
SPSS. The process is not stopped in the first stage just because of the fact that
K-means analysis provides much stable clusters due to interactive procedure
involved in it in comparison to the single-pass hierarchical methods. The K-
means analysis provides four outputs, namely, initial cluster centers, case listing
of cluster membership, final cluster centers, and analysis of variance for all the
variables in each of the clusters.
4. The case listing of cluster membership is used to describe as to which case
belongs to which of the clusters.
5. The final cluster centers are obtained by doing iteration on the initial cluster
solutions. It provides the final solution. On the basis of final cluster centers, the
characteristics of different clusters are explained.
6. Finally, ANOVA table describes as to which of the variables is significantly
different across all the identified clusters in the problem.
The detailed discussion of the above-mentioned outputs in cluster analysis shall
be done by means of the results obtained in the solved example using SPSS.
Example 10.1 A media company wants to cluster its target audience in terms of
their preferences toward quality, contents, and features of FM radio stations.
Twenty randomly chosen students were selected from a university who served the
sample for the study. Below-mentioned 14 questions were finally selected by their
research team after the content and item analysis which measured many of the
variables of interest. The respondents were asked to mark their responses on a 5-
point scale where 1 represented complete disagreement and 5 complete agreement.
The responses of the respondents on all the 12 questions that measured different
dimensions of FM stations are shown in Table 10.6.
334 10 Cluster Analysis: For Segmenting the Population
Table 10.6 Response of students on the questions related to quality, contents, and features of FM
radio stations
SN Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14
1 5 4 2 5 1 3 5 2 1 4 3 4 3 4
2 1 4 4 2 5 2 3 5 2 3 4 2 2 3
3 2 2 3 3 2 4 2 3 4 4 2 2 4 4
4 5 3 3 4 4 4 5 3 2 5 2 5 3 5
5 4 1 2 4 1 1 5 4 2 4 3 4 2 4
6 4 2 3 4 2 5 2 1 5 2 1 3 5 3
7 2 3 2 2 3 4 3 4 4 3 4 3 5 2
8 5 2 2 5 2 4 5 2 2 4 1 5 1 4
9 2 4 4 2 5 3 4 4 3 3 5 2 2 2
10 3 4 4 2 4 3 2 5 2 4 3 3 2 2
11 4 5 4 3 5 4 4 4 1 1 5 4 3 2
12 2 4 5 1 4 2 2 4 2 4 3 4 4 3
13 2 5 4 2 5 3 3 5 3 2 5 3 3 2
14 1 5 4 5 4 3 2 5 3 3 5 4 4 2
15 2 5 5 3 4 2 3 4 4 3 4 3 3 3
16 5 3 2 4 5 2 4 4 3 5 2 5 2 5
17 5 2 3 5 2 3 5 2 4 5 3 4 4 4
18 5 2 2 2 2 4 4 3 4 2 2 2 4 1
19 4 3 3 3 4 5 2 3 5 4 3 2 5 2
20 3 4 4 1 2 4 4 2 4 2 3 4 4 3
characteristics and preparing data file, and, therefore, these steps shall be skipped in
this chapter. In case of any clarification, readers are advised to go through Chap. 1
for detailed guidelines for preparing the data file.
The steps involved in using SPSS for cluster analysis shall be discussed first, and
then the output obtained from the analysis shall be shown and explained. The whole
scheme of cluster analysis with SPSS is as follows:
Stage 1
First of all, the hierarchical cluster analysis shall be done by using the sequence of
SPSS commands. The following outputs would be generated in this analysis:
(a) Proximity matrix of distances (similarity) between all the cases/objects
(b) Agglomerative schedule
(c) Icicle plot
(d) Dendrogram
On the basis of fusion coefficients in the agglomerative schedule, the number of
clusters (say K) is decided.
Stage 2
After deciding the number of clusters in the hierarchical cluster analysis, the data is
again subjected to K-means cluster analysis in SPSS. Using this analysis, the
following outputs would be generated:
(a) Initial cluster centers
(b) Case listing of cluster membership
(c) Final cluster centers
(d) Analysis of variance for comparing the clusters on each of the variables
(a) Data file After defining variable names and their labels, prepare the data file for
the responses of the students on all the variables shown in Table 10.2. The data
file shall look like as shown in Fig. 10.11.
(b) Initiating command for hierarchical cluster analysis: After preparing the data
file, start the hierarchical analysis in SPSS by the following command sequence
(Fig. 10.12):
Analyze ! Classify ! Hierarchical Cluster
336 10 Cluster Analysis: For Segmenting the Population
Fig. 10.11 Showing data file for all the variables in SPSS
(i) Selecting variables for analysis: After clicking the Hierarchical Cluster
option, you will be taken to the next screen for selecting variables. Select the
variables as follows:
– Select all the variables and bring them in the “Variable(s)” section.
– Ensure that in the “Display” section, the options “Statistics” and “Plots” are
checked. These are selected by default.
– In case if a variable denoting label of each cases is defined in the variable
view while preparing the data file, then bring that variable under the section
“Label Cases by.” While defining the variable for label in the variable view,
define its variable type as String under the column heading Type. However,
for the time being, you can skip the process of defining the variable for label
and leave the option “Label Cases by” blank.
The screen will look like as shown in Fig. 10.13.
(ii) Selecting options for computation: After selecting the variables, you need to
define different options for generating all the four outputs of hierarchical
analysis. Take the following steps:
– Click the tag Statistics in the screen shown in Fig. 10.13 and take the
following steps:
– Ensure that the “Agglomerative schedule” is checked. By default, it is
checked.
338 10 Cluster Analysis: For Segmenting the Population
Fig. 10.14 Screen showing option for generating agglomerative schedule and proximity matrix
Fig. 10.16 Selecting options for cluster method and distance measure criteria
– Click Continue. You will be taken back to the screen shown in Fig. 10.13.
The screen will look like as shown in Fig. 10.16.
– Click OK
(c) Getting the output: Clicking the option OK shall generate lot of outputs in the
output window. The four outputs that would be selected are Proximity matrix,
Agglomerative schedule, Icicle plot, and Dendrogram. These outputs have been
shown in Tables 10.7, 10.8 and Fig. 10.21, 10.22.
Stage 1 was the explorative process where number of initial clusters was identified.
These initial clusters were identified on the basis of fusion coefficients in the
agglomerative schedule. After deciding the number of clusters, apply the K-
means cluster analysis in stage 2. In stage 1, three clusters were identified on the
basis of the agglomeration schedule in Table 10.8 (for details, see Interpretation of
Findings). This shall be used to find the final solution in the K-means cluster
Solved Example of Cluster Analysis Using SPSS 341
analysis. The data file developed for the hierarchical analysis is also used for the
K-means cluster analysis. Follow these steps in stage 2.
(i) Initiating command for K-means cluster analysis: Start the K-means analysis by
using the following command sequence (Fig. 10.17):
(ii) Selecting variables for analysis: After clicking the K-Means Cluster Analysis
option, you will be taken to the next screen for selecting variables. Select the
variables as follows:
– Select all the variables and bring them in the “Variable(s)” section.
– Write number of clusters as 3. This is so because only three clusters were
identified from the hierarchical analysis.
– Click the option Iterate and ensure that the minimum iteration is written as
10. In fact, this is done by default. If you want to have more than 10
maximum iterations, it may be mentioned here.
– Click Continue.
342 10 Cluster Analysis: For Segmenting the Population
Fig. 10.20 Screen showing options for cluster information and ANOVA
344 10 Cluster Analysis: For Segmenting the Population
Interpretations of Findings
Stage 1: The agglomerative cluster analysis done in stage 1 provided the outputs
shown in Tables 10.7 and 10.8 and in Figs. 10.21 and 10.22. The agglomerative
analysis is explorative in nature. Its primary purpose is to identify the initial cluster
solution. Therefore, one should take all possible parameters to identify the clusters
so that important parameters are not left out. We shall now discuss the results
generated in the agglomerative analysis in stage 1.
Proximity Matrix: To Know How Alike (or Different) the Cases Are
Table 10.7 is a proximity matrix which shows distances between the cases. One can
choose any distance criterion like squared Euclidean distance, Manhattan distance,
Chebyshev distance, Mahalanobis (or correlation) distance, or Pearson correlation
distance. In this example, the squared Euclidean distance was chosen as a measure
of distance. The minimum distance exists between the 9th and 13th cases which is
6.00, whereas the maximum distance is observed between the 8th and 13th cases
which is 87.00. The minimum distance means that these two cases would combine
at the very first instance. This can be seen from Table 10.8 where 9th and 13th cases
are combined into a single cluster in the very first stage. Similarly, the 8th and 13th
cases are in the extreme clusters which can be seen in the dendrogram shown in
Fig. 10.22.
Table 10.8 is an agglomerative schedule which shows how and when the clusters
are combined. The agglomerative schedule is used to decide the number of clusters
present in the data and one should identify the number of clusters by using the
column labeled “Coefficients” in this table. These coefficients are also known as
fusion coefficients. The values under this column are the distance (or similarity)
statistic used to form the cluster. From these values, you get an idea as to how the
clusters have been combined. In case of using dissimilarity measures, small
coefficients indicate that those fairly homogenous clusters are being attached to
each other. On the other hand, large coefficients show that the dissimilar clusters are
being combined. In using similarity measures, the reverse is true, that is, large
coefficients indicate that the homogeneous clusters are being attached to each other,
whereas small coefficients reveal that dissimilar clusters are being combined.
The value of fusion coefficient depends on the clustering method and the
distance measure you choose. These coefficients help you decide how many
clusters you need to represent the data. The process of cluster formation is stopped
when the increase (for distance measures) or decrease (for similarity measures) in
Table 10.7 Proximity matrix
Squared Euclidean distance
Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 .000 68.000 45.000 19.000 21.000 52.000 52.000 16.000 61.000 51.000 49.000 60.000 69.000 62.000 54.000 33.000 17.000 45.000 58.000 42.000
2 68.000 .000 39.000 57.000 55.000 82.000 30.000 80.000 7.000 11.000 29.000 16.000 9.000 24.000 12.000 49.000 71.000 59.000 48.000 42.000
3 45.000 39.000 .000 40.000 40.000 19.000 17.000 47.000 40.000 30.000 64.000 31.000 46.000 43.000 31.000 46.000 30.000 28.000 17.000 25.000
4 19.000 57.000 40.000 .000 30.000 51.000 49.000 15.000 52.000 40.000 46.000 45.000 60.000 61.000 47.000 10.000 16.000 50.000 45.000 41.000
5 21.000 55.000 40.000 30.000 .000 65.000 49.000 23.000 56.000 46.000 66.000 55.000 68.000 67.000 53.000 28.000 22.000 44.000 67.000 51.000
6 52.000 82.000 19.000 51.000 65.000 .000 34.000 48.000 71.000 61.000 73.000 62.000 73.000 68.000 56.000 65.000 33.000 23.000 20.000 28.000
7 52.000 30.000 17.000 49.000 49.000 34.000 .000 66.000 23.000 25.000 37.000 26.000 21.000 26.000 24.000 53.000 41.000 21.000 14.000 18.000
8 16.000 80.000 47.000 15.000 23.000 48.000 66.000 .000 73.000 57.000 67.000 74.000 87.000 84.000 70.000 25.000 21.000 47.000 64.000 52.000
Solved Example of Cluster Analysis Using SPSS
9 61.000 7.000 40.000 52.000 56.000 71.000 23.000 73.000 .000 14.000 20.000 23.000 6.000 25.000 11.000 50.000 58.000 44.000 35.000 31.000
10 51.000 11.000 30.000 40.000 46.000 61.000 25.000 57.000 14.000 .000 26.000 11.000 14.000 25.000 15.000 36.000 54.000 40.000 31.000 33.000
11 49.000 29.000 64.000 46.000 66.000 73.000 37.000 67.000 20.000 26.000 .000 35.000 14.000 29.000 27.000 54.000 64.000 50.000 49.000 35.000
12 60.000 16.000 31.000 45.000 55.000 62.000 26.000 74.000 23.000 11.000 35.000 .000 19.000 28.000 14.000 45.000 57.000 53.000 38.000 26.000
13 69.000 9.000 46.000 60.000 68.000 73.000 21.000 87.000 6.000 14.000 14.000 19.000 .000 15.000 9.000 56.000 70.000 50.000 37.000 31.000
14 62.000 24.000 43.000 61.000 67.000 68.000 26.000 84.000 25.000 25.000 29.000 28.000 15.000 .000 14.000 59.000 61.000 67.000 40.000 46.000
15 54.000 12.000 31.000 47.000 53.000 56.000 24.000 70.000 11.000 15.000 27.000 14.000 9.000 14.000 .000 43.000 47.000 49.000 32.000 24.000
16 33.000 49.000 46.000 10.000 28.000 65.000 53.000 25.000 50.000 36.000 54.000 45.000 56.000 59.000 43.000 .000 26.000 58.000 51.000 55.000
17 17.000 71.000 30.000 16.000 22.000 33.000 41.000 21.000 58.000 54.000 64.000 57.000 70.000 61.000 47.000 26.000 .000 36.000 35.000 37.000
18 45.000 59.000 28.000 50.000 44.000 23.000 21.000 47.000 44.000 40.000 50.000 53.000 50.000 67.000 49.000 58.000 36.000 .000 21.000 23.000
19 58.000 48.000 17.000 45.000 67.000 20.000 14.000 64.000 35.000 31.000 49.000 38.000 37.000 40.000 32.000 51.000 35.000 21.000 .000 28.000
20 42.000 42.000 25.000 41.000 51.000 28.000 18.000 52.000 31.000 33.000 35.000 26.000 31.000 46.000 24.000 55.000 37.000 23.000 28.000 .000
This is a dissimilarity matrix
345
346 10 Cluster Analysis: For Segmenting the Population
the coefficients between the two adjacent steps is large. In this example, the process
can be stopped at the three cluster solution, after stage 17. Let us see how it is done?
We should look for the coefficients from the last row upward because we want
the lowest possible number of clusters due to economy and its interpretability.
Stage 20 represents a one cluster solution where all the cases are combined into one
cluster, and, therefore, it is not shown in Table 10.8. The largest difference
(389.800–261.524) exists in the coefficients between stages 18 and 19, which
means we have to stop the process of cluster formation after stage 19; this would
result in only two-cluster solution. However, we may not be interested to represent
the data by two clusters only; therefore, we will look for the next larger difference
of (261.524–172.417) which is equal to 89.107 (between stage 18, the three-cluster
solution, and stage 17, the four-cluster solution). The next one after that is
(172.417–153.750), only 18.667, between stages 17 and 16. Thereafter, the differ-
ence keeps decreasing. So we decide to stop the cluster formation at stage 18 which
is a three-cluster solution.
Thus, in general, the strategy is to first identify the largest difference in the
coefficients and identify the stage of the lowest coefficient as the cluster solution.
However, it is up to the researcher to decide the number of clusters depending upon
its interpretability. You can see from the dendrogram shown in Fig. 10.22 that three
clusters are clearly visible in this case.
348 10 Cluster Analysis: For Segmenting the Population
The agglomeration schedule starts off using the case numbers that has smallest
distance as shown by the icicle plot in Fig. 10.21. The cluster is formed by adding
cases. The number of the lowest case becomes the number of this newly formed
cluster. For example, if a cluster is formed by merging cases 3 and 6, it would be
known as cluster 3, and if the clusters are formed by merging cases 3 and 1, then it
would be known as cluster 1.
The columns labeled “Stage Cluster First Appears” shows the step at which each
of the two clusters that are being joined first appear. For example, at stage 9 when
clusters 1 and 17 are combined, it tells you that cluster 1 was first formed at stage 7
and cluster 17 is a single case, and that the resulting cluster (known as 1) will see
action again at stage 11 (under the column “Next stage”). If number of cases are
small then the icicle plot explains step-by-step clustering summary better than the
agglomeration schedule.
Figure 10.22 shows the dendrogram which is used to show the plotting of cluster
distances. It provides a visual representation of the distance at which clusters are
combined. We read the dendrogram from left to right. A vertical line represents the
joined clusters. The position of the line on the scale shows the distance at which
clusters are joined. The computed distances are rescaled in the range of 1–25, and,
therefore, actual distances cannot be seen here; however, the ratio of the rescaled
distances within the dendrogram is the same as the ratio of the original distances.
The first vertical line, corresponding to the smallest rescaled distance, is for the
case 9 and case 13. The next vertical line is at the next smallest distance for the
cluster 9 and case 2. It can be seen from Table 10.8 that the lowest coefficient is
Solved Example of Cluster Analysis Using SPSS 349
3.000, which is for cases 9 and 13. The next smallest distance is shown by the
coefficient as 7.333 which is for cluster 9 and case 2. Thus, what you see in this plot
is what you already know from the agglomeration schedule.
Remark: While reading the dendrogram, one should try to determine at what stage
the distances between clusters that are combined is large. You look for large
distances between sequential vertical lines. In this case, large distance between
the vertical lines suggests a three-cluster solution.
Stage 2 With the help of hierarchical cluster analysis, the number of cluster was
decided to be three. After this, K-means cluster analysis was applied to get the final
solution of the cluster means. The SPSS generated the outputs in the form of
Tables 10.9, 10.10, 10.11, 10.12, 10.13, and 10.14. We shall now explain these
outputs and discuss the cluster characteristics.
350 10 Cluster Analysis: For Segmenting the Population
The first step in K-means clustering was to find the K-centers. This is done
iteratively. Here, the value of K is three because three clusters were decided on
the basis of agglomerative schedule. We start with an initial set of centers and keep
modifying till the changes between two iterations are small enough. Although one
can also guess these centers which can be used as initial starting points, it is
advisable to let SPSS find K cases that are well separated and use these values as
initial cluster centers. In our example, Table 10.9 shows the initial centers.
Once the initial cluster centers are selected by the SPSS, each case is assigned to
the nearest cluster, depending upon its distance from the cluster centers. After
assigning all the cases to these clusters, the cluster centers are once again
recomputed on the basis of its member cases. Again, all the cases are assigned by
using the recomputed cluster centers. This process keeps on going till no cluster
center changes appreciably. Since the number of iteration is taken as 10 by default
in SPSS (see Fig. 10.18), this process of assigning cases and recomputing cluster
centers will keep repeating to a maximum of ten times. In this example, you can see
from Table 10.10 that the three iterations were sufficient.
Solved Example of Cluster Analysis Using SPSS 351
Table 10.11 shows the final cluster centers after iteration stops, and cases are
reassigned to the clusters. Using these final cluster centers, cluster characteristic
are described.
Each question in this example is responded on a 1–5 scoring scale, where 5
stands for total agreement and 1 stands for total disagreement. Thus, if any score
shown in Table 10.11 is more than 2.5, it indicates the agreement toward the
statement, and if it is less than 2.5, it reflects disagreement. Thus, owing to these
352 10 Cluster Analysis: For Segmenting the Population
criteria, the characteristics of these three clusters of cases were as follows (refer to
the question details in Example 10.1):
Cluster 1
FM listeners belonging to this cluster were of the strong opinion that channels
should provide more old Hindi songs (Q.1) and provide some incentives to the
listeners (Q.4). They strongly feel that the humor and ability to deliver interesting
programs make RJs more popular (Q.7). The channel should play 24 7 (Q.10)
and must air information regarding educational opportunity available in the city
(Q.12), and the RJ must speak in local dialect (Q.14).
Further, listeners belonging to this cluster feel that FM channels should air more
entertaining programs (Q.6) and should provide more opportunity to listeners to
talk to the celebrities (Q.8).
Solved Example of Cluster Analysis Using SPSS 353
Cluster 2
Listeners belonging to this cluster strongly felt that FM channels must provide
solutions to personal problems (Q.2), RJs presentation skill to be important for the
channels (Q.3), channels to provide more often the latest songs (Q.5), channels to
arrange more dialogues between celebrities and their audience (Q.8), and should air
information about sports other than cricket also (Q.11).
Further, listeners to this cluster also felt that FM channels should air more
entertaining programs (Q.6). Humor and ability to deliver interesting programs
make RJs more popular (Q.7). The channels must play 24 7 (Q.10) and should
provide information regarding educational opportunity (Q.12) and shopping offers
(Q.13) available in the city.
Cluster 3
Listeners in this cluster were strongly of the view that the FM channels must contain
more entertaining programs (Q.6), RJs voice must be very clear and melodious
(Q.9), and channels should provide information regarding shopping offers available
in the city (Q.13).
Further, listeners in this cluster were also of the view that channels should air
more old Hindi songs (Q.1), provide solution to the personal problems (Q.2), and
believe RJs to be the key factor in popularizing the FM channels (Q.3). They were
of the view that the humorous RJs make programs more interesting (Q.7). Channels
should provide more opportunity to listeners to talk to the celebrities (Q.8), they
should operate 24 7 (Q.10) and, at the same time, must air the information
regarding educational opportunities available in the city (Q.12).
Table 10.12 shows ANOVA for the data on all the 14 variables. The F-ratios
computed in the table describe the differences between the clusters. F-ratio is
significant at 5% level if the significance level (p-value) associated with it is less
than .05. Thus, it can be seen in Table 10.12 that F-ratios for all the variables are
significant at 5% level as their corresponding p-values are less than .05.
Remark
1. There is a divided opinion on the issue of using ANOVA analysis for comparing
the clusters on each of the parameters. The footnote in Table 10.12 warns that the
observed significance levels should not be interpreted in the usual fashion
because the clusters have been selected to maximize the differences between
clusters.
354 10 Cluster Analysis: For Segmenting the Population
Cluster Membership
Table 10.13 shows the cluster membership of the cases. You can see that six cases
belong to cluster 1, eight cases to cluster 2, and six cases to cluster 3.
Table 10.14 is a summary of Table 10.13. You do not like to see clusters with
very few cases unless they are really different from the remaining cases.
Exercise
Multiple-Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the
one that you consider the closest to the correct answer.
Answer Q.1 to Q.4 on the basis of the following information. In a cluster
analysis, if the data on two cases are as follows:
Case 1 14 8 10
Case 2 10 11 12
Learning Objectives
After completing this chapter, you should be able to do the following:
• Understand the factor analysis and its application.
• Learn the difference between exploratory and confirmatory factor analysis.
• Know the use of factor analysis in developing test batteries.
• Interpret different terms involved in factor analysis.
• Explain the situations where factor analysis can be used.
• Know the procedure of retaining the factors and identifying the variables in it.
• Explain the steps involved in factor analysis.
• Understand the steps involved in using SPSS for factor analysis.
• Discuss the outputs obtained in factor analysis.
• Learn to write the results of factor analysis in standard format.
Introduction
Factor Loading
Factor loading can be defined as the correlation coefficient between the variable and
factor. Just like Pearson’s r, the squared factor loading of a variable indicates the
percentage variability explained by the factor in that variable. As a rule of thumb,
0.7 or higher factor loading represents that the factor extracts sufficient variance
from that variable. The percentage variance explained in all the variables accounted
for by each factor can be computed by dividing the sum of the squared factor
loadings for that factor divided by the number of variables and multiplied by 100.
Communality
The communality can be defined as the sum of the squared factor loadings of a
variable in all the factors. It is the variance in that variable accounted for by all the
factors. The communality of variable is represented by h2. It measures the percent-
age of variance in a given variable explained by all the factors jointly and may be
considered as the reliability of the variable. Low communality of a variable
indicates that the variable is not useful in explaining the characteristics of the
Terminologies Used in Factor Analysis 363
group and the factor model is not working well for that variable. Thus, variables
whose communalities are low should be removed from the model as such variables
are not related to each other. Any variable whose communality is <.4 should
usually be dropped. However, the communalities must be interpreted in relation
to the interpretability of the factors. For instance a communality of .80 may seem to
be high but becomes meaningless, unless the factor on which the variable is loaded
is interpretable, normally it usually will be. On the other hand a communality of .25
may look to be low but becomes meaningful if the variable can well define the
factor.
Hence, it is not the value of communality of a variable that is important, but the
variable’s role in interpretation of the factor is the important consideration. How-
ever, the variable whose communality is very high usually explain the factor well.
If the value of communality is more than 1, then one must expect that something is
wrong with the solution. Such situation indicates that either sample is too small or
the researcher has identified too many or too few factors.
Eigenvalue
The eigenvalue for a given factor measures the variance in all the variables which is
accounted for by that factor. It is also called as characteristics root. The sum of the
eigenvalues of all the factors is equal to the number of variables. The decision about
the number of factors to be retained in the factor analysis is taken on the basis of
eigenvalues. If a factor has a low eigenvalue, then it is contributing little to the
explanation of variances in the variables and may be dropped. Eigenvalues measure
the amount of variation in the total sample accounted for by each factor.
Kaiser Criteria
While applying the factor analysis one needs to decide as to how many factors
should be retained. As per the Kaiser’s criteria only those factors having eigenvalue
>1 should be retained in the factor analysis. Initially each variable is supposed to
have its eigenvalue 1. Thus, it may be said that unless a factor extracts at least as
much as the equivalent of one original variable, it is dropped. This criterion was
proposed by Kaiser and is the most widely used by the researchers.
The scree plot is a graphical representation of the factors plotted along X-axis
against their eigenvalues, on the Y-axis. As one moves toward the X-axis (factors),
the eigenvalues drop. When the drop ceases and the curve makes an elbow toward
364 11 Application of Factor Analysis: To Study the Factor Structure Among Variables
Scree plot
5
3
Eigen value
1 2 3 4 5 6 7 8 9 10
Component number
less steep decline, Cattell’s scree test says to drop all further components after the
one starting the elbow. Thus, the factors above the elbow in the plot are retained.
The scree test was developed by Cattell. “Scree” is a term used in geology. The
scree is the rubble at the bottom of a cliff. In scree test, if a factor is important, it will
have a large variance. The scree plot may look like Fig. 11.1.
Varimax Rotation
Unrotated factor solution obtained after applying the principal component analysis
is rotated by using any of the rotational technique to enhance the interpretability of
factors. The varimax rotation is the most widely used rotation technique in factor
analysis. It is an orthogonal rotation of the factor axes to maximize the variance of
the squared loadings of a factor on all the variables in a factor matrix, which has the
effect of relocating the original variables into extracted factor. After varimax
rotation, each factor will tend to have either large or small loadings of any particular
variable and thus facilitates a researcher to identify each variable in one and only
one factor. This is the most common rotation option. Other rotational strategies are
quartimax, equamax, direct oblimin, and promax methods which are not much used
by the researcher.
What Do We Do in Factor Analysis? 365
The factor analysis involves different steps which are discussed below. You may
not understand all the steps at a glance but do not lose heart and continue to read.
After reading these steps, once you go through the solved example discussed later in
this chapter, a full clarity of the concepts can be achieved. The steps discussed
below cannot be done manually but may be achieved by using any statistical
package. So try and relate these steps with the output of factor analysis.
1. Compute descriptive statistics for all the variables. Usually mean and standard
deviation are provided by the standard statistical packages while running the
factor analysis. However, you may run descriptive statistics program to compute
other descriptive statistics like skewness, kurtosis, standard error, and coefficient
of variability to understand the nature of the variables under study.
2. Prepare correlation matrix with all the variables taken in the study.
3. Apply KMO test to check the adequacy of data for running factor analysis. The
value of KMO ranges from 0 to 1. The larger the value of KMO more adequate is
the sample for running factor analysis. As a convention, any value of KMO more
than .5 signifies the adequacy of sample for running the factor analysis. A value
of 0 indicates that the distinct factors cannot be made and hence, the sample is
not adequate. On the other hand, if its value is approaching 1, then the factor
analysis yields distinct and reliable factors. Kaiser recommends accepting values
>0.5 as acceptable (values below this should lead you to either collect more data
or rethink which variables to include). Further, the values between 0.5 and 0.7
are mediocre, values between 0.7 and 0.8 are good, values between 0.8 and 0.9
are great, and values above 0.9 are superb (Hutcheson and Sofroniou 1999).
4. Apply Bartlett’s test of sphericity for testing the hypothesis that the correlation
matrix is not an identity matrix. If the correlation matrix is an identity matrix, the
factor analysis becomes inappropriate. Thus, if the Bartlett’s test of sphericity is
significant, it is concluded that the correlation matrix is not an identity matrix
and the factor analysis can be run.
5. Obtain unrotated factor solution by using principal component analysis. This will
provide you the number of factors along with their eigenvalues. We retain only
those factors whose along with their eigenvalues. This can also be shown
graphically by scree plot. This solution also provides the factor loadings of the
variables on different factors, percentage variability explained by each factor, and
the total variability explained by all the factors retained in the model.
6. Thus, this primary factor analysis solution can tell you the percentage of
variability explained by all the identified factors together. However, it is not
possible to identify the variables included in each factor because some of the
variables may belong to more than one factor. This problem is sorted out by
choosing the appropriate rotation technique.
366 11 Application of Factor Analysis: To Study the Factor Structure Among Variables
7. Obtain final solution by using the varimax rotation option, available in SPSS.
This will solve the problem of redundancy of variables in different factors. As a rule
of thumb, if the factor loading of any variable on a factor is equal or more than 0.7,
then it should belong to that factor. The reason for choosing 0.7 factor loading as a
cut off point is that because factor loading represents correlation coefficient hence
at least 49% (= 0.72) variability of the variable must be explained by the factor to
which it belongs. However, other variables whose loadings are <0.7 can also be
identified in that factor on the basis of its explainability.
8. Identified factors in step 6 are given names depending upon the nature of
variables included in it.
9. If the purpose of the factor analysis is to develop a test battery also, then one or
two variables from each factor may be selected on the basis of their magnitude
of loadings. These variables so selected may form the test battery. Each
variable in the test battery is assigned weight. The weights assigned to the
variable in the test battery depend upon the percentage variability explained by
the factor from which it belongs. Usually, the first factor explains the maxi-
mum variance, and therefore two or three variables may be kept from it
depending upon the nature of the variables and its explainability. From rest
of the factors, normally one variable per factor is selected, as the sole purpose
of the factor analysis is to reduce the number of variables so that the maximum
variance in the group may be explained.
While using the factor analysis, the following assumptions are made:
1. All the constructs which measure the concepts have been included in the study.
2. Sufficient sample size has been taken for factor analysis. Normally sample size
must be equal to 5–20 times the number of variables taken in the study.
3. No outlier is present in the data.
4. Multicollinearity among the variables does not exist.
5. Homoscedasticity does not exist between the variables because factor analysis is
a linear function of measured variables. The meaning of homoscedasticity
between the variables is that the variance around the regression line is the
same for all values of the predictor variable (X).
6. Variables should be linear in nature. Nonlinear variables may also be used after
transforming it into linear variables.
7. Data used in the factor analysis is based on interval scale or ratio scale.
Research Situations for Factor Analysis 367
Although the factor analysis is very useful multivariate statistical technique, how-
ever, it has some limitations as well.
• Much of the advantage of factor analysis technique can be achieved only if the
researcher is able to collect a sufficient set of product attributes. If some of the
important attributes are missed out, the results of factor analysis will not be
efficient.
• If majority of the variables are highly related to each other and distinct from
other items, factor analysis will assign a single factor to them. This will not
reveal other factors that capture more interesting relationships.
• Naming the factors may require researcher’s knowledge about the subject matter
and theoretical concepts, because often multiple attributes can be highly
correlated for no apparent reason.
Example 11.1
An industrial researcher wanted to investigate the climate of an organization. A set
of 12 questions were developed to measure different parameters of the climate. The
subject could respond these questions on five-point scale with 5 indicating strongly
agree and 1 strongly disagree attitude towards the question. The responses obtained
on the questionnaire are shown in Table 11.1 along with the description of the
questions. Apply factor analysis technique to study the factor structure and suggest
the test battery that can be used for assessing the climate of any industrial organi-
zation. Also apply the scree test for retaining factors graphically and KMO test for
testing the adequacy of data.
Statements
1. Employees are encouraged to attend training programs organized by outside
agencies.
2. Any employee can reach up to the top level management position during their
carrier.
3. Employees are praised by the immediate boss for doing something useful in the
organization
4. Medical facilities for the employees and their families are excellent
5. Employees are given preference in jobs announced by the group of companies.
6. For doing some creative work or working creatively, employees get incentives
7. Employee’s children are honored for their excellent performance in their
education.
8. Employees are cooperative in helping each other to solve their professional
problems
9. Fees of employees children are reimbursed during their schooling
Solved Example of Factor Analysis Using SPSS 369
10. Employees get fast promotion if their work is efficient and consistence
11. Senior managers are sensitive to the personal problems of their employees.
12. Employees get cheaper loan for buying vehicles.
Solution
By applying the factor analysis following issues shall be resolved:
1. To decide the number of factors to be retained and the total variance explained
by these factors
2. To identify the variables in each factor retained in the final solution, on the basis
of its factor loadings
3. To give names to each factor retained on the basis of the nature of variables
included in it
4. To suggest the test battery for assessing the climate of any industrial
organization
5. To test the adequacy of sample size used in factor analysis
370 11 Application of Factor Analysis: To Study the Factor Structure Among Variables
Before running the SPSS commands for factor analysis, a data file needs to be
prepared. By now, you must have been familiar in preparing the data file. If not, you
may go through the procedure discussed in Chap. 1 in this regard. Do the following
steps for generating outputs in factor analysis:
(i) Data file: In this problem, all 12 statements are independent variables. These
variables have been defined as ‘Scale’ variable because they were measured on
interval scale. Variables measured on interval as well as ratio scales are treated
as scale variable in SPSS. After preparing the data file by defining variable
names and their labels, it will look like Fig. 11.2.
(ii) Initiating command for factor analysis: Once the data file is prepared, click the
following command sequence in the Data View:
Fig. 11.2 Screen showing data file for the factor analysis in SPSS
Solved Example of Factor Analysis Using SPSS 371
Fig. 11.5 Screen showing option for correlation matrix and initial factor solution
3. Table 11.4 shows the result of KMO test, which tells whether sample size taken
for the factor analysis was adequate or not. It tests whether the partial
correlations among variables are small. The value of KMO ranges from 0 to 1.
The closer the value of KMO to 1, the more adequate is the sample size to run the
factor analysis. Usually the value of KMO more than 0.5 is considered to be
sufficient for doing factor analysis reliably. In this case, KMO value is 0.408,
which is <.5; hence, the sample size is not adequate, and more samples should
be taken for the analysis. Since this is a simulated example developed to make
the procedure clear, hence less number of data set was taken.
Further, Bartlett’s test of sphericity is used to test the null hypothesis that the
correlation matrix is an identity matrix. Since significance value (p value) of
Bartlett’s test is .002 in Table 11.4, which is <.01, hence it is significant, and the
correlation matrix is not an identity matrix. Thus, it may be concluded that the
factor model is appropriate.
4. Table 11.5 shows the communalities of all the variables. Higher communality of
a variable indicates that the major portion of its variability is explained by all the
identified factors in the analysis. If communality of variable is <.4, it is
considered to be useless and should normally be removed from the model.
From Table 11.5, it can be seen that the communalities of all the variables are
more than .4; hence, all the variables are useful in the model.
376
11
Table 11.3 Correlation matrix for the parameters of the organizational climate
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12
S1 1 0.188 0.17 0.35 0.299 0.318 0.271 0.329 0.118 0.041 0.321 0.068
S2 0.188 1 0.193 0.407* 0.339 0.252 0.073 0.088 0.232 0.027 0.027 0.239
S3 0.17 0.193 1 0.349 0.175 0.354 0.068 0.236 0.302 0.591** 0.366 0.089
S4 0.35 0.407* 0.349 1 0.088 0.212 0.244 0.237 0.679** 0.095 0.18 0.243
S5 0.299 0.339 0.175 0.088 1 0.176 0.136 0.058 0.092 0.362 0.203 0.118
S6 0.318 0.252 0.354 0.212 0.176 1 0 0.152 0.168 0.397* 0.036 0.066
S7 0.271 0.073 0.068 0.244 0.136 0 1 0.05 0.315 0.063 0.375 0.303
S8 0.329 0.088 0.236 0.237 0.058 0.152 0.05 1 0.186 0.239 0.468 0.234
S9 0.118 0.232 0.302 0.679** 0.092 0.168 0.315 0.186 1 0.027 0.254 0.242
S10 0.041 0.027 0.591** 0.095 0.362 0.397* 0.063 0.239 0.027 1 0.302 0.028
S11 0.321 0.027 0.366 0.18 0.203 0.036 0.375 0.468 0.254 0.302 1 0.208
S12 0.068 0.239 0.089 0.243 0.118 0.066 0.303 0.234 0.242 0.028 0.208 1
Value of “r” required for its significance at .05 level ¼ 0.396, d.f ¼ N2 ¼ 23, * Significant at .05 level
Value of “r” required for its significance at .01 level ¼ 0.505, d.f ¼ N2 ¼ 23, ** Significant at .01 level
Application of Factor Analysis: To Study the Factor Structure Among Variables
Solved Example of Factor Analysis Using SPSS 377
3.0
2.5
2.0
Elgenvalue
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12
Component Number
5. Table 11.6 shows the factors extracted and the variance explained by these
factors. It can be seen that after rotation, the first, second, third, and fourth
factors explain 19.266, 18.601, 17.355, and 13.923% of the total variance,
respectively. Thus, all these four factors together explain 69.144% of the total
variance.
The eigenvalues for each of the factor are shown in Table 11.6. Only those
factors are retained whose eigenvalues are 1 or more than 1. Here, you can see
that the eigenvalue for the first four factors are >1; hence, only four factors have
been retained in this study.
Figure 11.8 shows the scree plot which is obtained by plotting the factors (along
X-axis) against their eigenvalues (along Y-axis). This plot shows that only four
factors have eigenvalues above elbow bent; hence, only four factors have been
retained in this study.
6. Table 11.7 shows the first initial unrotated solution of the factor analysis. Four
factors have been extracted in this study. The factor loadings of all the variables
on each of the four factors have been shown in this table. Since this is an
unrotated factor solution, and therefore some of the variables may show their
contribution in more than one factor. In order to avoid this situation, the factors
are rotated. Varimax rotation has been used in this example to rotate the factors,
as this is the most popular method used by the researchers due to its efficiency.
7. After using the varimax rotation, the final solution so obtained is shown in
Table 11.8. Clear picture emerges in this final solution about the variables
explaining the factors correctly. The rotation facilitates the variable to appear
in one and only factor.
Variables are usually identified in a factor if their loading on that factor is 0.7 or
more. This ensures that the factor extracts sufficient variance from that variable.
However, one may reduce this threshold value if sufficient variables cannot be
identified in the factor. In this problem, the variables have been retained in a
factor in which its loadings are greater than or equal to 0.6. Owing to this
criterion variables have been grouped in each of the four factors, namely,
welfare, motivation, interpersonal relation, and career which are shown in
Tables 11.9, 11.10, 11.11, and 11.12.
Factor 1 in Table 11.9 contains variables that measure the welfare of employees
in an organization, and therefore it may be termed as “Welfare Factor.” On the
other hand, all items mentioned in Table 11.10 measure the motivation of
employees; hence, factor 2 is named as “Motivation Factor.” Similarly the
items in Tables 11.11 and 11.12 are related with measuring relationships
among employees and career-related issues; hence, factor 3 and factor 4 may
be termed as “interpersonal relation factor” and “career factor,” respectively.
In order to develop a test battery to measure the climate of an organization, one
may choose variables from these identified factors. Since percentage contribu-
tion of each factor in the measurement of total variability are more or less same,
hence one variable from each factor having highest loadings on the factor may
be picked up to develop the test battery for measuring the climate of an organi-
zation. Thus, the test battery so developed is shown in Table 11.13. One may
380 11 Application of Factor Analysis: To Study the Factor Structure Among Variables
choose more than one variable from one or two factors also, depending upon
their explainability.
Readers are advised to run the confirmatory factor analysis with more data set to
these questions before using this instrument to measure the organizational
climate because this was a simulated study.
Summary of the SPSS Commands for Factor Analysis 381
(i) Start SPSS and prepare data file by defining the variables and their properties
in Variable View and typing the data column-wise in Data View.
(ii) In the data view, follow the below-mentioned command sequence for factor
analysis:
Analyze ! Dimension Reduction ! Factor
(iii) Select all the variables from left panel to the “Variables” section of the right
panel.
(iv) Click the tag Descriptives and check the options “Univariate descriptives,”
“Initial Solution,” “Coefficients,” “Significance levels,” and “KMO and
Bartlett’s test of sphericity.” Press Continue.
(v) Click the tag Extraction and then check “Scree plot.” Let other options
remain as it is by default. Press Continue.
382 11 Application of Factor Analysis: To Study the Factor Structure Among Variables
(vi) Click the tag Rotation and then check “Varimax” rotation option. Let other
option remains as it is by default. Press Continue.
(vii) Click OK for output.
Exercise
Short Answer Questions
Note: Write answer to each of the following questions in not more than 200 words.
Q.1. What do you mean by a factor? What is the criterion of retaining a factor in a
study and identifying the variables in it?
Q.2. How the factor analysis is useful in understanding the group characteristics
Q.3. Describe an experimental situation in which the factor analysis can be used.
Q.4. How factor analysis can be useful in developing a questionnaire?
Q.5. Discuss the procedure of developing a test battery to assess the lifestyle of
employees of an organization.
Q.6. What is principal component analysis and how it is used in factor analysis?
Q.7. What do you mean by eigenvalue? How the Kaiser’s criterion works in
retaining factors in the model?
Q.8. What do you mean by scree test? How is it useful in identifying the factors to
be retained through graph?
Q.9. What is the importance of communality in factor analysis?
Q.10. What is the significance of factor loadings? How it is used to identify the
variables to be retained in the factors?
Q.11. Why the factors are rotated to get the final solution in factor analysis? Which
is the most popular rotation method and why?
Multiple-Choice Questions
Note: Question no. 1–10 has four alternative answers for each question. Tick marks
the one that you consider the closest to the correct answer.
1. Factor analysis is a technique for
(a) Correlation analysis
(b) Dimension reduction
(c) Finding the most important variable
(d) Comparing factors
2. Principal component analysis extracts the maximum variance in the
(a) Last extracted factor
(b) Second extracted factor
(c) First extracted factor
(d) Any extracted factor
Exercise 383
10. Varimax rotation is used to get the final solution. After rotation
(a) Factor explaining maximum variance is extracted first
(b) All factors whose eigenvalues are more than 1 are extracted
(c) Three best factors are extracted
(d) Non overlapping of variables in the factors emerges
11. Eigen value is also known as
(a) Characteristics root
(b) Factor loading
(c) Communality
(d) None of the above
12. KMO test in factor analysis is used to test whether
(a) Factors extracted are valid or not?
(b) Variables identified in each factor are valid or not?
(c) Sample size taken for the factor analysis was adequate or not?
(d) Multicolinearity among the variables exists or not?
13. Bartlett’s test in factor analysis is used for testing
(a) Same adequacy
(b) Whether correlation matrix is identity matrix
(c) Usefulness of variable
(d) Retaining the factors in the model
14. While using factor analysis certain assumptions need to be satisfied. Choose the
most appropriate assumption
(a) Data used in the factor analysis is based on interval scale or ratio scale
(b) Multicolinearity among the variables exist
(c) Outlier is present in the data
(d) Size of the sample does not affect the analysis.
Assignments
1. It is decided to measure the personality profile of the senior executives in a
manufacturing industry. Eleven personality characteristics were measured on 30
senior executives chosen randomly from an organization. Marks on each of these
characteristics were measured on a ten-point scale. The meaning for each of these
characteristics is described below the table. The data so obtained are shown in the
following table. Apply factor analysis using varimax rotation. Discuss your
findings and answer the following questions:
(a) Whether data is adequate for factor analysis?
(b) Whether sphericity is significant?
(c) How many factors have been extracted?
(d) In your opinion, what should be the name of the factors?
(e) What factor loadings you suggest for a variable to qualify in a factor?
(f) Can you suggest the test battery for screening the personality characteristics
of an executive?
Exercise 385
Explanation of Parameters
(a) Friendliness: Being friendly with others and try to be networked all the time.
(b) Achievement: Doing one’s best or difficult tasks and achieving recognition
(c) Orderliness: Doing work systematically
(d) Autonomy: Lead your life the way you feel like.
(e) Dominance: Always ready to assume the leadership
(f) Sensitiveness: Understand the other’s point of view in analyzing the
situation.
386 11 Application of Factor Analysis: To Study the Factor Structure Among Variables
(g) Exhibition: To showcase one’s self by appearance, speech, and manner for
attracting others.
(h) Endurance: Being focus toward work until it is completed and being able to
work without being distracted.
(i) Neediness: Always ready to take support of others with grace and remains
obliged for that.
(j) Helping temperament: Always ready to help the needy and less fortunate.
(k) Learn to change: Always ready to change due to change environment.
2. A researcher wants to know the factors that are responsible for people to choose
the Rajdhani Express at different routes in India. Twenty respondents who
recently traveled from this train were selected for getting their responses.
These subjects were given a questionnaire consisting of ten questions mentioned
below. They were asked to give their opinion on a seven-point scale where 1
indicates complete agreement and seven complete disagreements. The responses
so obtained are shown in the following table.
Apply factor analysis and use varimax rotation to discuss your findings. Explain
the factors so extracted in the study.
Questionnaire includes
1. The attendants are caring
2. The bedding provided in the train is neat and clean.
Response data obtained from the passengers on the services provided during journey in the train
S. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
N. Caring Bedding Courteous Food Spray Toilets Timeliness Seats Clean Snacks
1 2 2 1 2 3 4 2 2 4 1
2 3 1 2 3 2 5 4 2 5 2
3 4 2 4 4 3 6 4 3 6 3
4 1 1 2 3 2 4 3 2 4 2
5 2 2 3 4 2 4 4 3 3 3
6 3 2 2 3 3 3 3 2 3 2
7 4 1 3 6 2 5 5 1 5 5
8 5 1 5 5 3 6 5 2 6 4
9 5 1 2 2 2 5 3 2 5 1
10 3 2 3 2 2 5 3 3 5 2
11 6 1 4 4 2 4 4 2 6 3
12 6 2 6 3 3 6 3 3 6 2
13 2 5 3 4 6 5 4 6 4 3
14 1 2 3 3 3 4 4 3 2 2
15 3 1 2 4 3 5 5 2 3 3
16 4 5 3 4 5 4 5 6 4 3
17 2 2 2 1 3 3 3 3 5 1
18 2 1 3 3 2 3 4 2 6 2
19 1 2 2 3 3 4 3 3 2 2
20 3 1 1 1 4 2 2 2 3 1
Exercise 387
Learning Objectives
After completing this chapter, you should be able to do the following:
• Understand the importance of discriminant analysis in research.
• List down the research situation where discriminant analysis can be used.
• Understand the importance of assumptions used in discriminant analysis.
• Know the different concepts used in discriminant analysis.
• Understand the steps involved in using SPSS for discriminant analysis.
• To interpret the output obtained in discriminant analysis.
• Explain the procedure in developing the decision rule using discriminant model.
• Know to write the results of discriminant analysis in standard format.
Introduction
Often we come across a situation where it is interesting to know as to why the two
naturally occurring groups are different. For instance, after passing the school, the
students can opt for continuing further studies, or they may opt for some skill-
related work. One may be interested to know as to what makes them to choose their
course of action. In other words, it may be desired to know on what parameters
these two groups may be distinct. Similarly one may like to identify the parameters
which distinguish the liking of two brands of soft drink by the customers or which
make the engineering and management students different. Thus, to identify the
independent parameters responsible for discriminating these two groups, a statisti-
cal technique known as discriminant analysis (DA) is used. The discriminant
analysis is a multivariate statistical technique used frequently in management,
social sciences, and humanities research. There may be varieties of situation
where this technique can play a major role in decision-making process. For
instance, the government is very keen that more and more students should opt for
the science stream in order to have the technological advancement in the country.
Therefore, one may investigate the factors that are responsible for class XI students
to choose commerce or science stream. After identifying the parameters responsible
for discriminating a science and commerce student, the decision makers may focus
their attention to divert the mindset of the students to opt for science stream.
Yet another application where discriminant analysis can be used is in the food
industry. In launching the new food product, much of its success depends upon its
taste, and, therefore, product formulation must be optimized to obtain desirable
sensory quality expected by consumers. Thus, the decision maker may be interested
to know the parameters that distinguish the existing similar product and new
proposed product in terms of the product properties like sensory characteristics,
percent of ingredients added, pricing, and contents. In this chapter, the discriminant
analysis technique shall be discussed in detail along with its application with SPSS.
Z ¼ c þ b 1 X 1 þ b2 X 2 þ . . . . . . bn X n (12.1)
where
c is a constant
b’s are the discriminant coefficients
X’s are the predictor variables
Only those independent variables are picked up which are found to have
significant discriminating power in classifying a subject into any of the two groups.
The discriminant function so developed is used for predicting the group of a new
observation set.
The discriminant analysis is actually known as discriminant function analysis
but in short one may use the term discriminant analysis. In discriminant analysis,
the dependent variable is a categorical variable, whereas independent variables are
metric. The dependent variable may have more than two classes, but the discrimi-
nant analysis is more powerful if it has two classifications. In this text, the
discriminant analysis shall be discussed only for two-group problem.
After developing the discriminant model, for a given set of new observation the
discriminant function Z is computed, and the subject/object is assigned to first
group if the value of Z is less than 0 and to second group if more than 0. This
Terminologies Used in Discriminant Analysis 391
criterion holds true if an equal number of observations are taken in both the groups
for developing a discriminant function. However, in case of unequal sample size,
the threshold may vary on either side of zero.
The main purpose of a discriminant analysis is to predict group membership
based on a linear combination of the predictive variables. In using this technique,
the procedure starts with a set of observations where both group membership and
the values of the interval variables are known. The end result of the procedure is a
model that allows prediction of group membership when only the interval variables
are known.
A second purpose of the discriminant analysis is to study the relationship
between group membership and the variables used to predict group membership.
This provides information about the relative importance of independent variables in
predicting group membership.
Discriminant function analysis is similar to the ordinary least square (OLS)
regression analysis. The only difference is in the nature of dependent variable.
In discriminant function analysis, the dependent variable is essentially a categorical
(preferably dichotomous) variable, whereas in multiple regression it is a continuous
variable. Other differences are in terms of the assumptions being satisfied in using
discriminant analysis which shall be discussed later in this chapter.
Discriminant Function
Z ¼ c þ b 1 X 1 þ b 2 X 2 þ . . . þ bn X n
where
b1, b2,. . ., bn are discriminant coefficients
X1, X2,. . ., Xn are discriminating variables
c is a constant
The discriminant function is also known as canonical root. This discriminant
function is used to classify the subject/cases into one of the two groups on the basis
of the observed values on the predictor variables.
Classification Matrix
Box’s M Test
While applying ANOVA, one of the assumptions was that the variances are
equivalent for each group, but in DA the basic assumption is that the variance-
covariance matrices are equivalent. By using Box’s M tests, we test a null hypothe-
sis that the covariance matrices do not differ between groups formed by the
dependent variable. The researcher would not like this test to be significant so
that the null hypothesis that the groups do not differ can be retained. Thus, if the
Box’s M test is insignificant, it indicates that the assumptions required for DA holds
true.
However, with large samples, a significant result of Box’s M is not regarded as
too important. Where three or more groups exist, and Box’s M is significant, groups
with very small log determinants should be deleted from the analysis.
Eigenvalues
Eigenvalue is the index of overall model fit. It provides information on each of the
discriminant functions (equations) produced. In discriminant analysis, the maxi-
mum number of discriminant functions produced is the number of groups minus 1.
In case dependent variable has two categories, only one discriminant function shall
be generated. In DA, one tries to predict the group membership from a set of
predictor variables. If the dependent variable has two categories and there are n
predictive variables, then a linear discriminant equation, Zi ¼ c þ b1 X1 þ b2 X2 þ
. . . þ bn Xn, is constructed such that the two groups differ as much as possible on Z.
Here, one tries to choose the weights b1, b2,. . ., bn in computing a discriminant
score (Zi) for each subject so that if an ANOVA on Z is done, the ratio of the
between groups sum of squares to the within groups sum of squares is as large as
possible. The value of this ratio is known as eigenvalue.
Thus, eigenvalue is computed with the data on Z and is a quantity maximized by
the discriminant function coefficients.
394 12 Application of Discriminant Analysis: For Developing a Classification Model
SSBetween groups
Eigenvalue ¼ (12.2)
SSWithin groups
The larger the eigenvalue, the better is the model in discriminating between the
groups.
Wilks’ Lambda
Different steps that are involved in discriminant analysis have been discussed in this
section. Initially you may not understand all the steps clearly but continue to read
this chapter, and once you complete reading the solved example using SPSS
discussed in this chapter, your understanding level about this topic shall be
enhanced. All the steps discussed below cannot be performed manually but may
be achieved by using any statistical package. So go through these steps and try
understanding the outputs of your discriminant analysis.
What We Do in Discriminant Analysis 395
1. The first step in the discriminant analysis is to identify the independent variables
having significant discriminant power. This is done by taking all the independent
variables together in the model or one by one. The option for these two methods
can be seen in SPSS as “Enter independents together” and “Use stepwise
method,” respectively.
In stepwise method, an independent variable is entered in the model if its
corresponding regression coefficient is significant at 5% level and excluded at
subsequent stages until and unless it is significant at 10% level. Thus, in
developing discriminant function, the model will enter only significant indepen-
dent variables. The model so developed is required to be tested for its robustness.
2. In the second step, a discriminant function model is developed by using the
discriminant coefficients of the predictor variables and the value of constant
shown in the “Unstandardized canonical discriminant function coefficients”
table generated in the SPSS output. This is similar to developing of regression
equation. This way, the function so generated may be used to classify an
individual into any of the two groups. The discriminant function shall look
like as follows:
Z ¼ c þ b1 X1 þ b2 X2 þ . . . . . . :: þ bn Xn
where
Z is the discriminant function
X’s are predictor variables in the model
c is the constant
b’s are the discriminant constants of the predictor variables
3. After developing discriminant model, the Wilks’ lambda is computed in the third
step for testing the significance of discriminant function developed in the model.
This indicates the robustness of discriminant model. The value of Wilks’ lambda
ranges from 0 to 1, and the lower value of it close to 0 indicates better discrimi-
nating power of the model. Further, significant value of chi-square indicates that
the discrimination between the two groups is highly significant.
After selecting independent variables as predictors in the discriminant model,
the model is tested for its significance in classifying the subjects/cases correctly
into groups. For this, SPSS generates a classification matrix. This is also known
as confusion matrix. This matrix shows the number of correct and wrong
classification of subjects in both the groups. High percentage of correct classifi-
cation indicates the validity of the model. The level of accuracy shown in the
classification matrix may not hold for all future classification of new subjects/
cases.
4. In the fourth step, the relative importance of predictor variables in discriminating
the two groups is discussed. The SPSS generates the “Standardized canonical
discriminant function coefficients” table. The variable with higher coefficient in
the table is the most powerful in discriminating the two groups, whereas the
variable having least coefficient indicates low discriminating power.
396 12 Application of Discriminant Analysis: For Developing a Classification Model
While applying discriminant analysis, one should test the assumptions used in this
analysis. Following are the assumptions which are required to be fulfilled while
using this analysis:
1. Each of the independent variables is normally distributed. This assumption can
be examined by the histograms of frequency distributions. In fact, violations of
the normality assumption are usually not serious because in that case the
resultant significance tests are still reliable. One may use specific tests like
skewness and kurtosis for testing the normality in addition to graphs.
2. All variables have linear and homoscedastic relationships. It is assumed that the
variance/covariance matrices of variables are homogeneous in both the groups.
Box M test is used for testing the homogeneity of variances/covariances in both
the groups. However, it is sensitive to deviations from multivariate normality
and should not be taken too seriously.
3. Dependent variable is a true dichotomy. The continuous variable should never
be dichotomized for the purpose of applying discriminant analysis.
4. The groups must be mutually exclusive, with every subject or case belonging to
only one group.
5. All cases must be independent. One should not use correlated data like before-
after and matched pair data.
6. Sample sizes of both the groups should not differ to a great extent. If the sample
sizes are in the ratio 80:20, logistic regression may be preferred.
7. Sample size must be sufficient. As a guideline, there should be at least five to six
times as many cases as independent variables.
8. No independent variables should have a zero variability in either of the groups
formed by the dependent variable.
9. Outliers should not be present in the data. To solve this problem, inspect
descriptive statistics.
The discriminant analysis is used to develop a model for discriminating the future
cases/objects into one of the two groups on the basis of predictor variables. Hence,
it is widely used in the studies related to management, social sciences, humanities,
Solved Example of Discriminant Analysis Using SPSS 397
and other applied sciences. Some of the research situations where this analysis can
be used are discussed below:
1. In a hospitality firm, the data can be collected on employees in two different job
classifications: (1) customer support personnel and (2) back office management.
The human resources manager may like to know if these two job classifications
require different personality types. Each employee may be tested by a battery of
psychological test which consists of a measure of socialization trait,
extrovertness, frustration level, and orthodox approach.
The model can be used to priorities the predictor variable which can be used to
identify the employees in different category during selection process. Further,
the model may be helpful in developing the training program for future
employees recruited in different categories.
2. A college authority might divide a group of past graduate students into two
groups: students who finished the economics honors program in 3 years and
those who did not. The discriminant analysis could be used to predict successful
completion of the honors program based on the independent variables like SAT
score, XII maths score, and age of the candidates. Investigating the prediction
model might provide insight as to how each predictor individually and in
combination predicted completion or noncompletion of the economics honors
program at the undergraduate level.
3. A marketing manager may like to develop a model on buying two different kinds of
toothpaste on the basis of the product and customer profiles. The independent
variables may consist of age and sex of the customer and contained quantity, taste,
price of the products, etc. The insight from the developed model may provide the
decision makers in the company to develop and market their products with success.
4. A social scientist may like to know the predictor variable which is responsible
for smoking. The data on variables like the age at which the first cigarette was
smoked and other reasons of smoking like self-image, peer pressure, and frus-
tration level can be studied to develop a model for classifying an individual into
smoker and nonsmoker. The knowledge so accrued from the developed model
may be used to start the ad campaign against smoking.
5. In medical research, one may like to predict whether patient would survive from
burn injury based on the combinations of demographic and treatment variables.
The predictor variables might include burn percentage, body parts involved, age,
sex, and time between incident and arrival at hospital. In such situations, the
discriminant model so developed would allow a doctor to assess the chances of
recovery based on predictor variables. The discriminant model might also give
insight into how the variables interact in predicting recovery.
Example 12.1 The marketing division of a bank wants to develop a policy for
issuing visa gold card to its customers through which one can shop and withdraw up
398 12 Application of Discriminant Analysis: For Developing a Classification Model
to Rs. 100,000 at a time for 30 days without any interest. Out of several customers,
only a handful number of customers are required to be chosen for such facility.
Thus, a model is required to be made on the basis of the existing practices for
issuing similar card to the customers on the basis of the following data. The data
was collected on 28 customers in the bank who were either issued or denied similar
card earlier. Apply discriminant analysis to develop a discriminant function for
issuing or denying the golden visa card to the customers on the basis of their profile.
Also test the significance of the model so obtained. Discuss the efficiency of
classification and relative importance of the predictor variables retained in the
model (Table 12.1).
Solution
Here it is required to do the following:
1. To develop a discriminant function for deciding whether a customer be issued a
golden credit card
2. To identify the predictor variable in developing the model and find their relative
importance
3. To test the significance of the model
4. To explain the efficiency of classification
These issues shall be discussed with the output generated by the SPSS in this
example. Thus, the procedure of using SPSS for discriminant analysis in the
given example shall be explained first, and thereafter the output shall be discussed
in the light of the objectives of the study.
In order to perform discriminant analysis with SPSS, a data file needs to be prepared
first. Since the initial steps in preparing the data file has been explained in earlier
chapters, it will not be repeated here again. In case of difficulty, you may go through
the procedure discussed in Chap. 1 in this regard. Take the following steps for
generating the outputs in discriminant analysis:
(i) Data file: Here, five independent variables and one dependent variable need to
be defined. The dependent variable Card_decision is defined as a nominal
variable, whereas all five independent variables as scale variables in SPSS.
After preparing the data file by defining variable names and their labels, the
screen will look like as shown in Fig. 12.1.
(ii) Initiating command for discriminant analysis: After preparing the data file,
click the following command sequence in the Data View:
Fig. 12.1 Screen showing partial data file for the discriminant analysis in SPSS
the model. Such studies are known as confirmatory studies. In this example,
all the variables have been selected to build the model. The screen will look
like Fig. 12.3.
(iv) Selecting the option for computation: After selecting variables, different
option needs to be defined for generating the output in discriminant analysis.
Take the following steps:
– Click the tag Statistics in the screen shown in Fig. 12.3. and
– Check the option of “Means” and “Box’s M” in the “Descriptives” section.
Solved Example of Discriminant Analysis Using SPSS 401
Fig. 12.4 Screen showing the options for descriptive statistics and discriminant coefficients
Solved Example of Discriminant Analysis Using SPSS 403
Table 12.2 Group statistics: mean and standard deviation of all independent variables in different
groups
Issue decision Mean SD
Card issued Average daily balance in last 1 year 67,112.00 9989.74
Number of balance less than 5,000 in last 1 year 1.509 1.29
Annual income in lakh 30.03 5.19
Family size 4.57 1.50
Average transaction per month 9.86 4.62
Card denied Average daily balance in last 1 year 44,566.57 7923.67
Number of balance less than 5,000 in last 1 year 15.64 17.38
Annual income in lakh 31.30 7.56
Family size 4.36 1.15
Average transaction per month 11.64 2.95
Total Average daily balance in last 1 year 55,839.29 14,493.43
Number of balance less than 5,000 in last 1 year 8.5714 14.08
Annual income in lakh 30.67 6.39
Family size 4.46 1.32
Average transaction per month 10.75 3.91
group. The readers may draw relevant conclusions as per their objectives from
this table.
2. Table 12.3 reveals the value of unstandardized discriminant coefficients which
are used in constructing discriminant function. Since all independent variables
were included to develop the model, the discriminant coefficients of all the five
independent variables are shown in Table 12.3.
Thus, discriminant function can be constructed by using the values of constant
and coefficients of these five independent variables as shown in Table 12.3.
where
X2 is number of balance less than 5,000 in last 1 year
X3 is annual income in lakh
X4 is family size
X5 is average transaction per month
Solved Example of Discriminant Analysis Using SPSS 405
Table 12.5 Wilks’ lambda Test of function(s) Wilks’ lambda Chi-square df Sig.
and chi-square test
1 .336 25.618 5 .000
3. The canonical correlation is 0.815 as shown in Table 12.4. This indicates that
approximately 66% of the variation in the two groups is explained by the
discriminant model.
Since the Wilks’ lambda provides the proportion of unexplained variance by the
model, the lesser its value, the better is the discriminant model. The value of
Wilks’ lambda lies in between 0 and 1. Its value here is 0.336 as shown in
Table 12.5; hence, the model can be considered good because only 33.6%
variability is not explained by the model. To test the significance of Wilks’
lambda, the value of chi-square is calculated which is shown in Table 12.5. Since
the p value associated with it is .000 which is less than .05, it may be inferred that
the model is good.
4. Table 12.6 is a classification matrix which shows the summary of correct and
wrong classification of cases in both the groups on the basis of the developed
discriminant model. This table shows that out of 14 customers whom credit card
was issued, 12 were correctly classified by the developed model and 2 were
wrongly classified in the card denied group. On the other hand, out of 14
customers whom card was denied, 13 were classified by the model correctly in
the card denied group and only 1 customer was wrongly classified in the card
issued group. Thus, out of 28 cases, 25 (89.3%) cases were correctly classified by
the model which is quite high; hence, the model can be considered as valid.
Since this model is developed on the basis of a small sample, the level of
accuracy shown in the classification matrix may not hold for all future classifi-
cation of new cases.
5. Table 12.7 shows the standardized discriminant coefficients of the independent
variables in the model. The magnitude of these coefficients indicates the dis-
criminating power of the variables in the model. The variable having higher
406 12 Application of Discriminant Analysis: For Developing a Classification Model
-1.354 0 +1.354
Value of Z
(i) After preparing the data file, follow the below-mentioned command sequence
for discriminant analysis:
(ii) Select the dependent variable Card_Decision from left panel to the “Grouping
Variables” section of the right panel and define its minimum and maximum
range as “1” and “2.” Further, select all independent variables from the left
panel to the “Independents” section of the right panel. Check the option “Enter
independents together.”
(iii) Click the tag Statistics and check options for “Means,” “Fisher’s,” and
“Unstandardized” in it. Click Continue.
(iv) Click the tag Classify and check option for “Summary table.” Press Continue.
(v) Press OK for output.
Exercise
Q.8. Explain the role of Wilks’ lambda in discriminant analysis. Comment on the
models if its values are 0, 0.5, and 1 in three different situations.
Q.9. Explain the purpose of classification matrix in discriminant analysis. How
the percentage of correct classification is similar to R2?
Q.10. What is discriminant function and how it is developed? How this function is
used in decision-making?
Q.11. One of the conditions in discriminant analysis is that “All variables have linear
and homoscedastic relationships.” Explain the meaning of this statement.
Q.12. What do you mean by the discriminating power of the variables? How will
you asses it?
Multiple-Choice Questions
Note: For each of the question, there are four alternative answers. Tick mark the
one that you consider the closest to the correct answer.
1. In discriminant analysis, independent variables are treated as
(a) Scale
(b) Nominal
(c) Ordinal
(d) Ratio
2. In discriminant analysis, dependent variable is measured on the scale known as
(a) Grouping
(b) Ordinal
(c) Nominal
(d) Criterion
3. Discriminant function is also known as
(a) Eigenvalue
(b) Regression coefficient
(c) Canonical root
(d) Discriminant coefficient
4. Confusion matrix is used to denote
(a) Correctly classified cases
(b) Discriminant coefficients
(c) F-values
(d) Robustness of different models
5. The decision criteria in discriminant analysis are as follows:
Classify in first group if Z < 0
Classify in second group if Z > 0
The above criteria hold true
(a) If size of the samples in both the groups are equal
(b) If size of the samples in both the groups are nearly equal
(c) If size of the samples in both the groups are in the proportion of 4:1
(d) In all the situations
Exercise 409
(continued)
Results of the examination and subject’s profile
S.N. Bank examination result IQ English Numerical aptitude Reasoning
17 Not successful 56 73 75 83
18 Not successful 65 64 56 84
19 Not successful 56 58 64 86
20 Successful 95 68 78 82
21 Successful 92 80 74 83
22 Not successful 45 73 71 91
23 Successful 85 56 89 74
24 Successful 68 45 83 85
25 Not successful 64 73 64 84
26 Not successful 70 71 56 86
27 Successful 78 74 84 94
28 Not successful 64 70 55 86
29 Not successful 42 67 51 76
30 Successful 82 67 90 83
Develop a discriminant model. Test the significance of the developed model and
find the relative importance of the independent variables in the model. Compare
the efficiency of the two discriminant function models obtained by taking all the
variables at once and stepwise methods.
2. A branded apparel company wanted to reward its loyal customers by means of
incentives in the form of 60% discount in the first week of New Year. The
company had a loose policy of identifying a customer into loyal or disloyal on
the basis of certain criterion which was more subjective. However, the manage-
ment was interested to develop a more scientific approach to build up a model of
classifying a customer into loyal and disloyal group. A sample of 30 customers
were chosen from the database, and their purchase details were recorded which
are shown in the following table:
Apply discriminant analysis to build up a classification model which can be used
for the existing and future customers to reward as per the company policy. Test
(continued)
Purchase
S. Customer No. of purchases/ amount in a No. of kids’ wear No. of ladies No. of
N. classification year in a year year apparel/year apparel/year gents
11 Disloyal 6 34,012 3 2 15
12 Loyal 12 67,000 12 8 5
13 Loyal 5 92,008 20 12 9
14 Disloyal 4 12,000 6 2 8
15 Loyal 10 71,540 6 15 8
16 Disloyal 4 13,450 1 2 15
17 Loyal 14 125,000 24 15 8
18 Loyal 20 80,000 5 20 7
19 Disloyal 5 56,021 15 10 15
20 Loyal 9 170,670 21 25 12
21 Disloyal 6 1,012 1 1 1
22 Disloyal 7 54,276 13 8 15
23 Loyal 15 100,675 25 25 5
24 Loyal 12 106,750 30 15 4
25 Disloyal 11 3,500 2 2 3
26 Disloyal 5 2,500 2 1 3
27 Loyal 10 89,065 14 21 8
28 Loyal 9 80,540 15 19 16
29 Disloyal 7 12,000 4 4 6
30 Disloyal 3 5,056 4 2 3
Q.1 a Q.2 c
Q.3 c Q.4 a
Q.5 a Q.6 d
Q.7 c Q.8 d
Q.9 b Q.10 b
Q.11 b Q.12 c
Q.13 c Q.14 d
Q.15 d
Chapter 13
Logistic Regression: Developing a Model
for Risk Analysis
Learning Objectives
After completing this chapter, you should be able to do the following:
• Learn the difference between logistic regression and ordinary least squares
regression.
• Know the situation where logistic regression can be used.
• Describe the logit transformation used in the analysis.
• Understand different terminologies used in logistic regression.
• Explain the steps involved in logistic regression.
• Understand the assumptions used in the analysis.
• Know the SPSS procedure involved in logistic regression.
• Understand the odds ratio and its use in interpreting the findings.
• Interpret the outputs of logistic regression generated by the SPSS.
Introduction
Logistic regression is a kind of predictive model that can be used when the
dependent variable is a categorical variable having two categories and independent
variables are either numerical or categorical. Examples of categorical variables are
Important Terminologies in Logistic Regression 415
Before getting involved into serious discussion about the logistic regression, one
must understand different terminologies involved in it. The terms which are
required in understanding the logistic regression are discussed herewith.
Outcome Variable
The natural log is the usual logarithmic function with base e. The natural log of X is
written as log(X) or ln(X). On the other hand, the exponential function involves the
constant “e” whose value is equal to 2.71828182845904 (2.72). The exponential
of X is written as exp(x) ¼ ex. Thus, exp(4) equals to 2.724 ¼ 54.74.
Since natural log and exponential function are opposite to each other,
E4 ¼ 54:74
) lnð54:74Þ ¼ 4
416 13 Logistic Regression: Developing a Model for Risk Analysis
Odds Ratio
If probability of success (p) of any event is 0.8, then the probability of its failure is
(1 p) ¼ 10.8 ¼ 0.2. The odds of the success can be defined as the ratio of the
probability of success to the probability of failure. Thus, in this example, odds of
success is 0.8/0.2 ¼ 4. In other words, the odds of success is 4 to 1. If the
probability of success is 0.5, then the odds of success is 1 and it may be concluded
that the odds of success is 1 to 1.
In logistic regression, odds ratio can be obtained by finding the exponential of
regression coefficient, exp(B), and is sometimes written as eB. If the regression
coefficient B is equivalent to 0.80, then the odds ratio will be 2.40 because exp
(0.8) ¼ 2.4.
The odds ratio of 2.4 indicates that the probability of Y equals to 1 is 2.4 times as
likely as the value of X is increased by one unit. If an odds ratio is .5, it indicates that
the probability of Y ¼ 1 is half as likely with an increase of X by one unit (here
there is a negative relationship between X and Y). On the other hand, the odds ratio
1.0 indicates that there is no relationship between X and Y.
The odds ratio can be better understood if both variables Y and X are dichoto-
mous. In that case, the odds ratio can be defined as the probability that Y is 1 when X
is 1 compared to the probability that Y is 1 when X is 0. If the odds ratio is given,
then B coefficient can be obtained by taking the log of the odds ratio. It is so because
log and exponential functions are opposite to each other.
The transformation from probability to odds is a monotonic transformation.
It means that the odds increases as the probability increases or vice versa. Proba-
bility ranges from 0 to 1, whereas the odds ranges from 0 to positive infinity.
Similarly the transformation from odds to log of odds, known as log transforma-
tion, is also a monotonic transformation. In other words, the greater the odds, the
greater is the log of odds and vice versa. Thus, if the probability of success
increases, the odds ratio and log odds both increase and vice versa.
Maximum Likelihood
Maximum likelihood is the method of finding the least possible deviation between
the observed and predicted values using the concept of calculus specifically
derivatives. It is different than ordinary least squares (OLS) regression where we
simply try to find the best-fitting line by minimizing the squared residuals.
In maximum likelihood (ML) method, the computer uses different “iterations”
where different solutions are tried for getting the smallest possible deviations or
best fit. After finding the best solution, the computer provides the final value for the
deviance, which is denoted as “2 log likelihood” in SPSS. Cohen et al. (2003)
called this deviance statistic as 2LL, whereas some other authors like Hosmer and
Lemeshow (1989) called it D. This deviance statistic follows the chi-square
distribution.
Important Terminologies in Logistic Regression 417
Logit
In logistic regression, logit is a special case of a link function. In fact, this logit
serves as a dependent variable and is estimated from the model.
Logistic Function
A logistic curve is just like sigmoid curve and is obtained by the logistic function
given by
eZ
p ¼ f ðzÞ ¼ (13.2)
1 þ eZ
The shape of the curve is like a letter “S.” In logistic function, the argument z is
marked along horizontal axis and the value of the function f(z) along the vertical
axis (Fig. 13.1).
The main feature of this logistic function is that the variable Z can assume any
value from minus 1 to + 1, but the outcome variable p can have the values
only in the range 0–1. This function is used in logistic regression model to find the
probability of occurring the target variable for a given value of independent
variables.
The logistic regression equation is similar to the ordinary least squares (OLS)
regression equation with the only difference that the dependent variable here is
418 13 Logistic Regression: Developing a Model for Risk Analysis
the log odds of the probability that the dependent variable Y ¼ 1. It is written as
follows:
^
p
logit ¼ ln ¼ B0 þ B1 X1 þ B2 X2 þ ::::::::: þ Bn Xn (13.3)
1^p
where B0 is an intercept and B1, B2. . ..Bn are the regression coefficients of X1, X2
. . ...,Xn, respectively. The dependent variable in logistic regression is log odds,
which is also known as logit.
Since in logistic regression log odds acts as a dependent variable which is
regressed on the basis of the independent variables, interpretation of regression
coefficients is not as easy as in case of OLS regression. In case of OLS regression,
the regression coefficient b represents the change in Y with one unit change in X.
This concept is not valid in case of logistic regression equation; instead the
regression coefficient b is converted into odds ratio to interpret the happening of
outcome variable. The interpretation of odds ratio has been discussed above in
detail under the heading “Odds Ratio.”
is used as a measure of model fit instead of R2. It tells you about the fit of the
observed values (Y) to the expected values ( Y^ ). If the difference between the
observed values from the expected values increases, the fit of the model becomes
poorer. Thus, the effort is to have the deviance as small as possible. If more relevant
variables are added to the equation, the deviance becomes smaller, indicating an
improvement in fit.
Let us first understand the concept of logistic regression with one independent
variable. Consider a situation where we try to predict whether a customer would
buy a product(Y) depending upon the number of days(X) he saw the advertisement
of that product. It is assumed that the customers who watch the advertisement for
many days will be more likely to buy the product. The value of Y can be 1 if the
product is purchased by the customer and 0 if not.
Since the dependent variable is not a continuous, hence the goal of logistic
regression is to predict the likelihood that Y is equal to 1 (rather than 0) given
certain values of X. Thus, if there is a positive linear relationship between X and Y,
then the probability that a customer will buy the product (Y ¼ 1) will increase with
the increase in the value of X (number of days advertisement seen). Hence, we are
actually predicting the probabilities instead of value of the dependent variable.
420 13 Logistic Regression: Developing a Model for Risk Analysis
1.0
0.5
0.0
Fig. 13.2 Graphical representation of the probability of buying versus number of days advertise-
ment seen
If Y is the target variable (dependent) and X is the predictive variable and if the
_ _
probability that Y ¼ 1 is denoted as p, then the probability that Y is 0 would be 1 p.
_
The logistic model for predicting p would be given by
p^
ln _
¼ B0 þ B 1 X (13.4)
1p
p^
where ln 1^ p is the log of the odds ratio and is known as logit and B0 is the
constant and B1 is the regression coefficient.
p^
In effect, in logistic regression this logit, ln 1^ p , is the dependent variable
against which independent variables are regressed.
_
From Eq. (13.4), the probability ( p ) that Y ¼ 1 can be computed for a given
value of X.
Let us assume that
p^
Z ¼ ln _
¼ B 0 þ B1 X (13.5)
1p
_
p
) _
¼ eZ
1p
Or
_ eZ eB0 þB1 X
p¼ ¼ (13.6)
1 þ eZ 1 þ eB0 þB1 X
p^
Thus, in the logistic regression, first a logit or log of odds ratio, that is, ln 1^ p ,
_
is computed for a given value of X, and then the probability ( p ) that Y ¼ 1 is
computed by using formula (13.6). In fact (13.6) gives the logistic function as
eZ
f ðzÞ ¼ (13.7)
1 þ eZ
This function if plotted by taking z on horizontal axis and f(z) on vertical axis
looks like as shown in Fig. 13.3.
422 13 Logistic Regression: Developing a Model for Risk Analysis
In logistic regression, the logistic function shown in Fig. 13.3 is used for estimating
the probability of an event happening (Y ¼ 1) for different values of X. Let us see
how it is done.
In the logistic function shown in (13.7), the input is z and output is f(z). The
value of z is estimated by the logistic regression Eq. (13.5) on the basis of the
value of X. The important characteristics of the logistic function are that it can
take any value from negative infinity to positive infinitive, but the output will
always be in the range of 0–1.
If there are n independent variables, then the value of z or logit or log of odds
shall be estimated by the equation:
^
p
Z ¼ logit ¼ ln ¼ B0 þ B1 X1 þ B2 X2 þ ::::::::: þ Bn Xn (13.8)
1^p
where B0 is an intercept and B1, B2. . ..Bn are the regression coefficients of X1, X2
. . ...,Xn, respectively.
The variable Z is estimated from (13.8) for a given value of Xs. It is a measure of
the total contribution of all the independent variables used in the model.
If the outcome variable is the risk factor for happening of an event say
bankruptcy of an organization, then each of the regression coefficients shows
the contribution toward the probability of that outcome. If the regression coeffi-
cient is positive, it indicates that the explanatory variable increases the probability
Understanding Logistic Regression 423
2. The independent variables are not required to be linearly related with the
dependent variable.
3. It can be used with the data having nonlinear relationship.
4. The dependent variable need not follow normal distribution.
5. The assumption of homoscedasticity is not required. In other words, no homo-
geneity of variance assumption is required.
Although the logistic regression is very flexible and can be used in many situations
without imposing so many restrictions on the data set, the advantages of logistic
regression come at a cost. It requires large data set to achieve reliable and meaningful
results. Whereas in OLS regression and discriminant analysis, 5 to 10 data per
independent variable is considered to be minimum threshold, logistic regression
requires at least 50 data per independent variable to achieve the reliable findings.
Due to the flexibility about its various assumptions, the logistic regression is widely
used in many applications. Some of the specific applications are discussed below:
1. A food joint chain may be interested to know as to what factors may influence
the customers to buy big-size Pepsi in the fast-food center. The factors may
include the type of pizza (veg. or non-veg.) ordered, whether French fries
ordered, the age of the customer, and their body size (bulky or normal). The
logistic model can provide the solution in identifying the most probable
parameters responsible for buying big-size Pepsi in different food chains.
2. A study may investigate the parameters responsible for getting admission to MBA
program in Harvard Business School. The target variable is a dichotomous
variable with 1 indicating the success in getting admission, whereas 0 indicates
failure. The parameters of interest may be working experience of the candidates in
years, grades in the qualifying examination, TOEFL and GMAT scores, and
scores on the testimonials. By way of logistic model, the relative importance of
the independent variables may be identified and the probability of success of an
individual may be estimated on the basis of the known values of the independent
variables.
3. A market research company may be interested to investigate the variables
responsible for a customer to buy a particular life insurance cover. The target
variable may be 1 if the customer buys the policy and 0 if not. The possible
independent variables in the study may be the age, gender, socioeconomic
status, family size, profession (service/business), etc. By knowing the most
likely causes for getting success in selling the policy, the company may target
the campaign toward the target audience.
Steps in Logistic Regression 425
After understanding the concepts involved in logistic regression, now you are ready
to use this analysis for your problem. The detailed procedure of this analysis using
SPSS shall be discussed by using a practical example. But before that, let us
summarize the steps involved in using the logistic regression:
1. Define the target variable and code it 1 if the event occurs and 0 otherwise. The
target variable should always be dichotomous.
2. Identify the relevant independent variables responsible for the occurrence of
target variable.
3. In case if any independent variable is categorical having more than two
categories, define the coding for different categories as discussed in the
“Assumptions” section.
4. Develop a regression model by taking dependent variable as log odds of the
probability that target variable Y ¼ 1. Logistic regression model can be devel-
oped either by using forward/backward step methods or by using all the inde-
pendent variables in the model. Forward/backward step methods are usually
used in explorative study where it is not known whether the independent variable
has some effect on the target variable or not. On the other hand, all the
independent variables are used in developing a model if the effect of indepen-
dent variables is known in advance and one tries to authenticate the model.
Several options for forward/backward methods are available in the SPSS, but
“Forward:LR” method is considered to be the most efficient method. On the
other hand, for taking all the independent variables in the model, the SPSS
provides a default option with “Enter” command.
5. After choosing the method for binary logic regression, the model would look like
as follows where p^ is the probability that the target variable Y ¼ 1:
426 13 Logistic Regression: Developing a Model for Risk Analysis
p^
ln ¼ Z ¼ B0 þ B1 X1 þ B2 X2 þ ::::::::: þ Bn Xn
1 p^
p^
The variables have their usual meanings. The log odds ln 1^ p is also known as
logit.
6. The estimated probability of occurring the target variable can be estimated for a
given set of values of independent variables by using the following formula:
The above-mentioned equation gives rise to the logistic curve which is S-shaped
as shown in Fig. 13.3. The probability can also be computed from this curve by
computing the value of Z.
7. Exponential of the regression coefficient is known as odds ratio. These odds
ratios are used to find the relative contribution of all the independent variables
toward the occurrence of target variable. Thus, the odds ratio corresponding to
each of the regression coefficients is computed for investigating the relative
contribution of independent variables toward the occurrence of dependent vari-
able. For example, the odds ratio of 3.2 for the variable X1 indicates that the
probability of Y (dependent variable) equals to 1 is 3.2 times as likely as the
value of X1 is increased by one unit. And if an odds ratio for the variable X3 is .5,
it indicates that the probability of Y ¼ 1 is half as likely with an increase of X3 by
one unit (here there is a negative relationship between X3 and Y). On the other
hand, if the odds ratio for the variable X2 is 1.0, it indicates that there is no
relationship between X2 and Y.
Table 13.2 Data on the candidate’s profile along with success status
S.N. Job success Education Sex Experience in years Age Metro Marital status
1 1 16 1 7 23 1 1
2 0 15 1 5 25 0 0
3 1 16 1 5 27 1 1
4 1 15 1 2 26 1 0
5 0 16 0 3 28 0 0
6 1 15 1 2 26 0 1
7 0 13 1 3 33 1 1
8 0 12 0 2 32 0 1
9 1 12 1 3 26 1 1
10 0 13 0 3 30 0 0
11 1 12 0 1 28 1 1
12 0 12 0 2 28 0 0
13 1 15 1 6 32 1 1
14 1 12 1 3 38 0 1
15 0 16 0 2 23 0 0
16 1 15 1 3 22 1 0
17 1 16 1 7 23 0 1
18 0 15 1 5 25 0 0
19 1 16 1 5 27 1 1
20 1 12 0 2 28 0 1
21 1 16 1 4 28 1 0
22 1 15 1 3 28 0 1
23 0 12 0 2 26 1 0
24 0 14 0 5 29 0 0
Job success : 0 : Failure 1: Success
Sex : 0 : Female 1: Male
Metro : 0 : Nonmetro resident 1: Metro resident
Marital status : 0 : Unmarried 1: Married
Solution
The above-mentioned problem can be solved by using SPSS. The steps involved in
getting the outputs shall be discussed first and then the output so generated shall be
explained to fulfill the objectives of the study.
The logistic regression in SPSS is run in two steps. The outputs generated in these
two sections have been discussed in the following two steps:
First Step
The first step, called Block 0, includes no predictors and just the intercept. This
model is developed by using only constant and no predictors. The logistic regres-
sion compares this model with a model having all the predictors to assess whether
the later model is more efficient. Often researchers are not interested in this model.
In this part, a “null model,” having no predictors and just the intercept, is described.
428 13 Logistic Regression: Developing a Model for Risk Analysis
Because of this, all the variables entered into the model will figure in the table titled
“Variables not in the Equation.”
Second Step
The second step, called Block 1, includes the information about the variables that
are included and excluded from the analysis, the coding of the dependent variable,
and coding of any categorical variables listed on the categorical subcommand. This
section is the most interesting part of the output in which generated outputs are used
to test the significance of the overall model, regression coefficients, and odds ratios.
The above-mentioned outputs in two steps are generated by the SPSS through a
single sequence of commands, but the outputs are generated in two different
sections with the headings “Block 0: Beginning Block” and “Block 1.” You have
the liberty to use any method of entering independent variables in the model out of
different methods available in SPSS. These will be discussed while explaining
screen shots of logistic regression in the next section.
The procedure of logistic regression in SPSS shall be defined first and then
relevant outputs shall be shown with explanation.
To run the commands for logistic regression, a data file is required to be prepared. The
procedure for preparing the data file has been explained in Chap. 1. After preparing the
data file, do the following steps for generating outputs in logistic regression:
(i) Data file: In this problem, job success is a dependent variable which is binary in
nature. Out of six independent variables, three variables, namely, sex, metro,
and marital status, are binary, whereas remaining three, education, experience,
and age, are scale variables. In SPSS all binary variables are defined as nominal.
After preparing the data file by defining variable names and their labels, it will
look like as shown in Fig. 13.4.
(ii) Initiating command for logistic regression: After preparing the data file, click
the following commands in sequence (Fig. 13.5):
(iii) Selecting variables for analysis: After clicking the Binary Logistic option,
you will get the next screen for selecting dependent and independent variables.
After selecting all the independent variables, you need to select the binary
independent variables included in it by clicking the option. The selection of
variables can be made by following the below-mentioned steps:
Solved Example of Logistics Analysis Using SPSS 429
Fig. 13.4 Screen showing data file for the logistic regression analysis in SPSS
– Select the dependent variable from the left panel to the “Dependent” section
in the right panel.
– Select all independent variables including categorical variables from left
panel to the “Covariates” section in the right panel.
– Click the command Categorical and select the categorical variables from
the “Covariates” section to the “Categorical Covariates” in the right panel.
The screen will look like Fig. 13.6.
– Click Continue.
430 13 Logistic Regression: Developing a Model for Risk Analysis
(iv) Selecting options for computation: After selecting the variables, you need to
define different options for generating the outputs in logistic regression. Do the
following steps:
– Click the tag Options in the screen shown in Fig. 13.6 for selecting outputs
related to statistics and plots. Do the following steps:
– Check “Classification Plots.”
– Check “Hosmer-Lemeshow goodness-of-fit.”
– Let all other default options be selected. The screen will look like Fig. 13.7.
– Click Continue.
(v) Selecting method for entering independent variables in logistic regression: You
need to define the method of entering the independent variables for developing the
model. You can choose any of the options like Enter, Forward:LR, Forward:Wald,
Backward:LR, or Backward:Wald. Enter method is usually selected when a specific
model needs to be tested or the contribution of independent variables toward the
target variable is known in advance. On the other hand, if the study is exploratory in
nature, then any of the forward or backward methods can be used. In this study, the
Forward:LR method shall be used because the study is exploratory in nature.
Solved Example of Logistics Analysis Using SPSS 431
Descriptive Findings
Table 13.3 shows the number of cases (N) in each category (e.g., included in the
analysis, missing, and total) and their percentage. In logistic regression, a listwise
deletion of missing data is done by default in SPSS. Since there is no missing data,
432 13 Logistic Regression: Developing a Model for Risk Analysis
Fig. 13.7 Screen showing option for generating classification plots and Hosmer-Lemeshow
goodness-of-fit
the number of missing cases is shown as 0. Table 13.4 shows the coding of the
dependent variable used in the data file, that is, 1 for success and 0 for failure in
getting the job.
Table 13.5 shows the coding of all the categorical independent variables along
with their frequencies in the study. While coding the categorical variables, highest
number should be allotted to the reference category because by default SPSS
considers the category with the highest coding as the reference category and
gives the code as 0. For instance, if you define the coding of the variable sex as
0 for “female” and 1 for “male,” then the SPSS will consider male as the reference
category and convert its code to 0 and the other category female as 1.
Solved Example of Logistics Analysis Using SPSS 433
If you look into the coding of the independent categorical variables in the
Table 13.2, that is, sex (0:female, 1:male), metro (0:nonmetro resident, 1:metro
resident), and marital status (0:unmarried and 1:married), these coding have been
reversed by the SPSS as shown in the Table 13.5. It is because SPSS by default
considers the highest coding as the reference category and converts it into 0.
However, you can change the reference category as the lowest coding in SPSS
screen as shown in the Fig. 13.6.
Analytical Findings
The findings in this section are the most interesting part of the output. These
findings include the test of the overall model, significance of regression
coefficients, and the values of the odds ratios.
In this study, since the Forward:LR method has been chosen for logistic regres-
sion, you will get more than one model with different number of variables in it. The
results of the logistic regression shall be discussed in two blocks. In the first block, the
logistic regression model shall be developed by using the constant without using any
of the independent variables. This model may be used to compare the utility of the
model developed in block two by using the identified independent variables.
434 13 Logistic Regression: Developing a Model for Risk Analysis
In Block 0, the results are shown for the model with only the constant included before
any coefficients (i.e., those relating to education, sex, experience, age, metro, and
marital) are entered into the equation. Logistic regression compares the model
obtained in Block 0 with a model including the predictors to determine whether the
latter model is more efficient. The Table 13.6 shows that if nothing is known about
the independent variables and one simple guesses that a person would be selected for
the job, we would be correct 58.3% of the time. Table 13.7 shows that the Wald
statistics is not significant as its significance value is 0.416 which is more than 0.05.
Hence, the model with constant is not worth and is equivalent to just guessing about
the target variable in the absence of any knowledge about the independent variables.
Table 13.8 shows whether each independent variable improves the model or not.
You can see that the variables sex, metro, and marital may improve the model as
they are significant with sex and marital slightly better than metro. Inclusion of
these variables would add to the predictive power of the model. If these variables
had not been significant and able to contribute to the prediction, then the analysis
would obviously be terminated at this stage.
In this block, results of the different models with different independent variables
shall be discussed.
Table 13.9 shows the value of 2 log likelihood (2LL), which is a deviance
statistic between the observed and predicated values of the dependent variable.
If this deviance statistic is insignificant, it indicates that the model is good and there
is no difference between observed and predicted values of dependent variable. This
number in absolute term is not very informative. However, it can be used to
compare different models having different number of predictive variables. For
Solved Example of Logistics Analysis Using SPSS 435
instance, in Table 13.9, the value of 2LL has reduced from 24.053 to 18.549. This
indicates that there is an improvement in model 2 by including an additional
variable, sex. In fact, the value of 2LL should keep on decreasing if you go on
adding the significant predictive variables in the model.
Unlike OLS regression equation, there is no concept of R2 in logistic regression.
It is because of the fact that the dependent variable is dichotomous and R2 cannot be
used to show the efficiency of prediction. However, several authors have suggested
pseudo R-squares which are not equivalent to the R-square that is calculated in OLS
regression. Thus, this statistic should be interpreted with great caution. Two such
pseudo R-squares suggested by Cox and Snell and Nagelkerke are shown in
Table 13.9. As per Cox and Snell’s R2, 44.3% of the variation in the dependent
variable is explained by the logistic model. On the other hand, Nagelkerke’s R2
explains 59.7% variability of the dependent variable by the independent variables in
the model. Nagelkerke’s R2 is more reliable measure of relationship in comparison to
Cox and Snell’s R2. Nagelkerke’s R2 will normally be higher than Cox and Snell’s R2.
In order to find whether the deviance statistic 2 log likelihood is insignificant
or not, Hosmer and Lemeshow suggested the chi-square statistic which is shown in
Table 13.10. In order that the model is efficient, this chi-square statistic should be
insignificant. Since the p value associated with chi-square in Table 13.10 is .569 for
the second model, which is greater than .05, it is insignificant and it can be
interpreted that the model is efficient.
Table 13.11 is a classification table which shows the observed and predicted values
of the dependent variable in both the models. In the second model, it can be seen that out
of 10 candidates who did not get the success in getting the job, four were wrongly
predicted to get the job. Similarly out of 14 candidates who succeeded to get the job,
none was wrongly predicted to be failure. Thus, the model correctly classified 83.3%
cases. This can be obtained by (20/24) 100.
436 13 Logistic Regression: Developing a Model for Risk Analysis
Table 13.12 is the most important table which shows the value of regression
coefficients B, Wald statistics, its significance, and odds ratio exp(B) for each
variable in both the models. The B coefficients are used to develop the logistic
regression equation for predicting the dependent variable from the independent
variables. These coefficients are in log-odds units. Thus, the logistic regression
equation in the second model is given by log 1p p
¼ 2:779 2:666 Sexð1Þ 2:7
11 Maritalð1Þwhere p is the probability of getting the job. The dependent variable
in the logistic regression is known as logit(p) which is equal to log(p/(1 p)).
The estimates obtained in the above logistic regression equation explain the
relationship between the independent variables and the dependent variable, where
the dependent variable is on the logit scale. These estimates tell the amount of
increase (or decrease, if the sign of the coefficient is negative) in the estimated log
odds of “job success” ¼ 1 that would be predicted by a 1 unit increase (or decrease)
in the predictor, holding all other predictors constant.
Because regression coefficients B are in log-odds units, they are often difficult to
interpret; hence, they are converted into odds ratios which are equal to exp(B).
These odds ratios are shown in the last column of Table 13.12.
Significance of the Wald statistics indicates that the variable significantly
predicts the success in getting the bank job, but it should be used only in a situation
Summary of the SPSS Commands for Logistic Regression 437
where the sample size is quite large, preferably more than 500. In case of small
sample, the level of significance gets inflated and it does not give the correct
picture. Since in this problem the value of chi-square in Hosmer and Lemeshow
test as shown in Table 13.10 is insignificant, the model can be considered to be valid
for predicting the success in getting the bank’s job on the basis of the second model
with two independent variables, that is, marital and sex.
In Table 13.12, the exp(B) represents the odds ratio for all the predictors. If the
value of the odds ratio is large, its predictive value is also large. Since the second
model is the final model in this study, the discussion shall be done for the variables
in this model only. Here both the independent variables, that is, sex and marital, are
significant. Since the sex(1) variable has a larger odds ratio .070, this is slightly a
better predictor in comparison to marital(1) variable in getting the bank’s job.
The value of exp(B) for the variable sex(1) is 0.070. It indicates that if the
candidate appearing in the bank exams is female, then there would be decrease in
the odds of 93% (.07–1.00 ¼ .93). In other words, if a female candidate is
appearing in the bank examination, her chances of success would be 93% less
than the men candidate if other variables are kept constant. Similarly the exp(B)
value of the variable marital(1) is .066. This indicates that there would be decrease
in the odds of 93.4% (.066–1.000 ¼ .934). It can be interpreted that if the
candidate appearing in the bank examination is unmarried, his/her chances of
success would be 93.4% less than the married candidate provided other variables
are kept constant.
Conclusion
To conclude, if the candidate is male and married, the chances of odds increases for
getting selected for a bank job in comparison to female and unmarried candidate.
(i) Start SPSS and prepare the data file by defining the variables and their properties
in Variable View and typing the data column-wise in Data View.
(ii) In the Data View, follow the below-mentioned command sequence for factor
analysis:
Analyze Regression Binary Logistic
438 13 Logistic Regression: Developing a Model for Risk Analysis
(iii) Select the dependent variable from the left panel to the “Dependent” section in
the right panel and all independent variables including categorical variables
from left panel to the “Covariates” section in the right panel.
(iv) By clicking the Categorical command, select the categorical variables from
the “Covariates” section to the “Categorical Covariates” in the right panel and
click Continue.
(v) Click the tag Options and check “Classification Plots” and “Hosmer-
Lemeshow goodness-of-fit” and click Continue.
(vi) Ensure that the option Forward:LR is chosen by default and then click OK for
output.
Exercise
(c) 17.12
(d) 3
3. If the probability of success is 0.6, then the odds of success is
(a) 0.4
(b) 1.5
(c) 2.4
(d) 0.75
4. In a logistic regression, if the odds ratio for an independent variable is 2.5, then
which of the following is true?
(a) The probability of the dependent variable happening is 0.25.
(b) The odds against the dependent variable happening is 2.5.
(c) The odds for the dependent variable happening is 2.5.
(d) The odds for the dependent variable happening is 2.5 against one unit
increase in the independent variable.
5. If p is the probability of success, then the logit of p is
(a) ln 1p
p
(b) ln 1þp
p
p
(c) log 1p
p
(d) log 1þp
Assignments
1. Following are the scores of 90 candidates in different subjects obtained in a
MBA entrance examination. Apply the logistic regression to develop a model for
predicting success in the examination on the basis of independent variables.
Discuss the comparative importance of independent variables in predicting
success in the examination. For the variable MBA, coding 1 represents success
and 0 indicates failure in the examination. Similarly gender 1 indicates male and
2 indicates female.
MBA English Reasoning Math Gender MBA English Reasoning Math Gender
1 68 50 65 0 0 46 52 55 1
0 39 44 52 1 0 39 41 33 0
0 44 44 46 1 0 52 49 49 0
1 50 54 61 1 0 28 46 43 0
1 71 65 72 0 0 42 54 50 1
1 63 65 71 1 0 47 42 52 0
0 34 44 40 0 0 47 57 48 1
1 63 49 69 0 0 52 59 58 0
0 68 43 64 0 0 47 52 43 1
0 47 45 56 1 1 55 62 41 0
0 47 46 49 1 0 44 52 43 0
0 63 52 54 0 0 47 41 46 0
0 52 51 53 0 0 45 55 44 1
0 55 54 66 0 0 47 37 43 0
1 60 68 67 1 0 65 54 61 0
0 35 35 40 0 0 43 57 40 1
0 47 54 46 1 0 47 54 49 0
1 71 63 69 0 1 57 62 56 0
0 57 52 40 1 0 68 59 61 1
0 44 50 41 0 0 52 55 50 0
0 65 46 57 0 0 42 57 51 0
1 68 59 58 1 0 42 39 42 1
(continued)
Exercise 441
MBA English Reasoning Math Gender MBA English Reasoning Math Gender
1 73 61 57 1 1 66 67 67 1
0 36 44 37 0 1 47 62 53 0
0 43 54 55 0 0 57 50 50 0
1 73 62 62 1 1 47 61 51 1
0 52 57 64 1 1 57 62 72 1
0 41 47 40 0 0 52 59 48 1
0 50 54 50 0 0 44 44 40 1
0 50 52 46 1 0 50 59 53 1
0 50 52 53 0 0 39 54 39 0
0 47 46 52 0 1 57 62 63 1
1 62 62 45 1 0 57 50 51 1
0 55 57 56 1 0 42 57 45 0
0 50 41 45 1 0 47 46 39 0
0 39 53 54 1 0 42 36 42 1
0 50 49 56 0 0 60 59 62 0
0 34 35 41 0 0 44 49 44 0
0 57 59 54 1 0 63 60 65 1
1 65 60 72 0 1 65 67 63 1
1 68 62 56 0 0 39 54 54 0
0 42 54 47 0 0 50 52 45 1
0 53 59 49 1 1 52 65 60 0
1 59 63 60 1 1 60 62 49 1
0 47 59 54 1 0 44 49 48 0
and 0 if not elected earlier). Apply the logistic regression and develop the model
for predicting success in assembly election.
Answers to Multiple-Choice Questions
Learning Objectives
After completing this chapter, you should be able to do the following:
• Know the use of multidimensional scaling in market research.
• Understand the different terms used in multidimensional scaling.
• Learn the procedures used in multidimensional scaling.
• Able to identify the research situations where multidimensional scaling can be
used.
• Describe the SPSS procedure involved in multidimensional scaling.
• Explain the various outputs generated by the SPSS in this analysis.
Introduction
world in their own way. From this perspective, MDS procedure based on the
predefined attributes is not completely satisfactory as it fails to take the individual
experience into account. One way to overcome this problem is to look at the
constructs an individual use to construe the world. Since the MDS is often used
to identify key dimensions underlying customer evaluations of products, services,
or companies, therefore once the data is at hand, multidimensional scaling can help
determine the following:
• While evaluating the objects, what dimensions are used by the respondents?
• The relative importance of each dimension.
• How the objects are placed in the perceptual map.
Thus, by using the multidimensional scaling methods, one can analyze their
current level of consumer satisfaction in the market and modify the marketing mix
based upon the current consumer preference and satisfaction.
Distances
Distance refers to the difference in the two objects on any one or more dimension as
perceived by a respondent. It is the fundamental measurement concept in MDS.
Distance may also be referred as similarity, preferences, dissimilarity, or proximity.
There exist many alternative distance measures, but all are functions of dissimilar-
ity/similarity or preference judgments.
If the cells of matrix represent the degree of similarity between pairs represented by
the rows and columns of the matrix, then the matrix is said to be similarity matrix.
On the other hand, if cells of the matrix represent the extent to which one object is
preferred over other in the pair, then the matrix is said to be dissimilarity matrix.
Larger cell values represent greater distance. The algorithm used by SPSS in
multidimensional scaling is more efficient with dissimilarity/preference measures
than with similarity/proximity measures. For this reason, distance matrices are used
in SPSS instead of similarity matrices.
Stress
Stress (phi) is a goodness-of-fit test that measures the efficiency of the MDS
models. The smaller the stress, the better is the fit. Stress measures the difference
between interpoint distances in computed MDS space and the corresponding actual
input distances. High stress indicates measurement error, and also it may reflect
having too few dimensions. Stress is not much affected by sample size provided the
number of objects is appreciably more than the number of dimensions.
Perceptual Mapping
Dimensions
While preparing dissimilarity matrix, the respondent may be asked to rate the two
objects/products on a particular characteristics such as color, look, energy effi-
ciency, and cost. These characteristics are said to be the dimension on which the
evaluation may take place. Usually the products are rated on two or more than two
dimensions. These dimensions may be predefined or may be perceived by the
respondents of their own.
1. Find the distance matrix among all the objects. It can be obtained by simply
ranking of distances between an object and all other objects by a consumer. This
matrix can be obtained by providing the consumer a card containing pair of
objects written on it, and the candidate needs to specify a number indicating the
difference between the two objects on any numerical scale which can represent
distance between the two objects. This process is repeated for all pairs of brands
being included in the study. In this process, no attributes are identified on which
the consumer is asked to decide on the difference. The distance measure so
What We Do in Multidimensional Scaling? 447
obtained for all the pair of objects can be compiled into a matrix as shown in
Table 14.1. This distance matrix serves the input data for the multidimensional
scaling.
2. After obtaining the distance matrix for each consumer, take the average of these
distances for each pair of objects to make the final distance matrix which is
normally used as an input data. However, multidimensional scaling can be used
for a single user as well.
3. Compute the value of “stress” for the solution in each dimension. Since the value
of stress represents a measure of lack of fit, therefore the intension is to get the
solution with an acceptably low value of a stress.
4. On the basis of the least value of the stress obtained in different solutions,
obtained in step 3, the number of dimensions is decided.
5. After deciding the number of dimensions, the objects are plotted on a map for
visual assessment of objects positioning.
6. Name these dimensions by keeping in mind the attributes of the brands like cost,
features, and look. The procedure would be clear by looking to the solved
Example 14.1.
discriminant function. For example, in case of three objects/brands, you could get
two functions, and with four objects, you may get up to three discriminant
functions. The solution of discriminant analysis gives the value of eigenvalue for
each discriminant function. This eigenvalue explains the amount of variance that is
explained by the discriminant function. This percentage variance explained by the
discriminant function is used to decide as to how many discriminant functions one
should use. If two discriminant functions are used, then they form two axes of the
perceptual map. Whereas if three discriminant functions are used, then you get
three perceptual maps, that is, function 1 vs. function 2, function 1 vs. function 3,
and function 2 vs. function 3. These discriminant functions represent the axes on
which the objects are first located and thereafter the attributes are located.
To find the number of dimensions and the perceptual map of different objects,
following steps are used:
1. Obtain consumers’ perceptions on different attributes on the different competing
brands. This serves as the input data for the discriminant analysis.
2. Run the discriminant analysis by taking all the independent variables together in
the model. The option for this method can be seen in SPSS as “Enter
independents together.”
3. The SPSS output shall generate the following results:
(a) Group statistics including mean and standard deviation
(b) Unstandardized canonical discriminant function coefficients table
(c) Eigen values and canonical correlation
(d) Wilks’ lambda and chi-square test
(e) Classification matrix
(f) Standardized canonical discriminant function coefficients
(g) Functions at group centroids
Remark: For generating the above-mentioned outputs for MDS, you can refer
back the solved Example 12.1 in Chap. 12.
4. The eigenvalue would decide as to how many discriminant function you want to
use.
5. Draw perceptual map (or maps) separately by using the standardized canonical
discriminant coefficients. This can be done by using Excel or any other graphic
package. The discriminant function denotes the axes on which the objects/brands
are first located, and then attributes are placed on the same graph.
3. The respondents have the same perception about the dimensionality in assessing
the distances among the objects.
4. Respondents will attach the same level of importance to a dimension, even if all
respondents perceive this dimension.
5. There is no change in the judgments of a stimulus in terms of either dimensions
or levels of importance over time.
Although MDS is widely used for positioning the brand image and comparing the
product characteristics, it has some limitations as well.
1. It is difficult to obtain the similarity and preferences of the respondents toward a
group of objects because perceptions of the subjects may differ considerably.
2. Because every product has lots of variant model having different characteristics
and therefore the group of objects taken for comparing their brand image may
itself differ on many counts. Due to this fact, true positioning may not be
possible.
3. Preferences change over time, place, and socioeconomic status and therefore
brand positioning obtained in a particular study may not be generalized.
4. The bias exists in the data collection.
5. In case of nonmetric data, all the MDS techniques are subject to the problem of
local optima and degenerate solutions.
6. Although metric MDS are more robust than nonmetric MDS and produce good
maps but the dimension interpretation, the main work of MDS is highly subjec-
tive and depends upon the questioning of the interviewers.
Example 14.1 Twenty customers were asked to rate 8 cars by showing the cards
bearing the name of a pair of cars. All possible pair of cars were shown, and the
customers were asked to rate their preferences of one car over other on an 8-point
scale. If the customer perceived that the two cars were completely dissimilar, a
score of 8 was given, and if the two cars were exactly similar, a score of 0 was
given. Following dissimilarity scores were obtained and are shown in Table 14.1.
Use multidimensional scaling to find the number of dimensions the consumers use
in assessing different brands and name these dimensions. Develop perceptual map
and position these eight brands of cars in a multidimensional space.
450 14 Multidimensional Scaling for Product Positioning
The data file needs to be prepared before using SPSS commands to generate outputs
in multidimensional scaling. Following steps would be performed to get the rele-
vant outputs for further interpretation in the analysis.
(i) Data file: Here, eight variables need to be defined. All these variables shall be
defined as ordinal as the scores are the dissimilarity ratings. After preparing the
data file by defining variable names and their labels, it will look like Fig. 14.1.
(ii) Initiating command for multidimensional analysis: After preparing the data
file, click the following command sequence in the Data View:
Analyze ! Scale ! Multidimensional Scaling (ALSCAL)
The screen shall look like Fig. 14.2.
(iii) Selecting variables for discriminant analysis: After clicking the Multidimen-
sional Scaling option, the SPSS will take you to the window where variables
are selected.
• Select all the variables from left panel to the “Variables” section of the right
panel.
• Click the tag Model in the screen shown in Fig. 14.3.
– Write minimum and maximum dimension for which the solution is
required. Since, in this problem, there are eight brands, hence maximum
of up to three-dimensional solution shall be obtained. In case of more
number of brands, solutions of more dimensions may be investigated.
– Let other options are checked by default.
– Click Continue.
• Click the tag Option in the screen as shown in screen 14.3.
– Check the option “Group plots” in the Display section.
– Let other options are checked by default.
– Click Continue.
The screen for these options shall look like Fig. 14.4.
– Click OK for output.
Solved Example of Multidimensional Scaling . . . 451
Fig. 14.1 Screen showing data file for the multidimensional scaling in SPSS
(iv) Getting the output: After clicking the OK option in Fig. 14.3, the output in the
multidimensional scaling shall be generated in the output window. Selected
outputs can be copied in the word file by using the right click of the mouse over
identified area of the output. Out of many outputs generated by the SPSS, the
following relevant outputs have been picked up for discussion:
452 14 Multidimensional Scaling for Product Positioning
These outputs so generated by the SPSS are shown in Tables 14.2, 14.3, 14.4,
14.5, 14.6, and 14.7 and Fig. 14.5.
solution, and the stress value for these solutions is 0.07911. Tables 14.4 and 14.5
contain two-dimensional solutions along with the stress value as 0.16611. On the
other hand, the one-dimensional solutions are shown in Tables 14.6 and 14.7 along
with the stress value 0.42024.
Stress value shows the lack of fit, and therefore, it should be as close to zero as
possible. Owing to these criteria, the one-dimensional solution is not good at all as
this contains the maximum value of stress (0.42024). The two-dimensional solution
looks better as it is close to zero, but the three-dimensional solution is the best
because its stress value is the least.
Since in this problem there are only eight brands, therefore it is not possible to
get a solution in more than three dimensions. If you have more than 14 or 15 brands,
you may try some higher dimension solution. To find out the optimum solution, one
needs to have the trade-off between stress value and the number of dimensions.
Three-Dimensional Solution
Based on the stress value, the three-dimensional solution is the best as in that case
the stress value is the least and closest to zero. Therefore, the next task is to define
454 14 Multidimensional Scaling for Product Positioning
the names of these three dimensions. These dimensions are the attributes of these
brands drawn either through our experience or knowledge of the market through a
survey of the customers or a combination of these methods. Thus, the three
dimensions may be named as follows:
Dimension 1: Spacious
Dimension 2: Fuel economy
Dimension 3: Stylish
By looking to the scores on the three dimensions in Table 14.3, it may be
concluded that the brands like Wagon R, Swift, and Santro are spacious than
other brands of similar cars. Brands like Tata Indica and Alto are fuel economical
cars, whereas the brands like Ford Figo and Swift are more stylish cars.
Table 14.2 Iteration details for the three-dimensional solution Young’s S-stress formula 1 is used
Iteration S-stress Improvement
1 .14535
2 .12004 .02531
3 .11372 .00632
4 .11188 .00184
5 .11126 .00062
Iterations stopped because
S-stress improvement is <.001000
Stress and squared correlation (RSQ) in distances RSQ values are the proportion of variance of the
scaled data (disparities) in the partition (row, matrix, or entire data) which is accounted for by their
corresponding distances. Stress values are Kruskal’s stress formula 1.
For matrix,
Stress ¼ .07911 RSQ ¼ .92211
Configuration derived in three dimensions
Two-Dimensional Solution
For the sake of understanding, the perceptual map shall be discussed for two-
dimensional solutions. If two-dimensional solutions would have been preferred
instead of three-dimensional solutions, then the perceptual map would be shown
by Fig. 14.5. Looking to this figure, the brands like Swift, Santro, and Wagon R are
perceived to be similar (spacious). Similarly the brands like Tata Indica and Alto
are perceived to be similar (fuel economy). In this case, we are losing information
on the third dimension which was “stylishness” in the three-dimensional solution.
This loss of information may be critical in some cases. It is therefore advisable to
analyze the data from a three-dimensional solution instead of a two-dimensional,
provided stress value warrants so.
Table 14.4 Iteration details for the two-dimensional solution Young’s S-stress formula 1 is used
Iteration S-stress Improvement
1 .22053
2 .19234 .02820
3 .17623 .01611
4 .16411 .01211
5 .15461 .00950
6 .14791 .00670
7 .14367 .00424
8 .14159 .00208
9 .14139 .00020
Iterations stopped because
S-stress improvement is <.001000
Stress and squared correlation (RSQ) in distances RSQ values are the proportion of variance of the
scaled data (disparities) in the partition (row, matrix, or entire data) which is accounted for by their
corresponding distances. Stress values are Kruskal’s stress formula 1.
For matrix,
Stress ¼ .16611 RSQ ¼ .87594
Configuration derived in two dimensions
(i) Start SPSS and prepare the data file by defining the variables and their
properties in Variable View and typing the data column wise in Data View.
(ii) In the data view, follow the below-mentioned command sequence for multidi-
mensional scaling:
Analyze ! Scale ! Multidimensional Scaling (ALSCAL)
(iii) Select all the variables from left panel to the “Variables” section of the right
panel.
(iv) Click the tag Model and write minimum and maximum dimension for which the
solution is required. Let other options are checked by default. Click Continue.
(v) Click the tag Option in the screen and check the option “Group plots” in the
Display section. Let other options are checked by default. Click Continue.
Click OK for output.
Exercise
1. MDS refers to
(a) Multidimensional spaces
(b) Multidirectional spaces
(c) Multidimensional perceptual scaling
(d) Multidimensional scaling
2. Stress is a measure of
(a) Distance between the two brands
(b) Goodness of fit
(c) Correctness of the perceptual map
(d) Error involved in deciding the nomenclature of dimensions
3. Perceptual mapping is a
(a) Graphical representation of the dimensions in multidimensional space
(b) Graphical representation of objects in multidimensional space
(c) Graphical representation of the distances of the objects
(d) Graphical representation of brands in two-dimensional space
4. Dimensions refer to
(a) The brands on which clustering is made
(b) The characteristics of the brands which are clubbed for assessment
(c) The brands which have some attributes common in them
(d) The characteristics on which the evaluation may take place
5. In dissimilarity-based approach of multidimensional scaling, the input data are
(a) Nominal
(b) Ordinal
(c) Scale
(d) Ratio
6. The solution of multidimensional is accurate if the value of stress is
(a) Less than 1
(b) More than 1
(c) Closer to 0
(d) Closer to 0.5
7. In attribute-based approach of multidimensional scaling, the input data can be
(a) Interval
(b) Nominal
(c) Ordinal
(d) None of the above
8. One of the assumptions in multidimensional scaling is
(a) The respondents will not rate the objects on the same dimensions.
(b) Dimensions are orthogonal.
Exercise 459
(c) Respondents will not attach the same level of importance to a dimension,
even if all respondents perceive this dimension.
(d) Data are nominal.
9. Choose the correct sequence of commands in SPSS for multidimensional scaling.
(a) Analyze ! Scale ! Multidimensional Scaling (ALSCAL)
(b) Analyze ! Multidimensional Scaling (ALSCAL) ! Scale
(c) Analyze ! Scale ! Multidimensional Scaling (PROXSCAL)
(d) Analyze ! Multidimensional Scaling (PROXSCAL) ! Scale
10. Following solutions are obtained in the multidimensional scaling:
(i) One-Dimensional Solution with Stress score ¼ 0.7659
(ii) Two-Dimensional Solution with Stress score ¼ 0.4328
(iii) Three-Dimensional Solution with Stress score ¼ 0.1348
(iv) Four-Dimensional Solution with Stress score ¼ 0.0924
Which solution would you prefer?
(a) ii
(b) i
(c) iv
(d) iii
Assignments
1. A refrigerator company wanted to draw a perceptual map using its consumers’
perceptions regarding its own brand and five competing brands. These six brands
were Samsung, LG, Videocon, Godrej, Sharp, and Hitachi. The customers were
shown a card containing a pair of names of these brands and were asked to rate in
terms of dissimilarity between the two on an 8-point rating scale. The rating of
8 indicates that the two brands are distinctively apart, whereas 1 indicates that the
two brands are exactly similar as perceived by the customers. This exercise was
done on all the pair of brands. The average dissimilarity ratings obtained by all the
Dissimilarity ratings obtained by the customers on the six brands of the refrigerators
Samsung LG Videocon Godrej Sharp Hitachi
Samsung 0 4 3 7 4 3
LG 0 3 8 3 2
Videocon 0 7 3 5
Godrej 0 6 8
Sharp 0 4
Hitachi 0
customers are shown in the following table. Apply the multidimensional scaling
and interpret your findings by plotting the perceptual map of these brands.
2. The authorities in a university wanted to assess its teachers as perceived by their
students on a seven-point scale by drawing the perceptual map. Six teachers,
460 14 Multidimensional Scaling for Product Positioning
Smith, Anderson, Clark, Wright, Mitchell and Johnson were rated by 25 students.
Score 7 indicated that the two teachers were distinctively apart, whereas the
score 1 represented that they were exactly similar as perceived by the students.
Following is the dissimilarity matrix obtained on the basis of the average
dissimilarity scores obtained on all the 25 students. By using the multidimen-
sional scaling technique, draw the perceptual map.
Table A.1 The normal curve area between the mean and a given z value
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
Blair RC (1981) A reaction to ‘Consequences of failure to meet assumptions underlying the fixed
effects analysis of variance and covariance’. Rev Educ Res 51:499–507
Borg I, Groenen P (2005) Modern multidimensional scaling: theory and applications, 2nd edn.
Springer, New York, pp 207–212
Box GEP (1953) Non-normality and tests on variances. Biometrika 40(3/4):318–335, JSTOR
2333350
Buda A, Jarynowski A (2010) Life-time of correlations and its applications, vol 1. Wydawnictwo
51 Niezalezne, Wrocław
Caliński T, Kageyama S (2000) Block designs: a randomization approach, Volume I: Analysis, vol
150, Lecture notes in statistics. Springer, New York
Cameron AC, Windmeijer FAG (1997) An R-squared measure of goodness of fit for some
common nonlinear regression models. J Econom 77(2):329–342
Cattell RB (1966) The scree test for the number of factors. Multivar Behav Res 1(2):245–276.
University of Illinois, Urbana-Champaign, IL
Chatfield C (1993) Calculating interval forecasts. J Bus Econ Stat 11:121–135
Chernoff H, Lehmann EL (1954) The use of maximum likelihood estimates in w2 tests for
goodness-of-fit. Ann Math Stat 25(3):579–586. doi:10.1214/aoms/1177728726
Chow SL (1996) Statistical significance: rationale, validity and utility, vol 1, Introducing statistical
methods. Sage Publications Ltd, London
Christensen R (2002) Plane answers to complex questions: the theory of linear models, 3rd edn.
Springer, New York
Clatworthy J, Buick D, Hankins M, Weinman J, Horne R (2005) The use and reporting of cluster
analysis in health psychology: a review. Br J Health Psychol 10:329–358
Cliff N, Keats JA (2003) Ordinal measurement in the behavioral sciences. Erlbaum, Mahwah
Cohen J (1994) The earth is round (p < .05). Am Psychol 49(12):997–1003, This paper lead to the
review of statistical practices by the APA. Cohen was a member of the Task Force that did
the review
Cohen Jacob, Cohen Patricia, West Stephen G, Aiken Leona S (2002) Applied, multiple regression
– correlation analysis for the behavioral sciences. Routledge Academic, New York
Cohen J, Cohen P, West SG, Aiken LS (2003) Applied multiple regression/correlation analysis for
the behavioral sciences, 3rd edn. Erlbaum, Mahwah
Corder GW, Foreman DI (2009) Nonparametric statistics for non-statisticians: a step-by-step
approach. Wiley, Hoboken, New Jersy
Cox TF, Cox MAA (2001) Multidimensional scaling. Chapman and Hall, Boca Raton
Cox DR, Hinkley DV (1974) Theoretical Statistics, Chapman & Hall
Cox DR, Reid N (2000) The theory of design of experiments. Chapman & Hall/CRC, Fl
Cramer D (1997) Basic statistics for social research. Routledge, London
Critical Values of the Chi-Squared Distribution. NIST/SEMATECH e-Handbook of Statistical
Methods. National Institute of Standards and Technology. http://www.itl.nist.gov/div898/
handbook/eda/section3/eda3674.htm
Crown WH (1998) Statistical models for the social and behavioral sciences: multiple regression
and limited-dependent variable models. Praeger, Westport/London
Darlington RB (2004) Factor analysis. http://comp9.psych.cornell.edu/Darlington/factor.htm.
Retrieved 22 July 2011
David W Hosmer, Stanley Lemeshow (2000) Applied Logistic Regression (2nd ed.). John Wiley
& Sons, Hoboken, NJ
Devlin SJ, Gnanadesikan R, Kettenring JR (1975) Robust estimation and outlier detection with
correlation coefficients. Biometrika 62(3):531–545. doi:10.1093/biomet/62.3.531. JSTOR
2335508
Ding C, He X (July 2004) K-means clustering via principal component analysis. In: Proceedings of
international conference on machine learning (ICML 2004), pp 225–232. http://ranger.uta.edu/
~chqding/papers/KmeansPCA1.pdf
References and Further Readings 471
Dobson AJ, Barnett AG (2008) Introduction to generalized linear models, 3rd edn. Chapman and
Hall/CRC, Boca Raton
Dodge Y (2003) The Oxford dictionary of statistical terms. Oxford University Press, Oxford
Dowdy S, Wearden S (1983) Statistics for research. Wiley, New York
Draper NR, Smith H Applied regression analysis Wiley series in probability and statistics. Wiley,
New York
Duda RO, Hart PE, Stork DH (2000) Pattern classification, 2nd edn. Wiley Interscience, New York
Fisher RA (1921) On the probable error of a coefficient of correlation deduced from a small sample
(PDF). Metron 1(4):3–32. http://hdl.handle.net/2440/15169. Retrieved 25 Mar 2011
Fisher RA (1924) The distribution of the partial correlation coefficient. Metron 3(3–4):329–332.
http://digital.library.adelaide.edu.au/dspace/handle/2440/15182
Fisher RA (1925) Statistical methods for research workers. Oliver and Boyd, Edinburgh, p 43
Fisher RA (1954) Statistical methods for research workers, 12th edn. Oliver and Boyd, Edinburgh,
London
Flyvbjerg B (2011) Case study. In: Denzin NK, Lincoln YS (eds) The Sage handbook of
qualitative research, 4th edn. Sage, Thousand Oaks, pp 301–316
Fotheringham AS, Brunsdon C, Charlton M (2002) Geographically weighted regression: the
analysis of spatially varying relationships. Wiley, Hoboken, NJ
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat
Assoc 78:553–569
Fox J (1997) Applied regression analysis, linear models and related methods. Sage, Thousand
Oaks, California
Francis DP, Coats AJ, Gibson D (1999) How high can a correlation coefficient be? Int J Cardiol
69:185–199. doi:10.1016/S0167-5273(99)00028-5
Freedman DA (2005) Statistical models: theory and practice. Cambridge University Press,
Cambridge
Freedman DA et al (2007) Statistics, 4th edn. W.W. Norton & Company, New York
Friedman JH (1989) Regularized discriminant analysis. J Am Stat Assoc (American Statistical
Association) 84(405):165–175. doi:10.2307/2289860. JSTOR 2289860. MR0999675.
http://www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-4389.pdf
Gayen AK (1951) The frequency distribution of the product moment correlation coefficient in
random samples of any size draw from non-normal universes. Biometrika 38:219–247.
doi:10.1093/biomet/38.1-2.219
Gibbs Jack P, Poston JR, Dudley L (1975) The division of labor: conceptualization and related
measures. Soc Forces 53(3):468–476
Glover DM, Jenkins WJ, Doney SC (2008) Least squares and regression techniques, goodness of
fit and tests, non-linear least squares techniques. Woods Hole Oceanographic Institute, Woods
Hole
Gorsuch RL (1983) Factor analysis. Lawrence Erlbaum, Hillsdale
Green P (1975) Marketing applications of MDS: assessment and outlook. J Market 39(1):24–31.
doi:10.2307/1250799
Greenwood PE, Nikulin MS (1996) A guide to chi-squared testing. Wiley, New York
Hardin J, Hilbe J (2003) Generalized estimating equations. Chapman and Hall/CRC, London
Hardin J, Hilbe J (2007) Generalized linear models and extensions, 2nd edn. Stata Press, College
Station
Harlow L, Mulaik SA, Steiger JH (eds) (1997) What if there were no significance tests? Lawrence
Erlbaum Associates, Mahwah, NJ
Hastie TJ, Tibshirani RJ (1990) Generalized additive models. Chapman & Hall/CRC, New York
Hempel CG (1952) Fundamentals of concept formation in empirical science. The University of
Chicago Press, Chicago, p 33
Hettmansperger TP, McKean JW (1998) Robust nonparametric statistical methods, 1st edn,
Kendall’s library of statistics 5. Edward Arnold, London, p xiv+467
Hilbe JM (2009) Logistic regression models. Chapman & Hall/CRC Press, Boca Raton, FL
Hinkelmann K, Kempthorne O (2008) Design and analysis of experiments. I and II, 2nd edn.,
Wiley, New York
472 References and Further Readings
Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York/
Chichester
Huang Z (1998a) Extensions to the K-means algorithm for clustering large datasets with categori-
cal values. Data Mining Knowl Discov 2:283–304
Huang Z (1998b) Extensions to the k-means algorithm for clustering large data sets with categori-
cal values. Data Mining Knowl Discov 2:283–304
Hubbard R, Armstrong JS (2006) Why we don’t really know what statistical significance means:
implications for educators. J Market Educ 28(2):114. doi:10.1177/0273475306288399
Hubbard R, Parsa AR, Luthy MR (1997) The spread of statistical significance testing in psychol-
ogy: the case of the Journal of Applied Psychology. Theory Psychol 7:545–554
Hutcheson G, Sofroniou N (1999) The multivariate social scientist: introductory statistics using
generalized linear models. Sage Publications, Thousand Oaks
Jardine N, Sibson R (1968) The construction of hierarchic and non-hierarchic classifications.
Comput J 11:177
Jones LV, Tukey JW (December 2000) A sensible formulation of the significance test. Psychol
Methods 5(4):411–414. doi:10.1037/1082-989X.5.4.411. PMID 11194204. http://content.apa.
org/journals/met/5/4/411
Kempthorne O (1952) The design and analysis of experiments, Wiley, New York
Kendall MG (1955) Rank correlation methods. Charles Griffin & Co., London
Kendall MG, Stuart A (1973) The advanced theory of statistics, vol 2: Inference and relationship.
Griffin, London
Kenney JF, Keeping ES (1951) Mathematics of statistics, Pt. 2, 2nd edn. Van Nostrand, Princeton
Kirk RE (1995) Experimental design: procedures for the behavioral sciences, 3rd edn. Brooks/
Cole, Pacific Grove
Kriegel HP, Kröger P, Schubert E, Zimek A (2008) A general framework for increasing
the robustness of PCA-based correlation clustering algorithms. In: Scientific and statistical
database management. (Lecture notes in computer science 5069). doi:10.1007/978-3-540-
69497-7_27; ISBN 978-3-540-69476-2, p 418
Kruskal JB, Wish M (1978) Multidimensional scaling, Sage University paper series on quantita-
tive application in the social sciences. Sage Publications, Beverly Hills/London, pp 7–11
Kutner H, Nachtsheim CJ, Neter J (2004) Applied linear regression models, 4th edn. McGraw-
Hill/Irwin, Boston, p 25
Larsen RJ, Stroup DF (1976) Statistics in the real world. Macmillan, New York
Ledesma RD, Valero-Mora P (2007) Determining the number of factors to retain in EFA: an easy-
to-use computer program for carrying out parallel analysis. Pract Assess Res Eval 12(2):1–11
Lee Y, Nelder J, Pawitan Y (2006) Generalized linear models with random effects: unified analysis
via H-likelihood, Chapman & Hall/CRC, Boca Raton, FL
Lehmann EL (1970) Testing statistical hypothesis, 5th edn. Wiley, New York
Lehmann EL (1992) Introduction to Neyman and Pearson (1933) on the problem of the most
efficient tests of statistical hypotheses. In: Kotz S, Johnson NL (eds) Breakthroughs in
statistics, vol 1. Springer, New York (Followed by reprinting of the paper)
Lehmann EL (1997) Testing statistical hypotheses: the story of a book. Stat Sci 12(1):48–52
Lehmann EL, Romano JP (2005) Testing statistical hypotheses, 3Eth edn. Springer, New York
Lentner M, Bishop T (1993) Experimental design and analysis, 2nd edn. Valley Book Company,
Blacksburg
Lewis-Beck MS (1995) Data analysis: an introduction. Sage Publications Inc, Thousand Oaks.
California
Lindley DV (1987) “Regression and correlation analysis,” New Palgrave: A dictionary of
economics, vol 4, pp 120–123.
Lomax RG (2007) Statistical concepts: a second course, Lawrence Erlbaum Associates, NJ
MacCallum R (1983) A comparison of factor analysis programs in SPSS, BMDP, and SAS.
Psychometrika 48(48):doi:10.1007/BF02294017
Mackintosh NJ (1998) IQ and human intelligence. Oxford University Press, Oxford, pp 30–31
Maranell GM (2007) Chapter 31. In: Scaling: a sourcebook for behavioral scientists. Aldine
Transaction, New Brunswick/London, pp 402–405
References and Further Readings 473
Mark J, Goldberg MA (2001) Multiple regression analysis and mass assessment: a review of the
issues. Apprais J Jan:89–109
Mayo DG, Spanos A (2006) Severe testing as a basic concept in a Neyman-Pearson philosophy of
induction. Br J Philos Sci 57(2):323. doi:10.1093/bjps/axl003
McCloskey DN, Ziliak ST (2008) The cult of statistical significance: how the standard error costs
us jobs, justice, and lives. University of Michigan Press, Ann Arbor., MI
McCullagh P, Nelder J (1989) Generalized linear models, 2nd edn. Chapman and Hall/CRC, Boca
Raton
Mellenbergh GJ (2008) Chapter 8: Research designs: testing of research hypotheses. In: Adèr HJ,
Mellenbergh GJ (eds) (with contributions by D.J. Hand) Advising on research methods:
a consultant’s companion. Johannes van Kessel Publishing, Huizen, pp 183–209
Menard S (2002) Applied logistic regression analysis, Quantitative applications in the social
sciences, 2nd edn. Sage Publications, Thousand Oaks, California
Mezzich JE, Solomon H (1980) Taxonomy and behavioral science. Academic Press, Inc., New
York
Michell J (1986) Measurement scales and statistics: a clash of paradigms. Psychol Bull 3:398–407
Miranda A, Le Borgne YA, Bontempi G (2008) New routes from minimal approximation error to
principal components. Neural Process Lett 27(3):197–207, Springer
Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen
clustering algorithms. Psychometrika 45:325–342
Morrison D, Henkel R (eds) (2006/1970) The significance test controversy. AldineTransaction,
New Brunswick
Nagelkerke (1991) A note on a general definition of the coefficient of determination. Biometrika
78(3):691–692
Narens L (1981) On the scales of measurement. J Math Psychol 24:249–275
Nelder J, Wedderburn R (1972) Generalized linear models. J R Stat Soc A (General). 135
(3):370–384 (Blackwell Publishing). doi:10.2307/2344614. JSTOR 2344614
Nemes S, Jonasson JM, Genell A, Steineck G (2009) Bias in odds ratios by logistic regression
modelling and sample size. BMC Med Res Methodol 9:56, BioMedCentral
Neyman J, Pearson ES (1933) On the problem of the most efficient tests of statistical hypotheses.
Phil Trans R Soc A 231:289–337. doi:10.1098/rsta.1933.0009
Nickerson RS (2000) Null hypothesis significance tests: a review of an old and continuing
controversy. Psychol Methods 5(2):241–301
Pearson K, Fisher RA, Inman HF (1994) Karl Pearson and R. A. Fisher on statistical tests: a 1935
exchange from nature. Am Stat 48(1):2–11
Perriere G, Thioulouse J (2003) Use of correspondence discriminant analysis to predict the
subcellular location of bacterial proteins. Comput Methods Progr Biomed 70:99–105
Plackett RL (1983) Karl Pearson and the Chi-squared test. Int Stat Rev (International Statistical
Institute (ISI)) 51(1):59–72. doi:10.2307/1402731
Rahman NA (1968) A course in theoretical statistics. Charles Griffin and Company, London
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc
(American Statistical Association) 66(336):846–850. doi:10.2307/2284239, JSTOR 2284239
Rawlings JO, Pantula SG, Dickey DA (1998) Applied regression analysis: a research tool, 2nd edn.
Springer, New York
Rodgers JL, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat
42(1):59–66
Rozeboom WW (1966) Scaling theory and the nature of measurement. Synthese 16:170–233
Rummel RJ (1976) Understanding correlation. http://www.hawaii.edu/powerkills/UC.HTM
Schervish MJ (1987) A review of multivariate analysis. Stat Sci 2(4):396–413. doi:10.1214/ss/
1177013111, ISSN 0883-4237. JSTOR 2245530
Schervish M (1996) Theory of statistics. Springer, New York, p 218. ISBN 0387945466
Sen PK, Anderson TW, Arnold SF, Eaton ML, Giri NC, Gnanadesikan R, Kendall MG, Kshirsagar
AM et al (1986) Review: contemporary textbooks on multivariate statistical analysis:
474 References and Further Readings
Classification matrix, 392, 395, 404, 405 Dendogram, 322–323, 329–330, 332, 346
Cluster analysis, 318 plotting cluster distances, 349
assumptions, 331 Dependent variable, 6
procedure, 330 Descriptive research, 30
situation suitable for cluster analysis, 331 Descriptive statistics, 10, 29–31, 365
solution with SPSS, 333 computation with SPSS, 54
steps in cluster analysis, 332 Descriptive study, 2, 29, 53
terminologies used, 318 Design of experiments, 222
Clustering criteria, 322 Detection of errors
Clustering procedure, 321 using frequencies, 10
hierarchical clustering, 322 using logic checks, 10
nonhierarchical clustering(k-means), 326 using mean and standard deviation, 10
two-step clustering, 327 using minimum and maximum scores, 10
Cluster membership, 354 Deviance, 416, 418–419, 434
Coefficient of determination R2, 134, 137 Deviance statistic, 416, 434–435
Coefficient of variability, 44 Dimensions, 446–447
Coefficient of variation, 30, 48 Discriminant analysis, 389
Communality, 360, 362–363, 375 assumptions, 396
Concomitant variable, 292 discriminant function, 390–396, 398, 404
Confidence interval, 48 procedure of analysis, 394
Confirmatory study, 149, 360–361, 392, 399–400 research situations for discriminant
Confusion matrix, 392 analysis, 396
Contingency coefficient, 79 stepwise method, 392
Contingency table, 69–70, 73, 76, 79, 178, 262 what is discriminant analysis?, 390
Correlation coefficient, 3, 104, 141, 176 Discriminant model, 390, 395
computation, 106 Discriminant score, 406
ecological fallacy, 110 Dissection, 318
limitations, 111 Dissimilarity based approach of
misleading situations, 110 multidimensional scaling, 446
properties, 108 procedure for multidimensional scaling,
testing the significance, 111 446
unexplained causative relationship, 110 steps for solution, 446
Correlation matrix, 105 Dissimilarity matrix, 445
computation, 106 Dissimilarity measures, 344, 446
computing with SPSS, 117 Distance matrix, 322, 446, 447
situations for application, 115 Distance measure, 318
Cox and Snell’s R2, 435 Distances, 445
Cramer’s V, 80 Distribution free tests, 3
Critical difference, 227, 265
Critical region, 171, 175, 183, 185
Critical value, 50, 52, 111, 170–175, 182–185 E
Crosstab, 69–70, 88, 92 Eigenvalue, 361, 363, 365, 393
Equal occurrence hypothesis, 69
Error variance, 256–257, 259, 262, 292,
D 298, 419
Data Analysis, 2, 3 Euclidean distance, 319–320, 324,
Data cleaning, 9 329, 331
Data mining, 1 Euclidean space, 320
Data warehousing, 1 Experimental error, 292
Degrees of freedom, 70–72, 76, 111, 171, Exploratory study, 149, 360, 392, 430
177–179, 181–183, 185, 191, Exponential function, 415
226–227, 259–260, 263–265, 417 Extraneous variable, 6
Index 477
F I
Factor, 259 Icicle plots, 328–329, 331, 333, 335, 348
Factor analysis, 359 Identity matrix, 365, 375
assumptions, 366 Importing data in SPSS
characteristics, 367 from an ASCII file, 18
Limitations, 367 from the Excel file, 22
Situations suitable for factor Independent variable, 6
analysis, 367 Index of quartile variation, 46
solutions with SPSS, 368 Inductive studies, 2
used in confirmatory studies, 360 Inferential studies, 2
used in exploratory studies, 360 Initial cluster centers, 349
what we do in factor analysis, 365 Interaction, 224, 256, 260, 262
Factorial ANOVA, 223, 257 Inter-quartile range
Factorial design, 223, 257–258 lower quartile, 41, 42
Factor loading, 362, 365, 366, 379 upper quartile, 41, 42
Factor matrix, 364 Interval data, 1, 3, 4
Final cluster centers, 350 Interval scale, 3
Forward:LR method, 425, 428, 430–431,
433–434
Frequency distribution, 69 K
F statistic, 171, 221, 223, 226–227, 229, Kaiser’s criteria, 363, 365
262, 264–265 k-means clustering, 326, 327, 332
F test, 3, 72, 146, 182 KMO test, 365, 375
Functions at group centroids, 396 Kruskal-Wallis test
Fusion coefficients, 333, 335, 340, 344 Kurtosis, 30, 49–52
G L
Gamma, 80 Lambda coefficient, 79
Goodness of fit, 69, 73, 417 Least significant difference (LSD) test,
227, 265
Least square method, 143
H Left tailed test, 175, 184–185
Hierarchical clustering, 322, 324, Leptokurtic curve, 51–52
326–328, 331 Level of significance, 72, 77,
agglomerative clustering, 322–323 111–112, 171–177, 179,
divisive clustering, 322, 325 182–185, 192, 227–229,
Homoscedasticity, 366 262, 265
Homoscedastic relationships, 396 Likelihood ratio test, 417
Hypothesis Linear regression, 133, 143, 145, 292,
alternative hypothesis, 170–171, 173–177, 298, 419
179–183, 222, 225, 229, 262 Linkage methods, 324
non parametric, 168–169 average linkage method, 325
null, 72, 74, 77, 111, 112, 169–179, complete linkage method, 325
181–184, 191–193, 221–222, single linkage method, 325
225, 227, 229–230, 232, 262, Logistic curve, 415, 417
265, 280, 295, 297, 393 Logistic distribution, 419
parametric, 168 Logistic function, 417, 421
research hypothesis, 169–170, 175, interpretation, 422
184, 191 Logistic model with mathematical equation,
Hypothesis construction, 168 421
Hypothesis testing, 171 Logistic regression, 396, 413
478 Index
V W
Variable Wald statistics, 436
categorical, 5 Ward’s method, 324
continuous, 5 Wilk’s Lambda, 394, 395, 404
dependent, 6 Within group variation, 260
discrete, 5
extraneous, 6
independent, 6 Z
Variance, 46, 178 Z distribution, 168
Variance maximizing rotation, 362 Z test, 3, 168, 178
Varimax rotation, 364, 366, 379