Module 6 RM: Advanced Data Analysis Techniques
Module 6 RM: Advanced Data Analysis Techniques
1
Correlation.
Correlation and regression analysis are related in the sense that both deal with
relationships among variables. The correlation coefficient is a measure of linear
association between two variables. Values of the correlation coefficient are always
between -1 and +1. A correlation coefficient of +1 indicates that two variables are
perfectly related in a positive linear sense; a correlation coefficient of -1 indicates that
two variables are perfectly related in a negative linear sense, and a correlation coefficient
of 0 indicates that there is no linear relationship between the two variables. For simple
linear regression, the sample correlation coefficient is the square root of the coefficient of
determination, with the sign of the correlation coefficient being the same as the sign of
b1, the coefficient of x1 in the estimated regression equation.
Neither regression nor correlation analyses can be interpreted as establishing cause-and-
effect relationships. They can indicate only how or to what extent variables are associated
with each other. The correlation coefficient measures only the degree of linear
association between two variables. Any conclusions about a cause-and-effect relationship
must be based on the judgment of the analyst.
What is the difference between correlation and linear regression?
Correlation and linear regression are not the same.
What is the goal?
Correlation quantifies the degree to which two variables are related. Correlation does not
fit a line through the data points. You simply are computing a correlation coefficient (r)
that tells you how much one variable tends to change when the other one does. When r is
0.0, there is no relationship. When r is positive, there is a trend that one variable goes up
as the other one goes up. When r is negative, there is a trend that one variable goes up as
the other one goes down.
Linear regression finds the best line that predicts Y from X.
What kind of data?
Correlation is almost always used when you measure both variables. It rarely is
appropriate when one variable is something you experimentally manipulate.
Linear regression is usually used when X is a variable you manipulate (time,
concentration, etc.)
2
Does it matter which variable is X and which is Y?
With correlation, you don't have to think about cause and effect. It doesn't matter which
of the two variables you call "X" and which you call "Y". You'll get the same correlation
coefficient if you swap the two.
The decision of which variable you call "X" and which you call "Y" matters in
regression, as you'll get a different best-fit line if you swap the two. The line that best
predicts Y from X is not the same as the line that predicts X from Y (however both those
lines have the same value for R2)
Assumptions
The correlation coefficient itself is simply a way to describe how two variables vary
together, so it can be computed and interpreted for any two variables. Further inferences,
however, require an additional assumption -- that both X and Y are measured, and both
are sampled from Gaussian distributions. This is called a bivariate Gaussian distribution.
If those assumptions are true, then you can interpret the confidence interval of r and the P
value testing the null hypothesis that there really is no correlation between the two
variables (and any correlation you observed is a consequence of random sampling).
With linear regression, the X values can be measured or can be a variable controlled by
the experimenter. The X values are not assumed to be sampled from a Gaussian
distribution. The vertical distances of the points from the best-fit line (the residuals) are
assumed to follow a Gaussian distribution, with the SD of the scatter not related to the X
or Y values.
Relationship between results
Correlation computes the value of the Pearson correlation coefficient, r. Its value ranges
from -1 to +1.
Linear regression quantifies goodness of fit with r2, sometimes shown in uppercase as
R2. If you put the same data into correlation (which is rarely appropriate; see above), the
square of r from correlation will equal r2 from regression
3
between eyes, etc. We'd expect that many of the measurements would be correlated, and
we'd say that the explanation for these correlations is that there is a common underlying
factor of body size. It is this kind of common factor that we are looking for with factor
analysis, although in psychology the factors may be less tangible than body size.
To carry the body measurement example further, we probably wouldn't expect body size
to explain all of the variability of the measurements: for example, there might be a
lankiness factor, which would explain some of the variability of the circumference
measures and limb lengths, and perhaps another factor for head size which would have
some independence from body size (what factors emerge is very dependent on what
variables are measured). Even with a number of common factors such as body size,
lankiness and head size, we still wouldn't expect to account for all of the variability in the
measures (or explain all of the correlations), so the factor analysis model includes a
unique factor for each variable which accounts for the variability of that variable which is
not due to any of the common factors.
Why carry out factor analyses? If we can summarise a multitude of measurements with a
smaller number of factors without losing too much information, we have achieved some
economy of description, which is one of the goals of scientific investigation. It is also
possible that factor analysis will allow us to test theories involving variables which are
hard to measure directly. Finally, at a more prosaic level, factor analysis can help us
establish that sets of questionnaire items (observed variables) are in fact all measuring the
same underlying factor (perhaps with varying reliability) and so can be combined to form
a more reliable measure of that factor.
There are a number of different varieties of factor analysis: the discussion here is limited
to principal axis factor analysis and factor solutions in which the common factors are
uncorrelated with each other. It is also assumed that the observed variables are
standardized (mean zero, standard deviation of one) and that the factor analysis is based
on the correlation matrix of the observed variables.
4
coefficients are called loadings (a variable is said to 'load' on a factor) and, when the
factors are uncorrelated, they also show the correlation between each variable and a given
factor. In the model above, a11 is the loading for variable X1 on F1, a23 is the loading
for variable X2 on F3, etc.
When the coefficients are correlations, i.e., when the factors are uncorrelated, the sum of
the squares of the loadings for variable X1, namely a11 2 + a12 2 + … + a13 2, shows
the proportion of the variance of variable X1 which is accounted for by the common
factors. This is called the communality. The larger the communality for each variable, the
more successful a factor analysis solution is.
By the same token, the sum of the squares of the coefficients for a factor -- for F1 it
would be [a112 + a212 + … + an12] -- shows the proportion of the variance of all the
variables which is accounted for by that factor.
What is a factor?
The key concept of factor analysis is that multiple observed variables have similar
patterns of responses because of their association with an underlying latent variable, the
factor, which cannot easily be measured.
For example, people may respond similarly to questions about income, education, and
occupation, which are all associated with the latent variable socioeconomic status.
In every factor analysis, there are the same number of factors as there are variables. Each
factor captures a certain amount of the overall variance in the observed variables, and the
factors are always listed in order of how much variation they explain.
The eigenvalue is a measure of how much of the variance of the observed variables a
factor explains. Any factor with an eigenvalue ≥1 explains more variance than a single
observed variable.
So if the factor for socioeconomic status had an eigenvalue of 2.3 it would explain as
much variance as 2.3 of the three variables. This factor, which captures most of the
variance in those three variables, could then be used in other analyses.
The factors that explain the least amount of variance are generally discarded. Deciding
how many factors are useful to retain will be the subject of another post.
5
What are factor loadings?
The relationship of each variable to the underlying factor is expressed by the so-called
factor loading. Here is an example of the output of a simple factor analysis looking at
indicators of wealth, with just six variables and two resulting factors.
Variables Factor 1 Factor 2
Income 0.65 0.11
Education 0.59 0.25
Occupation 0.48 0.19
House value 0.38 0.60
Number of public parks in 0.13 0.57
neighborhood
Number of violent crimes per 0.23 0.55
year in neighborhood
The variable with the strongest association to the underlying latent variable. Factor 1, is
income, with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, one
could also say that the variable income has a correlation of 0.65 with Factor 1. This
would be considered a strong association for a factor analysis in most research fields.
Two other variables, education and occupation, are also associated with Factor 1. Based
on the variables loading highly onto Factor 1, we could call it “Individual socioeconomic
status.”
House value, number of public parks, and number of violent crimes per year, however,
have high factor loadings on the other factor, Factor 2. They seem to indicate the overall
wealth within the neighborhood, so we may want to call Factor 2 “Neighborhood
socioeconomic status.”
Notice that the variable house value also is marginally important in Factor 1 (loading =
0.38). This makes sense, since the value of a person’s house should be associated with his
or her income.
3) DISCRIMINANT ANALYSIS
The purposes of discriminant analysis (DA)
Discriminant Function Analysis (DA) undertakes the same task as multiple linear
regression by predicting an outcome. However, multiple linear regression is limited to
cases where the dependent variable on the Y axis is an interval variable so that the
combination of predictors will, through the regression equation, produce estimated mean
population numerical . Y values for given values of weighted combinations of X values.
But many interesting variables are categorical, such as political party voting intention,
migrant/non-migrant status, making a profit or not, holding a particular credit card,
owning, renting or paying a mortgage for a house, employed/unemployed, satisfied
versus dissatisfied employees, which customers are likely to buy a product or not buy,
6
what distinguishes Stellar Bean clients from Gloria Beans clients, whether a person is a
credit risk or not, etc.
DA is used when:
the dependent is categorical with the predictor IV’s at interval level such as age,
income, attitudes, perceptions, and years of education, although dummy variables
can be used as predictors as in multiple regression. Logistic regression IV’s can be
of any level of measurement.
there are more than two DV categories, unlike logistic regression, which is limited
to a dichotomous dependent variable.
variable
variables
This function is similar to a regression equation or function. The v’s are unstandardized
discriminant coefficients analogous to the b’s in the regression equation. These v’s
maximize the distance between the means of the criterion (dependent) variable.
Standardized discriminant coefficients can also be used like beta weight in regression.
Good predictors tend to have large weights. What you want this function to do is
maximize the distance between the categories, i.e. come up with an equation that has
strong discriminatory power between groups. After using an existing set of data to
calculate the discriminant function and classify cases, any new cases can then be
classified. The number of discriminant functions is one less the number of groups. There
is only one function for the basic two group discriminant analysis.
7
for instance, three groups taking three available levels of amounts of housing loan;
the groups or categories should be defined before collecting the data;
the attribute(s) used to separate the groups should discriminate quite clearly
between
the groups so that group or category overlap is clearly non-existent or minimal;
group sizes of the dependent should not be grossly different and should be at least
five times the number of independent variables.
There are several purposes of DA:
To investigate differences between groups on the basis of the attributes of the
cases, indicating which attributes contribute most to group separation. The
descriptive technique successively identifies the linear combination of attributes
known as canonical discriminant functions (equations) which contribute
maximally to group separation.
Predictive DA addresses the question of how to assign new cases to groups. The
DA function uses a person’s scores on the predictor variables to predict the
category to which the individual belongs.
To determine the most parsimonious way to distinguish between groups.
To classify cases into groups. Statistical significance tests using chi square enable
you to see how well the function separates the groups.
To test theory whether cases are classified as predicted.
Discriminant Analysis
Discriminant Analysis may be used for two objectives: either we want to assess the
adequacy of classification, given the group memberships of the objects under study; or
we wish to assign objects to one of a number of (known) groups of objects. Discriminant
Analysis may thus have a descriptive or a predictive objective.
In both cases, some group assignments must be known before carrying out the
Discriminant Analysis. Such group assignments, or labelling, may be arrived at in any
way. Hence Discriminant Analysis can be employed as a useful complement to Cluster
Analysis (in order to judge the results of the latter) or Principal Components Analysis.
Alternatively, in star-galaxy separation, for instance, using digitised images, the analyst
may define group (stars, galaxies) membership visually for a conveniently small training
set or design set.
Methods implemented in this area are Multiple Discriminant Analysis, Fisher's Linear
Discriminant Analysis, and K-Nearest Neighbours Discriminant Analysis.
Multiple Discriminant Analysis
(MDA) is also termed Discriminant Factor Analysis and Canonical Discriminant
Analysis. It adopts a similar perspective to PCA: the rows of the data matrix to be
examined constitute points in a multidimensional space, as also do the group mean
vectors. Discriminating axes are determined in this space, in such a way that
optimal separation of the predefined groups is attained. As with PCA, the problem
8
becomes mathematically the eigen reduction of a real, symmetric matrix. The
eigen values represent the discriminating power of the associated eigenvectors.
Then Ygroups lie in a space of dimension at most nY - 1. This will be the number
of discriminant axes or factors obtainable in the most common practical case when
n > m > nY (where n is the number of rows, and m the number of columns of the
input data matrix).
Linear Discriminant Analysis
is the 2-group case of MDA. It optimally separates two groups, using the
Mahalanobis metric or generalized distance. It also gives the same linear
separating decision surface as Bayesian maximum likelihood discrimination in the
case of equal class covariance matrices.
K-NNs Discriminant Analysis
: Non-parametric (distribution-free) methods dispense with the need for
assumptions regarding the probability density function. They have become very
popular especially in the image processing area. The K-NNs method assigns an
object of unknown affiliation to the group to which the majority of its K nearest
neighbours belongs.
There is no best discrimination method. A few remarks concerning the advantages and
disadvantages of the methods studied are as follows.
Analytical simplicity or computational reasons may lead to initial consideration of
linear discriminant analysis or the NN-rule.
Linear discrimination is the most widely used in practice. Often the 2-group
method is used repeatedly for the analysis of pairs of multigroup data (yielding
9
4) CLUSTER ANALYSIS
Cluster analysis is a convenient method for identifying homogenous groups of objects
called clusters. Objects (or cases, observations) in a specific cluster share many
characteristics, but are very dissimilar to objects not belonging to that cluster.
The objective of cluster analysis is to identify groups of objects (in this case, customers)
that are very similar with regard to their price consciousness and brand loyalty and assign
them into clusters. After having decided on the clustering variables (brand loyalty and
price consciousness), we need to decide on the clustering procedure to form our groups of
objects. This step is crucial for the analysis, as different procedures require different
decisions prior to analysis. There is an abundance of different approaches and little
guidance on which one to use in practice.
An important problem in the application of cluster analysis is the decision regarding how
many clusters should be derived from the data
Choose a clusteringalgorithm
10
5) MULTIDIMENSIONAL SCALING
Multidimensional Scaling
General Purpose
Logic of MDS
Computational Approach
How many dimensions to specify?
Interpreting the Dimensions
Applications
MDS and Factor Analysis
General Purpose
Multidimensional scaling (MDS) can be considered to be an alternative to factor analysis
(see Factor Analysis). In general, the goal of the analysis is to detect meaningful
underlying dimensions that allow the researcher to explain observed similarities or
dissimilarities (distances) between the investigated objects. In factor analysis, the
similarities between objects (e.g., variables) are expressed in the correlation matrix. With
MDS, you can analyze any kind of similarity or dissimilarity matrix, in addition to
correlation matrices.
Logic of MDS
The following simple example may demonstrate the logic of an MDS analysis. Suppose
we take a matrix of distances between major US cities from a map. We then analyze this
matrix, specifying that we want to reproduce the distances based on two dimensions. As a
result of the MDS analysis, we would most likely obtain a two-dimensional
representation of the locations of the cities, that is, we would basically obtain a two-
dimensional map.
In general then, MDS attempts to arrange "objects" (major cities in this example) in a
space with a particular number of dimensions (two-dimensional in this example) so as to
reproduce the observed distances. As a result, we can "explain" the distances in terms of
underlying dimensions; in our example, we could explain the distances in terms of the
two geographical dimensions: north/south and east/west.
Orientation of axes. As in factor analysis, the actual orientation of axes in the final
solution is arbitrary. To return to our example, we could rotate the map in any way we
want, the distances between cities remain the same. Thus, the final orientation of axes in
the plane or space is mostly the result of a subjective decision by the researcher, who will
choose an orientation that can be most easily explained. To return to our example, we
could have chosen an orientation of axes other than north/south and east/west; however,
that orientation is most convenient because it "makes the most sense" (i.e., it is easily
interpretable).
11
Computational Approach
MDS is not so much an exact procedure as rather a way to "rearrange" objects in an
efficient manner, so as to arrive at a configuration that best approximates the observed
distances. It actually moves objects around in the space defined by the requested number
of dimensions, and checks how well the distances between objects can be reproduced by
the new configuration. In more technical terms, it uses a function minimization algorithm
that evaluates different configurations with the goal of maximizing the goodness-of-fit
(or minimizing "lack of fit").
Measures of goodness-of-fit: Stress. The most common measure that is used to evaluate
how well (or poorly) a particular configuration reproduces the observed distance matrix is
the stress measure. The raw stress value Phi of a configuration is defined by:
Phi = [dij - f ( ij)]2
In this formula, dij stands for the reproduced distances, given the respective number of
dimensions, and ij (deltaij) stands for the input data (i.e., observed distances). The
expression f ( ij) indicates a nonmetric, monotone transformation of the observed input
data (distances). Thus, it will attempt to reproduce the general rank-ordering of distances
between the objects in the analysis.
There are several similar related measures that are commonly used; however, most of
them amount to the computation of the sum of squared deviations of observed distances
(or some monotone transformation of those distances) from the reproduced distances.
Thus, the smaller the stress value, the better is the fit of the reproduced distance matrix to
the observed distance matrix.
Shepard diagram. You can plot the reproduced distances for a particular number of
dimensions against the observed input data (distances). This scatterplot is referred to as a
Shepard diagram. This plot shows the reproduced distances plotted on the vertical (Y)
axis versus the original similarities plotted on the horizontal (X) axis (hence, the
generally negative slope). This plot also shows a step-function. This line represents the
so- called D-hat values, that is, the result of the monotone transformation f( ) of the
input data. If all reproduced distances fall onto the step-line, then the rank-ordering of
distances (or similarities) would be perfectly reproduced by the respective solution
(dimensional model). Deviations from the step-line indicate lack of fit.
How Many Dimensions to Specify?
If you are familiar with factor analysis, you will be quite aware of this issue. If you are
not familiar with factor analysis, you may want to read the Factor Analysis section in the
manual; however, this is not necessary in order to understand the following discussion. In
general, the more dimensions we use in order to reproduce the distance matrix, the better
is the fit of the reproduced matrix to the observed matrix (i.e., the smaller is the stress). In
fact, if we use as many dimensions as there are variables, then we can perfectly reproduce
the observed distance matrix. Of course, our goal is to reduce the observed complexity of
12
nature, that is, to explain the distance matrix in terms of fewer underlying dimensions. To
return to the example of distances between cities, once we have a two-dimensional map it
is much easier to visualize the location of and navigate between cities, as compared to
relying on the distance matrix only.
Sources of misfit. Let's consider for a moment why fewer factors may produce a worse
representation of a distance matrix than would more factors. Imagine the three cities A,
B, and C, and the three cities D, E, and F; shown below are their distances from each
other.
A B C D E F
A 0 D0
B 90 0 E 90 0
C 90 90 0 F 180 90 0
In the first matrix, all cities are exactly 90 miles apart from each other; in the second
matrix, cities D and F are 180 miles apart. Now, can we arrange the three cities (objects)
on one dimension (line)? Indeed, we can arrange cities D, E, and F on one dimension:
D---90 miles---E---90 miles---F
D is 90 miles away from E, and E is 90 miles away from F; thus, D is 90+90=180 miles
away from F. If you try to do the same thing with cities A, B, and C you will see that
there is no way to arrange the three cities on one line so that the distances can be
reproduced. However, we can arrange those cities in two dimensions, in the shape of a
triangle:
A
90 miles 90 miles
B 90 miles C
Arranging the three cities in this manner, we can perfectly reproduce the distances
between them. Without going into much detail, this small example illustrates how a
particular distance matrix implies a particular number of dimensions. Of course, "real"
data are never this "clean," and contain a lot of noise, that is, random variability that
contributes to the differences between the reproduced and observed matrix.
Scree test. A common way to decide how many dimensions to use is to plot the stress
value against different numbers of dimensions. This test was first proposed by Cattell
(1966) in the context of the number-of-factors problem in factor analysis (see Factor
Analysis); Kruskal and Wish (1978; pp. 53-60) discuss the application of this plot to
MDS.
Cattell suggests to find the place where the smooth decrease of stress values (eigenvalues
in factor analysis) appears to level off to the right of the plot. To the right of this point,
you find, presumably, only "factorial scree" - "scree" is the geological term referring to
the debris which collects on the lower part of a rocky slope.
13
example of distances between cities, the resultant dimensions are easily interpreted. At
other times, the points in the plot form a sort of "random cloud," and there is no
straightforward and easy way to interpret the dimensions. In the latter case, you should
try to include more or fewer dimensions and examine the resultant final configurations.
Often, more interpretable solutions emerge. However, if the data points in the plot do not
follow any pattern, and if the stress plot does not show any clear "elbow," then the data
are most likely random "noise."
In addition to "meaningful dimensions," you should also look for clusters of points or
particular patterns and configurations (such as circles, manifolds, etc.). For a detailed
discussion of how to interpret final configurations, see Borg and Lingoes (1987), Borg
and Shye (in press), or Guttman (1968).
14
Applications
The "beauty" of MDS is that we can analyze any kind of distance or similarity matrix.
These similarities can represent people's ratings of similarities between objects, the
percent agreement between judges, the number of times a subjects fails to discriminate
between stimuli, etc. For example, MDS methods used to be very popular in
psychological research on person perception where similarities between trait descriptors
were analyzed to uncover the underlying dimensionality of people's perceptions of traits
(see, for example Rosenberg, 1977). They are also very popular in marketing research, in
order to detect the number and nature of dimensions underlying the perceptions of
different brands or products & Carmone, 1970).
In general, MDS methods allow the researcher to ask relatively unobtrusive questions
("how similar is brand A to brand B") and to derive from those questions underlying
dimensions without the respondents ever knowing what is the researcher's real interest.
MDS and Factor Analysis
Even though there are similarities in the type of research questions to which these two
procedures can be applied, MDS and factor analysis are fundamentally different methods.
Factor analysis requires that the underlying data are distributed as multivariate normal,
and that the relationships are linear. MDS imposes no such restrictions. As long as the
rank-ordering of distances (or similarities) in the matrix is meaningful, MDS can be used.
In terms of resultant differences, factor analysis tends to extract more factors
(dimensions) than MDS; as a result, MDS often yields more readily, interpretable
solutions. Most importantly, however, MDS can be applied to any kind of distances or
similarities, while factor analysis requires us to first compute a correlation matrix. MDS
can be based on subjects' direct assessment of similarities between stimuli, while factor
analysis requires subjects to rate those stimuli on some list of attributes (for which the
factor analysis is performed).
In summary, MDS methods are applicable to a wide variety of research designs because
distance measures can be obtained in any number of ways (for different examples, refer
to the references provided at the beginning of this section).
6) DESCRIPTIVE STATISTICS
Descriptive statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a
collection of data, or the quantitative description itself. Descriptive statistics are
distinguished from inferential statistics (or inductive statistics), in that descriptive
statistics aim to summarize a sample, rather than use the data to learn about the
population that the sample of data is thought to represent. This generally means that
descriptive statistics, unlike inferential statistics, are not developed on the basis of
probability theory. Even when a data analysis draws its main conclusions using
inferential statistics, descriptive statistics are generally also presented. For example in a
paper reporting on a study involving human subjects, there typically appears a table
giving the overall sample size, sample sizes in important subgroups (e.g., for each
15
treatment or exposure group), and demographic or clinical characteristics such as the
average age, the proportion of subjects of each sex, and the proportion of subjects with
related comorbidities.
Some measures that are commonly used to describe a data set are measures of central
tendency and measures of variability or dispersion. Measures of central tendency include
the mean, median and mode, while measures of variability include the standard deviation
(or variance), the minimum and maximum values of the variables, kurtosis and skewness.
Contents
1 Use in statistical analysis
o 1.1 Univariate analysis
o 1.2 Bivariate analysis
Use in statistical analysis
Descriptive statistics provides simple summaries about the sample and about the
observations that have been made. Such summaries may be either quantitative, i.e.
summary statistics, or visual, i.e. simple-to-understand graphs. These summaries may
either form the basis of the initial description of the data as part of a more extensive
statistical analysis, or they may be sufficient in and of themselves for a particular
investigation.
For example, the shooting percentage in basketball is a descriptive statistic that
summarizes the performance of a player or a team. This number is the number of shots
made divided by the number of shots taken. For example, a player who shoots 33% is
making approximately one shot in every three. The percentage summarizes or describes
multiple discrete events. Consider also the grade point average. This single number
describes the general performance of a student across the range of their course
experiences.
The use of descriptive and summary statistics has an extensive history and, indeed, the
simple tabulation of populations and of economic data was the first way the topic of
statistics appeared. More recently, a collection of summarisation techniques has been
formulated under the heading of exploratory data analysis: an example of such a
technique is the box plot.
In the business world, descriptive statistics provide a useful summary of security returns
when researchers perform empirical and analytical analysis, as they give a historical
account of return behavior.
Univariate analysis
Univariate analysis involves describing the distribution of a single variable, including its
central tendency (including the mean, median, and mode) and dispersion (including the
range and quantiles of the data-set, and measures of spread such as the variance and
standard deviation). The shape of the distribution may also be described via indices such
as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted
in graphical or tabular format, including histograms and stem-and-leaf display.
16
Bivariate analysis
When a sample consists of more than one variable, descriptive statistics may be used to
describe the relationship between pairs of variables. In this case, descriptive statistics
include:
Cross-tabulations and contingency tables
Graphical representation via scatterplots
Quantitative measures of dependence
Descriptions of conditional distributions
The main reason for differentiating univariate and bivariate analysis is that bivariate
analysis is not only simple descriptive analysis, but also it describes the relationship
between two different variables.[5] Quantitative measures of dependence include
correlation (such as Pearson's r when both variables are continuous, or Spearman's rho if
one or both are not) and covariance (which reflects the scale variables are measured on).
The slope, in regression analysis, also reflects the relationship between variables. The
unstandardised slope indicates the unit change in the criterion variable for a one unit
change in the predictor. The standardised slope indicates this change in standardised (z-
score) units. Highly skewed data are often transformed by taking logarithms. Use of
logarithms makes graphs more symmetrical and look more similar to the normal
distribution, making them easier to interpret intuitively.
7) INFERENTIAL STATISTICS
Inferential Statistics
Unlike descriptive statistics, which are used to describe the characteristics (i.e.
distribution, central tendency, and dispersion) of a single variable, inferential statistics are
used to make inferences about the larger population based on the sample. Since a sample
is a small subset of the larger population (or sampling frame), the inferences are
necessarily error prone. That is, we cannot say with 100% confidence that the
characteristics of the sample accurately reflect the characteristics of the larger population
(or sampling frame) too. Hence, only qualified inferences can be made, within a degree
of certainty, which is often expressed in terms of probability (e.g., 90% or 95%
probability that the sample reflects the population).
Typically, inferential statistics deals with analyzing two (called BIVARIATE analysis) or
more (called MULTIVARIATE analysis) variables. In this discussion, we will limit
ourselves to 2 variables, i.e. BIVARIATE ANALYSIS.
There are different types of inferential statistics that are used. The type of inferential
statistics used depends on the type of variable (i.e. NOMINAL, ORDINAL, INTERVAL/
RATIO). While the type of statistical analysis is different for these variables, the main
idea is the same: we try to determine how one variable compares to another. Values of
one variable could be systematically higher/ lower/ or the same as the other (e.g., men's
and women's wages). Alternatively, there could be a relationship between the two (e.g.
age and wages), in which case, we find the correlation between them. The different types
of analysis could be summarized as below:
17
Type of Variables Inferential statistics
Compare the DISTRIBUTION, CENTRAL
TENDENCY
Nominal (e.g. GENDER, male and female) [Carry out separate test to check the validity
(i.e. margin of error) of above comparison, in
which DISPERSION measures are used]
Beyond scope [should be taught in Statistics
Ordinal (e.g. class grades)
class]
Ratio/ Interval (e.g. AGE and WAGE) Regression Analysis
18
FIRST STEP
The first step in the regression analysis is to chart the X and Y values graphically to
visually see if there is indeed a relationship between the X and the Y. X is typically on
the horizontal (x) axis; Y is typically on the verical (y) axis. This chart of plotted values
is called a scatterplot. The scatterplot should give you a good visual clue as to whether X
and Y are related or not. See the charts below. A POSITIVE association between AGE
and WAGES would have an upward trend (positive slope), where higher WAGES
correspond to higher AGE and lower WAGES correspond to lower AGE. A NEGATIVE
association would be indicated by the opposite effect (negative slope), where the older
individuals (i.e. higher AGE) have lower WAGES than the younger individuals (i.e.
lower AGE) (this could arguably apply in computer programming, which is a relatively
young field). A RANDOM association (i.e. zero association) is one where the scatterplot
does not indicate any trend (i.e. either positive or negative). In this case, young as well as
old individuals may expect to earn high or low earnings (i.e. the trend would be flat).
There are, however, many cases where the relationship between X and Y may not be as
linear; the relationship may be curvilinear, e.g., U or reverse U. For example, WAGES
might rise with AGE upto a certain number of years (say, retirement), and decrease after
that (a reverse U). All of this information can be visually gleaned from the scatterplot.
Examine the following scatterplots.
19
second step is to test the relationship mathematically. We will deal only with LINEAR
relationships here. In a linear relationship, if you recall high school mathematics, the
relationship between X and Y can be described by a single line. A line is given by the
equation:
Y = A + B * X, where
Y = Dependent
variable;
X = Independent
variable;
A = Intercept on Y
axis;
B = Slope (or gradient)
[In different books you
might see slope is
represented by m, and
the intercept is
represented by c]
I will not get into the statistical procedures for how to calculate the values for A and B;
these are covered in the class on statistics [You can simply calculate this using Excel, as
shown in class]. Here, my interest is more in explaining and interpreting what these
values mean. From the scatterplot and the regression line, you should be able to more
precisely understand the relationship between X and Y. There are several likely
scenarios:
(a) the line is at 45 degrees (i.e. B = 1), which means that X and Y have a perfect
relationship (i.e. for 1 unit increase in X, there is a corresponding 1 unit increase in Y).
That means our hypothesis is fully true. However, this is rarely the case in most social
science studies;
(b) the line is off from 45 degrees but is inclined close to it (i.e. B~1), which means X
and Y are indeed related (i.e. for 1 unit increase in X, there is a fractional increase in Y).
If there is a positive slope (i.e. the line is inclined upward), the hypothesis holds true; if
there is a negative slope, the hypothesis does not hold true. This is more likely to be the
case in many occasions.
(c) the line is vertical or horizontal (i.e. B=0 or infinite), which means X and Y are not
related. This means our hypothesis is not true.
Thus the value B tells much about the relationship between the Independent and
Dependent variables.
The regression equation is really useful in predicting the value of Y for a given value of
X. That is, in the above example of relationship between AGE and WAGE, you will be
able to predict what WAGE one will earn at a particular AGE, when the values of A and
B are given. Thus, suppose the regression equation between AGE and WAGE is given as
(A= -6; B= 0.9):
WAGE = -6 + 0.9 * AGE [WAGE is hourly; AGE is in years]
20
Then, at the AGE 45, the person could expect to receive: -6 + 0.9 * 45 = -6 + 40.5 =
$32.5 per hour.
[The value A is the value of Y when X = 0. This value is of no statistical use unless X can
actually take values near 0.]
THIRD STEP
Obviously, from the scatterplot and regression equation, you should now be able to
predict if there is indeed any relationship between the Independent and Dependent
variables. The third step tells you how much of an effect the Independent variable has on
the Dependent variable. Here, we calculate the Correlation coefficient. This coefficient,
also called Pearson's R, gives the strength of relationship between the two variables.
[Again, I am not describing how to calculate; this should be covered in Statistics class;
you can simply do this using Excel as showed in class]. The value of Pearson's R could
range anywhere between 0 and 1. Generally, in social science, a value of R above 0.6
indicates a strong relationship between the two variables. A value between 0.3 and 0.6
indicates a moderate relationship. Anything below 0.3 indicates a weak relationship.
More generally, the value of R-squared (i.e. the squared value of Pearson's R) is
calculated to give the percentage strength of relationship between the independent and
dependent variables. Similar to R, R-squared value could be anywhere between 0 and 1.
Let's say in the above example, the Pearson's R is 0.7. This value indicates that there is a
strong relationship between AGE and WAGES. The R-squared value is 0.7 * 0.7 = 0.49.
This means that AGE represents 49% of the increase in one's WAGES. [The other 51
percent could be other factors, such as education, etc.]
There are additional steps required to test if the values of R and R-squared above are
indeed reliable; these should be covered in your Statistics class.
of your sample
tionship between your independent (causal)
variables, and you dependent (effect) variables
Why use inferential statistics?
-tiered journals will not publish articles that do NOT use inferential statistics.
s to the larger population.
yours.
21
You draw a random sample from this population
Using a pre-established formula, you determine that your sample size is
large enough
The following types of inferential statistics are relatively common and relatively easy to
interpret
One sample test of difference/One sample hypothesis test
Confidence Interval
Contingency Tables and Chi Square Statistic
T-test or Anova
Pearson Correlation
Bi-variate Regression
Multi-variate Regression
INTRODUCTION
Factor analysis is a method for investigating whether a number of variables of interest
Y1, Y2, : : :, Yl, are linearly related to a smaller number of unobservable factors F1, F2, :
: :, Fk .
The fact that the factors are not observable disquali¯es regression and other methods
previously examined. We shall see, however, that under certain conditions the
hypothesized factor model has certain implications, and these implications in turn can be
tested against the observations. Exactly what these conditions and implications are, and
how the model can be tested, must be explained with some care.
Factor Analysis
Factor analysis is a statistical method used to study the dimensionality of a set of
variables. In factor analysis, latent variables represent unobserved constructs and are
referred to as factors or dimensions.
• Exploratory Factor Analysis (EFA)
Used to explore the dimensionality of a measurement instrument by finding the smallest
number of interpretable
factors needed to explain the correlations among a set of variables – exploratory in the
sense that it places no structure on the linear relationships between the observed variables
and on the linear relationships between the observed variables and the factors but only
specifies the number of latent variables
• Confirmatory Factor Analysis (CFA)
Used to study how well a hypothesized factor model fits a new sample from the same
population or a sample from a different population – characterized by allowing
restrictions on the parameters of the model
22
Applications of Factor Analysis
• Personality and cognition in psychology
• Child Behavior Checklist (CBCL)
• MMPI
• Attitudes in sociology, political science, etc.
• Achievement in education
• Diagnostic criteria in mental health
Issues
• History of EFA versus CFA
• Can hypothesized dimensions be found?
• Validity of measurements
A Possible Research Strategy For Instrument Development
1. Pilot study 1
• Small n, EFA
• Revise, delete, and add items
2. Pilot study 2
• Small n, EFA
• Formulate tentative CFA model
3. Pilot study 3
• Larger n, CFA
• Test model from Pilot study 2 using random half of the sample
• Revise into new CFA model
• Cross-validate new CFA model using other half of data
4. Large scale study, CFA
5. Investigate other populations
23