Hoffmann - Linear Regression Analysis - Second Edition
Hoffmann - Linear Regression Analysis - Second Edition
John P. Hoffmann
Department of Sociology Brigham Young University
Contents
Preface to the second edition 1 A Review of Some Elementary Statistical Concepts 2 Simple Linear Regression Analysis 3 Multiple Linear Regression Analysis 4 The ANOVA Table and Goodness-of-Fit Statistics 5 Comparing Linear Regression Models 6 Dummy Variables in Linear Regression Models 7 Specification Errors in Linear Regression Models 8 Measurement Errors in Linear Regression Models 9 Collinearity and Multicollinearity 10 Nonlinear Associations and Interaction Terms 11 Heteroscedasticity and Autocorrelation 12 Influential Observations: Leverage Points and Outliers 13 A Brief Introduction to Logistic Regression 14 Conclusions Appendix A: Data Management Appendix B: Formulas
iii 1 23 45
63 73 81 99
119 1 37 151 183 2 13 2 27 247 255 267
iv
Numerous books and articles explain how to conduct what is usually termed a linear regression analysis or estimate a linear regression model. Students are usually exposed to this model in a second course on applied statistics. As we shall learn, the reason for its popularity is because (1) it is relatively easy to understand; (2) statistical software designed to estimate it is widely available; and (3) it is a flexible and powerful method for conducting important types of analyses. Nonetheless, even though linear regression models are widely used and offer an important set of tools for understanding the association between two or more variables, they are often misused by those who fail to take into account the many assumptions analysts are forced to make. For instance, a linear regression model does not provide suitable results when two or more of the explanatory variables (if you dont know this term, you will; just keep reading) have high correlations (see Chapter 9). Similarly, if one of the data points is substantially different from all the others, then the linear regression model is forced to compensate, often with poorly fitting results (see Chapter 12). One of the goals of this presentation, therefore, is to provide a relatively painless overview of the assumptions we make when using linear regression models. The main purpose of this presentation, though, is to show the reader how to use linear regression models in studies that include quantitative data. Specific objectives include discussing why linear regression models are used, what they tell us about the relationships between two or more variables, what the assumptions of the model are and how to determine if they are satisfied, what to do when the assumptions are not satisfied (sometimes we may still use linear regression by making some simple modifications to the model), and what to do when the outcome variable is measured using only two categories. As we shall learn, linear regression models are not designed for two-category outcome variables, so well discuss another model, known as a logistic regression model, which is designed for these types of variables (see Chapter 13).
PREFACE
I have been teaching these methods for several years and have seen many students struggle and more succeed. Thus, I know how important it is that students are familiar with some standard statistical concepts before beginning to learn about linear regression analysis. Hence, the first chapter is designed as a quick and dirty review of elementary statistics. For the reader who has not been exposed to this material (or who has forgotten it), I recommend reviewing a basic statistics textbook to become familiar with means, medians, standard deviations, standard errors, z-scores, t-tests, correlations and covariances, and analysis-of-variance (ANOVA). I also suggest that the reader take some time to learn Stata, a statistical software package designed to carry out most of the analyses presented herein. It is a relatively easy program to master, especially for students who have used some type of spreadsheet software or other statistical software. I find that Stata combines the best of all software worlds: ease of use and a comprehensive set of tools. It allows users several ways to request regression models and other statistical procedures, including through a command line, user-defined files (called .do files; which are simple programs of instruction), and drop-down menus. In this presentation, we will rely on the command line approach, although I always encourage my students to write out programs using Stata .do files so they have a record of the commands used. Recording work in Statas log files is also strongly recommended. The chapters follow the typical format for books on linear regression analysis. As mentioned earlier, we first review elementary statistics. This is followed by an introductory discussion of the simple linear regression model. Second, we learn how to interpret the results of the linear regression model. Third, we see how to add additional explanatory variables to the model. This transforms it into a multiple linear regression model. We then learn about goodness-of-fit statistics, model comparison procedures, and dummy explanatory variables. Fourth, we move into an in-depth discussion of the assumptions of linear regression models. We spend several chapters on exciting (okay, too strong of a word) and mystifying topics such as multicollinearity,
vi
heteroscedasticity, autocorrelation, and influential observations. We finish the presentation by learning about the logistic regression model, which, as mentioned earlier, is designed for outcome variables that include only two categories (e.g., Do you support the death penalty for murder? 0 = no, 1 = yes). There is an important issue that I hope readers will consistently consider as they examine this text. Statistics has, for better or worse, been maligned by many observers in recent years. Books with titles such as How to Lie with Statistics are popular and can lend an air of disbelief to many studies that use statistical analysis. Researchers and statistics educators are often to blame for this disbelief. We frequently fail to impart two things to students and consumers: (1) a healthy dose of skepticism; and (2) encouraging them to use their imagination and common sense when assessing data and the results of analyses. I hope that readers of the following chapters will be comfortable using their imagination and reasoning skills as they consider this material and as they embark on their own quantitative studies. In fact, I never wish to underemphasize the importance of imagination for the person using statistical techniques. Nor should we suspend our common sense and knowledge of the research literature simply because a set of numbers demonstrates some unusual conclusion. This is not to say that statistical analysis is not valuable or that the results are misleading. Clearly, statistical analysis has led to many important discoveries in medicine and the social sciences, as well as informed policy in a positive way. The point I wish to impart is that we need a combination of tools including statistics, but also our own ideas and reasoning abilities to help us understand important social and behavioral issues.
PREFACE
vii
strongly recommend that you become familiar with Statas help menu, as well as similar tools such as the findit command. These are invaluable sources for learning about Stata. The UCLA Academic Technology Service website on statistical software (http://www.ats.ucla. edu/stat/stata) is an excellent (and free) resource for learning how to use Stata.
Acknowledgements
I would first of all like to thank the many students, undergraduate and graduate, who have completed courses with me on this topic or on other statistical topics. I have learned so much more from them than they will ever learn from me. I have had the privilege of having as research assistants Scott Baldwin, Colter Mitchell, and Bryan Johnson. Each has contributed to my courses on statistics in many ways, almost all positive! Karen Spence, Kristina Beacom and Amanda Cooper helped me put together the chapters. I am indebted to them for the wonderful assistance they provided. Wade Jacobsen did excellent work on the second edition. He helped me revise each chapter to show how Stata may be used to conduct linear regression analysis.
Probabilities are normally presented using, not surprisingly, the letter P. One way to represent the probability of a five from a roll of a die is with P(5). So we may write P(5) = 0.167 or P(5) = 1/6. You might recall that some statistical tests, such as t-tests (see the description later in the chapter) or ANOVAs, are often accompanied by p-values. As we shall learn, p-values are a type of probability value used in many statistical tests. By combining the principles of probability and elementary statistical concepts, we may develop the basic foundations for statistical analysis. In general, there are two types of statistical analyses: Descriptive and inferential. The former set of methods is normally used to describe or summarize one or more variables (recall that the term variables is used to indicate phenomena that can take on more than one value; this contrasts with constants, or phenomena that take on only one value). Some common terms that you are probably familiar with are measures of central tendency and measures of dispersion. Well see several of these measures a little later. Then there are the many graphical techniques that may be used to see the variable. You might recognize techniques such as histograms, stem-and-leaf plots, dot-plots, and box-and-whisker plots. Inferential statistics are designed to infer or deduce something about a population from a sample. Suppose, for instance, that we are interested in determining who is likely to win the next Presidential election in the United States. Well assume there are only two candidates from which to choose: Clinton and Palin. Of course, it would be enormously expensive to ask each person who is likely to vote in the next election their choice of President. Therefore, we may take a sample of likely voters and ask them who they plan to vote for. Can we infer anything about the population of voters based on our sample? The answer is that it depends on a number of factors. Did we collect a good sample? Were the people who responded honest? Do people change their minds as the election approaches? We dont have time to get into the many issues involved in sampling, so well have to assume that our sample is a good representation of the population
from which it is drawn. Most important for our purposes is this: Inferential statistics include a set of techniques designed to help us answer questions about a population from a sample. Another way of dividing up statistics is to compare techniques that deal with one variable from those that deal with two or more variables. Most readers of this presentation will likely be familiar with techniques designed for one variable. These include, as we shall see later, most of the descriptive statistical methods. The bulk of this presentation, at least in later chapters, concerns a technique designed for analyzing two or more variables simultaneously. A key question that motivates us is whether two or more variables are associated in some way. As the values of one variable increase, do the values of the other variable also tend to increase? Or do they decrease? In elementary statistics students are introduced to covariances and correlations, two techniques designed to answer these questions generally. But recall that you are not necessarily saying that one variable somehow changes another variable. Remember the maxim: Correlation does not equal causation? Well try to avoid the term causation in this presentation because it involves many thorny philosophical issues (see Judea Pearl (2000), Causality: Modeling, Reasoning, and Inference, New York: Cambridge University Press). Nonetheless, one of our main concerns is whether one or more variables is associated with another variable in a systematic way. Determining the characteristics of this association is one of the main goals of the linear regression model that we shall learn about later.
they rounded the weights to the nearest kilogram. What would be your best guess of the average weight among the sample? It is not always the best, but the most frequent measure is the arithmetic mean, which is computed using the following formula:
E[X] = x =
The term on the left-hand side of the equation is E[X]. This is a short-hand way of saying that this is the expected value of the variable X. It is often used to represent the mean. To be more precise, we might also list this term as E[weight in kg], but normally, as long as its clear that X = weight in kilograms, using E[X] is sufficient. The middle term read as x-bar may also be familiar as a common symbol for the mean. The formula for computing the mean is rather simple. We add all the values of the variable and divide this sum by the number of observations. Note that the rather cumbersome symbol that looks like an overgrown E in the right-hand part of the equation is the summation sign; it tells us to add whatever is next to it. The symbol xi denotes the specific values of the x variable, or the individual weights that weve measured. The subscript i indicates each observation. The symbol n represents the sample size. Sometimes the individual observations are represented as in. If you know that n = 5, then you know there are five individual observations in your sample. In statistics, we often use upper-case Roman letters to represent population values and lower case Roman letters to represent sample values. Therefore, when we say E[X] = x , we are implying that our sample mean estimates the population expected value, or the population mean. Heres a simple example: We have a sample of peoples weights that consist of the following set: [84, 75, 80, 69, 90]. The sum of this set is [84 + 75 + 80 + 69 + 90] = 398; therefore the mean is 398/5 = 79.6. Another way of thinking about this mean value is that it represents the center of gravity. Suppose we have a plank of wood that is magically weightless (or of uniform weight across its span). We order
the people from lightest to heaviest trying to space them out proportional to their weights and ask them to sit on our plank of wood. The mean is the point of balance, or the place at which we would place a fulcrum underneath to balance the people on the plank.
69
75
80
84
90
79.6
There are some additional things you should know about the mean. First, it is measured in the same units as the observations. If your observations are not all measured in the same unit (e.g., some peoples weights are in kilograms, others in pounds), then the mean cannot be interpreted. Second, the mean provides a good measure of central tendency if the variable is measured continuously and is normally distributed. What do these terms mean? A variable is measured continuously or we say the variable is continuous if it can conceivably take on any real number. Of course, we usually cannot be so precise when we measure things, so it is not uncommon to round our measures to whole numbers or integers. We also often measure things using positive numbers only; it makes little sense, for instance, to try to measure a persons weight using negative numbers. The other type of variable is known as discrete or categorical; these variables have a finite number of possible values. For example, we normally use only two categories to measure gender: female and male. Hence, this is a discrete variable. We say a variable is normally distributed if it follows a bell-shaped curve. Order the values of the variable from lowest to highest and then plot them by their frequencies or the percent of observations that have a particular value (we must assume that there are many values of our variable). We may then view the shape of the distribution of the variable. The graph on the next page shows an
example of a bell-shaped distribution, usually termed a normal or Gaussian distribution, using a simulated sample of weights. (It is known as Gaussian after the famous German mathematician, Carl Friedrich Gauss, who purportedly discovered it.)
s n o i t 15 a v r e s b10 O f o t n 5 e c r e P
Distribution of Weights
We shall return to means and the normal distribution frequently in this presentation. To give you a hint of what is to come, the linear regression model is designed, in part, to predict means for particular sets of observations in the sample. For instance, if we have information on the heights of our sample members, we may wish to use this information to predict their weights. Our predictions could include predicting the mean weight of people who are 72 centimeters tall. We may use a linear regression model to do this. But, suppose that our variable does not follow a normal distribution. May we still use the mean to represent the average value? The simple answer is yes, as long as the distribution does not deviate too far from the normal distribution. In many situations in the social and behavioral sciences, though, variables do not have normal distributions. A good example of this involves annual income. When we ask a sample of people about their annual incomes, we usually find
that a few people earn a lot more than others. Measures of income typically provide skewed distributions, with long right tails. If asked to find a good measure of central tendency for income, there are several solutions available. First, we may take the natural (Naperian) logarithm of the variable. You might recall from an earlier mathematics course that using the natural logarithm (or the base 10 logarithm) pulls in extreme values. If this is not clear, try taking your calculator and using the LN function with some large and small values (e.g., 10 and 1,500). You will easily see the effect this has on the values of a variable. If youre lucky, you may find that taking the natural logarithm of a variable with a long right-tail transforms it into a normal distribution. The square root or cube root of a variable may also work to normalize a skewed distribution. Well see examples of this in Chapter 10. Second, there are several direct measures of central tendency appropriate for skewed distributions (or other distributions plagued by extreme values such as outliers; see Chapter 12), such as the trimmed mean and the median. The trimmed mean cuts off some percentage of values from the upper and lower ends of the distribution, usually five percent, and uses the remaining values to compute the mean value. The median should be familiar to you. It is the middle value of the distribution. In order to find it, we first order the values of the variable from lowest to highest. Then we choose the middle value if there are an odd number of observations, or the average of the middle two values if there are an even number of observations. If you are familiar with percentiles (or quartiles or deciles), then you might recall that the median is the 50th percentile of a distribution. The median is known as a robust statistic because it is relatively unaffected by extreme values. As an example, suppose we have two variables, one that follows a normal distribution (or something close to it) and another that has an extreme value: Variable 1: Variable 2: [45, 50, 55, 60, 65, 70, 75] [46, 51, 54, 59, 66, 71, 375]
Variable 1 has a mean of 60 and a median of 60, so we make the same estimate of its central value regardless of which measure is used. In contrast, Variable 2 has a mean of 103, but a median of 59. Although we might debate the point, I think most people would agree that the median is a better representative of the average value than the mean for Variable 2. The next issue to address from elementary statistics involves measures of dispersion. As the name implies, these measures consider the spread of the distribution of a variable. Most readers are familiar with the term standard deviation, since it is the most common measure for continuous variables. However, before seeing the formula for the standard deviation, it is useful to consider some other measures of dispersion. The most basic measure is the sum of squares, or SS[X]:
SS[X] =
(x
x)
This formula first computes deviations from the mean (xi x ), squares each one, and adds them up. If youve learned about ANOVA models, the sum of squares should be familiar. Perhaps you even recall that there are various forms of the sum of squares. Well learn more about these in Chapter 4. A second measure of dispersion that is likely more familiar to you is the variance, or Var[X]. It is often labeled as s2. The formula is
Var[X] = s
(x
x)
n 1
Notice that another way of computing the variance is to take the sum of squares and divide it by the sample size minus one. One of the drawbacks of the variance is that it is measured not in units of the variable, but rather in squared units of the variable. To transform it into the same units as the variable, it is therefore a simple matter to take the square root of the variance. This measure is the standard deviation (it is also denoted by the letter s):
SD[X] =
(x
x)
n 1
A variables distribution, assuming it is normal, is often represented by its mean and standard deviation (or variance). In short-hand form, this is listed as x ~ N ( x , s ) (the wavy line means distributed as). Obviously, a variable that is measured in the same units as another and that shares the same mean is less dispersed if its standard deviation is smaller. Although not often used, another promising measure of dispersion is the coefficient of variation (CV), which is computed by dividing the standard deviation by the mean of the variable (s x ) . It is often multiplied by 100. The CV is valuable because it shows how much a variable varies about its mean. An important point to remember is that symbols such as s and s2 are used to represent sample statistics. Greek symbols or, as we have seen up until this point, upper-case Roman letters are often used to represent population statistics. For example, the symbol for the population mean is the Greek letter mu ( ), whereas the symbol for the population standard deviation is the Greek letter sigma ( ). However, well see when we get into the symbology of linear regression that Greek letters are used to represent other quantities as well. Another useful measure of dispersion or variability refers not to the variable directly, but rather to its mean. When we compute the variance or the standard deviation, we are concerned with the spread of the distribution of the variable. But imagine that we take many, many samples from a population and compute a mean for each sample. We would end up with a sample of means from the population rather than simply a sample of observations. We could then compute a mean of these means, or an overall mean, which should reflect pretty accurately assuming we do a good job of drawing the samples the actual mean of the population of observations. Nonetheless, these numerous means will also have a
10
distribution. It is possible to plot these means to see if they follow a normal distribution. In fact, an important theorem from mathematical statistics states that the sample means follow a normal distribution even if they come from a non-normally distributed variable in the population (see Chapter 3). This is a very valuable finding because it allows us to make important claims about the linear regression model. We shall learn about these claims in later chapters. Our concern here is not whether the means are normally distributed, at least not directly. Rather, we need to consider a measure of the dispersion of these means. Statistical theory suggests that a good measure (estimate) of dispersion is the standard error of the mean. It is computed using the sample standard deviation as
se(mean) = s n
Standard errors are very useful in linear regression analysis. We shall see later that there is another type of standard error known as the standard error of the slope coefficient which we use to make inferences about the regression model.
Standardizing a Variable
One of the difficult issues when we are presented with different continuous variables is that they are rarely measured in the same units. Education is measured in years, income is measured in dollars, body weight is measured in pounds or kilograms, and food intake is measured in kilocalories. It is convenient to have a way to adjust variables so their measurement units are similar. You might recall that this is one of the purposes of z-scores. Assuming we have a normally distributed set of variables, we may transform them into z-scores so they are comparable. A z-score transformation uses the following formula:
z score =
( xi
x) s
11
Each observation of a variable is put through this formula to yield zscores, or what are commonly known as standardized values. The unit of measurement for z-scores is standard deviations. The mean of a set of z-scores is zero, whereas its standard deviation is one. You may remember that z-scores are used to determine what percentage of a distribution falls a certain distance from the mean. For example, 95% of the observations from a normal distribution fall within 1.96 standard deviations of the mean. This translates into 95% of the observations using standardized values falling within 1.96 zscores of the mean. With a slight modification, this phenomenon is helpful when we wish to make inferences from the results of the linear regression model to the population. The plot of z-scores from a normally distributed variable is known as the standard normal distribution. As mentioned earlier, one of the principal advantages of computing z-scores is that they provide a tool for comparing variables that are measured in different units; this will come in handy as we learn about linear regression models. Of course, we must be intuitively familiar with standard deviations to be able to make these comparisons.
12
moment correlation (there are actually many types of correlations; the type attributed to the statistician Karl Pearson is the most common). A covariance is a measure of the joint variation of two continuous variables. In less technical terms, we claim that two variables covary when there is a high probability that large values of one are accompanied by large or small values of the other. For instance, height and weight covary because large values of one tend to accompany large values of the other in a population or in most samples. This is not a perfect association because there is clearly substantial variation in heights and weights among people. The equation for the covariance is
Cov ( x, y ) =
(x
x )( yi y ) n 1
The equation computes deviations from the means of both variables, multiplies them, adds up these products for each observation, and then divides this sum by the sample size minus one. Dont forget that this implies the xs and ys come from the same unit, whether it is a person, place, or thing. One of the problems with the covariance is that it depends on the measurement scheme of both variables. It would be helpful to have a measure of association that did not depend on these measurement units, but rather offered a way to compare various associations of different sets of variables. The correlation coefficient accomplishes this task. Among the several formulas we might use to compute the correlation, the equations on the next page are perhaps the most intuitive:
Corr ( x, y ) =
Corr ( x, y ) =
(z )(z )
x y
n 1
13
covariance divided by a joint measure of variability: The variances of each variable multiplied, with the square root of this quantity representing what we might call the joint or pooled standard deviation. The second equation shows the relationship between zscores and correlations. We might even say, without violating too many tenets of the statistical community, that the correlation is a standardized measure of association. A couple of interesting properties of correlations are (1) they always fall between 1 and +1, with positive numbers indicating a positive association and negative numbers indicating a negative association (as one variable increases the other tends to decrease). A correlation of zero means there is no statistical association, at least not one that can be measured assuming a straight line association, between the two variables. (2) The correlation is unchanged if we add a constant to the values of the variables or if we multiple the values by some constant number. However, these constants must have the same sign, negative or positive. As mentioned earlier, there are several other types of correlations (or what we refer to generally as measures of association) in addition to Pearsons. For instance, a Spearmans correlation is based on the ranks of the values of variables, rather than the actual values. Similar to the median when compared to the mean, it is less sensitive to extreme values. There are also various measures of association designed for categorical variables, such as gamma, Cramers V, lambda, eta, and odds ratios. Odds ratios, in particular, are popular for estimating the association between two binary (two-category) variables. We shall learn more about odds ratios in Chapter 13.
14
mean from one sample (or subsample) is different from the mean of another sample (or subsample). For example, I may wish to know whether the mean income of adults from Salt Lake City, UT, is higher than the mean income of adults from St. George, UT. If I have good samples of adults from these two cities, I can consider a couple of things. First, I can take the difference between the means. Lets say that the average annual income among a Salt Lake City sample is $35,000 and the average annual income among a St. George sample is $32,500. It appears as though the average income in Salt Lake City is higher. But we must consider something else: We have two samples, so we must consider the possibility of sampling error. Our samples likely have different means than the true population means, so we should take this into account. A t-test is designed to consider these issues by, first, taking the difference between the two means and, second, by considering the sampling error with what is known as the pooled standard deviation. This provides an estimate of the overall variability in the means. The name t-test is used because the t-value that results from the test follows what is termed a Students t distribution. This distribution looks a lot like the normal distribution; in fact, it is almost indistinguishable when the sample size is greater than 50. At smaller sample sizes the t-distribution has fatter tails and is a bit flatter in the middle than the normal distribution. As mentioned earlier, the t-test has two components: The difference between the means and an estimate of the pooled standard deviation. The following equation shows the form the t-test takes:
= sp
xy 1 1 + n1 n2
where s p
15
The sp in the above equations is the pooled standard deviation. The ns are the sample sizes and the s2 are the variances for the two groups. A key assumption that this type of t-test makes is that the variances are equal for the two groups represented by the means. Some researchers are uncomfortable making such an assumption, so they use the following test, which is known as Welchs t-test:
t =
xy Var[x ] Var[ y ] + n1 n2
Unfortunately, we must use special tables if we wish to compute this value by hand and determine the probability that there is a difference between the two means. Fortunately, though, many statistical software packages provide both types of mean comparison tests, along with another test that is designed to show whether we should use the standard t-test or Welchs t-test. An important assumption that we are forced to make if we wish to use these mean-comparison procedures is that the variables follow a normal distribution. The t-test, for example, does not provide accurate results if the variable from either sample does not follow a normal distribution. There are other tests, such as those designed to compare ranks or medians (e.g., Wilcoxon-Mann-Whitney test), which are appropriate for non-normal variables. There are many other types of comparisons that we might wish to make. Clearly, comparing two means does not exhaust our interest. Suppose we wish to compare three means, four means, or even ten means. We might, for instance, have samples of incomes from adults in Salt Lake City, UT, St. George, UT, Reno, NV, Carson City, NV, and Boise, ID. We may use ANOVA procedures to compare means that are drawn from multiple samples. Using multiple comparison procedures, we may also determine if one of the means is significantly different from one of the others. Books that describe ANOVA techniques provide an overview of these procedures. As we shall learn in subsequent chapters, we may also use linear regression analysis to
16
compute and compare means for different groups that are part of a larger sample.
17
It should be obvious by now that one of the valuable results of statistics is the ability to say something about a population from a sample. But recall the lesson we learned when discussing the standard error of the mean: We usually take only one sample from a population, but we could conceivably draw many. Therefore, any sample statistic we compute or test we run must consider the uncertainty involved in sampling. The solution to this dilemma of uncertainty has been the frequent use of standard errors for test statistics, including the mean, the standard deviation, correlations, and medians. As we will see in Chapter 2, there is also a standard error for slope coefficients in linear regression models. These standard errors may be thought of as quantitative estimates of the uncertainty involved in test statistics. They are normally used in one of two ways. First, recall from elementary statistics that when we use a t-test, we compare the t-value to a table of p-values. All else being equal, a larger t-value equates to a smaller p-value. This approach is known generally as significance testing because we wish to determine if our results are significantly different from some other possible result. It is important to note, though, that the term significant does not mean important. Rather, it was originally used to mean that the results signified or showed something (see David Salsburg (2001), The Lady Tasting Tea, New York: Owl Books). We often confuse or mislead when we claim that a significance test demonstrates that we have found something special. Showing where p-values come from is beyond the scope of this presentation. It is perhaps simpler to provide a general interpretation. Suppose we find through a t-test the following when comparing the mean income levels in Salt Lake City (n = 50) and St. George, UT (n = 50):
t = $35,000 $32,500 1 1 6,250 + 50 50 = 2,500 1,250 = 2.0
18
If we look up a table of t-values (available on-line or in most statistics textbooks), we find, using a sample size of 100, that a t-value of 2.0 corresponds to a p-value of approximately 0.975 (the precise number is 0.9759). This leaves approximately 0.025 of the area under the curve that represents the t-distribution. One way to interpret this pvalue is with the following long-winded statement: Assuming we took many, many samples from the population of adults in Salt Lake City and St. George, and there was actually no difference in mean income in this population, we would expect to find a difference of $2,500 or something larger only 2.5 times, on average, out of every 100 samples we drew. If you remember using null and alternative hypotheses, such a statement may sound familiar. In fact, we can translate the above inquiry into the following hypotheses: H0: Mean Income, Salt Lake City Ha: Mean Income, Salt Lake City = Mean Income, St. George > Mean Income, St. George
Astute readers may notice that we have set up a one-tailed significance test. A two-tailed significance test suggests a difference about five percent of the time, or only five times out of every one-hundred samples. Well return to the interpretation of p-values in Chapter 2. The second way that standard errors are used is to compute confidence intervals (CIs). There are many applied statisticians who prefer confidence intervals because they provide a range of values within which some measure is likely to fall. Here we contrast a point estimate and an interval estimate. Means and correlations are examples of point estimates: They are single numbers computed from the sample that estimate population values. An interval estimate provides a range of values that (presumably) contains the actual population value. A confidence interval offers a range of possible or plausible values.
19
Those who prefer confidence intervals argue that they provide a better representation of the uncertainty inherent in statistical analyses. The general formula for a confidence interval is Point estimate [(confidence level) (standard error)] The confidence level represents the percentage of the time, based on a z-statistic or a t-statistic, you wish to be able to say that the interval includes the point estimate. For example, assume weve collected data on violent crime rates from a representative sample of 100 cities in the United States. We wish to estimate a suitable range of values for the mean violent crime rate in the population. Our sample yields a mean of 104.9 with a standard deviation of 23.1. The 95% confidence interval is computed as
23.1 95% CI = 104.9 1.96 = {100.4, 109.4} 100
The value of 1.96 for the confidence level comes from a table of standard normal values, or z-values. It corresponds to a p-value of 0.05 (two-tailed test). The standard error formula was presented earlier in this chapter. How do we interpret the interval of 100.4 109.4? There are two ways that are generally used: (1) Given a sample mean of 104.9, we are 95% confident that the population mean of violent crime rates falls in the interval of 100.4 and 109.4. (2) If we were to draw many samples from the population of cities in the U.S., and we claimed that the population mean fell within the interval of 100.4 109.4, we would accurate about 95% of the time. As we shall see in subsequent chapters it is also possible to construct confidence intervals for point estimates from a linear regression model.
20
As an exercise, see if you can compute the standard error of the mean from the mean and the standard deviation (abbreviated Std. Dev) from the variance listed in the table. Then, return to Stata and use the correlate or pwcorr command to estimate the correlation between Public Expenditures and Economic Openness (e.g., correlate expend econopen). Can you figure out how to estimate the covariance between these two variables? You should find that the correlation is 0.748 and the covariance is 37.065. If you are interested in confidence intervals for the mean, use the ci command. For public expenditures, you should obtain 95% CIs of 37.18 and 50.05. You may also wish to practice using t-tests in Stata. For example, open the Stata file gss96.dta. There you will find a variable called gender, which is coded as 0 = male and 1 = female (use the command
21
to see some information about this variable). Well use it to compare personal income (labeled pincome) for males and females. The command ttest pincome, by(gender) should produce a table with which to accomplish this task. What does this table show? What is the t-value? What is the p-value? Try using the subcommand welch to request Welchs version of the t-test for unequal variances (ttest pincome, by(gender) welch). What does this test show?
24
Well begin by considering an elementary situation: The simple linear regression model, or what some people refer to as bivariate regression. The term simple is used not to indicate that our research question is uninteresting or crude, but rather that we have only one explanatory and one outcome variable. But, you might ask, if there are only two variables why not use the Pearsons correlation coefficient to estimate the relationship? This is certainly a reasonable step; we can learn something about the relationship between two variables by considering the correlation (assuming they are continuous and normally distributed). The difference here is primarily conceptual: We think, perhaps because of a particular theory or model, that one variable influences another, or we have control over one variable and therefore wish to see its effect on another (e.g., we control the amount of fertilizer and see how a particular species of tomato plant grows). The outcome variable is usually labeled as Y and the explanatory variable is labeled X (for now, lets assume were interested in population values). To help you understand the regression model, recall your days in pre-algebra or algebra class when you were asked to plot points on two axes. You probably labeled the vertical axis as Y and the horizontal axis as X. You then used an equation such as Y = mX + b to either represent the systematic way the points fell on the graph, or to decide where to put points on the graph. The equation provided a kind of map that told you where to place objects. The m represented the slope and the b represented the intercept, or the point at which the line crossed the Y axis. If you can recall this exercise in graphing, then simple linear regression is easy to understand. Consider the graph on the following page. What is its slope and intercept? Since we dont know the units of measurement, we cannot tell. However, we do know that the slope is a positive number and the intercept is a positive number (notice the line crosses the Y axis at a positive value of Y). We also know that it has a very high positive correlation because the points are very close to the line. Lets say it is represented by the equation following the graph.
25
Y = 2X + 1.5 This suggests that as the X values increase by one unit the Y values increase by two units. Perhaps you recall a definition of the slope as rise ; which is a short-hand way of saying that as the points rise a run certain number of units along the Y axis they also run a certain number of units along the X axis. Unfortunately, real data in almost any research application never follow a straight line relationship. Nonetheless, in a simple linear regression analysis we may represent the association between the variables using a straight line. Whether a straight line is an accurate representation is a question we should always ask ourselves. Consider the graph on the next page. This graph may be produced in Stata with the nations.dta data by using the twoway plot command with both the scatter and lfit plot types. Use Public Expenditures (expend) as the outcome variable and Percent Labor Union (perlabor) as the explanatory variable. In the command line this appears as twoway scatter expend perlabor || lfit expend perlabor. The
26
double bars (located on most keyboards just under the Backspace key) tell Stata to produce two graphs, one on top of the other. Here weve asked for a scatterplot (scatter) and a linear fit line (lfit) in one graph. After entering the commands, a new window should open that displays the following graph.
35 10
40
45
50
20
40 Fitted values
50
public expenditures
The association between the two variables shown in the scatterplot is positive, but notice that the points do not fall on the line that represents the slope. This shows a statistical relationship, which is probabilistic, whereas the previous graph shows a mathematical relationship, which is deterministic. Therefore, we now must say that, on average, public expenditures increase as percent labor union increases. The next question to ask is On average, how many units increase in Labor Union are associated with how many units increase in Public Expenditures? This is a fundamental question that we try to answer with a linear regression model. In a linear regression model, rather that Y = mX + b, we use the following form of the equation:
Yi = + 1X i
27
The Greek letter alpha represents the intercept and the Greek letter beta represents the slope. These two terms are sometimes labeled parameters; however, in statistical analysis parameters are usually defined as properties of a population, in contrast to properties of samples, which are labeled statistics. For the real stickler, perhaps we should not even use Greek letters in the equation since, technically speaking, they imply parameters and thus apply to the population. Hence, an alternative representation of the linear regression equation is
yi
= a + b1 xi
In this equation we have used lower-case Roman letters to represent a linear regression model that uses data from a sample. You will also notice the subscripted i; as before, this represents the fact that we have individual observations from the sample. Since another common approach is to preserve Greek symbols but use lower case Roman letters to represent the variables, here is another way the equation is represented:
yi
= 0 + 1 xi
This equation uses beta to represent the intercept and the slope, with a zero subscript used for the intercept. However, the preference in this presentation is to use alpha to indicate the intercept (i.e., yi = + 1 xi ). Another way of thinking about the slope, whether represented by the Greek letter beta or the letter b, is that it represents the change in y y over the change in x. You may see this referred to as , in which x the Greek letter delta is used to denote change. This is just another way of saying rise over run. The intercept may also be defined as the predicted value of y when the value of x is zero. Perhaps youve noticed a problem with the equations weve written so far. A hint is that we used the term probabilistic to describe
28
the statistical relationship. If youre not sure what this all means, revisit the scatterplot between Public Expenditures and Percent Labor Union. Notice that the points clearly do not fall on the straight line. Take almost any two variables used in the social and behavioral sciences and they will fail to line up in a scatterplot. This presents a problem for those who want their models exact. Our equations thus far call for exactness, which misrepresents the actual relationship between the variables. Our straight line is, at best, an approximation of the actual linear relationship between x and y. Given a small data set, it is not difficult to draw a straight line that tries to match this relationship. Nonetheless, we must revise our linear regression equation to consider the uncertainty of the straight line. Statisticians have therefore introduced what is known as the error term into the model:
yi
= + 1 xi + i
The Greek letter epsilon represents the uncertainty in predicting the outcome variable with the explanatory variable. Another way of thinking about the error term is that it represents how far away the individual y values are from the true mean value of Y for given values of x. Now, this last sentence is loaded with assumptions. We use, for instance, the term true mean value of Y. What are we trying to say? Well, think about our sample. If we have done a good job, our sample observations should represent some group from the population. For instance, if weve sampled from adults in the U.S., some of our sample members should represent 25-30-year-olds. We assume that these sample members are a good representation of other 25-30-yearolds. Hence, their values on a characteristic such as education should provide a good estimate of mean education among those in this age group in the population. Suppose we wish to use parental education (x) to predict education (y) among this age group. Using the linear regression model, we assume that the error terms represent the distance from the points on the graph to the actual mean of Y, or the population mean.
29
Unfortunately, we usually cannot test at least not fully the various assumptions we make when using the linear regression model. Since we normally do not have access to the population parameters, we can only roughly test properties of the various assumptions we make when using linear regression. These assumptions include the following: 1. For any value of X, Y is normally distributed and the variance of Y is the same for all possible values of X. (Notice that we use upper-case letters to show that we are referring to the population.) 2. The mean value of Y is a straight line function of X. In other words, Y and X have a linear relationship. 3. The Y values are statistically independent of one another. Another way of saying this is that the observations are independent. For example, using the earlier example, we assume that our measures of public expenditures across the seven nations are independent. We usually assume that simple random sampling guarantees independence. (However, we should ask ourselves: Are the economic conditions of these nations likely to be independent?) 4. The error term is a random variable that has a mean equal to zero in the population and constant variance (this is implied by (1)). Symbolically, this is shown as ~ N 0, 2 . As mentioned in Chapter 1, the wavy line means distributed as. In order to test these assumptions, we should have lots of x values and lots of y values. For example, if we collect data on dozens of countries with, say, 20% labor unions, then we expect the public expenditure values to be normally distributed within this value of the variable labor union. And we expect the errors in predicting Y to have a mean of zero.
30
Y
20%
40%
60%
This graph provides one way to visualize what is meant by some of these assumptions. We have sets of observations at 20%, 40%, and 60% values of labor union. We assume that the mean public expenditure values for these three subsamples are a good representation of the actual means in the population. We also assume that public expenditures at these three values of the explanatory variable are normally distributed, that the variances of each distribution are identical, and that the errors we make when we estimate the value of public expenditures have a mean of zero (e.g., the underestimation and overestimation that we make cancel out).
31
sample. First, use the twoway plot command with the scatter and lfit plot types to estimate a simple scatterplot and linear fit line with the number of robberies per 100,000 people (robbrate) as the outcome variable and per capita income in $1,000s (perinc) as the explanatory variable (twoway scatter robbrate perinc || lfit robbrate perinc). You should see a modestly increasing slope. But notice how the points diverge from the line; only a few are relatively close to it. Next, well try a simple linear regression model using these variables. In Stata, a linear regression model is set up using the regress command. The commands should look like regress robbrate perinc. The output screen should then include the following table.
Source Model Residual Total robbrate perinc _cons SS 111134.139 381433.089 492567.228 Coef. 14.45886 -165.1305 df 1 48 49 MS 111134.139 7946.52268 10052.3924 t 3.74 -1.92 P>|t| 0.000 0.061 Number of obs F( 1, 48) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 13.99 0.0005 0.2256 0.2095 89.143
Can we translate the numbers in this table into our linear regression equation? Yes, we can, but first we need to find the slope and the intercept. The term coefficients (abbreviated in the printout as Coef.) is used in a general sense, although, depending on the Stata commands used, there may be both unstandardized and standardized coefficients that we can interpret. Well focus for now on the former since it provides the information we need. Note that Stata uses _cons to label the intercept. Thus, our equation reads:
Robberies ( y i ) = 165 .1 + {14 .459 Per Capita Income ( x i )} + i
The y and the x are included to remind us that number of robberies is the outcome variable and per capita income is the explanatory variable. Revisit the scatterplot between these two variables. The slope is positive. In other words, the line tilts up from
32
left to right. Our slope in the above equation is also positive, but provides even more information about the presumed association between the variables. The intercept in the equation is negative. The Stata scatterplot does not even show the intercept; it is off the chart. But imagine expanding the graph and you can picture the line crossing the y-axis. Recall that we said the slope shows the average number of units increase in y for each one unit increase in x. This provides a way to interpret the slope in this equation: Each one unit increase in per capita income is associated with a 14.46 unit increase in the number of robberies per 100,000 people. Of course, we need to specify how these variables are measured to determine what this statement means. For instance, since per capita income is measured in thousands of dollars and the robberies are measured in offenses per 100,000 people we may say: Each $1,000 increase in per capita income is associated with 14.46 more robberies per 100,000 people. Similarly, given our understanding of the intercept, we may say: The expected number of robberies per 100,000 people when per capita income is zero is 165.2. One of the problems with this interpretation is that (check the data if youd like) there is no state with a per capita income of zero. It is not a good idea to estimate values from your sample when no observations have this value. Sometimes, such as the situation here, we come up with ridiculous results if we try to make inferences too far outside our set of data values. Keep in mind that weve used the term associated in our interpretations. Many people are uncomfortable with some of the terms used to describe the relationship between variables, especially when using regression models in the social and behavioral sciences. Associated is a safe term, one that implies that as one variable
33
changes the other one also tends to change. It should not be seen as a term denoting causation, though; each $1,000 increase in per capita income does not lead to or cause an increase in the number of robberies. Im sure you can think up several reasons why these two variables might be associated that have nothing to do with one variable causing the other. In general, we say that the intercept represents the expected value of Y when X equals zero. The term expected value should remind you of the mean. The slope represents the expected difference or change in Y that is associated with each one unit change or difference in X. We are inferring from the sample to the population here, although we dont have to. The two terms that are frequently used when considering the slope are difference and change. A problem with using the term change is that we dont necessarily observe changes; we (usually) note only differences in the values of variables. The graph on the next page shows a picture of the shifts that may be inferred from the linear regression model. With a little imagination, it also demonstrates another useful thing about this type of model: We may come up with predictions of the outcome variable. With a graph we simply find that point along the x-axis that interests us, trace vertically to the regression line, and then trace horizontally to the y-axis to find its value. This is known as the predicted value from the model. Graphs can be difficult to use to make precise predictions, however, so we may also use the regression equation to find predicted values. Suppose we wish to predict the value of the robbery rate for states with per capita incomes of $25,000. We may use the equation to do this: Robberies = 165.13 + {14.46 25.0} = 196.37 per 100,000
Hence, on average, states with mean per capita income values of $25,000 have approximately 196.37 robberies per 100,000 people. Weve used here the term on average. This is another way of implying that for those states that have per capita incomes of $25,000, we estimate that the mean number of robberies is 196.37. In
34
other words, predicted values are predicted means of the outcome variable for particular values of the explanatory variable.
Lin
ear
n ssio e r reg
e lin
14.46
$1,000
35
try to draw a line that is as close as possible to all the points. Another way of thinking about this is to try to draw a line that, if we calculated an aggregate measure of distance from the points to the line, will have as small a value as possible. This is precisely what the most common technique for fitting the regression line does. Known as ordinary least squares (OLS) or the principle of least squares, this technique is used so frequently that many people refer to linear regression analysis as ordinary least squares analysis or as an OLS regression model. However, there are many types of estimation routines, such as weighted least squares and maximum likelihood, so there are actually a number of ways we might compute the results of a linear regression analysis. Because it is so widely used, though, well focus on OLS in this presentation. The main goal of OLS is to obtain the minimum value for the following equation:
SSE =
(y
i ) y
( y { + x })
i 1 i
The quantity, SSE, is the Sum of Squared Errors, called by Stata the Residual Sum of Squares (Residual SS). There is also a new symbol that (also called y-hat). This is the symbol for we havent seen before: y predicted values of y based on the model that uses the sample data. The goal of OLS is to minimize the SSE, or make it as close to zero as possible. In fact, if the SSE equals zero, then the straight line fits the data points perfectly; they fall right on the line. In addition, if the points fall directly on the line, the Pearsons correlation coefficient is one or negative one (depending on whether the association is positive or negative). Another new term is residual: It is a common name for a measure of the distance from the regression line to the actual points, so there are actually many residual values that result from the model. It is typically represented by e to indicate that it is another way of showing the errors in estimation {residual i ( i ) = ( y i y i )} . As well learn in later chapters, the residuals are used for many purposes in linear regression models.
36
As mentioned earlier, the straight regression line almost never goes through each data point. Rather, there are discrepancies from the line for two reasons: (1) There is simply variation of various types whenever we measure something; and (2) a straight line association is not appropriate. The former situation is a normal part of just about any linear regression analysis. Although we wish to minimize errors, there is usually natural or random variation because of the intricacies and vicissitudes of behaviors, attitudes, and demographic phenomena. The second situation is more serious; if a straight line is not appropriate we should look for a relationship referred to as nonlinear that is appropriate. Chapter 10 provides a discussion of what to do when we find a nonlinear relationship among variables. But how do we come up with an equation that will minimize the SSE and thus allow us to both fit a straight line and estimate the slope and intercept? Those of you who have taken a calculus course may suspect that, since we are concerned with changes in variables, derivatives might provide the answer. You are correct, although through Cramers rule and the methods of calculus some simple formulas have been derived. Among the various alternatives, the following least squares equation for the slope has been shown to be the best at minimizing the SSE:
(x x )( y y ) (x x )
i i 2 i
This equation works very well if the assumptions discussed earlier are met. Well have a lot more to say about assumptions in later chapters. The numerator and denominator in this equation should look familiar. The numerator is part of the equation for the covariance (see Chapter 1) and the denominator is the formula for the sum of squares of x. Here we use the common x-bar and y-bar symbols for the
to indicate that we are estimating the slope with the means; and 1
equation. Lets think for a moment about what happens as the quantities in the equation change. Suppose that the covariance
37
increases. What happens to the slope if the sum of squares does not also change? Clearly, it increases. What happens as the sum of squares of x increases? The slope decreases. So, all else being equal, as the variation of the explanatory variable increases, the slope decreases. Typically, we wish to have a large positive or negative slope if we hypothesize that there is an association between two variables. Now that we have the slope, how do we estimate the intercept? The formula for the intercept is the following:
x y 1
The estimated intercept is computed using the means of the two variables and the estimated slope. We should note at this point that we rarely have to use these equations since programs such as Stata will conduct the analysis for us. Moreover, programs such as Stata do not use these equations directly; rather they use matrix routines to speed up the process. It is useful, though, to practice using these equations with a small set of data points.
38
a regression model that considers these variables. Our hypotheses are therefore
Although these are reasonable hypotheses, for some reason most researchers who use linear regression models define the hypotheses more generally as saying either the slope is zero or the slope is not zero. This, of course, means that the alternative hypothesis is that the slope is simply different from zero and that the null hypothesis is that the slope is zero: {H a : (1 0)} vs. {H 0 : (1 = 0 )} . It may appear obvious that we can simply look at the slope coefficient and determine whether or not it is zero. But, lets not forget a crucial issue. Remember that we are assuming that we have a sample that represents some target population. Ours is only one sample among many possible samples that might be drawn from the population. Perhaps we were just lucky (or unlucky, depending on how you look at things) and found the one sample among countless others that had a positive slope. How can we be confident that our slope does not fall prey to such an event? Although we can never be absolutely certain, we may use a significance test or confidence intervals to estimate whether the slope in the population is likely to be zero. It is now time to return to standard errors. In Chapter 1 we learned how to compute the standard error of the mean and how this statistic is interpreted. There is also a statistic known as the standard error of the slope that is interpreted in a similar way. That is, the standard error of the slope estimates the variability of the estimated slopes that might be computed if we were to draw many, many samples. Lets say we have a population in which the association between age and alcohol consumption is zero. That is, the correlation is zero and the population-based linear regression slope is zero. Therefore, the null hypothesis of a zero slope is true (H0: 1 = 0 ). If we were to draw
39
many samples, can we infer what percentage of the slopes from these samples would fall a certain distance from the true mean slope of zero? It turns out we can because the linear regression slopes from samples, if many samples are drawn randomly, follow a t-distribution. We therefore know, for instance, that in a sample of, say, 1000, we would expect only about five percent of the slopes to fall more than 1.96 t-values (similar to z-values in the standard normal distribution) from the true mean of zero. We should only occasionally find a slope this far from zero if the null hypothesis is true. If the null hypothesis is not true (or if we may reject it, to use a common phrase), then a slope this many t-values from zero should be relatively common. So, the question becomes, how can we compute these t-values? The t-value in a linear regression model is computed by dividing the estimated slope by the estimated standard error. Weve already seen how to compute a slope in a simple linear regression model. Here is the formula for the standard error:
se 1
( )
) n2 (y y (x x )
2 i i 2 i
SSE n 2 SS[ x]
This equation provides the standard error formula for the slope in a simple linear regression model only. As with the slope equation, the standard error equation includes some familiar elements. In the numerator, for instance, we see the sum of squared errors (SSE) and the sample size (n); in the denominator we find the sum of squares of x. As the sum of squares of x gets larger, the standard error becomes relatively smaller. As the SSE becomes larger, the standard error also becomes larger. This should not be surprising if we think about our scatterplot: Larger SSEs indicate more variation about the regression line. Therefore, our uncertainty about whether we have a good prediction of the population slope should also increase. But what happens to the standard error as the sample size increases? Working through the algebra implied by the equation, we can see that the standard error decreases with larger samples. Again, this should make
40
sense: The larger the sample, the more it is like the population from which it was drawn, and the more confidence we should have about the sample slope. In fact, some observers complain because if we have a larger enough sample, even one that wasnt drawn randomly, we can make claims of certainty that are not justifiable. As mentioned in the beginning of the last paragraph, a t-value in a linear regression model is computed by dividing the slope by the standard error:
t value = 1 se
( )
1
Occasionally, you might see one beta value minus another in the t ). Although it may be used in such a value equation (e.g.,
1 0
manner, it is much more common to assume that we wish to compare the slope from the regression model to the slope implied by the null hypothesis, which is typically that the slope is zero in the population. Each t-value is associated with a p-value that depends on the sample size. This is the basis for using significance tests to determine the importance of the slope. Fortunately, we do not need to use t-tables to determine the p-value. Statistical software provides the associated p-values. Recall that we used Stata to estimate a model with robberies per 100,000 as the outcome variable and per capita income as the explanatory variable. Here is the table that Stata produced.
Source Model Residual Total robbrate perinc _cons SS 111134.139 381433.089 492567.228 Coef. 14.45886 -165.1305 df 1 48 49 MS 111134.139 7946.52268 10052.3924 t 3.74 -1.92 P>|t| 0.000 0.061 Number of obs F( 1, 48) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 13.99 0.0005 0.2256 0.2095 89.143
41
We have already learned to interpret the slope and intercept in this model. But notice that Stata also computes the standard errors (listed as Std. Err.) of the slope and intercept, the t-values (under the column labeled t), and the p-values (under the column labeled P>|t|). The tvalue for the slope is 14.459/3.867 = 3.74. A t-value of this magnitude from a sample of 49 observations has a p-value of less than 0.001. Stata rounds p-values to three decimal places. Although it is rarely used, there is also a p-value for the intercept. It compares the intercept to the null hypothesis value of zero. We usually dont have need to interpret the standard error of the slope or intercept directly, nor do we interpret the t-value. Rather, we may focus on the p-value. Unfortunately, p-values are frequently misused. You might recall that most statistical analyses use a threshold or rule-of-thumb that states that a p-value of 0.05 or less means that the statistic is somehow important. Actually, this is an arbitrary value that has, for various reasons, become widely accepted. In fact, when one finds a p-value of 0.05 or less, it is common to claim that the slope (or other statistic) is statistically significant. But what precisely does the p-value mean? Can we come up with an interpretation? The answer is yes, but we must again assume that the sample slope says something reasonable about the true population slope. Suppose, for instance, that we find a p-value of 0.03. The following interpretation is reasonable (even if it is not entirely persuasive): If the actual slope in the population is zero, we would expect to find a sample slope this far or farther from zero only three times out of every 100 samples, if we were to draw many samples from the population. You should notice that this interpretation construes the p-value as a probability. Another way of thinking about this p-value is that it suggests (but, of course, does not prove) that, if the null hypothesis is true in the population, we would reject it only about three percent of the time given a slope this far or farther from zero. Once again, we
42
are assuming that we could draw many samples to reach this conclusion. In the model of robbery and per capita income, the p-value suggests that if the population slope is zero, we would expect to find a slope of 14.459 or some value farther from zero less than one time out of every one-thousand samples drawn. This is clearly a small pvalue and offers a large degree of confidence that there is a non-zero statistical association between the number of robberies per 100,000 and per capita income. However, note that we use the phrase something farther from zero. This implies that we are using what we labeled in Chapter 1 as a two-tailed significance test. Imagine a bell-shaped curve with the value of zero as its middle point. We may then picture the area under the curve as representing the frequency of slopes from a large number of samples. The mean of zero represents the null hypothesis. Hence, this bell-shaped curve signifies the distribution of slopes for the null hypothesis. A certain percentage of the total area of the curve lies in the tails of this distribution. For instance, if we have a large sample (say, 200 or more), then we know that 2.5% of the total area falls in each tail marked by the values 1.96. A two-tailed p-value of 0.05 implies a t-value of 1.96 (for large samples). When we rely on programs such as Stata for p-values, they typically provide two-tailed tests: They assume that the slope can be in either of the tail areas, above or below the mean. In other words, a negative slope and a positive slope are equally valid for rejecting the null hypothesis as long as their t-values are sufficiently large (in absolute value). But suppose our alternative hypothesis indicates direction, such as that the slope is greater than zero. We are then justified in using a one-tailed test: We are concerned only with the conceivable slopes that fall in the upper tail of the t-distribution. The 5% area that most people use falls above the mean of zero; in large samples this has a threshold at 1.64 t-units above the mean. How can we use a one-tailed test when programs such as Stata assume we want two-tailed tests? We may take the p-value provided by Stata and divide it in half in
43
order to reach a conclusion about the statistical significance of the slope. This can quickly become confusing, especially as we think about taking only one sample and then trying to infer something about a population. Perhaps it is easiest to simply remember that we wish to have slopes far from zero if we want to conclude that there is a non-zero linear association between an explanatory and a outcome variable. As with other statistical tests that compare a value computed from the data to a hypothetical value from a null hypothesis, there are many researchers who argue that significance tests that use only pvalues are misleading: They deceive us into thinking that our estimates are precise and fail to account for the uncertainty that is part of any statistical model. Many recommend that we should therefore construct confidence intervals (CIs) for slopes, much like they are constructed for means. Since we have point estimates and standard errors, constructing confidence intervals for linear regression slopes is relatively simple. In fact, the same general formula we used in Chapter 1 is applicable: Point estimate [(confidence level) (standard error)] As discussed in Chapter 1, the confidence level represents the percentage of the time we wish to be able to say that the interval includes the point estimate. Given the special nature that the 0.05 pvalue has taken on, it is not surprising that a 95% confidence interval is the most common choice. If we wish to use a two-tailed test (again, the most common choice), then, assuming a large sample, the value of 1.96 is used for the confidence level. For smaller samples, you should consult a table of t-values or rely on the statistical software to compute CIs for the slopes (note that Stata does this automatically). Here is an example from our linear regression model: 14.459 [2.011 3.74] = {6.685, 22.233} In this example, since we have a sample size of 50, we use a confidence level of 2.011. This is found in a table of t-values (the df is
44
n 2) and is associated with 2.5% of the area in each tail of the distribution. In other words, our confidence level is 95%. The interpretation of this confidence interval is similar to the interpretation offered in Chapter 1: Given a sample slope of 14.459, we are 95% confident that the population slope representing the association between per capita income and the number of robberies falls in the interval bounded by 6.685 and 22.233. The resulting intervals may be slightly different than those we just computed because of rounding error.) If we wish to obtain a higher degree of confidence for example, if we want to be 99% confident about where the population slope falls then the interval is wider because the t-value used in the equation is larger. For example, a confidence level of 99% with a sample size of 50 is represented by a tvalue of approximately 2.68, so the CIs are {3.99, 24.93}. Try to figure out how to ask Stata for more precise 99% confidence intervals. An interesting relationship between p-values and CIs occurs when using the same t-value for determining the threshold and for the confidence level. Suppose we rely on 1.96 to compute a p-value of 0.05 (two-tailed test) and for our confidence level. Then, if the slope divided by the standard error i.e., the t-value from our regression model is less than 1.96, the p-value is greater than 0.05. In this situation, the CIs that use 1.96 to represent the confidence level include zero. Thus, we would conclude that the null hypothesis cannot be rejected. In other words, our conclusions about statistical significance are generally the same whether we use p-values or confidence intervals. The main advantage of confidence intervals is that they are a better reflection of the uncertainty that is a fundamental part of statistical modeling.
46
U.S. counties. We would likely find a positive association between these two variables. Would we therefore conclude that purchasing lighters leads to lung disease? Probably not, especially when we realize that cigarette smoking is related to both purchasing lighters and lung disease. We say that cigarette smoking is a confounding variable and that the association between lighters purchased and lung disease is spurious (see Chapter 7). Hence, smoking should be included in a regression model that predicts lung disease, especially if the model also includes the frequency of lighter purchases. (Actually it should always be used in models predicting lung disease.)
47
zero; thus, a p-value listed as 0.0000 should be read as a p-value of less than 0.0001.) However, there is not a statistically significant correlation between the two proposed explanatory variables (r = 0.1780; p = .2162). Thus, if our hypotheses propose that the unemployment rate and the gross state product are explanatory variables that predict the violent crime rate, we have some preliminary evidence that the key variables are associated.
violrate unemprat violrate unemprat gsprod 1.0000 0.4131 0.0029 0.4780 0.0004 1.0000 0.1780 0.2162 1.0000 gsprod
Lets now see what a linear regression analysis tells us about these associations. Well begin with a simple linear regression model using only the unemployment rate as an explanatory variable. Stata generates the following table after we enter the following command:
regress violrate unemprat
How do we interpret the slope associated with the unemployment rate? It should be easy by now. We may say Each one unit increase in the unemployment rate is associated with 88.78 more violent crimes per 100,000 people. The p-value suggests that, if the actual population slope is zero, wed expect to get a slope of 88.78 or something farther away from zero less than three times out of every one-thousand samples. The intercept indicates that the expected value of violent crimes is 76.69 when the unemployment rate is zero (is this a reasonable number?).
Source Model Residual Total violrate unemprat _cons SS 606247.34 2945642.12 3551889.46 Coef. 88.78214 76.68569 df 1 48 49 MS 606247.34 61367.5441 72487.5399 t 3.14 0.51 P>|t| 0.003 0.615 Number of obs F( 1, 48) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 9.88 0.0029 0.1707 0.1534 247.72
48
At this point we should discuss standardized coefficients. To calculate standardized coefficients using Stata, use the regress command with the beta option (regress violrate unemprat, beta). Under the column labeled beta, Stata lists the number 0.413. Have you seen this number before? Look at the correlation matrix shown on the previous page. The Pearsons correlation between the unemployment rate and violent crimes per 100,000 is 0.413. The standardized coefficient measures the association between the variables in term of standard deviation units or z-scores; when there is only one explanatory variable this statistic is the same as the Pearsons correlation (recall from Chapter 1 that we identified the correlation as the standardized version of the covariance.). As we shall soon see, this will change once we move to a multiple linear regression model. Standardized regression coefficients or beta weights as they are often called may be interpreted using standard deviations. One way to consider them is to imagine transforming both the explanatory variable and the outcome variable into z-scores (i.e., standardize them), running the regression model, and looking at the slope. This slope represents the association between variables in standard deviation units and is identical to the standardized regression coefficient that Stata produces. Hence, our interpretation of the standardized regression coefficient in the violent crimes model is Each one standard deviation increase in the unemployment rate is associated with a 0.413 standard deviation increase in the number of violent crimes per 100,000 people. The standardized regression coefficient and the unstandardized regression coefficient are related based on the following formula:
s * = x s y where * is the standardized coefficient
Recall that sx denotes the standard deviation of x and sy denotes the standard deviation of y. In the earlier example, we may use this formula to compute the standardized coefficient for the
49
There are some researchers who prefer standardized coefficients because they argue that these statistics allow comparisons of explanatory variables within the same linear regression model. An explanatory variable with a standardized coefficient farther away from zero, they argue, is more strongly associated with the outcome variable than are explanatory variables with standardized coefficients closer to zero. However, this assumes that the distributions of the explanatory variables are the same. Many times, though, one variable is much more skewed than another, so a standard deviation shift in the two variables is not comparable. When we learn about dummy variables in Chapter 6, well also see that standardized regression coefficients are of limited utility. Moreover, standardized regression coefficients should not be used to compare coefficients for the same variable across two different linear regression models. Well now estimate a multiple linear regression model by adding the gross state product to the previous model. In Stata we simply add the new variable to the command (regress violrate unemprat gsprod). You should find the following table in the Stata output window.
Source Model Residual Total violrate unemprat gsprod _cons SS 1206205.63 2345683.83 3551889.46 Coef. 72.80565 67.17297 63.51354 df 2 47 49 MS 603102.813 49908.1666 72487.5399 t 2.81 3.47 0.47 P>|t| 0.007 0.001 0.644 Number of obs F( 2, 47) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 12.08 0.0001 0.3396 0.3115 223.4
There are now two slopes, one for each explanatory variable. Well first offer some interpretations for the numbers in this table and then try to understand how a multiple linear regression model comes up with them.
50
Try to imagine the association among these three variables in three dimensions. If this is difficult, begin with a two variable scatterplot with x- and y-axes and then picture another axis, labeled z, coming out of the page. The three axes are at 90 degree angles to one another. Now, how is the intercept interpreted in this representation? Perhaps you can guess that it is the expected value of the outcome variable when both of the explanatory variables are zero. If we had a highly unusual state that had no unemployment and no economic productivity (well call it Nirvana), wed expect it to have 63.51 violent crimes per 100,000 (Okay, well call it Club Med). Of course, the intercept in this model is meaningless since it falls so far outside of any reasonable values of the explanatory variables. The slopes are interpreted the same way as before, but we add a phrase to each statement. Heres an example: Controlling for the effects of the unemployment rate, each $100,000 increase in the gross state product is associated with 67.17 more violent crimes per 100,000 people. Notice that weve used the phrase controlling for. Other phrases that are often used include statistically adjusting for or partialling out. Because we are statistically controlling for or partialling out the effects of a third variable, multiple linear regression coefficients are also known as partial regression coefficients or partial slopes. Statistical control or adjustment is a difficult concept to grasp. One way to think about it is to claim that we are estimating the slope of one variable on another while holding constant the other explanatory variable. For instance, when we claim that each $100,000 increase in the gross state product is associated with 67.17 more violent crimes per 100,000 people, we are assuming that this average increase occurs for any value of the unemployment rate, whether 1% or 12%. In other words, the unemployment rate is assumed to be held constant. To further our understanding of this important idea, well discuss four ways of approaching this topic with the hope that perhaps one or two of these will be useful. However, the first way may only be useful if you have superior spatial perception skills. This is to use a three-dimensional plot to visualize the relationship among
51
the three variables. Stata has a user-written 3-D graphing option (called scat3; type findit scat3 on the command line). By plotting three variables and then using the regression option to calculate the slopes, you may be able to visualize the meaning of statistical control. Looking at how the points follow a particular variables slope may allow you to visualize the meaning of statistical control. A second way to understand statistical control assumes experience with calculus. A partial slope is synonymous with a partial derivative. Suppose that y is a function of two variables, x and w, that we wish to represent in a regression equation. If w is held constant (e.g., w = w0), then y becomes a function of a single variable x. Its derivative at a particular value of x is known as the partial derivative of y y f ( x, w) with respect to x and is represented by where y = f(x, or x x w). We wont go any further than this, even though it might be valuable to some readers. Well assume that if youve had exposure to partial derivatives and you find them useful then you can explore the relationship between partial derivatives and partial slopes. Many calculus textbooks include graphical representations of partial derivatives that may help you visualize the notion of holding one variable constant while allowing another to vary. The next two ways of approaching the concept of statistical control or adjustment in multiple linear regression models should be helpful to most readers. The first involves computing residuals from two distinct linear regression models and then using these residuals to compute the partial slope. Before showing how to do this, it is useful to consider the nature of the residuals. As mentioned in Chapter 2, residuals measure the distance from the regression line to the actual observations. A basic formula for a residual is resid i = ( yi y i ) . (As mentioned earlier, youll often find the residuals represented as ei.) This means that we take the observed values and subtract the values we predict based on the linear regression model. As indicated by the subscripts, we do this for each observation. Another way to think about residuals is that they measure the variation that is left over in
52
the outcome variable after we have accounted for the systematic part that is associated with the explanatory variable. Part of what is left over may be accounted for by another explanatory variable. This part is represented by the partial slope. In order to show how this is done, take the following steps in Stata: 1. Estimate a simple linear regression model with violent crime as the outcome variable and the unemployment rate as the explanatory variable (leave gross state product out of the model). Next use the predict command to calculate the unstandardized residuals (predict resid, residual). This will compute a new variable (called resid in our example) that is made up of the residuals from the model for each observation. These residuals represent the variability in violent crimes that is not accounted for by the unemployment rate. (Browse Statas Data Editor to examine these residuals.) 2. Estimate a second regression model with gross state product as the outcome variable and the unemployment rate as the explanatory variable. Once again, predict the residuals from this model (use a different variable name, such as resid1). These residuals represent the variability in the gross state product that is not accounted for by the unemployment rate. 3. Estimate a third regression model that uses the residuals from (1) as the outcome variable and the residuals from (2) as the explanatory variable (e.g., regress resid resid1). There is no need to predict the residuals from this model. Now, look at the unstandardized regression coefficient from the third model. It is 67.17, the same number we found when we estimated the multiple linear regression model earlier. This is because the slopes from these two models represent the same thing: The covariability between the gross state product and violent crime that is explanatory of the unemployment rate. The fourth way to consider statistical control is similar to using
53
residuals, except it is visual. The figure on the next page shows three overlapping circles that are labeled A, B, and C. We could have labeled them y, x1, and x2 to show that they are variables we are interesting in using in a regression model. Lets assume that A represents the outcome variable and B and C are explanatory variables. The total area of the circles represents their variability (if you find it more comforting, think of the sums of squares or the variances as represented by circles). The overlapping areas signify joint variability, such as a measure of covariance between two variables. Notice the darkened area of overlap between A and B. It does not include any part of C. This area denotes the joint variability of A and B that is not accounted for at all by the explanatory variable C. It represents the partial slope or the regression coefficient from a multiple linear regression model. Notice that there is no need for B and C to be completely independent. Rather we can assume that they covary. Hence, the term independent variable which is often used in regression modeling can be misleading: It does not mean that the predictor variables are independent from one another, only that they might assuming our conceptual model is a good one independently predict the outcome variable. Extending multiple linear regression models to include more than two explanatory variables is easy and requires no deeper level of understanding than that which we already possess. The phrase statistically adjusting for the effects of must now either specify each other explanatory variable or state something such as statistically adjusting for the effects of the other variables in the model, each one unit increase in variable x1 is associated with a [partial slope] unit increase in variable y.
54
C
A Brief Overview of the Assumptions of Multiple Linear Regression Models
Now that we have some sense of what multiple linear regression models produce in the way of slopes, intercepts, and p-values, it is a good time to review the assumptions that we make when using the model. We saw some of these assumptions in Chapter 2, but in the interests of completeness, well revisit them. Moreover, this is a brief overview because several of the subsequent chapters review these assumptions in painful detail. 1. Independence: The Y values are statistically independent of one another. We saw that this assumption applies to the simple linear regression model. Now that we have seen a couple of regression models, we can think about how this assumption might operate. For example, when analyzing state-level data we assume that our measures are independent across states. But, is this true? It is likely that states that share borders are more
55
2.
3.
4.
5.
similar in a lot of ways than states that are in different parts of the country. Distribution of the Error Term: The error term is normally distributed with a mean of zero and constant variance. The first part of this assumption that the mean is zero is important for estimating the correct intercept. But this assumption taken as a whole is also important for making inferences from the sample to the population. In other words, it is important for significance tests and confidence intervals. Specification Error: The correlation between each explanatory variable and the error term or residuals is zero. If there is a noticeable correlation, weve left something important out of the model. In Chapter 7, well learn that this problem involves whether or not we have specified the correct model; hence, it is known as specification error. Measurement Error: The xs and y are measured without error. Notice that we are using the lower case to indicate the explanatory and outcome variables. This is because we are concerned primarily with sample measures. As the name implies, we must consider whether we are measuring the variables accurately. When we ask people to record their family incomes, are they providing us with accurate information? When we ask people whether they are happy, might some interpret this question differently than others? Well learn in Chapter 8 that measurement errors are a plague on the social and behavioral sciences. Collinearity: There is not perfect collinearity among any combination of the Xs. If you consider the overlapping circles that we saw in the last section, you can visualize what this means. Suppose that circles B and C overlap completely. Is it possible to estimate the covariance between A and B explanatory of C? Of course not. There is no variability left over in B once we consider its association with C. Well discover in Chapter 9 that even a higher degree of collinearity,
56
LINEAR REGRESSION ANALYSIS or overlap, among explanatory variables can cause problems in multiple linear regression models. 6. Linearity: The mean value of Y for each specific combination of the Xs is a linear function of the Xs. For example, the regression surface is flat in three dimensions. Note that in a simple linear regression model we only had to assume a straight line relationship, but now that we have two or more explanatory variables we must move to higher dimensions. One way to think about this is to imagine a three-dimensional space and then visualize the difference between a flat surface and a curved surface. We assume that the relationship between the Xs and Y is not curved for any X. In Chapter 10, well find out about some of the tools for analyzing relationships that are not linear. 7. Homoscedasticity: The variance of the error term is constant for all combinations of Xs. The term homoscedasticity means same scatter. Its antonym is heteroscedasticity (different scatter). Well learn much more about this issue in Chapter 11. 8. Autocorrelation: There is not an autocorrelated pattern among the errors. This is a huge problem in times series and spatial regression models. These involve data collected over time, such as months or years, or across spatial units, such as counties, states, or nations. Another name for this problem in over-time data is serial correlation. The key issue that leads to autocorrelation is that errors or residuals that are closer together in time (year 1 and year 2) or space (Los Angeles County, CA and Orange County, CA) share more unexplained variance than errors that are farther apart in time (year 1 and year 10) or space (Los Angeles County, CA and Fairfax County, VA). Chapter 11 includes a discussion of autocorrelation.
57
Although not normally listed as an assumption, we also assume that there are not any wildly discrepant values that affect the model. This is related to assumptions about measurement error, specification error, and linearity. Suppose, for instance, that you look at a variable such as annual income and find the following values: [$25,000; $35,000; $50,000; $75,000; $33,000,000]. As well see in Chapter 12, the last entry is labeled as either a leverage point (if income is an explanatory variable) or an outlier (if income is the outcome variable). In either case, these types of discrepant values which are known generally as influential observations can affect, sometimes in untoward ways, linear regression models. The question you should ask yourself when confronted by such values is why have they occurred? Did someone record the wrong value, perhaps by placing a decimal place in the wrong spot? Or is this value accurate; does someone in your sample actually earn that much money per year? If the value is simply coding error, then fixing it is usually simple. If it is an actual value, then perhaps a linear model predicting income, or using it as a predictor, is not appropriate. In this case, we should search for other regression techniques to model income. Fortunately, multiple linear regression models are flexible enough to accommodate many such problems; you just need to know where to look for solutions. There are two general issues to consider: First, how do we test these assumptions; and, second, what do we do if they are violated. There are tests most of them indirect for all of these assumptions. Furthermore, if they are violated if our model does not meet one or more of these assumptions there are usually solutions that still make linear regression models viable. Several of the subsequent chapters in this presentation discuss these tests and the solutions. The term regression diagnostics is used as an umbrella term for these tests. Several of these assumptions are stringent, but, one argument goes, our models are saved by statistical theorys famous Central Limit Theorem (CLT) even when one or more of these assumptions is violated. You might recall from introductory statistics that the CLT states that, for relatively large samples, the sampling distribution of
58
the mean of a variable is approximately normally distributed even if the distribution of the underlying variable is not normally distributed (For a nice review, see Neil A. Weiss (1999), Introductory Statistics, Fifth Edition, Reading, MA: Addison-Wesley, p.427; a standard proof is provided in Bernard W. Lindgren (1993), Statistical Theory, Fourth Edition, New York: Chapman & Hall, p.140). A bit more formally, we state: For random variables with finite sample variance, the sampling distribution of the standardized sample means approaches the standard normal distribution as the sample size approaches infinity. This concept concerns the sampling distribution of the mean, which, as weve seen before, assumes that we take many samples from the population. If these samples are drawn randomly, the distribution of means tends to approximate the normal distribution after only about 30 samples. However, if the underlying variables distribution is highly skewed, it may take more samples to approximate the normal. Since intercepts and slopes are similar to means, they also may be shown to follow particular normal-like distributions (such as the tdistribution). Therefore, given a large enough sample, we can infer that even if the outcome or explanatory variables are not normally distributed, the results of the regression model will be accurate most of the time. How large a sample needs to be to be large enough is a difficult question to answer, though. Thus, it is important to learn techniques appropriate for when assumptions like linearity or normality are not met. And this discussion should also reinforce the idea that using random samples or having control over explanatory variables is important. Lets assume, though, that the assumptions are met. Then, according to something known as the Gauss-Markov theorem (see Lindgren (1993), op. cit., p.510), the OLS estimators for the slopes, intercepts, and standard errors, offer the best linear unbiased estimators (BLUE) among the class of linear estimators. No other
59
linear estimator has less bias, or is more accurate on average, than OLS estimators.
is known as a linear combination of the xs that has the largest possible correlation with the y variable. This is an ideal because we are attempting, with this model, to explain as much of the variability in the outcome variable as possible with the explanatory variables. Well learn in Chapter 4 about some ways to estimate this correlation. We also mentioned in the last chapter that statistical software uses matrix routines to come up with slopes, standard errors, and other particulars of linear regression models. For those of you familiar with vectors and matrices, think of the y-values as a vector of observations and the x-values as a matrix of observations:
y1 y Y = 2 M yn 1 x11 1 x 21 X= M M 1 x n1 L x1 p L x2 p L M L x np
Given this short-hand way of expressing the explanatory and outcome variables (translate the vector and matrix into spreadsheet format, such as Statas data view, if its not clear how this works and it is easy to see the utility of representing data this way), we may express , where the = X the multiple linear regression equation as Y represents a vector of the intercept (denoted in the X matrix with 1s)
60
and slopes for each explanatory variable. The X is listed first in this equation because we say it is postmultiplied by the vector of slopes. Given the X and Y matrices, the following matrix formula may be used to estimate the vector of slopes:
= (X X )1 X Y
The X with the accent next to it is known as the transpose of the matrix and the superscripted 1 indicates that we should take the inverse of the product in parentheses. In matrix terminology we say the vector of slopes is estimated by postmultiplying the inverse of X times its transpose by X-transpose times Y. Try it out with a very small data set (if youve got sufficient time and patience to compute the inverse) and see if your results match what Stata comes up with. Matrix algebra is also useful for estimating several other features of multiple linear regression models. For example, the standard errors of the coefficients may be estimated by taking the square roots of the diagonal elements of the following matrix:
V = (XX ) 2
1
where 2 =
1 nk
The term n refers to the sample size and the term k refers to the number of explanatory variables in the model. Well learn more about
2 in Chapter 4 when we discuss goodness-of-fit statistics. For now, note that it measures the amount of variability of the residuals around the regression line. An alternative way of computing the standard errors that does not use matrix algebra is
= se i
( )
(x
x ) 1 R i2 (n k 1)
2
(y
i ) y
Well learn more about the R2 value in Chapter 4. For now, well say only that it is the R2 from a linear regression model that includes xi as the outcome variable and all the other explanatory variables as
61
auxiliary regression model. The quantity in parentheses that involves the R2 is called the tolerance. Moreover, notice that the standard error is affected by the sum of squares of x, with larger values tending to decrease the standard error, and the SSE, with larger values tending to increase the standard error. A larger sample size also tends to decrease the standard error. David G. Kleinbaum et al. (1998), Applied Regression Analysis and Other Multivariable Methods, Third Edition, Pacific Grove, CA: Duxbury Press (see Appendix B), provide a relatively painless overview of matrix routines useful in linear regression analysis. A more painful review, but one that is worth the effort for understanding the role of matrix routines in statistics, is found in James R. Schott (1997), Matrix Analysis for Statistics, New York: Wiley. Stata has a number of matrix routines built into its standard programs. Type help matrix in the command line to see its many options.
As in a standard ANOVA, there is a set of numbers known as the sum of squares. Recall from Chapter 1 that we discussed a measure of dispersion known as the sum of squares of X, abbreviated SS[X]. The formula for this measure is SS[X] = (xi x )2 . But we now have a outcome variable, labeled y, and, as seen in the last chapter, we wish to assess something about the overlap between the explanatory and outcome variables. So, it makes sense to try to come up with a measure of this overlap. Now, if the sum of squares of the outcome variable measures its total area, we might also wish for a measure of the overlapping area. This is what the ANOVA table provides. Under 63
64
the table cell called Total Sum of Squares (Total SS or TSS) we have SS[y]. If we were to take all 50 values of the violent crime variable, subtract the mean from each, and add these numbers up, we would obtain an overall sum of squares of 3,551,889.46. This may be thought of as the total area of a circle representing the variation in violent crime. What, then, are those other sums of squares? The first one, in the Regression Sum of Squares (RSS) cell, or what Stata calls the Model SS, computes the variation of the mean of y around the values of y predicted from the regression equation. Recall that the predicted and is calculated based on the linear value is designated as y regression equation. Each non-missing observation in the data set has a predicted value. We may therefore modify the sum of squares formula to reflect this type of variation: RSS =
(y y)
2
This represents the total area of overlap between the variability of the xs and the variability of y (since the values of y are predicted based on the xs). It may also be thought of as the improvement in prediction over the mean with information from the explanatory variables. Its value in the linear regression model is 606,247.34. The other sum of squares, in the Residual Sum of Squares (Residual SS) cell (better known as the sum of squared errors, or SSE [see Chapter 2]), is the area of y that is left over after accounting for its overlap with the independent variables. It is computed using the following equation: SSE =
(y
i )2 y
Its value in the Stata table is 2,945,642.12. Notice that the RSS and the SSE sum to equal the TSS: 606,247.34 + 2,945,642.12 = 3,551,889.46. This is because we assume we have accounted for all of the variation in y with either the variability in the xs or random variability (this does not, however, rule out that there might be other variables that account for portions of the variability of y). Heres a
65
picture of what we are examining with the ANOVA portion of our linear regression model of violent crimes per 100,000:
Variability in violent crime rate Total area of circle: 3,551,889.46 Area of part of circle not overlapping: 2,945,642.12
Another way of representing the equations inherent in the ANOVA table is to combine the sums of squares equations into one general equation. Such an equation should represent the partitioning of the area of the left-hand circle, or the total sum of squares of violent crimes, into its two component parts. The diagram below illustrates this concept. The next issue to address is whether we can use this information on overlapping or joint variability to come up with a measure of the overall fit of the model. In regression analyses of all types there is a concern with what is known as goodness-of-fit statistics. How do we know if we are accounting for a substantial amount of the variability in the outcome variable?
66
y) + (y y ) (y y) = (y
2 2 i i
SS[y] or TSS
3,551,889.5 =
606,247.34
2,945,642.1
One idea is to use the sums of squares to estimate the amount of variability. Since we have area measures of a type, we can estimate the proportion or percentage of variation in the outcome variable that is accounted for by the explanatory variables. There is a number labeled R Square next to the table; this is typically labeled simply R2. Known in some circles as the coefficient of determination or the squared multiple correlation coefficient, it indicates the proportion of variability in the outcome variable that is accounted for (some use the phrase explained by) by the explanatory variables. It is simple to transform it into a percentage. In the violent crime model, the R2 is 0.171, or 17.1%. In other words, the linear regression model accounts for 17.1% of the variability in violent crimes. Returning to our sums of squares equation, it should be apparent how to compute this R2 value: Take the RSS and divide it by the TSS. Therefore, from our model that predicted violent crimes we find 606,247.34/3,551,889.46 = 0.171. Sometimes the following formula is used, but it is equivalent to simply dividing the RSS by the TSS since SSE + RSS = TSS:
R2 =1
SSE TSS
67
An important property of the R2 measure is that it falls between zero and one, with zero indicating that the RSS is zero and a one indicating that the RSS = TSS, with the SSE equal to zero. An interesting phenomenon occurs when we add explanatory variables to the model: The R2 increases whether or not the explanatory variables add anything of substance to the models ability to predict y. Many researchers are understandably bothered by this problem since they do not wish to be misled into thinking that their model provides good predictive power when it does not. Thus, most use a measured known as the adjusted R2. The term adjusted is used to indicate that the R2 is adjusted for the number of explanatory variables in the model. If we simply add useless explanatory variables, the adjusted R2 can actually decrease. The formula for this statistic is
2 k n 1 R2 = R n 1 n k 1
In this equation, as before, n indicates the sample size and k indicates the number of explanatory variables in the regression model. Another term next to the table is labeled Root MSE or Root Mean Squared Error. Although it is not used very often in the social and behavioral sciences (although it probably should be), this standard error, which is sometimes referred to as the Standard Error of the Estimate or the Standard Error of the Regression (SE or Sy|x), measures the average size of the residuals. It is also the square root of what is commonly referred to as the Mean Squared Error (MSE):
MSE = SSE n k 1 MSE = S E
The mean squared error is found in the ANOVA table (find the cell in the MS column and the Residual row). The root mean squared error labeled Root MSE in the Stata output may be used as a goodness-of-fit measure since smaller values indicate that there is less variation in the residuals of the model. It is also useful for estimating what are known as prediction intervals: Bounds for the predictions we
68
make with the model for particular observations within the data set. For instance, say we wish to determine the interval of predicted violent crimes for a state in which the unemployment rate is 5.2, 6.4, or 7.8. The following equation may be used:
( x x ) (t )(S ) 1 + 1 + ( x 0 x ) PI = y + 1 0 E n2 n (n 1)Var ( x )
2
A value such as 5.2 is substituted for x0. The t-value is the confidence level we wish to use with degrees of freedom equal to n 2. For example, with a sample size of 50, wed use a t-value of 2.011 for a confidence level of 95% (two-tailed). As with the confidence intervals for the slopes and the residuals from the model, Stata will compute prediction intervals (use the predict command). As with other statistics derived from a sample, the R2 has a sampling distribution. If we were to take many samples from the population, then we could compute R2s for each sample. We may therefore use significance tests and confidence intervals for this statistic. Before seeing such a test, though, we have to determine how R2 values are distributed. Weve thus far mentioned the normal distribution, the standard normal distribution (z-distribution), and the t-distribution. R2 values follow an F-distribution. Perhaps you remember the F-distribution from the ANOVA model. If so, youll recall that it has two degrees of freedom, known as the numerator and denominator degrees of freedom. Notice that in the ANOVA table there are columns labeled df (degrees of freedom) and MS (Mean Square). The cells in the degrees of freedom column include the number of explanatory variables (k; in the Regression row), the sample size minus the number of explanatory variables minus one (n k 1; in the Residual row), and the sample size minus one (n 1; in the Total row). The cells in the Mean square column are the sums of squares in the rows divided by their accompanying degrees of freedom. Hence, the mean square due to regression is 606,247.34/1 = 606,247.34 and the mean square due to residuals is 2,945,642.12/48 = 61,367.54. This latter value is also the aforementioned mean squared
69
error (MSE). The number labeled F next to the ANOVA table iincludes the Fvalue, which is the mean square due to regression divided by the mean square due to residuals (MSR/MSE). Similar to the way we computed the t-values for the slopes, this statistic may be thought of as a measure of how far our regression predictions are from zero, or how well our model is doing in predicting the outcome variable. The degrees of freedom for the F-value are found in the df column: k and (n k 1) nor next to the F in parentheses. If we have a program that will compute it, such as Stata, well rarely have to consult the F-values in a table at the back of a statistics book or on the internet. Rather, we find under the F-value a number labeled Prob > F that contains the pvalue we need. This is the significance test for whether our model is predicting the outcome variable better with information from the explanatory variables than with simply the mean of the outcome variable. A better way, perhaps, of looking at the F-test (F-value and its p-value) is that it compares the null hypothesis that R2 equals zero versus the alternative hypothesis that R2 is greater than zero. If we find a significant F-value, we gain confidence that we are predicting a statistically significant amount of the variability in the outcome variable. It should by now be rather unsurprising that we may also compute confidence intervals for the estimated R2. However, this is done so rarely that we wont learn how to do it (but see Kleinbaum et al., op. cit., Chapter 10). Most researchers from the social and behavioral sciences use the 2 R and accompanying F-test to determine something about the importance of the model. You will often find statements such as the model accounts for 42% of the variability in the tendency to twitch, and this proportion is significantly greater than zero. However, you should be cautious about such conclusions. Some statisticians argue that we should rely on either the substantive conclusions recommended by the partial regression slopes or the MSE (or SE) to judge how well the set of explanatory variables predicts the outcome variable (see, in particular, Franklin A. Graybill and Hariharan K. Iyer
70
(1994), Regression Analysis: Concepts and Applications, Belmont, CA: Duxbury Press, Chapter 3, for a discussion of this issue).
First, lets look at the Model Summary information. The R2 value is 0.85, the adjusted R2 value is 0.822, and the Root MSE is 0.262. Hence, we may claim that, using the three explanatory variables, the model accounts for 85% of the variability in first-year college GPA. This percentage explained is not affected much by the number of explanatory variables in the model. The output also indicates that the R2 value is statistically distinct from zero, with an F-value of 30.31 (df = 3, 16) that is accompanied by a p-value of less than .0001. The table of coefficients shows several results. It is best to ignore the intercept in this model because, first, SAT scores do not have a zero value (the minimum is 200) and, second, it unlikely that a student who had a grade of zero in high school math would be attending
71
college! Since SAT scores for math and verbal abilities are measured on a similar scale, we may compare their coefficients. It appears that SAT math scores matter more than SAT verbal scores for predicting GPA. We may say statistically adjusting for the effects of SAT verbal scores and high school math grades, each one unit increase in SAT math scores is associated with a 0.218 increase in first-year college GPA. The Beta coefficients (beta-weights), which measure the associations in standard deviation units, support the idea that SAT math scores are a better predictor than verbal scores or high school math grades. However, it is important to always remember the issue of sampling error: We do not know based only on the point estimates whether the two SAT coefficients are significantly distinct. There are methods for determining this. Relatively simple methods involving Ftests or t-tests are shown in Samprit Chatterjee and Bertram Price (1991), Regression Analysis by Example, Second Edition, New York: Wiley, pp.76-79. (See also Graybill and Iyer, op. cit., Chapter 3). A crude way to compare them is to examine the confidence intervals for each coefficient and then see whether theres any overlap (run the regress command again without the beta option). For instance, the confidence intervals for the slopes indicate that the 95% CI for math_sat is {0.122, 0.315} and for verbal_sat is {0.020, 0.243}. Since these overlap by a substantial margin, we should not be confident that math scores are a better predictor than verbal scores. This conclusion should not be surprising, however, because the sample size is small (n = 20). Perhaps the best method in Stata is to use the test postcommand. This allows a comparison of the coefficients to determine if one is statistically distinct from another. For example, after estimating the regression model we may type test sat_math = sat_verbal. The result is the following in the output window:
sat_math - sat_verb = 0 F(1, 16) = Prob > F = 1.23 0.2841
72
This F-test shows that there is not a statistically significant difference between the two coefficients. Hence, we cannot conclude that SAT math scores are a more substantial predictor of first-year college GPA. The coefficient for high school math grades offers an interesting problem. The p-value for this coefficient is 0.057, just over the usual threshold for making claims of statistical significance. Yet we could clearly make a claim here that this result is statistically significant. It seems perfectly reasonable to create, a priori, a hypothesis that states that as math grades increase, first-year GPA should also increase. If this is the case (although you might contend its too late for that!), then my hypothesis is directional and I may use a one-tailed significance test or confidence interval. As mentioned earlier, to switch from a two-tailed test (the Stata default) to a one-tailed test we divide the p-value in half. Thus, the one-tailed p-value is 0.057/2 = 0.0285. Voil! We now have a significant result. This demonstrates the uncertainty involved in many statistical endeavors. Some might see this statistical sleight-of-hand as entirely justifiable, whereas others may argue that we are being disingenuous. There is no simple answer, though, so we should be careful to set up our hypotheses and describe the types of significance tests we plan to use before we estimate the regression model.
74
We say that the first model is nested within the second model because it includes a subset of the variables. Often, the model with the additional explanatory variables is called the full model; the nested model is also known as the restricted model. Non-nested models are two (or more) regression models that do not include the same subset of variables. For example, if the first model included resting heart rate, exercise frequency, and age as explanatory variables, and the second included triglyceride levels, smoking status, and time spent each day watching television, we cannot say that one model is nested within another. Since comparing models in the non-nested situation is difficult (see Graybill and Iyer, op. cit., Chapter 4), we will not pursue this issue here.
One of the advantages of this test is that, as we shall see, it generalizes well when we wish to add more than one explanatory variable to the model. However, it may be inefficient in the current situation because we need to check a table of F-values to determine the p-value. As an alternative, most researchers use the t-test. The t-value from a twotailed t-test is the square of the F-value from the partial F-test. Moreover, the t-test is printed in the Stata printout: It is the t-value and accompanying p-value for the extra explanatory variable that has been added to the model.
75
But suppose we wish to compare two models, one of which is nested in the other, but the full model contains two or more extra explanatory variables? In this situation we extend the partial F-test by using the Multiple Partial F-Test. As the formula for this test shows, it is a logical extension of the partial F-test:
F=
df = q, n k 1 (full model)
In this equation we compute an F-value based on the difference between the regression sums of squares from the two models. But note that this difference is divided by q the difference in the number of explanatory variables between the two models [(k(full) k(restricted)] before dividing by the mean square error of the full model. The degrees of freedom for the F-value are q and the degrees of freedom associated with the error sum of squares for the full model. As an example, lets return to the usdata.dta data set. The models we estimate include the number of robberies per 100,000 as the outcome variable (robbrate). The explanatory variables in the restricted model are the unemployment rate (unemprat) and per capita income in $1,000s (perinc). The full model includes these two variables plus the gross state product in $100,000s (gsprod) and migrations per 100,000 (mig_rate). The result of the restricted (or nested) multiple linear regression model is found in the following table:
Source Model Residual Total robbrate unemprat perinc _cons SS 158892.761 333674.466 492567.228 Coef. 25.01909 13.60748 -276.815 df 2 47 49 MS 79446.3807 7099.45673 10052.3924 t 2.59 3.71 -3.01 P>|t| 0.013 0.001 0.004 Number of obs F( 2, 47) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 11.19 0.0001 0.3226 0.2938 84.258 Beta .3126354 .4470277 .
These results indicate that these explanatory variables account for about 32% of the variability in robberies per 100,000. Moreover, there
76
is a positive association between robberies and each of the explanatory variables: States with higher rates of unemployment or higher per capita incomes tend to experience more robberies. Now, lets see what happens when we re-run the model, but add the gross state product and migrations per 100,000. The following table contains the results of the full model:
Source Model Residual Total robbrate unemprat perinc gsprod mig_rate _cons SS 274384.111 218183.116 492567.228 Coef. 18.88375 10.0517 31.89566 .0054743 -219.0959 df 4 45 49 MS 68596.0278 4848.5137 10052.3924 t 2.34 3.01 4.81 1.94 -2.66 P>|t| 0.024 0.004 0.000 0.059 0.011 Number of obs F( 4, 45) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 14.15 0.0000 0.5570 0.5177 69.631 Beta .2359689 .3302147 .5325452 .2097873 .
It appears weve added some important information to the model. The gross state product, in particular, has a highly significant association with robberies per 100,000 after controlling for the effects of the other variables in the model. In fact, if the beta weights can be trusted (and we cannot know without checking the distributions of all the explanatory variables whether they are suitable for comparing the associations), the gross state product has the strongest association with robberies among the four variables. But can we also determine whether the addition of the two explanatory variables significantly improves the predictive power of the model? Perhaps it would be useful to compare the R2 values. After all, they measure the proportion of variability in y that is explained by the model. The R2 for the restricted model is 0.323 and for the full model 0.557. Unfortunately, it is not a good idea to compare these statistics. Recall that in Chapter 4 we learned that the R2 goes up simply by adding explanatory variables, whether they are good predictors or not. An alternative is to compare the adjusted R2 values since these are not as affected by the addition of explanatory variables. The values are 0.294 for the restricted model and 0.518 for the full model. It appears that this is a substantial increase: The
77
adjusted R2 went up by about ([0.518 0.294] 0.294 ) 100 = 71% . But is the increase in explained variance statistically significant? We may use the multiple partial F-test to answer this question:
F=
[274,384.111 158,892.761]
4,848.514
= 11.91 (df = 2, 45 )
An F-value of 11.91 with df = (2, 45) has a p-value of approximately 0.0001, which is much lower than the threshold recommended by most rules-of-thumb. In this situation, it is safe to conclude that the addition of the gross state product and migrations per 100,000 rate increases the explained variance in the model. Stata provides a simple way to compute a multiple partial F-test with its test postcommand. After estimating the full model, type test gsprod mig_rate. The Stata output should show
( 1) ( 2) gsprod = 0 mig_rate = 0 F(2, 45) = Prob > F = 11.91 0.0001
This is simply the same approach that we calculated using the sums of squares and the mean squared error. Note what we are doing: testing whether or not the two coefficients are equal to zero. This is fundamentally what a nested F-test examines. But is this all we would wish to do with this model? We added two variables, one of which, the gross state product, appears to be strongly associated with the number of robberies. The other variable, though, does not appear to be as strongly associated with robberies as the other explanatory variables. However, assuming our hypothesis says we should use a one-tailed significance test (i.e., Migrations per 100,000 are positively associated with the number of robberies because), we may claim that the p-value is 0.059/2 = 0.0315, below the common rule-of-thumb of p < 0.05. Perhaps we should compute a partial F-test to determine whether the addition of migrations adds anything to the models explained variability. However, at some point some argue it should be very early in the
78
analysis process you as the analyst must decide about which variables are important and which variables are not. This should be guided, preferably, by your theory or conceptual model, rather than by rote estimation of regression models or correlation matrices.
Confounding Variables
There is one other issue to discuss about our multiple linear regression models. This returns us to a point made in Chapter 3 about the need for explanatory variables to check for confounding. As mentioned there, a confounding variable accounts for the association between an explanatory and an outcome variable (recall the lighter purchases and lung disease example). Well discuss confounding a bit more in Chapter 7, but for now it is useful to consider the two regression models designed to predict the number of robberies per 100,000. Looking back over the coefficients from the two models, notice that in the restricted model the unemployment rate has a partial slope of 25.019, whereas in the full model it has a partial slope of 18.884. In other words, the partial slope decreased by about 32% when we included the gross state product and migrations per 100,000 in the model. We cannot tell at this point whether this decrease is statistically significant (although there are tools for this), but lets assume it is. We may then claim that one or both of the new explanatory variables included in the model confounded the association between the unemployment rate and the number of robberies. It did not confound it completely the slope for the unemployment rate would be statistically indistinct from zero if this was the case but changed it enough to draw our attention. We usually look for variables that completely account for the association between two variables, but even those that only partially account for it can be interesting. The key question you should always ask yourself, though, is why. Why does the association change when we add a new variable? It could be a random fluctuation in the data, but it might be something important and worthy of further exploration. In this
79
example, perhaps the unemployment rate and the gross state product are associated in an interesting way.
82
The first variable, gender, is straightforward since we place people into one of only two groups. The second set of variables requires an explanation, though. Assume we have a variable in our data set that is designed to measure ethnicity (well limit this example to three ethnic groups to simplify things). It has three outcomes Caucasian, African-American, and Hispanic that are coded 1, 2, and 3. Since this variable is not continuous, nor can it even be ordered in a reasonable way, we could not include it as is in a linear regression model (what would a one unit increase or decrease mean?). The solution is to create three dummy variables that represent the three groups. The main group represented by each is coded as 1, with any sample member not in this group coded as 0. We can extend this form of dummy variable coding to a variable with any number of unique categories. In the data set (spreadsheet format), our new variables appear as
Observation 1 2 3 4 5 6 Ethnicity Caucasian Caucasian African-American African-American Hispanic Hispanic
x1
1 1 0 0 0 0
x2
0 0 1 1 0 0
x3
0 0 0 0 1 1
The variables x1, x2, and x3 are dummy variables that represent the three ethnic groups in the data set. There are several other types of coding strategies possible, such as effects coding and contrast coding, but the type shown in the table is flexible enough to accommodate many modeling needs. We could also add additional groups (e.g., East Asian, Native American, Aleut) if our original variable included codes for them.
83
There is a rule you should always remember when using dummy variables to represent mutually exclusive groups (e.g., married people, never married people, widowed people, etc.) in a regression model: When setting up the model, include all the dummy variables in the model except one. The one dummy variable that is omitted from the model is known as the reference category or comparison group. Suppose our variable has five categories (represented by k) and we create five dummy variables to represent these categories. We then use four (or k 1) dummy variables in the regression model. If you try to use all five groups the model will not run correctly because there is perfect collinearity among the dummy variables (see Chapter 9). How should you choose the reference category? This is up to you, but there are some guiding principles that may be helpful. First, many researchers use the most frequent category as the reference category, although it is usually preferable to let your hypotheses guide the selection. Second, it is not a good idea to use a relatively small group as the reference category. Third, many statistical software programs have regression routines that automate the creation of dummy variables. When these are used you should check the software documentation to see which dummy variable is excluded from the model. Some programs exclude the most frequent category; others exclude the highest or lowest numbered category. How are dummy variables used in a regression model? Just like any other explanatory variable. It is often valuable to give the variables names that are easy to recognize. So, for example, if we were to create a set of dummy variables representing marital status, we might wish to name them married, divorced, cohabit, single, and widowed. Lets first consider a regression model with only one dummy variable. This dummy variable is labeled gender. The outcome variable is annual personal income in $10,000s (pincome). If only gender is in the model, it is identical to an ANOVA model and, as we shall see, may be used to determine the mean income levels for males and females. The Stata data set gss96.dta contains these variables as well as a host of others from a representative sample of adults in the United
84
States. Notably, there are two variables that measure gender: sex and gender. The first is coded as 1 = male and 2 = female and the second is coded as 0 = male and 1 = female. This illustrates a frequent coding situation: There are many data sets that code dummy variables as {1, 2}. Although we may still use these variables in a regression model, the {0, 1} coding scheme is preferable for reasons that will become clear later on. In addition, the data set includes a marital status variable that is nominal (marital) and a set of dummy variables representing its categories (e.g., married, widow); and a race/ethnicity variable (race) along with a set of dummy variables (Caucasian, AfricanAm, othrace) that represent its categories. There are also several other variables in this data set that measure family income, religion, volunteer activities, and various demographic characteristics. (Note: The income variables are not truly continuous, nor are they coded correctly. Rather, they include categories of income that are not in actual $10,000 units, even though they are labeled as such. Nonetheless, in the interests of unfettered learning, well treat them as continuous and pretend that the units are accurate.) Setting up a linear regression model with dummy variables in Stata requires no special tools or unique approach. The set-up is the same as with continuous explanatory variables: Simply place the dummy variable(s) in the regression command following the outcome variable. Lets run the model specified in the last paragraph and see what Stata provides.
Source Model Residual Total pincome gender _cons SS 861.352751 15583.5014 16444.8541 Coef. -1.347002 10.59491 df 1 1897 1898 MS 861.352751 8.21481359 8.66430671 t -10.24 113.52 P>|t| 0.000 0.000 Number of obs F( 1, 1897) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1899 104.85 0.0000 0.0524 0.0519 2.8661
The output looks the same as before, with coefficients, standard errors, t-values, and p-values. In fact, most of the interpretations are similar, except the most important: The coefficients. Recall that we
85
learned earlier that the intercept is sometimes not very useful in linear regression models because explanatory variables are frequently coded so as not to have an interpretable zero value. Think about how a dummy variable is different: It clearly has an interpretable zero value since it is coded as {0, 1}. But, because it no longer makes much sense to refer to a one-unit increase in the explanatory variable at least not in the same way as with continuous explanatory variables we need to modify our thinking about the slope. Perhaps writing out the linear regression model will help us move toward a clearer interpretation of the results: Personal income, predicted = 10.595 1.347(gender) Suppose we wish to compute predicted values from this equation. This should be pretty easy by now: Predicted income for males: Predicted income for females: 10.595 1.347(0) = 10.595 10.595 1.347(1) = 9.248
Consider that the mean of personal income is 9.93, which looks remarkably close to the average of these two predicted values. It should be for the simple reason that, as foreshadowed earlier, these two predicted values are the mean values for personal income among males and females (confirm this using Statas summarize command twice with the if option: sum pincome if gender==0; sum pincome if gender==1. An alternative is to use the table command and ask Stata for the mean values of pincome: table gender, contents(mean pincome). Given that we may predict these two means with the linear regression model, what do the intercept and slope represent? It should be apparent that the intercept, when there is only one {0, 1} coded dummy variable, is the mean of the outcome variable for the group represented by the zero category. The slope represents the average difference in the group means of the outcome variable. Hence, we may say that the expected difference in personal income between males and females is 1.347. Without even computing
86
predicted values, we immediately see that females, on average, report less personal income than males. One experienced with various types of statistical analysis might ask: Isnt this the type of issue that t-tests are designed for? Of course it is. In fact, using Statas ttest command (ttest pincome, by(gender)), we find the same results as in the linear regression model: An average difference between males and female of 1.347, a tvalue of 10.24, and a p-value of less than 0.001.
87
df 3 1895 1898
MS 289.312686 8.22000848 8.66430671 t -0.15 -0.89 -10.21 108.93 P>|t| 0.881 0.372 0.000 0.000
Number of obs F( 3, 1895) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
[95% Conf. Interval] -.4112149 -.852413 -1.60795 10.4224 .3528828 .3193103 -1.089712 10.80458
Considering the way weve treated dummy variables so far, what do you think these slopes and the intercept measure? As before, lets write out the regression model: Predicted income = 10.613 1.349(gender) 0.029(African-American) 0.267(other race/ethnicity) Since the intercept in a multiple linear regression model is the expected value of the outcome variable when all the explanatory variables are zero, 10.613 represents the expected mean for males who are not African-American and are not in the other racial/ethnic category. So whos left? Caucasians. In other words, the intercept represents the mean for the omitted category from both sets of dummy variables: gender and race/ethnicity. As we found earlier, the slopes represent average differences among the groups. These are also known as deviations from the mean. Here we are interested in computing deviations from the mean of the reference category (Caucasian males). But we must now combine coefficients to estimate the means for these different groups, as shown in the following table.
88
Group
Caucasian males Caucasian females African-American males African-American females Other race/ethnic males Other race/ethnic females
Weve estimated six means for six mutually exclusive groups defined by the dummy explanatory variables. However, before being too pleased with ourselves, consider the t-value and p-values in the regression output. As in the earlier model, the gender coefficient is statistically significant. Another way of thinking about this is that the gender difference is statistically significant. However, the race/ethnicity coefficients are not statistically significant. This means that we cannot say that the income differences among Caucasians, AfricanAmericans, and members of other racial/ethnic groups are substantial enough that we could conclude that they differ in the population of adults in the U.S. Also notice that the income difference between males and females in each racial/ethnic group is identical (1.35). Whereas this difference is statistically significant, the income difference between males across racial/ethnic groups and females across racial ethnic groups is not statistically significant (e.g., the difference between Caucasian females (9.26) and Other race/ethnic females (8.99) is not statistically significant).
89
explanatory variables. As mentioned earlier, the well-known ANCOVA model is the analogue to this type of regression model. ANCOVA models serve several useful purposes. The main advantage is that we may compare groups regarding some variable after adjusting for (controlling for) their associations with some set of continuous variables. Recall in the last chapter that we discussed briefly the issue of confounding variables. Heres another example of this phenomenon. Suppose we are studying rates of colon cancer among cities in the U.S. We draw a sample of cities and find that those in Florida, Arizona, and Texas have higher rates of colon cancer than cities in the Northeast, Midwest, and Northwest areas of the country (perhaps we have set up a series of dummy variables that indicate region of the country). Are there unique environmental hazards in warm weather cities that affect the risk of colon cancer? We cannot tell without much more information about the environmental conditions in these cities. Nevertheless, we have not yet considered a key confounding variable that affects analyses of most rates of disease: Age. It is likely that many warm weather cities especially in so-called sunshine states such as Florida and Arizona have an older age structure than cities in colder climates. Age is thus associated with both colon cancer rates and region of the country. ANCOVA models are designed to correct or adjust for continuous variables, such as age, that act as confounders. Multiple linear regression models may be used in a similar manner. If we are interested in analyzing differences in some outcome among groups defined by dummy variables, we should consider adjusting for the possible association with conceptually-relevant continuous variables. Here is an example that is similar to our earlier dummy variable example, but changes the outcome variable. There is a perception in some circles that fundamentalist or conservative Christians tend to come from poorer families than do other people. Hence, we may surmise and this is admittedly a crude hypothesis that family income among fundamentalist Christians is lower on average than in other families in the U.S. Lets test this rough
90
hypothesis using the gss96.dta data. Here are the results of a linear regression model with family income (fincome) as the outcome variable and a dummy variable (fundamental, coded 0 = non-fundamentalist, 1 = fundamentalist Christian). The model also includes gender as an explanatory variable.
Source Model Residual Total fincome fundamental gender _cons SS 941.766876 35879.6402 36821.4071 Coef. -.6723305 -1.238342 16.80156 df 2 1896 1898 MS 470.883438 18.9238609 19.4001091 t -3.11 -6.20 108.53 P>|t| 0.002 0.000 0.000 Number of obs F( 2, 1896) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1899 24.88 0.0000 0.0256 0.0245 4.3502
The model provides support for the hypothesis: On average, adults in the fundamentalist Christian category report 0.672 units less on the family income scale than do other adults. This coefficient is statistically significant at the p = 0.002 level, with 95% confidence intervals of {1.097, 0.248}. Lets think about this association a bit more. Are there other variables that might account for this negative association? Without even knowing much about fundamentalist Christianity in the United States, Im sure we can come up with some informed ideas. Lets consider income. What accounts for differences in income in the United States? A prime candidate is education. If we were to read a bit about religion in the U.S., we might come across literature that suggests that conservative or fundamentalist Christians tend to have less formal education than others (although this has changed in recent decades). A reasonable step is to include a variable that assesses formal education to determine whether it affects the results found in the previous model. The gss96.dta data set includes such a variable, educate, which measures the highest year of formal education completed by sample members. What happens when we place it in the model? Rerun the regression model in Stata but include the education variable. You should discover the following results:
91
df 3 1895 1898
MS 1348.98884 17.2952193 19.4001091 t -1.03 -6.81 13.40 18.66 P>|t| 0.301 0.000 0.000 0.000
Number of obs F( 3, 1895) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
[95% Conf. Interval] -.6280296 -1.677125 .4147153 8.898408 .1943594 -.9275435 .5569327 10.98828
Note that the coefficient associated with the fundamentalist Christian variable is no longer statistically significant. Its p-value is now 0.301. We cannot determine with complete confidence that this coefficient (0.217) is statistically distinct from the earlier models coefficient ( 0.672) (once again, the confidence intervals might help). But, given the literature on formal education among fundamentalist Christians in the U.S., we should strongly suspect that education confounds the association between considering oneself a fundamentalist Christian and family income. To illustrate another important issue, well cheat a bit and pretend that the fundamentalist Christian coefficient is statistically significant (although we could, perhaps, just ignore for purposes of this modest exercise the issue of statistical significance). How can we estimate means for different groups using this regression model? Hopefully, by now it is apparent that writing out the implied prediction equation is useful: Predicted income = 9.94 0.22(fundamental) 1.3(gender) + 0.49(educate) Weve already seen what to do with the dummy variables: Simply plug in a zero or a one and compute the predicted values. But what should we do with education and its coefficient? Obviously we should do something similar with this variable as we did with the dummy variables: Plug in some value, compute the products, and add them
92
up. Suppose we think that putting a zero or a one for education is the way to go. What would these numbers represent? Looking over the coding of educate, perhaps by asking Stata to calculate summary statistics (sum educate), we find that it has a minimum value of zero and a maximum value of 20. But notice there is only one sample member who reported one year of education; the next value in the sequence is five. Therefore, placing a zero or a one in the equation is unwise since these values are not represented well in the data set. A value of 20 might be reasonable, but, for various reasons, it is better to use an average value of education in the computations (e.g., mean, median, or mode) or some other sensible value. For example, in the U.S. a well-recognized educational milestone is graduation from high school. This is normally denoted as 12 years of formal education. Twelve years is also the modal category in the distribution of educate. Hence, there is justification for using 12 in our computations if we wish to estimate means for the categories of fundamental and gender that are adjusted for years of education. Heres what the computations show us:
Adjusted Mean 15.60 14.30 15.82 14.52
Computation 9.94 0.22 + (0.4912) 9.94 0.22 1.3 + (0.4912) 9.94 + (0.4912) 9.94 1.3 + (0.4912)
The average estimated difference in family income between fundamentalist Christians and other adults is 0.22, which is substantially smaller than the difference between males and females. Moreover, as weve already mentioned, a difference of 0.22 is not statistically significant. Nonetheless, it is instructive to compare the means after adjustment for education with the unadjusted means.
93
These latter values, based on the first family income regression model, are shown in the following table.
Group Fundamentalist males Fundamentalist females Non-fundamentalist males Non-fundamentalist females Computation 16.8 0.67 16.8 0.67 1.24 16.8 16.8 1.24 Adjusted Mean 16.13 14.89 16.80 15.56
To estimate the difference between adjusted and unadjusted values, it is helpful to calculate the percentage difference (using the display command allows Stata to act as a calculator). Choosing, for example, males, the means indicate that adjusting for the education reduces the predicted difference in family income between fundamentalists and non-fundamentalists by
(16.80 16.13) (15.8 2 15.60 ) 0.67 0.22 100 = 100 = 67.2% ( ) 16 . 80 16 . 13 0.67
You should notice that the slopes may be used to compute the percentage difference between the unadjusted and adjusted means. Now that weve made all these computations by hand (and calculator), its helpful to know about a Stata postcommand that may be used to simplify things quite a bit. The adjust command following the regression model computes predicted values from regression models. The key is to tell Stata the values of the explanatory variables. For example, after estimating the full model, the following command will provide the predicted value for nonfundamentalist males with 12 years of education:
adjust fundamental=0 gender=0 educate=12
94
Dependent variable: fincome Command: regress Covariates set to value: fundamental = 0, gender = 0, educate = 12 ---------------------All | xb ----------+----------| 15.7732 ---------------------Key: xb = Linear Prediction
Note that this predicted value, 15.77, is not identical to the value we computed (15.82). The difference is due to the rounding that we were forced to do. Thus, Statas predicted value is more precise. In order to determine the other predicted values, we simply vary the values of the explanatory variables. Lets consider one more dummy variable regression example. This time well consider the association between marital status and family income. It is reasonable to hypothesize that married people have higher incomes, on average, than single people, whether they are never married, divorced, or widowed. Families with a married couple have at least the potential to earn two incomes; in fact, more than half of married couple families in the U.S. have two wage earners. Therefore, lets run a regression model to assess the association between marital status and family income. As mentioned earlier, marital status is represented by a set of four dummy variables: married, widow, divsep (divorced or separated), and nevermarr (never married). Well use married as the reference category. The resulting model has an R2 of 0.176, an adjusted R2 of 0.175, and an F-value of 134.93 (df = 3, 1895; p<.001). The regression coefficients or slopes support the hypothesis that married people report more family income than others. The average differences are three to four units with, it seems, never married people reporting the least family income.
95
1899 134.93 0.0000 0.1760 0.1747 4.0013
[95% Conf. Interval] -4.111566 -3.639032 -4.554992 17.53626 -2.192935 -2.712103 -3.668313 18.04601
However, think some more about variables that are associated with marital status and income. Weve seen, for instance, that education is strongly associated with income, but it also may be associated with marital status. Another variable mentioned earlier is age. Is age related to income? The answer is yes since middle aged people earn more on average than younger people. What about ages association with marital status? It makes sense to argue that there should be an association, especially when we have a never married category. Never married people tend to be younger, on average, than married people. Given these important issues, it is prudent to add education and age to the model. The regression table provided by Stata is shown on the next page. The new model, which we may call the full model to distinguish it from the earlier restricted model, has an R2 of 0.271, an adjusted R2 of 0.269, and an F-value of 140.83 (df = 5, 1893; p < .001). Although well skip the computations, a nested F-test indicates that weve significantly improved the predictive power of the model by adding age and education (i.e., the R2 has increased by a statistically significant amount). You should note that a nested F-test is not the same as subtracting the F-values from the two regression models. This latter procedure is not appropriate for comparing regression models. Although we may have suspected that the results for marital status change when adding age and education to the model, there is actually little shift in the size of the coefficients. For example, the never married coefficient changes from 4.112 to 3.922, a difference of only about three percent. Hence, we may conclude that the
96
differences in family income associated with marital status are not accounted for (nor confounded) by age or education.
Source Model Residual Total fincome widow divsep nevermarr age educate _cons SS 9983.13481 26838.2722 36821.4071 Coef. -3.044257 -3.129857 -3.921608 .0274864 .4918961 9.794846 df 5 1893 1898 MS 1996.62696 14.1776399 19.4001091 t -6.48 -14.06 -17.06 3.54 15.12 17.53 P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 Number of obs F( 5, 1893) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1899 140.83 0.0000 0.2711 0.2692 3.7653
[95% Conf. Interval] -3.96492 -3.566297 -4.3725 .0122515 .4280977 8.698974 -2.123593 -2.693417 -3.470716 .0427214 .5556944 10.89072
There is one other issue that we should consider before closing this chapter. As mentioned in Chapter 3, many researchers use standardized coefficients to compare the associations within a model. For instance, since education and age are measured in different units, trying to compare their association with a outcome variable such as family income in a multiple linear regression model is difficult. By using the standardized coefficients, so the argument goes, education and age may be compared in standard deviation units. Although there is some merit to this approach, it falls apart when considering dummy variables. A one standard deviation unit shift in a dummy variable makes no sense (unless, I suppose, it equals exactly one, which doesnt happen with {0, 1} dummy variables). Dummy variables can shift only from zero to one or one to zero. So interpreting the standardized coefficients associated with dummy variables is not appropriate. For example, the variable gender has a mean of 0.50 and a standard deviation of 0.50. In the regression model shown earlier (shown again on the next page), the standardized coefficient associated with gender is 0.148. So the mindless interpretation of this coefficient is Controlling for the effects of the other variables in the model, a one standard deviation unit increase in gender is associated with a 0.148 standard deviation decrease in family income. A one standard deviation unit increase in gender shifts, for instance, from 0 to 0.50. This is a nonsensical value; there is no gender value of 0.50, nor
97
x
A linear regression model is going to attempt, using the least squares formulas, to fit a straight line to this set of data. But it is clear that the 99
100
association between x and y is not linear; rather, it is curved or curvilinear. Estimating this association with a linear model is one example of specification error. Well discuss curvilinear associations in much more detail in Chapter 10. For now, well learn about three common sources of specification error in multiple linear regression models: (1) Including irrelevant variables; (2) leaving out important variables; and (3) misspecifying the causal ordering of variables within the model. The first problem is called overfitting; the second problem is called underfitting; and the third problem involves what is known as endogeneity, or simultaneous equations bias. All three present problems for linear regression models, but to a different degree. Before learning about these problems, though, it is important to discuss for a moment variable selection. Variable selection is a frequently discussed topic in regression modeling. How do we know which variables to put in a regression model? There are a large number of variable selection procedures available. Well discuss some of these later in the chapter. At this point, it is important to realize that much of the work on variable selection by mathematical statisticians overemphasizes obtaining the best prediction equation at the expense of testing a reasonable explanatory model. Regression models are clearly useful as predictive tools. Most of us would prefer, for instance, that medical researchers use statistical tools to predict with a high degree of certainty that a drug will cure us (or wont injure us!). Perhaps we dont care much about explaining the underlying biological mechanisms that link the drug with the cure. But in the social and behavioral sciences explanations are the driving force at the core of research. (In fact, some physicists argue that explanation is also at the core of their scientific endeavors. See, e.g., David Deutsch (1997), The Fabric of Reality, New York: Penguin Books.) We wish to be able to explain why one variable is associated with another (e.g., why are neighborhood unemployment rates associated with crime rates?) rather than to say only whether one predicts another (wheres the fun in that?). Therefore, when we select variables for a regression model,
SPECIFICATION ERRORS
101
we should use our explanatory framework our theory or conceptual model, as well as common sense to decide which variables to include in the regression equation. This presupposes that we have a solid understanding of the research literature, especially previous studies that have examined our outcome variable, and that we understand fully the variables in our data set, including how they are constructed, what they purport to measure, and so forth.
( )
(x
x ) 1 R i2 (n k 1)
2
(y
i ) y
102
The quantity (1 R i2 ) is called the tolerance and is estimated from an auxiliary regression equation that is defined as
x + L x . Suppose that an extraneous variable (x4) + x1i = 2 2i k ki
included in the model is strongly associated with an important variable (x2) but is not associated with the outcome variable (y). Then, all else being equal, the tolerance for x2 will be smaller (closer to zero) than if x4 were not in the model and x2s standard error in the linear regression model will be relatively larger. The following figure represents this situation.
x2
x4
The tolerance in the standard error computation for x2 is represented by the overlap between the two explanatory variables. If the overlap is large, then the standard error can shift considerably, becoming large enough in some cases to affect the conclusions regarding the statistical significance of x2s coefficient. (Keep in mind that we want substantial variation (i.e., large circles) in the explanatory and outcome variables.) However, if there is not a statistical association between the explanatory variables (e.g., corr(x2, x4) = 0), then the standard error associated with the important variable (x2) is unaffected. In the figure, this would be presented by non-overlapping
103
104
We saw an example of underfit in the last chapter, but lets explore another. The gss96.dta data set includes a variable labeled lifesatis. This is a measure of life satisfaction that is based on some questions that ask about satisfaction in ones marriage, at work, and in general. Higher values on this variable indicate a greater sense of life satisfaction. There are studies that suggest that life satisfaction is associated with education, occupational prestige, involvement with religious organizations, and several other variables. Well examine some of these associations in a linear regression model. (Note: This example is slightly revised from the following source: William D. Berry and Stanley Feldman (1985), Multiple Regression in Practice, Newbury Park, CA: Sage.) First, though, lets look at a bivariate correlation matrix of these variables. 1 The correlations suggest that all three of the proposed explanatory variables occupational prestige, education, and religious service attendance are positively associated with life satisfaction. No pair of variables has a noticeably larger correlation than any other when we consider life satisfaction. But notice also that there is a pretty substantial correlation between education and occupational
for life satisfaction are recorded for only 908 respondents in the gss96.dta data set. In order to analyze models using lifesatis, it may be a good idea to restrict the analytic sample to only those respondents who have nonmissing values on this variable. Statas keep if and drop if commands allow you to filter out the observations that are missing on lifesatis (e.g., keep if lifesatis~=. or drop if lifesatis==. request that Stata use only those observations that are not missing {~=. or ==. where the period is Statas missing value code}) However, these commands will permanently delete the cases from your data set. In order to prevent this, since you may wish to use them later, use preserve and restore. It is also a simple matter to restrict regression models to only a particular subset of observations using the if subcommand. See the help menu for more information.
1Values
SPECIFICATION ERRORS
105
prestige (r = 0.553). This is not surprising: Highly educated people tend to be employed in positions that are considered more prestigious (e.g., judges, bank presidents, even college professors!); and occupations that require more formal education tend to be judged as more prestigious. Pay attention to education and occupational prestige as we consider a couple of multiple linear regression models. Unlike what we saw in the last chapter, well begin with the full model and work backwards.
lifesa~s occprest lifesatis occprest educate attend 1.0000 0.1518 0.0000 0.1117 0.0008 0.1435 0.0000 1.0000 0.5530 0.0000 0.0358 0.2814 1.0000 0.0834 0.0119 1.0000 educate attend
The first thing to notice in the regression table on the next page is that, once we account for the association between occupational prestige and life satisfaction, the significant association between education and life satisfaction that appears in the correlation matrix disappears. The association between religious service attendance and life satisfaction remains, however. In this situation, we say that the association between education and life satisfaction is spurious; it is confounded by occupational prestige. One way to represent this is with the figure that appears below the table. It shows that although it seems there is an association between education and life satisfaction (identified by the broken arrow), occupational prestige is associated with both of the other variables in such a way as to account completely for their presumed association. Suppose, though, that for some reason (perhaps we failed to read the literature on life satisfaction carefully) we omit occupational prestige from consideration. Perhaps we think that highly prestigious occupations demand so much of people that they cant possible report high satisfaction with their lives. Or we simply dont care much
106
about employment patterns because, we argue, education and religious practices are much more important. The second table provides the regression results from a model that excludes occupational prestige.
Source Model Residual Total lifesatis occprest educate attend _cons SS 11086.4119 248845.306 259931.718 Coef. .1646559 .1675345 .8921501 39.83922 df 3 904 907 MS 3695.47065 275.271356 286.584033 t 3.37 0.70 4.18 13.72 P>|t| 0.001 0.484 0.000 0.000 Number of obs F( 3, 904) Prob > F R-squared Adj R-squared Root MSE = = = = = = 908 13.42 0.0000 0.0427 0.0395 16.591
[95% Conf. Interval] .0688316 -.3020458 .4733146 34.14018 .2604801 .6371147 1.310986 45.53827
Occupational Prestige
Education
X
df 2 905 907 MS 3977.91801 278.426389 286.584033 t 3.06 4.12 14.21
Life Satisfaction
Number of obs F( 2, 905) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.002 0.000 0.000
= = = = = =
In this model we find a highly statistically significant association between education and life satisfaction. However, we know that this
SPECIFICATION ERRORS
107
regression model is misspecified. We have a model that is underfit because it does not include a variable that is associated not only with the outcome variable, but also with education in such a way as to account for the presumed association between education and life satisfaction. Not only is the education slope inflated (0.168 0.613) when we omit occupational prestige from the model, but its standard error is also too small. Another model to consider is one that omits religious service attendance. We have seen that this variable is significantly associated with life satisfaction whether we include occupational prestige or not. In fact, its slope, standard error, t-value, and p-value are similar in both models. But does it affect the other variables? We see that the results arent much different than when religious service attendance is in the model (see the table on the next page). The adjusted R2 is reduced from 0.0395 to 0.022 (it was relatively small to begin with), but the slopes and standard errors associated with education and occupational prestige are not affected much. If you look back at the correlation matrix, you can see why: Religious service attendance has a relatively small correlation with the other two explanatory variables. Therefore, although this last model is also underfit, the consequences of specification error are less severe than when we leave out occupational prestige. One lesson to learn from this exercise is that specification error is a virtual certainty. After all, we cannot hope to include all variables that are associated with the outcome variable in the model. The goal is to strive for models with a low or manageable amount of specification error and hope that we dont reach the wrong conclusions about the associations that the models do represent.
108
Source Model Residual Total lifesatis educate occprest _cons
The next question to ask is whether the full model, which includes education, religious service attendance, and occupational prestige, is overfit. After all, education is not significantly associated with life satisfaction once we consider occupational prestige. On the one hand, it may not hurt our conclusions to include it in the model. On the other, it may make the other standard errors larger than they should be. At this point the best way to check for overfit is to estimate a linear regression model that omits education and look over the slopes and standard errors. The results of such a model show that the slope for occupational prestige increases slightly when education is omitted. Nonetheless, unless we were extreme in our desire to figure out the precise association between occupational prestige and this measure of life satisfaction, it is a relative judgment call whether or not education needs to be in the model.
SPECIFICATION ERRORS
109
But think about a set of variables such as occupational prestige, race/ethnicity, and life satisfaction. Models are often proposed that define these three variables, or many other sets of variables, as producing one another in some way. For example, we could revise our earlier life satisfaction model to include race/ethnicity:
Life satisfaction i = + 1 (occup. prestigei ) + 2 (race/ethnicity i ) + i
This model implies that occupational prestige and race/ethnicity independently combine to affect life satisfaction. But suppose that occupational prestige and race/ethnicity are not independent; rather, because of various historical and social factors that are associated with race/ethnicity in the United States and elsewhere, occupational prestige is, in part, a product of an individuals race/ethnicity. In this situation, we say that occupational prestige is endogenous in the system specified by the equation. The problem for the linear regression model is that the estimated slope for occupational prestige in the above equation is biased. As an exercise, estimate a model with occupational prestige as the outcome variable and the race/ethnicity dummy variables (AfricanAm, othrace) as explanatory variables. Include education as a control variable. What do the results tell you about the possible endogeneity of occupational prestige? A second endogeneity issue involves whether we have specified the correct ordering of variables. Does the outcome variable truly depend on the explanatory variables? Or could one or more of the explanatory variables depend on the presumed outcome variable? For instance, suppose we wish to estimate a model with adolescent drug use as the outcome variable and friends drug use as the explanatory variable. We may assume that ones friends influence ones behavior to a certain degree, thus leading to this model specification. But it may also be true that ones choice of friends depends on ones behavior. Hence, those who use drugs are more likely to choose friends who also use drugs. This issue implies that drug use and friends drug use are involved in whats known as reciprocal causation.
110
We wont pursue the issue of endogeneity in any more detail at this point because it involves (1) a thorny theoretical issue and (2) a complex statistical issue. It is a theoretical issue because we need to think carefully about the models we wish to estimate and consider whether one or more of the explanatory variables is potentially endogenous. When we estimate models with variables such as education, income, and life satisfaction; or ones own behavior and ones friends behavior; we should think about the order and intrasystem relationships among the variables. Are the explanatory variables truly independent? Are they determined outside the model? Or could one or more depend on one or more of the other explanatory variables? Moreover, could variables outside the system affect the explanatory variables? The answer is almost always yes, so correct model specification and lack of endogeneity require a properly fit model. Endogeneity is a complex statistical issue because it requires complicated systems of equations to specify correctly. Well learn in Chapter 8 about two statistical models, instrument variables with twostage least squares and structural equation models, that are designed to address endogeneity in regression models. Stata provides several other approaches for addressing endogeneity.
SPECIFICATION ERRORS
111
on your key explanatory variables also. You may find some unexpected variables that are associated with the explanatory variables you plan to use in your regression model. Unfortunately, we cannot even hope to include all the variables that might be associated with the explanatory and outcome variables. So, as mentioned earlier, specification error is always lurking, we just hope to minimize it. There are, however, some tools that are handy for assessing whether specification error is affecting the results of the regression model. The first such tool may be understood by considering, as an example, the equation shown earlier in the chapter:
yi = + 1 x1i + 2 x2i + i where i = x3i + x4i + random error
When we initially saw this equation, we were interested in what happens to the regression model when x2 and x4 are correlated. Although a correlation might affect the conclusions, how might we test whether x4 has an influence if we do not measure it directly? Can we figure out a way to measure the error terms (residuals) so that we may at least indirectly examine the possible association with between x2 and x4? As we learned in earlier chapters, we may use the predicted values from the model to compute the residuals:
i )} {residual i ( i ) = ( y i y
In Chapter 3 we saw that Stata allows us to predict the residuals from a linear regression model by using the predict postcommand. (Using the rstandard or rstudent options, you may also calculate standardized residuals, which are the residuals transformed into zscores; or studentized residuals, which are the residuals transformed into Students t-scores.) We may then assess whether the explanatory variables included in our model are correlated with the residuals. Unfortunately, because of the way the OLS estimators are derived, there is rarely, if ever, a linear association between the xs and the residuals. Sometimes, though, we might find a nonlinear association. Well learn more about examining potential nonlinear associations in Chapter 10.
112
Stata has several regression postcommands that may be used to test for specification errors. The most straightforward and probably the most useful test is Statas linktest command. This command is based on the notion that if a regression is properly specified, we should not be able to find any additional explanatory variables that are significant except by chance. The test creates two new variables, the variable of prediction, _hat, and the variable of squared prediction, _hatsq. The model is then re-estimated using these two variables as predictors. The first, _hat, should be significant since it is the predicted values. On the other hand, _hatsq shouldnt be significant, because if our model is specified correctly, the squared predictions should not have any explanatory power. That is we wouldnt expect _hatsq to be a significant predictor if our model is specified correctly. So, we should look at the p-value for _hatsq. Here is what linktest provides after our fully specified life satisfaction regression model with occprest, educate, and attend as explanatory variables.
lifesatis _hat _hatsq _cons Coef. 1.263113 -.002464 -6.993579 Std. Err. 4.00953 .0375193 106.82 t 0.32 -0.07 -0.07 P>|t| 0.753 0.948 0.948 [95% Conf. Interval] -6.605945 -.0760989 -216.6374 9.132171 .071171 202.6502
The results suggest that there is not an omitted variable or some other specification problem since the _hatsq coefficient has a p-value of 0.948. An alternative, but similar, approach is the RESET (regression specification error test) option, which is implemented in Stata using the ovtest postregression command.
ovtest Ramsey RESET test using powers of the fitted values of lifesatis Ho: model has no omitted variables F(3, 901) = 0.43 Prob > F = 0.7293
This test has as its null hypothesis that there is no specification error. It is not rejected in this situation, thus reaffirming the finding that
SPECIFICATION ERRORS
113
there is not specification error in this model. A third type of test that we wont discuss here, but is available in Stata, is called a Hausman test. It is more generally applicable than the other two tests. Unfortunately, these tests do not tell us if the misspecification bias is due to omitted variables or because we have the wrong functional form for one or more of the variables in the model. To test for problems with functional forms we may be better off examining partial residual plots and the distribution of the residuals (see Chapter 10). Theory remains the most important diagnostic tool for misspecification bias, though. Testing for underfit or overfit is much easier when you have access to all (or most) of the important explanatory variables in the data set. We obviously know that a regression model is underfit if we add a variable that is significantly associated with the outcome variable (a t-test or nested F-test may be used to confirm this; see Chapter 5). Moreover, some researchers recommend using the adjusted R2 to compare models. This should be used only when one of the models is nested within the other. If we add, say, a set of explanatory variables and find that the adjusted R2 increases, then we may conclude that the nested model is underfit. If, on the other hand, we take away a set of variables (perhaps because their associated pvalues are large) and find that the adjusted R2 does not decrease, then we may conclude that the model is overfit. However, it is not a good idea to rely only on the adjusted R2. It is better to combine an evaluation of this statistic with a nested F-test and a careful review of the slopes and standard errors of the remaining explanatory variables before concluding that a particular regression model is overfit or underfit. But we must also beware of overdoing it; of testing so many models by adding and subtracting various sets of explanatory variables that we cannot decide on the appropriate model.
114
Unfortunately, many, if not most, statisticians guides to variable selection are so automated as to violate a point made earlier in this chapter: Let your theory or conceptual model guide the selection of variables. Nonetheless, these more automated procedures are discussed so often theyre worth a brief review. The first type well mention is known as forward selection. In this approach, explanatory variables are added to the model based on their correlations with the unexplained component of the outcome variable. Yet this relies on biased estimates at each point in the selection process, so it should be avoided. A type of forward selection that you might encounter is stepwise regression wherein partial Fstatistics are computed as variables are removed and put back in the model. For various reasons that we wont go into, it should also be avoided. The second type is known as backward selection. In this approach, the analyst begins with all the explanatory variables in a model and then selectively removes those with slopes that are not statistically significant. This selection procedure is reasonable if you wish to have the best predictive model, but it is overly automated and takes the important conceptual work out of the hands of the researcher. Nonetheless, well consider some steps that help make backward selection somewhat better than more arbitrary or biased approaches. There are several steps to selecting variables if we wish to rely on backward selection (these are described in Kleinbaum et al., op. cit., Chapter 16). First, as mentioned earlier, estimate a regression model with all the explanatory variables. (Kleinbaum et al. suggest that we consider the higher order terms (e.g., x2, x3) and interactions in this model; well save a discussion of these for Chapter 10.) Second, choose a fit measure to compare models. Third, use backward selection to remove variables this may be done one at a time or in chunks from the model. Fourth, assess the reliability of the preferred model with a split sample procedure. This last step is often reasonable if we want to verify the reliability of a regression model regardless of how we decided which variables to include. If the
SPECIFICATION ERRORS
115
software allows it (and Stata does using the sample command), select a random subsample of observations from the data set (normally 50%) at the variable selection stage and then, after choosing the best model, estimate it using the remaining subsample of observations. If the models are the same (or quite similar) then we may have confidence that weve estimated the best fitting model. These steps are fairly simple using statistical software. But we should say a few more words about the second step: Choosing a fit measure to compare models. Weve seen a couple of these measures already: the adjusted R2 and nested F-tests. If we compare the full model to its nested models using these approaches, we can guard against including extraneous variables in the model. Moreover, some statisticians recommend comparing the Mean Squared Error (MSE) or its square root (SE) (see Chapter 4):
MSE =
SSE n k 1
MSE = S E
Looking for the smallest MSE or SE from a series of nested models is a reasonable step since, assuming we wish to make good predictions and find strong statistical associations, wed like to have the least amount of variation among the residuals. There is another common measure that is related to these other measures: Mallows Cp. An advantage of Mallows Cp is that it is minimized when the model has only statistically significant variables. Mallows Cp is computed as
Cp = SSE ( p ) [n 2( p + 1)] MSE(k ) where p = number of predictors in restricted model
The SSE(p) is from the restricted model and the MSE(k) is from the full model. Using this measure does not require the model with all the explanatory variables and some model nested in it. It may include any two models that use subsets of the explanatory variables as long as at least one is nested within another. Nonetheless, Mallows Cp is used
116
most often to compare the complete or full model with various models nested within it. Here is an example of three models that use three of these model selection fit statistics: The adjusted R2, SE, and Mallows Cp. The models use the usdata.dta data set and are designed to predict violent crimes per 100,000 (violrate). Suppose, based on previous research and our own brilliant deductions, that we think that the following explanatory variables are useful predictors of violent crimes across states in the U.S.: The unemployment rate, population density, migrations per 100,000, per capita income in $1,000s, and the gross state product in $100,000s. But someone convinces us that we should use a backward selection procedure to test several combinations of these variables. The table on the following page shows what we find after estimating several linear regression models. The table on the next page shows that Model 2 provides the best fit. It has the largest adjusted R2, the smallest SE, and the smallest Mallows Cp. However, we did not test all the possible models, which would be a formidable task even with only five explanatory variables. (Try to figure out how many models we could fit.) Stata offers a procedure that is similar to the one we just used. If we estimate the model with all five explanatory variables using the stepwise command, we get a best-fitting model:
stepwise, pr(0.2): regress violrate unemprat density mig_rate gsprod perinc
The pr subcommand requests Statas backward selection model. The 0.2 tells Stata to retain a variable only if its coefficient has a p-value of no larger than 0.2. We may also use the pe subcommand to request forward selection. See Statas help menu (help stepwise) for additional options. Using backward selection with stepwise returns a model with all the explanatory variables included except for per capita income. This model is identical to Model 2 in the table, which supports the conclusion that Model 2 is the best fitting model. For some reason that you may wish to explore, the stepwise selection procedure in
SPECIFICATION ERRORS
117
Stata omits population density from its preferred model. Yet, as mentioned earlier, most statisticians strongly advise against using stepwise selection.
Variables Adjusted R2 SE Model 1: Unemployment rate Population density 0.357 215.86 Migrations per 100,000 Per capita income Gross state product Model 2: Unemployment rate 0.371 213.59 Population density Migrations per 100,000 Gross state product Model 3: Unemployment rate 0.311 223.40 Gross state product Outcome variable: Violent crimes per 100,000 Mallows Cp
6.00
4.06
6.34
The lesson from using these automated procedures is that some of them, such as backward selection, are reasonable tools if your goal is to come up with a model that includes the set of explanatory variables that offer the most predictive power. But these procedures cannot substitute for a good theoretical or conceptual model that not only predicts the outcome variable, but, more importantly, explains why it is associated with the explanatory variables. Moreover, they may provide misleading results because they are designed to fit the sample data, thus overstating how precise the results appear to be when we wish to make inferences to the population from which the sample was drawn. (For more on this point, see John Fox (1997), Applied Regression Analysis, Linear Models, and Related Methods, Thousand Oaks, CA: Sage, Chapter 13.)
120
questions. This would likely deter most people from participating in surveys, however. The literature on measurement in the social and behavioral sciences is huge and we cannot even begin to cover all the important topics. Our concern is with a narrow statistical issue but one of utmost importance that involves measurement. Known in general terms as measurement error or errors in variables, we are interested in learning about what happens to the regression model when the instruments used to measure the variables are not accurate. (For a much broader overview, see Paul Bieder et al. (Eds.) (2004), Measurement Errors in Surveys, New York: Wiley). There are various sources of measurement error, including the fact that we simply dont have accurate measuring instruments (such as scales or rulers) (known as method error) or that people sometimes misunderstand the questions that researchers ask or dont answer them accurately for some reason (lack of interest or attention, exhaustion from answering lots of questions; this is known as trait error). Sometimes, people dont wish to tell researchers the truth or provide information about personal topics. One of the most personal topics, it seems, is personal income. If you ask questions about a persons income, be prepared for many refusals or downright deceptions. Another source of measurement error is recording or coding errors, such as when a data entry person forgets to hit the decimal key so an income of $1,900.32 per month becomes an income of $190,032 per month. Normally, this type of measurement error is easy to detect during the data screening phase. If we make it a point to always check the distributions of all the variables in our analysis using means, medians, standard deviations, ranges, box-and-whisker plots, and the other exploratory techniques, and are familiar with the instruments used to measure the variables, we will usually catch coding errors. (Dont forget to always check the missing data codes in your data set! See Appendix A.) But suppose that we have carefully screened our data for coding errors and fixed all the apparent problems. We may still suspect that
MEASUREMENT ERRORS
121
there is measurement error in the variables for all the other reasons weve discussed so far. Is there anything we can do about it? Well discuss two situations that lead to distinct solutions: (a) the outcome variable, y, is measured with error; (b) one or more of the explanatory variables, x, is measured with error. Before discussing these two situations, lets revisit some of the assumptions (see Chapter 3) of linear regression analysis that relate to measurement error. First, there is an assumption that is almost never followed in the social sciences that the x variables are fixed (the technically proficient say nonstochastic). This means that researchers have control over the x variables. They may then try out different values of the explanatory variable (such as by applying different amounts of fertilizer and water to plants and examining their growth) and determine with quite a bit of accuracy the association between x and y. Second, we assume that the x and y variables are measured without error. Well learn in this chapter what to do when this assumption is violated. Third, x is uncorrelated with the error term. We saw one type of violation of this assumption in Chapter 7 when addressing specification error. The first assumption is not normally met in the social sciences. There is certainly experimental research especially in psychology, but also in sociology and economics that manipulates explanatory variables, but most social science research relies on random samples to consider a variety of values of explanatory variables. There are important ethical limits faced by social and behavioral researchers. For example, we should not perhaps cannot manipulate an adolescents friends in studies of peers and delinquency. Another limitation involves the fact that there are many interesting variables that cannot be manipulated. For instance, we cannot control or manipulate a persons race or gender. Although deliberate manipulation of the xs is preferred, statistical control is often the best substitute. Since the second assumption is the main focus of Chapter 7, much of our discussion in this chapter centers on the third
122
assumption. Well learn that measurement error can cause several problems for linear regression models. As with specification error, you should realize that we always have measurement error either because our concepts are not well defined or for some systematic reason (e.g., our instruments are not accurate enough). It is thus important that we seek to reduce it to a manageable level.
Using these terms, what does the linear regression model look like? Heres one way to represent it:
yi = + 1 x1i + 2 x2 i + i
yi + ui = + 1 x1i + 2 x2 i + i
Now, lets use basic algebra to subtract u from both sides of the equation. By solving, we obtain:
yi = + 1 x1i + 2 x2i + i + (- ui )
Therefore, the error in measuring y can be considered just another component of the error term, or simply another source of error in our model. As long as this source of error is not associated systematically with either of the explanatory variables (i.e., corr(x1i, ui) = 0; corr(x2i, ui) = 0), we have not violated a key assumption of the model and our slope estimates tend to be unbiased. But if the xs are correlated with the measurement error in y, then we have the specification error
MEASUREMENT ERRORS
123
problems described in Chapter 7. Unfortunately, there is rarely any way to include this error explicitly in the regression model to solve the specification problem. However, even when there is no correlation between the xs and the error in y, the R2 tends to be lower than if measurement error is not present. In the presence of error in y, the SSE is larger because there is more variation in the residuals that is not accounted for by the model. Moreover, the standard errors of the slope estimates are inflated, thus making it harder to find statistically significant slope coefficients. Recall that the standard error formula is
= se i
()
(x x ) (1 R )(n k 1)
2 i 2 i
) (y y
i i
{ ( y
being equal (or ceteris paribus, to use an urbane Latin term favored by some social scientists), the standard error is also larger. All is not lost, though. If you obtain statistically significant slope coefficients, you then have a measure of confidence that you would find significant results even if there was no measurement error in y.
yi = + 1 x1i + 1v1i + i
124
So the error term now has two components (1v1i and i ) with x correlated with at least one of them. In this situation, the estimated slope is biased, usually towards zero. The degree of bias is typically unknown, though; it depends on how much error there is in the measurement of x. The standard error of the slope is also biased. Usually the t-value is smaller (because of a larger standard error) than it would be if there was no measurement error in x. Combining these two problems smaller slopes and larger standard errors you should immediately see that it is difficult to find statistically significant slopes when x is measured with error. This does not mean, however, that there is not a true statistically significant association between x and y; only that you may not be able to detect it if there is sufficient error in x to bias the slope and standard error. But lets suppose that we have a multiple linear regression model and several of the explanatory variables are measured with error. It should be apparent that this compounds the problem, sometimes tremendously. Lets make it even worse. What happens when both the outcome variable and the several of the explanatory variables are measured with error? Now we have a truly bad situation: The R2 is biased, the slopes are biased, and the standard errors are biased. We cant trust much about such a model. Yet, it is not uncommon in the social and behavioral sciences to have measurement error in all of the variables. A major problem is we dont often know if and how much measurement error we have. Fortunately, there are several techniques designed to address this problem.
MEASUREMENT ERRORS
125
of psychology and cognitive studies, but also draws from a variety of fields such as sociology, marketing, political science, and opinion research (see, e.g., Jon A. Krosnick and Leandre R. Fabrigar (2005), Questionnaire Design for Attitude Measurement in Social and Psychological Research, New York: Oxford University Press). We have neither the space nor the ability in this brief chapter to go over the many aspects of good measurement, including various topics such as the validity and reliability of instruments. One piece of advice, though: It is better to use a well-established instrument to measure a phenomenon than to try to make one up for your study. For instance, if we wish to measure symptoms of depression, an instrument such as the Hamilton Depression Inventory is a better choice than trying to develop a brand new scale for a study. The next two approaches are not normally very useful, but you may run into situations where knowing about them comes in handy. The first is known as inverse least squares (ILS) regression. Its logic is pretty straightforward. Suppose you wish to estimate a regression model with only one explanatory variable, but you know that this variable is measured with error. Fortunately, the outcome variable is not measured with error (an admittedly rare situation). In this situation, we may switch the explanatory and outcome variables and estimate a regression model:
xi = + 1 yi
But is this the slope you wish to use? Probably not. Rather, recalling classes in elementary algebra, we may reorder the equation (after estimating the slope) so that the x appears on the right-hand side and the y appears on the left-hand side. In this manipulation, the slope for x becomes 1 1 and it is not biased (although the standard error is not accurate). Note also that the intercept is not correct in this type of least squares regression. The second method assumes that we have some external knowledge that is not normally available. First, though, we need to learn about a term in measurement theory known as reliability. In
126
general usage, reliability refers to the reproducibility of a measuring instrument. The most common type is test-retest reliability, which is a quantitative approach to determining whether an instrument measures the same thing at time one that it measures at time two. Assuming a continuous variable, researchers usually use the Pearsons correlation coefficient of the two measurement points to determine the reliability. A common view of reliability is that it measures the ratio of the true score to the measured score. Using the notation introduced earlier, this is written as (the subscripts are omitted):
r = xx x (true score ) = x (true score ) + v (error ) s2 x s2 x
In this equation r represents the reliability, not the Pearsons correlation coefficient (although there are statistical similarities). The s2 in the numerator and denominator are variance measures. It should be clear, and this agrees with what we have been claiming throughout this chapter, that reliability increases as the error in measurement decreases. Although rare, we may occasionally have information on the reliability from other studies. Suppose, for example, that we are conducting a study that uses Mudds Scale of Joy to measure happiness among teenage boys and girls (this is not an actual scale). Several very sophisticated studies have found that when Mudds scale is used among this population, it yields a reliability coefficient (rxx*) of 0.8. (We wont bother trying to figure out how these other researchers came up with this reliability coefficient.) Since errors in measuring the x variable tend to bias the slope downward, our observed slope will be too small if x is contaminated by measurement error. But we now know the degree of bias because of the reliability coefficient. Therefore, we may multiple the observed slope by the inverse of the reliability coefficient to obtain the unbiased slope. Assume our model suggests that high levels of joy reported by adolescents predict high GPA. We estimate a linear regression model and obtain an unstandardized slope of 1.45. Multiplying this slope by
MEASUREMENT ERRORS
127
1/0.8 = 1.25 yields the unbiased slope of {1.45 1.25} = 1.81. Unfortunately, this approach can become quite difficult to use when there is more than one explanatory variable.
128
Well discuss very briefly a frequently used technique known as factor analysis. Factor analysis is designed to take observed or manifest variables and reduce them to a set of latent variables that underlie the observed variables. There are many books that focus specifically on factor analysis (see, e.g., Paul Kline (1999), An Easy Guide to Factor Analysis, London: Routledge; David J. Bartholomew (1987), Latent Variable Models and Factor Analysis, New York: Oxford University Press; and Bruce Thompson (2004), Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications, Washington, DC: American Psychological Association). Perhaps the simplest way of understanding this statistical technique is to consider that variables that are similar share variance. If there is sufficient shared variance among a set of observed variables, we claim that this shared variance represents the latent variable. Conceptually speaking, we assume that the latent variable predicts the set of observed variables. The diagrams on the next two pages represent two simplified models of a latent variable. The first shows the overlap of three observed variables; the shaded area of overlap represents the latent variable, or the area of shared variability. The second diagram shows how latent variables are often represented in books and articles. Notice the direction of the arrows. They imply that the latent variable (recall it is not directly observed) predicts the observed variables. Stata provides several factor analysis techniques (type help factor). However, when used to measure latent variables, many researchers rely on specialized software such as AMOS, LISREL, COSAN, EQS, Latent Gold, or MPlus to conduct the analysis. An advantage of these programs is that they will simultaneously estimate factor analyses to estimate the latent variables and regression models to test the associations among a set of latent variables. For example, if we used 10 observed variables to measure aspects of anxiety, assuming that a latent variable or the true measure of anxiety explains their shared variance, and 12 variables to measure selfesteem, then a latent variable analysis program allows us to test the measurement properties of these two latent variables and their
MEASUREMENT ERRORS
129
statistical association using a regression model. One of the advantages of this approach which is often termed structural equation modeling (SEM) is that it may be used to estimate more than one regression equation at a time, or what are called simultaneous equations. This provides one solution to the endogeneity problem discussed in Chapter 7. But, you may be asking, what happened to the measurement error in the latent variable? The idea is that the shared variance represents the true score of a latent variable, whereas the remaining unshared variability represents the error. In the diagram below, the shaded area where all three variables overlap represents the true score for the latent variable and the non-shaded areas represent measurement error. Of course, we must make the assumption that the shared variance truly represents something that is significant and that we have correctly labeled it. We also assume that the measure errors are independent, although this assumption may be relaxed in SEM.
O b served V ariab le 1
O b served V ariab le 2
130
Latent Variable
Observed Variable 2
Observed Variable 3
The second latent variable approach has been popular in econometrics. (It is still used, but not as often as in the past.) In this approach, the latent variables are based on instrumental variables. An example is the easiest way to describe instrumental variables. Suppose we have an explanatory variable that is measured with error. We know that if we place it in a regression model, the slopes and standard errors will be biased. However, we think we have found another set of variables that have the following properties: (a) they are highly correlated with the error-contaminated explanatory variable; and (b) they are uncorrelated with the error term (which, youll recall, represents all the other influences on y that are not in the model). If we can find such a set of variables which are known as instrumental variables we may use them in an OLS regression model to predict the explanatory variable, use the predicted values to estimate a latent variable as a substitute for the observed explanatory variable, and use this latent variable in an OLS regression model to predict y. One common estimation technique that uses instrumental variables is known as two-stage least squares because it utilizes the two OLS regression stages just described. We assume that the latent variable is a linear combination of the instruments, thus we can estimate the association with a linear model.
MEASUREMENT ERRORS
131
Here are the steps in a more systematic fashion. Well use the letter z to represent the instrumental variables and x to present the error-contaminated explanatory variable. The steps in two-stage least squares are as follows (the i subscripts are omitted): (1) Estimate x1 = p + p1 z1 + p2 z2 + L + pq zq . (2) Save the predicted values from this OLS regression model. + r2 x2 + L + rs xs ; where (3) Using OLS, estimate y = r + r1 x1
132
instrumental variables, education (educate) and personal income (pincome), are measured without error; and (d) education and personal income are strong predictors of occupational prestige but are uncorrelated with the error in predicting life satisfaction. So, in this regression model, we hypothesize that higher occupational prestige is associated with higher life satisfaction, but that we should use the instrumental variables in a two-stage least squares analysis to reduce the effects of measurement error. The first step is to estimate the correlations among the explanatory variable and the instrumental variables. Statas pwcorr is used to obtain these correlations and their p-values. The table indicates that the three variables are moderately correlated, with the largest correlation between occupational prestige and education. In Chapter 7 we explored this association from a different angle. Nonetheless, at this point education and income appear to be reasonable instruments for occupational prestige. Next, well see the results of the OLS regression model with life satisfaction as the outcome variable and occupational prestige as the explanatory variable.
occprest occprest pincome educate 1.0000 0.2842 0.0000 0.5530 0.0000 1.0000 0.2269 0.0000 1.0000 pincome educate
df 1 906 907
Number of obs F( 1, 906) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
MEASUREMENT ERRORS
133
estimate (SE or Root MSE) of 16.742. The explained variance is low, but statistically significant (F = 21.36, p < .001). The coefficient for occupational prestige is also statistically significant, with, as expected, higher occupational prestige associated with higher life satisfaction. However, we assumed earlier that occupational prestige is measured with error. So lets see what a two-stage least squares regression analysis provides. This model in Stata is part of the instrumental variables command, ivregress. This command may also be used to estimate other types of errors-in-measurement models. To request a two-stage least squares for life satisfaction and occupational prestige, include the following on Statas command line:
ivregress 2sls lifesatis (occprest = pincome educate)
Note that Stata tells us that occupational prestige is instrumented, with personal income and education as the instruments. Comparing these results to the previous model, note that the coefficient shifts from about 0.190 in the first model to 0.266 in the two-stage least squares model. Thus, the slope increases by about 40% after adjusting the explanatory variable for measurement error. But the t-value is somewhat smaller. But how do we know if this is the proper approach? Although we must rely to large extent on theory (e.g., are the instruments wellsuited to the model?), there is also a statistical test known as the Durbin-Wu-Hausman test that allows us to determine if the
134
instrumental variables approach is preferred. To implement this test, we first estimate a linear regression model with the instruments; in the above model this is
Occupation al Prestige i = + 1 (Education i ) + 2 (Personal Income i ) + i
Second, save the unstandardized residuals from this model (predict res_occ, residual); in this example, we label them res_occ. Third, estimate a linear regression model with the original outcome variable, the original explanatory variables, and the unstandardized residuals. In our example, this model is
Life satisfaction i = + 1 (Occupational Prestige i ) + 2 (res_occi ) + i
If the coefficient involving the residuals is statistically significant, then there is evidence that the instrumental variablestwo-stage least squares model is preferred to the single-stage linear regression model. In the example of the model designed to predict life satisfaction, the t-value of res_occ is 1.290, which has a p-value of 0.197. Since it is not statistically significant, there is no evidence that the two-stage least squares model offers any benefit over the linear regression model. Stata automates this for us using one of its estat postcommands. After estimating the two-stage least squares model, type estat endogenous. Stata returns the following:
Tests of endogeneity Ho: variables are exogenous Durbin (score) chi2(1) Wu-Hausman F(1,905) = = 1.70784 1.7054 (p = 0.1913) (p = 0.1919)
Note that the p-value for these values is similar to the p-value for res_occ. Thus, we reach the same conclusion about the appropriateness of the two-stage least squares model. In general, then, there is little evidence that occupational prestige is endogenous in this model. (For more information about these tests, see Russell Davidson and James
MEASUREMENT ERRORS
135
G. Mackinnon (1993), Estimation and Inference in Econometrics, New York: Oxford University Press.) Yet, we should regularly ask, are the assumptions we made before estimating the model reasonable? Could there be other potential sources of error in either of the instruments? Suppose we are told that occupations considered prestigious are those that require more education or that pay, on average, higher salaries? Would this change our judgment of education and personal income as instrumental variables? We might decide that these are not good instruments for this model. But finding instrumental variables is typically problematic and we will always be forced to justify our choices based on reasoning that is divorced from concrete statistical evidence. If we determine that we are not content with the instrumental variables approach, there are other errors-in-variables models that researchers use, but they are beyond the scope of this presentation (type help instrumental in Stata to see a few of these). It should be obvious by now that measurement error is an almost universal problem in social science statistics. It may be thought of as one type of specification error in a linear regression model since it involves extra information (errors in measuring) that we hope to exclude from the model. Whether there will ever be tools to minimize it sufficiently so that we may convince all or most people that our models are valid is, at this time, an unanswerable question. Nonetheless, measurement techniques continue to improve as researchers study the properties of measuring tools and the way that people answer survey questions or report information in general. (Excellent guides to constructing survey questions are Norman Bradburn et al. (2004), Asking Questions: The Definitive Guide to Questionnaire Design, San Francisco: Jossey-Bass; and Krosnick and Fabrigar, op. cit.) At this point, we should simply do our best to measure social and behavioral phenomena accurately and reliably. Moreover, some useful tools for reducing the damaging effects of measurement error exist and are readily available to even the novice
136
researcher. A review of the literature on factor analysis, multivariate analysis, and structural equation modeling will provide a guide for using these tools.
()
2 (xi x ) (1 R i2 )(n k 1)
) (y y
i i
137
138
What will the tolerance (1 R i2 ) be if there is such a large overlap between x1 and x2? We cannot be certain with a diagram, but rest assured that it will be large. What happens to the standard error of the slope as the tolerance increases? It becomes larger and larger. Suppose, for example, that the correlation between x1 and x2 is somewhere on the order of 0.95. Then, assuming a linear regression model with only two explanatory variables, the tolerance is (1 0.952) = 0.0975. Next, take the correlation between another samples x1 and x2 that is much lower, say 0.25. The tolerance in this situation is (1 0.252) = 0.4375. Try plugging fixed values into the standard error equation for the other quantities with these two tolerances and youll see the effect on the standard errors.
x2
x1
The assumption of OLS regression, though, states that there is no perfect collinearity. Perfect collinearity is represented by completely overlapping circles, or a correlation of 1.0. (It could also be 1.0 and have the same consequences.) What happens to the standard error when there is perfect collinearity? The tolerance in this situation is {1 1}, which means that the denominator in the standard error equation is zero. Hence, we cannot compute the standard error. It is
139
rare to find a perfectly collinear association, unless you create a variable yourself or accidentally rename a variable and throw it in a regression model with its original twin sibling. Fortunately, statistical software such as Stata recognizes when two variables are perfectly correlated and throws one out of the model. What is not uncommon, though, is to find high collinearity, such as the figure represents. The statistical theory literature emphasizes that the main problem incurred in the presence of high or perfect collinearity is biased or unstable standard errors. Slopes are considered unbiased, at least when the model is estimable. Yet, practice strongly indicates that slopes can be highly unstable when two (or more) variables are highly collinear. Lets see an example of this problem. The usdata.dta data set has three variables that we suspect are associated with violent crimes per 100,000: Gross state product in $100,000s (gsprod), the number of suicides per 100,000 (suicrate), and the age-adjusted number of suicides per 100,000 (asuicrat). Yet there is something peculiar about these variables: Two of them address the states number of suicides. Demographers and epidemiologists are well aware that several health outcomes, such as disease rates or the prevalence of some health problems, are influenced by the age distribution. Some states, for instance, may have a higher prevalence of particular types of cancer because they have, on average, older people. Since older people are more likely to suffer from certain types of cancer, states with older populations will have more cases of these cancers. Therefore, demographers often compute the age-adjusted prevalence or rate. This, in essence, controls for age before regression models are estimated. If age has a large association with the outcome, then the original frequency of the outcome and the age-adjusted frequency might differ dramatically. A correlation matrix will provide useful evidence to judge the association between the age structure of states and number of suicides. The correlation between the number of suicides and the ageadjusted number of suicides is 0.990 (p<.001), which is close to the maximum value of one. Hence, we have two variables with much of
140
their variability overlapping. Moreover, it appears that the age structure is only modestly associated with suicide. But suppose we wish to enter these two variables in a linear regression model designed to predict violent crimes per 100,000. This is clearly an artificial example since the two suicide variables are measuring virtually the same thing. But lets use it to explore the effects of collinearity in a linear regression model.
suicrate asuicrat suicrate asuicrat gsprod 1.0000 0.9900 0.0000 -0.3693 0.0083 1.0000 -0.3564 0.0111
gsprod
1.0000
Well estimate a linear regression model using these three variables as predictors, with violent crimes per 100,000 as the outcome variable.
Source Model Residual Total violrate suicrate asuicrat gsprod _cons SS 890469.575 2661419.88 3551889.46 Coef. -33.14892 42.71824 84.06014 269.9085 df 3 46 49 MS 296823.192 57856.9539 72487.5399 t -0.45 0.61 3.80 1.72 P>|t| 0.653 0.545 0.000 0.092 Number of obs F( 3, 46) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 5.13 0.0038 0.2507 0.2018 240.53
[95% Conf. Interval] -180.4624 -98.16917 39.49069 -45.70741 114.1646 183.6057 128.6296 585.5244
Look at the unstandardized coefficients for the two suicide variables: One is positive and one is negative. Although statistical theory suggests that these slopes are unbiased, you can see they are untenable. Both variables are measuring the same phenomenon, yet they indicate that as one increases violent crimes tend to decrease, whereas as the other increases violent crimes tend to increase. Clearly, both interpretations of the association between suicides and violent crimes cannot be correct. Here is another linear regression model that omits one state: Utah
141
[95% Conf. Interval] -181.1484 -126.9773 38.89058 -48.55759 145.2802 183.9886 128.8752 588.5673
The coefficients and standard errors for the two suicide variables have shifted quite a bit from one model to the next. Although this may be because the state of Utah has unusual values on one or more of the variables in the model (see Chapter 12), it might also be the result of collinearity between the two suicide variables. It is not uncommon for collinearity to create highly unstable coefficients that shift substantially with minor changes in the model. Of course, it is unlikely that we would have estimated a model with two variables that are so similar. Even though we used these models to illustrate a point, it is important to always consider whether two (or, as we shall see, more) explanatory variables are highly correlated before estimating the model. Now, lets see what happens when we include only one suicide variable in the OLS regression model that predicts violent crimes. The results of this model (see the table below) show that there is a positive association between the number of suicides (age-adjusted) per 100,000 and the number of violent crimes per 100,000 after adjusting for the association with the gross state product. Although it is not statistically significant, notice how different the coefficient is from those in the earlier models. More importantly, notice the standard error in this model: It is much smaller than the standard errors from the earlier models. This demonstrates how collinearity inflates standard errors substantially.
142
Source Model Residual Total violrate asuicrat gsprod _cons
Multicollinearity
Recall that the tolerance from the auxiliary regression equation is based on the proportion of variance among one explanatory variable that is explained by the other explanatory variables. In other words, it is the R2 from a linear regression model. I hope it is clear by this point that we should consider overlapping variability between not only two variables, but among all the explanatory variables. Suppose, for example, that we have four explanatory variables, x1, x2, x3, and x4. The largest bivariate correlation between any two variables is 0.4. If we were to use only the two variables with the largest correlation in a linear regression model, the tolerance in the standard error formula would not be any smaller than (1 0.42 = ) 0.84. It is unlikely that a tolerance of this magnitude would bias the standard errors to an extreme degree or make the slopes unstable. However, it is possible for one of these explanatory variables to be a linear combination of the others. If, for instance, x4 is predicted perfectly or nearly perfectly by x1, x2, and x3, then the tolerance will approach zero and the same problems result: Biased standard errors and unstable slopes. In terms of our overlapping circles (see the figure on the next page), multicollinearity appears as two or more variables encompassing another explanatory variable. It should be clear, though, that looking at bivariate correlations will not always be sufficient for determining whether multicollinearity is present. Any of the bivariate correlations may be below thresholds that warn of collinearity, but they wont show if some combination of explanatory variables predicts another explanatory variable. We must
143
unique portion of x 4
x1
x2
x3
information to assess the possibility of multicollinearity? The easy answer is Of course! We may simply regress each explanatory
144
variable on all the others and compute an R2 for each model. We need a rule-of-thumb, though: How high must an R2 be before we suspect multicollinearity? Perhaps we should use the same rule-of-thumb as mentioned above: 0.7. However, since we square the R, perhaps it should be 0.49. Although there are rules-of-thumb that specify the threshold for the auxiliary R2 and the tolerance, a more frequently used statistic is known as the Variance Inflation Factor (VIF). For various reasons we dont have time to discuss, the VIF has become the test de jour for assessing multicollinearity. It is defined as the inverse of the tolerance:
VIF =
{1 R }
2 i
The rule-of-thumb(s) is that a VIF greater than or equal to 10 (some say it is nine) is indicative of multicollinearity. Others argue that if the square root of the VIF is greater than or equal to three, we should suspect multicollinearity (see John Fox (1991), Regression Diagnostics, Newbury Park, CA: Sage Publications). A relevant issues that is not often discussed, though, is sample size: Linear regression models are affected less by collinearity or multicollinearity as the sample size increases. Hence, all else being equal, a model that uses a sample size of 10,000 is less sensitive to multicollinearity at least using conventional rules-of-thumb than a model that uses a sample size of 100. Unfortunately, the decision rules that have emerged for these tests do not normally consider the sample size. Another standard test for multicollinearity involves condition indices and the variance proportions. Before demonstrating how to use this test, it is useful to take a quick journey into the intersection of linear algebra and principal components analysis for some terminology. An eigenvalue is the variance of the principal component from a correlation (or covariance) matrix. The principal components of a set of variables are a reduced set of components that are linear combinations of the variables. They are one type of latent variable (see Chapter 8), yet the principal components from a set of variables are uncorrelated with
145
one another. Hence, if we have a set of, say, 10 explanatory variables, we may reduce them to a set of, perhaps, two or three principal components that are uncorrelated with one another. The eigenvalues of these principal components measure their variances. Briefly, then, eigenvalues are a measure of joint variability among a set of explanatory variables. As the eigenvalue gets smaller, there is more shared variability among the explanatory variables, with a value of zero indicating perfect collinearity. An interesting property of eigenvalues in the following test is that they sum to equal the number of predictors in the regression equation (number of explanatory variables + the intercept, or k + 1). But how is this useful for judging multicollinearity? We may use eigenvalues to compute condition indices (CIs; not to be confused with confidence intervals), which are another measure of joint variability among explanatory variables. CIs are used in conjunction with variance proportions (Stata calls these variance-decomposition proportions) to assess multicollinearity in linear regression models (they may also be used in other regression models). Variance proportions assess the amount of the variability in each explanatory variable (as well as the intercept) that is accounted for by the dimension (or principal component). Fortunately, Stata will compute these values for us. Lets see an example of VIFs, eigenvalues, CIs, and variance proportions. Estimate the linear regression model we used earlier that included both suicide variables, then use the vif command to calculate the VIFs. Next, use the coldiag2 command and the collin command followed by the explanatory variables (collin suicrate asuicrat grprod) to see additional collinearity diagnostics. To use the coldiag2 and collin commands, we may need to download files from a Stata server. Assuming we are connected to the internet, we may type findit coldiag2 and findit collin to locate the files. Entering vif, coldiag2, and collin on the command line after regress returns its output provides the information on the following pages. The collinearity statistics reported in the second table show the VIFs and the tolerances (1/VIF). You can see that two of
146
these greatly exceed the rule-of-thumb that any VIF of 10 or more indicates collinearity or multicollinearity problems. But we may also use the CIs and variance proportions in third table to assess collinearity. Heres how its done. First, look for any CI that is greater than or equal to 30. Then, look across the row and find any variance proportions that exceed 0.50. A CI that is greater than or equal to 30 coupled with a variance proportion that is greater than or equal to 0.50 indicates a collinearity or multicollinearity problem that should be checked further. An advantage of CIs and variance proportions over VIFs is that they identify which variables are highly collinear.
Source Model Residual Total violrate suicrate asuicrat gsprod _cons SS 890469.575 2661419.88 3551889.46 Coef. -33.14892 42.71824 84.06014 269.9085 df 3 46 49 MS 296823.192 57856.9539 72487.5399 t -0.45 0.61 3.80 1.72 P>|t| 0.653 0.545 0.000 0.092 Number of obs F( 3, 46) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 5.13 0.0038 0.2507 0.2018 240.53
[95% Conf. Interval] -180.4624 -98.16917 39.49069 -45.70741 114.1646 183.6057 128.6296 585.5244
78.34
Condition Indexes and Variance-Decomposition Proportions condition index 1 1.00 2 2.45 3 10.00 4 78.34 _cons suicrate asuicrat 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 1.00 1.00 gsprod 0.02 0.72 0.25 0.01
147
Cond Eigenval Index --------------------------------1 3.3965 1.0000 2 0.5696 2.4420 3 0.0333 10.1047 4 0.0006 72.3735 --------------------------------Condition Number 72.3735 Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept) Det(correlation matrix) 0.0172
In the collinearity diagnostics shown above there is one CI, associated with line 4, which exceeds 30; it is 72.37 or 78.34 (there is slightly different scaling for coldiag2 and collin). Looking across the row in the table on the preceding page, we may see that there are two variance proportions that exceed 0.50; they are associated with the suicide rate and the adjusted suicide rate. Thus, we have strong evidence that these two variables are highly collinear. The largest CI is computed by taking the largest eigenvalue (line 1s = 3.397), dividing it by line 4s eigenvalue (it is rounded to 0.0006, but is actually 0.00064845), and taking the square root of the quotient:
Eigenvalue of Dimension 1 Eigenvalue of Dimension q = 3.379 72.3 0.00064845
The equation shows that this is how CIs are generally computed, with the largest eigenvalue divided by one of the others to yield the squared value of the CI. Of course, this guarantees that the CI for Dimension 1 is always 1.00.
148
mentioned earlier. Collinearity and multicollinearity have less influence on linear regression models as the sample size gets larger. Therefore, if a model is plagued by collinearity problems, then, if possible, one should add more observations. Unfortunately, this is rarely feasible outside of a laboratory. Survey data are typically of a fixed sample size because of cost constraints, so adding more observations is not practical. Another solution was discussed in Chapter 8. Suppose that some subset of explanatory variables is involved in a multicollinearity problem. Look over these variables: Are they measuring a similar phenomenon? Would it be practical to combine them in some way? Oftentimes, highly collinear variables are indirectly measuring some latent construct, so using them to create a latent variable is a common solution to multicollinearity. Rather than including the collinear variables in the regression model, we may place the latent variable in the model. Then we may assess its association with the outcome variable. The third solution is to use a technique known as ridge regression. Although it is rarely recommended as a general solution, ridge regression may offer some help when nothing else seems to work or in particular situations that we wont go into here. The general idea underlying ridge regression is to create (artificially) more variability among the explanatory variables. This is usually done by considering, rather than the raw data, the variance-covariance matrix of the variables. The diagonals of this matrix are the variances, which are too small in the case of collinearity or multicollinearity to yield unbiased estimates of standard errors. Hence, in ridge regression the analyst chooses a constant to add to the variances, thus increasing variability and decreasing the effects of multicollinearity. The choice of this constant may be arbitrary or based on a specific formula. This may sound very promising, but you should realize that ridge regression results in biased slopes. But the bias may be less than the OLS estimates that are made unstable by multicollinearity. A key problem with ridge regression, though, is that it seems to change slopes that
149
are not significant in an OLS model more than slopes that are significant. Therefore, the analyst must have solid conceptual hypotheses to determine which slopes are important and which are not. (See Norman R. Draper and Harry Smith (1998), Applied Regression Analysis, Third Edition, New York: Wiley, Chapter 17, for a review of ridge regression.) The fourth solution, which is not recommended very often, is to use a variable selection procedure, such as stepwise or backward elimination, to choose the best set of predictive variables (see Chapter 7). These methods will typically drop one or more of the variables involved in the collinearity or multicollinearity problem. However, they also take the important conceptual work out of the hands of the analyst and present other problems (see Fox (1997), op. cit., Chapter 13). The last solution is recommended the least often, but is probably employed most often by analysts who use regression techniques. This is to simply omit one or more of the variables involved in the collinearity or multicollinearity problem. This is recommended only as a last resort or if one of the collinear variables measures virtually the same phenomenon as other collinear variables (such as was the case with the two suicide variables we saw earlier in the chapter). It is rare, however, for two variables to measure the same thing. It is much more common to find that collinearity is caused by recoding problems in the data cleaning stage. This may occur when one variable is created from another. It should be clear to you that collinearity and multicollinearity present some potentially serious problems for the linear regression model. Biased standard errors are a nuisance, but, when coupled with unstable slope coefficients, the results of the model are not trustworthy. What should you do about this problem? It is best to consider carefully the source of collinearity or multicollinearity and then use the best solution available. Unfortunately, there are no easy answers.
x
Curvilinear and other associations among variables that are not represented well by a straight line are known generally as nonlinear associations. Although making comparisons across the social and behavioral sciences would be very difficult, it is likely that nonlinear associations are the norm, with linear associations the exception. Yet linear associations are much simpler to conceptualize and to model using statistical procedures such as correlations and OLS regression analysis. Although there is a growing literature on nonlinear 151
152
regression techniques, and modern-day statistical software provides powerful tools for estimating nonlinear associations, the way we think about associations among variables still tends to be linear. This may have something to do with general limitations of the human mind, but it is likely that we simply need more experience with nonlinear thinking and modeling. There are actually two approaches to nonlinear analysis that are used in regression modeling. Well begin with the more complex approach, although well end up learning about the simpler approach. The first type of nonlinear regression analysis proposes that the estimated coefficients are nonlinear; in other words, they do not imply a straight line pattern. An example of a model that is nonlinear in the estimated parameters is
2 y i = + 1 x1i + 1 x 2i + log e ( 2 x3i ) + 3 x 4i + i
In this equation, the betas are not simple, linear terms; rather, they include higher-order terms, with nonlinearities such as quadratics and logarithms (loge implies the natural or Naperian logarithm). There are nonlinear regression routines that are designed specifically for situations when the estimated parameters are hypothesized to be nonlinear. However, such hypotheses can be difficult to formulate, so we will not cover this type of nonlinear regression approach in this presentation. Another promising avenue in regression analysis, but one that we also will not discuss, is known as nonparametric regression. An example of this model is known as generalized additive models (GAM), which allow the analyst to fit various types of associations linear and nonlinear among the outcome and explanatory variables. See John Fox (2000), Multiple and Generalized Nonparametric Regression, Thousand Oaks, CA: Sage, for a good overview of these and other nonparametric models. In this chapter, well address a simpler approach that adapts the linear regression model so that it may incorporate nonlinear associations among the explanatory and outcome variables. To begin to understand how we may do this, consider that when we estimate a
NONLINEARITIES AND INTERACTION TERMS 153 linear regression model and interpret the slopes, we assume that the magnitude of change in y that is associated with each unit change in x is constant. In other words, it is the same regardless of where in the y distribution the change is presumed to take place. For example, suppose we find, like in Chapter 2, that the slope of per capita income on the number of robberies per 100,000 is 14.46. By now, we should be able to interpret this coefficient effortlessly: Each $1,000 increase in per capita income is associated with a 14.46 increase in the number of robberies per 100,000. Keeping this in mind, consider the following diagram:
14.46 $1,000
It should be clear that the difference in the number of robberies associated with the difference in per capita income is expected to be the same regardless of where along a hypothetical continuum of robberies per 100,000 people we find ourselves. Of course, as mentioned in Chapter 2, we should only interpret these associations within the bounds of our data. Suppose, though, that when states have low per capita incomes, say, less than $20,000, each unit increase is per capita income is associated with an increase of 20 robberies per 100,000; whereas in higher income states (e.g., per capita income > $30,000) each unit increase in per capita income is associated with an
154
increase of only 10 robberies per 100,000. Perhaps higher income states have more resources to hire police or provide job opportunities for youths at risk of becoming delinquent or growing into criminals. Whatever the hypothesized mechanism, these sorts of possibilities are legion and recommend against simple hypotheses that suggest only linear associations. Lets look at an example of a nonlinear association. In the gss96.dta data set there are variables that measure personal income (pincome) and the respondents age in years (age). Before analyzing these data, think for a moment about peoples incomes in the United States. How should the association between age and income behave? (To simplify our thinking, lets deal with adults only.) First, it seems clear that younger adults who are still in school or just entering the workforce have less income than middle-aged or older adults who have been in the workforce longer (well also ignore for now the influences of education, unemployment, job training, and other important variables). But what happens as people get older and reach their sixties? Many begin to retire and their personal income tends to decrease. Thus, we have already departed from hypothesizing a linear association between age and personal income. Rather, there is likely a curvilinear association, with income rising with age until it reaches a maximum point, after which it begins to decrease as people retire. But there are also important demographic issues to consider when dealing with age. One such issue involves death. Wealthier people, on average, tend to live longer than poorer people partly because they have access to better health care. As we begin to look at the ageincome association at older ages, say 75 or 80, we may begin to see the effects of certain people dying earlier than others. If, on average, those with less income tend to die earlier, then income should appear higher among older adults. There are many complex demographic phenomena at play here that we dont have time to address, but we should consider that the age-income association is not linear and may have more than one bend.
NONLINEARITIES AND INTERACTION TERMS 155 The following three graphs represent the three associations between age and income that weve discussed thus far. The first graph shows a linear association, with income increasing at a steady pace with increasing age. The second graph shows what is known as a quadratic association, with one bend in the association. The third graph shows a cubic association, with two bends and three distinct pieces to represent the last part of our discussion about age and income. Once we begin dealing with nonlinear associations, they can quickly blossom into some very complex shapes with numerous bends and twists. Therefore, to manage the modeling it is important to always consider conceptually the associations in your set of variables. As we saw in earlier chapters, there are routinized ways to estimate regression models (e.g., stepwise selection, reliance on nested F-tests), but these should be avoided if they are not guided by theoretical considerations.
Income
20
50
80
Income
20
50
80
Income
20
50
80
Age
Age
Age
When thinking about associations among any set of variables, it is wise to consider whether there are nonlinear associations. You may be surprised at how often they occur. For instance, the association between age and various outcomes is often curvilinear. The impact of explanatory variables such as education on outcomes often have what are known as floor or ceiling effects: There is a negative or positive association up to a certain point; thereafter there is a flat association.
156
This graph shows an example of a hypothetical ceiling effect concerning the association between peer delinquency and ones own delinquency. As the number of friends involved in delinquency increases, ones own delinquency also tends to increase up to a certain point. High frequency delinquents, though, are often psychologically different from others (e.g., more impulsive, less empathetic, more callous), so having more delinquent friends may not add much to their own delinquency. If you suspect that there are nonlinear associations among your variables, there are some simple tools for checking them. Perhaps the most useful is the ordinary scatterplot. Using Statas graphing options, it is easy to construct a scatterplot between two variables. For instance, well use the gss96.dta data set and the following command twoway scatter pincome age, jitter(5) to construct a twovariable scatterplot with personal income on the y-axis and age on the x-axis. The jitter option randomly jitters the data points on the graph so that any that may overlap will be easier to see. The result is very busy since there are so many data points. It is impossible to visualize whether a straight line or a curved line fits this association best. Fortunately, Stata provides additional options to fit a variety of lines to its plots: the mean of y, a linear fit line, a quadratic fit line, and a lowess fit line. The mean of y fits a horizontal line (twoway scatter pincome age, jitter(5) yline(10.35)). A linear line places the regression slope in the plot (twoway lfit pincome age),
NONLINEARITIES AND INTERACTION TERMS 157 a quadratic line fits a line with up to one bend (twoway qfit pincome age), and a lowess fits locally weighted regression lines that allow a series of lines to fit the data points (twoway lowess pincome age). The latter option parcels the data into smaller (local) sections and fits regression lines to each section. It is useful when were not sure what type of nonlinearity might occur. Try alternating between linear, quadratic, and lowess to determine the estimated association between age and personal income. Its interesting to look at the types of associations that emerge. In particular, although it is difficult to see in any of the plots, there is a quadratic association up until about age 80, and then the fit line flattens out. Hence, there is evidence that the association between age and personal income follows a cubic pattern. If it followed a linear pattern, the linear, quadratic, and lowess would yield the same straight line. Of course, we may not always know if the model includes these types of associations. Therefore, lets now see a way to test for nonlinearities in a linear regression model. Then well learn what to do when faced with a nonlinear association. Well stick with the same simple example, but specify personal income as the outcome variable and age as the explanatory variable. Estimate this model, then use the following postregression command to predict the studentized residuals: predict rstudent, rstudent. The regression results indicate that there is a positive association between age and personal income. But this does not examine the nonlinear association between these two variables, which we now know exists. By constructing a scatterplot with the studentized residuals (which scale the residuals using a t-distribution) on the y-axis and age on the x-axis, however, we can learn something (twoway scatter rstudent age). This type of scatterplot is known as a partial residual plot. Many analysts recommend that we view a partial residual plot for each explanatory variable used in a regression model. Keep in mind that, in our example, the residuals gauge all the influences on personal income except for its linear association with age. Based on an assumption of
158
the regression model, there should be no association between age and these residuals. But is there? Plot a fit line in the partial residual plot youve just created. Try out whichever type of fit line youd like. But be advised that many experts recommend using the lowess line because it will show many possible associations. If you choose any of the nonlinear fit lines, youll notice the curvilinear association between age and the residuals. This is indicative of a nonlinear association that we havent considered in our regression model, but, as we suspect, it is the nonlinear aspects of age and personal income that are showing up. Now that we know about the association between age and personal income, what should we do about it? There are a couple of approaches that are particularly helpful. This involves a linear regression model with higher order terms for age. This means that we take age and square it and then take age and cube it (age age; or age^2 {age raised to the power of 2}; age age age; or age^3 {age raised to the power of 3}. In Stata, we may create these variables using the generate command (generate age2 = age*age and generate age3 = age^3). The names for these new variables are arbitrary, but indicate what they represent. Next, estimate a regression model that includes pincome as the outcome variable and age, age2, and age3 as the explanatory variables. The regression output shows three explanatory variables, the latter two corresponding to the quadratic and cubic versions of age. All three slopes are significant. Moreover, the signs of the slopes tell us something important (although we already knew this) about the nonlinear association between age and personal income. The signs are positive, negative, and positive; corresponding to an initial positive association, subsequent negative association, and finally a slight positive association. This is not the same pattern that we saw in the scatterplots because they were limited mainly to a quadratic association. It is also quite simple to determine at which ages the association turns negative and then positive. Well see one way to do this later.
[95% Conf. Interval] .8147754 -.0225754 .0000598 -10.40823 1.233586 -.0131633 .0001266 -4.565825
A question to ask is whether this model is an improvement over models with just age or age and age2 as explanatory variables. One simple way to determine this is to compare the adjusted R2 values from the three models and combine them with nested F-tests to determine statistical significance (see Chapter 5), but we first need to learn a bit more about how to estimate the model. It is always a good idea when using explanatory variables with higher-order terms (e.g., age and age2) to check for multicollinearity among the variables. Lets do this and see what the results show. The VIFs for all three explanatory variables are very high: 454, 1773, and 477! There is and this should not surprise you substantial multicollinearity in this model. Another symptom of multicollinearity is the large standardized regression coefficients; it is rare to find one larger than 1.0, yet, using the beta subcommand, we may see that all three are much larger than one (in absolute value). Recall, though, that multicollinearity inflates standard errors, making it harder to find significant results. Yet all three age variables have significant slopes, thus supporting the hypothesis of a cubic association between age and personal income. Nonetheless, most of us dont care for such extreme multicollinearity in our linear regression models. But is there anything we can do about it? There is a simple statistical trick that allows us to diminish the multicollinearity problem. It is based on the following statistical phenomenon:
Cov( xi x , x ) = 0
160
You may substitute correlation for covariance and the results are the same. The proof of this statement is provided in many books on probability or statistical theory (e.g., Sheldon Ross (1994), A First Course on Probability, New York: Macmillan). What it means for us is that by centering the explanatory variable before computing the higher-order terms, and then using these new terms in the linear regression model, multicollinearity is reduced substantially. You may recall that perhaps the most common form of centering a variable is to compute its z-scores. In Stata the simplest way to compute zscores is with the egen command (egen zage = std(age)). Using this command, we find that the variable zage (though we could call it anything wed like) has been added to the data set: It consists of the zscores for age. Next, lets use zage to create new quadratic and cubic age variables. In Stata, the following lines produce these variables:
generate zagesq = zage^2 generate zagecube = zage^3
After creating these variables, estimate a linear regression model with personal income as the outcome variable and zage, zagesq, and zagecube as the explanatory variables. Make sure you ask for collinearity diagnostics after the model. The results shown on the following page should be listed in Statas output window. Keeping in mind that the age variables are now measured in different units, the results are quite similar to the earlier results: A positive association that turns negative and then positive again. Notice also that the slopes are highly significant (look at the relatively large t-values). More importantly, though, the VIFs are well below the thresholds indicative of multicollinearity. It is a good idea to remember this statistical trick for dealing with collinearity and multicollinearity problems; it will come in handy later in the chapter and, perhaps, throughout your career in regression modeling.
[95% Conf. Interval] .3264303 -1.172915 .1154604 10.68794 .7029413 -.8941394 .2443516 11.02726
A problem with using these centered explanatory variables emerges if we wish to determine at what point in the age or personal income distribution the slope turns negative or positive. One valuable approach uses, once again, a scatterplot. This scatterplot employs the unstandardized predicted values (these are predicted by using the predict command with the xb option: predict pred, xb) as the yaxis and age as the x-axis. The plot shows that the initial downturn in the association occurs close to age 45 (using Statas graph editor change the scale markers of the x-axis from 20 to 5 to help you visualize this), with the final upturn around age 80. But consider the following question: Why is the downturn at such a relatively young age? Shouldnt it occur much closer to the typical retirement age? (Hint: It might have something to do with the nature of crosssectional data and birth cohorts.)
11 6 20 7 Linear prediction 8 9 10
40
60 age in years
80
100
162
Nonlinear associations abound in the social and behavioral sciences (as well as in other scientific disciplines, such as chemistry, biology, and physics). Weve only seen a couple of simple examples, but when we consider variables such as age, education, socioeconomic status, and many other interesting variables, we often find nonlinear shifts in their associations with numerous outcome variables. It is best to first think carefully about potential nonlinear associations among your variables (does the literature suggest any?) and then use visual techniques, such as scatterplots, to examine them. Then, once you have a sense of the likely shape of the associations, consider ways to test for them in a linear regression model. Squaring or cubing explanatory or, as we shall see outcome variables scarcely uncovers the myriad possibilities.
NONLINEARITIES AND INTERACTION TERMS 163 variables (but see Chapter 13), well ignore the issue of noncontinuous outcome variables. (For information about regression models for non-continuous outcome variables, see John P. Hoffmann (2004), Generalized Linear Models: An Applied Approach, Boston: Pearson.) However, because the assumption of a normally distributed error term is so important, yet so rarely satisfied, we need to address it directly. A normal q-q plot provides a straightforward way to assess whether a variable is normally distributed. It also offers an indirect and preliminary way to assess whether the residuals are normally distributed. Heres how it works. First, the statistical program checks the mean and standard deviation of the variable. Second, it generates a simulated variable that follows a normal distribution but has the same mean and standard deviation as the observed variable. Third, it orders the simulated variable and the observed variable from their lowest to highest values. Fourth, it produces a scatterplot with the simulated variable on the y-axis and the observed variable on the xaxis. If the observed variable is normally distributed, the data points line up along a diagonal line drawn by the program. If it is not, the points deviate from a diagonal line. The direction of deviation provides some guidance as to how the observed variable is distributed. Lets first look at some q-q plots and then well learn how to use one in Stata with an actual variable.
Expected Normal Value
Observed Value
This plot shows that the observed variable (it is on the x-axis) is skewed slightly and, if we were to investigate further, wed find it has
164
a longer right-tail than a normal distribution. One way to normalize such a distribution (this is simply short-hand for saying we are trying to transform it into a normally distributed variable) is to take its square root.
Expected Normal Value
Observed Value
Here we have a variable that has an even longer right-tail than the previous variable. If we observed a histogram of this variable, wed find a few observations at an extreme upper end of the distribution. This type of variable needs a greater transformation effort to pull in the extreme values than is available with the square root function. In this situation, most analysts take the natural logarithm (loge) of the variable. However, if you do this make sure you recode any zero values to a positive number (perhaps by adding a one to the variable) since the natural logarithm of zero is undefined. A regression model with a logged outcome variable is so popular in applied statistics that it is referred to as a log-linear model. The distribution of the variable at the top of the next page is virtually the mirror image of the first q-q plot we saw. Its right-tail is shorter than a normal distribution; it has fewer observations in its upper tail than does a normal distribution. In order to normalize it, we might square its values to stretch them out a bit. Finally, lets see an even more extreme example of this short-tailed distribution (see the second figure on the next page).
Observed Value
In this situation, one might try exponentiating the variable (using the exponential and raising it to the power of the observed value, exi) to stretch out its distribution. Of course, this does not exhaust the ways we might transform variables so they are normally distributed. Moreover, it is important to realize that these transformed variables may be placed in a linear regression model just like any other outcome variable, but the interpretations of the coefficients must change to account for the transformation. For example, if we estimate a model with the square root of income as the outcome variable and age as the explanatory variable, the slope indicates differences in square root units of income that are associated with each one year difference in age. Thus far, weve discussed what we should do to test whether the outcome variable is normally distributed before estimating a linear
166
regression model. But we should also examine the residuals after estimating the model to make sure they follow a normal distribution. After all, this is the key assumption that needs to be examined. We still use the normal q-q plot. It is constructed using the qnorm command. To see one of these in action, lets estimate a model with the usdata.dta data set. Well once again use robberies per 100,000 people as the outcome variable. First, create a normal q-q plot using the variable robbrate (qnorm robbrate). The resulting graph is
400 -100 -100 0 robberies per 100,000 100 200 300
300
400
Given this, what type of transformation should be used to normalize the robberies variable? (Note what each axis represents.} Before deciding, lets run a linear regression model with robberies per 100,000 as the outcome variable and the unemployment rate and per capita income in $1,000s as explanatory variables. Next, predict the standardized values (z-scores) of the residuals (predict rstand, rstandard). Now, lets look at a q-q plot of these standardized residuals (qnorm rstand). Notice that the residuals of robbrate are not too far off the diagonal line. In fact, there is, if anything, a slight trend toward a compressed distribution. Nonetheless, experience suggests that these
NONLINEARITIES AND INTERACTION TERMS 167 residuals are in pretty good shape. But can we make them even closer to a normal distribution? Lets compute the natural logarithm of robberies per 100,000 and place it in the linear regression model (gen ln_robb = ln(robbrate)). Note that the term ln_robb is an arbitrary, yet properly descriptive, name. Once weve created this transformed variable, substitute it for the original outcome variable and re-estimate the model. What does the normal probability plot show? It appears weve done more harm than good. There are now a lot of residuals in the middle of the distribution that are well above the diagonal line. Go back and see if the square root of robberies per 100,000 does a better job. What do you find? It looks better than the logarithmic version, although there are still some odd-looking residuals near the middle of the distribution. Is the normal probability plot from the original model better looking? It appears to be. Therefore, even though the outcome variable is slightly skewed, it is suitable for a linear regression model with these two explanatory variables. Lets now look at a variable that is highly skewed: Gross state product in $100,000s (gsprod). A q-q plot shows the degree of nonnormality. Estimate a simple linear regression model with gross state product as the outcome variable and the population density (density) as the explanatory variable. What does the normal probability plot show? It looks rather odd, with some residuals well below and some well above the diagonal. Try out a few transformations of gross state product to see if you can normalize the residuals. Then compare the regression model before transforming the variable to one after transforming the variable. Do you see any differences that would make you prefer one model over the other? The natural logarithm does a pretty good job of normalizing the residuals, but can you do better? A key question is whether your interpretation of the association between the gross state product and population density changes after transforming the variable. To put another spin on it: Check the distribution of population density. Is it normally
168
distributed? Can you find a transformation that normalizes it? Once you find this transformation, try re-running the linear regression model with normalized versions of both variables. What is your interpretation of the association between gross state produce and population density now? It is important to keep in mind that you will rarely find a variable in the social and behavioral sciences that has observations or residuals for that matter that fall directly on the diagonal line of a normal q-q plot. The goal is to find a transformation, assuming one is needed, which allows the observations or residuals to come close to the diagonal line. Scanning the functions in Statas help menu will give you a good idea about the many transformations that are available. For advice on how to use these various transformations and numerous other ways to think visually about the distribution of variables, an excellent resource is William S. Cleveland (1993), Visualizing Data, Summit, NJ: Hobart Press. To summarize what weve learned so far: It is important to check the distributional properties of the variables in the model. Although weve emphasized the distribution of the outcome variable, it is also important to examine the distributions of the explanatory variables. In fact, some statisticians recommend that analysts should first check to make sure the explanatory variables are normally distributed (with the exception of dummy variables; see Chapter 6). Moreover, if the association between an explanatory and outcome variable is nonlinear, it is better in many situations to transform the explanatory variable in some way so that the association is linear or so that it can accommodate a nonlinear association. We saw one example of this when we analyzed the association between age and personal income. But there are many other nonlinear associations to consider. Perhaps taking the natural log or cube root of the explanatory variable is needed to linearize (a sophisticated way of saying make linear) its association with the outcome variable. An association may be piecemeal linear, with different slopes needed in different areas of the explanatory or outcome variables distribution. Or, we may need
NONLINEARITIES AND INTERACTION TERMS 169 to take the squared values of one variable and the cubed values of the other to linearize the association. It is easy to become overwhelmed quickly with the many possibilities, so it is best to be cautious, yet thorough, and always keep in mind that theory and past research usually provide good guidance. After we have checked the distributions of the variables and assessed whether there are nonlinear associations among the explanatory and outcome variables, it is always a good idea to examine the partial residual plots for each explanatory variable. Sometimes, nonlinear associations emerge in multiple linear regression models that bivariate scatterplots fail to reveal. This has to do with statistical control in a way that is beyond the scope of this presentation. Nevertheless, these nonlinearities may reveal unexpected, yet important, aspects of the model. If any of these nonlinear associations emerge, you should go back and think about why they occur. Including higher-order terms in a model (such as age2 and age3) may be interesting and important for correct model specification (see Chapter 7 and the age-income example in this chapter), but it is an empty exercise in the absence of thoughtful consideration about the theory and previous research that initially motivated your conceptual model. Finally, it is essential that linear regression models be accompanied by normal q-q plots of the residuals. A key assumption is that the error term is normally distributed, so testing this assumption is a vital exercise. If you find that the residuals do not follow a normal distribution, there are myriad transformations available. With additional experience looking at normal q-q plots, you will find that selecting an appropriate transformation will become easy.
170
These are also known as non-additive associations because, to include an interaction term in a linear regression model, we must multiply variables. Therefore, our regression equation is no longer limited to plus signs between variables. For instance, we have thus far restricted regression equations to the following form:
yi = + 1 x1i + 2 x 2i + 3 x3i + i
However, there is no reason we should limit the model so that the predictor variables are completely independent, which is implied by the additive terms in the equation. An interaction term also known as a multiplicative term involves multiplying explanatory variables. Here is a sample regression equation which well learn to use shortly:
y i = + 1 x1i + 2 x 2i + 3 ( x1i x 2i ) + i
The last term in this equation is listed as x1i x2i, which indicates that the variables are multiplied. We typically say that they interact in some fashion. (For a general overview of interaction terms in linear regression models, see Leona S. Aiken and Stephen G. West (1991), Multiple Regression: Testing and Interpreting Interactions, Newbury Park, CA: Sage.) A useful way of thinking about interaction terms is by considering distinct slopes for the different groups represented in the sample. Suppose, for example, that we hypothesize that self-esteem increases between the teenage years and young adulthood (e.g., from ages 15 to 25). For now, well assume there is a linear association between age and self-esteem within this age range. Given the types of regression equations weve used thus far, we might estimate the following linear regression model:
Self - esteem i = + 1 (age i ) + 2 (genderi ) + i
In this model, we treat age as a continuous variable and gender as a dummy variable. Lets say that gender is coded as 0 = female and 1 = male. Suppose the Stata output indicates positive slopes for both age and gender, so that males, on average, report higher self-esteem than
NONLINEARITIES AND INTERACTION TERMS 171 females. The assumed slopes for age by the two gender groups may be represented using the graph that appears below. If this is confusing, trying making up a couple of slopes for age and gender and then estimate a few predicted scores using the linear regression equation. Youll notice that the relative distance between the males and females at each age is the same. This type of model is known as a different intercept-parallel slopes model. The slopes are the same for the two groups; they are just a constant distance apart. This distance is gauged by the intercept.
Self-esteem
s Male
ales Fem
15
20 Age
25
But perhaps youve read the literature on self-esteem and suspect that self-esteem increases more among males than among females from ages 15-25. Thus, we need a way to model different slopes for males and females. One approach we might use is to estimate separate linear regression models for males and females and then compare the slopes from each model. In Stata, the if or by command may be used to run separate models for different groups. As well learn, interaction terms are also very handy for modeling different slopes for two or more groups. The graph below on the following page shows how this is represented. But how do we characterize this type of association in a linear regression model? With an interaction term that multiplies age and gender. In the regression equation, this appears as
172
As we saw earlier, the values of one explanatory variable are multiplied by the values of the other explanatory variable; in this case age is multiplied by gender. If the coefficient for 3 is statistically significant, then we may infer that males and females have different ageself-esteem slopes. Hence, this type of model is known as a different intercept-different slopes model. Normally, though, we need to consider the other coefficients as well. At this point were assuming all three are positive and statistically significant. This also raises another important issue when using interaction terms: If interaction terms are used in the model, the constituent variables must also appear in the model for correct specification.
Self-esteem
les Ma
ales Fem
15
20 Age
25
Lets see an example of an interaction term using the gss96.dta data set. Well consider the associations among age, personal income, and gender. But, first, consider the following issue: We know that education and income are positively associated in the U.S., with people who have more years of formal education, on average, earning higher incomes than people with fewer years of formal education (an interesting exception involves air traffic controllers, but well ignore this). Many studies also indicate that women, on average, earn less than men. Lets see if these two observations are supported in the gss96.dta data set. Well include age as a control variable in the model since we know it is associated with both education and income (but,
NONLINEARITIES AND INTERACTION TERMS 173 to simplify things, well ignore the nonlinear association between age and income). The results shown below support both propositions: Controlling for the effects of age, males, on average, report more personal income than females; and more years of education is associated with more personal income (note that gender is coded as 0 = male and 1= female in this data set). If we were to plot slopes for education, it would appear similar to the graph shown earlier: Different intercepts but parallel slopes for males and females. The average difference between males and females is 1.328 units of income.
df 3 1895 1898
MS 725.562039 7.52937626 8.66430671 t 7.38 10.87 -10.53 14.01 P>|t| 0.000 0.000 0.000 0.000
Number of obs F( 3, 1895) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
[95% Conf. Interval] .0274256 .2102905 -1.575262 4.750569 .047292 .302893 -1.080654 6.296949
Now, lets consider another question that involves these variables: What role does gender play, if any, in the association between education and income? Does more education tend to equalize personal income among males and females? Or is there some type of glass ceiling that results in a greater income disparity among males and females at higher levels of education? Perhaps the occupational choices of highly educated men and women lead to an income disparity. Or there may be no association between gender and education when it comes to personal income; they may only be independently associated with this outcome. We cannot even approach the answers to these questions with the linear regression model we just estimated. It provides no evidence about whether income varies by gender and education. Fortunately, an interaction term may be used to gather some of this evidence.
174
In Stata, we may calculate the interaction term using the following command: generate gen_educ = gender * educate. The name of the new variable, gen_educ, is arbitrary. Lets place this new variable in the linear regression model we just estimated. The output should appear as:
Source Model Residual Total pincome age educate gender gen_educ _cons SS 2268.39276 14176.4614 16444.8541 Coef. .0376037 .1767414 -3.618572 .1648684 6.619434 df 4 1894 1898 MS 567.09819 7.48493209 8.66430671 t 7.45 5.39 -5.43 3.50 13.17 P>|t| 0.000 0.000 0.000 0.000 0.000 Number of obs F( 4, 1894) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1899 75.77 0.0000 0.1379 0.1361 2.7359
[95% Conf. Interval] .0276989 .1124544 -4.925471 .0724929 5.633953 .0475085 .2410284 -2.311674 .2572439 7.604916
How should we interpret these results? First, comparing them to the results of the previous model, we see that the education coefficient is smaller and gender coefficient is larger (in absolute terms). Second, the gender-by-education interaction term has a positive and significant coefficient. It takes some experience to be able to look at the three slopes education, gender, and gender by education and make sense of the patterns. But here is a hint that will help you interpret them: Choose one of the variables involved in the interaction to focus on and then consider what the interaction term is telling you about this slope. Lets pick education. Its slope is 0.177 and is significantly different from zero. But the interaction term multiplies either a zero (to represent males) or a one (to represent females) by education. Therefore, for males we assume one slope, but when we consider females the slope is pulled in an upward direction (as indicated by the positive gender-by-education slope). Similarly, females begin with a detriment in income (as shown by the negative slope for gender), but then are pulled up by increasing education (as indicated by the positive gender-by-education slope). If this is still difficult to understand, use Stata to compute predicted values for some representative groups and then compare these values to see whether the gap between males and females
NONLINEARITIES AND INTERACTION TERMS 175 changes at different levels of education. The simplest way to do this is with Statas adjust postcommand. Here is an example using 12 and 16 as the education levels (age=40 is an arbitrary value but should be constant for all four groups):
adjust adjust adjust adjust age=40 age=40 age=40 age=40 gender=0 gender=1 gender=0 gender=1 educate=12 educate=12 educate=16 educate=16 gen_educate=0 gen_educate=12 gen_educate=0 gen_educate=16
Noting that gender is coded as 0 = male and 1 = female, Stata provides the following means for these four groups: 10.24, 8.6, 10.95, and 9.97. The next question is how do we interpret the difference among these predicted values? One possibility is to look at the raw difference among males and females in the two education groups. In the 12-year group, the raw difference is {10.24 8.6} = 1.64. In the 16-year group, the raw difference is {10.95 9.97} = 0.98. Thus, it appears that the gap between males and females on personal income is greater at lower levels of education. This approach to comparing groups often works, but it may break down in certain situations. It is better to compare relative differences or percentage differences within education groups to understand the effects implied by the interaction term:
Percent difference, 12 years : Percent difference, 16 years :
It is now clear that there is a larger percentage difference between males and females with 12 years of formal education than at 16 years of formal education. This supports the idea that higher levels of education tend to diminish income differences among men and women in the United States. There is a problem you may have anticipated that could affect our confidence in the regression model. Recall that an interaction term is calculated by multiplying one variable by another. This is bound to
176
lead to collinearity between the interaction term and its constituent terms. In fact, if we re-run the linear regression model and ask Stata for collinearity diagnostics, we find that gender has a VIF of 28 and gen_educ has a VIF of 29. Earlier in the chapter, when computing higher-order terms for age, we learned about a solution to this problem: Standardize (i.e., take the z-scores of) the constituent variables first before computing the higher-order terms. It makes no sense to standardize gender since it is a dummy variable, so lets standardize educate and then compute the interaction term. Then place age, gender, the standardized value of education and the interaction term {standardized value of education gender} in the linear regression model designed to predict income. The results of this new model show a lack of multicollinearity, but the same general patterns that we saw earlier. In fact, we may use the adjust postcommand and go through the means comparison procedure in the same way as we did earlier. The results are identical. Well let you figure out why.
NONLINEARITIES AND INTERACTION TERMS 177 associated with this outcome variable. As a control variable, well include the number of migrations per 100,000 people. Lets suppose, furthermore, that our sad little theory also predicts an interaction between the unemployment rate and percent male, such that we think that when there are more males and more unemployment in a state, there is likely more violent crime. Were unsure whether this prediction is valid, so well use a two-tailed hypothesis test and simply propose that the unemployment rate and percent male interact in some way. To test this notion, lets estimate a linear regression model. First, though, since we wish to minimize the risk of multicollinearity, well use standardized versions of the explanatory variables (unemprat, permale, and mig_rate) and calculate the interaction term between the unemployment rate and percent male using their standardized forms. After calculating the interaction term, we may estimate the regression model. The results are provided in the table below. The unemployment rate is, as expected, positively associated with the number of violent crimes per 100,000. Since the unemployment rate is measured in z-scores, we may interpret its slope as Controlling for the effects of the other variables in the model, each one standard deviation increase in the unemployment rate is associated with 96.7 more violent crimes per 100,000. Somewhat unexpectedly, the percentage of males in a state is negatively associated with number of violent crimes per 100,000. But what does the coefficient of the interaction term show? One way to think about the interaction term is to consider the slope for percent male: It is negative. However, as the unemployment rate increases this negative slope becomes, for want of a better term, less negative. Another way to view it is to consider the slope of the unemployment rate: It is positive. As percent male increases, this positive slope becomes more positive.
178
Source Model Residual Total violrate zmig_rate zunemprat zpermale zmaleunemp _cons
[95% Conf. Interval] -4.67468 26.37422 -226.5241 14.96766 462.693 142.7696 166.9827 -37.87077 152.158 596.5425
Although such interpretations provide general guidance for understanding the interaction term, some analysts find it more intuitive to compute predicted scores. The means comparison procedure mentioned above is not very useful, unless you wish to create categorical variables from the continuous variables, so well simply compute the predicted values by hand (and calculator). But we have to decide on the values of the explanatory variables. Since they are all measured using z-scores, convenient values include 1, 0, and 1; or one standard deviation below the mean (low), at the mean (medium), and one standard deviation above the mean (high). Lets see what the predicted violent crime numbers at these values of the explanatory variables indicate (note that well ignore the effects of migrations per 100,000; by not including it in our calculations, we are predicting the number of violent crimes at mean migration). We may utilize the following equation (or use Statas adjust command) and plug in the values for the unemployment rate and percent male. Violent crimes = 529.6 + 96.7(z-Unemployment) 132.2(zPercent male) + 83.6(z-Male zUnemployment)
NONLINEARITIES AND INTERACTION TERMS 179 Predicted number of violent crimes per 100,000 people
Percent male Unemployment rate One SD below mean At mean One SD above mean One SD below mean 648.7 661.8 674.9 At mean 432.9 529.6 626.3 One SD above mean 217.1 397.4 577.7
What do these predicted values indicate about the association among percent male, the unemployment rate, and the number of violent crimes per 100,000? Notice, first, that the lowest predicted number occurs when a state has a low unemployment rate and a high percentage of males. The highest predicted number occurs when a state has a high unemployment rate and a low percentage of males. But how can we determine the effects of, say, the percent male on the association between the unemployment rate and the number of violent crimes? We may compare the predicted values down the columns. For instance, at low values of percent male, the predicted values increase by about 4% (648.7 674.9), whereas at high values of percent male, the predicted values increase by 166% (217.1 577.7). In other words, the slope of the unemployment rate on the number of violent crimes is much steeper when percent male is high. One way that these types of findings are often represented is with a graph where we plot two or more slopes based on the results of the interaction term. The graph on the next page shows the two estimated slopes based on the predicted values. It is important to note that even though there is a steeper slope in states with a high percentage of males, the highest predicted values are in those states with a low percentage of males. As a final note: Remember to check the normal q-q plot of the residuals from this model. Sometimes, interaction terms introduce
180
additional complications in a model, such as non-normally distributed residuals. What does this model show? Are you more or less confident in the results after examining the distribution of the residuals? More importantly, though, what does the literature on violent crime tell us about the model? Do we trust the finding that a lower proportion of males is associated with more violent crime?
800
ge o f males Lo w percen ta
600
400
gh Hi
e t ag n e erc
m of
s ale
200
Before concluding this chapter, we should mention that regression models are not limited to interaction terms involving only two explanatory variables, or what are known as two-way interactions. You may also enter three-way interaction terms in a regression model, which implies multiplying three explanatory variables together (e.g., x1 x2 x3). However, this regression model must include all of the constituent terms as well as all the possible two-way interactions (x1 x2; x1 x3; x2 x3). Conceivably, four-way, five-way, or even higherorder interactions are possible, but quickly become very difficult to work with and lead to problems with interpretation. A word of warning: There are some statisticians who claim that interaction terms are problematic because they fail to disentangle the actual ordering of the variables. Recall in Chapter 7 that we discussed endogeneity, or whether there are explanatory variables in the
NONLINEARITIES AND INTERACTION TERMS 181 regression model that are predicted by (or, to use generally unrecommended terminology, caused by) other explanatory variables inside or outside the model. If one of the variables used to compute the interaction term is endogenous to the other or to other variables then the usual interpretation of the interaction term becomes questionable. Think about the model estimated earlier in the chapter: Perhaps gender in some way influences education, or, as we say, education is endogenous in the model. What about the violent crime model? Percent male may affect the unemployment rate (e.g., males are more likely than females to participate in the labor force, so they are at higher risk of unemployment), so the unemployment rate could be endogenous in the model. A comprehensive discussion of this problem is beyond the scope of this presentation, but you should always think carefully about the variables in your model when considering interaction terms and the likelihood of endogeneity.
Chapter Summary
This chapter has covered a lot of important material. Nonlinear associations and outcome variables as well as residuals from linear regression models that are not normally distributed are quite common in the world of regression analysis. Moreover, it is important to consider that many of the associations between two variables that interest you may depend upon some third variable. Interaction terms provide a valuable way to test for non-additivity in your models. As a review and conclusion to what weve learned, here are four suggestions: 1. Always use theory and previous literature to guide your thinking about interaction terms and possible nonlinearities among your variables. 2. Always use a normal q-q plot to check the distribution of the outcome variable before estimating a regression model. Construct bivariate scatterplots of each explanatory variable by the outcome variable. Plot nonlinear lines in these scatterplots.
182
Check the partial residual plots for the regression model using standardized or studentized residuals. Check the normal q-q plot of the residuals. When faced with nonlinear associations, consider the various transformations that are available. It is likely you will find one that will linearize the associations or normalize the residuals. 3. When entering interaction terms, consider using standardized versions of the explanatory variables to diminish the risk of multicollinearity. Use predicted values to figure out what the interaction term implies about the associations in the model. 4. It is easy to get carried away by combining quadratic terms, cubic terms, interactions, and transformed variables in a single regression model. Take care to avoid doing too much since models can easily become unmanageable when there are too many of these types of variable. It is better to have a good idea guided by theory and previous research of why these associations exist, rather than searching for them using multiple models. If you cannot explain them, then perhaps they should not appear in your model.
184
The name for this assumption is homoscedasticity. The term homo means same and the term scedastic means scatter. Hence the errors are assumed to be homoscedastic or have the same scatter. The alternative is that the errors do not follow the same pattern; hence, they are heteroscedastic (hetero means different). The figure below shows an example of heteroscedastic errors. Note how the presumed spread of the errors in predicting y increases with hypothetical values of x. The main consequence of heteroscedastic errors is that, although the slopes are unbiased, the standard errors are biased. An alternative way of saying this is that although the slopes are unbiased they are inefficient. The degree of bias is rarely known, although it seems to be especially severe in small samples. Yet it can throw off interpretations of statistical significance even in moderately sized samples.
1 2 3 x
Heteroscedastic errors are a common phenomenon in linear regression models. They seem to be particularly common in crosssectional data, or data that were collected at one point in time. Here is an example found frequently in textbook illustrations of
HETEROSCEDASTICITY AND AUTOCORRELATION 185 heteroscedasticity. The Stata data set, consume.dta, has information on how much income the families in the sample earned in a year and how much they spent on consumable goods in a year. A general hypothesis is that families with more income spend more on goods than families with less income. The results of a simple linear regression analysis support this proposition. Each one-thousand dollar increase in family income is associated with about a ninehundred dollar increase in spending on consumable goods.
Source Model Residual Total consume income _cons SS 2179.71564 31.0737689 2210.78941 Coef. .8993247 .8470517 df 1 18 19 MS 2179.71564 1.72632049 116.357337 t 35.53 1.20 P>|t| 0.000 0.244 Number of obs F( 1, 18) Prob > F R-squared Adj R-squared Root MSE = 20 = 1262.64 = 0.0000 = 0.9859 = 0.9852 = 1.3139
One of the graphical tests we should consider after estimating this model is a partial residual plot. In Chapter 10 we learned that this type of plot is useful for checking the appropriateness of the linear model. However, it also provides a useful way to check whether the errors are homoscedastic or heteroscedastic in a simple linear regression model. Ask Stata to construct a simple scatterplot of the studentized residuals (y) by the income variable (x). The results are shown on the following page. Notice that the residuals spread out at higher values of family income. This is an exemplary example of what heteroscedasticity looks like, with the spread of the errors fanning out at higher values of the explanatory variable. However, an opposite pattern also indicates heteroscedasticity: The spread of the residuals narrowing down at higher values of the explanatory variable. Homoscedastic residuals look like a random scatter, with no recognizable association in the partial residual plot. At this point, it is important to ask yourself the why question: Why does this pattern exist? Is there a reasonable explanation? Often,
186
by answering this question in a thoughtful way, you can figure out a solution to the problem of heteroscedasticity. For instance, it seems reasonable to suggest that there are (at least) two types of high income families: Those that spend most of their money and those that save a lot of their money. Low income people dont have this luxury; they often must spend what they have on necessary goods such as food, with little left over to place in a savings or retirement account or spend on additional items. How does this offer a solution to heteroscedasticity, though? If we had a variable in the data set that measured saving patterns or leisure activities, then we could include it in the model and determine if it explained the heteroscedasticity among the residuals. This illustrates that, as with many problems that affect the linear regression model, a good theory goes a long way.
2
Studentized residuals
-2
-1
10
20
30
40
50
Another common reason for heteroscedasticity is that one or more of the variables requires transformation. This is often not apparent by looking at a scatterplot between the outcome and one of the explanatory variables. But previous research may provide a clue. As mentioned in an earlier chapter, income measures are often linearized by taking their natural logarithms. And many statisticians point out that taking the natural logarithm of the explanatory or
HETEROSCEDASTICITY AND AUTOCORRELATION 187 outcome variable may be all that is needed to eliminate the heteroscedasticity problem. For example, take the natural logarithm of either the explanatory or outcome variable in the linear regression model we just estimated, re-estimate the model using the transformed variable, and construct a partial residual plot. What does the plot show? There does not appear to be a clear heteroscedastic pattern to the residuals, although it does look unusual (either like a U-shaped or inverted U-shaped pattern, which may indicate other problems that we dont have the space to investigate).
188
calculate both the z-scores of the predicted values (zpred) and the studentized residuals (sresid). Create another scatter plot which graphs zpred on the x-axis and sresid on the y-axis. Notice that completion percentage is positively associated with quarterback rating, but interceptions are not. Rather than the coefficients, lets focus on the scatterplot that Stata creates. The scatterplot Stata produces should look like the one below. Notice that, with the exception of the data point at about {1.2, 2.4}, there is clearly a fanning-out shape to the residuals at higher quantities of the predicted values. This suggests that there is a heteroscedasticity problem among the residuals. The other line in the graph is the lowess line (locally weighted regression) that we learned a little about in Chapter 10. It is occasionally useful for determining heteroscedasticity and other problems with the residuals. If there was no pattern to the residuals, then the lowess line should appear as a horizontal line (although this may not detect heteroscedasticity well). (Note that Stata has a postregression command called rvfplot that will produce a variation of this graph.)
4
Studentized residuals
-2
-1
Now that weve see the heteroscedastic pattern in the residuals, are there other ways to detect it? One of the problems is that
HETEROSCEDASTICITY AND AUTOCORRELATION 189 graphical techniques become difficult to use with large data sets (imagine trying to visualize patterns with hundreds of points in a scatterplot), so it would be nice to have some other techniques for diagnosing heteroscedasticity. Fortunately, there are several. Well discuss three of these. The first diagnostic approach is known as Whites test (after the econometrician H. White). This test is conducted after estimating a multiple linear regression model. It is comprised of the following steps: 1. Compute the unstandardized residuals from the regression model. 2. Compute the square of these values (e.g., resid2). 3. Compute the square of each explanatory variable (e.g, x12, x22). Do not compute squared values of dummy variables, though. 4. Compute two-way interactions using all the explanatory variables (e.g., x1 x2; x1 x3; x2 x3; etc.). 5. Estimate a new regression model with the squared values of the residuals as the outcome variable and the computed variables in (3) and (4) as the explanatory variables (dont forget to include their constituent terms). 6. There are then two ways to conclude whether or not the errors are heteroscedastic: (a) Rely on the R2 from the model in (5); if it is significantly different from zero then there is heteroscedasticity. (b) Use the following test statistic:
nR 2 ~
Where n = sample size, R2 is from the model, and k = the number of explanatory variables. Note that this test statistic is distributed as a 2 variable, so we need to compare its value to a 2 value with k degrees of freedom.
190
An advantage of Whites test is that not only does it test for heteroscedasticity, but it also provides evidence about potential nonlinearities in the associations. Stata will save us the trouble of having to go through each of these steps. Although slightly different, the postcommand estat imtest, white will produce several tests, including Whites test for heteroscedasticity. For example, after estimating the passing rating model, we may type this command. Stata returns the following output:
White's test for Ho: homoskedasticity against Ha: unrestricted heteroskedasticity chi2(5) Prob > chi2 = = 17.09 0.0043
Cameron & Trivedi's decomposition of IM-test --------------------------------------------------Source | chi2 df p ---------------------+----------------------------Heteroskedasticity | 17.09 5 0.0043 Skewness | 4.92 2 0.0855 Kurtosis | 0.57 1 0.4496 ---------------------+----------------------------Total | 22.58 8 0.0039 ---------------------------------------------------
The first part of the printout shows that Whites test compares the null hypothesis of homoscedasticity against the alternative of heteroscedasticity. We should reject the null hypothesis in this situation (note the small p-value) and conclude that there are heteroscedastic errors in this model. The second diagnostic approach is known as Glejsers test (after the statistician who introduced it, H. Glejser). The easiest way to understand how this test works is to consider the appearance of heteroscedastic errors in the scatterplot. Recall that residuals, whether unstandardized, standardized, or studentized, have a mean of zero, with negative and positive values. Suppose we folded over the residuals along the mean, so that the negatives were pulled up to be positive. This is, in effect, what happens when we take the absolute
HETEROSCEDASTICITY AND AUTOCORRELATION 191 value of the residuals, or i . Assuming a fan-shaped pattern, what would the association between the predicted values and the absolute values of the residuals look like? It would be positive if the residuals fan out to the right and negative if they fan out to the left. The scatterplot below, based on the plot shown earlier, demonstrates this. Notice that the presumed regression line indicates a positive association. This, in brief, is how Glejsers test is conducted: Take the absolute values of the residuals and estimate a linear regression model with these new residuals as the outcome variable and the original explanatory variables. An advantage of this test is that it allows you to isolate the particular explanatory variables that are inducing heteroscedasticity. Lets continue our example of quarterback ratings. After estimating the model and saving the unstandardized residuals (predict resid, resid), well test for heteroscedasticity using Glejsers test. The absolute values of the residuals are computed as follows: generate absresid = abs(resid). Absresid is an arbitrary name assigned to the new variable.
Effect of Taking the Absolute Values of Residuals
3.0
1.5
0.0
-1.0
192
Similar to Whites test, Glejsers test shows evidence of heteroscedasticity. The positive association between completion percentage and the absolute value of the residuals suggests that there is increasing variability in the residuals as the completion percentage increases. In fact, a simple scatterplot with the residuals on the y-axis and completion percentage on the x-axis shows much the same thing.
Source Model Residual Total absresid compperc interatt _cons SS 46.0259534 65.7710073 111.796961 Coef. .6547374 .7898726 -38.7939 df 2 23 25 MS 23.0129767 2.85960901 4.47187843 t 4.01 1.56 -3.74 P>|t| 0.001 0.131 0.001 Number of obs F( 2, 23) Prob > F R-squared Adj R-squared Root MSE = = = = = = 26 8.05 0.0022 0.4117 0.3605 1.691
The third test for heteroscedasticity is a simple variation of Glejsers test known as the Breusch-Pagan test. The difference is that the squared residuals are used rather than the absolute values of the residuals. This test is implemented in Stata using the postcommand estat hettest. The null and alternative hypotheses are the same as in Whites test. Again, we reject the null hypothesis and conclude that there are heteroscedastic errors.
estat hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of resid2 chi2(1) Prob > chi2 = = 5.01 0.0252
HETEROSCEDASTICITY AND AUTOCORRELATION 193 would solve the problem. Or, there might be another type of nonlinear association between completion percentage and quarterback rating. In fact, there appears to be a slight U-shaped relationship between a quarterbacks completion percentage and his rating, with a negative association between 55% and 58%, and a positive association from about 59% and 64%. (See if you can figure out a way to verify this statement.) As mentioned earlier in the chapter, transforming a variable by taking its natural logarithm also may reduce heteroscedasticity in the model. This is especially useful when using well-known skewed variables such as income or measures of deviant behavior. Taking the natural logarithm helps when the residuals spread out at higher values of an explanatory variable; if they narrow down, then squaring the variable or exponentiating it may help. Note, though, that these solutions involve the proper specification of the model. Heteroscedasticity is often only a symptom of improper model specification. However, there are also situations when a variable that you may not be particularly interested in induces heteroscedasticity. This problem emerges quite frequently when we analyze what are known as repeated cross-sectional data. The General Social Survey (GSS), whose data weve used in previous chapters, is a well-known example of this type of data set. Although weve employed only the 1996 data, the GSS has been collected for more than 30 years. A different sample of adults in the U.S. are surveyed every couple of years; hence, it involves cross-sectional data that are repeated over time. One of the problems with such data is that the sample size may vary from one survey to the next. Since sample size affects variability (e.g., larger samples generally lead to smaller standard errors), differing sample sizes across years assuming one wishes to analyze data from several years induces heteroscedasticity. Some analysts simply include a variable that gauges year in their models, hoping that this will minimize heteroscedasticity. A preferred approach, however, is to use weighted least squares (WLS) to adjust for differing sample sizes. WLS uses a weight function to adjust the standard errors for hetero-
194
( y i y i )2
x )( y i y )
i
1 s (x
2 i
x )
where y =
2
s s
yi
2 i 2 i
and x =
s s
xi
2 i 2 i
The formula for the slope is for simple linear regression models, but extending it to multiple linear regression models is not difficult. The key for WLS is the weight function, or 1/si2. In the example of repeated cross-sectional data, the inverse of the sample size often works well to correct standard errors for heteroscedasticity. However, in most situations especially those that do not involve repeated cross-sectional data we must come up with the weight function conceptually, or by thinking about the associations implied by the model. For instance, suppose that our completion percentage quarterback rating example was not solved by considering nonlinearities. Are there other variables in the data set that might be inducing heteroscedasticity? There arent a lot of options, so lets propose the following hypothesis: When thinking about NFL quarterbacks and the percent of the time they complete their passes, we should also consider what constitutes a safe passer and a risky passer. Safe passers throw short, accurate passes and therefore have high percentages. But theres also the rare passer who is accurate when throwing the ball farther; he appears to be a risky passer, but perhaps he is not. Hence, heteroscedastic errors might be induced by combining shorter and longer distance passers in the evaluation of completion percentage and quarterback rating. If this argument has any merit (and it probably doesnt), then a measure of average distance per pass may be just the weight variable were looking for. In the passers.dta data set there is a variable that assesses average yards per completion (avercomp) that may prove useful. First, lets see the original linear regression model again:
Now, lets see what the WLS regression model with average yards per completion as the weight variable provides. Although there are several possibilities in Stata, the most straightforward command is known as wls0. This is a user-written command that must be downloaded first (type findit wls0). After making sure it is available on your computer, consider the following command:
wls0 rating compperc interatt, wvar(avercomp) type(abse) graph
It does not appear as though the WLS model offers much improvement over the linear regression model. Moreover, the accompanying graph still shows heteroscedastic errors. You might try other variations of this model (type help wls0) or alternative variables in the data set, although it can quickly turn into a search for the unobtainable. In fact, it is not a good idea to simply hunt willy-
196
nilly for weight variables without strong theoretical justification since using the wrong weight can lead to misleading results. The other solutions to heteroscedastic errors do not require us to find the source of the problem. Rather, they are general solutions. One is named Whites correction (after the same econometrician who developed Whites test) or the Huber-White sandwich estimator (since it was also described by statisticians P. Huber and F. Eicker; I dont know why Eicker isnt recognized!), and the other is named the Newey-West approach (after the econometricians who developed it). Both involve a lot of matrix manipulation that is beyond scope of this chapter. However, there are Stata commands available that will do these corrections. The simplest approach is to use the robust subcommand along with the regress command: regress rating compperc interatt, robust. Stata returns the following output:
Linear regression Number of obs = F( 2, 23) = Prob > F = R-squared = Root MSE = Robust Std. Err. .4617432 .9714897 26.49144 26 3.16 0.0614 0.3801 3.3798
The main difference is in the standard errors. Note that, compared to the original model, the so-called robust standard error is larger than the original standard error. This is usually the case when the errors are heteroscedastic. Some experts suggest that, because heteroscedasticity is such a common problem in linear regression models, we should always use a correction method. If there is no heteroscedasticity, then the results of the corrected and uncorrected models are the same. But if there is heteroscedasticity, then the results of a standard linear regression analysis can be highly misleading. Substantially more information on this issue is provided in the following article: J. Scott Long and Laurie H. Ervin (2000), Using Heteroscedasticity Consistent Standard
HETEROSCEDASTICITY AND AUTOCORRELATION 197 Errors in the Linear Regression Model, American Statistician 54:217224.
Autocorrelation
Another problem arises when the error terms follow an autocorrelated pattern. This problem is especially acute in two types of data: Those that are collected over time and those that are collected across spatial units. Data that are collected over time come in several types. First, weve already learned about repeated crosssectional data. Second, data that are collected from the same individuals (these can be people, animals, or other individual units) over time are known as longitudinal or panel data. Third, data that are collected on the same aggregated unit (e.g., cities, counties, states) over time are known as time-series data. Time-series data are often limited to one unit, such as the city of Detroit, the state of Illinois, or the entire United States; data collected on different aggregate units over time are known as cross-sectional time-series data. No matter the type of data collected over time, autocorrelation is virtually a constant problem. One way to think about this problem is to consider the nature of residuals or errors in prediction. When we collect data over time, there is typically a stronger association among errors in time periods that are closer together than in those that are farther apart. For example, if we collect information on crime rates in New York City over a 25-year period and try to predict them based on unemployment rates over the same period, our errors are likely to be more similar in 1980 and 1981 than in 1980 and 2000. Hence, the errors are correlated differentially depending on time. Another name for this type of autocorrelation is serial correlation. A similar problem occurs when collecting data across spatial units. Suppose we collect data on suicide rates from cities across the United States. We then use percent poverty and the amount of air pollution to predict these rates. The errors in prediction are likely to be more strongly related when considering Los Angeles and San
198
Diego, CA, than when considering San Diego and Providence, RI. Los Angeles and San Diego probably share many more characteristics, some of which we do not measure, than do San Diego and Providence. These unmeasured characteristics add to the error in the model. Another name for this type of problem is spatial autocorrelation. As with heteroscedasticity, the main result of autocorrelation is biased standard errors. The slopes, on average, are still correct, but the standard errors and, ultimately, the t-values and p-values, are not correct. Hence, autocorrelation makes it much more difficult to determine whether a slope a statistically distinct from zero in the population. The scatterplot below shows an example of autocorrelation, or, more accurately, serial correlation. Notice the snaking pattern of the residuals around the linear regression line. This is the typical appearance of autocorrelation; it is the consequence of the stronger association among errors closer together in time than among those farther apart in time. Of course, such a pattern is rarely so clear; it is more common, especially with larger sample sizes, to find an unrecognizable pattern to the residuals. Hence, we need additional tools to determine if autocorrelation is a problem in linear regression models.
HETEROSCEDASTICITY AND AUTOCORRELATION 199 Lets see an example of serial correlation. The Stata data set, detroit.dta, includes variables that assess the annual number of homicides in Detroit, MI from 1961 1973. It also has additional variables that were collected on an annual basis. This is an example of a classic time-series data set. Looking at homicides over time indicates an increasing trend. But lets look at a scatterplot of the homicide rate (y) by year (x) to see if we can find evidence of autocorrelation. When constructing this scatterplot, use the lfit command to include a linear regression line. It should look similar to the graph on the following page. Suppose, though, that we wish to use one of the variables in the data set to predict the number of homicides. A key issue in criminal justice management is whether the number of police officers affects crime patterns. Well thus consider whether the number of police per 100,000 people predicts the number of homicides. What does a linear regression model show (well ignore year for now)? The results on the next page indicate that the number of police per 100,000 is positively associated with the number of homicides that Detroit experienced over these years. But, we also know that autocorrelation is a likely problem. How should we check for this problem, though? As with heteroscedasticity a plot of the residuals by the predicted values is a useful diagnostic tool. As before, ask Stata to plot the studentized residuals by the predicted values from this regression model. The resulting plot indicates what appear to be heteroscedastic errors (note the drawnout S-shape as we move from left to right). But is there evidence of autocorrelation? It is difficult to tell based on this plot, but try fitting a cubic or a lowess line. The cubic line indicates an autocorrelated structure. Moreover, the lowess line shows a huge bend in the lower left-hand portion of the plot, and then some more curvature in the right-hand side of the plot. With additional experience, you will learn that this is also evidence of autocorrelated residuals.
200
0 1960
10
20
30
40
50
1965
1970
1975
Year
df 1 11 12
Number of obs F( 1, 11) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
As mentioned earlier, looking for patterns among the residuals and predicted values can be difficult, especially with large data sets. Fortunately, there are also statistical tests for autocorrelation. The most common of these is known as the Durbin-Watson test. It is calculated by considering the squared difference of the residuals adjacent in time relative to the sum of the squared residuals:
d=
(e
2
ti n
ti 1 ) e
2 ti
e
1
HETEROSCEDASTICITY AND AUTOCORRELATION 201 The t subscript indicates time. This test is so common that most statistical programs, including Stata, compute it as part of their linear regression routines. A rule-of-thumb is that the closer the d value is to two, the less likely that the model is affected by autocorrelation. The theoretical limits of d are zero and four, with values closer to zero indicating positive autocorrelation (the most common type) and values closer to four indicating negative autocorrelation (a rare situation). To use the Durbin-Watson test, we first compute the d value and then compare it to the values from a special table of d values available in many regression textbooks and on the internet. Some statistical programs provide values from these tables as part of their output. The values in the table include an upper limit and a lower limit based on the number of coefficients (explanatory variables + 1 or {k + 1} and the sample size (n). The following rules-of-thumb apply to the use of the d values from a linear regression model: If dmodel < dlower, then positive autocorrelation is present If dmodel > dupper, then there is no positive autocorrelation If dupper < dmodel < dlower, then the test is inconclusive Notice that this test does not assess negative autocorrelation, although, as mentioned earlier, if a d value from a model is close to the upper theoretical limit of four then there is negative autocorrelation that affects the regression model. Returning to the Detroit homicide example, we may ask Stata for the d statistic by first designating the time variable (type tsset year). Next, we can use the estat dwatson or dwstat postcommand to tell Stata to calculate the Durbin-Watson statistic. Both of these commands yield the following result:
Durbin-Watson d-statistic (2, 13) = .665654
This values is substantially lower than two, but we should always check a table of Durbin-Watson values to determine if it falls outside the boundaries for {k + 1} = 2 and n = 13. A table found on the internet indicates the following boundaries for a five percent test:
202
{1.01, 1.34}. Hence, the results of the police homicide linear regression analysis shows evidence of autocorrelation among the residuals.
The Prais-Winsten model has eliminated the autocorrelation problem. Notice that the d statistic is now well above the upper bound of 1.34, thus indicating that positive autocorrelation is no longer a problem. Moreover, we continue to find a significant association between the number of police and homicides in Detroit. Yet the slope is smaller and the standard error is larger in the new model. A question remains, however: Why is the association positive? There are also some useful techniques for other types of data collected over time, in particular longitudinal data. A set of techniques that is growing in popularity is known as generalized estimating equations (GEEs). One of the drawbacks of the Cochran-Orcutt and Prais-Winsten regression approaches is that they are limited to models with a first-order autoregressive process, or AR(1). This means that the immediate previous value has a much stronger direct influence on the current value (e.g., 1970 1971) than other preceding values. Yet, it is often the case that other preceding values (e.g., 1968, 1969) also
204
have a strong effect on the current value. GEEs offer substantial flexibility to model various types of correlation structures among the residuals, not just AR(1). Stata has the capability to conduct GEE analyses. Lets look at a couple of examples of GEEs. The data set delinq.dta includes information collected over a four-year period from a sample of adolescents. It includes information on family relationships, self-esteem, stressful life events, and several other issues. Before considering a regression model, look at how the data are set up: Each adolescent provides up to four years of observations (if the id number is the same then it is the same adolescent). We can consider that there are two sample sizes: The number of people represented (n = 651) and the number of total observations, or n t (651 4 = 2,904). Moreover, a key assumption of the linear regression model is violated in this type of data set: The observations are not independent because the same people contribute more than one observation. Lets set up a linear regression model. Well predict self-esteem (esteem) among these adolescents using family cohesion (cohes; a measure of family closeness) and stressful life events (stress). A linear regression model provides the results shown in the first table of the following page. Now lets see what a GEE provides. We first have to tell Stata that year is the time dimension and that the people in the data set are identified by the id variable. This involves using Statas xtset command: xtset id year. Then the GEE model may be invoked:
xtgee esteem cohes stress, corr(ar1)
Note that the initial model is set up to assume that the errors follow an AR(1) process. Therefore, it is similar to a Prais-Winsten regression model. Note that the coefficients are closer to zero in the GEE model than in the OLS model. Hence, it is likely that the OLS model leads to biased coefficients when the observations are not
HETEROSCEDASTICITY AND AUTOCORRELATION 205 independent. Autocorrelation does not affect the standard errors much in this model, though. Linear regression model
Source Model Residual Total esteem cohes stress _cons SS 11705.2797 45618.3621 57323.6418 Coef. .1577462 -.3757225 25.23998 df 2 2601 2603 MS 5852.63985 17.5387782 22.0221444 t 22.02 -7.88 59.82 P>|t| 0.000 0.000 0.000 Number of obs F( 2, 2601) Prob > F R-squared Adj R-squared Root MSE = = = = = = 2604 333.70 0.0000 0.2042 0.2036 4.1879
The AR(1) model may not be the best choice. Is it reasonable to assume that the value in previous year affects the current years value, but other years dont have much of an effect? Perhaps not; other years may also be influential. In a GEE model we may estimate what is known as an unstructured pattern among the residuals. This allows the model to estimate the correlation structure rather than assuming that the residuals are uncorrelated (OLS) or follow an AR(1) pattern. The GEE model with an unstructured pattern (xtgee esteem cohes stress, corr(unstructured)) provides the results shown in the table on the next page.
206
The first thing to notice is that there are few differences between the AR(1) model and the unstructured model. Therefore, the assumption of an AR(1) pattern among the residuals is reasonable. The xtcorr postcommand may be used to determine the correlation patterns from the models. But keep in mind that if we do not take into consideration the longitudinal nature of the data, we overestimate the association between family cohesion and self-esteem, and between stressful life events and self-esteem. GEE models provide a good alternative to OLS linear regression models. When analyzing longitudinal data which offer some important advantages over cross-sectional data GEE models are the preferred approach. There are other models that are also popular, such as fixed-effects models and random-effects models, but these are actually specific cases of GEE models. A review of these various models is beyond the scope of this chapter, but an excellent start is James W. Hardin and Joseph M. Hilbe (2003), Generalized Estimating Equations, Boca Raton, FL: Chapman & Hall/CRC. Books on longitudinal data analysis offer even more choices (e.g., Judith D. Singer and John B. Willett (2003), Applied Longitudinal Data Analysis, New York: Oxford University Press).
HETEROSCEDASTICITY AND AUTOCORRELATION 207 states, or nations. It should be evident that errors in prediction are likely to be more similar in adjacent or nearby units than in more distant units. Therefore, whenever spatial data are analyzed it is a good idea to assess the likelihood of spatial autocorrelation because, as with serial correlation, the standard errors are affected. Moreover, as with longitudinal data, the assumption of independence is violated since adjacent areas usually share characteristics much more than distant areas. (Note: The information in this section is derived mainly from Peter A. Rogerson (2006), Statistical Methods for Geography, Second Edition, Thousand Oaks, CA: Sage.) The Durbin-Watson statistic (d) is not useful for assessing spatial auto-correlation. However, a standard test is known as Morans I (named after its originator, the Australian statistician Patrick A.P. Moran). Assuming we transform the variable into z-scores (this simplifies the formula), Morans I is calculated by
n wij z i z j
i j n n
(n 1) wij
i j
In this equation, there are n spatial units and wij is a measure of how close together the specific units, indexed by i and j, are to one another. If two spatial units that are close together exhibit similar scores on the variable (z), there is a positive contribution to Morans I. Hence, if the nearest units tend to have scores that correlate more than units that are far apart, Morans I is larger. In fact, its theoretical values are 1 and 1, much like Pearsons correlation coefficient, with higher values indicating positive spatial autocorrelation. Negative spatial autocorrelation also may occur, with Morans I closer to 1, although it is rare. Notice that this measure is concerned conceptually with lack of independence across spatial units; hence, we may recognize autocorrelation as a problem of the dependence of observations.
208
An important issue in calculating Morans I is to figure out the wijs because the I value is highly dependent on them. The simplest approach is to create a dummy variable with a value of one indicating that the spatial units are adjacent (or share a border) and a zero indicating that they are not. This is known as binary connectivity. Another way is to compute distance measures, although we then have to decide distances between which particular points. A commonly used distance measure is from the center of one region to the center of another (e.g., the city center of Los Angeles to the city center of San Diego). If we use distance, wij in Morans I is the inverse of the distance so that units closer together receive a larger weight. As an example of how to compute Morans I, consider the following map. It consists of data from five counties in the state of Hypothetical. The numbers represent a measure of the number of crimes committed in each county in the last year adjusted for their population sizes. We wish to determine the degree of spatial autocorrelation before going any farther in our analysis.
A 32 C 26 Lake Watershed
E 19 D 17
Fo st re
B 18
HETEROSCEDASTICITY AND AUTOCORRELATION 209 To compute Morans I, we first need to decide on a system of weights for the distances. To ease the computations, lets use the binary connectivity approach where a one is assigned to adjacent counties (e.g., A and C) and a zero is assigned to non-adjacent counties (e.g., A and B). A simple way to see these weights is with a matrix where the entries are the wijs:
0 0 W = 1 0 0
0 0 0 1 0
1 0 0 1 1
0 1 1 0 1
0 0 1 1 0
This matrix is symmetric, with the same pattern above the diagonal as below the diagonal. For example, there is a 1 listed in row 1, column 3; and a 1 listed in row 3, column 1. This is an indicator that counties A and C share a border. The overall mean of the crime variable is 22.4, with a standard deviation of 6.43. To compute Morans I, we may convert specific crime values into z-scores to save us some steps. The z-scores (calculations omitted), listed in order from counties A to E, are {1.494, 0.685, 0.56, 0.84, and 0.529}. We then add the products of each pair of z-scores, and multiply this sum by the sample size (5) to obtain the numerator (note that the non-adjacent pairs, since they have zero weights, are omitted from the equation):
5 [AC + BD + CA + CD + CE + DB + DC + DE + EC + ED] = 5 [(1.494 0.56) + (0.685 0.84) + (0.56 1.494) + (0.56 0.84) + (0.56 0.529) + (0.84 0.685) + (0.84 0.56) + (0.84 0.529) + (0.529 0.56) + (0.529 0.84)] = 10.9
210
The sum of the weights is 10 (count the 1s in the W matrix), so the denominator in the equation is {4 10} = 40. The Morans I value is therefore 10.9/40 = 0.273. This suggests that there is a modest amount of spatial autocorrelation among these crime statistics. The solutions for spatial autocorrelation are similar in logic to the solutions for serial correlation. First, we may add a variable or set of variables to the model that explains the autocorrelation. It is often difficult to find these types of variables, though. Second, we may use a regression model known as geographically weighted regression that weights the analysis by a distance measure. Large weights apply to units that are close together; small weights apply to those that are farther apart. The regression coefficients are estimated in an iterative fashion after finding the optimal weight. Third, there are spatial regression models designed specifically to address spatial data and autocorrelation. The research area of Geographical Information Systems (GIS) includes a host of spatial regression approaches. Rogerson, op. cit., provides a relatively painless overview of spatial statistics and regression models. Stata has several user-written programs available for spatial analysis (type findit spatial models).
Chapter Summary
We now know about the risks of heteroscedasticity and autocorrelation. Both of these problems are often symptoms of specification error in linear regression models, either because we use the wrong functional form or because weve left important variables out of the model. The main consequence of heteroscedasticity and autocorrelation is biased standard errors. Since we normally wish to obtain results that allow inferences to the population, obtaining unbiased standard errors is important. However, since specification error often accompanies heteroscedasticity or autocorrelation, the slopes from linear regression models may also be incorrect. Therefore, it is always essential that you think about the nature of your data and variables to evaluate the likelihood of specification error, lack of independent observations, heteroscedasticity, and
HETEROSCEDASTICITY AND AUTOCORRELATION 211 autocorrelation. There are several diagnostic tools and many solutions available for heteroscedasticity and autocorrelation. Here is a summary of several of these that are discussed in this chapter. 1. It is always a good idea to plot the studentized residuals by the standardized predicted values after estimating a linear regression model. A fan-shaped pattern to the residuals, one that either spreads out or narrows down at higher values of the predicted values, is indicative of heteroscedasticity. A cubic or similar snaking pattern to the residuals is indicative of serial correlation. 2. There are also several statistical tests for heteroscedasticity and auto-correlation. Whites test and Glejsers test offer simple ways to use a regression model to test for heteroscedasticity. The Durbin Watson d statistic is the most common numeric test for serial correlation. 3. Solutions to heteroscedasticity and serial correlation are legion. If you know the variable that is inducing heteroscedasticity or serial correlation, then you should consider a WLS regression analysis to adjust the model. If you dont know the source, or if you are genuinely worried about heteroscedasticity or serial correlation in your linear regression models, then use a model specifically designed to correct these problems. The NeweyWest or White-Huber corrections for standard errors under heteroscedasticity work well. Prais-Winsten regression provides better estimates than OLS regression in the presence of serially correlated residuals. 4. For longitudinal data, a frequently used approach is to employ one of the GEE models. They provide a flexible way to model longitudinal associations among variables.
212
5. Data collected over spatial units present the same conceptual problems as data collected over time, although the techniques for diagnosing spatial autocorrelation and for adjusting the models are different. Morans I is a widely used diagnostic test for spatial autocorrelation. And there are many regression routines designed specifically for spatial data.
214
GPA, then the score of 2,400 would constitute a leverage point. The observation might also be an outlier if, say, the student who scored 2,400 also obtained a 1.5 GPA in college whilst the other students obtained GPAs close to 3.0. High leverage points that are not outliers mainly affect the standard error of the slopes. Recall that the formula for the standard error of a slope in a multiple linear regression model (OLS) is
= se i
( )
(x
x ) 1 R i2 (n k 1)
2
(y
i ) y
The key component of this equation that is affected by leverage points involves the xi values in the denominator. Suppose that, all else being equal, we place a leverage point in the equation. What effect will it have? Well, it will result in a relatively large squared value in the denominator. Hence, the denominator will increase and the standard error will decrease. Thus, leverage points tend to reduce standard errors of slopes in linear regression models. (What affect do outliers have on standard errors?) The following three graphs show an outlier, a leverage point, and a combination of the two. They also include estimated linear regression slopes.
y Outlier
y
Leverage point
INFLUENTIAL OBSERVATIONS
215
y
Outlier & Leverage point
x
Note that the outlier pulls what should be a relatively steep, positive regression line upwards near the low end of the x distribution. Therefore, the slope will be smaller than if the outlier were not part of the set of observations. The leverage point falls on the regression line, so the slope is not influenced but, as mentioned earlier, the standard error of the slope is affected. The third graph shows a common situation: The observation is extreme on both the x and y variables. It has a relatively strong influence on the slope and the standard error. In brief, it has a large effect on the regression model. Influential observations result from a number of sources. Perhaps the most common is coding error. When entering numeric data, it is easy to hit the wrong entry key or to forget a numeral or a decimal place. Therefore, it is always a good idea to explore the data to determine if there are any coding errors. Oftentimes, though, influential observations occur as a routine part of data collection exercises: Some people do earn a lot more money than others; some adolescents do drink alcohol or smoke marijuana much more often than other youth. If the nature of a variable often leads to extreme values, then a common solution is to simply pull in these values by taking the square root or natural logarithm of the variable (see Chapter 10). Before estimating a model, it is a good idea to use exploratory data analysis techniques, such as q-q plots, boxplots, or stem-and-leaf plots to visualize the distributions and check for leverage points or outliers.
216
Recall that the matrix X is the transpose of X, whereas X1 is the inverse of X (see Chapter 3). This is known as the hat matrix because
= HY, or the predicted values of y may be computed using the Y
vector of y values premultiplied by H. The diagonals of the hat matrix (hii) lie between zero and one. We may compute the mean of these values to come up with an overall mean for the set of x values used in
INFLUENTIAL OBSERVATIONS
217
the multiple linear regression model. The points off the diagonals represent distance measures from this joint mean. Many statistical software packages, including Stata, provide leverage values (sometimes called hat values) based on the hat matrix. Larger values are more likely to influence the results of the regression model. These values may range from zero to {n 1}/n, which gets closer and closer to one as the sample size increases. A standard rule-of-thumb states that any leverage value that is equal to or exceeds 2(k+1)/n should be scrutinized as an influential observation. For example, in a model with three explanatory variables and a sample size of 50, leverage points of 2(3+1)/50 = 0.16 or more should be evaluated further. The most common method for detecting outliers is through the use of deleted studentized residuals (also known as jackknife residuals or simply as studentized residuals by some researchers). In Stata, these are simply the studentized residuals we used in earlier chapters. In particular, they are estimated using the root mean squared error (SE; see Chapter 4) from the linear regression model after deleting the specific observation. The formula for deleted studentized residuals is
ti =
ei S E (1) 1 hi
The ei value is the unstandardized residual and the hi value is from the hat matrix described earlier. This is a standardized measure of the residuals, with a mean of zero, so we would expect, if the residuals are normally distributed, that about five percent of them would fall two units away from the mean. To detect outliers, look for deleted studentized residuals that are substantially greater than two or substantially less than negative two. As suggested earlier, it is not uncommon to find observations that are both leverage points and outliers. Hence, statisticians have developed a number of summary measures of influential observations. The two most prominent of these measures are Cooks D or Distance (named for its developer, the statistician R.D. Cook)
218
and DFFITS. Both of these measures combine information from leverage values and deleted studentized residuals. Cooks D is computed as
Di = t i2 h i k + 1 1 hi
Larger values of Cooks D indicate more influence on the regression results. There are various rules-of-thumb for Cooks D. One is to look for values greater than or equal to one. A more common and less stringent rule is to consider Cooks D values greater than 4/(n k 1) as indicative of influential observations. For example, in a model with three explanatory variables and a sample size of 75, any Cooks D value greater than 4/(75 3 1) = 0.056 should be scrutinized. The other general diagnostic measure of influential observations, DFFITS, is quite similar to Cooks D. Stata calls these dfits. It is computed as
DFFITSi = t i hi 1 hi
A general rule-of-thumb is to consider any DFFITS with an absolute value greater than or equal to two as indicative of an influential observation. (Note that DFFITS, like deleted studentized residuals, can be positive or negative.) However, a rule-of-thumb that considers the sample size is 2 (k + 1) n ; hence, if we have a model with a three explanatory variables and a sample size of 75, any DFFITS values greater than 2 (3 + 1) 75 = 0.46 should be evaluated as an influential observation. Once these measures are estimated, it is a simple matter to compute the cut-off points for their respective rules-of-thumb and then identify influential observations. Another useful technique is to construct a scatterplot with the leverage values on the x-axis and the deleted studentized residuals on the y-axis. Then you may determine if
INFLUENTIAL OBSERVATIONS
219
one or more observations are both outliers and high leverage points. A scatterplot of the Cooks D values (or the absolute value of the DFFITS) by the leverage values or deleted studentized residuals may also be useful for determining the type of influential observations that are affecting the model.
[95% Conf. Interval] 19.20629 25.52359 -.2404035 -215.6569 124.9171 106.2105 .3246401 339.7273
We may examine summary statistics for the diagnostic measures we predicted using the summarize command (summarize rstudent leverage cook dfits):
Variable rstudent leverage cook dfits Obs 50 50 50 50 Mean -.0074527 .08 .0254171 -.0387935 Std. Dev. 1.025459 .0875959 .066837 .3239761 Min -2.038497 .0207301 1.69e-06 -1.371845 Max 2.196996 .4871022 .4606597 .5962182
The results provide the minimum and maximum values, but we are particularly interested in the maximum values of these measures. First,
220
however, it is useful to compute the cut-off values for these measures. Given our model, the leverage value rule-of-thumb cut-off is 2(3+1)/50 = 0.16; the Cooks D rule-of-thumb cut-off is 4/(50 3 1) = 0.087; and the DFFITS rule-of-thumb cut-off is 2 (3 + 1) 50 = 0.57 . According to the Stata printout, the maximum studentized residual value is 2.197, the maximum leverage value is 0.487, the maximum Cooks D value is 0.461, and the maximum DFFITs value is 1.37 (in absolute value terms). The studentized residuals are no cause for concern; it appears that most are within two units of the mean of zero. However, there is at least one observation that exceeds the preferred maximum leverage value and at least one that exceeds the preferred maximum Cooks D and DFFITS value. This is also made evident by creating a scatterplot of Cooks Ds by leverage values, while telling Stata where to indicate the cut-off values (twoway scatter cook leverage, yline(0.087) xline(0.16)). It is evident from this graph that one observation is particularly extreme. We should explore its implications further. A convenient way in Stata to consider these values in more detail is by using additional exploratory analysis techniques. For example, by using the list command to list all Cooks D and leverage values greater than the cut-off values (list cook if cook > 0.087 and list leverage if leverage > 0.16), we may identify which cases comprise the maximum values for these diagnostic variables. For example, according to the Stata summary statistics, the largest Cooks D value (0.461) is case number 5; the next largest is case number 48 (Cooks D = 0.12). According to the cut-off value, both of these cases are influential observations, but there are no others. The largest leverage value (0.467) is also case number 5. According to the rule-of-thumb for the leverage values (0.16), there are seven leverage points that exceed the rule-of-thumb. Box plots and stem-and-leaf plots are also useful for exploring data (type findit boxplot or help stem). Since case number 5 appears in all three diagnostic measures as an influential observation, it is important that we take a close look at
INFLUENTIAL OBSERVATIONS
221
it. The first step is to determine which state is represented by case number 5. A simple way to find the name of this state is by asking Stata again for the scatterplot but also request that it label the cases using the mlabel subcommand: twoway scatter cook leverage, yline(0.087) xline(0.16) mlabel(state). It should not be surprising to find that the state is California, which has the largest population and is extreme on other characteristics, is the high leverage value. A boxplot (graph box leverage) for the leverage or Cooks D values (graph box cook) visually demonstrates how far this case is from the others. The next step is to determine why California appears as an influential observation. Given that it is not an outlier, perhaps it is extreme on one or more of the explanatory variables. Considering this possibility, a good analyst will explore the data further to understand more about the influence that Californias data have on the regression model.
222
influential observation. But it does reveal a clue that we will consider later. As an alternative, we could simply ignore the influential observations. If our sample is large enough, this may be a reasonable alternative (although the solutions discussed later are preferred). For smaller samples, though, influential observations can have a large impact on the results. Weve seen one example of this already.
Source Model Residual Total violrate unemprat gsprod density _cons SS 1123838.64 2242356.36 3366194.99 Coef. 78.46247 90.99672 .0007233 8.477252 df 3 45 48 MS 374612.879 49830.1412 70129.0624 t 2.97 3.41 0.01 0.06 P>|t| 0.005 0.001 0.996 0.953 Number of obs F( 3, 45) Prob > F R-squared Adj R-squared Root MSE = = = = = = 49 7.52 0.0004 0.3339 0.2895 223.23
[95% Conf. Interval] 25.3355 37.25422 -.2852006 -276.9425 131.5894 144.7392 .2866473 293.897
The best solution is to try to understand influential observations. For example, why do Californias data affect the results so dramatically? Lets explore this some more by considering the explanatory variable gross state product that seems most affected by including California in the model. Recall that this variable assesses a states economic productivity. California is the U.S.s most populous state, one of the major producers of agricultural products, the setting for many large shipping ports, and home to the some of the most valuable real estate in the world. Even if you are unfamiliar with its economy, it should not surprise you to know that its gross state product exceeds the gross national product (a measure of a nations economic productivity) of most nations of the world. In other words, Californias economy is huge. Try running exploratory analyses on the gross state product variable. California easily outstrips the other states, with a gross state product in $100,000s of 9.13. The next largest is New York at 5.88. Next, try estimating a q-q plot of gross state product. Notice the extreme value; it is California. It is partially causing a highly skewed variable that appears as if it would benefit
INFLUENTIAL OBSERVATIONS
223
from taking its square root or natural logarithm (you may recall that we checked its distribution in Chapter 10 and found it to be highly skewed). After computing the natural logarithm of the gross state product, re-estimate the multiple linear regression model (with this logged variable) and assess the influential observation measures. You should find that the largest Cooks D is now 0.135, which is much closer to the cut-off point. (Which state now has the maximum value on the Cooks D?) The DFFITS also show improvement. Other popular solutions to the problem of influential observations involve the use of regression techniques that are not so sensitive to extreme values. For instance, you may recall that the median is known as a robust measure of central tendency because it is less sensitive than the mean to extreme values (see Chapter 1). Since our multiple linear regression model using least-squares is based on means, perhaps an alternative regression technique based on medians would be influenced less by extreme values. This reasoning is the basis for two very similar regression models: Median regression and least-median-squares (LMS) regression. Rather than minimizing the sum of squared errors, these techniques minimize the sum of absolute residuals or the median squared residual. An example of a median regression model estimated with Stata is provided in the first table shown on the next page (qreg violrate unemprat gsprod density). Another alternative is to use the Huber-White sandwich estimators (see Chapter 11) to adjust the standard errors for influential observations. Recall that this estimation technique is useful for heteroscedastic errors. Often, influential observations mimic heteroscedasticity, so adjusting the standard errors with this widelyused approach yields more robust results. The second table on the next page shows the results of an OLS model with Huber-White estimators of the standard errors using Stata (regress violrate unemprat gsprod density, robust).
224
Median regression Raw sum of deviations 11227.01 (about 502.78) Min sum of deviations 8630.281 violrate unemprat gsprod density _cons Coef. 89.79099 57.9164 -.0810787 -24.05405 Std. Err. 44.76845 34.52293 .2217001 236.8805 t 2.01 1.68 -0.37 -0.10
[95% Conf. Interval] -.3232274 -11.57465 -.5273379 -500.8698 179.9052 127.4075 .3651804 452.7617
Linear regression
Number of obs = F( 3, 46) = Prob > F = R-squared = Root MSE = Robust Std. Err. 23.4665 19.37164 .1152934 110.3569
[95% Conf. Interval] 24.82606 26.87398 -.1899554 -160.1017 119.2973 104.8601 .274192 284.1722
A related approach is known as robust regression (there are actually several types, including median regression). One form of robust regression begins with an OLS model and then calculates weights based on the absolute value of the residuals. It goes through this process repeatedly (iteratively) until the change in the weights drops below a certain point. In essence it down-weights observations that have the most influence on the model. Stata includes a robust regression technique as part of its regular suite of modeling routines. Therefore, the table on the following page presents the results of the regression model using Statas rreg (robust regression) routine. The results of these various models indicate that there are consistent positive associations between the gross state product or the unemployment rate and the number of violent crimes per 100,000 across states in the U.S. Moreover, even though the evidence is clear that there are influential observations, a number of robust techniques suggest that the effects of these influential observations would not force us to change our general conclusions. For example, the table below shows a comparison of the unstandardized coefficients for the
INFLUENTIAL OBSERVATIONS
225
unemployment rate and the gross state product across the various models:
Robust regression Number of obs = F( 3, 46) = Prob > F = Coef. 73.81075 64.2619 .0233289 49.21484 Std. Err. 28.19651 21.52184 .1507155 148.1391 t 2.62 2.99 0.15 0.33 P>|t| 0.012 0.005 0.878 0.741 50 6.71 0.0008
[95% Conf. Interval] 17.05413 20.94069 -.2800457 -248.9736 130.5674 107.5831 .3267036 347.4033
It appears that the worst thing we can do is omit California from the model. This option has the most dramatic effect on the association between the gross state product and the number of violent crimes per 100,000. In fact, we would overestimate by a substantial margin the association between these two variables if we simply removed California from consideration. Yet, the OLS model without adjustment for influential observations offers results consistent with the other, more robust regression models. Note also that the unemployment rate coefficient shifts considerably in the median regression model. You may wish to study this technique further to understand why this shift occurs. And, of course, we should not forget that the first approach was to take the natural logarithm of the gross state product. This may actually be the best
226
solution in this situation, especially since the three techniques we used are most appropriate when outliers are causing problems. We have only touched the surface of the abundant information about influential observations and the many robust regression techniques that are available. Much more information is provided in Peter J. Rousseeuw and Annick M. Leroy (2003), Robust Regression and Outlier Detection, New York: Wiley-Interscience. John Fox (1997), op. cit., also provides an overview of several robust regression techniques. The main point is that influential observations are a common part of regression models. Although they are often caused by coding errors or reflect highly skewed variables, they are also a routine part of data collection and variable construction. Exploring your data carefully before estimating the model and assessing influence measures after estimating the model should therefore be part of any regression exercise.
228
combination of the Xs is a linear function of the Xs. Now, imagine a binary outcome variable in a model; can it satisfy either of these assumptions? The quick answer is no. In order to gather evidence to this effect, consider one of the variables that appears in the data set lifesat.dta. The variable, also labeled lifesat, measures respondents selfreported satisfaction with life using two categories, low (coded zero) and high (coded 1). Of course, it would be preferable to have a continuous measure of life satisfaction, but suppose this is all we have. Can we assess predictors of this variable in a regression context? For instance, does age predict life satisfaction? As an initial step, lets ask Stata to construct a scatterplot with life satisfaction (lifesat) on the y-axis and age in years (age) on the x-axis. Place a linear fit line in the scatterplot (twoway scatter lifesat age || lfit lifesat age). You should find a plot with a fit line that looks like the following graph:
1 0 .2 .4 .6 .8
30
40 Fitted values
45
Simply looking at the fit line, we would conclude that there is a negative association between age and life satisfaction. However, notice how peculiar the scatterplot looks. All the life satisfaction scores fall in line with the values of zero or one. This, in itself, is not
LOGISTIC REGRESSION
229
peculiar since respondents can take on scores of only zero or one. However, try to imagine lines running from the points to the fit line. Notice that there will be a systematic pattern to these residuals. Is this indicative of any problems weve discussed in previous chapters? Of course it is: There is a systematic pattern that indicates heteroscedasticity. As a next step, well use Stata to estimate a linear regression model with life satisfaction as the outcome variable and age as the explanatory variable. When setting up the model, ask for a normal probability plot of the residuals (predict rstudent, rstudent followed by qnorm rstudent). The results (see the table below) provide an unstandardized coefficient of 0.016. Hence, the interpretation is that each one year increase in age is associated with a 0.016 decrease in life satisfaction. Do these interpretations make sense? Can life satisfaction decrease by 0.016 units? This illustrates a key problem with using a linear regression model to predict a binary outcome variable. Look at the normal probability plot of the residuals. Is it even possible to claim that the residuals approach a normal distribution? This would constitute a leap in interpretation no sincere analyst would be willing to make. Another problem is that, even though these types of outcome variables can take values of only zero or one, it is possible to obtain predicted values from a linear regression equation that exceed one or are less than zero. For instance, we might predict that a persons life satisfaction score is a meaningless 0.3 or a 1.75.
Source Model Residual Total lifesat age _cons SS .569711538 25.1584438 25.7281553 Coef. -.0162055 1.094168 df 1 101 102 MS .569711538 .249093503 .252236817 t -1.51 2.70 P>|t| 0.134 0.008 Number of obs F( 1, 101) Prob > F R-squared Adj R-squared Root MSE = = = = = = 103 2.29 0.1336 0.0221 0.0125 .49909
230
-2
Studentized residuals -1 0 1
-2
-1
0 Inverse Normal
LOGISTIC REGRESSION
231
Remember when you were asked in some secondary school mathematics course to flip a coin and estimate elementary probabilities? You were probably told that, given a fair coin, the probability of a heads was 0.50 and the probability of a tails was also 0.50. That is, from a frequentist view of statistics, we expect that if we flip a coin numerous times about half the flips will come up as heads and the other half will come up as tails. This is usually represented as P(H) = 0.50, with P shorthand notation for probability (see Chapter 1). The development of logistic regression as well as other techniques for binary outcome variables was predicated on the notion that a binary variable could be represented as a probability. Returning to our earlier example, we might ask the probability of a person reporting high or low life satisfaction. Using our shorthand method, this is depicted as P(high life satisfaction) = 0.49. In practical terms, this means that about 49% of the sample respondents report high life satisfaction. A fundamental rule of probabilities is that they must add up to one. Hence, if there are only two possibilities, we know that the probability of low life satisfaction {P(low life satisfaction)} is 1 0.49 = 0.51. The simplest way to determine these probabilities in Stata is to ask for frequencies for the lifesat variable (tab lifesat). Once we shift our interest from estimating the mean of the outcome variable to estimating the probability that it takes on one of the two possible choices, we may utilize a regression model that is designed to predict probabilities rather than means. This is what a binary logistic regression model is designed to do: It estimates the probability that a binary variable takes on the value of one rather than zero (notice that, like dummy variables, we assume that the outcome variable is coded {0, 1}). In our example, we may use explanatory variables which may be continuous or dummy variables to estimate the probability that life satisfaction is high. Binary logistic regression accomplishes this feat by transforming the linear regression equation through the use of the following logistic function:
232
The part of the denominator in parentheses is similar to the linear regression model that weve used in earlier chapters. But note that it is transformed in a very specific way. The advantage of this function is that it guarantees that the predicted values range from zero to one, just as probabilities are supposed to do. For example, the following table shows several negative and positive predicted values (i.e., from the linear regression equation in the denominator) and their values after running them through the function:
Initial value 10.0 5.0 0.5 0.0 0.5 5.0 10.0 Transformed value 0.000045 0.006693 0.487503 0.500000 0.622459 0.993307 0.999955
If we placed more extreme values in the function it would return numbers closer to zero or one, but they would never fall outside the boundary of [0, 1]. Logistic regression models do not use OLS or some similar estimation routine. Rather they use a widely-used estimation procedure known as Maximum Likelihood (ML). ML is a common statistical procedure, but its particulars require more statistical knowledge than we currently have at our disposal. Suffice it to say that ML estimates the most likely value of a statistic, whether the value of interest is a mean, a regression coefficient, or a standard error, given a particular set of data. Detailed information on ML estimation is available in most books on statistical inference. A precise treatment is provided in Scott Eliason (1993), Maximum Likelihood Estimation: Logic and Practice, Newbury Park, CA: Sage.
LOGISTIC REGRESSION
233
Fortunately or unfortunately (youll be able to choose one of these after a couple more pages), logistic regression models are not often used to come up with predicted probabilities. Although there is no good reason why this occurs, most researchers who use logistic regression employ odds and odds ratios to understand the results of the model. Odds are used frequently in games of chance. For example, slot machines and roulette wheels are usually handicapped by being accompanied by the odds of winning any particular attempt. Horse races are also supplemented by odds: The odds that Felix the Thoroughbred will win the Apple Downey Half-Miler are four to one. In general, odds are simply probabilities that have been transformed using the following equation:
P (1 P )
Say that our thoroughbred has been in ten half-mile races and won two of them. We may then say that the probability that she wins a half-miler in 0.20. This translates into an odds of winning of 0.20/(1 0.20) = 0.25. Another way of saying this is that the odds of winning are four to one, or for each race she wins she loses four of them. An example closer to the interests of the research community is the following. Suppose we conduct a survey of adolescents and find that 25% report marijuana use and 75% report no marijuana use in the past year. Hence, the probability of marijuana use {P(marijuana use)} among adolescents in the sample is 0.25. What are the odds of marijuana use?
P(marijuana use ) 0.25 = = 0.33 or 1 3 (1 P{marijuana use}) 1 0.25
This implies that for every three adolescents who do not use marijuana, we expect one adolescent to use marijuana. In other words, three times as many adolescents have not used marijuana as have used marijuana in the past year.
234
This is not yet very interesting since we have restricted our attention so far to only one variable. Lets extend our attention to two variables. Well treat adolescent past-year marijuana use as the outcome variable. To keep it simple, well use a dummy variable that measures gender (coded 0=female; 1=male). Now we have up to four outcomes, with males and females who have used or have not used marijuana. If interested in probabilities, we may compare the two: P(male marijuana use) and P(female marijuana use). However, in keeping with our interest in odds, well compare the odds of marijuana use among male and female adolescents. Lets assume that our survey data indicate that 40% of male adolescents and 20% of female adolescents report past-year marijuana use. Hence, the odds of marijuana use among males is 0.40/0.60 = 0.667 or 2/3 and the odds of marijuana use among females is 0.20/0.80 = 0.25 or 1/4. Keep in mind that these are not probabilities; odds are different. Another way of saying this that for every two males who used marijuana, three did not; and for every one female who used marijuana, four did not. An odds ratio is just what it sounds like: the ratio of two odds. Lets continue our adolescent marijuana use example to see the utility of this measure. Assume we wish to compare marijuana use among males and females. What is the odds ratio for these two groups? First, we have to decide on which is the focal group and which is the comparison group. Lets use males as the focal group since they are more likely to report use. The odds ratio (OR) of males to females is
OR males vs. females = Odds(males ) P (males ) P (females ) = = Odds(females ) 1 P (males ) 1 P (females ) 0.667 = 2.67 0.25
An odds ratio of 2.67 gives rise to the following interpretation: The odds of past-year marijuana use among males are 2.67 times the odds of past-year marijuana use among females. Some analysts use a shorthand phrase that males are 2.67 times as likely as females to use
LOGISTIC REGRESSION
235
marijuana. However, this can mislead some observers into thinking that we are dealing with probabilities rather than odds. The ratio of the probabilities is 0.40/0.20 = 2.0; clearly not the same as the odds ratio. Odds ratios are simply two odds that are compared to determine whether one group has higher or lower odds than another group on some binary outcome. A number greater than one implies a positive association between an explanatory and outcome variable (but keep in mind that the coding of the variables drives the interpretation) and a number between zero and one indicates a negative association. If there is no difference in the odds, then the odds ratio is one. For instance, if males and females are equally likely to report past-year marijuana use, then the odds ratio is 1.0. A logistic regression model is one way to compute predicted odds and odds ratios for binary outcome variables. A useful equation that lends itself to these computations is
P ( y = 1) where the logit [P ( y = 1)] = log e 1 P ( y = 1)
logit [P ( y = 1)] = + 1 x1 + 2 x 2 + L + k x k
The quantity log e P( y = 1) is also known as the log-odds. The 1 P( y = 1) term loge indicates the natural logarithm. Recall the logistic function that was specified earlier. It is another version of this equation. Moreover, the term log-odds should bring to mind a possible solution to computing the odds: Take the exponential of this quantity to transform it from log-odds to odds (or, as we shall see, an odds ratio). This can be a confusing exercise, so lets return to a real example of a binary outcome variable. In the data set we used earlier, lifesat, there is a variable gender that is coded as 0=female and 1=male. Well treat gender as the explanatory variable and life satisfaction as the outcome variable. The results of a Stata cross-tabulation (tabulate lifesat gender, column) for these two variables is shown below.
236
What are some of the relevant probabilities and odds from this table? First, the probability of high life satisfaction overall is 50/103 = 0.485. The overall odds of high life satisfaction are 50/53 = 0.94 (notice that we do not need to compute probabilities to compute odds). These overall probabilities and odds may be interesting, but theyre not very informative. Rather, we may also consider specific measures among males and females. The probabilities of high life satisfaction among males and females are 37/85 = 0.435 and 13/18 = 0.722. The respective odds are 37/48 = 0.771 among females and 13/5 = 2.6 among males. We can easily see that males are more likely than females to report high life satisfaction.
life satisfacti on low high Total gender of respondent female male 48 56.47 37 43.53 85 100.00 5 27.78 13 72.22 18 100.00
The odds ratio provides a summary measure of the association between gender and life satisfaction. Since we have the two odds, computing the odds ratio is simple: ORmales vs. females = 2.6/0.771 = 3.37. An interpretation is that the odds of high life satisfaction among males are 3.37 times the odds of high life satisfaction among females. Recall that an odds ratio greater than 1.0 indicates a positive association. Since males comprise the higher-coded category, we claim a positive association between gender and life satisfaction (without, of course, making any qualitative judgments about gender and life satisfaction). It appears that there is little advantage to going beyond this simple exercise to estimate the association between gender and life satisfaction. However, keep in mind that we might (1) wish to determine whether there is a statistically significant association
LOGISTIC REGRESSION
237
between gender and life satisfaction (assuming we have a good sample from a population) and (2) add additional explanatory variables to determine whether the association is spurious. This is where binary logistic regression comes in handy: It estimates standard errors and pvalues; and it allows additional explanatory variables in the model. First, well run the simple logistic model with one explanatory variable. In Stata, this model is requested with the logit command (logit lifesat gender, or). The or option converts the coefficients into odds ratios (as an alternative, the logistic command automatically provides odds ratios). After entering the command, the following table should appear in the output window:
Logistic regression Log likelihood = -68.838906 lifesat gender Odds Ratio 3.372972 Std. Err. 1.922249 z 2.13 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 P>|z| 0.033 = = = = 103 5.02 0.0250 0.0352
The appearance of the results is similar to that obtained using a linear regression model, with a couple of important exceptions. The column labeled Odds Ratio is of particular interest. The value in the gender row should look familiar: It is the same value we obtained for the odds ratio from the cross-tabulation (but keep in mind that this works only because we coded female as zero and male as one). In fact, it is the odds ratio for the genderlife satisfaction association. However, assuming we wish to infer something about the population from which the sample was drawn, we now may determine whether this odds ratio is significantly different from 1.0 (why 1.0?). The pvalue is used in the same manner as in the linear regression model, with, for instance, a value below 0.05 typically recognized as designating a statistically significant result. According to the p-value of 0.033, we may claim that the odds ratio is significantly different from 1.0 at the p < 0.05 level and conclude (tentatively, in the absence of much more information) that males have significantly higher odds of reporting high life satisfaction. Using the information on significance
238
tests in regression models (see Chapter 2), how should we interpret a p-value of 0.033? The interpretation of the odds ratio is the same as it was earlier: The odds of high life satisfaction among males are expected to be 3.37 times the odds of high life satisfaction among females. Notice that weve added the term expected to be rather than something more determinate. This is to remind the reader that we are inferring from a sample to a population. We can never be certain, given a single sample, that the odds among males are 3.37 times higher; only that we expect or infer them to be this much higher given the model results. Suppose we wish to figure out the odds of high life satisfaction among males and females rather than just the summary odds ratio. After re-estimating the model without the or option, we may use the log-odds ratios to estimate predicted values. For example, the predicted value for females is [0.26 + {1.216 0}] = 0.26 and the predicted value for males is [0.26 + {1.216 1}] = 0.956. Recall the equation presented earlier in the chapter:
P( y = 1) where the logit[P( y = 1)] = log e 1 P( y = 1)
logit [P ( y = 1)] = + 1 x1 + 2 x 2 + L + k x k ,
Notice that the values we just predicted are therefore known as logitvalues or, more commonly, the log-odds. Stata predicts these values in a logistic regression model when the or option is not included as it was in the previous model. To transform the log-odds into odds, we simply take the exponential (inverse of the natural logarithm) of these values. This is what Stata does when we include the or option at the end of the logit command. The predicted odds for females are thus exp(0.26) = 0.771 and for males exp(1.216) = 3.373. Compare these to the odds we computed using the cross-tabulation of gender and life satisfaction. There are some perhaps many researchers who are uncomfortable with odds and odds ratios and prefer probabilities.
LOGISTIC REGRESSION
239
The coefficients from a logistic regression model may be transformed into probabilities by utilizing the logistic function:
P( y = 1) =
1 1 + exp[ ( + 1 x1 + 2 x2 + L + k xk )]
We simply substitute the part of the denominator in parentheses with the logistic regression coefficients. However, we must choose specific groups to represent. With the previous model, there are only two groups, females (coded 0) and males (coded 1), and we may predict their probabilities by including the coefficients in the equation:
P(females) = P (males ) = 1 = 0.435 1 + exp[ (- 0.26 + { 1.216 0})]
Notice that these predicted probabilities are identical to the probabilities we computed with the cross-tabulation of gender by life satisfaction. In Stata (as well as most other statistical software packages) you may also save the predicted probabilities and then assess their values for particular groups that were included in the model. However, Stata provides even better tools for estimating probabilities. Its adjust postcommand, for example, may be used to predict probabilities for particular values of the explanatory variables. A user written program, prvalue, is also useful (type findit prvalue). Here is an example of the adjust command that uses the subcommand pr to request probabilities:
adjust gender = 0, pr adjust gender = 1, pr
Stata returns the probabilities for females and males: 0.435 and 0.722, which, as expected, are the same as computed previously. There are also options for residuals and influence statistics that we will ignore in this presentation but are worth exploring and
240
understanding if you wish to realize the capabilities and limitations of specific logistic regression models.
LOGISTIC REGRESSION
Logistic regression Log likelihood = -66.908672 lifesat gender age intell Odds Ratio 3.392536 .9097395 .947597 Std. Err. 1.963677 .0461263 .0438697 z 2.11 -1.87 -1.16 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 P>|z| 0.035 0.062 0.245 = = = =
241
103 8.88 0.0309 0.0623
This type of interpretation is not very satisfactory to most analysts. It is difficult, without substantial experience, to interpret this as a negative association between age and life satisfaction. However, there is a nice property of regression models that transform outcome variables using natural logarithms that is useful in this situation. That is, we may use the following percentage change formula to interpret coefficients from logistic regression models:
{exp( ) 1} 100
This formula uses the untransformed coefficient (found by running the model without the or option, as shown below) to determine the percent difference or change in the odds associated with a one unit difference or change in the explanatory variable.
Logistic regression Log likelihood = -66.908672 lifesat gender age intell _cons Coef. 1.221578 -.094597 -.053826 8.246674 Std. Err. .5788226 .0507028 .0462957 5.390533 z 2.11 -1.87 -1.16 1.53 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 P>|z| 0.035 0.062 0.245 0.126 = = = = 103 8.88 0.0309 0.0623
[95% Conf. Interval] .0871062 -.1939726 -.1445639 -2.318578 2.356049 .0047786 .0369119 18.81192
Using the age coefficient and this formula, we find {[exp(0.095)] 1 100} = 9.06%. Thus, the interpretation of the coefficient is Statistically adjusting for the effects of gender and intelligence scores, each one year increase in age is associated with a 9.06% decrease in the odds of high life satisfaction.
242
Multiple logistic regression models may also be used to estimate predicted odds or probabilities. This is complicated by the presence of continuous variables, however. As with multiple linear regression models (see, e.g., Chapter 6), we should choose particular values of the explanatory variables to compare. Some researchers prefer to use minimize and maximum values, but any plausible values, as long as they are represented by a sufficient number of cases in the data set, work reasonably well. For example, we may use the logistic function to compare the probabilities for males and females at the modal categories of age (41) and intelligence score (92):
P (females) = 1 1 + exp[ (8.25 + { 1.22 0} + { .095 41} + { .054 92})] = 0.351
P (males) =
This may be accomplished in Stata after running the model above, using the following postcommands: adjust gender=0 age=41 intell=92, pr and adjust gender=0 age=41 intell=92, pr. The results are slightly different due to rounding error. Another way of thinking about these predicted probabilities is to infer that approximately 35% of females in the modal categories of age and intelligence are expected to report high life satisfaction; whereas about 65% of males in these categories are expected to report high life satisfaction. Finally, there are tests of fit for logistic regression models that are analogous to fit statistics in linear regression models. Nonetheless, these are not computed in the same way, nor should they be interpreted in the same way as F-values, SE values, or R2 values. They may, however, be used to compare models. For example, Stata provides a log likelihood statistic, which may be used to calculate the Deviance statistic (2 log likelihood). The Deviance statistic may be used to compare nested models (but not non-nested models). Values
LOGISTIC REGRESSION
243
closer to zero tend to indicate better fitting models, although additional information is required before one may reach this conclusion. As an example, the original model with only gender as an explanatory variable has a Deviance of 137.678, whereas the model with gender, age, and intelligence scores has a Deviance of 133.818. To compare these models, subtract one from the other. The result is distributed 2 with degrees of freedom equal to the difference in the number of explanatory variables across the two models. Therefore, one takes the difference between the two Deviances in this case 137.678 133.818 = 3.860 and compares it to a 2 value with two degrees of freedom (since we added two explanatory variables in the second model). The p-value associated with this 2 value is 0.145 (in Stata type display 1 - chi2(2,3.86)), which suggests that the difference between the two models is not statistically significant. In other words, we do not improve the models predictive capabilities by including age and intelligence score. In the spirit of parsimony, we should conclude that the simpler model provides a better fit to the data. A simple way in Stata to obtain this test is with its test postcommand. Using this postcommand after the model with all the explanatory variables will implicitly compare the two models:
test age intell ( 1) [lifesat]age = 0 ( 2) [lifesat]intell = 0 chi2( 2) = Prob > chi2 = 3.62 0.1638
Here we have asked Stata to jointly test whether the age and intell coefficients are equal to zero. This is the same as asking whether the second model fits the data better than the first model. We used a nested F-test to compare linear regression models in a similar fashion (see Chapter 5). Another method for comparing logistic regression models is with what are known as information criterions. The two most widely used
244
are called Akaikes Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Both use the Deviance value from the estimated models. The standard formulas for these statistics (there are others) are
AIC = Deviance + (2 k )
Both statistics indicate that the logistic regression model with gender only (the nested model) provides a better fit to the data than the model with gender, age, and intelligence score (the full model). More information about these and other fit statistics is provided in Walter Zucchini (2000), An Introduction to Model Selection, Journal of Mathematical Psychology 44: 41-61; or in books on generalized linear models.
LOGISTIC REGRESSION
245
Binary logistic regression provides a powerful tool for evaluating outcome variables that take on only two values. These types of variables are not uncommon in the various scientific disciplines that utilize statistical modeling. Rather, they are of broad interest and provide important information about many relevant phenomena. We have only presented a brief introduction to this valuable regression technique. Moreover, we have not addressed a twin model: Probit regression. Probit regression provides similar results in general (although the coefficients are scaled differently), but is based on the assumption that underlying a binary variable is a continuous variable that is represented along a continuum from a probability of zero to a probability of one. Substantially more information on logistic and probit regression is provided in the numerous books that address categorical outcome variables. An especially lucid guide to logistic models is provided in David G. Kleinbaum and Mitchel Klein (2002), Logistic Regression: A Self-Learning Text, Second Edition, New York: Springer.
14 Conclusions
The linear regression model offers a powerful tool for analyzing the association between one or more explanatory variables and a single outcome variable. Some novice researchers wish to move quickly beyond this model and learn to use more sophisticated models because they get discouraged about its limitations and believe that other regression models are more appropriate for their analysis needs. Although there are many situations where alternative regression models are needed (see, e.g., Chapter 13), the linear model should not be overlooked or underutilized. Even when one or more of the assumptions described in the previous chapters is violated, linear regression will frequently provide accurate results. Although saying this model is robust is ill-advised because robustness has a very specific statistical meaning, it is not disingenuous to claim that the model works well under a variety of circumstances. There are myriad practical uses in the social, behavioral, and health sciences for linear regression analysis. The key to using linear regression analysis appropriately is to understand its strengths and weaknesses. It works quite nicely with dummy and continuous explanatory variables. As long as the outcome variable can be transformed to something close to normality, the coefficients and standard errors are within a reasonable distance to the true population parameters (assuming, of course, that we have a good sample). And, with adjustments to the standard errors that minimize the negative effects of heteroscedasticity and influential observations, the results can move even closer to the precision most of us desire. In particular, Huber-White sandwich estimators are especially useful when heteroscedasticity or influential observations show up or are suspected (see Chapters 11 and 12). However, linear regression models should not be used when the outcome variable is binary or takes on only a few values. How many is a few has been a point of contention in the research literature. 247
248
Some claim less than nine, others less than seven. The key, though, is whether the variable is distributed normally, not precisely how many categories it has. Nevertheless, given the availability of the many regression models designed for categorical outcome variables, there is little need to rely on linear regression analysis in this situation. The binary logistic regression model, for example, provides a solid alternative when the outcome variable is binary. Other models are described in the many books on categorical data analysis and generalized linear models (see, e.g., Hoffmann, op. cit.). There are many other topics that we have not had time to address in this presentation. For example, the issue of non-independence was mentioned only briefly when discussing the assumptions of linear regression analysis (Chapter 2) and autocorrelation (Chapter 11). Non-independence occurs when we have longitudinal data or spatial data, but also in other contexts. Recall that standard errors are biased in the presence of non-independent observations. The general issue of non-independence of observations is particularly germane to the more specific topic of survey sampling. It is rare to find large surveys that do not use some form of sampling that leads to non-independence of observations. These sampling schemes are conducted for efficiency and cost-effectiveness. It is prohibitively expensive to gather a simple random sample of people in the United States. Hence, most national surveys (and many state surveys) use some form of stratified or cluster sampling. This means that units, such as metropolitan areas, are sampled first, followed by smaller units such as census tracts, block groups, and households. There are usually several stages to the sampling. Telephone surveys typically use groups of exchanges that also result in a certain degree of clustering. When using most national data sets, it is therefore important to consider the effects of stratification and clustering on regression estimates. As mentioned earlier, the main effect is on the standard errors, so they should be adjusted. Many statistical software packages now include techniques for survey data that adjust standard errors for non-independence. (For example, Stata includes a set of
CONCLUSIONS
249
statistical techniques for survey data that include non-independent observations. Type help survey to read about the many options.) Another issue related to sampling involves the use of weights. Sampling weights are employed to designate the number of people (or other units) in the population each sample member represents. Recall that we want our sample to represent some well-understood population. Therefore, each observation in our sample should represent some group from the population. For example, in a nationally representative sample a 40-year-old Caucasian man may represent several thousand 40-year-old Caucasian men from some part of the U.S. Survey data often include sample weights that may be used in analyses (e.g., our 40-year-old man has a weight of 12,340 since he represents 12,430 40-year-old men). Yet, if these weights are whole numbers designed to indicate the actual number of people each observation represents, the presumed sample size can be inflated tremendously. What effect do larger samples have on standard errors? All else being equal, they make them smaller. So, experience shows that almost any regression coefficient is statistically significant whether or not it is important if the sample is weighted to represent a large population. This may be seen as an advantage, but it can also be misleading and dubious results. One solution is to compute normed sampling weights for the analysis. These are computed as
wi =
wi w
In words, take each observed weight and divide it by the overall mean of the weights. Lets say our national sample has a mean weight of 15,000 (on average, each observation represents 15,000 people). Our 40-year-old man will then have a normed sampling weight of 12,430/15,000 = 0.829. If another observation represents more people, say 17,500, then its weight is 17,500/15,000 = 1.167. Taking the normed weights not only makes the weights smaller and more manageable, but also preserves their utility since some observations still have more
250
influence than others. It also results in more reasonable standard errors. A more efficient approach to this general issue is to use Statas survey or weight commands to specify the weights and how they should be used (type help weights). Substantially more information on survey data and sampling weights may be found in Paul S. Levy and Stanley Lemeshow (1999), Sampling of Populations: Methods and Applications, New York: Wiley. There are many other issues we could discuss, but space is limited. The information provided in the previous chapters should provide the groundwork for further coursework on and self-study of regression modeling. Thus, perhaps it is best to ask interested readers to pursue addition topics on their own through the many excellent books and courses on linear regression and related statistical techniques. Topics such as sample selection bias, nonlinear models, bootstrapping methods, multilevel linear models, generalized linear models, simultaneous equations, methods of moments (MM) estimators, generalized additive models (GAMs), marginal models, event history analysis, and log-linear models (including transition models) have all been used to complement, supplement, or replace linear regression models in one way or another.
Summary
To summarize what weve learned, here is a list of steps to consider as you pursue linear regression analyses on your own. It is not necessarily a comprehensive list since each analysis and model has its own idiosyncrasies that make it unique. Nonetheless, it prescribes the most common steps that should normally be taken as you estimate linear regression models.
CONCLUSIONS
251
252
the direction and degree of skewness). If the outcome variable is binary, use logistic regression. Are there outliers or other influential observations that you can see? Consider their source. Do you need to compute dummy variables and include them in your model? Check to make sure youve taken care of missing values they can throw everything off if they are not adjusted for in the model! Assess the bivariate associations in the data. Use scatterplots for continuous variables. Plot linear and nonlinear lines to determine the bivariate associations. Compute bivariate correlations for continuous variables. Look for outliers and potential collinearity problems. Estimate the regression model. Avoid automated variable selection procedures unless your goal is simply to find the best prediction model. Assess the results. Are there unusual coefficients (overly large or small; negative when they should be positive)? Save the collinearity diagnostics; save the influential observation diagnostics (studentized residuals, leverage values, Cooks D, and DFFITS); run a scatterplot of deleted studentized residuals by standardized predicted values; estimate partial residual plots; ask for a normal probability plot of the residuals; ask for the Durbin-Watson statistic, if appropriate; compute Morans I, if using spatial data and it is available. Assess the goodnessof-fit statistics (adjusted R2; F-value and its accompanying p-value; SE). Run nested models if appropriate and use nested F-tests to compare them. These are particularly useful for assessing potential specification errors. Check the diagnostics. Are there any collinearity problems? If yes, you might need to combine variables, collect more data, or, as last resort, drop variables. If the collinearity problem involves interaction or quadratic terms, use centered values such as z-scores to recompute them. Are the residuals normally distributed? If not, consider a transformation. Do the partial residual plots provide evidence of nonlinear associations? Is there evidence of heteroscedasticity? (If the
CONCLUSIONS
253
plot is inconclusive visually, try Whites test or Glejsers test.) If yes, consider transforming a variable, weighted least squares regression, or using Huber-White sandwich estimators. Is there evidence of autocorrelation? Consider the source and try to correct for it. If you have spatial data, a spatial regression model may be needed. Use PraisWinsten regression or time-series techniques for data collected over time, if appropriate. Are there influential observations? If yes, consider their source. Are there coding errors? Will a transformation help? If not, use a robust regression technique to adjust for influential observations. Interpret and present the results. Interpret the unstandardized slopes and p-values. What do the goodness-of-fits statistics tell you about the model? Compare the results to the guiding hypotheses. Given the decision rules, are the hypotheses or the conceptual model supported by the analysis? Consider presenting predicted values, especially from models that include interaction terms. Consider graphical presentations of coefficients, nonlinear associations, and interactions. These can provide intuitive information that is often lost when presenting only numbers.
256
least a portion of the data set (some are huge) to determine if there are any unusual values or patterns to the data or variables. In addition, evaluate carefully those variables that you plan to analyze. Are dummy variables or binary outcome variables coded as {0, 1}? Are there categorical variables that need to be transformed into sets of dummy variables? How are continuous variables coded? (e.g., Is income measured in dollars or thousands of dollars? What is the base number for rates?) Are there unusual values that dont appear tenable (see Chapter 12)? Remember to always conduct exploratory analyses of your variables to check for outliers, non-normal distributions, and unusual codes. At the very least, ask the software for frequencies and the distributional properties of the variables (e.g., means, medians, standard deviations, skewness) you plan to analyze and evaluate each variable carefully.
State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii robbrate 185.75 155.13 173.76 125.68 331.16 . 163.21 198.74 299.91 205.21 130.83 larcrate 2844.27 3624.34 4925.63 2815.42 2856.87 3634.48 2668.73 3114.23 4322.41 3678.27 5046.93 assrate 403.69 526.32 495.71 379.83 590.26 298.75 214.41 442.54 715.11 407.14 . fips 1 2 4 5 6 8 9 10 12 13 15
Missing Data
The portion of the usdata.dta data set that appears in the table has been revised by placing dots in two of the cells. These dots are a commonly used representation of missing values. But oftentimes missing values are coded with untenable values such as 99 for age or a 9 for race/ethnicity when the other choices are 1= Caucasian, 2=African-American, 3=Latino, and 4= other race/ethnicity. There
DATA MANAGEMENT
257
are many other possibilities; check the codebook accompanying the data for information about how missing values are treated. Regardless of the coding scheme for missing values, before using a variable it is important to decide how you are going to handle missing values. As you do this, though, first consider their source. Why do missing values appear in your data set? The reasons will likely vary depending on the nature of the variable. An important issue to consider is whether the data entry person forgot to place a value in a specific cell of the data set. Perhaps, in this example, we need to go back to the data source and find the robbery number for Colorado or the assault number for Hawaii. Yet, these data may not exist. A common source of missing values in survey data involves skip patterns. Skip patterns in surveys are a convenient way to make responding more efficient. For instance, suppose we ask a sample of adolescents to fill out a survey about their use of cigarettes. Our questions ask details such as how often they smoke, how many cigarettes they smoke in a day or a week, from where do they obtain cigarettes, and so forth. A majority of adolescents dont smoke at all, so it is common to find initial questions such as Q.1. Have you ever smoked cigarettes? (circle one answer only) (a) No [Go to question Q.16] (b) Yes [Go to question Q.2]
Q.2. When was the last time you smoked a cigarette? (circle one answer only) (a) Today (b) In the last week (c) More than a week ago but less than a month ago (d) A month ago or longer In contemporary surveys these types of questions are often programmed into a computer so that those responding no to the first question are automatically asked question 16 next. Suppose we wish to analyze the second question either as an outcome or an explanatory variable (perhaps by creating dummy variables). Would
258
we only want to analyze data from those who actually answered the question; or would we want to also include those who never smoked in the statistical analysis? This is an essential issue to address since those who answered no to the first question will be missing on the second question. If we decide to include those who never smoked in the analysis, then it is a simple matter to create a new variable that includes a code for never smoked along with others for those who smoked in the last day, week, etc. It is also not uncommon to find missing data due to refusals or dont know responses (Question: When was the last time you smoked a cigarette? Response: I dont know). For example, many people dont like to answer questions about personal or family income. A perusal of various survey data shows that missing values on income are frequent, with usually more than 10% of respondents refusing to answer. Yet many researchers are interested in the association between income and a host of potential explanatory or outcome variables. Is it worth it to lose 10-20% of your sample so that you can include income in the analysis? Are people who refuse to answer income questions (or other types of questions) different from those who do answer? If they are systematically different, then the sample used in the analysis may no longer be a representative sample from the population. Making decisions about missing values is a crucial part of most statistical exercises. There are several widely used solutions to missing data problems. First, some researchers simply ignore them by omitting them from the analysis. They argue that a few missing cases will not bias the results of their model too much. However, be very careful not to leave them with codes (such as 9 or 999) that will be included in the analysis. Imagine, for instance, if you left a missing code of 999 in an analysis of years of education! Now that would be an influential observation. The most commonly used technique for throwing out missing cases is known as listwise deletion (this is the Stata default). Suppose we analyze the association between robbery and assault in the revised
DATA MANAGEMENT
259
usdata.dta data set shown earlier with a correlation or a linear regression model. In this situation, listwise deletion removes both Colorado and Hawaii from the analysis. Now imagine if we estimate a multiple linear regression model with, say, 5 explanatory variables. Suppose further that 10 different observations were each missing only one value, but from 5 different explanatory variables. Looking over the frequencies, wed find 48 observations for each explanatory variable, but once we place all of them in the regression model and use listwise deletion, our sample size goes from 50 to 40; we lose onefifth of the sample. This can be costly. A problem that is not infrequent involves the use of multiple nested regression models with different patterns of missing data. Say we estimate a model with robberies per 100,000 predicted by per capita income. If there are no missing data on either variable, the sample size is 50. We then add migrations per 100,000, which is missing two observations. A third model adds population density and the unemployment rate, each of which has four missing observations. We therefore have up to 10 missing observations. It is common to claim that the first two models are nested within the third, thus allowing statistical comparisons (see Chapter 5). However, they are not truly nested because they have different numbers of observations (ns): 50 in the first model, 48 in the second model, and 40 44 in the third model. Not only do nested F-tests break down when this occurs, but the regression coefficients should not be compared across models. It is therefore imperative that missing data be handled before running any of the regression models so that each has the same number of observations. There are a variety of techniques for handling missing values in addition to listwise deletion. These include mean substitution, regression substitution, raking techniques, hot deck techniques, creating a code for missing values, and multiple imputation. Mean substitution replaces missing values of a variable with the mean of that variable. For example, assume that the mean number of assaults per 100,000 is
260
342.4. We could then replace the missing value in Stata using the following command: recode assrate (. = 342.4) A problem with mean substitution is that it can lead to biased slope coefficients and standard errors, especially as a higher proportion of observations is missing. Unfortunately, there is no rule-of-thumb about the proportion of missing values that lead to biased regression results when using mean substitution. Regression substitution replaces missing values with values that are predicted from a regression equation. For instance, suppose that a small number of values are missing from the personal income variable that appears in the gss96.dta data set. Yet, there is complete data for education, occupational prestige, and age. If a regression equation does a good job of predicting the valid income values with these three variables, then the missing values may be predicted by using the coefficients from the following linear regression model
An alternative is to use Statas impute command: impute income educate polview, gen(newincome). This creates a new variable, named newincome, that has imputed values for personal income based on values of education, occupational prestige, and age. Unfortunately, if the variables are not good predictors of the variable with the missing value problem, then biased regression coefficients and standard errors will result.
DATA MANAGEMENT
261
Raking is an iterative procedure that is based on using values of adjacent observations to come up with an estimated value for the observation that is missing. Hot deck involves partitioning the data into groups that share similar characteristics and then randomly assigning a value based on the values of this group. Both require specialized software to implement (type help hotdeck). Creating a code for a missing value is a common procedure. Suppose that we have a problem with personal income: 10% of the observations are missing. We could create a new dummy variable that is coded as 0=non-missing on income and 1=missing on income, and include this variable in the regression model, along with other income dummy variables. This is only useful if the variable is transformed into a set of dummy variables, though. So income would have to be categorized into a number of discrete groups (e.g., 0 - $10,000; $10,001 - $20,000; etc.) that would then be the basis for a set of dummy variables. It is rarely, if ever, a good idea to use these techniques to replace missing values for outcome variables. Some analysts argue that raking and hot-deck techniques are useful if there are only a few missing values in the outcome variable. However, if this is the case, multiple imputation is a better approach. Moreover, it has become a widely used technique throughout the regression world when there are missing values for explanatory variables. It involves three steps: 1. Impute, or fill in, the missing values of the data set, not once, but q times (usually 3-5 times). The imputed values are drawn from a particular probability distribution, which depends on the nature of the variable (e.g., continuous, binary). This step results is q complete data sets. 2. Analyze each of the q data sets. This results in q analyses. 3. Combine the results of the q analyses into a final result. There are rules that are followed to combine them,
262
LINEAR REGRESSION ANALYSIS usually by taking some average of the coefficients and standard errors if a regression model is used.
Stata 11 (released in 2009) includes a set of procedures for multiple imputation. These fall under the general command mi (type help mi). An excellent overview of multiple imputation is provided in Joe L. Schafer (1999), Multiple Imputation: A Primer. Statistical Methods in Medical Research 8: 3-15. For a general review of procedures for missing data, see Paul D. Allison (2001), Missing Data, Thousand Oaks, CA: Sage.
DATA MANAGEMENT
263
important that they are coded in the same direction and in the same way. For instance, if we are measuring symptoms of depression, all the items should be coded so that increasing numbers indicate either more or less symptoms. You may need to reverse code some items so that all are in the same direction. Furthermore, if one variable is coded 1 4 and another is coded 1 10, then adding them up or taking the mean of the items will be influenced more by the variable with more response categories. As an alternative, some researchers recommend first standardizing all the variables (taking their z-scores) and then adding them up or taking their mean. The reasoning is that even if the number of response categories differs, standardizing will normalize them so they are on the same scale. There is a risk to simply adding up the variables, however. Suppose, for instance, that we use the following Stata command to compute a new variable, depress, from five variables in a depression inventory:
generate depress = dep1 + dep2 + dep3 + dep4 + dep5
An alternative is to use the egen and rsum (row sum) commands to add up the items:
egen depress = rsum(dep1 dep2 dep3 dep4 dep5)
But say there are different patterns of missing values for each of the variables. For some observations, the new variable will depend not just on their responses to the questions, but also on whether they are missing on one or more items. Respondents with response patterns of {2, 2, 2, 2, 2} will have the same depression score those with response patterns of {4, 6, ., ., .}, even though their level of depressive symptoms differs substantially. Some researchers prefer to take the mean of the items. In Stata this appears on the command line as
egen depress = rowmean(dep1 dep2 dep3 dep4 dep5)
264
Hence, respondents with the patterns listed earlier will have depression scores of 2 and (4+6)/2 = 5, which reflects more accurately their level of depressive symptoms. Coding and recoding variables is a crucial part of any data management exercise, whether or not they are needed for statistical analyses. Weve already learned about coding dummy variables (see Chapter 6). It is important to remember to create dummy variables that are mutually exclusive if they measure aspects of a particular variable (e.g., marital status). It is a good idea to run cross-tabulations on dummy variables along with the variable from which they are created to ensure that they are coded correctly. Making sure continuous variables are coded properly is also essential. It is best to use coding strategies that are easily understandable and widely accepted, if possible. Measuring income in dollars, education in years, or age in years is commonplace. Yet, for various reasons, sometimes it is better to code income in $1,000s. Some data sets have education grouped into categories (e.g., 0 11 years, 12 years, 13 15 years, 16 years, etc.) rather than years. Keeping track of coding strategies is therefore of utmost importance. Constructing your own codebook or keeping track of codes and recodes by keeping copies of Stata .do or log files is useful and will allow you to come back later and efficiently recall the steps you took before analyzing the data. There are numerous other coding issues that we will not address. Experience is the best teacher. Moreover, keeping careful records and back-up files of data sets that include before-and-after recodes of key variables will be helpful. Remember that once you save an updated Stata file using the same name the old file is gone (unless youve made a back-up copy). So make sure you are satisfied with the variables before you overwrite an old data file. Once you have completed taking care of missing values, creating new variables, and recoding old variables as well as transforming non-normal variables it is always a good idea to ask for frequencies (for categorical or dummy variables) and descriptive statistics (for
DATA MANAGEMENT
265
continuous variables) for all of the variables that you plan to use in the analysis. Stata has a host of commands available for checking the distributions of variables and making sure that the analysts data manipulation worked as expected. As an example, lets say we use the delinq.dta data set to estimate a linear regression model. First, though, we ask a colleague to transform the variable stress using the natural logarithm because previous research indicates that variables measuring stressful life events are positive skewed. Our colleague uses the following Stata command command to transform the variable:
generate newstress = ln(stress)
After asking for descriptive statistics on the newstress variable (summarize newstress) we find that it has more than 400 missing cases. Why did this occur? If we didnt pay attention to the original variable or did not look carefully at the data file, we might simply think that more than 400 adolescents did not respond to this question during survey administration. Perhaps wed chalk it up to the inexactness of data collection and ignore these cases. Or we might use a substitution or imputation procedure to replace the missing values. However, if we looked at the original stress variable which we should always do wed find no missing cases. Something must have happened during the transformation. Recall that taking the natural logarithm of zero is not possible; the result is undefined. But notice that the original stress variable has legitimate zero values (some adolescents report no events), so the command needs to be revised to take into account the zeros in the variable. A simple solution is to revise the command to read
generate newstress = ln(stress + 1)
Now, those with zero scores on the original variable will have zero (not missing) scores on the new variable. Yet we might not have found this out if we didnt check the descriptive statistics for both the old and the new stress variables. Knowing your data and variables, and always checking distributions and frequencies following variable
266
creation and recoding will go a long way toward preventing modeling headaches later.
Appendix B Formulas
1. Mean of a variable
E[X] = x =
2.
SS[X] =
(x
x)
3.
Variance of a variable
Var[X] = s
(x
x)
n 1
4.
SD[X] =
(x
x)
n 1
5.
267
268 6.
se( x ) =
s n
7.
Computing z-scores
z score =
( xi
x) s
8.
Cov ( x, y ) =
(x
x )( yi y ) n 1
9.
Corr( x, y ) =
Corr ( x, y ) =
10.
(z )(z )
x y
n 1
xy 1 1 s + n1 n2
where s =
FORMULAS
269
11.
t =
xy Var[x ] Var[ y ] + n1 n2
12.
General formula for a Confidence Interval (CI) Point estimate [(confidence level) (standard error)] (Note: The point estimate can be a mean or a regression coefficient)
13.
Sum of Squared Errors (SSE) for the linear regression model SSE =
(y
i ) y
14.
Regression Sum of Squares (RSS) for the linear regression model RSS =
(y
y)
15.
i ) resid i = ( yi y
16.
270
(x x )( y y ) (x x )
i i
2
17.
x y 1
18.
se 1
( )
) n2 (y y (x x )
2 i i 2 i
SSE n 2 SS[ x]
19.
t value =
1 se
( )
1
20.
s * = x s y
21.
= (X X )1 X Y
FORMULAS 22.
271
V = (XX ) 2
1
where 2 =
1 nk
23.
= se i
( )
(x x ) (1 R )(n k 1)
2
) (y y
i i
2 i
24.
R2
RSS TSS
SSE TSS
25.
2 k n 1 R2 = R n k 1 1 n
26.
MSE =
SSE n k 1
MSE = S E
272 27.
LINEAR REGRESSION ANALYSIS Prediction Intervals for the linear regression model
( x x ) (t )(S ) 1 + 1 + ( x0 x ) PI = y + 1 0 E n2 n (n 1)Var (x )
2
28.
F=
df = 1, n k 1 (full model)
29.
F=
df = q, n k 1 (full model)
30.
Cp =
31.
VIF =
1 1 R i2
273
r = xx
s2 x s 2 x
33.
s s
yi
2 i 2 i
and x =
s s
xi
2 i 2 i
34.
d=
(e
2
ti n
ti1 ) e
2 ti
e
1
274 35.
n wij z i z j
i j
(n 1) wij
i j
36.
H = X(XX ) X = (hij )
1
37.
ti =
ei S E (1) 1 hi
38.
t i2 h Di = i k + 1 1 hi
39.
DFFITSi = t i
hi 1 hi
275
P (Y = 1) =
1 1 + exp[ ( + 1 x1 + 2 x 2 + L + k x k )]
41.
Odds of an event
P (1 P )
42.
P1 (1 P1 )
P2 (1 P2 )
43.
44.
45.
wi =
wi w