0% found this document useful (0 votes)

32 views96 pages

Understanding Regression Analysis

Uploaded by

József Tóth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

32 views96 pages

Understanding Regression Analysis

Uploaded by

József Tóth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 96

SAGE UNIVERSITY PAPERS Series: Quantitative Applications in the Social Sciences Series Editor: Michael S. Lewis-Beck, University of lowa Editorial Consultants Richard A. Berk, Sociology, University of California, Los Angeles William D. Berry, Political Science, Florida State University Kenneth A. Bollen, Sociology, University of North Carolina, Chapel Hill Linda B. Bourque, Public Health, University of California, Los Angeles Jacques A. Hagenaars, Social Sciences, Tilburg University Sally Jackson, Communications, University of Arizona Richard M. Jaeger, Education, University of North Carolina, Greensboro Gary King, Department of Government, Harvard University Roger E. Kirk, Psychology, Baylor University Helena Chmura Kraemer, Psychiatry and Behavioral Sciences, Stanford University Peter Marsden, Sociology, Harvard University Helmut Norpoth, Political Science, SUNY, Stony Brook Frank L. Schmidt, /ndustrial Psychology, University of lowa Herbert Weisberg, Political Science, The Ohio State University Publisher Caen AAiline Matuina Gana Publinatinns Inc IRS ies, please wnteAPR 29 1996 Series / Number 07-057 UNDERSTANDING REGRESSION ANALYSIS An Introductory Guide | . LARRY D. SCHROEDER Syracuse University DAVID L. SJOQUIST Georgia State University PAULA E. STEPHAN Georgia State University SAGE PUBLICATIONS The International Professional Publishers Newbury Park London New DelhiCopyright © 1986 by Sage Publications, Inc. All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. For information address: SAGE Publications, Inc. © 2455 Teller Road Newbury Park, California 91320 E-mail: order@sagepub.com SAGE Publications Ltd. 6 Bonhill Street London EC2A 4PU_ United Kingdom SAGE Publications India Pvt. Ltd. . M-32 Market Greater Kailash I New Dethi 110 048 India Printed in the United States of America International Standard Book Number 0-8039-2758-4 Library of Congress Catalog Card No. 85-063790 95 96 97 98 99 20 19 18 17 16 15 When citing a university paper, please use the proper form. Remember to cite the current Sage University Paper series tile and include the paper number. One of the following formats can be adapted (depending on the style manual used): (1) SCHROEDER, LARRY D , SJOQUIST, DAVID L., and STEPHAN, PAULA E. (1986) Understanding Regression Analysis, An Introductory Guide. Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-057. Newbury Park, CA: Sage. OR (2) Schroeder, Larry D., Sjoquist, David L., & Stephan, Paula B. (1986). Understanding regression analysis: An introductory guide (Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-057). Newbury Park, CA: Sage.CONTENTS Series Editor’s Introduction 7 Acknowledgments 9 1. Linear Regression anf Hypothesized Relationships i A Numerical Example 12 Estimating a Linear Relationship 17 Least Squares Regression 19 Examples 22 The Linear Correlation Coefficient 23 The Coefficient of Determination 26 Regression and Correlation 28 2. Multiple Linear Regression 29 Estimating Regression Coefficients 29 Standardized Coefficients 31 Associated Statistics 32 Examples 34 3. Hypothesis Testing 36 Introduction 36 The Testing Procedure 40 The Standard Error of the Estimated Coefficient 41 The Student’s t Distribution 43 Forming Test Values 44 The Role of Standard Error and Sample Size 45 Changing the Level of Significance 46 t Ratio 46 Left-Tail Tests 47 Two-Tail Tests 48 Confidence Intervals 49 F Statistic 51 What Tests of Significance Can and Cannot Do 53’ 4. Extensions to the Multiple Regression Model 53 Types of Data 54 Dummy Variables 56 Interaction Variables 58 Transformations 59 Prediction 62 Examples 63 5. Problems and Issues of Linear Regression 65 Specification 67 Proxy Variables and Measurement Error 70 + Selection Bias 71 Multicollinearity 71 Autocorrelation 72 ba 4 Heteroskedasticity 75 Simultaneous Equations T Limited Dependent Variables 719 Conclusions 80 Appendix A: Derivation of a and b 81 Appendix B: Critical Values for Student’s t Distribution 82 Appendix C: Regression Output from SAS and SPSS_— 83 Appendix D: Suggested Textbooks 87 Notes 88 References 93 About the Authors 95To our children, Leanne Nathan Jennifer DavidSeries Editor’s Introduction Researchers in the social sciences, business, policy studies and other areas rely heavily on the use of linear regression analysis. The frequency with which the technique is employed is demonstrated by a review of articles in professional journals such as the American Economic Re- view, Journal of Finance, American Political Science Review, Journal of Policy Analysis and Management, Journal of Marketing, Journal of Educational Research, and American Sociological Review. The use of linear regression is so common because this research tool adds considerably to the understanding of economic, political, and social phenomena. Frequently, instructors would like to supplement their courses with materials, such as articles from professional journals, that use regression analysis. To students unfamiliar with regression, however, research based on the technique can be incomprehensible. For those who have yet to take a statistics course, this book is intended to provide the background needed to understand much of the empirical work relying on linear regression analysis. The book provides a heuristic explanation of the basic procedures and terms used in regression analysis. Written at the most elementary level and assuming only a minimal mathematics background, the book focuses on the intuitive and verbal interpretation of regression coefficients, associated statistics, and hypothesis tests. Other terminology often encountered in today’s literature is also explained, including standardized regression coefficients, dummy variables, interaction terms, and transformations. Brief discussions of some of the major problems encountered in regression analysis are also presented. The book can be used as a supplementary text in a variety of courses in numerous fields. Examples given in the text encompass the fields of demography, economics, education, finance, marketing, policy analysis, political science, public administration, and sociology. Instructors in any of these areas are likely to find the text useful. The authors do not intend for this book to serve as a substitute for a course or textbook in statistics. It is not designed to teach the use of 1regression analysis, but rather to fill the void that exists when the student encounters empirical papers before taking a statistics course. On the other hand, the level of exposition makes the volume suitable as an introductory supplement in applied statistics courses where students are encountering linear regression for the first time. This book is an outgrowth of material previously prepared by the authors for students in intermediate economics courses who did not have a background in statistics. An earlier, more limited version of the book was published by General Learning Press under the title, Interpret- ing Linear Regression Analysis: A Heuristic Approach. This version has been expanded to encompass the many other disciplines that use regression analysis. —Richard G. Niemi Series Co-EditorAcknowledgments We are especially grateful to Theodore C. Boyden for providing the encouragement to undertake this project. Special thanks go to the following individuals who provided suggestions for examples and clari- fied various arguments: Kenneth Bernhardt, Michael Binford, Libby Dalton, Benoit Deschamps, Louis Ederington, Kirk Elifson, Charles Jeret, Ralph LaRossa, Taylor Little, Jr., Dileep Mehta, Donald Reitzes, and Frank Whittington. We also want to thank Esther Gray, Bee Hutchins, Marian Mealing, Billie Shook, and Carla Thomas for their expert typing, David Amis for help with the illustrations, and Richard G. Niemi for his support.UNDERSTANDING REGRESSION ANALYSIS LARRY D. SCHROEDER Syracuse University DAVID L. SJOQUIST Georgia State University PAULA E. STEPHAN Georgia State University 1. LINEAR REGRESSION Hypothesized Relationships The two statements, “The more a political candidate spends on advertising, the larger the percentage of the vote he will receive” and “Mary is taller than Jane,” express different types of relationships. The first statement implies that the percentage of the vote that a candidate receives is a function of, or is caused by, the amount of advertising, while in the second statement no causality is implied. More precisely, the former expresses a causal or functional relationship while the latter does not. A functional relationship is thus a statement (often in the form of an equation) of how one variable, called the dependent variable, depends on one or more other variables, called independent variables. In the example, the share of the vote a candidate receives is dependent on (is a function of) the amount of advertising, which is independent of the percentage of the vote received. Another independent variable that might be included is the number of prior years in office, in which case the functional relationship would be stated as, “The candidate’s share of the vote depends on the amount of advertising as well as the candidate’s prior years in office.” ‘ I12 Other examples of functional relationships are: (1) “If he allows his hair to grow longer, he will become stronger,” (2) “If she studies more, her grades will improve,” and (3) “If the price of oranges increases, individuals will purchase fewer oranges.” One of the activities of researchers is testing the validity or falsity of hypothesized functional relationships, called hypotheses’ or theories. This volume discusses one tool used in testing hypotheses—linear regression. Linear regression analysis is applicable to a vast array of subject matter. Consider the following situations in which regression analysis has been employed: a study of the effect of shelf space devoted to a particular product on the sales of that product (Curhun, 1972); a study of the effect of the size of the dividend paid by a corporation on the value of the corporation’s stock (Durand, 1959); a study of the effect of school quality on academic achievement (Coleman et al., 1966); a study of the effect of age on the probability that an individual or family will move (Polachek and Horvath, 1977). All of these examples are cases in which the application of regression analysis was useful, although the application was not always as straight- forward as the example to which we now turn. A Numerical Example To facilitate the discussion of linear regression analysis, the following food consumption example will be referred to throughout the book. Suppose one were asked to investigate by how much a typical family’s food expenditure increases as a result of an increase in its income. While most would agree that there is a relationship between the amount spent on food and income, the example is in fact an investigation of an economic theory. The theory suggests that the consumption of food is a function of family income;’ that is, C = f(I), read “C is a function of I”, where C (the dependent variable) refers to the consumption of food and 1 (the independent variable) refers to income. Throughout the book we will refer to the theory that C increases as I increases as the hypothesis. The investigation of the relationship between C and | allows for both testing the theory that C increases as a result of increases in I and obtaining an estimate of how much food consumption changes as income changes. One can therefore consider the investigation as an analysis of two related questions: (1) Does spending on food increase when a family’s income increases? (2) By how much does spending on13 food change when income increases or decreases? As will be seen in Chapter 3, these questions cannot be answered with certainty. However, since the material in this section can be more easily understood by assuming that answers to these questions can be provided with certainty, we shall proceed initially under this assumption. At least two strategies for analyzing these questions are feasible. One can observe various families over time and note how their consumption of food changes as their income changes, or one can observe income and food consumption differences among several families and note how differences in food consumption are related to differences in income. We have adopted the latter approach, employing the hypothetical data given in columns | and 2 of Table 1, which represent annual income and food consumption information from a sample of 50 families in the United States for one year. Assume that this sample was chosen random- ly from the population of all families in the United States.’ The associated levels of these two variables have been plotted as the 50 points in Figure 1. Casual observation of the points in Figure 1 suggests that C increases as l increases. However, the magnitude by which C changes as I changes for the 50 families is not obvious. For this reason the presentation of data in tabular or graphical form is not by itself a particularly useful format from which to draw inferences. These formats are even less desirable as the number of observations and variables increases. Thus we seek a means of summarizing or organizing the data in a more useful manner, Any functional relationship can be most conveniently expressed as a mathematical equation. If one can determine the equation for the relationship between C and I, one can use this equation as a means of summarizing the data. Since an equation is defined by its form and the values of its parameters,’ the investigation of the relationship between C and I entails learning something from the data about the form and parameters of the equation. The economic theory that suggests that C is a function of I does not indicate the form of the relationship between C and I. That is, it is not known whether the equation is of a linear or some other, more complex. form. In some problems the general form of the equation is suggested by the theory, but since this is not so in the food expenditure problem, it is necessary to specify a particular form. We shall assume that the form of the equation for our problem is that of a straight line, which is the simplest and most commonly used functional form.14 TABLE 1 Food Consumption, Family Income, and Family Size Data () (2) (3) (4) Food Family Consumption Income Size Live on Farm $ 723.52 $ 8,246 1 No 780.70 8,742 4 No 990.74 9,048 6 No 1,634.98 10,584 7 No 1,189 40 10,626 2 No 1,295.64 10,984 2 No 1,025.52 11,822 1 No 1,792.18 12,532 2 No 1,328.00 12,952 5 No 780.06 13,220 2 Yes 1,366.14 13,386 6 No 2,950.72 13,746 8 No 1,273.34 13,946 2 No. 1,953 58 14,206 2 No 866.62 14,388 1 No 2,125.30 14,622 4 No 2,372.00 15,032 2 No 2,477 34 15,172 5 No 1,148.24 16,284 1 No 2,108 14 16,664 3 No 1,810.96 17,124 2 No 1,776.58 17,302 2 No 2,295.04 18,254 3 No 877.52 18,908 1 Yes 1,284.00 18,922 2 No 1,502.94 19,330 2 Yes 1,939.00 20,108 3 No 2,443.06 20,600 3 No 2,003.44 21,238 4 No 1,682.36 22,120 2 No 2,308.16 22,452 7 No 1,472.44 23,288 2 No 2,534.66 23,316 4 No 2,194.76 23,588 2 No 1,638.26 23,708 3 No 2,612.00 23,830 6 No 2,328.96 23,908 2 No 1,666.90 24,216 3 No 2,560.22 25,422 1 No 3,103.54 25,504 9 No 2,819.06 26,286 5 No 975.10 26,590 2 No (continued)15 TABLE 1 (Continued) QQ) (2) (3) 4) Food Famuly Consumption Income Size Ive on Farm 2,122 52 26,852 1 No 1,068.38 27,146 3 Yes 2,253.46 27,936 6 No 2,763.40 28,556 5 No 1,904.66 28,874 3 No 2,111.50 29,450 4 No 3,211.64 29,624 1 No 2,665.78 29,690 4 No SOURCE: Hypothetical data. ze g 3 & S & 3S E 2 5 8 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Income (in thousands) Figure 1: Scatter Diagram of Family Income and Food Consumption16 Given this assumption, one can express the functional relationship that exists between C and I for all U.S. families as C= atpl (1) where a (the Greek letter alpha) and f (the Greek letter beta) are the unknown parameters assumed to hold for the population of U.S. families and are referred to as the population parameters.° (See also Figure 2.) Given the assumption that the form of the equation of the possible relationship between C and I can be represented by a straight line, what e s 3 3 s 5 2 a 5 2 2 8 3 Line 3 10 15 Income (in thousands) Figure 2: Illustration of Different Slopes17 remains is to estimate the values of the population parameters of the equation using our sample of 50 families. The two questions posed earlier refer to the value of the slope—that is, the value of B. The first question asks whether f is greater than zero, while the second asks the value of 8. By obtaining an estimate of the value of 8, astatement can be made as to the effect of changes in income on the level of food consumption for the 50 families in our sample. Further, from this estimate of 8 inferences can be drawn about the behavior of all families in the population. Before proceeding, it is important to note the following. The actual or “true” form of the relationship between I and C is not known. We have simply assumed a particular form for the relationship in order to summarize the data in Figure 1. Further, we do not know the values of the population parameters of the assumed linear relationship between C and I. The task is to obtain estimates of the values of w and B. We will denote these estimates as a and b. Estimating a Linear Relationship The question that may come to mind at this point is, how can it be stated that income and food consumption are related by a precise linear equation when the data points in Figure | clearly do not lie on astraight line? The answer comprises three parts. First, the equation is only a summary of the data points and does not imply that C and J are related in precisely this manner. Second, the hypothesis is based on the implicit assumption that only income and consumption differ between these families. However, other things, such as family size and tastes, are not likely to be the same and no doubt affect the amount of food consumed. Third, there is randomness in people’s behavior; that is, an individual or family, for no apparent reason, may buy more or less food than some other family that appears to be in exactly the same situation with regard to income, taste, and the like. Thus one would not expect the data points to lie consistently on a straight line, even if the line did represent the average response to changes in income. As noted previously, from the data points in Figure | it is not obvious how much C increases as I increases; that is, it is uncertain what the position of the line summarizing the data points should be. To see this, consider the two solid lines that have been arbitrarily drawn through the points in Figure 3. Line | has the equation C = 1,000 + 0.011, and line 2 has the equation C = 200 + 0.101. Which of these two lines is the better18 3 g § g 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Income (in thousands) Figure 3: Two Possible Summaries of the Income-Consumption Relationship estimate of how food consumption changes as income changes? This is the same as asking which of the two equations is better at summarizing the relationship between C and I found in Table 1. More generally, which line among all the straight lines that it is possible to draw in Figure 3 is the “best” in terms of summarizing the relationship between C and I? Regression analysis, in essence, provides a procedure for determining the regression line, which is the best straight line (or linear) approximation of the relationship between C and I. This procedure is equivalent to finding particular values for the slope and intercept. An intuitive idea of what is meant by the process of finding a linear approximation of the relationship between the independent and depen-19 dent variables can be obtained by taking a string or pencil and trying to “fit” the points in Figure 1. Move the string up or down, or rotate it until it takes on the general tendency of the points in the graph. What property should this line possess? If asked to select which of the two solid lines in Figure 3 is better at summarizing (estimating) the relationship between income and food consumption, one would un- doubtedly choose line 1 because it is “closer” to the points than line 2. (This is not to imply that line 1 is the regression line.) Closeness or distance can be measured in different ways. Two possible measures are the vertical or horizontal distance between the observed points and a line. In the normal case, where the dependent variable is plotted along the vertical axis, distance is measured vertically as the differences between the observed points and the line. This is shown in Figure 3, where the vertical dotted line drawn from the data point to line 1 measures the distance between the observed data point and the line, In this case distance is measured in dollars of consumption, not in feet or inches. The choice of the vertical distance stems from the theory stating that the value of C depends on the value of I. Thus, for a particular value of income, it is desired that the regression line be chosen so as to predict a value of food consumption that is as close as possible to the value of food consumption observed at that income level. The regression line cannot minimize the distance for all points simul- taneously. In Figure 3 it can be seen that some points are closer to line 1 while others are closer to line 2. Thus a means of averaging or summing up all these distances is needed to obtain the best fitting line. Although several methods exist for summing these distances, the most common method in regression analysis is to find the sum of the squared values of the vertical distances. This is expressed as where C, is the value of C that would be estimated by the regression line and is read “C hat sub i.”” Least Squares Regression In the most common form of regression analysis, the line that is chosen is the one that minimizes- N a Eq -6y, which is called the sum of the squared errors, frequently denoted SSE. For each observation, the distance between the observed and the predicted level of consumption can be thought of as an error, since the observed level of consumption is not likely to be predicted exactly but is missed by some amount (C, ~ G). This error may be due, for example, to randomness in behavior or other factors such as differences in family size. Because the squares of the errors are minimized, the term Jeast squares regression analysis is used. The reason for selecting the sum of the squared errors lies in statistical theory that is beyond the scope of this book. However, an intuitive rationale for its selection can be presented. If the errors were not squared, distances above the line would be canceled by distances below the line. Thus it would be possible to have several lines, all of which minimized the sum of the nonsquared errors.’ It is implicit that closeness is good, while remoteness is bad. It can also be argued that the undesir- ability or remoteness increases more than in proportion to the error. Thus, for example, an error of four dollars is considered more than twice as bad as an error of two dollars. One way of taking this into account is to weight larger errors more than smaller errors, so that in the process of minimizing it is more important to reduce larger errors. Squaring errors is one means of weighting them. Let a and b represent the estimated values of « and £ for the still unknown regression line. Thus G can be expressed as Gsa+ bh. Substituting a + bl, for CG, the expression for SSE can be rewritten as N ECan bly? [l} Using the calculus, expressions for a and b can be found that minimize the value of expression 2 and hence give the least squares estimates of a and 8, which in turn define the regression line (see Appendix A for the derivation of the formulas). For the given set of data, the a and b that minimize 50 - 2 E,G-2~ bh)21 are a= 714.58 and b = +0.058 (see Appendix A for the calculation of these values). Therefore, the least squares line, which is drawn in Figure 4, has the equation C= 714.58 + 0.0581 3] These results mean, for example, that the estimate of consumption for a family whose annual income is $10,000 is $1294.94—that is, $1294.24 = $714.58 + 0.058($10,000). Remember, this is an estimate of C and not C= 71458 + 0581 fg 2 3 E 5 s a E 2 S 8 0 2 4 6 8 10 12 14 1618 20 22 24 26 28 30 32 Income (in thousands) Figure 4: ‘Best Fitting” Regression Line22 necessarily the amount one would observe for a specific family with an income of $10,000. The value of a, $714.58, is the estimated food consumption for a family with zero income. The value of b, 0.058, implies that for this sample, each dollar change in family income results in a change of $0.058 in food consumption in the same direction (note the positive sign for b). These conclusions, of course, hold only for this particular sample. When the least squared technique is applied to additional samples of consumers, one would obtain additional (generally different) estimates of « and B. It is important to point out that regression analysis does not prove causation. Our estimate of B is consistent with the theory that an increase in income causes an increase in food consumption. However, it does not prove causation. Note that we could have reversed the equation, making I depend on C, and argued that higher food consumption makes for healthier and more productive workers who thus have higher incomes. Since I and C increase together, this relationship would also be supported. It would take some alternative experiment or test to determine the direction of the causation. Our estimate of 8, however, is not consistent with the theory that food consumption decreases with increases in income.’ Examples Before proceeding, three examples are presented to illustrate how regression analysis is used. EXAMPLE 1—INFLATION AND STOCK PRICES Are stocks of major corporations a hedge against inflation—that is, does the return on a portfolio of stocks increase with the rate of inflation? Jaffe and Mandellzen (1976) address this question, as part of a broader study, by estimating the following regression equation Re = .0168 - 3.0141, where R, is the rate of return on a market portfolio of stocks in month t and I; is the rate of inflation in month t.'° The estimate of the regression coefficient on I, is ~3.014, which implies that an increase in the inflation rate of one percentage point is associated with a reduction in the rate of return of 3.014 percentage points. Thus, for this portfolio, stocks do not appear to be a hedge against inflation.23 EXAMPLE 2—HOME STATE ADVANTAGE Has the advantage held by a U.S. presidential candidate in his home state diminished over time as elections have become more nationalized? This question was addressed by Lewis-Beck and Rice (1984). The regression equation they obtained is H = 2.03 + .18T where H is the home state advantage, measured in percentage points of the state popular vote, and T is an election year counter (e.g., for 1904 T = 1, for 1908 T = 2, and so on). Notice that the coefficient on T is positive, which suggests that the home state advantage has not declined over time. EXAMPLE 3—PAY PREMIUM FOR VETERANS Ina recent article, De Tray (1982) argues that veterans receive a pay premium because employers, in evaluating the potential of employees, realize that veterans have had to pass mental and physical exams and survive a period of military service before being honorably discharged. He further argues that the quality of information provided by veteran status depends on the percentage of an age group that served in the military. Men who did not serve during war years, when virtually all able-minded and able-bodied men were drafted, may be less productive on the average than men who did not serve during peacetime, when few were called up. Therefore, De Tray hypothesizes that the veteran premium is positively related to the percentage in an age group that served in the military. To test this hypothesis, De Tray computed the veteran premium, w, for each of several age groups and regressed it on the percentage of each age group that served in the military, V. He found that the regression equation is equal to . w = -.078 + .165V indicating that the premium increases as the percentage of the age group that served in the military increases. It should be noted that this is only part of a larger study. The Linear Correlation Coefficient In the first part of this chapter, we demonstrated how regression analysis can be used to summarize the relationship between a dependent24 and independent variable. We turn now to an explanation of descriptive statistics designed to evaluate (1) the degree of association between variables and (2) how well the independent variable has explained the dependent variable. The correlation coefficient measures the degree of linear association between two variables.'' To understand what statisticians mean by linear association, consider Figure 5, which has the same 50 points as Figure !. The average (or mean) level of food consumption is represented by the dotted line, while the solid line represents the mean level of income. The two lines divide the figure into the four quadrants denoted 3 8 E s 2 a E 2 2 5 38 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Income (in thousands) Figure 5: Linear Correlation Analysis: The Food Expenditure Problem25 by Roman numerals. Levels of C that are greater than the average of 1842.45 lie above the dashed line in quadrants IJ and II, while less than average levels lie below, in quadrants III and IV. Similarly, income levels greater than the average lie to the right of 19,399 in quadrants I and IV, while those less than average lie to the left in quadrants II and TL. Figure 5 demonstrates that a majority of the points in the sample lie in quadrants I and III. Because of this pattern, the variables C and I are said to be positively correlated. Put differently, C and I are said to be positively correlated when Cs above (below) the mean value of food consumption, denoted C, are associated with Is above (below) the mean value of income, denoted I. On the other hand, if the Cs below C had been associated with the I’s above I (and vice versa), one would have said that the variables were negatively correlated. The reader should be able to demonstrate that in this case the data points would have been clustered in quadrants II and IV. Another possibility exists: If the data points had been spread fairly evenly throughout the four quadrants, one would have said that C and I were uncorrelated. The particular descriptive statistic that measures the degree of linear association between two variables is called the correlation coefficient and is denoted r. Although we offer no proof, r always lies between the values of -1 and +1 (-1.0 0). Hypothesis testing is analogous to decisions reached in courts of law. Under the court system, a defendant is brought to trial and he or she is assumed to be not guilty. For the judge or jury to reject the assumption of not guilty in favor of the alternate finding of guilty, sufficient evidence must be produced. In the court system, errors can be made; innocent defendants can be found guilty and guilty individuals can be found not guilty. Under a legal system where the evidence must show “beyond a shadow of doubt” that the assumption of nonguilt is to be rejected, there is a primary concern for the inferential error of the first type—that is, of convicting an innocent person.” Just as the defendant is assumed not guilty until proven guilty, in hypothesis testing the null hypothesis is assumed true until there is sufficient evidence that it is not true. Likewise, just as inferential errors can occur in courts of law, inferential errors can also occur in hypothesis testing. Again, we are particularly concerned with an inferential error of the type that occurs if one rejects the null hypothesis in favor of the alternate when the null hypothesis is actually true. Instead of simply stating that the analyst should reject the assumption that the null is true in favor of the alternate if the evidence suggests it “beyond a shadow of a doubt,” the hypothesis-testing procedure allows the investigator to specify an exact probability of making an inferential error—that is, allows the investigator to define how big the “shadow of a doubt” is. Most commonly, 1, 5, and 10 percent probabilities are chosen; however, there is nothing that prevents the analyst from using other probabilities of this type of inferential error.” When the researcher can reject the null hypothesis that 8 = 0 in favor of the alternate, the regression coefficient is said to be significant, which is short for significantly different from zero at a stated probability. The level of significance depends on the probability the investigator has assigned to rejecting the null when it is indeed true. In Table 2, the double asterisks next to the coefficient on the cohabitation variable imply that this coefficient is significant at the 1 percent40 level of significance (this is how “p< .01” in that table is to be read). This means that, in rejecting the null hypothesis that cohabitation has no effect on marital satisfaction (B = 0) in favor of the alternate that there is an effect, there is at most a 1 percent chance that we have rejected the null hypothesis that 8 = 0 when indeed Bis zero. Likewise, as will be seen, the t ratios reported beside the regression coefficients in the housework example of Table 3 can be used to determine whether or not a coefficient is significant. The Testing Procedure The formal procedure used to test hypotheses concerning the value of the population parameter is comparable to the procedure discussed earlier, First, a hypothesis concerning the value of the population parameter is formulated. This hypothesis is referred to as the null hypothesis, denoted Ho, and is assumed to hold unless sufficient evidence is found to reject it. The null hypothesis in the food consumption problem is that 8 is equal to zero (this is written as Ho:8 = 0). Second, the test value method (to be discussed later) is used to compute a number, tv, such that if Ho is true, there is a low prespecified probability of obtaining an estimate that overstates B by more than tv. The chosen probability is referred to as the level of significance; we will use 5 percent for the time being. Thus, on average no more than 5 percent of all samples will produce b’s that are greater than the population parameter by more than this test value when the null hypothesized value of Bis the actual value of B. Third, the difference between b and the hypothesized value of 8 is computed. Finally, the following criterion is used to test the null hypothesis: (1) Reject the null hypothesis if this computed difference is greater than the test value. (2) Do not reject the null hypothesis if this difference is less than or equal to the test value. Statement 1 in the criterion says that if the difference between the estimate and the hypothesized value is greater than the test value, the null hypothesis is to be rejected, since there is only a 5 percent chance that, if the null is true, an incorrect inference about the population parameter will be made. If, on the other hand, the difference is less than or equal to the test value (statement 2 of the criterion above), one cannot feel confident in rejecting the null hypothesis, since 95 percent of the41 samples will produce b’s that vary by no more than this amount from B when the null hypothesized value of 8 is the actual value of B. Note from the above criterion that only rejection or nonrejection of the null hypothesis is possible. Nonrejection does not imply that one accepts the null hypothesis. This is because the procedure outlined previously only tells us the probability of rejecting the null hypothesis when it is true. This is analogous to the court example where the finding is “not guilty” instead of “innocent.” The level of significance does not tell us anything about the probability of accepting the null when it is false. On the other hand, if the null hypothesis is rejected, it is usually stated that the alternate hypothesis, often denoted H,, is accepted. It is for this reason that the relationship that the researcher predicts between the independent and dependent variable is stated as the alternate hypothesis. We have now formulated the concept of the null hypothesis and the criterion used to test that hypothesis. The hypothesis-testing procedure will be complete once the method for constructing the test value (tv) has been presented. As will be shown, the test value depends on (1) the estimated variability of the estimates of 8 from sample to sample and (2) a probability distribution. The Standard Error of the Estimated Coefficient The standard error of the regression coefficient is a measure of the amount of variability that would be present among different b’s estimated from samples drawn from the same population. While it is true that equation 3 in Chapter | provides a unique estimate of 8, it is also the case that if a different set of data were drawn from the population, a different estimate of 8 would probably result. Statistical theory allows us to estimate how much variability there would be among all these estimates (that is, allows us to estimate the standard error) just by taking information from one sample. In essence, the standard error measures how sensitive the estimate of the parameter is to changes in a few observations in the sample. To understand what is meant by sensitive, consider Figure 7. Panel A presents two samples from population A, panel B presents two samples from population B, and panel C presents two samples from population C. In each case the ordinary least squares regression lines are also presented. The figure is constructed so that, with the exception of the circled observations, the data points are the same for any given panel42 Figure 7: Sensitivity of Regression Line to Changes in Observations (ie., within lettered pairs). In the case of the circled observations, within a given panel the values of the X’s have remained unchanged while the associated Y values have changed. It is apparent that regression coefficients estimated from either population A or B are extremely sample-de-43 pendent. In both situations a change in a few of the observations results in a large change in the slope of the regression line and hence a large change in b. The data drawn from population C, however, are neither scattered nor clustered. In this instance, a change in a few of the observations will not alter b substantially. What characteristics do the data in panels A and B have which do not appear in panel C? In A the amount of variability of the dependent variable Y (measured on the vertical axis) which cannot be attributable to variations in X is great relative to that in data set C. In panel B the variations in X are considerably less than the comparable variations in the independent variables shown in Panel C. Each of these characteristics is positively related to the standard error of a regression coefficient and creates additional uncertainty regarding the true parameter f. The measure of the standard error” allows one to make inferences about how sensitive the estimate of 8 is to changes in sample composi- tion without taking another sample. Because a large standard error casts doubt on the estimate, the magnitude of the test value depends positively on the size of the standard error. The standard error, generally represented as Sp, is often reported along with the regression coefficients, as in Table 4. The Student’s t Distribution A probability distribution”? is also used in the hypothesis-testing procedure. To better understand the role that probability plays in the testing procedure, reconsider what has been said thus far about regres~ sion parameters. First, it has been stressed that the population parameter can never be observed. Second, it has been noted that the estimate of the parameter from any sample is but one possible estimate; additional samples from the population yield additional, probably different estimates. Not all estimates are equally “close” to the population parameter. Finally, it is desired to draw inferences about the population parameter from one estimate of the parameter. In the food consumption problem, the b of .058 is to be used to make inferences about the population . Thus one would like to know if .058 is one of the estimates that is close to B. A question of this nature can never be answered, since the value of the population parameter is unobservable and hence unknown. A statement can, however, be made regarding the probability of obtaining an estimate with a given degree of closeness to the assumed, null hypothesized, value of 8. Analogously, probabilistic statements can be made concerning the degree of closeness associated with a given probability.44 These statements can be made because statisticians have determined the probability distribution of the fraction (b - 8)/sv. In general, this fraction is distributed according to what is known as the Student's t distribution. (A discussion of how statisticians are able to determine the probability distribution of (b — 8)/s» is beyond the scope of this book.) The Student’s t distribution allows one to make probabilistic statements concerning the size of the fraction (b - 8)/sp. The distribution relates the probability that the fraction will be no larger than what is known as thet statistic, denoted t.. For a stated probability, the t statistic depends on the degrees of freedom, defined as the number of observations in the problem (the size of the sample) minus the number of coefficients estimated. Values for the Student’s t distribution are given in Appendix B. Inthe consumption problem, there are 48 degrees of freedom, since two coefficients (a and b) were estimated and there are 50 observations.” (See also Figure 8.) For any given problem with 48 degrees of freedom, the t distribution states that for 5 percent of the samples, the fraction (b — B)/s» will be larger than 1.677. This implies that the probability is 5 percent that the following inequality holds:”° (b - B)/s» > 1.677 7] Multiplying this inequality by sp yields (b - B) > 1.677sy (8) Inequality 8 means that if the null hypothesis is true, only 5 percent of the estimates will exceed the null hypothesized value by more than 1.677s». Thus 95 percent will overstate the null hypothesis by less than this value. Forming Test Values The expression 1.677s, is an example of a test value. More generally, a test value is formed by multiplying the appropriate t statistic by the standard error of the estimator. In the food expenditure problem, s, = .013. Since t,s, = (1.677)(.013) = .022, the test value is .022. The null hypothesis can be rejected if the difference between the estimated coefficient and the hypothesized value is greater than this test value. In the case where the hypothesized value is zero, this difference is always equal45 5% of total area under curve Figure 8: t Distribution to the estimated coefficient, b, in this case .058. Thus, for the food expenditure problem, the null hypothesis can be rejected in favor of the alternate hypothesis that a positive relationship exists between income and food expenditure, since .058 > .022. More generally, it follows that the null hypothesis that 6 = 0 can be rejected in favor of the alternate hypothesis that it is greater than zero if b> sots [9] The testing procedure can also be used to test hypotheses concerning hypothesized values of 8 other than zero.” Suppose, for example, that one wished to test the hypothesis that a one-dollar increase in income is associated with a 4-cent increase in family food expenditure against the hypothesis that it is associated with a larger increase. In this case, the null hypothesis is Ho:B = .04, and the alternate hypothesis is Ha:B > .04. The difference between .04 and our estimate of .058 is .018. Given that this is less than the test value of .022, one cannot reject the null hypothesis. On the other hand, the reader should be able to verify that the null hypothesis, that 8 = .03, could be rejected at the 5 percent level of significance in favor of the alternate hypothesis that B > .03. In this instance we say that the coefficient is significantly greater than .03. The Role of Standard Error and Sample Size The statistical inference made about the population parameter from its estimate clearly depends on the size of the test value, which in turn46 depends on the size of the standard error of the estimated coefficient and on the size of the appropriate t statistic, A larger test value means, other things being equal, that it is harder to reject the null hypothesis in favor of the alternate. If the standard error in the food expenditure problem had been larger, the test value would also have been larger and different inferences might have been drawn about the population parameter. As noted in the discussion of the t distribution, for a given level of significance, the size of thet statistic, and hence the size of the test value, is influenced by the size of the sample.” That the number of observations in the sample will influence the size of the interval is reasonable, since a small sample is less likely to be representative of the population than a larger sample. The t statistics given in Appendix B illustrate that as the degrees of freedom decrease, the t statistic increases. Thus, for example, if the food expenditure sample size had been smaller, the appropriate t statistic would have been larger. Asa result, the test value would also have been larger and different inferences might have been drawn about the population parameter. Changing the Level of Significance Although the 5 percent level of significance is suitable for much empirical research, in some instances it is desirable to have a smaller probability of rejecting the null hypothesis when it is true. As can be seen from Appendix B, for a given number of degrees of freedom the t statistic (and hence the size of the test value) increases as the level of significance decréases.”* Applying the method discussed earlier, one finds that for the food expenditure problem, at the 2.5 percent level of significance the test value is .026 = tsp = (2.011)(.013). In a similar fashion, at the 1 percent level of significance the test value is .031. Notice that it might be possible to reject a hypothesis at the 5 percent level of significance but not at a lower level of significance. Often researchers will indicate at what level a variable is significant. In the cohabitation example of Table 2 the single asterisk indicates that a coefficient is significant at the 5 percent level; the double asterisk indicates significance at the | percent level. The lowest level at which a null hypothesis can be rejected is called by some authors the prob value or p value of a test (for an example of this, see Table 2). t Ratio Simple algebraic manipulation allows us to rewrite equation 9 as47 (b/s») > ts [10] The expression b/s is referred to as the t ratio. The reader can check that for the food consumption problem it is 4.462. Researchers often report this number in lieu of the standard error. Thus, for example, the numbers beside the regression coefficients in the housework time example (Table 3) are t ratios and not standard errors. The null hypothesis that 8 = 0 can easily be tested by computing the t ratio and comparing it to the appropriate t statistic. If the t ratio is greater than the appropriate t statistic, the null hypothesis can be rejected at the specified level of significance. In addition, the t ratio provides a way of determining the level of significance at which the null hypothesis can be rejected. For example, Appendix B demonstrates that for the food expenditure problem, the hypothesis that 6 = 0 can be rejected at the 0.5 percent level of significance. (For 48 degrees of freedom, the t statistic at the 0.5 percent level of significance is 2.682, substantially less than the t ratio of 4.462.) For a similar reason, the t ratio of 3.17 reported beside the number-of-rooms variable in the housework time example (Table 3) implies that the null hypothesis that B= 0 can be rejected at the 0.5 percent level. Just as the examples of Chapter 2 do not provide a uniform format for tests of significance, neither do computerized regression programs. For example, as can be seen from Appendix C, SPSS output provides information on standard errors, while SAS output provides information on t ratios as well as standard errors. Left-Tail Tests The reader will note that all of the alternate hypotheses presented thus far have taken the form, “8 is greater than some number.” In order to test the corresponding null hypothesis and make inferences about the alternate hypothesis, we have computed by how much our estimate overstates the null hypothesized value and then compared this difference to the test value. This type of test is called a right-tail test. It gets its name from the fact that in this instance the alternate hypothesis is positive and lies to the right of the null hypothesized value. There are, of course, instances in which one is interested in alternate hypotheses that concern negative values. In this case a left-tail test is in order. Left-tail tests are appropriate when the alternate hypothesis is of the form that the population parameter is less than some specified number, such as zero. In such a case, we would have: Ho:6 = 0, Ha:B <0.48 A test value for a left-tail test can be computed in the same manner as a test value for a right-tail test. For example, for a left-tail test with 48 degrees of freedom, only 5 percent of the sample will yield b’s that understate the population parameter by more than —1.677sp. Note that once again we are comparing the difference between the estimate and the null hypothesized value to some test value. Here, however, if we use (b- £) asa measure of “understatement,” the difference is negative since the alternate hypothesis lies to the left of the null hypothesis, not to the right. Thus we are saying that in only 5 percent of the cases is this difference more negative than ~1.677sp; that is, in only 5 percent of the cases is (b — 8) < -1.677s».° Just as we computed a t ratio for a right-tail test, we can also compute at ratio for a left-tail test. In this case, however, we reject the null hypothesis that the population parameter is zero if b/s» < ts.” Two-Tail Tests Occasionally theory does not suggest the direction of the relationship between the dependent and independent variables. In this case a two-tail test is appropriate. A good example of where this arises is found in the relationship between cohabitation and marital satisfaction. It could be argued that because cohabitation before marriage allows couples to work through various problems, a positive relationship exists between cohabitation and marital satisfaction. On the other hand, cohabitation prior to marriage may decrease marital satisfaction because couples tire of each other or because the “newness” of the relationship has worn off. Thus we are not sure whether to argue for a positive or a negative relationship between marital satisfaction and cohabitation. This is an example of an instance where a two-tail test is appropriate. In such a test, the null hypothesis is Ho:8 = 0, and the alternate hypothesis is Ha:B #0. A two-tail test must consider the possibility that the estimate over- or understates 8. From the previous discussion, we know that with 48 degrees of freedom, there is a 5 percent chance that an estimate overstates the population parameter by more than 1.677sp. Likewise, there is a 5 percent chance that it understates the parameter by more than -1.677s,. Combining these statements, we can say that there is a 10 percent chance that the estimate differs either positively or negatively from the population parameter by more than 1.677s». In absolute value terms, this means that there is a 10 percent chance that |b - B| > 1.6775».

21842541
100% (1)
21842541
81 pages
20.Sykes .Regression
No ratings yet
20.Sykes .Regression
33 pages
(Chapman & Hall_CRC Statistics in the Social and Behavioral Sciences) Jocelyn E. Bolin - Regression Analysis in R_ a Comprehensive View for the Social Sciences-CRC Press _ Taylor & Francis Group (2023
No ratings yet
(Chapman & Hall_CRC Statistics in the Social and Behavioral Sciences) Jocelyn E. Bolin - Regression Analysis in R_ a Comprehensive View for the Social Sciences-CRC Press _ Taylor & Francis Group (2023
193 pages
Wooldridge (2018) - Introductury Econometrics_ A Modern Approach-Chapter 2
No ratings yet
Wooldridge (2018) - Introductury Econometrics_ A Modern Approach-Chapter 2
47 pages
Full Download Applied Regression Analysis Doing Interpreting and Reporting 1st Edition Christer Thrane PDF DOCX
No ratings yet
Full Download Applied Regression Analysis Doing Interpreting and Reporting 1st Edition Christer Thrane PDF DOCX
67 pages
Introductory Econometrics A Modern Approach 4th Edition Wooldridge Solutions Manual - Available For Instant Download And Reading
100% (2)
Introductory Econometrics A Modern Approach 4th Edition Wooldridge Solutions Manual - Available For Instant Download And Reading
43 pages
83566
No ratings yet
83566
51 pages
Regression Explained SPSS
No ratings yet
Regression Explained SPSS
25 pages
Understanding Regression Analysis An Introductory Guide Larry D Schroeder David L Sjoquist Paula E Stephan instant download
No ratings yet
Understanding Regression Analysis An Introductory Guide Larry D Schroeder David L Sjoquist Paula E Stephan instant download
41 pages
3 Simple Linear Regression
No ratings yet
3 Simple Linear Regression
71 pages
Data Analysis - An Introduction - Lewis-Beck, Michael S - 1995 - Thousand Oaks - Sage Publications - 9780803957725 - Anna's Archive
No ratings yet
Data Analysis - An Introduction - Lewis-Beck, Michael S - 1995 - Thousand Oaks - Sage Publications - 9780803957725 - Anna's Archive
92 pages
Regression Explained SPSS
100% (1)
Regression Explained SPSS
23 pages
Lse Ppa M4u3 Notes
No ratings yet
Lse Ppa M4u3 Notes
15 pages
Michael S. Lewis-Beck-Data Analysis - An Introduction, Issue 103-SAGE (1995)
100% (1)
Michael S. Lewis-Beck-Data Analysis - An Introduction, Issue 103-SAGE (1995)
119 pages
@ Arkes - Regression Analysis - A Practical Introduction (2023)
No ratings yet
@ Arkes - Regression Analysis - A Practical Introduction (2023)
413 pages
Applied Regression Analysis by Christer Thrane
No ratings yet
Applied Regression Analysis by Christer Thrane
203 pages
13704416
No ratings yet
13704416
81 pages
DA UNIT-III
No ratings yet
DA UNIT-III
14 pages
application of linear regression analysis in advanced research.docx added REF & TOC
No ratings yet
application of linear regression analysis in advanced research.docx added REF & TOC
16 pages
Linear Regression (1)
No ratings yet
Linear Regression (1)
19 pages
15 Types of Regression You Should Know
No ratings yet
15 Types of Regression You Should Know
30 pages
1
No ratings yet
1
16 pages
RALM Flyer
No ratings yet
RALM Flyer
2 pages
Fase 4 Modelos de Regresión Con Información Cualitativa ECONOMETRIA
No ratings yet
Fase 4 Modelos de Regresión Con Información Cualitativa ECONOMETRIA
5 pages
Applied Longitudinal Analysis. ISBN 0470380276, 978-0470380277
100% (26)
Applied Longitudinal Analysis. ISBN 0470380276, 978-0470380277
23 pages
DA Notes 3
No ratings yet
DA Notes 3
12 pages
Regression and Classification
No ratings yet
Regression and Classification
26 pages
Regression Analysis A Practical Introduction Compress
No ratings yet
Regression Analysis A Practical Introduction Compress
363 pages
Lewis Beck
No ratings yet
Lewis Beck
46 pages
PREFACE - Introduction To Linear Regression Analysis, 5th Edition
No ratings yet
PREFACE - Introduction To Linear Regression Analysis, 5th Edition
4 pages
Applied Linear Regression Models
No ratings yet
Applied Linear Regression Models
561 pages
Bowerman Regression CHPT 1
100% (2)
Bowerman Regression CHPT 1
18 pages
Correlation and Regression 2
No ratings yet
Correlation and Regression 2
24 pages
Sketch
No ratings yet
Sketch
3 pages
Regression Analysis Willey Publication
20% (5)
Regression Analysis Willey Publication
15 pages
Introduction To Regression Analysis
No ratings yet
Introduction To Regression Analysis
8 pages
Techniques of Statistical Analysis 1 Group 2 2014-15
No ratings yet
Techniques of Statistical Analysis 1 Group 2 2014-15
3 pages
Regression Analysis: Unified Concepts, Practical Applications, and Computer Implementation
100% (2)
Regression Analysis: Unified Concepts, Practical Applications, and Computer Implementation
280 pages
Applied Regression
No ratings yet
Applied Regression
56 pages
Hoffmann - Linear Regression Analysis - Second Edition
100% (1)
Hoffmann - Linear Regression Analysis - Second Edition
285 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Understanding Regression Analysis

Uploaded by

Understanding Regression Analysis

Uploaded by

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.