0% found this document useful (0 votes)
7 views6 pages

TNDY - TA Session 2

The document provides a tutorial on statistical analysis using multiple regression in Stata, focusing on the Current Population Survey dataset. It explains the concepts of bivariate and multiple linear regression, including how to interpret regression output and coefficients, particularly in relation to income, age, gender, and race. The tutorial also includes practical commands for running regression models in Stata and visualizing income differences by gender.

Uploaded by

elfmerooh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

TNDY - TA Session 2

The document provides a tutorial on statistical analysis using multiple regression in Stata, focusing on the Current Population Survey dataset. It explains the concepts of bivariate and multiple linear regression, including how to interpret regression output and coefficients, particularly in relation to income, age, gender, and race. The tutorial also includes practical commands for running regression models in Stata and visualizing income differences by gender.

Uploaded by

elfmerooh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

##TNDY TA Session 2: Statistical analysis using multiple regression

##Stata (version 14.2 for Macs)


##Last update: Javier M. Rodriguez, June 12, 2021
##Main dataset: “TNDY_TASession2_CPS.dta”
##These data are a 10% random sample from the Current Population Survey, one of the
key intercensal surveys in the U.S. and sponsored by the U.S. Census Bureau and the
Bureau of Labor Statistics. The CPS is the survey officially used in the U.S. to monitor key
labor force statistics (e.g., the unemployment rate).

Linear regression

I. Bivariate Linear Model


 In many situations, we want to quantify the relationship between two variables (X and
Y), where we have good reasons to think that X affects Y.
 Sometimes we may also want to find (estimate) the predicted value of Y for a given level
of X. By “predicted value of Y” we mean, what can be known about Y using X (as an
explanation of Y).
o These two scenarios can be resolved via regression analysis. In our case, we will
use linear regression. This means that we will assume that the association
between X and Y, or at least some part of it, can be summarized by a line.
o For example, the predicted income (Y) for people of a given age “X” can be
determined as follows:

reg inctot age

Interpretation of the regression output table


Keep in mind that the regression output table above is for a linear regression, meaning that it
describes a straight line. And the equation (or mathematical representation) of this “predicted”
line is as follows:

1
y=α + βX
or, in this case:
income=α + β ( age )

or, according to the regression output table:

income =cons +Coef (age)

 _cons: The intercept is the expected mean of Y (income) when X (age)=0.

 Coef: the coefficient β measures by how much does the dependent variable (income
“Y”) change if the independent variable (age “X”) is increased by 1 unit (1 year of life).
o Q: How would you interpret the coefficient of age in this regression? How much
income will change if I increased the age by one unit.

 t = coef/SE.
o Sign: The sign tells you if the mean coefficient value is to the left (negative) or to
the right (positive) of zero. Another way to think about it, is that it tells you if the
variables X and Y are negatively (or, inversely: “more of X less of Y”) or positively
(“more of X more of Y”) associated.
o Value: For a conventional statistical significance level, we expect the absolute
value of t to be greater than 1.96 (rule of thumb) to reject the null hypothesis
that income and age are not related (i.e., that β=0 ) at the 0.05 level of
significance.
 Interpretation for t=1.96: It means that β (the coef) is 1.96 standard
errors (SE) away from “0”. In statistical terms, this is good indication that
our estimate of β (the quantification of the association between X and Y)
is large enough to be differentiable from “0”; indeed, that “distance” of
1.96 SEs from “0” stands for a 95% certainty that there is an association
between X and Y.

II. Multiple Linear Regression Model


 You use a multiple or multivariate regression model when you need to “hold constant”
variables that may be associated with the relationship (X-and-Y relationship) of interest.
o This means that you incorporate more [independent] variables into the model.
 In general terms, what is a “model”?
 A multiple regression model explicitly “holds fixed” other variables. By this we mean
that, what the regression does, is to “statistically control” for other variables. Let’s
elaborate on the idea of “statistical control”… To do this, let’s imagine that our new
linear model is:

y=α + β 1 X + β 2 Z
or,

2
income=α + β 1 ( age ) + β 2 (female)

What our new linear model is telling us is that the income of individuals is a function
(i.e., depends) on their age and on their gender.
o By “statistical control” we mean that, in this case, the association between
income (Y) and age (X) is independent from the gender (Z) of the individual, and
that the association between income (Y) and gender (Z) is independent from the
age (X) of the individual.
o By “independent” we mean that estimated differences in income between males
and females ( β 2) are not related to differences in age between males and
females and differences in income as people age. Note that, at the same time,
the estimated differences in income as people age ( β 1) are not related to
differences in income and age between males and females.
o That’s what a multiple regression does: It separates (isolates) the
associations between Y and each of the independent variables, and
quantifies their independent associations in the form of a coefficient
(the βs ).

This is how you run the multiple regression model income=α + β 1 ( age ) + β 2 (female)
in Stata:

reg inctot age female

 Interpretation of the coefficient


 By how much does the dependent variable (income) change if the independent variable
age increases by one unit, holding all other independent variables fixed/constant (in this
case gender (female))?
o Answer: For every additional year of age, income increases by $72, on average.
 Q: How do we interpret the coefficient of the dummy variable female?

3
o Answer:
 Because the variable is coded 1=female, 0=male, then a 1-unit change is
the same as going from males to females.
 Accordingly, the coef. for gender is the average difference between males
and females.
 Because the coef. for being a female is negative ( β 2=−2202), then we
can say that females show an income that is $2202 smaller than that of
males.
 In other words: The difference in income between males and
females is $2202, on average.

Let’s run a quick visualization (not from the model but from the observed data) on the income
differences by gender as individuals age:

twoway (lfit inctot age if female==1) (lfit inctot age if female==0)

*** If you have time, go over the following material:

Including a categorical variable in the multiple regression model

4
Imagine that you also wonder if the difference in income as people age are related to the fact
that there are differences in age by races/ethnicity, and within each race/ethnicity between
males and females. What to do? Yup: run a multiple regression model now including “race”.
reg inctot age female i.race

 Here, the variable race is coded: 1=white, 2=black, 3=Asian, 4=other.


 Also, remember that Stata automatically separated them for us (by using i.race in the
command syntax). You could also have generated the 4 dummies for race and include them
in the model one-by-one; for example: reg inctot age female race1 race2 race3…, etc.
o In the regression output, the first category race=1 (white) is treated as the base
category (or category of reference) in the regression model. This means that all
coefficients related to race will be relative to the base category.
o Q: Can you interpret the coefficient of race2?
 Answer:
 Because the variable is coded 1=black, 0=others, but the base
category is for whites (1=white, 0=others), then a 1-unit change is
the same as going from whites to blacks.
 Accordingly, the coef. for race2 is the average difference between
black and white people in the sample.
 Because the coef. for being a black person is negative (
β race2=−520 ), then we can say that black individuals show an
income that is $520 smaller than that of white people, on
average.

5
o Q: Can you interpret the coefficient of race3? The Asian show 478$ income less
than the white people, on average

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy