Panel 101
Panel 101
Oscar Torres-Reyna
otorres@princeton.edu
OTR See Stock and Watson, Introduction to Econometrics, chapter 10 “Regression with Panel Data”. 2
Usage
Panel data deals with omitted variable bias due to heterogeneity in
the data. It does this by controlling for variables that we cannot
observe, are not available, and/or can not be measured but are
correlated with the predictors. Two types:
1. Variables that do not change over time but vary across entities
(cultural factors, difference in business practices across
companies, etc.) → Entity fixed effects.
2. Variables that change over time but not across entities (i.e.
national policies, federal regulations, international
agreements, etc.) → Time fixed effects.
Some drawbacks when working with panel data are data collection
issues (i.e. sampling design, coverage), non-response in the case of
micro panels or cross-country dependency in the case of macro
panels (i.e. correlation between countries).
For a comprehensive list of advantages and disadvantages of panel data see Baltagi, Econometric
Analysis of Panel Data (chapter 1).
OTR 3
FIXED-EFFECTS MODEL
(Covariance Model, Within
Estimator, Individual Dummy
Variable Model, Least Squares
Dummy Variable Model)
OTR 4
The fixed effects idea
Entities have individual characteristics that may
or may not influence the outcome and/or
predictor variables. For example, the business
practices of a company may influence its stock
price or level of spending; attitudes or policies
towards guns in a particular state may affect its
levels of gun violence. Business practices,
cultural, or political variables are, most of the
time unavailable or hard to measure.
OTR 5
The fixed effects idea
Since individual characteristics are not random
and may impact the predictor or outcome
variables, we need to control for them. In this
way, the effect of the predictors will not be
influenced by those fixed characteristics.*
In entity’s fixed effects it is assumed a
correlation between the entity’s error term and
predictor variables. However, an entity’s fixed
effects cannot be correlated with another
entity’s.
OTR * See Stock and Watson, 2003, p.289-290 6
The model (1)
The entity fixed effects regression model is
𝑌𝑖𝑡 = 𝛼𝑖 + 𝛽𝑋𝑖𝑡 + 𝑢𝑖 + 𝑒𝑖𝑡
i = 1…n ; t = 1….T
Where:
𝑌𝑖𝑡 outcome variable (for entity i at time t).
𝛼𝑖 is the unknown intercept for each entity (n entity-specific intercepts).
𝑋𝑖𝑡 is a vector of predictors (for entity i at time t) .
𝑢𝑖 within-entity error term ; 𝑒𝑖𝑡 overall error term.
columns. 1 3 # # # # …..
: : : : : : :
• Entity and time in 2 1 # # # # …..
rows. 2 2 # # # # …..
2 3 # # # # …..
: : : : : : :
This format is known as 3 1 # # # # …..
long form. 3 2 # # # # …..
3 3 # # # # …..
OTR 9
Wide form data (time in columns)
If your dataset is in wide format, either entity or time
are in columns, you need to reshape it to long format
(you can do this in Stata).
Beware that Stata does not like numbers as column
names. You need to add a letter to the numbers
before importing into Stata. If you have something
like the following:
OTR 10
Wide form data (time in columns)
Add a letter to the numeric column names, for example,
an ‘x’ before the year:
OTR 11
Reshaping from wide to
long
order id
rename x gdp
OTR 12
Wide form data (entity in columns)
If the wide format data has the entities in column
and time in rows, like this example:
OTR 13
Wide form data (entity in columns)
Import it into Stata:
OTR 14
Reshape wide to long format
Once in Stata, you can reshape it
using the command reshape:
* Adding the prefix ‘gdp’ to column names.
Command ‘renvars’ is user-written, you need
to install it, see note below
gen id = _n
order id
reshape long gdp , i(id) j(country) str
OTR 18
Visualizing panel data
* All in one, type:
xtline gdp, overlay
OTR 19
Data example
The data used in the following slides was extracted from the World
Development Indicators database:
https://databank.worldbank.org/source/world-development-indicators
Data was further cleaned to remove regions, subregions, and missing values
across years and variables resulting in 126 countries.
Variable ‘trade’ was added by adding imports + exports.
OTR 20
Data example – histograms
hist gdppc
hist labor
hist trade
OTR 21
Data example – transformations
OTR 22
Data example – histograms
hist ln_gdppc
hist ln_labor
hist ln_trade
OTR 23
Setting data as panel
The panel variable (country) is in string format (red color, type
browse country to see it), we need to convert it to labeled
format (numbers with labels, blue color):
OTR 24
Descriptive statistics
. sum gdppc trade labor // Pooled data
F(2,125) = 87.57
corr(u_i, Xb) = 0.1067 Prob > F = 0.0000
Beta coefficients indicate the (Std. err. adjusted for 126 clusters in country1) Two-tail p-values test the
change in the output (y) when ------------------------------------------------------------------------------ hypothesis that each coefficient is
the predictors change one | Robust different from 0 (according to its
unit over time. In this ln_gdppc | Coefficient std. err. t P>|t| [95% conf. interval] t-value).
example, all the variables are -------------+---------------------------------------------------------------- A value lower than 0.05 will reject
log-transformed, the ln_trade | .3603947 .0737076 4.89 0.000 .2145182 .5062712 the null and conclude that the
interpretation is: when the ln_labor | .053167 .1608747 0.33 0.742 -.265224 .371558 predictor has a significant effect
predictor increases 1% over _cons | -.9384681 1.075791 -0.87 0.385 -3.067592 1.190656 on the outcome (95%
time, the output (y) changes -------------+---------------------------------------------------------------- significance).
𝛽% (elasticity). sigma_u | 1.1155513
sigma_e | .10989953
rho | .99038791 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Controlling for
Outcome Predictor(s)
heteroskedasticity Total number of entities (i)
F(23,125) = 34.28
corr(u_i, Xb) = 0.7525 Prob > F = 0.0000
Beta coefficients indicate the (Std. err. adjusted for 126 clusters in country1) Two-tail p-values test the
change in the output (y) when ------------------------------------------------------------------------------ hypothesis that each coefficient is
the predictors change one | Robust different from 0 (according to its
unit over time. In this ln_gdppc | Coefficient std. err. t P>|t| [95% conf. interval] t-value).
example, all the variables are -------------+---------------------------------------------------------------- A value lower than 0.05 will reject
log-transformed, the ln_trade | .2401329 .0695213 3.45 0.001 .1025416 .3777242 the null and conclude that the
interpretation is: when the ln_labor | -.2958837 .081081 -3.65 0.000 -.456353 -.1354145 predictor has a significant effect
predictor increases 1% over | on the outcome (95%
time, the output (y) changes year | significance).
𝛽% (elasticity). 2001 | .0119809 .0042779 2.80 0.006 .0035144 .0204475
... ... ... ... ... ... ...
... ... ... ... ... ... ...
2021 | .2878247 .0705454 4.08 0.000 .1482065 .4274428
|
Intraclass correlation (rho), _cons | 7.213881 1.961627 3.68 0.000 3.331578 11.09619
shows how much of the -------------+----------------------------------------------------------------
variance in the output is sigma_u | 1.0561892
explained by the difference sigma_u = sd of residuals within groups 𝑢𝑖
sigma_e | .09753735
across entities. In this sigma_e = sd of residuals (overall error term) 𝑒𝑖𝑡
rho | .99154389 (fraction of variance due to u_i)
example is 99%. ------------------------------------------------------------------------------
OTR 𝑠𝑖𝑔𝑚𝑎_𝑢 2 27
𝑟ℎ𝑜 = 2 2
𝑠𝑖𝑔𝑚𝑎_𝑢 + 𝑠𝑖𝑔𝑚𝑎_𝑒
Fixed effects regression using xtreg, fe (with lags on predictors)
𝑌𝑖𝑡 = 𝛼𝑖 + 𝛽𝑋𝑖𝑡−1 + 𝑢𝑖 + 𝑒𝑖𝑡
Fixed effects option
F(2,125) = 81.17
corr(u_i, Xb) = 0.1265 Prob > F = 0.0000
(Std. err. adjusted for 126 clusters in country1) Two-tail p-values test the
Beta coefficients indicate hypothesis that each coefficient is
the change in the output ------------------------------------------------------------------------------
| Robust different from 0 (according to its
(y) when the predictors one t-value).
unit over time (a year ln_gdppc | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+---------------------------------------------------------------- A value lower than 0.05 will reject
before –”L1.”). In this the null and conclude that the
example, all the variables ln_trade |
L1. | .3385586 .0703993 4.81 0.000 .1992297 .4778875 predictor has a significant effect
are log-transformed, the on the outcome (95%
interpretation is: when the |
ln_labor | significance).
predictor increases 1% over
time (a year before –”L1.”), L1. | .0581167 .1566956 0.37 0.711 -.2520033 .3682367
the output (y) changes 𝛽% |
(elasticity). _cons | -.4600892 1.082489 -0.43 0.672 -2.60247 1.682291
-------------+----------------------------------------------------------------
sigma_u | 1.1260807
Intraclass correlation (rho), sigma_e | .10685653
shows how much of the rho | .99107579 (fraction of variance due to u_i)
variance in the output is ------------------------------------------------------------------------------
explained by the difference
across entities. In this 𝑠𝑖𝑔𝑚𝑎_𝑢 2 sigma_u = sd of residuals within groups 𝑢𝑖
example is about 98%. 𝑟ℎ𝑜 = sigma_e = sd of residuals (overall error term) 𝑒𝑖𝑡
𝑠𝑖𝑔𝑚𝑎_𝑢 2 + 𝑠𝑖𝑔𝑚𝑎_𝑒 2
OTR 28
Entity fixed effects regression using reghdfe
𝑌𝑖𝑡 = 𝛼𝑖 + 𝛽𝑋𝑖𝑡 + 𝑢𝑖 + 𝑒𝑖𝑡
Fixed effects option Controlling for
correlation within
panels Total number of
Outcome Predictor(s) cases (rows)
OTR 30
Entity fixed effects regression with lags using reghdfe
𝑌𝑖𝑡 = 𝛼𝑖 + 𝛽𝑋𝑖𝑡 + 𝑢𝑖 + 𝑒𝑖𝑡
Fixed effects option Controlling for
correlation within
panels Total number of
Outcome Predictor(s) cases (rows)
OTR NOTE: must type xtset country1 year, before using lags in reghdfe 31
A note on fixed effects
“...The fixed-effects model controls for all time-invariant
differences between the individuals, so the estimated coefficients
of the fixed-effects models cannot be biased because of omitted
time-invariant characteristics...[like culture, religion, gender, race,
etc].
One side effect of the features of fixed-effects models is that they
cannot be used to investigate time-invariant causes of the
dependent variables. Technically, time-invariant characteristics of
the individuals are perfectly collinear with the person [or entity]
dummies. Substantively, fixed-effects models are designed to
study the causes of changes within a person [or entity]. A time-
invariant characteristic cannot cause such a change, because it is
constant for each person.” [(Underline is mine) Kohler, Ulrich,
Frauke Kreuter, Data Analysis Using Stata, 2nd ed., p.245]
OTR 32
RANDOM-EFFECTS MODEL
(Random Intercept, Partial
Pooling Model)
OTR 33
The random effects idea
The rationale behind random effects model is that, unlike the
fixed effects model, the variation across entities is assumed
to be random and uncorrelated with the predictor or
independent variables included in the model:
“...the crucial distinction between fixed and random effects is
whether the unobserved individual effect embodies elements that
are correlated with the regressors in the model, not whether these
effects are stochastic or not” [Green, 2008, p.183]
If you have reason to believe that differences across entities
have some influence on your dependent variable then you
should use random effects. An advantage of random effects is
that you can include time invariant variables (i.e. gender). In
the fixed effects model these variables are absorbed by the
intercept.
OTR 34
The random effects idea
Random effects assume that the entity’s error term is not
correlated with the predictors which allows for time-
invariant variables to play a role as explanatory variables.
In random-effects you need to specify those individual
characteristics that may or may not influence the
predictor variables. The problem with this is that some
variables may not be available therefore leading to
omitted variable bias in the model.
RE allows to generalize the inferences beyond the sample
used in the model.
OTR 35
Random effects regression using xtreg, re
𝑌𝑖𝑡 = 𝛼 + 𝛽𝑋𝑖𝑡 + 𝑢𝑖𝑡 + 𝑒𝑖𝑡
Random effects option
Beta coefficients indicate the (Std. err. adjusted for 126 clusters in country1) Two-tail p-values test the
change in the output (y) when ------------------------------------------------------------------------------ hypothesis that each coefficient is
the predictors change one | Robust different from 0 (according to its
unit over time and across ln_gdppc | Coefficient std. err. z P>|z| [95% conf. interval] t-value).
entities (average effect). In -------------+---------------------------------------------------------------- A value lower than 0.05 will reject
this example, all the variables ln_trade | .4175909 .0760404 5.49 0.000 .2685543 .5666274 the null and conclude that the
are log-transformed, the ln_labor | -.1597685 .1312262 -1.22 0.223 -.4169671 .0974302 predictor has a significant effect
interpretation is: when the _cons | .9295612 .6361615 1.46 0.144 -.3172923 2.176415 on the outcome (95%
predictor increases, on -------------+---------------------------------------------------------------- significance).
average, 1%, the output (y) sigma_u | .41594682
changes 𝛽% (elasticity). sigma_e | .10989953
rho | .93474564 (fraction of variance due to u_i)
------------------------------------------------------------------------------
OTR 37
Which to choose?
Whenever there is a clear idea that individual characteristics of
each entity or group affect the regressors, use fixed effects. For
example, macroeconomic data collected for most countries
overtime. There might be a good reason to believe that
countries’ economic performance may be affected by their
own internal characteristics: type of government, political
environment, cultural characteristics, type of public policies,
etc.
Random effects is used whenever there is reason to believe
that individual characteristics have no effect on the regressors
(uncorrelated).
OTR 38
Which to choose?
The Hausman-test tests whether the individual characteristics are correlated with the regressors
(see Green, 2008, chapter 9). The null hypothesis is that they are not (random effects).
OTR 39
TESTS / DIAGNOSTICS
OTR 40
Do we need time fixed effects?
To see if time fixed effects are needed when running a FE model use
the command testparm. It is a joint F-test to if all years jointly
equal to 0 (type help testparm for more details).
( 1) 2001.year = 0
( 2) 2002.year = 0
( 3) 2003.year = 0
( 4) 2004.year = 0
( 5) 2005.year = 0
( 6) 2006.year = 0
( 7) 2007.year = 0
( 8) 2008.year = 0
( 9) 2009.year = 0
(10) 2010.year = 0
(11) 2011.year = 0
(12) 2012.year = 0 The Prob > F is < 0.05, we fail to
(13) 2013.year = 0
(14) 2014.year = 0 accept the null that the coefficients for
(15) 2015.year = 0
(16) 2016.year = 0 the years are jointly equal to zero. In this
(17) 2017.year = 0
(18) 2018.year = 0
case, time fixed effects are needed.
(19) 2019.year = 0
(20) 2020.year = 0
(21) 2021.year = 0
Estimated results:
| Var SD = sqrt(Var)
---------+-----------------------------
ln_gdppc | 2.022383 1.422105
e | .0120779 .1098995
u | .1730118 .4159468
xttest2
. xttest2
[OMITTED]
Pr < 0.05, we fail to accept the null hypothesis and conclude that panel are
correlated (cross-sectional dependence).
OTR 43
Are the panels correlated? [Pasaran CD test]
As mentioned in the previous slide, cross-sectional dependence is more of an issue in macro panels
with long time series (over 20-30 years) than in micro panels.
Pasaran CD (cross-sectional dependence) test is used to test whether the residuals are
correlated across entities*. Cross-sectional dependence can lead to bias in tests results (also called
contemporaneous correlation). The null hypothesis is that residuals are not correlated. The command
for the test is xtcsd, you have to install it typing:
Had cross-sectional dependence be present Hoechle suggests to use Driscoll and Kraay standard errors
using the command xtscc (install it by typing ssc install xtscc). Type help xtscc for more
details.
*Source: Hoechle, Daniel, “Robust Standard Errors for Panel Regressions with Cross-Sectional Dependence”,
http://fmwww.bc.edu/repec/bocode/x/xtscc_paper.pdf
OTR 44
Testing for heteroskedasticity
A test for heteroskedasticiy is avalable for the fixed- effects model using the
command xttest3. The null hyphotesis is homoskedasticity (or constant
variance). This is a user-written program, to install it type:
ssc install xttest3
xtreg ln_gdppc ln_trade ln_labor, fe robust
xttest3
. xttest3
Modified Wald test for groupwise heteroskedasticity
in fixed effect regression model
NOTE: Use the option ‘robust’ to obtain heteroskedasticity-robust standard errors (also known
as Huber/White or sandwich estimators).
OTR 45
Testing for serial correlation
Serial correlation tests apply to macro panels with long time series (over 20-30 years).
Not a problem in micro panels (with very few years). Serial correlation causes the
standard errors of the coefficients to be smaller than they actually are and higher R-
squared. A Lagram-Multiplier test for serial correlation is available using the command
xtserial. This is a user-written program, to install it type:
ssc install xtserial
xtreg ln_gdppc ln_trade ln_labor, fe robust
xtserial ln_gdppc ln_trade ln_labor
. xtserial ln_gdppc ln_trade ln_labor
The null is no serial correlation. Above we fail to reject the null and conclude the data does not have first-
order autocorrelation. Type help xtserial for more details.
OTR 46
Source: Hoechle, Daniel, “Robust Standard Errors for Panel Regressions with Cross-Sectional
OTR 47
Dependence”, page 4, http://fmwww.bc.edu/repec/bocode/x/xtscc_paper.pdf
Suggested books / references
• Introduction to econometrics / James H. Stock, Mark W. Watson. 2nd ed., Boston: Pearson
Addison Wesley, 2007.
• Econometric Analysis of Panel Data, Badi H. Baltagi, Wiley, 2008.
• Econometric Analysis / William H. Greene. 6th ed., Upper Saddle River, N.J. : Prentice Hall, 2008.
• An Introduction to Modern Econometrics Using Stata/ Christopher F. Baum, Stata Press, 2006.
• Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill.
Cambridge ; New York : Cambridge University Press, 2007.
• Data Analysis Using Stata/ Ulrich Kohler, Frauke Kreuter, 2 nd ed., Stata Press, 2009.
• Statistics with Stata / Lawrence Hamilton, Thomson Books/Cole, 2006.
• Statistical Analysis: an interdisciplinary introduction to univariate & multivariate methods / Sam
Kachigan, New York : Radius Press, c1986
• “Beyond “Fixed Versus Random Effects”: A framework for improving substantive and statistical
analysis of panel, time-series cross-sectional, and multilevel data” / Brandom Bartels
http://polmeth.wustl.edu/retrieve.php?id=838
• “Robust Standard Errors for Panel Regressions with Cross-Sectional Dependence” / Daniel
Hoechle, http://fmwww.bc.edu/repec/bocode/x/xtscc_paper.pdf
• Designing Social Inquiry: Scientific Inference in Qualitative Research / Gary King, Robert
O.Keohane, Sidney Verba, Princeton University Press, 1994.
• Unifying Political Methodology: The Likelihood Theory of Statistical Inference / Gary King,
Cambridge University Press, 1989.
OTR 48