STATA
STATA
Stata
Felix Bittmann
Stata
www.degruyter.com
Contents
1 Introduction 1
1.1 Formatting 1
1.2 Graphic style 2
1.3 Version info 3
1.4 Online resources 3
1.5 Cheat sheet 3
4 Describing data 46
4.1 Summarizing information 47
4.2 Using stored results* 49
4.3 Histograms 51
4.4 Boxplots 53
4.5 Simple bar charts 55
4.6 Scatterplots 56
4.7 Frequency tables 58
4.8 Summarizing information by categories 61
4.9 Editing and exporting graphs 64
4.9.1 Combining graphs 65
4.10 Correlations 67
4.11 Testing for normality 68
4.12 t-test for groups* 69
4.13 Weighting* 71
6 Regression analysis 83
6.1 Research question 83
6.2 What is a regression? 84
6.3 Binary independent variable 85
6.4 Ordinal independent variable 87
6.5 Metric independent variable 91
6.6 Interaction effects* 94
6.6.1 The classic way 96
6.6.2 Marginal effects 96
6.6.3 Predicted values 97
6.6.4 Separate analyses by subgroups 99
6.7 Standardized regression coefficients* 100
9 Matching 134
9.1 Simulating an experiment 134
9.2 Propensity score matching 135
9.3 Matching diagnostics 138
9.3.1 Common support 138
9.3.2 Balancing of covariates 139
References 155
Copyright 157
Index 158
List of Notes
https://doi.org/10.1515/9783110617160-201
1 Introduction
Congratulations! As you are holding his book in your hands right now, this probably
means you want to (or have to) work with Stata, the best statistical software package
available! It does not matter whether you are completely new to the field of data sciences
or an experienced veteran, Stata will hopefully enable you to answer your research
questions correctly, efficiently and in an enjoyable fashion. As you bought this book for
a really short and direct introduction, let’s skip the formal chit-chat and start right away.
Please…
When you like this book, tell your friends and colleagues, as it might help them
getting to know Stata. If you don’t like this book, tell me, so I have the chance to make
things better in the next edition. You can always send emails to mail@statabook.com.
Thank you!
This book has benefited enormously from people who supported me. I want to thank Bill
Rising, Dahla Rodewald, Elsje Nieboer, Marie Pauline Burkart, Markus Kreggenwinkel,
Minh Nguyet Pham, Steffen Schindler, Svenja Dilger and Viktoria Sophie Zorn.
Furthermore I want to acknowledge that the general outline of this book is based on
the teachings of Michael Gebel (while I take responsibility for any errors remaining)!
1.1 Formatting
To make reading and finding things easier, this book uses formatting to highlight text.
Stata commands are printed like this
https://doi.org/10.1515/9783110617160-001
2 1 Introduction
use “testfile.dta”
You can enter these commands directly or write them in do-files. Point-and-click path-
ways are written in bold.
Some additional information (Notes), that can be skipped by the reader in a hurry,
are printed in boxes separated by horizontal bars. However, you are strongly advised
to come back later and have a look.
Some headings are marked with an asterisk (*). These chapters contain slightly
more advanced topics that might not be interesting for your very first seminar paper
with Stata (but maybe for your second).
Some Stata commands can be abbreviated and are thus easier to type. Within a
few hours of use you will also come to the conclusion that the shorter form is often
preferable. Throughout the book, we will always use the complete form, yet underline
the part that is the alternative short command, for example
tabulate age
tab age
Please note that I will not always provide the shortest abbreviation possible, but the
command that I encountered most often in my work and the work of others.
This book is printed without colors, therefore, all graphics are in black and white.
While this is fine for most outputs Stata generates, some use colors to highlight differ-
ences between groups. To visualize these differences without colors, a certain color
style (scheme) is used (Bischof, 2017).1 Therefore, visually, the results presented in
this book might differ slightly from what you see on your computer screen (while the
interpretation is identical). When you want to receive the exactly same output you
have to set the scheme when you start Stata. You can do this by entering the following
in your command line
Now Stata will employ this style every time it produces a graphic. To revert to the
standard settings either restart Stata or enter
This book is written for Stata 15. If your interface, icons, graphs, tables or path descrip-
tions are slightly off, it may be that you are using a different version. This might be
somewhat inconvenient, yet the basic commands are in almost all cases identical.
Throughout the book, we will mostly rely on one dataset, the NLSW88 data. The
dataset can always be downloaded using the command
webuse nlsw88
or you can download it manually from Stata’s website2 for Stata 8 or newer.
You can find complete do-files for every chapter as well as additional material online
(www.statabook.com).
Throughout the book, you will learn many commands that you don't need to memo-
rize. However, it can be helpful to write down important information by hand, which
will also help you remember it better. Finally, I want to recommend some excellent
cheat sheets which were created by Tim Essam and Laura Hughes. Beginners and
experts will find these equally useful.3
2 http://www.stata-press.com/data/r8/nlsw88.dta (2018-02-05).
3 https://geocenter.github.io/StataTraining/ (2018-11-06)
2 The first steps
Let’s get started! I will assume that Stata is installed and running on your computer.
In this chapter you will learn what Stata looks like, how to open and save files and
document your work.
1 If what you see is completely different, someone before you might have customized Stata. To return
to Factory (Standard) settings, click Edit → Preferences → Manage Preferences → Factory Settings.
But please make sure to ask your colleagues before if you work on a shared computer!
https://doi.org/10.1515/9783110617160-002
2.2 Opening stata files 5
Stata is that you don’t have to use the command line as almost all methods can
be run with point-and-click as well. Yet with more experience you will naturally
come to the conclusion that it is faster and more efficient using some commands.
Stata is not about memorizing abstract keywords but about getting stuff done.
Throughout the book, I will provide commands and point-and- click pathways for
most methods.
3. is the Command history, where all commands, either entered by you manually
or automatically created by Stata after a point-and-click, will appear. This is of
great value as it makes it easy to check what you have already done with your
data, and save your work history for replication purposes.
4. shows all variables that are included in your dataset. As you can only have
one dataset open at a time in Stata, this list always refers to your currently
opened data file. One great feature of this window is the little search bar at the
top: you can enter any text and Stata will show any matches within variable
names or variable labels, which makes finding the desired variable extremely
convenient.
5. will display detailed information about the variable you clicked on in the variable
list window. Here you can see the full name, labels (more about that later) or
notes. When working with large datasets with many variables, this information
can help you a lot when it comes to identifying the correct variable.
6. is the Toolbar, where you can click to reach all needed commands and tools.
The arrow at the top shows the current version of Stata which is installed
on your computer. Now and then, updates bring cool new features, methods
or correct bugs. Luckily, the basics are the same for most versions and will not
change anytime soon, meaning that you can read this book using version 10, 15 or
any future release of Stata
The arrow at the bottom shows your current working directory, which is explained
below.
Chances are great that you already have a Stata data file (which you can recog-
nize by the suffix .dta). If this is not the case, we will talk about entering your own
data or importing datasets from other software below. You could open the file right
away, but you probably have several data files in the same folder. It is a good habit
to establish one working folder (with the possibility of sub-folders) to structure
your data. Your results and efficiency will definitively benefit from doing this! To
change the current working folder, click File → Change Working Directory and
select the desired folder or enter the following command in the command line (then
hit Enter):
6 2 The first steps
cd “C:/Users/username/stata-course/”2
Please note that the quotation marks are required when your folder name includes
spaces. You probably want to see which files are contained in your current working
directory. Just type dir or ls.
C:\Users\username\stata-course\example.dta
/home/username/stata-course/example.dta
Note the different separators (\VS/). Luckily Stata knows how to deal with this problem and automatically
translates separators if needed. Therefore, it is recommended to use the Linux/Mac version like this:
cd “C:/Users/username/stata-course/example.dta”
This command will work on Windows, Linux and Mac as long as the respective path exist. Throughout
the book we will, therefore, use this notation as it is compatible with all systems.
Assuming the file you want to open is in your current working directory, type
use “filename.dta”
or just
use “filename”
as the extension is optional and Stata will recognize the correct file. When the file is
not in the current working directory, you can use an absolute pathway. For example,
type
use “C:/Users/username/stata-course/filename.dta”
2 Of course, you must change my generic example path between the quotation marks to the path
where your own files are located on your computer. Also read the next info box to understand why
Stata uses different path separators than Windows.
2.4 Entering data manually 7
Alternatively use File → Open and select the desired file. By the way, this command
also works for opening files directly from the Internet:
use “http://www.stata-press.com/data/r8/nlsw88.dta”
Sometimes your dataset is not directly available as a Stata .dta file. In this case you
can import some other formats, for example from Excel, SAS, XML, Comma Separated
Values or unformatted text data. You can see all options by clicking File → Import.
When you click the desired format, a dialogue will open and help you with importing
the data. Other options can be inspected by typing
help import
Unfortunately there is still no quick option for importing files created from SPSS
(.sav) directly into Stata. Luckily there exists a user written command (community-
contributed software or CCS). Please refer to page 9, to learn how you can install this
little helper program into Stata.
Sometimes you want to enter data by hand. This is feasible when there are only
tiny bits of information, or for testing purposes. Most datasets from the real world
are often very large, containing information about thousands of people (or whatever
you study). Entering these amounts of information by hand is clearly insane and not
advised as it is time-consuming and very error-prone.
To fill in your data values, enter
edit
or click the Data Editor (Edit) button. A new window will pop up (the Data
Editor, Figure 2.2) that displays raw information. Each line (row) stands for
one case (observation/person). The variables are shown in the columns on top
(grey cells). Double-click one of these and a new window will appear. Enter the
name of the variable (for example age) and choose “Fill with missing data”.
Click OK.
Now you can click the cells below the newly created variable and enter values for
each case. Repeat for each desired variable and case to fill in your data. When you are
8 2 The first steps
done close the window. When you want to browse your data, which is a common
task, to check whether data transformations were successful, it is better to use the
browsing window, as there you cannot accidentally change data values. Just enter
browse
One of the great advantages of Stata is that it comes with tons of tested and refined
datasets that are preinstalled or available on the net. To see the available datasets,
click File → Example Datasets and then Example datasets installed with Stata. Then
click use to load the desired dataset. Throughout the book we will rely on preinstalled
or online datasets as they are very easy to open and perfect for learning. For example,
when you want to open auto.dta, you can also enter directly
sysuse auto
The dataset is opened into Stata. If there is already unsaved data in memory, you have
to save or close it first in order to open the new file.
2.6 Saving and exporting data 9
Other information for using SPSS files: a CCS for importing is provided by Sergiy Radyakin (Windows
only). You can check out his website3 to learn how to install it.
After you are done with your work you usually want to save your data. Two notes of
caution. Firstly, never overwrite your original dataset as any changes you make are
permanent! If you made an error there is no way to correct it later. Therefore, you
should save the dataset with a new name. Secondly, it is usually a good idea to export
your dataset to an open file format, after your project is finished. This makes it much
easier for your colleagues and other researchers to open and handle your data in the
future. A good option for doing this is saving as a .csv file, which can be opened on
any computer, even without Stata. The biggest downside of these open formats is that
all metadata is lost. Thus it is suggested to use different formats simultaneously when
archiving your projects. Remember, exporting need not to be done in your daily work
routine, but once you are finished, for archiving purposes.
To save your dataset, click File → Save as and choose the desired folder and name.
The dialogue also makes it possible to save your files for older Stata versions. For example,
version 12 cannot open files saved with Stata 15, so when some of your colleagues still
use an older version you can set the option. If you want to use the command, type
3 http://www.radyakin.org/transfer/usespss/faq/usespss_faq.html (2018-01-16).
10 2 The first steps
save “filename.dta”
Again, when you use this relative pathway, your file will be saved in the current
working directory. To export your dataset, click File → Export → Comma- or tab
separated data or type
When exporting to text files, pay attention to how your non-integers are formatted.
The exporting process will save these numbers with the displayed format; this may
cause loss of precision. For more information on this issue type
help format
Now you know how to open and save files, which means you can start managing and
analyzing data. A few more words on the basic Stata workflow, beforehand, which
will save you many hours of trouble in the future, as good routines and basic safety
measures will make working with Stata so much better (I promise!).
First of all, as mentioned above, never ever overwrite or change your original
dataset. Sometimes this can be a minor problem when the dataset is downloaded
from the Internet and can be restored. In other cases, as with data that you person-
ally put together, this might be impossible. Therefore, whenever you receive a new
dataset: make a copy and archive it. A very convenient method is to use the built-in
compressing function of your computer and save your data file as a “zipped” archive
(for example as a .zip or .rar file) with the current date. By doing this, you always have
a copy you cannot accidentally change as it is compressed.
After this, we can start with do-files, which will radically improve your workflow
and make it easy and convenient to replicate what you did (for you and other people).
Basically, a do-file is a plain text file that contains all commands you use to alter or
analyze your dataset. When you come back later, maybe in some years, the do-file
tells you what you did with your data in the first place, and makes it possible to repli-
cate your previous findings. This is also crucial for science as a whole, as replicability
of data analyses is important to tell other researchers how you came to your results.
When needed, other researchers can reconstruct your analysis. This underlines how
intersubjectivity can be reached. It still might be the case that other people do not
agree with you on all details, as they might have chosen different variables or another
4 If this command does not work as you are using an older version, try outsheet using “filename.csv”
2.8 Do-files 11
operationalization, yet you can document and justify your proceedings. Finally, when
you come to the conclusion that you have had enough Stata for a day, and want to
save your results, it is more common to save the do-files with all your commands than
to save changes directly into a .dta file. Thus, you are strongly advised to use do-files
every time when working with Stata. Although it might seem like extra work at first,
after a week you will have internalized this good habit.
2.8 Do-files
Usually, the first thing you want to do after starting Stata is to get your do-file
running. When you already have one, from the last session or from your col-
leagues, you can open it by clicking Window → do-file editor → New do-file
editor or type
doedit
A new window will appear (Figure 2.3). In that window, click File → Open and open
the file. You can recognize do-files by the suffix .do. Otherwise you can just start by
typing first commands in the empty do-file-Editor.
12 2 The first steps
The first line (clear all) removes any datasets, saved variables, macros or loops from
the memory. This means you get a fresh start and have the system ready to work. Of
course, before running the code make sure that you have saved all your current data-
sets, otherwise the data will be lost!
The second line (version 15) tells Stata which version you are currently running.
This is important for archiving purposes. Now suppose you used Stata 10 in the past,
but then you switched to the newest version and some command syntax was changed
in the meantime. If you ran your old do-file from version 10 in Stata 15, your old do-file
might not work, and produce an error message instead. Although this is rarely the
case, it is good practice to include this line. You can see your current version directly
on the output screen after Stata has started (or just type version).
The third line (capture log close) checks whether an active log-file is running. If
so, it will close it.6 For more information about logs see page 12.
The last line (set more off) is useful for long do-files that should run without any inter-
ruption, and is especially important when you are not running version 15 or newer. Some
commands produce a lot of output and older versions will halt, so the user can check it.
You can either hit the space bar or click “Show more” to see the next few lines of output. As
this gets tiring after a while, automatically showing all lines is quite convenient.
You may have also noticed the comments after the commands shown above.
When working with do-files, it is highly recommended that you comment on (some)
commands. Later, they will be useful to help you remember what you did. To write
a comment after a command just type two slashes (//). Any character in the same
line after them will be ignored by Stata. When you need comments that stretch over
several lines, use
5 When entering commands from this book, you do not have to type the comments which start with //.
These are only in the book to help you understand what is going on, and always ignored by Stata
when running commands.
6 The command capture can be used in combination with any other command. If the second com-
mand fails and produces an error, Stata then ignores this and continues with the rest of the code.
Therefore, using capture always requires caution.
2.8 Do-files 13
The indentation of the second and third line is not required, but might improve
readability. You will also notice that Stata changes the color for comments so they
are even more clear to the user. It is also a good idea to use headlines and sub-head-
lines to structure your do-files. Starting a line with an asterisk (*) will do this for
you. As a general rule, it is a good idea to comment on complex parts of the code that
will help you remember how they work. Simple commands like opening or saving
files do not deserve comments in real files, as this might clutter your do-file and
affect readability.
By the way, when you have a really long command in your do-file, you can use a
triple slash (///). Just continue your command in the next line, for example
drop if age < 40 & sex == 1 & income > 4000 ///
& BMI > 28 & wave == 57
This is useful for readability, but Stata does not care how many characters are in one
line.
Now you can start with actual work, by opening the desired file. You can either
type the command to open the file in the command line directly and hit Enter, or
you can type it into the do-file, or you can use point-and-click. When you typed it
into the do-file, you will notice that nothing has happened yet. You have to execute
the command so Stata will start working. You can do this by selecting the desired
line(s) or all lines in the do-file with the mouse, then enter Ctrl + D. Alternatively,
click Tools → Execute (do). You do not have to select the desired line(s) entirely, but
at least one character in each line has to be selected. If zero characters are selected,
Stata will run the entire do-file from top to bottom. When you look at your output
window, you will see that Stata opened the desired file. When you use point-and-
click Stata will automatically create the command that runs “behind the scenes” and
save it in the Review window. It is always a good idea to inspect these commands to
see how Stata works.
To summarize it, you basically have three options to save your commands using
do-files:
– Type all commands in the do-file directly and run them from there.
– Type all commands in the interactive Command line and after the command was
run successfully, right-click on the command in the history-window and copy-
paste it into your do-file.
– Type all commands in the interactive Command line and let Stata copy them
automatically into the file. The advantage of this method is that you do not have
to do it manually afterwards; the problem is that misspecified commands, say
7 If you want to know more about the double equal sign, see page 31.
14 2 The first steps
with a typo, will be copied so you later have to come back and control what you
did. If you prefer this third option, type
Stata will create a new file called filename.do or, if it already exists, append all new
commands to the end of it.
Congratulations! By reaching this point, you have mastered the first part of the
book. Now you know how to open, save and manage your files. You are ready to start
some real data manipulation in the next chapter!
When using do-files, Stata expects one command per line. You can create longer com-
mands by using three backslashes (///), yet even this solution can be problematic when
dealing with very long commands. Especially when recoding variables with several
categories or creating sophisticated graphs, many lines will be needed. To avoid a
structure using backslashes, Stata offers the possibility of using a semicolon (;) as an
artificial line break set by the user. Any other line breaks will be ignored until Stata
reaches a semicolon. To do this, type in your do-file8:
#delimit ;
//your very long command goes here;
#delimit cr
Until the #delimit cr part is reached, Stata will only see ends of lines when reaching
a semicolon. Some users prefer to use this option as a default. The problem is that
forgetting a semicolon at the end of a command will mess your code up.
8 This command will actually work only in a do-file and not when typed directly into the command
window.
2.9 Delimit and line breaks 15
in Stata or any text editor. This information will also be available if Stata or your computer crashes.
Finally, this can be really helpful when it comes to replication purposes. I do not advise using Log-Files
for every single small output you produce, but rather for larger chunks of your work, so that you can
put these in an archive.
The name option allows you to name the current Log-File, which is always a good idea and makes
nested logging possible. When you already have an existing Log-File that you want to overwrite, add
the option replace
When you want to extend the existing Log-File without overwriting the older data, use the option
append instead. At the end of your do-file put the line
to terminate the logging. The Log-File will be created in your current working directory. You can also
create and manage Log-Files in the GUI by clicking File → Log → Begin…
For large projects, like a seminar paper or a thesis, you will clearly have an extra folder on your com-
puter. Within this, creating four Stata folders is often useful.
Data contains all your data files, like raw data or .dta files
Do contains all do-files that you created
Log contains all Log-Files Stata created for you
Graph contains all graphics and figures you created with Stata
By doing this, it will be easier for you to find the correct file faster and have a better view of your proj-
ect. Over time you will develop your own system to manage your data.
3 Cleaning and preparing data
As you are familiar with the basic steps for working with Stata, we will focus on real
data editing and work through some examples. In this chapter, we will start to prepare
data. Over time you will see that running actual statistical methods is often done very
quickly, yet cleaning and preparing data, so you can use it for your desired methods,
will take the lion’s share of your time.
Suppose you get a new dataset for the very first time, maybe from your colleagues, a
research institute or a government agency. In the best case, there exists information
about how the data was collected, generated and processed. Then it is a great idea
to first study these documents, so you know what you are working with. As we are
realists, we assume there is no documentation, and we have to find out the import-
ant stuff alone. In this chapter we will use a preinstalled dataset from the National
Longitudinal Survey that contains information about the working careers of women
in their thirties and forties. The data was collected in the USA in 1988. We start with a
fresh Stata window and open the dataset by typing
Sysuse opens a preinstalled (“system”) dataset, nlsw88 is the name of the specific dataset,
and the option clear tells Stata to delete any loaded data in memory, if there is any.1
The best idea is to get a rough overview of the data, so we see which variables are
contained and what they are about. Try
describe
or click Data → Describe data → Describe data in memory or in a file. Another tip:
when you want to see the dialogue box but do not want to click through the entire menu
tree, just type db, followed by the command of interest (for example: db describe).
We get a large amount of information. There are 2,246 obs (observations, cases)
in the file. 17 vars (variables) are contained in the dataset which are listed separately
below. Size tells us how large the dataset is (about 60kb).
Now to the details about the variables. Variable name tells us the name of the
variable (duh!). Note that some of them seem quite obvious (wage), while others
1 If you want to see all available preinstalled datasets on your computer, type sysuse dir.
https://doi.org/10.1515/9783110617160-003
3.1 Getting to know your data 17
are rather cryptic (ttl_exp). To find out what this means, we can look at the variable
label, which contains a description of each variable. It is usually a good idea to keep
variable names short and concise, as they are often typed and should not clutter our
code. Still, they should be meaningful and tell us roughly what they are about. Longer
descriptions can go with the variable labels, which are especially helpful for people
who are new to the dataset. For the moment we can safely ignore storage type and
display format. Lastly, we see that some variables have an entry under value label,
others do not. This basically tells us whether the values of the variable are named
(labeled). We will deal with these aspects of data management in the following sec-
tions. The next figure (Figure 3.1) will help you to understand what these are all about.
Listing cases
Sometimes you see strange behavior in your data and want to look at certain cases that might contain
wrong or unusual data. You can do this by using the Data Editor window which might be a good idea
when your dataset is small. When you have thousands of cases and hundreds of variables, this might
not help you. A possible solution is the list command that helps you inspecting interesting cases. For
example, when we want to see the properties of respondents younger than 40, that are from the south
and union members, we type
or click Data → Describe data → List data. You will receive a list that contains all information for all
cases that fit the condition. If you just want to have a general look at your data, type
list in 1/10
– output omitted -
The “in” part means that only the first 10 observations will be shown to you, otherwise the list might
be a little long! Note that you can always stop Stata producing output by pressing Q. You can also
combine in and if in the same command
Each variable has a unique name and a label (which should be unique as well).
Whenever you use or manipulate a variable, you have to use the variable name.
To change variable names or labels we click Data → Variables Manager. There
we see a list of all existing variables, their names, labels and corresponding value
3.3 Labeling values 19
labels. This is a great tool for managing the most basic properties of our vari-
ables. Suppose we want to rename the variable smsa to metro and change the
variable label:
The command label variable, allows you to change the label of the variable metro to
any text you put in quotation marks directly after. In general, it is strongly advised to
label your variables so you know immediately what they are about.
Momentarily, all information in the dataset is in numerical form (that means there is
no text data like names or strings). Each variable can take certain values, which are
coded as numbers. Some variables are binary (only two values are possible, such as,
0 and 1). Examples are gender (male and female) or marital status (married and not
married). As it might be hard to remember whether 0 or 1 stands for “married”, we can
label our values. By doing this Stata will remember for us the correct specification.
Note that the label per se is only useful for us humans as the computer always uses
numbers to compute results mathematically. Luckily, most values are labeled already.
This really helps us knowing which number stands for which answer given by the par-
ticipants. Only the variable c_city lacks a value label. From the description, we learn
that this variable tells us whether a participant lives in a “central city” (whatever this
means).
To get more information about this variable, we will inspect it by typing
tabulate c_city
We see that 655 persons are coded with 1, which we assume means “yes”. Now we
want to label these values. To do this in Stata, we will have to use two steps.
1. Create a value label and give it a name.
2. Tell Stata to use the generated value label for the desired variable(s).
20 3 Cleaning and preparing data
tabulate c_city
For beginners, this process might seem tedious, as we first have to create a new label
and tell Stata to use it with our variable. The major advantage is that by doing so,
we could label a lot of “yes – no” questions at the same time! Imagine we had infor-
mation on whether people like Pizza Hawaii, drive a Porsche or ever went bungee
jumping. All these questions can be answered in a binary fashion and, when they
have the same numerical values, we can label them all at once.
To do the process described above with commands, we type
Which workflow you prefer is up to you. Most people will switch from the point-and-
click method to commands as soon as they are more familiar with them. We want to
copy these commands to our do-file, which is hopefully already running and has a
nice header (when you have absolutely no idea what I am talking about, have a look
at page 12). To do this, hold Ctrl and click the corresponding commands in the Review
3.4 IDs and unique identifiers 21
window on the left side of the screen. Then, right-click one of them and click Copy.
Now you can paste them into your do-file, so they are saved. Maybe later we learn that
1 actually means “no” and 0 actually means “yes”. Then we can go back and correct
our mistake easily.
By the way, which variables should have value labels? “Categorical variables
should have labels, unless the variable has an inherent metric” (Long, 2009: 163).
Examples of variables with inherent metric are age, income or time spent watching
TV per day in minutes.
Many datasets contain one case (person) per row, especially when we talk about
cross-sectional data (that is when all data entries are from the same point in time,
e.g. the same year). Every person should have a unique identifier, which is often a
number or a code. If there are identifiers in our list that are not unique, which means
that several persons have the same ID, we have a problem. In Stata it is easy to test
whether an ID is unique. Click Data → Data Utilities → Check for unique identifiers
and select the ID variable (idcode). Or use the command
isid idcode
As we do not receive any error message, the assumption is fulfilled. When no unique
ID exists, but we are absolutely sure that there are no duplicates in our dataset, we
can create it. Try typing
generate ID = _n
label variable ID “new unique identifier”
This will create a new variable ID that gives each person a number, starting with
1. _n is a system variable which tells Stata the position (going from 1 up to the
last observation) of each case in the dataset. Another helpful system variable is
_N which is the total number of observations in your dataset. You can inspect
the results using the data browser. When you are not sure if any duplicates exist
in your data, click Data → Data Utilities → Manage duplicate observations
or type
The part idcode-tenure, tells Stata to use all variables in the dataset, except the newly
created ID variable. To see how this works, take a look at the variable window.
22 3 Cleaning and preparing data
The variables are in a certain order and new ones are created at the bottom of the list.
The hyphen tells Stata to use all variables from idcode to tenure, but not ID, which is
below tenure. As ID must differ across all cases, by design, we should not use it in our
duplicates test. Luckily, our dataset seems to be fine.
Note that you can also create IDs separately for subgroups in the data. Imagine
we want to create different IDs within the industry where a person works. Try:
This command will first sort all cases by industry, and then start enumerating persons
within each industry from 1 until all persons have a number. When it reaches the next
industry, it will start, again, from 1 and label all persons. Suppose you want to mark
the youngest person in each industry. Try:
What Stata does here is first sort all cases by industry, and, within industry, by age.
Putting age in parentheses will prevent Stata from creating several counters within
each industry, using age as a second hierarchy. To see how this works in detail, play
3.4 IDs and unique identifiers 23
around with the commands to become more familiar with them. You can also sort
variables into ascending order by values, without creating variables using the sort
command:
If you want to reverse the order (sort in descending order) use gsort:
help generate
A new window will pop up that contains a large amount of information (Figure 3.2). As these help files
always follow the same structure, I want to give a short explanation here.
24 3 Cleaning and preparing data
Before the text even begins, have a look at the three small fields in the upper right corner which
you can click. “Dialog” opens the point-and-click interface of the respective command whose help-
file you just studied (if available). “Also see” links you to other help-files for similar commands which
might be relevant as well, so clicking there is often a great idea. Finally, “Jump to”, lets you navigate
quickly through the current help file, which is a boon for longer ones.
Now to the main part of the help-file. The first section contains a generic example to see how this
command should be used (the syntax). This is often helpful, as it tells you where each part of the com-
mand goes and how the command is structured. Everything in brackets [...] is optional, meaning that
these parts are not needed to run the command. The next thing you see is the pathway to point-and-
click, which is available for the vast majority of all commands. After that, a more detailed explanation
of the command follows, which tells you what the command can do and how it is best used.
Then the options are explained, which can customize the way the command works. More experi-
enced statistical methods often have dozens of options, so it is usually a good idea to browse this section
when using a command for the first time (also have a look at the point-and-click interface, as this is often
more convenient). After that, several examples with real (online) data are presented, which you can use
to practice right, away and see the command in action. This is great when you do not have suitable data
available and just want to get to know the methods. Some commands like regress will also show a section
about saved results which we will explain later (see page 49). The help file usually ends with references,
which tell you which algorithms were used and where they are documented. Using the Stata helpfiles is
a great way to improve your knowledge and explore the wide range of possibilities Stata offers. Always
remember that you do not have to memorize every single bit of a command as the help file will provide
the information when needed.
Missing data is a widespread problem in any empirical research. In the social sci-
ences, people often do not want to answer certain questions or some statistics are not
available for all countries in the political sciences. You have to tell Stata that a certain
numerical value should be interpreted as a missing value, otherwise you run into
problems. Stata uses the dot (.) to depict missing values. Type
Yet at first glance, all values seem plausible. This is also true for other variables like
age, where values below zero or above 100 should also make us think!
To tell Stata which value labels stand for missings (in this hypothetical case -999)
we use the following command:
where VAR stands for the name of the variable we want to change. Also note that Stata
offers a convenient tool to change these missing codes for several variables at a time.
This comes in handy when you know an institute uses the same numerical values for
all variables to declare missings. Try
You can also reverse this and turn all missing values (.) back into plain numerical
values. Try
0 1 2 3 4 5 . .a .b .c .d
largest possible numerical value to indicate missings. This can be a great source of
problems if one forgets about this. Suppose we wanted to compare values and count
how many people work more than 60 hours a week. We would enter
The displayed result is 22, which is incorrect, as four of these persons have missing
values, and should thus not be counted at all. To resolve this problem we have to type
You can also do this in the GUI by clicking Data → Data Utilities → Count observa-
tions satisfying condition. To visualize how Stata handles missing values, have a
look at Figure 3.3.
Note that you can also have distinct missing values that can be used to give more
information about the type of missing value (extended missing values). When collect-
ing data there are several possibilities why data is missing: the respondent was not
home, he refused to answer the question and so on. To code these, try
Stata will count all numerical values that start with a dot followed by exactly one
letter as missings and will not use them in any statistic it computes.
Finally, we come to the fun part where we create our own new variables. Creating vari-
ables might be at the heart of data preparation, as data is rarely in the form we want
it to be when we receive it. Luckily, the general scheme for doing this is very easy. We
still have our NLSW88 dataset open. As you can see we have one variable age which
tells us how old participants are at the time of the survey in 1988. Let’s suppose we
want a variable that tells us not the age, but the year of birth of the person. The year
of birth is just the year of survey minus age, thus we type
Alternatively, click Data → Create or change data → Create new variable, enter the
name of the new variable (ybirth) and the expression to generate it (1988-age), then
click OK. When we click in the variable window (top right) and scroll down, we can
see that the variable was actually generated, congratulations! Yet the variable label is
still missing so we type
tabulate ybirth
Please check whether the result is correct, and label the variable in a useful fashion.
Generating variables is a powerful command that enables you to manipulate data
in many ways, as you can use a lot of mathematical operators or other functions. You
can receive a list of the possibilities by using the GUI, as described above, or type
help functions
As a side note: when typing commands directly into Stata, you can use the tabulator
key for auto-completion. This can be highly useful when dealing with long or cryptic
variable names: type the first few characters of the variable name, and hit tab. As long
as the name is unambiguous, Stata will show the complete name of the variable.
Stata comes with a long list of functions that make generating new variables easy and
comfortable. I want to present a small outline, as they are useful in daily routines.
Inlist
Inlist can be used to test for certain conditions instead, of using a long command with
a lot of “or” options (the vertical bar). Compare the following commands, as they do
the same thing:
Another trick is to use the first argument not for the name of a variable, but with a
numerical value to find all cases that have at least one matching condition.
count if inlist(1,union,south,c_city)
which is equal to
Inrange
Inrange gives you all cases that fall within a certain range. Suppose that we want all
women that earn between 10 and 15 dollars per hour (including both limits):
count if inrange(wage,10,15)
3.6 Creating new variables 29
Of course you can use a generate command instead of the count when needed.
Autocode
When we do not want to explicitly state the size of each category in a recode transfor-
mation, we can use autocode. As long as we tell Stata how many categories we want
in total, it will automatically form them with equal sizes (not equal number of cases!).
When we want to have a variable with five categories about tenure, try
Tenure is the name of the variable we want to recode, 5 is the number of categories
we want, 0 is the smallest value that should be considered in the recode and 27 is the
largest value. It is usually a good idea to inspect the Min and Max of the variable we
recode, and then use these values. Note that this command works best with integers.
If your original variable includes decimal numbers, make sure to inspect the result
carefully as the border categories might have an unequal size.
When you want to use point-and-click to use these advanced functions, click
Data → Create of change data → Create new variable, tick Specify a value or an
expression and click Create. Then click Functions and choose Programming. You will
see a large list with short descriptions of each command.
Egen
The generate command has a slightly more advanced brother, the egen command.
Using different suboptions, egen offers several dozens of ways to manipulate data.
When you want to create a new variable, that contains for each case the maximum
value of some other variables, you can use this. Imagine we want to know, for each
woman, where she has the largest value: wage, tenure or hour? We type
30 3 Cleaning and preparing data
Note that this example is somewhat artificial, as usually the variables you want to
compare, in this way, should use the same metric.
You can also get the minimum (rowmin), mean (rowmean), median (rowmedian)
or the number of missings (rowmiss). The last option is especially helpful when you
want to select all cases that have complete information on all desired variables.
To use egen with point-and-click, go Data → Create or change data → Create
new variable (extended).
If (!) you have a background in programming you will be familiar with if conditions that
are used to structure program flow and to process data selectively. The basic idea is that
Stata should do action X if and only if condition Y is true. This is one of the most central
aspects of data managing, as it allows you to change properties of variables in many
ways. For example, when we want to know how many people are in our dataset that are
younger than 30 and union members, we use the count command. When we use this
command without any if qualifiers, it will just count how many cases are in our dataset
(2,246). Now we combine count and the if qualifier, to count only the cases where the
condition is true:
The “&” means “and” in Stata. To form more complex conditions, you can use these
operators (see Table 3.1):
3.7 The if qualifier 31
Parentheses are also important helpers to structure conditions. As you might remem-
ber from math class, parentheses bind stronger than other operators and often make
a difference. Here are some examples:3
Older or exactly 40 years old and married:
Younger than 35 and income larger or equal to 25 (and also not counting missing values):
If qualifiers are super important and can be combined with most Stata commands.
To check whether you can use an if qualifier in combination with another command,
refer to the documentation (for more information see page 23). Further, make sure
that the values you compare have no missing values, or at least consider this possibil-
ity (as in the first example shown above), otherwise results might be incorrect (which
is explained on page 26).
3 This character is called Sheffer stroke or just vertical bar. Typing it might involve using the Shift key
or the AltGr key, depending on your keyboard layout.
4 The condition & !missing(age) ensures that people with missing values on the variable age will not
fulfill the condition and are, therefore, not counted. If you explicitly want to count people who have
missings use & missing(age).
32 3 Cleaning and preparing data
represents “Stata, create this variable X here and set it to the value of Y!”. The second version means
“Stata, can you tell me whether this variable M here is equal to the value of N?”. After a week of using
Stata, this difference will be the most normal thing for you. Still, even experienced users sometimes
forget one equal sign, which messes the code up. So always make sure to double check whether you
really typed what you wanted.
Generating new variables is important, but often you still want to make changes to
them later. As a general rule, never change existing (original) variables, in place, as
these changes overwrite original data which could be a problem. It is a good habit to
create new variables first from existing ones, and then change these newly created
variables. If something goes wrong you still have the original in place and do not have
to load a backup of your dataset.
Have a look at the variable, hours, which tells us how long participants work
per week on average. This is a variable with an inherent metric as time in hours is a
metric measurement. Assume we want to create an indicator that tells us whether a
person works part-time, which is 20 hours or less. All persons working more than that
will be counted as full-time workers. We want to create a binary indicator from this
metric. First, we generate a new variable which we call “parttime”.
generate parttime = .
This creates a variable that has missings for all cases which seems not very useful
at first glance. Yet this is usually a good idea, as, if you make a mistake and your
changes do not work, you will later see that many values are missing, which should
alert you.
Before changing a variable, we have to think about how we want to operational-
ize (code) it. We decide that every person that works part-time will receive the value
1 (think of “yes”) and all other persons will receive the value 0 (“no”). Now we can
make the changes
You will get a long output with all values. As you can see, all participants with 20
hours working time, or less, are in the right column under 1, while all others are
in the left column. We can conclude from this that our new variable was created
correctly. Another option for checking errors, that relies less on visual inspection
of the tables, is the assert command. For example, we noticed that our variable
used for generating parttime, hours, has four missing values. Therefore, our new
variable should also include exactly four missing values. To check this we can use
the following command:
34 3 Cleaning and preparing data
When this command runs without any errors, the condition is true. Otherwise Stata
will halt and report that something you asserted is not correct.
Please label the new variable, and when you want, you can also apply the “yesno”
label we created earlier5 (see page 19).
codebook race
or click Data → Create or change data → Other variable-transformation commands → Recode categor-
ical variable. Again we should crosscheck our results by typing
5 As the label is already created, you only have to use the second command.
3.9 Removing observations and variables 35
Please note that the option generate(is_white) is crucial, otherwise your existing variable will be over-
written! As the recode command is very helpful, it is a good idea to explore the many options available
by typing
help recode
Sometimes you want to remove observations from your dataset because they contain
missing or wrong information. When you apply the following commands, make sure
to save your dataset before, as we will not save the changes after removing observa-
tions and variables.
First, let’s have a look at the variable grade which tells us how many years of
schooling participants received. We will notice that there are a few cases that had
less than four years of schooling, which is really short, and uncommon. We think
that this might be an error in the data, so let’s remove the two problematic cases. To
do so we type
Stata will report that it deleted two observations. Sometimes it is more efficient not
to indicate which observations should be dropped, but which should be kept. Have
a look at the variable occupation, which tells us in which job respondents work.
Imagine that we want to have a dataset that only contains information about manag-
ers. We could use a drop command, with a condition that takes all other occupations
and removes them. Using the complement and telling Stata which occupations to
keep is much faster in this case, so try
keep if occupation == 2
Stata will tell you that it removed 1,982 observations. This command can be really
helpful when you want to create new data files which only consist of subsets of your
data, like an all male or female dataset.
36 3 Cleaning and preparing data
Finally, we want to delete a variable that we do not need anymore. Note that in this
case, the number of observations stays identical. We want to remove the ID variable
that we created before, as we want to use the original variable (“idcode”) instead. Try
drop ID
describe
You will see that the variable is not listed anymore. Remember that drop combined
with an if condition will remove cases, otherwise it will remove variables. If you actu-
ally executed the commands described here (keep and drop), make sure to reload
your dataset before you proceed, otherwise you will receive different results (sysuse
nlsw88, clear).
Someday you might work with large datasets that contain thousands of cases. Making
sense of such amounts of information is not easy as individually reviewing and clean-
ing cases is not feasible anymore. Still chances are quite high that there are errors
in the data. Even when you cannot correct the mistake, since the original or true
3.10 Cleaning data systematically 37
information is missing, you can delete or flag these observations and not use them
in your analyses.
One first thing you can do is check whether the values which are contained in your
variables make sense. As an example, I want to use the variable age, which clearly has
only a certain range of correct values. As only adults are interviewed, values below
18 should not be possible. Also there should be a limit since a value of 200 is quite
impossible, yet can happen easily due to a typo. We will use the assert command to
perform these “sanity checks” automatically
assert inrange(age,18,100)
Stata will halt if there is a contradiction found in the data, otherwise nothing will
happen, which tells us that the assertion is fulfilled. The same goes for variables with
ordinal scaling: as only a few values are defined (say from 1 to 5 on a rating scale), any
other values are probably incorrect:
assert inrange(occupation,1,13)
This time you will receive an error, as the assertion is not fulfilled for all cases.
Closer inspection shows that the problematic cases are ones with missing infor-
mation. When you want to allow this to happen, you can adjust your assertion
respectively
Another common source of problems arises when several variables are linked in a
logical way. For example, have a look at the two variables ttl_exp and tenure. The first
measures the overall job experience of a woman, the second how long she works in
her current job. Logically, the overall job experience must always be longer or equal
to the time working for the current job, otherwise the information is contradictory and
probably false. We can check it by typing
Actually we see that ttl_exp is always larger or equal to tenure. A similar error can
arise when questions depend on each other in a survey: when a person states that she
is currently unemployed, but responds that she works 20 hours a week in the next
question, there is something wrong. The same goes for children with a PhD or men
with a hysterectomy. When you are new to surveys, you just would not believe how
often data is just terribly incorrect when you receive it.
Sometimes you do not have one large dataset that contains all the information you need,
but several files with different aspects. For example, one file could contain informa-
tion about the personal characteristics of a respondent, like age or place of residence.
Another file might contain information about the school career and grades. For analyses
it is often necessary to bring these pieces of information together and combine datasets.
When you work with large professionally collected datasets, it is often the case
that the creators of the data provide a rich set of information about how their datasets
work, and how they can be combined. Still, it is required to think about which pieces
of data you need to do your analyses. This is something you have to do in advance,
which often involves switching between different datasets. I assume you know what
you really want to do, and will show you five different cases you might encounter.
Note that you can also apply all of the following procedures by using the interface.
Just click Data → Combine datasets and choose the desired method.
When you have several datasets that contain basically the same information, but
about different observations, you can combine them all into one large dataset. For
example, you have a questionnaire that you give to two colleagues. One goes to city A
and interviews people, the other goes to city B. In the end, both bring you their data-
sets which contain the same variables, but for different persons (Table 3.2).
Dataset A Dataset B
After appending you have one dataset that basically looks like this (Table 3.3):
3.11 Combining datasets 39
Combined Dataset
Name Age Income
Jeff 23 2300
Dave 55 3400
John 18 1200
Carol 66 1900
To see this in action make sure that you saved your dataset that we used throughout
this chapter, as we will create a new one now. First, open two datasets from the online
database and save them locally on your computer:6
We will append the saved file append_a.dta to the open dataset append_b.dta. In
Stata we call the dataset that is open at the moment (in memory) the “Master” and
the other one, that will be added, the “Using”. Note that since version 11 you can also
append several datasets at once.
The generate option creates an indicator variable called check_append that later tells
you which observation was added and which was from the original (in this case from
“append_b”). With the list command, you can check whether everything went fine.
Your new dataset should have six observations. When you want to use point-and-click
go to Data → Combine datasets → Append datasets.
The second case, that occurs more often in reality, is that you have several datasets
from the same institute, and variables are separated by topic. To combine these
6 The files provided for the example work with Stata 15 or 14. If you are using an older version, you
have to replace the file suffix with "_old.dta". For example, change append_a.dta to append_a_old.dta.
7 The option replace will overwrite any existing dataset with the same name, if existent.
40 3 Cleaning and preparing data
datasets each of them has to include one unique identifier, like an ID, which is used to
merge data correctly. Suppose the following basic design for the two files you want to
merge (Table 3.4):
Dataset A Dataset B
Combined Dataset
To make this work, both files that should be merged together must contain a unique identi-
fier, in this case the name of the country. In interviews, this is often a unique number that is
given to each participant and cannot occur twice within one dataset. When this variable does
not exist and cannot be created, merging is not possible. Sometimes more than one variable
is needed, for example, when the same people were interviewed over the course of some
years, so the unique ID and the year would be needed. We will now try to do this by typing
Note that country is the unique identifier which contains the name of the country.
Also take a look at the info Stata gives you after performing the merge. You will see
that Stata created an indicator variable automatically this time (_merge). When you
want to use point-and-click, go to Data → Combine datasets → Merge two datasets.
3.11 Combining datasets 41
The third case will be used whenever there is asymmetry in the datasets. Maybe your
main file contains information about pupils and which school they attend. You also
have a second data file which contains information about the schools, like in which
district they are placed and the number of pupils attending (Table 3.6). As there are
many more pupils than schools, your school dataset has fewer observations than your
main file. As long as there is a school ID in both datasets you can match the data.
Dataset A Dataset B
Combined Dataset
This merging is called many-to-one, as your master file (which is in memory at the
time of merging) has more observations (“many”) than your using file (“one”). Let’s
see this in action.
school_ID is the variable that links information about schools and pupils and, there-
fore, works as a unique identifier. Note that the file with the “many” (the pupils) has
to be opened at the time of merging in memory.
Basically, this is exactly the same procedure as the Many-to-One Merge, but only the
files for master and using are swapped. When you can perform a Many-to-One Merge
you can always perform a One-to-Many Merge as well (and the other way round). To
save space, you can just change the code above or look at the do-file provided for this
chapter online.
Imagine that you have two datasets, one about parents and one about their children
(Table 3.8). As every child has two parents and some parents have more than one
child, a direct match is not possible. When you have the child data as master, which
parent would you link with the child, father or mother? The other way round, when
you use the parent data as master and a family has several children, which of them
would you merge to a parent? In these cases, you can form all pairwise combinations.
Dataset A Dataset B
Combined Dataset
Combined Dataset
The number of all pairwise combinations is the product of the number of cases in
each dataset with the same ID, thus two for family A1, six for family A2 and one for
family A3. When one ID is only available in one of the two datasets, it will not be
included in the resulting file (as any number times zero is zero. You can change that
standard option if you want). Now for the example:
Merging datasets can be confusing for beginners, yet can be mastered with time and
practice. The key to correct merging is reading the documentation of your datasets
(sorry, no magic trick here!) and learning how they are structured (use describe,
list and browse). If you read this chapter twice, and still think none of the solutions
described here is the correct one for you, it could be the case that you cannot merge
your data. When you can’t find a unique identifier (or more than one) that links cases
throughout files, a merge will end in chaos. Do not try the Many-to-Many Merge as
even the Stata documentation itself advises against ever using it!
Someday you might want to work with panel- or history-data. In these cases the
same people are interviewed several times, say, once every year. Or maybe you have
information about the development of countries and want to see the trends over
44 3 Cleaning and preparing data
time. In these cases, it is often necessary to reshape data. What sounds complex is
quite simple in Stata. Basically you only have two options, the wide format (Table
3.10) and the long format (Table 3.11). The wide format is used when every case
(observation, person, country, etc…) needs exactly one line in the data browser to
show all of its information. Cross-sectional designs are in almost all cases in the
wide format, and reshaping is not needed here. When you have time-series-data
(e.g. panel-data), they can have both shapes, and we want to show how to transform
this kind of data.
Wide Format
This dataset contains some kind of ID (the name of the country), a variable, that does
not change over time (the currency), a score that does change over time, and the val-
ues for each year are saved as an extra variable. We want to reshape to long format,
where one country can take several lines and we will introduce a new year-variable.
The reshaped format would look like this:
Long Format
A new variable that indicates the point in time (year) was created, and some variables
were deleted, as they are no longer needed. Time constant variables are the same in
every year, as they do not change. To reshape data, you need two pieces of informa-
tion: an ID, that identifies every object in your data (in our example, the country) and
the point in time (in our example years, which is an extra variable in the long format
3.12 Reshaping data 45
and merged into the name of another variable in the wide format). As always, to per-
form a successful reshape, it is vital to be familiar with the dataset used. We have an
example where we reshape wide to long, and then back to wide:
As long as you do not create any time-varying variables additionally, you can also use
the shorter commands to switch between the formats, after having used the explicit
commands before:
After you have cleaned your data, checked for errors and problems and created new
variables, it is time to describe your data. Every piece of scientific research, no matter
whether it is a seminar paper or a thesis, will include a part where you just describe
what you see. No fancy causal analyses or advanced statistical methods are needed
here, just reporting means, medians, tables or the distribution of variables, so the
reader can get an impression what the data is all about. Furthermore, descriptive
parts are also extremely relevant for yourself as a good description always helps in
understanding the phenomena you want to explore. In this chapter you will learn
how to do this.
https://doi.org/10.1515/9783110617160-004
4.1 Summarizing information 47
numlabel, add
tabulate industry
Stata automatically added numerical labels to all categories. This is quite convenient, yet has the
drawback that missing cases are still not shown automatically. Furthermore, the labels will be only
added to existing variables, so you have to run the command again after you created new variables.
Lastly, the labels added will also show up in the graphics you produce, which can be annoying. To get
rid of the labels again type
numlabel, remove
By now we have used the tabulate command several times to get an impression of our
data, but as variables can contain thousands of different numerical values, this is not
a good way to summarize information. This is usually done by employing statistical
indicators such as the mean, median or standard deviation. Stata can compute these
numbers easily, which of course requires a variable with an inherent metric (remem-
ber that you should not summarize nominal or ordinal variables in this fashion, as
the results can be meaningless). We will use the same dataset as in the last chapter. A
great variable to test this is age, so let’s see this in action
When some persons have missing values for a variable, you would notice it here.
The left part of the table computes certain values as percentiles. Remember that the
median is the 50% percentile. In our case that means that 50% of all persons are 39
years old or younger, while 50% are older than 39. When you need a percentile that is
not listed here, like 33%, try the centile command:
or click Statistics → Summaries, tables, and tests → Summary and descriptive sta-
tistics → Centiles with Cis. The result is 37, meaning that one third of the sample are
younger or equal to 37 years old, while two thirds are older than that age. If you want to
learn more about the other statistics presented, refer to your statistics textbook.
In a seminar paper, you should report the most basic statistics for relevant depen-
dent and independent variables, like the mean, standard deviation, median and number
of observations. Doing this is quite easy and can be done using all variables at the same
time, as long as you only include metric, binary or ordinal numbers:
As long as your binary variables are coded in a 0/1 fashion, you can include them as
well, as the mean is just the percentage with a value of 1. To export such a table into
your text-editor, you can highlight the table in Stata, right-click and copy it directly.
4.2 Using stored results 49
Unfortunately, this will often not yield the desired results. To get nicer results, you can
use a user-written script (CCS).
The last command creates a nicely formatted table, in a new .rtf-document, in your
current working directory (Figure 4.1). The options in parentheses, after cells, specify
which statistics to include. Using this script makes it quite convenient to export tables
and use them directly in your publications, with only little adjustments needed after-
wards. Make sure to read the official documentation for the command, as it is very
powerful and offers many options.1
1 http://repec.sowi.unibe.ch/stata/estout/ (2018-05-16).
xi − x
2 Scorez =
SD (x)
50 4 Describing data
You see that the mean is 7.77 and the standard deviation is 5.76. Now you generate the
new variable
Luckily this can be done faster, as many Stata commands not only display the numbers
but also store them internally for other uses. To see what Stata stores, run the command
(summarize) again (as only the results of the last command will be stored) and try
return list3
You will see a list of all statistics Stata saves. You can use them directly
We see that the overall mean is very close to zero. The difference is only due to rounding.
3 Some commands, like regress, use the command ereturn list as they estimated a model.
4 Whenever you put quietly in front of a command, Stata will not show the output but still keeps the
calculated results in memory, which is a useful trick.
4.3 Histograms 51
4.3 Histograms
While central statistics, as discussed above, are important, sometimes you want to
show the actual distribution of a variable. Histograms are a good way to do this for
metric variables. Try
histogram age
or click Graphics → Histograms. When you are used to working with histograms, the
result might disappoint you as white areas between the bars are not common for this kind
graphic. The problem here is that, although the variable has an inherent metric, due to the
special sample, there are only a few distinct values (from 34 to 46). To get a nicer graph type
kdensity age
You will find this graphic under Graphics → Smoothing and densities → Kernel
density estimation.
When you want to combine several plots within one graph, you can use the
command twoway:
or click Graphics → Twoway graph and create the subgraphs separately. Note that,
although the command is called twoway, you can combine an arbitrary number of
plots in one graph. The code for each subplot has to be enclosed in parentheses.
4.4 Boxplots 53
4.4 Boxplots
Another way to create clear and concise graphics for metric variables is via boxplots.
Try it by typing
histogram wage
You will notice that there is a large proportion of the values between 0 and 15, while there are only very few
cases beyond that. What you consider an outlier depends on your research question, your data and your
personal opinion, as there is no general rule to classify them. Basically you have three options to continue:
1. Do nothing. Just leave the variable unchanged and work with it. This can work very well, but
sometimes outliers can influence your regression results significantly. You should probably
invest some time, and try to find out why there are outliers, as there might be a coding problem
or error in the data. If this is not true, and the data is correct, you should have a closer look at the
respective cases as there might be an interesting hidden subpopulation for further research.
2. Fix them. You can fix outliers to a numerical limit, say 25. So every value that is larger than 25 will
be set to this value. You can do this by typing
Pay attention to the part !missing(wage) as otherwise all people with missing values on this vari-
able will also receive this value, which would be a severe mistake. The general problem with this
technique is that it reduces the variance of your variable, as the values above the limit, which can be
different from each other, will all be set to the same value.
3. Remove them. Sometimes you can exclude any cases from your analysis that are outliers. This
means you will not use these observations. Say again, that our limit is 25, so we type
All cases with larger values will be set to a missing value, and thus not used. This option can introduce bias,
as in this case, you remove some special groups from your analyses; namely people with high incomes.
4.5 Simple bar charts 55
You have to think theoretically, when certain subgroups are not included in your study, as to what
this can do to your research question.
Whatever you do, make sure to write about it in your research paper, so the reader knows how
you processed the data. And make sure to reload the original dataset if you executed the commands
described here, otherwise your results will differ from the ones presented here.
Let’s have a look at the variable industry, which tells us the branches respondents
work in. We type
tabulate industry
and receive a table that lists frequencies. Let’s suppose we want to convert this
information into a bar chart, so it can be represented visually, and that we want to
plot the absolute number of frequencies for each category. Oddly, there is no easy
way to do directly this in Stata. One possibility is to use the histogram command
with a lot of options, that somehow brings us closer to what we want, but it is com-
plicated and often not exceptionally pretty. Especially, when a variable has many
categories and you want to label them all, it gets tricky. Luckily, someone before
us noticed this problem and wrote a little command (community-contributed soft-
ware) that helps us out. Type
I hope this is roughly what you want. Note that you can customize the command by
using options from the graph bar command. When you do not want to show the abso-
lute numbers for each category, just remove the option blabel(bar). In general, Stata
offers a variety of different graphs, and many sophisticated options to customize
them. While this is a boon to the experienced user, beginners are often deterred from
using the many possibilities. It is a good way to start with simple graphs, and use the
point-and-click menu to try out different options to see what they do. Over time, you
will build a personal preference for certain forms of representing data visually, and
you will be able to create them easily. Make sure to always save your commands in a
do-file, so you can look them up quickly later. When it comes to documenting how
graphs are created, Stata is clearly outstanding.
4.6 Scatterplots
Whenever you want a visual representation of the relation of two metric variables,
scatterplots are a good idea. They are used to check whether two variables are
somehow related to each other, and whether there is any hint of a correlation visible.
To see this in action, we want to plot the relation of wages to total work experience.
We assume that people with more experience will earn more on average. We type
or click Graphics → Twoway graph (scatter, line, etc.) and click Create… and enter
the x- and y-variable. The first variable (wage) goes on the y-axis, the second one
(ttl_exp) on the x-axis. We can see that there seems to be some relation, yet as we have
many data points, due to the high number of observations, some kind of thick cloud
is created at the bottom making a visual inspection difficult. To account for this, we
can try the option jitter
4.6 Scatterplots 57
This is also extremely helpful when dealing with variables which do not have many
distinct values. Basically, it creates a standard scatterplot and moves every data point
slightly, and in a random fashion to the side so points are not plotted exactly on top of
each other. The numerical value (in our case 10) is a value that tells Stata how strong
this random movement should be. Another solution is to use the color option:
The number in parentheses can be between 1 and 100, and lets you adjust the sat-
uration of the dots. When you want to get an even clearer visualization, try the CCS
binscatter5 (you need Stata 13 or newer to use this). This command combines data
5 https://michaelstepner.com/binscatter/ (2018-01-29).
58 4 Describing data
points and, therefore, reduces their number but still recovers the overall tendency of
the distribution, and also fits a linear model.
Simple tables are at the heart of science, as they can summarize a great load of infor-
mation in a few lines. While fancy graphics might be useful in presentations, scientific
publications should mostly rely on tables, even when they seem plain or boring. Stata
offers a great variety of options for creating customized tables which we will explore in
detail. We already know how to inspect the distribution of values, for one variable, using
the tabulate command, and how to create crosstabs (see page 33). We want to expand
these tables, to gain further insight, and analyze how the variables, union and south, are
related, to check whether there are differences between the regions (south vs not south).
defines the rows of your table, and the variable south, the columns. We could sum-
marize by column:
Stata then displays percentages. We can compare relative interests in unions by region.
In the south, only 17% are union members, while in other regions, this number is
over 30%. We can deduct from this, that unions are much more popular in non-south
regions than in the south (assuming that our data is representative of women from
the USA). Or to formulate it differently: imagine you are in the south and talk to a ran-
domly chosen woman. The chances that she is a union member would be about 17%.
We can also summarize differently by typing
Stata summarizes, this time by row. Here we have to think in a different manner. Let’s
suppose you visit a large convention of union members who are representative of the
USA. When you talk to a random union member, the chance that she is from the south is
60 4 Describing data
29.5%. Please take some time to think about the major differences when you sum-
marize by row or column. What you choose must be theoretically justified for your
individual research question.
A third option is to show the relative frequency of each cell:
This enables you to say that in your sample 40.15% of all respondents are not union
members and are not from the south. Theoretically, you can also combine all options
in one large table:
This is usually not a good idea as this table is very hard to read, and can easily
lead to errors. Often it is a better idea to use more tables, than to create a super table
with all possible options.
Another less common option I want to discuss here is 3-way-tables. Until now,
we have looked at two variables at once, but we can do better. By introducing a third
variable we can get even further insight. Imagine we want to assess the influence of
the size of the city the respondent lives in. We have a binary variable as we count some
cities as central and others not. To test the influence, we type
You can create this table by clicking Statistics → Summaries, tables and tests →
Other tables → Flexible table of summary statistics. There you enter c_city under
Superrow variables, union under Row variable, and south under Column variable.
Then click Submit. Another possibility is to use the bysort command to create these
kinds of tables (for example: bysort c_city: tabulate union south).
Note that these 3-way-tables are difficult to read and nowadays there are better
ways to analyze your data. In the past, when computational power was much lower,
these kinds of tables were at the heart of social sciences. You will find them in many
older publications.
When you have a categorical variable, which is relevant to your research question, it is
often a good idea to compare across categories. For example, we could check whether
completing college has any effect on your later income. We can use the binary variable
collgrad, which tells us whether a respondent finished college. We can summarize the
wage by typing
62 4 Describing data
or click Statistics → Summaries, tables, and tests → Other tables → Compact table
of summary statistics. Under Variable 1, enter collgrad and under Summarize vari-
able wage, then click Submit. We notice a stark contrast between the means (6.9 VS
10.5) which tells us that, on average, people who finish college earn more.
When you want you can also combine frequency tables and summary statistics
and produce detailed information for subgroups. Try
You will receive information about the mean, standard deviation and the absolute
frequency, separately for each cell of the table.
When you generally prefer graphs over simple tables, we can easily create a bar
chart by typing
or clicking Graphics → Bar chart, tick the first line, enter Mean in the first field and
wage in the second. Then click Categories, tick Group 1 and enter collgrad. Then click
Bars and tick Label with bar height. You can produce more complex graphs by intro-
ducing more categorical variables, e.g. south:
4.8 Summarizing information by categories 63
Note that these graphs get more and more complex, so be careful that the
readability is not lost in the process. If you are not happy with the format of the
numbers displayed, you can change this. For example, replace blabel(bar) with
blabel(bar, format(%5.2f)). The 5 is the total number of digits displayed and the 2
specifies how many digits should appear after the decimal point. f specifies that
the fixed format will be used (alternatives include exponential formats, and many
more). Formatting is a quite complex, yet not so popular issue, so I refer you to
the help files (type help format). Just remember that the format option displayed
here works with many more graphs and tables and will be enough to create nicely
formatted graphics easily.
Another option for comparing groups are dot charts. For example, when you
want to compare mean wages in different occupations, try
64 4 Describing data
or click Graphics → Dot chart. By comparing both commands, you will notice that
the syntax is almost identical, as only the bar is switched to dot. When you work with
Stata for some time, you will get a feeling for the basic structure of commands, which
are often similar.
Producing graphics using commands, or point-and-click, is quite easy in Stata. This last
part about graphs will show you how to finalize and export pretty graphs that could be used
for publishing. To begin with an example, we will produce a simple histogram as before
– output omitted –
A new window will open which shows the graph (Figure 4.2). This is fine, but we
will add a caption and our sources. In the new window click File → Start Graph Editor.
Another window will open that shows some additional elements. You can now
either click the element you want to edit in the graph directly, or click the elements on
the right-hand side of the screen. This part of the window lists, in detail, all elements
that make up the graph. We could now start to edit these, but we want to record the
changes we make. This can be pretty useful, for example, when you want to export
several graphs and add the same details to all of them. Then you only need to record
your scheme once, and later apply it to all graphs. To do this click Tools → Recorder →
Begin. Now we start editing by clicking “note” and enter “Source: NLSW88” and click
Submit. Then we click Tool → Recorder → End to finish our recording and give a
name for the scheme (nlsw88).6 To later use recorded schemes, we open the desired
graph, start the Graph Editor and click Tool → Recorder → Play and choose the desired
recording. As we want to keep it that simple, we click File → Save as… and save the file
as histogram_age.gph. This is Stata’s own file format and can only be used in Stata.
Then we click File → Stop Graph Editor, which closes the Editor window. We can now
export the graph, so we can use it in a text editor. We click File → Save as… and save
it as “histogram_age.png”. Preferred file formats are .png (which is a good idea when
you want to use the graph in text editors or on websites) or as .pdf 7 (which makes
upscaling for posters convenient). The corresponding commands are
It is usually a good idea to save a graph first in Stata’s own format (.gph). When you
notice a mistake later, or want to make a change, this is impossible with an exported
file and you have to start from the beginning. Gph-Files make it easy to edit already
created graphs, and export them again to any desired format.
Sometimes you want to show several graphs in a paper to give the reader an over-
view of the data. You can either produce one graph for each variable or combine
several graphs. This is often preferred when it comes to descriptive statistics as space
is usually short and compressing information, especially when it is not the most
6 Recordings are not saved in the current working directory. On Windows they are found in “C:\ado\
personal\grec”, on Linux in “/home/username/ado/personal/grec” and on Mac in “home/username/
Library/Application Support/Stata/ado/personal/grec”
7 An alternative are svg-files, which were introduced in version 15.
66 4 Describing data
interesting one, is a good idea. The basic idea here is to create each graph separately,
give it a name, and later combine them.
First we start by creating the graphs:
Each command produces a graph and labels it with a name (internally). Even when you
now close the graph window, without saving it to your drive, it is still kept in memory as it
has been named. When you want to change an already existing graph use the replace option
Now we combine the two graphs into one image and label it as well:
You can start the Graph Editor, and make changes to each graph individually as all
information is saved in the process. Naming graphs is also possible when you use
point-and-click. You can find the respective option under the category “Overall”.
When you do not name your graphs, the one you created last will be named “Graph”
automatically. Note that only the one created most recently will be kept in memory, all
other unnamed ones will be lost.
To conclude this very short introduction to graphs, I encourage you to explore the
vast possibilities Stata offers when editing graphs, as they cannot be explained here.
Actually, there is an entire book about creating and editing graphs in Stata (Mitchell,
2012). Shorter overviews can be found in the provided do-files for this chapter and online8
4.10 Correlations
As you have seen by now, it is quite easy to create helpful graphics out of your data.
Sometimes it is a good idea to start with a visual aid, in order to get an idea of how the
data is distributed, and switch to a numerical approach that can be reported easily
in the text, and helps to pinpoint the strength of their effects. A classical approach
is to use correlation, which measures how two variables covary with each other. For
example, if we assume that the correlation between wage and total job experience
is positive, more job experience will be associated with a higher wage. Basically,
you have a positive correlation when one numerical value increases, and the other
increases as well. Meanwhile, a negative correlation means that when you increase
the value of one variable, the other one will decrease.
We can calculate this formally by typing
9 Remember that a correlation is in the range from –1 (perfect negative association) to +1 (perfect
positive association). A value of zero indicates that there is no linear relation at all (while there still
can be a nonlinear one, so make sure to check this using scatterplots).
68 4 Describing data
Sometimes you want to know whether the distribution of a metric variable follows
the normal distribution. To test this you have several options. The first is to use his-
tograms to visually inspect how close the distribution of your variable of interest
resembles a normal distribution (see page 51). By adding the option normal Stata fur-
thermore includes the normal distribution plot to ease comparison.
If this seems a little crude, you can use quantile-quantile plots (Q-Q plots). This
graphic plots expected against empirical data points. Just type
qnorm ttl_exp
or click Statistics → Summaries, tables, and tests → Distributional plots and tests
→ Normal quantile plot. The further the plotted points deviate from the straight line,
the less the distribution of your variable follows a normal distribution.
If you prefer statistical tests, you can use the Shapiro-Wilk test (swilk, up to 2,000
cases), the Shapiro-Francia test (sfrancia, up to 5,000 cases) or the Skewness/Kurtosis
test (sktest, for even more cases). As we have about 2,200 cases in our data, we will
use the Shapiro-Francia test:
4.12 t-test for groups 69
sfrancia ttl_exp
or click Statistics → Summaries, tables, and tests → Distributional plots and tests
→ Shapiro-Francia normality test.
The test result is significant (Prob>z is smaller than 0.05). Therefore, we reject the
null-hypothesis (which assumes a normal distribution for your variable) and come to
the conclusion that the variable is not normally distributed. Keep in mind that these
numerical tests are quite strict, and even small deviations will lead to a rejection of
the assumption of normality.
In one of the last sections, we compared the wages of college graduates with those
of other workers, and noticed that graduates earned more money on average. Now
there are two possibilities: either this difference is completely random and due to
our (bad?) sample, or this difference is “real” and we would get the same result if we
interviewed the entire population of women in the USA, instead of just our sample of
about 2,200 people. To test this statistically, we can use a t-test for mean comparison
by groups. The null hypothesis of the test is that both groups have the same mean
values, while the alternative hypothesis is that the mean values actually differ (when
you do not understand this jargon, please refer to your statistics textbook10).
10 There you will also learn that this test needs certain assumptions fulfilled to yield valid results. We
just assume that this is the case here. In a research paper you should make sure that these assump-
tions are actually true (Acock, 2014: 164–168).
70 4 Describing data
11 In statistics it is common to refer to p-values below 0.05 as “significant” and to any values below
0.01 as “highly significant”.
4.13 Weighting 71
4.13 Weighting*
Until now we have assumed that our data is from a simple random sample, that means
every unit of the population has the same probability of getting into the sample. In
our case, that implies every working woman in the USA in 1988 had the same chance
of being interviewed. Sometimes samples are way more complex, and we want want
to introduce a slightly more sophisticated yet common example.
Suppose that you interview people in a city and your focus is research on migra-
tion. You want to survey the city as a whole, but also get a lot of information about the
migrants living there. From the official census data you know that 5% of all inhabi-
tants are migrants. As you plan to interview 1000 people you would normally inter-
view 50 migrants, but you want to have more information about migrants, so you
plan to oversample this group by interviewing 100 migrants. That means you only
get to interview 900 non-migrants to keep the interviewing costs the same. When you
calculate averages for the entire city you introduce a bias, due to oversampling (for
example, when migrants are younger on average than non-migrants).
To account for this, you weight your cases. Migrants receive a weight that makes
each case “less important”, all other cases will be counted as “more important”. You
can calculate weights by dividing the probability of being in the sample when using
a random design, by the actual probability, that is in our case (the following code is a
made up example and won't work with the NLSW88 dataset):
0.05 0.95
Wmigrant = = 0.5 and Wnon-migrant = = 1.056
0.10 0.90
Notice that the weight for migrants is below 1 while it is greater than 1 for non-
migrants. You create a variable (pw1) which has the numerical value of 0.5 if a respon-
dent is migrant, and 1.056 if a respondent is non-migrant. We can use this variable in
combination with other commands, for example
The weight is called a design weight as it was created in relation to the design of our
sampling process. Refer to help files to check which commands can be used in com-
bination with weighting. Refer to the Stata manual to learn about the svy-commands
that were introduced to make weighting for highly complex or clustered samples pos-
sible (Hamilton, 2013: 107–122). For a practical guide see Groves et al. (2004): 321–328.
12 Some other commands (like tabulate) require you to use integer-weights. To achieve this we multiply
each value by 100 to account for all decimal places of the non-migrant weight (replace pw1 = pw1*100).
72 4 Describing data
Statistical significance
Researchers are usually quite happy when they find a “significant” result, but what does this mean?13
This term has already popped up a few times and will become even more important in the following
chapters. To understand it, one has to remind oneself that the data being used is in almost all cases
a (random) sample from a much larger population. Whenever we find an effect in an analysis, so that
a coefficient is different from zero, we have to ask: did we get this result because there is a real effect
out there in the population, or just because we were lucky and our sample is somewhat special?
Stated differently: we have to take the sampling error into account.
As we can never test this rigorously (which would require repeating the analysis using the entire
population instead of just the sample), statisticians have developed some tools that help us in de-
ciding whether an effect is real. Remember, there is always a factor of uncertainty left, but a p-value,
which indicates whether a result is significant, helps a lot. Usually, we refer to p-values below a value
of 0.05 as significant, which is just a convention and not written in stone.
A common interpretation for a p-value, say 0.01, is: assuming that the effect is zero in reality
(if you tested the entire population, not just a sample), you would find the test-results you received
(or an even more extreme result) in 1% of all repetitions of your study, due to random sampling error.
As this error-rate is quite low, most researchers would accept that your findings are real (but there is
still a slight chance that you are just really unlucky with your sample, so be careful!).
13 To be more concrete: they want to find that the coefficient of a variable of interest is statistically
different from zero (in a regression model).
5 Introduction to causal analysis
By now, you will have mastered the basics of Stata. I would call this part descriptive
statistics: methods that allow you to summarize your data, to get a general over-
view about means, standard deviations and distributions. These statistics should
be the first part of every serious research as a good description is the foundation of
any advanced analysis. The topics in the following chapters will introduce advanced
methods that will enable you to test more interesting hypotheses, and make claims
about causal relationships. Interestingly, many researchers, and especially those
with a background in statistics, often avoid the word “causal” in their papers and
rather use terms like “related” or “associated” to describe their findings. Though this
is humble in general, as a perfect and thorough explanation of causality is a major
challenge, it often does not help researchers and especially policy-makers, as they
need advice for creating useful interventions to tackle real-world problems. Luckily
you will soon notice that causal analysis is not defined by new or fancy methods that
have just been invented , but is rather about thinking in a causal framework to create
theories, test hypothesis and explicate results. Therefore, even advanced Stata users,
who are new to causal analysis, will profit well from reading this section.
Nearly every student has heard this already: correlation does not imply causation.
This is absolutely correct, as sometimes there are things that covary (appear together),
and one thing does not cause the other. A very popular example is the correlation
between the number of storks and fertility in women: the higher the number of storks
in a certain region, the higher the fertility rate in that region (Matthews, 2000). We
could draw this association as follows (Figure 5.1):
Storks Fertility
Using directed arrows on both ends should imply correlation and not a causal rela-
tionship. This association is symmetrical, meaning that when we know one thing we
can predict the other. This brings us to another important aspect, namely the dif-
ference between prediction and causation. Sometimes it is good enough to predict
certain outcomes, even when we are not interested in causality. For example, a demog-
rapher might be interested in predicting the number of births for a certain region.
He might use variables that are causally linked to the birth rate, like the number of
https://doi.org/10.1515/9783110617160-005
74 5 Introduction to causal analysis
women, average income, family structure and so on. But, he could also use other
variables, like the number of storks in that region, even when he knows that there is
no causal relationship between this variable and fertility. As long as this is a stable
correlation it might be good enough to improve his forecast.
In science prediction is often not good enough when it comes to developing inter-
ventions to manipulate reality in our favor. Whenever we see a correlation and cannot
make any interventions to influence the result (outcome), we know for sure that it cannot
be a causal relation. For example, breeding storks and, therefore, increasing their popu-
lation in a given region will probably not increase human fertility. Researchers are often
interested in finding causal relations, so that policy-makers can use this knowledge to
solve problems. Therefore, whenever we encounter a certain correlation, firstly, we have
to check whether this relation is causal or not. One way to solve this problem is to think
about factors that cause both phenomena at the same time, which would create the
bias. This can be done by thinking theoretically and using previous research results and
common sense. When we find such a common cause, we call it a confounder. One defi-
nition of confounder is: a variable that simultaneously affects two (or more) other vari-
ables. In our case the confounder would be the degree of urbanization, which both affects
number of storks and fertility. In rural areas the number of storks is higher, due to larger
number of natural habitats and food sources. Also, we expect higher fertility rates in rural
areas, probably due to different social structures, tighter knit communities or different
family values. We can depict this relation with the following illustration (Figure 5.2):
Note that we have removed the connection between storks and fertility, as we no longer
believe that there is any causal relation (the correlation is still there, yet usually not drawn
in causal diagrams). We could use this graph as a working hypothesis: there is no relation-
ship between the number of storks and fertility rates after taking the effect of urbanization
into account, which is usually called “controlling”. Stated differently: after controlling for
urbanization, there is no effect of number of storks on fertility (or vice versa). We will come
back to this aspect at the end of this chapter, as it is a little technical. In the next section we
will proceed with something we have already started here: causal graphs.
In my opinion, the largest appeal of modern causal analysis lies in its simplicity. The
main tool for creating and solving causal questions are causal graphs, also called
directed acyclic graphs (DAGs), which are so intuitive that even a ten year old child could
grasp them. The procedure is as follows: focus on one relation of interest, like the effect
5.2 Causal graphs 75
of X on Y (for example the effect of the number of storks on fertility). Now add all other
variables that might somehow be related to these two variables. Asterisks are added to
variables that are unmeasured (unobserved), and, therefore, not available for direct
analysis. Now draw arrows between the variables. Each arrow is only allowed to point in
one direction (the D in DAG). Feedback loops are not allowed (acyclic). That means, an
arrow cannot point to the variable it originated from (self causation) and when an arrow
points from A to B there cannot be an arrow from B to A.1 Every arrow implies a causal
relationship between two variables, pointing from the cause to the effect. Conversely,
when there is no arrow between two variables this means that these variables are not
related causally. Therefore, drawing or omitting arrows imply strong assumptions that
should be considered carefully. You should draw arrows based on common sense, theo-
retical considerations and previous research results. The arrows can also imply the ten-
tative causal relationships, which you assume, but want to test explicitly.
The graph from the stork example (Figure 5.2) is such a DAG, yet very simple. In
reality these can be larger, yet should not be overwhelmingly complex. If in doubt, it
is better to deconstruct a research project into smaller analyses, than trying to analyze
the entire framework in one big model. For a more detailed introduction to causal
graphs, see the excellent paper by Elwert (2013).
DAGs have two major functions in modern causal analysis: Firstly they enable you to
directly decide whether your research question can be answered causally. If it turns out
that central variables are missing, it could be the case that no unbiased causal effect can
be estimated from your current data (not even with the most fancy methods). This helps
you avoiding futile work, so you can focus on changing your research or, even better,
collect more data to fill your gaps. Secondly, when you come to the conclusion that you
can answer your question with the data available, looking at the DAG will tell you which
variables are important in your analysis, and which you can safely ignore. Let’s leave
the storks behind and take a more elaborate (generic) example with the following DAG:
K C A
X Y
1 This must also be the case on a more general level. Therefore the structure
A→B→C→A is not allowed, as it is cyclic.
76 5 Introduction to causal analysis
This one seems a little more complex than the example before, yet you will soon
learn that there should be no problem dealing with it. Verify for yourself that this
graph is acyclic, because there are no feedback loops and as long as you only follow
the arrows in their direction you can never go back to the position you started from.
Our goal is to estimate the causal effect of X on Y. Before we start talking about the
general techniques to achieve this, I want to introduce certain constellations, that are
famous in the literature, and will appear more often in real applications.
The confounder
We have talked about this constellation before, in Figure 5.2, as you will encounter it
most often in applied research. Confounders, in the example above (Figure 5.3) are T,
K, A or N , as all have multiple effects.2 We will see that controlling for a confounder is
often a good idea, to avoid bias, yet there are exceptions, so stay tuned.
The sensor that reacts when the glass is broken is the mediator, as it is the only way
the alarm can be activated. If the burglar can pick the door and leaves the windows
intact, the alarm will not go off. Mechanisms are often highly interesting phenomena
when studied in detail. To stay with our example, when you finally find out that you
had been burgled, but all the windows are intact, it is time to think about a better
alarm system. Other examples of mechanisms are vitamin C, which mediates the
effect of fruit intake on the outbreak of scurvy (epidemiology), and education, which
mediates the effect of parental socioeconomic status on future income of children
(sociology).
2 Technically, X is also a confounder in Figure 5.3, yet not labeled so, as it is part of the
central cause-effect structure we are trying to analyze.
5.3 Estimating causal effects 77
The collider
Probably the most non-intuitive constellation that has caused statisticians serious
headaches for a long time is the collider, which is depicted by C in Figure 5.3
(K→C←A). A collider is a variable that is caused by several other variables and
is better left alone. What this means is that a collider usually does not bias your
results, with the only exception that you control for it. Why is that the case? A
simple real-world example: you visit a friend in a hospital that is specialized in
heart and bone diseases. So there are only two possibilities for a patient: either his
bones or his heart are not healthy (Figure 5.5).
You cannot usually tell which sickness it is that leads to hospitalization. Now, you
talk to a patient and she tells you that her heart is totally healthy. So you may conclude
that she must have a disease of the bones, even when the two types of diseases are
uncorrelated in the general population. Therefore, whenever you have a collider with
the structure (K→C←A) and you control for C (for example by using C as a predictor
variable in your regression model) variables K and A will no longer be independent of
each other, which can introduce bias. For a more thorough introduction to colliders,
see Elwert and Winship (2014).
After talking about these general structures it is time to come to the most interest-
ing part, that is, using DAGs to estimate causal effects.
You usually want to estimate the effect of variable X (cause, treatment, expo-
sure) on Y (effect, outcome, result), for which you have drawn a causal graph as
explained above. Now, it is time to see how this model guides you through the
process of estimating the effect. First of all we address all causal pathways. A
causal pathway is a path starting with an arrow pointing from the cause (X) to the
outcome (Y), either directly (X→Y), or with one or more mediating variables in-
between (X→M→Y). These are also called front-door paths. All other pathways are
non-causal paths (causal and non-causal paths are defined relative to a specific
cause and effect).
That brings us to non-causal pathways or back-door paths. A back-door path is
a sequence of arrows, from cause to outcome, that starts with an arrowhead towards
the cause (X) and finally ends at the outcome. For example, in Figure 5.3, X←T→Y is a
back-door path. The arrow from T to X points towards X and, therefore, is a potential
78 5 Introduction to causal analysis
starting point for a back-door path. When you look along this line, you will reach T and
from there you can reach Y. Another example of a back-door path is X←K→C←A→Y.
To estimate the unbiased causal effect of X on Y it is vital to close (block) all back-
door paths between X and Y. How this can be done depends on the type of back-door.
For example, the path X←K→C←A→Y is already blocked automatically , as any collid-
ers on the path block the flow of information (as long as you do not control for them!).
If there are confounders on the path, they will be blocked when you control for them.
As long as we control for T, the back-door path X←T→Y will be blocked. The third
and only remaining back-door path in the example is X←K←N→A→Y. This path does
not contain a collider, therefore, it is unblocked. Luckily, you now have free choice:
controlling for any variable of the path will block it. So you can control either for K, N
or A. Note that the blocking should always be minimal, which means that you should
only control for the lowest amount of variables necessary. When a path is already
blocked controlling for more variables than actually needed might be harmful.
Let’s take another generic example. Consider the causal graph depicted below
(Figure 5.6). Again, you want to estimate the causal effect of X on Y. Identify all back- door
paths and decide which variables you have to control to get an unbiased result.
P T
L B
Apparently there are two back-door paths: X←P→L→Y and X←P→T←Y. You will have
noticed that T in the second path is a collider, so this path is blocked right from the
start. Also note that B is a descendant of the collider T, therefore, controlling for B
would be as harmful as controlling for T directly! The first path is open at the begin-
ning, yet you have two options. You can either control for P or L, to block this path
completely. This example also highlights that there can be back-door paths that have
an arrow pointing away from the outcome (Y).
One last example. Consider the DAG below (Figure 5.7). Can you find a way of
estimating the unbiased effect of X on Y? Consider that S* and H* are not measured,
therefore, you cannot control for them.
S* G H*
When you scratch your head now, you might be right, as there is no way to estimate
an unbiased effect with the data available. One back-door path goes from X←S*→G→Y,
which can be blocked by controlling for G. But this is a problem, as controlling for G
opens up another back-door path, namely X←S*→G←H*→Y.3 Therefore it is impossible
to estimate an unbiased effect, even with a million observations! The only way to solve
this problem is to get back to the field and somehow measure either S or H. This should
teach you two things: first, even with “Big data” it might be impossible to estimate causal
effects when relevant variables are not measured. Second, drawing causal graphs can
spare you a lot of futile work, because you can stop immediately when you encoun-
ter such a constellation in the conceptual phase. There is no need to waste any more
resources on this task as you cannot estimate the unbiased causal effect (as was proven
mathematically by Pearl (2009)). When you find my examples quite difficult, or have a
more complex system of variables, you can also use free software that finds all back-door
paths for you, and tells you which variables you have to control for (www.dagitty.net).
This concludes the introduction to causal graphs. Hopefully, you now have a feeling
for how you can assess real life challenges, and start doing modern causal analysis. For
further information on the topic, I refer you to the book of Pearl and Mackenzie (2018),
which is suitable, even for the layperson. This was a very short primer, at least three
other important techniques for estimating unbiased effects besides closing back-door
paths were omitted, as these involve advanced knowledge. The front-door criterion
and instrumental variables are introduced in a quite nontechnical fashion in the book
of Morgan and Winship (2015). Do-calculus is presented and proven mathematically in
Pearl (2009).
3 This example also shows that a variable can have different functions with respect to
the pathway regarded. In the path X←S*→G→Y, G is a mediator, while it is a collider
in the path X←S*→G←H*→Y.
80 5 Introduction to causal analysis
causality? Is it possible that well fed children have more calories that can be used to
power the brain, therefore, increasing the scores?4 Actually, this correlation is spuri-
ous, as both factors are caused by the same aspect: the age of the child. Older children
have better scores, as they have more reading experience and older children also have
a higher weight, on average, as children still grow and become larger and heavier.
Therefore, we have to include the confounder “age” in our model. We measure this by
controlling for the class that the child is in (in this case, as age might not be available
in the data, we say that the class a child is in is a good proxy for age, as pupils in ele-
mentary school are usually quite homogeneous within a class, with respect to age).
To see how this works in detail we start with a scatterplot that visualizes the relation
between reading scores and weight for all children in school.
We can clearly see that weight and reading are correlated and heavier children have
better scores. What we do now is a perfect stratification on the variable of interest
(class). Each stratum is defined by a class and we create new scatterplots, one for
each class. By doing so we can see whether the general relation still holds.
Interestingly, the pattern vanished. When we look at children separated by class
the association disappears and weight and reading ability are no longer related.
Therefore, we can conclude that the correlation was spurious and age was the con-
founder. We could also calculate this numerically: first we calculate Pearson’s R
separately for each class, then we generate the weighted average (each class has
50 children in the example, therefore, we can just use the arithmetic mean without
weighting). The result tells us that the correlation, after controlling for class (also
called the partial correlation), is very low and not statistically significant.
4 We quietly ignore the possibility that children reading many hours every day tend to sit
at their desks, which decreases energy consumption and, therefore, causes weight gain.
5.4 What does “controlling” actually mean? 81
In the first example, age was operationalized, using the class a child was in, which
allowed us to have a perfect stratification with four categories. But how does this work
when the variable we want to control for is metric (continuous), for example, when
age is measured as day since birth? We can assume that there are many “categories”
and some of them might be empty (for example when no child has an age of exactly
3019 days). There are many ways to solve this problem. One quite simple option, which
is used in linear regressions, is to use a linear approximation; therefore, it does not
matter whether some categories are empty or sparsely populated, as long as enough
data is available to estimate a linear function for reading ability and age.
You can see this as follows: start with the bivariate (2D) scatterplot which is
depicted below. Suppose we draw a straight line through all the data points to esti-
mate the linear function of the bivariate association. Now we include the control vari-
able, and plot all three variables together. Each value (read, weight and age) describes
one point in three-dimensional space. As we are now in 3D, we no longer fit a straight
line, but a plane through the data (think of a sheet of paper that can be rotated and
tilted, but not bent, until the optimal fit is reached). By doing this, it turns out that
the first dimension (weight), is no longer relevant and the second (age) is much more
important in devising the optimal fit (also see Pearl and Mackenzie (2018): 220–224).
One possibility to visualize this on printed paper is by using added-variable-plots.
Again, we start with only two variables and use weight to predict reading ability (note
that the variables were rescaled which explains the negative values) (Figure on the
next page).
We see that there is a strong correlation between reading ability and weight, and
Stata fitted the regression line for us. The slope is clearly positive (1.43). Does this
relationship change when we include age (measured in days) as a control variable?
Absolutely. The fitted line is now almost horizontal, meaning that there is no
slope, which is statistically different from zero, when age is included as a control vari-
able. Our conclusion is that weight does not predict reading ability when the age of a
child is also taken into account.
82 5 Introduction to causal analysis
Hopefully this section helped you in understanding the basics of control vari-
ables. The specific techniques and formulas applied depend on the method used,
and perfect stratification or linear approximations are only two of many possibilities,
yet the general idea is often the same. Sometimes it does not even involve statistical
methods at all, for example when controlling happens on the level of research design:
when you plan to interview only a very specific subgroup of the population, this also
implies controlling. Furthermore, this section should also underline the limits of con-
trolling: when only very few cases have information on the control variable it might
be challenging to fit a good approximation to a linear function. In other cases, the
relationship between the outcome and the control is not linear, which poses great
problems, as you will receive biased results if this is the case. The next chapters will
tell you how you can deal with these problems, if they should arise in your own appli-
cations, and how to solve them.
6 Regression analysis
As we know how to manage and describe data we can go to the fun part and analyze
data for our research. As this is an introduction to Stata, we will start with a method
that is very popular and also a great foundation for the large family of regression-based
applications. In this chapter we will learn how to run and interpret a multiple regres-
sion. In the next chapter we will furthermore check the most important assumptions
that must hold, so our analyses will yield valid results.
Any good scientific research must start with a theoretically justified research
question which has a relevance for the public. Finding a good research question,
that is adequate for the scope of a term paper (15 pages or 8,000 words can be
really short!), is a challenge, yet it is worth spending some time on this task, as
your project (and grade) will really benefit from a well-formulated and manage-
able research question. Usually you start with a certain motivation in mind, and
proceed with a literature review to check which questions are still open and find
out where the interesting research gaps lie. You will then develop a theoretical
framework, based on the previous literature and the general theoretical founda-
tions of your field.
When this is done, you should formulate testable hypotheses that can be
answered with the dataset you can access. This is the second crucial step, as it is
easy to lose the focus and end up with hypotheses that are somewhat related to the
research question, but vague and unclear. You really want to pinpoint a hypoth-
esis, as your project will benefit by a clear and precise formulation. For a general
overview of this process, refer to King et al. (1995): 3–28. Also try to formulate your
research questions and hypotheses in a causal fashion, even when only observa-
tional data is available for analysis. As pointed out recently, almost all research is
finally interested in causality and therefore, it is a good idea to spell this out explic-
itly (Hernán, 2018).
We imagine all this is done, so we can start with the analyses. Of course, as this
book is short, we will deal with ad-hoc hypotheses to practice the methods. It should
be clear that for any real seminar paper you should invest much more time in the
mentioned aspects.
As we want to continue using our dataset of working women we will test the
effects of certain variables on wage. We start by formulating testable hypotheses:
1. Union members will earn more than non-union members (H1).
2. People with more education will earn more than people with less education (H2).
3. The higher the total work experience, the higher the wage (H3).
https://doi.org/10.1515/9783110617160-006
84 6 Regression analysis
As you might have noticed, our hypotheses use three different kinds of variable scal-
ings: binary (H1), ordinal (H2) and metric (H3). Our dependent variable, that is the
variable we want to explain, is metric (wage). When you want to use a linear (multi-
ple) regression analysis, this must always be the case.1
DV = β0 + β1 · IV + ϵ
where β0 is the constant (also called intercept), β1 the regression coefficient of the
IV and ϵ is the error term. The error term “collects” the effect of all omitted variables
that have an independent influence on your dependent variable, but are, as the term
omitted describes, not captured in your model. Let’s take an example. We want to
regress income on motivation to see whether income can be explained by motivation.
Our equation would look like this (with made up numbers):
This would mean that a person with a motivation of zero (whatever this means
depends on your coding system) would earn 400, and every point more on the moti-
vation scale would increase the income by 20. When your motivation is 10 you would
receive an income of 400 + 20 · 10 = 600
Normally you would also include the error term in the equation, which is all
the influence that cannot be explained by the model. For example, we assume that
1 If your dependent variable is binary, ordinal or nominal, you can use logistic or multinomial
regressions instead.
2 This chapter introduces linear OLS regressions. For a more detailed introduction refer to Best and
Wolf (2015). Exercises that might be interesting for you after you read this chapter can be found in
Rabe-Hesketh and Skrondal (2012): 60–69.
6.3 Binary independent variable 85
In our first model we want to inspect the relationship between wage and being a
union member. Remember that our independent variable is coded binary, where the
numerical value 1 is given to people who are members. If not already open, load the
data and run the regression:
or click Statistics → Linear models and related → Linear regression. Let’s take a
look at the command first. Regress is the regression command in Stata, followed by the
dependent variable (the one we want to explain). Note that a regression can only have
one dependent variable, but one or more independent variables. These follow directly
after. We use factor variable notation to tell Stata how to deal with a binary variable.
Binary, nominal or ordinal variables receive the prefix i. (optional for binary variables),
which helps Stata to run the correct model. Continuous or metric variables receive the
prefix c. (which is optional, but often helpful for getting a quick overview of the model).
Let’s take a look at the output.
The upper left corner of the table shows the decomposition of explained variance.
These numbers are used to calculate some other statistics, like R-squared, which is
depicted on the right hand side. Usually you do not have to care about this part of the
table, as one uses better indicators to assess the model. The more interesting statistics
are found on the right hand side.
86 6 Regression analysis
Number of obs is the number of cases used in your model. As listwise deletion
is the standard, only cases will be used which have complete information on every
variable in the model. For example, if you use ten IVs in your model and one person
only has information about nine of these (and the last one has a missing value), then
this person is not used in calculating the model.
F(1, 1876) is used to calculate the Prob > F value. This is an omnibus-test which
checks whether your model, in general, explains the variance of the dependent vari-
able. If this value is not significant (larger than 0.05), your independent variable(s)
might not be related to your dependent variable at all, and you should probably refine
your model. As long as this number is low your model seems fine (as in our case here).
R-squared is the percentage of the overall variance that is explained by your
model. This value is quite low and tells us that when we want to predict wages, the
variable union alone is not sufficient to reach satisfying results. Usually it is not a
good idea to assess the quality of a model using only the explained variance, yet it
gives you a rough impression of the model fit. Keep in mind that you can still test
causal mechanisms, even if R-squared is quite low. You can calculate this statistic by
hand using the information on the left (751/32,613 = 0.023).
Adj R-squared is the adjusted R-squared, which is corrected to account for some
problems that R-squared introduces. R-squared will always become larger the more
controls you include, even if you introduce “nonsensical” independent variables to
the model. To “punish” this, adjusted R-squared corrects for the number of explaining
variables used.
Root MSE is the square root of the Mean Square Error. This value can be inter-
preted as follows: if you were to predict the wage of a person, using only the informa-
tion in the model (that is information about the union status), you would, on average,
make an error of about 4.12 $. But keep in mind that this number depends on your
model and should not be compared across different models.
The more interesting numbers are in the lower part, where the coefficients and
significance levels are shown. We can formulate our regression equation, which
would be
7.2 is the constant (or intercept), while 1.47 is the coefficient of our independent vari-
able. As union can only take two values (0 and 1), there are only two possible results.
Non-union members will have an average wage of 7.2 while union-members will have
an average value of 7.2 + 1.47 = 8.67. The p-value (P>|t|) of this coefficient is below 0.05,
so we know the result is significant. Thus we conclude that there is a real effect of
union membership on wages, and the result is positive. Please note that this result
might be spurious, as there are no control variables in the model. Formulated differ-
ently, the effect may disappear when we introduce more explanatory variables to the
model. We will deal with this in model 3.
6.4 Ordinal independent variable 87
After discussing the most simple case, we move on to a model with an ordinal inde-
pendent variable. As there is no real ordinal variable in the dataset that could be used
directly, we have to create one.3 We will use the metric variable, years of schooling
(grade), and transform it (low education, medium education, high education). After
this is done, we use a crosstab to inspect if we made any mistakes.
3 Note that we create an ordinal variable for the sake of demonstration for this example but we will
use the metric version for the rest of chapter six and seven, so working through the book is more
convenient.
4 Note that this categorization is data driven to assure that all groups end up with a sufficient number
of cases. In real research, creating categories should probably be justified on a theoretical basis.
88 6 Regression analysis
Stata’s factor variable notation makes our life much easier. Usually any ordinal or
nominal variable has be to be recoded into dummy variables to be used in regres-
sions. For example, we would have to recode our variable education into two binary
variables, “medium_education” and “high_education”. One category (in this case,
“low_education”) would be our reference-category. Luckily we can skip this step by
using a Stata shortcut.
As the category low education is our reference, we find this effect in the con-
stant. A person with low education will earn 4.88 on average. A person with medium
education will make 1.90 more than that (4.88 + 1.90), a person with high education
5.21 more (4.88 + 5.21). All variables are highly significant, telling us that in com-
parison to low education, the two other groups make a significant difference. We
would thus conclude that education has a positive effect on wage, and education
pays off financially. Also note that our R-squared is higher than in the first model,
telling us that education can explain more variation of wage than the membership
in a union.
Stata will always use the category within a variable with the lowest numerical
value as a reference category. You can show the category of reference explicitly by
typing
5 The option perm makes this configuration permanent, so you do not have to enter this command
every time you start Stata.
6.4 Ordinal independent variable 89
Here category three (high education) would be the reference. You will notice that
all coefficients will change. This must happen since all your relative comparisons will
change as well. Someone with low education will have a wage of 10.09 – 5.21 = 4.88,
which is exactly the same value, as calculated above. You see that changing the refer-
ence categories of independent variables does not change results. You should choose
them in a fashion that helps you understand the results.
Using the i. prefix tells Stata to treat agegrp as a categorical variable. At the top of the output you see
the F-statistic and Prob > F, which are identical to the ones displayed by the ANOVA (11.23 and 0.0000).
You can also see more detailed results when you check the lower parts of the output. The first category
6.5 Metric independent variable 91
(30–45) is used as a reference and therefore, not shown in the table. We learn that the third group
(60+) displays a highly significant result (P>|t| is smaller than 0.05 here), therefore, we conclude that
there is a difference from the reference-group.
What if you want to test if age-group 2 (46–59) is statistically different from age-group 3 (60+)?
You have several possibilities: firstly you could change the category of reference in the regression
model (see page 89). Secondly, you can use the test command to test this numerically:
You can find this test under Statistics → Postestimation → Test, contrasts, and comparisons of param-
eter estimates and click Create. As the result (0.0019) is significant (Prob > F is smaller than 0.05) you
know that the two group means are different from each other.
In summary, you should keep in mind that ANOVAs and linear regressions are very similar from
a technical point of view, while regressions are more versatile and powerful for advanced analyses,
hence, the emphasis on these models in the rest of the book.
The table shows a highly significant effect, with a numerical value of 0.33, and 3.61 for
the constant. Thus a person with zero years total work experience would earn 3.61$ on
average, and with each year more she would receive 0.33$ more. Therefore, a person
with five years of education would earn 3.61 + 5 · 0.33 = 5.26 Note this is the case, as
work experience is coded in years. If we used months instead, all the numbers would
be different, but the actual effects would stay the same. We conclude that the effect
of work experience on wage is positive, which makes sense intuitively, as experience
should be beneficial for workers.
92 6 Regression analysis
In chapter five we learned that controlling for the correct variables is essential
when we want to recover causal effects. We will do this by including some more
variables: union status, place of residence (south VS other) and years of education
(as a metric variable). Therefore, our final (saturated) model is estimated with the
following command6:
Note that the coefficient of work experience became slightly smaller, yet is still highly
significant. R-squared also increased drastically, as our model with four independent
variables is able to explain much more variance in wages. One could interpret our
results as follows: “the effect of work experience on wage is highly significant and
each year more experience will result in a wage plus of 0.27$ on average , when con-
trolling for union-membership, region and years of education”. Usually it is not nec-
essary to explain the effect of every single control variable, as you are interested in
one special effect. Note that we call a model with more than one independent variable
a multiple regression.
It is worth taking some time to understand the correct interpretation of the result.
The positive effect of work experience is independent of all other variables in the
model, which are: union status, region and education. Or to formulate it differently:
every extra year of work experience increases the wage by 0.27$, when holding all other
variables in the model constant (ceteris paribus interpretation). When you build your
model using the framework of causal analysis, and have selected the correct variables
to control for (closing all back-door paths), you could even state that work experience
is a cause of income (which is probably wrong in our example, as we have not done all
the important steps, and have created an ad-hoc model to give as a general example).
6 The order of the independent variables is without any meaning and does not influence the results.
6.5 Metric independent variable 93
In chapter ten we will continue to interpret and visualize effects that we have
estimated using regressions so far. Finally, it is time to come back to our hypotheses
and see whether our results do support or reject them.
H1 claims that union members will earn more than non-union members. As the
coefficient of the variable union is positive, and the p-value highly significant
(page 85), we can state: union members do, on average, earn more money than
non-union members. We therefore, accept hypothesis one.7
H2 claims that more educated people will earn more money. We can state: people
with a higher education do earn more money on average, as, in contrast to the
lowest category of education, both other coefficients show a positive and highly
significant result (page 88). We therefore, accept hypothesis two.
H3 claims that people with more work experience earn more money. We can state:
as the coefficient for work experience is positive and highly significant, people
with more work experience do earn more money on average after controlling
for union membership, region and education (page 92). We therefore, accept
hypothesis three.
Confidence intervals
Confidence intervals are a common type of interval estimation for expressing uncertainty in a statis-
tic. In the regression commands so far you have seen that Stata also reports a confidence interval
for each coefficient. The problem is that we (mostly) work with samples from a much greater popula-
tion, that means all statistics we calculate are probably not identical to the result we would get if we
could use all cases that exist. For example, our sample consists of working women between 34 and
46 years of age. We want to know the average work experience they have, which yields 12.53 years
(summarize ttl_exp). Suppose we not only have a sample, but interview every single woman in the
USA between 34 and 46. We would then probably get a different result and not 12.53. A confidence
interval tries to give us a measurement, to see how much trust we can put in our statistic. To compute
it, just type
ci means ttl_exp
7 Keep in mind that you can never verify a hypothesis, as, strictly speaking, they can only be reject-
ed. If we cannot reject an hypothesis, we accept it (for the time being, as new data or insights could
change our views in the future). If you are interested in this topic, I refer you to the writings of Karl
Popper.
94 6 Regression analysis
or click Statistics → Summaries, tables, and tests → Summary and descriptive statistics →
Confidence intervals. The standard is a 95% interval. The standard error of the mean is 0.097,
the calculated confidence interval is [12.34; 12.73]. Be careful with the interpretation, as many
people get it wrong and it is even printed incorrectly in journal articles (Hoekstra, Morey, Rouder,
Wagenmakers, 2014)! A correct interpretation would be: “If we were to redraw the sample over and
over, 95% of the time, the confidence intervals contain the true mean.”8 Of course, our sample must
consist of a random sample of the population for this statement to be valid. When we know that our
sample is biased, say we only interviewed people from New York, then the entire statistic is biased.
To understand the interpretation, remember that there must be a true value for our statistic,
which we would know if we had interviewed not a sample, but every single person. Imagine we went
out and collected a sample, not only once, but 100 times, independently. Then in 95 of these 100
samples the calculated confidence interval would contain the true value. In five of the samples it
would not. Also, remember that a confidence interval gets larger when we increase the level. That is
why a 99% confidence interval for work experience would be [12.28; 12.79] and thus broader than the
one calculated before. To see why this is true, consider the extreme case, a 100% confidence interval.
As this must include the true value it has to be from zero to infinity!
TL;DR9: Never use this interpretation: “the probability, that the true value is contained in the
interval, is 95%.”
Until now we have assumed that all effects have the same strength for all persons. For
example, the effect of work experience on wage is 0.27, no matter whether you are a
union member or not, whether you are from the south, or not, or whatever your edu-
cation is. We call this the average effect of experience and often this is good enough.
But sometimes we think, based on our theoretical reasoning, that there are subgroups
which are affected quite differently by some variables. For example, we could argue
that there is an interaction effect between being union members and having college
education, with respect to wages. Stated differently, we expect that the possession of
a college degree moderates how union-membership affects income. In this example
union-membership is the main effect, while college education is the interaction
effect. Finally, it is recommended you to draw a causal graph of the model, which
could look like this (Figure 6.1):
8 In fact, researchers discuss whether only this very strict interpretation is correct, especially as this is
a question about how one views statistics (Frequentist VS Bayesian approach), see http://rynesherman.
com/blog/misinterpreting-confidence-intervals/ (2018-01-26).
9 “Too long; didn’t read”
6.6 Interaction effects 95
Union Wages
Interaction effects might sound complicated at first, but are very common in data
analysis, and it is very useful to take some time and make sure that you really under-
stand what this means. Due to factor variable notation it is very easy to calculate
these effects in Stata. Generally, it is recommended you have a three stage procedure:
Your first model only includes the main effect. The second model includes the main
effect, all control variables and also the interaction variable, but without the interac-
tion effect itself. Finally, the third model furthermore adds the interaction effect. As
we want to keep it simple and only specify total work experience as a control variable,
we would run these three models:
By typing two hash signs (##) between i.union and i.collgrad you tell Stata to calculate
the main effects of union-membership, the main effect of college education and
additionally the interaction effect between both. When you just type a single hash -
sign, Stata will only calculate the interaction effect, which is usually not what we
96 6 Regression analysis
want. Again, the order of the independent variables is arbitrary. Also note that this
notation is symmetric. Stata does not know which variable is the “main” and which
is the interaction as, from a mathematical point of view, this cannot be distinguished.
It is up to you to define and interpret the results in a way you desire, just like we did
in Figure 6.1.
In the following I will show different options on how to interpret and visualize the
results. Which option seems most comfortable is up to you.
First, we will use the output of the model and interpret results directly, which can be
slightly challenging. We see that the coefficient of union is positive and highly signif-
icant (1.27), telling us that union-members earn more on average than non-members.
Exactly the same is true for college education (3.40). Finally, we see that the interac-
tion effect (−0.936) is negative and also significant (p-value smaller than 0.05). We can
now calculate the overall effect of union-membership for two groups, the people with
college education and those without.
Here we calculate the effect of union-membership for both groups, using the classic
way. Our conclusion is that the effect is much stronger for people without college,
which means that people with less education profit more from union-memberships.
You have noticed that this procedure requires you to calculate effects by hand, which
becomes rapidly more complex as soon as more variables or other interactions are
present. Therefore, we would kindly ask Stata to do this for us.
10 Read: “The effect of union-membership given non-college education”. In our example college
education can only have two values, zero or one, therefore, only two equations are necessary. If you
have an interaction with a variable with more categories, you have to calculate the effect for all
values. If your interaction-variable has a metric scale (like age), it is often a good idea to summarize
it into a few ordinal categories (for example young people, medium aged people and older people).
6.6 Interaction effects 97
Another option I want to introduce here does not emphasize effects of certain vari-
ables, but rather uses all information in the model (therefore, also the data from the
control variables) to predict the overall outcomes for certain groups (which implicitly
also tells us something about effects or differences between groups). Again, we can
use margins here:
98 6 Regression analysis
Stata directly calculates expected wages and also regards the effect of work expe-
rience. For example, a person who is not in a union and, also holds no college degree,
would earn 6.49$ on average. When you want to compare results between groups,
make sure you get the correct comparisons. In this case, we would compare group 1
VS 3 (no college education) and group 2 VS 4 (college education) to assess the effect of
union-membership on wages. You can also get a visual output by typing
marginsplot
Margins is a very powerful command that we will use in chapter ten to visualize
our results. When you still feel insecure about interactions, don’t be discouraged, as
margins makes it simple to get informative results, even when there are many inter-
actions present in your model. If our model was slightly more complex, even experts
would not calculate these effects by hand, but use the Stata internals to get nice
graphics that can be interpreted visually.
A final technique for dealing with interactions is to run separate regressions for each
subpopulation, which is defined by your interaction variable. In the example above,
our interacting variable is college education, which means we have two groups: people
who have college education and people without college education. We can remove
the interacting variable from our regression model and instead run two models, the
first for people with college education, the second for people without.
To understand what bysort does, see how we can get the exact same results with
the if qualifier:
- output omitted -
By comparing the coefficients of union, you can check whether there is any inter-
action. When the coefficients are quite similar, we would conclude that there is no
interaction at all. In our case the results underline that there are differences (1.27 for
people without college, 0.33 for people with college). You will notice that these results
are very close to what we have calculated above as “marginal effects”. This split-tech-
nique is preferred when you have many variables in the model, and you expect many
interaction effects. For example, when you expect interactions, not only between
union and college, but also between total work experience and college, you would
normally have to specify the second effect in another interaction-term. When you run
separate regressions by groups, the model will implicitly account for any possible
interactions between your grouping variable and any other explaining variable used
in the model. Basically, this design can be helpful when you expect your groups to
be highly different from each other in a large number of effects. If necessary, you
could even specify explicit interactions, within this split-design, to account for higher
orders of interaction (if there is any theoretical expectation of this). The main down-
side of this split approach is that your two groups are no longer in the same model,
therefore you cannot compare coefficients easily. For example, it is no longer possible
to tell if the coefficient of experience in model one (0.289) is statistically different
from the same coefficient in model two (0.32).
To summarize this section, interaction effects are often highly interesting and
central to many research questions. Possible examples are: that a drug has differ-
ent effects on men and women, a training program affects young and old workers
differently, or a newly introduced tax influences spending of large and small firms
differently. Whenever you want to test interactions, you should have a clear theoret-
ical framework in mind that predicts different outcomes. Thanks to Stata, actually
calculating and interpreting interactions is as easy as can be.
other in months). Secondly, they make it possible to compare effect-sizes for variables
with different units within one study (for example, when you want to measure what
affects life satisfaction more, the income or the number of close friends). The idea is
as follows: you z-standardize (see the formula in the footnote on page 49) your depen-
dent variable and all (metric) independent variables and use these modified variables
in your regression. Stata makes this easy:
The output shows the normal coefficient (0.331) and the standardized coefficient
(0.265). The interpretation is as follows: when the work experience of a woman
increases by one standard deviation, the wage increases by 0.265 standard devia-
tions. The result is highly significant.
If you would like to see how this works in detail, you can reproduce the results
on your own:
You see that the results are identical. When you want to learn more about standard-
ized coefficients in Stata, have a look at the paper by Doug Hemken.11
11 https://www.ssc.wisc.edu/~hemken/Stataworkshops/stdBeta/Getting%20Standardized%20
Coefficients%20Right.pdf (2018-05-11)
7 Regression diagnostics
There are some assumptions that must be fulfilled, so the regression will yield correct
results (Wooldridge, 2016: 73–83; Meuleman, Loosveldt, Emonds, 2015). Therefore, it
is strongly recommended to test these assumptions to see whether you can trust your
conclusions. When you find out that there are great problems in the data, it might be
a good idea to redo analyses, or find other data sources. The least you can do is report
all the violations, so others can regard this when reading your paper.
If not otherwise stated, all diagnostics are applied to the following model we used
in chapter six:
7.1 Exogeneity
The term exogeneity describes that the expected error, given all your independent
variables, must be zero.1 Stated differently, any knowledge about your independent
variables does not give you any information about the error term. When we come back
to our example from the last chapter, this means that a union member has the same
intelligence, motivation, etc… as a non-union member, unless these variables are also
used as controls in the model.
This is a very strict assumption that cannot be tested statistically. When you want
to pinpoint causal effects with your analyses, you have to make sure that you account
for all possible further explanations, which is a difficult task. Refer to introductions
to causal analysis, to learn more about that (see chapter five in this book, Morgan
and Winship (2015) and De Vaus (2001)). For the moment, you should keep in mind
that when you think that several other factors could have a causal influence on your
dependent variable (that are in any way connected to one of your explaining (IV) vari-
ables), you have to control for these factors.
To summarize, there is no statistical test that tells you whether your model is
“correct” and will calculate the desired causal effect. Nevertheless, there is one test
you can try, the Ramsey test, to see whether there are general problems with the model.
estat ovtest
https://doi.org/10.1515/9783110617160-007
7.3 Linearity in parameters 103
The second basic assumption is that your data is from a (simple) random sample, meaning
that every unit of the population has the same (nonzero) chance of getting picked for the
sample. This assumption is often violated when clustering occurs. For example, when
you want to collect general information about the population of a country, it would be
desirable to sample from a national register that contains every single person. As this
is often not available, a common step is to first sample cities and then sample from a
city-based register. This is convenient and brings the cost down, as interviewers only
have to visit a limited number of places. The problem is that persons within one cluster
are probably somewhat similar, and, therefore, units are not completely independent
of each other. Consequently, the random sampling assumption is violated. Other exam-
ples are regions, nested within countries, or people nested within families. When you
are confronted with this problem, using multilevel models is a solution (Rabe-Hesketh
and Skrondal, 2012: 73–137). Another remedy is to control for the stage-two variable (for
example, the city). Estimated effects are then independent of place of sampling.
In a linear regression the relationship between your dependent and independent vari-
ables has to be linear. For illustration, say the coefficient of the variable total work expe-
rience is 2.5, which means, one more year of job experience results in an income that
is increased by 2.5$. This effect has to be stable, no matter whether the change of your
experience is from 2 to 3 years or from 20 to 21 years. In reality, we experience that this
assumption is often violated, especially when dealing with the variable age, as satura-
tion effects occur. When we think about the effect of age on wages, we would expect that,
for young people there is a positive relationship, as people gain experience over time,
104 7 Regression diagnostics
that makes them more valuable. This relationship will get weaker as people reach their
peak performance. When people get even older, their energy and speed will decline, due
to biological effects of aging, therefore, wages will start to decline. It is obvious that the
relationship between age and wages is not constant over time, but looks like a reversed
U. Whenever this occurs, the regression will yield biased results, as it assumes strict
linearity. For illustration, the following graph shows the actual connection between age
and household income, taken from the German ALLBUS dataset (N=2,572):
The first step is to check the relationship between the dependent variable and every
independent variable graphically. Note that this only makes sense when the inde-
pendent variable is metric, as binary (dummy) variables always have a linear rela-
tion to the dependent variable. A simple scatterplot gives you a first impression as to
whether the relationship is linear or not.
This command creates three plots in one image. The first part creates a simple scat-
terplot which we already know. The second part lets Stata fit a straight line through
the data points, while the last command creates a locally weighted graph which can
be different from a straight line. Luckily we come to the conclusion that the relation
of our only metric variable and the dependent variable is fairly linear, as the linear fit
and the lowess fit are very close to each other (apart from a small region to the right).
This means that our assumption is probably fulfilled. You can also use the binscatter
CCS, which I personally enjoy very much (see page 58).
Another option is to use residuals to check for nonlinearity. When you run a
regression, Stata can save the predicted value for each case. As we want to explain
wages through some other variables, Stata can calculate the wage that is predicted
by the model, for each case. The difference between the observed value of wage,
and the predicted value, is the residual. For example, when a respondent has a
wage of 10 and the regression predicts a wage of 12, then the residual is -2 for this
person. Usually you want the residuals to be as small as possible. When we plot
residuals against our variables and cannot detect any pattern in the data points,
the relation seems linear. For point-and-click go Statistics -> Postestimation and
choose Predictions -> Predictions and their SEs, leverage statistics, distance
statistics, etc.
7.3.1 Solutions
When you come to the conclusion in your own analysis that a relation is not linear
you can try to save your model by introducing higher ordered terms as additional
variables in your model (polynomial regression). For example, let’s say the variable
age causes the problem. Then you can use Stata’s factor variable notation (include
the term c.age##c.age or even c.age##c.age##c.age, a three-way interaction).2 How
exactly you want to model and transform your independent variable depends on the
functional form of the relationship with the dependent variable. After you had your
command run you can use the calculated R-squared to assess the model fit. When the
value went up this tells you that your new model is probably better.
If this still does not help you, take a look at the idea of a piecewise regression, or at
the commands gmm, nl or npregress kernel3 (warning: these are for experienced users
and require a lot of understanding to yield valid results).
Nested models
Often you want to compare different models to find the variable transformation that helps you best
to describe the distribution of your data points mathematically. R-squared is a statistic that is very
commonly used to do this, as higher values indicate a better model fit. However, this is not without
problems, especially when you want to compare nested models. A nested model is a subset of another
model. For example, when you want to predict wage and use the variables south and union and after
that, you run a second model which uses the variables south, union and work experience, then the
first model is nested inside the second , as the explaining variables are a subset of the second model.
By doing this you want to see which model is better to describe reality. The problem is that you usually
cannot compare R-squared values over nested models. Luckily other statistics can help you out.
Akaikes Information Criterion (AIC) or the Bayesian Information Criterion (BIC) are suited to do this.
You can get them by running your model and typing
estat ic
7.4 Multicollinearity
When you have a large number of independent (control) variables it can be problem-
atic if there are strong correlations among these. For example, you could have one
variable age, measured in years, and you create a second variable which measures
age in months. This variable is the value of the first variable multiplied by 12. This is a
problem for Stata, as both variables basically contain the same information and one
of the two variables must be dropped from the analysis (to be more precise: one of the
two is a linear combination of the other). Sometimes there is no perfect linear combi-
nation that predicts another variable but a strong correlation, maybe when you use
two operationalizations of intelligence that have a high correlation. If this happens
108 7 Regression diagnostics
the standard errors of estimated coefficients can be inflated, which you want to avoid.
You can test for multicollinearity after you run your model with the command
estat vif
7.4.1 Solutions
When you notice that some variables have an unusually large value for VIF it could
help to exclude one them from the regression and check whether the problem persists.
Another variable probably already accounts for a large part of the aspect you try to
measure, as correlation is high. Also you should think theoretically how the variables
you used are related to each other and whether there are any connections. Maybe you
can find another variable that has a lower correlation with the other explaining vari-
ables but can account for a part of your theory or operationalization.
7.5 Heteroscedasticity
Like explained above, after running your regression you can easily compute residu-
als, which are the differences between predicted and actual values for each case. One
assumption of regressions is that the variance of these errors is identical for all values
of the independent variables (Kohler and Kreuter, 2012: 296). Stated otherwise, the
variance of the residuals has to be constant.4 If this assumption is violated and the
actual variance is not constant we call this heteroscedasticity.5 We can check this by
plotting residuals against predicted values
rvfplot, yline(0)
estat hettest
6 Here you see that “significant” not always means “good”. Keep in mind that p-values can be used for
a wide range of statistical hypothesis tests. The meaning of the result depends on how the hypotheses
are formulated, which is arbitrary. Therefore, always pay attention to what your null- and alternative
hypothesis state. In this case here, Stata is so nice and displays the null hypothesis as a reminder (H0:
Constant variance).
110 7 Regression diagnostics
7.5.1 Solutions
This clearly shows that most values are between 0 and 15 and above that there
are only a few cases left. There are many ways to solve this problem. What you
should do depends on your goals. If you are only interested in interpreting signs
and p-values (to see if there is any significant effect at all) you can transform the
dependent variable and live with it. In contrast to that, when you want to make
predictions and get more information out of the data, you will certainly need a
re-transformation to yield interpretable values, which is a little more complex.
Therefore, I will start with the simple solutions and proceed with a little more
advanced options.
gladder wage
or click Statistics → Summaries, tables, and tests → Distributional plots and tests
→ Ladder of powers. You will receive several histograms that depict how the variable
7.5 Heteroscedasticity 111
looks after performing a transformation. Pick the one that looks most symmetrical
and normally distributed, in our case the log-transformation.
You will notice that the graphical distribution of the data points is much more equal
and signals homogeneity. The numerical test is still significant, but the p-value is
larger and thus we reduced the amount of the problem. If you think this is still not
good enough, you can try the Box-Cox-transformation.
You find this command under Data → Create or change data → Other variable-
creation commands → Box-Cox transform.
We conclude that this distribution is clearly more symmetrical than before. You
can now run another regression with bcwage as the dependent variable and interpret
7 Log() or Ln() is the natural logarithm in Stata, so Euler’s number is used as the base. If you want the
logarithm with base 10, use log10(wage).
112 7 Regression diagnostics
the signs and p-values of the variables of interest. The problem of heteroscedasticity
should be greatly reduced.
Predicted values
Often it is more important to actual predict values for certain constellations than to
only conclude that a certain variable has a significant influence or not. If you need
this, a little more work is required so you receive results that are accessible to non-
statisticians. The first and super-easy option is run a normal regression model without
any changes and specify robust standard errors. Stata will apply different algorithms
that can deal better with heteroscedasticity.
You will notice that the coefficients are identical to the normal regression, but the standard
errors and, therefore, also the confidence intervals changed slightly. You can compare the
results to the same model which uses regular standard errors (see page 92). To summarize it,
you can use robust standard errors when you expect high heteroscedasticity but keep in mind
that they are not magic and when your model is highly misspecified they will not save your
results. Some statisticians recommend to calculate normal and robust standard errors and
compare results: if the difference is large the model is probably poorly specified and should
be revised carefully (King and Roberts, 2015). So you see, this option can be helpful in some
cases, but sometimes it will not yield the best outcomes.
If you come to the conclusion that robust standard errors will not really improve
your model, you can transform your dependent variable, as described above, and
later re-transform the predictions to produce results that can be interpreted easily.
We will choose a log-transformation for the example (make sure you have created
the logged variable as explained before):
7.6 Influential observations 113
The magic happens in the option expression. We specify that the exponential-function
(exp) should be applied to the result, as this is the inverse-function of the log-func-
tion. Finally, Stata produces a nice graph for us which shows the effect of union-
membership for certain values of work experience. This method is not restricted to the
log-function, as long as you specify the correct inverse-function for re-transformation.
Note that this is an advanced topic and you should further research the literature, as
the “best” transformation, or method, also depends on your research question and
your variables. For example, some statisticians even prefer a poisson model over the
regression with a log-transformed variable.8
Sometimes there are extraordinary cases in your dataset that can significantly influ-
ence results. It can be useful to find and investigate these cases in detail, as it might
be better to delete them. This can be the case if they are extreme outliers, or display
a very rare or special combination of properties (or just plain coding errors). To find
these cases, Stata can compute different statistics.
8 https://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/ (2018-07-22).
114 7 Regression diagnostics
7.6.1 Dfbetas
The first option for checking special cases are Dfbetas. Run your regression model as
usual and type
dfbeta
The further the absolute distance of a value from zero the greater the influence. You
can check the cases which seem the most distant and inspect them individually.
Then repeat for all other created dfbetas. There is a rule of thumb to estimate the
2
value above which cases are problematic, which is calculated by the formula ,
(n)
where n is the number of cases (observations) used in the model. In our example
2
this would be =0.0462
(1876)
We can get a complete list of all cases that violate this limit by typing
- output omitted -
Do not forget to exclude cases with missing values, as otherwise these will be listed
as well!
7.6 Influential observations 115
According to this, 83 cases are problematic (this is the result for the first dfbeta
only). As there is no general rule for dealing with these cases, it is up to you to inspect
and decide. When you do nothing and just ignore influential observations, it is also
fine, as these are rarely discussed in research papers. When you are unsure what to
do, consult your advisor (as he or she is the one who will grade your paper).
When you do not like Dfbetas, since they are tedious to check when you use a great
number of variables in your model, an alternative is Cook’s distance which summa-
rizes all information in just one variable. This measurement detects strange patterns
and unusual variable constellations and can help to identify coding errors or prob-
lematic cases. As usual, run your model first and type
This person has an extraordinarily high income, although her education is only
medium and she is from the south. In general, this is a strange constellation which
causes the high leverage of this case on the results.
A final possibility is to use leverage-versus-squared-residual plots which combine
information about the leverage and residuals. This makes detection of outliers easy.
Just run your regression and type
lvr2plot, mlabel(idcode)
This plot shows for each case, the individual leverage (y-axis) and residual
(x-axis). If the residual of a case is high, this tells us that our model makes a pre-
diction that is quite off from the real outcome. Therefore, the residual is related to
the dependent variable of the case. A high leverage of a case tells us that the con-
stellations of independent variables of a certain case are so extreme, or uncommon,
that they influence our final result over proportionally. Therefore, the leverage is
related to the independent variables of a case. The two dotted lines in the graph
show the means for both residuals and leverages. The diagnostics reported here are
not exhaustive, as a much longer list of tests and method exists that can be used
to assess the results of a regression. By listing the most important aspects that you
should always control, I think that the results should be quite robust and suitable
for publication. It is always a good idea to ask the people who will later receive your
paper whether they have extra suggestions for tests that you should apply. Also,
have a look at the literature and footnotes throughout the chapter, as they provide
valuable information.
7.7 Summary 117
7.7 Summary
The following table (Table 7.1) will summarize the most important aspects we have
learnt so far. Also keep in mind that bias is usually worse for your results than inflated
standard errors.
Macros
Sometimes it is necessary to run a command several times, with different options, to compare results,
or to just play around with specifications. When you use a command which includes many variables,
typing it all the time can get tiring. Another very common scenario is that you run a model with some
control variables and later get feedback from your colleagues who suggest that you add some more
variables. Now, you probably have to add these new variables at every place in your do-file, to update
your code, which is error-prone.
A solution to this is to use local macros which allow you to define lists of variables at one place
and then use them en block. When you come back later and want to add or delete variables, you only
have to do so in one place. The general syntax is easy:
The first line defines the macro with they keyword local. After that you put the name (“controls”). Then
a list of all variables follows, here with factor variable notation. The second line runs a regression with
wage as a dependent variable. There will be four independent variables, union and the three others,
9 Note that the opening and closing symbols are different. The first character is the grave accent, the
second one the apostrophe.
118 7 Regression diagnostics
which are included in “controls”. The macro must be typed in a do-file and only “lives” as long as the
do-file is executed. After that you cannot use the local again, say, in the command line.
The counterpart to local macros is global macros. The main difference is that these macros can be
used as often as you want, and they will still be in memory after your do-file has run. You can change
a global macro anytime by redefining it. The usage is similar:
To refer to the global macro the dollar-sign is used as a prefix. Keep in mind that globals can be prob-
lematic when it comes to debugging, as old or “forgotten” macros might haunt you later on.
8 Logistic regression*
Linear regressions can be used, as long as the dependent variable is metric (exam-
ples of metric variables are wage, working hours per week or blood pressure). We
often encounter data that does not fit this metric, as it is binary. Classical examples
are: whether somebody is pregnant, voted in the last elections or has died, as only
two possibilities (states) exist. This chapter will shortly introduce logistic regres-
sions which are used when these types of variables should be explained. As we will
not talk about any statistical foundations of the method, you are advised to have a
look at the literature, for a general introduction (Acock, 2014: 329–345; Long and
Freese, 2014). It is also recommended that beginners read chapter six about linear
regressions first before starting here. Finally I want to underline that I will only
introduce predicted probabilities, as Odds Ratios or Logits seem no longer adequate
in the presence of unobserved heterogeneity, and are especially hard to grasp for
beginners.
8.1 Introduction
When running linear regressions we can explain the influence on a dependent metric
variable in a fashion like “People who are union members earn on average 1.5 $ more
per hour than other people.” Pay attention to the metric of our prediction, which is
measured in dollars, years, points or any other metric unit. When we deal with binary
variables this is no longer possible, and we have to start thinking in probabilities. In
a logistic regression, an independent variable can only influence the probability of
going from one state (0) to the other (1).
In our example we want to research the effect of age on the chance of having
a heart attack. The coding of the dependent variable is binary: people either had a
heart attack or not, therefore, we use a logistic model. The independent variable is
metric. Note that the arrow of causation can only point in one direction, as age can
cause heart attacks, but heart attacks cannot cause (or influence) age. We will also
introduce control variables, to account for spurious correlations.
To see the method in action, we will load another dataset which is taken from
the second National Health and Nutrition Examination Survey (NHANES II)1 which
surveyed people in the USA about their health and diets between 1976 and 1980. We
will use a sample of the data, which is not representative of the general population
of the USA. Furthermore, we will ignore the sampling design, which introduces bias.
Consequently, our results will not be valid in general or meet scientific standards,
1 http://www.stata-press.com/data/r12/nhanes2.dta (2018-02-26).
https://doi.org/10.1515/9783110617160-008
120 8 Logistic regression
yet will be adequate for the purpose of demonstrating the technique. We open the
dataset by typing
First we inspect our dependent variable (heartatk) and our central independent vari-
able (age), to get an impression of the general distribution of the data
tabulate heartatk
We see that about 4.6% of all respondents suffered a heart attack, and age ranges from
20 to 74 years, with an arithmetic mean of 47.6. Also have a look at the coding of the
dependent variable, as 0 stands for “no attack” and 1 stands for “attack”. The depen-
dent variable must have exactly two numerical values, zero (0) and one (1). Stata will
treat zero as the base category or “negative outcome”. Other coding schemes are not
accepted.2 To see whether any association between the two variables exists, we can
use Spearman’s Rho (see page 67):
2 Type set showbaselevels on, perm so Stata always display the category of reference.
8.1 Introduction 121
The result underlines that there is a positive association, which is highly significant
(Prob > |t| is smaller than 0.05). This result will not surprise us, as we know in general
that older people have more heart attacks, due to the biological consequences of
aging on blood vessels.
Why can’t we just use a linear regression to calculate the results? The main
problem is that the model is probably not linear at all. This means that the probability
of suffering a heart attack is not constant over the years, but further increases with
age. The effect of an increase of one year will only slightly increase the probability of
having a heart attack when you are 25, but much higher when you are 75, therefore,
we have nonlinearity in the parameters. We will now test this empirically by running
the logistic regression, with age as the only predictor. We also use the factor vari-
able notation, as in chapter six, where continuous variables receive the prefix c. and
binary and ordinal variables receive the prefix i.
interpret the sign of the coefficient. Although this might seem disappointing, statis-
ticians have shown that any other interpretation can be highly biased (Mood, 2010).
This means that an increased age will result in an increased probability of having
a heart attack (the higher the age, the higher the probability). As Stata uses 0 as
the point of reference (“no event”), the change to the other category (1, “event”)
is explained in the model. If the coding was reversed we would receive a negative
coefficient for age.
We also want to see the nonlinear character of the predicted probabilities that
was assumed by the theoretical argumentation. To assess this, we can calculate
predicted values. Based on the model, Stata will estimate the probability for each
person of suffering a heart attack.
The scatterplot shows the nonlinear relation, as the slope increases with age.
To receive more concrete results, we can calculate predicted probabilities using
the margins command:
margins
8.1 Introduction 123
summarize predi
You see that the result is 0.046 as well. Note that standard errors are different, as
margins is more complex and calculates other statistics (as confidence intervals) as
well. A second option is the marginal outcome at the mean:
margins, atmeans
What happens here is different from the first margins command. Now Stata calculates
the empiric arithmetic mean for every independent variable in the model (in this case
only for age, that is 47.6 years) and internally changes the age of every person to this
value. Then it predicts individual probabilities and averages the result. You see that
the outcome (0.02) is not equal to the first one (0.046). Here we learn that a person
who is 47.6 years old, has a probability of 2.2% of suffering a heart attack on average.
You have to decide theoretically which result you want. You see that this decision is
important, as outcomes might differ largely!
We can extend this logic, and produce more informative graphs by not calculat-
ing just one single prediction at the mean, but for a wide range of ages. The next
124 8 Logistic regression
command will calculate the probability for every single age from 20 to 74 and then
make a pretty graph out of it.
Until now, we have only included age as an explanatory variable in the model, which is
usually not a good idea, as other variables might help explaining the real relationship.
Therefore, we want to add control variables, just as we did in the linear regression.
On a theoretical base we decide that gender, Body-Mass-Index (BMI) and the region
where a person lives, might be relevant factors, so we include them. Furthermore, as
we have already in the linear regression, we will include a higher ordered term for
age. One might wonder why this is still relevant, as we talk about nonlinear effects
anyway, but linearity between the logarithmic odds of the dependent variable, and
all metric independent variables is still required (Kohler and Kreuter, 2012: 368). Our
second model looks like this:
Due to the higher ordered term, it is quite impossible to interpret this model using
coefficients, so we first compute Average Marginal Effects and then produce graphs.
margins, dydx(*)
The asterisk means that the effects are computed for all variables in the model. We
see that the AME for age was slightly reduced, yet is still highly significant. Quite
interestingly gender accounts for a lot. The interpretation is as follows: women have
on average (all other variables held constant) 3.7 percentage points lower probability
of suffering a heart attack than men. The effect is highly significant.
To understand the AME as a counterfactual, we could ask: what would the prob-
ability be for an event, if every person in the sample was male? So Stata internally
sets gender to male for all persons (men and women!), leaves the other covariates
untouched, predicts the individual probabilities and averages them. Then it repeats
the process, this time setting the gender to female, computing again the average prob-
abilities and reporting the difference between the two averaged predictions, which is
the AME.
8.2 Control variables 127
To continue, we produce more detailed graphs which will also show the effect of
age:
The results are quite similar to what we have seen before. We can now try to calculate
separate effects for genders:
The results are impressive, and highlight the stark differences between men and
women. Up to 40, the probabilities of having a heart attack are quite similar for men
and women, after that the gap widens drastically. Note that we did not even have to
128 8 Logistic regression
specify an explicit interaction effect between age and gender, as this happens “auto-
matically” in a logistic regression, as all variables are somehow related to each other.
Another option is to calculate Average Marginal Effects for gender, not only for
the general model, but for a wider range of values for age.
Remember that the AME tells you the marginal effect of changing gender, on the prob-
ability of having a heart attack. For example, in the graph above we see women with
age 60 have a six percentage points lower probability of having a heart attack than
men. We come to this conclusion, as the AME above is calculated for the 2.sex (see
the title of the graph). As men are coded with zero and are, therefore, the category of
reference, women are the second category, so the depicted effect shows the change
from reference to this category. Additionally, a 95% confidence interval is included in
the graph to improve the accuracy of your estimation.
One last option you can try is Marginal Effects at the Mean (MEM), which is just a
combination of the aspects we have learned so far:
The outcome tells us that when you had two otherwise-average individuals, one male,
one female, the probability of having a heart attack would be 2.1 percentage points
lower for the female.
8.3 Nested Models 129
What was explained on page 106 is also valid for logistic regressions: when you want
to compare nested models make sure that they all use the same number of cases. You
can use the AIC or the BIC to compare model fit (remember, lower values indicate better
fitting models). Another option is the Likelihood-Ratio test, to see whether one model
performs better than the other. To do this, estimate both models, save the results and run
the test:
8.4 Diagnostics
A review of the literature shows that there is no clear consensus about the diagnostics that
are relevant for logistic regressions for yielding valid results. The following section is an
overview of some aspects that will contribute to your model quality. The good news is that
logistic regressions have lower standards than linear ones, so violations might not have
too severe consequences. We will use the following model for all diagnostics shown here:
You should create a model based upon your theoretical considerations. You can, sta-
tistically, further test whether your model has any missing or unnecessary variables
included. To do so, run your model and type
linktest
Logistic models usually need more observations than linear regressions. Make sure to
use at least 100 cases. When you have a lower number of cases, you can run an exact
logistic regression (type help exlogistic for more information). Furthermore, it is vital
that there are no empty cells in your model, which means that for any combination
of your variables, there must be cases available. In the example above, we include
the region (four categories) and the gender (two categories) as predictors, so there are
eight cells (considering only these two variables). An empty cell is an interaction of
8.4 Diagnostics 131
levels of two or more factor variables for which there is no data available. When this
happens, Stata will drop the respective categories automatically and not show them in
the output. Having enough cases, in general, will reduce the chances that you have any
empty cells.
8.4.3 Multicollinearity
Just like linear regressions, logistic ones are also influenced by a high correlation of
independent variables. Testing this requires another user-written command (collin). Try
search collin
And look for the entry in the window that pops up (Figure 8.1). When the installation
is complete, enter the command and the variables you used in the model:
As a general rule of thumb it might be a good idea to readjust your model when a
variable shows a VIF above 10. In this case, remove one variable with a high VIF and
run your model again. If the VIF is lower for the remaining variables, the choice might
be a good idea.
Some observations might have uncommon constellations for values in their variables,
which makes them influence results over-proportionally. It is usually a good idea to
inspect these cases, although there is no general rule on how to deal with them. When
you can exclude the possibility of plain coding errors, you can either keep or drop
these cases. In contrast to the linear regression, we will use a slightly different mea-
surement for influential observations in logistic regressions (Pregibon’s Beta, which
is similar to Cook’s Distance). After your model was run type
We see a few extreme outliers, for example case 27,438. What is wrong here? Let’s have
a closer look. We want to list all relevant values for this case, so we type
8.4 Diagnostics 133
The information here is perplexing. This woman with an age of 28 and a BMI of 27
(which counts as slightly overweight) reported a heart attack. Our common sense tells
us that this is exceptional, as usually older men who are obese are prone to heart
attacks. As this special case clearly violates the normal pattern, we could delete it
and argue that this abnormal case influences our general results over-proportionally.
Whatever you decide to do, report it in your paper and give a theoretical justification,
as there is no general rule.
After you have excluded the case, run your model again and assess how the
output changed.
- output omitted -
9 Matching
The last technique I want to introduce is propensity score matching (PSM).1 What
sounds fancy is often not included in the basic Stata textbooks, and actually quite
easy to understand and interpret, even for the beginner. The next section will shortly
discuss the main ideas behind the method, run an example and check some diagnos-
tics afterwards.
Experiments are regarded as the gold standard in science. Most people will know
about experiments from the medical sciences, when new drugs are tested and one
group receives the actual drug and the other one a placebo to assess the “real” effect
of the new active component. What is the core of experiments? The answer is random-
ization. People that are sampled from a larger population are randomly assigned to
two groups, the treatment group and the control. This guarantees that, on average,
both groups are similar, with respect to visible (e.g. age, gender) and hidden (e.g.
intelligence, motivation, health) properties. When the treatment (drug) is applied to
one group and randomization was performed, we can be sure that any effect that
happens is solely due to the treatment, as no other factors can influence the result.
Basically, we could do this in the social sciences as well.2 Suppose, we want to
research the effect of education on success in life. We sample pupils from the pop-
ulation and randomly put some of them in elite schools, while others have to go to
not so fancy schools. After some years, we see how the pupils are doing. As factors
like intelligence, motivation or social and financial background were similar at
the start of the experiment (due to the randomization) the differences later in life,
between the groups, can be traced back to the effect of school alone. As you may
expect, especially parents from wealthy families might not be happy when their
child is randomly put in a low-quality school, which clearly underlines the ethical
and economic problems of social experiments.
The idea of matching is to simulate an experiment, even when only observa-
tional data is available. Basically, the method tries to find pairs in the data which
are similar in visible characteristics, and only differ in the treatment status. For
example, when we find two girls that have the same age, intelligence and social
background, but one of them went to the elite school while the other did not, then
1 For a general introduction to the method see the excellent overview of Caliendo and Kopeinig
(2008). Note that some recommendations given there are already outdated. Recent information can
be found here: https://www.stata.com/meeting/germany17/slides/Germany17_Jann.pdf (2018-02-02).
2 For an introduction see Shadish et al. (2002).
https://doi.org/10.1515/9783110617160-009
9.2 Propensity score matching 135
we have a matched pair. The basic problems of the method are the same as with
regressions: only measured characteristics can be used as “controls” and match-
ing is not a magic trick that introduces new information. The main advantages, in
contrast to a regression are that the functional form of the relationship between
treatment and outcome is not needed to be specified (so the problem of “linearity”
we discussed before is gone) and the method is very close to the ideas of the coun-
terfactual framework.
One problem in matching is that it is very hard or even impossible to find good matches
when we use a lot of control variables. When we match on age, gender, education,
intelligence, parental background, etc… it will be hard to find “statistical twins” in
the data that have the exact same properties in all these variables. This is called the
curse of dimensionality, which can be solved using propensity scores. The basic idea
is to run a logistic regression which uses all control variables as independent vari-
ables, predict the chance of being in the treatment group, and then match using the
score alone (which is a single number). The assumption is that people with a similar
score will have similar properties. This has been proven mathematically, but still has
some problems that were discussed recently (King and Nielsen, 2016). Therefore,
we will rely on kernel-matching, which seems quite robust against some problems
that are associated with matching. Based upon recent developments, I recommend
this approach (possibly in combination with exact matching) over algorithms like
nearest-neighbor or caliper.3
The problem is that Stata does not support kernel-matching, therefore, we will
use a new command developed by Jann (2017) which implements a robust and fast
version and comes with a lot of diagnostic options.
Using this, we want to test the effect of being in a union on wage. As in chapter six, we
will rely on NLSW88 data. Note that our dependent variable can be metric (continu-
ous) or binary, and our treatment variable must be binary, so exactly two groups can
be defined (treatment and control).4 The independent variables can have any metric,
as long as you use factor-variable notation. For the example, we choose total work
3 Stata comes with matching since version 13. Popular community-contributed commands are ps-
match2 (ssc install psmatch2) and pscore (ssc install st0026_2).
4 If you want to compare several groups, for example more than one treatment drug, it is recommend-
ed to form all binary contrasts and run one matching model for each.
136 9 Matching
experience, region, age and smsa (metropolitan region) as control variables. Let’s see
this in action. We open our dataset and run the command:
The explanation is as follows: kmatch is the name of the command we want to use,
ps tells Stata to perform propensity score matching. Union is the treatment variable
(coded 0/1), followed by four (rather randomly chosen) control variables. Just as with
regressions, you enter your variables with factor-variable notation. When desired, you
can also include interactions or higher-ordered terms. The outcome variable (wage) is
put in parentheses at the end of the command. Note that you can also include multi-
ple outcomes in the same model.
The ATE (0.915), is displayed as a result, which is the Average Treatment Effect.
This is the estimated difference between the means of the treatment, and control
group for the outcome variable (wage). As we see that the value is positive, we know
that the average effect of being in a union results in a plus on wage of about 92 cents
per hour.
Now we would like to know whether this result is statistically significant. In con-
trast to a regression, p-values cannot be calculated in the regular way, but must be
estimated, using bootstrapping (which is a form of resampling). On modern comput-
ers it is reasonable to use about 500 replications for bootstrapping (the larger the
number, the less variable the results). To get these we type
This might take a few minutes to run , as a random sample is drawn 500 times
and then the command is repeated for each. The result shows that the p-level is
below 0.05 and,therefore the effect is statistically significant. Note that your results
will probably differ from the one shown here, as resampling is based on random
samples, thus the p-level and standard error will slightly deviate.5
When you are not interested in the ATE, you can also see the ATT (Average
Treatment Effect on the Treated). This statistic tells us what people who are actually
in a union earn more, due to the union membership. The counterpart is the ATC
(Average Treatment Effect on the Control) which tells us how much more, people
(who are not in a union) would earn, if they were union members. To get these
effects add the options att or atc.
Finally, it seems like a good idea to combine PSM with an exact matching on
certain key variables (here we choose college education). In other words, before the
matching is run, Stata will match perfectly, which means that people with college
education will be compared only to other people with the same level of education.
The downside is that when you match on a large number of exact variables, the
number of cases used will be lower, as a perfect match cannot always be found.
It is usually a good idea to include only a few binary or ordinal variables for exact
matching. Note that you cannot use factor variable notation or interactions in this
option.
- Output omitted -
The option att tells Stata to report the ATT instead of the ATE, and ematch(collgrad)
will combine the PSM with an exact matching on the variable collgrad. It is usually a
good idea to, later, report the ATT, and the ATE, as these are often the most interesting
results.
5 If you want your results to be replicable, use the command set seed 1234 at the top of your do-file,
where you replace 1234 with a “random” number of your choice. Furthermore, make sure to sort your
cases by ID as random resorting could also influence the results (sort idcode).
138 9 Matching
In contrast to linear regressions, matching has lower demands, which makes our lives
easier, yet there are two central aspects that must be checked, to see if we can trust
our results (the following diagnostics refer to the model from page 135).
As described above, when you run a PSM, Stata will start with a logit model to calcu-
late (for each person) the propensity score that summarizes the probability of being in
the treatment group. If some variables have perfect predictability, which means they
flawlessly determine if a person is either in treatment or control, PSM will not work.
For example, suppose that every person from the south is in the treatment group,
and persons from other regions are in the control group. Then, the variable south
perfectly predicts the treatment status, which is not allowed. Stated otherwise: when
you divide the entire range of calculated propensity scores into an arbitrary number
of groups, you need, within each group, both people from treatment and control. This
can be visualized.
The region between the two vertical bars is the region of common support. Although
kmatch automatically trims away cases outside that region (for example the people
depicted by the continuous black graph to the very left), you still have to inspect this,
as there might be other problems. Sometimes, there are regions in the middle of the
spectrum, with very low values for one group, which can be problematic. Checking
this will tell you how well the common support is fulfilled. In our case it looks fine.
You can create this kind of graph automatically by typing
Another important aspect to consider, are the cases that could not be matched.
Almost always, there will be cases in your sample for which an adequate match
could not be found, therefore, these cases will not be used for calculating the sta-
tistic of interest (for example the ATE). This means, your overall sample and the
sample used to calculate the statistic differ, which might leads to bias. To check if
this is a problem type
kmatch cdensity
You see that the graphs of the “Total” sample and the “Matched” sample are very
close to each other, meaning that the sample that could be used in the matching is
almost identical to the overall sample on average. As long as this is the case, we do
not expect any bias.
Remember that a PSM tries to make two groups similar, that were quite different before
the match, with respect to all independent variables in the model. It is a good idea to
check whether this goal was actually achieved. The idea is simple: you inspect the means
of these variables, within both groups, after the matching was done, and see if they are
similar. As long as this is the case, you can be optimistic that your matching was suc-
cessful. You can do this using either tables or graphs. We first start with a simple table.
kmatch summarize
You see that the standardized differences (mean) for total work experience was 0.13
before matching, and 0.02 after matching. This seems like a good result, also for the
other variables. Keep in mind that the matched value should approach zero for means
140 9 Matching
and one for the ratios of the standard deviations, which are listed below. All in all
these results are very positive. If you prefer graphs over tables type
You see a plot comparing the distribution of each variable before and after the match-
ing. The closer the two graphs resemble each other, after the matching, the better the
balancing.
Finally, you can use Rosenbaum-Bounds to assess how robust your results are, with
respect to omitted variables. This slightly advanced technique cannot be introduced
here (check out the online resources), but is available using community-contributed
software (Rosenbaum, 2002; Becker and Caliendo, 2007; DiPrete and Gangl, 2004).6 To
learn how this technique can be integrated in a research paper, refer to Gebel (2009).
6 Available user-written commands are rbounds (metric outcome variable) and mhbounds (binary
outcome variable).
10 Reporting results
After you have finished the difficult part, and produced and tested your results, it is
time to publish them. Usually this works fairly well, yet some things should be con-
sidered in order to reach a professional appearance. It is never a good idea to copy
and past Stata output directly into a paper, as it will probably lower your grade. The
first problem is that Stata produces a great amount of information that is often not
needed in publications, as tables are already large enough and researchers don’t have
time to deal with every bit of information. The second aspect is that the Stata format
is usually not what editors or professors expect when they think of a nice layout. It
is always a good idea to study the most important journals of your field, to get an
impression of what a table looks like there, so you can try to adopt this style in your
own paper. Otherwise, ask your advisor for examples. We will use a quite generic for-
matting that could probably be used in any modern journal. We will use the results of
the linear regression models from chapter six to produce tables and output.
10.1 Tables
Tables are at the heart of scientific research, as they compress a lot of information
into a small area. It might be the case that other methods, like graphs, are better for
visualizing your results, yet you are often required to also include a table, so numer-
ical values can be studied when so desired. For our example, we will use a nested
regression with three models, which is a very common procedure in sciences. We will,
therefore, run three different regression models, and save the results internally, so
Stata can produce a combined output.
https://doi.org/10.1515/9783110617160-010
142 10 Reporting results
The last line will tell Stata to produce a table which includes all three models,
shows standard errors and also adds information about R-squared and the number of
cases used. This looks quite fine, but still requires some more work. One problem is
that the stars, which should indicate the p-levels, are missing. Also, it is a good idea
to round numbers, say, to three decimal places.
10.1 Tables 143
This looks better. Getting this data into your text editor, for example Microsoft Word
or LibreOffice Writer, can still be tricky. Highlight the table in Stata, right-click it and
select “Copy Table”. Then either go directly to your text editor, right-click and choose
“Paste as...” and see which option works best. Sometimes it can be helpful to paste
the data into calculation software first, like Excel or Calc, to get better formatting, and
copy this table into the text editor. After you have done this, add the stars manually
and adjust options for a pretty look. A finished table might look something like this1
(Table 10.1):
Race
White Ref.
Black −0.97***
(0.21)
Other 0.56
(0.78)
Const. 3.61*** 3.69*** 3.78***
(0.34) (0.27) (0.27)
The problem is that when using Stata onboard functions, you will need to do some
extra work to create these tables, just as we want them. An alternative is a user-written
package that can do most steps automatically.
1 Please note that for nested models, number of cases should usually be identical. This is explained
on page 106.
144 10 Reporting results
Now we run each model again and store the results using the eststo command:
The last command produces the table, and saves it in rtf-fomat in the current working
directory. It can be opened in any word processing editor. When you want, you can
also export the following file types: txt, csv, html and tex. Esttab offers a great variety
of options, so you can customize your tables as you like them, and then save the code
in a do-file, so you can use it for future work. The following example shows how pow-
erful the command can be:
Nogaps makes the table more compact, nomtitles hides the model names, r2 shows
R-squared for each model, star modifies which significance levels should be indi-
cated, b(3) sets the numbers, rounded to three decimal places, se shows standard
errors instead of t-values, label shows variable labels and replace overwrites any
existing table with the same name in your working directory. If you also want to hide
reference categories, include nobase.
10.2 Graphs
Regression tables are important, but can be hard to interpret, especially, when there
are interactions present in your models. Graphs are an excellent way of visualizing
desired effects, so even people who are not immersed in the topic can understand
them (never forget: when you want to influence policy, you have to be understandable
by the public). Stata brings some great tools to create these graphs.
Let’s start with the basics. Maybe you just want to show your regression coeffi-
cients with standard errors, or even better, confidence intervals. There exists a CCS
that can do this for you. Try
10.2 Graphs 145
We use coefplot (Jann, 2014) to visualize the coefficients. The xline option adds a verti-
cal bar. When the confidence interval of a variable crosses this line, we know that it is
not significant. As we see clearly, all variables have a significant effect (except “other”
in race, which is probably due to the low case number for this category 0.
This way we can visualize coefficients neatly. What if we want to predict wages
at special values, or for certain groups? No problem. But let’s see how it works before
we get fancy. Just type
margins
margins, atmeans
Stata also presents the empirical means found in the sample. You see that our
“average” person has 12.8 years of job experience, and is 24.5% a union member.
Calculating means for binary or ordinal variables is easy but a little nonsensical.
Therefore, it is a good idea to move on and compute some better statistics, and not
use the mean of all variables. Try
margins, at(ttl_exp=(0(5)30))
10.2 Graphs 147
Stata uses the last regression command to compute the wage of people at several
numerical values of job experience, in our case from zero years up to 30 years, in steps
of five years. Stata even produces a small table that tells you which values it uses, and
then shows the results, including standard errors, in another table. This seems nice,
but it gets even better by typing
marginsplot
or click Statistics → Postestimation → Margins plots and profile plots and click
Submit.
This is called a conditional-effects plot. What happens here is the following:
Stata starts with the first number, which we specified in the command above (which
is 0), and internally changes the experience of every single person to that value.
Then it predicts, separately for each case, the wage, using the estimated regression
coefficients with the individual information, except work experience, which was
changed to zero. After that, it averages the predicted wages and reports them to
repeat the process with the second specified value. This answers the question, what
wages would be if every person had work experience of zero, with all other variables
remaining unchanged.
We can do even better, when we differentiate not only by work experience but
also by the status of union membership, so try
Here the list of computed values gets large so the marginsplot command is really a
boon. We can clearly see that union members will always make more than non-mem-
bers.
We can include as many variables as we want in the at option, for example, to
differentiate by region. But this brings a lot of information into one graph, so instead
we want subgraphs by union membership:
You can run the last command again (marginsplot), without any options, and decide
which you prefer (one VS two graphs). The general interpretation is easy: union-
members make more than others, but there are also differences for the regions, as
people from the south seem to make less than other people.
Until now the graphs presented only consist of straight lines, as there are no
interactions present (for an explanation see page 94). Let’s change that by typing
10.2 Graphs 149
- output omitted -
The lines are not straight anymore. Yet we can also learn from the regression output
that, in general, the interaction does not make our model any better and we would not
report this in a paper (see the p-values for the interaction-term).
You have learned that margins is a very powerful command, with a great number
of options and possibilities that cannot be introduced here. To see more, refer to
Williams (2012)2 or have a look at the Stata help files for margins and marginsplot.
On the website, you will find another do-file that contains more information about
graphics and interaction effects.
As you reach the end of this book, you are finally familiar with the basics of Stata.
You are able to open and save Stata files, transform and create variables, compute
descriptive statistics and work on more advanced methods like multiple regressions.
This gives you all the tools to write solid research papers. If you use this book as a
companion for your first paper ever written with Stata, the following guide should
give you a very concise overview of how a seminar paper can be structured. This basic
outline will be valid for every field of study, but make sure that you get additional
input from your fellow students, colleagues or advisors.
https://doi.org/10.1515/9783110617160-011
11.1 The basic structure 151
want to use, and how they are suited to your theories. It is relevant to explain how
theoretical constructs, and actual variables, can be matched with each other and
where deviations and problems lie. For example, when you need age of a pupil, as
a control variable, because it is required by your theoretical framework, but only
the class a pupil is in, is available, it is an important task to justify why it might be
adequate to use this variable instead.
Then, write about the central variables you want to use, and report some basic
information like means, standard deviations or distributions (using histograms,
boxplots or kernel-density plots). You can use tables to display a lot of infor-
mation in compact form. Remember that each causal question requires a good
description, as this often enables you to understand the data and certain relation-
ships within it. Furthermore, include some information about how your depen-
dent and independent variable are related (use correlation coefficients or simple
crosstabs). Briefly explain the methods you want to use, and why they are able to
answer your research question.
5. Statistical analysis: here you proceed with your method, and report results, and also
any problems that may arise. In the first part, you should just display these numbers
without commenting on them. The second part, the discussion, is used to explain how
your theoretical framework, statistical computations and results of other researchers
fit together. When there are large deviations from your hypotheses, it is important to
look for problems or reasons that could explain these differences (for example, those
that may arise, due to different datasets, operationalization or methods).
6. Summary: shortly summarize your results, and try to answer the research ques-
tion you formulated in the introduction, in a few sentences. Highlight certain
aspects, results or problems that seem especially important. Furthermore, give
an outlook on how you, or other researchers, should proceed in the future, after
considering your results.
7. References: always include all references and citations used. It does not matter
whether they are for theoretical argumentation, empirical research or about your
methods. Remember, missing references can be viewed as plagiarism, so always
double check that you have reported the origin of direct and indirect citations!
Other researchers want to know your sources.
Note that this is a very general framework which can vary depending on your field,
the extent of your research and the opinion of your advisor. It is always a very good
idea to talk to him or her before you start writing, and present your ideas and plans,
as you will receive some feedback that can help you enormously. Also, talk to fellow
students, maybe from higher semesters, to learn about certain aspects or idiosyncra-
sies you should pay attention to. For example, whether you put your graphics and
tables inside the chapters, or in a separate chapter at the end of the paper, is not
written in stone. For a more detailed introduction refer to the paper of Bhakar and
Nathani (2015).
152 11 Writing a seminar paper
In chapter two you have learned how you can use do-files to save commands and
structure your tasks. When you work on larger projects, like a seminar paper, you will
see that it is extremely helpful to have your files and data organized. Another trick
you can use, is splitting different parts of analysis over several do-files. For example,
you start with data recoding and operationalization, to transform your variables, so
you can use them later. Then you proceed with descriptive statistics, which gives you
an overview of the data. After that, you switch to statistical methods and run your
regressions (or whatever methods you want to use). You can use a different do-file
for each part, and later run them sequentially with a Master-Do. The first step is to
create each file, say operationalization.do, descriptives.do and analyses.do. Then you
create a new file, master.do, which you put in the same folder as the other ones. In
this do-file you write
***Master do-file***
do operationalization
do descriptives
do analyses
When you run the Master-Do, all the other do-files will be called in the sequence that
you placed them. By doing this, you can organize and automatize your workflow even
better. When you do not want to see the actual output of a do-file, for example when
data is recoded, use
quietly do operationalization
When you later come back and see that you forgot to create an important variable,
edit your operationalization.do and run the Master-Do again. You can also combine
this design with logs, so you will receive a complete output in a file after the do-files
are finished.
12 The next steps
This final chapter will summarize information about where you can get help with Stata,
and which sources are great for learning more and gaining further experience. As the
program is quite popular, and beloved by many people, there are large communities,
projects, websites, and forums that offer plenty of information and advice for beginners.
– Statalist.org is the official Stata forum and usually offers the best advice when
you have very special questions about Stata. Expert users and Stata employees
will deal with your requests. When you use complex designs or methods and
want to learn more about the details, this is the first place to go.
– Reddit.com/r/Stata is the Stata subforum on Reddit, and also highly popular. This
seems like the perfect place for beginners who want to find typos in their com-
mands, and need general help for questions that might have been asked before.
– Talkstats.com is mainly focused on general advice when it comes to statistics but
also includes a Stata subforum.
– Stata Manual: every Stata installation comes with a tremendous documentation,
which you can access either directly inside Stata (see page 23) or as PDFs which
are saved on the computer. These manuals are sorted by topic, so you can browse
freely and see which commands might be interesting for you.
– Statabook.com, the website of this book, also provides more material online, like
do-files that contain additional information, and exemplary seminar papers that
show, in-depth how the process of writing an empirical paper works.
12.2 Books
There are plenty of great books about Stata. I only want to introduce a few which
might be interesting for the beginner.
– Kohler and Kreuter (2012): this is actually the book for the motivated beginner.
The authors give a profound and in-depth presentation of how to use Stata, and
furthermore, introduce the basic concepts of statistics. If you read the current
book, as an absolute beginner, and want to get more information, this is clearly
the next book you should pick up.
– Acock (2014): the structure of this book closely resembles the one of Kohler and
Kreuter, and offers the interested beginner a more in-depth explanation of all tools
and functions, starting with basic data management, and exploring more advanced
methods. Whether you choose this one, or the one listed before, is up to you.
https://doi.org/10.1515/9783110617160-012
154 12 The next steps
– Hamilton (2013): this book also offers an introduction to general themes like data
management and visualizing, but mostly works as a grand-tour of the long list
of methods Stata provides. Many kinds of regressions, survival analysis, event-
history design and multilevel models are covered, as well as many more. While
not offering a large theoretical introduction to each method, which would clearly
go beyond its scope, the explanations are understandable, even for the beginner.
This book is ideal in combination with an introductory course in methods, or just
for browsing and exploring Stata’s vast possibilities.
– Mehmetoglu and Jakobsen (2017): this quite recent work comes for students in a
hurry, who have to work with advanced methods. While the introduction to the
general Stata workflow is quite short, it offers applied knowledge and examples
for basic regression models, and also more advanced techniques, like working
with panel data. It is similar to the work of Hamilton, but much shorter and with
a smaller scope.
– Long (2009): this is not a general introduction to Stata, but mostly deals with data
management and organization. When you are not focused on direct application,
but want to learn how you can structure your entire workflow around Stata,
maybe for larger projects like a Ph.D. or your research career, then this book is
for you.
References
Acock, Alan C. (2014): A Gentle Introduction to Stata. College Station, Texas.
Becker, Sascha O.; Caliendo, Marco (2007): Sensitivity analysis for average treatment effects, in:
The Stata Journal 7(1): 71–83.
Best, Henning; Wolf, Christoph (2015): Linear Regression, in: Best, Henning; Wolf, Christoph (eds.):
Regression Analysis and Causal Inference. London: 57–81.
Bhakar, Sher Singh; Nathani, Navita (2015): A Handbook on writing Research Paper in Social
Sciences. Available online: https://www.researchgate.net/publication/282218102_A_
Handbook_on_writing_Research_Paper_in_Social_Sciences (2018-07-23).
Bischof, Daniel (2017): New Graphic Schemes for Stata: plotplain & plottig, in: The Stata Journal
17(3): 748–59.
Caliendo, Marco; Kopeinig, Sabine (2008): Some Practical Guidance for the Implementation of
Propensity Score Matching, in: Journal of Economic Surveys 22(1): 31–72. Available online:
https://www.econstor.eu/bitstream/10419/18336/1/dp485.pdf (2018-02-02).
De Vaus, David (2001): Research Design in Social Research. London.
DiPrete, Thomas A.; Gangl, Markus (2004): Assessing bias in the estimation of causal effects:
Rosenbaum bounds on matching estimators and instrumental variables estimation with
imperfect instruments, in: Sociological methodology 34(1): 271–310.
Elwert, Felix (2013): Graphical Causal Models, in: Morgan, Stephen (Ed.): Handbook of Causal
Analysis for Social Research. Dordrecht: 245–73. Available online: http://citeseerx.ist.psu.edu/
viewdoc/download?doi=10.1.1.364.7505&rep=rep1&type=pdf (2018-07-07).
Elwert, Felix; Winship, Christopher (2014): Endogenous Selection Bias: The Problem of Conditioning
on a Collider Variable, in: Annual Review of Sociology 40: 31–53. Available online: http://www.
annualreviews.org/doi/abs/10.1146/annurev-soc-071913-043455 (2018-01-25).
Gebel, Michael (2009): Fixed-Term Contracts at Labour Market Entry in West Germany: Implications
for Job Search and First Job Quality, in: European Sociological Review 25(6): 661–75.
Groves, Robert; Fowler, Floyd Jr.; Couper, Mick; Lepkowski, James; Singer, Eleanor; Tourangeau,
Roger (2004): Survey Methodology. Hoboken.
Hamilton, Lawrence C. (2013): Statistics with Stata. Boston.
Hernán, Miguel A. (2018): The C-Word: Scientific Euphemisms Do Not Improve Causal Inference from
Observational Data, in: AJPH 108(5): 616–19.
Hoekstra, Rink; Morey, Richard D.; Rouder, Jeffrey N.; Wagenmakers, Eric-Jan (2014): Robust
misinterpretation of confidence intervals, in: Psychonomic Bulletin & Review 21(5): 1157–64.
Jann, Ben (2007): fre: Stata module to display one-way frequency table. Available from http://ideas.
repec.org/c/boc/bocode/s456835.html (2018-01-18).
Jann, Ben (2014): Plotting regression coefficients and other estimates, in: The Stata Journal 14 (4):
708–37.
Jann, Ben (2017): kmatch: Stata module for multivariate-distance and propensity-score matching.
Available online: https://ideas.repec.org/c/boc/bocode/s458346.html (2018-02-05).
King, Gary; O. Keohane, Robert; Verba, Sidney (1995): Designing Social Inquiry. Scientific Inference
in Qualitative Research. Princeton.
King, Gary; Nielsen, Richard (2016): Why Propensity Scores Should Not Be Used for Matching
(Working Paper). Available online: https://gking.harvard.edu/files/gking/files/psnot.pdf
(2018-02-02).
King, Gary; Roberts, Margaret E. (2015): How robust standard errors expose methodological
problems they do not fix, and what to do about it, in: Political Analysis 23(2): 159–79.
Kohler, Ulrich; Kreuter, Frauke (2012): Data Analysis using Stata. College Station, Texas.
Long, Scott (2009): The Workflow of Data Analysis Using Stata. College Station, Texas.
https://doi.org/10.1515/9783110617160-013
156 References
Long, Scott; Freese, Jeremy (2014): Regression Models for Categorical Dependent Variables Using
Stata. College Station, Texas.
Matthews, Robert (2000): Storks Deliver Babies (p=0.008), in: Teaching Statistics 22(2): 36–38.
Mehmetoglu, Mehmet; Jakobsen, Tor Georg (2017): Applied Statistics Using Stata: A Guide for the
Social Sciences. London.
Meuleman, Bart; Loosveldt, Geert; Emonds, Viktor (2015): Regression analysis: Assumptions
and diagnostics, in: Best, Henning; Wolf, Christoph (eds.): Regression Analysis and Causal
Inference. London: 83–110.
Mitchell, Michael N. (2012): A Visual Guide to Stata Graphics. College Station, Texas.
Mood, Carina (2010): Logistic Regression: Why We Cannot Do What We Think We Can Do, and What
We Can Do About It, in: European Sociological Review 26(1): 67–82. Available online: http://
www.urbanlab.org/articles/Mood_2010_LogRegession.pdf (2018-02-21).
Morgan, Stephen L.; Winship, Christopher (2015): Counterfactuals and Causal Inference: Methods
and Principles for Social Research. Cambridge.
Pearl, Judea (2009): Causality: models, reasoning, and inference. Cambridge.
Pearl, Judea; Mackenzie, Dana (2018): The book of why. The new science of cause and effect.
New York.
Rabe-Hesketh, Sophia; Skrondal, Anders (2012): Multilevel and Longitudinal Modeling Using Stata.
Volume I: Continuous Responses. College Station, Texas.
Rosenbaum, Paul R. (2002): Observational Studies. New York.
Shadish, William; Cook, Thomas; Campbell, Donald (2002): Experimental and Quasi-Experimental
Designs for Generalized Causal Inference. Boston.
Williams, Richard (2012): Using the margins command to estimate and interpret adjusted predictions
and marginal effects, in: The Stata Journal 12(2): 308–31.
Wooldridge, Jeffrey M. (2016): Introductory econometrics: a modern approach. Boston.
Copyright
All tables and figures are, if not otherwise stated, created by the author. Figure 3.3
(page 26) was created based on a graphic originally uploaded by “Thirunavukkarasye-
Raveendran” under the Attribution 4.0 International (CC BY 4.0) license, see:
https://commons.wikimedia.org/wiki/File:Zahlenstrahl_v1_02-11-2016_PD.svg
https://doi.org/10.1515/9783110617160-014
Index
_N 21 Combining datasets 38–43
_n 21–22 Command history 5
Command line 2, 4–5, 13
Added-variable-plot 81 Comment 12–13
Adjusted R-squared 86 Common cause 74
Ados 9 Common support (PSM) 138
AIC 107, 129 compare 37
And (logical) 30–31 Conditional-effects plot 147
ANOVA 89–91 Conditioning 79
append 14, 39 Confidence intervals 93, 94, 112, 124, 128, 144, 145
Arithmetic mean 47, 80, 120, 123, 145 Confounder 74, 76, 78, 80
assert 33, 37 Controlling 74, 76, 78, 79, 80, 82, 92, 93
ATC 137 Cook’s distance 115
ATT 137 correlate 67, 80
autocode 29 Correlation 67 , 73,
Average Marginal Effect (AME) 124, 128, 129 count 26, 28–31, 61
Average Treatment Effect 136, 137 Creating variables 23, 26
Crosstab 33, 58, 87
Back-door path 77–79 CSV 9, 144
Balancing (PSM) 139 Current working directory 5, 6, 10, 49, 144
Bar chart 55, 62
bcskew0 111 Data editor 7–8
Beta coefficient 100 delimit 14–15
BIC 107, 129 Dependent variable 84, 85, 86, 101, 102, 104,
binscatter 57, 58, 105 106, 110–112, 116, 119, 120, 121, 126, 135
Bootstrapping 136 describe 16
Box-Cox-transformation 111 Design weight 71
Boxplot 53–55 dfbeta 114–115
browse 8, 24, 153 dir 6
bysort 22, 61, 100 Directed acyclic graph (DAG) 74, 75, 77, 78
do 152
capture 12 Documentation 16, 23, 31, 43, 49, 153
catplot 55 doedit 11
Causal analysis 73–82, 92, 102 Do-files 2, 3, 10–15, 66, 152, 153
Causal graph 74–75, 77, 78, 79, 94 Dot chart 63–64
CCS 9, 49, 57, 105, 144 drop 13, 35, 36
cd 6 Dummy variable 88
centile 48 Duplicates 21–22
Changing variables 32 duplicates list 21
Clear (option) 16
clear all 12 edit 7
Clustering 103 Education (variable) 88
cmdlog 14 egen 29, 30
codebook 34 Equality 31
coefplot 145 ereturn list 50
Collider 77 Error term 84, 102
collin 131 estat hettest 109, 111
https://doi.org/10.1515/9783110617160-015
Index 159
estat ic 107 96, 101, 102, 103, 104, 106, 108, 116,
estat ovtest 102 119–121, 123, 126, 131, 135, 139, 151
estat vif 108 Influential observation 113–116, 132
estimates store 129, 141 inlist 28
estimates table 141, 141 inrange 28–29, 37
Estout 49, 144 Interaction effects 94–100, 128, 149
eststo 144 Intercept 84
esttab 49, 144 Interquartile range 53
Exact matching 135, 137 isid 21
Execute commands 13
Exogeneity 102–103 Jitter 56–57
Exp() 112 joinby 43
Extended missing value 26
kdensity 52
Factor variable notation 85, 88, 95, 97, 106, keep 35, 36
121, 135, 136, 137 Kendall’s Tau 68
Filename 6, 10, 14 Kernel-density plot 52, 151
Format 7, 9, 10, 17, 44, 45, 63, 65, 141 Kernel-matching 135
fre 46 kmatch 135–139
Frequency table 58–61, 62
label define 20
generate 21, 22, 23, 27, 29, 39, 50, 101, 111 label values 20
gladder 110 label variable 19, 21, 27
Global 117 Leverage-versus-squared-residual plot 116
Gph (extension) 65 Likelihood-ratio test 129
graph bar 56, 62, 63 linktest 130
graph box 53 list 18, 39, 40, 41, 43, 45
graph combine 66 local 117
graph dot 64 Log() 111
Graph Editor 65, 66 log close 12, 15
graph export 65 Log-files 14–15
graph save 65 log using 14, 15
gsort 23 Logistic regression 119–133, 135
GUI 4–5 logit 119, 121, 123, 125, 126, 129, 130, 133, 138
lowess 105
help 7, 9, 10, 23, 28 lrtest 129
Helpfile 24, 149 Ls 6
Heteroscedasticity 108–113 lvr2plot 116
Higher ordered term 106, 126, 136, 149
histogram 51, 52, 55, 64, 65, 66, 68, 110, 111 Macros 12, 117
Hypothesis 1, 69, 70, 73, 74, 83, 93, 109 Main effect 94, 95
Marginal effect 96, 97, 100, 125, 128
ID 21, 22, 36, 40, 41, 43, 44, 45 – at the Mean 128
if 30–32, 100 margins 96–99, 122, 123–124, 126, 128, 145,
Import 7 146, 149
in 18 marginsplot 98, 113, 124, 127, 128, 147, 148, 149
Indentation 13 Master do-file 152
Independent variable 48, 84, 85–89, 91–94, Matching 134–140
160 Index
Mean 30, 46, 47, 49, 50, 73, 90, 139, 151 Regression 83, 84
Mechanism 76, 86, 150 Regression coefficient 84, 100–101, 144, 147
Median 30, 46–48, 53 Remove variables 36, 117
Mediator 76, 79 rename 19
merge 39–45 replace 32
Missing value 24, 25, 26, 31, 33, 48, 54, 107, 115 Research question 83
misstable summarize 24 reshape 44, 45
Multicollinearity 107–108, 131–132 Reshaping data 43–45
Multiline command 13 Residual 105, 108, 116, 132
mvdecode 25 Results window 4, 14
mvencode 25 return list 50
Root MSE 86
Negation (logical) 31 Rosenbaum-Bounds 140
Nested models 106–107, 129 R-squared 85, 86, 88, 92, 106, 107,
Normality test 69 142, 144
numlabel 47 rvfplot 109, 117
Observation 7, 16, 21, 35–39, 41, 48, 53, 54, 56, Sampling error 72
79, 83, 86, 109, 113, 115, 130, 132 save 9
OLS 84, 89 scatter 56
Operationalization 11, 103, 107, 150, 151, 152 Scatterplot 56
Or (logical) 28, 30 Scheme 2, 26, 65
Outlier 53, 54, 113, 116, 132 Set more off 12
sfrancia 68, 69
Pairwise combinations 42–43 sort 22, 23, 30, 43, 137
Partial correlation 80 Spearman’s Rho 67, 68, 120
Pearson’s R 67, 80 Special functions 28
Percentile 48 SPSS 7
Png 65 ssc install 9
Polynomial regression 106 Standard deviation 47, 48–50, 62, 73, 90, 101,
Predict 73, 81, 86, 97, 100, 105, 106, 107, 112, 140, 151
123, 126, 132, 135, 138, 145, 147 Standard error 94, 108, 110, 112, 117, 123, 137,
Predicted value 97, 98, 99, 105, 108, 112–113, 122 142, 144, 147
Prediction 73, 74, 85, 110, 112, 116, 119, 123, 126 Standardized regression coefficient 100–101
Pregibon’s Beta 132 Statistical significance 72
Propensity score matching 9, 134, 135–137 Stored results 49–50
Proxy (variable) 80 Stratifying 79
Pseudo R-squared 121 Subgroup 99
pwcorr 67 summarize 47–50
Svy 71
qnorm 68 sysuse 8
Q-Q plot 68
quietly 50 tabulate 2, 20, 24, 27, 32, 33, 34, 46, 47, 55,
Ramsey’s test 103 58, 59–62, 71, 87, 89, 90, 120
Randomization 134 test 91
recode 29, 34, 35, 87, 88, 152 T-test 69, 89
Recorder 65 twoway 52, 56, 104
Reference-category 88 Unique identifier see ID
regress 85 use 2, 6, 7, 39–41, 43, 45
Index 161