Baum 2003 - Introduction To Stata
Baum 2003 - Introduction To Stata
Christopher F Baum
January 2003
baum@bc.edu
http://fmwww.bc.edu/GStat/docs/StataIntro.pdf
C:\Stata\StataData\myfile.dta
or
/u/baum/statadata/myfile.dta
4
Strengths of Stata: Data Manipulation
5
Strengths of Stata: Statistics
6
Specialized Statistical Techniques
7
Cost, Availability and Support
And last (but certainly not least) Stata is very inexpensive! Their
GradPlan program makes the full version of Stata version 8 soft-
ware available to BC faculty and students for $89.00 (one–year
license for students) or $129.00 (perpetual license for faculty)
with various options for purchasing documentation. A quite
thorough set of documentation is available for $129.00 (a 4-
volume reference manual and user’s guide). The “Small Stata”
version is available to students for $39.00 for a one–year license;
it will handle a limited number of observations and variables, but
contains all the commands. GradPlan orders are made direct to
Stata, with delivery from on–campus inventory.
8
Stata is very well supported by telephone and email technical sup-
port, as well as the more informal support provided by other users
on StataList, the listserv. The manuals are useful–particularly
the User’s Guide–but full details of the command syntax are
available online, and in hypertext form in the GUI environment.
There are tutorials available within the program, and several
small “canned” datasets, to introduce you to a number of com-
mon tasks; use the command tutorial.
9
But why should I type commands?
Stata may be used in an interactive mode, but even there you are
typing command lines, generally not pulling down menus. You
do use menus extensively to interact with the computer’s file
system, and with elements of the computer in general–to manage
multiple windows, to change screen defaults, print results and
graphs, and the like.
11
In a computer program where all actions are point and click,
such as a spreadsheet, who can say how you arrived at a certain
set of results? Unless every step of your transformations of the
data can be retraced, how can you find exactly how the sample
you are employing differs from the raw data? A command-driven
program is capable of this level of reproducibility, and one could
argue that we owe it to our students to instill this level of rigor
in their research practices.
13
Extensibility
14
The vast majority of Stata commands are written in Stata’s own
programming language–the “ado-file” language. If a command
is not built in to the kernel, Stata searches for it along the
“adopath”. Like the PATH in Unix, Linux or DOS, the adopath
indicates the several directories in which an ado-file might be
located. This implies that the “official” Stata commands are
not limited to those coded into the kernel. If Stata’s developers
tomorrow wrote a command named “agglomerate”, they would
make two files available on their web site: agglomerate.ado (the
ado-file code) and agglomerate.hlp (the associated help file).
Both are straight ASCII text.
15
Update facility
The importance of this program design goes far beyond the limits
of official Stata. Since the adopath includes both Stata direc-
tories and other directories on your hard disk (or on a server’s
filesystem), you may acquire new Stata commands from a num-
ber of web sites. The Stata Journal (SJ), a quarterly refereed
journal, is the primary method for distributing user contributions.
Between 1991 and 2001, the Stata Technical Bulletin played this
role, and a complete set of issues of the STB Reprints are avail-
able in O’Neill Library.
17
The SJ is a subscription publication (and available at O’Neill
Library: see Quest), but the ado- and hlp-files may be freely
downloaded from Stata’s web site. The Stata command “help”
accesses help on all installed commands; the Stata command
“search” will locate commands that have been documented in
the STB, and with one click you may install them in your version
of Stata. Help for these commands will then be available in your
own Stata.
18
User extensibility: the SSC archive
19
Any component in the SSC archive may be readily inspected with
a web browser, using IDEAS’ or EconPapers’ search functions,
and if desired you may install it with one command from the
archive from within Stata. For instance, if you know there is
a module in the archive named “omninorm,” you could use ssc
install omninorm to install it. Anything in the archive can be
accessed via Stata 7’s ssc command: thus ssc describe omninorm
will locate this module, and make it possible to install it with one
click.
20
The importance of all this is that Stata is infinitely extensible.
Any ado-file on your adopath is a full-fledged Stata command.
Stata’s capabilities thus extend far beyond the official, supported
features described in the Stata manual to a vast array of addi-
tional tools. Since the current directory is on the adopath, if I
create an ado file hello.ado:
Stata will now respond to the command hello. It’s that easy.
21
Command syntax
22
The fundamental syntax of all Stata commands follows a tem-
plate. Not all elements of the template are used by all com-
mands, and some elements are only valid for certain commands.
But where an element appears, it will appear in the same place,
following the same grammar.
23
The general syntax is:
26
if and in clauses
sort price
list make price in 1/5
lists only expensive cars (in 1978 prices!) Note the double equal
in the exp. A single equal sign, as in the C language, is used for
assignment; double equal in comparison.
28
the using clause
29
Stata binary files
30
To bring the contents of an existing Stata file into memory, the
command:
31
Reading and writing binary (.dta) files is much faster than dealing
with text (ASCII) files, and permits variable labels, value labels,
and other characteristics of the file to be saved along with the
file. To write a Stata binary file, the command
32
Transportability
Stat/Transfer can also transfer SAS, SPSS and many other file
formats into Stata format, without loss of variable labels, value
labels, and the like. It is a very useful tool.
33
Accessing data over the Web
use http://fmwww.bc.edu/ec-p/data/Wooldridge/crime1.dta
34
The type command can display any text file, whether on your
hard disk or over the Web; thus
type http://fmwww.bc.edu/ec-p/data/Wooldridge/crime1.des
copy http://fmwww.bc.edu/ec-p/data/Wooldridge/crime1.des
crime.codebook
35
When you have used a dataset over the Web, you have loaded
it into memory in your desktop Stata. You cannot save it to the
Web, but can save the data to your own hard disk. The advan-
tages of this feature for instructional and collaborative research
should be clear. Students may be given a URL from which their
assigned data are to be accessed; it matters not whether they
are using Stata for Windows, Macintosh, Linux, or UNIX.
36
The options clause
37
The by prefix
38
The option ,total will add the overall summary. What about a
classification with several levels, or a combination of values?
This is a very handy tool, which often replaces explicit loops that
must be used in other programs to achieve the same end.
39
The by prefix should not be confused with the by option available
on some commands, which allows for specification of a grouping
variable: for instance
will run a t-test for the difference of sample means across do-
mestic and foreign cars.
40
Another useful aspect of by is the way in which it modifies the
meanings of the observation number symbol. Usually n refers
to the current observation number, which varies from 1 to N,
the maximum defined observation. Under a bylist, n refers to
the observation within the bylist, and N to the total number of
observations for that category. This is often useful in creating
new variables.
Each variable may have its own default display format. This
does not alter the contents of the variable, but affects how it is
displayed. For instance, %9.2f would display a two-decimal-place
real number. The command
43
Variable labels
Each variable may have its own variable label. The variable label
is a character string (maximum 80 characters) which describes
the variable, associated with the variable via
44
Value labels
A full set of functions are available for use in the generate com-
mand, including the standard mathematical functions, recode
functions, string functions, date and time functions, and special-
ized functions (help functions for details). Note that the sum()
function is a running sum.
46
The D., L., and F. operators may be used under a timeseries
calendar (including in the context of panel data) to specify first
differences, lags, and leads, respectively. These operators un-
derstand missing data, and numlists: e.g. L(1/4).x is the first
through fourth lags of x.
48
Estimation commands
52
Data manipulation commands:
53
Useful statistical commands:
54
Useful statistical commands:
55
Useful statistical commands:
58
Any number of two-way scatterplots can be generated with one
command using the by modifier:
59
Graphs may also be readily combined into a single graphic for
presentation. For instance,
60
Now let us walk though some analysis of nontrivial Stata datasets.
A list of over 100 datasets suitable for instructional use is avail-
able on the economics web pages as
http://fmwww.bc.edu/ec-p/data/ecfindata.html#teach
Let’s consider the data Zvi Griliches used in his 1976 article on
the wages of young men (Journal of Political Economy, 84, S69-
S85). These are cross-sectional data on 758 individuals collected
over several survey years.
61
* StataIntro: cross-section example
use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta
describe
summarize
label define ur 0 rural 1 urban
label values smsa ur
tab smsa
tab mrt smsa, chi2
ttest med,by(smsa)
anova lw mrt smsa
anova lw mrt smsa mrt*smsa
anova,regress
regress lw tenure kww smsa
predict lweps,resid
graph lweps kww
62
sort smsa
graph lweps kww,by(smsa) total
bysort year: regress lw tenure kww smsa
graph iq kww age s expr lw,matrix
gen medrural = med*(smsa==0)
gen medurban = med*(smsa==1)
regress lw tenure kww medurban medrural
test medurban=medrural
The following example reads some daily Dow-Jones Averages
data, graphs daily returns, then performs Dickey-Fuller tests for
unit roots on the DJIA, its log, and its returns (log price rela-
tives). AR(3) models are then estimated on the returns series,
and tests are carried out on the fitted model for AR(1) errors
and ARCH effects. A portmanteau test is then performed on the
residual series.
63
* StataIntro: time-series example
use http://fmwww.bc.edu/ec-p/data/micro/ddjia.dta
desc
summ
tsset
graph ret day,c(l) s(.)
for var djia ldjia ret: dfuller X,lags(22)
regress ret L(1/3).ret
regress ret L(1/3).ret, robust
dwstat
archlm, lags(20)
predict eps,resid
wntestq eps
64