Handout 2
Handout 2
Working with data from files: The most common ready-to-go data format is a family of tabular
formats called structured values.
As shown in the following figure, this data is arranged in rows and columns where the first
row gives the column names. Each column represents a different fact or measurement; each
row represents an instance or datum
Reading the UCI car data: Loading data of this type into R is a one-liner: we use the R
command read.table()
For data frames, the command dim() is also important, as it shows you how many rows and
columns are in the data.
Using R on less-structured data: Data isn’t always available in a ready-to-go format. The
German bank credit dataset is stored as tabular data without headers; it uses a cryptic
encoding of values that requires the dataset’s accompanying documentation to untangle. we’ll
now show how to reformat the data using R.
Transforming data in R: Data often needs a bit of transformation before it makes any sense.
In order to decrypt troublesome data, you need what’s called the schema documentation or a
data dictionary. In this case, the included dataset description says the data is 20 input
columns followed by one result column. In this example, there’s no header in the data file.
The column definitions and the meaning of the cryptic A-* codes are all in the accompanying
data documentation.
print(d[1:3,])
We can change the column names to something meaningful with the command in the
following listing.
Setting column names : Adding our own column names to the fileds.
Lists are R’s map structures. They can map strings to arbitrary objects. The important list
operations [ ] and %in% are vectorized. This means that, when applied to a vector of values,
they return a vector of results by performing one lookup per entry.
The following for loop to convert values in each column that was of type character from the
original cryptic A-* codes into short level descriptions taken directly from the data
documentation. We, of course, skip any such transform for columns that contain numeric
data.
Transforming the car data
Examining our new data: We can now easily examine the purpose of the first three loans
with the command print(d[1:3,'Purpose']).
In many production environments, the data you want lives in a relational or SQL
database, not in files. Public data is often in files (as they are easier to share), but your most
important client data is often in databases. Relational databases scale easily to the millions of
records and supply important production features such as parallelism, consistency,
transactions, logging, and audits. When you’re working with transaction data, you’re likely to
find it already stored in a relational database, as relational databases excel at online
transaction processing (OLTP).
Data in a database is often stored in what is called a normalized form, which requires
relational preparations called joins before the data is ready for analysis. Also, you often don’t
want a dump of the entire database, but instead wish to freely specify which columns and
aggregations you need during analysis.
We’ll show how to load data into a database. Knowing how to load data into a
database is useful for problems that need more sophisticated preparation.
A production-size example:
For our production-size example we’ll use the United States Census 2011 national
PUMS(Public Use Microdata Sample) American Community Survey data. This is a
remarkable set of data involving around 3 million individuals and 1.5 million households.
Each row contains over 200 facts about each individual or household (income, employment,
education, number of rooms, and so on).
The data has household cross-reference IDs so individuals can be joined to the
household they’re in. The size of the dataset is interesting: a few gigabytes when zipped up.
So it’s small enough to store on a good network or thumb drive, but larger than is convenient
to work with on a laptop with R alone (which is more comfortable when working in the range
of hundreds of thousands of rows). We’ll work through all of the steps for acquiring this data
and preparing it for analysis in R.
A hard rule of data science is that you must be able to reproduce your results. At the
very least, be able to repeat your own successful work through your recorded steps and
without depending on a stash of intermediate results. Everything must either have directions
on how to produce it or clear documentation on where it came from. We call this the “no
alien artifacts” discipline.
Step1:
Step2:
Step3:
Step4:
Step5:-
Keep notes
A big part of being a data scientist is being able to defend your results and repeat your
work. We strongly advise keeping a notebook. We also strongly advise keeping all of your
scripts and code under version control, You absolutely need to be able to answer exactly what
code and which data were used to build the results you presented last week.
We’ll use the Java-based tool SQL Screwdriver to load the PUMS data into our database. We
first copy our database credentials into a Java properties XML file. We’ll then use Java at the
command line to load the data. To load the four files containing the two tables, run the
commands in the following listing.
Step1:-
Step2:-
Step3:-
Step4:-
Loading data with SQL Screwdriver SQL
Step1:-
Step2:-
Step3:-
Step4:-
Step5:-
Step6:-
SQL Screwdriver infers data types by scanning the file and creates new tables in your
database. It then populates these tables with the data. SQL Screwdriver also adds four
additional “provenance” columns when loading your data. These columns are
We can now use a database browser like SQuirreL SQL to examine this data. We start up
SQuirreL SQL and copy the connection details from our XML file into a database. We’re
then ready to type SQL commands into the execution window. A couple of commands you
can try are SELECT COUNT(1) FROM hus and SELECT COUNT(1) FROM pus, which
will tell you that the hus table has 1,485,292 rows and the pus table has 3,112,017 rows. Each
of the tables has over 200 columns, and there are over a billion cells of data in these two
tables. In addition to the SQL execution panel, SQuirreL SQL has an Objects panel that
allows graphical exploration of database table definitions.
Step1:-
Step2:-
Step3:-
Step4:-
Now we can view our data as a table (as we would in a spreadsheet). We can now examine,
aggregate, and summarize our data using the SQuirreL SQL database browser. Below fig.
shows a few example rows and columns from the household data table.
To load data from a database, we use a database connector. Then we can directly issue SQL
queries from R. SQL is the most common database query language and allows us to specify
arbitrary joins and aggregations. SQL is called a declarative language (as opposed to a
procedural language) because in SQL we specify what relations we would like our data
sample to have, not how to compute them. For our example, we load a sample of the
household data from the hus table and the rows from the person table (pus) that are
associated with those households.
Step1:-
Step2:-
Step3:
Loading data from a database into R:
To load data from a database, we use a database connector. Then we can directly issue
SQL queries from R. SQL is the most common database query language and allows us to
specify arbitrary joins and aggregations. SQL is called a declarative language (as opposed to
a procedural language) because in SQL we specify what relations we would like our data
sample to have, not how to compute them. For our example, we load a sample of the
household data from the hus table and the rows from the person table (pus) that are associated
with those households. Producing composite records that represent matches between one or
more tables (in our case hus and pus) is usually done with what is called a join. For this
example, we use an even more efficient pattern called a sub-select that uses the key word in.
Step2:-
Step3:-
Step4:-
Step5:-
Step6:-
Step7:-
Step8:-
The data has been unpacked from the Census-supplied .csv files into our database and a
useful sample has been loaded into R for analysis.
Each row of PUMS data represents a single anonymized person or household. Personal data
recorded includes occupation, level of education, personal income, and many other
demographics variables. To load our prepared data frame, download phsample.Rdata from :
https://github.com/WinVector/zmPDSwR/tree/master/PUMS
and run the following command in R: load('phsample.RData'). Our example problem will be
to predict income (represented in US dollars in the field PINCP) using the following
variables:
Age— An integer found in column AGEP.
Employment class— Examples: for-profit company, nonprofit company, ... found in column
COW.
Education level— Examples: no high school diploma,high school, college, and so on, found
in column SCHL.
Sex of worker— Found in column SEX.
Our data treatment is to select a subset of “typical full-time workers” by restricting the subset
to data that meets all of the following conditions:
Workers self-described as full-time employees
Workers reporting at least 40 hours a week of activity
Workers 20–50 years of age
Workers with an annual income between $1,000 and $250,000 dollars
Selecting a subset of the Census data:
Before we work with the data, we’ll recode some of the variables for readability. In
particular, we want to recode variables that are enumerated integers into meaningful factor
level names, but for readability and to prevent accidentally treating such variables as mere
numeric values.
Recoding variables:
Step1:-
Step2:-
Step3:-
Step4:-
The data preparation is making use of R’s vectorized lookup operator [].
The standard trick to work with variables that take on a small number of string
values is to re encode them into what’s called a factor as we’ve done with the as.factor()
command. A factor is a list of all possible values of the variable (possible values are called
levels), and each level works (under the covers) as an indicator variable. An indicator is a
variable with a value of 1 (one) when a condition we’re interested in is true, and 0 (zero)
otherwise. Indicators are a useful encoding trick. Following Figure illustrates the process.
SEX and COW underwent similar transformations.
> summary(dtrain$COW)