0% found this document useful (0 votes)
39 views15 pages

Handout 2

The document discusses loading data into R from various sources like files, URLs, and databases. It describes common data formats like CSV and how to read them into R using functions like read.table(). When data is not well-structured, preprocessing may be required by setting column names or mapping codes to more descriptive values. The document also covers exploring loaded data through commands like class(), dim(), and summary(). It provides an example of loading a large census dataset from files into a database for analysis in R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views15 pages

Handout 2

The document discusses loading data into R from various sources like files, URLs, and databases. It describes common data formats like CSV and how to read them into R using functions like read.table(). When data is not well-structured, preprocessing may be required by setting column names or mapping codes to more descriptive values. The document also covers exploring loaded data through commands like class(), dim(), and summary(). It provides an example of loading a large census dataset from files into a database for analysis in R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Loading data into R

Working with data from files: The most common ready-to-go data format is a family of tabular
formats called structured values.

The easiest data format to read is table-


Working with well-structured data from files or URLs:
structured data with headers. As shown in figure below, this data is arranged in rows and
columns where the first row gives the column names. Each column represents a different fact
or measurement; each row represents an instance or datum about which we know the set of
facts.

As shown in the following figure, this data is arranged in rows and columns where the first
row gives the column names. Each column represents a different fact or measurement; each
row represents an instance or datum

Reading the UCI car data: Loading data of this type into R is a one-liner: we use the R
command read.table()

Loading well-structured data from iles or URLs


The commands to always try first are these:
class()— Tells you what type of R object you have. In our case, class(uciCar) tells us the
object uciCar is of class data.frame.
help()— Gives you the documentation for a class. In particular try help (class(uciCar)) or
help("data.frame").
summary()— Gives you a summary of almost any R object. summary(uciCar) shows us a lot
about the distribution of the UCI car data.

For data frames, the command dim() is also important, as it shows you how many rows and
columns are in the data.

Exploring the car data:


Working with other data formats
.csv is not the only common data file format you’ll encounter. Other formats include .tsv (tab-
separated values), pipe separated files, Microsoft Excel workbooks, JSON data, and XML.
R’s built-in read.table() command can be made to read most separated value formats. Many
of the deeper data formats have corresponding R packages:

1. XLS/XLSX :Excel Sheet Format


2. JSON :JavaScript Object Notation
3. XML :Extended Markup Language
4. MongoDB :A source-available cross-platform document-oriented
database program. Classified as a NoSQL database program
5. SQL :Structured Query Language

Using R on less-structured data: Data isn’t always available in a ready-to-go format. The
German bank credit dataset is stored as tabular data without headers; it uses a cryptic
encoding of values that requires the dataset’s accompanying documentation to untangle. we’ll
now show how to reformat the data using R.

Transforming data in R: Data often needs a bit of transformation before it makes any sense.
In order to decrypt troublesome data, you need what’s called the schema documentation or a
data dictionary. In this case, the included dataset description says the data is 20 input
columns followed by one result column. In this example, there’s no header in the data file.
The column definitions and the meaning of the cryptic A-* codes are all in the accompanying
data documentation.

Loading the credit dataset:


d <- read.table(paste('http://archive.ics.uci.edu/ml/','machine-learningdatabases/
statlog/german/german.data',sep=''), stringsAsFactors=F, header=F)

print(d[1:3,])
We can change the column names to something meaningful with the command in the
following listing.

Setting column names : Adding our own column names to the fileds.

colnames(d) <- c('Status.of.existing.checking.account', 'Duration.in.month', 'Credit.history',


'Purpose', 'Credit.amount', 'Savings account/bonds', 'Present.employment.since',
'Installment.rate.in.percentage.of.disposable.income','Personal.status.and.sex',
'Other.debtors/guarantors','Present.residence.since', 'Property', 'Age.in.years',
'Other.installment.plans', 'Housing','Number.of.existing.credits.at.this.bank', 'Job',
'Number.of.people.being.liable.to.provide.maintenance.for','Telephone', 'foreign.worker',
'Good.Loan')

 The c() command is R’s method to construct a vector


Building a map to interpret loan use codes

mapping <- list('A40'='car (new)','A41'='car (used)','A42'='furniture/equipment',


'A43'='radio/television', 'A44'='domestic appliances',...)

Lists are R’s map structures. They can map strings to arbitrary objects. The important list
operations [ ] and %in% are vectorized. This means that, when applied to a vector of values,
they return a vector of results by performing one lookup per entry.

The following for loop to convert values in each column that was of type character from the
original cryptic A-* codes into short level descriptions taken directly from the data
documentation. We, of course, skip any such transform for columns that contain numeric
data.
Transforming the car data

Examining our new data: We can now easily examine the purpose of the first three loans
with the command print(d[1:3,'Purpose']).

The distribution of loan purpose with summary(d$Purpose)


Summary of Good.Loan and Purpose
> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
repairs 8 14
retraining 1 8

Working with relational databases:

In many production environments, the data you want lives in a relational or SQL
database, not in files. Public data is often in files (as they are easier to share), but your most
important client data is often in databases. Relational databases scale easily to the millions of
records and supply important production features such as parallelism, consistency,
transactions, logging, and audits. When you’re working with transaction data, you’re likely to
find it already stored in a relational database, as relational databases excel at online
transaction processing (OLTP).

Data in a database is often stored in what is called a normalized form, which requires
relational preparations called joins before the data is ready for analysis. Also, you often don’t
want a dump of the entire database, but instead wish to freely specify which columns and
aggregations you need during analysis.

We’ll show how to load data into a database. Knowing how to load data into a
database is useful for problems that need more sophisticated preparation.

A production-size example:

For our production-size example we’ll use the United States Census 2011 national
PUMS(Public Use Microdata Sample) American Community Survey data. This is a
remarkable set of data involving around 3 million individuals and 1.5 million households.
Each row contains over 200 facts about each individual or household (income, employment,
education, number of rooms, and so on).

The data has household cross-reference IDs so individuals can be joined to the
household they’re in. The size of the dataset is interesting: a few gigabytes when zipped up.
So it’s small enough to store on a good network or thumb drive, but larger than is convenient
to work with on a laptop with R alone (which is more comfortable when working in the range
of hundreds of thousands of rows). We’ll work through all of the steps for acquiring this data
and preparing it for analysis in R.

Curating the data

A hard rule of data science is that you must be able to reproduce your results. At the
very least, be able to repeat your own successful work through your recorded steps and
without depending on a stash of intermediate results. Everything must either have directions
on how to produce it or clear documentation on where it came from. We call this the “no
alien artifacts” discipline.
Step1:

Step2:

Step3:
Step4:

Step5:-

Keep notes
A big part of being a data scientist is being able to defend your results and repeat your
work. We strongly advise keeping a notebook. We also strongly advise keeping all of your
scripts and code under version control, You absolutely need to be able to answer exactly what
code and which data were used to build the results you presented last week.

Staging the data into a database


Structured data at a scale of millions of rows is best handled in a database. You can try to
work with text-processing tools, but a database is much better at representing the fact that
your data is arranged in both rows and columns.
We’ll use three database tools in this example: the server less database engine H2, the
database loading tool SQL Screwdriver, and the database browser SQuirreL SQL. All of
these are Java based, run on many platforms, and are open source.

1. Server less database engine H2


2. The database loading tool SQL Screwdriver
3. The database browser SQuirreL SQL

If you have a database such as MySQL or PostgreSQL already available, we recommend


using one of them instead of using H2. To use your own database, you’ll need to know
enough of your database driver and connection information to build a JDBC connection. If
using H2, you’ll only need to download the H2 pick a file path to store your results, H2 is a
Server less zero-install relational database that supports queries in SQL.

We’ll use the Java-based tool SQL Screwdriver to load the PUMS data into our database. We
first copy our database credentials into a Java properties XML file. We’ll then use Java at the
command line to load the data. To load the four files containing the two tables, run the
commands in the following listing.

SQL Screwdriver XML configuration file

Step1:-

Step2:-

Step3:-

Step4:-
Loading data with SQL Screwdriver SQL
Step1:-

Step2:-

Step3:-

Step4:-

Step5:-

Step6:-

SQL Screwdriver infers data types by scanning the file and creates new tables in your
database. It then populates these tables with the data. SQL Screwdriver also adds four
additional “provenance” columns when loading your data. These columns are

1. ORIGINSERTTIME: when you ran the data load


2. ORIGFILENAME: what filename
3. ORIGFILEROWNUMBER: what line the row came from
4. ORIGRANDGROUP : is a pseudo-random integer distributed uniformly from 0
through 999, designed to make repeatable sampling plans easy to implement.

We can now use a database browser like SQuirreL SQL to examine this data. We start up
SQuirreL SQL and copy the connection details from our XML file into a database. We’re
then ready to type SQL commands into the execution window. A couple of commands you
can try are SELECT COUNT(1) FROM hus and SELECT COUNT(1) FROM pus, which
will tell you that the hus table has 1,485,292 rows and the pus table has 3,112,017 rows. Each
of the tables has over 200 columns, and there are over a billion cells of data in these two
tables. In addition to the SQL execution panel, SQuirreL SQL has an Objects panel that
allows graphical exploration of database table definitions.

SQuirreL SQL table explorer:

Step1:-

Step2:-

Step3:-

Step4:-
Now we can view our data as a table (as we would in a spreadsheet). We can now examine,
aggregate, and summarize our data using the SQuirreL SQL database browser. Below fig.
shows a few example rows and columns from the household data table.

Browsing PUMS data using SQuirreL SQL

To load data from a database, we use a database connector. Then we can directly issue SQL
queries from R. SQL is the most common database query language and allows us to specify
arbitrary joins and aggregations. SQL is called a declarative language (as opposed to a
procedural language) because in SQL we specify what relations we would like our data
sample to have, not how to compute them. For our example, we load a sample of the
household data from the hus table and the rows from the person table (pus) that are
associated with those households.
Step1:-

Step2:-

Step3:
Loading data from a database into R:

To load data from a database, we use a database connector. Then we can directly issue
SQL queries from R. SQL is the most common database query language and allows us to
specify arbitrary joins and aggregations. SQL is called a declarative language (as opposed to
a procedural language) because in SQL we specify what relations we would like our data
sample to have, not how to compute them. For our example, we load a sample of the
household data from the hus table and the rows from the person table (pus) that are associated
with those households. Producing composite records that represent matches between one or
more tables (in our case hus and pus) is usually done with what is called a join. For this
example, we use an even more efficient pattern called a sub-select that uses the key word in.

Loading data into R from a relational database


Step1:-

Step2:-

Step3:-

Step4:-

Step5:-
Step6:-

Step7:-

Step8:-

The data has been unpacked from the Census-supplied .csv files into our database and a
useful sample has been loaded into R for analysis.

Working with the PUMS data:

Loading and conditioning the PUMS data

Each row of PUMS data represents a single anonymized person or household. Personal data
recorded includes occupation, level of education, personal income, and many other
demographics variables. To load our prepared data frame, download phsample.Rdata from :

https://github.com/WinVector/zmPDSwR/tree/master/PUMS
and run the following command in R: load('phsample.RData'). Our example problem will be
to predict income (represented in US dollars in the field PINCP) using the following
variables:
Age— An integer found in column AGEP.
Employment class— Examples: for-profit company, nonprofit company, ... found in column
COW.
Education level— Examples: no high school diploma,high school, college, and so on, found
in column SCHL.
Sex of worker— Found in column SEX.
Our data treatment is to select a subset of “typical full-time workers” by restricting the subset
to data that meets all of the following conditions:
Workers self-described as full-time employees
Workers reporting at least 40 hours a week of activity
Workers 20–50 years of age
Workers with an annual income between $1,000 and $250,000 dollars
Selecting a subset of the Census data:

Recoding the data

Before we work with the data, we’ll recode some of the variables for readability. In
particular, we want to recode variables that are enumerated integers into meaningful factor
level names, but for readability and to prevent accidentally treating such variables as mere
numeric values.

Recoding variables:
Step1:-

Step2:-

Step3:-

Step4:-
The data preparation is making use of R’s vectorized lookup operator [].

The standard trick to work with variables that take on a small number of string
values is to re encode them into what’s called a factor as we’ve done with the as.factor()
command. A factor is a list of all possible values of the variable (possible values are called
levels), and each level works (under the covers) as an indicator variable. An indicator is a
variable with a value of 1 (one) when a condition we’re interested in is true, and 0 (zero)
otherwise. Indicators are a useful encoding trick. Following Figure illustrates the process.
SEX and COW underwent similar transformations.

Examining the PUMS data


At this point, we’re ready to do some science, or at least start looking at the data. For
example, we can quickly tabulate the distribution of category of work.

Listing 2.14. Summarizing the classifications of work:

> summary(dtrain$COW)

Employee of a private forprofit Federal government employee 423


Local government employee Private not-forprofit employee 39
Self-employed incorporated Self employed not incorporated 17
State government employee 24

Watch out f or NAs :


R’s representation for blank or missing data is NA. Unfortunately a lot of R commands
quietly skip NAs without warning. The command table(dpus$COW,useNA='always') will
show NAs much like summary(dpus$COW) does.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy