0% found this document useful (0 votes)

39 views15 pages

Handout 2

The document discusses loading data into R from various sources like files, URLs, and databases. It describes common data formats like CSV and how to read them into R using functions like read.table(). When data is not well-structured, preprocessing may be required by setting column names or mapping codes to more descriptive values. The document also covers exploring loaded data through commands like class(), dim(), and summary(). It provides an example of loading a large census dataset from files into a database for analysis in R.

Uploaded by

Satyanarayana Areti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views15 pages

Handout 2

Uploaded by

Satyanarayana Areti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Loading data into R

Working with data from files: The most common ready-to-go data format is a family of tabular
formats called structured values.

The easiest data format to read is table-

Working with well-structured data from files or URLs:
structured data with headers. As shown in figure below, this data is arranged in rows and
columns where the first row gives the column names. Each column represents a different fact
or measurement; each row represents an instance or datum about which we know the set of
facts.

As shown in the following figure, this data is arranged in rows and columns where the first
row gives the column names. Each column represents a different fact or measurement; each
row represents an instance or datum

Reading the UCI car data: Loading data of this type into R is a one-liner: we use the R
command read.table()

Loading well-structured data from iles or URLs

The commands to always try first are these:
class()— Tells you what type of R object you have. In our case, class(uciCar) tells us the
object uciCar is of class data.frame.
help()— Gives you the documentation for a class. In particular try help (class(uciCar)) or
help("data.frame").
summary()— Gives you a summary of almost any R object. summary(uciCar) shows us a lot
about the distribution of the UCI car data.

For data frames, the command dim() is also important, as it shows you how many rows and
columns are in the data.

Exploring the car data:

Working with other data formats
.csv is not the only common data file format you’ll encounter. Other formats include .tsv (tab-
separated values), pipe separated files, Microsoft Excel workbooks, JSON data, and XML.
R’s built-in read.table() command can be made to read most separated value formats. Many
of the deeper data formats have corresponding R packages:

1. XLS/XLSX :Excel Sheet Format

2. JSON :JavaScript Object Notation
3. XML :Extended Markup Language
4. MongoDB :A source-available cross-platform document-oriented
database program. Classified as a NoSQL database program
5. SQL :Structured Query Language

Using R on less-structured data: Data isn’t always available in a ready-to-go format. The
German bank credit dataset is stored as tabular data without headers; it uses a cryptic
encoding of values that requires the dataset’s accompanying documentation to untangle. we’ll
now show how to reformat the data using R.

Transforming data in R: Data often needs a bit of transformation before it makes any sense.
In order to decrypt troublesome data, you need what’s called the schema documentation or a
data dictionary. In this case, the included dataset description says the data is 20 input
columns followed by one result column. In this example, there’s no header in the data file.
The column definitions and the meaning of the cryptic A-* codes are all in the accompanying
data documentation.

Loading the credit dataset:

d <- read.table(paste('http://archive.ics.uci.edu/ml/','machine-learningdatabases/
statlog/german/german.data',sep=''), stringsAsFactors=F, header=F)

print(d[1:3,])
We can change the column names to something meaningful with the command in the
following listing.

Setting column names : Adding our own column names to the fileds.

colnames(d) <- c('Status.of.existing.checking.account', 'Duration.in.month', 'Credit.history',

'Purpose', 'Credit.amount', 'Savings account/bonds', 'Present.employment.since',
'Installment.rate.in.percentage.of.disposable.income','Personal.status.and.sex',
'Other.debtors/guarantors','Present.residence.since', 'Property', 'Age.in.years',
'Other.installment.plans', 'Housing','Number.of.existing.credits.at.this.bank', 'Job',
'Number.of.people.being.liable.to.provide.maintenance.for','Telephone', 'foreign.worker',
'Good.Loan')

 The c() command is R’s method to construct a vector

Building a map to interpret loan use codes

mapping <- list('A40'='car (new)','A41'='car (used)','A42'='furniture/equipment',

'A43'='radio/television', 'A44'='domestic appliances',...)

Lists are R’s map structures. They can map strings to arbitrary objects. The important list
operations [ ] and %in% are vectorized. This means that, when applied to a vector of values,
they return a vector of results by performing one lookup per entry.

The following for loop to convert values in each column that was of type character from the
original cryptic A-* codes into short level descriptions taken directly from the data
documentation. We, of course, skip any such transform for columns that contain numeric
data.
Transforming the car data

Examining our new data: We can now easily examine the purpose of the first three loans
with the command print(d[1:3,'Purpose']).

The distribution of loan purpose with summary(d$Purpose)

Summary of Good.Loan and Purpose
> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
repairs 8 14
retraining 1 8

Working with relational databases:

In many production environments, the data you want lives in a relational or SQL
database, not in files. Public data is often in files (as they are easier to share), but your most
important client data is often in databases. Relational databases scale easily to the millions of
records and supply important production features such as parallelism, consistency,
transactions, logging, and audits. When you’re working with transaction data, you’re likely to
find it already stored in a relational database, as relational databases excel at online
transaction processing (OLTP).

Data in a database is often stored in what is called a normalized form, which requires
relational preparations called joins before the data is ready for analysis. Also, you often don’t
want a dump of the entire database, but instead wish to freely specify which columns and
aggregations you need during analysis.

We’ll show how to load data into a database. Knowing how to load data into a
database is useful for problems that need more sophisticated preparation.

A production-size example:

For our production-size example we’ll use the United States Census 2011 national
PUMS(Public Use Microdata Sample) American Community Survey data. This is a
remarkable set of data involving around 3 million individuals and 1.5 million households.
Each row contains over 200 facts about each individual or household (income, employment,
education, number of rooms, and so on).

The data has household cross-reference IDs so individuals can be joined to the
household they’re in. The size of the dataset is interesting: a few gigabytes when zipped up.
So it’s small enough to store on a good network or thumb drive, but larger than is convenient
to work with on a laptop with R alone (which is more comfortable when working in the range
of hundreds of thousands of rows). We’ll work through all of the steps for acquiring this data
and preparing it for analysis in R.

Curating the data

A hard rule of data science is that you must be able to reproduce your results. At the
very least, be able to repeat your own successful work through your recorded steps and
without depending on a stash of intermediate results. Everything must either have directions
on how to produce it or clear documentation on where it came from. We call this the “no
alien artifacts” discipline.
Step1:

Step2:

Step3:
Step4:

Step5:-

Keep notes
A big part of being a data scientist is being able to defend your results and repeat your
work. We strongly advise keeping a notebook. We also strongly advise keeping all of your
scripts and code under version control, You absolutely need to be able to answer exactly what
code and which data were used to build the results you presented last week.

Staging the data into a database

Structured data at a scale of millions of rows is best handled in a database. You can try to
work with text-processing tools, but a database is much better at representing the fact that
your data is arranged in both rows and columns.
We’ll use three database tools in this example: the server less database engine H2, the
database loading tool SQL Screwdriver, and the database browser SQuirreL SQL. All of
these are Java based, run on many platforms, and are open source.

1. Server less database engine H2

2. The database loading tool SQL Screwdriver
3. The database browser SQuirreL SQL

If you have a database such as MySQL or PostgreSQL already available, we recommend

using one of them instead of using H2. To use your own database, you’ll need to know
enough of your database driver and connection information to build a JDBC connection. If
using H2, you’ll only need to download the H2 pick a file path to store your results, H2 is a
Server less zero-install relational database that supports queries in SQL.

We’ll use the Java-based tool SQL Screwdriver to load the PUMS data into our database. We
first copy our database credentials into a Java properties XML file. We’ll then use Java at the
command line to load the data. To load the four files containing the two tables, run the
commands in the following listing.

SQL Screwdriver XML configuration file

Step1:-

Step2:-

Step3:-

Step4:-
Loading data with SQL Screwdriver SQL
Step1:-

Step2:-

Step3:-

Step4:-

Step5:-

Step6:-

SQL Screwdriver infers data types by scanning the file and creates new tables in your
database. It then populates these tables with the data. SQL Screwdriver also adds four
additional “provenance” columns when loading your data. These columns are

1. ORIGINSERTTIME: when you ran the data load

2. ORIGFILENAME: what filename
3. ORIGFILEROWNUMBER: what line the row came from
4. ORIGRANDGROUP : is a pseudo-random integer distributed uniformly from 0
through 999, designed to make repeatable sampling plans easy to implement.

We can now use a database browser like SQuirreL SQL to examine this data. We start up
SQuirreL SQL and copy the connection details from our XML file into a database. We’re
then ready to type SQL commands into the execution window. A couple of commands you
can try are SELECT COUNT(1) FROM hus and SELECT COUNT(1) FROM pus, which
will tell you that the hus table has 1,485,292 rows and the pus table has 3,112,017 rows. Each
of the tables has over 200 columns, and there are over a billion cells of data in these two
tables. In addition to the SQL execution panel, SQuirreL SQL has an Objects panel that
allows graphical exploration of database table definitions.

SQuirreL SQL table explorer:

Step1:-

Step2:-

Step3:-

Step4:-
Now we can view our data as a table (as we would in a spreadsheet). We can now examine,
aggregate, and summarize our data using the SQuirreL SQL database browser. Below fig.
shows a few example rows and columns from the household data table.

Browsing PUMS data using SQuirreL SQL

Step2:-

Step3:
Loading data from a database into R:

To load data from a database, we use a database connector. Then we can directly issue
SQL queries from R. SQL is the most common database query language and allows us to
specify arbitrary joins and aggregations. SQL is called a declarative language (as opposed to
a procedural language) because in SQL we specify what relations we would like our data
sample to have, not how to compute them. For our example, we load a sample of the
household data from the hus table and the rows from the person table (pus) that are associated
with those households. Producing composite records that represent matches between one or
more tables (in our case hus and pus) is usually done with what is called a join. For this
example, we use an even more efficient pattern called a sub-select that uses the key word in.

Loading data into R from a relational database

Step1:-

Step2:-

Step3:-

Step4:-

Step5:-
Step6:-

Step7:-

Step8:-

The data has been unpacked from the Census-supplied .csv files into our database and a
useful sample has been loaded into R for analysis.

Working with the PUMS data:

Loading and conditioning the PUMS data

Each row of PUMS data represents a single anonymized person or household. Personal data
recorded includes occupation, level of education, personal income, and many other
demographics variables. To load our prepared data frame, download phsample.Rdata from :

https://github.com/WinVector/zmPDSwR/tree/master/PUMS
and run the following command in R: load('phsample.RData'). Our example problem will be
to predict income (represented in US dollars in the field PINCP) using the following
variables:
Age— An integer found in column AGEP.
Employment class— Examples: for-profit company, nonprofit company, ... found in column
COW.
Education level— Examples: no high school diploma,high school, college, and so on, found
in column SCHL.
Sex of worker— Found in column SEX.
Our data treatment is to select a subset of “typical full-time workers” by restricting the subset
to data that meets all of the following conditions:
Workers self-described as full-time employees
Workers reporting at least 40 hours a week of activity
Workers 20–50 years of age
Workers with an annual income between $1,000 and $250,000 dollars
Selecting a subset of the Census data:

Recoding the data

Before we work with the data, we’ll recode some of the variables for readability. In
particular, we want to recode variables that are enumerated integers into meaningful factor
level names, but for readability and to prevent accidentally treating such variables as mere
numeric values.

Recoding variables:
Step1:-

Step2:-

Step3:-

Step4:-
The data preparation is making use of R’s vectorized lookup operator [].

The standard trick to work with variables that take on a small number of string
values is to re encode them into what’s called a factor as we’ve done with the as.factor()
command. A factor is a list of all possible values of the variable (possible values are called
levels), and each level works (under the covers) as an indicator variable. An indicator is a
variable with a value of 1 (one) when a condition we’re interested in is true, and 0 (zero)
otherwise. Indicators are a useful encoding trick. Following Figure illustrates the process.
SEX and COW underwent similar transformations.

Examining the PUMS data

At this point, we’re ready to do some science, or at least start looking at the data. For
example, we can quickly tabulate the distribution of category of work.

Listing 2.14. Summarizing the classifications of work:

> summary(dtrain$COW)

Employee of a private forprofit Federal government employee 423

Local government employee Private not-forprofit employee 39
Self-employed incorporated Self employed not incorporated 17
State government employee 24

Watch out f or NAs :

R’s representation for blank or missing data is NA. Unfortunately a lot of R commands
quietly skip NAs without warning. The command table(dpus$COW,useNA='always') will
show NAs much like summary(dpus$COW) does.

CRC Data Science
No ratings yet
CRC Data Science
443 pages
AMSLI Questions Part 3
100% (4)
AMSLI Questions Part 3
5 pages
Unit 2
No ratings yet
Unit 2
32 pages
Empirical Software Engineering (Swe504) : Practical File
No ratings yet
Empirical Software Engineering (Swe504) : Practical File
27 pages
R1 Uptovisualisation
No ratings yet
R1 Uptovisualisation
122 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Data Science
No ratings yet
Data Science
49 pages
Week 5 Database
No ratings yet
Week 5 Database
68 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
Consolidate AmitRana
No ratings yet
Consolidate AmitRana
58 pages
Working With Data
No ratings yet
Working With Data
38 pages
3 Data Science Intro
No ratings yet
3 Data Science Intro
76 pages
Relationaldatabase
No ratings yet
Relationaldatabase
11 pages
Megamat - Swiss Instruments LTD
No ratings yet
Megamat - Swiss Instruments LTD
74 pages
Statistics and Data Science With R Part - 4
No ratings yet
Statistics and Data Science With R Part - 4
23 pages
IBA Chapter 2 Slides Final Accessible
No ratings yet
IBA Chapter 2 Slides Final Accessible
43 pages
W01 Introduction To R
No ratings yet
W01 Introduction To R
67 pages
How To Use The R Programming Language For Statistical Analyses
No ratings yet
How To Use The R Programming Language For Statistical Analyses
38 pages
Database Connections in R
No ratings yet
Database Connections in R
10 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
02-Data Gathering and Preparation
No ratings yet
02-Data Gathering and Preparation
54 pages
Data Analytics Using R
100% (1)
Data Analytics Using R
27 pages
1.importing Data From External Files
No ratings yet
1.importing Data From External Files
33 pages
Antim Prahar Data Analytics For Business Decisions 2025 - Compressed
No ratings yet
Antim Prahar Data Analytics For Business Decisions 2025 - Compressed
44 pages
SSC Shedule by Shubh Chahhc 2025
No ratings yet
SSC Shedule by Shubh Chahhc 2025
93 pages
Grade 10 Math
No ratings yet
Grade 10 Math
142 pages
R Exercise 1 - Introduction To R For Non-Programmers
No ratings yet
R Exercise 1 - Introduction To R For Non-Programmers
9 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
Module I
No ratings yet
Module I
74 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
Ccpda Book
No ratings yet
Ccpda Book
46 pages
Modern Mathematical Logic Joseph Mileti Instant Download
No ratings yet
Modern Mathematical Logic Joseph Mileti Instant Download
80 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Reading Data in R
No ratings yet
Reading Data in R
11 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Unit1-Data Science
No ratings yet
Unit1-Data Science
77 pages
Life of St. Dominic
No ratings yet
Life of St. Dominic
2 pages
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
Unit I - Introduction To R
No ratings yet
Unit I - Introduction To R
21 pages
Notes 03 R Large Data
No ratings yet
Notes 03 R Large Data
8 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
T1 - Data Collection, Reliability and Validity of Data
No ratings yet
T1 - Data Collection, Reliability and Validity of Data
8 pages
Rmarkdown
No ratings yet
Rmarkdown
10 pages
Ebook: Data Visualization Tools For Users (English)
No ratings yet
Ebook: Data Visualization Tools For Users (English)
26 pages
R Programming
No ratings yet
R Programming
22 pages
Sample Question Bank
No ratings yet
Sample Question Bank
124 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
TMJC H2 Mathematics Prelims Paper 2 (Q)
No ratings yet
TMJC H2 Mathematics Prelims Paper 2 (Q)
25 pages
Data Science
No ratings yet
Data Science
9 pages
03 Data Input Output
No ratings yet
03 Data Input Output
43 pages
English - Literature 3 6 2017
No ratings yet
English - Literature 3 6 2017
12 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Data Understanding and Preparation
No ratings yet
Data Understanding and Preparation
48 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
Understanding Body Language and Facial Expressions
No ratings yet
Understanding Body Language and Facial Expressions
18 pages
Data Science
No ratings yet
Data Science
8 pages
Getting and Cleaning Data Course Notes: Xing Su
No ratings yet
Getting and Cleaning Data Course Notes: Xing Su
27 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
20 pages
Introduction To R For Business Analytics
No ratings yet
Introduction To R For Business Analytics
7 pages
Finding Others Greatness
No ratings yet
Finding Others Greatness
37 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
GWG PDFX4 Workflow EN
No ratings yet
GWG PDFX4 Workflow EN
36 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
R Prog
No ratings yet
R Prog
27 pages
Cuet Icar Ug
No ratings yet
Cuet Icar Ug
2 pages
Modern Workplace - Slide Deck Presentation
No ratings yet
Modern Workplace - Slide Deck Presentation
13 pages
CEE A1 ECCT ListeningTest
No ratings yet
CEE A1 ECCT ListeningTest
6 pages
The Early Trinity
No ratings yet
The Early Trinity
14 pages
A Review On Literature and A Glimpse of World Literature-Songcuan
No ratings yet
A Review On Literature and A Glimpse of World Literature-Songcuan
43 pages
Science 6 - Week 7 Dll-Bow
No ratings yet
Science 6 - Week 7 Dll-Bow
2 pages
Lab 7
No ratings yet
Lab 7
11 pages
High Court of Himachal Pradesh, Shimla: (In The Ratio of 1:6 As Per Schedule-IV, Part-K, of High Court R&P Rules, 2015)
No ratings yet
High Court of Himachal Pradesh, Shimla: (In The Ratio of 1:6 As Per Schedule-IV, Part-K, of High Court R&P Rules, 2015)
7 pages
Worksheet On Grammar Class 10
No ratings yet
Worksheet On Grammar Class 10
4 pages
The Study of Select Themes in Cormac Mcarthy'S
No ratings yet
The Study of Select Themes in Cormac Mcarthy'S
26 pages
Sentences
No ratings yet
Sentences
61 pages
The Complete Guide For Linux System Administration CH03 Powerpoint
No ratings yet
The Complete Guide For Linux System Administration CH03 Powerpoint
40 pages
MP Board 10th Result 2024 - MPBSE Class 10 Result, Check and Download Madhya Pradesh Board of Secondary Education HSC Result - एमपी बोर्ड 10वीं का रिजल्ट - AajTak
No ratings yet
MP Board 10th Result 2024 - MPBSE Class 10 Result, Check and Download Madhya Pradesh Board of Secondary Education HSC Result - एमपी बोर्ड 10वीं का रिजल्ट - AajTak
1 page
Coding Form Dokter
No ratings yet
Coding Form Dokter
5 pages
EE/CS-320 - Computer Organization & Assembly Language (Fall Semester 2013-14) Assignment 2
No ratings yet
EE/CS-320 - Computer Organization & Assembly Language (Fall Semester 2013-14) Assignment 2
11 pages
Language Teaching Beliefs Questionnaire
No ratings yet
Language Teaching Beliefs Questionnaire
2 pages
Introduction To Algorithm and Complexity Module 1
No ratings yet
Introduction To Algorithm and Complexity Module 1
2 pages
Tutorial2 8086
No ratings yet
Tutorial2 8086
2 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Handout 2

Uploaded by

Handout 2

Uploaded by

Loading data into R

The easiest data format to read is table-

Loading well-structured data from iles or URLs

Exploring the car data:

1. XLS/XLSX :Excel Sheet Format

Loading the credit dataset:

colnames(d) <- c('Status.of.existing.checking.account', 'Duration.in.month', 'Credit.history',

 The c() command is R’s method to construct a vector

mapping <- list('A40'='car (new)','A41'='car (used)','A42'='furniture/equipment',

The distribution of loan purpose with summary(d$Purpose)

Working with relational databases:

Curating the data

Staging the data into a database

1. Server less database engine H2

If you have a database such as MySQL or PostgreSQL already available, we recommend

SQL Screwdriver XML configuration file

1. ORIGINSERTTIME: when you ran the data load

SQuirreL SQL table explorer:

Browsing PUMS data using SQuirreL SQL

Loading data into R from a relational database

Working with the PUMS data:

Loading and conditioning the PUMS data

Recoding the data

Examining the PUMS data

Listing 2.14. Summarizing the classifications of work:

Employee of a private forprofit Federal government employee 423

Watch out f or NAs :

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.