0% found this document useful (0 votes)

97 views44 pages

Lecture 4: Let's Get Data!: Prof. Esther Duflo

1) The document discusses various sources to find existing data, including data libraries, government data sources, and data shared by researchers. 2) It also covers extracting data from the internet by using APIs or scraping websites. Examples provided include scraping book price data and using the Google Maps API. 3) The document notes that if existing data sources don't have what you need, you may need to collect your own data through surveys, apps, questionnaires, or organizing your own data collection team.

Uploaded by

Jake Tolentino

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views44 pages

Lecture 4: Let's Get Data!: Prof. Esther Duflo

Uploaded by

Jake Tolentino

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Lecture 4: Let’s get data!

Prof. Esther Duflo

14.310x

1/1
Where can we find data?

1 Existing data Libraries

2 Collecting your own data
3 Extracting data from the internet

2/1
Existing data libraries
• A Great resource for MIT students and others:
http://libguides.mit.edu/ssds
• Popular sources of data
• Data.gov: Datasets generated by the executive branch of the
US government http://www.data.gov/
• IPUMS : censuses from the US and many more!
https://www.ipums.org/
• International IPUMS
https://international.ipums.org/international/
• ICPSR http://www.icpsr.umich.edu/icpsrweb/ICPSR/:
A data repository with many data sets on lots of subjects
• Harvard-MIT Data center and Harvard Data verse
https://dataverse.harvard.edu/ where many researchers
archive their data
• Amazon dataverse
http://aws.amazon.com/public-data-sets/

3/1
International household survey data

• Demographic and Health surveys

http://www.dhsprogram.com/
• World bank http://data.worldbank.org/
• LSMS (search for LSMS on the world bank data page)
• Rand public-use databases
http://www.rand.org/labor/data.html

4/1
Replication data from researchers

• Randomized control trials

• https://dataverse.harvard.edu/dataverse/jpal
• https://dataverse.harvard.edu/dataverse/
socialsciencercts
• The American Economic Association journals require posting
of any data used for research: there is lots of data on the
AEA website

5/1
The internet!

• Many websites that are data intensive are making that data
directly available to people
• 538
• Yahoo data dump : “a sample of anonymized user interactions
on the news feeds of several Yahoo properties”.
http://webscope.sandbox.yahoo.com/catalog.php?
datatype=r&did=75
• Uber movement (to come)
https://movement.uber.com/cities
• Some sites are specializing in aggregating data
• Sports data sets http://www.opensourcesports.com/,
http://nbasavant.com/
• Web pages: Way back machine https://archive.org/web/
• There is much more... search in library catalog, google, etc.

6/1
But what if this is not what I am looking
for?
• Sometimes what you are looking for is available, but not free:
your library may be able to purchase it... or you may need to
get an agreement.
• Sometimes it is available but the access is restricted (for
example for confidentiality reasons).
• For administrative data, see
https://www.povertyactionlab.org/admindata
• The entity that owns the data may be interested in sharing it
with you if this is part of a research project (prospective or
retrospective)
• You will then have to comply with the partner’s requirements
for data security, and also go through an Human Research
Board Review (IRB) at your institution.
• And sometimes you will have to harvest it yourself!

7/1
Harvesting data

• Scraping data from the internet.

• Collecting your own data.

8/1
Scraping data from the internet

• Use an API
• Use a web page

9/1
What is Web Scraping?

• Pull data form one page

• Crawl an entire web page
• A set of forms running in the background
• Any of the above in an ongoing fashion

10 / 1
Using an API

• API (Application Programme interface) are programs that help

a particular program to communicate with other programs
• Some web sites provide and invest in API (Twitter, Facebook,
google map, etc.) and you will typically use those to harvest
the data from those sites, some time in conjunction with
python

11 / 1
Example: using google map API to look
at traffic in Delhi
Gabriel Kreindler

• Google has set up an API (application program interface) that

allows access to Google Maps.
• Mostly used by smartphone apps, websites, etc.
• Google provides documentation and libraries to access.
• Very simple to access using Python, Java, HTML, etc.
• No need for complicated scraping!
• Designed for commercial applications.Cost structure:
• First 2,500 queries per day free, then 1 USD/2,000 queries.
• Researchers can also take advantage!

12 / 1
• One example: travel time(distance matrix)
• Takes into account traffic conditions at different hours.
• Two types of queries, depending on departure time:
• Departure time = now ?prediction based on crowdsourced live
data from Google Android users, as well as historical data on
that route.
• Departure time = 6/15/2016 8:34am ?prediction based on
historical data.
• No direct access to historical data.

13 / 1
Example: Delhi Odd-Even study

• Delhi, one of the most polluted cities in the world,

experimented with a driving restriction policy called Odd-Even
(based on license plate numbers)
• Implemented between January 1-15th 2016, and again in April
• The following results are based on queries made every 20
minutes since January 1st2016, on 93 routes across Delhi.
• The input is a API key and .csv file that has a departure point
(latitude, longitude) and arrival point
• The out put is many .csv files with a time it takes.
• A python code keeps querying google API
• Code+readme available on the course web site for those
interested

14 / 1
ysis.

outes used for the analysis are depicted in Figure 2.

Figure 2. Origin-Destination Routes for Live Queries in Delhi

15 / 1
16 / 1
17 / 1
Ex1 – Delhi Odd-Even [Live queries]

1st day after Odd-Even

National holiday
< Odd Even Pilot >

18 / 1
Scraping web sites

• Some providers will not have an API

• Then you need to extract the information from the page.
• Example: Ellison and Ellison: Did the internet change the
price of used books?
• Want to compare the price of the same used book in stores
and online
• Need to collect data at regular interval on the prices of a
bunch of used books.
• From the web site http://www.abebooks.com/

19 / 1
What we start with

20 / 1
What we start with

21 / 1
What we want to get

A nice table we can import in R

• Name of title (for lots of titles)
• Date
• Price

22 / 1
Web scraping with Python

• Most conventional way to do it... the internet is full of

tutorials
• You will work using the request library and the BeautifulSoup
library
• With those you will write simple routine that will extract what
you are looking for.
• In the used book example, we need to pull up the page for
each book at specified date, and instruct python to search for
the price (which is nicely identified to as a class).
• And export them into a table.

23 / 1
24 / 1
Web scraping in R

• R has a web scrapping package built by Hadley Wickham

(same person who wrote the R for data science book, ggplot2,
tidyverse, ), called rvest.
• See: http://blog.rstudio.org/2014/11/24/
rvest-easy-web-scraping-with-r/
• Works well in conjunction with a google chrome plugin
selectorgadget.com/
• It has the ability to submit forms and search web pages, etc.
• See demo(package = ”rvest”) for demonstrations

25 / 1
Harvesting a table

26 / 1
Harvesting a table

27 / 1
Harvesting a table

28 / 1
Harvesting specific items

29 / 1
Harvesting specific items

30 / 1
Harvesting specific items

31 / 1
32 / 1
A cleaner code for a cleaner output

33 / 1
34 / 1
Collecting your own data

• It is not as infeasible as it sounds!

• Survey tool on the internet (survey monkey, amazon mturk)
• Install Apps on willing participants that will track their
movements (or other things) with https://moves-app.com/
• Sit in the science center and administer questionnaires
• Set up some A/B testing
• And of course if you have more money, organize a data
collection team to collect whatever you would like!

35 / 1
Steps for collecting your own data

• Obtain the funding you may need

• Prepare a data management plan : how will you keep the data
safe? will you share it?
• Obtain Human Subjects Approval
• Design your data collection instrument
• Pilot your data collection instrument
• Implement!

36 / 1
Protecting Human Subjects
The research governing human subjects is regulated, to ensure the
protection of the participants.
The background, Nazi research, Tuskegee Syphilis trials
• American medical research project conducted by the U.S.
Public Health Service from 1932 to 1972, examined the
natural course of untreated syphilis in black American men.
• The subjects, all impoverished sharecroppers from Macon
county, Alabama, were unknowing participants in the study;
they were not told that they had syphilis, nor were they
offered effective treatment.
• By the end of the experiment, 28 of the men had died directly
of syphilis, 100 were dead of related complications, 40 of their
wives had been infected, and 19 of their children had been
born with congenital syphilis.
• People were also lured to come for tests with publicity of free
treatment.
37 / 1
• In 1972 the whistle was blown, and the men finally won a $10
million class action trial against the PHS. The scientific merit
of the study was also shoddy: apparently not much was ever
learnt from it about how to treat the disease
• President Clinton apologized in the name of the Nation in
1997.
What is wrong here?

38 / 1
Protection of Human Subject

The research involving human subjects is governed by federal

regulation . HHS Regulations for the Protection of Human
Subjects at Title 45 Code of Federal Regulations Part 46.
The HHS regulations are intended to implement the basic ethical
principles governing the conduct of human subjects research.
These ethical principles are set forth in the report of the National
Commission for the Protection of Human Subjects of Biomedical
and Behavioral Research entitled: Ethical Principles and Guidelines
for the Protection of Human Subjects of Research (the ”Belmont
Report”).

39 / 1
Human subject research

Research - A systematic investigation, including research

development, testing and evaluation, designed to develop or
contribute to generalizable knowledge. Activities that meet this
definition constitute research for purposes of the HHS regulations,
whether or not they are conducted or supported under a program,
which is considered research for other purposes. For example, some
demonstration and service programs may include research activities.
Human Subject - A living individual about whom an investigator
(whether professional or student) conducting research obtains (1)
data through intervention or interaction with the individual, or (2)
identifiable private information.

40 / 1
Human subject research

Research - A systematic investigation, including research

development, testing and evaluation, designed to develop or
contribute to generalizable knowledge. This means that
Facebook or Amazon can experiment as much as they want
on you unless they publish; but if you are working with them
with the goal of publishing you need to go through an IRB
Human Subject - A living individual about whom an investigator
(whether professional or student) conducting research obtains (1)
data through intervention or interaction with the individual, or (2)
identifiable private information.

41 / 1
Key principles of the Belmont report

1 Respect for persons

• Respect individual autonomy
• Protect individuals with reduced autonomy
2 Beneficence
• Maximize benefits and minimize harms
3 Justice
• Equitable distribution of research burdens and benefits

42 / 1
Related requirements

Application of the general ethical principles to the conduct of

human subjects research leads to the following requirements:
• Respect for Persons
• Informed consent
• Protecting privacy and maintaining confidentiality
• Additional safeguards for protection of subjects likely to be
vulnerable to coercion or undue influence
• Beneficence
• Assessment of risk/benefit analysis including study design
• Ensure that risks to subjects are minimized
• Risk justified by benefits of the research
• Justice
• Ensure that selection of subjects is equitable.

43 / 1
How this works in practice
• For any research by an MIT, you need to be trained in human
subject
• You need to submit to MIT COUHES a form describing your
research.
• You need to follow the appropriate deadlines; and get the
authorization BEFORE you start.
• You need to submit your informed consent forms as well,
unless you request a waiver
• The form is used to asses if there is risk to the subjects or
others.
• A committee assess the risks and the benefits, and if
necessary asks you for changes to protect the subjects better.
• When work is conducted abroad, typically you also need to
obtain human subject permission from that country.
• Some research which has minimum risk and does not involve
individually linked data is considered exempt.
44 / 1

BRS Physiology
25% (4)
BRS Physiology
1 page
Online Library System Project Report
79% (76)
Online Library System Project Report
42 pages
CP4P File Systems and Visual Studio
No ratings yet
CP4P File Systems and Visual Studio
38 pages
Whatsapp Development Proposal
No ratings yet
Whatsapp Development Proposal
4 pages
Syllabus of 14.130X Taught in MIT
No ratings yet
Syllabus of 14.130X Taught in MIT
6 pages
M - Ch17 Macroeconomics
No ratings yet
M - Ch17 Macroeconomics
32 pages
GT Basic Practise Set
No ratings yet
GT Basic Practise Set
5 pages
IE Chapter 1
No ratings yet
IE Chapter 1
83 pages
Data Science With Python
No ratings yet
Data Science With Python
16 pages
Add MITx Credentials To Resume and LinkedIn PDF
No ratings yet
Add MITx Credentials To Resume and LinkedIn PDF
5 pages
Handout 7: Monopoly
No ratings yet
Handout 7: Monopoly
8 pages
Handout 5: Production Functions and Cost Minimization
No ratings yet
Handout 5: Production Functions and Cost Minimization
10 pages
Handout 9: Choice Under Uncertainty
No ratings yet
Handout 9: Choice Under Uncertainty
6 pages
HW 1
14% (7)
HW 1
9 pages
Exercícios Semanais
100% (2)
Exercícios Semanais
11 pages
ECON 330-Econometrics-Dr. Farooq Naseer
No ratings yet
ECON 330-Econometrics-Dr. Farooq Naseer
5 pages
14.750x Syllabus 2019
No ratings yet
14.750x Syllabus 2019
8 pages
Static Ups Failures-Origin and Possible Prevention: Sanjay B.R 1 M.Tech Sjec
No ratings yet
Static Ups Failures-Origin and Possible Prevention: Sanjay B.R 1 M.Tech Sjec
31 pages
Managerial Economics, 8e William F. Samuelson Stephen G. Marks
No ratings yet
Managerial Economics, 8e William F. Samuelson Stephen G. Marks
16 pages
Parkin12e Economics Ch09
No ratings yet
Parkin12e Economics Ch09
39 pages
ECO500 ARes 005
No ratings yet
ECO500 ARes 005
34 pages
Uses of Break Even Analysis
No ratings yet
Uses of Break Even Analysis
20 pages
Estimating Demand Curves
100% (1)
Estimating Demand Curves
2 pages
Imperfect Competition
100% (1)
Imperfect Competition
24 pages
ECON 300 PPT CH - 08
0% (2)
ECON 300 PPT CH - 08
34 pages
Thermocouple Lecture PDF
No ratings yet
Thermocouple Lecture PDF
13 pages
Research Paper - THE INDIAN ECONOMY
No ratings yet
Research Paper - THE INDIAN ECONOMY
16 pages
14 100x Microeconomics S
No ratings yet
14 100x Microeconomics S
5 pages
The Is-Lm Model: ECON 2123: Macroeconomics
No ratings yet
The Is-Lm Model: ECON 2123: Macroeconomics
56 pages
Managerial Economics in A Global Economy, 5th Edition by Dominick Salvatore
No ratings yet
Managerial Economics in A Global Economy, 5th Edition by Dominick Salvatore
21 pages
LPP INTRODUCTION DEFINITIONAND EXAMPLES OF Linear Programming Intro
No ratings yet
LPP INTRODUCTION DEFINITIONAND EXAMPLES OF Linear Programming Intro
18 pages
Chapter 6: MATLAB Programs Exercises
No ratings yet
Chapter 6: MATLAB Programs Exercises
30 pages
Microeconomics Cue Card
100% (1)
Microeconomics Cue Card
4 pages
Solution To Assignment 1
No ratings yet
Solution To Assignment 1
7 pages
Lecture-note-Renewable Energy
No ratings yet
Lecture-note-Renewable Energy
3 pages
Micro Economics Part II
No ratings yet
Micro Economics Part II
30 pages
Lecture 3: Chapter 3 Key Concepts
100% (1)
Lecture 3: Chapter 3 Key Concepts
8 pages
Mulyiple Choice Questions On Ecnomics
No ratings yet
Mulyiple Choice Questions On Ecnomics
4 pages
3.regression Slides
100% (1)
3.regression Slides
25 pages
Unit-III Pricing
No ratings yet
Unit-III Pricing
22 pages
Problem Sessions Before Midterm Exam
100% (1)
Problem Sessions Before Midterm Exam
15 pages
CH 12 Simulation
No ratings yet
CH 12 Simulation
49 pages
Functional Relationship: Total, Average and Marginal
No ratings yet
Functional Relationship: Total, Average and Marginal
21 pages
AP Microeconomics Review
No ratings yet
AP Microeconomics Review
94 pages
Ignou Assignment
No ratings yet
Ignou Assignment
8 pages
DS II Mid Term 2017 Solution
No ratings yet
DS II Mid Term 2017 Solution
20 pages
Macroeconomics Chapter 1
No ratings yet
Macroeconomics Chapter 1
24 pages
Linear Programming Examples Assignment 1
No ratings yet
Linear Programming Examples Assignment 1
5 pages
Chapter 18 Power Point Slides
No ratings yet
Chapter 18 Power Point Slides
18 pages
Economics and Nature
No ratings yet
Economics and Nature
8 pages
29614
0% (3)
29614
7 pages
AP Macro Classical vs. Keynesian
No ratings yet
AP Macro Classical vs. Keynesian
15 pages
I. Input Combination Choice: Production in The Long-Run
No ratings yet
I. Input Combination Choice: Production in The Long-Run
24 pages
Course Outline - Business Statistics and Basic Econometrics PDF
No ratings yet
Course Outline - Business Statistics and Basic Econometrics PDF
3 pages
06 The Theory of Consumer Choice
No ratings yet
06 The Theory of Consumer Choice
40 pages
Economics Department: Syllabus For M. Sc. in Applied Economics
No ratings yet
Economics Department: Syllabus For M. Sc. in Applied Economics
25 pages
1.6. Indirect Utility Function Roys Identity Expenditure Minimisation
No ratings yet
1.6. Indirect Utility Function Roys Identity Expenditure Minimisation
30 pages
Arnold - Econ13e - ch20 Consumer Choice Maximizing Utility and Behavioral Economics
No ratings yet
Arnold - Econ13e - ch20 Consumer Choice Maximizing Utility and Behavioral Economics
31 pages
The Price Theory Complete by Kongnso Rene 674729925
No ratings yet
The Price Theory Complete by Kongnso Rene 674729925
87 pages
Where To Find Data PDF
No ratings yet
Where To Find Data PDF
10 pages
L2 - Data Acquisition
No ratings yet
L2 - Data Acquisition
48 pages
Using Python To Scrape A Website and Gather Data - Practicing On A Criminal Justice Dataset - Journalist's Resource
No ratings yet
Using Python To Scrape A Website and Gather Data - Practicing On A Criminal Justice Dataset - Journalist's Resource
1 page
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
14310x Lecture Slides 10 Part2 1
No ratings yet
14310x Lecture Slides 10 Part2 1
79 pages
Lecture 5: Let's Look at Some Data: Exploratory Data Analysis
No ratings yet
Lecture 5: Let's Look at Some Data: Exploratory Data Analysis
29 pages
14310x Lecture Slides 03
No ratings yet
14310x Lecture Slides 03
44 pages
Creating Publication-Ready Word Tables in R: Sara Weston and Debbie Yee
No ratings yet
Creating Publication-Ready Word Tables in R: Sara Weston and Debbie Yee
40 pages
Typhoon Yolanda Damage Report: Fisheries Industry
No ratings yet
Typhoon Yolanda Damage Report: Fisheries Industry
17 pages
Tuberculosis Programme Review (Joint Monitoring Mission)
No ratings yet
Tuberculosis Programme Review (Joint Monitoring Mission)
48 pages
Food-Based Dietary Guidelines
No ratings yet
Food-Based Dietary Guidelines
4 pages
Web-Based Billing and Collection System For A Municipal Water and Services Unit
No ratings yet
Web-Based Billing and Collection System For A Municipal Water and Services Unit
30 pages
Online Quiz System Using PHP With Source Code - SourceCodester
No ratings yet
Online Quiz System Using PHP With Source Code - SourceCodester
20 pages
Economics and Finance Research - IDEAS - RePEc
No ratings yet
Economics and Finance Research - IDEAS - RePEc
1 page
SYSTRAN v6 UserGuide
No ratings yet
SYSTRAN v6 UserGuide
214 pages
Syntel Interview Questions and Answers 42895
No ratings yet
Syntel Interview Questions and Answers 42895
13 pages
73 - VLAN Service Management
No ratings yet
73 - VLAN Service Management
19 pages
Website Traffic Forecasting
No ratings yet
Website Traffic Forecasting
32 pages
Nat12 Mil
No ratings yet
Nat12 Mil
5 pages
IT ACT 2000 Scope and Impacts
No ratings yet
IT ACT 2000 Scope and Impacts
9 pages
Connect Plus Final Revision - Model Answer
No ratings yet
Connect Plus Final Revision - Model Answer
10 pages
NVR AS4000 UserGuide
50% (2)
NVR AS4000 UserGuide
24 pages
Juniper Demo Practical Lab Manual
No ratings yet
Juniper Demo Practical Lab Manual
59 pages
Blockchain Unconfirmed Transaction Hack Scriptdocx PDF Free
0% (1)
Blockchain Unconfirmed Transaction Hack Scriptdocx PDF Free
4 pages
Gita Eriani PDF
No ratings yet
Gita Eriani PDF
5 pages
Online Assignment: Topic:Community Awareness
No ratings yet
Online Assignment: Topic:Community Awareness
9 pages
Apache Superset Readthedocs Io en Latest PDF
No ratings yet
Apache Superset Readthedocs Io en Latest PDF
120 pages
Presentation v3
No ratings yet
Presentation v3
126 pages
Lab 04
No ratings yet
Lab 04
17 pages
MC Email Specialist - Jun23
No ratings yet
MC Email Specialist - Jun23
389 pages
Introduction Create An Account Reset The Password User Guide
No ratings yet
Introduction Create An Account Reset The Password User Guide
16 pages
Vrealize Operations 86 Api Guide
No ratings yet
Vrealize Operations 86 Api Guide
30 pages
Military Format
No ratings yet
Military Format
6 pages
Diploma in Information Technology Coursework: Subject: Web Design Subject Code: DIT2283
No ratings yet
Diploma in Information Technology Coursework: Subject: Web Design Subject Code: DIT2283
15 pages
Using The WAGO 750-340 PROFINET Coupler As Remote I/O With A Siemens S7 PLC
No ratings yet
Using The WAGO 750-340 PROFINET Coupler As Remote I/O With A Siemens S7 PLC
30 pages
Kali 6 Linux
No ratings yet
Kali 6 Linux
4 pages
MMDU Events Management Hub: Web Component
No ratings yet
MMDU Events Management Hub: Web Component
18 pages
DCN Question Bank
No ratings yet
DCN Question Bank
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 4: Let's Get Data!: Prof. Esther Duflo

Uploaded by

Lecture 4: Let's Get Data!: Prof. Esther Duflo

Uploaded by

Lecture 4: Let’s get data!

Prof. Esther Duflo

1 Existing data Libraries

• Demographic and Health surveys

• Randomized control trials

• Scraping data from the internet.

• Pull data form one page

• API (Application Programme interface) are programs that help

• Google has set up an API (application program interface) that

• Delhi, one of the most polluted cities in the world,

outes used for the analysis are depicted in Figure 2.

Figure 2. Origin-Destination Routes for Live Queries in Delhi

1st day after Odd-Even

• Some providers will not have an API

A nice table we can import in R

• Most conventional way to do it... the internet is full of

• R has a web scrapping package built by Hadley Wickham

• It is not as infeasible as it sounds!

• Obtain the funding you may need

The research involving human subjects is governed by federal

Research - A systematic investigation, including research

Research - A systematic investigation, including research

1 Respect for persons

Application of the general ethical principles to the conduct of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.