Lecture 4: Let's Get Data!: Prof. Esther Duflo
Lecture 4: Let's Get Data!: Prof. Esther Duflo
14.310x
1/1
Where can we find data?
2/1
Existing data libraries
• A Great resource for MIT students and others:
http://libguides.mit.edu/ssds
• Popular sources of data
• Data.gov: Datasets generated by the executive branch of the
US government http://www.data.gov/
• IPUMS : censuses from the US and many more!
https://www.ipums.org/
• International IPUMS
https://international.ipums.org/international/
• ICPSR http://www.icpsr.umich.edu/icpsrweb/ICPSR/:
A data repository with many data sets on lots of subjects
• Harvard-MIT Data center and Harvard Data verse
https://dataverse.harvard.edu/ where many researchers
archive their data
• Amazon dataverse
http://aws.amazon.com/public-data-sets/
3/1
International household survey data
4/1
Replication data from researchers
5/1
The internet!
• Many websites that are data intensive are making that data
directly available to people
• 538
• Yahoo data dump : “a sample of anonymized user interactions
on the news feeds of several Yahoo properties”.
http://webscope.sandbox.yahoo.com/catalog.php?
datatype=r&did=75
• Uber movement (to come)
https://movement.uber.com/cities
• Some sites are specializing in aggregating data
• Sports data sets http://www.opensourcesports.com/,
http://nbasavant.com/
• Web pages: Way back machine https://archive.org/web/
• There is much more... search in library catalog, google, etc.
6/1
But what if this is not what I am looking
for?
• Sometimes what you are looking for is available, but not free:
your library may be able to purchase it... or you may need to
get an agreement.
• Sometimes it is available but the access is restricted (for
example for confidentiality reasons).
• For administrative data, see
https://www.povertyactionlab.org/admindata
• The entity that owns the data may be interested in sharing it
with you if this is part of a research project (prospective or
retrospective)
• You will then have to comply with the partner’s requirements
for data security, and also go through an Human Research
Board Review (IRB) at your institution.
• And sometimes you will have to harvest it yourself!
7/1
Harvesting data
8/1
Scraping data from the internet
• Use an API
• Use a web page
9/1
What is Web Scraping?
10 / 1
Using an API
11 / 1
Example: using google map API to look
at traffic in Delhi
Gabriel Kreindler
12 / 1
• One example: travel time(distance matrix)
• Takes into account traffic conditions at different hours.
• Two types of queries, depending on departure time:
• Departure time = now ?prediction based on crowdsourced live
data from Google Android users, as well as historical data on
that route.
• Departure time = 6/15/2016 8:34am ?prediction based on
historical data.
• No direct access to historical data.
13 / 1
Example: Delhi Odd-Even study
14 / 1
ysis.
15 / 1
16 / 1
17 / 1
Ex1 – Delhi Odd-Even [Live queries]
National holiday
< Odd Even Pilot >
18 / 1
Scraping web sites
19 / 1
What we start with
20 / 1
What we start with
21 / 1
What we want to get
22 / 1
Web scraping with Python
23 / 1
24 / 1
Web scraping in R
25 / 1
Harvesting a table
26 / 1
Harvesting a table
27 / 1
Harvesting a table
28 / 1
Harvesting specific items
29 / 1
Harvesting specific items
30 / 1
Harvesting specific items
31 / 1
32 / 1
A cleaner code for a cleaner output
33 / 1
34 / 1
Collecting your own data
35 / 1
Steps for collecting your own data
36 / 1
Protecting Human Subjects
The research governing human subjects is regulated, to ensure the
protection of the participants.
The background, Nazi research, Tuskegee Syphilis trials
• American medical research project conducted by the U.S.
Public Health Service from 1932 to 1972, examined the
natural course of untreated syphilis in black American men.
• The subjects, all impoverished sharecroppers from Macon
county, Alabama, were unknowing participants in the study;
they were not told that they had syphilis, nor were they
offered effective treatment.
• By the end of the experiment, 28 of the men had died directly
of syphilis, 100 were dead of related complications, 40 of their
wives had been infected, and 19 of their children had been
born with congenital syphilis.
• People were also lured to come for tests with publicity of free
treatment.
37 / 1
• In 1972 the whistle was blown, and the men finally won a $10
million class action trial against the PHS. The scientific merit
of the study was also shoddy: apparently not much was ever
learnt from it about how to treat the disease
• President Clinton apologized in the name of the Nation in
1997.
What is wrong here?
38 / 1
Protection of Human Subject
39 / 1
Human subject research
40 / 1
Human subject research
41 / 1
Key principles of the Belmont report
42 / 1
Related requirements
43 / 1
How this works in practice
• For any research by an MIT, you need to be trained in human
subject
• You need to submit to MIT COUHES a form describing your
research.
• You need to follow the appropriate deadlines; and get the
authorization BEFORE you start.
• You need to submit your informed consent forms as well,
unless you request a waiver
• The form is used to asses if there is risk to the subjects or
others.
• A committee assess the risks and the benefits, and if
necessary asks you for changes to protect the subjects better.
• When work is conducted abroad, typically you also need to
obtain human subject permission from that country.
• Some research which has minimum risk and does not involve
individually linked data is considered exempt.
44 / 1