0% found this document useful (0 votes)
97 views44 pages

Lecture 4: Let's Get Data!: Prof. Esther Duflo

1) The document discusses various sources to find existing data, including data libraries, government data sources, and data shared by researchers. 2) It also covers extracting data from the internet by using APIs or scraping websites. Examples provided include scraping book price data and using the Google Maps API. 3) The document notes that if existing data sources don't have what you need, you may need to collect your own data through surveys, apps, questionnaires, or organizing your own data collection team.

Uploaded by

Jake Tolentino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views44 pages

Lecture 4: Let's Get Data!: Prof. Esther Duflo

1) The document discusses various sources to find existing data, including data libraries, government data sources, and data shared by researchers. 2) It also covers extracting data from the internet by using APIs or scraping websites. Examples provided include scraping book price data and using the Google Maps API. 3) The document notes that if existing data sources don't have what you need, you may need to collect your own data through surveys, apps, questionnaires, or organizing your own data collection team.

Uploaded by

Jake Tolentino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Lecture 4: Let’s get data!

Prof. Esther Duflo

14.310x

1/1
Where can we find data?

1 Existing data Libraries


2 Collecting your own data
3 Extracting data from the internet

2/1
Existing data libraries
• A Great resource for MIT students and others:
http://libguides.mit.edu/ssds
• Popular sources of data
• Data.gov: Datasets generated by the executive branch of the
US government http://www.data.gov/
• IPUMS : censuses from the US and many more!
https://www.ipums.org/
• International IPUMS
https://international.ipums.org/international/
• ICPSR http://www.icpsr.umich.edu/icpsrweb/ICPSR/:
A data repository with many data sets on lots of subjects
• Harvard-MIT Data center and Harvard Data verse
https://dataverse.harvard.edu/ where many researchers
archive their data
• Amazon dataverse
http://aws.amazon.com/public-data-sets/

3/1
International household survey data

• Demographic and Health surveys


http://www.dhsprogram.com/
• World bank http://data.worldbank.org/
• LSMS (search for LSMS on the world bank data page)
• Rand public-use databases
http://www.rand.org/labor/data.html

4/1
Replication data from researchers

• Randomized control trials


• https://dataverse.harvard.edu/dataverse/jpal
• https://dataverse.harvard.edu/dataverse/
socialsciencercts
• The American Economic Association journals require posting
of any data used for research: there is lots of data on the
AEA website

5/1
The internet!

• Many websites that are data intensive are making that data
directly available to people
• 538
• Yahoo data dump : “a sample of anonymized user interactions
on the news feeds of several Yahoo properties”.
http://webscope.sandbox.yahoo.com/catalog.php?
datatype=r&did=75
• Uber movement (to come)
https://movement.uber.com/cities
• Some sites are specializing in aggregating data
• Sports data sets http://www.opensourcesports.com/,
http://nbasavant.com/
• Web pages: Way back machine https://archive.org/web/
• There is much more... search in library catalog, google, etc.

6/1
But what if this is not what I am looking
for?
• Sometimes what you are looking for is available, but not free:
your library may be able to purchase it... or you may need to
get an agreement.
• Sometimes it is available but the access is restricted (for
example for confidentiality reasons).
• For administrative data, see
https://www.povertyactionlab.org/admindata
• The entity that owns the data may be interested in sharing it
with you if this is part of a research project (prospective or
retrospective)
• You will then have to comply with the partner’s requirements
for data security, and also go through an Human Research
Board Review (IRB) at your institution.
• And sometimes you will have to harvest it yourself!

7/1
Harvesting data

• Scraping data from the internet.


• Collecting your own data.

8/1
Scraping data from the internet

• Use an API
• Use a web page

9/1
What is Web Scraping?

• Pull data form one page


• Crawl an entire web page
• A set of forms running in the background
• Any of the above in an ongoing fashion

10 / 1
Using an API

• API (Application Programme interface) are programs that help


a particular program to communicate with other programs
• Some web sites provide and invest in API (Twitter, Facebook,
google map, etc.) and you will typically use those to harvest
the data from those sites, some time in conjunction with
python

11 / 1
Example: using google map API to look
at traffic in Delhi
Gabriel Kreindler

• Google has set up an API (application program interface) that


allows access to Google Maps.
• Mostly used by smartphone apps, websites, etc.
• Google provides documentation and libraries to access.
• Very simple to access using Python, Java, HTML, etc.
• No need for complicated scraping!
• Designed for commercial applications.Cost structure:
• First 2,500 queries per day free, then 1 USD/2,000 queries.
• Researchers can also take advantage!

12 / 1
• One example: travel time(distance matrix)
• Takes into account traffic conditions at different hours.
• Two types of queries, depending on departure time:
• Departure time = now ?prediction based on crowdsourced live
data from Google Android users, as well as historical data on
that route.
• Departure time = 6/15/2016 8:34am ?prediction based on
historical data.
• No direct access to historical data.

13 / 1
Example: Delhi Odd-Even study

• Delhi, one of the most polluted cities in the world,


experimented with a driving restriction policy called Odd-Even
(based on license plate numbers)
• Implemented between January 1-15th 2016, and again in April
• The following results are based on queries made every 20
minutes since January 1st2016, on 93 routes across Delhi.
• The input is a API key and .csv file that has a departure point
(latitude, longitude) and arrival point
• The out put is many .csv files with a time it takes.
• A python code keeps querying google API
• Code+readme available on the course web site for those
interested

14 / 1
ysis.

outes used for the analysis are depicted in Figure 2.

Figure 2. Origin-Destination Routes for Live Queries in Delhi

15 / 1
16 / 1
17 / 1
Ex1 – Delhi Odd-Even [Live queries]

1st day after Odd-Even

National holiday
< Odd Even Pilot >

18 / 1
Scraping web sites

• Some providers will not have an API


• Then you need to extract the information from the page.
• Example: Ellison and Ellison: Did the internet change the
price of used books?
• Want to compare the price of the same used book in stores
and online
• Need to collect data at regular interval on the prices of a
bunch of used books.
• From the web site http://www.abebooks.com/

19 / 1
What we start with

20 / 1
What we start with

21 / 1
What we want to get

A nice table we can import in R


• Name of title (for lots of titles)
• Date
• Price

22 / 1
Web scraping with Python

• Most conventional way to do it... the internet is full of


tutorials
• You will work using the request library and the BeautifulSoup
library
• With those you will write simple routine that will extract what
you are looking for.
• In the used book example, we need to pull up the page for
each book at specified date, and instruct python to search for
the price (which is nicely identified to as a class).
• And export them into a table.

23 / 1
24 / 1
Web scraping in R

• R has a web scrapping package built by Hadley Wickham


(same person who wrote the R for data science book, ggplot2,
tidyverse, ), called rvest.
• See: http://blog.rstudio.org/2014/11/24/
rvest-easy-web-scraping-with-r/
• Works well in conjunction with a google chrome plugin
selectorgadget.com/
• It has the ability to submit forms and search web pages, etc.
• See demo(package = ”rvest”) for demonstrations

25 / 1
Harvesting a table

26 / 1
Harvesting a table

27 / 1
Harvesting a table

28 / 1
Harvesting specific items

29 / 1
Harvesting specific items

30 / 1
Harvesting specific items

31 / 1
32 / 1
A cleaner code for a cleaner output

33 / 1
34 / 1
Collecting your own data

• It is not as infeasible as it sounds!


• Survey tool on the internet (survey monkey, amazon mturk)
• Install Apps on willing participants that will track their
movements (or other things) with https://moves-app.com/
• Sit in the science center and administer questionnaires
• Set up some A/B testing
• And of course if you have more money, organize a data
collection team to collect whatever you would like!

35 / 1
Steps for collecting your own data

• Obtain the funding you may need


• Prepare a data management plan : how will you keep the data
safe? will you share it?
• Obtain Human Subjects Approval
• Design your data collection instrument
• Pilot your data collection instrument
• Implement!

36 / 1
Protecting Human Subjects
The research governing human subjects is regulated, to ensure the
protection of the participants.
The background, Nazi research, Tuskegee Syphilis trials
• American medical research project conducted by the U.S.
Public Health Service from 1932 to 1972, examined the
natural course of untreated syphilis in black American men.
• The subjects, all impoverished sharecroppers from Macon
county, Alabama, were unknowing participants in the study;
they were not told that they had syphilis, nor were they
offered effective treatment.
• By the end of the experiment, 28 of the men had died directly
of syphilis, 100 were dead of related complications, 40 of their
wives had been infected, and 19 of their children had been
born with congenital syphilis.
• People were also lured to come for tests with publicity of free
treatment.
37 / 1
• In 1972 the whistle was blown, and the men finally won a $10
million class action trial against the PHS. The scientific merit
of the study was also shoddy: apparently not much was ever
learnt from it about how to treat the disease
• President Clinton apologized in the name of the Nation in
1997.
What is wrong here?

38 / 1
Protection of Human Subject

The research involving human subjects is governed by federal


regulation . HHS Regulations for the Protection of Human
Subjects at Title 45 Code of Federal Regulations Part 46.
The HHS regulations are intended to implement the basic ethical
principles governing the conduct of human subjects research.
These ethical principles are set forth in the report of the National
Commission for the Protection of Human Subjects of Biomedical
and Behavioral Research entitled: Ethical Principles and Guidelines
for the Protection of Human Subjects of Research (the ”Belmont
Report”).

39 / 1
Human subject research

Research - A systematic investigation, including research


development, testing and evaluation, designed to develop or
contribute to generalizable knowledge. Activities that meet this
definition constitute research for purposes of the HHS regulations,
whether or not they are conducted or supported under a program,
which is considered research for other purposes. For example, some
demonstration and service programs may include research activities.
Human Subject - A living individual about whom an investigator
(whether professional or student) conducting research obtains (1)
data through intervention or interaction with the individual, or (2)
identifiable private information.

40 / 1
Human subject research

Research - A systematic investigation, including research


development, testing and evaluation, designed to develop or
contribute to generalizable knowledge. This means that
Facebook or Amazon can experiment as much as they want
on you unless they publish; but if you are working with them
with the goal of publishing you need to go through an IRB
Human Subject - A living individual about whom an investigator
(whether professional or student) conducting research obtains (1)
data through intervention or interaction with the individual, or (2)
identifiable private information.

41 / 1
Key principles of the Belmont report

1 Respect for persons


• Respect individual autonomy
• Protect individuals with reduced autonomy
2 Beneficence
• Maximize benefits and minimize harms
3 Justice
• Equitable distribution of research burdens and benefits

42 / 1
Related requirements

Application of the general ethical principles to the conduct of


human subjects research leads to the following requirements:
• Respect for Persons
• Informed consent
• Protecting privacy and maintaining confidentiality
• Additional safeguards for protection of subjects likely to be
vulnerable to coercion or undue influence
• Beneficence
• Assessment of risk/benefit analysis including study design
• Ensure that risks to subjects are minimized
• Risk justified by benefits of the research
• Justice
• Ensure that selection of subjects is equitable.

43 / 1
How this works in practice
• For any research by an MIT, you need to be trained in human
subject
• You need to submit to MIT COUHES a form describing your
research.
• You need to follow the appropriate deadlines; and get the
authorization BEFORE you start.
• You need to submit your informed consent forms as well,
unless you request a waiver
• The form is used to asses if there is risk to the subjects or
others.
• A committee assess the risks and the benefits, and if
necessary asks you for changes to protect the subjects better.
• When work is conducted abroad, typically you also need to
obtain human subject permission from that country.
• Some research which has minimum risk and does not involve
individually linked data is considered exempt.
44 / 1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy