PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
DATA SCIENCE
JOHN P DICKERSON
Lecture #1 – 08/29/2018
CMSC641
Wednesdays
7:00pm – 9:30pm
INTRODUCTION TO
?????????????
Data science is the application of
computational and statistical techniques to
address or gain [managerial or scientific]
insight into some problem in the real world.
Zico Kolter
Machine Learning Prof, CMU
3
Drew Conway
CEO, Alluvium (analytics company)
4
MANY DEFINITIONS
Broad: necessarily larger than a single
discipline
5
THE DATA LIFECYCLE
6
“The ability to take data—to be able to
understand it, to process it, to extract value
from it, to visualize it, to communicate it—that’s
going to be a hugely important skill in the next
decades, not only at the professional level but
even at the educational level for elementary
school kids, for high school kids, for college
kids.”
Hal Varian
Chief Economist at Google
7
MOTIVATION
Explosion of data, in pretty much every domain
• Sensing devices and sensor networks that can monitor
everything 24/7 from temperature to pollution to vital signs
• Increasingly sophisticated smart phones
• Internet, social networks makes it easy to publish data
• Scientific experiments and simulations à astronomical data
volumes
• Internet of Things
• Dataification: taking all aspects of life and turning them into data
(e.g., what you like/enjoy has been turned into a stream of your
"likes")
How to handle that data? How to extract interesting actionable
insights and scientific knowledge?
Data volumes expected to get much worse
8
FOUR V’S OF BIG DATA
Increasing data Volumes
• Scientific data: 1.5GB per genome -- can be sequenced in .5 hrs
• 500M tweets per day (as of 2013)
• As of 2012: 2.5 Exabytes of data created every day
Variety:
• Structured data, spreadsheets, photos, videos, natural text, ...
Velocity
• Sensors everywhere -- can generate high-rate "data streams"
• Real-time analytics requires data to be consumed as fast as it is
generated
Veracity
• How do you decide what to trust? How to remove noise? How to
fill in missing values?
9
THIS CERTIFICATE PROGRAM
You’ll learn to take data:
• Process it
• Visualize it
• Understand it
• Communicate it
• Extract value from it
Hal Varian
Info: https://www.cs.umd.edu/class/fall2018/cmsc641/
Piazza: piazza.com/umd/fall2018/cmsc641
ELMS: (everyone should be registered automatically)
10
THIS CERTIFICATE PROGRAM
CMSC 641: Principles of Data Science
• An introduction to the data science pipeline, i.e., the end-to-end
process of going from unstructured, messy data to knowledge
and actionable insights. Provides a broad overview of what data
science means and systems and tools commonly used for data
science, and illustrates the principles of data science through
several case studies.
11
THIS CERTIFICATE PROGRAM
CMSC 641: Principles of Data Science
12
THIS CERTIFICATE PROGRAM
CMSC 641: Principles of Data Science
13
THIS CERTIFICATE PROGRAM
CMSC 641: Principles of Data Science
14
THIS COURSE
End-to-end data science lifecycle
Data modeling
Information Visualization
Feel free to tell me if there are topics that you think we should
cover…
15
PREREQUISITE
KNOWLEDGE
Aimed at folks with some CS knowledge – but likely
accessible to others with programming experience and
mathematical maturity.
We do not assume:
• Experience with Python, pandas, scikit-learn, matplotlib, etc …
• Deep statistics or any ML knowledge
• Database or distributed systems knowledge
We do assume: You want to be here!
16
WHO AM I?
http://jpdickerson.com
17
WHO ARE YOU?
STAT400?
CMSC422?
CMSC424?
18
(TENTATIVE) COURSE
STRUCTURE
First 2 lectures: intro & primers in the Python data
science stack
Next 3 lectures: data collection & management
• Best practices, data wrangling, exploratory analysis,
ethics, debugging, visualization, etc …
Next 4-5 lectures: statistical modeling & ML
• Statistical learning, regression, classification, cross-
validation, model evaluation, hypothesis testing, etc …
Midterm
Final 4.5 lectures: advanced topics
Ambitious …
• Dimensionality reduction, distributed learning, big data,
distributed computation
• Either group presentations or more lectures
19
GRADE #1: MINI-PROJECTS
Students will complete four mini-project assignments:
• Case studies meant to mimic what you, a future data scientist,
will see in industry. They should be fun J.
The rules:
• Allowed: small group discussions
• Required: individual programming & writing
• Never allowed: public posting of solutions
Deliverable:
• Turn in an .ipynb of a Jupyter notebook on ELMS
20
GRADE #2: READING
HOMEWORKS
We will post (bi)weekly reading assignments. Mix of:
• Blog posts
• Academic articles
• News articles
Weekly quiz to be taken on ELMS covering the readings
Individual quiz grades are pass/fail:
• At least 60% correct à Pass
• Less than 60% correct à Fail
Must take at least ten of these quizzes over the semester
21
GRADE #3: MINI-TUTORIAL
In lieu of a final exam, you’ll create a mini-tutorial that:
• Identifies a raw data source
• Processes and stores that data
• Performs exploratory data analysis & visualization
• Derives insight(s) using statistics and ML
• Communicates those insights as actionable text
Individual or group project
23
READY-MADE DATASET
REPOSITORIES
https://www.data.gov/
• US-centric agriculture, climate, education, energy, finance, health,
manufacturing data, …
https://cloud.google.com/bigquery/public-data/
• BigQuery (Google Cloud) public datasets (bikeshare, GitHub,
Hacker News, Form 990 non-profits, NOAA, …)
https://www.kaggle.com/datasets
• Microsoft-owned, various (Billboard Top 100 lyrics, credit card
fraud, crime in Chicago, global terrorism, world happiness, …)
https://aws.amazon.com/public-datasets/
• AWS-hosted, various (NASA, a bunch of genome stuff, Google
Books n-grams, Multimedia Commons, …)
24
NEW DATASET IDEAS
Fraternal Order of Police vs Black Lives Matter
Linking finance data to ${anything_else}
Something having to do with Pokémon statistics?
Look through http://www.alexa.com/topsites and scrape
something interesting!
University of Maryland-related, or College Park-related, stuff
• Check out http://umd.io/ – open source project; maybe your
data collection and cleaning scripts can be added to this!
Honestly, pretty much anything! Just document everything.
Reproducibility!
25
FINAL TUTORIAL
Deliverable: URL of your own GitHub Pages site hosting an
.ipynb/.html export of your final tutorial
• https://pages.github.com/ – make a GitHub account, too!
• https://github.com/blog/1995-github-jupyter-notebooks-3
The project itself:
• ~1500+ words of Markdown prose
• ~150+ lines of Python
• Should be viewable as a static webpage – that is, if I (or
anyone else) opens the link up, everything should render and I
shouldn’t have to run any cells to generate output
26
FINAL TUTORIAL RUBRIC
The TAs and I will grade on a scale of 1-10:
Motivation: Does the tutorial make the reader believe the topic is
important (a) in general and (b) with respect to data science?
Understanding: After reading the tutorial, does the reader
understand the topic?
Further resources: Does the tutorial “call out” to other resources
that would help the reader understand basic concepts, deep dive,
related work, etc?
Prose: Does the prose in the Markdown portion of the .ipynb add to
the reader’s understanding of the tutorial?
Code: Does the code help solidify understanding, is it well
documented, and does it include helpful examples?
Subjective Evaluation: If somebody linked to this tutorial from
Hacker News, would people actually read the whole thing?
27
Thanks to: Zico Kolter
GRADE #J: CLASS
PARTICIPATION
Please please please please please do the required reading,
if available, before coming to class!
28
GRADE BREAKDOWN
60% mini-projects:
• There are 3 of them
• Equal weighting @ 20% each Your Grade
29
SOME TECHNOLOGIES
WE WILL USE
30
(Don’t tell CMSC330 …)
31
IMPORTANT WALLS OF TEXT
32
Common
Sense!
ANTI-HARASSMENT
(Adapted from ACM SIGCOMM’s policies)
33
Common
Sense!
ACADEMIC INTEGRITY
(Text unironically stolen from Hal Daumé III)
Any assignment or exam that is handed in must be your own work (unless
otherwise stated). However, talking with one another to understand the
material better is strongly encouraged. Recognizing the distinction between
cheating and cooperation is very important. If you copy someone else's
solution, you are cheating. If you let someone else copy your solution, you
are cheating (this includes posting solutions online in a public place). If
someone dictates a solution to you, you are cheating.
Everything you hand in must be in your own words, and based on your own
understanding of the solution. If someone helps you understand the problem
during a high-level discussion, you are not cheating. We strongly encourage
students to help one another understand the material presented in class, in
the book, and general issues relevant to the assignments. When taking an
exam, you must work independently. Any collaboration during an exam will be
considered cheating. Any student who is caught cheating will be given an F in
the course and referred to the University Office of Student Conduct. Please
don't take that chance – if you're having trouble understanding the material,
please let me know and I will be more than happy to help.
34
(A FEW) DATA SCIENCE SUCCESS
STORIES & CAUTIONARY TALES
35
POLLING: 2008 & 2012
Nate Silver uses a simple idea – taking a principled approach
to aggregating polling instead of relying on punditry – and:
• Predicts 49/50 states in 2008
• Predicts 50/50 states in 2012
36
Democrat (+) or Republican (-) in 2012
POLLING: 2016
37
http://www.huffingtonpost.com/entry/nate-silver-election-forecast_us_581e1c33e4b0d9ce6fbc6f7f
https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/
AD TARGETING
Pregnancy is an expensive & habit-forming time
• Thus, valuable to consumer-facing firms
2012:
• Target identifies 25 products and subsets thereof that are
commonly bought in early pregnancy
• Uses purchase history of patrons to predict pregnancy,
targets advertising for post-natal products (cribs, etc)
• Good: increased revenue
• Bad: this can expose pregnancies – as famously
happened in Minneapolis to a high schooler
38
http://www.businessinsider.com/the-incredible-story-of-how-target-exposed-a-teen-girls-pregnancy-2012-2
AUTOMATED DECISIONS OF
CONSEQUENCE
[Sweeney 2013, Miller 2015, Byrnes 2016,
Rudin 2013, Barry-Jester et al. 2015]
Policing/
Hiring Lending
sentencing
Female cookies à
less freq. shown professional job opening ads
39
“… a lot remains unknown about how big data-driven
decisions may or may not use factors that are proxies
for race, sex, or other traits that U.S. laws generally
prohibit from being used in a wide range of commercial
decisions … What can be done to make sure these
products and services–and the companies that use
them treat consumers fairly and ethically?”
40
OLYMPIC MEDALS
41
https://www.nytimes.com/interactive/2016/08/08/sports/olympics/history-olympic-dominance-charts.html
NETFLIX PRIZE I
Recommender systems: predict a user’s rating of an item
Netflix Prize: $1MM to the first team that beats our in-house
engine by 10%
• Happened after about three years
• Model was never used by Netflix for a variety of reasons
• Out of date (DVDs vs streaming)
• Too complicated / not interpretable
42
NETFLIX PRIZE II
Critically-Acclaimed/Strong
Frat/Gross-Out Comedy Female Lead
Artsy
BB
43
Latent factor 2
Image courtesy of Christopher Volinsky
NETFLIX PRIZE III
Netflix initially planned a follow-up competition
In 2007, UT Austin managed to deanonymize portions of the
original released (anonymized) Netflix dataset:
• ????????????
• Matched rating against those made publicly on IMDb
Why could this be bad?
2009—2010, four Netflix users filed a class-action lawsuit
against Netflix over
44
MONEYBALL
Baseball teams drafted rookie
players primarily based on
human scouts’ opinions of
their talents
Peter Brand, data scientist du
jour, convinces the {bad,
poor} Oakland Athletics to
use a quantitative aka
sabermetric approach to
hiring
45
46
http://www.businessinsider.com/best
-jobs-in-america-in-2017-2017-1/
WRAP-UP FOR PART I
Register on Piazza using your UMD address:
piazza.com/umd/fall2018/cmsc641
47
AFTER THE BREAK:
SCRAPING DATA WITH PYTHON
48
THE DATA LIFECYCLE
49
TODAY’S LECTURE
50
BUT FIRST, SNAKES!
Python is an interpreted, dynamically-typed, high-level,
garbage-collected, object-oriented-functional-imperative, and
widely used scripting language.
• Interpreted: instructions executed without being compiled into
(virtual) machine instructions*
• Dynamically-typed: verifies type safety at runtime
• High-level: abstracted away from the raw metal and kernel
• Garbage-collected: memory management is automated
• OOFI: you can do bits of OO, F, and I programming
Not the point of this class!
• Python is fast (developer time), intuitive, and used in industry!
51
*you can compile Python source, but it’s not required
THE ZEN OF PYTHON
• Beautiful is better than ugly.
• Explicit is better than implicit.
• Simple is better than complex.
• Complex is better than complicated.
• Flat is better than nested.
• Sparse is better than dense.
• Readability counts.
• Special cases aren't special enough to break the rules …
• … although practicality beats purity.
• Errors should never pass silently …
• … unless explicitly silenced.
52
Thanks: SDSMT ACM/LUG
LITERATE
PROGRAMMING
Literate code contains in one document:
• the source code;
• text explanation of the code; and
• the end result of running the code.
Basic idea: present code in the order that logic and flow of
human thoughts demand, not the machine-needed ordering
• Necessary for data science!
• Many choices made need textual explanation, ditto results.
Stuff you’ll be using in Project 0 (and beyond)!
53
10-MINUTE PYTHON
PRIMER
Define a function:
Python is whitespace-delimited
Define a function that returns a tuple:
def my_func(x, y):
return (x-1, y+2)
(a, b) = my_func(1, 2)
54
a = 0; b = 4
USEFUL BUILT-IN FUNCTIONS:
COUNTING AND ITERATING
len: returns the number of items of an enumerable object
len( [‘c’, ‘m’, ‘s’, ‘c’, 3, 2, 0] )
7
range: returns an iterable object
list( range(10) )
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
enumerate: returns iterable tuple (index, element) of a list
enumerate( [“311”, “320”, “330”] )
https://docs.python.org/3/library/functions.html
55
USEFUL BUILT-IN FUNCTIONS:
MAP AND FILTER
map: apply a function to a sequence or iterable
arr = [1, 2, 3, 4, 5]
map(lambda x: x**2, arr)
arr = [1, 2, 3, 4, 5, 6, 7]
filter(lambda x: x % 2 == 0, arr)
[2, 4, 6]
56
PYTHONIC
PROGRAMMING
Basic iteration over an array in Java:
int[] arr = new int[10];
for(int idx=0; idx<arr.length; ++idx) {
System.out.println( arr[idx] );
}
idx = 0
while idx < len(arr):
print( arr[idx] ); idx += 1
57
LIST COMPREHENSIONS
Construct sets like a mathematician!
• P = { 1, 2, 4, 8, 16, …, 216 }
• E = { x | x in ℕ and x is odd and x < 1000 }
Construct lists like a mathematician who codes!
E = [ x for x in range(1000) if x % 2 != 0 ]
58
EXCEPTIONS
Syntactically correct statement throws an exception:
• tweepy (Python Twitter API) returns “Rate limit exceeded”
• sqlite (a file-based database) returns IntegrityError
print('Python', python_version())
try:
cause_a_NameError
except NameError as err:
print(err, '-> some extra text')
59
PYTHON 2 VS 3
Python 3 is intentionally backwards incompatible
• (But not that incompatible)
Biggest changes that matter for us:
• print “statement” à print(“function”)
• 1/2 = 0 à 1/2 = 0.5 and 1//2 = 0
• ASCII str default à default Unicode
Namespace ambiguity fixed:
i = 1
[i for i in range(5)]
print(i) # ????????
60
TO ANY CURMUDGEONS …
If you’re going to use Python 2 anyway, use the _future_
module:
• Python 3 introduces features that will throw runtime errors in
Python 2 (e.g., with statements)
• _future_ module incrementally brings 3 functionality into 2
• https://docs.python.org/2/library/__future__.html
61
PYTHON VS R (FOR
DATA SCIENTISTS)
There is no right answer here!
• Python is a “full”
programming language –
easier to integrate with
systems in the field
• R has a more mature set of
pure stats libraries …
• … but Python is catching up
quickly …
• … and is already ahead
specifically for ML.
You will see Python more in the
tech industry.
62
EXTRA RESOURCES
Plenty of tutorials on the web:
• https://www.learnpython.org/
63
64
TODAY’S LECTURE
with
65
Thanks: Zico Kolter’s 15-388
GOTTA CATCH ‘EM ALL
Five ways to get data:
• Direct download and load from local storage
• Generate locally via downloaded code (e.g., simulation)
• Query data from a database (covered in a few lectures)
• Query an API from the intra/internet
Covered today.
• Scrape data from a webpage
66
WHEREFORE ART
THOU, API?
A web-based Application Programming Interface (API) like
we’ll be using in this class is a contract between a server and
a user stating:
67
“SEND ME A SPECIFIC
REQUEST”
Most web API queries we’ll be doing will use HTTP requests:
• conda install –c anaconda requests=2.12.4
r = requests.get( 'https://api.github.com/user',
auth=('user', 'pass') )
r.status_code
200
r.headers[‘content-type’]
‘application/json; charset=utf8’
r.json()
68
http://docs.python-requests.org/en/master/
HTTP REQUESTS
https://www.google.com/?q=cmsc320&tbs=qdr:m
??????????
*be careful with https:// calls; requests will not verify SSL by default
69
RESTFUL APIS
This class will just query web APIs, but full web APIs typically
allow more.
Representational State Transfer (RESTful) APIs:
• GET: perform query, return data
• POST: create a new entry or object
• PUT: update an existing entry or object
• DELETE: delete an existing entry or object
Can be more intricate, but verbs (“put”) align with actions
70
QUERYING A RESTFUL API
Stateless: with every request, you send along a
token/authentication of who you are
token = ”super_secret_token”
r = requests.get(“https://github.com/user”,
params={”access_token”: token})
print( r.content )
{"login":”JohnDickerson","id":472985,"avatar_url":"ht…
71
AUTHENTICATION
AND OAUTH
Old and busted:
r = requests.get(“https://api.github.com/user”,
auth=(“JohnDickerson”, “ILoveKittens”))
New hotness:
• What if I wanted to grant an app access to, e.g., my Facebook
account without giving that app my password?
• OAuth: grants access tokens that give (possibly incomplete)
access to a user or app without exposing a password
72
“… I WILL RETURN INFORMATION
IN A STRUCTURED FORMAT.”
So we’ve queried a server using a well-formed GET request
via the requests Python module. What comes back?
General structured data:
• Comma-Separated Value (CSV) files & strings
• Javascript Object Notation (JSON) files & strings
• HTML, XHTML, XML files & strings
Domain-specific structured data:
• Shapefiles: geospatial vector data (OpenStreetMap)
• RVT files: architectural planning (Autodesk Revit)
• You can make up your own! Always document it.
73
CSV FILES IN PYTHON
Any CSV reader worth anything can parse files with any
delimiter, not just a comma (e.g., “TSV” for tab-separated)
1,26-Jan,Introduction,—,"pdf, pptx",Dickerson,
2,31-Jan,Scraping Data with Python,Anaconda's Test Drive.,,Dickerson,
3,2-Feb,"Vectors, Matrices, and Dataframes",Introduction to pandas.,,Dickerson,
4,7-Feb,Jupyter notebook lab,,,"Denis, Anant, & Neil",
5,9-Feb,Best Practices for Data Science Projects,,,Dickerson,
74
JSON FILES & STRINGS
JSON is a method for serializing objects:
• Convert an object into a string (done in Java in 131/132?)
• Deserialization converts a string back to an object
Easy for humans to read (and sanity check, edit)
Defined by three universal data structures
75
Images from: http://www.json.org/
JSON IN PYTHON
Some built-in types: “Strings”, 1.0, True, False, None
Lists: [“Goodbye”, “Cruel”, “World”]
Dictionaries: {“hello”: “bonjour”, “goodbye”, “au
revoir”}
Dictionaries within lists within dictionaries within lists:
[1, 2, {“Help”:[
“I’m”, {“trapped”: “in”},
“CMSC641”
]}]
76
JSON FROM TWITTER
GET https://api.twitter.com/1.1/friends/list.json?cursor=-
1&screen_name=twitterapi&skip_status=true&include_user_entitie
s=false
{
"previous_cursor": 0,
"previous_cursor_str": "0",
"next_cursor": 1333504313713126852,
"users": [{
"profile_sidebar_fill_color": "252429",
"profile_sidebar_border_color": "181A1E",
"profile_background_tile": false,
"name": "Sylvain Carle",
"profile_image_url":
"http://a0.twimg.com/profile_images/2838630046/4b82e286a659fae310012520f4f7
56bb_normal.png",
"created_at": "Thu Jan 18 00:10:45 +0000 2007", …
77
PARSING JSON IN
PYTHON
Repeat: don’t write your own CSV or JSON parser
• https://news.ycombinator.com/item?id=7796268
• rsdy.github.io/posts/dont_write_your_json_parser_plz.html
Python comes with a fine JSON parser
import json
r = requests.get(
“https://api.twitter.com/1.1/statuses/user_timeline.jso
n?screen_name=JohnPDickerson&count=100”, auth=auth )
data = json.loads(r.content)
78
XML, XHTML, HTML
FILES AND STRINGS
Still hugely popular online, but JSON has essentially
replaced XML for:
• Asynchronous browser ßà server calls
• Many (most?) newer web APIs
XML is a hierarchical markup language:
<tag attribute=“value1”>
<subtag>
Some cool words or values go here!
</subtag>
<openclosetag attribute=“value2” />
</tag>
You probably won’t see much XML, but you will see plenty of
HTML, its substantially less well-behaved cousin …
79
DOCUMENT OBJECT
MODEL (DOM)
80
SCRAPING HTML IN
PYTHON
HTML – the specification – is fairly pure
HTML – what you find on the web – is horrifying
We’ll use BeautifulSoup:
• conda install -c asmeurer beautiful-soup=4.3.2
import requests
from bs4 import BeautifulSoup
r = requests.get(
“https://cs.umd.edu/class/spring2017/cmsc320/” )
81
BUILDING A WEB
SCRAPER IN PYTHON
Totally not hypothetical situation:
• You really want to learn about data science, so you choose to
download all of last semester’s CMSC320 lecture slides to
wallpaper your room …
• … but you now have carpal tunnel syndrome from clicking
refresh on Piazza last night, and can no longer click on the
PDF and PPTX links.
Hopeless? No! Earlier, you built a scraper to do this!
lnks = root.find(“div”, id=“schedule”)\
.find(“table”)\ # find all schedule
.find(“tbody”).findAll(“a”) # links for CMSC320
Sort of. You only want PDF and PPTX files, not links to other
websites or files.
82
REGULAR
EXPRESSIONS
Given a list of URLs (strings), how do I find only those strings
that end in *.pdf or *.pptx?
• Regular expressions!
• (Actually Python strings come with a built-in endswith
function.)
“this_is_a_filename.pdf”.endswith((“.pdf”, “.pptx”))
“tHiS_IS_a_FileNAme.pDF”.lower().endswith(
(“.pdf”, “.pptx”))
83
84
REGULAR EXPRESSIONS
Used to search for specific elements, or groups of elements,
that match a pattern
import re
85
MATCHING MULTIPLE
CHARACTERS
Can match sets of characters, or multiple and more elaborate
sets and sequences of characters:
• Match the character ‘a’: a
• Match the character ‘a’, ‘b’, or ‘c’: [abc]
• Match any character except ‘a’, ‘b’, or ‘c’: [^abc]
• Match any digit: \d (= [0123456789] or [0-9])
• Match any alphanumeric: \w (= [a-zA-Z0-9_])
• Match any whitespace: \s (= [ \t\n\r\f\v])
• Match any character: .
Special characters must be escaped: .^$*+?{}\[]|()
86
MATCHING SEQUENCES AND
REPEATED CHARACTERS
A few common modifiers (available in Python and most other
high-level languages; +, {n}, {n,} may not):
• Match character ‘a’ exactly once: a
• Match character ‘a’ zero or once: a?
• Match character ‘a’ zero or more times: a*
• Match character ‘a’ one or more times: a+
• Match character ‘a’ exactly n times: a{n}
• Match character ‘a’ at least n times: a{n,}
Example: match all instances of “University of <somewhere>” where
<somewhere> is an alphanumeric string with at least 3 characters:
• \s*University\sof\s\w{3,}
87
COMPILED REGEXES
If you’re going to reuse the same regex many times, or if you
aren’t but things are going slowly for some reason, try
compiling the regular expression.
• https://blog.codinghorror.com/to-compile-or-not-to-compile/
88
DOWNLOADING A
BUNCH OF FILES Import the modules
import re
import requests
from bs4 import BeautifulSoup
try:
from urllib.parse import urlparse
except ImportError:
from urlparse import urlparse
89
DOWNLOADING A
BUNCH OF FILES Parse exactly what you want
90
NEXT LECTURE
91
NEXT CLASS:
NUMPY, SCIPY, AND DATAFRAMES
92