0% found this document useful (0 votes)

358 views91 pages

PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON

It will help to understand the principles of data science

Uploaded by

A.H Mahmud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

358 views91 pages

PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON

It will help to understand the principles of data science

Uploaded by

A.H Mahmud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

PRINCIPLES OF

DATA SCIENCE
JOHN P DICKERSON

Lecture #1 – 08/29/2018

CMSC641
Wednesdays
7:00pm – 9:30pm
INTRODUCTION TO
?????????????
Data science is the application of
computational and statistical techniques to
address or gain [managerial or scientific]
insight into some problem in the real world.

Zico Kolter
Machine Learning Prof, CMU

3
Drew Conway
CEO, Alluvium (analytics company)

4
MANY DEFINITIONS
Broad: necessarily larger than a single
discipline

Interdisciplinary: statistics, computer

science, operations research, statistical
and machine learning, data
warehousing, visualization,
mathematics, information science, …

Insight-focused: grounded in the desire

to find insights in data and leverage Tuomas Carsey, UNC
them to inform decision making

5
THE DATA LIFECYCLE

Exploratory Analysis, Insight &

Data Data analysis hypothesis
& testing, &
Policy
collection processing
Data viz ML Decision

6
“The ability to take data—to be able to
understand it, to process it, to extract value
from it, to visualize it, to communicate it—that’s
going to be a hugely important skill in the next
decades, not only at the professional level but
even at the educational level for elementary
school kids, for high school kids, for college
kids.”

Hal Varian
Chief Economist at Google

7
MOTIVATION
Explosion of data, in pretty much every domain
• Sensing devices and sensor networks that can monitor
everything 24/7 from temperature to pollution to vital signs
• Increasingly sophisticated smart phones
• Internet, social networks makes it easy to publish data
• Scientific experiments and simulations à astronomical data
volumes
• Internet of Things
• Dataification: taking all aspects of life and turning them into data
(e.g., what you like/enjoy has been turned into a stream of your
"likes")
How to handle that data? How to extract interesting actionable
insights and scientific knowledge?
Data volumes expected to get much worse

8
FOUR V’S OF BIG DATA
Increasing data Volumes
• Scientific data: 1.5GB per genome -- can be sequenced in .5 hrs
• 500M tweets per day (as of 2013)
• As of 2012: 2.5 Exabytes of data created every day
Variety:
• Structured data, spreadsheets, photos, videos, natural text, ...
Velocity
• Sensors everywhere -- can generate high-rate "data streams"
• Real-time analytics requires data to be consumed as fast as it is
generated
Veracity
• How do you decide what to trust? How to remove noise? How to
fill in missing values?

9
THIS CERTIFICATE PROGRAM
You’ll learn to take data:
• Process it
• Visualize it
• Understand it
• Communicate it
• Extract value from it
Hal Varian

Info: https://www.cs.umd.edu/class/fall2018/cmsc641/
Piazza: piazza.com/umd/fall2018/cmsc641
ELMS: (everyone should be registered automatically)

10
THIS CERTIFICATE PROGRAM
CMSC 641: Principles of Data Science
• An introduction to the data science pipeline, i.e., the end-to-end
process of going from unstructured, messy data to knowledge
and actionable insights. Provides a broad overview of what data
science means and systems and tools commonly used for data
science, and illustrates the principles of data science through
several case studies.

CMSC 642: Big Data Systems

CMSC 643: Machine Learning and Data Mining

CMSC 644: Algorithms for Data Science

11
THIS CERTIFICATE PROGRAM
CMSC 641: Principles of Data Science

CMSC 642: Big Data Systems

• An overview of data management systems for performing
data science on large volumes of data, including relational
databases, and NoSQL systems. The topics covered include:
different types of data management systems, their pros and
cons, how and when to use those systems, and best practices
for data modeling.
CMSC 643: Machine Learning and Data Mining

CMSC 644: Algorithms for Data Science

12
THIS CERTIFICATE PROGRAM
CMSC 641: Principles of Data Science

CMSC 642: Big Data Systems

CMSC 643: Machine Learning and Data Mining

• Provides a broad overview of key machine learning and data
mining algorithms, and how to apply those to very large datasets.
Topics covered include decision trees, linear models for
classification and regression, support vector machines, neural
networks and deep learning, online learning, recommendation
systems, clustering and dimensionality reduction, and systems
for large-scale machine learning.

CMSC 644: Algorithms for Data Science

13
THIS CERTIFICATE PROGRAM
CMSC 641: Principles of Data Science

CMSC 642: Big Data Systems

CMSC 643: Machine Learning and Data Mining

CMSC 644: Algorithms for Data Science

• Provides an in-depth understanding of some of the key data
structures and algorithms essential for advanced data
science. Topics include random sampling, graph algorithms,
network science, data streams, and optimization.

14
THIS COURSE
End-to-end data science lifecycle

Acquiring, wrangling, cleaning, and integrating data; Setting up

pipelines for ETL

Data modeling

Information Visualization

Ethics, Privacy, and Reproducibility

Feel free to tell me if there are topics that you think we should
cover…

15
PREREQUISITE
KNOWLEDGE
Aimed at folks with some CS knowledge – but likely
accessible to others with programming experience and
mathematical maturity.
We do not assume:
• Experience with Python, pandas, scikit-learn, matplotlib, etc …
• Deep statistics or any ML knowledge
• Database or distributed systems knowledge
We do assume: You want to be here!

16
WHO AM I?

http://jpdickerson.com

17
WHO ARE YOU?
STAT400?
CMSC422?
CMSC424?

Register on Piazza: piazza.com/umd/fall2018/cmsc641

18
(TENTATIVE) COURSE
STRUCTURE
First 2 lectures: intro & primers in the Python data
science stack
Next 3 lectures: data collection & management
• Best practices, data wrangling, exploratory analysis,
ethics, debugging, visualization, etc …
Next 4-5 lectures: statistical modeling & ML
• Statistical learning, regression, classification, cross-
validation, model evaluation, hypothesis testing, etc …
Midterm
Final 4.5 lectures: advanced topics
Ambitious …
• Dimensionality reduction, distributed learning, big data,
distributed computation
• Either group presentations or more lectures

19
GRADE #1: MINI-PROJECTS
Students will complete four mini-project assignments:
• Case studies meant to mimic what you, a future data scientist,
will see in industry. They should be fun J.

The rules:
• Allowed: small group discussions
• Required: individual programming & writing
• Never allowed: public posting of solutions

Deliverable:
• Turn in an .ipynb of a Jupyter notebook on ELMS

20
GRADE #2: READING
HOMEWORKS
We will post (bi)weekly reading assignments. Mix of:
• Blog posts
• Academic articles
• News articles
Weekly quiz to be taken on ELMS covering the readings
Individual quiz grades are pass/fail:
• At least 60% correct à Pass
• Less than 60% correct à Fail
Must take at least ten of these quizzes over the semester

21
GRADE #3: MINI-TUTORIAL
In lieu of a final exam, you’ll create a mini-tutorial that:
• Identifies a raw data source
• Processes and stores that data
• Performs exploratory data analysis & visualization
• Derives insight(s) using statistics and ML
• Communicates those insights as actionable text
Individual or group project

Will be hosted publicly online (GitHub Pages) and will

strengthen your portfolio.

23
READY-MADE DATASET
REPOSITORIES
https://www.data.gov/
• US-centric agriculture, climate, education, energy, finance, health,
manufacturing data, …
https://cloud.google.com/bigquery/public-data/
• BigQuery (Google Cloud) public datasets (bikeshare, GitHub,
Hacker News, Form 990 non-profits, NOAA, …)
https://www.kaggle.com/datasets
• Microsoft-owned, various (Billboard Top 100 lyrics, credit card
fraud, crime in Chicago, global terrorism, world happiness, …)
https://aws.amazon.com/public-datasets/
• AWS-hosted, various (NASA, a bunch of genome stuff, Google
Books n-grams, Multimedia Commons, …)

24
NEW DATASET IDEAS
Fraternal Order of Police vs Black Lives Matter
Linking finance data to ${anything_else}
Something having to do with Pokémon statistics?
Look through http://www.alexa.com/topsites and scrape
something interesting!
University of Maryland-related, or College Park-related, stuff
• Check out http://umd.io/ – open source project; maybe your
data collection and cleaning scripts can be added to this!
Honestly, pretty much anything! Just document everything.

Reproducibility!

25
FINAL TUTORIAL
Deliverable: URL of your own GitHub Pages site hosting an
.ipynb/.html export of your final tutorial
• https://pages.github.com/ – make a GitHub account, too!
• https://github.com/blog/1995-github-jupyter-notebooks-3
The project itself:
• ~1500+ words of Markdown prose
• ~150+ lines of Python
• Should be viewable as a static webpage – that is, if I (or
anyone else) opens the link up, everything should render and I
shouldn’t have to run any cells to generate output

26
FINAL TUTORIAL RUBRIC
The TAs and I will grade on a scale of 1-10:
Motivation: Does the tutorial make the reader believe the topic is
important (a) in general and (b) with respect to data science?
Understanding: After reading the tutorial, does the reader
understand the topic?
Further resources: Does the tutorial “call out” to other resources
that would help the reader understand basic concepts, deep dive,
related work, etc?
Prose: Does the prose in the Markdown portion of the .ipynb add to
the reader’s understanding of the tutorial?
Code: Does the code help solidify understanding, is it well
documented, and does it include helpful examples?
Subjective Evaluation: If somebody linked to this tutorial from
Hacker News, would people actually read the whole thing?

27
Thanks to: Zico Kolter
GRADE #J: CLASS
PARTICIPATION
Please please please please please do the required reading,
if available, before coming to class!

Earn full credit via:

• Lecture participation
• Piazza participation
• Regular attendance at office hours (can be just to chat!)
Please I’m so lonely …
And so are the TAs …

Aim to ask/answer a question at least once every two weeks;

or attend office hours at least once a month

28
GRADE BREAKDOWN
60% mini-projects:
• There are 3 of them
• Equal weighting @ 20% each Your Grade

10% reading homeworks

= A+
30% final tutorial
Mini-Projects Reading Tutorial

(+5% course participation)

29
SOME TECHNOLOGIES
WE WILL USE

30
(Don’t tell CMSC330 …)

31
IMPORTANT WALLS OF TEXT

32
Common
Sense!
ANTI-HARASSMENT
(Adapted from ACM SIGCOMM’s policies)

The open exchange of ideas and the freedom of thought and

expression are central to our aims and goals. These require an
environment that recognizes the inherent worth of every person
and group, that fosters dignity, understanding, and mutual
respect, and that embraces diversity. For these reasons, we are
dedicated to providing a harassment-free experience for
participants in (and out) of this class.

Harassment is unwelcome or hostile behavior, including speech

that intimidates, creates discomfort, or interferes with a person's
participation or opportunity for participation, in a conference, event
or program.

33
Common
Sense!
ACADEMIC INTEGRITY
(Text unironically stolen from Hal Daumé III)

Any assignment or exam that is handed in must be your own work (unless
otherwise stated). However, talking with one another to understand the
material better is strongly encouraged. Recognizing the distinction between
cheating and cooperation is very important. If you copy someone else's
solution, you are cheating. If you let someone else copy your solution, you
are cheating (this includes posting solutions online in a public place). If
someone dictates a solution to you, you are cheating.

Everything you hand in must be in your own words, and based on your own
understanding of the solution. If someone helps you understand the problem
during a high-level discussion, you are not cheating. We strongly encourage
students to help one another understand the material presented in class, in
the book, and general issues relevant to the assignments. When taking an
exam, you must work independently. Any collaboration during an exam will be
considered cheating. Any student who is caught cheating will be given an F in
the course and referred to the University Office of Student Conduct. Please
don't take that chance – if you're having trouble understanding the material,
please let me know and I will be more than happy to help.

34
(A FEW) DATA SCIENCE SUCCESS
STORIES & CAUTIONARY TALES

35
POLLING: 2008 & 2012
Nate Silver uses a simple idea – taking a principled approach
to aggregating polling instead of relying on punditry – and:
• Predicts 49/50 states in 2008
• Predicts 50/50 states in 2012

• (He is also a great case study

in creating a brand.)
https://hbr.org/2012/11/how-nate-silver-
won-the-2012-p

36
Democrat (+) or Republican (-) in 2012
POLLING: 2016

HuffPo: “He may end up being

right, but he’s just guessing. A “trend
line adjustment” is merely political
punditry dressed up as
sophisticated mathematical
modeling.”
538: Offers quantitative reasoning
for re-/under-weighting older polls,
& changing as election approaches

37
http://www.huffingtonpost.com/entry/nate-silver-election-forecast_us_581e1c33e4b0d9ce6fbc6f7f
https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/
AD TARGETING
Pregnancy is an expensive & habit-forming time
• Thus, valuable to consumer-facing firms
2012:
• Target identifies 25 products and subsets thereof that are
commonly bought in early pregnancy
• Uses purchase history of patrons to predict pregnancy,
targets advertising for post-natal products (cribs, etc)
• Good: increased revenue
• Bad: this can expose pregnancies – as famously
happened in Minneapolis to a high schooler

38
http://www.businessinsider.com/the-incredible-story-of-how-target-exposed-a-teen-girls-pregnancy-2012-2
AUTOMATED DECISIONS OF
CONSEQUENCE
[Sweeney 2013, Miller 2015, Byrnes 2016,
Rudin 2013, Barry-Jester et al. 2015]

Policing/
Hiring Lending
sentencing

Search for minority names à

ads for DUI/arrest records

Female cookies à
less freq. shown professional job opening ads

39
“… a lot remains unknown about how big data-driven
decisions may or may not use factors that are proxies
for race, sex, or other traits that U.S. laws generally
prohibit from being used in a wide range of commercial
decisions … What can be done to make sure these
products and services–and the companies that use
them treat consumers fairly and ethically?”

- FTC Commissioner Julie Brill [2015]

40
OLYMPIC MEDALS

41
https://www.nytimes.com/interactive/2016/08/08/sports/olympics/history-olympic-dominance-charts.html
NETFLIX PRIZE I
Recommender systems: predict a user’s rating of an item

Twilight Wall-E Twilight II Furious 7

User 1 +1 -1 +1 ?
User 2 +1 -1 ? ?
User 3 -1 +1 -1 +1

Netflix Prize: $1MM to the first team that beats our in-house
engine by 10%
• Happened after about three years
• Model was never used by Netflix for a variety of reasons
• Out of date (DVDs vs streaming)
• Too complicated / not interpretable

42
NETFLIX PRIZE II
Critically-Acclaimed/Strong
Frat/Gross-Out Comedy Female Lead

Artsy

Latent factors model:

Latent factor 1

Identify factors with max

discrimination between
movies

43
Latent factor 2
Image courtesy of Christopher Volinsky
NETFLIX PRIZE III
Netflix initially planned a follow-up competition
In 2007, UT Austin managed to deanonymize portions of the
original released (anonymized) Netflix dataset:
• ????????????
• Matched rating against those made publicly on IMDb
Why could this be bad?
2009—2010, four Netflix users filed a class-action lawsuit
against Netflix over

44
MONEYBALL
Baseball teams drafted rookie
players primarily based on
human scouts’ opinions of
their talents
Peter Brand, data scientist du
jour, convinces the {bad,
poor} Oakland Athletics to
use a quantitative aka
sabermetric approach to
hiring

(Spoiler: Red Sox offer Brand a job,

he says no, they take a sabermetric
approach and win the World Series.)

45
46
http://www.businessinsider.com/best
-jobs-in-america-in-2017-2017-1/
WRAP-UP FOR PART I
Register on Piazza using your UMD address:
piazza.com/umd/fall2018/cmsc641

Please chat with me if you’re unsure of whether or not you’re at

the right {programming, math} level for this course:
• My guess is that you are!
• This is a young class, so we’re quite flexible

Read about Docker & Jupyter!

• Works on *nix, OSX, Windows
• https://www.docker.com/
• (We’ll post a small project shortly.)

47
AFTER THE BREAK:
SCRAPING DATA WITH PYTHON

48
THE DATA LIFECYCLE

Exploratory Analysis, Insight &

Data Data analysis hypothesis
& testing, &
Policy
collection processing
Data viz ML Decision

49
TODAY’S LECTURE

Exploratory Analysis, Insight &

Data Data analysis hypothesis
& testing, &
Policy
collection processing
Data viz ML Decision

50
BUT FIRST, SNAKES!
Python is an interpreted, dynamically-typed, high-level,
garbage-collected, object-oriented-functional-imperative, and
widely used scripting language.
• Interpreted: instructions executed without being compiled into
(virtual) machine instructions*
• Dynamically-typed: verifies type safety at runtime
• High-level: abstracted away from the raw metal and kernel
• Garbage-collected: memory management is automated
• OOFI: you can do bits of OO, F, and I programming
Not the point of this class!
• Python is fast (developer time), intuitive, and used in industry!

51
*you can compile Python source, but it’s not required
THE ZEN OF PYTHON
• Beautiful is better than ugly.
• Explicit is better than implicit.
• Simple is better than complex.
• Complex is better than complicated.
• Flat is better than nested.
• Sparse is better than dense.
• Readability counts.
• Special cases aren't special enough to break the rules …
• … although practicality beats purity.
• Errors should never pass silently …
• … unless explicitly silenced.

52
Thanks: SDSMT ACM/LUG
LITERATE
PROGRAMMING
Literate code contains in one document:
• the source code;
• text explanation of the code; and
• the end result of running the code.
Basic idea: present code in the order that logic and flow of
human thoughts demand, not the machine-needed ordering
• Necessary for data science!
• Many choices made need textual explanation, ditto results.
Stuff you’ll be using in Project 0 (and beyond)!

53
10-MINUTE PYTHON
PRIMER
Define a function:

def my_func(x, y):

if x > y:
return x
else:
return y

Python is whitespace-delimited
Define a function that returns a tuple:
def my_func(x, y):
return (x-1, y+2)

(a, b) = my_func(1, 2)

54
a = 0; b = 4
USEFUL BUILT-IN FUNCTIONS:
COUNTING AND ITERATING
len: returns the number of items of an enumerable object
len( [‘c’, ‘m’, ‘s’, ‘c’, 3, 2, 0] )

7
range: returns an iterable object
list( range(10) )

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
enumerate: returns iterable tuple (index, element) of a list
enumerate( [“311”, “320”, “330”] )

[(0, “311”), (1, “320”), (2, “330”)]

https://docs.python.org/3/library/functions.html

55
USEFUL BUILT-IN FUNCTIONS:
MAP AND FILTER
map: apply a function to a sequence or iterable

arr = [1, 2, 3, 4, 5]
map(lambda x: x**2, arr)

[1, 4, 9, 16, 25]

filter: returns a list of elements for which a predicate is true

arr = [1, 2, 3, 4, 5, 6, 7]
filter(lambda x: x % 2 == 0, arr)

[2, 4, 6]

We’ll go over in much greater depth with pandas/numpy.

56
PYTHONIC
PROGRAMMING
Basic iteration over an array in Java:
int[] arr = new int[10];
for(int idx=0; idx<arr.length; ++idx) {
System.out.println( arr[idx] );
}

Direct translation into Python:

idx = 0
while idx < len(arr):
print( arr[idx] ); idx += 1

A more “Pythonic” way of iterating:

for element in arr:
print( element )

57
LIST COMPREHENSIONS
Construct sets like a mathematician!
• P = { 1, 2, 4, 8, 16, …, 216 }
• E = { x | x in ℕ and x is odd and x < 1000 }
Construct lists like a mathematician who codes!

P = [ 2**x for x in range(17) ]

E = [ x for x in range(1000) if x % 2 != 0 ]

Very similar to map, but:

• You’ll see these way more than map in the wild
• Many people consider map/filter not “pythonic”
• They can perform differently (map is “lazier”)

58
EXCEPTIONS
Syntactically correct statement throws an exception:
• tweepy (Python Twitter API) returns “Rate limit exceeded”
• sqlite (a file-based database) returns IntegrityError

print('Python', python_version())

try:
cause_a_NameError
except NameError as err:
print(err, '-> some extra text')

59
PYTHON 2 VS 3
Python 3 is intentionally backwards incompatible
• (But not that incompatible)
Biggest changes that matter for us:
• print “statement” à print(“function”)
• 1/2 = 0 à 1/2 = 0.5 and 1//2 = 0
• ASCII str default à default Unicode
Namespace ambiguity fixed:
i = 1
[i for i in range(5)]
print(i) # ????????

60
TO ANY CURMUDGEONS …
If you’re going to use Python 2 anyway, use the _future_
module:
• Python 3 introduces features that will throw runtime errors in
Python 2 (e.g., with statements)
• _future_ module incrementally brings 3 functionality into 2
• https://docs.python.org/2/library/__future__.html

from _future_ import division

from _future_ import print_function
from _future_ import please_just_use_python_3

61
PYTHON VS R (FOR
DATA SCIENTISTS)
There is no right answer here!
• Python is a “full”
programming language –
easier to integrate with
systems in the field
• R has a more mature set of
pure stats libraries …
• … but Python is catching up
quickly …
• … and is already ahead
specifically for ML.
You will see Python more in the
tech industry.

62
EXTRA RESOURCES
Plenty of tutorials on the web:
• https://www.learnpython.org/

Work through Project 0, which will take you through some

baby steps with Python and the Pandas library:
• (We’ll also post some readings soon.)

Come hang out at office hours (or chat with me privately)

• Office hours will be on the website/Piazza very soon.
• Also, email me – I realize your schedules are not like
undergrads’ schedules J.

63
64
TODAY’S LECTURE

Exploratory Analysis, Insight &

Data Data analysis hypothesis
& testing, &
Policy
collection processing
Data viz ML Decision

with

65
Thanks: Zico Kolter’s 15-388
GOTTA CATCH ‘EM ALL
Five ways to get data:
• Direct download and load from local storage
• Generate locally via downloaded code (e.g., simulation)
• Query data from a database (covered in a few lectures)
• Query an API from the intra/internet
Covered today.
• Scrape data from a webpage

66
WHEREFORE ART
THOU, API?
A web-based Application Programming Interface (API) like
we’ll be using in this class is a contract between a server and
a user stating:

“If you send me a specific request, I will return some

information in a structured and documented format.”

(More generally, APIs can also perform actions, may not be

web-based, be a set of protocols for communicating between
processes, between an application and an OS, etc.)

67
“SEND ME A SPECIFIC
REQUEST”
Most web API queries we’ll be doing will use HTTP requests:
• conda install –c anaconda requests=2.12.4
r = requests.get( 'https://api.github.com/user',
auth=('user', 'pass') )

r.status_code

200

r.headers[‘content-type’]

‘application/json; charset=utf8’

r.json()

{u'private_gists': 419, u'total_private_repos': 77, ...}

68
http://docs.python-requests.org/en/master/
HTTP REQUESTS
https://www.google.com/?q=cmsc320&tbs=qdr:m

??????????

HTTP GET Request:

GET /?q=cmsc320&tbs=qdr:m HTTP/1.1
Host: www.google.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1

params = { “q”: “cmsc320”, “tbs”: “qdr:m” }

r = requests.get( “https://www.google.com”,
params = params )

*be careful with https:// calls; requests will not verify SSL by default

69
RESTFUL APIS
This class will just query web APIs, but full web APIs typically
allow more.
Representational State Transfer (RESTful) APIs:
• GET: perform query, return data
• POST: create a new entry or object
• PUT: update an existing entry or object
• DELETE: delete an existing entry or object
Can be more intricate, but verbs (“put”) align with actions

70
QUERYING A RESTFUL API
Stateless: with every request, you send along a
token/authentication of who you are

token = ”super_secret_token”
r = requests.get(“https://github.com/user”,
params={”access_token”: token})
print( r.content )

{"login":”JohnDickerson","id":472985,"avatar_url":"ht…

GitHub is more than a GETHub:

• PUT/POST/DELETE can edit your repositories, etc.
• Try it out: https://github.com/settings/tokens/new

71
AUTHENTICATION
AND OAUTH
Old and busted:

r = requests.get(“https://api.github.com/user”,
auth=(“JohnDickerson”, “ILoveKittens”))

New hotness:
• What if I wanted to grant an app access to, e.g., my Facebook
account without giving that app my password?
• OAuth: grants access tokens that give (possibly incomplete)
access to a user or app without exposing a password

72
“… I WILL RETURN INFORMATION
IN A STRUCTURED FORMAT.”
So we’ve queried a server using a well-formed GET request
via the requests Python module. What comes back?
General structured data:
• Comma-Separated Value (CSV) files & strings
• Javascript Object Notation (JSON) files & strings
• HTML, XHTML, XML files & strings
Domain-specific structured data:
• Shapefiles: geospatial vector data (OpenStreetMap)
• RVT files: architectural planning (Autodesk Revit)
• You can make up your own! Always document it.

73
CSV FILES IN PYTHON
Any CSV reader worth anything can parse files with any
delimiter, not just a comma (e.g., “TSV” for tab-separated)
1,26-Jan,Introduction,—,"pdf, pptx",Dickerson,
2,31-Jan,Scraping Data with Python,Anaconda's Test Drive.,,Dickerson,
3,2-Feb,"Vectors, Matrices, and Dataframes",Introduction to pandas.,,Dickerson,
4,7-Feb,Jupyter notebook lab,,,"Denis, Anant, & Neil",
5,9-Feb,Best Practices for Data Science Projects,,,Dickerson,

Don’t write your own CSV or JSON parser

import csv
with open(“schedule.csv”, ”rb”) as f:
reader = csv.reader(f, delimiter=“,”, quotechar=’”’)
for row in reader:
print(row)

(We’ll use pandas to do this much more easily and efficiently)

74
JSON FILES & STRINGS
JSON is a method for serializing objects:
• Convert an object into a string (done in Java in 131/132?)
• Deserialization converts a string back to an object
Easy for humans to read (and sanity check, edit)
Defined by three universal data structures

Python dictionary, Java

Map, hash table, etc …

Python list, Java array,

vector, etc …

Python string, float, int,

boolean, JSON object,
JSON array, …

75
Images from: http://www.json.org/
JSON IN PYTHON
Some built-in types: “Strings”, 1.0, True, False, None
Lists: [“Goodbye”, “Cruel”, “World”]
Dictionaries: {“hello”: “bonjour”, “goodbye”, “au
revoir”}
Dictionaries within lists within dictionaries within lists:
[1, 2, {“Help”:[
“I’m”, {“trapped”: “in”},
“CMSC641”
]}]

76
JSON FROM TWITTER
GET https://api.twitter.com/1.1/friends/list.json?cursor=-
1&screen_name=twitterapi&skip_status=true&include_user_entitie
s=false

{
"previous_cursor": 0,
"previous_cursor_str": "0",
"next_cursor": 1333504313713126852,
"users": [{
"profile_sidebar_fill_color": "252429",
"profile_sidebar_border_color": "181A1E",
"profile_background_tile": false,
"name": "Sylvain Carle",
"profile_image_url":
"http://a0.twimg.com/profile_images/2838630046/4b82e286a659fae310012520f4f7
56bb_normal.png",
"created_at": "Thu Jan 18 00:10:45 +0000 2007", …

77
PARSING JSON IN
PYTHON
Repeat: don’t write your own CSV or JSON parser
• https://news.ycombinator.com/item?id=7796268
• rsdy.github.io/posts/dont_write_your_json_parser_plz.html
Python comes with a fine JSON parser
import json

r = requests.get(
“https://api.twitter.com/1.1/statuses/user_timeline.jso
n?screen_name=JohnPDickerson&count=100”, auth=auth )

data = json.loads(r.content)

json.load(some_file) # loads JSON from a file

json.dump(json_obj, some_file) # writes JSON to file
json.dumps(json_obj) # returns JSON string

78
XML, XHTML, HTML
FILES AND STRINGS
Still hugely popular online, but JSON has essentially
replaced XML for:
• Asynchronous browser ßà server calls
• Many (most?) newer web APIs
XML is a hierarchical markup language:
<tag attribute=“value1”>
<subtag>
Some cool words or values go here!
</subtag>
<openclosetag attribute=“value2” />
</tag>
You probably won’t see much XML, but you will see plenty of
HTML, its substantially less well-behaved cousin …

79
DOCUMENT OBJECT
MODEL (DOM)

80
SCRAPING HTML IN
PYTHON
HTML – the specification – is fairly pure
HTML – what you find on the web – is horrifying
We’ll use BeautifulSoup:
• conda install -c asmeurer beautiful-soup=4.3.2

import requests
from bs4 import BeautifulSoup

r = requests.get(
“https://cs.umd.edu/class/spring2017/cmsc320/” )

root = BeautifulSoup( r.content )

root.find(“div”, id=“schedule”)\
.find(“table”)\ # find all schedule
.find(“tbody”).findAll(“a”) # links for CMSC320

81
BUILDING A WEB
SCRAPER IN PYTHON
Totally not hypothetical situation:
• You really want to learn about data science, so you choose to
download all of last semester’s CMSC320 lecture slides to
wallpaper your room …
• … but you now have carpal tunnel syndrome from clicking
refresh on Piazza last night, and can no longer click on the
PDF and PPTX links.
Hopeless? No! Earlier, you built a scraper to do this!
lnks = root.find(“div”, id=“schedule”)\
.find(“table”)\ # find all schedule
.find(“tbody”).findAll(“a”) # links for CMSC320

Sort of. You only want PDF and PPTX files, not links to other
websites or files.

82
REGULAR
EXPRESSIONS
Given a list of URLs (strings), how do I find only those strings
that end in *.pdf or *.pptx?
• Regular expressions!
• (Actually Python strings come with a built-in endswith
function.)
“this_is_a_filename.pdf”.endswith((“.pdf”, “.pptx”))

What about .pDf or .pPTx, still legal extensions for PDF/PPTX?

• Regular expressions!
• (Or cheat the system again: built-in string lower function.)

“tHiS_IS_a_FileNAme.pDF”.lower().endswith(
(“.pdf”, “.pptx”))

83
84
REGULAR EXPRESSIONS
Used to search for specific elements, or groups of elements,
that match a pattern
import re

# Find the index of the 1st occurrence of “cmsc641”

match = re.search(r”cmsc641”, text)
print( match.start() )

# Does start of text match “cmsc641”?

match = re.match(r”cmsc641”, text)

# Iterate over all matches for “cmsc641” in text

for match in re.finditer(r”cmsc641”, text):
print( match.start() )

# Return all matches of “cmsc641” in the text

match = re.findall(r”cmsc641”, text)

85
MATCHING MULTIPLE
CHARACTERS
Can match sets of characters, or multiple and more elaborate
sets and sequences of characters:
• Match the character ‘a’: a
• Match the character ‘a’, ‘b’, or ‘c’: [abc]
• Match any character except ‘a’, ‘b’, or ‘c’: [^abc]
• Match any digit: \d (= [0123456789] or [0-9])
• Match any alphanumeric: \w (= [a-zA-Z0-9_])
• Match any whitespace: \s (= [ \t\n\r\f\v])
• Match any character: .
Special characters must be escaped: .^$*+?{}\[]|()

86
MATCHING SEQUENCES AND
REPEATED CHARACTERS
A few common modifiers (available in Python and most other
high-level languages; +, {n}, {n,} may not):
• Match character ‘a’ exactly once: a
• Match character ‘a’ zero or once: a?
• Match character ‘a’ zero or more times: a*
• Match character ‘a’ one or more times: a+
• Match character ‘a’ exactly n times: a{n}
• Match character ‘a’ at least n times: a{n,}
Example: match all instances of “University of <somewhere>” where
<somewhere> is an alphanumeric string with at least 3 characters:
• \s*University\sof\s\w{3,}

87
COMPILED REGEXES
If you’re going to reuse the same regex many times, or if you
aren’t but things are going slowly for some reason, try
compiling the regular expression.
• https://blog.codinghorror.com/to-compile-or-not-to-compile/

# Compile the regular expression “cmsc320”

regex = re.compile(r”cmsc320”)

# Use it repeatedly to search for matches in text

regex.match( text ) # does start of text match?
regex.search( text ) # find the first match or None
regex.findall( text ) # find all matches

Interested? CMSC6, CMSC7, CMSC8*, talk to me.

88
DOWNLOADING A
BUNCH OF FILES Import the modules
import re
import requests
from bs4 import BeautifulSoup
try:
from urllib.parse import urlparse
except ImportError:
from urlparse import urlparse

Get some HTML via HTTP

# HTTP GET request sent to the URL url

r = requests.get( url )

# Use BeautifulSoup to parse the GET response

root = BeautifulSoup( r.content )
lnks = root.find("div", id="schedule")\
.find("table")\
.find("tbody").findAll("a")

89
DOWNLOADING A
BUNCH OF FILES Parse exactly what you want

# Cycle through the href for each anchor, checking

# to see if it's a PDF/PPTX link or not
for lnk in lnks:
href = lnk['href']

# If it's a PDF/PPTX link, queue a download

if href.lower().endswith(('.pdf', '.pptx')):

Get some more data?!

urld = urlparse.urljoin(url, href)

rd = requests.get(urld, stream=True)

# Write the downloaded PDF to a file

outfile = path.join(outbase, href)
with open(outfile, 'wb') as f:
f.write(rd.content)

90
NEXT LECTURE

Exploratory Analysis, Insight &

Data Data analysis hypothesis
& testing, &
Policy
collection processing
Data viz ML Decision

91
NEXT CLASS:
NUMPY, SCIPY, AND DATAFRAMES

Course Structure - Introduction To Data Science
No ratings yet
Course Structure - Introduction To Data Science
23 pages
Introduction To Data Science - Ii-I Course File 2025-26
No ratings yet
Introduction To Data Science - Ii-I Course File 2025-26
152 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
CS3352 - Foundations of Data Science
No ratings yet
CS3352 - Foundations of Data Science
142 pages
Introduction To Data Science: John P Dickerson
No ratings yet
Introduction To Data Science: John P Dickerson
36 pages
Data Science 1
100% (4)
Data Science 1
133 pages
Data Science Applications
No ratings yet
Data Science Applications
25 pages
Lecture 1
No ratings yet
Lecture 1
24 pages
2nd - Semester - Data Science - Final - Updated
No ratings yet
2nd - Semester - Data Science - Final - Updated
15 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Data Science Lecture No 01
No ratings yet
Data Science Lecture No 01
28 pages
Unit 01 - Introduction To Data Science - Complete
No ratings yet
Unit 01 - Introduction To Data Science - Complete
60 pages
Data Science
No ratings yet
Data Science
15 pages
001-2023-0714 DLBDSIDS01 Course Book
No ratings yet
001-2023-0714 DLBDSIDS01 Course Book
90 pages
FIT1043 - Lecture 1 - 2024 Data Science
No ratings yet
FIT1043 - Lecture 1 - 2024 Data Science
66 pages
IITH Executive MTech Brochure
No ratings yet
IITH Executive MTech Brochure
13 pages
Activ Steps
No ratings yet
Activ Steps
11 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
How To Become A Data Scientist: Jesse Steinweg-Woods, PH.D
No ratings yet
How To Become A Data Scientist: Jesse Steinweg-Woods, PH.D
46 pages
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
No ratings yet
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
23 pages
GE 461 Introduction To Data Science: Spring 2021
No ratings yet
GE 461 Introduction To Data Science: Spring 2021
39 pages
S22 Lecture 1 Intro Inked
No ratings yet
S22 Lecture 1 Intro Inked
46 pages
2nd - Semester - Data Science - Modified
No ratings yet
2nd - Semester - Data Science - Modified
14 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
Mit Data Science Program
100% (1)
Mit Data Science Program
15 pages
Ids PPT and PDF
No ratings yet
Ids PPT and PDF
493 pages
Data Science
100% (2)
Data Science
52 pages
Introducing Data Science
57% (7)
Introducing Data Science
2 pages
Data Science - CS109: Joe Blitzstein, Verena Kaynig-Fittkau, Hanspeter Pfister
No ratings yet
Data Science - CS109: Joe Blitzstein, Verena Kaynig-Fittkau, Hanspeter Pfister
47 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
CS429: Data Mining: About Instructor
No ratings yet
CS429: Data Mining: About Instructor
26 pages
Data Science Course Outline
No ratings yet
Data Science Course Outline
4 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Data Science
No ratings yet
Data Science
71 pages
20IT503 - Big Data Analytics - Unit1
No ratings yet
20IT503 - Big Data Analytics - Unit1
59 pages
Intro To Data-Science Final
No ratings yet
Intro To Data-Science Final
3 pages
Iisc Cds Brochure
No ratings yet
Iisc Cds Brochure
18 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Mit Data Science Program
No ratings yet
Mit Data Science Program
14 pages
Real Skills That Deliver: Data Science Real Outcomes!
No ratings yet
Real Skills That Deliver: Data Science Real Outcomes!
20 pages
Mit Data Science Program
No ratings yet
Mit Data Science Program
13 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
119 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Project Report
No ratings yet
Project Report
29 pages
IITH Executive MTech Brochure 2017
No ratings yet
IITH Executive MTech Brochure 2017
13 pages
What Is Data Science GDI
0% (1)
What Is Data Science GDI
24 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Data Science BluePrint
No ratings yet
Data Science BluePrint
12 pages
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Introduction To Data Science: Cpts 483-06 - Syllabus
No ratings yet
Introduction To Data Science: Cpts 483-06 - Syllabus
5 pages
Course Outline PDF
No ratings yet
Course Outline PDF
2 pages
Khairuls Basic-Math 6th-2022 (Exambd - Net)
No ratings yet
Khairuls Basic-Math 6th-2022 (Exambd - Net)
109 pages
Enterprise Security With Power Platform
No ratings yet
Enterprise Security With Power Platform
48 pages
Microsoft - Testinises.pl 400.brain - Dumps.2023 Feb 15.by - Nelson.133q.vce
No ratings yet
Microsoft - Testinises.pl 400.brain - Dumps.2023 Feb 15.by - Nelson.133q.vce
16 pages
STC Credit Scoring Implementation - Scope of Work 20230428 v6.0
No ratings yet
STC Credit Scoring Implementation - Scope of Work 20230428 v6.0
34 pages
Operating Systems Lecture Notes
No ratings yet
Operating Systems Lecture Notes
196 pages
Cheetah-G2 FSM en Final 241017
No ratings yet
Cheetah-G2 FSM en Final 241017
79 pages
Java Extensiblity Framework Customer Overview 1.0
No ratings yet
Java Extensiblity Framework Customer Overview 1.0
15 pages
Grand Tour of Azure API Management
No ratings yet
Grand Tour of Azure API Management
113 pages
Django Ninja
No ratings yet
Django Ninja
10 pages
Biplab Roy 8th Sem Project Report
No ratings yet
Biplab Roy 8th Sem Project Report
82 pages
Lect 4-Building Your First Mobile App
No ratings yet
Lect 4-Building Your First Mobile App
33 pages
API Notes-2
No ratings yet
API Notes-2
23 pages
Commissure Web Services API Specifications V3 0 R9
No ratings yet
Commissure Web Services API Specifications V3 0 R9
56 pages
Unit 2 Lec 2 Cloud Computing
No ratings yet
Unit 2 Lec 2 Cloud Computing
42 pages
Analytical Ability 3
No ratings yet
Analytical Ability 3
3 pages
04-The Conversion API
No ratings yet
04-The Conversion API
6 pages
System Software3160715 Handbook
No ratings yet
System Software3160715 Handbook
124 pages
FEA Practice T102
No ratings yet
FEA Practice T102
6 pages
Software Construction
No ratings yet
Software Construction
15 pages
HP P6000 CV Kit Contents
No ratings yet
HP P6000 CV Kit Contents
13 pages
Percentage Formula
No ratings yet
Percentage Formula
2 pages
Cours3-Malware Analysis and Reverse Engineeringv2 en
No ratings yet
Cours3-Malware Analysis and Reverse Engineeringv2 en
14 pages
MEF White Paper The Case For Standardized and Automated Inter Provider Business Interface
No ratings yet
MEF White Paper The Case For Standardized and Automated Inter Provider Business Interface
23 pages
Shyam CV
No ratings yet
Shyam CV
1 page
Latthe Education Society's Polytechnic, Sangli: Advanced Java Programming
No ratings yet
Latthe Education Society's Polytechnic, Sangli: Advanced Java Programming
6 pages
Artificial Intelligence - Shaping The Future
No ratings yet
Artificial Intelligence - Shaping The Future
4 pages
PDD - Generate Yearly Report
No ratings yet
PDD - Generate Yearly Report
15 pages
API Course Content
No ratings yet
API Course Content
8 pages
AutoGen Studio-12
No ratings yet
AutoGen Studio-12
8 pages
Esraa Shehadeh: Integration Support Engineer
No ratings yet
Esraa Shehadeh: Integration Support Engineer
2 pages
The Future of Renewable Energy Detailed
No ratings yet
The Future of Renewable Energy Detailed
6 pages
5G Technology and The Future of Connectivity Detailed
No ratings yet
5G Technology and The Future of Connectivity Detailed
6 pages
Beyond Automation - The Rise of Intelligent Robotics
No ratings yet
Beyond Automation - The Rise of Intelligent Robotics
1 page
Article 3 Cybersecurity Defending The Digital Frontier
No ratings yet
Article 3 Cybersecurity Defending The Digital Frontier
6 pages
The Evolution of Space Exploration Detailed
No ratings yet
The Evolution of Space Exploration Detailed
6 pages
The Internet of Things IoT Connecting Our World Detailed
No ratings yet
The Internet of Things IoT Connecting Our World Detailed
6 pages
Quantum Computing Principles and Applications Detailed
No ratings yet
Quantum Computing Principles and Applications Detailed
6 pages
CRISPR and Gene Editing Technology Detailed
No ratings yet
CRISPR and Gene Editing Technology Detailed
6 pages
Blockchain Technology Beyond Cryptocurrency Detailed
No ratings yet
Blockchain Technology Beyond Cryptocurrency Detailed
6 pages
Nanotechnology Innovations and Implications Detailed
No ratings yet
Nanotechnology Innovations and Implications Detailed
6 pages
Cybersecurity Threats and Solutions Detailed
No ratings yet
Cybersecurity Threats and Solutions Detailed
6 pages
Modul 1
No ratings yet
Modul 1
10 pages
AN-IND-1-011 Using CANoe NET API
No ratings yet
AN-IND-1-011 Using CANoe NET API
25 pages
MuleSoft Certified Platform Architect - Level 1 Exam
No ratings yet
MuleSoft Certified Platform Architect - Level 1 Exam
2 pages
Prince Resume-3
No ratings yet
Prince Resume-3
1 page
Profit Loss Exercises
No ratings yet
Profit Loss Exercises
2 pages
Data Science
From Everand
Data Science
John D. Kelleher
3/5 (8)
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON

Uploaded by

PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON

Uploaded by

PRINCIPLES OF

Interdisciplinary: statistics, computer

Insight-focused: grounded in the desire

Exploratory Analysis, Insight &

CMSC 642: Big Data Systems

CMSC 643: Machine Learning and Data Mining

CMSC 644: Algorithms for Data Science

CMSC 642: Big Data Systems

CMSC 644: Algorithms for Data Science

CMSC 642: Big Data Systems

CMSC 643: Machine Learning and Data Mining

CMSC 644: Algorithms for Data Science

CMSC 642: Big Data Systems

CMSC 643: Machine Learning and Data Mining

CMSC 644: Algorithms for Data Science

Acquiring, wrangling, cleaning, and integrating data; Setting up

Ethics, Privacy, and Reproducibility

Register on Piazza: piazza.com/umd/fall2018/cmsc641

Will be hosted publicly online (GitHub Pages) and will

Earn full credit via:

Aim to ask/answer a question at least once every two weeks;

10% reading homeworks

(+5% course participation)

The open exchange of ideas and the freedom of thought and

Harassment is unwelcome or hostile behavior, including speech

• (He is also a great case study

HuffPo: “He may end up being

Search for minority names à

- FTC Commissioner Julie Brill [2015]

Twilight Wall-E Twilight II Furious 7

Latent factors model:

Identify factors with max

(Spoiler: Red Sox offer Brand a job,

Please chat with me if you’re unsure of whether or not you’re at

Read about Docker & Jupyter!

Exploratory Analysis, Insight &

Exploratory Analysis, Insight &

def my_func(x, y):

[(0, “311”), (1, “320”), (2, “330”)]

[1, 4, 9, 16, 25]

filter: returns a list of elements for which a predicate is true

We’ll go over in much greater depth with pandas/numpy.

Direct translation into Python:

A more “Pythonic” way of iterating:

P = [ 2**x for x in range(17) ]

Very similar to map, but:

from _future_ import division

Work through Project 0, which will take you through some

Come hang out at office hours (or chat with me privately)

Exploratory Analysis, Insight &

“If you send me a specific request, I will return some

(More generally, APIs can also perform actions, may not be

{u'private_gists': 419, u'total_private_repos': 77, ...}

HTTP GET Request:

params = { “q”: “cmsc320”, “tbs”: “qdr:m” }

GitHub is more than a GETHub:

Don’t write your own CSV or JSON parser

(We’ll use pandas to do this much more easily and efficiently)

Python dictionary, Java

Python list, Java array,

Python string, float, int,

json.load(some_file) # loads JSON from a file

root = BeautifulSoup( r.content )

What about .pDf or .pPTx, still legal extensions for PDF/PPTX?

# Find the index of the 1st occurrence of “cmsc641”

# Does start of text match “cmsc641”?

# Iterate over all matches for “cmsc641” in text

# Return all matches of “cmsc641” in the text

# Compile the regular expression “cmsc320”

# Use it repeatedly to search for matches in text

Interested? CMSC6*, CMSC7*, CMSC8*, talk to me.

Get some HTML via HTTP

# HTTP GET request sent to the URL url

# Use BeautifulSoup to parse the GET response

# Cycle through the href for each anchor, checking

# If it's a PDF/PPTX link, queue a download

Interested? CMSC6, CMSC7, CMSC8*, talk to me.