1 Introduction
1 Introduction
n Volume
n Fast-moving data creates massive historical archives
n Valuable for mining patterns, trends and relationships
n Variety
n From structured to unstructured
n Valuable for data enrichment/fusion
n 4V?
+ Big Data – Storage and Analytics
+ Where does Big Data come from?
Storage Analytics
Online NewSQL Analytic
gaming
§ Structured data Datastore
Ad § ACID guarantees
serving § Relational/SQL
Sensor
data
Financial
trade
§ Unstructured data
Internet § Eventual consistency
commerce
§ Schemaless
SaaS, § KV, document
Web 2.0
Mobile
platforms
NoSQL
+ Big Data Management Infrastructure
Storage Analytics
Online NewSQL Analytic
gaming
§ Structured data Datastore
Ad § ACID guarantees
serving § Relational/SQL
Sensor
data
Financial
trade
§ Unstructured data
Internet § Eventual consistency
commerce
§ Schemaless
SaaS, § KV, document
Web 2.0
Mobile
platforms
NoSQL
+ Big Data Management Infrastructure
Storage Analytics
Online NewSQL Analytic
gaming Datastore
Ad
serving
Sensor
data
Financial
trade
Internet
commerce
SaaS,
Web 2.0
Mobile
platforms
NoSQL
+ Big Data Management Infrastructure
Storage Analytics
Online NewSQL Analytic
gaming Datastore
Ad
serving
Sensor
data
Financial
trade
Internet
commerce
SaaS,
Web 2.0
Mobile
platforms
NoSQL
+ The Data LifeCycle
Exploratory
analysis Insights &
Data Data Data
& Decision
collection processing Analytics
Data Making
visualization
+ The Two Steps of Working with Data
Exploratory
analysis Insights &
Data Data Data
& Decision
collection processing Analytics
Data Making
visualization
+
+ Netflix Prize I
Netflix Prize: $1MM to the first team that beats our in-
house engine by 10%
• Happened after about three years
• Model was never used by Netflix for a variety of reasons
• Out of date (DVDs vs streaming)
• Too complicated / not interpretable
+ Flu Prediction
Sports
Overview CS102
+ The Data LifeCycle
Exploratory
analysis Insights &
Data Data Data
& Decision
collection processing Analytics
Data Making
visualization
+ Data Collection and Preparation
n Data Preparation
n Filling in missing values
n Removing suspicious data
n Making formats, and units consistent
n Data Collection
n Storing big data
n Collecting (big) continuous stream data
+ Data Processing
n Database
n Data
n DBMS Application Interactive
Programs Queries
n Applications/Users
n What is DBMS?
n The relational model DBMS
n SQL/Query processing
n ACID/Transaction
Management
n Data models
n The relational model
n Non-relational; object oriented, object relational, network,
hierarchical
n Now semi-structured and unstructured
n Database design
n Entity-Relationship diagrams, functional dependencies,
normalisation
n Physical storage
n Organisation, hashing, indexing
+ The Database
n Data models
n The relational model
n Non-relational; object oriented, object relational, network,
hierarchical
n Now semi-structured and unstructured
n Database design
n Entity-Relationship diagrams, functional dependencies,
normalisation
n Physical storage
n Organisation, hashing, indexing
+ The Data LifeCycle
Exploratory
analysis Insights &
Data Data Data
& Decision
collection processing Analytics
Data Making
visualization
+ Data Visualization
Basic Data Visualization
n “A picture is worth a thousand words”
Don’t underestimate the power of basic visualizations
§ Bar charts
§ Pie charts
§ Scatterplots
§ Maps
Overview CS102