Unit 1
Unit 1
Unit-I
3. Unstructured data:
Sources of unstructured
data, issues with
terminology, dealing
Agenda
Semi-Structured
Sources of semi-
structured data
Unstructured
Sources of unstructured
data
Issues with terminology
Dealing with
unstructured data
Classification of Digital Data
Digital data is classified into the following categories:
Unstructured data- -This is the data which does not conform to a data
model or is not in a form which can be used easily by a computer
program. About 80%-90% data of an organization is in this format for
example, memos, chat rooms, powerpoint presentations, images,
videos, letters etc,.
Approximate Distribution of Digital Data
set of columns.
Data stored in databases is an example of structured
data.
Sources of Structured Data
Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc
OLTP Systems
Ease with Structured Data
Input / Update /
Delete
Security
Scalability
Transaction
Processing
(ACID
properties
Semi-structured
Data
Semi-structured Data
Semi-Structured
Data
Characteristics of Semi-structured Data
Inconsistent Structure
Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information
is blended with data
values
Data objects may have
different attributes not known
beforehand
Unstructured
Data
Unstructured Data
Images
Free-Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media
data
Document
Word
Issues with terminology – Unstructured Data
Data Mining
Part-of-speech tagging
It is based on
It is based on Relational It is based on character and
Technology XML/RDF(Resource Description
database table binary data
Framework).
Structured query allow Queries over anonymous nodes Only textual queries are
Query performance
complex joining are possible possible
Reference
s…
Further
Readings
http://data-magnum.com/the-big-deal-about-big-data-whats-insi
de- structured-unstructured-and-semi-structured-data/
http://www.webopedia.com/TERM/S/structured_data.html
http://en.wikipedia.org/wiki/UIMA
Thank
you
Chapter
2
Introduction to Big
Data
Learning Objectives and Learning Outcomes
d) To understand what is
new
today.
Agenda
1. Composition: deals with structure of data, that is, the sources of data , the granularity,
the types, and the nature of the data as to whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of the data that is “can one use
this data as is for analysis?” or “Does it require cleansing for further enhancement and
enrichment?”
3. Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
Definition of Big
Data
Definition of Big Data
High-
volume
Big Data is high-volume,
High- high- velocity, and
velocity high-variety information
High- assets that demand cost
variety
effective, innovative forms
of information processing
Cost-effective, innovative for enhanced insight and
forms of information
processing decision making.
Source: Gartner IT
Enhanced insight Glossary
& decision
making
Volume - A Mountain of
Data
• Volatility-Deals with, how long is the data valid? And how long should it be
stored?
Sto rag e
C u ration
A n alysis
T ran sfer
V isu alization
Privacy
V iolations
Why Big
Data?
Why Big Data?
More Data
More Accurate
Analysis
Reporting /
ERP
Dashboarding
CRM OLAP
Hadoop Operational
Systems
Images and Videos
Data Warehouse
Data Warehouse
Social Media
(Twitter, Facebook, etc.)
MapReduce
Data Marts
•Competitive Advantage
•Decision Making
•Value of Data
Its time for Activity…
Teams Games Tournaments
Answer Me
Big data for dummies - Judith Hurwitz, Alan Nugent, Fern Halper,
Marcia Kaufman
http://en.wikipedia.org/wiki/Big_data
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
https://www.oracle.com/bigdata/
http://bigdatauniversity.com/
THANK YOU