0% found this document useful (0 votes)
4 views62 pages

Unit 1

The document provides an overview of Big Data and its analytics, focusing on the types of digital data: structured, semi-structured, and unstructured. It discusses the characteristics, sources, and challenges associated with each data type, as well as the significance of Big Data in decision-making and business intelligence. The content also highlights the differences between traditional business intelligence and Big Data environments.

Uploaded by

PRADEEP NAZARETH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views62 pages

Unit 1

The document provides an overview of Big Data and its analytics, focusing on the types of digital data: structured, semi-structured, and unstructured. It discusses the characteristics, sources, and challenges associated with each data type, as well as the significance of Big Data in decision-making and business intelligence. The content also highlights the differences between traditional business intelligence and Big Data environments.

Uploaded by

PRADEEP NAZARETH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

BIG DATA & ANALYTICS

Unit-I

Dr. SELVA KUMAR S


ASSISTANT PROFESSOR
DEPT. OF CSE
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes


Introduction to digital
data and its
types
a) To differentiate
1. Structured data: Sources between structured,
of structured data, semi-structured and
ease with structured data, unstructured data.
etc.
b) To understand the need
2. Semi-Structured data: to integrate structured,
Sources of semi- semi- structured and
structured data, unstructured data.
characteristics of
semi- structured data.

3. Unstructured data:
Sources of unstructured
data, issues with
terminology, dealing
Agenda

Types of Digital Data


 Structured
 Sources of structured
data
 Ease with structured data

 Semi-Structured
 Sources of semi-
structured data

 Unstructured
 Sources of unstructured
data
 Issues with terminology
 Dealing with
unstructured data
Classification of Digital Data
Digital data is classified into the following categories:

 Structured data- This is the data which is in an organized form(e.g,


rows and columns) and can be easily used by a computer program.
Relationships exist between entities of data, such as classes and their
objects. Data stored in databases is an example of structured data.

 Semi-structured data- This is the data which does not conform to a


data model but has some structure. However, it is not in a form which
can be used easily by a computer program, for example, emails, XML,
markup languages like HTML etc.,

 Unstructured data- -This is the data which does not conform to a data
model or is not in a form which can be used easily by a computer
program. About 80%-90% data of an organization is in this format for
example, memos, chat rooms, powerpoint presentations, images,
videos, letters etc,.
Approximate Distribution of Digital Data

Approximate percentage distribution of digital


data
Structured
Data
Structured Data

 This is the data which is in an organized form (e.g.,


in rows and columns) and can be easily used by a
computer program.
 In structured data, all row in a table has the same

set of columns.
 Data stored in databases is an example of structured
data.
Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /


Searching

Scalability

Transaction
Processing
(ACID
properties
Semi-structured
Data
Semi-structured Data

 This is the data which does not conform to a


data model but has some structure. However, it is
not in a form which can be used easily by a
computer program.

 Example, emails, XML, markup languages like


HTML, etc. Metadata for this data is available but
is not sufficient.
Sources of Semi-structured Data

XML Extensible MarkUp Language

Other MarkUp Language

JSON(JavaScript Object Notation)

Semi-Structured
Data
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information
is blended with data
values
Data objects may have
different attributes not known
beforehand
Unstructured
Data
Unstructured Data

 This is the data which does not conform to a data


model or is not in a form which can be used easily
by a computer program.

 About 80–90% data of an organization is in this


format.

 Example: memos, chat rooms,


PowerPoint presentations, images, videos,
letters, researches, white papers, body of an email,
etc.
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages

Chats

Social
Media
data

Document
Word
Issues with terminology – Unstructured Data

Structure can be implied despite not being


formerly defined.

Data with some structure may still be labeled


Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be


highly structured in ways that are unanticipated
or unannounced.
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics


Dealing with Unstructured Data
Data Mining
• Association Rule Mining
• Regression Analysis
• Collaborative Filtering

Text analysis and Text Mining

Natural Language Processing(NLP)

Noisy text Analysis

Manual tagging with metadata

Part-of-speech tagging

Unstructured Information Management Architecture(UIMA)


Answer a few quick
questions …
Answer Me

 Which category (structured, semi-structured, or unstructured) will


you place a Web Page in?

 Which category (structured, semi-structured, or unstructured) will


you place
Word Document in?

 State a few examples of human generated and machine-generated


data.
Summary please…

few participants of the learning program to summarize the


lecture.
Properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character and
Technology XML/RDF(Resource Description
database table binary data
Framework).

Matured transaction and Transaction is adapted from No transaction management


Transaction management
various concurrency techniques DBMS not matured and no concurrency

Versioning over Versioning over tuples or graph


Version management Versioned as a whole
tuples,row,tables is possible

It is more flexible than


It is schema dependent and It is more flexible and there is
Flexibility structured data but less flexible
less flexible absence of schema
than unstructured data

It is very difficult to scale DB It’s scaling is simpler than


Scalability It is more scalable.
schema structured data

New technology, not very


Robustness Very robust —
spread

Structured query allow Queries over anonymous nodes Only textual queries are
Query performance
complex joining are possible possible
Reference
s…
Further
Readings

http://data-magnum.com/the-big-deal-about-big-data-whats-insi
de- structured-unstructured-and-semi-structured-data/

http://www.webopedia.com/TERM/S/structured_data.html

http://en.wikipedia.org/wiki/UIMA
Thank
you
Chapter
2
Introduction to Big
Data
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes


Introduction to big data a) To understand the
significance of big
1. Definition of big data. data.

2. Challenges of big data. b) To understand the other


characteristics of data
3. Why big data? that are not definitional
characteristics of big
4. Traditional Business data.
Intelligence versus
big data. c) To understand the
challenges of big data
and how to deal with
the same.

d) To understand what is
new
today.
Agenda

 Definition of Big Data


 Volume
 Velocity
 Variety
 Challenges of Big Data
 Other Characteristics of Data Which are Not Definitional Traits of Big
Data
 Why Big Data?
 Traditional Business Intelligence (BI) versus Big Data
 A Typical Data Warehouse Environment
 A Typical Hadoop Environment
 Coexistence of Big Data and Data Warehouse
Characteristics of Data

Data has three characteristics:

1. Composition: deals with structure of data, that is, the sources of data , the granularity,
the types, and the nature of the data as to whether it is static or real-time streaming.

2. Condition: The condition of data deals with the state of the data that is “can one use
this data as is for analysis?” or “Does it require cleansing for further enhancement and
enrichment?”

3. Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
Definition of Big
Data
Definition of Big Data

High-
volume
Big Data is high-volume,
High- high- velocity, and
velocity high-variety information
High- assets that demand cost
variety
effective, innovative forms
of information processing
Cost-effective, innovative for enhanced insight and
forms of information
processing decision making.

Source: Gartner IT
Enhanced insight Glossary
& decision
making
Volume - A Mountain of
Data

1 Kilobyte (KB) = 1000 bytes


1 Megabyte (MB) = 1,000,000 bytes
1 Gigabyte (GB) = 1,000,000,000 bytes
1 Terabyte (TB) = 1,000,000,000,000 bytes
1 Petabyte (PB) = 1,000,000,000,000,000 bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000
bytes
Volume

Where does this data get generated?


1. Typical internal sources:
• Data Storage- File systems, SQL, NoSQL (MongoDB, Cassandra).
• Archives – Archives of scanned documents, paper archives, customer records,
patient health records etc,.
2. External data sources:
• public web - Wikipedia, weather, regulatory, census etc.
3. Both (internal+external)
• Sensor data – Car sensors, smart electric meters, office buildings etc,.
• Machine log data – Event logs, application logs, Business process logs, audit logs etc.
• Social media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,.
• Business apps – ERP,CRM, HR, Google Docs, and so on.
• Media – Audio, Video, Image, Podcast, etc.
• Docs – CSV, Word Documents, PDF,XLS, PPT and so on.
Sources of Big Data
Veloci
ty

Batch  Periodic  Near real time  Real-time processing


Variety

 Structured data: example: traditional transaction processing


systems and
RDBMS, etc.

 Semi-structured data: example: Hyper Text Markup


Language (HTML), eXtensible Markup Language (XML).

 Unstructured data: example: unstructured text documents,


audio, video,
email, photos, PDFs, social media, etc.
Other Characteristics of Data –
Which are not Definitional Traits of
Big Data

• Veracity and Validity-Veracity refers to biases, noises and abnormality in


data.
Validity refers to the accuracy and correctness of the
data.

• Volatility-Deals with, how long is the data valid? And how long should it be
stored?

• Variability- Data flows can be highly inconsistent with periodic peaks.


Challenges with Big
Data
Challenges with Big Data
C a p tu re

Sto rag e

C u ration

C h allen g es with B ig Data


Search

A n alysis

T ran sfer

V isu alization

Privacy
V iolations
Why Big
Data?
Why Big Data?

More Data

More Accurate
Analysis

More Confidence in decision making

Greater operational efficiencies, Cost reduction,


Time reduction, New product development, Optimized
offerings, etc.
Traditional Business Intelligence (BI) versus
Big Data

1. In traditional BI environment, all the enterprise’s data is housed in a


central server whereas in a big data environment data resides in a
distributed file system. The distributed file system scales by scaling in
or out horizontally as compared to typical database server that scales
vertically.
2. In traditional BI, data is generally analyzed in an offline mode whereas
in big data, it is analyzed in both real time as well as in offline mode.
3. Traditional BI is about structured data and it is here that data is taken
to processing functions whereas big data is about variety and here the
processing functions are taken to the data.
A Typical Data Warehouse Environment

Reporting /
ERP
Dashboarding

CRM OLAP

Legacy Data Ad hoc querying


Warehouse

3rd party Apps Modeling


Co-existence of Big Data and Data Warehouse

Web Logs HDFS

Hadoop Operational
Systems
Images and Videos

Data Warehouse
Data Warehouse
Social Media
(Twitter, Facebook, etc.)
MapReduce
Data Marts

Docs & PDFs ODS


What is changing in the realms of Big data

•Competitive Advantage
•Decision Making
•Value of Data
Its time for Activity…
Teams Games Tournaments
Answer Me

 Share your understanding of Big Data.

 How is traditional BI environment different from the Big Data


environment?

 Share your experience as a customer on an e-commerce site.


Comment on the
big data that gets created on a typical e-commerce site.
Summary please…

Ask a few participants of the learning program to summarize the


lecture.
References

Further Readings

 Big data for dummies - Judith Hurwitz, Alan Nugent, Fern Halper,
Marcia Kaufman
http://en.wikipedia.org/wiki/Big_data

http://www.sas.com/en_us/insights/big-data/what-is-big-data.html

https://www.oracle.com/bigdata/

http://bigdatauniversity.com/
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy