0% found this document useful (0 votes)
94 views89 pages

Unit 1 - Big Data Technologies

The document discusses big data technologies and concepts. It provides an overview of text books on data mining and Hadoop. It then covers topics like the data mining process, common data mining techniques including frequent pattern mining, association analysis, classification, regression and clustering. It defines big data, discusses the characteristics of big data and common big data scenarios. It also discusses the limitations of traditional data analytics architectures and why Hadoop provides a better solution for big data.

Uploaded by

prakash N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views89 pages

Unit 1 - Big Data Technologies

The document discusses big data technologies and concepts. It provides an overview of text books on data mining and Hadoop. It then covers topics like the data mining process, common data mining techniques including frequent pattern mining, association analysis, classification, regression and clustering. It defines big data, discusses the characteristics of big data and common big data scenarios. It also discusses the limitations of traditional data analytics architectures and why Hadoop provides a better solution for big data.

Uploaded by

prakash N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 89

Big Data Technologies

Text Books:
1. Jiawei Han MichelineKamber Jian Pei, Data Mining: Concepts and
Techniques, Third Edition, Elsevier, Morgan Kaufmann, 2011.
2. Tom White, “Hadoop: The Definitive Guide”, 3rd Edition, O’reilly,
2012.
3. Brett Lantz, Machine Learning with R - Second Edition - Deliver Data
Insights with R and Predictive Analytics 2nd Revised edition, 2015

By
Prakash N
Assistant Professor
Department of CST
UNIT I: DATA MINING & BIG DATA

Introduction to Data mining, KDD process, Data Mining Techniques: Mining


Frequent patterns, Association rule, Cluster analysis, Classification and
Regression. Introduction to Big Data - What is Big Data? Explosion in
Quantity of Data, Big Data Characteristics, Types of Data, Common Big data
Customer Scenarios, BIG DATA vs. HADOOP, A Holistic View of a Data
System, Limitations of Existing Data Analytics Architecture.
Why Data Mining
• Credit ratings/targeted marketing:
– Given a database of 100,000 names, which persons are the least likely
to default on their credit cards?
– Identify likely responders to sales promotions
• Fraud detection
– Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?
• Customer relationship management:
– Which of my customers are likely to be the most loyal, and which are
most likely to leave for a competitor? :

Data Mining helps extract such


information
Data mining

• It is the process of discovering interesting


patterns and knowledge from large amount of
data.
• Data source includes database, data
warehouses, the web or other information
resources.
• Also known as Knowledge Discovery from
Data (KDD)
Applications
• Banking: loan/credit card approval
– predict good customers based on old customers
• Customer relationship management:
– identify those who are likely to leave for a competitor.
• Targeted marketing:
– identify likely responders to promotions
• Fraud detection: telecommunications, financial
transactions
– from an online stream of event identify fraudulent events
• Manufacturing and production:
– automatically adjust knobs when process parameter changes
Evolution of Database Technology
• 1960 and earlier : Primitive File Processing
• 1970 to 1980 (Database Management system) -> Relational
Database Systems, ER models, Query Language, User Interface,
Forms, Query Processing, Transactions, Concurrency and recovery.
• Mid 1980 to present (Advance Database System): Advance data
models, Advanced Queries, Parallel data processing.
• Late 1980s to present (Advance Data Analysis) : Data ware housing
and Data Mining.
KDD Process

1. Data Cleaning – Remove noise and inconsistent data.

2. Data Integration – Multiple data source may be combined.

3. Data selection – Data relevant to analysis task are retrieved from


database

4. Data Transformation - Consolidated data from for data mining.

5. Data Mining – Intelligent methods are applied for data extraction

6. Pattern Evaluation - Identify truly data.

7. Knowledge Presentation – Used to present knowledge to user.


Data Mining: A KDD Process

Pattern Evaluation
– Data mining: the core of
knowledge discovery process.
Data Mining

Task-relevant Data
Data Selection
Data Transformation
Data Warehouse

Data Cleaning
Data Integration

Databases
What kind of pattern can be mined?

The most basic form of data for mining applications are Database
data, Data warehouse data and transactional data.
• Database Data (DBMS)

-Consist of a collection of interrelated data.


-To access and manage the data using by software programs.
• Data warehouse – repository of information collected from
multiple source.
• Transactional Data – information recorded from transaction. It
includes unique transaction number. (trans.ID).
Data Mining Task

• Prediction Tasks
– Perform induction on the current data in order to make predictions

• Description Tasks
– Find human-interpretable patterns that describe the data.
Data Mining Techniques

– Mining of Frequent Patterns

– Associations

– Correlations

– Classification and Regression

– Clustering Analysis
Mining of Frequent Patterns

• Patterns that occur frequently in data.


• Many kinds of Frequent Patterns
- Frequent Item sets
- Frequent Subsequence (Sequential Patterns)
• It leads to the discovery of interesting associations and correlations within
data.
Frequent Item Sets
• Set of Items that often appear together in a transactional data set
Example – milk and bread frequently brought together in grocery
store.

Frequent Subsequence (Sequential Patterns)


• The Pattern that customers, tend to purchase first a laptop, followed by a
digital camera and then a memory card.
Association Rule / Analysis

• For example, you want to know which items are frequently purchased together
within the same transaction.
Such a rule is,
buys(X,”computer”) => buys(X,”Software”)[support=1%, confidence=50%]
*X – Variable representing a customer
*buys – Attribute.

*A confidence of 50% means, customer buys a computer.. There is a


chance customer will buy software as well.
*1% means that, of transaction under analysis computer and software are
purchased together.

This rule contain single predicate referred as Single-Dimensional Association Rules


2nd Rule..

age(X,”20..29”) ˄ income(x, “40K..49K”) => buys(X, “laptop”)


[support = 2%, confidence=60%]

This rule involving more than one attribute or predicate (i.e., age, income and
buys) referred as Multidimensional Association Rule.
Classification and Regression

Classification
• It is the process of finding a model that describes and distinguish data
classes and concepts.
• The model are derived based on the analysis of a set of training data.
(Class label known)
• It is used to predict a class label of objects for which the class label is
unknown.
• Classification model can be represented in various forms (i). IF-THEN
rules, (ii) a decision tree and (iii) a neural network.
For example, Classify countries based on climate, or classify car based
on gas mileage
Classification Rules (IF-THEN rules)
A Decision Tree Algorithm

Node – Attribute; Branch – Outcome of the test;


Tree Leaves – Classes
A Neural Network

f3 f6 Class A

age f1

f4 f7 Class B
income f2

Class C
f5 f8

It is typically collection of neuron-like processing unit with weighted


connections between the units.
Regression

• Regression is a data mining function that predicts a number.


(unavailable numerical data values)
• Age, weight, distance, temperature, income, or sales could all
be predicted using regression techniques.
• For example, a regression model could be used to predict
children's height, given their age, weight, and other factors.
Cluster Analysis

• It is the process of partitioning a set of data objects into


subsets.
• Each subset is a cluster.
• Objects in a cluster are similar to one another, yet dissimilar to
objects in other clusters.
• The set of clusters resulting from a cluster analysis can be
referred as a Clustering.
• Clustering analyzes data objects without consulting class
labels (training data).
• Clustered based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
Cluster Analysis
What is Big Data?
• “A massive volume of both structured and unstructured data
that is so large that it's difficult to process using traditional on
hand database management tools.
• ‘Big Data’ is similar to ‘small data’, but bigger in size.

• but having data bigger it requires different approaches,


Techniques, Tools and Architecture
• an aim to solve new problems or old problems in a better way.
Big Data Analytics
Categories of BIG Data
• Structured
• Written in a format that’s easy for machines to
understand.
• Structured data is easily searchable by basic algorithms.
• Examples : Fields/ Tables/ Columns/
RDBMS/Spreadsheet

• Semi-structured
• Markers/Tags to separate elements
• XML/HTML
• Unstructured
• No fields/attributes
• More like Human Language
• Free form text (E-mail body, notes, articles,…)
• Audio, video, and image
Examples Of Structured Data

• An 'Employee' table in a database is an example of Structured Data.

Employee_ID Employee_N Gender Department Salary_In_lac


ame s
2365 Rajesh Male Finance 650000
Kulkarni
3398 Pratibha Female Admin 650000
Joshi
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Male Finance 500000
Das
7699 Priya Sane Female Finance 550000
Examples Of Un-Structured Data
Examples Of Semi - Structured Data

• Personal data stored in a XML file


Explosion in Quantity of Data
• Every minute
– Facebook users share nearly 2.5 million pieces of
content
– Twitter users tweet nearly 300,000 times
– Instagram users post nearly 220,000 new photos
– YouTube users upload 72 hours of new video content
– Apple users download nearly 50,000 apps
– Email users send over 200 million messages
– Amazon generates over $80,000 in online sales
Explosion in Quantity of Data

• The Data Explosion in 2014 Minute by Minute


– In 2012, Google received over 2 million search
queries per minute
– Today, Google receives over 4 million search
queries per minute from the 2.4 billion strong
global internet population
Explosion in Quantity of Data
• Science
– Data bases from astronomy, genomics, environmental data,
transportation data, …
• Humanities and Social Sciences
– Scanned books, historical documents, social interactions data, new
technology like GPS …
• Business & Commerce
– Corporate sales, stock market transactions, census, airline traffic,

• Entertainment
– Internet images, movies, MP3 files, …
• Medicine
– MRI & CT scans, patient health records, …
Big Data Analytics
Why Big Data?

• Increase of storage capacities


• Increase of processing power
• Availability of data.
– Manage data – extract relevant data
– Perform analytics on data – gain insights and use
algorithms
– Make decisions
Why Big Data?

Big Data can further Benefit organisations in the


below mentioned 5 areas
Comprehend market Conditions:
through big data, organisations can predict what
future customer behaviour will be purchasing
patterns, choices, product preferences.
Know your Customer Better:
through big data analysis, companies come to
know the general thought process and feedback in
advance and make course corrections.
Why Big Data?

Control Online Reputation


Sentimental analysis can be done through Big
Data Tools.
Cost Saving
firstly, there might be an initial cost of application
of big data tools, but in the long run, the benefits
will outweigh the cost.
 Availability of Data – Through Big Data tools,
relevant data can be available, in an accurate
and structured format, in real time.
Big Data Characteristics

4Vs’
• Volume
• Velocity
• Variety
• Veracity
Big Data Characteristics

• Volume  Refer to the amount of data


• Terabytes --- Zettabytes, Records, Transactions,
Tables and files
• Amount of data the size of the data set.
• We are not talking Terabytes but Zettabytes, the same
amount of data will soon be generated every minute.
• New big data tools use distributed systems so that we
can store and analyse data across databases that are
dotted around anywhere in the world.
• Velocity  Data in motion
• Velocity Refers to the speed
• at which new data is generated and the speed at which data
moves around.
• at which the data is created, stored, analyzed and visualized
• Machine to machine processes exchange data between billions
of devices
• Infrastructure generate massive log data in real time.
• Variety Data in many forms,
• Different data type such as audio, video, image data
(mostly unstructured data)
• In the past we only focused on structured data that
neatly fitted into tables or relational databases such as
financial data.
• In fact, 80% of the world data’s is unstructured
• With big data technology we can now analyse and bring
together data of different types (messages, social
media, conversations, photos….)
• Veracity  Data in doubt
• Refers to the messiness
• Inconsistent and missed data.
• With many forms of big data, quality and accuracy are
less controllable
5 Vs of Big Data
Volume, Veracity, Velocity,
Variety, and Value

Having access to big data is no


good unless we can turn it into
value.

Big Data Analytics


Big Data Characteristics – 4V’s
Common Big data Customer Scenarios
• Web and E-Tailing
- Recommendation Engines
- Search Quality
- Abuse and click Fraud Detection
• Telecommunications
- Network Performance Optimization
- Analysis network to predict failure
Common Big data Customer Scenarios
• Government
- Fraud Detections and Cyber Security
- Welfare schemes

- Health Care and Life Sciences


- Health Information Exchange
- Drug Safety
- Health care Service quality Improvements
Common Big data Customer Scenarios
• Banks and Financial Services
- Fraud Detections and Cyber Security
- Credit Scoring Analysis

• Retail
- Sale Transaction Analysis
H
B
BIG DATA vs. HADOOP
Understand and navigate
Formed Discovery and Navigation
formed big data sources

Manage & store huge volume Hadoop File System


of any data MapReduce

Structure and control data Data Warehousing

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Integrate all data sources Extract Transform Load, Integration,


Data Quality, Security, Data Life Cycle.
Big Data Analytics
1. Analyzes multiple data streams from many sources live
2. Stream computing uses software algorithms that analyzes the
data in real time.
3. To increase speed and accuracy when dealing with data
handling and analysis.
ETL (Extract, Transformation and Load)
• ETL did originate in enterprise IT
- data from online databases is Extracted,
- then Transformed to normalize it and
- finally Loaded into enterprise data
warehouses for analysis
Data Life Cycle
HDFS (Hadoop Distributed File System)
• Hadoop File System was developed using distributed file
system design

• Highly fault tolerant and designed using low-cost hardware

• Holds very large amount of data and provides easier


access

• the files are stored across multiple machines

• provides file permissions and authentication

• Support big data analytics applications

• High performance access to data across Hadoop clusters


HDFS (Hadoop File System)
• Developer – Apache Software foundation

• Written in Java

• The core of apache Hadoop consists of storage part


(HDFS) and Map Reduce

Benefits
• Computing Power – Distributed computing model
ideal for big data

• Flexibility – Store any amount of any kind of data

• Fault Tolerance – If node goes down, Jobs are


automatically redirected to other nodes. And it
automatically stores multiple copies of all data.
HDFS (Hadoop File System)
Benefits
• Low cost – open source framework is free

• Scalability – System can be grown easily by adding


more nodes

HDFS Goals
• Detection of faults and automatic recovery

• High throughput of data access rather than low


latency

• Provide high bandwidth and scale to hundreds of


nodes in single cluster
HDFS (Hadoop File System)
• Write once read many access model for files

• Applications move themselves closer to where the


data is located

• Easily portable.

• Every block is replicated three times (by default)

• Default block size is 64mb

• Sending alive message in every 3 seconds


Storing a file in HDFS
MapReduce

• The MapReduce algorithm contains two important tasks, Map


and Reduce.
• Map takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples
(key/value pairs).
• Secondly, Reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of
tuples.
MapReduce

• It is a powerful paradigm for parallel computation


• Hadoop uses map reduce to execute jobs on files in HDFS
• Hadoop will intelligently distribution computation over cluster
• Take computation to data.
MapReduce

• The MapReduce algorithm contains two important tasks, Map


and Reduce.
• Map takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples
(key/value pairs).
• Secondly, Reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of
tuples.
MapReduce
Hadoop Ecosystem
A Holistic View of a Big Data System
Discover,
Understand,
Search and Analyze
Navigate Streaming data
source of Big and large data
Data bursts for real
time insights

Delivers insight
Analysis with advanced in
petabytes of database
unstructured analytics and
data structured operational
data analytics

Govern data
Quality and
manage the
information life
cycle.
Holistic View of Hadoop Ecosystem

Hadoop System
• It is an open source distributed processing framework that
manages data processing.
• Storage for big data applications running in clustered system.
• It is used to advanced analytic initiatives, including predictive
analytics, data mining.
• Handle various forms of structured and unstructured data,
giving users more flexibility for collecting, processing and
analyzing data than relational databases and data warehouses
provide.
• Hadoop runs on clusters of commodity servers.
Holistic View of Hadoop Ecosystem

Stream Computing
Holistic View of Hadoop Ecosystem

Stream Computing
• Analyzes multiple data streams from many sources live. i.e.,
pulling in streams of data, processing the data and streaming it
back out as a single flow

• Uses software algorithms that analyzes the data in real time.


• It streams in to increase speed and accuracy when dealing with
data handling and analysis.
Holistic View of Hadoop Ecosystem

Stream Computing
• In June 2007, IBM announced its stream computing system,
called System S.
– This system runs on 800 microprocessors and the System S software
enables software applications to split up tasks and then reassemble
the data into an answer.
Dataware House
• DWs are central repositories of integrated data from one or
more disparate sources
• They store current and historical data in one single place that
are used for creating analytical reports for workers
throughout the enterprise.
• used for reporting and data analysis
Holistic View of Hadoop Ecosystem

Dataware House
• The data is processed, transformed, and ingested so that users
can access the processed data in the Data Warehouse through
Business Intelligence tools, SQL clients, and spreadsheets.
• Three main types of Data Warehouses are:
– Enterprise data warehouse
1. Its ability to classify the data according to the subject and given
access according to those divisions
– Operational Data Store
1. Data store required when nether data ware house nor OLTP systems
support organizations reporting needs.
– Data Mart
1. Subset of data ware house, designed for a particular line of business
such as sales, finance…
Holistic View of Hadoop Ecosystem

Information Integration and Governance


• Integrate the data, cleanse data, master data, protect
sensitive data and govern the meaning of data
• Get the information to make important decision
• Governance and integration platform solutions to know your
data is correct and available to every data user, to trust your
data to deliver efficiency and protection, and to use your data
to drive business transformation and innovation.

Data Discovery
• Data discovery is the process of breaking complex data
collections into information that users can understand and
manage.
Holistic View of Hadoop Ecosystem

Data Visualization
• Representing data in visual form. This can be particularly
useful when data need to be evaluated and decisions made
quickly.
Big Data System Management
• Monitoring and ensuring the availability of all big data
resources through a centralized interface/dashboard.
• Performing database maintenance for better results.
• Ensuring the security of big data repositories and control
access.
• Ensuring that data are captured and stored from all resources
as desired
Testing you?

Q 1 – As compared to RDBMS, Hadoop


A – Has higher data Integrity.
B – Does ACID transactions
C – IS suitable for read and write many times
D – Works better on unstructured and semi-structured data.

Q2 Which of the following is the true about metadata?


A-FsImage & EditLogs are metadata files
B- Metadata contain information like number of blocks, their
location, replicas
C-Metadata shows the structure of HDFS directories/files
D-ALL of the above
Testing you?

Q3. HDFS performs replication, although it results in data


redundancy?
A-True
B-False

Q4. Why do we need Hadoop?


A-Storage,
B-Security
C-Analytics
D-All the above
Testing you?

Q5.Components of RDMS? Are HDFS and MapReduce


A-True
B- False

Q6. For the following stands for,


A . OLTP
B. OLAP
C. HDFS
D. BDT
Testing you?

Q7.All of the following accurately describe Hadoop, EXCEPT:


A) Open source
B) Real-time
C) Java-based
D) Distributed computing approach

Q8. Which of the following genres does Hadoop produce ?


A) Distributed file system
B) JAX-RS
C) Java Message Service
D) Relational Database Management System
Testing you?

Q9. What was Hadoop written in ?


A) Java (software platform)
B) Perl
C) Java (programming language)
D) Lua (programming language)

Q10. What is default size of block?


A) 64mb
B) 64kb
C) 64gb
D) 64pb
Testing you?

Q11. Hadoop is a framework that works with a variety of


related tools. Common cohorts include:
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet

Q12. …………………. is an essential process where intelligent


methods are applied to extract data patterns.
A) Data warehousing
B) Data mining
C) Text mining
D) Data selection
Testing you?

13. Data mining can also applied to other forms such as …………….
i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) All

14. Write full form of KDD-----------------------

15. The out put of KDD is ………….


A) Data
B) Information
C) Query
D) Useful information
Testing you?

16. The data is stored, retrieved and updated in ………….


A) OLTP
B) OLAP
C) SMTP
D) FTP

17. What is metadata?

18. The Data ware house is --------------


A) Read only
B) Write only
C) Read and Write
D) None
Limitations of Existing Data Analytics

EDW – Enterprise Data Warehouse


MPP – Massively Parallel Processing Database
Schema-Blueprint of how the database is constructed
(divided into database tables in the case of relational
databases)
ACID Properties
• A- Atomicity : Each transaction is considered as one unit and
either runs to completion or is not executed at all.
—Abort: If a transaction aborts, changes made to
database are not visible.
— Commit: If a transaction commits, changes made are
visible.
• C-Consistency : Maintained so that the database is consistent
before and after the transaction
Total before T occurs = 500 + 300 = 700.
Total after T occurs = 400 + 200 = 600.
Therefore, database is consistent. Inconsistency occurs in case T1
completes but T2 fails. As a result T is incomplete.
ACID Properties
• I-Isolation : multiple transactions can occur concurrently.
• Transactions occur independently without interference.

• D-Durability :
This property ensures that once the transaction has completed
execution, the updates and modifications to the database are
stored in and written to disk and they persist even is system
failure occurs
Limitations of Hadoop for Big Data Analytics
Issue with small files
• Hadoop is not suited for small data.

• lacks the ability to efficiently support the random reading of


small files because of its high capacity design.

• If we are storing these huge numbers of small files, HDFS can’t


handle these lots of files.

• HDFS was designed to work properly with a small number of


large files for storing large data sets rather than a large
number of small files.
Limitations of Hadoop for Big Data Analytics
Slow Processing Speed
• In Hadoop, with a parallel and distributed algorithm.

• MapReduce process large data sets.

• MapReduce requires a lot of time to perform these tasks


thereby increasing latency.

• Data is distributed and processed over the cluster in


MapReduce which increases the time and reduces processing
speed.
Limitations of Hadoop for Big Data Analytics
Support for Batch Processing Only
• Hadoop supports batch processing only.
• the execution of a series of programs each on a set or "batch"
of inputs, rather than a single input.
• Hadoop MapReduce is the best framework for processing
data in batches.
• MapReduce framework of Hadoop does not leverage the
memory of the Hadoop cluster to the maximum.

No Delta Iteration
• Hadoop is not so efficient for iterative processing.
• Hadoop does not support cyclic data flow.
Limitations of Hadoop for Big Data Analytics
Latency
• In Hadoop, MapReduce framework is comparatively slower,
since it is designed to support different format, structure and
huge volume of data.
• MapReduce requires a lot of time to perform these tasks
thereby increasing latency.

Not Easy to use


• In Hadoop, MapReduce developers need to hand code for
each and every operation which makes it very difficult to
work.
Limitations of Hadoop for Big Data Analytics
Security
• Hadoop can be challenging in managing the complex
application. If the user doesn’t know how to enable platform
who is managing the platform, your data could be at huge
risk.
• At storage and network levels, Hadoop is missing encryption,
which is a major point of concern.
• Hadoop supports Kerberos authentication, which is hard to
manage.

No Abstraction
• Hadoop does not have any type of abstraction
• So MapReduce developers need to hand code for each and
every operation which makes it very difficult to work.
Limitations of Hadoop for Big Data Analytics
Vulnerable by Nature
• Hadoop is entirely written in java, a language most widely used,
hence java been most heavily exploited by cyber criminals

No Caching
• Hadoop is not efficient for caching.
• In Hadoop, MapReduce cannot cache the intermediate data in
memory for a further requirement which diminishes the
performance of Hadoop.

Lengthy Line of Code


• Hadoop has 1,20,000 line of code, the number of lines
produces the number of bugs and it will take more time to
execute the program.
Limitations of Hadoop for Big Data Analytics
Uncertainty
• Hadoop only ensures that data job is complete, but it’s unable
to guarantee when the job will be complete.
Limitations of Existing Data Analytics Architecture

Big Data Analytics


Solution: A Combined Storage Compute Layer

Big Data Analytics

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy