Unit 1 - Big Data Technologies
Unit 1 - Big Data Technologies
Text Books:
1. Jiawei Han MichelineKamber Jian Pei, Data Mining: Concepts and
Techniques, Third Edition, Elsevier, Morgan Kaufmann, 2011.
2. Tom White, “Hadoop: The Definitive Guide”, 3rd Edition, O’reilly,
2012.
3. Brett Lantz, Machine Learning with R - Second Edition - Deliver Data
Insights with R and Predictive Analytics 2nd Revised edition, 2015
By
Prakash N
Assistant Professor
Department of CST
UNIT I: DATA MINING & BIG DATA
Pattern Evaluation
– Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Selection
Data Transformation
Data Warehouse
Data Cleaning
Data Integration
Databases
What kind of pattern can be mined?
The most basic form of data for mining applications are Database
data, Data warehouse data and transactional data.
• Database Data (DBMS)
• Prediction Tasks
– Perform induction on the current data in order to make predictions
• Description Tasks
– Find human-interpretable patterns that describe the data.
Data Mining Techniques
– Associations
– Correlations
– Clustering Analysis
Mining of Frequent Patterns
• For example, you want to know which items are frequently purchased together
within the same transaction.
Such a rule is,
buys(X,”computer”) => buys(X,”Software”)[support=1%, confidence=50%]
*X – Variable representing a customer
*buys – Attribute.
This rule involving more than one attribute or predicate (i.e., age, income and
buys) referred as Multidimensional Association Rule.
Classification and Regression
Classification
• It is the process of finding a model that describes and distinguish data
classes and concepts.
• The model are derived based on the analysis of a set of training data.
(Class label known)
• It is used to predict a class label of objects for which the class label is
unknown.
• Classification model can be represented in various forms (i). IF-THEN
rules, (ii) a decision tree and (iii) a neural network.
For example, Classify countries based on climate, or classify car based
on gas mileage
Classification Rules (IF-THEN rules)
A Decision Tree Algorithm
f3 f6 Class A
age f1
f4 f7 Class B
income f2
Class C
f5 f8
• Semi-structured
• Markers/Tags to separate elements
• XML/HTML
• Unstructured
• No fields/attributes
• More like Human Language
• Free form text (E-mail body, notes, articles,…)
• Audio, video, and image
Examples Of Structured Data
4Vs’
• Volume
• Velocity
• Variety
• Veracity
Big Data Characteristics
• Retail
- Sale Transaction Analysis
H
B
BIG DATA vs. HADOOP
Understand and navigate
Formed Discovery and Navigation
formed big data sources
• Written in Java
Benefits
• Computing Power – Distributed computing model
ideal for big data
HDFS Goals
• Detection of faults and automatic recovery
• Easily portable.
Delivers insight
Analysis with advanced in
petabytes of database
unstructured analytics and
data structured operational
data analytics
Govern data
Quality and
manage the
information life
cycle.
Holistic View of Hadoop Ecosystem
Hadoop System
• It is an open source distributed processing framework that
manages data processing.
• Storage for big data applications running in clustered system.
• It is used to advanced analytic initiatives, including predictive
analytics, data mining.
• Handle various forms of structured and unstructured data,
giving users more flexibility for collecting, processing and
analyzing data than relational databases and data warehouses
provide.
• Hadoop runs on clusters of commodity servers.
Holistic View of Hadoop Ecosystem
Stream Computing
Holistic View of Hadoop Ecosystem
Stream Computing
• Analyzes multiple data streams from many sources live. i.e.,
pulling in streams of data, processing the data and streaming it
back out as a single flow
Stream Computing
• In June 2007, IBM announced its stream computing system,
called System S.
– This system runs on 800 microprocessors and the System S software
enables software applications to split up tasks and then reassemble
the data into an answer.
Dataware House
• DWs are central repositories of integrated data from one or
more disparate sources
• They store current and historical data in one single place that
are used for creating analytical reports for workers
throughout the enterprise.
• used for reporting and data analysis
Holistic View of Hadoop Ecosystem
Dataware House
• The data is processed, transformed, and ingested so that users
can access the processed data in the Data Warehouse through
Business Intelligence tools, SQL clients, and spreadsheets.
• Three main types of Data Warehouses are:
– Enterprise data warehouse
1. Its ability to classify the data according to the subject and given
access according to those divisions
– Operational Data Store
1. Data store required when nether data ware house nor OLTP systems
support organizations reporting needs.
– Data Mart
1. Subset of data ware house, designed for a particular line of business
such as sales, finance…
Holistic View of Hadoop Ecosystem
Data Discovery
• Data discovery is the process of breaking complex data
collections into information that users can understand and
manage.
Holistic View of Hadoop Ecosystem
Data Visualization
• Representing data in visual form. This can be particularly
useful when data need to be evaluated and decisions made
quickly.
Big Data System Management
• Monitoring and ensuring the availability of all big data
resources through a centralized interface/dashboard.
• Performing database maintenance for better results.
• Ensuring the security of big data repositories and control
access.
• Ensuring that data are captured and stored from all resources
as desired
Testing you?
13. Data mining can also applied to other forms such as …………….
i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) All
• D-Durability :
This property ensures that once the transaction has completed
execution, the updates and modifications to the database are
stored in and written to disk and they persist even is system
failure occurs
Limitations of Hadoop for Big Data Analytics
Issue with small files
• Hadoop is not suited for small data.
No Delta Iteration
• Hadoop is not so efficient for iterative processing.
• Hadoop does not support cyclic data flow.
Limitations of Hadoop for Big Data Analytics
Latency
• In Hadoop, MapReduce framework is comparatively slower,
since it is designed to support different format, structure and
huge volume of data.
• MapReduce requires a lot of time to perform these tasks
thereby increasing latency.
No Abstraction
• Hadoop does not have any type of abstraction
• So MapReduce developers need to hand code for each and
every operation which makes it very difficult to work.
Limitations of Hadoop for Big Data Analytics
Vulnerable by Nature
• Hadoop is entirely written in java, a language most widely used,
hence java been most heavily exploited by cyber criminals
No Caching
• Hadoop is not efficient for caching.
• In Hadoop, MapReduce cannot cache the intermediate data in
memory for a further requirement which diminishes the
performance of Hadoop.