0% found this document useful (0 votes)

138 views74 pages

Bda Unit 1

This document provides an introduction to the concepts of big data and analytics. It outlines the course outcomes, which include understanding big data frameworks like Hadoop and NOSQL, designing algorithms to solve data intensive problems using MapReduce, and analyzing big data using tools like Hive and Spark. It then covers topics like the characteristics of data, different data types, sources of big data, and working with unstructured data. [/SUMMARY]

Uploaded by

Madhu Yarru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views74 pages

Bda Unit 1

Uploaded by

Madhu Yarru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 74

UNIT-1

Introduction to Big Data

Course Outcomes
 Understand Big Data and its analytics in the real
world.
 Use the Big Data frameworks like Hadoop and
NOSQL to efficiently store and process Big Data to
generate Analytics.
 Design of Algorithms to solve Data Intensive problems
using Map Reduce Paradigm.
 Design and Implementation of Big Data Analytics
using Pig and Spark to solve Data Intensive problems
and to generate analytics.
 Analyse Big Data using Hive
Contents
Data
Characteristics of data
Types of digital data
Sources of data
Working with unstructured data
Evolution and Definition of big data
Characteristics and Need of big data
Challenges of big data
Data
Data are raw facts and figures that on their own have
no meaning
These can be any alphanumeric characters i.e. text,
numbers, symbols
Data Examples
Yes, Yes, No, Yes, No, Yes, No, Yes
42, 63, 96, 74, 56, 86
51017
None of the above data sets have any meaning until
they are given a CONTEXT and PROCESSED into a
useable form
Data Into Information
To achieve its aims the organisation will need to
process data into information.
Data needs to be turned into meaningful information
and presented in its most useful format
Data must be processed in a context in order to give it
meaning
Information
Data that has been processed within a context to give it
meaning
Examples:
5/10/07 The date of your final exam.
51,007 The average starting salary of an employee.
51007 Zip code of Bronson Iowa.
Knowledge (insights)
Knowledge is the understanding of rules needed to
interpret information
“…the capability of understanding the relationship
between pieces of information and what to actually do
with the information”
-Debbie Jones
Examples:
With that information DEO can allot rooms for exams
Faculty can prepare question papers
Data Every Where!

Lots of data is being collected

and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
How much data?
Every Day
 NYSE(New York Stock Exchange) generates 1.5 billion shares and
trade data
 2.5 quintillion bytes of data were created every day. (SG Analytics, 2020)
 Facebook stores 2.7 billion comments and Likes
 Google processes about 24 peta bytes of data
Every Minute
 Face book users shares nearly 2.5 million pieces of content
 Twitter users tweet nearly 300,000times
 Instagram users post nearly 220,000 new photos
 Youtube users upload 72 hours of new video content
 Apple users download nearly 50,000 apps
 Email users send over 200 million messages
 Amazon generates over $80,000 in online sales
 Google receives over 4million search queries
Every second
 Banking applications process more than 10,000 credit card transactions
Data: the treasure trove
Provides business advantages such as
Generating product recommendations
Inventing new products
Analyzing the market
Provides few early key indicators that can turn the
fortune of business
Provides room for precise analysis. If we have more
data for analysis then we have greater precision of
analysis
Types of Digital Data

Structured data
Semi structured data
Unstructured data
Structured Data
Data conforms to a pre-defined schema/structure.
Structured data has data model.
Data model is of the type of business data that we intend
to store, process and access.
Cardinality of a relation – the number of
rows/records/tuples in a relation.
Degree of a relation – the number of fields/columns.
Ex: Data stored in databases
Sources of structured data
Databases such as Oracle,DB2,Teradata,MySQL,PostgreSQL
Spreadsheets
OLTP systems
Ease of working with structured data
Semi structured data
Self describing structure.
Not confirm to the data model and Uses tags.
IT uses tags to segregate semantic elements.
Sources of semi structured data
XML : Extensible Markup Language is hugely
popularized by web services
Other markup languages
JSON(Java Script Object Notation)

There is no separation between the data and the

schema.
Entities belonging to the same class need not
necessarily have the same set of attributes. The order
of the attributes may not be similar.
Unstructured data
Working with unstructured data
o Data Mining:

 Association rule mining

 Regression analysis
 Collaborative filtering

o Text analytics or Text mining : Compared to the structured data stored in

relational databases. Text is large unstructured, amorphous and difficult to
deal with algorithmically.

o Natural Language Processing(NLP) : It is related to the area of human

computer interaction. It is about enabling computers to understand human
or natural language input.
Noisy Text analytics: It is the process of extracting
structured or semi-structured information from noisy
unstructured data such as charts, blogs, wikis, emails,
message-boards, text-messages.
Manual Tagging with metadata: Tagging manually with
adequate metadata to provide the requisite semantics to
understand unstructured data.
Part-of-speech tagging: POS or POST or grammatical
tagging. It is the process of reading text and tagging
each word in the sentence to a particular part of speech
such as noun, verb etc.
Unstructured Information Management Architecture
(UIMA): It is an open source platform from IBM. It is
used for real-time content analytics. It is about
processing text and other unstructured data to find
latent meaning and relavent relationship buried therein.
Characteristics of Data
Composition : structure , sources, granularity
Condition: state of the data – does it
Context – where has this data been generated
BIG DATA
Big Data is a collection of data that is huge in volume,
yet growing exponentially with time. It is a data
with so large size and complexity that none of
traditional data management tools can store it or
process it efficiently. Big data is also a data but with
huge size.
Evolution and Definition of big data
Characteristics of big data

• Big data is data that big in

• Volume
• Velocity
• Variety
• Value
• Veracity
Need of big data

More data  More accurate analysis  Greater

confidence in decision making Greater official
efficiency, cost reduction, time reduction, new
product development, and optimized offerings, etc
Challenges of big data
UNIT-1

BIG DATA
ANALYTICS
Contents
Overview of business intelligence
Data science and Analytics
Meaning and Characteristics of big data analytics
Need of big data analytics
Classification of analytics
Challenges to big data analytics
Importance of big data analytics
Basic terminologies in big data environment
What is Business Intelligence?
Business Intelligence enables the business to make
intelligent, fact-based decisions

Aggregate Data Present Enrich Inform a

Data Data Decision

Add Context to
Database, Data Reporting Tools, Decisions are
Create
Mart, Data Dashboards, Fact-based and
Information,
Warehouse, ETL Static Reports, Data-driven
Descriptive
Tools, Mobile
Statistics,
Integration Tools Reporting,
Benchmarks,
OLAP Cubes
Variance to Plan
Business Intelligence (BI) Tools

Data Sources
Data Warehousing
OLAP (Online Analytical Processing)
Data Mining
Regression
Predicting Customer Behavior
cloud technology
mobile BI
visual analytics.
Market Basket Analytics
Text Analytics
Customer Segmentation/Clustering
Amazon.com and NetFlix

Collaborative Filtering tries to predict other

items a customer may want to purchase based
on what’s in their shopping cart and the
purchasing behaviors of other customers

36
Unstructured Text Processing

Facebook Page

Twitter
Page Customer Sat
Survey
Comments
Call
Center Services

Notes,
Quality Cost Friendliness
Voice

Competitors’
Facebook
Public Web Sites,
Pages
Discussion Boards,
Email
Product Reviews
Blogs Alerts,
Adhoc Real-time
Feedback Action

37
Data Science
Data science is the science of extracting knowledge
from data
(Or)
It is a science of drawing out hidden patterns amongst
data using statistical techniques and information
technology (machine learning, data engineering,
probability models and pattern recognition)
Business
Acumen

Data
Technology science
Mathematics
Expertise expertise
Data Science use cases

Exploring Massive Datasets

Weather predictions
Oil drillings
Seismic activities
Financial frauds
Terrorist network and activities
Global economic impacts
Sensor logs
Social media analytics
Data Science use cases
Manufacturing
Customer churn
Market basket analytics(associative mining)
Collaborative filtering
Regression analysis
Business Acumen Skills
Understanding of domain
Business strategy
Problem solving
Communication
Presentation
Technology expertise
Good database knowledge such as RDBMS
Good NoSQL database knowledge such as MongoDB,
Cassandra, Hbase,etc
Programming languages such as java,python,C++,etc
Open source tools such as Hadoop
Data warehousing
Data mining
Visualization such as Tableau, Flare, Google
visualization APIs etc
Mathematics expertise
Mathematics
Statistics
Artificial Intelligence
Algorithms
Machine learning
Pattern recognition
Natural language processing
Data Science process is
Collecting raw data from multiple disparate data
sources
Processing the data
Integrating the data and preparing clean datasets
Engaging in explorative data analysis using model and
algorithms
Preparing presentations using data visualization
Communicating the findings to all stakeholders
Making faster and better decisions
Responsibilities of a data scientist
Data Management:
Raw---> relationships--->preprocessing and further
analysis
Analytical Techniques
Employs a blend of analytical techniques to Develop
models
Algorithms to understand data, interpret relationships
and spot trends
Business analysts
Able to apply domain knowledge to see results
Good presenter
communicator
Meaning and Characteristics of BDA
Need of Big Data Analytics
Reactive – Business Intelligence
Reactive-Big Data Analytics
Proactive- Analytics
Proactive-Big Data Analytics
Reactive – Business Intelligence

Business intelligence helps to make faster and better

decisions by providing the right information to the
right persons at right time in right format.
Analysis of historical data and displaying findings of
analysis in the form of enterprise dashboards, alerts
and notifications
Reactive-Big Data Analytics

Analysis on huge datasets but the approach is still

reactive because of static data
Proactive- Analytics

Futuristic decision making by dm, predictive

modeling, text mining and statistical analysis
Traditional database management used on big data
therefore several limitations on storage capacity and
processing.
Proactive-Big Data Analytics

 Filter out relevant data( Terabytes,petabytes,exabytes

of information) for analysis
High performance analytics used to gain rapid insights
from big data and the ability to solve complex
problems using more data
Classification of analytics
Challenges of Big Data Analytics

Storage: E.ID E.name E.Salary

RDBMS can store only structured data

(rows and columns)
Ex: Employee details in a company
Big data(3V’s) is mixed up with different structures.
(Should we scale vertically or horizontally?)
Ex: Web logins, XML documents and all the data coming from sensors
Contd..,
Security
RDMS(Normal Data)
Authentication-user connects to RDMS
Authorization-perform certain actions
Data Encryption
virus control
Contd..,
NoSQL(Big Data)
Column: cassandra,Hbase
Document: clusterpoint,apache couchDB
Graph: OrientDB,Apache Giraph

lack of proper authentication and security mechanisms

Ex:credit card information, personal information
Contd..,
Schema
Static schema(fixed attributes)
Ex:Student database(1000x15)
Dynamic Schema
Ex:Online application filling
Contd..,
Continuous availability
RDBMS and NoSQL big data
platforms have certain amount
of downtime built in
Contd..,
Partition tolerant:

How to build partition tolerant

systems to handle both h/w and s/w
failures
Contd..,
Data Quality:
How to maintain data quality
Data accuracy
Completeness
Timeliness
Do we have appropriate meta data in
place?
Terminologies used in Big Data Environment
In-memory analytics
Data access from non-volatile storage such as hard disk
is a slow process.
Pre-process and store data(cubes, aggregate tables)
Advance Thinking
Addressed using in memory analytics-all relevant data is
stored in RAM
Advantages-faster access, rapid deployment, better
insights and minimal IT involvement
Qlik Tech International , SAP, Tibco Software,
Information Builders,IBM,Oracle,Apache Spark .
Contd..,
In-database processing(In-database analytics)
Fusing data warehouses with analytical systems(the
database itself can run the computations eliminating the
need for export)
Today, many large databases, such as those used
for credit card fraud detection and investment
bank risk management, use this technology because it
provides significant performance improvements over
traditional methods
 Teradata , IBM, IEMC() Greenplum, Sybase, ParAccel,
SAS, and EXASOL
Contd..,
Symmetric multiprocessor system(SMP)
Massively parallel processing
Difference between Parallel and Distributed
System
Contd..,
Shared Nothing Architecture
Shared memory-common memory shared by multiple
processors.
Shared disk- multiple processors share a common
collection of disks while having their own private
memory.
Shared nothing-neither memory nor disk is shared
among multiple processors.
Advantages of Shared Nothing Architecture
Fault isolation-should not effect the process if any node
is failed
Scalability-reduces the burden on disk storage, increases
the processing speed
Contd..,
CAP Theorem(Brewer’s theorem)
It states that in a distributed computing environment
impossible to provide the following guarantees
Consistency(every read fetches the last write)
Availability(each non failing node will return a response
in a reasonable amount of time)
Partition tolerance(system will continue to function
when network partition occurs)
Contd..,
Brewer’s CAP

Consistenc
y

Availabilit
CAP y

Partition
Tolerance
Contd..,
Examples of databases that follow one of the possible
three combinations:
Availability and partition tolerance(AP)
Consistence and partition tolerance(CP)
Consistence and Availability(CA)

Topic 6 BPR Methodology
0% (1)
Topic 6 BPR Methodology
23 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
72 pages
Dataanalyticsunit 1
No ratings yet
Dataanalyticsunit 1
26 pages
Module 1 - Data Science Introduction - Detailed
No ratings yet
Module 1 - Data Science Introduction - Detailed
131 pages
Data Analysis - Unit1
No ratings yet
Data Analysis - Unit1
65 pages
Big Data Lesson 1 Lucrezia Noli
No ratings yet
Big Data Lesson 1 Lucrezia Noli
46 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
Unit 1
No ratings yet
Unit 1
59 pages
Itfm Assignment Group 5
No ratings yet
Itfm Assignment Group 5
14 pages
Lecture 2
No ratings yet
Lecture 2
50 pages
W1 L11 Introduction
No ratings yet
W1 L11 Introduction
26 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Itfm Assignment Group 8
100% (1)
Itfm Assignment Group 8
16 pages
BDT 1
No ratings yet
BDT 1
49 pages
Module 1
No ratings yet
Module 1
21 pages
Introduction To Data
No ratings yet
Introduction To Data
34 pages
Unit 1
No ratings yet
Unit 1
76 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
UNIT I Notes
No ratings yet
UNIT I Notes
26 pages
Chapter 1
No ratings yet
Chapter 1
149 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
L01-Fundamentals of Big Data and Data Analytics
No ratings yet
L01-Fundamentals of Big Data and Data Analytics
58 pages
Lecture1 Introductiontobigdata 190301171350
No ratings yet
Lecture1 Introductiontobigdata 190301171350
63 pages
1.1 Module-1
No ratings yet
1.1 Module-1
31 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Introduction To Data Science - Students
No ratings yet
Introduction To Data Science - Students
237 pages
"Ÿ""Isliln: Formation
No ratings yet
"Ÿ""Isliln: Formation
34 pages
Unit 1
No ratings yet
Unit 1
74 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
37 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Data Science and Big Data Analytics Unit 1 Notes
No ratings yet
Data Science and Big Data Analytics Unit 1 Notes
13 pages
Unit 1
No ratings yet
Unit 1
99 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
Big Data Analtics (Unit 1)
No ratings yet
Big Data Analtics (Unit 1)
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Unit 1 Introduction To Data Analytics
No ratings yet
Unit 1 Introduction To Data Analytics
20 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
32 pages
Business Analytics
No ratings yet
Business Analytics
34 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Bda U1
No ratings yet
Bda U1
78 pages
BDA (18CS72) Module-1
No ratings yet
BDA (18CS72) Module-1
36 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
Unit1 BDT
No ratings yet
Unit1 BDT
96 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-07-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-07-15 Reference-Material-I
69 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
1 - Konsep Big Data
No ratings yet
1 - Konsep Big Data
35 pages
Ch3 - Introduction To Big Data Analytics
No ratings yet
Ch3 - Introduction To Big Data Analytics
37 pages
Insights Into Big Data: An Industrial Perspective
No ratings yet
Insights Into Big Data: An Industrial Perspective
52 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
BDA Mod1
No ratings yet
BDA Mod1
36 pages
UNIT1
100% (1)
UNIT1
37 pages
Unit - I - Types of Digital Data
No ratings yet
Unit - I - Types of Digital Data
45 pages
BD 1
No ratings yet
BD 1
15 pages
Big Data Analytics - Unit 1
No ratings yet
Big Data Analytics - Unit 1
43 pages
Reviewerku
No ratings yet
Reviewerku
6 pages
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
From Everand
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
George Snypes
2/5 (1)
Unit-IV Cloud Platforms in Industry
No ratings yet
Unit-IV Cloud Platforms in Industry
54 pages
UNIT III Building Aneka Clouds
100% (1)
UNIT III Building Aneka Clouds
15 pages
CN LAB Questions
No ratings yet
CN LAB Questions
1 page
A Multi Objective Binary Bat Approach For Testcase Selection in Object Oriented Testing
No ratings yet
A Multi Objective Binary Bat Approach For Testcase Selection in Object Oriented Testing
12 pages
PRO Argument Facebook Fake News Dissemination
No ratings yet
PRO Argument Facebook Fake News Dissemination
2 pages
Automation Services Work Portfolio in Telecom Industry: Open Source. Cloud. Automation
No ratings yet
Automation Services Work Portfolio in Telecom Industry: Open Source. Cloud. Automation
34 pages
EO Catalyst
No ratings yet
EO Catalyst
30 pages
WP05 - ACT 01 - Development 1909
No ratings yet
WP05 - ACT 01 - Development 1909
53 pages
LECTURE 3 - Corporate Image
No ratings yet
LECTURE 3 - Corporate Image
10 pages
19 Arid 3235 LAB (5,6,7,8)
No ratings yet
19 Arid 3235 LAB (5,6,7,8)
11 pages
PNZ Series
No ratings yet
PNZ Series
2 pages
Design For Optimization Even Semester 2021 Home Assignment-CO-2
No ratings yet
Design For Optimization Even Semester 2021 Home Assignment-CO-2
2 pages
BST, S&I, and EI: Lab Manual
No ratings yet
BST, S&I, and EI: Lab Manual
28 pages
Channel Coding - Part II: Digital Communications
No ratings yet
Channel Coding - Part II: Digital Communications
26 pages
STR-W6753: Universal-Input/58 W Off-Line Quasi-Resonant Flyback Switching Regulator
No ratings yet
STR-W6753: Universal-Input/58 W Off-Line Quasi-Resonant Flyback Switching Regulator
8 pages
Cse2012 PPS3 w2022
No ratings yet
Cse2012 PPS3 w2022
3 pages
Lumbini Bikas Bank
No ratings yet
Lumbini Bikas Bank
14 pages
527260-002F CE840 UserGuide
No ratings yet
527260-002F CE840 UserGuide
100 pages
Cbse Class 10 Maths Competency Based Prcatice Questions Chapter 2
No ratings yet
Cbse Class 10 Maths Competency Based Prcatice Questions Chapter 2
3 pages
Hik ProConnect Mobile Client User Manual
No ratings yet
Hik ProConnect Mobile Client User Manual
44 pages
Nokia 7730 SXR 1 Series Service Interconnect Routers Data Sheet EN
No ratings yet
Nokia 7730 SXR 1 Series Service Interconnect Routers Data Sheet EN
9 pages
Random Forest
No ratings yet
Random Forest
9 pages
ODSC West - Creating APIs That Data Scientists Will Love - With - Links
No ratings yet
ODSC West - Creating APIs That Data Scientists Will Love - With - Links
74 pages
CCNA 1 v7.0 Modules 16 - 17: Building and Securing A Small Network Exam Answers 2020
No ratings yet
CCNA 1 v7.0 Modules 16 - 17: Building and Securing A Small Network Exam Answers 2020
25 pages
Modern Teaching Methods
75% (4)
Modern Teaching Methods
10 pages
Samsung UN32M5300AF Chassis UNV72
No ratings yet
Samsung UN32M5300AF Chassis UNV72
157 pages
Welding Machine Specifications PDF
0% (1)
Welding Machine Specifications PDF
4 pages
PyTorch Geometric Temporal Spatiotemporal Signal Processing
No ratings yet
PyTorch Geometric Temporal Spatiotemporal Signal Processing
10 pages
Lista de Accesorios Nueva
No ratings yet
Lista de Accesorios Nueva
11 pages
TCS Allegations and Mixtures Quiz-3 PREP INSTA
No ratings yet
TCS Allegations and Mixtures Quiz-3 PREP INSTA
21 pages
Lab Report 2 (Circle)
No ratings yet
Lab Report 2 (Circle)
4 pages
Direct Memory Access - GeeksforGeeks
No ratings yet
Direct Memory Access - GeeksforGeeks
4 pages
BA v2 Colibri Cctalk EN 1-0
No ratings yet
BA v2 Colibri Cctalk EN 1-0
48 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Bda Unit 1

Uploaded by

Bda Unit 1

Uploaded by

UNIT-1

Introduction to Big Data

Lots of data is being collected

There is no separation between the data and the

 Association rule mining

o Text analytics or Text mining : Compared to the structured data stored in

o Natural Language Processing(NLP) : It is related to the area of human

• Big data is data that big in

More data  More accurate analysis  Greater

Aggregate Data Present Enrich Inform a

Collaborative Filtering tries to predict other

Exploring Massive Datasets

Business intelligence helps to make faster and better

Analysis on huge datasets but the approach is still

Futuristic decision making by dm, predictive

 Filter out relevant data( Terabytes,petabytes,exabytes

Storage: E.ID E.name E.Salary

RDBMS can store only structured data

lack of proper authentication and security mechanisms

How to build partition tolerant

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.