0% found this document useful (0 votes)
138 views74 pages

Bda Unit 1

This document provides an introduction to the concepts of big data and analytics. It outlines the course outcomes, which include understanding big data frameworks like Hadoop and NOSQL, designing algorithms to solve data intensive problems using MapReduce, and analyzing big data using tools like Hive and Spark. It then covers topics like the characteristics of data, different data types, sources of big data, and working with unstructured data. [/SUMMARY]

Uploaded by

Madhu Yarru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views74 pages

Bda Unit 1

This document provides an introduction to the concepts of big data and analytics. It outlines the course outcomes, which include understanding big data frameworks like Hadoop and NOSQL, designing algorithms to solve data intensive problems using MapReduce, and analyzing big data using tools like Hive and Spark. It then covers topics like the characteristics of data, different data types, sources of big data, and working with unstructured data. [/SUMMARY]

Uploaded by

Madhu Yarru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

UNIT-1

Introduction to Big Data


Course Outcomes
 Understand Big Data and its analytics in the real
world.
 Use the Big Data frameworks like Hadoop and
NOSQL to efficiently store and process Big Data to
generate Analytics.
 Design of Algorithms to solve Data Intensive problems
using Map Reduce Paradigm.
 Design and Implementation of Big Data Analytics
using Pig and Spark to solve Data Intensive problems
and to generate analytics.
 Analyse Big Data using Hive
Contents
Data
Characteristics of data
Types of digital data
Sources of data
Working with unstructured data
Evolution and Definition of big data
Characteristics and Need of big data
Challenges of big data
Data
Data are raw facts and figures that on their own have
no meaning
These can be any alphanumeric characters i.e. text,
numbers, symbols
Data Examples
Yes, Yes, No, Yes, No, Yes, No, Yes
42, 63, 96, 74, 56, 86
51017
None of the above data sets have any meaning until
they are given a CONTEXT and PROCESSED into a
useable form
Data Into Information
To achieve its aims the organisation will need to
process data into information.
Data needs to be turned into meaningful information
and presented in its most useful format
Data must be processed in a context in order to give it
meaning
Information
Data that has been processed within a context to give it
meaning
Examples:
5/10/07 The date of your final exam.
51,007 The average starting salary of an employee.
51007 Zip code of Bronson Iowa.
Knowledge (insights)
Knowledge is the understanding of rules needed to
interpret information
“…the capability of understanding the relationship
between pieces of information and what to actually do
with the information”
-Debbie Jones
Examples:
With that information DEO can allot rooms for exams
Faculty can prepare question papers
Data Every Where!

Lots of data is being collected


and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
How much data?
Every Day
 NYSE(New York Stock Exchange) generates 1.5 billion shares and
trade data
 2.5 quintillion bytes of data were created every day. (SG Analytics, 2020)
 Facebook stores 2.7 billion comments and Likes
 Google processes about 24 peta bytes of data
Every Minute
 Face book users shares nearly 2.5 million pieces of content
 Twitter users tweet nearly 300,000times
 Instagram users post nearly 220,000 new photos
 Youtube users upload 72 hours of new video content
 Apple users download nearly 50,000 apps
 Email users send over 200 million messages
 Amazon generates over $80,000 in online sales
 Google receives over 4million search queries
Every second
 Banking applications process more than 10,000 credit card transactions
Data: the treasure trove
Provides business advantages such as
Generating product recommendations
Inventing new products
Analyzing the market
Provides few early key indicators that can turn the
fortune of business
Provides room for precise analysis. If we have more
data for analysis then we have greater precision of
analysis
Types of Digital Data

Structured data
Semi structured data
Unstructured data
Structured Data
Data conforms to a pre-defined schema/structure.
Structured data has data model.
Data model is of the type of business data that we intend
to store, process and access.
Cardinality of a relation – the number of
rows/records/tuples in a relation.
Degree of a relation – the number of fields/columns.
Ex: Data stored in databases
Sources of structured data
Databases such as Oracle,DB2,Teradata,MySQL,PostgreSQL
Spreadsheets
OLTP systems
Ease of working with structured data
Semi structured data
Self describing structure.
Not confirm to the data model and Uses tags.
IT uses tags to segregate semantic elements.
Sources of semi structured data
XML : Extensible Markup Language is hugely
popularized by web services
Other markup languages
JSON(Java Script Object Notation)

There is no separation between the data and the


schema.
Entities belonging to the same class need not
necessarily have the same set of attributes. The order
of the attributes may not be similar.
Unstructured data
Working with unstructured data
o Data Mining:

 Association rule mining


 Regression analysis
 Collaborative filtering

o Text analytics or Text mining : Compared to the structured data stored in


relational databases. Text is large unstructured, amorphous and difficult to
deal with algorithmically.

o Natural Language Processing(NLP) : It is related to the area of human


computer interaction. It is about enabling computers to understand human
or natural language input.
Noisy Text analytics: It is the process of extracting
structured or semi-structured information from noisy
unstructured data such as charts, blogs, wikis, emails,
message-boards, text-messages.
Manual Tagging with metadata: Tagging manually with
adequate metadata to provide the requisite semantics to
understand unstructured data.
Part-of-speech tagging: POS or POST or grammatical
tagging. It is the process of reading text and tagging
each word in the sentence to a particular part of speech
such as noun, verb etc.
Unstructured Information Management Architecture
(UIMA): It is an open source platform from IBM. It is
used for real-time content analytics. It is about
processing text and other unstructured data to find
latent meaning and relavent relationship buried therein.
Characteristics of Data
Composition : structure , sources, granularity
Condition: state of the data – does it
Context – where has this data been generated
BIG DATA
Big Data is a collection of data that is huge in volume,
yet growing exponentially with time. It is a data
with so large size and complexity that none of
traditional data management tools can store it or
process it efficiently. Big data is also a data but with
huge size.
Evolution and Definition of big data
Characteristics of big data

• Big data is data that big in


• Volume
• Velocity
• Variety
• Value
• Veracity
Need of big data

More data  More accurate analysis  Greater


confidence in decision making Greater official
efficiency, cost reduction, time reduction, new
product development, and optimized offerings, etc
Challenges of big data
UNIT-1

BIG DATA
ANALYTICS
Contents
Overview of business intelligence
Data science and Analytics
Meaning and Characteristics of big data analytics
Need of big data analytics
Classification of analytics
Challenges to big data analytics
Importance of big data analytics
Basic terminologies in big data environment
What is Business Intelligence?
Business Intelligence enables the business to make
intelligent, fact-based decisions

Aggregate Data Present Enrich Inform a


Data Data Decision

Add Context to
Database, Data Reporting Tools, Decisions are
Create
Mart, Data Dashboards, Fact-based and
Information,
Warehouse, ETL Static Reports, Data-driven
Descriptive
Tools, Mobile
Statistics,
Integration Tools Reporting,
Benchmarks,
OLAP Cubes
Variance to Plan
Business Intelligence (BI) Tools

Data Sources
Data Warehousing
OLAP (Online Analytical Processing)
Data Mining
Regression
Predicting Customer Behavior
cloud technology
mobile BI
visual analytics.
Market Basket Analytics
Text Analytics
Customer Segmentation/Clustering
Amazon.com and NetFlix

Collaborative Filtering tries to predict other


items a customer may want to purchase based
on what’s in their shopping cart and the
purchasing behaviors of other customers

36
Unstructured Text Processing

Facebook Page

Twitter
Page Customer Sat
Survey
Comments
Call
Center Services

Notes,
Quality Cost Friendliness
Voice

Competitors’
Facebook
Public Web Sites,
Pages
Discussion Boards,
Email
Product Reviews
Blogs Alerts,
Adhoc Real-time
Feedback Action

37
Data Science
Data science is the science of extracting knowledge
from data
(Or)
It is a science of drawing out hidden patterns amongst
data using statistical techniques and information
technology (machine learning, data engineering,
probability models and pattern recognition)
Business
Acumen

Data
Technology science
Mathematics
Expertise expertise
Data Science use cases

Exploring Massive Datasets


Weather predictions
Oil drillings
Seismic activities
Financial frauds
Terrorist network and activities
Global economic impacts
Sensor logs
Social media analytics
Data Science use cases
Manufacturing
Customer churn
Market basket analytics(associative mining)
Collaborative filtering
Regression analysis
Business Acumen Skills
Understanding of domain
Business strategy
Problem solving
Communication
Presentation
Technology expertise
Good database knowledge such as RDBMS
Good NoSQL database knowledge such as MongoDB,
Cassandra, Hbase,etc
Programming languages such as java,python,C++,etc
Open source tools such as Hadoop
Data warehousing
Data mining
Visualization such as Tableau, Flare, Google
visualization APIs etc
Mathematics expertise
Mathematics
Statistics
Artificial Intelligence
Algorithms
Machine learning
Pattern recognition
Natural language processing
Data Science process is
Collecting raw data from multiple disparate data
sources
Processing the data
Integrating the data and preparing clean datasets
Engaging in explorative data analysis using model and
algorithms
Preparing presentations using data visualization
Communicating the findings to all stakeholders
Making faster and better decisions
Responsibilities of a data scientist
Data Management:
Raw---> relationships--->preprocessing and further
analysis
Analytical Techniques
Employs a blend of analytical techniques to Develop
models
Algorithms to understand data, interpret relationships
and spot trends
Business analysts
Able to apply domain knowledge to see results
Good presenter
communicator
Meaning and Characteristics of BDA
Need of Big Data Analytics
Reactive – Business Intelligence
Reactive-Big Data Analytics
Proactive- Analytics
Proactive-Big Data Analytics
Reactive – Business Intelligence

Business intelligence helps to make faster and better


decisions by providing the right information to the
right persons at right time in right format.
Analysis of historical data and displaying findings of
analysis in the form of enterprise dashboards, alerts
and notifications
Reactive-Big Data Analytics

Analysis on huge datasets but the approach is still


reactive because of static data
Proactive- Analytics

Futuristic decision making by dm, predictive


modeling, text mining and statistical analysis
Traditional database management used on big data
therefore several limitations on storage capacity and
processing.
Proactive-Big Data Analytics

 Filter out relevant data( Terabytes,petabytes,exabytes


of information) for analysis
High performance analytics used to gain rapid insights
from big data and the ability to solve complex
problems using more data
Classification of analytics
Challenges of Big Data Analytics

Storage: E.ID E.name E.Salary

RDBMS can store only structured data


(rows and columns)
Ex: Employee details in a company
Big data(3V’s) is mixed up with different structures.
(Should we scale vertically or horizontally?)
Ex: Web logins, XML documents and all the data coming from sensors
Contd..,
Security
RDMS(Normal Data)
Authentication-user connects to RDMS
Authorization-perform certain actions
Data Encryption
virus control
Contd..,
NoSQL(Big Data)
Column: cassandra,Hbase
Document: clusterpoint,apache couchDB
Graph: OrientDB,Apache Giraph

lack of proper authentication and security mechanisms


Ex:credit card information, personal information
Contd..,
Schema
Static schema(fixed attributes)
Ex:Student database(1000x15)
Dynamic Schema
Ex:Online application filling
Contd..,
Continuous availability
RDBMS and NoSQL big data
platforms have certain amount
of downtime built in
Contd..,
Partition tolerant:

How to build partition tolerant


systems to handle both h/w and s/w
failures
Contd..,
Data Quality:
How to maintain data quality
Data accuracy
Completeness
Timeliness
Do we have appropriate meta data in
place?
Terminologies used in Big Data Environment
In-memory analytics
Data access from non-volatile storage such as hard disk
is a slow process.
Pre-process and store data(cubes, aggregate tables)
Advance Thinking
Addressed using in memory analytics-all relevant data is
stored in RAM
Advantages-faster access, rapid deployment, better
insights and minimal IT involvement
Qlik Tech International , SAP, Tibco Software,
Information Builders,IBM,Oracle,Apache Spark .
Contd..,
In-database processing(In-database analytics)
Fusing data warehouses with analytical systems(the
database itself can run the computations eliminating the
need for export)
Today, many large databases, such as those used
for credit card fraud detection and investment
bank risk management, use this technology because it
provides significant performance improvements over
traditional methods
 Teradata , IBM, IEMC() Greenplum, Sybase, ParAccel,
SAS, and EXASOL
Contd..,
Symmetric multiprocessor system(SMP)
Massively parallel processing
Difference between Parallel and Distributed
System
Contd..,
Shared Nothing Architecture
Shared memory-common memory shared by multiple
processors.
Shared disk- multiple processors share a common
collection of disks while having their own private
memory.
Shared nothing-neither memory nor disk is shared
among multiple processors.
Advantages of Shared Nothing Architecture
Fault isolation-should not effect the process if any node
is failed
Scalability-reduces the burden on disk storage, increases
the processing speed
Contd..,
CAP Theorem(Brewer’s theorem)
It states that in a distributed computing environment
impossible to provide the following guarantees
Consistency(every read fetches the last write)
Availability(each non failing node will return a response
in a reasonable amount of time)
Partition tolerance(system will continue to function
when network partition occurs)
Contd..,
Brewer’s CAP

Consistenc
y

Availabilit
CAP y

Partition
Tolerance
Contd..,
Examples of databases that follow one of the possible
three combinations:
Availability and partition tolerance(AP)
Consistence and partition tolerance(CP)
Consistence and Availability(CA)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy