0% found this document useful (0 votes)
91 views10 pages

Big-Data-Analytics Notes For Ug

Definition of Big Data, Classification and applications

Uploaded by

M S S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views10 pages

Big-Data-Analytics Notes For Ug

Definition of Big Data, Classification and applications

Uploaded by

M S S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

BIG DATA ANALYTICS

Unit Structure
2.0 Objectives
2.1 Introduction to big data analytics
2.2 Classification of Analytics
2.3 Challenges of Big Data
2.4 Importance of Big Data
2.5 Big Data Technologies
2.6 Data Science
2.7 Responsibilities
2.8 Soft state eventual consistency
2.9 Data Analytics Life Cycle
Summary
Review Questions

2.0 OBJECTIVES

Big Data is creating significant new opportunities for


organizations to derive new value and create competitive advantage from
their most valuable asset: information. For businesses, Big Data helps
drive efficiency, quality, and personalized products and services,
producing improved levels of customer satisfaction and profit. For
scientific efforts, Big Data analytics enable new avenues of investigation
with potentially richer results and deeper insights than previously
available. In many cases, Big Data analytics integrate structured and
unstructured data with Realtime feeds and queries, opening new paths to
innovation and insight.

2.1 INTRODUCTION TO BIG DATA ANALYTICS


Big Data Analytics is...
1 Technology-enabled analytics: Quite a few data analytics and
visualization tools are available in the market today from leading
vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World
Programming Systems (WPS), etc. to help process and analyze your big
data.
2. About gaining a meaningful, deeper, and richer insight into your
business to steer it in the right direction. understanding the customer's
demographics to cross-sell and up- sell to them, better leveraging the
services of your vendors and suppliers, etc.

1
3. About a competitive edge over your competitors by enabling you
with findings thatallow quicker and better decision-making.

4. A tight handshake between three communities: IT, business users,


and data scientists.Refer Figure 3.3.

5. Working with datasets whose volume and variety exceed the current
storage and processing capabilities and infrastructure of your
enterprise.

About moving code to data. This makes perfect sense as the program for
distributed processing is tiny (just a few KBs) compared to the data
(Terabytes or Petabytes today and likely to be Exabytes or Zettabytes in
the near future).

2.2 CLASSIFICATION OF ANALYTICS

There are basically two schools of thought:


1 Those that classify analytics into basic, operationalized, advanced and
Monetized.
2 Those that classify analytics into analytics 1.0, analytics 2.0, and
analytics 3.0.

2.2.1. First School of Thought

It includes Basic analytics, Operationalized analytics, Advanced


analytics and Monetizedanalytics.

Basic analytics: This primarily is slicing and dicing of data to help with
basic business insights. This is about reporting on historical data, basic
visualization, etc.

(Big Data and Analytics)

2
Operationalized analytics: It is operationalized analytics if it gets
woven into the enterprisesbusiness processes.
Advanced analytics: This largely is about forecasting for the future by
way of predictive andprescriptive modelling.
Monetized analytics: This is analytics in use to derive direct business
revenue.

2.2.2 Second School of Thought:


Let us take a closer look at analytics 1.0, analytics 2.0, and analytics
3.0. Refer Table 2.1. Figure 2.1 shows the subtle growth of analytics
from Descriptive 🡪 Diagnostic 🡪 Predictive 🡪 Perspective analytics.

Analytics 1.0 Analytics 2.0 Analytics 3.0


Era: mid 1990s to 2005 to 2012 2012 to present
2009 Descriptive Descriptive statistics Descriptive + predictive
predictive statistics (use +
statistics (report on data from the past to prescriptive statistics (use
events, occurrences, etc. make predictions for the data from the past to
of the past) future) make
prophecies for the future
and at the same time
make
recommendations to
leverage the situation to
one's advantage)
key questions asked: key questions asked: Key questions asked:
What happened? What happened? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the
action
taken to take advantage
of
what will happen?
Data from legacy Big data A blend of big data and
systems. ERP, CRM, and data from legacy systems,
3rd party applications. ERP, CRM, and 3rd party
applications.
Small and structured data Big data is being taken up A blend of big data and
sources. Data stored in seriously. Data is mainly traditional analytics to
enterprise data unstructured, arriving at a yield insights and
warehouses or data marts. much higher pace. This offerings with speed and
fast flow of data entailed
impact.
that the influx of big
volume data had to be
stored and processed
rapidly, often on massive
parallel servers running
Hadoop.
Data was internally Data was often Data is both being
sourced. externally sourced. internally and externally
sourced.

3
Relational databases Database appliances, In memory analytics, in
Hadoop clusters, SQL to database processing, agile
Hadoop environments, analytical methods,
etc. machine
learning techniques etc.
Table 2.1Analytics 1.0, 2.0 and 3.0 (Big Data and Analytics)

2.3 CHALLENGES OF BIG DATA

There are mainly seven challenges of big data: scale, security,


schema, Continuous availability, Consistency, Partition tolerant and data
quality.

Scale: Storage (RDBMS (Relational Database Management System) or


NoSQL (Not only SQL)) is one major concern that needs to be addressed
to handle the need for scaling rapidlyand elastically. The need of the hour
is a storage that can best withstand the attack of large volume, velocity
and variety of big data. Should you scale vertically or should you scale
horizontally?

Security: Most of the NoSQL big data platforms have poor security
mechanisms (lack of proper authentication and authorization
mechanisms) when it comes to safeguarding big data. A spot that cannot
be ignored given that big data carries credit card information, personal
information and other sensitive data.

schema: Rigid schemas have no place. We want the technology to be


able to fit our big data and not the other way around. The need of the
hour is dynamic schema. Static (pre-defined schemas) are obsolete.

Continuous availability: The big question here is how to provide 24/7


support because almostall RDBMS and NoSQL big data platforms have a
certain amount of downtime built in.

Consistency: Should one opt for consistency or eventual consistency?


Partition tolerant: How to build partition tolerant systems that can take
care of both hardwareand software failures?

Data quality: How to maintain data quality- data accuracy,


completeness, timeliness, etc.? Dowe have appropriate metadata in place?

2.4 IMPORTANCE OF BIG DATA

Let us study the various approaches to analysis of data and what it


leads to.
Reactive-Business Intelligence: What does Business Intelligence (BI)
help us with? It allows the businesses to make faster and better decisions
by providing the right information to the right person at the right time in
4
the right format. It is about analysis of the past or historical data and then
displaying the findings of the analysis or reports in the form of enterprise
dashboards, alerts, notifications, etc. It has support for both pre-specified
reports as well as adhoc querying.

Reactive - Big Data Analytics: Here the analysis is done on huge


datasets but the approach isstill reactive as it is still based on static data.

Proactive - Analytics: This is to support futuristic decision making by


use of data mining predictive modelling, text mining, and statistical
analysis on. This analysis is not on big data as it still the traditional
database management practices on big data and therefore has severe
limitations on the storage capacity and the processing capability.

Proactive - Big Data Analytics: This is filtering through terabytes,


petabytes, exabytes of information to filter out the relevant data to
analyze. This also includes high performance analytics to gain rapid
insights from big data and the ability to solve complex problems using
more data.

2.5 BIG DATA TECHNOLOGIES

Following are the requirements of technologies to meet challenges of big


data:
• The first requirement is of cheap and ample storage.
• We need faster processors to help with quicker processing of big data.
Affordable open source distributed big data platforms, such as
Hadoop.
• Parallel processing, clustering, virtualization, large grid environments
(to distribute processing to a number of machines), high connectivity,
and high throughputs(rate at whichsomething is processed).
• Cloud computing and other flexible resource allocation arrangements.

2.6 DATA SCIENCE

Data science is the science of extracting knowledge from data. In


other words, it is a science of drawing out hidden patterns amongst data
using statistical and mathematical techniques.

It employs techniques and theories drawn from many fields from


the broad areas of mathematics, statistics, information technology
including machine learning, data engineering, probability models,
statistical learning, pattern recognition and learning, etc.

Data Scientist works on massive datasets for weather predictions,


oil drillings, earthquake prediction, financial frauds, terrorist network and

5
activities, global economic impacts, sensor logs, social media analytics,
customer churn, collaborative filtering(prediction about interest on users),
regression analysis, etc. Data science is multi-disciplinary. Refer to
Figure 2.2.

Figure 2.2 Data Scientist (Big Data and Analytics)

2.6.1 Business Acumen(expertise) Skills:

A data scientist should have following ability to play the role of data
scientist.
• Understanding of domain
• Business strategy
• Problem solving
• Communication
• Presentation
• Keenness

2.6.2 Technology Expertise:


Following skills required as far as technical expertise is concerned.
• Good database knowledge such as RDBMS.
• Good NoSQL database knowledge such as MongoDB, Cassandra,
HBase, etc.
• Programming languages such as Java. Python, C++, etc.
• Open-source tools such as Hadoop.
• Data warehousing.
• Data mining
• Visualization such as Tableau, Flare, Google visualization APIs, etc.

6
2.6.3 Mathematics Expertise:

The following are the key skills that a data scientist will have to have to
comprehend data,interpret it and analyze.
• Mathematics.
• Statistics.
• Artificial Intelligence (AI).
• Algorithms.
• Machine learning.
• Pattern recognition.
• Natural Language Processing.
• To sum it up, the data science process is
• Collecting raw data from multiple different data sources.
• Processing the data.
• Integrating the data and preparing clean datasets.
• Engaging in explorative data analysis using model and algorithms.
• Preparing presentations using data visualizations.
• Communicating the findings to all stakeholders.
• Making faster and better decisions.

2.7 RESPONSIBILITIES

Refer figure 2.3 to understand the responsibilities of a data scientist.

Data Management: A data scientist employs several approaches to


develop the relevant datasets for analysis. Raw data is just "RAW",
unsuitable for analysis. The data scientist works on it to prepare to reflect
the relationships and contexts. This data then becomes useful for
processing and further analysis.

Analytical Techniques: Depending on the business questions which we


are trying to find answers to and the type of data available at hand, the
data scientist employs a blend of analytical techniques to develop models
and algorithms to understand the data, interpret relationships, spot trends,
and reveal patterns.

Business Analysis: A data scientist is a business analyst who


distinguishes cool facts from insights and is able to apply his business
expertise and domain knowledge to see the results inthe business context.

7
Figure 2.3 Data scientist: your new best friend!!!
(Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data)

Communicator: He is a good presenter and communicator who is able


to communicate the results of his findings in a language that is
understood by the different business stakeholders.

2.8 SOFT STATE EVENTUAL CONSISTENCY

ACID property in RDBMS:


Atomicity: Either the task (or all tasks) within a transaction are
performed or none of them are. This is the all-or-none principle. If one
element of a transaction fails the entire transaction fails.

Consistency: The transvaction must meet all protocols or rules definedt by


the system at all times. The transaction does not isolate those protocols
and the database must remain in a consistent state at the beginning and
end of a transaction; there are never any half-completedtransactions.

Isolation: No transaction has access to any other transaction that is in an


intermediate or unfinished state. Thus, each transaction is independent
unto itself. This is required for both performance and consistency of
transactions within a database.

Durability: Once the transaction is complete, it will persist as complete


8
and cannot be undone; it will survive system failure, power loss and
other types of system breakdowns.

BASE (Basically Available, Soft state, Eventual consistency). In a


system where BASE is the prime requirement for reliability, the
activity/potential (p) of the data (H) changes;
it essentially slows down.

Basically Available: This constraint states that the system does


guarantee the availability of the data as regards CAP Theorem; there will
be a response to any request. But, that response could still be ‘failure’ to
obtain the requested data or the data may be in an inconsistent or
changing state, much like waiting for a check to clear in your bank
account.

Eventual consistency: The system will eventually become consistent


once it stops receiving input. The data will propagate to everywhere it
should sooner or later, but the system will continue to receive input and
is not checking the consistency of every transaction before it moves onto
the next one. Werner Vogel’s article “Eventually Consistent – Revisited”
coversthis topic is much greater detail.

Soft state: The state of the system could change over time, so even
during times without input there may be changes going on due to
‘eventual consistency,’ thus the state of the system is always ‘soft.’

2.9 DATA ANALYTICS LIFE CYCLE

Here is a brief overview of the main phases of the Data Analytics:

Phase 1- Discovery: In Phase 1, the team learns the business domain,


including relevant history such as whether the organization or business
unit has attempted similar projects in thepast from which they can learn.
The team assesses the resources available to support the project in terms
of people, technology, time and data. Important activities in this phase
include framing the business problem as an analytics challenge that can
be addressed in subsequent phases and formulating initial hypotheses
(IHs) to test and begin learning the data.

Phase 2- Data preparation: Phase 2 requires the presence of an analytic


sandbox, in which theteam can work with data and perform analytics for
the duration of the project. The team needs to execute extract, load, and
transform (ELT) or extract, transform and load (ETL) to get data into the
sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data
should be transformed in the ETLT process so the team can work with it
and analyze it. In this phase, the team also needs to familiarize itself with
the data thoroughly and take steps tocondition the data.

Phase 3-Model planning: Phase 3 is model planning, where the team


9
determines the methods, techniques and workflow it intends to follow for
the subsequent model building phase. The team explores the data to learn
about the relationships between variables and subsequently selects key
variables and the most suitable models.

Figure 2.4 - Overview of Data Analytical Lifecycle


(Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data)

Phase 4-Model building: In Phase 4, the team develops data sets for
testing, training, and production purposes. In addition, in this phase the
team builds and executes models based on the work done in the model
planning phase. The team also considers whether its existing tools will
suffice for running the models, or if it will need a more robust
environment for executing models and workflows (for example, fast
hardware and parallel processing, if applicable).

Phase 5-Communicate results: In Phase 5, the team, in collaboration


with major stakeholders, determines if the results of the project are a
success or a failure based on the criteria developed in Phase 1. The team
should identify key findings, quantify the business value, and develop a
narrative to summarize and convey findings to stakeholders.

Phase 6-0perationalize: In Phase 6, the team delivers final reports,


briefings, code and technical documents. In addition, the team may run a
pilot project to implement the models ina production environment.

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy