0% found this document useful (0 votes)

45 views19 pages

Analytics: Topics

The document discusses the phases of big data analytics lifecycle. It is divided into nine phases: 1) Business Case/Problem Definition 2) Data Identification 3) Data Acquisition and filtration 4) Data Extraction 5) Data Munging (Validation and Cleaning). It then provides two questions and answers regarding explaining the different steps in a data analytics project lifecycle and types of big data analysis.

Uploaded by

Medhaj Wakchaure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views19 pages

Analytics: Topics

Uploaded by

Medhaj Wakchaure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

UNIT IV

Big Data Analytics

CHAPTER 4

SyllabusTopics

-
Data Analytics Architecture and Life Cycle,
Big Types of analysis, Analytical approaches, Data Analytics with
Mathematical manipulations, Data Ingestion from
different sources (CSV, JSON,
html, Excel, mongoDB, mysql,
sqlite), Data cleaning, Handling missing values, dataimputation, Data transformation, Data Standardization, handling
bnnorical data with 2 and more categories, statistical and graphical analysis
methods, Hive Data Analytics.

4.1
Big Data Analytics Life cycle is divided into ine phases, named as:.. 4-2
.....
Explain diferent steps in Data Analytics project life cycle. (SPPU-Q. 5(a), Dec. 19, 8 Marks) 4-2
42 Types of Big Data Analytics 4-4

UQ.
Explain different types of analysis in detail with example. SPPU- Q. 6a), Dec. 18, 9 Marks). 4-4
43 approaches
Analytical ,4-5

Data Analytics with Mathematical manipulations 4-6

44
4-7
45 Data Ingestion from different sources (CSV, JSON, html, Excel, mongoDB, mysql, sqlite)

46 Data Cleaning....
4-9
4-9
47 Handling missing values
are preprocessed before building model?
UQ. How Missing values and categorical variables
18, 9 Marks). 4-9
Explain with example.(SPPU-Q. 5(b), Dec.
4.8 Data Imputation.. 4-12
4.9 Data Transformation
4-14
May 18, 8 Marks).
transformation in big data. (SPPU-Q.5(b), 4-14
UQ. Explain the different modes of data
4-14
4.9,1 Benefits of Data Transformation
4.10 4-14
Data Standardization...
411 handling 4-15
more categories.
categorical data with 2
and
4.12 4-16
statistical methods.
and graphical analysis
- Q. 7(b), Dec. 19,
8 Marks). 4-16
UQ. plot.(SPPU
Cplain pie chart and scatter 4-17
13 Hive Data
Analytics.
8 Marks) 4-17
UQ. SPPU-0. 6(a), Dec. 19,
Architecture
of
HIVE.
plain
4-18
Chapter Ends
(Big Dat Analyticn). Page
to(2
Analvtics (SPPU-Som 6-l)
Data Scionce and Bia Data Definition
IS Phase II : Data
ANALYTICS LIFE CYCLE case identificd, now it's
is
4.1 BIG DATA Oncethe busincss
DIVIDED INTO NINE
PHASES, appropriale datasetsto work with. In tthisstage,
find the
NAMED AS: see what other conpanieslave
unalysis is done to done
case.
Analytics project ite for a sinilar
Explain different steps in Data Marks) business Case and 1he scope
UQ.
(SPPUQ. 5(a)) Dec. 19, 8 Depending on lhe
ycle. of
analysis of the project being addressed, the sources of
can be either external or internal
Phases of Big Data Analytics datascts to the
Business Casc/Problem Definition company.
of internal datascts, the datascts can
2. Data ldentification In the case include
internal sources, such as
3 Data Acquisition and filtration data collected from feedback
On
forms, from existing software, the other hand, for
4. Data Extraction includes datascts rore a
external datascts, the list
Data Munging (Validation and Cleaning)
5. party providers.
Data Aggregation & Representation (Storage) filtration
Phase III Data Acquisition and
:

,.
7 Exploratory Data Analysis Once the source of data is identified, now it is time
8 Data Visualization (Preparation for Modeling and gather the data from such sources. This kind of data i
Assessment) mostly unstructurcd. Then it is subjccted to filration
such as removal of the corrupt data or irrelevant data
9 Uilization of analysis results.
which is of no scopc to the analysis objective. Here
Let us discuss each phase :
corrupt data means data that may have missing records
: or the ones, which include incompatible data types.
Phase I Business Problem Definition
In this stage, the team learns about the business After filtration, a copy of the filtered data is stored and
domain, which presents the motivation and goals for compressed, as it can be of use in the future, for some
carrying out the analysis. other analysis.

In this stage, the problem is identified, and assumptions Phase IV: Data Extraction
are made that how much potential gain a company will Now the data is filtered, but there might be a possibility
make after carrying out the analysis. that some of the entries of the data might be
incompatible, to rectify this issue, a separate phase is
Important activities in this step include framing the
created, known as the data extraction phase.
business problem as an analytics challenge that can be
In this phase, the data, which don't match with the
addressed in subsequent phases.
underlying scope of the analysis, are extracted and
It helps the decision-makers understand the business transformed in such a form.
resources that will be required to be utilized thereby
Phase V: Data Munging
determining the underlying budget required to carry out
the project.
As mentioned in phase III, the data is coollected from
various sources, which being
Moreover, it can be determined, whether the problem results in the data
unstructured. There might be a possibility, that uie
identified, is a Big Data problem not, based on the might have constraints, that are unsuitable, wile
business requirements in the business case. lead to false results. Hence clean and
there is a need to
To qualify as a big data problem, the business case validate the data.
should be directly related to one(or more) It includes removing any establishing
of the invalid data and ways to
complex validation rules.
characteristics of volume, velocity, or variety. There are many
validate and clean
the data.
(SPPU - New Syllabus w.e.f academic year 21-22)(P6-56)
Venture
Tech-Neo Publications...A SACHIN |
SHHAH
Big Data Analytics (SPPU-Sem 6-IT)
Science. and
Cata (Big Data Analytics)...Page No. (4-3)
erample, a dataset might contain few rows, with
for entries. If a similar datasct is present,
then those A sort of represcntation is reauired to obtains value or
null
are copied from that dataset, else those rows are Some conclusion from the analysis. Hence, various
entries
dropped.
tools are used to visualize the data in graphic forim,
Wnich can easily be interpreted by business Users.
Data Aggregation
&
Representation
Phase V: Visualization is said to influence the interpretation of
cleansed and validates, against certain rules the results.
The data is
by the enterprise.
But the data
be spread might Moreover, it allows the users to discover answers to
set
datasets, and it is not advisable to work
Multiple questions that are yet to be formulated.
acTOSS
Hence, the datasets are joined
with multiple datasets. Phase IX:Utilization of analysis results
together.
The analysis is done, the results are visualized, now it's
rexample: are two datasets, namely that of a
Ifthere time for the business users to make decisions to utilize
For

Student Academic section and Student Personal Details the results.

section, then both can be joined together vja common The results can be used for optimization, to refine the
felds, i.e. roll number. businesS process. It can also be used as an input for the
systems to enhance performance.
.This phase calls for intensive operation since the
amount of data can be very large. Automation can be The block diagram of the life cycle is given below
I
Phase
hrought into consideration, so hat these things are
executed, without any human intervention. Business problem VI
Phase
definition
Phase VII :
Exploratory Data Analysis I
Data aggregation
Phase and representation
Here comes actual step, the analysis task.
the
Depending on the nature of the big data problem, Data identification Phase VIl
analysis is carried out. Exploratory data
• Data analysis can be classified as Confirmatory
Phase lI analysis
Unit
and Exploratory analysis. In confirmatory Data acquisition
znalysis

analysis, the cause of a phenomenon is analyzed before.

and filtration
Phase Vil

IV
End Se
The assumption is called the hypothesis. Phase |V Data visualization

The data is analyzed to approve or disapprove the Data extraction JPhase IX

ypothesis. This kind of analysis provides definitive
Utilization of
asWers to some Specific questions and confirms Phase V analysis results
Whether an assumption was true or not
Data munging
In an exploratory analysis, the data is explored to
cbtain
information, why a phenomenon occurred. This
Pe of analysis answers "why" a phenomenon
0ccurred.
This analysis doesn't provide (1D1)Fig. 4.1.1 : Life cycle of Big data analytics
kind of
definitive, i.e.
meanwhile, it provides discovery of patterns. evident from the block diagram that Phase VII,
It is
Phase successively until is
it
exploratory Data analysis, is modified
VIII: Data Visualization Emphasis is put on erTor
performed satisfactorily.
move back from Phase VIIL
correction. Moreover, one can
we the
have the answer some questions, using
nformation to not achieved. In this
from the data in the datasets. But these to Phase VIl, if a satisfactory result is
2nswers
are stilM presented to manner, ensured that the data is analyzed properly.
it is
business a form that can't be
users.
Tech-Neo Publications..A SACHIN SHAH Venture
Syllabus w.e.f 21-22)(P6-56)
academicyyear
Analytics (SPPU-Som 6)
Big Data
Data Science and
ANALYTICS (SPPU-
Q.
6(a), Dec. 18,
,9 Marks
DATA
TYPES OF BIG
M 4.2 example. analytics gives riSC
types of analysis in detail with opproach of to the four
Explain different diflerent
oQ upproachcs. This
different
data require
The diflecrent types
of
analytics.
different types of Big data subcategories that are
:
4, Prescriptive Analytics
into four Analytics
analytics is categorizcd 3.
Predictive
Big data
Diagnostic Analyties
2
I. Descriptive Analytics
We will look at them in detail. Data Analytlcs
Types of Blg

Diagnostic analytics
Descrlptive analytlcs It givos a
dotailed and in
on tho root
simplifios the data and
It dopth insight problem.
a causo of a
Summarizes past data into
roadable fom.
Predictive analytics
Prescriptive analytics This type of analytics makes
Proscriptivo analytics allows use of historical and present
events.
business to determine the bost data to predict future
possible solution to a problom.

(1D2)Fig. 4.2.1 : Types of Big Data Analytics

recovery, churn
drill-down, data mining, and data
reason analysis, and customer health score analysis are
1. Descriptive Analytics
a technique allexamples of diagnostic analytics.
Descriptive Analytics is considered useful is useful when
a certain segment of In business terms, diagnostic analytics
for uncovering patterns within
you are researching the reasons leading
chum
customers. It simplifies the data and summarizes past
indicators and usage trends among
your most loyal
data into a readable form.
customers.
Descriptive analytics provide insights into what has
diagnostic analytics can be
an
A use case for
occurred in the past and with the trends to dig into for
e-commerce company. Given the situation that the
more detail. This helps in creating reports like a even thougn
so on. sales of the company have gone down
company's revenue, profits, sales, and customers are adding products to their carts.
summary
Examples of descriptive analytics include
statistics, clustering, and association rules used in
The possible reasons behind this problem can be: e
are
market basket analysis.
forn didn't load correctly, the shipping charges
high, and not enough payment methods are available.
An example of the use of descriptive analytics is the
company
Dow Chemical Company. The company utilized its Taking the help of diagnostic analytics, the
on that
past data to increase its facility utilization across its comes out with a specific reason and then works
offices and labs. to resolve the issue.

2. Diagnostic Analytics
Predictive Analytics
Diagnostic Analytics, as the name suggests, gives a the
diagnosis to a problem. It gives a detailed and in-depth
Predictive Analytics, as can be discerned from
name future
insight into the root cause ofa problemn. itself, is concerned with predicting
trends,
Data scientists turn to this analytics craving for the incidents. These future incidents can be market
consumer events.
reason behind a particular happening. Techniques like trends, and many such market-related

(SPPU
-
New Syllabus w.e.f academic year 21-22)(P6-56) Venture
Tech-Neo Publications..A SACHIN SHAH
6-IT)
Bia Data Analytics (SPPU-Sem Analytics)...Page No. (4-5)
Scienceand
(Big Data
analytics makes use of historical and
of ANALYTICAL APPROACHES
ype 4.3
This predict future events. This the most
data to
present
used
form of analytics among businesses.
conumonly Data fusion and data integration
and
analytics doesn't only work for the service By a techniques that analyse
Predictive combining of
set
the
also for the consumers. It keeps track of sources and solutions,
providers but
integrate data from multiple more
and based on them, predicts what we
activities insights are more efficient and
potentially
t source of
our past accurate than if developed through
a single
next.
maydo is
predictive analytics NOT to tell you data.
purpose of
"The
happen inthe
future. It cannot do that. In fact,
will 2. Data mining
what do that. Predictive analytics
can only analytics, data
can commnon tool used within big data
analytics A data sets by
no happen in
the future, because all
patterns from large
what might mining extracts
forecast
analytics are probabilistic in nature." statistics and machine
predictive Combining methods from
Strategist, PRos learning, within database
management.
Michael Wu, Chief Al to
ne
when customer data is mined
uses models like data mining, AI, would be an
An example to react to
Predictive analytics segments are most likely
current data and determine which
learning to analyze
machine
and scenarios.
Corecast what might
happen in specific offer.
next best
Predictive analytics include 3. Machine learning
Examples of artificial intelligence,
analysis. within the field of
chum risk, and renewal risk Well known
used for data
analysis.
fers machine learning is also with
science, it works
Prescriptive Ánalytics yet Emerging from
computer on
) 4. most valuable assumptions based
is the to produce
, Prescriptive analytics next step in computer algorithms
be impossible
analytics. It is the predictions that would
underused form of data. It provides
predictive analytics. several possible for human analysts.
analysis explores on processing (NLP). Unil
The
prescriptive
depending the results of
4. Natural language computerscience, artificialTV
suggests actions dataset. subspecialty of
actions and
analytics of a given Known as a tool uses Enl Se
descriptive and
predictive linguistics, this data analysis
of data and intelligence, and
analytics is a combination prescriptive analyse human (natural)
language.
Prescriptive
data of algorithms to
various business
rules. The (organizational inputs)
internal Statistics
analytics can be both 5. organise, and
interpret
and external (social
media insights). determine works to collect,
businesses to This technique experiments.
allows surveys and
Prescriptive analytics problem. When data, within analysis,
to a include spatial
possible solution benefit analysis techniques network
the best analytics, it adds the Other data association rule learning,
combined with
predictive future modelling,
occurrence like mitigate predictive more.
manipulating a future many, many analyse this
analysis and
O
customer process, manage, and
risk. for technologies that expansive field.
analytics The different and
prescriptive offer an entirely over time.
• Examples of
and next best data are of and develops
action of
retention is next best similarly evolves any form or size
that technologies aside, effectively, it
analysis.
Aurora Techniques and accurately and
the
analytics can be valuable. Managed product, and market
A
use case of prescriptive reducing the data is business,
million by a host of
Health Care system. It saved
$6 can reveal
Tcadmission rates by 10%. healthcare insights.
good of drug
use in the
process
Prescriptive analyytics has
enhance the clinical
industry. It can be used to patients for SHAH Venture
Publications..A SACHIN
development, finding the right Tech-Neo
trials, etc.
21-22)(P6-56)
mic
vear
(Big Data Analytics).
Page No
6-/T)
Data Science and Big Data Analytics
(SPPU-Sem Consistecnt data: It can
be structured. read,
and
data
understood by providing in a conststent bete
WITHH
DATA ANALYTICSMANIPULATIONS lorma
M 4.4 MATHEMATICAL You may
not have
a
unificd view wlen
taking
sources, but with data manipulation da
from various
can make sure
commands, you lat the
EO What is Data Manipulation? consistently,
is structurcd and storcd
Meaning: Manipulation of data
Data Manipulation paramount for organzalions
or changing information to Projcct data: it is to
thc process of manipulating usc historical data to projcct the future be
We use DML to able to
make it more organized and readable. more in-depth analysis, especially andh
Well, it provide
accomplish this. What is mcant by DML? (o finances. Manipulation of data
when

for Data ManipulationLanguage or Comes makes

stands i
possible or this purpose.
programming language capable of adding, removing,
to Overall, bcing able to convert, update,
and altering databases, i.e. changing the infornation delete,an
map thc incorporatc data into
a
database means you
something that we can read. We can clean and can

morc with the data. Create more value from the data
data thanks to DML (o make it digestible for
Cxpression. becomes pointless by providing data that remains static
But you will have straightforward insights male to
Data Manipulation Examples better business decisions when you know how to ue

Data Manipulation is the modification of information to data to your advantage.

make it casier to read or more structured. For example, Delete or neglect redundant data: data that is unusshl.
in alphabetical order, a log of data may be sorted, is always present and can interfere with what matters
making it casicr to find individual entries. On web
(A) Contrasted with language programming
server logs, data manipulation is also used to allow the
website owner to monitor their most famous pages and It looks very stilted when you first look at Data
Manipulation Language. For example, explaining to
their sources of traffic.
others how to use a built-in feature in Access is
Accounting users or related fields also manipulate relatively straightforward compared to, using DML to
information to assess the expense of the product, Pick * FROM. DML, however,, is not a language fxr
pricing patterns, or future tax obligations. To forecast programming. That a machine understands and
operates as an implicit program cannot be compiled, or
developments in the stock market and how stocks
translated into Os and 1s. Think of it instead as a ratber
might perform shortly, stock market analysts also use
sophisticated formula, as one might find in 3
data manipulation. spreadsheet. You probably use some very convoluteo
Computers can also use data manipulation to view the formulas when using a spreadsheet - DML is simp)
information in a more realistic way to users based on formula speaking, but for using a database.
code in a user-defined software program,
web page, or Steps Involved in Data Manipulation
data formatting.
When you want to get started with data manipulauo
Purpose of Data Manipulation here are the steps you should take into consideration :
For business operations and
optimization, data Only if you have data to do so is data manipulation
manipulation is a key feature. You have to which is
be able to deal feasible. You need a
with the data in the way you need it to use database, therefore,
data properly and generated from data sources.
turn it into valuable information
such as analyzing financial 2. This knowledge requires reorganization
data, consumer behavior, and doing rend analysis.
As such restructuring. Manipulation cleanse
data manipulation provides an organization of data helps youto
with many your information.
advantages, including:
3
Import a
database and create it work on.
for you to

(SPPU- New Syllabus w.e.f academic year 21-22)\P6-56)

lTech-Neo Publications.A SHAH
Venture
SACHIN
SCience and Big Data Analytics (SPPU-Sem
Dgla 6-1TY
combine. erase, or merge
You can information
manipulation. through (Big Data Analytics)..Page
No.
data make
Sense (4-)
you manipulate data, data business. of such massive
When analysis becomes amount of data to grow the
sinple. ET
Where Does
Excel, How Do You this Massive Amount
Manipulate Data? Come From? of Data
In
Manipulation of
data in Python The data is
and manipulation primarily
of IoT devices, user-gencrated, generatcd from
are critical aspects of data
manipulation. Before social networks, user
datz in
R
events are recorded
through the more profound continually which hclps
mving principles of Data the systems evolve resulting
better user experiencc. in
R,
Munipulation in Python and let us now understand
how to There is no limit to the rate
manipulate data. of data crcation. With
passing time, the rate grows
definitely, you are aware of how exponentially. As more
Most
to use MS
Excel. users use our app, or
JoT device or the product
our which
Hse ae some tips to help you manipulate Excel info business offers, the data
keeps growing.
L Formulas and functions : Addition, subtraction, Data ingestion is just one part of a much
bigger data
multiplication, and division are some of the basic math processing system. More comnmonly known as handling
functions in Excel. You need to know how to use these the Big Data. The data moves through a data
pipeline
across several different stages.
Excel-critical features.
Also, there are several different layers involved in the
1
Autofil in Excel : When you want to use the same
entire big data processing setup such as the data
equation across several cells, this feature is useful. One collection layer, data query layer, data processing, data
way of doing it is to retype
the formula. Another way is visualization, data storage & the data security layer.
to drag the cursor to
the cell's lower right corner and S Different Ways of Ingesting Data
then downwards. It will help you simultaneously apply
Data ingestion can be done either in real-time or in
the same formula to several rows. batches at regular intervals. It entirely depends on the
requirement of our business.
3 Sort and Users can save a lot of time when
Filter :
Data Ingestion in real-time is typically preferred in Unil
nalyzing data by sorting and filtering options in Excel.
systems reading medical data like a heartbeat, bloodV
* Removing duplicates: There are often chances of. pressure loT sensors where time is of critical End Se
Teplication of data in the process of collecting and
importance. In systems handling financial data like
a
similating data. In Excel, the Delete Duplicate feature stock market events. These are few instances where
can help remove duplicate spreadsheet entries. time, lives & money are closely linked.
. over time.
Column
splitting, merging, and merging : Columns On the contrary in systems which read trends
sport over
ows in Excel mav often be added or removed. Data For instance, estimating the popularity of the
a
period of time,we can surely ingest data in batches.
Mganization often requires integrating, splitting. Or
combining multiple Data Ingestion
datasheet IS Importance of user
on data. They need
W
4.5 Businesses today are relying
DATA INGESTION FROM pIFFERENT data to make future plans
&& projections. They need
to

SOURCES (CSV, JSON, HTML, EXCEL, bchaviours. All these

understand the user needs, his
MONGODB, MYSQL, SQLITE) create better products, make
things enable companies
campugns, give user
smarter decisions, run ad
Data Ingestion? insight into the market.
recommendations, gain a better
eventually
Dala creating value from data which
Ingestion streaming-in massive In short., increased
ounts is
the process of
more customer-centric products &
tItemal
of data
system, from several different results in
in Our
customer loyalty.
soOurces,
for other operations
&
Tquired running analytics understand &
by the SACHIN SHAH Venture
business. Data is ingested to Tech-Neo Publications.A
21-22)(P6-56)
Syllabus w.e.f academic year
(Big Data Analytics).Page No
Data Sclonce and Big Data Analytics (SPPU-Som 6-)
move back inn time, track errors
only way to &

There are also other uses of data ingestion such as study

the
behaviour of the system.
tracking the service cfficiency, getting everything
Now, when we have to study the behaviour
okay signal from the loT devices used by millions of of the.
custonIers. system as whole comprehensively, we have
a
to stream
all the logs to
a
central place. Ingest logs
Centralizing records of data streaming in fron several to a central
different sources like for scanning logs. Reducing server to run analytics on it with the help of
thc solution,
complexity of tracking the syslem as a whole. Scanning like ELK stack ctc.
logs at one place with tools like Kibana cuts down the (iv) Strenm Processing Engines for Real-TTime
Events
hassle by notches, I'Il talk about the data ingestion
real-time strcaming data processing
&
Quick
tools up ahead in the article. iskey
systems handling LIVE information such as sports. in
Real-world Industry & Architectural Use It's
imperative that the architecctural setup in place
Cases of Data Ingestion cfficient cnough to ingest data, analyse it. Fimure
is

Here are some of the use-cases where data ingestion is belaviour in real time && quickly push information t
required. the fans. After all, the whole business depends on it
(i) Moving Massive Amount of Big Data into IHadoop Let's talk about somc of the challenges the
development teams have to face while ingesting data
This is the primary & the most obvious usc case. As
discussed above, Big Data from all the loT devices, Challenges Companies Face When
social
apps & everywhere, is streamed
through data pipclines, Ingesting Data
moves into the most popular
distributed data processing
framework Hadoop for analysis & stuff. (i) Slow Process
Guys, data ingestion is a slow process. How? I
(ii) Moving Data from Databases to Elastic Search
explain. When data is streamed from several different
Server sources into the system, data
coming from cach &
In the past, with a few of my friends, I wrote a every different source has a different
product format, different
search software as a service solution from scratch with syntax, attachcd metadata. The data as a whole is
Java, Spring Boot, Elastic Search. Speaking of its
heterogeneous. It has to be transformed into a common
design the massive amount of product data from legacy
storage solutions of the organization was format like JSON or something to be understood by the
streamed, analytics system.
indexed & stored to Elastic Search Server. The
streaming process is more technically called the The conversion of data is a tedious process. It takes a
Rivering of data. lot of computing resources & time. Flowing data has to
As in, drawing an analogy from how the water flows be staged at several stages in the pipeline, processcd &
through a river, here the data moved through a data then moved ahead. Also, at each & every stage data nas
to be authenticated &
pipeline from legacy systems & got ingested into the verified to meet the
organization's security standards. With the traditionl
elastic search server enabled by a plugin
specifically data cleansing processes, it takes wecks if not monun
written to execute the task.
to get useful information on band. Traditional datd
(iiü) Log Processing, Running Log Analytics Systems ngestion systems like ETL ain't that eftecuh
anymore.
If your project isn't a hobby project, chances are it's
(ii) Complex & Expensive
running on a cluster. Monolithic systems are a thing of
the past. With so many microservices running As already stated process S
the entire data flow
concurrently. There is a massive number of logs which resource-intensive. A done
lot of heavy lifting hasto be
is generated over a period of time. And logs are the to prepare the data before into the
being ingested
system. Also, dedicated
it isn't a side process, an entire
team is required to pull off something
like that.
(SPPU - New Syllabus w.e.f academic year 21-22)(P6-56)
Tech-Neo Publications.A Venture
SACHIN SHAH
Science and Big Data Analytics (SPPU-Sem 6-IT) (4-9)
Data (Big Data Analytics)..Page No.

There are
always
scenarios were
the tools & This data is usually not necessary or helpful when it
in the market fail to serve your
frameworks available comes to analyzing data because it may hinder the
custom needs you are left with no option
&

than to process or provide inaccurate results. There are several

a Custom solution fromthe ground up.
write methods for cleaning data depending on how it is
The semantics of the data coming from externals stored along with the answers becing sought.
SOurces changes sometimes which then requires a
Data cleaning is not simply about erasing information
a way to
change in the backend data processing code too. The to make space for new data, but rather finding
external I0T devices are evolving at a quick speed. maximize a data set's accuracy without necessarily
the factors we have to keep in mind when
So. these are deleting information.
more actions than
setting upa data processing & analytics system. For one, data cleaning includes
syntax
removing data, such as fixing spelling and
(ti) Data is Vulnerable errors, standardizing data sets,and correcting mistakes
When data is moved around opens up the possibility
it such as empty fields, missing codes,
and identifying
goes through
of a breach. Moving data is vulnerable. It
team has
duplicate ata points.
several different staging areas & the development a foundational element of
Data cleaning is considered
to put resources to ensure their system meets
in additional an important role in
the data science basics, as it plays
the security standards at all times, process and uncovering reliable answers.
theanalytical
IS DataIngestion Architecture Most importantly, the goal of data
cleaning is to create
uniform to allow
Data ingestion is the initial & the toughest
part of the data sets that are standardized and
to easily
entire data processing architecture. business intelligence and data analytics tools
are to be considered when access and find the right data for cach query.
The key parameters which
are: Uses of Data Cleaning
designing a data ingestion solution
Data streams in
: or data
Data Velocity, size & format Regardless of the type of analysis Unit
sources into the system at visualizations you need, data cleaning is vital step to
a T

through several different

streams from social ensure that the answers you generate are accurate.
different speeds & size. Data Enl
& what not. And streams and with
networks, loT devices, machines When collecting data from several
different can carry
every stream of data streaming in has manual input from users, information
unstructured gaps.
semantics. A stream might be structured, mistakes, be incorrectly inputted, or have
or semi-structured. Data cleaning helps ensure that information always
: Data can be
The frequency of data streaming matches the correct ficlds while making it easier for
or at regular
streamed in continually in real-time business intelligence tools to interact with data sets to
to stream in most
batches. We would need weather data find information more efficiently. One of the
trends social common data cleaning examples is its application in
continually. On the other hand, to study
media data can be streamed in at regular
intervals. data warehouses.

4.7 HANDLING MISSING VALUES

4.6 DATA CLEANING

How Missing values and categorical variables

Data cleaning is the process of preparing data 1or UQ.
are preprocessed before building model?
analysis by removing or modifying data that is Explain with example.
incorrect, incomplete, irelevant, duplicated, Or (SPPU Q.
5(b), Dec. 18, 9 Marks)
improperly formatted.

(SpPU - New Syllabus w.e.f academic year 21-22)(P6-56) Tech-Neo Publications.A SACHIN SHAH Venture
(Big Data Analytics...Page
Data Science and Big Data Analvtics (SPPU-Sem 6-1) No.
(4-10)
t why do missing values occur in Some hypothetical examples of MAR data include:
data?
more likely
Missing values can occur in data for a number ol
A
certain swimmning lane is to have misssing

reasons, suclh as survey non-responses or errors clectronic time observations but the missing
in lala data
isni
cntry. directly related to the actual time,

While it may scem that a missing value is a misSing

A
scale produces more missing values wlen
placed on
a
value, not all nissing data is the sarnc. a
soft surface than hard surface (Van Buren. 2018).
Missing data is grouped into three broad categorics: independent of the weight.

Missing completely at random. Childhood health assessment data is more likely

10be
Missing at random. missing in lower median income countics.

Missing not at random. LS Missing not at random (MNAR)

a Missing completely at random (MCAR) When data is missing not at random (MNAR)
ha
likelihood of a missing observation is related to its values
Data is
missing completely at random if all l
can be difficult to identify MNAR
observations have the same likelihood data bccause the val
of being missing.
of missing data are unobserved. This can result in distorted
Some hypothetical examples of MCAR data include :
data.
Electronic time observations are missing,
independent Some hypothetical cxamples of MNAR data include:
of what lane a swimmer is in.
When surveyed people with more income are less
A scale is equally likely to produce
missing values
likely to report their incomes.
when placed on a soft surface or a hard surface (Van
Buren, 2018). On a health survey, illicit drug users are
less likely to

a
respond to question about illicit drug use.
G Missing at random (MAR)
Individuals surveyed about their age are more
When data is missing at random (MAR) the likelihood likely to
leave the age question blank
that a data point is missing is not related to when they are older.
the missing data
but may be related to other observed data.

Type of missing value Description

Examples Acceptalble solutions
Missing completely al All observations
have the Electronic time observations
random same likelihood Deletion, Imputation
of being are missing,
independent of
missing
what lane a swimmer
is in.
A scale is equally
likely to
produce missing
values
when placed on
a soft
surface or a
hard surface
(Van Buren, 2018).
Geographical
location data
is eqaully
likely to be
missing for all
locations.

(SPPU - New Syllabus w.e.f academic year 21-22)(P6-56)

Tech-Neo Venture
Publications A
SACHIN SHAH V
Science and Big Data Analytics
Data (SPPU-Sem 6-IT)
Type of missing value (Big Data Analytics)...Page
Deseription No. (4-11)
Examples
Missing at random Likeliho0d that Acceptable solutions
a data
point is missing A certain swimming
lane is Deletion,Imputation
is not more likely
related to the missing to havc missing
data electronic time observations.
but may be related to
A scale
other observed data. produces more
missing alues when placed
on a soft surface
than a hard
surface ( Van Buren, 2018).
Childhood health assessment
data is more likely to be
missing in lower median
income counties
Missing not at random Likelihood of a missing When surveyed people with Imputation
observation is related to more income are less likely
its values. to report their income.

On a health survey illicit

drug users are less likely to
respond to a question about
illicit drug use.

Individuals surveyed about

their age are more likely to
Unit
leave the age question blank
IV
when they are older.
End Se

G Dealing with missing values

on the cause of the missing values and the characteristics of the
How we should deal with missing data depends both
in the same manner that we deal with missing time
data set. For example, we cannot deal with missing categorical data
series data.

t Deletion
may be a suitable method for dealing with missing values. However, when data
When data is MCAR and MAR deletion
is MNAR, deletion of missing observations can lead to bias.
.
In this section we cover three methods of data deletion for missing values
(A) Listwise deletion (B) Pairwise deletion
(C) Variable deletion

Cenn- New Syllabus W.e.f academic year 21-22)(P6-56) Tech-Neo Publications..A SACHIN SHAH Venture
(Big Data Analytics).Page No.(4-12)
Data Science and Big Data Analytics (SPPU-Sem 6-l)
missing values
of deletion for
Table 4.7.1: Methods Disadvantages
Adyantages
Method Description Can
result in biased estimales
Easy to implement.
Listwise deletion Delete all observations whee if the missing values are
the missing values occur.
n
MCAR.
Wastes useful information
Can disrupt time
seris
analysis by creating gaps in
dates used for analysis.
Can result in biased estimates
Pairwise deletion Uses all available data when Simple to implement.
if the missing values are not
computing meanS and Uses all available
MCAR.
covariances. information.
Results in different sample
sizes being used for different
computations.
Requires that data follow
normal distribution

Variable Deletion Eliminate a variable from Easy to implement. Significant loss in

analysis if it is missing a large information.

percentage of observations. May result in missing variable
bias.

Example (B) Pairwise deletion

Uses the observations in 1906-1910 when computing
Consider a small sample of data from the Nelson
the means and covariances of emp and cpi.
Plosser macroeconomic dataset:
Uses the observations in 1906-1907 and 1909-1910
year cpi gnp ip Emp cpi when computing the means and covariances of ip.
Uses the observations in 1909-1910 when computing
1906 9.8 33749 28
the means and covariancees of gnp.
1907 10 34371 29 Uses the observations in 1909-1910 for model

estimation other than means and covariances.

1908 33246 28
(C) Variable deletion
1909 116.8 8.5 35072 28
Uses only emp and cpi for all parts of analysis.
1910 120.1 10 35762 28
4.8 DATAIMPUTATION
A. Listwise deletion
statistically
Imputing data replaces missing values with
Uses only the observations from 1909 and 1910 for all determined values. Methods of imputation can
vary
mean to
parts of analysis. from simply replacing missing values with the
It eliminates the all data in 1906-1908 because of the sophisticated multiple imputation processes.
depends
missing values in gnp and ip. Which method of
imputation should be used
on the cbaracteristics
of the data.
PPU - New SyllabuS w.e.f academic year 21-22)(P6-56) Tech-Neo Publications.k A SACHIN SHAH
Venture
SciotncO
And Big Data Analytics (SPPU-Sem 6-IT)
pata (Big Data Analytics)..Pago No. (4-13)

Table 4.8.1 : Examples of imputatlon

methods
Method Description
Advantages Disadvantages
Replacement with AIl missing
valucs are replaced|Easy
with the variable mcan, median
toimplement. Distorts the distribution of the
medlan,
nean,
or mode. data.
mode
Reduces data variance.

Results in biased cstimates.

Regression Missing values are
LInear predicted |•
Simple to implement.
using a lincar model and the Biascd correlations between
other variables in the datasct Uses all available variables.
information. Underestimated variability.

Falsely strengthens relationship

betwcen variables.
ILast observation Used for time series data,
Appropriate for time•
carried forward Can results in biased cstimates
Use last observed
data series data, even when data isMCAR.
lLOCF)
value as a replacement
for Easy implementation.
missing data Modeling techniques should
address that data has been
imputed by LOCF.
May incorrectly suggest
stability
across time stretches
if used to
fillsuccessions of missing data.
Predictive mean
Replacements for missing
Appropriate for time May duplicate values,
|matching
values are drawn randomly especially Unit
series data.
from a group of nearby
Easy to use and
if sample is small.
Not suited for small
IV
End Sen
candidate values. samples,
versatile. skewed data, or sparse
data.
Robust to data Cannot be used to
extrapolate
transformations. beyond range of the data.
Valid for discrete data.
Will always produce
replacements within the
observed data range.
nstatistics,
point, Imputation is the process of
replacing missing data with substituted values. When
it is substituting
known as
for a data
tmputation". "unit imputation": when substituting for a component of a
data point, it is known as
item

SPPU-Ne
Syllabus
w.e.f accademic year 21-22)(P6-56) aTech-Neo Publications.A SACHIN SHAH Venture
Page
Data Science and Big Data Analytics (SPPU-Sem
6-|T) to(4-14)
Data transformation facilitates compatibility
between
applications, systems, and types of data, Data
4.9 DATA TRANSFORMATION used o
multiple purposes may need to he transformed
in
of data transformation
different ways
UQ. Explain the different modes
(SPPU Q. 5(b), May 18, 8 Marks)
-
in big data.
4.10 DATA STANDARDIZATION
process of changing the
Data transformation is the a
Standardization is data processing worklow
forinat, structure, or values of data.
Data
of disparate
may be transformed at converts the structure datasets into
a
For data analytics projects, data Format. As part of
Organizations that use Common Data the Data
two stages of the data pipeline. Data Standardization deals
use an ETL Preparation field, the with
on-premises data warehouses generally pulled fr
transformation of datasets after the data is
(extract, transform, load) process, in which data
before it's loaded into target
source systems and
transformation is the middle step. Data Standardization Can al..
use cloud-based data systems. Because of that,
Today, most organizations transformation rules enoine i
storage be thought of as the
warehouses, which can scale compute and
or minutes. Data Exchange operations.
resources with latency measured in seconds consumer to
lets organizations Data Sandardization enables the data
The scalability of the cloud platform a manner. Typically
load raw data into the analyze and use data in consistent
skip preload transformations and source system
when data is created and stored in the
- a
at query time
data warehouse, then transform it a way that is often unknown
transform). it's structured in particular
model called ELT (extract, load,
migration, to the data consumer.
Processes such as data integration, data semantically related
all may Moreover, datasets that might be
data warehousing, and data wrangling differently, thereby
may be stored and represented
involve data transformation. a consumer to aggregate or
may be constructive (adding, making it difficult for data
Data transformation (deleting compare the datasets.
data), destructive
copying, and replicating
(standardizing salutations DataStandardization Use Cases
fields and records), aesthetic LG
(renaming, moving, and
or street names), or structural There are two main
use case categories in Data
a database). and Complex
combining columns in Standardization: Source-to-Target Mapping,
among a variety of ETL tools into two sub
Anenterprise
can choose
Data Reconciliation. We typically divide the former
automate the process of data transformation. categories thereby arriving at three
use cases :
that scientists also
engineers, and data sources : This use
analysts, data such as Simple mapping from external
data using seripting languages case handles on-boarding data from systems that de
transform SQL.
or domain-specific languages like external to the organization, and mapping its
keys anu
Python
Transformation values to an output schema.
of Data
4.9.1 Benefits : Simple mapping from internal
sources This : Us

are based
several benefits case involves handling internal datasets that
Transforming data yields better-organized.
to make it on inconsistent definitions and transforming them i
transformed and
Data is easier for both humans a single trustworthy data set for the entire organizatiou.
may be
Transformed data involves the
use. Complex reconciliation : This use case
computers to improves data provide
validated data potential creation of complex calculated metrics that
formatted and
Properly applications from their own semantics based ondefined business log
protects duplicates,
and unexpected
quality values,
as nullincompatible formats.
landmines such
indexing, and
incorrect
Venture
21-22)(P6-56) Tech-Neo Publications..A SACHIN SHAH

w.e.facademic
year
Syllabus
New
Seneand (Big Data Analtics)..Pags
HANDLING CATEGORICAL DATA tio.(4-15)
411 2 AND MORE CATEGORIES
(II) Distance and Order
WITH
Numbers hold relationships.
For instance, four
two, and, when converting is twice
Cstcgorical data is simply information aggregated into catcgories into nurmbers
directly,
these relationships are created
rather than being in numeric formats, such as despite not existing bctwecn
rups
Sex or Education Level. They are present he original categories. Looking at the example
Gender, in bcfore,
United Kingdorm becomes
datasets, yet the current algorithms twice France, and France
amost all real-life | United States equals Germany.
plus
to deal withthem.
sGill struggle
Well, that's not exactly
XGBoost or most
right...
Take, for instance, SKlcarn models.
This is especially an issue for algorithms, such as K

you try and train them with categorical data,

f you'll| Means, where a distance measure is calculated when
immediately get an eTor.
running the model.
Qurrently, many
resources advertise a wide variety of LT Solutions
solutions that might seem to work at first, but are

deeply wrong once thought through. This is especially (III) One-Hot Encoding
true for non-ordinal categorical data, mcaning that the One-Hot Encoding is the most conmon, correct way to
classes are not ordered (As it might be for Good = 0. deal with non-ordinal categorical data. It consists of creating
an additional feature for each group of the categorical
Beter = 1, Best = 2). A bit of clarity is needed to
feature and mark each observation belonging (Value= 1) or
distinguish the approaches that Data Scientists should
not (Value = 0) to that group.
use from those that simply make the models run.
United States France Germany United Kingdom
3 What Not To Do 1

() Label Encoding 0 0

One of the simplest and most common solutions 0 1 0

advertised to transform categorical variables is Label 0 Unit

Encoding. It consists of substituting each group with a 1 IV
Corresponding number and keeping such numbering End Semt.
1 0
consistent throughout the feature.
(IV) Example of One-Hot Encoding
Categorical Feature Label Encoding
This approach is able to encode categorical features
United States properly, despite some minor drawbacks. Specifically,
1 the presence of a high number of binary values is not
United States
ideal for distance-based algorithms, such as Clustering
2
France models.
3 In addition, the high number of additionally generated
Germany
4 features introduces the curse of dimensionality. This
United Kingdom means that due to the now high dimensionality of the
2 dataset, the dataset becomes much more sparse.
France
In other words, in Machine Learning problems,
you'd
Example of Label Encoding need at least a few samples per each feature
one the of

makes the models run, and it is combination. Increasing the number of features means
Ihis solution Data Scientists. However,
most commonly used by aspiring that we might encounter cases of not having enough
many issues. observations for each feature combination.
simplicity comes with

year 21-22)(P6-56) aTech-NeoPublications..A SACHIN SHAH Venture

w.e.f academic
(SPPU - New Syllabus
(Big Data Analytics). .Page
No.(4-16)
Data Sclenco and Big Data Analytics (SPPU-Sern 6-[)
4.12 STATISTICAL AND GRAPHICAL
(IV) Target Encoding ANALYSIS METHODS
way of handling
A
lesser known, but very cffective
categorical variables, is Target Encoding. It consists of UQ, Erplain pie chart and scatter plot.
the
substituting cach group in a cateporical feature with (SPPU Q. 7(b), Dec. 19,
8 Marks
Verage response in the target variablc.
on
Country Target Variable Target Encoding rT Techniques based Mathematics
and
Statistics
United States 0.40 Descriptive Analysis : Descriptive Analysis considers
Germany 0.50 the historical data, Key Performance Indicatore
describes the performance based on a chosen
United States 0.40
benchmark. It takes into account past trends and
how
United States 0.40 they might influence future performance.

France 0.67 Dispersion Analysis : Dispersion in the area onto

which a data set is spread. This technique allows data
Germany 0.50
analysts to determine the variability of the factors under
United States 0 0.40 study.
France 0.67 Regression Analysis : This technique works by
modeling the relationship between a dependent variable
United States 0.40
and one or more independent variables. A regression
France 0.67 model can be linear, multiple, logistic, ridge, non
linear, life data, and more.
Example of Target Encoding
Factor Analysis : This technique helps to determine if
The process to obtain the Target Encoding is relatively there exists any relationship between a set of variables.
straightforward and it can be summarised as : This prOcess reveals other factors or variables that
I. Group the data by category. describe the patterns in the relationship among the
original variables. Factor Analysis leaps forward into
2. Calculate the average of the target variable per each
useful clustering and classification
group. procedures.
Discriminant Analysis : It is a classification technique
3
Assign the average to cach observation belonging to in data mining. It identifies
that group the different points on
different groups based on
variable measurements. n
Thiscan be achieved in a few lines of code: simple terms, it identifies
what makes two groups
encodings = data.groupby(Country)[Target different from one another:
this helps to identify new
Variable].mean).reset _index()data = items
dala.merge(encodings, how='left,
Time Series Analysis: In this kind
on=Country) data.drop(Country', axis=l, of analyss,
measurements are
inplace=True) spanned across time,
a collection which gVes
Alternatively, we can also use the category_encoders of organized data
known as time series.
library to use the TargetEncoder functionality. Techniques based
it Graphs on Visualization and
Target Encoding is a powerful solution also because
as is the case
avoids generating high number of features, Column
a
Chart, Bar
dimensionality of the used to Chart : Both are
for Onc-Hot Encoding, kecping the present these charts
one. categories. The numerical differences between
dataset as the original column
chart takes to
columns to the height of the
reflect the
the case differences.
of the bar chart. Axes interchange in

21-22)(P6-56) elTech-Neo
acadermic year Publications...A
(SPPU New Syllabus w.e.f
-
SACHIN
SHAH Ven
and Big Data Analytics
(SPPU-Sem 6-IT)
: This chart represents
Line Chart the change of (Big Data Analytica)..Page No.
continuous interval of time. data (4-17)
ra Map
Chart: This concept is based on
Aea the line chart. It RegionalMap : It uses
color to represent value
fills the area between the polyline distribution over a map
also and
with color,
representing better trend information.the axis partition.
Point Map : It represents
to represent the geographical
e Chart : It is used the proportion of distribution of data
in points on a
different classifications. It is only suitable background. When geographicat
for only one the points are the same
it can be size, it becomes meaningless in
series of data. However, made multi-layered for single data, but
represent the proportion of
data in different
i1 the points are as a bubble, it
to also represents the
size of the data in
categories, each region.
Flow Mlap : It represents
Funnel Chart : This chart represents the relationship
the proportion of between an inflow area
each stage and reflects the size of each module.
and an outflow area. 1t
It helps represents a line connecting
in comparing rankings. the geometric centers
of gravity of the spatial
clements. The use of
Word Cloud Chart : It a dynamic low lines helps
is visual representation
of reduce visual clutter.
t Äata. It
requies a large amount of data,
and the
Heat Map : This represents
the weight of cach
degree of discrimination needs to be point in a geographic area. The
high for users color here
perceive the most
to represents the density.
prominent one. It
is not a very
accurate analytical technique.
4.13 HIVE DATA ANALYTICS
Cantt Chart : It shows the actual timing
progress
and the
of the activity comparedto the requirements. i UQ. Explain Architecture
, of HIVE.
Radar Chart :
It is used to compare
multiple (SPPU-Q. 6(a), Dec. 19, 8 Marks)
quantized charts.
It represents which variables in the IGQ What is Hive ?
data have higher
values and which have lower values.
radar chart is used
Hive, originally developed by
A
for comparing classification and Facebook and later
owned by Apache, is a data storage
series along with
proportional representation. system that was
developed with a purpose to Unit
analyze organized
Scatter Plot : It shows
the distribution of variables in Working under an open-source
data platform called
data.V
points over a
rectangular coordinate system. The Hadoop, Apache Hive is an
application system that Ed Se

distribution in the was released in the year

data points can reveal the correlation 2010(October).
between the
variables. Introduced to facilitate fault-tolerant
analysis of hefty
Bubble Chart : It is a variation of the scatter data on a regular basis, Hive has been
used in big data
plot.
Here, in addition to the x and y coordinates, analyties and has been popular in the realm
arca represents
the bubble for more
the 3rd value. than a decade now.
bauge : It is a kind of materialized Even though it has many competitors
chart. Here the like Impala,
SCale represents Apache Hive stands apart from the rest of the systems
the metric. and the pointer represents
the
dimension. due to its fault-tolerant nature in process
It is a suitable technique to represent he of data
interval
comparisons. analysis and interpretation.
Frame D
Diagram : It Understanding Hive in Big Data
hierarcby is a visual representation of a
in an inverted tree structure. Apache Hive is a particularly cfficient tool when it
Rectangular
Tree Diagram : This technique is usedto comes to big data (exponential data that is to be
Tepresent
hierarchical analyzed). A warchouse data software that supports the
level. relationships but at the samne
It
makes efficient
use of space and represents the data analysis process of big data on a regular basis, the
proporion
representedI by each rectangular area.
t-NeN
Syllabus w.e.f aTech-Neo Publications..A SACHIN SHAH Venture
iacadermic year 21-22)(P6-56)
(Big Data Analytics).Page
Data Science and Big Data Anatyics (SPPU-Semn 6-l)
to.(4-18)
organizations necd big data to record the inforrnairn
concept of hive big data is quite popular in the over the time.
that is collected
technological realm
To produce data-driven analysis, organizations
As data is stored in the Apache Hadoop Distributed gahe
File System (HDFS) wherein data is organized and data and use such software applications to analyze
data, with Apache ther
data. This can be
Hive,
structured, Apache Hive helps in processing this data used for
reading, writing, and managing information
and analyzing it producing data-driven patterns and that
trends. Fit to be used by organizations or instituuons, been stored in an organized form. Ever
since
data
Apache Hive is extremely helpful in big data and its analytics has comne into being, storage of data
has been
ever-changing growth. a trending topic.

The concept of Structured Query Language or SQL Even though small scale organizations were
able to
software is involved in the process which manage medium-sized data and analyze
it with
communicates with nunerous databases and collects traditional data analytics tools, big data could not
the required data Understanding Hive big data through managed with such applications and so, there
Ws
the lens of data analytics can help us get more insights dire need for advanced software.
into the working of Apache Hive.
As data collection became a daily task and
By using a batch processing sequence, Hive generates
organizations expanded all aspects, data collection
data analytics in a much casier and organized form became exponential and vast. Furthermore, data began
that
also requires less time as compared to traditional tools. to be dealt in petabytes that define storage vast
HiveQL is a language similar to SQL that interacts with of data
For this, organizations necded hefty equipment and
the Hive database across various organizations
and perhaps that is the reason why the release of a software
analyses necessary data in a structured format.
like Apache Hive was necessary. Thus, Apache Hive
T Why do we need it ? was released with the purpose
of analyzing big data and
Hive in big data is a milestone innovation producing data-driven analogies.
that has
eventually led to data analysis on a large scale.
Big

Chapter Ends..
WT V
Big Data Visualization
PTER 5

vsualization tools
Conventional data
visualization, Tools used in
da
opics Data,
Big data Visualizing Big Analysis
Challenges to visualization, Case
Study:
visualization, of data visualization tools, visualization,
to Data representations, Types
n
data Visualizationtools, Open
-
source data
techniques used
in Big
data

s for visual Analytical

Propriety Data visualization,
Google Chart API
on, using
Zomato Candela,D3.js,
problem of Introductionto: visualization,
Tableau data
onusing advantages of

explain
AAlso challenc
Visualization... visualization? overcomethese
nto Data
of data howto
need Marks). overco
What is the 18, 8 dataand to

200 Acacad 30 en SG 01
No ratings yet
200 Acacad 30 en SG 01
26 pages
Interfacing Techniques Topic 4
No ratings yet
Interfacing Techniques Topic 4
23 pages
Gtag Understanding and Auditing Big Data
100% (1)
Gtag Understanding and Auditing Big Data
42 pages
2 Mark Questions With Answers
50% (2)
2 Mark Questions With Answers
3 pages
BDS Session 4
No ratings yet
BDS Session 4
65 pages
Unit 1 Notes Final Part C
No ratings yet
Unit 1 Notes Final Part C
38 pages
Big Data Categories-Life Cycle
No ratings yet
Big Data Categories-Life Cycle
15 pages
DSBDA Easy Solution 2019
No ratings yet
DSBDA Easy Solution 2019
58 pages
Big Data Analytics Life Cycle
No ratings yet
Big Data Analytics Life Cycle
3 pages
DSBA Unit 1
No ratings yet
DSBA Unit 1
12 pages
DSBA Unit 3
No ratings yet
DSBA Unit 3
12 pages
Bda Bi Jit Chapter-3
No ratings yet
Bda Bi Jit Chapter-3
40 pages
Unit I Big Data
No ratings yet
Unit I Big Data
256 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
DSBD
No ratings yet
DSBD
23 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
No ratings yet
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
13 pages
Operational and Analytical Big Data
No ratings yet
Operational and Analytical Big Data
23 pages
Bigdata
No ratings yet
Bigdata
54 pages
CHAPTER - 1 - Introduction - 1
No ratings yet
CHAPTER - 1 - Introduction - 1
33 pages
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
No ratings yet
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
24 pages
01 - Big Data Analytics - An Introduction
No ratings yet
01 - Big Data Analytics - An Introduction
45 pages
Chapter 3
No ratings yet
Chapter 3
30 pages
Mat5010 - Cat 1 QP - Key
No ratings yet
Mat5010 - Cat 1 QP - Key
14 pages
W3 - DA Life Cycle
No ratings yet
W3 - DA Life Cycle
49 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
47 pages
Big Data
No ratings yet
Big Data
13 pages
Big Data
No ratings yet
Big Data
4 pages
Week 2 - Data Analytics Life Cycle
No ratings yet
Week 2 - Data Analytics Life Cycle
41 pages
Data Analytics - Module-1.1
No ratings yet
Data Analytics - Module-1.1
42 pages
Ch1-Introduction To Data Analytics & LifeCycle
No ratings yet
Ch1-Introduction To Data Analytics & LifeCycle
26 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
El Gwekwerere C23156717L MScBDA624 AssignmentOne
No ratings yet
El Gwekwerere C23156717L MScBDA624 AssignmentOne
6 pages
Big Data Analytics
100% (1)
Big Data Analytics
11 pages
Intorduction of DA
No ratings yet
Intorduction of DA
5 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Unit 4 DA Revised
No ratings yet
Unit 4 DA Revised
102 pages
ATW115 Slides Chp02
No ratings yet
ATW115 Slides Chp02
52 pages
Bda Unit-1
No ratings yet
Bda Unit-1
43 pages
Notas - Curso Data Analysis
No ratings yet
Notas - Curso Data Analysis
38 pages
Unit 1 - DSA
No ratings yet
Unit 1 - DSA
12 pages
Data Analytics Template - Task 3 - Final
No ratings yet
Data Analytics Template - Task 3 - Final
11 pages
Module I (Introduction Data Analytics Life Cycle) Part II
No ratings yet
Module I (Introduction Data Analytics Life Cycle) Part II
103 pages
Unit - 2 Learning Notes
No ratings yet
Unit - 2 Learning Notes
7 pages
Unit - 2 PDA
No ratings yet
Unit - 2 PDA
20 pages
Data Science
No ratings yet
Data Science
10 pages
Data Analysis Quest
No ratings yet
Data Analysis Quest
31 pages
Data Analyses
No ratings yet
Data Analyses
9 pages
DS&BDA Unit 3
No ratings yet
DS&BDA Unit 3
51 pages
Unit 1
No ratings yet
Unit 1
36 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Data Analysis For Grade 5 Elementary
No ratings yet
Data Analysis For Grade 5 Elementary
24 pages
20PMHS012 RH
No ratings yet
20PMHS012 RH
32 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Understanding Data Analytics Project Life Cycle - Pingax
No ratings yet
Understanding Data Analytics Project Life Cycle - Pingax
14 pages
Life Cycle
No ratings yet
Life Cycle
35 pages
Data Science Pipeline, EDA & Data Preparation
No ratings yet
Data Science Pipeline, EDA & Data Preparation
14 pages
Big Data Analytics Introduction-Lect 1
No ratings yet
Big Data Analytics Introduction-Lect 1
26 pages
Week 1
No ratings yet
Week 1
54 pages
Data Analytics-Wps Office
No ratings yet
Data Analytics-Wps Office
21 pages
Big Datadoc
No ratings yet
Big Datadoc
9 pages
Big Data Analytics Unit1
No ratings yet
Big Data Analytics Unit1
20 pages
Mastering Data Science: From Basics to Expert Proficiency
From Everand
Mastering Data Science: From Basics to Expert Proficiency
William Smith
No ratings yet
Learn Scrum With Jira Software - Atlassian
No ratings yet
Learn Scrum With Jira Software - Atlassian
12 pages
Python Notes Unit - III
100% (1)
Python Notes Unit - III
26 pages
VTU CloudComputing 22 and 21 Scheme Questions
No ratings yet
VTU CloudComputing 22 and 21 Scheme Questions
2 pages
Red Hat OpenShift Application Services Cheat Sheet Red Hat Developer
No ratings yet
Red Hat OpenShift Application Services Cheat Sheet Red Hat Developer
10 pages
Phonet
No ratings yet
Phonet
14 pages
Ericsson Private 5G Solution Brief
No ratings yet
Ericsson Private 5G Solution Brief
5 pages
Net Centric Programming: Adeel-ur-Rehman
No ratings yet
Net Centric Programming: Adeel-ur-Rehman
36 pages
Advanced IO Finctions
100% (1)
Advanced IO Finctions
14 pages
VPN Presentation
No ratings yet
VPN Presentation
15 pages
Zoning in Cisco SAN Switch For Beginners - SAN Enthusiast
No ratings yet
Zoning in Cisco SAN Switch For Beginners - SAN Enthusiast
6 pages
Wcdma Dual-Sim Wifi Router Technical Specification: General
No ratings yet
Wcdma Dual-Sim Wifi Router Technical Specification: General
4 pages
XXXXX: (Autonomous)
100% (1)
XXXXX: (Autonomous)
26 pages
1 MPI Communications: CS424. Parallel Computing Lab#4
No ratings yet
1 MPI Communications: CS424. Parallel Computing Lab#4
30 pages
7 PN 4 D 54 ZBPF 6
No ratings yet
7 PN 4 D 54 ZBPF 6
9 pages
Unit 5 Cloud
No ratings yet
Unit 5 Cloud
12 pages
Ai-900 5
100% (1)
Ai-900 5
5 pages
Mu Camatt2.20 7
No ratings yet
Mu Camatt2.20 7
1 page
01 DevOps Introduction
No ratings yet
01 DevOps Introduction
156 pages
Python Self-Assessment: Creating Rest Apis With Flask and Python
No ratings yet
Python Self-Assessment: Creating Rest Apis With Flask and Python
2 pages
Seven Building Blocks of Information Technology: The Wares That Links The Global Community
No ratings yet
Seven Building Blocks of Information Technology: The Wares That Links The Global Community
11 pages
06 - Memory System - I
No ratings yet
06 - Memory System - I
63 pages
Project Presentation For CNC Pen Plotter
No ratings yet
Project Presentation For CNC Pen Plotter
9 pages
Lumiq Case Study Template - 1
No ratings yet
Lumiq Case Study Template - 1
8 pages
Nutanix Kubernetes Platform v2 12
No ratings yet
Nutanix Kubernetes Platform v2 12
1,190 pages
Principles of Sap Governance Risk and Compliance
No ratings yet
Principles of Sap Governance Risk and Compliance
3 pages
ICT and Its Current State
No ratings yet
ICT and Its Current State
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Analytics: Topics

Uploaded by

Analytics: Topics

Uploaded by

UNIT IV

Big Data Analytics

Data Analytics with Mathematical manipulations 4-6

Student Academic section and Student Personal Details the results.

analysis, the cause of a phenomenon is analyzed before.

The data is analyzed to approve or disapprove the Data extraction JPhase IX

(1D2)Fig. 4.2.1 : Types of Big Data Analytics

for Data ManipulationLanguage or Comes makes

Data Manipulation is the modification of information to data to your advantage.

(SPPU- New Syllabus w.e.f academic year 21-22)\P6-56)

SOURCES (CSV, JSON, HTML, EXCEL, bchaviours. All these

There are also other uses of data ingestion such as study

than to process or provide inaccurate results. There are several

through several different

4.7 HANDLING MISSING VALUES

How Missing values and categorical variables

While it may scem that a missing value is a misSing

Missing completely at random. Childhood health assessment data is more likely

Missing not at random. LS Missing not at random (MNAR)

Type of missing value Description

(SPPU - New Syllabus w.e.f academic year 21-22)(P6-56)

On a health survey illicit

Individuals surveyed about

G Dealing with missing values

Variable Deletion Eliminate a variable from Easy to implement. Significant loss in

analysis if it is missing a large information.

Example (B) Pairwise deletion

estimation other than means and covariances.

Table 4.8.1 : Examples of imputatlon

Results in biased cstimates.

Falsely strengthens relationship

you try and train them with categorical data,

One of the simplest and most common solutions 0 1 0

advertised to transform categorical variables is Label 0 Unit

year 21-22)(P6-56) aTech-NeoPublications..A SACHIN SHAH Venture

France 0.67 Dispersion Analysis : Dispersion in the area onto

distribution in the was released in the year

s for visual Analytical

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.