Unit-2 Pda
Unit-2 Pda
Data Analytics
V.IndivaruTeja
ASSISTANT PROFESSOR
CMRCET
UNIT-2
Data Analytics:
Introduction to Analytics
Introduction to Tools and Environment
Application of Modeling in Business
Databases & Types of Data and variables
Data Modeling Techniques
Missing Imputations etc.
Need for Business Modeling.
INTRODUCTION
Introduction to Analytics:
➢ Analytics is a journey that involves a combination of
potential skills, advanced technologies, applications,
and processes used by firm to gain business insights
from data and statistics. This is done to perform
business planning.
➢ Data Analytics refers to the techniques used to
analyze data to enhance productivity and business
gain.
➢ Data is extracted from various sources and is cleaned
and categorized to analyze various behavioral
patterns. The techniques and the tools used vary
according to the organization or individual.
➢Data Analytics has a key role in improving your
business as it is used to gather hidden
insights, generate reports, perform market
analysis, and improve business requirements.
➢Data analytics is the process of inspecting,
transforming and Extract Meaningful Insights
from data for Decision making.
➢Data analytics is a scientific process of Convert
Data into Useful Information for Decision
Makers.
Role of Data Analytics: Gather Hidden Insights Hidden
insights from data are gathered and then analyzed
with respect to business requirements.
1.Generate Reports : Reports are generated from the
data and are passed on to the respective teams and
individuals to deal with further actions for a high
rise in business.
2.Perform Market Analysis : Market Analysis can be
performed to understand the strengths and weaknesses
of competitors. Improve Business Requirement – Analysis
of Data allows improving Business to customer
requirements and experience..
3.Gather Hidden Insights :Hidden insights from
data are gathered and then analyzed with
respect to business requirements.
4. Improve Business Requirement : Analysis of
Data allows improving Business to customer
requirements and experience
Data analytics vs data analysis:
Data analytics and data analysis are frequently used
interchangeably, data analysis is a subset of data analytics
concerned with examining, cleansing, transforming, and
modeling data to derive conclusions. Data analytics includes
the tools and techniques used to perform data analysis.
Data analytics vs.data science:
Data analytics and data science are closely related. Data
analytics is a component of data science, used to understand
what an organization’s data looks like. Generally, the output of
data analytics are reports and visualizations. Data science takes
the output of analytics to study and solve problems.
The difference between data analytics and data science is often
seen as one of timescale. Data analytics describes the current
or historical state of reality, whereas data science uses that data
to predict and/or understand the future.
Applications of Analytics
➢In today’s world, data rules the most
modern companies.
➢To gain such important insight into data as
a whole, it is important to analyze data
and draw specific information that can be
used to improve certain aspects of a
market or the business as a whole.
➢ There are several applications of data
analytics, and businesses are actively using
such data analytics applications to keep
themselves in the competition.
Fraud and Risk Detection: This has been known as one
of the initial applications of data science which was
extracted from the discipline of Finance.
➢ So many organizations had very bad experiences
with debt and were so fed up with it. Since they
already had data that was collected during the
time their customers applied for loans, they applied
data science which eventually rescued them from
the losses they had incurred.
➢ This led to banks learning to divide and conquer
data from their customers’ profiles, recent
expenditure and other significant information that
were made available to them.
➢ This made it easy for them to analyze and infer if
there was any probability of customers defaulting.
➢Policing/Security Several cities all over the
world have employed predictive analysis in
predicting areas that would likely witness a
surge in crime with the use of geographical
data and historical data.
➢Although, it is not possible to make arrests for
every crime committed but the availability of
data has made it possible to have police
officers within such areas at a certain time of
the day which has led to a drop in crime rate.
Applications of Data Analytics
1.Banking: Banks employ data analytics to manage risks, gather
insights into customer behavior, and personalize financial
services.
➢ Using data analytics, banks and credit card companies can
customize their offerings, identify potential fraud, and assess
potential client creditworthiness by analyzing customer
demographics, transaction data, and credit histories.
➢ Data analytics also helps banks spot money laundering
activities and boost regulatory compliance.
2.Cybersecurity: Data analytics is pivotal in cybersecurity by
detecting and preventing cyber-attacks and similar threats.
➢ Security systems analyze user behavior, network traffic, and
system logs to locate anomalies and possible security
breaches.
➢ By leveraging data analytics, businesses and other
organizations can proactively enhance their security
measures, detect threats, respond to them in real-time, and
protect sensitive data.
3.E-commerce:E-commerce platforms use data analytics to
understand customer behavior, optimize marketing
campaigns, and personalize shopping experiences.
➢ E-commerce companies offer personalized product
recommendations, target specific customer segments,
and improve customer satisfaction and retention by
analyzing customer preferences, purchase histories, and
browsing patterns.
4.Finance: Data analytics plays a vital role in investment
strategies, fraud detection, and risk assessment.
➢ Banks and other financial institutions analyze vast
volumes of data to predict creditworthiness, spot
suspicious transactions, and optimize their investment
portfolios.
➢ Data analytics allows finance companies to offer
personalized financial advice and develop creative
financial products and services.
5.Healthcare: Data analytics positively changes the
healthcare industry by offering better patient care,
disease prevention, and resource optimization.
➢ Hospitals can analyze patient data to spot high-risk
individuals and provide personalized treatment plans.
➢ Data analytics also helps find disease outbreaks,
monitor treatment effectiveness, and improve
healthcare operations.
6.Internet Searches: Data analytics powers Internet
search engines, letting users find relevant information
accurately and quickly.
➢ Search engines analyze colossal amounts of data (e.g.,
web pages, user queries, click-through rates) to deliver
the most relevant search results.
➢ Additionally, data analytics algorithms continuously
learn and adapt to individual user behaviors, providing
increasingly accurate and personalized search results.
7.Logistics: Data analytics is essential in managing fleet
operations, optimizing transportation routes, and
improving overall supply chain efficiency.
➢ Logistics companies can reduce delivery times, minimize
costs, and increase demand forecasting, inventory
management, and customer satisfaction by analyzing
delivery times, routes, and vehicle performance data.
8.Manufacturing: Data analytics revolutionizes
manufacturing by optimizing production processes,
improving predictive maintenance, and enhancing
product quality.
➢ Data analytics allows manufacturers to analyze sensor
data, machine performance, and historical maintenance
records to predict equipment failures, minimize
downtime, and guarantee efficient operations.
➢ Data analytics also allows manufacturers to monitor
production lines in real time, leading to higher
productivity and savings.
9.Retail: Data analytics is changing the retail industry by optimizing
pricing strategies, offering insights into customer preferences, and
improving inventory management. Retailers can use customer
feedback, analyze sales data, and market trends to identify popular
products, personalize offers, and forecast future demand. Data
analytics helps retailers improve their marketing efforts, increase
customer loyalty, and optimize store layouts.
10.Risk Management: Risks are common in the commercial world, and
data analytics is essential in risk management across many industries,
such as finance, insurance, and project management. Organizations
can use data analysis to assess risks, develop sound mitigation
strategies, and make informed decisions by analyzing market trends,
historical data, and external factors.
11.Supply Chain Management: The recent global pandemic has cast
supply chains in a new light. Consequently, data analytics improves
supply chain management by reducing costs, optimizing inventory
levels, and enhancing overall operational efficiency. Using data
analytics, organizations can analyze supply chain data to forecast
demand, identify bottlenecks, and improve logistics and distribution
processes. Data analytics also enhances transparency throughout the
supply chain.
Data Analytics Life Cycle
The Data analytic lifecycle is designed for Big Data
problems and data science projects. The cycle is
iterative to represent real project. To address the
distinct requirements for performing analysis on Big
Data, step – by – step methodology is needed to
organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing
data.
Phase 1: Discovery :–
The data science team learn and investigate the problem.
Develop context and understanding.
Come to know about data sources needed and available for the
project.
The team formulates initial hypothesis that can be later tested
with data.
Phase 2: Data Preparation: –
Steps to explore, preprocess, and condition data prior to
modeling and analysis.
It requires the presence of an analytic sandbox, the team execute,
load, and transform, to get data into the sandbox.
Data preparation tasks are likely to be performed multiple times
and not in predefined order.
Several tools commonly used for this phase are – Hadoop,
Alpine Miner, Open Refine, etc.
Phase 3: Model Planning: –
Team explores data to learn about relationships between variables
and subsequently, selects key variables and the most suitable
models. In this phase, data science team develop data sets for
training, testing, and production purposes.
Team builds and executes models based on the work done in the
model planning phase. Several tools commonly used for this
phase are – Matlab, STASTICA.
Phase 4: Model Building :–
Team develops datasets for testing, training, and production
purposes. Team also considers whether its existing tools will
suffice for running the models or if they need more robust
environment for executing models.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results: –
After executing model team need to compare outcomes of modeling to criteria
established for success and failure.
Team considers how best to articulate findings and outcomes to various team
members and stakeholders, taking into account warning, assumptions.
Team should identify key findings, quantify business value, and develop
narrative to summarize and convey findings to stakeholders.
Phase 6: Operationalize: –
The team communicates benefits of project more broadly and sets up pilot
project to deploy work in controlled way before broadening the work to full
enterprise of users.
This approach enables team to learn about performance and related constraints
of the model in production environment on small scale , and make
adjustments before full deployment.
The team delivers final reports, briefings, codes. Free or open source tools –
Octave, WEKA, SQL, MADlib
Data Analytics Tools
Popular Data Analytics Tools
1.Excel: Microsoft Excel is one of the oldest, most well-known software
applications for data analysis. In addition to providing spreadsheet
functions to manage and organize vast data sets, Excel offers graphing
tools and computing capabilities such as automated summation
(AutoSum).
➢ Excel also includes the Analysis ToolPak, which includes data analysis
tools that perform regression, variance, and statistical analysis.
➢ Furthermore, Excel’s versatility and simplicity make it a robust data
analysis tool ideal for sorting, filtering, cleaning, managing, analyzing,
and visualizing data. Every aspiring data analyst should be proficient in
Excel.
2.Jupyter Notebook: Jupyter Notebook is a web-based interactive
environment where data analysts can share computational documents
or “notebooks.”
➢ Data analysts use Jupyter Notebooks to clean data, write and run code,
conduct data visualization, perform machine learning and statistical
analysis, and function with many other kinds of data analysis.
➢ Furthermore, Jupyter Notebook lets analysts merge data visualizations,
code, comments, and various programming languages into one place to
document the data analysis process better and share it with other
teams or stakeholders.
3.MySQL: MySQL is a popular open-source relational database
management system (also called RDBMS) that stores application data,
primarily web-based.
➢ Popular websites like Facebook, Twitter, and YouTube have used
MySQL.
➢ Structured Query Language (or SQL) is most often used for relational
database management system management, typically employing
relational databases often structured into tables.
4.Python: Python is consistently ranked as one of the most popular
programming languages in the world. Unlike other programming
languages today, Python is relatively simple to learn and can be
employed in many different tasks, including software, web
development, and data analysis.
➢ Data analysts use Python to streamline, analyze, model, and visualize
data using built-in analytics tools.
➢ Python also offers data analytics professionals access to libraries like
Pandas and Numpy, which provide powerful analytics-related tools.
Python is another application that new data analysts should be highly
familiar with.
5.R: R is an open-source programming language typically used for
statistical computing and graphics. R is regarded as a relatively easy-to-
learn programming language like the Python above.
➢ R is usually used for data visualization, statistical analysis, and data
manipulation. Additionally, R’s statistical focus makes it well-suited for
statistical calculations.
➢ In contrast, the included visualization tools make it the ideal language
for creating compelling graphics such as graphs and scatter plots.
Along with Python, R is one of the most essential programming
languages data analysts use.
6.Tableau: Tableau is a data visualization application for business analytics
and business intelligence.
➢ Tableau is one of the most popular data visualization platforms in
today’s business world, mainly because it boasts an easily understood
user interface and smoothly turns data sets into understandable
graphics.
➢ Business users like Tableau because it’s easy to use, and data analysts
like it because it has powerful tools that perform advanced analytics
functions like cohort analysis, predictive analysis, and segmentation.
Data Analytics Environments
a. Integrated Development Environments (IDEs)
b. Cloud Platforms
c. Collaborative Tools
a. Integrated Development Environments (IDEs)
i) Jupyter Notebook: An open-source web
application that allows you to create and share
documents that contain live code, equations,
visualizations, and narrative text. It is particularly
popular for Python and R.
ii) RStudio: An IDE for R that provides tools for data
analysis, including a console, syntax-highlighting
editor, and tools for plotting and history tracking.
iii) PyCharm: A Python IDE that supports
development for data analytics, with features like
code completion, debugging, and integrated tools
for scientific computing.
b. Cloud Platforms
i) Google Cloud Platform (GCP): It Offers tools like
BigQuery for large-scale data analysis, as well as
other data storage and processing solutions.
ii) Amazon Web Services (AWS): It Provides a range
of data analytics services such as Amazon
Redshift for data warehousing and AWS Glue for
ETL.
iii) Microsoft Azure: It Includes Azure Synapse
Analytics for integrating big data and data
warehousing and Azure Machine Learning for
building and deploying machine learning models.
c. Collaborative Tools
i) GitHub/GitLab: Platforms for version control
and collaboration on code projects. They allow
multiple users to work on the same project
and keep track of changes.
ii) Slack/Microsoft Teams: Communication tools
that facilitate collaboration and sharing of
insights among data analysts and teams.
TYPES OF DATA ANALYTICS
1.Descriptive Analytics
What is happening?
Why is it happening?
❖ Diagnostic analytics is a form of
advanced analytics that examines
data or content to answer the
question,
❖ It is characterized by techniques
such as data discovery, data
mining and correlations.
3.Predictive Analyitcs
What is likely to happen?
❖ With the help of predictive analysis,
determine the future outcome.
❖ Based on the analysis of the historical
data, we are able to forecast the future.
❖ With the help of data analytics,
technological advancements and machine
learning, we are able to obtain predictive
about the future effectively.
❖ Predictive analytics is a complex field that
requires a large amount of data, skilled
implementation of predictive models.
❖ Its tuning to obtain accurate predictions
4. Prescriptive Analytics
What do I need to do?
❖ Understanding of
❖ what has happened,
❖ why it has happened.
❖ variety of “what-might-
happen” analysis.
❖ help the user determine the
best solutions of action to
take.
❖ Prescriptive analysis is
typically not just with one
individual action but is in fact
a host of other actions.
❖ Best route home and
considering the distance of
each route, the speed
TYPES OF DATA ANALYTICS
Various steps involved in Analytics:
1.Access
2.Manage
3.Analyze
4.Report
Types of Data Models
(Categorical) (Numerical)
Data Types are an important concept of
statistics, which needs to be understood, to
correctly apply statistical measurements to
your data and therefore to correctly conclude
certain assumptions about it.
➢There are two types of variables you'll find in
your data
1.Numerical(Quantitative)
2. Categorical(Qualitative)
TYPES OF DATA
Binary
1. Quantitative data (Numerical data): It deals
with numbers and things you can measure
objectively: dimensions such as height,
width, and length.
➢ Temperature and humidity, Prices, Area and
volume.
➢ Numerical data is information that is
measurable, and it is, of course, data
represented as numbers and not words or
text.
Numerical data can be divided into continuous or discrete
values.
a) Continuous Data: Continuous Data represents
measurements and therefore their values can’t be
counted but they can be measured.
➢ Continuous numbers are numbers that don’t have a
logical end to them.
➢ An example would be the height of a person, which you
can describe by using intervals on the real number line.
b) Discrete Data: We speak of discrete data if its values are
distinct and separate. In other words: We speak of
discrete data if the data can only take on certain values.
➢ Discrete numbers are the opposite; they have a logical
end to them.
➢ Some examples include variables for days in the month, or
number of bugs logged.
2. Categorical Data :
➢Categorical data represents characteristics. This is
any data that isn’t a number, which can mean a
string of text or date.
➢Therefore it can represent things like a person’s
gender, language etc.
➢These variables can be broken down into nominal
and ordinal values.
a) Nominal Data :Nominal values represent
discrete units and are used to label variables,
that have no quantitative value. Nominal
value examples include variables such as
“Country” or “Marital Status”.
Example:
b) Ordinal data : Ordinal values represent
discrete and ordered units. It is therefore
nearly the same as nominal data, except that
it’s ordering matters.
➢Examples of ordinal values include having a
priority on a bug such as “Critical” or “Low” or
the ranking of a race as “First” or “Third”.
Example:
c) Binary data : In addition to ordinal and
nominal values, there is a special type of
categorical data called binary. Binary data
types only have two values – yes or no.
➢This can be represented in different ways such
as “True” and “False” or 1 and 0.
➢Examples of binary variables can include
whether a person has stopped their
subscription service or not, or if a person
bought a car or not.
Missing Imputations: In R, missing values are represented by
the symbol NA (not available).
➢ Impossible values (e.g.,dividing by zero) are represented
by the symbol NaN (not a number). To remove missing
values from our dataset we use na.omit() function.
For Example:
➢ We can create new dataset without missing data as below:
newdata<-na.omit(mydata)
Or
➢ we can also use “na.rm=TRUE” in argument of the
operator. From above example we use na.rm and get
desired result.
x <- c(1,2,NA,3)
mean(x, na.rm=TRUE)
# returns 2
Missing Imputations (MICE Package) :
➢MICE : MICE (Multivariate Imputation via
Chained Equations) is one of the commonly
used package by R users.
➢ Creating multiple imputations as compared to
a single imputation takes care of uncertainty in
missing values.
➢The mice package implements a method to
deal with missing data.
➢The MICE algorithm can impute mixes of
continuous, binary, unordered categorical and
ordered categorical data.
For example:
➢ Suppose we have X1, X2….Xk variables. If X1 has
missing values, then it will be regressed on other
variables X2 to Xk.
➢ The missing values in X1 will be then replaced by
predictive values obtained. Similarly, if X2 has missing
values, then X1, X3 to Xk variables will be used in
prediction model as independent variables.
➢ Later, missing values will be replaced with predicted
values. mice package has a function known as
md.pattern(). It returns a tabular form of missing value
present in each variable in a data set.
Syntax:
imputed_Data<- mice(data, m=5, maxit = 5,
method = ‘NULL', seed = NA)
Precisely, the methods used by this package are:
2. logreg(Logistic Regression)
– For Binary Variables( with 2 levels)
3. polyreg
– For Factor Variables (>= 2 levels)