0% found this document useful (0 votes)
2 views55 pages

BA1 Introduction 2025

The document provides an overview of business analytics, emphasizing its role in transforming data into actionable insights for decision-making. It covers various statistical concepts, including types of data, descriptive vs. inferential statistics, and levels of measurement, as well as data preparation techniques and visualization methods. Additionally, it discusses the significance of big data and the importance of understanding relationships between variables through correlation and covariance.

Uploaded by

Shourya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views55 pages

BA1 Introduction 2025

The document provides an overview of business analytics, emphasizing its role in transforming data into actionable insights for decision-making. It covers various statistical concepts, including types of data, descriptive vs. inferential statistics, and levels of measurement, as well as data preparation techniques and visualization methods. Additionally, it discusses the significance of big data and the importance of understanding relationships between variables through correlation and covariance.

Uploaded by

Shourya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Roshny Unnikrishnan, Ph.D.

Adjunct professor - Analytics

Business
Analytics
INTRODUCTION
Analytics overview
Business Analytics in practice
Measurement and scaling
Calculations
 Mean median mode Geometric mean
 Standard deviation and CV, Variance , Range
 Z score – Identifying outliers
 Scatter charts – Covariance – correlation coefficient
Analytics overview
BUSINESS ANALYTICS IN PRACTICE
Definition
Business analytics:
◦ Scientific process of transforming data into insight for making
better decisions.
◦ Used for data-driven or fact-based decision making, which is often
seen as more objective than other alternatives for decision making.
The Spectrum of Business Analytics

5
Data
Types of data

Nature of Source of data Variable


Timeframe Presentation
variation collection type

Qualitative Cross sectional Primary Grouped Nominal

Quantitative Longitudinal Secondary Ungrouped Ordinal

Discrete Interval

Continuous
Ratio
Data, Data Sets, Elements,
Variables, and Observations

n is the no: of observations or sample (rows in this case)


eg : n=10 when I select 10 students from sec A/B
N is the no: of values in the Population or universal set
eg N applies when my population is all students from sec A/B
Frequency – The number of times the observation occurred/recorded in an experiment.
Descriptive v/s Inferential
Data are compilations of facts, figures, or other contents, both numerical and non-numerical.

Statistics is the science that deals with the collection, preparation, analysis, interpretation, and presentation
of data

There are two branches of statistics: descriptive and inferential statistics.

Descriptive statistics refers to the summary of important aspects of a data


set.
◦ Includes collecting, organizing, and presenting the data in the form of charts and tables.
◦ Often calculate numerical measures (typical value, variability).
Population and sample
Inferential statistics refers to drawing conclusions about a larger set of
data (population) based on a smaller set of data (sample).
• A population consists of all items/members of interest.
• A sample is a subset of the population.
We rely on sample data to make inferences about various characteristics
of the population.
Structured/Unstructured

Structured data
◦ Reside in a pre-defined, row-column format.
◦ Spreadsheet or database applications.
◦ Enter, store, query, and analyze.
◦ Numerical information that is objective and not open to interpretation.
Structured/Unstructured

Today, only about 20% of all data used in business decisions is structured.
Unstructured data
◦ Do not conform to a pre-defined, row-column format.
◦ Textual and multimedia content.
◦ Do not conform to database structures.
◦ These data may have some implied structure.
◦ Still considered unstructured.
◦ Do not conform to a row-column model required in most database systems.
◦ Example: social media data such as Twitter, YouTube, Facebook, and blogs.
Timeseries data
 A sequential organization of data accordingly
to their time of occurrence is termed as time
series.

A time series data is the set of measurements


taking place in a constant interval of time; here
time acts as the independent variable and the
objective
 ( to study changes in a characteristics) is
dependent variables.

https://www.analyticssteps.com/blogs/introduction-time-series-analysis-time-series-forecasting-machine-learning-
methods-models
Cross sectional data
The key difference between time series and cross-sectional data is that time series data focuses on the
same variable over some time, while the cross-sectional data focuses on several variables at the
same point of time.
Big Data

Businesses generate and gather more and more data at an


increasing pace: Big Data.
◦ A massive volume of structured and unstructured data
◦ Extremely difficult to manage, process, and analyze using traditional data processing tools
◦ Presents great opportunities to gain knowledge and game-changing intelligence

“High-volume, high-velocity and/or high-variety information


assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight, decision
making, and process automation” (www.gartner.com).
Big data
There are three characteristics of big data.
◦ Volume: immense amount of data complied for a single or multiple sources
◦ Velocity: generated at a rapid speed, management is a critical issue
◦ Variety: all types, forms, granularity, structured or unstructured
Additional characteristics
◦ Veracity: credibility and quality of the data, reliability
◦ Values: methodological plan for formulating questions, curating the right data and
unlocking hidden potential
Having a plethora of data does not guarantee that useful insights or
measurable improvements will be generated.
Type of data by variable

MEASUREMENT AND SCALING= NOMINAL, ORDINAL,


INTERVAL RATIO
https://medium.com/@rndayala/data-levels-of-measurement-4af33d9ab51a
Nominal scale
The lowest measurement level you can use, Select your occupation
from a statistical point of view, is the nominal Student
level.
Self employed
The nominal scale is where numbers are used
Entrepreneur
only as labels to classify objects.
Govt service
E.g.: Area
 Rural Professional
 Urban
Others_______________
Gender
 Male
 Female
 Do not want to specify
An ordinal scale to rank the consumers preference among brands of Cars on scale of 1 to 5 where 1 is highest
Maruti 5 4 5 14 Maruti 5 1 5 11
Hyund 2 1 2 5 Hyunda 3 3 2 8
ai i
Mahind 4 3 4 11 Mahindr 4 4 3 11
ra a
Toyota 3 2 3 8 Toyota 2 2 4 8
Kia 1 5 1 7 Kia 1 5 1 7

Over Rank
Maruti 5 1 5 11 all
Hyundai 4 3 2 14
Mahindra 3 2 4 9 Maruti 35 5
Hyund 27 3
Toyota 1 4 3 8
ai
Kia 2 5 1 8
Mahin 31 4
dra
Toyota 24 2
Kia 22 1
Interval Scale
The standard survey rating scale is an interval scale
E.g. : the same scale where the three factors are ranked for the ordinal scale can
be given in the interval scale as follows (semantic)
The 5 brands can be rated on a scale of 1 to 5 for the aesthetics factor where 1
is the least and 5 the highest.
Affordability (semantic) Affordability (numerical)
1 2 3 4 5
Maruti Maruti 1 2 3 4 5
Hyundai Hyundai 1 2 33 4 5
Tata Tata 1 2 3 4 5
Chevy Chevy 1 2 3 44 5
Fiat Fiat 1 22 3 4 5
Ratio scale

Marketshare
The factor that clearly defines ratio scale is that it Hatchback cars
has a true zero point. (100)
Any numerical data on actuals – sales Maruti 40/100
In This approach, the each respondent is shown Hyundai 20/100
the five different types of cars and asked “how
Tata 20/100
much would you be willing to pay for this brand of
car?” VW Polo 10/100
At the end of research the data is collated and we Renault Kwid 10/100
might find that respondents are willing to pay 10%
more for Maruti over Hyundai, and 15% more for
Hyundai over Fiat
Levels of measurement
Scale Basic Operations Number system Typical usage Statistical tools
Descriptive Inferential
Nominal Determination of equality 1,2 Classifications of any Percentages Chi square
(Unique kind ,Mode
definition)
Ordinal Determination of greater or Order of Rankings Median Mann Whitney test,
less numerals Freidman, Two way
(0<1<2….<9) ANOVA, Rank order
correlation
Interval Determination of equality of Equality of Index numbers, Mean, T-test, factor analysis,
intervals differences attitude measures, Range, ANOVA
opinions Standard
deviation
Ratio Determination of equality of Equality of Sales, units All Coefficient of variance
ratios ratios produced,No:of arithmetic
customers, costs operations
Descriptives and
Visualisations
CHAPTER 2 &3
Visualisations
CHAPTER 2
Visualisations

Categorical Numeric
Construct frequency distribution
Construct frequency distribution • Line chart
• Bar Chart
• Histogram
• Pie chart
• Scatterplot – 2 variables
• Box plot
Data Preparation
Data Preparation
We often spend a considerable amount of time inspecting and
preparing the data for the subsequent analysis.
◦ Counting and sorting
◦ Handling missing values
◦ Subsetting

Counting and Sorting


◦ Among the very first tasks analysts perform
◦ Gain a better understanding and insights into the data
◦ Help to verify that the data set is complete or determine if there are missing
values
◦ Sorting allows us to review the range of values for each variable
◦ Sort based on a single or multiple variables
Data Preparation
There are two common strategies for dealing with
missing values.
The omission strategy recommends that observations
with missing values be excluded from subsequent
analysis.
The imputation strategy recommends that the missing
values be replaced with some reasonable imputed
values.
◦ Numeric variables: replace with the average
◦ Categorical variables: replace with the predominant category
Data Preparation
Subsetting is the process of extracting a portion of the data
set that is relevant for subsequent statistical analysis.
◦ The objective of the analysis is to compare two subsets of
the data.
◦ Eliminate observations that contain missing values, low-
quality data, or outliers.
◦ Excluding variables that contain redundant information, or
variables with excessive amounts of missing values.
We can also subset data based on data ranges.
Ch 3: Numerical Descriptive
measures

Measures of Measures of Measures of association


central location Dispersion (Bivariate – correlation ,
(refer next slide ) (refer next slide ) covariance etc)

Analysis of relative location


(Chebyshev’s theorem
Z score )
MEASURES OF CENTRAL LOCATION MEASURES OF DISPERSION
Mean Absolute measures
Median Range
Mode Standard deviation
Skewness Mean Absolute deviation
Weighted mean , geometric
mean
Quartiles
Percentiles
Box plot
Analysis of relative location
Z score
CHAPTER – 3
Analyzing Distributions
• z-score:
◦ Measures the relative location of a value in the data set.
◦ Helps to determine how far a particular value is from the
mean relative to the data set’s standard deviation.
◦ Standardized value

36
• If 𝑥 , 𝑥 , . . . , 𝑥 is a sample of n observations

=
◦ 𝑧 = z-score for 𝑥
◦ 𝑥̅ = sample mean
◦ s = sample standard deviation
Calculate z score
No: of students
in the class
46
54
42
46
32
z-Scores for the Class Size Data

• For class size data, 𝑥̅ = 44 and s = 8.


◦ For observations with a value > mean, z-score > 0.
◦ For observations with a value < mean, z-score < 0.
39
Analyzing Distributions
• Empirical rule:
◦ For data having a bell-shaped distribution:
◦ Within 1 standard deviation – approximately 68% of the data values.
◦ Within 2 standard deviations – approximately 95% of the data values.
◦ Within 3 standard deviations – almost all the data values.

• Identifying outliers:
◦ Outliers: Extreme values in a data set.
◦ It can be identified using standardized values (z-scores).
◦ Any data value with a z-score less than –3 or greater than +3 is an outlier.

40
Analysis of Relative Location
Measures of association
Correlation
BIVARIATE – T WO VARIABLES
CHAPTER 3
Scatter Plot
The first step in determining whether there is a relationship between two
variables is to examine the graph of the observed (or known) data. This
graph, or chart, is called a scatter diagram.
Refer data and scatter diagram plotted
A scatter diagram can give us two types of information. Visually, we can
look for patterns that indicate that the variables are related.
Then, if the variables are related, we can see what kind of line, or
estimating equation, describes this relationship.
Scatter diagram
An instructor is interested in finding out how the number of students absent on a given day is
related to the mean temperature that day. A random sample of 10 days was used for the study.
The following data indicate the number of students absent (ABS) and the mean temperature
(TEMP) for each day.
ABS 8 7 5 4 2 3 5 6 8 9
TEMP 10 20 25 30 40 45 50 55 59 60
(a) State the dependent (Y) variable and the independent (X) variable.
(b) Draw a scatter diagram of these data.
(c) Does the relationship between the variables appear to be linear or curvilinear?
(d) What type of curve could you draw through the data?
(e) What is the logical explanation for the observed relationship?
Measures of Association Between
Two Variables
• Scatter Charts: Useful graph for analyzing the relationship between two
variables.

• Covariance: Descriptive measure of the linear association between two


variables.
◦ Sample covariance for a sample of size n with the observations
(𝑥 , 𝑦 ), (𝑥 , 𝑦 ), and so on:

𝑠 =
∑ µ µ
◦ Population covariance, 𝜎 =

47
Measures of Association Between
Two Variables
r value Relationshi
• Correlation coefficient: Measures the relationship p between
between two variables. the x and y
variables
◦ Not affected by the units of measurement for x and y.
<0 Negative
◦ Sample correlation coefficient denoted by 𝑟 . linear
◦ 𝑟 = Near 0 No linear
◦ 𝑠 = sample covariance =
∑ relationshi
p
∑ ̅ >0 Positive
◦ 𝑠 = sample standard deviation of x = linear


◦ 𝑠 = sample standard deviation of y =
48
Calculate the correlation using
covariance method
X Y
Marks in accounts Marks in QT
1 48 45
2 35 20
3 17 40
4 23 25
5 47 45
Xbar= y bar =
Solution
xbar =34 y bar 35
X Y
Marks in accounts Marks in QT x-xbar y-ybar (x-xbar)^2 (y-ybar)^2 (x-xbar)(y-ybar)
48 45 14 10 196 100 140
35 20 1 -15 1 225 -15
17 40 -17 5 289 25 -85
23 25 -11 -10 121 100 110
47 45 13 10 169 100 130
776 550 280

Sxy 70
Sx 13.92839
Sy 11.72604

rxy 0.428594
Data for Bottled Water Sales at Queensland
Amusement Park for a Sample of 14 Summer Days

51
Chart Showing the Positive Linear Relation Between
Sales and High Temperatures

52
Sample Covariance Calculations for Daily High
Temperature and Bottled Water Sales at Queensland Amusement Park

53
Computation of Correlation
Coefficient
Illustration - To determine the sample correlation coefficient for bottled
water sales at Queensland Amusement Park:
12.8
𝑟 = = = 0.93
(4.36)(3.15)
• There is a very strong linear relationship between high temperature and
sales.

54
Practice – correlation
X Y
20 1
22 0
25 2
30 5
38 2
40 4
42 6
45 5
47 7
51 8

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy