0% found this document useful (0 votes)
94 views109 pages

02-03 ASAP Business Analytics-2 Descriptive Statistics

Uploaded by

George Mathew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views109 pages

02-03 ASAP Business Analytics-2 Descriptive Statistics

Uploaded by

George Mathew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

FOUNDATION TO DATA SCIENCE

Business Analytics

Unit1: BASIC STATISTICS REFRESHER AND HOW


TO EXPLORE DATA -2

Prof. Dr. George Mathew


B.Sc., B.Tech, PGDCA, PGDM, MBA, PhD 1
Measures of Location
1. Mean (Arithmetic Mean)
2. Median
3. Mode
4. Geometric Mean
5. Percentiles
6. Quartiles
1.Mean (Arithmetic Mean)
The most commonly used measure of location
is the mean (arithmetic mean), or average
value, for a variable. The mean provides a
measure of central location for the data. If the
data are for a sample (typically the case), the
mean is denoted by (x-bar) x̄. The sample
mean is a point estimate of the (typically
unknown) population mean for the variable of
interest. If the data for the entire population are
available, the population mean is computed in
the same manner, but denoted by the Greek
letter μ.
Home Sale Data

Practical: Excel File:02-03 ASAP Discriptive statistics_Excel Solver.xlsx


1.Mean (Arithmetic Mean)
Median
The median, another measure of central
location, is the value in the middle when the
data are arranged in ascending order (smallest
to largest value). With an odd number of
observations, the median is the middle value.
An even number of observations has no
single middle value. In this case, we follow
convention and define the median as the
average of the values for the middle two
observations.
Median
Mode
A third measure of location, the mode, is the
value that occurs most frequently in a data
set. To illustrate the identification of the
mode, consider the sample of five class
sizes.
32 34 42 46 46 54 56 67
Here 46 repeats twice others only once,
Hence Mode=46
Mean, Median, Mode
Geometric Mean
The geometric mean is a measure of location
that is calculated by finding the nth root of the
product of n values. The general formula for
the sample geometric mean, denoted x g ,
follows.

The geometric mean is often used in analyzing


growth rates in financial data. In these types of
situations, the arithmetic mean or average
value will provide misleading results.
Geometric Mean
To illustrate the use of the geometric mean,
consider Table 2.10 which shows the percentage
annual returns, or growth rates, for a mutual fund
over the past ten years. Suppose we want to
compute how much $100 invested in the fund at
the beginning of year 1 would be worth at the end
of year 10.
Geometric Mean
Product=
(0.779)(1.287)(1.109)(1.049)(1.158)(1.055)(0.630)(1.
265)(1.151)(1.021)] = $100(1.335)
= 1.3345
G.M= tenth root of 1.335

The geometric mean tells us that annual returns grew


at an average annual rate of (1.029 - 1)100, or 2.9
percent. In other words, with an average annual
growth rate of 2.9 percent, a $100 investment in the
fund at the beginning of year 1 would grow to
$100(1.029) 10 = $133.09 at the end of ten years.
Geometric Mean
We can use Excel to calculate the geometric
mean for the data in Table 3 by using the
function GEOMEAN. In Figure 10, the value
for the geometric mean in cell is found using
the formula ='=GEOMEAN(C4:C13).
Geometric Mean
Percentiles
A percentile is the value of a variable at
which a specified (approximate) percentage
of observations are below that value. The
pth percentile tells us the point in the data
where approximately p percent of the
observations have values less than the pth
percentile; hence, approximately (100 – p)
percent of the observations have values
greater than the pth percentile.
Percentiles
Percentiles
Percentiles
Therefore, $305,912.50 represents the 85th
percentile of the home sales data. The pth
percentile can also be calculated in Excel
using the function PERCENTILE.EXC.
Figure 12 shows the Excel calculation for the
85th percentile of the home sales data. The
value in cell E13 is calculated using the
formula =PERCENTILE.EXC(B2:B13,0.85);
B2:B13 defines the data set for which we are
calculating a percentile, and 0.85 defines the
percentile of interest.
CALCULATING VARIABILITY MEASURES FOR
THE HOME SALES DATA IN EXCEL
Quartiles
It is often desirable to divide data into four
parts, with each part containing
approximately one-fourth, or 25 percent, of
the observations. These division points are
referred to as the quartiles and are defined
as:
Q 1 = first quartile, or 25th percentile Q 2 =
second quartile, or 50th percentile (also the
median) Q 3 = third quartile, or 75th
percentile.
Quartiles
To demonstrate quartiles, the home sales
data are again arranged in ascending order.
108,000 138,000 138,000 142,000 186,000
199,500 208,000 254,000 254,000 257,500
298,000 456,250 We already identified Q2,
the second quartile (median) as 203,750.
To find Q1 and Q3, we must find the 25th
and 75th percentiles.
Quartiles
Inter Quartile Range
The difference between the third and first
quartiles is often referred to as the
interquartile range, or IQR. For the home
sales data, IQR = Q 3 - Q 1 = 256,625 -
139,000 = 117,625. Because it excludes the
smallest and largest 25 percent of values in
the data, the IQR is a useful measure of
variation for data that have extreme values
or are badly skewed.
Quartile Using Excel
A quartile can be computed in Excel using
the function QUARTILE.EXC. Figure 12
shows the calculations for first, second, and
third quartiles for the home sales data. The
formula used in cell E15 is
=QUARTILE.EXC(B2:B13,1). The range
B2:B13 defines the data set, and 1 indicates
that we want to compute the 1st quartile.
Cells E16 and E17 use similar formulas to
compute the second and third quartiles.
Compare the Spread of two Data
Compare the Spread of two
Data
Range
The simplest measure of variability is the
range. The range can be found by
subtracting the smallest value from the
largest value in a data set. Let us return to
the home sales data set to demonstrate the
calculation of range. Refer to the data from
home sales prices in Table 2. The largest
home sales price is $456,250, and the
smallest is $108,000. The range is $456,250
- $108,000 = $348,250.
Variance
The variance is a measure of variability of the
data. The variance is based on the deviation
about the mean, which is the difference
between the value of each observation (x i )
and the mean. For a sample, a deviation of an
observation about the mean is written (x i - x̄ ).
In the computation of the variance, the
deviations about the mean are squared.
Variance
Standard Deviation
The standard deviation is defined to be the
positive square root of the variance. We use
s to denote the sample standard deviation
and σ to denote the population standard
deviation. The sample standard deviation, s,
is a point estimate of the population
standard deviation,σ, and is derived from the
sample variance in the following way:
Coefficient of Variation
In some situations we may be interested in a
descriptive statistic that indicates how large
the standard deviation is relative to the
mean. This measure is called the coefficient
of variation and is usually expressed as a
percentage.
Identifying Outliers
Sometimes a data set will have one or more
observations with unusually large or unusually
small values. These extreme values are called
outliers. It should be removed during analysis to get
best results.
Standardized values (z-scores) can be used to
identify outliers. Usually data satisfy a bell-shaped
distribution, almost all the data values will be within
three standard deviations of the mean. Hence, in
using z-scores to identify outliers, we recommend
treating any data value with a z-score less than -3
or greater than +3 as an outlier.
z-scores
A z-score allows us to measure the relative
location of a value in the data set. More
specifically, a z-score helps us determine how
far a particular value is from the mean relative
to the data set’s standard deviation. Suppose
we have a sample of n observations, with the
values denoted by x 1 , x 2 , . . . , x n . In
addition, assume that the sample mean, x̄, and
the sample standard deviation, s, are already
computed. Associated with each value, x i , is
another value called its z-score.
Normal Distribution
Normal Distribution
z-scores

The z-score is often called the standardized


value. The z-score, z i , can be interpreted as
the number of standard deviations, x i , is from
the mean. For example, z 1 = 1.2 indicates
that x 1 is 1.2 standard deviations greater than
the sample mean.
z-scores
Box Plots
A box plot is a graphical summary of the
distribution of data. A box plot is developed
from the quartiles for a data set. Figure 14 is
a box plot for the home sales data. Here are
the steps used to construct the box plot:
Box Plots
What can we learn from these box plots?
The most expensive houses appear to be in Shadyside and
the cheapest houses in Hamilton. The median home selling
price in Groton is about the same as the median home
selling price in Irving. However, home sales prices in Irving
have much greater variability. Homes appear to be selling
in Irving for many different prices, from very low to very
high. Home selling prices have the least variation in Groton
and Hamilton. Unusually expensive home sales (relative to
the respective distribution of home sales values) have
occurred in Fairview, Groton, and Irving, which appear as
outliers. Groton is the only location with a low outlier, but
note that most homes sell for very similar prices in Groton,
so the selling price does not have to be too far from the
median to be considered an outlier.
Statistical Inference
Statistics uses data from a sample to make estimates
and test hypotheses about the characteristics of a
population through a process referred to as statistical
inference.
Example: To evaluate the advantages of the new
filament, by Norris Electronics, a sample of 200 bulbs
manufactured with the new filament were tested. Data
collected from this sample showed the number of
hours each lightbulb operated before filament
burnout. Data given in See Table 7.
Figure 16 provides a graphical summary of the
statistical inference process for Norris
Electronics.
Inferential statistics
inferential statistics, is used to make inferences or
to project from a sample to an entire population. For
example, when a firm test-markets a new product in
two cities of United States, it is not only concerned
about how customers in these two cities feel, but
they want to make an inference from these
sample markets to predict what will happen
throughout the United States. So, two applications of
statistics exist:
(1) descriptive statistics which describe
characteristics of the population or sample and
(2) inferential statistics which are used to generalize
from a sample to a population.
Population Parameters and Sample
Statistics
Population parameters are measured
characteristics of a specific population. In
other words, information about the entire
universe of interest. Sample statistics are
used to make inferences (guesses) about
population parameters based on sample data.
In our notation, we will generally represent
population parameters with Greek lowercase
letters Mu for example, sigma or and sample
statistics with English letters, such as X or S.
Correlation
Correlation Using Excell
To find the correlations between each pair of
stocks, click Data Analysis in the Analysis
group on the Data tab and then select
Correlation. You must install the Analysis
ToolPak “Summarizing data by using
histograms,” and “Summarizing data by
using descriptive statistics” before you can
use this feature. Click OK and then fill in the
Correlation dialog box as shown in Figure
Compare the Spread of two Data
Measures of Variability
In descriptive statistics we mainly analyse the
data for its Central tendency or Average and
Spread of the data, variability, or dispersion.
The spread is measured by measuring its
variation from the mean.
We measure the spread of data using various
measures of spread such as; Range,
Percentiles, Absolute mean deviation,
Variance and standard deviation, quartile
deviation etc.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy