0% found this document useful (0 votes)
9 views99 pages

DSILYTC Session 5 - Descriptive Statistics

This document provides an overview of descriptive statistics, focusing on numerical measures such as mean, median, mode, and measures of variability like range and standard deviation. It also discusses the importance of understanding the relationship between variables through covariance and correlation coefficients, as well as the use of data dashboards for effective data presentation. Key concepts include measures of location, variability, distribution shape, and methods for detecting outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views99 pages

DSILYTC Session 5 - Descriptive Statistics

This document provides an overview of descriptive statistics, focusing on numerical measures such as mean, median, mode, and measures of variability like range and standard deviation. It also discusses the importance of understanding the relationship between variables through covariance and correlation coefficients, as well as the use of data dashboards for effective data presentation. Key concepts include measures of location, variability, distribution shape, and methods for detecting outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

DSILYTC:

Introduction to
Analytics
SESSION 5: DESCRIPTIVE STATISTICS
Objective of the Study

 DescriptiveAnalytics
 Case Analysis:
 Applications of
Numerical Measures
DESCRIPTIVE STATISTIC: NUMERICAL MEASURES

 MEASURE OF LOCATION • MEASURE OF VARIABLES


NUMERICAL MEASUREMENT

 If
the measures are computed for data from
a sample, they are called SAMPLE STATISTIC.
 Ifthe measures are computed for data form
a population, they are called POPULATION
PARAMETERS.
A sample statistic is referred to as the point
estimator of the corresponding population
parameter.
MEASURE OF LOCATION

 Mean
 Median
 Mode
 Weighted Mean
 Geometric Mean
 Percentile
 Quartiles
MEAN

 Perhaps the most important measure of location


is the mean
 The mean provides a measure of central location
 The mean of a data set is the average of all the data
values
 Thesample mean is the point estimator of the
population means
MEDIAN

 The median of a data set is the value in the middle when


the data items are arranged in ascending order.
 Whenever a data set has extreme values, median is the
preferred measure of central location.
 The median is the measure of location most often
reported for annual income and property value data.
 A few extremely large income or property values can inflate
the mean.
MEDIAN
MEDIAN
MEDIAN
TRIMMED MEAN

 Another measure sometimes used when extreme


values are present is the TRIMMED MEAN.
 It is obtained by deleting a percentage of the
smallest and largest values from a data set and
then computing the mean of the remaining
values.
 For example, the 5% trimmed mean is obtained by
removing the smallest 5% and the largest 5% of the
data values and then computing the mean of the
remaining values.
MODE

 The mode of a data set is the value that


occurs with greatest frequency.
 The greatest frequency can occur at two
or more different values.
 If the data have exactly 2 modes, the
data are BIMODAL.
 If the data have more than 2 modes, the
data are MULTIMODAL.
MODE
WEIGHT MEAN

 In some instance the mean is computed by


giving each observation a weight that
reflects its relative importance.
 The choice of weights depends on the
applications.
 The weight might be the number of credit
hours earned for each grade, as in GPA
 In other weighted mean computations,
quantities such as pounds, dollars, or volume
are frequently used.
WEIGHT MEAN
WEIGHT MEAN
WEIGHT MEAN
GEOMETRIC MEAN

 The geometric mean is calculated by findings the


nth root of the product of n values.
 It is often used in analyzing growth rates in financial
data (where using the arithmetic mean will provide
misleading results).
 It should be applied anytime you want to determine
the mean rate of change over several successive
periods (be it years, quarters, weeks,…)
 Other common applications include: change in
population of species, crop yields, pollution levels,
and birth and death rates.
GEOMETRIC MEAN
GEOMETRIC MEAN
PERCENTILES

 A percentiles provides information about how the


data are spread over the interval from the smallest
value to the largest value.
 Admission test scores for colleges and universities
are frequently reported in terms of percentiles.
 The pth percentiles of a data set is value such that
at least p percent of the items takes on this value
or less and at least (100 – p) percent of the items
take on this value or more.
PERCENTILES
QUARTILES
MEASURES OF VARIABILITY

 Itis often desirable to consider measures


of variability (dispersion), as well as
measures of location.
 Forexample, in choosing supplier A or
supplier B we might consider not only the
average delivery time for each, but also
the variability in delivery time for each.
MEASURES OF VARIABILITY

 Range
 Interquartile Range
 Variance
 Standard deviation
 Coefficient of variation
RANGE

 The range of data set is the difference between the largest and
smallest data values.

RANGE = LARGEST Values -


SMALLEST Values

 It is the simplest measure of variability


 It is very sensitive to the smallest and largest data values.
INTERQUARTILE RANGE

 The interquartile range of a data set is


the difference between the 3rd
quartile and the 1st quartile.
 It is the range for the middle 50% of
the data.
 It overcomes the sensitivity to extreme
data values.
INTERQUARTILE RANGE
VARIANCE

 The variance is a measure of variability that


utilize all the data.
 It is based on the difference between the value
of each observation

 The variance is useful in comparing the


variability of 2 or more variables
VARIANCE
STANDARD DEVIATION

 The standard deviation of a data set


is the positive square root of the
variance.
 It is measured in the same units as the
data, making it more easily
interpreted than the variance.
STANDARD DEVIATION
COEFFICIENT OF VARIATION

 Thecoefficient of variation indicates


how large the standard deviation is in
relation to the mean.
 The coefficient of variation is computed as
follows:
Descriptive Statistic: Numerical
Measures (part 2)
 Measure of Distribution shape, Relative
Location and Detecting Outliers.
 Five number summaries and Box plots
 Measures of Association between 2
variables
 DataDashboard: Adding numerical
measures to improve effectiveness
Measure of Distribution shape, Relative
Location and Detecting Outliers.

 Distribution shape
 Z-scores
 Chebyshev’s Theorem
 Empirical Rule
 Detecting Outliers
Distribution shape

 An important measure of the shape of a


distribution is called SKEWNESS.
 The formula for the skewness of sample data is

 The skewness can be easily computed using


statistical software.
Distribution shape: SKEWNESS
Distribution shape: SKEWNESS
Distribution shape: SKEWNESS
Distribution shape: SKEWNESS
Z-Scores

 The z-scores is often called the standardized


value.
 It denotes the number of standard deviations a
data value is form the mean.

 Excel’s STANDARDIZE function can be used to


compute the z-scores.
Z-Scores

 An observations’ z-scores is a measure of the


relative location of observation in a data set.
 A data value less than the sample mean will
have a z-score less than zero.
 A data value greater than the sample mean will
have a z-score greater than zero.
 A data value equal to the sample mean will
have a z-score of zero.
Z-Scores
Chebyshev’s Theorem
Chebyshev’s Theorem

 At least 75% of the data values must be


within z= 2 standard deviation of the
mean.
 At least 89% of the data values must be
within z=3 standard deviation of the
mean.
 At least 94% of the data values must be
within z=4 standard deviation of the
mean.
Chebyshev’s Theorem
Chebyshev’s Theorem
EMPIRICAL RULE

 Whenthe data are believed to


approximate a bell-shaped distribution:
 Theempirical rule can be used to
determine the percentage of data values
that must be within a specified number of
standard deviations of the mean.
 The empirical rule is based on the normal
distribution, which is covered in chapter 6.
EMPIRICAL RULE

For data having bell-shaped distribution.


Approximately 68% of the data values will be
within +/-1 standard deviation of its mean.
Approximately 95% of the data values will be
within +/-2 standard deviation of its means.
Almost all of the data values will be within +/-3
standard deviation of its means.
EMPIRICAL RULE
Detecting Outliers

 An outliers is an unusually small or unusually large


value in a data set.
 A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
 It might be
 An incorrectly recorded data value
 A data value that was incorrectly included in the
data set
 A correctly recorded unusual data value that belongs
in the data set.
OUTLIERS
FIVE NUMBER SUMMARIES & BOX
PLOT
 Summary statistic and easy-to-draw
graphs can be used to quickly
summarize large quantities of data.
 Two tools that accomplish this are
five-number summaries and box plots.
FIVE NUMBER SUMMARY

 Smallest value
 First quartile
 Median
 Third quartile
 Largest Value
FIVE NUMBER SUMMARY
BOX PLOT

A box plot is a graphical summary of data


that is based on a five-number summary
A key to the development of a box plot is
the computation of median and quartiles
Q1 and Q3
 Box plot provide another way to identify
outliers.
BOX PLOT
BOX PLOT

 Limitsare located (not drawn) using the


interquartile range (IQR).
 Data outside these limits are considered
(outliers).
 The location of each outlier is shown with
the symbols.
BOX PLOT
BOX PLOT
Measures of Association between 2
variables
 Thus far we have examined numerical
methods used to summarize the data for one
variables at a time.
 Often a manager or decision maker is
interested in the relationship between 2
variables.
 Two descriptive measures of the relationship
between 2 variables COVARIANCE and
CORRELATION COEFFICIENT.
COVARIANCE

 The covariance is a measure of the


linear association between 2
variables.
 Positive values indicates a positive
relationship.
 Negative values indicates a
negative relationship.
COVARIANCE
CORRELATION COEFFICIENT

Correlation is a measure of linear


association and not necessarily
causation
Just between 2 variables are highly
correlated, it does not mean that one
variables is the cause of the other.
CORRELATION COEFFICIENT
CORRELATION COEFFICIENT

 The coefficient can take on value


between -1 and +1
 Values near -1 indicate a strong negative
linear relationship.
 Values near +1 indicate a strong positive
linear relationship
 The closer the correlation to zero, the
weaker relationship.
CORRELATION COEFFICIENT
COVARIANCE & CORRELATION
COEFFICIENT
COVARIANCE & CORRELATION
COEFFICIENT
COVARIANCE & CORRELATION
COEFFICIENT
DATA DASHBOARDS:
Adding numerical measure to
 Data improve effectiveness
dashboard are not limited to graphical
displays
 The addition of numerical measures, such as
the mean and standard deviation of KPI’s, to
a data dashboard is often critical.
 Dashboards are often interactive.
 Drilling Down refers to functionally in
interactive dashboards that allows the users
to access information and analyses at
increasingly detailed level.
DATA DASHBOARDS:
Adding numerical measure to
improve effectiveness
DSILYTC:
Introduction to
Analytics
SESSION 5: DESCRIPTIVE STATISTICS

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy