Chapter 2 (Descriptive)
Chapter 2 (Descriptive)
DESCRIPTIVE
STATISTICS
POINTS TO HIGHLIGHT
Overview of using data: Definition and Goals
Types of data
Modifying data in Excel
Creating distributions from data
Measures of location
Measures of variability
Analyzing distribution
Measure of association between two variables
Overview of Using
Data: Definitions and
Goals
Data
Variable
Observation
Variation
5
Types of Data
Population and Sample Data
Quantitative and Categorical Data
Cross-Sectional and Time Series Data
Sources of Data
Population vs. Sample Data
Population: the whole group of all elements
of interest
In some cases, it is not feasible to collect data
from population
7
Population vs. Sample data
Population Sample
a b cd b c
ef gh i jk l m n gi n
o p q rs t u v o r u
w
y
x y z
8
Quantitative vs. Categorical
data
Quantitative data: Data on which numeric and
arithmetic operations, such as addition,
subtraction, multiplication, and division, can be
performed
Example: Share Price, Volume (Data for Dow Jones
Industrial Index Companies)
9
Cross-sectional vs. Time series
data
Cross-sectional data: Data collected from
several entities at the same, or
approximately the same, point in time
Example: Data in table 2.1 for Dow Jones
Industrial Index Companies
10
Figure 2.1: Dow Jones Index
Values Since 2005
11
Sources of Data
◦ Experimental study
A variable of interest is first identified
Then one or more other variables are identified and
controlled or manipulated so that data can be
obtained about how they influence the variable of
interest
12
Source of Data
Experimental study
A researcher for a pharmaceutical
company wants to determine whether
aspirin does reduce the incidence of heart
attacks. He select a sample of men and
women. The sample would be divided into
two groups: one group would take aspirin
regularly and the other would not. After 2
years, the researcher would determine the
proportion of people in each group who
had suffered a heart attack. Then, it is
possible to draw conclusion whether
aspirin is effective in reducing the
likelihood of heart attacks.
13
Source of Data
Observational study
A researcher for a pharmaceutical company
wants to determine whether aspirin does
reduce the incidence of heart attacks. He
select a sample of men and women and
asking each whether he or she has taken
aspirin regularly over the past 2 years. Each
person would be asked whether he or she
had suffered a heart attack over the same
period. The proportions reporting heart
attacks would be compared and a
conclusion can be drawn whether aspirin is
effective in reducing the likelihood of heart
attacks.
14
Figure 2.2: Customer Opinion
Questionnaire used by Chops City
Grill Restaurant
15
Figure 2.3: Data for 20 Top-Selling
Automobiles Entered into Excel with
Percent Change in Sales from 2010
16
Modifying Data in
Excel
Sorting and Filtering Data in
Excel
Conditional Formatting of
Data in Excel
Sorting and Filtering Data in
Excel
To sort the automobiles by March 2010 sales:
◦ Step 1: Select cells A1:F21
◦ Step 2: Click the Data tab in the Ribbon
◦ Step 3: Click Sort in the Sort & Filter group
◦ Step 4: Select the check box for My data has
headers
◦ Step 5: In the first Sort by dropdown menu,
select Sales (March 2010)
◦ Step 6: In the Order dropdown menu, select
Largest to Smallest
◦ Step 7: Click OK
18
Figure 2.4: Using Excel’s Sort
Function to Sort the Top-Selling
Automobiles Data
19
Figure 2.5: Top-Selling Automobiles
Data Sorted by Sales in March 2010
Sales
20
Sorting and Filtering Data in
Excel
Using Excel’s Filter function to see the sales of models made
by Toyota
◦ Step 1: Select cells A1:F21
◦ Step 2: Click the Data tab in the Ribbon
◦ Step 3: Click Filter in the Sort & Filter group
◦ Step 4: Click on the Filter Arrow in column B, next to
Manufacturer
◦ Step 5: If all choices are checked, you can easily deselect all
choices by unchecking (Select All). Then select only the check
box for Toyota.
◦ Step 6. Click OK
21
Figure 2.6: Top Selling Automobiles
Data Filtered to Show Only
Automobiles Manufactured by Toyota
22
Conditional Formatting of Data in
Excel
23
Conditional Formatting of Data in
Excel
Example:
To identify the automobile models in Table 2.2 for which
◦ Step 1: Starting with the original data shown in Figure 2.2, select
cells F1:F21
◦ Step 6: Click OK
24
Figure 2.7: Using Conditional Formatting
in Excel to Highlight Automobiles with
Declining Sales from March 2010
25
Figure 2.8: Using Conditional
Formatting in Excel to Generate Data
Bars for the Top-Selling Automobiles
Data
26
Conditional Formatting of Data in
Excel
Quick Analysis button appears just outside the
bottom-right corner of a group of selected cells
Provides shortcuts for Conditional Formatting,
adding Data Bars, etc.
Creating
Distributions from
Data
Frequency Distributions for
Categorical Data
Relative Frequency and
Percent Frequency
Distributions
Frequency Distributions for
Quantitative Data
Histograms
Cumulative Distributions
Frequency Distributions for
Categorical Data
29
Table 2.3: Data from a Sample
of 50 Soft Drink Purchases
30
Table 2.4: Frequency Distribution of
Soft Drink Purchases
32
Relative Frequency and Percent
Frequency Distributions
33
Table 2.5: Relative Frequency and
Percent Frequency Distributions of
Soft Drink Purchases
34
Frequency Distributions for
Quantitative Data
Three steps necessary to define the
classes for a frequency distribution with
quantitative data:
1. Determine the number of non-overlapping
bins.
2. Determine the width of each bin.
3. Determine the bin limits.
35
Creating Distributions from
Data
Table 2.6: Year-End Audit Times (Days)
36
Figure 2.11: Using Excel to Generate
a Frequency Distribution for Audit
Times Data
37
Histogram
38
Figure 2.12: Histogram for the
Audit Time Data
39
Figure 2.13: Creating a Histogram for the
Audit Time Data Using Data Analysis Toolpak in
Excel
40
Figure 2.14: Completed Histogram for the
Audit Time Data Using Data Analysis ToolPak in
Excel
41
Creating Distributions from
Data
Histograms provides information about the
shape, or form, of a distribution
Skewness: Lack of symmetry
Skewness is an important characteristic of
the shape of a distribution
42
Figure 2.15: Histograms Showing
Distributions with Different Levels of
Skewness
43
Cumulative Distributions
Cumulative frequency distribution:
shows the number of data items with values
less than or equal to the upper class limit of
each class
◦ A variation of the frequency distribution that
provides another tabular summary of
quantitative data
44
Table 2.8: Cumulative Frequency,
Cumulative Relative Frequency, and
Cumulative Percent Frequency
Distributions for the Audit Time Data
45
Measures of Location
Mean (Arithmetic Mean)
Median
Mode
Geometric Mean
Measures of Location
Mean/Arithmetic Mean
◦ Average value for a variable
◦ The mean is denoted by
◦ n = sample size
◦ = value of variable x for the first observation
◦ = value of variable x for the second observation
◦ = value of variable x for the nth observation
47
Table 2.9: Data on Home Sales in
Cincinnati, Ohio, Suburb
48
Computation of Sample Mean
illustration: Computation of the mean
home selling price for the sample of 12
home sales
49
Measures of Location
Median
◦ Value of the item in the middle when the
data are arranged in ascending order
◦ Value of middle item, for an odd number of
observations
◦ Average of values of two middle items, for
an even number of observations
50
Computation of Sample Median
illustration: When the number of
observations are odd
Consider the class size data for a sample of five
college classes:
46 54 42 46 32
Arrange the class size data in ascending order
32 42 46 46 54
Middlemost value in the data set = 46
Median is 46
51
Computation of Sample Median
illustration - When the number of
observations are even
Consider the data on home sales in Cincinnati,
Ohio, Suburb (Table 2.9)
Arrange the data in ascending order:
108,000 138,000 138,000 142,000 186,000
199,500 208,000 254,000 254,000 257,500
298,000 456,250
Median = average of two middle values
= "199,500 + 208,000" /"2" =203,750
53
Figure 2.16: Calculating the Mean,
Median, and Modes for the Home
Sales Data using Excel
54
Measures of Location
Geometric Mean
◦ nth root of the product of n values
◦ Used in analyzing growth rates or rate of
change
◦ Sample geometric mean
55
Table 2.10: Percentage Annual
Returns and Growth Factors for the
Mutual Fund Data
illustration: Consider the percentage annual returns and
growth factors for the mutual fund data over the past 10
years
We will determine the mean rate of growth for the fund
over the 10-year period
56
Computation of Geometric
Mean
Solution:
◦ Product of the growth factors:
(.779)(1.287)(1.109)(1.049)(1.158)(1.055)(.630)(1.265)
(1.151)(1.021)
= 1.335
◦ Geometric mean of the growth factors:
= = 1.029
◦ Conclude that annual returns grew at an
average annual rate of
(1.029 – 1)100% or 2.9%
57
Figure 2.17: Calculating the
Geometric Mean for the Mutual
Fund Data Using Excel
58
Measures of
Variability
Range
Variance
Standard Deviation
Coefficient of Variation
Measures of Variability
Table 2.11: Annual Payouts for Two Figure 2.18: Histograms for Payouts of
Different Investment Funds Past 20 Years from Fund A and Fund B
60
Computation of Range
Range
Found by subtracting the smallest value from the
data
It is based on the deviation about the mean,
which is the difference between the value of
each observation (xi) and the mean
The deviations about the mean are squared
while computing the variance
◦ Sample variance, =
◦ Population variance , =
62
Table 2.12: Computation of
Deviations and Squared Deviations
about the Mean for the Class Size
Data
Computation of Sample
Variance:
63
Figure 2.19: Calculating Variability
Measures for the Home Sales Data
in Excel
64
Measures of Variability
Standard Deviation
◦ Positive square root of the variance
◦ Measured in the same units as the original data
◦ For sample , s=
◦ For population, σ=
Coefficient of Variation
65
Computation of Coefficient of
Variation
Illustration:
Consider the class size data:
46 54 42 46 32
Mean, = 44
Standard deviation, s = 8
Coefficient of variation = % = 18.2%
66
Analyzing
Distributions
Percentiles
Quartiles
Z-Scores
Empirical Rule
Identifying
Outliers
Box Plots
Analyzing Distributions
Percentiles
Value of a variable at which a specified
data where:
◦ Approximately p percent of the observations
have values less than the pth percentile
◦ Approximately (100 – p) percent of the
observations have values greater than the pth
percentile
68
Analyzing Distributions
Steps to calculate the pth percentile:
◦ Arrange the data in ascending order (smallest to largest value)
◦ Compute k = (n + 1) × p
◦ Divide k into its integer component, i, and its decimal
component, d
If d = 0, find the kth largest value in the data set; this is the pth
percentile
If d > 0, the percentile is between the values in positions i and i +
1 in the sorted data; to find this percentile, we must interpolate
between these two values:
i. Calculate the difference between the values in positions i and i +
1 in the sorted data set; we define this difference between the two
values as m
ii. Multiply this difference by d: t = m × d
iii. To find the pth percentile, add t to the value in position i of the
sorted data
69
Analyzing Distributions
Illustration
To determine the 85th percentile for the home sales
data in Table 2.9.
1. Arrange the data in ascending order
108,000 138,000 138,000 142,000 186,000
199,500
208,000 254,000 254,000 257,500 298,000
456,250
2. Compute k = (n + 1) × p = (12 + 1) × 0.85 = 11.05
3. Dividing 11.05 into the integer and decimal
components gives us i = 11 and d = 0.05
d > 0, interpolate between the values in the 11th and
12th positions in the sorted data
70
Analyzing Distributions
Illustration (contd.)
To determine the 85th percentile for the
71
Analyzing Distributions
Quartiles
When the data is divided into four equal
parts:
◦ Each part contains approximately 25% of
the observations
◦ Division points are referred to as quartiles
= first quartile, or 25th percentile
= second quartile, or 50th percentile (also the
median)
= third quartile, or 75th percentile
72
Analyzing Distributions
z-score
Measures the relative location of a value in the
data set
Helps to determine how far a particular value is
If , , . . . , is a sample of n observations
=
= z-score for
= sample mean
s = sample standard deviation
73
Table 2.13: z-Scores for the
Class Size Data
For class size data, = 44 and s= 8
◦ For observations with a value > mean, z-score >0
◦ For observations with a value <mean, z-score <0
74
Figure 2.20: Calculating z-
Scores for the Home Sales
Data in Excel
75
Example: which is the better
offer?
Suppose that two graduating seniors, one a marketing
major and one an accounting major, are comparing job
offers. The accounting major has an offer for $45,000
per year, and the marketing student has an offer for
$42,000 per year. Summary information about the
distribution of offers follows:
Accounting: mean = 46,000 Standard deviation =
1500
Marketing: mean = 42,500 Standard deviation =
1000
Example: which is the better
offer?
Accounting Marketing
z score = z score =
data
Developed from the quartiles for a data set
Figure 2.22: Box
Plot for the
Home Sales Data
79
Figure 2.23: Box Plots Comparing Home
Sale Prices in Different Communities
80
Measures of
Association Between
Two Variables
Scatter Charts
Covariance
Correlation Coefficient
Table 2.14: Data for Bottled Water
Sales at Queensland Amusement Park
for a Sample of 14 Summer Days
82
Figure 2.24: Chart Showing the
Positive Linear Relation Between
Sales and High Temperatures
Scatter
chart
83
Measures of Association
Between Two Variables
Scatter Charts:
Useful graph for analyzing the relationship
between two variables
The scatter chart also suggests that a straight
line could be used as an approximation for the
relationship between two variables
84
Measures of Association
Between Two Variables
Covariance: Descriptive measure of the
linear association between two variables
◦ Sample covariance for a sample of size n with
the observations
(, ), (, ), and so on: =
◦ Population covariance, =
85
Table 2.15: Sample Covariance Calculations for
Daily High Temperature and Bottled Water Sales
at Queensland Amusement Park
86
Figure 2.25: Calculating Covariance and
Correlation Coefficient for Bottled Water
Sales Using Excel
87
Measures of Association
Between Two Variables
Correlation coefficient: Measures the
relationship between two variables
◦ Not affected by the units of measurement for x
andy
◦ Sample correlation coefficient denoted by
=
= sample covariance =
= sample standard deviation of x =
= sample standard deviation of y=
88
Interpretation of Correlation
Coefficient
–1 ≤ r ≤ +1
r value Relationship between
the x and y variables
89
Figure 2.26: Scatter Diagrams and
Associated Covariance Values for
Different Variable Relationships
90
Computation of Correlation
Coefficient
Illustration
To determine the sample correlation
91
Figure 2.27: Example of Nonlinear
Relationship Producing a Correlation
Coefficient Near Zero
92