Measures of Central Tendency & Variation
Measures of Central Tendency & Variation
4-1
Descriptive Statistics
Slide
4-2
Histograms
Looking at the Distribution of the
Data
Slide
4-3 Histogram
• A Picture of a list of numbers
Data 4
Frequency
11 15 3
8 26
2
10 5
1
15
0
0 10 20 30 Data value
Frequency
11 15 3
8 26
2
10 5
1
15
0
0 10 20 30 Data value
29 44 12 53 21 34 39 25 48 23
17 24 27 32 34 15 42 21 28 37
Slide
4-10
Slide
4-11 Distribution Shapes (Ideal)
• Normal
– Symmetric
– Bell-Shaped
• Skewed
– Not symmetric
– Can cause trouble
– Transform? Logarithm?
• Bimodal
– Two clear groups
– Find out why!
– Analyze separately?
Slide
4-12 Idealized Normal Distributions
• Can shift center, width (diversity) of distribution
• In idealized form, without the randomness of data
Slide
4-13 Data from a Normal Distribution
• All are sampled from the same idealized normal
distribution. Note the random differences.
30 30
Frequency
Frequency
20 20
10 10
0 0
60 80 100 120 140 60 80 100 120 140
30 30
Frequency
20 Frequency 20
10 10
0 0
60 80 100 120 140 60 80 100 120 140
Slide
4-14
Fig 3.2.1
Example: Mortgage Interest Rates
• Values from about 5.7% to 6.6%
• Typical: from about 6.2% to 6.4%
• Diversity among institutions
• Special
15
features: gap just below 6.5%, some low rates
Frequency (lenders)
10
0
5.5% 6.0% 6.5% 7.0%
Interest rate
Slide
4-15 Idealized Skewed Distributions
• Not symmetric
• Various shapes are possible
• In idealized form, without the randomness of data
Slide
4-16 Example: Commercial Bank Assets
Fig 3.4.2
30
20
10
0
0 100 200 300 400 500
Bank assets ($ billions)
Slide
4-17 Bimodal Distribution
Fig 3.5.1
40
Frequency (funds)
30
20
10
0
2% 3% 4% 5% 6%
Yield
Slide
4-18 Outlier
• A data value very different from the others
• Difficult to see distribution of most of the data,
even after changing histogram scale
Defects 10
8
11 19
Frequency
Frequency
23 15
18 19 0 0
13 268 0 100 200 300 0 100 200 300
25 9
Slide
4-19 Outlier: What to Do?
• Note the outlier. If error, then fix it
• (Perhaps) analyze with and without outlier(s)
– If similar answers, then no problem
• OK to omit outlier(s) IF not part of situation
under study
– e.g., Lab analysis, dropped test tube
• OK to omit, if studying normal operation, not laboratory
accidents
– e.g., Statistical audit, “special occurrence” error
• Use care. Such an error in a sample may represent other
“explainable” errors in accounts that were not examined
Slide
4-20 Example: TV Advertising
Fig 3.6.5
20
10
0
0% 1,000% 2,000%
Percent Increase in Syndicated TV Spending
Slide
4-21 Data Mining Promotions Received
Fig 3.6.5
3,000
2,000
1,000
0
0 50 100 150 200
Promotions
Slide
4-22 More Detail in Promotions
Fig 3.6.5
600
Number of people
500
400
300
200
100
0
0 20 40 60 80 100 120 140 160 180
Promotions
Slide
4-23 Data Mining Donations
Fig 3.6.5
20,000
Number of people
15,000
10,000
5,000
0
$0 $20 $40 $60 $80 $100 $120
Donation
Slide
4-24 More Detail in Donations
Fig 3.6.5
300
250
200
150
100
50
0
$0 $20 $40 $60 $80 $100 $120
Donation
Slide
4-25 Even More Detail in Donations
Fig 3.6.5
200
150
100
50
0
$0 $20 $40 $60 $80 $100 $120
Donation
Slide
4-26
Numerical Descriptive
Measures
Landmark Summaries:
Interpreting Typical Values and
Percentiles
Slide
4-27 Numerical Descriptive
• Large data sets can often be adequately
described by just a few numbers
– Parameters describe populations
– Statistics describe samples
• Types of descriptive measures
Measures of
– Central tendency
– Dispersion
– Shape
– Relationships
Slide
4-28
Measures of Central
Tendency
Landmark Summaries:
Interpreting Typical Values and
Percentiles
Slide
4-29 Measures of Central Tendency
• Also referred to as averages
• An average is a single value, which is
considered as the most representative (or
typical) value for a given set of data
• Are used to give an impression of the size of all
items in a given set of data
Slide
4-30
• Objectives
– To get one single value that describes the characteristic
of the entire data
– By condensing the mass of data in one single value, we
get an idea of the entire data; easy to remember and
figure out
– To facilitate comparison; by reducing the mass of data
in one single figure comparison is made possible -
comparison can be made either at a point in time or
over a period of time
Slide
4-31 Average or Mean
• Add the data, divide by n or N (the number of
elementary units)
X 1 X 2 ... X n
X Sample average
n
X 1 X 2 ... X N
Population average
N
Frequency (lots) 2
0
0 5 10 15 20
Defects per lot
Average is 5.1
defects per lot
Slide
4-33 Median
• Also summarizes the data
• The middle one
– Put data in order
– Pick middle one (or average middle two if n is even)
– Median (9, 4, 5) = Median(4, 5, 9) = 5
5+7
– Median (9, 4, 5, 7) = Median (4, 5, 7, 9) = = 6
2
• Rank of the median is (1+n)/2
– If n=3, rank is (1+3)/2 = 2
– If n=4, rank is (1+4)/2 = 2.5 (so average 2nd and 3rd)
– If n=262, rank is (1+262)/2 = 131.5
Slide
4-34 Median (continued)
• A representative, central number
– If data set has a center
• Less sensitive to outliers than the average
• For skewed data, represents the “typical case”
better than the average does
– e.g., incomes
• Average income for a country equally divides the total, which
may include some very high incomes
• Median income chooses the middle person (half earn less, half
earn more), giving less influence to high incomes (if any)
Slide
4-35 Example: Spending
• Customers plan to spend ($thousands)
3.8, 1.4, 0.3, 0.6, 2.8, 5.5, 0.9, 1.1
• Rank ordered from smallest to largest
0.3, 0.6, 0.9, 1.1, 1.4, 2.8, 3.8, 5.5
1 2 3 4 5 6 7 8
Rank of median
= (1+8)/2 = 4.5
9
• Median is (1.1+1.4)/2 = 1.25 6 4
3 1 8 8 5
– Smaller than the average, 2.05 0 1 2 3 4 5
• Due to slight skewness?
Median Average
Slide
4-36
Fig 4.1.2
Example: The Crash of 1987
• Dow-Jones Industrials, stock-price changes as
each stock began trading that fateful morning
• Fairly normal
• Mean and median are similar
Frequency
0
-20% -10% 0%
Percent change at opening
Median = -8.6%
Average = -8.2%
Slide
4-37
Fig 4.1.3
Example: Incomes
• Personal income of 100 people
• Average is higher than median due to skewness
50
40
Frequency
30
20
10
0
$0 $100,000 $200,000 Income
Average = $38,710
Median = $27,216
Slide
4-38 Mode
• Also summarizes the data
• Most common data value
– Middle of tallest histogram bar Mode
• Problems: Mode
Average
Median
Mode
Slide
4-41 Which summary to use?
• Average
– Best for normal data
– Preserves totals
• Median
– Good for skewed data or data with outliers, provided
you do not need to preserve or estimate total amounts
• Mode
– Best for categories (nominal data).
– The mode is the only summary computable for nominal
data!
Slide
4-42 Which Summary? (continued)
• Average requires quantitative data (numbers)
• Median works with quantitative or ordinal
• Mode works with quantitative, ordinal, or nominal
X w1 X 1 w2 X 2 ... wn X n
• Weighted average is
0.20(20%) + 0.36(15%) + 0.44(30%) = 22.6%
– The expected return for the portfolio.
– Each stock is represented in proportion to $ invested
Slide
4-47 Percentiles
• Landmark summaries in the same measurement
units as the data
– e.g., dollars, people, miles per gallon, …
• Some familiar percentiles
– Smallest data value is 0th percentile
– Median is 50th percentile
– Largest data value is 100th percentile
– 90th percentile is larger than 90% of elementary units
• Finding percentiles
– Difficult to see from histogram
– Easy using CDF (Cumulative Distribution Function)
Slide
4-48 Cumulative Distribution Function
• Data axis horizontally (as in histogram)
• Cumulative percent vertically
• Equal vertical jump at each data value
0.3, 0.6, 0.9, 1.1, 1.4, 2.8, 3.8, 5.5
80% 100%
Cumulative
50%
Percent
0%
$0 $2 $4 $6
Spending
80th percentile
is $3.80
Slide
4-49 Five-Number Summary
• Selected landmarks to represent entire data set
– Median = 50th percentile
– Quartiles
Discard decimal,
• LQ = Lower Quartile = 25th percentile if any.
int(10.5)=10
1 n int(35)=35
1 int
– Rank = 2
Rank of median
2
• UQ = Upper Quartile = 75th percentile
– Rank is n+1–[rank of lower quartile]
– Extremes
• Smallest = 0th percentile
• Largest = 100th percentile
Slide
4-50 Five-Number Summary (continued)
• Provides information about
– Central summary
• Median
– Range of the data
• Largest – smallest
– “Middle half” of the data
• From LQ to UQ
– Skewness
• If median is not approximately half way between quartiles
Slide
4-51 Box Plot
• Displays five-number summary
Median
Lower Upper
Quartile Quartile
Smallest Largest
0 {
2 4
Middle half
6 8
of the data
• Less detail than histogram
– Easier to compare many groups
Slide
4-52 Example: Spending
• Spending rank ordered from smallest to largest
0.3, 0.6, 0.9, 1.1, 1.4, 2.8, 3.8, 5.5
1 2 3 4 5 6 7 8
Rank of LQ Rank of median Rank of UQ
= (1+4)/2 = 2.5 = (1+8)/2 = 4.5 = 8+1-2.5=6.5
4 = int(4.5)
• LQ is (0.6+0.9)/2 = 0.75
• UQ is (2.8+3.8)/2 = 3.3
Slide
4-53 Example: Spending (continued)
• Five-number summary
0.3, 0.75, 1.25, 3.3, 5.5
Smallest, LQ, Median, UQ, Largest
• Box plot
0 5
Spending ($thousands)
LQ UQ
Slide
4-55
Fig 4.2.3
Example: Technology CEO Pay
• CEO compensation in technology companies
– Detailed box plot identifies outliers
• and identifies the most extreme non-outliers,
• gives more detail than the (ordinary) box plot
Apple
Computer
AMD IBM
Detailed Box Plot Sun
Microsystems
$0 $5,000,000 $10,000,000
Box Plot
$0 $5,000,000 $10,000,000
Slide
4-56
Fig 4.2.3
Example: CEO Compensation
• Box plots to compare firms within industry groups
– Utilities group generally shows lower compensation
– Highest-paid are in Financial Services group
Utilities
Technology
Financial
Energy
GPU
Enron
Utilities Duke
Energy
Apple
Computer
AMD IBM
Technology Sun
Microsystems
Berkshire
Hathaway Lehman Merrill Goldman
Brothers Lynch Sachs Citigroup
Financial Morgan Stanley Bear
Baker Dean Witter Stearns
Hughes
Energy Phillips Petroleum
4+
past 2 years
$0 $50 $100
Size of current donation
Slide
4-59
Fig 4.2.9
Example: Business Failures
• Per million people, by state
90th percentile is 432.4
50th percentile is 260.2
100%
Cumulative Percent
50%
0%
0 100 200 300 400 500 600 700
Failures
Slide
4-60
Fig 4.2.10
Example: Business Failures
• Compare histogram, box plot, and CDF
10
Histogram
0
0 Failures 500
Box plot
0 Failures 500
100%
CDF
0%
0 Failures 500
1-4
GEOMETRIC MEAN
The geometric mean (GM) of a set of n numbers
is defined as the nth root of the product of the
n numbers
GM=(X1.X2….Xn)1/n
Here, geometric mean is used to average
percentages, indexes, and relatives
Example 1: it is known that the price of a
commodity has risen by 6%, 13%, 11%, and 15%
in each of 4 successive years
Determine the GM (average) rise
3-16
Measures of Spread
Variability: Dealing with
Diversity
Slide
4-66 Variability: Introduction
• Also known as dispersion, spread, uncertainty,
diversity, risk
• I is the extent to which the values in a set of
observations are different from each other i.e. it
describes the degree of spread in a distribution
• If all the values are similar, the dispersion is low; if
there is a wide range of different values, the
dispersion is high
• Measure of Dispersion
– Is a measure, which helps to describe the amount of
dispersion, spread, or variability in a set of observations
• Importance of Measuring Variations
Slide
4-67 – It points out as to how far an average is representative
of the entire data i.e. small variation means the average
is representative and vice versa
– To determine the nature and cause of variation in order
to control the variation itself
– To enable comparison to be made of two or more series
with regard to their variability i.e. a means of
determining uniformity or consistency; low degree of
variation means high uniformity or consistency
– To facilitate the use of other statistical measures. For
example, correlation analysis, testing of hypotheses, the
analysis of fluctuations etc. are all based on measures
of variation
Slide
4-68 Examples
• Stock market, daily change, is uncertain
– Not the same, day after day!
• Risk of a business venture
– There are potential rewards, but possible losses
• Uncertain payoffs and risk aversion
– Which would you rather have
• $1,000,000 for sure
• $0 or $2,000,000, each outcome equally likely
– Both have same average! ($1,000,000)
– Most would prefer the choice with less uncertainty
Slide
4-69 Types
• Variance & standard deviation
• Coefficient of Variation
• Range
• Inter Quartile Range
• Quartile deviation
Slide
4-70 Standard Deviation S
• Measures variability by answering:
– “Approximately how far from average are the data
values?” (same measurement units as the data)
– The square root of the average squared deviation
• (dividing by n-1 instead of n for a sample)
• For a sample
( X 1 X )2 ( X 2 X )2 ... ( X n X ) 2
S
n 1
• For a population
( X 1 )2 ( X 2 )2 ... ( X N ) 2
s
N
Slide
4-71
Slide
4-72
Slide
4-73 Example: Spending
• Customers plan to spend ($thousands)
3.8, 1.4, 0.3, 0.6, 2.8, 5.5, 0.9, 1.1
• Average is 2.05. Sum of squared deviations is
(3.8–2.05)2+(1.4–2.05)2+…+(1.1–2.05)2 = 23.34
• Divide by 8–1=7 and take square root:
23 .34
3.334286 1 .83 = Standard deviation
7
3
Frequency
2
1
0
0 1 2 3 4 5 6 7
spending
S = 1.83 X = 2.05 S = 1.83
Slide
4-75
Fig 5.1.3
Normal Distribution and Std. Dev.
• For a normal distribution only
• 2/3 of data within one standard deviation of the average
(either above or below)
• 95% for 2 std. devs.
• 99.7% for 3
one one
standard standard
deviation deviation
2/3 of data
50
0
Slide
4-78 Example: The Stock Market
• Daily stock market returns, S&P500 index, first
half of 2001. Standard deviation is 1.43%
– Average daily percent change: -0.03%
– Typical day: about 1.5 percentage points up or down
Frequency (days)
30
20
10
0
-5% 5%
0%
Stock market return
One One
standard Average standard
deviation deviation
Slide
4-79
Fig 5.1.11
Mining the Donations Database
• 989 people made donations
– Average donation $15.77, standard deviation $11.68
– Skewed distribution for donation amounts
300
Number of people
250
200
150
100
50
0
$0 $20 $40 $60 $80 $100 $120
One standard One standard
Donation amount
deviation deviation
Average donation
Slide
4-80 The Range
• The difference: Largest – Smallest
• Good features
– Easy and fast to compute
– Describe the data
– Check the data: Is the range too big to be reasonable?
• Problem
– Very sensitive to just two data values
• Compare to standard deviation, which combines all data values
Slide
4-81 Example: Spending
• $Thousands: 3.8, 1.4, 0.3, 0.6, 2.8, 5.5, 0.9, 1.1
• The range is 5.2
– larger than the standard deviation, 1.83
The range
3
Frequency
5.5–0.3 = 5.2
2
1
0
0 1 2 3 4 5 6 7
spending
Average One standard deviation
Slide
4-82 Coefficient of Variation
• A relative measure of variability
• The ratio: Standard deviation divided by average
– For a sample: S/X
– For a population: s/
• No measurement units. A pure number. Answers:
– “Typically, in percentage terms, how far are data values
from average?”
• Useful for comparing situations of different sizes
– To see how variability compares after adjusting for size
Slide
4-83 Example: Portfolio Performance
• You have invested $100 in each of 5 stocks
– Results: $116, 83, 105, 113, 98
– Average is $103, std. dev. is $13.21
• Your friend has invested $1,000 in each stock
– Results: $1,160, 830, 1,050, 1,130, 980
– Average is $1,030, std. dev. is $132.10
• Coefficients of variation are identical
13.21/103 = 132.10/1,030 = 0.128 = 12.8%
• Typically, results for these 5 stocks were
approximately 12.8% from their average value
Slide
4-84 Adding a Constant to the Data
• If the same number is added to each data value:
– The average changes by this same number
• The center of the distribution shifts by the same amount
– The standard deviation is unchanged
• Each data value stays the same distance from average
• Example: Order amounts: $3, 6, 9, 5, 8
– Average is $6.20, std. dev. is $2.39
– Now add shipping and handling, $1 per order:
$4, 7, 10, 6, 9
– Average rises by $1 to $7.20, but std. dev. is still $2.39
Slide
4-85 Multiplying the Data by a Constant
• If each data value is multiplied by some number:
– The average is multiplied by this same number
• The center of the distribution shifts by the same multiple
– The standard deviation is also multiplied by this same
number (after ignoring any minus sign)
• The distribution is widened (or narrowed) by this factor
• Example: Order amounts: $3, 6, 9, 5, 8
– Average is $6.20, std. dev. is $2.39
– Add 10% sales tax: $3.30, $6.60, $9.90, $5.50, $8.80
– Average rises by 10% to $6.82
– Std. dev. also rises by 10%, to $2.63
Slide
4-86 Example: International Exchange Rates
• Suppose $1 is worth 1.146 European euros
– Assume for now that this rate is constant
• Your firm is anticipating
– Average profits worth 850,000 euros
– Standard deviation (uncertainty) of 100,000 euros
• In dollars, after conversion, your firm anticipates
– Average profits worth 850,000/1.146 = $741,710
– Standard deviation of 100,000/1.146 = $87,260
• Relative risk is the same in $ and in euros
– Coefficient of variation is 11.8%