Week-3
Week-3
Modar Shbat
Division of Engineering
modar.shbat@smu.ca
2
One of the first important numerical measures is a measure of center—a measure
along the horizontal axis that locates the center of the distribution.
For example, the birth weight data presented previously ranged from a low of 5.6 to a
high of 9.4, with the center of the histogram located in the vicinity of 7.5.
The Mean
The arithmetic average of a set of measurements is a very common and useful measure
of center. This measure is often referred to as the arithmetic mean, or simply the mean,
of a set of measurements.
Example:
We should remember that samples are measurements drawn from a larger population
that is usually unknown. An important use of the sample mean is as an estimator of the
unknown population mean µ.
As an example, the birth weight data in the previous
discussion are a sample from a larger population of
birth weights
The calculated mean marks the balancing point of the distribution. The mean of the
entire population of newborn birth weights is unknown, but if you had to guess its value,
your best estimate would be 7.57. Although the sample mean changes from sample to
sample, the population mean µ stays the same.
The Median
A second measure of central tendency is the median, which is the value in the middle
position in the set of measurements ordered from smallest to largest.
Definition: The median m of a set of n measurements is the value of x that falls
in the middle position when the measurements are ordered from smallest to
largest.
Example 2:
Note: The value 0.5(n+1) indicates the position of the median in the ordered data set. If the
position of the median is a number that ends in the value 0.5, we need to average the two
adjacent values.
From Example 2:
Although both the mean and the median are good measures of the center of a
distribution, the median is less sensitive to extreme values or outliers.
For example, the value x= 27 in the previous example is much larger than the other five
measurements:
The median, m =7.5, is not affected by the outlier, whereas the sample average:
Continuous variable
It is possible for a distribution of measurements to have more than one mode. These
modes would appear as “local peaks” in the relative frequency distribution (example:
bimodal distribution).
In the figure we can see that both distributions are centered at x = 4, but there is a big
difference in the way the measurements spread out, or vary. The measurements in (a)
vary from 3 to 5; in (b) the measurements vary from 0 to 8.
Variability or dispersion is a very important characteristic of data. Measures of
variability can help you create a mental picture of the spread of the data.
The Range
Definition: The range, R, of a set of n measurements is defined as the
difference between the largest and smallest measurements.
12
The Range
For the birth weight data, the measurements vary
from 5.6 to 9.4. The range is 9.4 - 5.6 = 3.8.
For large data sets, the range is not an adequate measure of variability.
For example:
the two relative
frequency distributions in
the figure have the same
range but very different
shapes and variability.
The Variance
We overcome the difficulty caused by the signs of the deviations by working with their
sum of squares. From the sum of squared deviations, a single measure called the
variance is calculated.
15
The Variance The variance will be relatively large for highly variable data and
relatively small for less variable data.
Taking the square root of the variance, we obtain the standard deviation, which returns
the measure of variability to the original units of measurement.
Definition: The standard deviation of a set of measurements is equal to the
positive square root of the variance.
If you need to calculate the variance by hand, it is much easier to use the alternative
computing formula. This computational form is sometimes called the shortcut method
for calculating the variance.
Note: It turns out that the sample variance with (n - 1) in the denominator provides better
estimates of population variance than would an estimator calculated with n in the
denominator. For this reason, we always divide by (n -1) when computing the sample
variance and the sample standard deviation.
This information allows you to compare several sets of data with respect to their
locations and their variability.
PRACTICAL SIGNIFICANCE OF THE STANDARD DEVIATION
We now introduce a useful theorem developed by the mathematician Tchebysheff.
Example:
21
Empirical Rule
Another rule for describing the variability of a data set does not work for all data sets, but
it does work very well for data that “pile up” in the familiar mound shape (see the figure).
Since mound-shaped data distributions occur quite frequently in nature, the rule can
often be used in practical applications. For this reason, we call it the Empirical Rule.
According to the Empirical Rule, you expect approximately 68% of the measurements to
fall into the interval from 11.1 to 14.5, approximately 95% to fall into the interval from 9.4
to 16.2, and approximately 99.7% to fall into the interval from 7.7 to 17.9.
Is Tchebysheff’s Theorem
applicable? Yes, because it
can be used for any set of
data.
We can see that Tchebysheff’s Theorem is true for these data. In fact, the proportions of
measurements that fall into the specified intervals exceed the lower bound given by this
theorem.
The relative frequencies in the Table are closely approximate those specified by the
Empirical Rue.
A z-score measures the distance between an observation and the mean, measured in
units of standard deviation.
For example: suppose that the mean and standard deviation of the test scores (based
on a total of 35 points) are 25 and 4, respectively. The z-score for your score of 30 is
calculated as follows:
Example:
Suppose you have been notified that your score of 610 on the Verbal Graduate Record
Examination placed you at the 60th percentile in the distribution of scores. Where does
your score of 610 stand in relation to the scores of others who took the examination?
Solution:
Scoring at the 60th percentile means that 60% of all the examination scores were lower
than your score and 40% were higher.
In general, the 60th percentile for the variable x is a point on the horizontal axis of the
data distribution that is greater than 60% of the measurements and less than the others.
For small data sets, it is often impossible to divide the set into four groups, each of which
contains exactly 25% of the measurements. Even when you can perform this task, there
are many numbers that would satisfy the preceding definition, and could therefore be
considered “quartiles.” To avoid this ambiguity, we use the following rule to locate
sample quartiles.
30
Quartile, Example:
Since these positions are not integers, the lower quartile is taken to be the value 3/4 of
the distance between the second and third ordered measurements, and the upper
quartile is taken to be the value 1/4 of the distance between the eighth and ninth ordered
measurements. Therefore,
We can measure the range of this “middle 50%” of the distribution using a numerical
measure called the interquartile range.
31
InterQuartile Range (IQR):
We will use the IQR along with the quartiles and the median in the next section to
construct another graph for describing data sets.
32
THE FIVE-NUMBER SUMMARY AND THE BOX PLOT
The five-number summary can be used to create a simple graph called a box plot to
visually describe the data distribution. From the box plot, you can quickly detect any
skewness in the shape of the distribution and see whether there are any outliers in the
data.
33
THE BOX PLOT
The box plot uses the IQR to create imaginary “fences” to separate outliers from the rest
of the data set:
Any measurement beyond the upper or lower fence is an outlier; the rest of the
measurements, inside the fences, are not unusual.
34
THE BOX PLOT: Example
Solution:
35
THE BOX PLOT: Example (Cont.)
The value x=520, is the only outlier, lying beyond the upper fence.
36