0% found this document useful (0 votes)
8 views37 pages

Week-3

The document discusses measures of central tendency, including the mean, median, and mode, as well as measures of variability such as range, variance, and standard deviation. It explains how to calculate these statistics and their significance in understanding data distributions, including the use of Tchebysheff's Theorem and the Empirical Rule. Additionally, it covers concepts like z-scores and percentiles, which help in assessing the relative standing of data points within a dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views37 pages

Week-3

The document discusses measures of central tendency, including the mean, median, and mode, as well as measures of variability such as range, variance, and standard deviation. It explains how to calculate these statistics and their significance in understanding data distributions, including the use of Tchebysheff's Theorem and the Empirical Rule. Additionally, it covers concepts like z-scores and percentiles, which help in assessing the relative standing of data points within a dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Dr.

Modar Shbat
Division of Engineering
modar.shbat@smu.ca
2
One of the first important numerical measures is a measure of center—a measure
along the horizontal axis that locates the center of the distribution.
For example, the birth weight data presented previously ranged from a low of 5.6 to a
high of 9.4, with the center of the histogram located in the vicinity of 7.5.

We will consider some rules for locating the


center of a distribution of measurements.

The Mean
The arithmetic average of a set of measurements is a very common and useful measure
of center. This measure is often referred to as the arithmetic mean, or simply the mean,
of a set of measurements.

Prob. & Stat. 3


To distinguish between the mean for the sample and the mean for the population, we will
use the symbol (x-bar) for a sample mean and the symbol µ (Greek lowercase mu)
for the mean of a population.
Definition: The arithmetic mean or average of a set of n measurements is equal
to the sum of the measurements divided by n.

Suppose there are n measurements on the variable x—call them To


add the n measurements together, we use this shorthand notation:

Example:

Prob. & Stat. 4


Example (Cont.):
The dot-plot in the following Figure seems to be centered between 6 and 8. To find the
sample mean, we calculate:

We should remember that samples are measurements drawn from a larger population
that is usually unknown. An important use of the sample mean is as an estimator of the
unknown population mean µ.
As an example, the birth weight data in the previous
discussion are a sample from a larger population of
birth weights

Prob. & Stat. 5


The mean of the 30 birth weights is:

The calculated mean marks the balancing point of the distribution. The mean of the
entire population of newborn birth weights is unknown, but if you had to guess its value,
your best estimate would be 7.57. Although the sample mean changes from sample to
sample, the population mean µ stays the same.
The Median
A second measure of central tendency is the median, which is the value in the middle
position in the set of measurements ordered from smallest to largest.
Definition: The median m of a set of n measurements is the value of x that falls
in the middle position when the measurements are ordered from smallest to
largest.

Prob. & Stat. 6


Example 1:

Example 2:

Note: The value 0.5(n+1) indicates the position of the median in the ordered data set. If the
position of the median is a number that ends in the value 0.5, we need to average the two
adjacent values.

Prob. & Stat. 7


From Example 1:

From Example 2:

Although both the mean and the median are good measures of the center of a
distribution, the median is less sensitive to extreme values or outliers.
For example, the value x= 27 in the previous example is much larger than the other five
measurements:

The median, m =7.5, is not affected by the outlier, whereas the sample average:

is affected; its value is not representative of the remaining five observations.


When a data set has extremely small or extremely large observations, the sample mean
is drawn toward the direction of the extreme measurements

Prob. & Stat. 8


If a distribution is skewed to the right, the mean shifts to the right; if a distribution is
skewed to the left, the mean shifts to the left. The median is not affected by these
extreme values because the numerical values of the measurements are not used in its
calculation. When a distribution is symmetric, the mean and the median are equal. If a
distribution is strongly skewed by one or more extreme values, you should use the
median rather than the mean as a measure of center.

Prob. & Stat. 9


The Mode
Another way to locate the center of a distribution is to look for the value of x that occurs
with the highest frequency. This measure of the center is called the mode.
Definition: The mode is the category that occurs most frequently, or the most
frequently occurring value of x. When measurements on a continuous variable
have been grouped as a frequency or relative frequency histogram, the class with
the highest peak or frequency is called the modal class, and the midpoint of that
class is taken to be the mode.
The mode is generally used to describe large data sets, whereas the mean and median
are used for both large and small data sets.

The modal class and the value of x Mode=5


occurring with the highest frequency are
the same.
Discrete variable

Prob. & Stat. 10


Modal class
The Mode

Birth weight of 7.7 occurs four times= mode


Midpoint =(7.6+8.1)/2= 7.85= mode

Continuous variable

It is possible for a distribution of measurements to have more than one mode. These
modes would appear as “local peaks” in the relative frequency distribution (example:
bimodal distribution).

Prob. & Stat. 11


Data sets may have the same center but look different because of the way the numbers
spread out from the center.

In the figure we can see that both distributions are centered at x = 4, but there is a big
difference in the way the measurements spread out, or vary. The measurements in (a)
vary from 3 to 5; in (b) the measurements vary from 0 to 8.
Variability or dispersion is a very important characteristic of data. Measures of
variability can help you create a mental picture of the spread of the data.
The Range
Definition: The range, R, of a set of n measurements is defined as the
difference between the largest and smallest measurements.
12
The Range
For the birth weight data, the measurements vary
from 5.6 to 9.4. The range is 9.4 - 5.6 = 3.8.

The range is easy to calculate, easy to interpret, and


is an adequate measure of variation for small sets of
data.

For large data sets, the range is not an adequate measure of variability.

For example:
the two relative
frequency distributions in
the figure have the same
range but very different
shapes and variability.

Prob. & Stat. 13


Consider, as an example, the sample measurements 5, 7, 1, 2, 4, displayed as a dot-plot
in the following Figure. The mean of these five measurements is:
v

The horizontal distances between


each dot (measurement) and the
mean will help you to measure the
variability. If the distances are large,
the data are more spread out or
variable than if the distances are
small.
The deviation of that measurement from the mean is:

Measurements to the right of the mean


produce positive deviations, and those to
the left produce negative deviations.

Prob. & Stat. 14


Because the deviations in the second column of the
table contain information on variability, one way to
combine the five deviations into one numerical
measure is to average them. Unfortunately, the
average will not work because the sum is always zero.

The Variance
We overcome the difficulty caused by the signs of the deviations by working with their
sum of squares. From the sum of squared deviations, a single measure called the
variance is calculated.

(Greek lowercase sigma)

15
The Variance The variance will be relatively large for highly variable data and
relatively small for less variable data.

Taking the square root of the variance, we obtain the standard deviation, which returns
the measure of variability to the original units of measurement.
Definition: The standard deviation of a set of measurements is equal to the
positive square root of the variance.

Prob. & Stat. 16


The Variance and the Standard Deviation

If you need to calculate the variance by hand, it is much easier to use the alternative
computing formula. This computational form is sometimes called the shortcut method
for calculating the variance.

Prob. & Stat. 17


Example:
Calculate the variance and standard deviation for the five measurements:
Use the computing formula.

Note: It turns out that the sample variance with (n - 1) in the denominator provides better
estimates of population variance than would an estimator calculated with n in the
denominator. For this reason, we always divide by (n -1) when computing the sample
variance and the sample standard deviation.

Prob. & Stat. 18


The Variance and the Standard Deviation
At this point, you have learned how to compute the variance and standard deviation of a
set of measurements. Remember these points:

This information allows you to compare several sets of data with respect to their
locations and their variability.
PRACTICAL SIGNIFICANCE OF THE STANDARD DEVIATION
We now introduce a useful theorem developed by the mathematician Tchebysheff.

Prob. & Stat. 19


Tchebysheff’s Theorem applies to any set of measurements and can be used to describe
either a sample or a population.
The number k can be any number as long as
it is greater than or equal to 1.

Prob. & Stat. 20


Although the first statement is not at all helpful (k=1), the other two values of k provide
valuable information about the proportion of measurements that fall in certain intervals.
The values k=2 and k=3 are not the only values of k you can use; for example, the
proportion of measurements that fall within k=2.5 standard deviations of the mean is at
least:

Example:

Solution: We have that: the mean and the variance


Thus, the standard deviation is defined as:
The distribution of measurements is centered around 75, and Tchebysheff’s Theorem
states:

21
Empirical Rule
Another rule for describing the variability of a data set does not work for all data sets, but
it does work very well for data that “pile up” in the familiar mound shape (see the figure).
Since mound-shaped data distributions occur quite frequently in nature, the rule can
often be used in practical applications. For this reason, we call it the Empirical Rule.

Prob. & Stat. 22


Empirical Rule
The mound-shaped distribution is commonly known as the normal distribution and will
be discussed in more details later.
Example:
In a time study conducted at a manufacturing plant, the length of time to complete a
specified operation is measured for each of n=40 workers. The mean and standard
deviation are found to be 12.8 and 1.7, respectively. Describe the sample data using the
Empirical Rule.
Solution: In order to describe the data, we should calculate these intervals:

According to the Empirical Rule, you expect approximately 68% of the measurements to
fall into the interval from 11.1 to 14.5, approximately 95% to fall into the interval from 9.4
to 16.2, and approximately 99.7% to fall into the interval from 7.7 to 17.9.

Prob. & Stat. 23


Empirical Rule
If we doubt that the distribution of measurements is mound-shaped, or if you wish for
some other reason to be conservative, you can apply Tchebysheff’s Theorem and be
absolutely certain of your statements. Tchebysheff’s Theorem tells you that at least 3/4
of the measurements fall into the interval from 9.4 to 16.2 and at least 8/9 into the
interval from 7.7 to 17.9.
Example:
Teachers are trained to develop lesson plans. In a study to assess the relationship
between written lesson plans and their implementation in the classroom, 25 lesson plans
were scored on a scale of 0 to 34 according to a Lesson Plan Assessment Checklist.
Use Tchebysheff’s Theorem and the Empirical Rule (if applicable) to describe the
distribution of these assessment scores.
Solution:
First, we need to calculate the mean and
the standard deviation of the given data:

Prob. & Stat. 24


Example (Cont.):
Second, we define the appropriate intervals
and count the actual number of measurements
that fall into each of these intervals.

Is Tchebysheff’s Theorem
applicable? Yes, because it
can be used for any set of
data.

We can see that Tchebysheff’s Theorem is true for these data. In fact, the proportions of
measurements that fall into the specified intervals exceed the lower bound given by this
theorem.

Prob. & Stat. 25


Example (Cont.):

Is the Empirical Rule applicable?

The data shows that the distribution is


relatively mound-shaped, so the Empirical
Rule should work relatively well.
According to the Empirical Rule we have:

The relative frequencies in the Table are closely approximate those specified by the
Empirical Rue.

Prob. & Stat. 26


In some cases, we need to know the position of one observation relative to others in a
set of data.
For example, if you took an examination with a total of 35 points, you might want to
know how your score of 30 compared to the scores of the other students in the class.
Z-score
The mean and standard deviation of the scores can be used to calculate a z-score,
which measures the relative standing of a measurement in a data set.

A z-score measures the distance between an observation and the mean, measured in
units of standard deviation.
For example: suppose that the mean and standard deviation of the test scores (based
on a total of 35 points) are 25 and 4, respectively. The z-score for your score of 30 is
calculated as follows:

Prob. & Stat. 27


Percentile:
A percentile is another measure of relative standing and is most often used for large
data sets (Percentiles are not very useful for small data sets).

Example:
Suppose you have been notified that your score of 610 on the Verbal Graduate Record
Examination placed you at the 60th percentile in the distribution of scores. Where does
your score of 610 stand in relation to the scores of others who took the examination?
Solution:
Scoring at the 60th percentile means that 60% of all the examination scores were lower
than your score and 40% were higher.
In general, the 60th percentile for the variable x is a point on the horizontal axis of the
data distribution that is greater than 60% of the measurements and less than the others.

Prob. & Stat. 28


Percentile:
Since the total area under the distribution is
100%, 60% of the area is to the left and 40%
of the area is to the right of the 60th
percentile. The median, m, of a set of data is
the middle measurement; that is, 50% of the
measurements are smaller and 50% are
larger than the median. Thus, the median is
the same as the 50th percentile.

The 25th and 75th percentiles, called the


lower and upper quartiles, along with
the median (the 50th percentile), locate
points that divide the data into four sets,
each containing an equal number of
measurements.

Prob. & Stat. 29


Quartile:

For small data sets, it is often impossible to divide the set into four groups, each of which
contains exactly 25% of the measurements. Even when you can perform this task, there
are many numbers that would satisfy the preceding definition, and could therefore be
considered “quartiles.” To avoid this ambiguity, we use the following rule to locate
sample quartiles.

30
Quartile, Example:

Since these positions are not integers, the lower quartile is taken to be the value 3/4 of
the distance between the second and third ordered measurements, and the upper
quartile is taken to be the value 1/4 of the distance between the eighth and ninth ordered
measurements. Therefore,

We can measure the range of this “middle 50%” of the distribution using a numerical
measure called the interquartile range.

31
InterQuartile Range (IQR):

For the data in the previous Example we have that:

We will use the IQR along with the quartiles and the median in the next section to
construct another graph for describing data sets.

32
THE FIVE-NUMBER SUMMARY AND THE BOX PLOT

The five-number summary can be used to create a simple graph called a box plot to
visually describe the data distribution. From the box plot, you can quickly detect any
skewness in the shape of the distribution and see whether there are any outliers in the
data.

33
THE BOX PLOT

The box plot uses the IQR to create imaginary “fences” to separate outliers from the rest
of the data set:

Any measurement beyond the upper or lower fence is an outlier; the rest of the
measurements, inside the fences, are not unusual.

34
THE BOX PLOT: Example

Solution:

35
THE BOX PLOT: Example (Cont.)

The value x=520, is the only outlier, lying beyond the upper fence.

We find that the smallest and largest


measurements are:
x =260 and x =340.
These are the two values that form the
whiskers. Since the value x =340 is the
same as Q3, there is no whisker on the
right side of the box.

36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy