Statistics and Probability
Statistics and Probability
In this chapter:
1.1 Introduction
1.2 Frequency distribution
1.3 Diagrams and Graph
1.3.1 Histogram
1.3.2 Stem and leaf diagram,
1.3.3 Ogives,
1.3.4 Frequency polygon,
1.4 Mean, median and mode
1.1 Introduction
A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position within that set of data. As such, measures of central
tendency are sometimes called measures of central location. They are also classed as
summary statistics. A measure of central tendency is a number that represents the
typical value in a collection of numbers. Three familiar measures of central tendency are
the mean, the median, and the mode. The mean often called the average is most likely
the measure of central tendency that you are most familiar with, but there are others,
such as the median and the mode.
The mean, median and mode are all valid measures of central tendency, but under
different conditions, some measures of central tendency become more appropriate to
use than others. In the following sections, we will look at the mean, mode and median,
and learn how to calculate them and under what conditions they are most appropriate to
be used.
• Frequencies can be absolute (when the frequency provided is the actual count
of the occurrences) or relative (when they are normalized by dividing the
absolute frequency by the total number of observations [0, 1])
• Relative frequencies are particularly useful if you want to compare distributions
drawn from two different sources (i.e. while the numbers of observations of each
source may be different)
Ex :
Frequency 2 4 5 1
Example 1
A traffic inspector has counted the number of automobiles passing a certain point in 100
successive 20-minute time periods. The observations are listed below.
23 20 16 18 30 22 26 15 5 18
14 17 11 37 21 6 10 20 22 25
19 19 19 20 12 23 24 17 18 16
27 16 28 26 15 29 19 35 20 17
12 30 21 22 20 15 18 16 23 24
15 24 28 19 24 22 17 19 8 18
17 18 23 21 25 19 20 22 21 21
16 20 19 11 23 17 23 13 17 26
26 14 15 16 27 18 21 24 33 20
21 27 18 22 17 20 14 21 22 19
A useful method for summarizing a set of data is the construction of a frequency table,
or a frequency distribution. That is, we divide the overall range of values into a number
of classes and count the number of observations that fall into each of these classes or
intervals.
The general rules for constructing a frequency distribution are
i) There should not be too few or too many classes.
ii) In so far as possible, equal class intervals are preferred. But the first and last classes
can be open-ended to cater for extreme values.
iii) Each class should have a class mark to represent the classes. It is also named as
the class midpoint of the ith class. It can be found by taking simple average of the class
boundaries or the class limits of the same class.
1. Setting up the classes
Choose a class width of 5 for each class, then we have seven classes going from 5 to 9,
from 10 to 14, …, and from 35 to 39.
2. counting
Classes Count
5–9 3
10 – 14 9
15 – 19 36
20 – 24 35
25 – 29 12
30 – 34 3
35 – 39 2
1.3.1 Histogram
Histograms
A histogram is usually used to present frequency distributions graphically. This is
constructed by drawing rectangles over each class. The area of each rectangle
should be proportional to its frequency.
Notes :
1. The vertical lines of a histogram should be the class boundaries.
2. The range of the random variable should constitute the major portion of the
graphs of frequency distributions. If the smallest observation is far away from
zero, then a 'break' sign ( ) should be introduced in the horizontal axis.
Construction
To construct a stem-and-leaf display, the observations must first be sorted in ascending
order: this can be done most easily if working by hand by constructing a draft of the
stem-and-leaf display with the leaves unsorted, then sorting the leaves to produce the
final stem-and-leaf display. Here is the sorted set of data values that will be used in the
following example:
44, 46, 47, 49, 63, 64, 66, 68, 68, 72, 72, 75, 76, 81, 84, 88, 106
Next, it must be determined what the stems will represent and what the leaves will
represent. Typically, the leaf contains the last digit of the number and the stem contains
all of the other digits. In the case of very large numbers, the data values may be
rounded to a particular place value (such as the hundreds place) that will be used for
the leaves. The remaining digits to the left of the rounded place value are used as the
stem.
In this example, the leaf represents the ones place and the stem will represent the rest
of the number (tens place and higher).
The stem-and-leaf display is drawn with two columns separated by a vertical line. The
stems are listed to the left of the vertical line. It is important that each stem is listed only
once and that no numbers are skipped, even if it means that some stems have no
leaves. The leaves are listed in increasing order in a row to the right of each stem.
It is important to note that when there is a repeated number in the data (such as two
72s) then the plot must reflect such (so the plot would look like 7 | 2 2 5 6 7 when it has
the numbers 72 72 75 76 77).
Key:
Leaf unit: 1.0
Stem unit: 10.0
Rounding may be needed to create a stem-and-leaf display. Based on the following set
of data, the stem plot below would be created:
−23.678758, −12.45, −3.4, 4.43, 5.5, 5.678, 16.87, 24.7, 56.8
For negative numbers, a negative is placed in front of the stem unit, which is still the
value X/ 10. Non-integers are rounded. This allowed the stem and leaf plot to retain its
shape, even for more complicated data sets. As in this example below:
Stem-and-leaf displays are useful for displaying the relative density and shape of the
data, giving the reader a quick overview of the distribution. They are also useful for
highlighting outliers and finding the mode. However, stem-and-leaf displays are only
useful for moderately sized data sets (around 15–150 data points). With very small data
sets a stem-and-leaf displays can be of little use, as a reasonable number of data points
are required to establish definitive distribution properties. A dot plot may be better suited
for such data. With very large data sets, a stem-and-leaf display will become very
cluttered, since each data point must be represented numerically. A box
plot or histogram may become more appropriate as the data size increases.
Represent the data by stem and leaf
12,13,21,27,33,34,35,37,40,40,41
Stem Leaf
1 2 3
2 1 7
3 3 4 5 7
4 0 0 1
1.3.3 Ogive
The word Ogive is a term used in architecture to describe curves or curved shapes.
Ogives are graphs that are used to estimate how many numbers lie below or above a
particular variable or value in data. To construct an Ogive, firstly, the cumulative
frequency of the variables is calculated using a frequency table. It is done by adding the
frequencies of all the previous variables in the given data set. The result or the last
number in the cumulative frequency table is always equal to the total frequencies of the
variables. The most commonly used graphs of the frequency distribution are
histogram, frequency polygon, frequency curve, Ogives (cumulative frequency curves).
Let us discuss one of the graphs called “Ogive” in detail. Here, we are going to have a
look at what is Ogive, graph, chart, and example in detail.
Ogive Definition
The Ogive is defined as the frequency distribution graph of a series. The Ogive is a
graph of a cumulative distribution, which explains data values on the horizontal plane
axis and either the cumulative relative frequencies, the cumulative frequencies or
cumulative percent frequencies on the vertical axis. Cumulative frequency is defined as
the sum of all the previous frequencies up to the current point. To find the popularity of
the given data or the likelihood of the data that fall within the certain frequency range,
Ogive curve helps in finding those details accurately. Create the Ogive by plotting the
point corresponding to the cumulative frequency of each class interval. Most of the
Statisticians use Ogive curve, to illustrate the data in the pictorial representation. It
helps in estimating the number of observations which are less than or equal to the
particular value.
Ogive Graph
The graphs of the frequency distribution are frequency graphs that are used to exhibit
the characteristics of discrete and continuous data. Such figures are more appealing to
the eye than the tabulated data. It helps us to facilitate the comparative study of two or
more frequency distributions. We can relate the shape and pattern of the two frequency
distributions. The two methods of Ogives are
Ogive Example
1) Draw frequency curve for following :
CI 10-20 20-30 30-40 40-50
F 10 30 40 20
Question 1:
Construct the more than cumulative frequency table and draw the Ogive for the below-
given data.
Frequency 3 8 12 14 10 6 5 2
Solution:
“More than” Cumulative Frequency Table:
More than 1 3 60
More than 11 8 57
More than 21 12 49
More than 31 14 37
More than 41 10 23
More than 51 6 13
More than 61 5 7
More than 71 2 2
Plotting an Ogive:
Plot the points with coordinates such as (70.5, 2), (60.5, 7), (50.5, 13), (40.5, 23), (30.5,
37), (20.5, 49), (10.5, 57), (0.5, 60).
An Ogive is connected to a point on the x-axis, that represents the actual upper limit of
the last class, i.e.,( 80.5, 0)
Take x-axis, 1cm = 10 marks
Y-axis = 1 cm – 10 c.f
Frequency Polygon
Another method to represent frequency distribution graphically is by a frequency
polygon. As in the histogram, the base line is divided into sections corresponding to
the class-interval, but instead of the rectangles, the points of successive class marks
are being connected. The frequency polygon is particularly useful when two or more
distributions are to be presented for comparison on the same graph.
Step 1- Choose the class interval and mark the values on the horizontal axes
Step 2- Mark the mid value of each interval on the horizontal axes.
Step 3- Mark the frequency of the class on the vertical axes.
Step 4- Corresponding to the frequency of each class interval, mark a point at
the height in the middle of the class interval
Step 5- Connect these points using the line segment.
Step 6- The obtained representation is a frequency polygon.
Let us consider an example to understand this in a better way.
Following steps are to be followed to construct a histogram from the given data:
The heights are represented on the horizontal axes on a suitable scale as shown.
The number of students is represented on the vertical axes on a suitable scale as
shown.
Now rectangular bars of widths equal to the class- size and the length of the bars
corresponding to a frequency of the class interval is drawn.
Frequency polygons can also be drawn independently without drawing histograms. For
this, the midpoints of the class intervals known as class marks are used to plot the
points.
frequency curve
1.4 Measures of Central Tendency
When we work with numerical data, it seems apparent that in most set of data there is a
tendency for the observed values to group themselves about some interior values;
some central values seem to be the characteristics of the data. This phenomenon is
referred to as central tendency. For a given set of data, the measure of location we use
depends on what we mean by middle; different definitions give rise to different
measures. We shall consider some more commonly used measures, namely arithmetic
mean, median and mode. The formulas in finding these values depend on whether they
are ungrouped data or grouped data.
Arithmetic Mean
The mean or average is the most popular and well known measure of central tendency.
It can be used with both discrete and continuous data, although its use is most often
with continuous data .The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set. So, if we have n values in a data set
and they have values x1,x2, …,xn, the sample mean, usually denoted by x¯ ,pronounced
"x bar", is:
x¯=x1+x2+⋯+xn
This formula is usually written in a slightly different manner using the Greek capitol
letter, ∑, pronounced "sigma", which means "sum of...":
x¯=∑x/n
one may have noticed that the above formula refers to the sample mean. So, why have
we called it a sample mean? This is because, in statistics, samples and populations
have very different meanings and these differences are very important, even if, in the
case of the mean, they are calculated in the same way. To acknowledge that we are
calculating the population mean and not the sample mean, we use the Greek lower
case letter "mu", denoted as μ:
μ=∑x/n
The mean is essentially a model of your data set. It is the value that is most common.
You will notice, however, that the mean is not often one of the actual values that you
have observed in your data set. However, one of its important properties is that it
minimises error in the prediction of any one value in your data set. That is, it is the value
that produces the lowest amount of error from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part
of the calculation. In addition, the mean is the only measure of central tendency where
the sum of the deviations of each value from the mean is always zero.
Ex 1: Find the Average marks obtained by student
64,69,72,72,75,65
Solution: for ungrouped data A.M.=x¯=∑x/n
=417/6
=69.5
The average marks are=69.5
Ex 2: Find the A.M. for the following
No of 1 2 3 4 5 6 7 8
days
spent
No of 5 6 5 10 8 4 3 2
patient
=4.02
Ex 3: Find the arithmetic mean for the following
monthly
sales frequency
100-120 15
120-140 35
140-160 50
160-180 60
180-200 30
200-220 10
Solution: for grouped data continuous variate case
monthly sales X(mid
CI(Class point of
Interval) Frequency(f) CI) fx
100-120 15 110 1650
120-140 35 130 4550
140-160 50 150 7500
160-180 60 170 10200
180-200 30 190 5700
200-220 10 210 2100
total N=200 Ʃfx =31700
= 31700/200=158.5
7 Median
We can also use the MEDIAN to describe the typical response. In order to find the
median we must first list the data points in numerical order:
756, 726, 710, 568, 564, 440, 440
Now we choose the number in the middle of the list.
756, 726, 710, 568, 564, 440, 440
The median is 568.
Because the median is 568 it is also reasonable to say that on this list the typical dam is
568 feet tall.
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to
calculate the median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the
middle mark because there are 5 scores before it and 5 scores after it. This works fine
when you have an odd number of scores, but what happens when you have an even
number of scores? What if you had only 10 scores? Well, you simply have to take the
middle two scores and average the result. So, if we look at the example below:
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to get
a median of 55.5.
Ex find the median
Age in 3 4 5 6 7 8 9 10
years(x)
No of 14 20 40 54 40 18 7 7
children(f)
Solution:
Age in 3 4 5 6 7 8 9 10
years(x)
No of 14 20 40 54 40 18 7 7
children(f)
Consider N/2=200/2=100
Cf just exceeds 100 is 128 therefore corresponding value of x is median i.e.6
Median=6
Median=140+(160-140)(100-50)/50=160
Mode: Mode is defined as the value of a variable which occurs more frequently.
Size of 60 65 70 75 80 85 90
pants(x)
No of 11 15 25 40 20 15 10
pants(f)
For grouped data discrete variate case Mode is the value of variable having Max
frequency.
Frequency 3 7 8 2 4 6
Mode= L1+(l2-l1)(f1-f0)/(2f1-f0-f2)
=300+(400-300)(8-7)/(2*8-7-2)= 314.28
Normally, the mode is used for categorical data where we wish to know which is the
most common category, as illustrated below:
We can see above that the most common form of transport, in this particular data set, is
the bus.
Ex 2) Find mode
IQ NO. OF
CHILDREN
80-90 2
90-100 8
100-110 45
110-120 50
120-130 30
130-140 15
Total 150=N
FOR MODE
Mode=112
Frequency 20 10 50 10 10
EXAMPLE
For the following list, n = 19. Find the median.
24, 25, 28, 31, 33, 33, 36, 42, 42, 48, 51, 57, 57, 68, 75, 79, 79, 79, 85
SOLUTION
The numbers are already in numerical order. The position of the "middle of the list" is:
(n+1)/2 = (19+1)/2 = 20/2 =10
Thus, the tenth number will be the median. We count until we arrive at the tenth
number.
24, 25, 28, 31, 33, 33, 36, 42, 42, 48, 51, 57, 57, 68, 75, 79, 79, 79, 85
The median is 48.
EXAMPLE
Compute the mean, median, and mode for this distribution of test scores:
92, 68, 80, 68, 84
PRACTICE EXERCISES
No. of 35 50 15 60 30 10
shops
Frequency distributions can be presented as a frequency table, a histogram, or a bar
chart.
1.Prepare a frequency distribution for the following data giving the height of 30 children:
126, 126, 135, 120, 144, 118, 124, 139, 121,133,
126, 130, 148, 125, 137, 142, 128, 132, 146, 144,
118, 142, 129, 110, 136, 143, 148, 129, 142, 119.
No. of 30 50 30 40
workers
References:
2.1 Introduction
2.2 Range,
2.3 Quartile deviation,
2.4 Mean deviation,
2.5 Box whisker plot,
2.6 Standard deviation
2.7 Coefficient of variation
2.1 Introduction
Measures of Dispersion
Suppose you are given a data series. Someone asks you to tell some interesting facts
about the data series. How can you do so? You can say you can find the mean,
the median or the mode of this data series and tell about its distribution. But is it the only
thing you can do? Are the central tendencies the only way by which we can get to know
about the concentration of the observation? In this section, we will learn about another
measure to know more about the data. Here, we are going to know about the measure of
dispersion. Let’s start.
As the name suggests, the measure of dispersion shows the scatterings of the data. It tells
the variation of the data from one another and gives a clear idea about the distribution of
the data. The measure of dispersion shows the homogeneity or the heterogeneity of the
distribution of the observations.
Arithmetic Mean
Median and Mode
Partition Values
Harmonic Mean and Geometric Mean
Range and Mean Deviation
Quartiles, Quartile Deviation and Coefficient of Quartile Deviation
Standard deviation and Coefficient of Variation
Suppose you have four datasets of the same size and the mean is also same, say, m. In
all the cases the sum of the observations will be the same. Here, the measure of central
tendency is not giving a clear and complete idea about the distribution for the four
given sets.
Can we get an idea about the distribution if we get to know about the dispersion of the
observations from one another within and between the datasets? The main idea about the
measure of dispersion is to get to know how the data are spread. It shows how much the
data vary from their average value.
The measures which express the scattering of observation in terms of distances i.e.,
range, quartile deviation.
The measure which expresses the variations in terms of the average of deviations of
observations like mean deviation and standard deviation.
(ii) A relative measure of dispersion:
We use a relative measure of dispersion for comparing distributions of two or more data
set and for unit free comparison. They are the coefficient of range, the coefficient of mean
deviation, the coefficient of quartile deviation, the coefficient of variation, and the
coefficient of standard deviation.
Example 1
There were two companies, Company A and Company B. Their salaries profiles given
in
mean, median and mode were as follow:
Company A Company B
Mean 30,000 30,000
Median 30,000 30,000
Mode (Nil) (Nil)
However, their detail salary (Rs) structures could be completely different as that:
Company A 5,000 15,000 25,000 35,000 45,000 55,000
Company B 5,000 5,000 5,000 55,000 55,000 55,000
Hence it is necessary to have some measures on how data are scattered. That is, we
want to know what is the dispersion, or variability in a set of data.
1.8.1 Range
Range is the difference between two extreme values. The range is easy to calculate
but cannot be obtained if open ended grouped data are given.
1)For the following find Range
12,34,56,78,90
Range=Max-Min
Range=90-12=78
1.8.3 Quartiles
Quartiles are the most commonly used values of position which divides distribution into
four equal parts such that 25% of the data are ≤Q1; 50% of the data are ≤Q2; 75% of
the data are ≤Q3. It is also denoted the value (Q3 - Q1) / 2 as the Quartile Deviation,
QD, or the semi-interquartile range.
2.2 Range
A range is the most common and easily understandable measure of dispersion. It is the
difference between two extreme observations of the data set. If X max and X min are the two
extreme observations then
• Range
– The difference between the largest and smallest values
• Inter_quartile range
– The difference between the 25th and 75th percentiles
Merits of Range
It is the simplest of the measure of dispersion
Easy to calculate
Easy to understand
Independent of change of origin
Demerits of Range
Q = ½ × (Q3 – Q1)
34,45,53,42,39,35,40,51,57,52,47,62,55,63,50
Ascending order:
34,35,39,40,42,45,47,50,51,52,53,55,57,62,63
No of observation=n=15
Q1=40
Q3=55
Quartile Deviation=(Q3-Q1)/2
=(55-40)/2=7.5
Here N=800=∑f
a) for Q1 consider N/4 =200 as 220 is the first cf greater than 200,the required
class for Q1 is 30-35
Q1=l1+[(l2-l1)(N/4-cf)]/f
=30+(35-30)(200-120)/100
=30+(5)(80)/100
Q1=34 year
For Q3 consider 3N/4=600
As 670 ,is the first cf exceeding 600, the required class interval for Q3 is 45-
50
Q3=l1+(l2-l1)(3N/4-cf)/f
=45+(50-45)(600-550)/120
47.08 years
Quartile Deviation=(Q3-Q1)/2
=(47.08-34)/2=6.54 years
Here, xi and fi are respectively the mid value and the frequency of the ith class interval.
5,6,9,11,12,13,14
Solution: Its ungrouped data
∑𝒙
̅= =5+6+9+11+12+13+14/7=70/7=10
𝒙 𝒏
̅
∑ |𝒙−𝒙|
Mean deviation from mean= =20/7=2.85
𝒏
=8/2=4th observation=11
∑|𝒙 − 𝒎𝒆𝒅𝒊𝒂𝒏| = 𝟏𝟗
∑ |𝒙−𝒎𝒆𝒅𝒊𝒂𝒏
Mean deviation from mean= =19/7=2.71
𝒏
2.5.Standard Deviation
Mean Absolute Deviation
Mean absolute deviation is the mean of the absolute values of all deviations from the
mean. Therefore it takes every item into account. Mathematically it is given as:
A standard deviation is the positive square root of the arithmetic mean of the squares of
the deviations of the given values from their arithmetic mean. It is denoted by a Greek
letter sigma, σ. It is also referred to as root mean square deviation. The standard deviation
is given as
The square of the standard deviation is the variance. It is also a measure of dispersion.
If instead of a mean, we choose any other arbitrary number, say A, the standard deviation
becomes the root mean deviation.
• Variance
– The sum of squares divided by the population size or the sample size
minus one
• Standard deviation
– The square root of the variance
• Another Measure of Dispersion
∑𝒙
̅= =106/8=13.25
𝒙 𝒏
𝟐 𝟐 𝟐 𝟐
∑ 𝒙 = 𝟐𝟏 + 𝟏𝟔 + ⋯ … … … . +𝟏𝟒 = 𝟏𝟓𝟐𝟒
𝒙𝟐 𝟏𝟓𝟐𝟒
𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 = √∑ 𝒙𝟐 /𝒏- 𝒙
̅ 𝟐 =√∑ − ̅̅̅̅
𝒙𝟐 =√ − (𝟏𝟑. 𝟐𝟓)𝟐 =3.86
𝒏 𝟖
Solution:
Continuous data
CI Frequency(f) Class-Mark(x) fx fx2
0-10 11 5 55 275
10-20 15 15 225 3375
20-30 25 25 625 15625
30-40 12 35 420 14700
40-50 7 45 315 14175
Total 70 1640 48150
N=∑f=70
∑ 𝒇𝒙
̅= =1640/70=23.42
𝒙 𝑵
√∑ 𝑓𝑥 2 48150
s.d.= − 𝑥̅ 2 =√ − (23.42)2 =11.78
𝑁 70
(x x)i
2
CV
s
100%
s2 i 1
x
n 1
1.Calculate the standard deviation for the following.
Marks(x): 100 80 55 65 90 88 47 50
X 2 3 4 5 6 7 8
F 10 8 2 4 6 5 5
Squaring the deviations overcomes the drawback of ignoring signs in mean deviations
Suitable for further mathematical treatment
Least affected by the fluctuation of the observations
The standard deviation is zero if all the observations are constant
Independent of change of origin
Coefficient of Dispersion
Whenever we want to compare the variability of the two series which differ widely in their
averages. Also, when the unit of measurement is different. We need to calculate the
coefficients of dispersion along with the measure of dispersion. The coefficients of
dispersion (C.D.) based on different measures of dispersion are
100 times the coefficient of dispersion based on standard deviation is the coefficient of
variation (C.V.).
For Company A
No. of employees = n1 = 900, and average daily wages = ȳ 1 = Rs. 250
or, Total wages = Total employees × average daily wage = 900 × 250 = Rs. 225000 … (i)
For Company B
So, Total wages = Total employees × average daily wage = 1000 × 220 = Rs. 220000 …
(ii)
Comparing (i), and (ii), we see that Company A has a larger wage bill.
For Company A
For Company B
Comparing (i), and (ii), we see that Company B has greater variability.
The average daily wages for both the companies taken together
ȳ = (n1 ȳ 1 + n2 ȳ 2)⁄( n1 + n2) = (900 × 250 + 1000 × 220) ÷ (900 + 1000) = 445000⁄1900 =
Rs. 234.21
𝑚𝑒𝑎𝑛 𝑥̅ =∑x/n=550/10=55
1.Calculate mean deviation from mode and Bowley’s measure of skewness for the
following data.
2..Calculate Quartile deviation and Bowley’s measure of skewness for the following
data.
5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22, 23, 24, 24, 26, 26,
31, 31, 36, 42, 44, 47
MCQ No 1
The scatter in a series of values about the average is called:
(a) Central tendency (b) Dispersion (c) Skewness (d) Symmetry
MCQ No 2
The measurements of spread or scatter of the individual values around the central point
is called:
(a) Measures of dispersion (b) Measures of central tendency
(c) Measures of skewness (d) Measures of kurtosis
MCQ No 3
The measures used to calculate the variation present among the observations in the
unit of the variable is
called:
(a) Relative measures of dispersion (b) Coefficient of skewness
(c) Absolute measures of dispersion (d) Coefficient of variation
MCQ No 4
The measures used to calculate the variation present among the observations relative
to their average is
called:
(a) Coefficient of kurtosis (b) Absolute measures of dispersion
(c) Quartile deviation (d) Relative measures of dispersion
MCQ No 5
The degree to which numerical data tend to spread about an average value called:
(a) Constant (b) Flatness (c) Variation (d) Skewness
MCQ No 6
The measures of dispersion can never be:
(a) Positive (b) Zero (c) Negative (d) Equal to 2
MCQ No 7
If all the scores on examination cluster around the mean, the dispersion is said to be:
(a) Small (b) Large (c) Normal (d) Symmetrical
MCQ No8
If there are many extreme scores on all examination, the dispersion is:
(a) Large (b) Small (c) Normal (d) Symmetric
MCQ No 9
Given below the four sets of observations. Which set has the minimum variation?
(a) 46, 48, 50, 52, 54 (b) 30, 40, 50, 60, 70 (c) 40, 50, 60, 70, 80 (d) 48, 49, 50, 51, 52
MCQ No 10
Which of the following is an absolute measure of dispersion?
(a) Coefficient of variation (b) Coefficient of dispersion
(c) Standard deviation (d) Coefficient of skewness
MCQ No 11
The measure of dispersion which uses only two observations is called:
(a) Mean (b) Median (c) Range (d) Coefficient of variation
MCQ No12
The measure of dispersion which uses only two observations is called:
(a) Range (b) Quartile deviation (c) Mean deviation (d) Standard deviation
MCQ No 13
In quality control of manufactured items, the most common measure of dispersion is:
(a) Range (b) Average deviation (c) Standard deviation (d) Quartile deviation
MCQ No 14
The range of the scores 29, 3, 143, 27, 99 is:
(a) 140 (b) 143 (c) 146 (d) 70
MCQ No15
If the observations of a variable X are, -4, -20, -30, -44 and -36, then the value of the
range will be:
(a) -48 (b) 40 (c) -40 (d) 48
MCQ No 16
The range of the values -5, -8, -10, 0, 6, 10 is:
(a) 0 (b) 10 (c) -10 (d) 20
MCQ No 17
If Y = aX ± b, where a and b are any two numbers and a ≠ 0, then the range of Y values
will be:
(a) Range(X) (b) a range(X) + b (c) a range(X) – b (d) |a| range(X)
MCQ No 18
If the maximum value in a series is 25 and its range is 15, the maximum value of the
series is:
(a) 10 (b) 15 (c) 25 (d) 35
MCQ No 19
Half of the difference between upper and lower quartiles is called:
(a) Interquartile range (b) Quartile deviation (c) Mean deviation (d) Standard deviation
MCQ No 20
If Q3=20 and Q1=10, the coefficient of quartile deviation is:
(a) 3 (b) 1/3 (c) 2/3 (d) 1
MCQ No 21
Which measure of dispersion can be computed in case of open-end classes?
(a) Standard deviation (b) Range (c) Quartile deviation (d) Coefficient of variation
MCQ No 22
If Y = aX ± b, where a and b are any two constants and a ≠ 0, then the quartile deviation
of Y values is
equal to:
(a) a Q.D(X) + b (b) |a| Q.D(X) (c) Q.D(X) – b (d) |b| Q.D(X)
MCQ No 23
The sum of absolute deviations is minimum if these deviations are taken from the:
(a) Mean (b) Mode (c) Median (d) Upper quartile
MCQ No24
The mean deviation is minimum when deviations are taken from:
(a) Mean (b) Mode (c) Median (d) Zero
MCQ No 26
The mean deviation of the scores 12, 15, 18 is:
(a) 6 (b) 0 (c) 3 (d) 2
MCQ No 27
Mean deviation computed from a set of data is always:
(a) Negative (b) Equal to standard deviation
(c) More than standard deviation (d) Less than standard deviation
MCQ No 28
The average of squared deviations from mean is called:
(a) Mean deviation (b) Variance (c) Standard deviation (d) Coefficient of variation
MCQ No 29
The sum of squares of the deviations is minimum, when deviations are taken from:
(a) Mean (b) Mode (c) Median (d) Zero
MCQ No 30
Which of the following measures of dispersion is expressed in the same units as the
units of observation?
(a) Variance (b) Standard deviation
(c) Coefficient of variation (d) Coefficient of standard deviation
MCQ No 31
Which measure of dispersion has a different unit other than the unit of measurement of
values:
(a) Range (b) Standard deviation (c) Variance (d) Mean deviation
MCQ No 2.32
Which of the following is a unit free quantity:
(a) Range (b) Standard deviation (c) Coefficient of variation (d) Arithmetic mean
MCQ No 33
If the dispersion is small, the standard deviation is:
(a) Large (b) Zero (c) Small (d) Negative
MCQ No 34
The value of standard deviation changes by a change of:
(a) Origin (b) Scale (c) Algebraic signs (d) None
MCQ No 35
The standard deviation one distribution dividedly the mean of the distribution and
expressing in
percentage is called:
(a) Coefficient of Standard deviation (b) Coefficient of skewness
(c) Coefficient of quartile deviation (d) Coefficient of variation
MCQ No 36
The positive square root of the mean of the squares of the cleviations of observations
from their mean is
called:
(a) Variance (b) Range (c) Standard deviation (d) Coefficient of variation MCQ No 37
The variance is zero only if all observations are the:
(a) Different (b) Square (c) Square root (d) Same
MCQ No 38
The standard deviation is independent of:
(a) Change of origin (b) Change of scale of measurement
(c) Change of origin and scale of measurement (d) Difficult to tell
MCQ No 39
If there are ten values each equal to 10, then standard deviation of these values is:
(a) 100 (b) 20 (c) 10 (d) 0
MCQ No 40
If X and Y are independent random variables, then S.D(X ± Y) is equal to:
(a) S.D(X) ± S.D(Y) (b) Var(X) ± Var(Y) (c) (d)
MCQ No 41
S.D(X) = 6 and S.D(Y) = 8. If X and Yare independent random variables, then S.D(X-Y)
is:
(a) 2 (b) 10 (c) 14 (d) 100
MCQ No 42
For two independent variables X and Y if S.D(X) = 1 and S.D(Y) = 3, then Var(3X - Y) is
equal to:
(a) 0 (b) 6 (c) 18 (b) 12
MCQ No 43
If Y = aX ± b, where a and b are any two constants and a ≠ 0, then Vat (Y) is equal to:
(a) a Var(X) (b) a Var(X) + b (c) a2 Var(X) – b (d) a2 Var(X)
MCQ No 2.44
If Y = aX + b, where a and b are any two numbers but a ≠ 0, then S.D(Y) is equal to:
(a) S.D(X) (b) a S.D(X) (c) |a| S.D(X) (d) a S.D(X) + b
MCQ No .45
The ratio of the standard deviation to the arithmetic mean expressed as a percentage is
called:
(a) Coefficient of standard deviation (b) Coefficient of skewness
(c) Coefficient of kurtosis (d) Coefficient of variation
MCQ No 46
Which of the following statements is correct?
(a) The standard deviation of a constant is equal to unity
(b) The sum of absolute deviations is minimum if these deviations are taken from the
mean.
(c) The second moment about origin equals variance
(d) The variance is positive quantity and is expressed in square of the units of the
observations
MCQ No 47
Which of the following statements is false?
(a) The standard deviation is independent of change of origin
(b) If the moment coefficient of kurtosis β2 = 3, the distribution is mesokurtic or normal.
(c) If the frequency curve has the same shape on both sides of the centre line which
divides the curve into
two equal parts, is called a symmetrical distribution.
(d) Variance of the sum or difference of any two variables is equal to the sum of
their respective
variances
MCQ No 48
If Var(X) = 25, then is equal to:
(a) 15/2 (b) 50 (c) 25 (d) 5
MCQ No.49
To compare the variation of two or more than two series, we use
(a) Combined standard deviation (b) Corrected standard deviation
(c) Coefficient of variation (d) Coefficient of skewness
MCQ No 50
The standard deviation of -5, -5, -5, -5, 5 is:
(a) -5 (b) +5 (c) 0 (d) -25
MCQ No 51
Standard deviation is always calculated from:
(a) Mean (b) Median (c) Mode (d) Lower quartile
MCQ No 52
The mean of an examination is 69, the median is 68, the mode is 67, and the standard
deviation is 3. The measures of variation for this examination is:
(a) 67 (b) 68 (c) 69 (d) 3
MCQ No 53
The variance of 19, 21, 23, 25 and 27 is 8. The variance of 14, 16, 18, 20 and 22 is:
(a) Greater than 8 (b) 8 (c) Less than 8 (d) 8 - 5 = 3
MCQ No 54
In a set of observations the variance is 50. All the observations are increased by 100%.
The variance of
the increased observations will become:
(a) 50 (b) 200 (c) 100 (d) No change
MCQ No 55
Three factories A, B, C have 100, 200 and 300 workers respectively. The mean of the
wages is the same
in the three factories. Which of the following statements is true?
(a) There is greater variation in factory C.
(b) Standard deviation in. factory A is the smallest.
(c) Standard deviation in all the three factories are equal
(d) None of the above
MCQ No 56
An automobile manufacturer obtains data concerning the sales of six of its deals in the
last week of 1996. The results indicate the standard deviation of their sales equals 6
autos. If this is so, the variance of their sales equals:
(a) (b) 6 (c) (d) 36
MCQ No 57
If standard deviation of the values 2, 4, 6, 8 is 2.236, then standard deviation of the
values 4, 8,12, 16 is:
(a) 0 (b) 4.472 (c) 4.236 (d) 2.236
MCQ No 58
Var(X) = 4 and Var(Y) =9. If X and Y are independent random variable then Var(2X + Y)
is:
(a) 13 (b) 17 (c) 25 (d) -1
MCQ No 59
If = Rs.20, S= Rs.10, then coefficient of variation is:
(a) 45% (b) 50% (c) 60% (d) 65%
MCQ No 60
Which of the following measures of dispersion is independent of the units employed?
(a) Coefficient of variation (b) Quartile deviation
(c) Standard deviation (d) Range
References:
In This chapter
3.1 Introduction
If the longer tail is on the left, we say that is skewed to the left and the coefficient of
skewness is negative.
Skewed to the right (positively skewed)
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a
normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or
outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform
distribution would be the extreme case.
Skewness
Other measures of skewness have been used, including simpler calculations suggested
by Karl Pearson (not to be confused with Pearson's moment coefficient of skewness,
see above). These other measures are:
In probability theory and statistics, skewness is a measure of the asymmetry of
the probability distribution of a real-valued random variable about its mean. The
skewness value can be positive or negative, or undefined.
For a unimodal distribution, negative skew commonly indicates that the tail is on the left
side of the distribution, and positive skew indicates that the tail is on the right. In cases
where one tail is long but the other tail is fat, skewness does not obey a simple rule. For
example, a zero value means that the tails on both sides of the mean balance out
overall; this is the case for a symmetric distribution, but can also be true for an
asymmetric distribution where one tail is long and thin, and the other is short but
fat.
Introduction
Consider the two distributions in the figure just below. Within each graph, the values on
the right side of the distribution taper differently from the values on the left side. These
tapering sides are called tails, and they provide a visual means to determine which of
the two kinds of skewness a distribution has:
1. negative skew: The left tail is longer; the mass of the distribution is concentrated
on the right of the figure. The distribution is said to be left-skewed, left-tailed,
or skewed to the left, despite the fact that the curve itself appears to be skewed
or leaning to the right; left instead refers to the left tail being drawn out and,
often, the mean being skewed to the left of a typical center of the data. A left-
skewed distribution usually appears as a right-leaning curve.
2. positive skew: The right tail is longer; the mass of the distribution is concentrated
on the left of the figure. The distribution is said to be right-skewed, right-tailed,
or skewed to the right, despite the fact that the curve itself appears to be skewed
or leaning to the left; right instead refers to the right tail being drawn out and,
often, the mean being skewed to the right of a typical center of the data. A right-
skewed distribution usually appears as a left-leaning curve.
Skewness in a data series may sometimes be observed not only graphically but by
simple inspection of the values. For instance, consider the numeric sequence (49, 50,
51), whose values are evenly distributed around a central value of 50. We can transform
this sequence into a negatively skewed distribution by adding a value far below the
mean, which is probably a negative outlier, e.g. (40, 49, 50, 51). Therefore, the mean of
the sequence becomes 47.5, and the median is 49.5. Based on the formula
a) Absolute Skewness
b) Relative or coefficient of Skewness
Mathematical Measure of skewness can be calculated by
1) Karl-Pearson’s Method
2) Bowley’s Method
a) Absolute Measure
1) karl Pearson’s Measure of Skewness=
Mean-Mode=3(Mean-Median)
2) Bowley’s measure of Skewness=(Q3-Q2)-(Q2-Q1)
b) Relative Measure:
1) Karl pearson’s coefficient of Skewness
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 3(𝑚𝑒𝑎𝑛−𝑚𝑒𝑑𝑖𝑎𝑛)
𝑆𝐾𝑝 = 𝑆.𝐷.
= 𝑆.𝐷.
𝑎𝑠 𝑀𝑒𝑎𝑛−Mode= 3(mean-Median)
𝑁𝑜𝑡𝑒
i)if 𝑆𝐾𝑝 > 0 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑦 𝑠𝑘𝑒𝑤𝑒𝑑
ii)if 𝑆𝐾𝑝 = 0 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒 𝑖𝑠 𝑠𝑦𝑚𝑒𝑡𝑟𝑖𝑐 𝑐𝑢𝑟𝑣𝑒
iii) if 𝑆𝐾𝑝 < 0 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑙𝑦 𝑠𝑘𝑒𝑤𝑒𝑑 𝑐𝑢𝑟𝑣𝑒
2) Bowley’s Coefficient of Skewness.
𝑆𝐾 (𝑄3−𝑄2)−(𝑄2−𝑄1)
𝐵 = (𝑄3−𝑄2)+(𝑄2−𝑞1)
𝑆𝐾 (𝑄3+𝑄1−2𝑄2)
𝐵= (𝑄3−𝑄1)
43,48,38,46,50,48,47,48,62,48
∑𝑥 478
Solution: here n=10 , Mean= 𝑥̅ = = =47.8
𝑛 10
∑ 𝑥2
Variance of X=Var(x)= − 𝑥̅ 2=(23178/10)-(47.8)2=32.96
𝑛
S.D.=Standard Deviation=√𝑉𝑎𝑟(𝑥)=√32.96=5.74108
Ex 2) Calculate the karl Pearson’s coefficient of Skewness for the following data
∑ 𝑓𝑥 2
Variance of X=Var(x)= 𝑁
− 𝑥̅ 2=(26640000/64)-(635.937)2=11833.5
S.D.=Standard Deviation=√𝑉𝑎𝑟(𝑥)=√11833.5=108.7819
𝑆𝐾 (𝑄3−𝑄2)−(𝑄2−𝑄1)
𝐵 = (𝑄3−𝑄2)+(𝑄2−𝑞1)
𝑆𝐾 (𝑄3+𝑄1−2𝑄2)
𝐵= (𝑄3−𝑄1)
𝑆𝐾 (𝑄3+𝑄1−2𝑄2)
𝐵= (𝑄3−𝑄1)
𝑆𝐾 (24.4444+43−2∗34.1667)
𝐵= (43−24.4444)
=-0.0479
The skewness for a normal distribution is zero, and any symmetric data should have a
skewness near zero. Negative values for the skewness indicate data that are skewed
left and positive values for the skewness indicate data that are skewed right. By skewed
left, we mean that the left tail is long relative to the right tail. Similarly, skewed right
means that the right tail is long relative to the left tail. If the data are multi-modal, then
this may affect the sign of the skewness.
Some measurements have a lower bound and are skewed right. For example, in
reliability studies, failure times cannot be negative.
•
• Positive skewness
It should be noted that there are alternative definitions of skewness in the literature.
For example, the Galton skewness (also known as Bowley's skewness) is defined as
Galton skewness=(Q1+Q3−2Q2)/(Q3−Q1)
where Q1 is the lower quartile, Q3 is the upper quartile, and Q2 is the median.
Examples
A normal distribution and any other symmetric distribution with finite third moment has a
skewness of 0
A lognormal distribution can have a skewness of any positive value, depending on its
parameters
Applications
Skewness is a descriptive statistic that can be used in conjunction with
the histogram and the normal quantile plot to characterize the data or distribution.
Skewness indicates the direction and relative magnitude of a distribution's deviation
from the normal distribution.
With pronounced skewness, standard statistical inference procedures such as
a confidence interval for a mean will be not only incorrect, in the sense that the true
coverage level will differ from the nominal (e.g., 95%) level, but they will also result in
unequal error probabilities on each side.
Skewness can be used to obtain approximate probabilities and quantiles of distributions
(such as value at risk in finance) via the Cornish-Fisher expansion.
Many models assume normal distribution; i.e., data are symmetric about the mean. The
normal distribution has a skewness of zero. But in reality, data points may not be
perfectly symmetric. So, an understanding of the skewness of the dataset indicates
whether deviations from the mean are going to be positive or negative.
Comparison of mean, median and mode of two log-normal distributions with different
skewnesses.
Which is a simple multiple of the nonparametric skew.
What Is Skewness in Statistics?
Some distributions of data, such as the bell curve or normal distribution, are
symmetric. This means that the right and the left of the distribution are perfect mirror
images of one another. Not every distribution of data is symmetric. Sets of data that are
not symmetric are said to be asymmetric. The measure of how asymmetric a distribution
can be is called skewness.
The mean, median and mode are all measures of the center of a set of data. The
skewness of the data can be determined by how these quantities are related to one
another.
Data that are skewed to the right have a long tail that extends to the right. An alternate
way of talking about a data set skewed to the right is to say that it is positively skewed.
In this situation, the mean and the median are both greater than the mode. As a general
rule, most of the time for data skewed to the right, the mean will be greater than the
median. In summary, for a data set skewed to the right:
Measures of Skewness
It’s one thing to look at two sets of data and determine that one is symmetric while the
other is asymmetric. It’s another to look at two sets of asymmetric data and say that one
is more skewed than the other. It can be very subjective to determine which is more
skewed by simply looking at the graph of the distribution. This is why there are ways to
numerically calculate the measure of skewness.
Skewed data arises quite naturally in various situations. Incomes are skewed to the
right because even just a few individuals who earn millions of dollars can greatly affect
the mean, and there are no negative incomes. Similarly, data involving the lifetime of a
product, such as a brand of light bulb, are skewed to the right. Here the smallest that a
lifetime can be is zero, and long lasting light bulbs will impart a positive skewness to the
data.
Practice Problems
MCQ No 1
The first three moments of a distribution about the mean are 1, 4 and 0. The distribution
is:
(a) Symmetrical (b) Skewed to the left (c) Skewed to the right (d) Normal
MCQ No 2
If the third central is negative, the distribution will be:
(a) Symmetrical (b) Positively skewed (c) Negatively skewed (d) Normal
MCQ No 3
If the third moment about mean is zero, then the distribution is:
(a) Positively skewed (b) Negatively skewed (c) Symmetrical (d) Mesokurtic
MCQ No 4
Departure from symmetry is called:
(a) Second moment (b) Kurtosis (c) Skewness (d) Variation
MCQ No 5
In a symmetrical distribution, the coefficient of skewness will be:
(a) 0 (b) Q1 (c) Q3 (d) 1
MCQ No 6
The lack of uniformity or symmetry is called:
(a) Skewness (b) Dispersion (c) Kurtosis (d) Standard deviation
MCQ No 7
For a positively skewed distribution, mean is always:
(a) Less than the median (b) Less than the mode
(c) Greater than the mode (d) Difficult to tell
MCQ No 8
For a symmetrical distribution:
(a) β1 > 0 (b) β1 < 0 (c) β1 = 0 (d) β1 = 3
MCQ No 9
If mean=50, mode=40 and standard deviation=5, the distribution is:
(a) Positively skewed (b) Negatively skewed (c) Symmetrical (d) Difficult to tell
MCQ No 10
If mean=25, median=30 and standard deviation=15, the distribution will be:
(a) Symmetrical (b) Positively skewed (c) Negatively skewed (d) Normal
MCQ No 11
If mean=20, median=16 and standard deviation=2, then coefficient of skewness is:
(a) 1 (b) 2 (c) 4 (d) -2
MCQ No 12
If mean=10, median=8 and standard deviation=6, then coefficient of skewness is:
(a) 1 (b) -1 (c) 2/6 (d) 2
MCQ No 13
If the sum of deviations from median is not zero, then a distribution will be:
(a) Symmetrical (b) Skewed (c) Normal (d) All of the above
MCQ No 14
In case of positively skewed distribution, the extreme values lie in the:
(a) Middle (b) Left tail (c) Right tail (d) Anywhere
MCQ No 15
Bowley's coefficient of skewness lies between:
(a) 0 and 1 (b) 1 and +1 (c) -1 and 0 (d) -2 and +2
MCQ No 16
In a symmetrical distribution, Q3 – Q1 = 20, median = 15. Q3 is equal to:
(a) 5 (b) 15 (c) 20 (d) 25
MCQ No 17
Which of the following is correct in a negatively skewed distribution?
(a) The arithmetic mean is greater than the mode
(b) The arithmetic mean is greater than the median
(c) (Q3 – Median) = (Median – Q1)
(d) (Q3 – Median) < (Median – Q1)
MCQ No 18
The lower and upper quartiles of a distribution are 80 and 120 respectively, while
median is 100. The
shape of the distribution is:
(a) Positively skewed (b) Negatively skewed (c) Symmetrical (d) Normal
MCQ No 19
In a symmetrical distribution Q1 = 20 and median= 30. The value of Q3 is:
(a) 50 (b) 35 (c) 40 (d) 25
MCQ No 20
The degree of peaked ness or flatness of a unimodel distribution is called:
(a) Skewness (b) Symmetry (c) Dispersion (d) Kurtosis
MCQ No 21
For a leptokurtic distribution, the relation between second and fourth central moment is:
MCQ No 22
For a platydurtic distribution, the relation between and is:
MCQ No 23
For a mesokurtic distribution, the relation between fourth and second mean moment is:
MCQ No 24
The second and fourth moments about mean are 4 and 48 respectively, then the
distribution is:
(a) Leptokurtic (b) Platykurtic (c) Mesokurtic or normal (d) Positively skewed
MCQ No 25
In a mesokurtic or normal distribution, μ4 = 243. The standard deviation is:
(a) 81 (b) 27 (c) 9 (d) 3
MCQ No 26
The value of β2 can be:
(a) Less than 3 (b) Greater than 3 (c) Equal to 3 (d) All of the above
MCQ No 27
In a normal (mesokurtic) distribution:
(a) β1=0 and β2=3 (b) β1=3 and β2=0 (c) β1=0 and β2>3 (d) β1=0 and β2<3
MCQ No 28
Any frequency distribution, the following empirical relation holds:
(a) Quartile deviation = Standard deviation
(b) Mean deviation = Standard deviation
(c) Standard deviation = Mean deviation = Quartile deviation
(d) All of the above
References:
In this chapter
4.1Scatter diagram,
Based on the different shapes the scatter plot may assume, we can draw different
inferences. We can calculate a coefficient of correlation for the given data. It is a
quantitative measure of the association of the random variables. Its value is always less
than 1, and it may be positive or negative.
In the case of a positive correlation, the plotted points are distributed from lower left corner
to upper right corner (in the general pattern of being evenly spread about a straight line
with a positive slope), and in the case of a negative correlation, the plotted points are
spread out about a straight line of a negative slope) from upper left to lower right.
If the points are randomly distributed in space, or almost equally distributed at every
location without depicting any particular pattern, it is the case of a very small correlation,
tending to 0.
Types of Patterns
Now, look at the different possible scenarios of the patterns formed in the scatter
diagrams, with their corresponding coefficients of correlation values mentioned with them,
below and try to make sense of them.
It is clear that the case of r = 0 may occur in many forms. Some such factors include
the symmetry of the pattern around a particular point, the general randomness of the
points etc. Note that the scatter diagram by itself doesn’t assign quantitative values as
measures of correlation for the plots. It simply gives an idea of what association to expect
between the random variables of interest.
Now go through the solved example below, to understand how to make your own scatter
plots and analyze them.
12 40-50
10 50-60
8 60-70
7 70-80
5 80-90
2 90-100
Solution:
Since the values of M is in the form of bins, we can use the centre point of each class in
the scatter diagram instead. So let us first choose the axes of our diagram.
The data points that we need to plot according to the given dataset are –
How can you see the relationship between the variables? Scatter plots can help us see
the relationship between two quantitative variables.
Relationship of Fat and Calories in McDonald's Burgers Relationship of Math SAT and Percent Taking Exam
550.00 570.00
560.00
500.00
550.00
540.00
400.00
530.00
350.00
520.00
300.00
510.00
250.00 500.00
10.00 15.00 20.00 25.00 30.00 0.00 10.00 20.00 30.00 40.00 50.00 60.00
fat Percentage Taking SAT
Value and Total Circulation of U.S. Currency Year of Twelfth Graders and Percentage Who Have Smoked
7E9
6E9 50.00
Percent Used Marijuana
Total Circulation ($)
5E9
45.00
4E9
3E9
40.00
2E9
1E9
35.00
0E0
But there are many other factors too, like your interest in that movie, your budget etc. Thus
to analyze the situation in detail, you need to note down your similar past experiences and
form a sort of distribution from that data. It is at this point that you require a Correlation
Coefficient, which will now provide you with a value, based on which you can calculate the
possibility of you not going for the movie this time if your friends don’t turn up! Karl
Pearson’s Coefficient of Correlation is one such type of parameter which we’ll be studying
in this section.
This method of correlation attempts to draw a line of best fit through the data of two
variables, and the value of the Pearson correlation coefficient, r, indicates how far away all
these data points are to this line of best fit.
The Pearson Product Moment Correlation Coefficient – r – measures the strength of the
linear relationship between the paired x and y values in a sample.
or
Judging the strength of the linear relationship – according to Cohen (1988), the
following can be concluded:
bYX = 13.3/10
bYX = 1.33
bxy = 13.3/17.95
bxy = 0.741
x 14 8 10 11 9 13 5
y 14 9 11 13 11 12 4
Solution:
∑𝑥 ∑𝑦
We observe that n=7,∑ 𝑥=70 , ∑ 𝑦=74,so 𝑥̅ = 𝑛 =70/7=10 , 𝑦̅= 𝑛 =74/7=10.57
∑𝑥∑𝑦
∑ 𝑥𝑦−
𝑛
We use the formula 𝑟 =
(∑ 𝑥)2 (∑ 𝑦)2
√∑ 𝑥 2− .√∑ 𝑦 2−
𝑛 𝑛
x y 𝑥2 𝑦2 xy
14 14 196 196 196
8 9 64 81 72
10 11 100 121 110
11 13 121 169 143
09 11 81 121 99
13 12 169 144 156
5 4 25 16 20
∑ 𝑥=70 ∑ 𝑦=74 2 2
∑ 𝑥 = 756 ∑ 𝑦 = 848 ∑ 𝑥𝑦 = 796
Substituting the values in formula
∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑟= 𝑛
(∑ 𝑥 )2 (∑ 𝑦)2
√∑ 𝑥 2 − . √∑ 𝑦 2 −
𝑛 𝑛
70 ∗ 74
796 ∑ −
𝑟= 7
2 2
√∑ 756 − (70) . √∑ 848 − (74)
7 7
796−740 56
𝑟= = =0.9231
√756−700.√848−782.28 √56.√65.74
⇒ The value of r always lies between +1 and -1. Depending on its exact value, we see
the following degrees of association between the variables-
r value variation:
A value greater than 0 indicates a positive association i.e. as the value of one variable
increases, so does the value of the other variable. A value less than 0 indicates a negative
association i.e. as the value of one variable increases, the value of the other variable
decreases.
⇒ The Pearson product-moment correlation does not take into consideration whether a
variable has been classified as a dependent or independent variable. It treats all variables
equally.
⇒ A change of origin of the system, or any scaling of the variables doesn’t affect
the value of r. The sign might change depending on the sign of scaling done.
Basically, if the bivariate system (x, y) is converted to another bivariate system (u, v) by a
change of origin or scaling or both, in the following way –
u=x–ab,v=y–cd
Then the correlation coefficient takes on the following value –
r(u,v)=bd|b||d| r(x,y)
Assumptions
While calculating the Pearson’s Correlation Coefficient, we make the following
assumptions –
There is a linear relationship (or any linear component of the relationship) between the
two variables
We keep Outliers either to a minimum or remove them entirely
An outlier is a data point that does not fit the general trend of your data but would appear
to be an extreme value and not what you would expect compared to the rest of your data
points. you can detect outliers by plotting the two variables against each other on a graph
and visually inspecting the graph for extreme points.
you can then either remove or manipulate that particular point as long as you can justify
why you did so. Outliers can have a very large effect on the line of best fit and the Pearson
correlation coefficient, which can lead to very different conclusions regarding your data.
Both of the above points for a given pair of variables can be analyzed easily by studying
their scatter plots.
Solved Example on Coefficient of Correlation
Question: An experiment conducted on 9 different cigarette smoking subjects resulted in
the following data –
1 25 63
2 35 68
3 10 72
4 40 62
5 85 65
6 75 46
7 60 51
8 45 60
9 50 55
Calculate the correlation of coefficient between the number of cigarettes smoked and the
longevity of a test subject.
Solution
Let us first assign random variables to our data in the following way –
y – years lived
We’ll be using the single formula for discrete data points here –
Let us now construct a table to compute all the values we are going to use in our
correlation formula. Note that N here = 9
X x2 Y y2 xy
=−0.61
This implies a negative correlation between the considered variables i.e. The higher the
number of cigarettes smoked per week in last 5 years, the lesser the number of years
lived. Note that it DOES NOT mean that smoking cigarettes decreases the life span.
Because, many other factors might be responsible for one’s death. Still, it is an important
conclusion nevertheless.
On the other hand if, for example, the relationship appears linear (assessed via
scatterplot) one would run a Pearson’s correlation because this will measure the strength
and direction of any linear relationship. Monotonicity –
Thus, at every level, we need to compare the values of the two variables. The method
of ranking assigns such ‘levels’ to each value in the dataset so that we can easily compare
it.
Assign number 1 to n (the number of data points) corresponding to the variable values
in the order highest to lowest.
In the case of two or more values being identical, assign to them the arithmetic
mean of the ranks that they would have otherwise occupied.
For example, Selling Price values given: 28.2, 32.8, 19.4, 22.5, 20.0, 22.5 The
corresponding ranks are: 2, 1, 5, 3.5, 4, 3.5 The highest value 32.8 is given rank 1, 28.2 is
given rank 2,…. Two values are identical (22.5) and in this case, the arithmetic means of
ranks that they would have otherwise occupied (3+42) has to be taken.
𝟔 ∑ 𝒅𝟐
𝑹 = 𝟏 − 𝒏(𝒏𝟐 −𝟏)
where n is the number of data points of the two variables and di is the difference in the
ranks of the ith element of each random variable considered. The Spearman correlation
coefficient, ρ, can take values from +1 to -1.
Ex 1) following data gives the ranks assigned to eight workers by two different
supervisors.Find the Rank correlation coefficient.
Rank by I 3 5 7 1 2 8 6 4
supervisor
I
Rank by II 2 1 4 5 7 6 3 8
supervisor
II
Solution
𝟔 ∑ 𝒅𝟐
𝑹 = 𝟏 − 𝒏(𝒏𝟐 −𝟏)
Rank by I 3 5 7 1 2 8 6 4
supervisor
I(R1)
Rank by II 2 1 4 5 7 6 3 8
supervisor
II(R2)
d =R1-R2 1 4 3 -4 -5 2 3 -4
𝑑2 1 16 9 16 25 4 9 16
∑ 𝒅𝟐 = 𝟗𝟔, 𝒏 = 𝟖
Using formula
𝟔 ∑ 𝒅𝟐 𝟔∗𝟗𝟔 𝟓𝟕𝟔
𝑹 = 𝟏 − 𝒏(𝒏𝟐 −𝟏)=1- 𝟖(𝟔𝟒−𝟏)=1-𝟓𝟎𝟒 =-0.1429
y 50 70 65 72 90 58 53 57 68 74
Solution:
X 15 32 25 30 35 20 19 22 27 31
y 50 70 65 72 90 58 53 57 68 74
R1(x) 10 2 6 4 1 8 9 7 5 3
R2(y) 10 4 6 3 1 7 9 8 5 2
d=R1- 0 -2 0 1 0 1 0 -1 0 1
R2
𝒅𝟐 0 4 0 1 0 1 0 1 0 1
∑ 𝒅𝟐 = 𝟖, 𝒏 = 𝟏𝟎
Using formula
𝟔 ∑ 𝒅𝟐 𝟔∗𝟖 𝟒𝟖
𝑹 = 𝟏 − 𝒏(𝒏𝟐 −𝟏)=1- 𝟏𝟎(𝟏𝟎𝟎−𝟏)=1-𝟗𝟗𝟎 =-0.9515
% of students % of students
State University having free scoring above
meals 8.5 CGPA
Pune 14.4 54
Chennai 7.2 64
Delhi 27.5 44
Kanpur 33.8 32
Ahmedabad 38.0 37
Indore 15.9 68
Guwahati 4.9 62
Solution: Let us first assign the random variables to the required data –
Before proceeding with the calculation, we’ll need to assign ranks to the data
corresponding to each state university. We construct the table for the rank as below –
Chennai 2 6 -4 16
Delhi 5 3 2 4
Kanpur 6 1 5 25
Ahmedabad 7 2 5 25
Indore 4 7 -3 9
Guwahati 1 5 -4 16
Σd2 = 96
R=1–6Σidi2/n(n2–1)
=1–576336
=−0.714
Such a strong negative coefficient of correlation gives away an important implication – the
universities with the highest percentage of students consuming free meals tend to have
the least successful results (and vice-versa). Similarly, we can solve all other questions.
In this section we will first discuss correlation analysis, which is used to quantify the
association between two continuous variables e.g., between an independent and a
dependent variable or between two independent variables. Regression analysis is a
related technique to assess the relationship between an outcome variable and one or
more risk factors or confounding variables. The outcome variable is also called
the response or dependent variable and the risk factors and confounders are called
the predictors, or explanatory or independent variables. In regression analysis, the
dependent variable is denoted "y" and the independent variables are denoted by "x".
NOTE: The term "predictor" can be misleading if it is interpreted as the ability to predict
even beyond the limits of the data. Also, the term "explanatory variable" might give an
impression of a causal effect in a situation in which inferences should be limited to
identifying associations. The terms "independent" and "dependent" variable are less
subject to these interpretations as they do not strongly imply cause and effect.
Correlation Analysis
ranges between -1 and +1 and quantifies the direction and strength of the linear
association between the two variables. The correlation between two variables can be
positive (i.e., higher levels of one variable are associated with higher levels of the other)
or negative (i.e., higher levels of one variable are associated with lower levels of the
other).
The sign of the correlation coefficient indicates the direction of the association. The
magnitude of the correlation coefficient indicates the strength of the association.
The figure below shows four hypothetical scenarios in which one continuous variable is
plotted along the X-axis and the other along the Y-axis.
Scenario 1 depicts a strong positive association (r=0.9), similar to what we might see for
the correlation between infant birth weight and birth length.
Scenario 2 depicts a weaker association (r=0,2) that we might expect to see between
age and body mass index (which tends to increase with age).
Scenario 3 might depict the lack of association (r approximately 0) between the extent
of media exposure in adolescence and age at which adolescents initiate sexual activity.
Scenario 4 might depict the strong negative association (r= -0.9) generally observed
between the number of hours of aerobic exercise per week and percent body fat.
We wish to estimate the association between gestational age and infant birth weight. In
this example, birth weight is the dependent variable and gestational age is the
independent variable. Thus y=birth weight and x=gestational age. The data are
displayed in a scatter diagram in the figure below.
Each point represents an (x,y) pair (in this case the gestational age, measured in
weeks, and the birth weight, measured in grams). Note that the independent variable is
on the horizontal axis (or X-axis), and the dependent variable is on the vertical axis (or
Y-axis). The scatter plot shows a positive or direct association between gestational age
and birth weight. Infants with shorter gestational ages are more likely to be born with
lower weights and infants with longer gestational ages are more likely to be born with
higher weights.
We first summarize the gestational age data. The mean gestational age is:
To compute the variance of gestational age, we need to sum the squared deviations (or
differences) between each observed gestational age and the mean gestational age. The
computations are summarized below.
Next, we summarize the birth weight data. The mean birth weight is:
The variance of birth weight is computed just as we did for gestational age as shown in
the table below.
To compute the covariance of gestational age and birth weight, we need to multiply the
deviation from the mean gestational age by the deviation from the mean birth weight for
each participant (i.e.,
The computations are summarized below. Notice that we simply copy the deviations
from the mean gestational age and birth weight from the two tables above into the table
below and multiply.
1.The following data represents the time in weeks (X) and the output in thousand units
(Y). Find the coefficient of correlation.
x: 7 5 4 11 10 12 14 9
y: 14 8 8 19 16 19 20 16
[ Answer: 0.9635 ]
2. Find the coefficient of correlation for the
following data:
x: 14 8 10 11 9 13 5
y: 14 9 11 13 11 12 4
[ Answer: 0.9231 ]
3. Find the coefficient of correlation for the following data representing cost in Rs. (X)
and sales in Rs. (Y) of a product for a period of eight years.
x: 84 80 92 85 95 90 83 87
y: 115 104 122 116 125 120 112 120
[ Answer: 0.9358 ]
4. Calculate the coefficient of correlation between marks in Economics (X) and
marks in Accountancy (Y) of a group of 10 students.
x: 53 47 42 60 63 52 57 55 61 48
y: 72 61 62 85 80 65 79 75 84 73
[ Answer: 0.8831 ]
5. Calculate the coefficient of rank correlation for the following data giving working
capital in lakhs of Rs. (x) and profit in thousands of Rs. (y) of 10 companies for
the year 2003.
x: 15 32 25 30 35 20 19 22 27 31
y: 50 70 65 72 90 58 53 57 68 74
[ Answer: 0.9515 ]
6. Calculate Spearman’s rank correlation
coefficient for the following data.
x: 105 112 107 115 160 152 148 132
y: 120 127 135 123 140 142 138 110
[ Answer: 0.5394 ]
7. Find the Spearman’s coefficient of
correlation for the following data.
x: 33 37 42 23 21 15 13 30 39
y: 17 27 32 12 13 11 9 25 30
[ Answer: 0.9667 ]
8. Find the rank correlation coefficient for the following data representing marks in
terminal (x) and the marks in Final examination for a group of 10 students.
x: 52 33 47 65 43 33 54 66 75 70
y: 65 59 72 72 82 60 57 58 72 90
[ Answer: 0.2303 ]
9.Find rank correlation coefficient.
x: 84 89 72 75 90 62 62 78
y: 65 75 58 65 75 54 51 57
[ Answer: 0.881 ]
1. Marks of 6 students in a class work and annual examination are given below. Find
the
coefficient of correlation.
Class work 12 14 23 18 10 19
Annual 68 78 85 75 70 74
Examination
1.Marks of 6 students in a unit test(x) and final examination(y) are given below. Find the
coefficient of correlation.
X 12 8 11 9 13 14
Y 45 35 29 32 40 36
c)Calculate the Rank Coefficient of Correlation between the Age and Blood
pressure of given people from a colony.
Age in 60 65 80 40 45 55 65
Years
Blood 144 162 162 125 145 145 149
Pressure
CORRELATION
MULTIPLE CHOICE QUESTIONS
Reference:
1.Statistical Technique by Manan Prakashan
2. Fundamental of mathematical statistics by Gupta and Kapoor
Unit 2
In this Chapter
5.1 Introduction
Linear Regression
5.1 Introduction
Regression Analysis
Regression analysis is a widely used technique which is useful for evaluating multiple
independent variables. As a result, it is particularly useful for assess and adjusting for
confounding. It can also be used to assess the presence of effect modification.
regression line – is a straight line that describes how a response variable y changes as
an explanatory variable x changes. We often use a regression line to predict the value
of y for a given value of x. Regression, unlike correlation, requires that we have an
explanatory variable and a response variable.
where and
Correlation:
Ex 1:Find the two regression equations and also estimate y when x=13 and estimate x
when y=10
x 11 7 9 5 8 6 10
y 16 14 12 11 15 14 17
Solution:
To find b,b1,a and a1 we require the summation. So prepare the following Table
Total
x 11 7 9 5 8 6 10 ∑x=56
y 16 14 12 11 15 14 17 ∑y=99
X2 121 49 81 25 64 36 100 ∑ X2=476
y2 256 196 144 121 225 196 289 ∑ y2=1427
xy 176 98 108 55 120 84 170 ∑ xy=811
So a=14.1429-0.6786*8=8.7141
Y= 8.7141+0.6786x
So a1=8-0.7074*14.1429=8-10.0047
-0.20047
X=-2.0047+0.7074y
And 15x-8y-180=0 and standard deviation of y is 1. Find the mean value of x and y,the
coefficient of correlation r and standard deviation of x
Solution: To find the mean value of x and y solve the given equations simultaneously as
follows
5x-6y+90=0 …..1)
15x-8y-180=0 ….2)
15x-8y-180=0
15x-18y+270=0
10y-450=0
Y=45
X=36
To find r,the correlation coefficient let equation 1) be x on y with the standard form
X=a1+b1y
We have 5x-6y+90=0
5x=6y-90
6𝑦
∴𝑥= − 15
5
Comparing it with standard form b1=6/5
Y=a+bx
15x-8y-180=0
∴ 8𝑦 = 15𝑥 − 180
15𝑥 180
Y= −
8 8
15 6
Now r= ± √𝑏 ∗ 𝑏1=±√ 8 ∗ 5 = 1.5
5𝑥
∴𝑦= + 15
6
b=5/6
so b1=8/15
8 5
Now = ± √𝑏 ∗ 𝑏1=±√ ∗ = 0.6667
15 6
Since b and b1 are positive ,r is also positive so r=0.6667
5/6=2/3*1/𝜎𝑥
𝜎𝑥 = 0.8
Ex 3 Find regression of y on x
Father’s Respondant’s
Respondant Education Education (Y) XY X2 Y2
(X)
1 10 10 100 100 100
2 10 11 110 100 121
3 12 12 144 144 144
4 14 13 182 196 169
5 14 14 196 196 196
Mean = 12 Mean = 12
where and
Father’s Respondant’s
Respondant Education XY X2 Y2
(X)
Education (Y)
Y a bX where b
XY ( NXY ) and a Y bX
X NX2 2
∑𝑥∑𝑦 12∗12
∑ 𝑥𝑦− 732− 732−28.8
𝑛 5
(∑ 𝑦)2
= 122
= 730−28.8=703.2/701.2
∑ 𝑦 2− 730−
5
𝑛
b=1.0028
a=12-1.0028*12=-0.03423
X 3 4 5 3 4
Y 12 7 5 11 8
X Y X2 Y2 XY
3 12 9 144 36
4 7 16 49 28
5 5 25 25 25
3 11 9 121 33
4 8 16 64 32
ΣX=19 ΣY =43 ΣX2=75 ΣY2 =403 ΣXY =154
= -9.4 / 2.8
= -3.6
= -0.29
𝑦̅ = 43/5 = 8.6
𝑥̅ = 19/5 = 3.8
Y – 𝑦̅ = byx (X – 𝑥̅ )
Y = -3.36X + 21.37
Y + 3.36X = 21.37
Y + 3.36(6) = 21.37
Y = 21.37 – 20.16
Y = 1.21
INDEX 9 7 8 4 7 5 5 6
SALARY 36 25 33 15 28 19 20 22
Find expected Salary of an employee whose Index is 3.
Solution:
Total
INDEX(x) 9 7 8 4 7 5 5 6 51
SALARY(y) 36 25 33 15 28 19 20 22
198
2
x 81 49 64 16 49 25 25 36 345
y2 1296 625 1089 225 784 361 400 484 5264
xy 324 175 264 60 196 95 100 132 1346
Regression of y on x is y=a+bx
∑𝑥∑𝑦 51∗198
∑ 𝑥𝑦− 1346−
𝑛 8
Where 𝑏 = (∑ 𝑥)2
= (51) 2 =4.213836
∑ 𝑥 2− 345−
8
𝑛
∑𝑦
, 𝑦̅ = =198/8= 24.75
𝑛
So a=24.75-4.213836*6.375= -2.11298
Y= -2.11298+4.213836x
Y= -2.11298+4.213836*3= 10.51703
Regression Analysis
Example :. Data on height (in cms) of father (Y) and that of his son (X) are given
below.
1. From the following data find the two regression equations and hence
estimate y when x = 13 and estimate x when y = 10.
x: 14 10 15 11 9 12 6
y: 8 6 4 3 7 5 9
[ Answer: 5.2858 & 8.1428 ]
2. Find the two regression equations and also estimate y when x = 13 and estimate x
when y = 10
x: 11 7 9 5 8 6 10
y: 16 14 12 11 15 14 17
[ Answer: 17.5359 & 5.0693 ]
3. The following data represents the marks in Algebra (x) and Geometry
(y) of a group of 10 students. Find both regression equations and
hence estimate y if x = 78 and x if y = 94.
y: 82 78 86 72 91
80 95 72 89 74 [
Answer: 80.394 ~ 80 and
94.9337 ~ 95 ]
4. Find the regression equations for the following data and hence estimate y when x =
15 and x when y = 18.
x: 10 12 14 19 8 11 17
y: 20 24 25 21 16 22 20
[ Answer: 21.64 & 11.54 ]
5. From the following data, find the regression equations and further estimate y if x = 16
and x if y = 18.
x: 3 4 6 10 12 13
y: 12 11 15 16 19 17
[ Answer: 20.32 & 11.8 ]
6. For a bivariate distribution, the following results are obtained.
Mean value of x = 65 Mean value of y = 53
Standard deviation = Standard deviation =
4.7 Coefficient of correlation
5.2 = 0.78
Find the two regression equations
and hence obtain i.The most
probable value of y when x = 63
ii.The most probable value of x when y = 50 [ Answer: 51.274 &
62.885 ]
7. The averages for rainfall and yield of a crop are 42.7 cms and 850 kgs
respectively. The corresponding standard deviations are 3.2 cms and 14.1 kgs.
The coefficient of correlation is 0.65. Estimate the yield when the rainfall is 39.2
cms. [ Estimated yield is 839.99 kgs. ]
a)Find the two regression lines of equation for the following data.
x 3 5 7 9 11
y 9 12 16 14 15
d) Given the following data estimate the linear trend equation. Find trend
values and calculate the trend value of 2018
Year 2010 2011 2012 2013 2014
No. of cars 11 30 38 50 56
(in
Thousand)
e) Find (a) σx (b)σ y (c) V(x) (d) V(y) and (e) cov (x, y) for the following data:
X 1 2 3 5 4 3
Y 2 4 5 5 3 1
f)The two regression lines between x and y are given below. Find mean value
of x and y and correlation coefficient (r xy)
100y – 45x – 1400 = 0
4y – 5x + 200 = 0
6. Larger values of r2 (R2) imply that the observations are more closely grouped about
the
a. average value of the independent variables
b. average value of the dependent variable
c. least squares line
d. origin
13. In regression analysis, the variable that is used to explain the change in the
outcome of an experiment, or some natural process, is called
a. the x-variable
b. the independent variable
c. the predictor variable
d. the explanatory variable
e. all of the above (a-d) are correct
f. none are correct
14. In the case of an algebraic model for a straight line, if a value for the x variable is
specified, then
a. the exact value of the response variable can be computed
b. the computed response to the independent value will always give a minimal residual
c. the computed value of y will always be the best estimate of the mean response
d. none of these alternatives is correct.
15. A regression analysis between sales (in 1000) and price (in Rs) resulted in the
following equation:
y = 50,000 - 8X
The above equation implies that an
a. increase of 1 in price is associated with a decrease of 8 in sales
b. increase of 8 in price is associated with an increase of 8,000 in sales
c. increase of 1 in price is associated with a decrease of 42,000 in sales
d. increase of 1 in price is associated with a decrease of 8000 in sales
17. If the coefficient of determination is a positive value, then the regression equation
a. must have a positive slope
b. must have a negative slope
c. could have either a positive or a negative slope
d. must have a positive y intercept
18. If two variables, x and y, have a very strong linear relationship, then
a. there is evidence that x causes a change in y
b. there is evidence that y causes a change in x
c. there might not be any causal relationship between x and y
d. None of these alternatives is correct.
19. If the coefficient of determination is equal to 1, then the correlation coefficient
a. must also be equal to 1
b. can be either -1 or +1
c. can be any value between -1 to +1
d. must be -1
21. The data are the same as for question 4 above. The relationship between number of
beers consumed (x) and blood alcohol content (y) was studied in 16 male college
students by using least squares regression. The following regression equation was
obtained from this study:
y= -0.0127 + 0.0180x
Suppose that the legal limit to drive is a blood alcohol content of 0.08. If Ricky
consumed 5 beers
the model would predict that he would be:
a. 0.09 above the legal limit
b. 0.0027 below the legal limit
c. 0.0027 above the legal limit
d. 0.0733 above the legal limit
23. If the correlation coefficient is 0.8, the percentage of variation in the response
variable explained
by the variation in the explanatory variable is
a. 0.80%
b. 80%
c. 0.64%
d. 64%
24. If the correlation coefficient is a positive value, then the slope of the regression line
a. must also be positive
b. can be either negative or positive
c. can be zero
d. can not be zero
27. Regression analysis was applied between sales (y) and advertising (x) across all
the branches
of a major international corporation. The following regression function was obtained.
y = 5000 + 7.25x
If the advertising budgets of two branches of the corporation differ by 30,000, then what
will be the predicted difference in their sales?
a. 217,500
b. 222,500
c. 5000
d. 7.25
28. Suppose the correlation coefficient between height (as measured in feet) versus
weight (as measured in pounds) is 0.40. What is the correlation coefficient of height
measured in inches versus weight measured in ounces? [12 inches = one foot; 16
ounces = one pound]
a. 0.40
b. 0.30
c. 0.533
d. cannot be determined from information given
e. none of these
29. Assume the same variables as in question 28 above; height is measured in feet and
weight is measured in pounds. Now, suppose that the units of both variables are
converted to metric (meters and kilograms). The impact on the slope is:
a. the sign of the slope will change
b. the magnitude of the slope will change
c. both a and b are correct
d. neither a nor b are correct
30. Suppose that you have carried out a regression analysis where the total variance in
the response is 133452 and the correlation coefficient was 0.85. The residual sums of
squares is:
a. 37032.92
b. 20017.8
c. 113434.2
d. 96419.07
e. 15%
f. 0.15
31. This question is related to questions 4 and 21 above. The relationship between
number of beers consumed (x) and blood alcohol content (y) was studied in 16 male
college students by using least squares regression. The following regression equation
was obtained from this study:
y= -0.0127 + 0.0180x
Another guy, his name Dudley, has the regression equation written on a scrap of paper
in his pocket. Dudley goes out drinking and has 4 beers. He calculates that he is under
the legal limit (0.08) so he decides to drive to another bar. Unfortunately Dudley gets
pulled over and confidently submits to a road-side blood alcohol test. He scores a blood
alcohol of 0.085 and gets himself arrested. Obviously, Dudley skipped the lecture about
residual variation. Dudley’s residual is:
a. +0.005
b. -0.005
c. +0.0257
d. -0.0257
35. When the error terms have a constant variance, a plot of the residuals versus the
independent variable x has a pattern that
a. fans out
b. funnels in
c. fans out, but then funnels in
d. forms a horizontal band pattern
e. forms a linear pattern that can be positive or negative
Reference:
1. Statistical Technique by Manan Prakashan
2. Statistical Technique by Sheth Publication
3. Fundamental of mathematical Statistics by Gupta Kapoor
Unit 4
Testing of Hypothesis
Chapter 7
Unit Structure
7.0Objectives
7.1 Introduction
7.1.1 Population
7.1.2 Sample
7.1.3 Parameter
7.1.4 Statistic
7.2Hypothesis Testing
7.2.1 Hypothesis
7.2.2 Steps of Testing Hypothesis
7.3Solved problems on Type I and II Errors
7.4 Let us sum up
7.5 Exercise
7.6 References
7.0: OBJECTIVES
7.1:INTRODUCTION
Hypothesis testing refers to the process of making inferences about a particular parameter. This
can be done using statistics and sample data.
7.1.1: Population: It is the collection of all possible observations under the study or
investigation. It denotes a large group consisting of elements having at least one common
feature. Examples:
a. Finite Population: When the number of elements of the population is fixed and thus
making it possible to enumerate it in totality, the population is said to be finite.
b. Infinite Population: When the number of units in a population are uncountable, and so it
is impossible to observe all the items of the universe, then the population is considered as
infinite.
7.1.2:Sample: It is a part or subset of the population that is selected to represent the entire group.
In other words, the respondents selected out of population constitutes a ‘sample’, and the process
of selecting respondents is known as ‘sampling.’ The units under study are called sampling units,
and the number of units in a sample is called sample size.In order to use statistics to learn things
about the population, the sample must be random. A random sample is one in which every
member of a population has an equal chance of being selected.
Example:A sample of 10 students are selected from the entire class of 50 students.
7.2HYPOTHESIS TESTING
1. Set up a Hypothesis:
The first step is to establish the hypothesis to be tested. The statistical hypothesis is an
assumption about the value of some unknown parameter, and the hypothesis provides some
numerical value or range of values for the parameter. Here two hypotheses about the population
are constructed -Null Hypothesis and Alternative Hypothesis.
The Null hypothesis denoted by H0 states that there is no difference between the assumed and
actual value of the parameter.In other words a hypothesis based on past experience or one which
is believed to be true is called Null Hypothesis.
Example: H0: The mean of Normal Distribution is 50
H0: µ = 50
The alternative hypothesis denoted by H1 is the other hypothesis about the population, which
stands true if the null hypothesis is rejected. Thus, if we reject H0 then the alternative hypothesis
H1 gets accepted.
Example: H1: The mean of Normal Distribution is more than 50
H1: µ > 50
Alternative Hypothesis can be of three types. If we want to test the null hypothesis that
H0: µ = 50, then the alternative hypothesis could be
(i) H1:µ > 50, this type of alternative hypothesis is called Right-tailed alternative hypothesis.
(ii) H1: µ < 50, this type of alternative hypothesis is called Light-tailed alternative hypothesis.
(iii) H1: µ ≠ 50, this type of alternative hypothesis is called Two-tailed alternative hypothesis.
Examples: In each of the following cases set up the Null and Alternative Hypothesis.
(ii) We want to test whether the mean GPA of students in American colleges is more than 2.0
(out of 4.0). The null and alternative hypotheses are:
H0: µ = 2.0 against H1:µ >2.0
H0: Cats express no food preference based on colour. against H1:Cats express food
preference based on colour.
(v) A medical researcher is interested in finding out whether a new medication will have any
undesirable side effects. The researcher is particularly concerned with the pulse rate of the
patients who take the medication.What are the hypotheses to test whether the pulse rate will be
different from the mean pulse rate of 82 beats per minute?
(vi) A chemist invents an additive to increase the life of an automobile battery. If the mean
lifetime of the battery is 36 months, then his hypotheses areH0: µ = 36 against H1: µ >36
which is a right right-tailed test
Note: A statistical test uses the data obtained from a sample to make a decision about
whether or not the null hypothesis should be rejected.
A hypothesis which completely defines the population distribution, it is called Simple hypothesis
otherwise it is called Alternative hypothesis.
Example: If x1, x2, x3,……,xn is arandom sample of size n from a Normal population, then the
hypothesis H: µ = µ0, σ2 = σ20 is simple hypothesis. Following hypotheses are all composite
hypotheses.
(i) H: µ = µ0 (ii) H: σ2 = σ20 (iii) H: µ < µ0, σ2 = σ20 (iv) H: µ > µ0, σ2 = σ20 .
a. Is it true that vitamin C has the ability to cure or prevent the common cold?
b. Ibuprofen is more effective than aspirin in helping a person who has had a heart attack.
d. Young boys are prone to more behavioral problems than young girls.
e. At the time of interview for promotion, the typist in Municipal corporation claims that his
typing speed is 100 words per minute.
a. Researchers select a sample from a population to learn more about the characteristics of a
population.
Since the null and alternative hypotheses are contradictory, you must examine evidence to decide
if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of
sample data.
After you have determined which hypothesis the sample supports, you make adecision. There are
two options for a decision. They are “reject H0” if the sample information favours the alternative
hypothesis or “do not reject H0” or “decline to reject H0” if the sample information is insufficient
to reject the null hypothesis.
3. Determining a Suitable Test Statistic:
After the hypothesis are constructed, the next step is to determine a suitable test statistic and its
distribution. A statistic whose value is used to test the validity of a null hypothesis against an
alternative hypothesis is known as a test statistic. Example: Suppose we want to test average
pocket money of First year students. From the past experience we get it was Rs. 50 per day. So
our null and alternative hypothesis will be H0: µ = 50 against H1: µ ≠ 50. For testing this
hypothesis we collect a sample from the current FY students and calculate the sample mean. This
sample mean is called test statistic for this particular example.
We have been using probability to decide whether a statistical test provides evidence for or
against our predictions. If the probability of obtaining a given test statistic from the population is
very small, we reject the null hypothesis
But you could be wrong. Even if you choose a probability level of 5 percent, that means there is
a 5 percent chance, or 1 in 20, that you rejected the null hypothesis when it was, in fact, correct.
You can make an error in the opposite way, too; you might fail to reject the null hypothesis when
it is, in fact, incorrect. These two errors are called Type I and Type II, respectively. Table 1
presents the four possible outcomes of any hypothesis test based on (1) whether the null
hypothesis was accepted or rejected and (2) whether the null hypothesis was true in reality.
A Type I error is often represented by the Greek letter alpha (α) and a Type II error by the
Greek letter beta (β ).
= P( reject H0 / H0 is true)
Type I and Type II errors are inversely related: As one increases, the other decreases. If we try to
make probability of Type I error as 0, probability of Type II error becomes maximum. The Type
I, or α (alpha), error rate is usually set in advance by the researcher. The Type II error rate for a
given test is harder to know because it requires estimating the distribution of the alternative
hypothesis, which is usually unknown.
A related concept is power—the probability that a test will reject the null hypothesis when it is,
in fact, false. You can see from Figure 1 that power is simply 1 minus the Type II error rate (β).
High power is desirable. Like β, power can be difficult to estimate accurately, but increasing the
sample size always increases power.
The Type I, or α (alpha), error rate is usually set in advance by the researcher.Once the
hypothesis about the population is constructed the researcher has to decide the level of
significance with which the null hypothesis is rejected when it is true. The significance level is
denoted by ‘α’ and is usually defined before the samples are drawn such that results obtained do
not influence the choice. In practice, we either take 5% or 1% level of significance.
If the 5% level of significance is taken, it means that there are five chances out of 100 that we
will reject the null hypothesis when it should have been accepted, i.e. we are about 95%
confident that we have made the right decision. Similarly, if the 1% level of significance is
taken, it means that there is only one chance out of 100 that we reject the hypothesis when it
should have been accepted, and we are about 99% confident that the decision made is correct.
7. Performing Computations:
Once the critical region is identified, we compute several values for the random sample of size
‘n.’ Then we will apply the formula of the test statistic as shown in step (3) to check whether the
sample results falls in the acceptance region or the rejection region.
8. Decision-making:
Once all the steps are performed, the statistical conclusions can be drawn, and the management
can take decisions. The decision involves either accepting the null hypothesis or rejecting it. The
decision that the null hypothesis is accepted or rejected depends on whether the computed value
falls in the acceptance region or the rejection region.
4. A test statistic is associated with a p value which is less than 0.05. What will be the decision of
the researcher?
Examples:
1
1. Given the probability distribution f(x) = 𝛼 , 0 ≤ 𝑥 ≤ 𝛼.
For testing H0 : 𝛼 = 1 against H1 : 𝛼 = 2 by a single observed value x, what would be the
sizes of Type I and II Errors if the critical regions is 0.5 ≤ 𝑥. Also find power of the test.
= P (Reject H0 / H0 is true)
= P(0.5 ≤ 𝑥 / 𝛼 = 1) = P(0.5 ≤ 𝑥 ≤ 1/ : 𝛼 = 1)
1 1 1 1
= ∫0.5 𝑓 (𝑥 )𝑑𝑥 = ∫0.5 𝑑𝑥 = ∫0.5 1. 𝑑𝑥 = x = 1-0.5 = 0.5
𝛼
= P (Accept H0 / H0 is false)
= P (x ≤ 0.5/ 𝛼 = 2) = P (0 ≤ 𝑥 ≤ 0.5/ : 𝛼 = 2)
0.5 0.5 1 0.5 1
= ∫0 𝑓(𝑥 )𝑑𝑥 = ∫0 𝑑𝑥 = ∫0 . 𝑑𝑥 = x/2 = 0.25
𝛼 2
= P (Reject H0 / H0 is true)
= P (x ≥ 1 / 𝛼 = 2) = P (1 ≤ 𝑥 ≤ ∞/ 𝛼 = 2)
∞ ∞ ∞
= ∫1 𝑓 (𝑥 )𝑑𝑥 = ∫1 𝛼𝑒 −𝛼𝑥 𝑑𝑥 = ∫1 2𝑒 −2𝑥 𝑑𝑥
𝑒 −2𝑥
= 2| |1∞ = 𝑒 −2
−2
= P (Accept H0 / H0 is false)
= P (x ≤ 1/ 𝛼 = 1) = P (0 ≤ 𝑥 ≤ 1/ 𝛼 = 1)
1 1 1
= ∫0 𝑓(𝑥 )𝑑𝑥 = ∫0 𝛼𝑒 −𝛼𝑥 𝑑𝑥 = ∫0 1𝑒 −𝑥 𝑑𝑥
𝑒 −𝑥
=| |01 = 1 − 𝑒 −1
−1
3. Let p be the probability that a coin will fall Head in a single toss in order to test H0 : 𝑝 = 1/
3
2 against H1 : 𝑝 = 4. The coin is tossed 5 times and H0 is rejected if more than 3 heads are
obtained. What would be the sizes of Type I and II Errors if the critical regions? ALso find
power of the test.
1 3
Solution: Here we want to test H0 : 𝑝 = 2against H1 : 𝑝 = 4
𝑛 5
Where f(x) = ( ) 𝑝 𝑥 𝑞 𝑛−𝑥 = ( ) 𝑝 𝑥 𝑞 5−𝑥 , x= 0,1,2,3,4,5
𝑥 𝑥
1 5 5
= P (x = 4, 5 / 𝑝 = 2) = ( ) 𝑝4 𝑞 5−4 + ( ) 𝑝5 𝑞 5−5
4 5
5 1 1 5 1 1 1 1 3
= ( ) (2)4 (2)5−4 + ( ) (2)5 (2)5−5 = 5 (2)4 + 1. (2)5 = 16
4 5
5 3 1 5−4 5 3 1 5−5
= 1 –[( ) (4)4 (4) + ( ) (4)5 (4) ]
4 5
3 1 5−4 3 3 1 3
= 1 –[5. (4)4 (4) + 1. (4)5 ] = 1 - (4)4 [ 5. 4 + 4 ]
81 47
= 1 - 128 = 128
47 81
Power of the test = 1 – β = 1- =
128 128
4. In a bag there are 4 marbles of which k are white and the remaining are black. To test
H0 : 𝑘 ≤ 2 against H1 : 𝑘 > 2, one marble is drawn from the bag and H0 is rejected if the marble
drawn is white. Find the two types of errors, level of significance and power of the test.
Solution: Here we want to test H0 : 𝑘 ≤ 2 against H1 : 𝑘 > 2
Where k = number of white balls in the bag
= P (Reject H0 / H0 is true)
= P (k = 0, 1, 2)
= P(selected marble is white/ k = 0) + P(selected marble is white/ k = 1)+ P((selected marble is
white/ k = 2)
1 3 2 2
( )∗ ( ) ( )∗ ( )
1 0 1 0
=0+ 4 + 4 = 0 + 0.25 + 0.5 = 0.75
( ) ( )
1 1
𝑒 −ƛ ƛ𝑥
Where f(x) = , 𝑥≥0
𝑥!
= P (Reject H0 / H0 is true)
= P (x > 4 / ƛ = 4)
= 1 - P (x ≤ 4 / ƛ = 4)
= 1 –P(x = 0, 1, 2, 3, 4 /ƛ = 4)
40 41 42 43 44
= 1 - 𝑒 −4 ( + + + )
0! 1! 2! 3! 4!
103
= 1 - 𝑒 −4 = 0.3711
3
= P (Accept H0 / H0 is false)
50 51 52 53 54 523
= P(x ≤ 4/ ƛ = 5) = 𝑒 −5 ( + + + ) = 𝑒 −5 = 0.4404
0! 1! 2! 3! 4! 8
= P (Reject H0 / H0 is true)
= P (0.6≤ 𝑥 / 𝛼 = 2) = P (0.6 ≤ 𝑥 ≤ 1/ : 𝛼 = 2)
1 1 1
= ∫0.6 𝑓 (𝑥 )𝑑𝑥 = ∫0.6 2𝑥 𝑑𝑥 = 2 ∫0.6 𝑥. 𝑑𝑥 = 2x2/2 = 0.64
= P (Accept H0 / H0 is false)
= P (x ≤ 0.6/ 𝛼 = 3) = P (0 ≤ 𝑥 ≤ 0.6/ : 𝛼 = 3)
7. A single value taken from N(µ, 16) population. The null hypothesis H 0: µ = 40 is accepted if
x <46, otherwise H1 : µ = 50 is considered to be true. Find Level of significance and power of
test.
= P (Reject H0 / H0 is true)
= P (x≤ 46 / µ = 40)
𝑥−40 46−40
=P( ≤ ) = P ( z ≤ 1.5)
4 4
= P (Accept H0 / H0 is false)
Population
Sample
Parameter and Statistic
Null and Alternative Hypotheses
Simple and Composite Hypotheses
Critical Region
Two types of Errors
Level of Significance and Power of test
Sums on Testing of Hypothesis
7.5 Exercise
1. An urn contains either 3 red and 6 white balls or 6 red and 3 white balls. Two balls are
selected from the urn. If both balls come out to be red, it will be decided that his urn contains 6
red and 3 white balls. Calculate two types of errors. Also calculate power of the test.
2. Let random variable X follows Binomial Distribution with n = 10 and p, where p can be either
½ or ¼. We select a random sample of size, and if the observed value is less than equal to 3, we
reject that p= ½ and accept p = ¼.Calculate level of significance and power of the test. (Ans.
0.171875, 0.775875)
3. A single value x is taken from N (µ, 25) population. The null hypothesis H 0: µ = 50 is
accepted if x < 70, otherwise H1: µ = 60 is considered to be true. Find Level of significance and
power of test. (Ans. 0, 0.002275)
1
4. Given the probability distribution f(x, α) = , 𝛼 − 1 ≤ 𝑥 ≤ 𝛼 + 1.
2
For testing H0 : 𝛼 = 4 against H1 : 𝛼 = 5 by a single observed value x, what would be the
sizes of Type I and II Errors if the critical regions is 4.5 ≤ 𝑥. Also find power of the test.
(Ans 0.25, 0.25, 0.75)
7.6 REFERENCES:
2. Probability and Statistics for Engineers and Scientists, 3 rd Edition, Sheldon. M. Ross
3. Introduction to probability and statistics-4th Edition J. Susan Milton, Jesse C. Arnold Tata
McGraw Hill
Testing of Hypothesis
Chapter: 8
Objectives
Introduction
Sampling Distribution
Central Limit Theorem
Tests of significance
Large sample test for sample mean
Large sample test for population proportion
Large sample test for difference between two sample means:
Student’s t test
Paired T test
Chi square test
8.5.1. Chi-square goodness of fit test
Let us sum up
Exercise
References
8.0. OBJECTIVES
8.1. INTRODUCTION
: Sampling Distribution:
Population is the entire collection of observations under the investigation or study and
sample is part of it. Sampling is a process used in statistical analysis in which a
predetermined number of observations (sample) is collected or taken from population.
The methodology used to sample from a larger population depends on the type of
analysis being performed.
From a population there can be different samples of size n. So the statistic which is
calculated for sample observations is a random variable which has a probability
distribution. The distribution of the statistic is called sampling distribution which
depends upon the distribution of the underlying population.
A study of sampling distribution of statistic for large sample is known as large sample theory. For
large samples the sampling distributions of statistic is normal distribution. If the sample size n is
less than 30 (n<30), it is known as small sample. For small samples the sampling distributions are
t, F and χ2 distribution.
The z test is a statistical test for the mean of a population. It can be used when n ≥30, or when
the population is normally distributed and σ is known.
Let a large sample of size n (≥ 30) be drawn from a population with mean µ and standard
deviation σ. Let x be the sample mean and s be the sample standard deviation.
We want to test (i) H0: µ = µ0 against H1: µ > µ0 (Right Tailed test)
or (ii) H0: µ = µ0 against H2: µ < µ0 (Left Tailed test)
or (iii) H0: µ = µ0 against H3: µ ≠ µ0 (Two Tailed test)
x̅−µ0
Test statistic is Z = σ
√n
For testing (i) H0: µ = µ0 against H1: µ > µ0 , the critical region is C = Z > Zα
Where P (Z > Zα / µ = µ0 ) = α.
For testing (ii) H0: µ = µ0 against H2: µ < µ0 , the critical region is C = Z < - Zα
Where P (Z < - Zα / µ = µ0 ) = α.
For testing (iii) H0: µ = µ0 against H3: µ ≠ µ0 , the critical region is C = Z > Zα/2 or Z < -
Zα/2 where P (Z > Zα/2 / µ = µ0) + P(Z < - Zα/2 / µ = µ0 ) = α or P(| Z | > Zα/2) = α.
Example 1: A national magazine claims that the average college student watches less television
than the general public. The national average is 29.4 hours per week, with a standard deviation of
2 hours. A sample of 30 college students has a mean of 27 hours. Is there enough evidence to
support the claim at 1% level of significance?
Solution: Step 1. State the Hypotheses. Here we are to test H0: µ = 29.4 against
H1: µ < 29.4
Step 2: Identify the level of significance α. Here α = 0.01. Here the critical region is C = Z < -
2.33.
Step 3: Here n = 30, σ = 2, 𝑥̅ = 27
x̅−µ0 27−29.4
Test statistic for testing population mean is Z = σ = 2 = -6.57
√n
√30
Step 4: Find the critical value. Since α = 0.01 and the test is a left-tailed test, the critical value is
Zα = –2.33.
Step 5: Make the decision. Since the test value, –6.57, falls in the critical region, which is Z
(calculated) < Zα the decision is to reject the null hypothesis.
Step 6: So there is enough evidence to support the claim that college students watch less
television than the general public.
Example 2: The Medical Rehabilitation Education Foundation reports that the average cost of
rehabilitation for stroke victims is Rs. 24,672. To see if the average cost of rehabilitation is
different at a large hospital, a researcher selected a random sample of 35 stroke victims and found
that the average cost of their rehabilitation is Rs. 25,226. The standard deviation of the population
is Rs. 3,251. At α = 0.01, can it be concluded that the average cost at a large hospital is different
from Rs. 24,672?
Solution: Step 1. State the Hypotheses. Here we are to test H0: µ = 24672 against H1:
µ ≠ 24672
Step 2: Identify the level of significance α. Here α = 0.01. The critical region is
|Z| > 2.58.
Step 3: Here n = 35, σ = 3251, 𝑥̅ = 25226
x̅ −µ0 25226−24672
Test statistic for testing population mean is Z = σ = = 1.01
√n 3251
√35
Step 4: Find the critical value. Since α = 0.01 and the test is a two-tailed test, the critical value is
Zα = 2.58.
Step 5: Make the decision. Since the test value, 1.01 is less than 2.58, it doesn’t falls in the critical
region, which is |Z| > Zα/2 the decision is to not to reject the null hypothesis.
Step 6: The average cost at a large hospital is not different from Rs. 24,672
Example 3: It is hoped that a newly developed pain reliever will more quickly reduce pain to
patients. The standard pain reliever is known to bring relief in an average of 3.5 minutes with
standard deviation of 1.5 minutes. 50 patients were given the new pain reliever and the sample
mean was calculated as 3.1 minutes. Is there sufficient evidence in the sample to indicate that
new pain reliever relieve pain more quickly? (Test at 5% level of significance).
Solution: Step 1. State the Hypotheses. Here we are to test H0: µ = 3.5 against H1: µ
< 3.5
Step 2: Identify the level of significance α. Here α = 0.05. The critical region is Z <
-1.65
Step 3: Here n = 50, σ = 1.5, 𝑥̅ = 3.1 x̅−µ0
Test statistic for testing population mean is Z = = 3.1−3.5 = -1.886
σ 1.5
√n √50
Step 4: Find the critical value. Since α = 0.05 and the test is a left-tailed test, the critical value is
Zα = -1.65.
Step 5: Make the decision. Since the test value, -1.886 is less than -1.65, it falls in the critical
region, which is Z < - Zα the decision is to reject the null hypothesis.
Step 6: So the decision is that the new pain reliever relieve pain more quickly.
Example 4: A sample of 900 members has a mean 3.4 cms and s.d. 2.61 cms. Is the sample
comes from a large population of mean 3.25cms. and s.d. 2.61 cms.?
Solution: Step 1. State the Hypotheses. Here we are to test H0: µ = 3.25 against H1:
µ ≠ 3.25
Step 2: Identify the level of significance α. Let α = 0.05. The critical region is
|Z | > 1.96
Step 3: Here n = 900, σ = 2.61, 𝑥̅ = 3.4 x̅−µ0
Test statistic for testing population mean is Z = = 3.4−3.25 = 1.73
σ 2.61
√n √900
Step 4: Find the critical value. Since α = 0.05 and the test is a two -tailed test, the critical value is
Zα = 1.96.
Step 5: Make the decision. Since the test value, 1.73 is less than 1.96, it doesn’t falls in the critical
region, which is |Z| > Zα/2 the decision is to not reject the null hypothesis.
So the decision is that the sample comes from the population with mean 3.25 cms.
2. A sample of size 400 was drawn at a sample mean is 99. Test at 5% LOS that the sample
comes from a population with mean 100 and variance 64. (Ans. Z= -2.5)
3. A company producing light bulbs finds that mean life span of the population of bulbs
is 1200 hours with s.d. 125. A sample of 100 bulbs have mean 1150 hours. Test whether
the difference between population and sample mean is significantly different? (Ans. Z= -
4)
4. Test the Hypothesis H0: µ = 70 against H1: µ ≠ 70 when a random sample of size 100 is
drawn giving mean 72 and a standard deviation 2. Use 5% level of significance.
(Ans. Z= 10)
We can use a hypothesis test to test a statistical claim about a population proportion when the
variable is categorical (for example, gender or support/oppose) and only one population or group
is being studied (for example, all registered voters).
The test looks at the proportion (P) of individuals in the population who have a certain
characteristic — for example, the proportion of people who carry cellphones. The null hypothesis
is H0: P = P0, where P0 is a certain claimed value of the population proportion P. For example, if
the claim is that 70% of people carry cellphones, P o is 0.70. Let a large sample of size n (≥ 30)
be drawn from the population. Let x be the number of successes in
the sample, thus the sample proportion is p = x.
n
We want to test (i) H0: P = P0 against H1: P > P0 (Right Tailed test)
or (ii) H0: P = P0 against H2: P < P0 (Left Tailed test)
or (iii) H0: P = P0 against H3: P ≠ P0 (Two Tailed test)
Let the level of significance is α.
p ~ N (P, PQ ).
n
Test statistic is Z = p−P0
P0Q0
√
n
For testing (i) H0: P = P0 against H1: P > P0, the critical region is C = Z > Zα Where P
(Z > Zα / P = P0 ) = α.
For testing (ii) H0: P = P0 against H1: P < P0, the critical region is C = Z < - Zα Where
P (Z < - Zα / P = P0 ) = α.
For testing (iii) H0: P = P0 against H1: P ≠ P0, the critical region is C = Z > Zα/2 or Z < - Zα/2 where
P (Z > Zα/2 / P = P0) + P(Z < - Zα/2 / P = P0 ) = α or P(| Z | > Zα/2) = α.
Example 1: One researcher believes a coin is “fair”, the other believes the coin is biased toward
heads. The coin is tossed 40 times, yielding 30 heads. Indicate whether or not the first
researcher’s position is supported by the results. Test at 5% level of significance.
Solution: Step 1. State the Hypotheses. Here we are to test H0: the coin is fair i.e. P = 0.5 against
H1: the coins fair towards heads i.e. P > 0.5.
Step 2: Identify the level of significance α. Here α = 0.05. The critical region is Z > 1.65.
p−P0
Step 3: Test statistic for testing population mean is Z =
√ P 0 Q0
n
30 3
Here P0 = 0.5, Q0 = 1- P0 = 1- 0.5 = 0.5, n= sample size = 40, p = sample proportion = =
p−P0 0.75−0.5 40 4
Z= == = 3.1623
0.5∗0.5
√P0Q0
n
√
40
Step 4: Find the critical value. Since α = 0.05 and the test is a right -tailed test, the critical value is
Zα = 1.65.
Step 5: Make the decision. Since the test value, 3.1623 is greater than 1.65, it falls in the critical
region, which is |Z| > Zα the decision is to reject the null hypothesis.
Step 6: So the decision is that the coin is not fair.
Example 2: A survey claims that 9 out of 10 doctors recommend aspirin for their patients with
headaches. To test this claim, a random sample of 100 doctors is obtained. Of these 100 doctors,
82 indicate that they recommend aspirin. Is this claim accurate? Use alpha = 0.05.
Solution: Step 1. State the Hypotheses. Here we are to test H0: P = 0.9 against H1: P ≠ 0.9. Step
2: Identify the level of significance α. Here α = 0.05. The critical region is |Z| > 1.96.
Step 3: Test statistic for testing population mean is Z = p−P0
P0Q0
√
n
Here P0 = 0.9, Q0 = 1- P0 = 1- 0.9 = 0.1, n= sample size = 100, p = sample proportion = 82/100 =
0.82
p−P0 = = 0.82−0.9 = -2.667, |Z| = 2.667
Z=
0.9∗0.1
√P0Q0
n √
100
Step 4: Find the critical value. Since α = 0.05 and the test is a two -tailed test, the critical value is
Zα = 1.96.
Step 5: Make the decision. Since the test value, 2.667 is greater than 1.96, it falls in the critical
region, which is |Z| > Zα/2, the decision is to reject the null hypothesis.
Step 6: So the decision is that the claim that 9 out of 10 doctors recommend aspirin for their
patients is not accurate.
: Large sample test for difference between two sample means:
Let there are two populations with means μ1 & μ2 and with standard deviations σ1 & σ2
respectively. Let two independent large samples are drawn from two populations. Let and
are the means of the two samples, ∆ is the hypothesized difference between the population
means (0 if testing for equal means) and n 1and n 2 are the sizes of the two samples.
We are to test (i) H0: μ1 - μ2 = ∆ against H1: μ1 - μ2 > ∆ (Right Tailed test)
or (ii) H0: μ1 - μ2 = ∆ against H2: μ1 - μ2 < ∆ (Left Tailed test)
or (iii) H0: H0: μ1 - μ2 = ∆ against H3: μ1 - μ2 ≠ ∆ (Two Tailed test) Let
the level of significance is α. 2 2
For large samples, ~ N (µ1, σ1 ) and ~ N (µ2, σ2 )
n1 n2
Test statistic is
For testing (i) H0: μ1 - μ2 = ∆ against H1: μ1 - μ2 > ∆ the critical region is C = Z > Zα
Where P (Z > Zα / H0) = α.
For testing (ii) H0: μ1 - μ2 = ∆ against H2: μ1 - μ2 < ∆ , the critical region is C = Z < - Zα Where P
(Z < - Zα / H0) = α.
For testing (iii) H0: μ1 - μ2 = ∆ against H1: μ1 - μ2 ≠ ∆, the critical region is C = Z > Zα/2 or Z <
- Zα/2 where P (Z > Zα/2 / H0) + P(Z < - Zα/2 / H0) = α or P(| Z | > Zα/2) = α.
Example 1: The amount of a certain trace element in blood is known to vary with a standard
deviation of 14.1 ppm (parts per million) for male blood donors and 9.5 ppm for female donors.
Random samples of 75 male and 50 female donors yield concentration means of 28 and 33 ppm,
respectively. What is the likelihood that the population means of concentrations of the element
are the same for men and women? (Test at 1% level of significance)
Solution: Step 1. State the Hypotheses. Here we are to test H0: µ1 = µ2 or H0: µ1 - µ2 = 0 against
H1: µ1 ≠ µ2 or H0: µ1 - µ2 ≠ 0.
Step 2: Identify the level of significance α. Let α = 0.01. The critical region is |Z| > 2.58.
Step 3: Here n1 = 75, n2 = 50, 𝑥̅1̅ ̅ = 28, 𝑥̅̅2̅ = 33, σ1 = 14.1, σ2 = 9.5
Step 4: Find the critical value. Since α = 0.01 and the test is a two-tailed test, the critical value is
Zα = 2.58.
Step 5: Make the decision. Since the test value, |Z| is 2.37 which is less than 2.58, it doesn’t falls
in the critical region, which is |Z| > Zα/2, the decision is to not to reject the null hypothesis.
Example 2: The means of two single large samples of 1000 and 2000 members are 67.5 and 68
inches respectively. Can the samples come from the same population of standard deviation
inches? (Test at 5% level of significance)
Solution: Step 1. State the Hypotheses. Here we are to test H0: µ1 = µ2 or H0: µ1 - µ2 = 0 against
H1: µ1 ≠ µ2 or H0: µ1 - µ2 ≠ 0.
Step 2: Identify the level of significance α. Let α = 0.05. The critical region is |Z| > 1.96.
Step 3: Here n1 = 1000, n2 = 2000, 𝑥̅1̅ ̅ = 67.5, 𝑥̅2̅ ̅ = 68, σ1 = 2.5, σ2 = 2.5
Step 4: Find the critical value. Since α = 0.05 and the test is a two-tailed test, the critical value is
Zα = 1.96.
Step 5: Make the decision. Since the test value, |Z| is 5.1 which is more than 1.96, it falls in the
critical region, which is |Z| > Zα/2, the decision is to reject the null hypothesis.
Step 6: The samples are not from same population with standard deviation 2.5.
Example 3: In a survey of buying habits, 400 women buyers are selected from city A. Their
average weekly expenditure was Rs. 250 with standard deviation Rs. 40. For another city B 400
women buyers were selected whose average expenditure was Rs. 220 with standard deviation Rs.
55. Test at 1% level of significance whether the average weekly expenditure of the two
populations of shoppers are equal or not.
Solution: Step 1. State the Hypotheses. Here we are to test H0: µ1 = µ2 or H0: µ1 - µ2 = 0 against
H1: µ1 ≠ µ2 or H0: µ1 - µ2 ≠ 0.
Step 2: Identify the level of significance α. Let α = 0.01. The critical region is |Z| > 2.58.
Step 3: Here Here n1 = 400, n2 = 400, 𝑥̅̅1̅ = 250, ̅𝑥2̅ ̅ = 220, σ1 = 40, σ2 = 55
250−220 = 8.82
Z=
2 552
√40 +
400 400
Step 4: Find the critical value. Since α = 0.01 and the test is a two-tailed test, the critical value is
Zα = 2.58.
Step 5: Make the decision. Since the test value, |Z| is 8.82 which is more than 2.58, it falls in the
critical region, which is |Z| > Zα/2, the decision is to reject the null hypothesis.
Step 6: We conclude that average weekly expenditure of two populations of shoppers of two
cities differ significantly.
Check your progress –II
1. In a big city 350 out of 700 males are found to be smokers. Does the information
supports that exactly half of the males in the city are smokers? Test at 1% LOS. (Ans. Z= 0)
2. Of two samples, the first one has 50 observations with mean of 7.82 and standard
deviation 0.24, the second one has 100 observations with mean of 6.75 and standard
deviation 0.30. Test at 1% the equality of means. (Ans. Z= 23.62)
3. For better understanding consider an example where it is required to check if the mean
level of pay of one state is greater than that of another state. Two samples of employees
are taken from sizes 1200 and 1000. The mean and standard deviation of the samples (in
thousands of rupees) is given as: (Ans. Z= 24.43)
In all the previous tests we discussed till now we have supposed that the only unknown
parameter of the normal population distribution is its mean. However, the more common
situation is one where the mean µ and variance σ2 are both unknown. Let us suppose this to be
the case and again consider a test of the hypothesis that the mean is equal to some specified
value µ0. That is, consider a test of H0 : µ = µ0 versus the alternative H1 : µ > µ0 or H2 : µ < µ0 or
H3 : µ ≠ µ0. It should be noted that the null hypothesis is not a simple hypothesis since it does
not specify the value of σ2. From the population we collect a sample x1, x2, xn.
Now when σ2 is no longer known, n
it 2seems reasonable to estimate it by sample standard
∑ (xi −x̅)
deviation which is S2 = 1
n−1
√n
For testing H0 : µ = µ0, we define a test statistic t = (𝑥̅ - µ0)
s
√n
t= (𝑥̅ - µ0) is said to follow student’s t distribution with degrees of freedom n-1(The
s
number of independent variates which makeup the statistic is known as the degrees of
freedom).
Assumptions of t distribution:
1) Define student’s 't' – statistic if the sample size if less than 30, it is considered as
small sample. It does not follow Normal Distribution.
2) The parent population from which the sample drawn is normal.
3) The sample observations are random and independent
4) The population standard deviation is not known.
For testing (i) H0: µ = µ0 against H1: µ > µ0 , the critical region is C = t > t α, n-1
For testing (ii) H0: µ = µ0 against H2: µ < µ0 , the critical region is C = t < - tα, n-1
For testing (iii) H0: µ = µ0 against H3: µ ≠ µ0 , the critical region is C = | t | > t α/2, n-1.
Example 1: The mean weekly sales of soap bars in departmental stores was 146.3 bars per
store. After an advertising campaign the mean weekly sales in 22 stores for a typical week
increased to 153.7 with standard deviation 17.2. Was the advertising campaign successful?
Solution: We are to test H0: µ = 146.3 versus the alternative H1: µ > 146.3 Let α
= 0.05. The critical region is C = t > t α, n-1
Example 2: A public health official claims that the mean home water use is 350 gallons a
day. To verify this claim, a study of 20 randomly selected homes was instigated with the
result that the average daily water uses of these 20 homes were as follows:
340 344 362 375 356 386 354 364 332 402 340 355 362 322 372 324 318 360 338 370
Do the data contradict the official’s claim?
Solution: To determine if the data contradict the official’s claim, we need to test H 0: µ =350
versus H1: µ ≠ 350
Let α = 0.05. The critical region is C = | t | > t α/2, n-1.
From the data given, we calculate ∑ x = 7076 and ∑(x − x̅)2 = 9069.2009
n 2
⇒ x̅ = 7076
∑ 1(x− x̅)
= 353.8, S2 = = 477.3236, s = 21.8478
n−1
20
√n
Thus, the value of the test statistic is t = (𝑥̅ - µ0) = √20 (353.8 – 350) = 0.7778
s 21.8478
Tabulated value of t for 19 (n-1 = 20-1) d.f. at 5% l.o.s. is 1.73. Since calculated value of t is less
than 1.73, we do not reject null hypothesis.
It implies that the data doesn’t contradict with the claim of the health official.
Example 3: The manufacturer of a new fiberglass tire claims that its average life will be at least
40,000miles. To verify this claim a sample of 12 tires are tested, with their lifetimes (in 1,000s
of miles) being as follows:
Tire 1 2 3 4 5 6 7 8 9 10 11 12
Life 36.1 40.2 33.8 38.5 42 35.8 37 41 36.8 37.2 33 36
Test the manufacturer’s claim at the 5% level of significance.
Solution: To determine whether the foregoing data are consistent with the hypothesis that the
mean life is at least 40,000 miles, we will test
H0 : µ ≥ 40 versus H1 : µ < 40
Let α = 0.05. The critical region is C = t > t α, n-1
From the data given, we calculate ∑ x = 447.4 and ∑(x − x̅)2 = 82.09605371
n 2
⇒ x̅ = 447.4
2 ∑ 1(x−x̅)
= 37.2833, S = = 7.46327761 , s = 2.7319
n−1
12
√n
Thus, the value of the test statistic is t = (𝑥̅ - µ0) = √12 (37.2833 – 40) = -3.4448
s 2.7319
Tabulated value of t for 11 (n-1 = 12-1) d.f. at 5% l.o.s. is 1.796. Since calculated value of t (-
3.4448) is less than -1.796, we reject null hypothesis.
The paired sample t-test, sometimes called the dependent sample t-test, is a statistical
procedure used to determine whether the mean difference between two sets of
observations is zero. Suppose we are interested in evaluating the effectiveness of a
company training program. One approach we might consider would be to measure the
performance of a sample of employees before and after completing the program, and
analyse the differences using a paired sample t-test. Let us assume two paired sets, such as
Xi and Yi for i = 1, 2, …, n such that their paired difference are independent which are
identically and normally distributed.
Let d = Xi - Yi and μd is the mean of d.
We are to test H0: μd = 0 against H1: μd > 0 (right-tailed) or H2: μd < 0 (left-tailed) or
H3: μd ≠ 0 (two-tailed)
The paired sample t-test has four main assumptions:
• The dependent variable (d) must be continuous.
• The observations are independent of one another.
• The dependent variable (d) should be approximately normally distributed.
• The dependent variable (d) should not contain any outliers.
freedom.
For testing (i) H0: μd = 0 against H1: μd > 0, the critical region is C = t > t α, n-1
For testing (ii) H0: μd = 0 against H2: μd < 0, the critical region is C = t < - tα, n-1
For testing (iii) H0: μd = 0 against H3: μd ≠ 0 , the critical region is C = | t | > t α/2, n-1.
Example 1: An IQ test was administered to 5 persons before and after they were trained.
Candidate 1 2 3 4 5
Before 110 120 123 132 125
After 120 118 125 136 121
Candidate 1 2 3 4 5
Before 110 120 123 132 125
After 120 118 125 136 121
D -10 2 -3 -4 4
∑d ̅ )2
∑(d−d
𝑑̅ = = -10/5 = -2, s2 = = 120/4 = 30, s = 5.472
n n−1
̅
Test statistic is t = √nd
s = √5 * (-2) = -0.8165
5.472
Tabulated value of t for 4 (n-1 = 5-1) d.f. at 1% l.o.s. is 4.604. Since calculated value of t
(-0.8165) is more than -4.604, we accept null hypothesis.
So we conclude that the training programme is not effective.
Example 2: A clinic provides a program to help their clients lose weight and asks a consumer
agency to investigate the effectiveness of the program. The agency takes a sample of 15 people,
weighing each person in the sample before the program begins and 3 months later to produce the
table in Figure 2. Determine whether the program is effective.
Before 210 205 193 182 259 239 164 197 222 211 187 175 186 243 246
After 197 195 191 174 236 226 157 196 201 196 181 164 181 229 231
Before 210 205 193 182 259 239 164 197 222 211 187 175 186 243 246
After 197 195 191 174 236 226 157 196 201 196 181 164 181 229 231
D 13 10 2 8 23 13 7 1 21 15 6 11 5 14 15
∑d ̅)2
∑(d−d
𝑑̅ = = 10.933, s2 = = 40.06637, s = 6.3298
n s
Test
̅
statistic is t =
√nd
n−1 ̅ = √15 * (10.933) = 6.6896995
6.3298
Tabulated value of t for 14 (n-1 = 15-1) d.f. at 5% l.o.s. is 2.1447867. Since calculated value of t
(6.6896995) is more than -2.1447867, we reject null hypothesis.
So we conclude that the training programme is not effective.
Market researchers use the Chi-Square test when they find themselves in one of the following situations:
1. They need to estimate how closely an observed distribution matches an expected
distribution. This is referred to as a “goodness-of-fit” test.
2. They need to estimate whether two random variables are independent.
The chi-square goodness of fit test is a useful method to compare a theoretical model to
observed data. The chi-square goodness of fit test is appropriate when the following
conditions are met:
The sampling method is simple random sampling.
The variable under study is categorical.
The expected value of the number of sample observations in each level of the
variable is at least 5.
where Oi is the observed frequency count for the ith level of the categorical variable, and E i is the
expected frequency count for the ith level of the categorical variable.
Example 1; Acme Toy Company prints baseball cards. The company claims that 30% of the
cards are rookies, 60% veterans and 10% are All-Stars.
Suppose a random sample of 100 cards has 50 rookies, 45 veterans, and 5 All-Stars. Is this
consistent with Acme's claim? Use a 0.05 level of significance.
Solution:
The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4) interpret results.
We work through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.
Null hypothesis: H0: The proportion of rookies, veterans, and All-Stars is 30%,
60% and 10%, respectively.
Alternative hypothesis: H1 : At least one of the proportions in the
null hypothesis is false.
Formulate an analysis plan. For this analysis, the significance level is 0.05. Using
sample data, we will conduct a chi-square goodness of fit test of the null
hypothesis.
Analyse sample data. Applying the chi-square goodness of fit test to sample data,
we compute the degrees of freedom, the expected frequency counts, and the chi-
square test statistic.
df = k - 1 = 3 - 1 = 2
(Ei) = n * pi
(E1) = 100 * 0.30 = 30
(E2) = 100 * 0.60 = 60
(E3) = 100 * 0.10 = 10
χ2 = Σ [ (Oi - Ei)2 / Ei ]
χ2 = [ (50 - 30)2 / 30 ] + [ (45 - 60)2 / 60 ] + [ (5 - 10)2 / 10 ]
= (400 / 30) + (225 / 60) + (25 / 10) = 13.33 + 3.75 + 2.50 = 19.58
where df is the degrees of freedom, k is the number of levels of the categorical
variable, n is the number of observations in the sample
Example 2: Researchers have conducted a survey of 1600 coffee drinkers asking how
much coffee they drink in order to confirm previous studies. The results of previous
studies (left) and the survey (right) are below. At α = 0.05, is there enough evidence to
conclude that the distributions are the same?
Response Frequency
2 cups per week 206
1 cup per week 193
1 cup per day 462
2+ cups per day 739
Solution: The null hypothesis H0: the population frequencies are equal to the
expected frequencies (to be calculated below).
The alternative hypothesis, H1: The null hypothesis is false.
α = 0.05,
The degrees of freedom: k−1 = 4−1 = 3
The test statistic can be calculated using a table:
Response % of E O (𝐸 − 𝑂)2
Coffee 𝐸
Drinkers
So we conclude that the population frequencies are not equal to the expected
frequencies.
Example 3: A die is tossed 120 times and the following results are obtained.
No. turned up: 1 2 3 4 5 6
Frequency: 30 25 18 10 22 15
Test the hypothesis that the die is
unbiased
E O (𝐸 − 𝑂)2
No. turned 𝐸
up
1 120/6=20 30 5
2 20 25 1.25
3 20 18 0.2
4 20 10 5
5 20 22 0.2
6 20 15 1.25
Test statistic = χ2 = Σ [ (Oi - Ei)2 / Ei ] = 12.9
Since calculated χ2 = 12.9 is more than 11.07, we reject null hypothesis at 5% l.o.s.
Two events are said to be independent if the occurrence of one of the events has no
effect on the occurrence of the other event.
A chi-square independence test is used to test whether or not two variables are
independent.
As in 8.5.1, an experiment is conducted in which the frequencies for two variables are
determined. To use the test, the same assumptions must be satisfied: the observed
frequencies are obtained through a simple random sample, and each expected
frequency is at least 5. The frequencies are written down in a table: the columns
contain outcomes for one variable, and the rows contain outcomes for the other
variable. If there are m rows and n columns in the table, it is called m× n contingency
table.
The procedure for the hypothesis test is essentially the same. The differences are that:
(i) H0 is that the two variables are independent.
(ii) H1 is that the two variables are not independent (they are dependent).
(iii) The expected frequency Er,c for the entry in row r, column c is calculated using:
where Or,c is the observed frequency count for the entry in row r, column c
Example 1: Two sample polls of votes for two candidates A and B are taken. The
results are given below. Examine the nature of the area is related to voting preference
or not.
Vote for A B Total
Area
Rural 620 380 1000
Urban 550 450 1000
Total 1170 830 2000
Solution: We are to test H0: Nature of the area is independent of the voting
preference against H1: The two variables are not independent (they are dependent).
α = 0.05,
The degrees of freedom: (number of rows - 1)×(number of columns - 1) = (2-1)(2-1) = 1
Let Er,c = Expected Frequency =( Sum of row r)×( Sum of column c) / Sample size
and Or,c is the observed frequency count for the entry in row r, column c.
So we conclude that Nature of the area is not independent of the voting preference.
Solution: We are to test H0: Sampling techniques adopted by the two researchers are independent
against H1: Sampling techniques adopted by the two researchers are not independent (they are
dependent).
α = 0.05,
The degrees of freedom: (number of rows - 1)×(number of columns - 1) = (2-1)(4-1) = 3
Let Er,c = Expected Frequency =( Sum of row r)×( Sum of column c) / Sample size
and Or,c is the observed frequency count for the entry in row r, column c.
Since calculated χ2 = 2.0971 is less than 7.815, we accept null hypothesis at 5% l.o.s.
So we conclude that the sampling techniques adopted by the two researchers are
independent.
a B
c D
In a 2×2 contingency table, if we simplify the formula of χ2, we get
𝑁∗(𝑎𝑑−𝑏𝑐)2
χ2 =
(𝑎+𝑏)(𝑎+𝑐)(𝑏+𝑑)(𝑐+𝑑)
Example: Out of 800 persons, 25% were literates and 300 have travelled beyond the
limits of their district, 40% of the literates were among those who had not travelled.
Test at 5% l.o.s. whether there is any relation between travelling and literacy.
We are to test H0: there is no relation between travelling and literacy against H1: there is relation
between travelling and literacy (they are dependent).
α = 0.05,
The degrees of freedom: (number of rows - 1) × (number of columns - 1) = (2-1)(2-1) = 1
800∗(120∗420−180∗80)2
= = 57.6
300∗200∗600∗500
Since calculated χ2 = 57.6 is more than 3.841, we reject null hypothesis at 5% l.o.s.
So we conclude that there is relation between travelling and literacy (they are
dependent).
: Let us sum up
In this unit we have discussed
Sampling Distribution
Central Limit Theorem
Large sample test for sample mean
Large sample test for population proportion
Large sample test for difference between two sample means
Student’s t test
Paired T test
Chi-square goodness of fit test
Chi square test of Independence
Sums on all formulas
: Exercise
1. The flower stems are selected and the heights are found to be (cm)
63,63,68,69,71,71,72 test the hypothesis that the mean height is 66 or not at 1% LOS.
(Ans. t=1.507)
2. A company producing light bulbs finds that mean life span of the population of bulbs is
1200 hours. A sample of 10 bulbs have mean 1150 with s.d. 12.5 hours. Test whether the
difference between population and sample mean is significantly different? (Ans. t= -
12.649)
3. Table below shows number of students in each of two classes A and B, who passed and
failed in an exam Test the Hypothesis that there is no difference between the two classes
at 5% LOS. (Ans. 𝜒2 = 0.96269)
Passed Failed
Class A 72 17
Class B 64 23
4. Table below shows the relation between the performances of the students in Maths
and Physics. Test the Hypothesis that the performance in two subjects are independent
are not.
(Ans. 𝜒2 = 145.78)
Physics Maths
High Grade Medium Grade Low Grade
High Grade 56 71 12
Medium Grade 47 163 38
Low Grade 14 42 85
5. The number of books borrowed from a public library during a particular week is given
below. Test the Hypothesis that the number of books borrowed does not depend on days
of week at 5% LOS.. (Ans. 𝜒2 = 2.143)
Mon Tue Wed Thurs Fri Sat
No. of 14 18 12 11 15 14
books
borrowed
Test the hypothesis at 0.05 level of significance that the presence or absence of
hypertension is independent of smoking habits. . (Ans. 𝜒2 = 14.464)
12. Eleven school boys were given attest in mathematics. They were given a month’s tuition
and a second test was held at the end of it. Do the marks give evidence that the student’s have
benefited by the coaching? Use LOS 1%.
Marks in test 1: 23, 20, 19, 21, 18, 20, 18, 17, 23, 16, 19
Marks in test 2: 24, 19, 22, 18, 20, 22, 20, 20, 23, 20, 17
(Ans t= -1.482)
REFERENCES:
1. Fundamentals of Mathematical Statistics- 1st edition S. C. Gupta, V.K.Kapoor, S. Chand
2. Probability and Statistics for Engineers and Scientists, 3rd Edition, Sheldon. M. Ross
3. Introduction to probability and statistics-4th Edition J. Susan Milton, Jesse C. Arnold Tata
McGraw Hill
4. Statistics for Business and Economics: Dr. Seema Sharma, Wiley
Unit 5: INTRODUCTION TO PROBABILITY
Unit: 5
Chapter 9
Unit Structure
9.0. Objectives
9.1. Introduction
9.1.1. Factorial
9.5. Exercise
9.6. References
9.0. OBJECTIVES
After studying this unit you will be able to:
9.1. INTRODUCTION
9.1.1: Factorial
The product of the first n natural numbers is called factorial n and is denoted by n!.
Where r<n
Note : 0! = 1
1! = 1
5! = 5× 4 × 3 × 2 × 1 = 120
If there are three things a, b and c, then permutations of three things taken two at a time
is denoted by P (3, 2) or 3P2.
It is given by
3!
=
(3−2)!
3!
=
1!
= 3.2.1 = 6
The notation for combination is C(n, r) or nCr which is the number of combinations or
selections of n things if only r are selected.
If there are three things a, b and c then combination of these three things taken two at a
time is denoted by 3C2 and is given by
3! 3! 6
So 3C2= = = =3
2!×(3−2)! 2!×1! 2
𝑛!
In General,nCr =
𝑟!(𝑛−𝑟)!
Note: Permutation and Combination are related to each other by formula P(n,r)=r!⋅C(n,r).
8! 8! 8×7×6×5×4×3!
Example 2. P (8, 5)= 8P5 = = = = 8× 7 × 6 × 5 × 4 = 6720
(8−5)! 3! 3!
Example 3. 6 cards are to be send to 4 persons, in how many ways this can be done?
Solution :
6 6! 6×5×4×3×2!
P4 = = = 6× 5 × 4 × 3 = 360
(6−4)! 2!
5 5!
C3 = = 10 ways
3!×2!
Example 6. In how many ways 4 cards can be chosen from a pack of 52 cards?
Example 7. From a group of 7 boys and 6 girls, 3 boys and 4 girls is to be selected. In how many
ways this can be done?
Solution: 3 boys can be selected from 7 boys in 7C3 ways
7! 7×6×5×4!
= 7C3 = = = 35
3!× 4! 3×2×4!
6! 6×5×4!
=6C4= = = 15
4!×2! 4!×2
Set theory is a branch of mathematical logic that studies sets, which informally are
collections of objects or things of similar type. Although any type of object can be
collected into a set, set theory is applied most often to objects that are relevant to
mathematics. Sets are usually denoted by A, B, C. The followings are some examples of
sets.
The objects in the set are called elements or members of the set.
xϵ A ⇒x is an element of the set A
Equality of Sets
Two sets are equal it and only if they have the same elements.
Subsets
A is a subset of B if and only if every elements of A is an
element of B, we write it as A⊂ B, we can also say as “ B includes
A”.
Union
The union of the set A and the set B is the set that contains
all the elements that belong to A or to B, written AU B.
Intersection
The intersection of the set A and the set B is the set that
contains all the elements that belong to A and B both, written
asA∩ B.
Complementary set
The element of universal set S which do not belong to the subset A, forms a set which
is calledcomplement of A and is denoted by Ac or Aʹ or A.
Introduction to Probability
Probability means possibility or chance. We are certain about “rising of the sun every
day”, about “there are 7 days in a week” etc. However there are many things where we
are not sure about the occurrence or the outcome of the incident, in those cases we use
the words probably or likely or possibly.
For example, “Probably it will rain to night”, “it is quite likely that there will be a good
yield of crop this year” and so on. But the terms probably, quite likely are all relative
terms of uncertainty. Probability is a numerical measure of uncertainty – a number that
conveys the strength of out belief in the occurrence of an uncertain event.
To find a measure for probability it is necessary to have the concept of few terms
which we discussed below.
Example :
1. Tossing a coin
2. Throwing a dice
The set of all possible outcomes of a random experiment is called sample space. The
elements of the sample space are called sample points. Sample space is denoted by S.
Example:
1. In an experiment of throwing a
coin S={H,T]
2. In an experiment of throwing a
dice S={1,2,3,4,5,6}
Event
In an experiment of throwing dice where S = {1, 2, 3, 4, 5, 6}, the event of getting odd
numbers is A = {1, 3, 5}
Clearly A⊂ S
The number of sample points in A is denoted by n (A). For the above experiment, n (A)
=3
Types of Events
1. Certain Event
If sample points in an event are same as sample points in sample space of that
random experiment, then the event is called a certain event.
2. Impossible Events
Events are said to mutually exclusive if the happening of any of them restricts the
happening of the others i.e., if no two or more of them can happen together or
simultaneously in the same trial.
Example :In tossing a coin event head and tail are mutually exclusive.
Note: If A & B are mutually exclusive events of sample space S, then A∩B =φ.
Events are said to be equally likely if they have equal choice to occur. In other words,
outcomes of a trial are said to be equally likely if taking into consideration all relevant
evidences, there is no reason to prefer one with respect to other.
Example: In throwing a dice all the six faces are equally likely to occur.
5. Exhaustive Events
If the sample points of the events taken together constitute the sample space of the
random experiment, the events are called exhaustive events.
Note: If A & B are exhaustive events of sample space S, then AUB =S.
S = {1,2,3,4,5,6}
Here A U B = {1, 2, 3, 4, 5, 6} = S
Example :
S ={1, 2, 3, 4, 5, 6}
A = {1, 2}
B ={3. 4. 5 6}
iv) A∩ B
v) Ac
Note : 0 ≤ m ≤ n
0 𝑚 𝑛
≤ ≤ ⇒ 0 ≤ 𝑃(𝐴) ≤ 1
𝑛 𝑛 𝑛
Let S be a sample space and let A be the set of events. Let P be a real-valued function
defined on B. ThenP is a probability set function if P satisfies the following three conditions:
Then, P (∐∞ ∞
𝑛=1 𝐴n ) = ∑𝑛=1 𝑃 (𝐴 n)
Example : 1
Solution:
In a random throw of two dice, the total number of cases is given below :
S = {(1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1),
(1, 2), (2, 2), (3, 2), (4, 2), (5, 2), (6, 2),
(1, 3), (2, 3), (3, 3), (4, 3), (5, 3), (6, 3),
(1, 4), (2, 4), (3, 4), (4, 4), (5, 4), (6, 4),
(1, 5), (2, 5), (3, 5), (4, 5), (5, 5) (6, 5),
(1, 6), (2, 6), (3, 6), (4, 6), (5, 6) (6, 6)}
Here, n (S) = 36
i) A : Both the dice show same number
= {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)}
n (A) = 6
𝑛 (𝐴) 6
P (A) = = = 1/6
𝑛 (𝑆) 36
= {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}
n (B) = 6
𝑛 (𝐵) 6
P (B) = = = 1/6
𝑛 (𝑆) 36
= {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)}
n (C) = 5
𝑛 (𝐶) 5
P (C) = =
𝑛 (𝑆) 36
Example : 2
Two unbiased coins are tossed simultaneously. Find the probability of getting –
Solution :
n(S) =4
n (A) = 3
𝑛 (𝐴) 3
P (A) = =
𝑛 (𝑆) 4
= {(H, H)}
n (B) = 1
𝑛 (𝐵) 1
P (B) = =
𝑛 (𝑆) 4
Example : 3
A box contains 20 tickets numbered from 1 to 20. A ticket is drawn randomly from
the box. Find the box. Find the probability that the number on the ticket is
i) Divisible by 5
iv) Divisible by 3 or 4.
Solution :
n(S) = 20
i) A : Divisible by 5
A {5, 10, 15, 20}
n (A) = 4
𝑛 (𝐴) 4 1
P (A) = = =
𝑛 (𝑆) 20 5
n (B) = 10
𝑛 (𝐵) 10 1
P (B) = = =
𝑛 (𝑆) 20 2
iii) C : Divisible by 3 and 4.
C = {12}
n (C) = 1
𝑛 (𝐶) 1
P (C) = =
𝑛 (𝑆) 20
iv) D = Divisible by 3 or 4.
n (D) = 10
𝑛 (𝐷) 10 1
P (D) = = =
𝑛 (𝑆) 20 2
Example: 4
A bag contains 10 while and 11 black balls. If two balls are drawn simultaneously from
the bag. Find the probability of getting (i) both white balls, (ii) one white and one black
ball, (iii) no white ball.
21! 21 ×20
= = = 210
2! ×19! 2
10
(ii) n (B) =Favourable number of
n (A) = Favourable number of cases = C2
cases = 10C1× 11C1
10! 10 ×9
= = = 45
2! ×8! 2
= 10 × 11 = 110
𝑛 (𝐴) 45
P (A) = = = 0.2143
𝑛 (𝑆) 210
𝑛 (𝐵) 110
P (B) = = = 0.5238
𝑛 (𝑆) 210
iii) C : No white ball (which means all the balls are black)
11! 11×10×9!
= All are Black balls = 11C2 = = = 55
2!×9! 2×9!
𝑛 (𝐶) 55
P (C) = = = 0.2619
𝑛 (𝑆) 210
ii) A multiple of 3 or 4.
3. A ticket drawn from a box a containing 30 tickets and a number on it is observed. Obtain
the probability that ticket drawn has a number (a) less than 7, (b) lying between 12 and 20,
both inclusive, (c) a prime number, (d) multiple of 4.
4. Two fair dice are rolled. Find the probability that the numbers on the uppermost face of
the first die is (i) greater than 7 (ii)less than 8 (iii) equal to the number on the second die.
4. A committee of 6 students is to be formed from a group of 7 boys and 5 girls. Find the
probability that it consists of (i) all boys, (ii) only 1 boy (iii) atleast 4 girls.
5. A bag contains 12 white and 18 black balls. The balls are drawn at random. Find the
probability if
6. A bag contains 3 black, 4 white and 5 red balls. One ball is drawn at random. Find the
probability that
i) It is black ball
Example 5: A card is selected at random from a pack of cards. What is the probability that it
is a (i) Picture card (ii) Ace card, (iii) Spade card, (iv) Black Queen card?
52!
n (S) = Total number of cases = 52C1= = 52
1! ×51!
12
12!
n (A) = Favourable number of cases = C1 = = 12
1! ×11!
𝑛 (𝐴) 12
P (A) = = = 0.2308
𝑛 (𝑆) 52
𝑛 (𝐵) 4
P (B) = = = 0.0769
𝑛 (𝑆) 52
13
13!
n (C) = Favourable number of cases = C1 = = 13
1! ×12!
𝑛 (𝐶) 13
P (C) = = = 0.25
𝑛 (𝑆) 52
𝑛 (𝐷) 2
P (D) = = 52 = 0.0385
𝑛 (𝑆)
Example 6:Two cards are drawn at random from a pack of well-shuffled cards. Find the
probability that
52!
n (S) = Total number of cases = 52C2 = = 1326
2! ×50!
𝑛 (𝐴) 16
P (A) = = = 0.0121
𝑛 (𝑆) 1326
𝑛 (𝐵) 6
P (B) = = = 0.0045
𝑛 (𝑆) 1326
26C 26C
n (C) = Favourable number of cases = 1× 1= 26× 26 = 676
𝑛 (𝐶) 676
P (C) = = = 0.5098
𝑛 (𝑆) 1326
13C 13C
n (D) = Favourable number of cases = 1× 1= 13× 13 = 169
𝑛 (𝐷) 169
P (D) = = = 0.1275
𝑛 (𝑆) 1326
(v) E = Both are heart cards
13 13! 13×12×11!
n (E) = Favourable number of cases = C2= = = 78
2! ×11! 2×11!
𝑛 (𝐸) 78
P (E) = = = 0.0588
𝑛 (𝑆) 1326
(vi) F = One of them is an ace card = One is ace and one is non ace card.
𝑛 (𝐹) 192
P (F) = = = 0.1448
𝑛 (𝑆) 1326
Example 7: A committee of 3 is to be formed from a group at 5 boys and 6 girls. Find the
probability that the committee consists of at least one girl.
Solution: Let S be the sample space. There are total 11 boys and girls.
11!
n (S) = Total number of cases = 11C3= 3! ×8! = 165
Let A be the event that the committee will consist at least one girl.
𝑛(𝐴) 155
P(A) = = = 0.9394
𝑛(𝑆) 165
Example 8: Six magazines are placed at random in a shelf. Find probability that a
particular pair of magazines shall be: (i) Always together, (ii) Never together.
Solution:
(i) If the pair of magazines are always together we will consider it a single magazine. Thus
now we have 6 – 1 = 5 magazines which can be arranged in 5! = 5× 4 × 3 × 2 × 1 = 120
ways. The two magazines which is considered as a single magazine can be arranged among
themselves in 2! = 2 ways.
240
P (the two magazines will always be together) = 720 = 0.3333
(ii) Total number of arrangements where the pair of magazines will never be together =
480
P (the two magazines will never be together) = 720 = 0.6667
Example 9:If the letters of the word RANDOM be arranged at random, what is the chance
that the two letters A and O will be at the extremes.
Solution: There are 6 letters in the word RANDOM which can be arranged taking all of them
at atime in 6! = 6 × 5 × 4 × 3 × 2 × 1 = 720 ways
If the two letters A and O will be at the extremes, the remaining 4 letters can be arranged in
4! = 24 𝑤𝑎𝑦𝑠.
So, Total number of favourable cases where the two letters A and O will be at the extremes
= 24 × 2 = 48 𝑤𝑎𝑦𝑠.
48
P (the two letters A and O will be at the extremes) = = 0.6667
720
Example 10: Using the letters in the word “SQUARE”, in 6 – letter arrangement, what is the
chance that (i) First letter is vowel, (ii) Vowels and consonant are alternate beginning with a
consonant?
Solution: There are 6 letters in the word SQUARE which can be arranged taking all of them
at atime in 6! = 6 × 5 × 4 × 3 × 2 × 1 = 720 ways
(i) There are three vowels in the word SQUARE. If the first letter is a vowel, the remaining 5
letters can be arranged in 5! = 120 𝑤𝑎𝑦𝑠.
The vowel in the first place can be selected from three vowels in 3C2 = 3 ways.
Total number of favourable cases where the first letter is vowel = 120× 3 = 360 𝑤𝑎𝑦𝑠.
360
P (the first letter is vowel) = = 0.5
720
(ii) There are three vowels and three consonants in the word SQUARE.
As vowels and consonants are alternatively arranged and it starts with a consonant,
following will be the arrangement
Total number of favourable cases where vowels and consonant are alternate beginning with
a consonant =6 × 6 = 36 𝑤𝑎𝑦𝑠.
36
P (Vowels and consonant are alternate beginning with a consonant = = 0.05
720
Factorial
Permutation and Combination
Some points on set theory
Random Experiments
Sample space
Events
Introduction to Mathematical probability
Introduction to Axiomatic Probability
Sums on Probability
9.5: Exercise:
7. Four cards are drawn at random from a pack of 52 cards. Find the probability that –
Ans. (i) 256/52C4, (ii) 4C2 ×4C2/ 52C4, (iii) 13C2 ×13C2/ 52C4
8. A room has three lamps. From a collection of 10 bulbs of which 6 are defective, 3 are selected at
random and put in the sockets. What is the probability that –
9. If two letters are taken at random from the word HOME, what is the probability that none of
10. If the letters of the word “CHEMESTRY” be arranged at random. What is the probability that the
arrangement (i) Begins with M (ii) Begins with M and ends with I
9.6: REFERENCES:
2. Introduction to probability and statistics-4th Edition J. Susan Milton, Jesse C. Arnold Tata McGraw
Hill
3. Statistics for Business and Economics: Dr.Seema Sharma, Wiley
Unit 6: Conditional Probability
Chapter 10
Unit Structure
10.0. Objectives
10.1. Introduction
10.6: Exercise
10.7 References
10.0: OBJECTIVES
10.1: INTRODUCTION
Probability theory is useful in understanding, studying, and analysing complex real world
systems. Probability theory can be used to model and develop complex real world
systems. In the previous unit we have studied definition and concept of classical and
axiomatic probability.In this unit we are going to study Addition and Multiplication laws of
probability, Conditional probability and Baye’s Theorem.
Proof: A∩ 𝐵
and in A ∩ B is m3.
A UB = A+B–A∩B
n (A U B) = n (A) + n (B) – n (A ∩ B)
⇒n ( A U B) = m1 +m2 – m3
Corollary: 1
A ∩ B = φ ⇒P(A∩B)=0
P(A U B) = P(A)+P(B)
Corollary: 2
–P(A∩B)–P(B∩C)–P(A∩C)+P(A∩B∩C)
Corollary: 3
Corollary: 4
Corollary: 5
P(B∩Ac)=P(B)−P(B∩A)
Corollary: 6
If A⊂B ⇒P(A)≤P(B)
Corollary: 7
The conditional probability of an event A is the probability that the event will occur given
the knowledge that an event B has already occurred. We say probability of the event A given
the event B has already occurred and denote it by P (A / B).
If the events A and B are such that the occurrence of A doesn’t depend upon occurrence
of event B, (A and B are independent event), the conditional probability of event A given
event B is simply the probability of event A, that is P (A).
Similarly, probability of event B given that event A has already occurred is denoted by P (B /
A).
If A and B are two events of a sample space S associated with an experiment, then the
probability of simultaneous occurrence of events A and B is given by
A B
A∩ B
10.2.4 Independent Events
Two events A and B are independent of each other if the occurrence or non-occurrence
of one does not affect the occurrence of the other.
P(B/A) = P (B)
Example : 1
Find the probability that a card drawn from a pack of cards will be a red or a picture card.
Solution :
26 1 12 3
P(A) = = P(B) = =
52 2 52 13
½ + ¼ - 6/52 = 8/13
Example: 2
An investment consultant predicts that the odds against the price of a certain stock will
go up during the next week are 2 : 1 and the odds in favour of the price remaining the same
are 1 : 3. What is the probability that the price of the stock will go down during the next
week.
Solution:Let A denote the event “stock price will go up” and B be the event stock price will
remain same.
1 2 1 3
P(A) = 3 P(Ac) = 3 P(B) = 4 P(Bc) = 4
1 1 7
= + =
3 4 12
7 5
P (Stock price will go down) = P(Ac ∩ Bc) = 1 - P (A U B) = 1- =
12 12
Example: 3
A and B are two events such that, P (A) = 0.2 and P (B) = 0.4. A and B are independent
events. Find the probability that (i) both A and B will occur (ii) only A occurs, (iii) only B will
occur, (iv) atleast one will occur, (v) none will occur.
Solution:
(i) P(both A and B will occur) = P(A ∩ B) = P(A) P(B) [Since A & B are Independent]
(ii) P (only A occurs) = P(A ∩ BC) = P(A) P(BC) [Since A & BC are Independent]
(iii) P (only B occurs) = P(AC ∩ B) = P(AC) P(B) [Since AC& B are Independent]
Example: 4
A commerce graduate can get offer from three companies A, B and C. The chances of
getting offer from company A is 20%, from B 16%, from C 14% , from A and B both 8%, from A
and C both 5%, from B and C both 4% and from all three is 2% . Find what percentage he gets
atleast one offer.
Solution:
Example: 5
The odds in favour of A hitting a target are 3 : 4 and odds against B hitting a target are
1 : 2. If both of them shoot the target independently, what is the probability of (i)
both hit the target, (ii) only A hits the target (iii) at least one of them hits the target.
(iv) none hits the target.
Solution:
3 4 2 1
P(A) = 7 P(Ac) = 7 P(B) = 3 P(Bc) = 3
(i) P (both A and B hit the target) = P(A ∩ B) = P(A) P(B) [Since A & B are
Independent]
3 2 6
= × =
7 3 21
(ii) P (only A hits the target) = P(A ∩ BC) = P(A) P(BC) [Since A & BC are Independent]
3 1 3
= 7 × 3 = 21
3 2 6 9+14−6 17
=7+ − = = 21
3 21 21
17 4
(v) P (none will occur) = P (Ac ∩ Bc) = 1 - P (A U B) = 1 – =
21 21
Check your Progress I
1. Two independent A and B events are such that, P (A) = 0.3 and P (B) = 0.4. Find the
probability that (i) both A and B will occur (ii) only A occurs, (iii) only B will occur, (iv) at least
one will occur, (v) none will occur. (Ans. 0.12, 0.18, 0.28, 0.58, 0.42)
3. A coin is tossed three times. What is the probability of getting all the three heads?
(Ans. 1/8)
4. The odds in favour of A living another 30 years is 5 : 7 and odds against B living another
v) Atleast one will be alive. [Ans. : (i) 0.185 ; (ii) 0.32 ; (iii) 0.26 ; (iv) 0.49 ; (v) 0.68]
Example: 6
Assume that a certain school has equal number of boys and girls. 5% of boys are football
players. Find the probability that randomly selected student is a boy and football player.
Solution:
𝑃(𝐹∩𝐵)
P(F/B) = ⇒ P(F∩B) = P(F/B) P(B)
𝑃(𝐵)
P (randomly selected student is a boy and football player) = P(F∩ 𝐵) = P(F/B) P(B)
Example: 7
Susan took two tests. The probability of her passing both tests is 0.6. The probability of her
passing the first test is 0.8. What is the probability of her passing the second test given that she
has passed the first test?
Solution:
Let A = event that Susan passes first test
B = event that she passes the second test
P (passing the second test given that she has passed the first test)
P (A∩B) 0.6
= P (B/A) = = = 0.75
𝑃 (𝐴) 0.8
Example: 8
A bag contains red and blue marbles. Two marbles are drawn without replacement. The
probability of selecting a red marble and then a blue marble is 0.28. The probability of selecting
a red marble on the first draw is 0.5. What is the probability of selecting a blue marble on the
second draw, given that the first marble drawn was red?
Solution:
P (selecting a blue marble on the second draw, given that the first marble drawn was red)
P (A∩B) 0.28
= P (B/A) = = = 0.56
𝑃 (𝐴) 0.5
Example: 9
A problem in Mathematics is given to three students whose chances of solving it are 1/3, 1/4
and 1/5 (i) What is the probability that the problem is solved? (ii) What is the probability that
exactly one of them will solve it?
Solution
= 1- 2/5 = 3/5
Example: 10
The probability that a car being filled with petrol will also need an oil change is 0.30; the
probability that it needs a new oil filter is 0.40; and the probability that both the oil and filter
need changing is 0.15.
(i) If the oil had to be changed, what is the probability that a new oil filter is needed?
(ii) If a new oil filter is needed, what is the probability that the oil has to be changed?
Solution
Let A and B be the events of changing oil and new oil filter respectively.
P(A) = 0.30, P(B) = 0.40, P(A∩B) = 0.15
(i) Here we have to find the probability that a new oil filter is needed, if the oil had to be
changed.The event B depends on A.
P (B/A) = P(A∩B)/P(A) = 0.15 / 0.30 = 1/2
(ii) If a new oil filter is needed, what is the probability that the oil has to be changed?
The event A depends on B.
P(A/B) = P(A∩B)/P(B) = 0.15 / 0.40 = 3/8 = 0.375
Example: 11
What is the probability that the total of two dice will be greater than 9, given that the first die is a
5?
Solution:
P (the total of two dice will be greater than 9, given that the first die is a 5)
P (A∩B) 1/18
= P (B/A) = = = 1/3
𝑃 (𝐴) 1/6
Example : 12
In a group of 100 people, 80 like tea, 50 like coffee and 36 like both tea and coffee. Find the
probability that a person selected at random.
100–94=6
n (T∩C) = 36
44+36+14
i) P (Likes at least one of tea and coffee ) = P (TUC) = 100
94
= = 0.94
10
0
44
ii) P (Likes tea but not coffee) = = 0.44
10
0
6
iii) P (Likes neither tea nor coffee) = = 0.06
10
0
36
iv) P (Likes both tea and coffee) = = 0.36
10
0
2. If A, B, C are independent events such that P (A) = 0.3 , P (B) = 0.1 and P (C) = 0.2. Find
theprobability of simultaneous occurrence of all the three events. [Ans. : 0.006]
3. One shot is fired from each of three guns. E1, E2, E3 denote the events that the target is
hit by the first, second and third guns respectively. If P (E1) = 0.5, P (E2) = 0.6 and P (E3) = 0.8
and E1, E2 , E3 are independent events, find the probability that –
ii) At least two hits are registered.[Ans. : (i) 0.26 ; (ii) 0.7]
4. A box contains 6 red, 4 white and 5 black balls. A person draws 4 balls from the box at
random. Find the probability that among the balls drawn there in at least one ball of each
colour.[Ans.: 0.5275]
5. A bag contains 10 white 5 back balls. Two balls are drawn at random one after the other
without replacement. Findthe probability that both balls drawn are black. [Ans. 2/21]
P (A∩B)
P (A/B) = .
𝑃 (𝐵)
P (A∩B)
P (B/A) = ⇒ P (A∩B) = P(B/A) P(A)
𝑃 (𝐴)
𝑃(𝐵 𝐼 𝐴𝑖)𝑃(𝐴𝑖)
⇒ P(Ai/B) = ∑𝑛
𝑖=1 𝑃 (B | Ai) P (Ai)
Example
You might wish to find a person's probability of having rheumatoid arthritis if they have hay
fever. In this example, "having hay fever" is the test for rheumatoid arthritis (the event).
A would be the event "patient has rheumatoid arthritis." Data indicates 10 percent of
patients in a clinic have this type of arthritis. P(A) = 0.10
B is the test "patient has hay fever." Data indicates 5 percent of patients in a clinic have hay
fever. P(B) = 0.05
The clinic's records also show that of the patients with rheumatoid arthritis, 7 percent have
hay fever. In other words, the probability that a patient has hay fever, given they have
rheumatoid arthritis, is 7 percent. P(B ∣ A) =0.07
Substituting these values into the theorem:
So, if a patient has hay fever, their chance of having rheumatoid arthritis is 14 percent. It's
unlikely a random patient with hay fever has rheumatoid arthritis.
More generally for a finite number of mutually exclusive and exhaustive events Ai (i = 1, 2,
……n), i.e., events that satisfy, Ai∩Aj = Φ for all i ≠ j and A1∪ A2∪ …. ∪An = S (Sample Space),
P (B / A𝑖) P (A𝑖)
Baye’s Theorem states that, P (Ai / B) = ∑𝑛
𝑖=1 P (B / A𝑖) P (A𝑖)
Example : 1
Suppose there are two bags with first bag contains 3 white and 2 black balls, second bag
contains 2 white and 4 black balls. One ball is transferred from first bag to second bag and
then a ball is drawn from the later and it is found to be white. What is the probability that
the transferred ball is white?
Solution:
Let B be the event of drawing a white ball from the second bag. A 1 is the event of
transferring a white ball from bag 1 and A 2 is the event of transferring a black ball from bag
1.
P (Transferred ball was white given that the ball drawn is white)
P (B/A1) P (A1)
= P(A1/B) = P(B/A1)P(A1)+P(B/A2)P(A2)
3
(3/7)×( )
5
= 3 3 2
( )×( )+ (2/7)×( )
7 5 5
= 9/13
Example : 2
Three firms A, B, C supply 25%, 35% and 40% of chairs needed to college. Past
experience shows that 5%, 4% and 2% of the chairs produced by these companies are
defective. If a chair is found to be defective, what is the probability that chair was supplied
by firm A.
Solution:
Let D be the event of selecting defective chair. Let A, B and C are the events of chair
supplied from firms A, B and C.
P (D/A) P (A)
= P (A/D) = D D
P( )P(A)+ P( )P(B)+ P(D/C)P(C)
A B
0.05 ×0.25
= 0.05 ×0.25 + 0.04×0.35+0.02 ×0.4
00125
= 0.0345
= 0.36
Example 3:
Bag I contains 4 white and 6 black balls while another Bag II contains 4 white and 3 black balls.
One ball is drawn at random from one of the bags and it is found to be black. Find the
probability that it was drawn from Bag I.
Solution:
Let A1 be the event of choosing the bag I, A2 the event of choosing the bag II and B be the
event of drawing a black ball.
By using Bayes’ theorem, the probability of drawing a black ball from bag I out of two bags,
P (B/A1) P (A1)
P(A1|B) = =
P(B/A1)P(A1)+P(B/A2)P(A2)
1
(6/10)×( )
2
= 6 1 1 = 0.5823
( )×( )+ (3/7)×( )
10 2 2
Example 4:
A man is known to speak truth 2 out of 3 times. He throws a die and reports that number
obtained is a four. Find the probability that the number obtained is actually a four.
Solution:
Let B be the event that the man reports that number four is obtained.
Let A1 be the event that four is obtained and A2 be its complementary event.
P(A2) = Probability that four does not occurs = 1 – P(E1) = 1 −1/6 = 5/6
Also, P(B|A1) = Probability that man reports four and it is actually a four = 2/3
P(B|A2) = Probability that man reports four and it is not a four = 1/3
P (B/A1) P (A1)
P(A1|B) = = P(B/A1)P(A1)+P(B/A2)P(A2)
1
(2/3)×( )
6
= 2 1 5 = 0.2858
( )×( )+ (1/3)×( )
3 6 6
10.5 LET US SUM UP
10.6: Exercise
4. How will the statement of Addition theorem be modified, if the two events are (i)
mutually exclusive, (ii) complementary?
5. A speak truth in 80% cases, B in 90% cases. In what percentage of cases are they likely to
contradict each other in stating the same fact? [Ans. 26%]
6. The odds in favour of A hitting a target are 3 : 4 and odds against B hitting a target are
1 : 2. If both of them shot the target independently find the probability that the target is
hit. [Ans. 17/21]
7. In a group of 120 students 80 passed in Mathematics and 90 passed in Economics and 65 passed
in both the subjects. Find the probability that a student selected at random from this group.
iv) Passed in only one subject [Ans. : (i) 0.875 ; (ii) 0.54 ; (iii) 0.125 ; (iv) 0.33]
8. The odds in favour of A living another 30 years is 5 : 7 and odds against B living another
v) Atleast one will be alive. [Ans. : (i) 0.185 ; (ii) 0.32 ; (iii) 0.26 ; (iv) 0.49 ; (v) 0.68]
9. Three urns are given each containing red and white balls. Urn I contains 6 red and 4 white balls.
Urn II contains 2 red and 6 white balls and urn III contains 1 red and 2 white balls. An
urn is selected at random and a ball is drawn. If the ball is red what is the chance that it is from
10.7: REFERENCES:
2. Introduction to probability and statistics-4th Edition J. Susan Milton, Jesse C. Arnold Tata McGraw
Hill