MATH 231-Reading Material For Biostatics
MATH 231-Reading Material For Biostatics
MATH 231
BIOSTATISTICS
by
© 2009
1
1. COURSE DESCRIPTION
The course is a general overview of the hypothesis testing procedures
2. OBJECTIVES
The aim of this course is to equip the students with the basic knowledge in Statistical procedures.
Special emphasis will be given to the use of statistical packages (SPSS, SAS) in data analysis
COURSE OUTLINE
1. Probability Distributions
Discrete Probility Distributions n(Bionomial, Poisson)
Normal Probability Distribution
2. Hypothesis; Definition
Hypothesis Testing Procedures
Hypothesis Testing for Proportions
3. Z - test
4. T - test
One sample test
Independent sample test
Paired test
Chi-Square test
5. Regression and Correlation Analysis
Correlation Analysis
Regression Analysis
6. F- test
Analysis of Variance (ANOVA)
7. Assessment
There will be two forms of assessments
i Continuous Assessment Tests 30%
(CAT) x 2 on times to be decided.
ii End of Semester Examination. 70%
APPENDIX 69
3
1.0 INTRODUCTION
Statistics is the study of how to collect, organize, analyze and interpret numerical information.
Statistics is a rather broad definition and it is useful to consider the subject by discussing the major
divisions within the fields:
Descriptive Statistics – refers to the statistical methods of describing and organization of data e.g.
mean, median, mode, range, variance and standard deviation.
Inferential Statistics – refers to the methods of using a sample to obtain information about a
population or making conclusions about a population from the sample statistics.
It is therefore important to remember that the main role of inferential statistics is to draw
conclusions about a population based on information obtained from a sample.
Population – refers to all measurements of interest. For example, the weights of all pineapples in
a field.
Sample – is simply a representative part of the population. For example, 100 weights of
pineapples.
Data Variables
Qualitative variables – no numerical value can be assigned, eg. the colour of the hair.
Quantitative Variables - have numerical values, eg. The number of people in a Sunday service.
Quantitative variables can be either discrete or continuous.
A discrete variable is one which inherently contains gaps between successive observable values,
e.g. Account of the number of bacterial colonies growing on the surface of an agar plate is a
discrete variable (it is either 3 or 4 but not 3.5). Countries, number of people, number of chairs,
number of animals, etc.
A continuous variable has the property that between any 2 observable values lies another
observable value.
4
A continuous variable takes values along a continuum, i.e., along a whole interval of values, e.g.
length, weight, height, temperature, speed, pH of a solution etc. An essential attribute of a
continuous variable is that unlike a discrete variable, it can never be measured exactly.
Exercise
Dr. Kamau has developed a new way to teach SAS. He claims that students who follow this
procedure learn to use SAS in an average time of one week. A randomly selected group of 40
students who used the method learned to use SAS in an average time of three weeks. To examine
Dr. Kamau’s claim, what would you use for the population and the sample?
5
2.0 COLLECTION OF DATA
Collection of data is not an easy task as the population may have a number of characteristics and
may be you are interested in only one or two characteristics, e.g. Pineapples in a field; required to
estimate the mean weight of the pineapples.
Experiment – data is obtained in which individual factor affects are controlled. For example,
The most common method of collection of data is data is through surveys, which have to be
conducted very carefully otherwise, the results would be of no value. Survey consists of two parts;
i. Planning – you need to consider
- Nature of problem
- Objective and scope of enquiry
- Sources of information
- Type of enquiry to be conducted
- Accuracy desired
ii. Executing – involve various steps to put the plans in operation
- Setting up an organization
- Selecting and training field staff
- Supervision of field work
- The problem of non-response
- Analysis of data
- Preparation of a report
Surveys can be done by using a variety of methods. The choice of the method depends on a
number of factors. Factors Affecting the Choice of Method of Survey;
1. Nature, objective and scope of enquiry - method selected should be such that it suits the type
of enquiry that is being conducted
2. Availability of finance - when financial resources are scarce leave aside expensive methods.
6
3. Availability of time - some methods involve a long duration of time while others take a
shorter time. The time allocated to the enquiry affects the selection of the method.
Methods of Surveys
Surveys can be done by using a variety of methods. Three of most common methods are;
Personal Interview Survey – investigator has to collect the information personally from the source
concerned. It has the advantage of obtaining in depth responses to questions. But the interviewer
must be trained in asking questions and recording responses, which makes it costly. There is also
the possibility of bias by the interviewer in the selection of respondents.
Telephone Surveys – they are less costly than personal surveys. Also people may be more
open since there is no face-to-face contact. It has a disadvantage in that some people do not have
telephones or away from the office or home when calls are made hence not interviewed.
Mailed Questionnaire Surveys – they can be used to cover a wider geographical area. Also, the
respondents can remain anonymous if desired. However, it has low response, inappropriate
answers as well as difficulty in reading or understanding the questions by some people.
Sampling Methods
Researchers use samples in order to collect data and information about a particular character from
a large population. Using samples saves time and money as well as it allows the researcher to get
more information about a particular subject.
In order to obtain samples that is unbiased i.e., giving each subject in the population to have an
equal chance of being selected. The basic methods of sampling are;
Random Sampling – a simple random sample is a sample chosen from a population in such a way
as to ensure that every individual in the population has equal chance of being included in the
sample.
Systematic Sampling – samples are obtained by numbering each subject of the population and then
selecting every nth number. However the first subject would be selected at random.
Stratified Sampling – samples are obtained by dividing the population into groups called strata
according to some characteristic important to the study.
7
Cluster Sampling – we divide the population area into sections (clusters), randomly select a few of
those sections and then choose all the members from the selected sections.
Importance of Statistics
Statistical background is important in understanding research reports within your area of interest.
This is due to the fact that;
i. Statistics permits summarization and presentation of large quantities of information in a
fashion that facilitates its interpretation.
ii. Statistics enables the researcher to extend research beyond the restricted setting in which
most research is actually conducted.
iii. Statistics enables the formulation and testing of hypothesis.
Statistical procedures in research are used in diverse fields as behavioral sciences (education,
psychology, and sociology), agriculture, economics, medicine, biology, etc. For example, statistics
enables the educator to draw conclusions concerning the efficiency of various instructional
methods, medical scientists to choose the most effective medicine, etc.
Limitations of Statistics
Despite the universality of its approach, statistics has its own limitations;
i. Does not study qualitative phenomena - phenomena which cannot be expressed in figures
like honesty, beauty.
ii. Does not reveal the entire study of the problem - some problems have other background
factors, e.g. – religion, which cannot be covered by statistics.
iii. Does not study individuals
iv. Liable to be misused - any person can misuse statistics and type of conclusion he/she likes,
e.g. Opinion polls, incident of crime, employment, census.
v. Statistical laws are true only on average - statistics deals with phenomena which are affected
by many other factors and it is not possible to study effects of each of these factors
separately. Conclusions arrived at are not entirely accurate and the same conclusion cannot
be arrived at under similar conditions at all times.
8
3.0 ORGANIZING DATA
This deals with the basic techniques of organization and representation of data in a way that is
simple to understand and analyze.
GRAPHS
(a) Bar Graphs
The bars can be vertical or horizontal, but they should be of uniform width and uniformly spaced.
The length of the bar represents the quantity we wish to compare under various conditions.
Sometimes numbers or pictures are used instead of solid bars.
Introduction
Bar graphs are a very common type of graph best suited for a qualitative independent variable.
Since there is no uniform distance between levels of a qualitative variable, the discrete nature of
the individual bars are well suited for this type of independent variable. Though you can extract
trends between bars (e.g., they are gradually getting longer or shorter), you cannot calculate a slope
from the heights of the bars.
9
One Independent and One Dependent Variable
Simple Bar Graph
Here the Factory is our independent variable, since there is no unit of measurement for factories
and no 'order' to the factories, the independent variable is nominal. The dependent variable is
scalar, measured in defects/1,000 cars. Since the scalar dependent variable has a natural zero point
(i.e., absolute or ratio), all of the bars are anchored to the horizontal axis, giving a common point
of measurement.
10
Horizontal Bar Graph
Bar graphs can be shown with the dependent variable on the horizontal scale. This type of bar
graph is typically referred to as a horizontal bar graph. Otherwise the layout is similar to the
vertical bar graph. Note in the example above, that when you have well-defined zero point (ratio
and absolute values) and both positive and negative values, you can place your vertical
(independent variable) axis at the zero value of the dependent variable scale. The negative and
positive bars are clearly differentiated from each other both in terms of the direction they point and
their color.
11
Range Bar Graph
Range bar graphs represent the dependent variable as interval data. The bars rather than starting at
a common zero point, begin at first dependent variable value for that particular bar. Just as with
simple bar graphs, range bar graphs can be either horizontal or vertical. Notice in the horizontal
example above, a reference line is used to indicate a common key dependent variable value.
12
Histogram
Histograms are similar to simple bar graphs except that each bar represents a range of independent
variable values rather than just a single value. What makes this different from a regular bar graph
is that each bar represents a summary of data rather than an independent value. For this type of
graph, the dependent variable is almost always a scalar scale representing the count, or number, of
how many of a sample falls within each range of the independent variable. In the example above,
the sample is all the females in Kenya. The independent variable is age, which as been grouped
into ranges of 5 years each. You should try and keep the ranges for each bar uniform (5 years in
this case), with the exception possibly being the first and/or last range.
13
Two (or more) Independent and One Dependent Variable
Grouped bar graph
Here, we have taken the same graph seen above and added a second independent variable, year.
The initial independent variable, factory, is nominal. The second independent variable, year, can
be treated as being either as ordinal or scalar. This is often the case with larger units of time, such
as weeks, months, and years. Since we have a second independent variable, some sort of coding is
needed to indicate which level (year) each bar is. Though we could label each bar with text
indicating the year, it is more efficient to use color. We will need a legend to explain the color
coding scheme. Note that all of the bars for each level of factory are touching each other,
indicating visually that they are grouped together.
14
Composite bar graph
Another alternative for a bar graph with two independent variables is to have the bars stacked
rather than side-by-side. This arrangement is useful when the summation of all the levels of the
second independent variable is as or more important than the values for each level. In the upper
15
example, it is very easy to read the summed weight of all of the different materials in each sample.
There are, however, tradeoffs. The stacking of the bars means there is no common baseline for the
individual bar elements, making it hard to make direct comparisons for the subcategories. For
example, it is hard to compare the iron content of the three samples. A particularly powerful use
for the composite bar graph is when the sum of all the dependent variable values for each bar is the
same, such as when the values are a fraction of a whole. In the bottom example, the sum of the
three different types of fats will always equal 100 percent. With this layout it is easier to see the
relative portions, if not the absolute values, of a particular fat type across oils.
2 7 Key: 2|7 = 27
3 2
4 1334778
5 0112333444456689
6 888
7 388
8 5
16
Composition of Human Body (by % weight)
Bones
10% Muscles
20%
Water
70%
AIR
25%
SOLIDS
50%
WATER
25%
17
Composition of Air (by %Volume)
AIR
CARBORN
DIOXIDE
OXYGEN
1%
21%
NITROGEN
78%
60
50
40
30
20
10
0
0 32 64 96 128
DAP
FREQUENCY TABLES
To generate a frequency table, follow the steps below;
i. Range of the data = highest value – lowest value
ii. Class width Range of the data
Desired number of classes
iii. Determine the classes to be used
18
Note;
Class width (Size) = upper class limit – lower class limit
Midpoint of a class = lower class limit + upper class limit
2
Example
The commuting distances in km for 60 workers of Kenya Methodist University is as follows;
2 13 47 10 3 16 20 17 40 4
6 7 25 8 21 19 15 3 17 14
12 12 45 1 8 4 16 11 18 23
18 6 2 14 13 7 15 46 12 9
9 34 13 41 28 36 17 24 27 29
16 14 26 10 24 37 31 8 16 12
19
Then we generate a frequency table by tally method as below:-
Frequency Class Midpoint
Classes Tally ( f ) ( x m)
1-5 |||| || 7 3
6-10 |||| |||| | 11 8
11-15 |||| |||| ||| 13 13
16-20 |||| |||| | 11 18
21-25 |||| 5 23
26-30 |||| 4 28
31-35 || 2 33
36-40 ||| 3 38
41-45 || 2 43
46-50 || 2 48
f 60
Class Boundaries
Limits of a class e.g. in the above data; class 6 – 10
The lower limit is 5.5
The upper limit is 10.5
Exercise
One irate DLM student called the institute 40 times during the last two weeks to inquire if the
notes had been sent out. Each time he called, he counted the number of rings before the phone was
answered. The record is shown in the table below:
20
What are the largest and smallest values in the table above? If we want five classes, what should
the class width be?
Complete the following frequency table.
A histogram gives the impression that frequencies jump suddenly from one class to the next. If
you want to emphasize the continuous rise or fall of frequencies you can use a frequency polygon,
or line graph.
Exercise
Q1. An agricultural experimental station at Meru recorded the following annual rainfall (to the
nearest inch) from the year 1927 to the year 1995.
12 9 14 11 15 15 7 12 18 17 11 18
16 15 11 16 19 21 18 11 13 21 8 10
19 19 11 17 16 12 10 19 15 12 13 18
14 22 13 13 21 10 11 14 10 13 9 12
29 10 14 13 15 13 13 15 9 15 15 22
12 15 8 16 11 12 19 6 13 14 16 12
21
Make a frequency table and a histogram using only five classes.
Make a frequency polygon from the histogram in part (a).
Q2. The student registration for year 1 Semester 1 in a certain University is as given in the table
below:
Department # of Students
Education Art 30
Computer 10
Theology 46
Education Science 24
Mathematics 20
Business 50
22
4.0 MEASURES OF CENTRE AND DISPERSION OF DATA
This section introduces the reader to the basic principles of determining the centre and the spread
of a data set. The determination of the mean, median mode, variance and standard deviation of a
data set will be discussed and illustrated.
Sample Mean ( x )
x
i 1
i
x
= =
n n
E.g. 10 15 20 25 30 , n=5
10 15 20 25 30
x =
5
x = 20
Other common names: Arithmetic mean, Average
Median
This is the central value of an ordered data set. E.g. the median of the following set of test scores
for MATH 210 is 75
50 51 60 64 65 70 75 80 81 85 90 95 97
6 below 6 above
- there are as many test scores above as below the median.
- for the data set below ;
51 60 64 69 70 75 78 80 85 90 91 95
Middle values
23
Sum of two middle scores 75 78
The Median = = = 76.5
2 2
Mode
This is the value or property, which occurs most frequently in the data e.g.
Example
The data below shows the number of children in a sample of 20 families
22333224546105246732
From the above data we construct a frequency table
# of 0 1 2 3 4 5 6 7
Children
Frequenc 1 1 6 4 3 2 2 1
y
The Mode is 2 because this is the value with the highest frequency
Note: Sometimes a data set will not have a mode.
- A data set can also have several modes eg rainfall data
Quartiles ( Qi )
13 9 18 15 14 21 7 10 11 20 5 18 25 16 17
Q1=1st Quartile = 10.5, Q2 = 2nd Quartile = 15, and Q3 = 3rd Quartile =18
24
IQR = Interquartile Range = Q3 – Q1 =18 – 10.5 = 7.5
GRAPHICAL ANALYSIS
Stem-leaf plots
i.
7 key: 3|2 = 32
23
334778
011233345689
888
3689
56
ii.
1 key: 13|3 = 133
3
677
00246
1234
01
1
xw
( x.w) , where w is the weight of each entry x.
w
25
Example
Source Score, x Weight, w .xw
CAT 1 86 0.10
CAT 2 68 0.10
H/work 75 0.10
Final 80 0.70
w = 1 x.w =
Exercise
Histograms & stem-leaf plots
35 10 30 25 75 10 30 20
20 10 40 50 40 30 60 70
25 40 10 60 20 80 40 25
20 10 20 25 30 50 80 20
GEOMETRIC MEAN x g
This quantity as applied to biological problems is used primarily to determine rates and ratios in
systems whose characteristics change with time. For example, assume that one wishes to determine
the mean rate of growth of a colony of bacteria. The arithmetic mean is not applicable because
bacteria grow not in arithmetic but in a geometric fashion:
1 2 4 8 16 etc.
26
Geometric mean x g = n x1.x2 .x3 .x4 ....xn
0 10 x0
12 43 x1
24 167 x2
36 620 x3
48 2719 x4
=
4
43 *167 620 2719 = 331.7
15 18 20 35
Range = largest value – smallest value
= 35 – 14
= 21
27
Sample Standard Deviation - S
( x x) 2
n 1
Where:
x – any entry in the distribution
n 1
Example
Compute the standard deviation (s) and the variance (s2) of the data set below:
3 4 5 6 8 10 10, n=8
Solution:
x x / n = 48 /8 = 6.0
x = 48 ( x x) 2
=
66
28
s2 = ( x x) 2
=
66
= 9.43 = Variance (s)
n 1 7
s s 2 9.43 3.07
x 2
x 2
n
s
n 1
Example
Compute the s2 and s for the following data set.
5, 5, 6 6 6 7 7 8 9 10
x x2
5 25
5 25
6 36
6 36
6 36
7 49
7 49
8 64
9 81
10 100
x = 69 x 2 = 501
29
n = 10
x 69
x = 6.9
n 10
( x )2 /n = 69²/10 = 476.1
x x
2 2
/n
For s
n 1
Exercise
Dr. Mwangi gave a 10 – point statistics quiz to 100 students. A random sample of 10 papers had
the following scores:
Data; 9 6 4 6 5 8 7 6 7 0
30
Count the number of entries in each class and record as fi .
fi
Add the number of entries from each class, , together to find the total number of entries, n
fi
(sum of ’s) in the entire distribution.
Note: Each entry is then treated as though it falls on the midpoint ( x m) of that class.
Sample Mean ( x)
(x f )
m
f
Sample Variance (s ) = 2 (( x x)
m
2
f)
f 1
Example
Midpoint
Class Freq ( f ) ( x m) f xm
0–2 4 1 4
3–5 3 4 12
6–8 8 7 56
9 – 11 15 10 150
12 – 14 13 13 169
15 – 17 5 16 80
18 – 20 2 19 38
31
f = 50 f xm
= 509
xm x ( xm x)2 ( xm x ) 2 f
( x m
x) 2 f =
961.40
n = f = 50
x
x m f
= 509 /50 = 10.2
n
s=
x x
m
2
f =
961.40
19.62 = 4.43
f 1 50 1
Exercise
Q1. 10 weights of the students in the class; _________________________
Determine the mean, median and mode
32
Age in 58 59 60 61 62 63 64 65 66 67 68 69
years
Freque 2 9 12 14 17 32 46 89 27 18 4 2
ncy
Q3. Find the median value of x, where x has the frequency distribution given below;
x 4 5 6 7 8 9 10
f 11 13 21 46 44 32 17
Q4. Find the mode and the median of the following data;
Marks 10-19 20- 30- 40- 50- 60-69 70-79 80- 90-99
29 39 49 59 89
f 3 7 12 18 22 17 14 9 5
Q5. Calculate the mean and the Standard Deviation (s) of the following data set.
xm f xm xm x ( xm x ) 2 f
Class f
1 – 5 14
6 – 10 8
11 – 15 11
16 – 20 10
21 - 25 6
Q6. A psychology test to measure memory skills was given to a random sample of 43 students.
33
The results follow, where x is the student score and f is the frequency with which students
obtained this score.
x 0 – 10 11 – 21 22 – 32 33 – 43 44 – 54
f 1 12 18 9 3
Use the above data to find the mean and sample standard deviation of scores.
Other Calculations
Weighted and frequency data
Weighted mean ( x w )
xw
w
Example
The table below shows the x values and their corresponding weightings ( w ).
x 2 7 10
w 2 5 3
Solution
xw
xw
w
34
x w xw
2 2 4
7 5 35
10 3 30
w x w
= 10 = 69
Weighted mean, x w
xw = 69/10 = 6.9
w
Example
The table below shows the x values and their corresponding frequencies
x 1 2 3 4 5 6 7 8
f 5 8 12 19 7 4 3 2
Solution
Sample Mean ( x ) =
xf , Where: f is the frequency of the value x
f
x x f
2
35
x f xf xx ( x x) 2 ( x x) 2 f
x x f
2
168.99
= = 1.69
60 1
Grouped Data
Example
The table below shows the x values and their corresponding frequencies
x 1 2 3 4 5 6 7 8
f 5 8 12 19 7 4 3 2
36
Determine the sample mean x and the sample standard deviation (s)
Solution
Sample Mean ( x ) = x f / f , Where: f is the frequency of the value x
x x f
2
Class f xm xm f xm – x ( x m – x )2 ( x m – x )2 f
= 19.62 = 4.43
37
Note: that the variance = s2
s
CV 100
x
8.54
Then; CV 100 = 0.26
33.40
Skewness
Pearson’s index of skewness ( P )
3( x median )
P , P is between -3 and 3 in most distributions
s
When P > 0, then the data is skewed right. When P < 0, the data is skewed left. When P = 0, the
data is symmetric.
Exercise
Describe the shape of each data set:
s2 s
sx
n n
38
It is common practice among researchers to publish the value of a sample mean, plus or minus the
sx
standard error of mean (x ). This gives the researcher an idea as to how much variability one
would expect to find in the means of many samples of size n drawn from the same population.
GRAPHICAL ANALYSIS
Stem-leaf plots
i. Stem Leaves
Determine the mean and standard deviation of the two data sets
39
5.0 CORRELATION AND REGRESSION ANALYSIS
Objectives: at the end of this topic, you should be able to be determine the correlation coefficient
(r) any paired data set, comment on the relationship and if necessary determine the prediction
equation for a data set.
SS xy
r
SS x .SS y
, where
SS x x 2
x 2
SS y y 2
y 2
SS xy xy
x. y
n
Example
Maina and Wanjiru are partners in the chemistry lab. Their assignment is to determine how much
copper sulphate (CuSO4) will dissolve in 100g of water at 10, 20, 30, 40, 50, 60, 70 C. their lab
results are shown in the table below;
40
x y
Temp of Water (C) Amount of CuSO4
(g)
10 17
20 21
30 25
40 28
50 33
60 40
70 49
Solution
Computational Table
x y x2 y2 x y
Then,
41
280 2
SS x 14000 280
7
2132
SS y 7229 747.71
7
280 213
SS xy 9940 1420
7
1420
r 0.98
2800 747.71
The relationship between the temperature and the amount of copper sulphate dissolved is almost
perfect. As temperature increases, the amount of copper sulphate dissolved increases.
- as x increases, y increases
42
ii.
- As x increases, y decreases
iii.
- No reasonable relationship
43
(B) LINEAR REGRESSION
The first step in answering these questions is to try to express the relationship as a mathematical
equation. There are many possible equations, but the simplest and most widely used is the linear
equation, or the equation of a straight line.
The least squares lines can be used for interpolation of y values for an x value which is between the
measured x values.
Note: Predication of y values for an x value beyond the range of observed x values is a complex
problem that will not be treated in these lecture notes. Prediction beyond the range of observation
is called extrapolation.
Example
Mwangi and Kamau are partners in the chemistry lab. Their assignment is to determine how much
copper sulphate (CuSO4) will dissolve in water at 10, 20, 30, 40, 50, 60 and 70 oC. Their lab
results are shown in the table below, where y is weight in grams of copper sulphate, which will
dissolve in 100g of water at x oC.
Lab (x) 10 20 30 40 50 60 70
Results (y) 34 46 48 53 55 63 65
Solution
Linear Equation: y a bx
where; y is the dependent variable, x is the independent variable
a - is constant term ( y-intercept (0,a))
b - is the slope of the line
44
To estimate a and b above, we proceed as follows;
Determine x = 280
x2 = 14000
y = 364
y2 = 19604
xy = 15900
Then,
SS x x
x
2
2
2800
n
SS y y
y 2
2
676.0
n
SS xy xy
x y 1340.0
n
SS xy 1340.0
b 0.48
SS x 2800
a y b x 32.86
Where;
x x 40
n
y
y 52
n
y = 32.86 + 0.48x
45
Estimate y when x = 35°C
y est = ŷ = 32.86 + 0.48 * 35 = 49.66 , you could also use the calculator in REG mode.
Note: we are only allowed to predict for values of x which are within the data range.
y yˆ
2
SE , or
n
SE
(y o yˆ ) 2
n
Exercise
Q1. The Food and Drug Administration is examining the effect of different doses of a new drug on
the pulse rate of human subjects. The results of the study on six people is given in the table below:
Dose, x Drop, y
2.50 8
3.00 11
46
3.50 9
4.50 16
5.50 19
6.00 20
Q2. A civil service efficiency expert developed a test measuring job satisfaction of civil service
clerks. The following information was obtained from a random sample of 10 clerks.
x 48 92 32 56 20 72 16 56 76 80
y 13 2 14 10 14 6 17 8 3 7
Where; x is the job satisfaction index and y is the # of days absent from work in 1 year.
i. Draw a scatter diagram for the data
ii. From the scatter diagram, would you say slope is closest to 1, 0, or -1?
iii. Find the equation of the line.
47
x 1 b1.23 N b12.3 x2 b13.2 x3
R1.23
1 r23
2
keeping x3 constant.
keeping x2 constant.
48
Exercise
Q1. The table below shows the weights x1 to the nearest pound (lb), the height x2 to the nearest
inch (in) and the ages x3 to the nearest year of 12 boys;
Wei 64 71 53 67 55 58 77 57 56 51 76 68
ght
(x1)
Hei 57 59 49 62 51 50 55 48 52 42 61 57
ght
(x2)
Age 8 10 6 11 8 7 10 9 10 6 12 9
(x3)
49
6.0 ELEMENTARY PROBABILITY
This section deals with the basic techniques in elementary probability.
Probability of an Event
When we use probability in a statement, we are using a number between 0 and 1 to indicate the
likelihood of an even. We’ll use the notation P(event A), read as ‘Probability of an event A’, to
denote the probability of event A. The closer to 1 the probability assignment is, the more likely
the even is to occur. If the even A is certain to occur, then P(A) = 1.
Probability formula for relative frequency:
Probability of an even A = relative frequency = f A / f
E.g. what is the probability of correctly guessing the answer to a true/false question.
# of correct answers
P (correct answer) = = ½
total # of questions
# of outcomes in favour of event A
Note: 𝑃(𝑒𝑣𝑒𝑛𝑡 𝐴) = total # of outcomes
Additive Law
P(A or B ) = P(A) + P(B) - P(A and B)
Multiplicative Law
If A and B are independent events, then;
Example
If 3 fair coins are tossed together, what is the probability of getting
i. Exactly 3 heads?
ii. Exactly 2 heads?
50
iii. At least 2 heads?
iv. Fewer than 2 heads?
v. At most 2 heads?
Solution
List of possible outcomes –
HHH
HHT
HTT No repetitions
TTT
Note: The sum of all the probabilities assigned to outcomes in a sample space must be one. For
example, if you think the probability is 0.65 that you will win a tennis match, then you assume the
probability is 0.35 that your opponent will win.
If the probability that an event occurs is denoted by p and probability that it does not occur is
denoted by q, then:
Note that,
P+q=1
q=1–p
51
CONDITIONAL PROBABILITY
Summer vacation
this year
Yes No Total
Own Yes 37 8 45
a No 40 19 59
house
Total 77 27 104
i. Find the probability that a randomly selected family is taking a summer vacation this year
ii. Find the probability that a randomly selected family is taking a summer vacation this year,
given that they own a house.
iii. Are the events of owning a house and taking a summer vacation this year independent
events or mutually exclusive?
52
PERMUTATIONS AND COMBINATIONS
Factorial Notations
n = n (n-1) (n-2) (n-3) . . . 3 . 2 . 1
E.g. 5 = 5 . 4 . 3 . 2 . 1 = 120
1! = 1 and 0 ! = 1 by definition
10 = 10 . 9 . 8 . 7 . 6 . 5 . 4 . 3 . 2 . 1
Permutations
This is an arrangement of objects in a particular order .e.g. For the letters A B C, how many
arrangements are possible taking the three at a time ?
ABC ACB
BAC BCA only six ways
CAB CBA
If we were to continue doing this, then with a large set of objects it can be unmanageable. Let us
look for a way out,
3
P3 a permutation (arrangement) of 3 objects taking 3 at a time (where order is important)
53
3! 3! 3.2.1
3
P3 6
(3 3)! 0! 1
Solution
.n = 6 , p = 2 and q =2 , two letters are repeated twice each
n! 6!
number of arrangements 180
p !q ! 2 !2 !
Exercises
Q1. In how many ways can the letters in the words be arranged?
(i) TROTTING (ii) MATRICES (iii) BESIEGE
(iv) PARALLEL
Combinations
This is an arrangement/selection of objects where order is not important.
All the three arrangements are similar because they consist of the same items and are therefore
54
considered as one combination.
n!
n
Cr
r !( n r ) !
10 ! 10 !
10
C3 120
3 !(10 3) ! 3! 7 !
n!
P( x) p x q n x
x!(n x)!
n
Cx p x q n x
Example:
Take a case of 4 patients for a particular disease. The probability of being cured is 0.6 (all signs
and symptoms of the disease are alleviated). If we assume that the outcomes (cure or not cured) in
the patients are independent, then we can answer questions like “what is the probability that
exactly 2 patients are cured ? ”
Solution:
In a case of 4 observations (patients), what is the probability of 2 success ( 2 patients are cured)
P( x) nC x . p x .q n x
P ( 2) 4C 2 . p 2 .q 4 2 0.35
Example
It has been found out that the probability that a child is a male in family is 0.4. In a family of 6
children, what is the probability that;
i. Exactly 4 are boys ?
ii. At least 3 are boys ?
Solution
56
In this case; n = 6, p = 0.4 , x = 4, then,
Exercise
Q1. A botanist has developed a new hybrid cotton plant that can withstand insects better than
other cotton plants. However, there is some concern about the germination of seeds from the new
plants. To estimate the probability that a seed from the new plant will germinate a random sample
of 3,000 seeds were planted in warm, moist soil. Of these seeds, 2,430 germinated.
i. What is the probability that a seed will germinate?
ii. What is the probability that a seed will not germinate?
iii. Are the outcomes in this sample space equally likely?
Standard Deviation:
npq
Example
In Njoro, Nakuru, about 57% of the days in a year are cloudy. Find the mean, variance and
standard deviation for the number of cloudy days during the month of June. What can you
conclude?
57
Solution
There are 30 days in June. Using
N = 30, p = 0.57 and q = 0.43
You can find the mean, variance and standard deviation as illustrated below;
Standard Deviation:
npq 7.353 2.71
So, you can conclude that, on the average, there are 17.1 cloudy days during the month of June.
The standard deviation is about 2.71 days.
n!
P( x) p1 1 p2 2 pk k
x x x
x1! x2 ! x3 ! xk !
Example
In a music store, a manager found the the probabilities that a person buys zero, one, or two or more
CDs are 0.3, 0.6, and 0.1, respectively. If six customers enter the store, find the probability that one
won’t buy anything, three will buy one CD each, and two will buy two or more CDs.
Solution
.n = 6, x1 = 1, x2 = 3, x3 = 2 , p1 = 0.3, p2 = 0.6 and p3 = 0.1. then
58
6!
P( x) (0.3)1 (0.6)3 (0.1) 2
1! 3! 2!
60(0.3)(0.216)(0.01) 0.03888
Example
From experience you know that the probability that you will make a sale on any given telephone
call is 0.23. Find the probability that your first sale on any given day will occur on your fourth or
fifth sales call.
Solution
Using: p = 0.23, q = 0.77, and x = 4, you have
So, the probability that your first sale will occur on the fourth or fifth sales call is
P (sales on fourth or fifth sales call) = P(4) + P(5) = 0.105 + 0.081 0.186
59
(D) THE POISSON DISTRIBUTION
The Poisson Distribution is a discrete probability distribution of a random variable x that satisfies
the following conditions;
1. The experiment consists of counting the number of times, x an event occurs in a given
interval. The interval can be an interval of time, area, or volume.
2. The probability of the event occurring is the same for each interval.
3. The number of occurrences in one interval is independent of the number of occurrences in
other intervals.
4. The probability of exactly x occurrences in an interval is
x e
P( x)
x!
Where e is an irrational number approximately equal to 2.71828 and µ is the mean number of
occurrences per interval unit.
Example
The mean number of accidents per month at a certain intersection is 3. What is the probability that
in any given month 4 accidents will occur at this intersection?
Solution
Using x = 4 and µ = 3, the probability that 4 accidents will occur in any given month at the
intersection is
34 (2.71828) 3
P(4) 0.168
4!
60
Exercise
Using a Binomial Distribution find the probability
1. A surgical technique is performed on seven patients. You are told there is a 70% chance of
success. Find the probability that the surgery is successful for;
i. Exactly five patients
ii. At least five patients
iii. Less than five patients
2. 64% of men consider themselves football fans. You randomly select 10 men and ask each if he
is a football fan. Find the probability that the number who consider themselves football fans is;
i. Exactly eight
ii. At least eight
iii. Less than eight
3. 48% of the people in Kenya have O+ blood. You randomly selected ten Kenyan and asked them
if their blood type is O+. Find the probability that;
i. Exactly eight
ii. At least eight
iii. Less than eight
5. An auto parts seller finds that one in every 100 parts sold is defective. Find the probability that;
i. The first defective part is the tenth part sold,
ii. The first defective part is the first, second, or third part sold, and
iii. None of the first 10 parts sold are defective.
61
6. The mean number of business failures per month in Nakuru town in the last one year was about
8. Find the probability that;
i. Exactly 4 businesses will fail in any given year.
ii. At least 4 businesses will fail in any given year.
iii. More than 4 businesses will fail in any given year.
7. A newspaper finds that the mean number of typographical errors per page is four. Find the
probability that;
i. Exactly three typographical errors will be found on a page
ii. At most three typographical errors will be found on a page
iii. More than three typographical errors will be found on a page
62
7.0 NORMAL DISTRIBUTION
This section introduces the reader to the basic techniques for determining whether a frequency
distribution is normal or not. At the end of this section you should also be able to determine the
area under the standard normal curve for any given interval of z values.
Normal Curve
The normal distribution (Gaussian Curve)
1
f ( x) e ( x ) / 2 2
2
2
Where; = 3.1416
- The standard deviation of the population
- The mean of the population
e – 2.718
x – The value of an observed variable
𝑓(𝑥) − 𝑓 𝑖𝑠 𝑎 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥
The normal distribution is biologically the most important of all probability distributions since its
‘bell-shaped’, symmetrical curve nicely describes the majority of variables encountered in the
biological sciences.
2
The curve is ‘bell -shaped ‘as shown below;
63
(a) Area Under the Standard Normal Curve
Example
(Hint: use the z – tables provided)
Find the area under the Standard Normal Curve
i. between z = 0 and z = 1.
64
ii. between 1 and 2.
iii. z 2.
65
iv. between -1 and 2.
Area between -2 and 0 = (area between 0 and 2) (mirror image of the area between –2 and 0)
= 0.4772
Probabilities (P)
66
b
f ( z )dz
a
Solution i:
Solution ii:
Solution iii:
iv. P(-1 z 2)
Solution iv:
= P (0 z 1) + P(0 z 2)
= 0.3413 + 0.4772
= 0.8185
69
Example
Let x have a normal distribution with = 10 and = 2. Find the probability that an x value
selected at random from the distribution is between;
Solution i:
P(9 x 11)
70
P (9 x 11) = P(-0.5 z 0.5) = P(-0.5 z 0) + P(0 z 0.5)
= 2 P(0 z 0.5)
= 2 x (area between 0 and 0.5)
= 2 x 0.1915
= 0.3830
Solution ii
We first convert x values to z. To do this, we use the formula
x
z
To convert the given x interval to z interval.
71
P(11 x 14) = P(0 z 2) - P(0 z 0.5)
= (area between 0 and 2) – (area between 0 and 0.5)
= 0.4772 – 0.1915
= 0.2857
Example
The mean of x distribution is = 2,500 and a standard deviation = 300. Determine the
probability that a random sample of 42 observations taken from this distribution has a mean that
lies between 2,350 and 2,650.
Solution
72
P (2350 x 2650)
We first convertx values to z. To do this, we use the formula
x
z
s
n
Then,
2350 2500 2650 2500
z 2350 3.24 and z 2650 3.24
300 300
42 42
Therefore,
Exercise
The average age a vehicle registered in the US is 8 years, or 96 months. Assume the standard
deviation is 16 months. If a random sample of 36 cars is selected, find the probability that the
mean is between 90 and 100 months.
73
N n
Correction factor =
N 1
x
Therefore, the formula for computing z is; z
s N n
n N 1
Exercise
The average weight of a group of male adults is 160 pounds. The standard deviation is 10 pounds.
If 30 males are selected from a population 300, find the probability that the mean will be less than
156 pounds.
Solution
x
P ( 11 x 14), z
74
For x = 14, = 10, = 2; then, z2 = (14 – 10)/2 = 2.00,
Exercise 1
Q1. Sketch the areas under the standard normal curve over the indicated intervals and find the area:
i. Area between z = 0 and = 2.92
ii. Area between z = -2.18 and = 1.34
iii. Area to the right of z = 0.15
Q2. Find the indicated probability and shade the corresponding area under the standard normal
curve:
i. P ( 0 z 1.62)
ii. P (-0.45 z 2.73)
iii. P ( z - 2.15)
Q3. Assume that x has a normal distribution, with the specified mean and standard deviation. Find
the indicated probabilities;
i. P (3 x 6), = 4, = 2
ii. P (50 x 70), = 40, = 15
iii. P (x 120), = 100, = 15
Q4. The ages of workers in the Mafuko bakery are normally distributed, with a mean of 45 years
and a standard deviation of 12 years. A worker is stopped at random and asked to fill out a
questionnaire. What is the probability that this worker is:
i. Less than 30 years old ?
ii. Between 35 and 55 years old ?
iii. More than 60 years old ?
Exercise 2
75
iv. P (0 z 1.64) v. P(-1.65 z 1.65)
vi. P (z -1.05 or z 2.55)
Q2. .x is a random variable with a normal distribution. Estimate the probability that x falls in the
indicated interval;
i. = 7, = 1.75, estimate P(5.25 x 8.75)
ii. = 20, = 5.4, estimate P(9.2 x 30.8)
Q3 The life span of a tire is normally distributed with a mean of 30,000 miles and a standard
deviation of 2,000 miles. Estimate the probability that a tire’s life span is between 30,000 and
34,000 miles.
Q4. The time per week a student uses a lab computer is normally distributed with a mean of 6.2
hours and standard deviation of 0.9 hour. Your are planning the schedule for the computer lab. Of
2000 students, estimate then umber of students who will use a lab computer for the given number
of hours;
i. Less than 5.3 hours
ii. Between 5.3 and 7.1 hours
iii. More than 7.1 hours
Q5. In a population survey of patients in a rehabilitation hospital, the mean length of stay in the
hospital was 12.0 weeks with a standard deviation of 1.0 weeks. The data is normally distributed.
What is the probability that;
Q6. The mean height of college male students is 70 inches with a standard deviation of 3 inches.
If we took a sample of 16 male students, what is the probability that their mean height is ;
Mean, = np and
Example
Seven percent of the people in Kenya have type O blood. You randomly selected 30 people and
ask them if their blood type is O ,find the probability that,
i. Exactly 4 people say they have O blood
ii. At least 4 people say they have O blood
iii. Fewer than 4 people say they have O blood.
Solution:
n = 30
p = 0.07, q = 1 – 0.07 = 0.93
i. Exactly 4 probability that x lies between 3.5 and 4.5 (correction for continuity)
x
z
3.5 2.1
when x 3.5, z 1.00
1.4
77
4 . 5 2 .1
when x 4.5, z 1.71
1 .4
Then, exactly 4 probability that z lies between 1.00 and 1.71
ii. At least 4 4 or less than 4 L ess than 4.5 (correction for continuity)
x
z
4.5 2.1
For x =4.5, z 1.71
1.4
iii. Fewer than 4 3 or less than 3 less than 3.5 (correction for continuity)
x
z
3.5 2.1
for x 3.5, z 1.00
1.4
Exercise
Q1. Twenty-nine percent of people in the USA say they are confident that passenger trips to the
moon will occur in their life time. You randomly selected 200 people in the USA and ask each if
he or she thinks passenger trips to the moon will occur in his or her life time. What is the
probability that at least 50 will say yes?
Q2. Twenty-four percent of people in Kenya have A (+) blood. You randomly select 32 people and
ask them if their blood type is A(+), find the probability that;
i. exactly 12 say they have A(+) blood
ii. at least 12 say they have A(+) blood
iii. fewer than 12 say they have A(+) blood
78
iv. at most 12 say they have A(+) blood
79
8.0 CONFIDENCE INTERVALS
Estimating Population Parameters
The sample mean, x is the most unbiased estimator of the population mean, .
x
x
n
The maximum error of estimate or the margin of error, E, is the greatest possible distance between
the point estimate and the value of parameter it is estimating.
E z c x z c , for large samples
n
or
E t c x t c , for small samples
n
Error tolerance
Example
If s = 5.0, n = 54, and x = 12.4
At 95% confidence, zc = 1.96
5 .0
then , E 1.96 1 .3
54
This means that at 95% confidence, the maximum error of estimate for the population mean,, is
about 1.3.
Exercise
80
Q1. Find the maximum error of estimate for the given values of c, s and n;
i. c = 0.90, s = 2.5, n = 36
ii. c = 0.95, s = 3.0, n = 60
iii. c = 0.99, s = 3.4, n = 100
x – E < < x + E
Exercise
Q1. Construct the indicated confidence intervals for population mean;
i. c = 0.90, x = 12.5, s = 2.0, n = 6
ii. c = 0.95, x = 13.4, s = 0.85, n = 8
81
iii. c = 0.99, x = 14.0, s = 2.0, n = 10
Q2. In 36 randomly selected seawater samples, the mean sodium concentration was 23 cc and the
standard deviation was 6.7 cc. Construct a 95% confidence interval for the population mean.
Q3. Determine the minimum required sample size if you want to be 95% confident that the
sample mean is within one unit of the population mean given = 4.8. Assume the population is
normally distributed.
Q4. In a random sample of 19 patients at a hospital’s minor emergency department, the mean
waiting time (in min) before seeing a medical professional was 23 min and the standard deviation
was 11 min. Construct a 95% confidence interval for the population mean. Assume the waiting
time is normally distributed.
Q5. you randomly selected 16 hotels and measured the temperature of tea sold at each. The mean
temperature is 162F with a standard deviation of 10F. Construct a 95% confidence interval for
the population mean. Assume the temperatures are approximately normally distributed.
z
2
Exercise
Q1. Determine the minimum required sample size if you want to be 95% confident that the
sample mean is within one unit (E = 1) of the population mean given = 4.8. Assume the
population is normally distributed. Two units of the population mean.
Q2. Determine the minimum required sample size if you want to be 99% confident that the
sample mean is within two units of the population mean given = 1.4. Assume the population is
normally distributed.
Q3. An admissions director wants to estimate the mean age of all students enrolled at a college.
82
The estimate must be within 1 year of the population mean. Assume the population of ages is
normally distributed. Determine the minimum required sample size to construct a 90% confidence
interval for the population mean given = 1.2 years.
83
8.0 Revision Exercises
Q1. (a) Briefly describe the difference between the following terms as used in statistics;
.i. A population and a sample
ii. Discrete and continuous variable
iii. Descriptive and inferential statistics
(b) Which of the following measurements are discrete or continuous;
i. the average number of babies born in certain clinic each week
ii. weight of 100 goats
iii. the average daily temperatures
iv. distance between the planets in our solar system
v. the number of pineapples in 10x10 m plots
The table below shows the CAT score for 45 students in MATH 100. Use this data to complete
the table below:
27 26 23 21 18 21 16 16 8
24 21 24 22 20 20 17 10 13
24 21 23 24 20 20 17 12 11
24 21 21 20 20 21 17 13 3
26 22 23 18 17 21 17 7 2
1–5
6 – 10
11 – 15
16 – 20
21 – 25
26 – 30
84
(ii) Draw a frequency histogram for these scores
Q2. The weights of students in the MATH 210 class were determined and recorded as below;
51 56 62 58 57
62 66 64 62 65
(a) determine; The mean, median and mode of these weights the range and standard deviation
of these weights
(b) determine the mean and standard deviation for the following frequency distribution;
.x 3 4 5 8
Freq 6 3 4 2
.x 10 15 18 1 4 7 14
.y 3 2 0 8 6 4 3
(a) Compute;
i. the correlation coefficient (r)
ii. comment on the relationship between x and y
iii. determine x and y , a and b for the equation y = a + bx
Exercise
Q1. (a) Briefly describe the difference between the following terms as used in statistics;
i. A population and a sample
ii. A population parameter and a sample statistics
iii. A discrete variable and continuous variable
85
The table below shows the final score for 45 students in MATH 100. Use this data to complete
the table below:
68 84 46 82 83 75 61 76 75
73 52 35 63 78 88 67 62 84
61 44 62 74 39 92 94 52 46
66 78 51 68 72 81 71 47 57
96 36 66 60 52 65 62 32 88
Q2. (a) The weights of students in the MATH 210 class were determined and recorded as below;
51 56 62 58 48
62 53 64 62 65
Determine;
i. The mean, median and mode of these weights
ii. the range and standard deviation of these weights
iii. the coefficient of variation (CV) and the skewness of these weights
Determine the mean and standard deviation for the following frequency distribution;
.x 4 6 8 10
Freq 2 3 4 2
86
.x 0 1 2 3 4 5 6
.y 2.2 2.4 3.3 5.4 9.4 14.5 19.9
= a + bx
v. write the prediction equation
vi. estimate y when x = 5 and x = 10.
vii. Estimate r2 and comment on it
Q4. (a) Which of the following values cannot be a probability of an event, show the reasoning;
-0.2 1.2 0.006 40% 120% 2/5 0.8
(b) Assuming that the probability of a male birth is 0.3, out of 2000 families with 5 children,
how many would you expect to have
i. at least 3 boys
ii. exactly 4 boys
iii. fewer than 3 boys
(c) .x is a random variable with a normal distribution. Estimate the probability that x falls in the
indicated interval;
87
i. = 7, = 2, estimate P(5 x 8)
ii. = 20, = 5, estimate P(18 x 25)
(b) determine the minimum required sample size if you want to be 95% confident that the sample
mean is within one unit of the population mean given = 4.8. Assume the population is
normally distributed.
88
9.0 BIBLIOGRAPHY
1. Aczel, A.D., 1996, Complete Business Statistics. 3rd edition. Chicago, Irwin.
2. Bluman, A.G.,1998, Elementary Statistics. McGraw-Hill. New York, New York.
3. Brase, C.H., C.P. Brase, 1987, Understanding Statistics. D.C. Heath and Company. Lexington,
Massachusetts. Toronto.
4. Dixon, W.J., and F.J. Massey, 1969, Introduction to Statistical Analysis. McGraw-Hill Book
Company. New York.
5. Hill, A.B., 1966, Principle of Medical Statistics. Oxford University Press. New York.
6. Johnson, R., 1990, Elementary Statistics. 6th edition. Boston, PWS- Kent
7. Pagano, R.R., 1990, Understanding Statistics. 3rd edition. New York, West.
8. Remington, R.D., M.A. Schork, 1970, Statistics with Applications to the Biological and Health
Sciences. Prentice-Hall, INC. Englewood Cliffs, New Jersey
89
FORMULAS
DESCRIPTIVE STATISTICS
Arithmetic Mean x n x
The Geometric Mean, x G N x1 .x2 .x3 ...x N
N
The Harmonic Mean, H =
1
x
x x
2
Standard Deviation ( s)
n 1
x x
2 2
/n SSx
Computational formula: ( s)
n 1 n 1
x x f
2
Mean x
xf
xw
xw
f w
Where f – frequency of the x w – the weighting of x
Mean x x f f
m
, Where;
xm
is the mid point of a class
m
90
N
( f ) l
Median = L1 2 cw
f median
1
Mode = L1 cw
1 2
ELEMENTARY PROBABILITY
fA
Probability of an event A P(A)
f
Additive law:
91
PERMUTATIONS and COMBINATIONS
n!
n
pr
(n r ) !
Combination - an arrangement of n objects taking r at a time. Order is not important in this case.
n!
n
Cr
r !( n r ) !
DISRETE PROBABILITY
Binomial Probability
P ( x ) nC x . p x .q n x
Where n is the total number of observations (trials)
x is the number of successes
p is the probability of success - P(S)
q is the probability of failure – P(F)
P(r) is the probability of r successes
the probability that the first success will occur on trial number x is
92
Poisson Probability
x e
P( x)
x!
where e is an irrational number approximately equal to 2.71828 and µ is the mean number of
occurrences per interval unit.
Correlation Analysis
Correlation Coefficient r
SSxy
SSx.SSy
Where :
SSx x 2 x / n
2
SSy y 2 y / n
2
SSxy xy x. y / n
93
Linear regression
SSxy
b slope
SSx
a y int ercept y bx
94