0% found this document useful (0 votes)
37 views94 pages

MATH 231-Reading Material For Biostatics

Uploaded by

Fello Muryy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views94 pages

MATH 231-Reading Material For Biostatics

Uploaded by

Fello Muryy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

KENYA METHODIST UNIVERSITY

Distance Learning Material

MATH 231

BIOSTATISTICS

by

COURSE INSTRUCTOR: Prof Peter A. Kamau

Tel: 0721 847 866

© 2009

Published by Kenya Methodist University

P.O. Box 267 – 60200, Meru

Tel: 254 – 064 – 30301, 31146

1
1. COURSE DESCRIPTION
The course is a general overview of the hypothesis testing procedures

2. OBJECTIVES
The aim of this course is to equip the students with the basic knowledge in Statistical procedures.
Special emphasis will be given to the use of statistical packages (SPSS, SAS) in data analysis

COURSE OUTLINE
1. Probability Distributions
 Discrete Probility Distributions n(Bionomial, Poisson)
 Normal Probability Distribution
2. Hypothesis; Definition
 Hypothesis Testing Procedures
 Hypothesis Testing for Proportions
3. Z - test
4. T - test
 One sample test
 Independent sample test
 Paired test
 Chi-Square test
5. Regression and Correlation Analysis
 Correlation Analysis
 Regression Analysis
6. F- test
 Analysis of Variance (ANOVA)
7. Assessment
There will be two forms of assessments
i Continuous Assessment Tests 30%
(CAT) x 2 on times to be decided.
ii End of Semester Examination. 70%

NB: Late submission of assignments will result to deduction of marks.


2
TABLE OF CONTENTS
1.0 INTRODUCTION 1
2.0 COLLECTION OF DATA 3
3.0 DATA ORGANIZING 7
4.0 MEASURES OF CENTRE & DISPERSION OF A DATA SET
Measures of Centre of a Data 14
Measures of dispersion of a Data 18
Mean and Standard Deviation of a Grouped Data 21
Coefficient of Variation (CV) 27
Skewness 27
Standard Error 28

5.0 CORRELATION & REGRESSION ANALYSIS


Correlation Coefficient (r) 29
Linear Regression 33
Coefficient of Determination 35
Standard Error of Estimate (SE) 35
Multiple and Partial Correlation 36

6.0 ELEMENTARY PROBABILITY


Probability of an Event 39
Permutations and Combinations 41
Discrete Probability Distributions 44
Normal Distributions 50
Approximating a Binomial Distribution 61

8.0 REVISION EXERCISES


9.0 BIBLIOGRAPHY 68

APPENDIX 69

3
1.0 INTRODUCTION
Statistics is the study of how to collect, organize, analyze and interpret numerical information.
Statistics is a rather broad definition and it is useful to consider the subject by discussing the major
divisions within the fields:
Descriptive Statistics – refers to the statistical methods of describing and organization of data e.g.
mean, median, mode, range, variance and standard deviation.
Inferential Statistics – refers to the methods of using a sample to obtain information about a
population or making conclusions about a population from the sample statistics.

It is therefore important to remember that the main role of inferential statistics is to draw
conclusions about a population based on information obtained from a sample.

Population – refers to all measurements of interest. For example, the weights of all pineapples in
a field.

Sample – is simply a representative part of the population. For example, 100 weights of
pineapples.

Note: Not all samples are useful.

Data Variables
Qualitative variables – no numerical value can be assigned, eg. the colour of the hair.
Quantitative Variables - have numerical values, eg. The number of people in a Sunday service.
Quantitative variables can be either discrete or continuous.

A discrete variable is one which inherently contains gaps between successive observable values,
e.g. Account of the number of bacterial colonies growing on the surface of an agar plate is a
discrete variable (it is either 3 or 4 but not 3.5). Countries, number of people, number of chairs,
number of animals, etc.

A continuous variable has the property that between any 2 observable values lies another
observable value.
4
A continuous variable takes values along a continuum, i.e., along a whole interval of values, e.g.
length, weight, height, temperature, speed, pH of a solution etc. An essential attribute of a
continuous variable is that unlike a discrete variable, it can never be measured exactly.

Exercise
Dr. Kamau has developed a new way to teach SAS. He claims that students who follow this
procedure learn to use SAS in an average time of one week. A randomly selected group of 40
students who used the method learned to use SAS in an average time of three weeks. To examine
Dr. Kamau’s claim, what would you use for the population and the sample?

5
2.0 COLLECTION OF DATA
Collection of data is not an easy task as the population may have a number of characteristics and
may be you are interested in only one or two characteristics, e.g. Pineapples in a field; required to
estimate the mean weight of the pineapples.

Data can be obtained through;


Survey – data is obtained in which individual factors affecting the phenomena under study cannot
be controlled. For example, wages of farm workers. We cannot isolate education, training, and sex.

Experiment – data is obtained in which individual factor affects are controlled. For example,

The most common method of collection of data is data is through surveys, which have to be
conducted very carefully otherwise, the results would be of no value. Survey consists of two parts;
i. Planning – you need to consider
- Nature of problem
- Objective and scope of enquiry
- Sources of information
- Type of enquiry to be conducted
- Accuracy desired
ii. Executing – involve various steps to put the plans in operation
- Setting up an organization
- Selecting and training field staff
- Supervision of field work
- The problem of non-response
- Analysis of data
- Preparation of a report

Surveys can be done by using a variety of methods. The choice of the method depends on a
number of factors. Factors Affecting the Choice of Method of Survey;
1. Nature, objective and scope of enquiry - method selected should be such that it suits the type
of enquiry that is being conducted
2. Availability of finance - when financial resources are scarce leave aside expensive methods.
6
3. Availability of time - some methods involve a long duration of time while others take a
shorter time. The time allocated to the enquiry affects the selection of the method.

Methods of Surveys
Surveys can be done by using a variety of methods. Three of most common methods are;
Personal Interview Survey – investigator has to collect the information personally from the source
concerned. It has the advantage of obtaining in depth responses to questions. But the interviewer
must be trained in asking questions and recording responses, which makes it costly. There is also
the possibility of bias by the interviewer in the selection of respondents.

Telephone Surveys – they are less costly than personal surveys. Also people may be more
open since there is no face-to-face contact. It has a disadvantage in that some people do not have
telephones or away from the office or home when calls are made hence not interviewed.

Mailed Questionnaire Surveys – they can be used to cover a wider geographical area. Also, the
respondents can remain anonymous if desired. However, it has low response, inappropriate
answers as well as difficulty in reading or understanding the questions by some people.

Sampling Methods
Researchers use samples in order to collect data and information about a particular character from
a large population. Using samples saves time and money as well as it allows the researcher to get
more information about a particular subject.

In order to obtain samples that is unbiased i.e., giving each subject in the population to have an
equal chance of being selected. The basic methods of sampling are;
Random Sampling – a simple random sample is a sample chosen from a population in such a way
as to ensure that every individual in the population has equal chance of being included in the
sample.
Systematic Sampling – samples are obtained by numbering each subject of the population and then
selecting every nth number. However the first subject would be selected at random.
Stratified Sampling – samples are obtained by dividing the population into groups called strata
according to some characteristic important to the study.
7
Cluster Sampling – we divide the population area into sections (clusters), randomly select a few of
those sections and then choose all the members from the selected sections.

Importance of Statistics
Statistical background is important in understanding research reports within your area of interest.
This is due to the fact that;
i. Statistics permits summarization and presentation of large quantities of information in a
fashion that facilitates its interpretation.
ii. Statistics enables the researcher to extend research beyond the restricted setting in which
most research is actually conducted.
iii. Statistics enables the formulation and testing of hypothesis.

Statistical procedures in research are used in diverse fields as behavioral sciences (education,
psychology, and sociology), agriculture, economics, medicine, biology, etc. For example, statistics
enables the educator to draw conclusions concerning the efficiency of various instructional
methods, medical scientists to choose the most effective medicine, etc.

Limitations of Statistics
Despite the universality of its approach, statistics has its own limitations;
i. Does not study qualitative phenomena - phenomena which cannot be expressed in figures
like honesty, beauty.
ii. Does not reveal the entire study of the problem - some problems have other background
factors, e.g. – religion, which cannot be covered by statistics.
iii. Does not study individuals
iv. Liable to be misused - any person can misuse statistics and type of conclusion he/she likes,
e.g. Opinion polls, incident of crime, employment, census.
v. Statistical laws are true only on average - statistics deals with phenomena which are affected
by many other factors and it is not possible to study effects of each of these factors
separately. Conclusions arrived at are not entirely accurate and the same conclusion cannot
be arrived at under similar conditions at all times.

8
3.0 ORGANIZING DATA
This deals with the basic techniques of organization and representation of data in a way that is
simple to understand and analyze.

GRAPHS
(a) Bar Graphs
The bars can be vertical or horizontal, but they should be of uniform width and uniformly spaced.
The length of the bar represents the quantity we wish to compare under various conditions.
Sometimes numbers or pictures are used instead of solid bars.

One Independent and One Dependent Variable


- Simple Bar Graph
- Horizontal Bar Graph
- Range Bar Graph
- Histogram

Two (or More) Independent Variables and One Dependent Variable


- Grouped Bar Graph
- Composite Bar Graph

Introduction
Bar graphs are a very common type of graph best suited for a qualitative independent variable.
Since there is no uniform distance between levels of a qualitative variable, the discrete nature of
the individual bars are well suited for this type of independent variable. Though you can extract
trends between bars (e.g., they are gradually getting longer or shorter), you cannot calculate a slope
from the heights of the bars.

9
One Independent and One Dependent Variable
Simple Bar Graph

Here the Factory is our independent variable, since there is no unit of measurement for factories
and no 'order' to the factories, the independent variable is nominal. The dependent variable is
scalar, measured in defects/1,000 cars. Since the scalar dependent variable has a natural zero point
(i.e., absolute or ratio), all of the bars are anchored to the horizontal axis, giving a common point
of measurement.

10
Horizontal Bar Graph

Bar graphs can be shown with the dependent variable on the horizontal scale. This type of bar
graph is typically referred to as a horizontal bar graph. Otherwise the layout is similar to the
vertical bar graph. Note in the example above, that when you have well-defined zero point (ratio
and absolute values) and both positive and negative values, you can place your vertical
(independent variable) axis at the zero value of the dependent variable scale. The negative and
positive bars are clearly differentiated from each other both in terms of the direction they point and
their color.

11
Range Bar Graph

Range bar graphs represent the dependent variable as interval data. The bars rather than starting at
a common zero point, begin at first dependent variable value for that particular bar. Just as with
simple bar graphs, range bar graphs can be either horizontal or vertical. Notice in the horizontal
example above, a reference line is used to indicate a common key dependent variable value.

12
Histogram

Histograms are similar to simple bar graphs except that each bar represents a range of independent
variable values rather than just a single value. What makes this different from a regular bar graph
is that each bar represents a summary of data rather than an independent value. For this type of
graph, the dependent variable is almost always a scalar scale representing the count, or number, of
how many of a sample falls within each range of the independent variable. In the example above,
the sample is all the females in Kenya. The independent variable is age, which as been grouped
into ranges of 5 years each. You should try and keep the ranges for each bar uniform (5 years in
this case), with the exception possibly being the first and/or last range.

13
Two (or more) Independent and One Dependent Variable
Grouped bar graph

Here, we have taken the same graph seen above and added a second independent variable, year.
The initial independent variable, factory, is nominal. The second independent variable, year, can
be treated as being either as ordinal or scalar. This is often the case with larger units of time, such
as weeks, months, and years. Since we have a second independent variable, some sort of coding is
needed to indicate which level (year) each bar is. Though we could label each bar with text
indicating the year, it is more efficient to use color. We will need a legend to explain the color
coding scheme. Note that all of the bars for each level of factory are touching each other,
indicating visually that they are grouped together.

14
Composite bar graph

Another alternative for a bar graph with two independent variables is to have the bars stacked
rather than side-by-side. This arrangement is useful when the summation of all the levels of the
second independent variable is as or more important than the values for each level. In the upper

15
example, it is very easy to read the summed weight of all of the different materials in each sample.
There are, however, tradeoffs. The stacking of the bars means there is no common baseline for the
individual bar elements, making it hard to make direct comparisons for the subcategories. For
example, it is hard to compare the iron content of the three samples. A particularly powerful use
for the composite bar graph is when the sum of all the dependent variable values for each bar is the
same, such as when the values are a fraction of a whole. In the bottom example, the sum of the
three different types of fats will always equal 100 percent. With this layout it is easier to see the
relative portions, if not the absolute values, of a particular fat type across oils.

(b) Stem-and-leaf plot

2 7 Key: 2|7 = 27
3 2
4 1334778
5 0112333444456689
6 888
7 388
8 5

(c) Circle Graph or Pie Chart


It is relatively safe from misinterpretation and is especially useful for showing the division of total
quantity into its component parts. The components are represented by proportional segments of a
circle.

16
Composition of Human Body (by % weight)

Bones
10% Muscles
20%

Water
70%

Composition of a Medium Textured Soil at Field Capacity (by Volume)

AIR
25%

SOLIDS
50%

WATER
25%

17
Composition of Air (by %Volume)

AIR

CARBORN
DIOXIDE
OXYGEN
1%
21%

NITROGEN
78%

(d) Frequency Histograms and Frequency Polygons


The bars always touch, and the width of a bar represents a quantitive value, such as age. In a bar
graph we could make the bars as wide as we wished, according to the visual impression we wanted
to convey. But in the histogram the width of the bar has a meaning. In the figure below, the width
of each bar represents 10 years.
INFILTRATION RATES (cm/hr)

60
50
40
30
20
10
0
0 32 64 96 128
DAP

FREQUENCY TABLES
To generate a frequency table, follow the steps below;
i. Range of the data = highest value – lowest value
ii. Class width  Range of the data
Desired number of classes
iii. Determine the classes to be used

18
Note;
Class width (Size) = upper class limit – lower class limit
Midpoint of a class = lower class limit + upper class limit
2
Example
The commuting distances in km for 60 workers of Kenya Methodist University is as follows;

2 13 47 10 3 16 20 17 40 4
6 7 25 8 21 19 15 3 17 14
12 12 45 1 8 4 16 11 18 23
18 6 2 14 13 7 15 46 12 9
9 34 13 41 28 36 17 24 27 29
16 14 26 10 24 37 31 8 16 12

Hint: Let us use 10 classes.

The largest distance commuted is 47 km and the smallest is 1km.


47  1
Class width = = 4.6  5
10

19
Then we generate a frequency table by tally method as below:-
Frequency Class Midpoint
Classes Tally ( f ) ( x m)

1-5 |||| || 7 3
6-10 |||| |||| | 11 8
11-15 |||| |||| ||| 13 13
16-20 |||| |||| | 11 18
21-25 |||| 5 23
26-30 |||| 4 28
31-35 || 2 33
36-40 ||| 3 38
41-45 || 2 43
46-50 || 2 48

f  60

Class Boundaries
Limits of a class e.g. in the above data; class 6 – 10
The lower limit is 5.5
The upper limit is 10.5

Exercise
One irate DLM student called the institute 40 times during the last two weeks to inquire if the
notes had been sent out. Each time he called, he counted the number of rings before the phone was
answered. The record is shown in the table below:

The number of rings before the phone is answered


20 10 8 7 3 5 15 6 9 5
6 18 13 18 1 19 10 19 2 6
4 17 16 9 3 20 15 8 14 19
20 7 14 6 3 17 2 14 4 11

20
What are the largest and smallest values in the table above? If we want five classes, what should
the class width be?
Complete the following frequency table.

Classes Tally Frequency Midpoint ( x m)


(f)

1 - 4 ______ ______ 2.5


5 - ___ ______ ______ ______
___ - 12 ______ ______ ______
13 - ___ ______ ______ ______
___ - ___ ______ ______ ______

(c) Determine the boundaries of each class. e.g.


Class Class Boundaries
1–4 0.5 – 4.5
(d) Draw the histogram for the above data

A histogram gives the impression that frequencies jump suddenly from one class to the next. If
you want to emphasize the continuous rise or fall of frequencies you can use a frequency polygon,
or line graph.

Exercise
Q1. An agricultural experimental station at Meru recorded the following annual rainfall (to the
nearest inch) from the year 1927 to the year 1995.

12 9 14 11 15 15 7 12 18 17 11 18

16 15 11 16 19 21 18 11 13 21 8 10
19 19 11 17 16 12 10 19 15 12 13 18
14 22 13 13 21 10 11 14 10 13 9 12
29 10 14 13 15 13 13 15 9 15 15 22
12 15 8 16 11 12 19 6 13 14 16 12

21
Make a frequency table and a histogram using only five classes.
Make a frequency polygon from the histogram in part (a).

Q2. The student registration for year 1 Semester 1 in a certain University is as given in the table
below:

Department # of Students
Education Art 30
Computer 10
Theology 46
Education Science 24
Mathematics 20
Business 50

Make a circle chart or pie chart for this data.

22
4.0 MEASURES OF CENTRE AND DISPERSION OF DATA
This section introduces the reader to the basic principles of determining the centre and the spread
of a data set. The determination of the mean, median mode, variance and standard deviation of a
data set will be discussed and illustrated.

(A) MEASURES OF CENTRE OF A DATA

Sample Mean ( x )

Sum of all the entries


=
number of entries
n

x
i 1
i
x
= =
n n
E.g. 10 15 20 25 30 , n=5

10  15  20  25  30
x =
5

x = 20
Other common names: Arithmetic mean, Average

Median
This is the central value of an ordered data set. E.g. the median of the following set of test scores
for MATH 210 is 75

50 51 60 64 65 70 75 80 81 85 90 95 97

6 below 6 above
- there are as many test scores above as below the median.
- for the data set below ;

51 60 64 69 70 75 78 80 85 90 91 95
Middle values

23
Sum of two middle scores 75  78
The Median = = = 76.5
2 2

Mode
This is the value or property, which occurs most frequently in the data e.g.

Example
The data below shows the number of children in a sample of 20 families
22333224546105246732
From the above data we construct a frequency table

# of 0 1 2 3 4 5 6 7
Children
Frequenc 1 1 6 4 3 2 2 1
y

The Mode is 2 because this is the value with the highest frequency
Note: Sometimes a data set will not have a mode.
- A data set can also have several modes eg rainfall data

Quartiles ( Qi )

13 9 18 15 14 21 7 10 11 20 5 18 25 16 17

Arrange the data in either ascending or descending order:


Lower half upper half
5 7 9 10 11 13 14 15 16 17 18 18 20 21 25

Q1=1st Quartile = 10.5, Q2 = 2nd Quartile = 15, and Q3 = 3rd Quartile =18

24
IQR = Interquartile Range = Q3 – Q1 =18 – 10.5 = 7.5

Q2 is the same as the Median

GRAPHICAL ANALYSIS

Stem-leaf plots
i.
7 key: 3|2 = 32
23
334778
011233345689
888
3689
56

ii.
1 key: 13|3 = 133
3
677
00246
1234
01
1

Determine the mean of two data sets above

THE WEIGHTED MEAN ( x w )

xw 
 ( x.w) , where w is the weight of each entry x.
w
25
Example
Source Score, x Weight, w .xw
CAT 1 86 0.10
CAT 2 68 0.10
H/work 75 0.10
Final 80 0.70
w = 1 x.w =

Exercise
Histograms & stem-leaf plots

ATM withdraws (in dollars)

35 10 30 25 75 10 30 20
20 10 40 50 40 30 60 70
25 40 10 60 20 80 40 25
20 10 20 25 30 50 80 20

Using five classes, draw a;


o Histogram of the above data
o Stem-leaf plot of the above data
o Box plot of the above data

GEOMETRIC MEAN x g  
This quantity as applied to biological problems is used primarily to determine rates and ratios in
systems whose characteristics change with time. For example, assume that one wishes to determine
the mean rate of growth of a colony of bacteria. The arithmetic mean is not applicable because
bacteria grow not in arithmetic but in a geometric fashion:

1 2 4 8 16 etc.

26
Geometric mean x g =   n x1.x2 .x3 .x4 ....xn

- Growth of a bacteria culture begun with 10 cells

Time (hr) # of cells xi

0 10 x0
12 43 x1

24 167 x2
36 620 x3
48 2719 x4

in this case the geometric mean of the cell number is

xg = n x1.x2 .x3 .x4 ....xn

=
4
43 *167  620  2719 = 331.7

(B) MEASURES OF DISPERSION OF A DATA SET


This section deals with the basic skills for the determination of the various measures of dispersion
- range, variance (s2) and standard deviation (s) of any data set.

Range – is one such measure of variance


e.g. the data set below:

15 18 20 35
Range = largest value – smallest value
= 35 – 14
= 21

27
Sample Standard Deviation - S 
 ( x  x) 2

n 1
Where:
x – any entry in the distribution

x - the mean of the distribution


n – the number of entries

Sample Variance (s2 ) =


 ( x  x) 2

n 1

Example
Compute the standard deviation (s) and the variance (s2) of the data set below:

3 4 5 6 8 10 10, n=8

Solution:
x   x  / n = 48 /8 = 6.0

Column I Column II Column III


x xx ( x  x)2
2 2–6=-4 (-4) 2 = 16
3 3–6=-3 (-3) 2 = 9
4 4–6=-2 (- 2) 2 = 4
5 5–6=-1 (-1) 2 = 1
6 6–6=0 02=0
8 8–6=2 (2)2 = 4
10 10 – 6 = 4 (4)2 = 16

 x = 48  ( x  x) 2
=
66

28
s2 =  ( x  x) 2

=
66
= 9.43 = Variance (s)
n 1 7

s  s 2  9.43  3.07

Computation formula for the sample standard deviation

 x  2

x 2

n
s
n 1

Example
Compute the s2 and s for the following data set.

5, 5, 6 6 6 7 7 8 9 10

x x2
5 25
5 25
6 36
6 36
6 36
7 49
7 49
8 64
9 81
10 100
 x = 69  x 2 = 501

29
n = 10
x 69
x =  6.9
n 10

 x 2 = 501, ( x )2 = (69) 2 = 4761

(  x )2 /n = 69²/10 = 476.1

 x   x 
2 2
/n
For s
n 1

(501  476.1) 24.9


s    2.77  1.66
9 9

 s = 1.66 = Standard Deviation

Exercise
Dr. Mwangi gave a 10 – point statistics quiz to 100 students. A random sample of 10 papers had
the following scores:

Data; 9 6 4 6 5 8 7 6 7 0

Find the range


Find the mean and the sample standard deviation (s)

Mean and Standard Deviation of Grouped Data Sets


If you have a big data set, it can be quite tedious to compute the mean and the standard deviation
but is quite easy to do this from the frequency distribution of this data. The basic plan is as
follows:
Make a frequency table.
Compute the midpoint for each class and call it x m.

30
Count the number of entries in each class and record as fi .
fi
Add the number of entries from each class, , together to find the total number of entries, n
fi
(sum of ’s) in the entire distribution.

Note: Each entry is then treated as though it falls on the midpoint ( x m) of that class.

Sample Mean ( x) 
 (x f )
m

f

Where: x m is the midpoint of a class


f is the number of entries in that class

 f = n is the total number of entries in the distribution

Sample Standard Deviation (s) =


 x  x 
m
2
f 
 f 1

Sample Variance (s ) = 2  (( x  x)
m
2
f)
 f 1
Example

Midpoint
Class Freq ( f ) ( x m) f xm

0–2 4 1 4
3–5 3 4 12
6–8 8 7 56
9 – 11 15 10 150
12 – 14 13 13 169
15 – 17 5 16 80
18 – 20 2 19 38

31
 f = 50  f xm
= 509

xm  x ( xm  x)2 ( xm  x ) 2 f

-9.2 84.64 338.56


-6-2 38.44 115.32
-3.2 10.24 81.92
-0.2 0.04 0.60
2.8 7.84 101.92
5.8 33.64 168.20
8.8 77.44 154.88

 ( x m 
 x) 2 f =

961.40

n =  f = 50

x
x m f
= 509 /50 = 10.2
n

s=
 x  x 
m
2
f  =
961.40
 19.62 = 4.43
 f 1 50  1

Exercise
Q1. 10 weights of the students in the class; _________________________
Determine the mean, median and mode

Q2. Age of retirement : male teachers

32
Age in 58 59 60 61 62 63 64 65 66 67 68 69
years
Freque 2 9 12 14 17 32 46 89 27 18 4 2
ncy

Show that the mean retiring age is 64.51 years.

Q3. Find the median value of x, where x has the frequency distribution given below;

x 4 5 6 7 8 9 10
f 11 13 21 46 44 32 17

Q4. Find the mode and the median of the following data;

Marks 10-19 20- 30- 40- 50- 60-69 70-79 80- 90-99
29 39 49 59 89

f 3 7 12 18 22 17 14 9 5

Q5. Calculate the mean and the Standard Deviation (s) of the following data set.

xm f xm xm  x ( xm  x ) 2 f
Class f

1 – 5 14
6 – 10 8
11 – 15 11
16 – 20 10
21 - 25 6

Q6. A psychology test to measure memory skills was given to a random sample of 43 students.

33
The results follow, where x is the student score and f is the frequency with which students
obtained this score.

x 0 – 10 11 – 21 22 – 32 33 – 43 44 – 54
f 1 12 18 9 3

Use the above data to find the mean and sample standard deviation of scores.

Other Calculations
Weighted and frequency data

Weighted mean ( x w ) 
 xw
w

Where: w is the weight of the data value x

Example
The table below shows the x values and their corresponding weightings ( w ).

x 2 7 10
w 2 5 3

Determine the weighted mean (average)

Solution

xw 
 xw
w

34
x w xw
2 2 4
7 5 35
10 3 30
w x w
= 10 = 69

Weighted mean, x w 
 xw = 69/10 = 6.9
w

Example
The table below shows the x values and their corresponding frequencies

x 1 2 3 4 5 6 7 8

f 5 8 12 19 7 4 3 2

Determine the mean ( x ) and the standard deviation (s)

Solution

Sample Mean ( x ) =
 xf , Where: f is the frequency of the value x
f
 x  x  f 
2

Sample Standard Deviation (s) =


 f 1

35
x f xf xx ( x  x) 2 ( x  x) 2 f

1 5 5 -2.82 7.95 39.76


2 8 16 -1.82 3.31 26.50
3 12 36 -0.82 0.67 8.07
4 19 76 0.18 0.03 0.62
5 7 35 1.18 1.39 9.75
6 4 24 2.18 4.75 19.01
7 3 21 3.18 10.11 30.34
8 2 16 4.18 17.47 34.94
f   xf   ( x  x) 2 f
60 229 =168.99

Sample Mean ( x ) =  xf /  f = 229/60 = 3.82

 x  x  f 
2

Sample Standard Deviation (s) =


 f 1

168.99
= = 1.69
60  1

Grouped Data
Example
The table below shows the x values and their corresponding frequencies

x 1 2 3 4 5 6 7 8

f 5 8 12 19 7 4 3 2

36

Determine the sample mean x and the sample standard deviation (s)

Solution
Sample Mean ( x ) = x f /  f , Where: f is the frequency of the value x

 x  x  f 
2

Sample Standard Deviation (s) =


 f 1
Computational table

Class f xm xm f xm – x ( x m – x )2 ( x m – x )2 f

0–2 4 1 4 -9.2 84.64 338.56


3–5 3 4 12 -6.2 38.44 115.32
6–8 8 7 56 -3.2 10.24 81.92
9 – 11 15 10 150 -0.2 0.04 0.60
12 – 14 13 13 169 2.8 7.84 101.92
15 – 17 5 16 80 5.8 33.64 168.20
18 – 20 2 19 38 8.8 77.44 154.88
f xm f  ( x m – x )2 f
= 50 = 509 = 961.40

x m is the centre mark of a class and f is the frequency of a class


n =  f = 50

Sample mean ( x ) =  x /  f = 509/50 = 10.2

Sample Standard Deviation (s) =


 x  x 
m
2
f  =
961.40
 f 1 50  1

= 19.62 = 4.43
37
Note: that the variance = s2

Coefficient of Variation (CV)

s
CV    100
 x

CV is a dimensionless measure of variability

.e.g. If s = 8.54 and x = 33.40

 8.54 
Then; CV     100 = 0.26
 33.40 

Skewness
Pearson’s index of skewness ( P )


3( x  median )
P , P is between -3 and 3 in most distributions
s
When P > 0, then the data is skewed right. When P < 0, the data is skewed left. When P = 0, the
data is symmetric.

Exercise
Describe the shape of each data set:

(i) x = 17, s = 2.3, median = 19

(ii) x = 32, s = 5.1, median = 25

Standard Error (Standard Error of the Mean)

s2 s
sx  
n n
38
It is common practice among researchers to publish the value of a sample mean, plus or minus the
sx
standard error of mean (x  ). This gives the researcher an idea as to how much variability one
would expect to find in the means of many samples of size n drawn from the same population.

GRAPHICAL ANALYSIS
Stem-leaf plots
i. Stem Leaves

7 key: 11|3 = 113


23
1334778
011233345689
888
3689
56
ii.
1 key: 19|1 = 191
3
677
00246
1234
01
1

Determine the mean and standard deviation of the two data sets

Determine the CV and Skewness of the two data sets

39
5.0 CORRELATION AND REGRESSION ANALYSIS

Objectives: at the end of this topic, you should be able to be determine the correlation coefficient
(r) any paired data set, comment on the relationship and if necessary determine the prediction
equation for a data set.

(A) CORRELATION COEFFICIENT (r)


r is always a number which is between -1 and +1, (-1  r  1 ).

Perfect Good Poor None Poor Good Perfect


-1 -0.5 0 0.5 +1

Negative Relationships Positive Relationships

SS xy
r
SS x .SS y
, where

SS x x  2
 x  2

SS y y  2
 y  2

SS xy   xy 
 x. y
n

n is the number of pairs in the data set

Example
Maina and Wanjiru are partners in the chemistry lab. Their assignment is to determine how much
copper sulphate (CuSO4) will dissolve in 100g of water at 10, 20, 30, 40, 50, 60, 70 C. their lab
results are shown in the table below;

40
x y
Temp of Water (C) Amount of CuSO4
(g)
10 17
20 21
30 25
40 28
50 33
60 40
70 49

Sketch a scatter plot for this data.


Determine the correlation coefficient (r) for this data pairs.
iii. Comment on the relationship between x and y .

Solution
Computational Table
x y x2 y2 x y

10 17 100 289 170


20 21 400 441 420
30 25 900 625 750
40 28 1600 784 1120
50 33 2500 1089 1650
60 40 3600 1600 2400
70 49 4900 2401 3430
x y x2 y2 x y
=280 =213 = 14000 = 7229 = 9940

Then,
41
280 2
SS x  14000   280
7
2132
SS y  7229   747.71
7

280  213
SS xy  9940   1420
7
1420
 r  0.98
2800  747.71

The relationship between the temperature and the amount of copper sulphate dissolved is almost
perfect. As temperature increases, the amount of copper sulphate dissolved increases.

Example of Scatter Plots


i.

- as x increases, y increases

42
ii.

- As x increases, y decreases

iii.

- No reasonable relationship

43
(B) LINEAR REGRESSION

Least Square Method


Looking at the scatter diagram of a data set, we ask two questions: can we find a relationship
between x and y, and if so, how strong is the relationship?

The first step in answering these questions is to try to express the relationship as a mathematical
equation. There are many possible equations, but the simplest and most widely used is the linear
equation, or the equation of a straight line.

The least squares lines can be used for interpolation of y values for an x value which is between the
measured x values.

Note: Predication of y values for an x value beyond the range of observed x values is a complex
problem that will not be treated in these lecture notes. Prediction beyond the range of observation
is called extrapolation.

Example
Mwangi and Kamau are partners in the chemistry lab. Their assignment is to determine how much
copper sulphate (CuSO4) will dissolve in water at 10, 20, 30, 40, 50, 60 and 70 oC. Their lab
results are shown in the table below, where y is weight in grams of copper sulphate, which will
dissolve in 100g of water at x oC.

Lab (x) 10 20 30 40 50 60 70

Results (y) 34 46 48 53 55 63 65

Solution
Linear Equation: y  a  bx
where; y is the dependent variable, x is the independent variable
a - is constant term ( y-intercept (0,a))
b - is the slope of the line
44
To estimate a and b above, we proceed as follows;
Determine x = 280
x2 = 14000
y = 364
y2 = 19604
xy = 15900

Then,

SS x x 
 x 
2
2

 2800
n

SS y y 
 y 2
2

 676.0
n

SS xy   xy 
 x y  1340.0
n

SS xy 1340.0
 b   0.48
SS x 2800

a  y  b x  32.86
Where;

x  x  40
n

y
 y  52
n

The prediction equation is

y = 32.86 + 0.48x

45
Estimate y when x = 35°C
y est = ŷ = 32.86 + 0.48 * 35 = 49.66 , you could also use the calculator in REG mode.

Note: we are only allowed to predict for values of x which are within the data range.

The Coefficient of Determination ( r2 )


Explained Variations
r2 
Total Variations
E.g. If r = 0.90, then coefficient of determination (r2) = (0.90)2
= 0.81
81% of the variations in y can be explained by the relationship between x and y.

Standard Error of Estimate (SE)


The standard error of estimate of y on x

  y  yˆ 
2

SE  , or
n

SE 
(y o  yˆ ) 2
n

Where ŷ or y est is the y estimated value, y or yo is the observed value.

Exercise
Q1. The Food and Drug Administration is examining the effect of different doses of a new drug on
the pulse rate of human subjects. The results of the study on six people is given in the table below:

Dose, x Drop, y

2.50 8
3.00 11
46
3.50 9
4.50 16
5.50 19
6.00 20

i. Draw a scatter diagram for the data

ii. Find x, y , b, a and the equation of the least squares


iii. Graph the least squares line on your scatter diagram above
iv. If x = 2.75, what is the predicted value of y?

Q2. A civil service efficiency expert developed a test measuring job satisfaction of civil service
clerks. The following information was obtained from a random sample of 10 clerks.

x 48 92 32 56 20 72 16 56 76 80
y 13 2 14 10 14 6 17 8 3 7

Where; x is the job satisfaction index and y is the # of days absent from work in 1 year.
i. Draw a scatter diagram for the data
ii. From the scatter diagram, would you say slope is closest to 1, 0, or -1?
iii. Find the equation of the line.

(C) MULTIPLE AND PARTIAL CORRELATION


Linear Regression of x1 on x2 and x3

x1  b1.23  b12.3 x2  b13.2 x3

Where; b1.23, b12.3 and b13.2 are constants


.b12.3 is the partial regression coefficient of x1 on x2 holding x3 as a constant
.b13.2 is the partial regression coefficient of x1 on x2 holding x3 as a constant

47
x 1  b1.23 N  b12.3  x2  b13.2 x3

x x  b1.23  x2  b12.3  x2  b13.2  x2 x3


2
1 2

x x  b1.23  x3  b12.3  x2 x3  b13.2  x3


2
1 3

(D) CORRELATION COEFFICIENTS


.r12 is the correlation coefficient between x1 and x2
.r13 is the correlation coefficient between x1 and x3
.r23 is the correlation coefficient between x2 and x3
The r12 , r13 and r23 are calculated using the normal formulas
SSx1 x 2
For example r12 = , we can always use the calculator in REG mode to
SSx1  SSx 2

estimate these correlation coefficients

(E) COEFICIENT OF MULTIPLE CORRELATIONS


The coefficient of multiple correlation is given by

r12  r13  2r12 r13r23


2 2

R1.23 
1  r23
2

In a similar manner, any other coefficient of multiple correlation can be calculated.

(F) PARTIAL CORRELATION


The coefficient of partial correlation is given by
r12  r13r23
r12.3  is the coefficient of multiple correlation between x1 and x2
1  r 1  r 
13
2
23
2

keeping x3 constant.

r13  r12 r23


r13.2  is the coefficient of multiple correlation between x1 and x3
1  r 1  r 
12
2
23
2

keeping x2 constant.
48
Exercise
Q1. The table below shows the weights x1 to the nearest pound (lb), the height x2 to the nearest
inch (in) and the ages x3 to the nearest year of 12 boys;

Wei 64 71 53 67 55 58 77 57 56 51 76 68
ght
(x1)
Hei 57 59 49 62 51 50 55 48 52 42 61 57
ght
(x2)
Age 8 10 6 11 8 7 10 9 10 6 12 9
(x3)

i. Find the least-squares regression equation of x1 on x2 and x3.


ii. Find the least-squares regression equation of x1 on x2 and x3.
iii. Compute the linear correlation coefficients r12, r13 and r23
iv. Compute the coefficient of linear multiple correlations R1.23 and R2.13
v. Compute the coefficient of linear partial correlations r12.3, r13.2 and r23.1

49
6.0 ELEMENTARY PROBABILITY
This section deals with the basic techniques in elementary probability.

Probability of an Event
When we use probability in a statement, we are using a number between 0 and 1 to indicate the
likelihood of an even. We’ll use the notation P(event A), read as ‘Probability of an event A’, to
denote the probability of event A. The closer to 1 the probability assignment is, the more likely
the even is to occur. If the even A is certain to occur, then P(A) = 1.
Probability formula for relative frequency:
Probability of an even A = relative frequency = f A / f

Where; f A is the frequency of an event, and  f is the sample size.


Probability formula when outcomes are equally likely.

# of outcomes in favour of the event


Probability of an even = Total # of outcomes

E.g. what is the probability of correctly guessing the answer to a true/false question.
# of correct answers
P (correct answer) = = ½
total # of questions
# of outcomes in favour of event A
Note: 𝑃(𝑒𝑣𝑒𝑛𝑡 𝐴) = total # of outcomes

Additive Law
P(A or B ) = P(A) + P(B) - P(A and B)

Multiplicative Law
If A and B are independent events, then;

P (A and B) = P(A) * P(B)

Example
If 3 fair coins are tossed together, what is the probability of getting
i. Exactly 3 heads?
ii. Exactly 2 heads?
50
iii. At least 2 heads?
iv. Fewer than 2 heads?
v. At most 2 heads?

Solution
List of possible outcomes –
HHH
HHT
HTT No repetitions
TTT

(a) P (3 HEADS) = ¼ (b) P(2 HEADS) = ¼

(c)P (At least 2 heads) = P (2 HEADS) + P (3 HEADS)


=¼+¼ =½

Note: The sum of all the probabilities assigned to outcomes in a sample space must be one. For
example, if you think the probability is 0.65 that you will win a tennis match, then you assume the
probability is 0.35 that your opponent will win.

If the probability that an event occurs is denoted by p and probability that it does not occur is
denoted by q, then:

Note that,
P+q=1

q=1–p

51
CONDITIONAL PROBABILITY

Summer vacation
this year

Yes No Total
Own Yes 37 8 45

a No 40 19 59
house

Total 77 27 104

i. Find the probability that a randomly selected family is taking a summer vacation this year
ii. Find the probability that a randomly selected family is taking a summer vacation this year,
given that they own a house.
iii. Are the events of owning a house and taking a summer vacation this year independent
events or mutually exclusive?

52
PERMUTATIONS AND COMBINATIONS
Factorial Notations
n = n (n-1) (n-2) (n-3) . . . 3 . 2 . 1

E.g. 5  = 5 . 4 . 3 . 2 . 1 = 120
1! = 1 and 0 ! = 1 by definition
10  = 10 . 9 . 8 . 7 . 6 . 5 . 4 . 3 . 2 . 1

Permutations
This is an arrangement of objects in a particular order .e.g. For the letters A B C, how many
arrangements are possible taking the three at a time ?

ABC ACB
BAC BCA only six ways
CAB CBA

How many arrangements are possible taking the two at a time ?

AB B A B C C B A C C A, there are 6 possible arrangements

If we were to continue doing this, then with a large set of objects it can be unmanageable. Let us
look for a way out,

A permutation of n objects taking r at a time ( n Pr ) , where r  n is calculated using the formula


below :
n!
n
Pr 
(n  r ) !

3
P3 a permutation (arrangement) of 3 objects taking 3 at a time (where order is important)

53
3! 3! 3.2.1
3
P3    6
(3  3)! 0! 1

A permutation of 3 objects taking 2 at a time


3! 3! 3.2.1
3
P2    6
(3  2)! 1! 1

A permutation with identical items:


the number of permutations of n items taken n at a time when p of the items are identical and the
rest are all different is equal to;
n!
p!
Example
In how many ways can the letters in the word LETTER be arranged?

Solution
.n = 6 , p = 2 and q =2 , two letters are repeated twice each

n! 6!
number of arrangements    180
p !q ! 2 !2 !

Exercises
Q1. In how many ways can the letters in the words be arranged?
(i) TROTTING (ii) MATRICES (iii) BESIEGE
(iv) PARALLEL

Combinations
This is an arrangement/selection of objects where order is not important.

ABC  ACB  BCA

All the three arrangements are similar because they consist of the same items and are therefore

54
considered as one combination.

Note: A B  B A is the same combination

- The number combinations of n objects taking r at a time ( n C r )

n!
n
Cr 
r !( n  r ) !

E.g. Let n = 10 and r = 3 (an arrangement of 10 objects taking 3 at a time)

10 ! 10 !
10
C3    120
3 !(10  3) ! 3! 7 !

Discrete Probability Distributions


(A) BINOMIAL DISTRIBUTION
Binomial Experiments;
a binomial experiment is a probability experiment that satisfies the following conditions;
i. The experiment is repeated for a fixed number of trials, where each trial is independent of
the other trial.
ii. There two possible outcomes of interest for each trial. The outcomes can be classified as
success (S) or failure (F).
iii. The probability of success P(S) is the same for each trial.
iv. The random variable r counts the number of successful trials.
v. The results from a binomial experiment gives a binomial distribution.

The following notations are used for binomial experiments;


n is the total # of observations (trials)
x is the # of successes
p is the probability of success - P(S)
q is the probability of failure – P(F)
P(r) is the probability of r successes
55
The Binomial Probability Formula; in a binomial experiment, the probability of exactly x
successes in n trials is

n!
P( x)  p x q n x
x!(n  x)!

 n
Cx p x q n x
Example:
Take a case of 4 patients for a particular disease. The probability of being cured is 0.6 (all signs
and symptoms of the disease are alleviated). If we assume that the outcomes (cure or not cured) in
the patients are independent, then we can answer questions like “what is the probability that
exactly 2 patients are cured ? ”

Solution:
In a case of 4 observations (patients), what is the probability of 2 success ( 2 patients are cured)

P( x) nC x . p x .q n  x

Here; x = 2 and n = 4, p = 0.6, (q = 1 – p = 1 – 0.6 = 0.4 )

P ( 2)  4C 2 . p 2 .q 4  2  0.35

Use the calculator and get the Answer

Example
It has been found out that the probability that a child is a male in family is 0.4. In a family of 6
children, what is the probability that;
i. Exactly 4 are boys ?
ii. At least 3 are boys ?

Solution
56
In this case; n = 6, p = 0.4 , x = 4, then,

P(4) 6C4 . p 4 .q 6 2  0.0033

In this case ; n = 6, p = 0.4 , x = 3 and 4 and 5 and 6


Then, P(at least 3 boys)

 P(3)  P(4)  P(5)  P(6)

Calculate as in (i) above

Exercise
Q1. A botanist has developed a new hybrid cotton plant that can withstand insects better than
other cotton plants. However, there is some concern about the germination of seeds from the new
plants. To estimate the probability that a seed from the new plant will germinate a random sample
of 3,000 seeds were planted in warm, moist soil. Of these seeds, 2,430 germinated.
i. What is the probability that a seed will germinate?
ii. What is the probability that a seed will not germinate?
iii. Are the outcomes in this sample space equally likely?

Population Parameters of a Binomial Distribution


Mean: µ = np
Variance: 2 = npq

Standard Deviation:
  npq

Example
In Njoro, Nakuru, about 57% of the days in a year are cloudy. Find the mean, variance and
standard deviation for the number of cloudy days during the month of June. What can you
conclude?

57
Solution
There are 30 days in June. Using
N = 30, p = 0.57 and q = 0.43
You can find the mean, variance and standard deviation as illustrated below;

Mean: µ = np = 30*0.57 = 17.1

Variance: 2 = npq = 30*0.57*0.43 = 7.353

Standard Deviation:
  npq  7.353  2.71

So, you can conclude that, on the average, there are 17.1 cloudy days during the month of June.
The standard deviation is about 2.71 days.

(B) MULTINOMIAL DISTRIBUTION


If each trial in an experiment has more than two outcomes, then a distribution called the
Multinomial Distribution must be used. For example a survey might require the responses of
‘approve’, ‘disapprove’, or ‘no opinion’.
Formula for Multinomial Distribution;

n!
P( x)   p1 1  p2 2    pk k
x x x

x1!  x2 !  x3 !    xk !

Example
In a music store, a manager found the the probabilities that a person buys zero, one, or two or more
CDs are 0.3, 0.6, and 0.1, respectively. If six customers enter the store, find the probability that one
won’t buy anything, three will buy one CD each, and two will buy two or more CDs.

Solution
.n = 6, x1 = 1, x2 = 3, x3 = 2 , p1 = 0.3, p2 = 0.6 and p3 = 0.1. then

58
6!
P( x)  (0.3)1 (0.6)3 (0.1) 2
1!  3!  2!
 60(0.3)(0.216)(0.01)  0.03888

(C) THE GEOMETRIC DISTRIBUTION


A geometric distribution is a discrete probability distribution of a random variable x that satisfies
the following conditions;
1. A trial is repeated until a success occurs.
2. The repeated trials are independent of each other.
3. The probability of success p is constant for each trial
4. The probability that the first success will occur on trial number x is

P(x) = pqx-1, where q = 1 – p

Example
From experience you know that the probability that you will make a sale on any given telephone
call is 0.23. Find the probability that your first sale on any given day will occur on your fourth or
fifth sales call.

Solution
Using: p = 0.23, q = 0.77, and x = 4, you have

p(4) = 0.23 (0.77)3 ~ 0.105

Using: p = 0.23, q = 0.77, and x = 5, you have

p(5) = 0.23 (0.77)4  0.081

So, the probability that your first sale will occur on the fourth or fifth sales call is

P (sales on fourth or fifth sales call) = P(4) + P(5) = 0.105 + 0.081  0.186

59
(D) THE POISSON DISTRIBUTION
The Poisson Distribution is a discrete probability distribution of a random variable x that satisfies
the following conditions;
1. The experiment consists of counting the number of times, x an event occurs in a given
interval. The interval can be an interval of time, area, or volume.
2. The probability of the event occurring is the same for each interval.
3. The number of occurrences in one interval is independent of the number of occurrences in
other intervals.
4. The probability of exactly x occurrences in an interval is

 x e 
P( x) 
x!
Where e is an irrational number approximately equal to 2.71828 and µ is the mean number of
occurrences per interval unit.

Example
The mean number of accidents per month at a certain intersection is 3. What is the probability that
in any given month 4 accidents will occur at this intersection?

Solution
Using x = 4 and µ = 3, the probability that 4 accidents will occur in any given month at the
intersection is

34 (2.71828) 3
P(4)   0.168
4!

60
Exercise
Using a Binomial Distribution find the probability
1. A surgical technique is performed on seven patients. You are told there is a 70% chance of
success. Find the probability that the surgery is successful for;
i. Exactly five patients
ii. At least five patients
iii. Less than five patients
2. 64% of men consider themselves football fans. You randomly select 10 men and ask each if he
is a football fan. Find the probability that the number who consider themselves football fans is;
i. Exactly eight
ii. At least eight
iii. Less than eight
3. 48% of the people in Kenya have O+ blood. You randomly selected ten Kenyan and asked them
if their blood type is O+. Find the probability that;
i. Exactly eight
ii. At least eight
iii. Less than eight

Using a geometric distribution find the probabilities;


4. A cereal maker places a game piece in its cereal boxes. The probability of winning a prize in
the game is one in four. Find the probability that you; win your first prize with your fourth
purchase.
i. Win your first prize with your first, second or third purchase
ii. Do not win a prize with your first four purchases.

5. An auto parts seller finds that one in every 100 parts sold is defective. Find the probability that;
i. The first defective part is the tenth part sold,
ii. The first defective part is the first, second, or third part sold, and
iii. None of the first 10 parts sold are defective.

Using a Poisson Distribution find the Probability

61
6. The mean number of business failures per month in Nakuru town in the last one year was about
8. Find the probability that;
i. Exactly 4 businesses will fail in any given year.
ii. At least 4 businesses will fail in any given year.
iii. More than 4 businesses will fail in any given year.

7. A newspaper finds that the mean number of typographical errors per page is four. Find the
probability that;
i. Exactly three typographical errors will be found on a page
ii. At most three typographical errors will be found on a page
iii. More than three typographical errors will be found on a page

62
7.0 NORMAL DISTRIBUTION
This section introduces the reader to the basic techniques for determining whether a frequency
distribution is normal or not. At the end of this section you should also be able to determine the
area under the standard normal curve for any given interval of z values.

Normal Curve
The normal distribution (Gaussian Curve)

1
f ( x)  e ( x   ) / 2 2
2

2

Where;  = 3.1416
 - The standard deviation of the population
 - The mean of the population
e – 2.718
x – The value of an observed variable

𝑓(𝑥) − 𝑓 𝑖𝑠 𝑎 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥

The normal distribution is biologically the most important of all probability distributions since its
‘bell-shaped’, symmetrical curve nicely describes the majority of variables encountered in the
biological sciences.

Standard Normal Curve


 (Mean) = 0 and  (the standard Deviation) = 1. Written as N(0,1) , then;
1
f ( z)  ez
2
/2

2
The curve is ‘bell -shaped ‘as shown below;

63
(a) Area Under the Standard Normal Curve

Example
(Hint: use the z – tables provided)
Find the area under the Standard Normal Curve

i. between z = 0 and z = 1.

 Area between 0 and 1 = 0.3413 sq. units

64
ii. between 1 and 2.

 Area between 1 to 2 = (area between 0 and 2) – (area between 0 and 1)


= 0.4772 – 0.3413 = 0.1359

iii. z  2.

 Area between 2 and  = (area between 0 and ) – (area between 0 and 2)


= 0.5000 – (area between 0 and 2)
= 0.5000 – 0.4772 = 0.0228

65
iv. between -1 and 2.

 Area between -1 to 2 = (area between -1 and 0) + (area between 0 and 2)


= (area between 0 and 1) + (area between 0 and 2) = 0.3413 + 0.4772 = 0.8185
v. between -2 and 0.

 Area between -2 and 0 = (area between 0 and 2) (mirror image of the area between –2 and 0)
= 0.4772

Probabilities (P)

P(a  z  b) = the probability that a z value is between a and b


= area under the curve between a and b

66
b
  f ( z )dz
a

Example (hint: use the z)

Determine the following probabilities;


`i. P (0  z  1)

Solution i:

P (0  z  1) = (area between 0 and 1)


= 0.3413
67
ii. P(1  z  2)

Solution ii:

P (1  z  2) = (area between 0 and 2) - (area between 0 and 1)


= P (0  z  2) - P(0  z  1)
= 0.4772 – 0.3413
= 0.1359
iii. P(z  2)

Solution iii:

P (z  2) = (area between 0 and ) - (area between 0 and 2)


68
= P(0  z  ) - P(0  z  2)
= 0.5000 – 0.4772
= 0.0228

iv. P(-1  z  2)

Solution iv:

P (-1  z  2) = (area between 0 and 1) + (area between 0 and 2)

= P (0  z  1) + P(0  z  2)
= 0.3413 + 0.4772
= 0.8185

For a Normal Distribution of x’s

69
Example
Let x have a normal distribution with  = 10 and  = 2. Find the probability that an x value
selected at random from the distribution is between;

`i. x is between 9 and 11

Solution i:
P(9  x  11)

We first convert x values to z. To do this, we use the formula


x
z 

To convert the given x interval to z interval.

z1 = (9 –10)/2 = -1/2 = -0.5 (use x = 9,  = 10 and  = 2).

z2 = (11 – 10)/2 = ½ = 0.5 (use x = 11,  = 10 and  2).

70
P (9  x  11) = P(-0.5  z  0.5) = P(-0.5  z  0) + P(0  z  0.5)
= 2 P(0  z  0.5)
= 2 x (area between 0 and 0.5)
= 2 x 0.1915
= 0.3830

ii. P(11  x  14)

Solution ii
We first convert x values to z. To do this, we use the formula
x
z

To convert the given x interval to z interval.

z1 = (11 –10)/2 = 1/2 = 0.5 (use x = 11,  = 10 and  = 2).

z2 = (14 – 10)/2 = 4/2 = 2 (use x = 14,  = 10 and  = 2).

71
P(11  x  14) = P(0  z  2) - P(0  z  0.5)
= (area between 0 and 2) – (area between 0 and 0.5)
= 0.4772 – 0.1915
= 0.2857

If the probability involves a mean ( x ), then the conversion formula is


x
z
 s 
 
 n

Example

The mean of x distribution is  = 2,500 and a standard deviation  = 300. Determine the
probability that a random sample of 42 observations taken from this distribution has a mean that
lies between 2,350 and 2,650.

Solution
72
P (2350  x  2650)
We first convertx values to z. To do this, we use the formula

x
z
 s 
 
 n

Then,
2350  2500 2650  2500
z 2350   3.24 and z 2650   3.24
 300   300 
   
 42   42 
Therefore,

P(2350  x  2650) = P(-3.24  z  3.24)


= 2 P(-3.24  z  3.24) = 2 (0.4994)
= 0.9988

The probability thatx-bar is between 2,350 and 2,650 is 0.9988

Exercise
The average age a vehicle registered in the US is 8 years, or 96 months. Assume the standard
deviation is 16 months. If a random sample of 36 cars is selected, find the probability that the
mean is between 90 and 100 months.

Finite Population Correction Factor


The formula for the standard error of the mean, s / n , is accurate when the samples are drawn
with replacement or are drawn without replacement from a very large or infinite population. Since
sampling with replacement is for most part unrealistic, a correction factor is necessary for
computing the standard error of the mean of a sample drawn without replacement from a finite
population. Compute the correction factor using the following formula;

73
N n
Correction factor =
N 1

x
Therefore, the formula for computing z is; z 
s N n
n N 1

Exercise
The average weight of a group of male adults is 160 pounds. The standard deviation is 10 pounds.
If 30 males are selected from a population 300, find the probability that the mean will be less than
156 pounds.

Other Examples of Normal Curves

Areas under Any Normal Curve


In many applied situations the original normal curve is not the standard normal curve. This does
not mean that we cannot find the probability that a measurement x will fall in an interval from a to
b. What we must do is convert original measurements x, a and b to z values. To do this, we use
the following formula:
x
z

Example
let x have a normal distribution with  = 10 and  = 2. Find the probability that an x value
selected at random from this distribution is between 11 and 14.

Solution
x
P ( 11  x  14), z 

For x = 11,  = 10,  = 2; then, z1 = (11 – 10)/2 = 1.50,

74
For x = 14,  = 10,  = 2; then, z2 = (14 – 10)/2 = 2.00,

Exercise 1
Q1. Sketch the areas under the standard normal curve over the indicated intervals and find the area:
i. Area between z = 0 and = 2.92
ii. Area between z = -2.18 and = 1.34
iii. Area to the right of z = 0.15

Q2. Find the indicated probability and shade the corresponding area under the standard normal
curve:
i. P ( 0  z  1.62)
ii. P (-0.45  z  2.73)
iii. P ( z  - 2.15)

Q3. Assume that x has a normal distribution, with the specified mean and standard deviation. Find
the indicated probabilities;
i. P (3  x  6),  = 4,  = 2
ii. P (50  x  70),  = 40,  = 15
iii. P (x  120),  = 100,  = 15

Q4. The ages of workers in the Mafuko bakery are normally distributed, with a mean of 45 years
and a standard deviation of 12 years. A worker is stopped at random and asked to fill out a
questionnaire. What is the probability that this worker is:
i. Less than 30 years old ?
ii. Between 35 and 55 years old ?
iii. More than 60 years old ?

Exercise 2

Q1. Find the following probabilities;


i. P (z  1.05) ii. P(z  2.55) iii. P (z  -1.95)

75
iv. P (0 z  1.64) v. P(-1.65 z  1.65)
vi. P (z  -1.05 or z  2.55)

Q2. .x is a random variable with a normal distribution. Estimate the probability that x falls in the
indicated interval;
i.  = 7,  = 1.75, estimate P(5.25  x  8.75)
ii.  = 20,  = 5.4, estimate P(9.2  x  30.8)

Q3 The life span of a tire is normally distributed with a mean of 30,000 miles and a standard
deviation of 2,000 miles. Estimate the probability that a tire’s life span is between 30,000 and
34,000 miles.

Q4. The time per week a student uses a lab computer is normally distributed with a mean of 6.2
hours and standard deviation of 0.9 hour. Your are planning the schedule for the computer lab. Of
2000 students, estimate then umber of students who will use a lab computer for the given number
of hours;
i. Less than 5.3 hours
ii. Between 5.3 and 7.1 hours
iii. More than 7.1 hours

Q5. In a population survey of patients in a rehabilitation hospital, the mean length of stay in the
hospital was 12.0 weeks with a standard deviation of 1.0 weeks. The data is normally distributed.
What is the probability that;

i. a patient is likely to be in for less than 10 weeks?


ii. a patient is likely to be in for longer than 12 weeks?
iii. a patient is likely to be in for a period between 10 and 12 weeks?

Q6. The mean height of college male students is 70 inches with a standard deviation of 3 inches.
If we took a sample of 16 male students, what is the probability that their mean height is ;

i. Greater than 72 inches?


76
ii. Between 70 and 72 inches?
iii. Less than 68 inches?
Approximating a Binomial Distribution with an normal distribution
Remember:

Mean,  = np and

Standard deviation,   npq

Note: the correction for continuity (  0.5 ) is required

Example
Seven percent of the people in Kenya have type O  blood. You randomly selected 30 people and
ask them if their blood type is O  ,find the probability that,
i. Exactly 4 people say they have O  blood
ii. At least 4 people say they have O  blood
iii. Fewer than 4 people say they have O  blood.

Solution:
n = 30
p = 0.07,  q = 1 – 0.07 = 0.93

 = np =30 * 0.07 = 2.1

  npq  30 * 0.07 * 0.93  1.4

i. Exactly 4  probability that x lies between 3.5 and 4.5 (correction for continuity)
x
z

3.5  2.1
when x  3.5, z   1.00
1.4

77
4 . 5  2 .1
when x  4.5, z  1.71
1 .4
Then, exactly 4  probability that z lies between 1.00 and 1.71

ii. At least 4  4 or less than 4  L ess than 4.5 (correction for continuity)

x
z

4.5  2.1
For x =4.5, z   1.71
1.4

Then, at least 4  probability that z lies between - and 1.71

iii. Fewer than 4  3 or less than 3  less than 3.5 (correction for continuity)
x
z

3.5  2.1
for x  3.5, z  1.00
1.4

Then, fewer than 4  probability that z lies between - and 1.00

Exercise
Q1. Twenty-nine percent of people in the USA say they are confident that passenger trips to the
moon will occur in their life time. You randomly selected 200 people in the USA and ask each if
he or she thinks passenger trips to the moon will occur in his or her life time. What is the
probability that at least 50 will say yes?

Q2. Twenty-four percent of people in Kenya have A (+) blood. You randomly select 32 people and
ask them if their blood type is A(+), find the probability that;
i. exactly 12 say they have A(+) blood
ii. at least 12 say they have A(+) blood
iii. fewer than 12 say they have A(+) blood
78
iv. at most 12 say they have A(+) blood

79
8.0 CONFIDENCE INTERVALS
Estimating Population Parameters

The sample mean, x is the most unbiased estimator of the population mean, .

x
x
n
The maximum error of estimate or the margin of error, E, is the greatest possible distance between
the point estimate and the value of parameter it is estimating.


E  z c x  z c , for large samples
n
or

E  t c x  t c , for small samples
n
Error tolerance

When n  30, the sample standard deviation, s, can be used in place of .

Note: the confidence level must be stated.


E.g. At 95% confidence, zc = 1.96 (from the z-tables,  = 0.05)

Example
If s = 5.0, n = 54, and x = 12.4
At 95% confidence, zc = 1.96

5 .0
then , E  1.96   1 .3
54

This means that at 95% confidence, the maximum error of estimate for the population mean,, is
about 1.3.

Exercise
80
Q1. Find the maximum error of estimate for the given values of c, s and n;
i. c = 0.90, s = 2.5, n = 36
ii. c = 0.95, s = 3.0, n = 60
iii. c = 0.99, s = 3.4, n = 100

Confidence Interval (CI)


The confidence intervals for the population mean, 

x – E <  < x + E

Then, for the above example;

12.4 – 1.3 <  < 12.4 + 1.3

11.1 <  < 13.7

Which is the 95% confidence interval?

A demonstration for constructing a Confidence Interval (CI)


Suppose a wheat council wanted to be 90% confident of its estimate for the mean yield of the last
season’s wheat crop.

Here is an overview of how to construct an interval estimate

From the interval estimate, 31.2 < µ < 33.8


So the wheat quality council can be 90% confident that the mean yield for the last season wheat
crop is between 31.2 and 33.8 bushels per acre.

Exercise
Q1. Construct the indicated confidence intervals for population mean;
i. c = 0.90, x = 12.5, s = 2.0, n = 6
ii. c = 0.95, x = 13.4, s = 0.85, n = 8
81
iii. c = 0.99, x = 14.0, s = 2.0, n = 10

Q2. In 36 randomly selected seawater samples, the mean sodium concentration was 23 cc and the
standard deviation was 6.7 cc. Construct a 95% confidence interval for the population mean.
Q3. Determine the minimum required sample size if you want to be 95% confident that the
sample mean is within one unit of the population mean given  = 4.8. Assume the population is
normally distributed.
Q4. In a random sample of 19 patients at a hospital’s minor emergency department, the mean
waiting time (in min) before seeing a medical professional was 23 min and the standard deviation
was 11 min. Construct a 95% confidence interval for the population mean. Assume the waiting
time is normally distributed.
Q5. you randomly selected 16 hotels and measured the temperature of tea sold at each. The mean
temperature is 162F with a standard deviation of 10F. Construct a 95% confidence interval for
the population mean. Assume the temperatures are approximately normally distributed.

Estimation of the Minimum Sample Size (n)


The minimum sample size required for the estimation of the population mean,  ;

z  
2

n c  , where; zc is value of z the given confidence level


 E 

Exercise
Q1. Determine the minimum required sample size if you want to be 95% confident that the
sample mean is within one unit (E = 1) of the population mean given  = 4.8. Assume the
population is normally distributed. Two units of the population mean.

Q2. Determine the minimum required sample size if you want to be 99% confident that the
sample mean is within two units of the population mean given  = 1.4. Assume the population is
normally distributed.

Q3. An admissions director wants to estimate the mean age of all students enrolled at a college.

82
The estimate must be within 1 year of the population mean. Assume the population of ages is
normally distributed. Determine the minimum required sample size to construct a 90% confidence
interval for the population mean given  = 1.2 years.

83
8.0 Revision Exercises
Q1. (a) Briefly describe the difference between the following terms as used in statistics;
.i. A population and a sample
ii. Discrete and continuous variable
iii. Descriptive and inferential statistics
(b) Which of the following measurements are discrete or continuous;
i. the average number of babies born in certain clinic each week
ii. weight of 100 goats
iii. the average daily temperatures
iv. distance between the planets in our solar system
v. the number of pineapples in 10x10 m plots

(c) Which of the following values cannot be a probability of an event;


0.8, -0.7, 1.2, 0.002, 20%, 120% , 1/3

The table below shows the CAT score for 45 students in MATH 100. Use this data to complete
the table below:

27 26 23 21 18 21 16 16 8
24 21 24 22 20 20 17 10 13
24 21 23 24 20 20 17 12 11
24 21 21 20 20 21 17 13 3
26 22 23 18 17 21 17 7 2

Test Scores (classes) Tally Frequency (f)

1–5
6 – 10
11 – 15
16 – 20
21 – 25
26 – 30

84
(ii) Draw a frequency histogram for these scores

Q2. The weights of students in the MATH 210 class were determined and recorded as below;
51 56 62 58 57
62 66 64 62 65

(a) determine; The mean, median and mode of these weights the range and standard deviation
of these weights
(b) determine the mean and standard deviation for the following frequency distribution;

.x 3 4 5 8
Freq 6 3 4 2

Q3. (a) Given the data set below;

.x 10 15 18 1 4 7 14
.y 3 2 0 8 6 4 3

(a) Compute;
i. the correlation coefficient (r)
ii. comment on the relationship between x and y
iii. determine x and y , a and b for the equation y = a + bx

iv. write the prediction equation


v. Estimate y when x = 12 and x = 19.

Exercise
Q1. (a) Briefly describe the difference between the following terms as used in statistics;
i. A population and a sample
ii. A population parameter and a sample statistics
iii. A discrete variable and continuous variable

85
The table below shows the final score for 45 students in MATH 100. Use this data to complete
the table below:

68 84 46 82 83 75 61 76 75
73 52 35 63 78 88 67 62 84
61 44 62 74 39 92 94 52 46
66 78 51 68 72 81 71 47 57
96 36 66 60 52 65 62 32 88

i. Using 7 classes, construct a frequency table for this data


ii. Draw a less-than Ogive for this data
iii. Draw a stem-leaf plot for this data. Key: 6|8 = 68
iv. State two advantages of a stem-leaf plot

Q2. (a) The weights of students in the MATH 210 class were determined and recorded as below;
51 56 62 58 48
62 53 64 62 65

Determine;
i. The mean, median and mode of these weights
ii. the range and standard deviation of these weights
iii. the coefficient of variation (CV) and the skewness of these weights

Determine the mean and standard deviation for the following frequency distribution;

.x 4 6 8 10
Freq 2 3 4 2

Q3. (a) Given the data set below;

86
.x 0 1 2 3 4 5 6
.y 2.2 2.4 3.3 5.4 9.4 14.5 19.9

i. draw a scatter plot of this data


ii. determine the correlation coefficient (r)
iii. comment on the relationship between x and y
iv. using the least square method, determine x and y , a and b for the equation, y

= a + bx
v. write the prediction equation
vi. estimate y when x = 5 and x = 10.
vii. Estimate r2 and comment on it

(b) State whether the following measurements are discrete or continuous;


i. weight of a chicken as it grows
ii. the average number of babies born in certain clinic each week
iii. the age of planets solar system
iv. the maximum daily temperatures for January, 2006
v. number of students registering for BSc Computer Science each year.

Q4. (a) Which of the following values cannot be a probability of an event, show the reasoning;
-0.2 1.2 0.006 40% 120% 2/5 0.8

(b) Assuming that the probability of a male birth is 0.3, out of 2000 families with 5 children,
how many would you expect to have

i. at least 3 boys
ii. exactly 4 boys
iii. fewer than 3 boys

(c) .x is a random variable with a normal distribution. Estimate the probability that x falls in the
indicated interval;

87
i.  = 7,  = 2, estimate P(5  x  8)
ii.  = 20,  = 5, estimate P(18  x  25)

Q5. (a) given : c = 0.95, s = 3.0, n = 60,


Find the maximum error of estimate (E) for estimating the population mean . Hint:
s
E  zc
n
Construct the confidence interval for the population mean .
What is the effect on the confidence interval when the sample size is increased? Explain the
reasoning.

(b) determine the minimum required sample size if you want to be 95% confident that the sample
mean is within one unit of the population mean given = 4.8. Assume the population is
normally distributed.

88
9.0 BIBLIOGRAPHY
1. Aczel, A.D., 1996, Complete Business Statistics. 3rd edition. Chicago, Irwin.
2. Bluman, A.G.,1998, Elementary Statistics. McGraw-Hill. New York, New York.
3. Brase, C.H., C.P. Brase, 1987, Understanding Statistics. D.C. Heath and Company. Lexington,
Massachusetts. Toronto.
4. Dixon, W.J., and F.J. Massey, 1969, Introduction to Statistical Analysis. McGraw-Hill Book
Company. New York.
5. Hill, A.B., 1966, Principle of Medical Statistics. Oxford University Press. New York.
6. Johnson, R., 1990, Elementary Statistics. 6th edition. Boston, PWS- Kent
7. Pagano, R.R., 1990, Understanding Statistics. 3rd edition. New York, West.
8. Remington, R.D., M.A. Schork, 1970, Statistics with Applications to the Biological and Health
Sciences. Prentice-Hall, INC. Englewood Cliffs, New Jersey

89
FORMULAS

DESCRIPTIVE STATISTICS

Arithmetic Mean x    n x
The Geometric Mean, x G  N x1 .x2 .x3 ...x N

N
The Harmonic Mean, H =
1
x

 x  x 
2

Standard Deviation ( s) 
n 1

 x   x 
2 2
/n SSx
Computational formula: ( s)  
n 1 n 1

 x  x  f 
2

For a grouped data, ( s) 


 f 1

FOR GROUPED DATA SET

Frequency data Weighted data


Mean x 
 xf  
xw 
 xw
f w
Where f – frequency of the x w – the weighting of x

Mean x    x f f
m
, Where;
xm
is the mid point of a class

m

90
N 
  ( f ) l 
Median = L1   2 cw
 f median 
 
 

Where: L1 – lower class boundary of the median class


N – number of items in the data (total frequency)
(f)l – sum of frequencies of all classes lower than the median class
fmedian - frequency of the median class
cw – size of the median class (class interval)

 1 
Mode = L1   cw
 1   2 

Where L1 – lower class boundary of the modal class


1 – excess of modal frequency over frequency of the next-lower class
2 – excess of modal frequency over frequency of the next-higher class
cw – size of the modal class (class interval)

ELEMENTARY PROBABILITY

fA
Probability of an event A   P(A)
f
Additive law:

P(A or B) = P(A) + P(B) – P(A and B)

Multiplicative law: (if A and B are independent events)

P(A and B) = P(A) * P(B)

91
PERMUTATIONS and COMBINATIONS

Permutation – an arrangement of n objects taking r at a time. Order is important here.

n!
n
pr 
(n  r ) !
Combination - an arrangement of n objects taking r at a time. Order is not important in this case.

n!
n
Cr 
r !( n  r ) !

DISRETE PROBABILITY
Binomial Probability

The Binomial Probability Formula;

P ( x )  nC x . p x .q n  x
Where n is the total number of observations (trials)
x is the number of successes
p is the probability of success - P(S)
q is the probability of failure – P(F)
P(r) is the probability of r successes
the probability that the first success will occur on trial number x is

P(x) = pqx-1, where q = 1 – p

The Geometric Probability

92
Poisson Probability

The probability of exactly x occurrences in an interval is

 x e 
P( x) 
x!

where e is an irrational number approximately equal to 2.71828 and µ is the mean number of
occurrences per interval unit.

CORRELATION and REGRESSION

Correlation Analysis

Correlation Coefficient r  
SSxy
SSx.SSy

Where :
SSx   x 2   x  / n
2

SSy   y 2   y  / n
2

SSxy   xy   x. y  / n

93
Linear regression

The equation is written as ; y  a  bx , where

SSxy
b  slope 
SSx

a  y  int ercept  y  bx

94

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy