Statistics For Economists Module - Copy For Students
Statistics For Economists Module - Copy For Students
Module
Chapter one
1. Introduction
There exist a tremendous number of random phenomena in nature, real life and experimental
sciences. Almost everything is random in nature: occurrences of rain and durations, number of
double stars in a region of the sky, lifetimes of plants, of humans, and of animals, life span of a
radioactive atom, phenotypes of offspring of plants or any biological beings, etc. moreover, most
of the day to day activities of human beings are also full of uncertainty. In such cases we use
words such as probability or chance, event or happening, randomness etc. The general theory
states that each phenomenon has a structural part, that is deterministic, and random part called
the error or the deviation. Randomness also appears in conceptual experiments: tossing a coin
once or 100 times, throwing three dice, arranging a deck of cards, matching two decks, playing
roulettes, etc.
In this chapter the students will be introduced to the nature of probability theory in general and
some specific concepts in relation to this probability theory.
Objectives:
After successful completion of this chapter; students will be able to:
Understand the concept of probability theory:
Define basic probability concepts and terms:
Understand the axioms of probability theory:
Identify the counting rules/procedures of probability:
Compute different probability problems:
Solve permutations and combinatorial problems:
Deal with conditional probabilities and independence.
As far as, we are in the concept of probability there are some basic terms related to probability
that we need to introduce. Here are some of the terms;
1. Experiment: Any process of observation or measurement or any process which generates
well defined outcome.
Examples:
Rolling a die number of times
Flip a coin
Selecting a card from standard deck of cards etc.
1. Probability Experiment: It is an experiment that can be repeated any number of times
under similar conditions and it is possible to enumerate the total number of outcomes
without predicting an individual outcome. It is also called random experiment.
Example:
i. If a fair die is rolled once it is possible to list all the possible outcomes i.e.1, 2, 3, 4, 5,
6 but it is not possible to predict which outcome will occur.
ii. If a fair coin is tossed once, clearly the outcome will be either head or tail with equal
probability, but no one can surely know which outcome will occur until the
experiment is done.
2. Outcome: The result of a single trial of a random experiment.
Axioms of Probability
Probability of any event ‘A’, P (A), is assigned in such a way that it satisfies certain conditions.
These conditions for assigning probability are known as Axioms of probability. There are three
such axioms. All conclusions drawn on probability theory are either directly or indirectly related
to these three axioms. Let E be a random experiment and S be a sample space associated with E.
With each event A a real number called the probability of A satisfies the following properties
called axioms of probability or postulates of probability.
Axiom 1: For any event ‘A’ belongs to the sample space, ‘S’, the value of probability of the
event lies between zero and one i.e. mathematically expressed as; 0 ≤P (A)≤ 1. Thus, this axiom
states that probabilities of events for a particular sample space are real numbers on the interval
[0, 1].
Axiom 2: The probability of the sample space S is, P(S) =1, i.e. S is sure event. This axiom
states that the event described by the entire sample space has probability of 1. If γ is the outcome
of an experimental trial, then γ is an element Ѕ by the definition of the sample space. Therefore,
the event described by S must occur on every trial. Intuitively, we can say that every experiment
we perform must yield some kind of result, and that result must be in the sample space, so the
“event” described by the sample space is so general that it must always occur.
Axiom 3: If A and B are mutually exclusive events or disjoint events, i.e. A ∩ B is empty, then
the probability that one or the other occur equals the sum of the two probabilities.
i. e. P (A∪B) = P (A) + P (B)
In general p (A∪ B) = p (A) + p (B) – p (A∩ B), for disjoint events p (A∩ B) = 0
Axiom 4: The probability of any event must be greater or equal to zero.
P (A) ≥ 0, i.e. non- negativity property of the probability theory.
1.4Counting Procedures
Bread
Coffee cake
Sandwich
If a choice consists of k steps of which the first can be made in n 1 ways, the second
can be made in n2 ways…, the kth can be made in nk ways, then the whole choice
can be made in (n1 * n2 *........* nk ) ways.
Example 1: The first four natural numbers are to be used in 4 digit identification card.
How many different cards are possible if
a) Repetitions are permitted.
b) Repetitions are not permitted.
Solutions
a)
1st digit 2nd digit 3rd digit 4th digit
4 4 4 4
The study of permutations and combinations is concerned with determining the number of
different ways of arranging and selecting objects out of a given number of objects, without
actually listing them. There are some basic counting techniques which will be useful in
determining the number of different ways of arranging or selecting objects.
The two basic counting principles are given below:
1. Permutation
An arrangement of n objects in a specified order is called permutation of the objects.
Permutation Rules:
1. The number of permutations of n distinct objects taken all together is n!
Where n! = n*(n −1)*(n − 2)*.....*3*2*1; or equivalently
= 1 × 2 ×3 × 4 × … … … .× n
2. The arrangement of n objects in a specified order using r objects at a time is called the
permutation of n objects taken r objects at a time. It is written as n P r and the formula is;
(n)!
nPr= (n−r )!
3. The numbers of permutations of n objects in which k1 are alike K2 are alike ---- etc. is
(n)!
nPr=
k 1!∗K 2 !∗…∗Kn !
Example:
1. Suppose we have a letters A, B, C, D
2. Here n = 10
Of which, 2 are C, 2 are O, 2 are R, 1 E, 1 T, 1 I, 1 N
⇒ K 1=2 , k 2=2 , k 3=2 ,k 4=k 5=k 6=k 7=1
Using the third rule of permutation, there are;
10 !
= 453, 600 different permutations are possible.
2!∗2 !∗2!∗1 !∗1!∗1 !∗1!
3. Here n = 7 i.e. there are seven different bands/ objects and r = 3
Exercise:
1. Six different statistics books, seven different physics books, and 3 different Economics
books are arranged on a shelf. How many different arrangements are possible if;
i. The books in each particular subject must all stand together
ii. Only the statistics books must stand together
2. Combinations
On many occasions we are not interested in arranging but only in selecting k objects from given
n objects. A combination is a selection of some or all of a number of different objects where the
order of selection is immaterial. Therefore, combination can be defined as the selection of
objects with-out regard to order. The number of selections of k objects from the given n objects
n!
n C k = k !(n−k )!
Remark
connection with a fundamental mathematical result called the Binomial Theorem; you
may also recall the related “Pascal’s Triangle”).
Use permutations if a problem calls for the number of arrangements of objects and
different orders are to be counted.
Use combinations if a problem calls for the number of ways of selecting objects and the
order of selection is not to be counted.
Example: Given the letters A, B, C, and D list the permutation and combination for selecting
two letters.
Solutions:
Permutation Combination
AB BA CA DA AB BC
AC BC CB DB AC BD
Solutions:
n = 9, r=5
2. Among 15 clocks there are two defectives .In how many ways can an inspector chose three of
the clocks for inspection so that:
a) There is no restriction.
b) None of the defective clock is included.
c) Only one of the defective clocks is included.
d) Two of the defective clock is included.
Solutions:
n = 15, of which 2 are defectives and 13 are non- defectives
r=3
a. If there is no restriction select three clocks from 15 clocks and this can be
done in :
n = 15, r = 3
15 !
nCr= 12! 3 ! = 455 ways
b. None of the defective clocks is included.
This is equivalent to zero defective and three non-defectives, which can be done
in:
Remark:
i. p (A ̍ /¿B) =1− p (A/B)
ii. P (B ̍ /A) =1− p (B/A)
Examples
1. For a student enrolling at freshman at certain university the probability is 0.25 that he/she
will get scholarship and 0.75 that he/she will graduate. If the probability is 0.2 that he/she
will get scholarship and will also graduate. What is the probability that a student who get
a scholarship graduate?
Required, P (B/A) =?
P ( A ∩B) 0.20
P (B/A) = P( A )
= 0.25 = 0.80
Note; for any two events A and B the following relation holds.
P (B) = p (B/A).p (A) + p (B/A').p (A')
Exercises:
1. If the probability that a research project will be well planned is 0.60 and the probability
that it will be well planned and well executed is 0.54, what is the probability that it will
be well executed given that it is well planned?
2. A lot consists of 20 defective and 80 non-defective items from which two items are
chosen without replacement. Events A & B are defined as A = {the first item chosen is
defective}, B = {the second item chosen is defective}
a. What is the probability that both items are defective?
b. What is the probability that the second item is defective?
3. 27 students out of a class of 43 are economists. 20 of the students are female, of whom 7
are economists. Find the probability that a randomly selected student is an economist
given that she is female.
Let’s start with a simple example where we can check all the probabilities directly by counting.
Example1. Draw two cards from a deck. Define the events: S1 = ‘first card is a spade’ and S2 =
‘second card is a spade’. What is the P (S2|S1)?
Solution: We can do this directly by counting: if the first card is a spade then of the 51 cards
remaining, 12 are spades.
P (S2|S1) = 12/51.
Now, let’s re compute this using conditional probability formula. We have to compute P (S1), P
(S2) and P (S1 ∩ S2): We know that P (S1) =1/4 because there are 52 equally likely ways to
draw the first card and 13 of them are spades. The same logic says that there are 52 equally likely
ways the second card can be drawn, so P (S2) =1/4.
The probability P (S2) =1/4 may seem surprising since the value of first card certainly affects the
probabilities for the second card. However, if we look at all possible two card sequences we will
see that every card in the deck has equal probability of being the second card. Since 13 of the 52
cards are spades we get P (S2) = 13/52 = 1/4. Another way to say this is: if we are not given
value of the first card then we have to consider all possibilities for the second card.
13 12
P (S1 ∩ S2) = . = 3/51
52 51
This was found by counting the number of ways to draw a spade followed by a second spade and
dividing by the number of ways to draw any card followed by any other card. Now, using (1) we
get;
P (S 2∩ S 1) 3/51
P (S2|S1) = P (S 1) = 1/ 4 = 12/51
The law of total probability will allow us to use the multiplication rule to find probabilities in
more interesting examples. It involves a lot of notation, but the idea is fairly simple. We state the
law when the sample space is divided into 3 pieces. It is a simple matter to extend the rule when
there are more than 3 pieces.
Suppose the sample space Ω is divided into 3 disjoint events B1, B2, B3 (see the figure below).
Then for any event A:
The top equation says ‘if A is divided into 3 pieces then P (A) is the sum of the probabilities of
the pieces’. The bottom equation (3) is called the law of total probability. It is just a rewriting of
the top equation using the multiplication rule.
The law holds if we divide Ω into any number of events, so long as they are disjoint and cover all
of Ω. Such a division is often called a partition of Ω.
Example: An urn contains 5 red balls and 2 green balls. Two balls are drawn one after the other.
What is the probability that the second ball is red?
Solution: The sample space is Ω = {rr, rg, gr, gg}. Let R1 be the event ‘the first ball is red’, G1
= ‘first ball is green’, R2 = ‘second ball is red’, G2 = ‘second ball is green’. We are asked to find
P (R2).
The fast way to compute this is just like P (S2) in the card example above. Every ball is equally
likely to be the second ball. Since 5 out of 7 balls are red, P (R2) = 5/7.
Let’s compute this same value using the law of total probability (3). First, we’ll find the
conditional probabilities. This is a simple counting exercise.
4 5 5 2
= . + .
6 7 6 7
30 5
= =
42 7
Independence
Two events are independent if knowledge that one occurred does not change the probability that
the other occurred. Informally, events are independent if they do not influence one another.
Formal definition of independence: Two events A and B are independent if and only if,
Example1. Toss a fair coin twice. Let H1 = ‘heads on first toss’ and let H2 = ‘heads on second
toss’. Are H1 and H2 independent?
Example2. Toss a fair coin 3 times. Let H1 = ‘heads on first toss’ and A = ‘two heads total’. Are
H1 and A independent?
Answer: We know that P (A) =3/8. Since this is not 0 we can check if P (A/B) = P (A) holds.
Now, H1 = {HHH, HHT, HTH, HTT} contains exactly two outcomes (HHT, HTH) from A, so
we have P (A|H1) =2/4. Since P (A|H1) P (A) these events are not independent.
Example3. Draw one card from a standard deck of playing cards. Let’s examine the
independence of 3 events ‘the card is an ace’, ‘the card is a heart’ and ‘the card is red’.
(a) We know that P (A) = 4/52 = 1/13, P (A|H) = 1/13. Since P (A) = P(A|H) we have that A is
independent of H.
(b) P (A|R) = 2/26 = 1/13. So A is independent of R. That is, whether the card is an ace is
independent of whether it’s red.
(c) Finally, what about H and R? Since P (H) = 1/4 and P (H|R) = 1/2, H and R are not
independent. We could also see this the other way around: P (R) = 1/2 and P (R|H) = 1, so H and
R are not independent.
Bayes’ theorem is a pillar of both probability and statistics and it is central to the rest of this
course. For two events A and B Bayes' theorem (also called Bayes' rule and Bayes' formula)
says;
P ( A / B)
P (B/A) = P (A ) . P (B) …………………………………….. 5
Bayes’ rule tells us how to invert conditional probabilities, i.e. to find P (B/A) from P( A /B).
In practice, P (A) is often computed using the law of total probability.
The key points is that A∩B is symmetric in A and B. so the multiplication rule says
Exercise:
1. Suppose 99% of people with HIV test positive, 95% of people without HIV test negative,
and 0.1% of people have HIV.
What is the chance that someone testing positive has HIV?
Chapter two
2. Random variables and probability distributions
Introduction
Random variables are of important concepts in dealing with probability and probability
distributions. Generally random variables are of two types; therefore this chapter is dedicated to
the overview of random variables and related terms.
Chapter objectives:
After successfully completing this chapter, students are expected to be able to:
Define the terms probability distribution and random variables.
Distinguish between a discrete and continuous probability distribution.
Calculate the mean, variance, and standard deviation of discrete and continuous
probability distributions.
Define moments and moment generating functions.
⇒ X (HHH) = 3,
X (TTT) = 0
X = {0, 1, 2, 3}
Examples:
Toss coin n times and count the number of heads.
Number of children in a family.
Number of car accidents per week.
Number of defective items in a given company.
Number of bacteria per two cubic centimeter of water.
Therefore, the probability distribution for the number of heads occurring in three coin
tosses is:
X=x 0 1 2 3
P (X = x) 1/8 3/8 3/8 1/8
F (x) 1/8 1/2 7/8 1
From the above, any number of other questions can be answered. For example, P (1 ≤ X ≤ 3) (i.e.
the probability that you will get at least one head) = P (1) + P (2) + P (3) = 3/8 + 3/8 + 1/8 = 7/8.
Or, if you prefer, you can use the complements rule and note that P (at least 1 head) = 1 – P (No
heads) = 1 - 1/8 = 7/8.
Function rules: Sometimes it is easier to specify the distribution of a discrete random variable
by its rule, rather than by a simple listing or graph. A rule for our coin-tossing experiment would
be:
1/8 if x = 0
Probability Mass Function (PMF): A probability distribution involving only discrete values of
X and it is denoted by P. Graphically, this is illustrated by a graph in which the x axis has the
different possible values of X, the Y axis has the different possible values of P(X = x).
1. P(x) ≥ 0
2. ∑ P (X=x ) = 1
x
b −1
P (a X < b) ∑ P (x)
≤
x=a
P (a < X ≤ b) ∑ P(x )
x=a +1
P (a ≤X ≤ b) ∑ P (x)
x=a
Exercise:
1. Three men and three women apply for an executive position in a small company. Two of
the applicants are selected for interview. Let X denotes the number of women in the
interview pool.
b. Find the PMF of X, assuming that the selection is done randomly. Plot it.
c. What is the probability that at least one woman is included in the interview pool?
F (a) = P (X≤ a) = ∑
X≤ a
P(x )
F (x) = P (X≤ x) = ∑
xi ≤ x
f (xi)
0 ≤ F(x ) ≤ 1
If x ≤ y, then F (x) ≤ F ( y )
ii. ∫ f (x )dx , ∀ x =∫ f (x ) dx = 1
−∞
NOTE. A random variable is continuous if and only if its CDF is an everywhere continuous
function.
d. Introduction to expectation
The expected value of a random variable is denoted by E (X). The expected value can be thought
of as the ‘average’ value attained by the random variable; in fact the expected value of a random
variable is also called its mean, in which case we use the notation μx.
2. Let X be a continuous random variable assuming the values in the interval (a, b) such that
b
E (X) = ∫ x f ( x) dx
a
We often seek to summarize the essential properties of a random variable in as simple terms as
possible.
Definition: If X is a random variable with mean E(X), then the variance of X, denoted by:
Var(X), is defined by Var(X) = E [(X-E(X)) 2]
= E (X2) – [E (X)] 2
n
Where, E (X2) =∑ xi2 P( X =xi), if X is discrete
i=1
= ∫ x f (x )dx , if X is continuous .
2
b. Var(4 + 3X)
Solution: Exercise
The moment generating function of the random variable X, denoted M X (t), is defined for all real
values of t by;
{
∑ e tx P ( x ) if X is discrete with pmf P(x)
x
M X (t) = E (e tx) = ∞
The reason M X (t) is called a moment generating function is because all the moments of
X can be obtained by successively differentiating M X (t) and evaluating the result at t=0.
The “moment generating function” gives us a nice way of collecting together all the moments of
a random variable X into a single power series (i.e. Maclaurin series) in the variable t. It is
defined to be;
This is a clever way of organizing all the moments into one mathematical object, and that object
is called the moment generating function. It's a function m (t) of a new variable t defined by;
M (t) = E (e tx)
Since the exponential function e t has the power series
∞
tk t
2
t
k
e t =∑ =1+t+ + ……..+
k=0 k! 2! k!
μ2 μ
= 1 + μ1 t+ 2! t2 +…+ k k! t
k
M ̍ x (0) = E (X)
(For any of the distributions we will use we can move the derivative inside the expectation).
Second moment:
d d d
M ̍ ̍ x (t) = M ̍ x (t) = E (Xe tx) = E ( ¿)) = E (X2e tx)
dt dt dt
2
M ̍ ̍ x (t = 0) = E (X )
Chapter Three
Introduction
Once we have dealt with the probability theory and random variables, now we have come to the
heart of the matter. In the previous chapters we tried to see the bird’s eye view of the probability
theory and the random variables. On top of this, we also defined briefly what discrete and
continuous random variables and their probability distributions, which are the probability mass
functions and probability density functions of the random variables respectively. In this chapter
we try to deal with some important probability distributions from both discrete and continuous
random variables in a more detailed manner.
Learning Objectives:
After successfully completing this chapter, students are expected to:
Describe the characteristics and compute probabilities using the Bernoulli and binomial
probability distributions.
{
x 1− x
P ( x )= θ ( 1−θ ) ,∧for x ∈{0 ,1 }
0 ,∧otherwise
The Binomial Distribution represents the number of successes and failures in n independent
Bernoulli trials for some given value of n. A binomial experiment is a probability experiment that
satisfies the following four requirements called assumptions of a binomial distribution.
1. The experiment consists of n identical trials.
2. Each trial has only one of the two possible mutually exclusive outcomes, success or a
failure.
3. The probability of each outcome does not change from trial to trial, and
4. The trials are independent, thus we must sample with replacement.
Examples of binomial experiments
Tossing a coin 20 times to see how many tails occur.
Asking 200 people if they watch BBC news.
Registering a newly produced product as defective or non-defective.
Asking 100 people if they favor the ruling party.
Rolling a die to see if a 5 appears etc…
The random variable X that counts the number of successes, x, in the n trials is said to follow a
binomial distribution with parameters n andθ , written bin (x; n;θ ) with probability mass function;
()
P ( X=x ) = n θ (1−θ) , where x=0 , 1 ,2 … n∨0 ≤ x ≤ n
x
x n− x
0, otherwise
Examples:
1. A biased coin is tossed 6 times. The probability of heads on any toss is 0.3. Let X denotes
the number of heads that come up.
Calculate:
i. P (X = 2)
ii. P (X = 3)
iii. P (1 < X ≤ 5).
Solution:
If we call heads a success then this X has a binomial distribution with parameters n = 6 and θ =
0.3
i.
2 ()
P ( X=2 )= 6 (0.3) 0.7 =0.324135
2 4
ii. Do it yourself!
iii. P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5)
= 0.324 + 0.185 + 0.059 + 0.01
= 0.578
2. What is the probability of getting three heads by tossing a fair coin four times?
Solution:
Let X be the number of heads in tossing a fair coin four times
X ~ Bin (n = 4, p = 0.50)
x
x
()
→ P ( X=x )= n θ ( 1−θ ) , x=0 ,1 , 2 ,3 , 4
n−x
( 4)
= x 0.5 x ( 0.5 )4− x
= ( x ) 0.5
4 4
3. Let X = number of heads after a coin is flipped three times. X Bin (3, 0.5).
What is the probability of each of the different values of X?
()
1
P ( X=0 )= 3 θ 0 ( 1−θ )3=
0 8
()
3
P ( X=1 )= 3 θ ( 1−θ ) =
1 2
1 8
( 2)
3
P ( X=2 )= 3 θ ( 1−θ ) =
2 1
( 3)
1
P ( X=3 )= 3 θ ( 1−θ ) =
3 0
8
Remark: If X is a binomial random variable with parameters n and θ then
Certain events in nature are said to occur spontaneously, i.e. they occur at random times,
independently of each other, but with certain intensity λ . The intensity is the average number of
spontaneous events per time interval. The number of spontaneous events X in any given concrete
time interval is then Poisson distributed, and we write X pois ( λ ). X is discrete
random variable and takes values in the set {0, 1, 2, 3…}.
Often we are interested in the number of events which occur in a specific period of time or in a
specific area of volume: The Poisson distribution is used as a distribution of rare events, such as:
Number of alpha particles emitted from a radioactive source during a given period of
time.
Number of telephone calls coming into an exchange during one unit of time.
Number of diseased trees per acre of certain woodland.
Number of death claims received per day by an insurance company.
Number of misprints.
Natural disasters like;
Accidents.
Hereditary.
Arrivals.
Earth quake.
The process that gives rise to such events is called Poisson process.
Examples:
1. If 1.6 accidents can be expected in an intersection on any given day, what is the
probability that there will be 3 accidents on any given day?
Solution;
Let X =the number of accidents, λ = 1.6
x −1.6
1.6 e
X = Poisson (1.6) ⇒ P (X = x) =
x!
3 −1.6
1.6 e
P (X = 3) = 3!
= 0.1380
2. Operators of toll roads and bridges need information for staffing tollbooths so as to
minimize queues (waiting lines) without using too many operators. Assume that in a
specified time period the number of cars per minute approaching a tollbooth has a mean
of 10. Traffic engineers are interested in the probability that exactly 11 cars approach the
tollbooth in the minute from 12 noon to 12:01.
11 −10
10 e
P ( 11)= =0.114
11!
Stat foreconomists Page 36
Statistics for economists
Thus, there is about an 11% chance that exactly 11 cars would approach the tollbooth the first
minute after noon.
E(X) = λ , Var(X) = λ
Example:
1. Find the binomial probability P(X=3) by using the Poisson distribution if θ = 0.01 and
n = 200
Solution: using Poisson, λ=n θ=0.01∗200=2
3 −2
( ) 2 e
⇒ P X=3 = =0. 1804
3!
Using binomial, n = 200, θ = 0.01
⇒ P ( X=3 )= ( )
200 0.013 0.99197 =0.1814
3
Description
The Hyper geometric Distribution arises when sampling is performed from a finite population
without replacement thus making trials dependent on each other.
The assumptions leading to the hyper geometric distribution are as follows:
1. The population or set to be sampled consists of N individuals, objects, or elements (a
finite population).
2. Each individual can be characterized as a success (S) or a failure (F), and there are m
successes in the population.
3. A sample of n individuals is selected without replacement in such a way that each subset
of size n is equally likely to be chosen.
If X is the number of successes in a completely random sample of size n drawn from a
population consisting of m successes and (N –m) failures, then the probability distribution of
X, called the hyper geometric distribution, is given by;
(mx )(n−x
L
)
P (X = x) = N , where
(n)
m = elements called successes
L = N – m = elements called failures
A sample of n elements is selected at random without replacement.
x = number of successes
Therefore, X is said to follow a hyper geometric distribution and denoted by;
X HG ( n , m , L , N ) .
Example:
1. Draw 6 cards from a deck without replacement.
What is the probability of getting two hearts?
m = 13 number of hearts, L = 39 number of non-hearts
Stat foreconomists Page 38
Statistics for economists
N = 52 total
x=2
n=6
P ( 2 hearts )=
( 2 ( 4 ))
13 39
=0.31513
(526)
2. What is the probability of getting at most 2 diamonds in the 5 selected cards without
replacement from a well shuffled deck?
p ( X ≤2 )=P ( X=0 )=P ( X=1 )=P(X =2)
The mean and variance of the hyper geometric random variable X having pmf h(x; n, m, L, N)
are;
The ratio M/N is the proportion of successes in the population. If we replace M/N by θ in E(X)
and Var(X), we get;
E ( X )=n θ
N −n
Var ( X )= . n θ(1−θ)
N −1
Note: for any values of the parameters, the mean of X is the same, whether the sampling is with
N −n
or without replacement. On the other hand, the variance of X is smaller, by a factor of ,
N−1
Then,
( x ) ( n−x )
N θ N (1−θ)
P ( X=x ) =
( Nx )
x()
→ n θ (1−θ) as N → ∞
x n− x
So far we have dealt with random variables with a finite number of possible values. For example;
if X is the number of heads that will appear, when you flip a coin 5 times, X can only take the
values 0, 1, 2, 3, 4, or 5.
Some variables can take a continuous range of values, for example a variable such as the height
of a human being, weight of human beings, the lifetime of an electronic component etc... For a
continuous random variable X, the analogue of a histogram is a continuous curve (the probability
density function) and it is our primary tool in finding probabilities related to the variable. As
with the histogram for a random variable with a finite number of values, the total area under the
curve equals 1. Here are some of the important continuous distributions.
{
1
,−∞< θ1 ≤∧x ≤ θ2 <∞
f ( x )= θ2−θ 1 …………………..1
0 , otherwise
X U ( θ1 , θ2 ) , for short
It is easy to see that this is a valid PDF (because p (x) > 0 and ∫ f ( x ) dx=1
We can also write this distribution with this alternative notation:
θ|θ1, θ2 ∼ U (θ1,θ2) ………………………….2
The first and second equations are equivalent. The latter simply says: x is distributed uniformly
in the range θ1 andθ2 , and it is impossible that x lies outside of that range.
Intuitively, this distribution states that all values within a given range [θ1,θ2] are equally likely.
Visually,
f (x)
1/ θ2 - θ1 …………
θ1 θ2 X
Where the rectangular region has area (θ2 – θ1) [1/ (θ2 – θ1)] = 1 (width times height).
2
θ 1+θ 2 2 (θ ¿ ¿ 2−θ 1)
E ( X )= ∧Var ( X )=σ = ¿
2 12
The normal distribution is the most important distribution in statistics, since it arises naturally in
numerous applications. The key reason is that large sums of (small) random variables often turn
out to be normally distributed.
A random variable X is said to have the normal distribution with parameters μ and σ if its
probability density function is given by:
2
1 1 (x−μ)
f ( x )= exp {− },−∞ < x< ∞ ,−∞ < μ< ∞ , σ > 0
σ √2π 2 σ2
2
X N (μ,σ )
Where μ=E ( X ) , σ 2=variance of ( X)
2
μ ¿ σ , are the parameters of the distribution.
5. Total area under the curve sums to 1, i.e., the area under the curve to the right of the
mean is 0.5, and the area under the curve to the left of the mean is 0.5.
∞
∫ f ( x ) dx=1
−∞
A normal distribution with expected value μ=0 and variance σ 2=1 is called a standard normal
distribution. The standard deviation in a standard normal distribution equals one obviously. The
density φ ( x ) of a standard normal distribution is;
Note: To facilitate the use of normal distribution to calculate probabilities, the following
distribution known as the standard normal distribution was derived by using the transformation.
X−μ
Z=
σ
1 −1 2
⇒ f ( z )= exp z
√2 π 2
Areas under the standard normal distribution curve have been tabulated in various ways. The
most common ones are the areas between;
Z = 0 and a positive value of Z.
Given a normally distributed random variable X with Mean μ and standard deviation σ
a−μ b−μ
⇒ P ( a< X <b ) =P( σ
< Z<
σ
)
Note:
P (a < X < b) = P (a≤ X <b )
= P (a < X≤ b)
= P (a≤ X ≤ b)
A table of standardized normal values attached at the end of the module as an appendix
can then be used to obtain an answer in terms of the converted problem.
The interpretation of Z values is straightforward. Since σ = 1, if Z = 2, the corresponding X value
is exactly 2 standard deviations above the mean. If Z = -1, the corresponding X value is one
standard deviation below the mean. If Z = 0, X = the mean, i.e. μ.
Rules for using the standardized normal distribution
It is very important to understand how the standardized normal distribution works, so we will
spend some time here going over it. Recall that, for a random variable X,
F(x) = P(X ≤ x)
NOTE: While memorization may be useful, you will be much better off if you gain an intuitive
understanding as to why the rules that follow are correct. Try drawing pictures of the normal
distribution to convince you that each rule is valid.
RULES:
1. P (Z ≤ a) ⇒ one tail probability
= F (a) = 1 – F (-a) (use when a is positive)
= 1 – F (a) = F (-a) (use when a is negative)
Examples:
Find:
1. P (Z ≤ a) for a = 1.96, -1.96, 2.00, -2.00
To solve: for positive values of a, look up and report the value for F (a) given in the standard
normal table attached as an appendix at the end. For negative values of a, look up the value for F
(-a) in absolute value (i.e. F (absolute value of a)) and report 1 – F (a)
Solution:
P (Z ≤ 1.96) = F (1.96) = 0 .975
P (Z ≤ -1.96) = F (-1.96) = 1 – F (1.96) = .025
P (Z ≤ 2.00) = F (2.00) = 0.9772
P (Z ≤ -2.00) = F (-2.00) = 1 – F (2.00) = 0.0228
It is possible also to work easily in the other direction, and determine what a is given P (Z ≤ a)
So far, Most of the calculations above are of the form: Find the probability P (Z ≤ z) for a given
value of z. Often times, we are also interested in an inverse problem: Find the value of ZA such
that the probability for Z to be greater than ZA equals a specified value A.
Formally, our question is: For what value of zA do we have
Examples:
Gamma random variables are used to model a number of physical quantities. Some examples are
The time it takes for something to occur, e.g. a lifetime or the service time in a queue.
The rate at which some physical quantity is accumulating during a certain period of time,
e.g. the excess water flowing into a dam during a certain period of time due to rain or the
amount of grain harvested during a certain season.
A random variable X is said to have a gamma distribution, X ~ (α,) with parameters α > 0 and
> 0 if its probability density function has the form;
{
−x
1 α −1 β
f x∨α , β = β Γ ( α ) x e ,∧α , β >0
( ) α
0 , otherwise
Where,
∞
Γ ( α )=∫ x
α−1 − x
e dx , is the gamma function.
0
Γ ( 1 )=1 Show!
Γ ( α )=( α −1 ) Γ ( α −1 ) for any α >1
Γ ( α )=( α −1 ) ! for integer α
Γ ( 12 )=√ π
Γ ( α +1 )=αΓα =α !
2
E ( X )=αβ ∧Var ( X )=α β
A random variable X has a beta distribution X beta(α , β ) with parameters α > 0 and β >0
If the density of X is
{
x α −1 (1−x )β −1
f ( X )= ,∧0 ≤ x ≤ 1
B (α , β )
0 ,∧elsewhere
{
Γ ( α + β ) α −1 β−1
x ( 1−x ) ,∧0 ≤ x ≤ 1
Alternatively, f ( x ) = Γ (α ) Γ ( β )
0 ,∧elsewhere
1
B ( α , β )=∫ x
β−1
dx=¿ ¿ Some constant
α−1
( 1−x )
0
α 2
αβ
E( X ) = σ =var ( X ) = 2
α+β ( α + β ) (α + β +1)
Example:
1. The length of your morning commute (in hours) is a random variable X that has a beta
distribution withα =β=2 .
a. Find the probability that your commute tomorrow will take longer than 30
minutes.
b. Your rage level is R = X2 + 2X +1. Find the expected value of R
Solution:
x α −1 (1−x) β−1
f ( x )=
B(α , β)
Γ (4 )
= x( 1−x)
Γ ( 2 ) Γ (2)
= 6 ( x−x 2 )
2
f ( x )=6( x−x )
a. ( ) 1
P x ≥ =∫ 6 ( x−x ) dx
2 1
2
( )
2 3
x x
= 6 −
2 3
=6 ( 12 − 13 − 18 + 241 ) = 12
Stat foreconomists Page 53
Statistics for economists
α 1
b. E ( X )= =
α+β 2
E ( X 2) =Var ( X ) + E ( X )
2
( )
2
αβ α
¿ +
2
( α + β ) ( α + β +1 ) α +β
4 1 3
¿ + =
16 .5 4 10
3 1
E ( R )= +2. +1
10 2
23
=
10
Chapter Four
4. Joint and conditional probability distributions
Introduction
So far we have considered only distributions with one random variable. There are many
problems that involve two or more random variables. We need to examine how these random
variables are distributed together (‘’jointly’’). There are discrete and continuous joint
distributions based on the type of random variables.
Learning objectives:
After successful completion of this chapter, students are able to:
Differentiate Joint and Marginal Distributions
Formulate marginal distributions from joint discrete and continuous distributions.
Understand the Conditional Distributions and Independence
Compute expectations, covariance and Correlations for joint random variables.
Solve the conditional Expectation of joint random variables.
In general, if X and Y are two random variables, the probability distribution that defines their
simultaneous behavior is called a joint probability distribution.
Shown here as a table for two discrete random variables, which gives P( X=x ,Y = y) .
x
1 2 3
y 1 0 1/6 1/6
2 1/6 0 1/6 If X and Y are discrete, this
3 1/6 1/6 0 distribution can be described with a joint
probability mass function and denoted by;
P xy (X =x , Y = y )
If X and Y are continuous, this distribution can be described with a joint probability
density function and denoted by; f xy ( X=x ,Y = y )
If we are given a joint probability distribution for X and Y, we can obtain the
individual probability distribution for X or for Y (and these are called the
Marginal Probability Distributions)...
The rule for finding a marginal is simple.
To obtain a marginal PMF/PDF from a joint PMF/PDF, sum or integrate out the
variable(s) you don't want.
For discrete, this is obvious from the definition of the PMF of a random variable.
p X ( x )=P ( X =x )=∑ p X ,Y ( x , y )
y
pY ( y )=P (Y = y )=∑ p X ,Y ( x , y)
x
Example:
1. Measurements for the length and width of a rectangular plastic covers for CDs are
rounded to the nearest mm (so they are discrete).
Let: X denote the length and
Y denotes the width.
The possible values of X are 129, 130, and 131 mm. The possible values of Y are 15 and16 mm
(Thus, both X and Y are discrete).
There are 6 possible pairs (X; Y).
We show the probability for each pair in the following table:
x=length
y=width 129 130 131
The sum of 15 0.12 0.42 0.06 all the probabilities is 1.0.
The 16 0.08 0.28 0.04 combination with the highest
probability is (130; 15).
The combination with the lowest probability is (131; 16).
The joint probability mass function is the function p X ,Y ( x , y )=P(X =x ,Y = y) . For example, we
p X ,Y
have (129; 15) = 0.12.
Questions:
a. Find the probability that a CD cover has length of 129mm (i.e. X = 129).
P (X = 129) = P (X = 129 and Y = 15) + P (X = 129 and Y = 16)
= 0.12 + 0.08 = 0.20
b. What is the probability distribution of X?
x=length
This table represents the values of the random variable X and its
corresponding probabilities and it is called the marginal distribution of X.
NOTE: We've used a subscript X in the probability mass function of X, or f X ( x ) , for clarification
since we're considering more than one variable at a time now.
y 15 16
f Y ( y) 0.60 0.40
Because the probability mass functions for X and Y appear in the margins of the table
(i.e. column and row totals), they are often referred to as the Marginal Distributions for X
and Y.
When there are two random variables of interest, we also use the term bivariate
probability distribution or bivariate distribution to refer to the joint distribution.
The joint probability mass function of the discrete random variables X and Y, denoted as
P X ,Y (x , y ), satisfies;
1. p X ,Y ( x , y ) ≥ 0
2. ∑
x
∑ p X ,Y ( x , y )=1
y
3. p X ,Y ( x , y )=P(X =x ,Y = y)
If X and Y are discrete random variables with joint probability mass function P X ,Y ( x , y ), then the
marginal probability mass functions of X and Y are;
PY ( y )=∑ P X ,Y (x , y )
x
Where the sum for P X (x) is over all points in the range of (X; Y) for which X =x and the sum
for PY ( y )=¿ is over all points in the range of (X; Y) for whichY = y .
4.1.3 Joint Probability Density Function
A bivariate function with values f (X , Y ) defined over the X Y plane is called joint probability
density function for the continuous random variable X and Y, denoted as f X , Y ( x , y ) and satisfies
the following properties:
1. f X , Y ( x , y ) ≥ 0 , for all values of x , y
2. ∫ ∫ f X ,Y ( x , y ) dx dy=1
−∞ −∞
If X and Y are continuous random variables with joint probability density function f XY (x ; y),
then the marginal density functions for X and Y are:
∞
f X ( x )=∫ f XY ( x , y ) dy , ∀ y= ∫ f XY ( x , y ) dy And
−∞
∞
f Y ( y )=∫ f XY ( x , y ) dx , ∀ x= ∫ f XY ( x , y ) dx
−∞
Where the first integral is over all points in the range of (X ; Y ) for which X =x , and the second
integral is over all points in the range of (X ; Y )for whichY = y .
We have previously shown that the conditional probability of A given B can be obtained by
dividing the probability of the intersection by the probability of B, specifically,
P( A ∩ B)
P ( A|B )=
P( B)
Definition: for discrete random variables X andY , the conditional PMF of X given Y and vice
versa are defined as:
P XY ( x , y )
P X ∨Y ( x| y )=P ( X=x|Y = y ¿= , f Y ( y)>0
PY ( y )
The conditional probability can be stated as the joint probability over the
marginal probability.
A conditional probability distribution PY ∨X ( y|x ) and P X ∨Y ( x| y ) have the following properties:
1. PY ∨X ( y|x ) ≥ 0, P X ∨Y ( x| y ) ≥ 0
a. Find the probability that a CD cover has a length of 130mm given the width is
15mm.
Solution: P( X=130∨Y =15)= P ¿ ¿
0.42
= 0.60
= 0.70
b. Find the conditional distribution of X given Y =15.
0.12
P ( X=129|Y =15 ) = =0.20
0.60
0.42
P ( X=130|Y =15 )= =0.70
0.60
0.06
P ( X=131|Y =15 )= =0.10
0.60
Therefore the conditional distribution of X given Y =15, or p X ∨Y=15 ( X=x|Y =15 ¿ is;
In the continuous case, the idea of a conditional distribution takes on a slightly different meaning
than in the discrete case. If X and Y are both continuous, P( X=x∨Y = y) is not defined because
the probability of any one point is identically zero. It make sense however to define a conditional
distribution function, i.e.
P( X ≤ x∨Y = y)
Because the value of Y is known when we compute the value the probability that X is less than
some specific value.
Definition: Let X and Y be jointly continuous random variables with joint probability density
f XY (x , y) and marginal densities f X ( x ) and f Y ( y), respectively. For any Y such that f Y ( y ) >0 , the
conditional probability density function of X given Y = y, is defined to be;
f XY ( x , y )
f X∨Y (x∨ y )=
f Y ( y)
f (x , y)
= f ( y)
f XY ( x , y )
And similarly, f Y ∨X ( y∨x )=
f X (x)
f (x , y)
= f (x )
{
1
( x + 4 y ),∧0< x <2 , 0< y <1
Example: f ( x , y )= 6
0 ,∧otherwise
1
The marginal density of x is f X ( x )= (x+ 2) while the marginal density of y,
6
1
f Y ( y )= ( 2+8 y ) . now,
6
a. find the conditional distribution of x given y. this is given by:
1
( x +2 )
f XY ( x , y ) 6 x +4 y
f X∨Y (x∨ y )= = =
f Y ( y) 1 8 y +2
( 2+ 8 y )
6
b. For 0 < x < 2 and 0 < y < 2, find the probability that X ≤ 1 given that y = ½.
First determine the density function when y = ½ as follows.
f (x , y) x+ 4 y
=
f ( y) 8 y+ 2
( 12 )
x+ 4
= 1
8 ( )+2
2
x+ 2 x+2
= 4+ 2 = 6
1
1 1
Then, P ( X ≤1|Y = ¿=∫ ( x +2 ) dx
2 0 6
1 1
( )
= 6 2 +2 −0
1 2 5
= 12 + 6 = 12
We have previously shown that two events A and B are independent if the probability of their
intersection is the product of their individual probabilities, i.e.
Discrete Random Variables: If X and Y are discrete random variables with joint probability
density function p(x , y) and marginal density functions P X (x) and PY ( y ), respectively, then X
and Y are independent if, and only if;
P X ,Y (x , y )=P X (x)P Y ( y)
¿ p(x) p( y)
for all pairs of real numbers (x, y).
Example: Continuing the plastic covers example once again...
X =length
Y =width 129 130 131 Row totals
15 0.12 0.42 0.06 0.60
16 0.08 0.28 0.04 0.40
Column total 0.20 0.70 0.1 1
Continuous Bivariate Random Variables: If X and Y are continuous random variables with
joint probability density function f(x, y) and marginal density functions f X ( x ) and f Y ( y )
respectively then X and Y are independent if and only if;
f X , Y ( x , y)=f X (x) f Y ( y )
¿ f ( x ) f ( y ) ,for all pairs of real numbers (x, y).
4.3Expectation
Definition: for random variables X and Y their joint expectation can be given by;
∞ ∞
E [ f X , Y ( x , y ) ]= ∫ ∫ xy ( f X ,Y ( x , y ) ) , if X∧Y are continuous random variables
−∞ −∞
4.3.1 Variance
Variance of a Single Random Variable: The variance of a random variable X with mean μ is
given by;
Var (X) ≡ σ 2 ≡ E ¿
≡ E [ ( X−μ ) ]
2
∞
≡ ∫ ( X −μ ) ¿
2
−∞
∞
≡ ∫ x f (x )dx−¿ ¿
2
−∞
≡ E ( x 2 ) −(E (X ))2
The variance is a measure of the dispersion of the random variable about the mean.
Variance of a sum:
2 2
Var (a X +b Y )=a Var ( X)+b Var (Y )+2 abCov (X , Y )
4.3.2 Covariance
Definition: Let X and Y be any two random variables defined in the same probability space. The
covariance of X and Y, denoted Cov [X, Y] orσ X, Y, is defined as;
The covariance measures the interaction between two random variables, but its numerical value
is not independent of the units of measurement of X and Y. Positive values of the covariance
imply that X and Y that X increases when Y increases; negative values indicate X decreases as Y
decreases.
( )
k n k n
Cov ∑ a i X i , ∑ b j Y j =∑ ∑ a i b j Cov( X i ,Y j)
i=1 j=1 i=1 j=1
4.3.3 Correlation
Generally,
1. 1 ≤ ρ ( X , Y ) ≤1
2. i f ρ ( X ,Y )=1 ,then Y =aX+ b , where a>0
3. if ρ ( X , Y )=1 , thenY =aX +b , wherea< 0
4. ρ ( aX +b , cY +d )= ρ ( X , Y ) for a , c> 0
5. if ρ ( X , Y )=0 , we say that X∧Y are uncorrelated
6. if ρ ( X , Y ) >0 , we say that X∧Y are positively correlated
7. If ρ( X ; Y )<0 , we say that X∧Y are negatively correlated .
{
1
,∧x=0 ,2
P X ( x )=∑ P X ,Y ( x , y )= 4
y 1
,∧x=1
2
( )
2 2 2
0∗1 1 ∗1 2 ∗1 3
E(P ¿ ¿ X ( x ) )=∑ x P X ( x )=
2 2
+ + = ¿
0 4 2 4 2
{
1
,∧ y=−1 , 1
PY ( y )=∑ P X ,Y ( x , y )= 4
x 1
,∧ y =1
2
( −1∗1 4 )
1
0∗1 1∗1
E(P¿ ¿Y ( y ) )=∑ y PY ( y )= + + =0 ¿
−1 4 2
( )
1 2 2 2
−1 ∗1 0 ∗1 1 ∗1 1
E(P ¿ ¿Y ( y ) )=∑ y PY ( y )=
2 2
+ + = ¿
−1 4 2 4 2
∞ x
f X ( x )=∫ f X ,Y (x , y ) dy=∫ 3 x dy =3 x , 0 ≤ x ≤1 ,
2
−∞ 0
∞ 1
3
E(f ¿¿ X ( x ) )=∫ x f X ( x ) dx=¿∫ x∗3 x dx=¿ ¿ ¿,
2
−∞ 0 4
∞ 1
3
E(f ¿ ¿ X ( x ) )= ∫ x f X ( x ) dx=¿∫ x ∗3 x dx=¿ ¿ ¿,
2 2 2 2
−∞ 0 5
−∞ y 2 2
( ) ( )
∞ 1 2 4
y∗3 3 y y 3 1 1 3
E(f ¿¿ Y ( y ) )=∫ y f Y ( y ) d y=¿ ∫
2
(1− y )dy= − = − = ¿¿ ,
−∞ 0 2 2 2 4 2 2 4 8
( ) ( )
∞ 1 2 3 5
y ∗3 3 y y 3 1 1 1
E(f ¿ ¿ Y ( y ) )= ∫ y f Y ( y ) dy=¿∫
2 2 2
(1− y ) dy= − = − = ¿ ¿,
−∞ 0 2 2 3 5 2 3 5 5
1 x
= ∫ ∫ xy∗3 x dy dx
0 0
{∫ }
1 x
=∫
2
y dy 3 x dx
0 0
{ }
1 2
y
=∫
2
3 x dx
0 2
√
= 3
∗19
80
320
= 0. 397
4.3.4 Conditional Expectation
1
= ( )(7 +8+9+ 10+11+12)
6
57
=9.5
6
This makes intuitive sense since, 6+ E (value of D 1)=6+3.5 ⇒ 9.5
The conditional variance of discrete random variables X given Y= y, denoted as Var (X|Y = y) or
2
σ X ∨Y = y
var ( X|Y = y ¿=∑ ( x−μ X ∨Y = y ) P X ∨Y = y ( x∨ y )
2
=∑
2 2
x P X ∨Y = y ( x∨ y )−μ X ∨Y= y
x
Example: Find E [Y|X] if the joint probability density function is f X , Y ( x , y )=1/x ; 0< y ≤ x ≤ 1.
x
1
Solution: f X ( x )=∫ dy=1 , 0≤ x ≤1
0 x
f (x , y ) 1
f Y ∨X ( y∨x )= X ,Y = , 0< y ≤ x
f X (x) x
x x
y x
E ( Y |X =x )=∫ y f Y ∨ X ( y∨x ) dy=∫ dy=
0 0 x 2
Therefore, the conditional expectation E(Y ∨X )=x /2
= ∫ x f X ∨Y = y ( x∨ y )−μ X ∨Y = y
2 2
Chapter five
5. Sampling and Sampling Distribution
Introduction
In statistics, Sampling plays vital role. Generally the purpose the whole chapter is to
introduce and armed students with the concept of sampling and sampling distributions.
Learning objectives:
At the end of the chapter students are expected to:
Understand what sampling is? Why sampling? And explain the different
types of sampling technique/methods.
Describe the concept of sampling distribution and elaborate the different
types of sampling distribution (i.e., sampling distribution of the sample mean
and sample proportion).
Sampling is very often used in our daily life. For example while purchasing food grains from a
shop we usually examine a handful from the bag to assess the quality of the commodity.
A doctor examines a few drops of blood as sample and draws conclusion about the blood
constitution of the whole body. Thus most of our investigations are based on samples. In this
chapter, let us see the importance of sampling and the various methods of sample selections from
the population.
1. A (statistical) population: is the complete set of possible measurements for which
inferences are to be made. The population represents the target of an investigation, and
the objective of the investigation is to draw conclusions about the population hence we
sometimes call it target population. Sometimes it is possible and practical to examine
every person or item in the population we wish to describe. We call this a complete
enumeration, or census. We use sampling when it is not possible to measure every item
in the population. Statisticians use the word population to refer not only to people but to
all items that have been chosen for study.
Examples
Population of trees under specified climatic conditions
Population of animals fed a certain type of diet
Population of farms having a certain type of natural fertility
Population of households, etc.
Universality
Qualitativeness
Detailedness
Non-representativeness
b. Non-Probability(non-random) sampling:
It is a sampling technique in which the choice of individuals for a sample depends on the basis of
convenience, personal choice or interest. It is the one where discretion is used to select
Judgment Sampling
In this case, the person taking the sample has direct or indirect control over which items are
selected for the sample.
Convenience Sampling
In this method, the decision maker selects a sample from the population in a manner that is
relatively easy and convenient.
Quota Sampling
In this method, the decision maker requires the sample to contain a certain number of items with
a given characteristic. Many political polls are, in part, quota sampling.
Snowball/networking sampling
In this sampling method samples are generated in a networked manner. Important method if the
researcher has little or no knowledge about the population under study.
c. Mixed Sampling:
Here samples are selected partly according to some probability and partly according to a fixed
sampling rule; they are termed as mixed samples and the technique of selecting such samples is
known as mixed sampling.
Exercises:
1. What are the merits and limitations of simple random sampling stratified random
sampling, systematic sampling and cluster sampling techniques?
2. A population of size 800 is divided into 3 strata of sizes 500, 200, 100 respectively. A
stratified sample size of 160 is to be drawn from the population. Determine the sizes of
the samples from each stratum under proportional allocation.
Note:
Let N = population size, n = sample size.
1. Suppose simple random sampling is used
We have N npossible samples if sampling is with replacement.
So far we have seen the techniques how samples can be drawn from the population. Using one of
the already discussed sampling techniques above if we take different samples from a population,
the statistic that we would compute for each sample combinations need not be the same and most
likely would vary from sample to sample.
For each sample combination given above we can compute the mean value (i.e., the sample
statistic). The following table shows us the mean value of each sample combinations.
Sample Mean sample Mean Therefore as we see from the table different sample
2, 2 2 6, 2 4 combinations yield different sample mean (statistic).
2,4 3 6, 4 5
2,6 4 6, 6 6 The population mean can be computed as;
2,8 5 6, 8 7 ∑ x = 20 =5Some of the sample combinations
4,2 3 8, 2 5 μ=
n 4
4,4 4 8, 4 6
4, 6 5 8, 6 7 have the mean value equal to the population mean 5.
4, 8 6 8, 8 8
Stat foreconomists Page 80
Statistics for economists
But the populations mean basically different from most of the sample means and the mean of the
sample itself differs from sample to sample. This leads us to concept of sampling distribution.
Sampling Distribution: - Given a variable X, if we arrange its values in ascending order and
assign probability to each of the values or if we present Xi in a form of relative frequency
distribution the result is called Sampling Distribution of X. or alternatively it is a probability
distribution of all the values of sample statistics.
We do have sampling distribution of the mean, proportion etc.
Generally, the probability distribution of a statistic is called sampling distribution. The
probability distribution of X is called the sampling distribution of the mean. The sampling
distribution of a statistic depends on the size of the population, the size of the samples, and the
method of choosing the samples.
Random Sample
The random variables X 1 , X 2, … Xn are a random sample of size n if...
a. The Xi ' s are independent random variables, and
b. Every Xi has the same sample probability distribution (i.e. they are drawn from the same
population).
NOTE: the observed data x 1 x 2 , … xn are also referred to as a random sample.
Statistic
A statistic is any function of the observations in a random sample.
Example:
The mean of X is a function of the observations (specifically, a linear combination of the
observations).
n
∑ Xi 1 1 1
X = i =1 = X 1 + X 2+ …+ X n
n n n n
A statistic is a random variable, and it has a probability distribution.
The distribution of a statistic is called the sampling distribution of the statistic because it depends
on the sample chosen.
Example:
Suppose we have a population of size N = 5, consisting of the age of five children: 4, 6, 8, 10,
and 12.
Population mea, μ=8
Population Variance, σ 2=8
Take samples of size 2 with replacement and construct sampling distribution of the sample mean.
Solution:
N = 5, n = 2
We have N n=5 2 = 25 possible samples since sampling is with replacement.
Step 1: Draw all possible samples:
4 6 8 10 12
4 4, 4 4, 6 4, 8 4, 10 4, 12
6 6, 4 6, 6 6, 8 6, 10 6, 12
8 8, 4 8, 6 8, 8 8, 10 8, 12
10 10, 4 10, 6 10, 8 10, 10 10, 12
12 12, 4 12, 6 12, 8 12, 10 12, 12
μX=
∑ Xi fi = 200 =8=μ
∑ fi 25
b. Find the variance of X , say σ 2X
2 ∑ 2
( Xi−μ X ) fi 100 2
σ X= = =4 ≠ 8=σ
∑ fi 25
Exercise: construct the sampling distribution taking sample size of 2 without replacement.
Properties of the sampling distribution of the mean
1. The variance of the sample mean σ 2X : Given the population mean (μ), population
variance (σ 2), the sample size (n) and population size (N); the variance of the
sample mean is given as follows:
if sampling is with replacement
2
2 σ
σ X= (show !)
n
σ
And the standard deviation is given by; σ X =
√n
If sampling is without replacement
2
σ N −n
), and the standard deviation is;
2
σ X= (
n N−1
σ X=
√
σ N −n
√ n N−1
The value
√ N −n
N −1
is referred as finite population correction factor.
2. In any case the sample mean is unbiased estimator of the population mean. i.e.
Statisticians are in agreement that the expected value of the sample mean is equal to
the population mean.
algebrically , μ X = μ ⇒ E ( X ) = μ, (show!)
Sampling may be from a normally distributed population or from a non-normally
distributed population.
( )
2
σ
⇒X N μ,
n
X−μ
N (0 , 1)
⇒ Z= σ
√n
Example:
If the uric acid values in normal adult males are approximately normally distributed with mean
5.7 mgs and standard deviation 1mg find the probability that a sample of size 9 will yield a
mean.
i. greater than 6
ii. between 5 and 6
iii. less than 5.2
Solution:
Let X be the amount of uric acids in normal adult males
μ = 5.9, σ =1, n=9
( )
2
σ
⇒ X N μ,
n
N ( 5.9 , 0.33 )
X−μ
N (0 , 1)
⇒Z= σ
√n
iii. ( 5.2−5.7
P( X< 5.2)=P Z <
√ 0.33 )
=P ( Z←1.22 ) =0.1112
If we are sampling from a population with unknown distribution, either normal or non-normal,
the sampling distribution of X will still be approximately normal with mean μ X and variance σ 2/n
provided that the sample size is large. This amazing result is an immediate consequence of the
following theorem, called the central limit theorem.
Given a population of any functional form with mean μ and finite variance σ 2, the sampling
distribution of X , computed from samples of size n from the population will be approximately
normally distributed with mean μ and variance σ 2/n, when the sample size is large. i.e.
X−μ
N (0 , 1)
Z= σ
√n
As n —>∞ , is the standard normal distribution n (z; 0, 1).
The normal approximation for X will generally be good if n ≥ 30. If n < 30, the approximation is
good only if the population is not too different from a normal distribution and, as stated above, if
the population is known to be normal, the sampling distribution of X will follow a normal
distribution exactly, no matter how small the size of the samples.
The sample size n = 30 is a guideline to use for the central limit theorem. However, as the
statement of the theorem implies, the presumption of normality on the distribution of X becomes
more accurate as n grows larger. The following figure illustrates how the theorem works. It
shows how the distribution of X becomes closer to normal as n grows larger, beginning with the
clearly non-symmetric distribution of that of an individual observation (n = 1). It also illustrates
that the mean of X remains μ X for any sample size and the variance of X gets smaller as n
increases.
As one might expect, the distribution of X will be near normal for sample size n < 30 if the
distribution of an individual observation itself is close to normal.
Example:
1. An electrical firm manufactures light bulbs that have a length of life that is approximately
normally distributed, with mean equal to 800 hours and a standard deviation of 40 hours.
Find the probability that a random sample of 16 bulbs will have an average life of less
than 775 hours.
Solution:
The sampling distribution of X will be approximately normal, with μ X = 800 and σ = 40/√ 16
= 10. The desired probability is given by the area of the shaded region in the following
figure.
Corresponding to x = 775, we find that;
775−800
z= =−2.5
10
And therefore
P ( X <775 ) =P ( Z <−2.5 ) =0.0062
Chebyshev Inequality
Let X be a random variable with mean μ and standard deviationσ , then for any positive number
k,
This is an important result in probability and will be especially useful in our proof of the Weak
Law of Large Numbers.
Example:
1. Let X be any random variable with E (X) = μ and V (X) =σ 2. Then, if k = cσ , Chebyshev
inequality states that;
2
σ 1
P ( ¿ X−μ|≥ c σ ¿ ≤ 2 2
= 2
c σ c
Thu, for any random variable, the probability of a deviation from the mean of more than c
1 1
standards is≤ 2 . If for instance, c = 4, 2 = 0.0625
c c
2. Let X be a random variable that follows a Poisson distribution with parameter θ=7. Give
a lower bound for P ( ¿ X−μ|≤ 4 ¿.
For a Poisson distribution with parameter 7 we have μ=σ 2=7. then,
P ( 3 ≤ X ≤ 11) =P ( ¿ X−7|≤ 4 ¿ = P ( ¿ X−μ|≤ 4 ¿ ≥1−σ 2 /4 2
¿ 1−7 /16=0.4375.
Law of large numbers (LLN)
Actually, this concept comprises the Weak and Strong Law of Large Numbers as well as
Kolmogorov's Law of Large Numbers. But, for the purpose of this module we will see and prove
the Weak Law using two different methods. The first proof uses Chebyshev inequality, and the
second uses what is known as characteristic functions.
P (| | )
Sn
n
−μ ≥ ε →0 for n → ∞
Equivalently;
P (| | )
Sn
n
−μ < ε → 1 for n→ ∞
Definition: Random variables X1, X2, X3…Xn are said to be independent and identically
distributed or i.i.d. if each random variable Xi has the same probability distribution as X1 and the
occurrence of one does not affect the probability of another.
Let X1, X2, X3…Xn be a sequence of i.i.d. random variables, each with mean E (X i) = μand
X 1+ X 2+ …+ Xn
standard deviation σ , we define X n = . The weak law of large numbers
n
(WLLN) states for all k > 0 then,
lim P ( ¿ X n−μ|>k ¿ ¿=0
n→∞
Let X1, X2, X3…Xn be a sequence of i.i.d. random variables with mean E |X 1| < ¿ μ=E ( X 1 ) .
X 1+ X 2+ …+ Xn
Again define X n = , n≥ 1.Let γ ( t ) =E ( e itX1 ) with t ϵ (−∞ , ∞ ) . then,
n
( )
t
()
t i X1
γ =E e n
n
( t
= E 1+ i X 1+o
n
1
n ( ))
t
= E ( 1 ) +i E ( X 1 )+ o
n
1
n ()
t
= 1+i μ+ o
n
1
n
as n → ∞ ()
Stat foreconomists Page 90
Statistics for economists
Thus,
¿ Xn
γ X ( t )=E(e
n
)
( )
n
Xi
= E e¿ ∑ j=i n
n t
i Xj
= E( ∏ e n
)
j=1
=
n
=γ n n (( ))t
( ( ))
n
¿μ 1
= 1+ +o
n n
( ( )) → e . so we have γ
n
1 n a 1
(1+ ) → e ⇒ 1+ + o
a
Xn ( t )=γ D ( t )=e¿ μ where P ( D=μ )=1∧thus X n → Pμ .
n n n
Example: die rolling. Consider n rolls of a die. Let Xi be the outcome of the ith roll. Then,
Sn =X1 + X2 + …Xn is the sum of the first n rolls. This is an idependent Bernouli trial
with E(Xi) = 7/2. Thus by the LLN, for any k > 0,
P¿
this can be restated as for any k >0 ,
P (| | )
Sn 7
− <k →1 as n→ ∞
n 2
The strong law of large numbers
CHAPTER SIX
6. ESTIMATION
Introduction
We now come to the heart of the matter called statistical inference. So far, under given values of
the population parameters μ andσ 2, we try to solve the probability of the sample mean X , from a
sample of size n, being greater than some specified value or within some range of values? The
parameters μ and σ 2 are assumed to be known and the objective is to try to form some
conclusions about possible values of X . But, in practice it is usually the sample values X and s2
is known, while the population parameters μ and σ 2 are not. Thus an interesting question to ask
is: given the values of X and s2, what can be generalized about μ and σ 2? This is called statistical
inference. Therefore, inference is the process of or way of making interpretations or conclusions
from sample data for the totality of the population or to the population parameters. Alternatively,
statistical inference is the act of generalizing from the data (“sample”) to a larger phenomenon
(“population”) with calculated degree of certainty. The act of generalizing and deriving statistical
judgments is the process of inference.
Note: There is a distinction between causal inference and statistical inference. Here we consider
only statistical inference.
Sometimes the population variance is known, and inferences have to be made about μ alone.
For instance, if a sample of 100 Ethiopian families find an average weekly expenditure on food (
X ) of 250 birr with a standard deviation (s) of 48 birr, what can be said about the average
expenditure (μ) of all Ethiopian families?
Diagrammatically this type of problem is shown as follows:
Generally, in statistics there are two ways in which inference can be made.
Statistical estimation
Statistical hypothesis testing.
This chapter covers the estimation of population parameters such as μ and σ 2 from the sample
statistic while the coming chapter describes statistical hypotheses testing about these
parameters. The two concepts are very closely related statistical terms.
Learning objectives:
After successful completion of this chapter, students will be able to;
Differentiate between point and interval estimation.
Show how point estimates can be calculated for the population parameters.
Elaborate the desirable properties of the estimators
Describe how confidence intervals are computed for the population parameters.
Indicate how it can be determine how large must be the significance level in order for an
estimate to have a desired level of accuracy.
This is one way of making inference about the population parameter where the investigator does
not have any prior notion about values or characteristics of the population parameter.
Basically there are two ways of conducting estimation.
i. Point Estimation
ii. Interval estimation
Point Estimation
population parameter using a single value or point. It is a procedure that results in a single value
as an estimate for a parameter. Point estimation is the one which is most prevalent in everyday
activities; for instance, the average number of economics instructors surfing the internet in Raya
University per day for an hour. Although this is presented as a truth, it is actually an estimate,
obtained from a survey of instructor’s use of either personal computers or their mobile phones.
Since it is obtained from a sample there must be some doubt about its accuracy of being the
population average. As a result interval estimates are also used, that gives some information
about the likely accuracy of the estimate.
Example: The following table shows the age of a simple random sample of 18 economics
students in Raya University in years. The figures are obtained from sampling, not from a census.
Use the data to estimate the populations mean age, μ, of all economics students.
18 22 19 20 21 18
23 21 22 24 18 20
17 23 18 19 21 19
Solution: We estimate the populations mean age, μ, of all economics students by the sample
mean age, x of the 18 students sampled. From the Table above,
x=
∑ x i = 363 =20.17
n 18
Interpretation: Based on the sample data, we estimate the mean age, μ, of all economics
students to be approximately 20.17 years.
An estimate of this kind is called a point estimate for μ because it consists of a single number, or
point.
Interval estimation
It is the procedure that results in the interval of values as an estimate for a parameter, which is
interval that contains the likely values of a parameter. It deals with identifying the upper and
lower limits of a parameter. The limits by themselves are random variable. Interval estimates are
better one way for the user of statistics knowledge, since they not only show the estimate of the
Stat foreconomists Page 94
Statistics for economists
parameter but also give an idea of the confidence in which the researcher has in that estimate.
For instance, suppose we are trying to estimate the mean summer income of students. Then, an
interval estimate might say that the (unknown) mean income is between 1000 and 2000 birr with
probability 0.95.
Generally, there are two principal methods/ approaches of point estimation, the method of
moments and the method of maximum likelihood estimation techniques. Though, there are other
methods of point estimation like the method of least square and Bayesian method, for the
purpose of this module and this course, for the time being we goanna focus on the first two
constructive methods of obtaining point estimators. By constructive we mean that the general
definition of each type of estimator suggests explicitly how to obtain the estimator in any
specific problem.
The method of moments is a very intuitive approach to the problem of parameter estimation: the
argument is as follows. On the one hand, we have defined population moments as expressions of
the variability in a population; on the other hand, for random samples drawn from such
populations, we have calculated sample moments to summarize the data. Population moments
Continue this until there are enough equations to solve for the unknown parameters.
Procedures to find MoM estimators
Step 1: Identify how many parameters the distribution has. (Let's say m).
Step 2: Find the first m population moments (using econ 1041 knowledge).
Step 3: Equalize each of the population moments to the corresponding sample moment.
Step 4: Solve the system to find solutions to the parameters.
Step 5: The solutions are the MoM estimators.
Definition: Let X1, X2… Xn be a random sample from a distribution with pmf or pdf f (x; θ1. . .
θm), where θ1, . . . , θm are parameters whose values are unknown. Then the moment estimators
θ^ 1. . . θ^ m are obtained by equating the first m sample moments to the corresponding first m
population moments and solving for θ1, . . . , θm.
If, for example, m = 2, E (X) and E ( X 2 ) will be functions of θ1 andθ2 .
Setting E(X) = (1/n) Σ Xi (= X ) and E ( X 2 ) = (1/n) Σ Xi 2 gives two equations in θ1 andθ2. The
solution then defines the estimators.
Example:
1. Let X i N ( μ , σ 2 ) . find the method of moment’s estimators of μ∧σ 2.
We have,
μ1=E ( X )=μ , μ 2=E ( X 2 ) =σ 2 + μ2 ,
Thus, we found
^μ= X
n
1
σ^ = ∑ X i2− X 2
2
n i=1
n
1
= ∑ (X i− X )
2
n i=1
Solution: E ( X )=p =
X 1 + X 2 +…+ X n
⇒ ^p =
∑ Xi
n n
The method of maximum likelihood was first introduced by R. A. Fisher, a geneticist and
statistician, in the 1920s. Most statisticians recommend this method, at least when the sample
size is large, since the resulting estimators have certain desirable efficiency properties.
To estimate model parameters by maximizing the likelihood, which is the joint probability
density function of a random sample, the resulting point estimators found can be regarded as
yielded by the most likely underlying distribution from which we have drawn the given sample.
Let X 1 , X 2… Xn be a random sample from a population with pmf or pdf f(x, θ), then the
likelihood function is defined as;
To find an MLE:
Write down the likelihood function (i.e. the joint distribution)
Take the natural logarithm of the likelihood and simplify
Differentiate with respect to the appropriate parameters, set to 0, and solve
Write resulting functions (RVs) as your estimator.
The likelihood function is the joint pdf or pf of the sample observations. The maximum
likelihood estimate (MLE) of a parameter θ is obtained by maximizing the likelihood function
with respect toθ .
Computational trick: Instead of maximizing L (θ ) with respect toθ , often we maximize ln (θ ) =
n
∑ log f ( X i∨θ) with respect to θ. Since the log function is monotonic, finding θ that maximizes
i
Example:
1. Suppose that the population has a Poisson distribution with unknown parameter λ. find
the ML estimator of λ.
x
λ −λ
P ( x , θ ) =P ( x , λ )= e , x=0 , 1 ,2 , …
x!
for some λ > 0. To estimate λ by MLE, let’s proceed as follows.
STEP 1: Calculate the likelihood function L (λ).
STEP 3: Differentiate log L (λ) with respect to λ, and equate the derivative to zero to find the
MLE.
n
d
∑ xi 1
n
log L ( λ ) = i=1
−n=0 ⇒ λ= ∑ x i
^
dλ λ n i=1
^λ=X
Solution:
2
1 ( X i−μ )
i. f ( X i )= e−
√ 2 π σ2 2σ
2
¿ ¿,
X i ∈ R , i=1 , 2 ,… … n
n
= ∏ ¿¿¿
i=1
=¿
iii. Log likelihood function
n
2
−n n
∑ ( X i−μ )
l=ln L=¿ ln ( 2 π )− ln σ 2−¿ i=1
ln e ¿ ¿
2 2 2 σ2
= −n n
∑ ( X i−μ )
ln (2 π )− ln σ 2−¿ i=1
, since ln e=1 ¿
2 2 2 σ2
n
2 ∑ ( X i −μ )
iv. dl i=1
= =0
dμ 2σ 2
n
2
dl −n i=1
∑ ( X i−μ )
= + =0
d σ2 2 σ 2 2σ4
⇒ ^μ= X
n
2
∑ ( X i− X )
σ^ 2= i =1
n
Why are statistical properties of estimators important? These statistical properties are extremely
important because they provide criteria for choosing among alternative estimators. Knowledge of
Unbiasedness
Definition of Unbiasedness: The estimator θ^ is an unbiased estimator of the population
parameter θ if the mean or expectation of the finite-sample distribution of θ^ is equal to the true θ.
That is, θ^ is an unbiased estimator of θ if,
E (θ^ ) = θ for any given finite sample size n < ∞.
Definition of the Bias of an Estimator: The bias of the estimator θ^ is defined as,
Bias (θ^ ) = E (θ^ ) − θ = the mean of θ^ minus the true value of θ.
The estimator θ^ is an unbiased estimator of the population parameter θ if the bias of θ^ is equal
to zero; i.e., if,
^
Bias ( θ)=E( ^
θ)−θ=0 ^
⇔ E( θ)=θ .
Alternatively, the estimator θ^ is a biased estimator of the population parameter θ if the bias of θ^
is non-zero; i.e., if,
Minimum Variance
Definition of Minimum Variance: The estimator θ^ is a minimum-variance estimator of the
population parameter θ if the variance of the finite-sample distribution of θ^ is less than or equal
~
Note: Either or both of the estimators’ θ^ and θ may be biased. The minimum variance property
implies nothing about whether the estimators are biased or unbiased.
Efficiency
A Necessary condition for Efficiency is Unbiasedness
The small-sample property of efficiency is defined only for unbiased estimators.
Therefore, a necessary condition for efficiency of the estimator θ^ is that
E (θ^ ) = θ, i.e., θ^ must be an unbiased estimator of the population parameter θ.
Definition of Efficiency: Efficiency = Unbiasedness + Minimum Variance
~
Verbal Definition: If θ^ and θ are two unbiased estimators of the population parameter θ, then
~
the estimator θ^ is efficient relative to the estimator θ if the variance of θ^ is smaller than the
~
variance of θ for any finite sample size n < ∞.
~
Formal Definition: Let θ^ and θ be two unbiased estimators of the population parameter θ, such
~ ~
that E (θ^ ) = θ and E (θ ) = θ. Then the estimator θ^ is efficient relative to the estimator θ if the
~
Note: Both the estimator’s θ^ and θ must be unbiased, since the efficiency property refers only to
the variances of unbiased estimators.
Meaning of the Efficiency Property
Efficiency is a desirable statistical property because of two unbiased estimators of the same
population parameter; we prefer the one that has the smaller variance, i.e., the one that is
statistically more precise.
~
In the above definition of efficiency, if θ is any other unbiased estimator of the population
parameter θ, then the estimator θ^ is the best unbiased, or minimum-variance unbiased, estimator
of θ.
Large-Sample Properties
Nature of Large-Sample Properties
The large-sample properties of an estimator are the properties of the sampling distribution of that
estimator as sample size n becomes very large, as n approaches infinity, as n→∞. Recall that the
sampling distribution of an estimator differs for different sample sizes i.e., for different values of
n.
The sampling distribution of a given estimator for one sample size is different from the sampling
distribution of that same estimator for some other sample size.
Consider the estimator θ^ for two different values of n, n1 andn2 .
In general, the sampling distributions of θ^ n and θ^ n are different they can have
1 2
different means
different variances
different mathematical forms
Desirable Large-Sample Properties
Consistency
Asymptotic Unbiasedness;
Consistency
A Necessary Condition for Consistency
Let n θ^ n be an estimator of the population parameter θ based on a sample of size n observations.
Formal Definition of Probability Limit: The point θ0 on the real line is the probability limit of
the estimator θ^ n if the ultimate sampling distribution of θ^ n is degenerate at the pointθ0 , meaning
that the sampling distribution of θ^ n collapses to a column of unit density on the point θ0 as sample
size n →∞.
This definition can be written concisely as
^ 0∨ plim θ^ n=θ0
plim θ=θ
n →∞
lim Pr (θ0 −ε ≤ θ^ n ≤ θ0 + ε)=lim Pr (−ε ≤ θ^ n−θ0 ≤+ε )= plim Pr (|θ^ n−θ 0|≤ ε )=1
n→∞ n →∞ n →∞
Where ε > 0 is an arbitrarily small positive number and |θ^ n−θ0| denotes the absolute value of the
This condition states that if both the bias and variance of the estimator θ^ n approach zero as sample
size n → ∞, then θ^ n is a consistent estimator of θ.
In point estimation, we recover a point values for an unknown parameter. But knowing the point
value is not enough, we also want to know how close to the truth it is. Using interval estimation,
we make statements that the true parameter lays within some region (typically depending on the
point estimate) with some prescribed probability. A point estimate by itself provides no
information about the precision and reliability of estimation. Therefore, an alternative to
reporting a single number is to report an entire interval of plausible values that is an interval
estimate.
Confidence interval estimation of the population mean
Although X possesses nearly all the qualities of a good estimator, because of sampling error, we
know that it's not likely that our sample statistic will be equal to the population parameter, but
instead will fall into an interval of values. We will have to be satisfied knowing that the statistic
is "close to" the parameter. That leads to the obvious question, what is "close"?
We can phrase the latter question differently: How confident can we be that the value of the
statistic falls within a certain "distance" of the parameter? Or, what is the probability that the
parameter's value is within a certain range of the statistic's value? This range is the confidence
interval.
A Confidence Interval is an interval of numbers containing the most plausible values for our
Population Parameter. The probability that this procedure produces an interval that contains the
actual true parameter value is known as the Confidence Level and is generally chosen to be 0.9,
0.95 or 0.99.
The confidence level is the probability that the value of the parameter falls within the range
specified by the confidence interval surrounding the statistic.
Case 1:
If sample size is large or if the population is normal with known variance
Recall the Central Limit Theorem, which applies to the sampling distribution of the mean of a
sample. Consider a sample of size n drawn from a population, whose mean, is μ and standard
deviation is σ with replacement and order important. The population can have any frequency
distribution. The sampling distribution of X will have a mean μ X = μ x and a standard deviation
σ
σ X= , and approaches a normal distribution as n gets large.
√n
This allows us to use the normal distribution curve for computing confidence intervals.
X−μ
⇒Z = σ has a normal distribution with mean = 0 and variance = 1
√n
⇒ μ =X ± Z σ /√ n
For the interval estimator to be a good estimator, the error should be small i.e. more narrow
CIs are desirable. We can use the following methods so that we can have a small level of
error.
By making n large i.e. a relatively large number of the sample size (good idea if possible)
Small variability or decreaseσ . Not an option, it is fixed by original distribution.
Taking Z small
Decreasing the confidence level? (Not a great idea. You reduce the CI width, but you're
less likely to capture μ)
To obtain the value of Z, we have to attach this to a theory of chance. That is, there is an area of
size 1−α such
(
P −Z α < Z <Z α =1−α
2 2
)
Stat foreconomists Page 111
Statistics for economists
Where α is the probability that the parameter lies outside the interval . Or the (“alpha”) level
represents the “lack of confidence” and is the chance the researcher is willing to take in not capturing
the value of the parameter.
Z α Stands for the standard normal variable to the right of which α/2 probability lays i.e. P (Z > Z α
2 2
) = α/2 and the reason we use Z α instead of Z α in this formula is because the random error
2
(imprecision) is split between underestimates (left tail of the SDM) and overestimates (right tail of
the SDM). The confidence level 1−α area lies between − Z α and Z α .
2 2
( )
X−μ
P −Z α < <Z α =1−α
⇒ 2
σ 2
√n
(
⇒ P X−Z α
2
σ
√n
< μ< X +Z α
σ
2 √n
=1−α
)
(
⇒ X−Z α
2
σ
√n
, X +Zα
σ
2 √n
)
is a 100 (1 – α ) % confidence interval for μ.
Where,
Z α , called the reliability factor,
2
σ
SE = is the standard error,
√n
σ
ME = Z α is the margin of error,
2 √n
w = 2ME is the width,
σ
UCL = Z α is the upper confidence limit, and
2 √n
σ
LCL = X −Z α is the lower confidence limit.
2 √n
But usually σ 2 is not known, in that case we estimate by its point estimator s2.
(
⇒ X−Z α
2
s
√n
, X +Zα
s
2 √n
)
is a 100 (1 – α ) % confidence interval for μ.
Here are the z values corresponding to the most commonly used confidence levels.
100 (1 – α ) % α α /2 Zα
2
Case 2:
If sample size is small and the population variance, σ 2 is not known.
X−μ
t= s has t distribution with n -1 degrees of freedom.
√n
(
⇒ X−t α
2
s
√n
, X +t α
)
s
2 √n
is a 100(1− α) % confidence interval for μ .The unit of measurement of
the confidence interval is the standard error. This is just the standard deviation of the sampling
distribution of the statistic.
Student’s t Distribution
The t is a family of distributions.
Degree of freedom is the number of observations that are free to vary after sample mean has
been calculated.
Degree of freedom, d.f. = n - 1
Let X1 = 5
Let X2 = 6
What is X3?
If the mean of these three values is 6, then X3 must be 7 i.e., X3 is not free to vary.
Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2
Note: t →Z as n increases.
Summary
P ( X−Z
√n)
s s
n large and σ 2unknown α < μ< X +Z α =1−α
2 √n 2
Example:
1. From a normal sample of size 25 a mean of 32 was found .Given that the population
standard deviation is 4.2. Find
a. A 95% confidence interval for the population mean.
b. A 99% confidence interval for the population mean.
c. A 90 % confidence interval for the population mean.
Solution:
a. X = 32 , σ =4.2, 1− α = 0.95 ⇒ α = 0.05, α/2 = 0.025
σ
The required interval will be X ± Z α
2 √n
= 32 ± 1.96 * 4.2/√ 25
= 32 ± 1.65
σ
The required interval will be X ± Z α
2 √n
= 32 ± 2.58 * 4.2/√ 25
= 32 ± 2.17
= (29.83, 34.17)
c. Exercise
2. A Drug Company is testing a new drug which is supposed to reduce blood pressure. From
the six people who are used as subjects, it is found that the average drop in blood pressure
is 2.28 points, with a standard deviation of .95 points.
Compute:
a. A 90 % confidence interval for the population mean.
b. A 95% confidence interval for the population mean.
c. A 99% confidence interval for the population mean.
Solution:
b.
X = 2.28 , s=0.95, 1− α = 0.95 ⇒ α = 0.05, α/2 = 0.025
s
The required interval will be X ± t α
2 √n
= 2.28 ± 2.571 * 0.95/√ 6
= 2.28 ± 1.008
= (1.28, 3.28)
That is, we can be 95% confident that the mean decrease in blood pressure is between 1.28 and
3.28 points.
Chapter seven
7. HYPOTHESIS TESTING
Introduction
Learning objectives:
After completing this chapter, students will be able to;
Define the null and alternative hypothesis:
Develop and test the hypothesis:
Test the test of association between variables
Definitions:
Statistical hypothesis: is an assertion or statement about the population whose plausibility is to
be evaluated on the basis of the sample data.
Test statistic: is a statistics whose value serves to determine whether to reject or accept the
hypothesis to be tested. It is a random variable.
Statistic test: is a test or procedure used to evaluate a statistical hypothesis and its value depends
on sample data.
We test the null hypothesis against an alternative hypothesis, which is given the symbol H a. The
alternative hypothesis is often the hypothesis that you believe yourself! It includes the outcomes
Type I error (α): Rejecting the null hypothesis when it is true. It is sometimes called
level of significance.
Level of Significance
The level of significance of a statistical test is the probability level α for obtaining a critical value
under the distribution specified by the null hypothesis. A calculated test-statistic is compared to
the critical value to make a decision as to rejection or non-rejection of the null hypothesis. The
rejection of the null hypothesis is equivalent to saying that the calculated probability for
obtaining the test statistic is less than α.
Power of a test:
The most powerful test is a test that fixes the level of significance and minimizes type II error
(β). The power of a test is defined as the probability of rejecting the null hypothesis when it is
actually false. It is given as:
NOTE:
1. There are errors that are prevalent in any two choice decision making problems.
2. There is always a possibility of committing one or the other errors.
3. Type I error (α) and type II error (β) have inverse relationship and therefore, cannot be
minimized at the same time.
In practice we set α at some value and design a test that minimizes β. This is because a
type I error is often considered to be more serious, and therefore more important to avoid,
than a type II error.
Suppose the assumed or hypothesized value of μ is denoted by μ0, then one can formulate two
sided (1) and one sided (2 and 3) hypothesis as follows:
1. H 0 : μ=μ 0 vs H 1 : μ ≠ μ0 →two sided
2. H 0 : μ=μ 0 vs H 1 : μ> μ 0 → one side d
3. H 0 : μ=μ 0 vs H 1 : μ< μ 0 → one sided
CASES:
X−μ 0
Where: Z cal=
σ /√n
Case 2: When sampling is from a normal distribution with σ 2 unknown and small sample size
The relevant test statistic is;
X−μ 0
Where: t cal=
S /√n
Case 3: When sampling is from a non- normally distributed population or a population whose
functional form is unknown.
If a sample size is large one can perform a test hypothesis about the mean by using:
X−μ 0 2
Z cal= , if σ is known
σ
√n
X−μ 0 2
Z cal= , if σ isunknown
S
√n
The decision rule is the same as case I.
Examples
Solution:
Let μ =Population mean, μ0=10
Step 1: Identify the appropriate hypothesis
H 0 : μ0=10 Vs. H 1 : μ 0 ≠ 10
Step 2: select the level of significance, α = 0.01 (given)
Step 3: Select an appropriate test statistics
t- Statistic is appropriate because population variance is not known and the sample size is also
small.
Step 4: identify the critical region.
Here we have two critical regions since we have two tailed hypothesis.
The critical region is, | t cal| > t 0.005(9)=3.2498
X−μ 0 10.06−10
=0.76
⇒ t cal = S = 0.25
√n √10
Step 6: Decision
Accept H 0, since t cal is in the acceptance region.
Step 7: Conclusion: At 1% level of significance, we have no evidence to say that the average
height content of containers of the given lubricant is different from 10 litters, based on the given
sample data.
2. The mean life time of a sample of 16 fluorescent light bulbs produced by a company is
computed to be 1570 hours. The population standard deviation is 120 hours. Suppose the
hypothesized value for the population mean is 1600 hours. Can we conclude that the life
Exercises:
1. It is known in a pharmacological experiment that rats fed with a particular diet over a
certain period gain an average of 40 gms. in weight. A new diet was tried on a sample of
2
20 rats yielding a weight gain of 43 gms with variance 7 gms . Test the hypothesis that
the new diet is an improvement assuming normality.
2. A batch of 100 resistors has an average of 102 Ohms. Assuming a population standard
deviation of 8 Ohms, test whether the population mean is 100 Ohms at a significance
level of α = 0.05.
A B1 B2 . . Bj . Bc total
A1 O11 O12 O1 j O1 c R1
.
.
Ai Oi 1 Oi 2 Oij Oic Ri
.
.
Ar Or 1 Or 2 Orj Orc
Total C1 C2 Cj n
The chi-square procedure test is used to test the hypothesis of independency of two attributes.
For instance we may be interested;
Whether the presence or absence of hypertension is independent of smoking habit or not.
Whether the size of the family is independent of the level of education attained by the
mothers.
Whether there is association between father and son regarding boldness.
Whether there is association between stability of marriage and period of acquaintance
ship prior to marriage.
[ ]
r c 2
(Oij −eij )
=∑ ∑
2 2
χ cal χ (r−1)(c−1)
i=1 j=1 e ij
Where,
Oij =¿The number of units that belongs to category i of A and j of B
e ij =¿Expected frequency that belong to category i of A and j of B.
The e ij is given by:
R i ×C j
e ij =
n
Decision Rule:
Reject H0 for independency at α level of significance if the calculated value of χ 2 exceeds the
tabulated value with degree of freedom equal to (r −1)(c −1) .
Examples:
1. A geneticist took a random sample of 300 men to study whether there is association
between father and son regarding boldness. He obtained the following results.
Son
Father bold Not
Bold 85 59
Not 65 91
Using α = 5% test whether there is association between father and son regarding boldness.
Solution:
H1: not H0
First calculate the row and column totals
R1 = 144, R2 = 156, C1 = 150, C2 = 150
R i ×C j
e ij =
n
[ ]
2 2 2
(Oij −eij )
cal =∑ ∑
2
χ
i=1 j=1 e ij
2 2 2 2
(85−72) (59−72) (65−72) (91−72)
= + + + =9.028
72 72 72 72
Obtain the tabulated value of chi-square
df = (r - 1)(c -1) = 1* 1 = 1
2
χ 0.05 ( 1 )=3.841 ¿ chi−square( χ ¿¿ 2)table ¿
2 2
The decision is to reject H0 since χ cal >¿ χ 0.05 (1)
2. Random samples of 200 men, all retired were classified according to education and
number of children as shown below.
Test the hypothesis that the size of the family is independent of the level of education attained by
fathers. (Use 5% level of significance).
R i ×C j
e ij =
n
R1 ×C 1 83 × 45 R2 ×C 1 117 × 45
e 11= = =18.675 e 21= = =26.325
n 200 n 200
R 1 ×C 2 83× 96 R2 × C 2 117 × 96
e 12= = =39.84 e22= = =56.16
n 200 n 200
R 1 ×C 3 83× 59 R 2 ×C 3 117 ×59
e 13= = =24.485 e 23= = =34.515
n 200 n 200
Obtain the calculated value of the chi- square.
[ ]
2 3 2
(Oij −eij )
=∑ ∑
2
χ cal
i=1 j=1 e ij
2 2 2
(14−18.675) (37−39.84) ( 27−34.515 )
= + +…+ =6.3
18.675 39.84 34.515
Obtain the tabulated value of chi-square at α = 0.05
Degree of freedom = (r - 1)(c -1) = 1* 2 = 2
2
χ 0.05 ( 2 )=5.99 ¿ chi−square( χ ¿¿ 2)table ¿
2 2
The decision is to reject H0 since χ cal >¿ χ 0.05 ( 1 ) and conclude that there is association.
References
C.R.Kothari. (2004). research Methodology methods and techniques (2 ed.). New Delhi, India:
Wilson, R. J. (2003). statistical methods (2 ed.). (B. Holland, Ed.) San Diego, CAlifornia, USA:
Academic press.
The end
Stat foreconomists Page 130
Statistics for economists