Chapter 5 Introduction To Statistical Inference
Chapter 5 Introduction To Statistical Inference
MODULE 12
Congratulations! You have finally reached the final module for this course. This module will
give you a brief overview of topics that you will encounter in the succeeding statistical theory
courses. Enjoy learning!
LEARNING OBJECTIVE: At the end of this module, you must be able to understand random
variables and probability distributions in relation to statistical inference.
DEFINITION 1
Two random variables X and Y are said to be identically distributed if they have the same
distribution function.
DEFINITION 2
Let X be a random variable with distribution function F and let be independent
and identically distributed (iid) random variables with common distribution F. Then the
collection is known as a random sample of size n from the distribution F.
( )
g ( x1,..., xn ) = f x j
j =1
n
if F has density function f; and by P X1 = x1,..., Xn = xn = P X j = x j when X1, , Xn are of the
j =1
Remarks:
1. Sometimes the term population is used to describe the universe from which the sample
is drawn; the population may be conceptual. Often F is referred to as probability
distribution function.
2. In sampling from a probability distribution, randomness is inherent in the phenomenon
under study. The sample is obtained by independent replications. In sampling from a
finite population, randomness is a consequence of the sampling design.
3. In sampling from a finite population, the term population is meaningful in that it refers
to some measurable characteristics or observable characteristics of a group of
individuals or units.
DEFINITION 3
Remarks:
1. A statistic is an alternative name given to a random variable or random vector when
we have a sample of observations. In practice, X1, , Xn could be a random sample,
i.e., X1, , Xn could be independent and identically distributed random variables.
2. Sample statistics are simply numerical characteristics of the sample just as parameters
are numerical characteristics of the population. However, sample statistics are
random variables and vary from sample to sample, whereas parameters are fixed
constants.
Illustration: Let X1, , Xn be a random sample from a distribution function F. Then the sample
mean and sample variance, defined as
n
X n
( Xi − X ) =
n
2
X 2
i − nX 2
X = i and S2 = i =1
,
i =1 n i =1 n −1 n −1
respectively, are sample statistics.
DEFINITION 4
A parametric family of density functions is a collection of density functions that is indexed
by a quantity called a parameter. The parameter, , specifies the form of the distribution
function. The set of all possible values of the parameter is called a parameter set,
denoted by .
DEFINITION 4
A family of distribution functions which is not a parametric family is called a
nonparametric family, that is, the family of underlying distributions for X cannot be
completely specified nor be indexed by a finite number of numerical parameters.
Illustration: Let be the family of all distribution functions on the real line that have finite
mean.
In this section, we will be discussing the two important problems in statistical inference:
estimation which is concerned with finding a value or a range of values for a parameter of
a distribution, F, based on sample data and test of hypothesis which deals with determining
if the sample data supports the underlying model assumptions or not.
DEFINITION 5
A point estimator is any function of a sample; that is, any statistic is a point
estimator.
Remarks:
1. The statistic W must be observable (that is, computable from the sample). Hence, it
cannot be a function of any unknown parameter.
2. An estimator is the statistic used to estimate the parameter, and a numerical value of
the estimator is called an estimate. For convenience in notation, no distinction will be
made.
3. Estimates that are constants (not based on the observations) should not be admissible
under any reasonable criteria.
4. Even though all statistics that take values in are possible candidates for estimates of
, we have to develop criteria to indicate which estimates are “good” and which may
be rejected.
5. Let be the unknown parameter to be estimated based on the sample X1, , Xn
( X1, ..., Xn ) =
of size n. Estimators will be denoted by ˆ ˆ (with or without subscripts). We
use Greek letter with a “hat” to represent estimators of parameters which are
represented by the Greek letter without “hat”.
DEFINITION 6
The bias of a point estimator W of a parameter is the difference between the expected
value of W and ; that is, An estimator whose bias is identically (in
) equal to 0 is called unbiased and satisfies for all .
When we subscript E by , it means that the expected value is to be computed under the
density or probability function when is the true value of the parameter.
Remarks:
1. Bias is a systematic error (in some direction). Unbiasedness of W says that W is correct
on the average, i.e., the mean of T is .
2. If ̂ is unbiased for and is a real valued function on , then ( ̂ ) is not unbiased for
(), in general.
3. To find an unbiased estimate for a parameter, one begins with the computation of
the first few moment(s) to see if the parameter is linearly related to any moment(s). If
so, then an unbiased estimate is easily obtained.
4. The closeness of the estimator to the parameter is referred to as the accuracy and is
measured by bias while closeness of the estimates from different samples to each
other is referred to as precision and is measured by variance.
5. If the estimator W is unbiased for , then a good measure of the precision is
V W = E W 2 − E2 W . Otherwise, a good measure is the mean squared error or MSE.
DEFINITION 7
The mean squared error (MSE) of an estimator W of a parameter is the function of
defined by
EXAMPLE 1
Let W = X. Now,
2 ( − x) 2 xdx − x 2 dx = 2 − = 2 3 − 2 = ; and
3 3 3 3
E W = x
0 0
dx =
0 2 2 2 2 3 2 6 3
2
Bias W = E W − = − = − .
3 3
2
2 2
V W = E W 2 − E2 W = − = .
6 3 18
2
2 2 2
Computing for MSE, we will have MSE W = V W + Bias2 W = + − = .
18 3 2
DEFINITION 8
Remarks:
1. If PF U ( X1, , Xn ) = 1− , we call U a minimum level 1- upper confidence bound,
and if PF L ( X1, , Xn ) = 1− , we call L a minimum level 1- lower confidence
bound.
2. In many problems an interval, [L, U] is preferable to a point estimate for due to the
confidence level attached to an interval estimate.
3. The length of a confidence interval is taken to be a measure of its precision: the
narrower the length for a given confidence coefficient, the more precise is the
estimator. The length is defined as the difference between the upper and the lower
limits, that is, length = U − L.
4. We choose, if possible, a confidence interval that has the least length among all
minimum level (1- ) confidence intervals for . However, this is usually difficult to
determine. Instead, we concentrate on all confidence intervals based on some
statistic T that have minimum level and choose one which has the least length. Such
an interval, if it exists, is called tight.
5. If the distribution of T is symmetric about , the length of the confidence interval based
on T is minimized by choosing an equal tails confidence interval. That is, we choose
in P T − = 1− by taking P T − = 2 and P T − − = 2.
6. We will often choose equal tails confidence intervals for convenience even though
the distribution of T may not be symmetric.
EXAMPLE 2
1 a 1
x = 2 x = 2
0 b
a −b
= 2 = 2
a= b = 1−
2 2
Hence,
P a x b = 1−
P x 1 − = 1−
2 2
x
P 1 − = 1−
2 2
x x
P = 1−
1 − 2 2
x x
So that , is a 1− level of confidence interval estimate for .
1−
2 2
Illustration: A coin is tossed six times. It is of interest to know the probability of obtaining a
“head” in a single toss. Such parameter is denoted as P where 0 < P < 1.
The random variable of interest is T defined as the total number of “heads” in 6 tosses of a
coin. The probability mass function of T is given by
6
P T = t = Pt (1− P ) , t = 0,1, ,6.
6−t
t
DEFINITION 9
Remarks:
1. The hypothesis under test is called nonparametric if is a nonparametric family and
parametric if is a parametric family.
2. Usually the null hypothesis is chosen to correspond to the smaller or simpler subset of
0 of and is a statement of “no difference”. In all cases, we consider the null
hypothesis will be of the form = 0 , 0 , or 0 . Note that the equality sign always
appears in H0.
3. If the distribution of ( X1, , Xn ) is completely specified by a hypothesis, we call it simple
hypothesis, otherwise the hypothesis is called composite. Thus, whenever 0 or 1
consists of exactly one point, the corresponding hypothesis is simple, otherwise it is
composite.
DEFINITION 10
Let H0: and H1: Let be 𝒳 be the set of all possible values of
A (decision) rule that specifies a subset C of 𝒳 such that
if reject H0 and
if accept H0
is called a test of H0 against H1 and C is called the critical region of the test. A test
statistic, T, is a statistic that is used in the specification of C.
There are two types of errors that can be made in using such a procedure. Type I error is
committed when a true null hypothesis is rejected while a Type II error is committed when
we fail to reject (accept) a false hypothesis.
DEFINITION 11
A test of the null hypothesis H0: 0 against H1: 1 is said to have size , 0 1, if
Remarks:
1. The chosen size is often unattainable particularly when the distribution is discrete; in
which case, we usually take the largest level less than that is attainable.
2. If P(C) for all 0, we say that the “critical region is of significance level .”
3. If sup P ( C ) = then the level and size of C are both equal to . On the other hand, if
0
4. If H0 is a simple hypothesis, PH0(C) is the size of the critical region C, which may or may
not equal to a given significance level .
5. The choice of a specific value for (0.1, 0.5, 0.01) is affected by several factors like
cost of the study, and consequences of rejecting a TRUE null hypothesis. The
economic and practical implications of rejecting H0 should influence the choice of .
DEFINITION 12
The probability of observing under H0 a sample outcome at least as extreme as the one
observed is called the p-value. If t0 is the observed value of the test statistic, T and the
critical region is at the right tail then the p-value is PHo[T t0]. If the critical region is at the
left tail then the p-value is PHo[T t0]. The smaller the p-value, the more extreme the
outcome and the stronger the evidence against H0.
Remarks:
1. The p-value is the smallest level at which the observed sample statistic is significant.
If the level is pre-assigned and p0 is the p-value associated with t0, then t0 is significant
at level if p0 .
2. Reporting the p-value instead of fixing permits one to choose his or her level of
significance.
3. If the critical region C is two-sided, that is, if C is of the form (T t1 or T t2), then we will
double the one-tailed p-value and report it as the p-value even if the distribution is
not symmetric.
PRACTICE PROBLEMS
2. Find a 95% confidence interval estimate for using a single observation from
- END OF MODULE 12 -