0% found this document useful (0 votes)
16 views41 pages

Section 5.3 and 5.4

Ukzn stat222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views41 pages

Section 5.3 and 5.4

Ukzn stat222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Statistical Inference

Statistical inference (inferential statistics) refers to the methodology


used to draw conclusions (expressed in the language of probability)
about population parameters on the basis of samples drawn from the
population.

Examples

1) The government of a country wants to estimate the proportion of


voters (p) in the country that approve of their economic policies.
2) A manufacturer of car batteries wishes to estimate the average
lifetime (µ) of their batteries.
3) A paint company is interested in estimating the variability (as
measured by the variance, σ2) in the drying time of their paints.
Statistical Inference

Hypothesis Estimatio
(Covered in section 5.3 &
Testsin section 5.5
(Covered n
5.4 of chapter 5)
of chapter 5)

Point Interval
Estimate Estimate
A single value that A range of values that
estimates the estimate the
parameter. parameter.
Associated with some chance
that the parameter lies in this
interval.
The quantities p, µ and σ2 that are to be estimated are called
population parameters.

Recall: A sample estimate of a population parameter is called a


statistic.

The table below gives examples of some commonly used parameters


together with their statistics:
Parameter Statistic
In this course, we
will only be
focusing on
inferences on a
population mean ()
Some terminology…
• A point estimate of a parameter is a single value (point) that
estimates a parameter.

• An interval estimate of a parameter is a range of values from L


(lower value) to U (upper value) that estimate a parameter. Associated
with this range of values is a percentage of confidence that the range of
values will contain the parameter that is being estimated.

Example:
Suppose the mean time it takes to serve customers at a supermarket
checkout counter is to be estimated.
1) The mean service time of 100 customers of (say) = 2.283 minutes is
an example of a point estimate of the parameter µ.
2) If it is stated that we are 95% confident that the mean service time
will be from 1.637 minutes to 4.009 minutes, the interval of values
(1.637, 4.009) is an interval estimate of the parameter μ.

Since, from chapter 4, we have seen that a point estimate (a statistic) can
differ each time depending on the sample obtained, it is more appropriate
(and useful) to obtain an interval estimate for a parameter rather than just
one single value.
Therefore, in sections 5.3 and 5.4, we are going to focus on methods of
obtaining an interval estimate for the parameter
Some more terminology…
A confidence interval is a range of values from L (lower value) to U
(upper value) that estimates a population parameter with (1)100%
confidence.

• (pronounced “theta”) can be the parameters or .


• L is the lower confidence limit.
• U is the upper confidence limit.
• The interval (L, U) is called the confidence interval.
• 1 is called the confidence coefficient.
• (1)100 is called the confidence percentage. It is the percentage of
confidence that the interval will contain , the parameter that is being
estimated.
Example:
Consider example 2 of slide 5: If it is stated that we are 95% confident
that the mean service time will be from 1.637 minutes to 4.009 minutes,
the interval of values (1.637, 4.009) is an interval estimate of the
parameter μ.

• , the parameter that is being estimated, is the population mean μ.


• = 0.05
• The confidence coefficient is (1) = 0.95
• The confidence percentage is (1)100 = 95
• L = 1.637 and U = 4.009

In theThe confidence
sections interval
5.3 and is the
5.4 that interval
follow, the (1.637, 4.009).of L and U when
determination
estimating the parameters µ will be discussed.
Section 5.3: Confidence interval for the
population mean, (population variance
known)
Consider the example from slide 5 where the mean time () it takes to
serve customers at a supermarket checkout counter is to be estimated.

Suppose from one random sample of customers, a mean time of = 2.283


was obtained.
Consider this on a number line:

= 2.283

This is a point estimate


for
Maybe is smaller than
= 2.283 2.283 and is actually
1.283

This value of 2.283 is not likely to be 1.283 2.283 3.283


exactly equal to , but it should be
close. Maybe is larger than
may be larger than = 2.283 or may 2.283 and is actually
be smaller it. 3.283
Therefore, it is better to get an When finding an estimate for , there
interval estimatee.g.
for … (1.283 ; is always some error involved (let’s
3.283) refer to it by E). This is the distance
between the true value of and the
If = 1.283, then the error Therefore, for this example,
(distance between the estimated
a possible interval estimate
value of and the true value of )
is: for could be:
E = 2.2831.283 = 1
(1.283 ; 3.283)
= (2.2831 ; 2.283 +
1.28 2.28 3.28 1)
3 3 3 = (E ; + E)
since = 2.283 and E = 1
If = 2.283, then the error
(distance between the estimated
value of and the true value of )
is:
E = 3.2832.283 = 1
In general, an interval estimate/confidence interval for a population
mean has the following form:
(E ; + E) or E

i.e. (point estimate error ; point estimate + error)

Now, what we need to determine is:

• How do we determine the value of the error (E)?


• How sure are we that the interval contains the population parameter?

Answer: We utilize the standard normal distribution and its properties…

(Note: The following slides show the theory behind the formula for an interval
estimate/confidence
interval for , however you do not need to know this theory.)
In order to answer the questions on the previous slide, we need to
determine a way to obtain a range of values (from a lower value to an
upper value) such that we know what the chance of a value falling into this
range is.we can use the standard normal distribution to obtain this:
Therefore,

Suppose we take the middle area


of 1, where is some arbitrary
𝛼 𝛼
2 2 value.
Then using symmetry, if an area of
1 is in the centre of the curve, then
an area of will lie on either side.
Using this diagram, we can
𝛼 𝛼
determine the z-values that 2 2
correspond to the middle area of
1:

𝑧𝛼 𝑧 𝛼
1−
2
Due to symmetry, these two z- 2

values will be the same, where one


¿−𝑧 𝛼
1−
2
will be positive and the other will
be negative.
Since the area to the right of this z-
value is , the area to the left of this
value is
Therefore, from this diagram, we can obtain the
following:
𝑃 −𝑧
( 1−
𝛼
2
<𝑍<𝑧
1−
𝛼
2
)=1 − 𝛼
Now, we can use this expression to −𝑧 𝑧
𝛼 𝛼
1− 1−
develop the formula for the confidence 2 2
interval for

( )
𝑥 −𝜇 Recall from chapter
¿𝑃 −𝑧 𝛼< < 𝑧 𝛼 =1 − 𝛼
1−
2
𝜎 1−
2 4: 𝑥 −𝜇
√𝑛 𝑍=
𝜎
√𝑛
Now we can solve for in the centre of this
expression…
( )
𝑥 −𝜇
1 −𝛼=𝑃 − 𝑧 𝛼 < <𝑧 𝛼
1−
2
𝜎 1−
2
√𝑛
𝜎
(
¿𝑃 −𝑧 𝛼×
1−
2
𝜎
√𝑛
< 𝑥 −𝜇< 𝑧 𝛼 ×
1−
2
𝜎
√𝑛 ) Multiply
by
√𝑛

(
¿ 𝑃 − 𝑥−𝑧
1−
𝛼×
2
𝜎
√𝑛
<−𝜇 <− 𝑥 + 𝑧 𝛼 ×
1−
2
𝜎
√𝑛 ) Subtract𝑥

(
∴ 1− 𝛼= 𝑃 𝑥 − 𝑧
1−
𝛼 ×
2
𝜎
√𝑛
< 𝜇< 𝑥+ 𝑧 𝛼 ×
1−
2
𝜎
√𝑛 ) Multiply
by
−1
Therefore, we can be (1)100% confident that will lie

between and .

i.e. A (1)100% confidence interval for when the value of is known, is:

( 𝑥− 𝑧
1−
𝛼
2
𝜎
√𝑛
;𝑥 +𝑧 𝛼
1−
𝜎
2 √𝑛
)
Lower confidence Upper confidence
limit (L) limit (U)
OR A (1)100% confidence interval for ( known) is:

on:the formula sheet 𝑥 ± 𝑧 𝜎


𝛼 (in the form E )
1−
2 √𝑛
point estimate for
Error (E)

The value of the error (E) therefore depends on the confidence percentage
() for the interval, the population standard deviation , and the size of the
sample () used to obtain the point estimate .

𝑧 𝛼
1−
2
is referred to as the z-
multiplier.
𝜎
is the standard error.
√𝑛
Example 1:
The actual content of cool drink in a 500 milliliter bottle is known to vary.
The standard deviation is known to be 5 milliliters. Thirty (30) of these 500
milliliter bottles were selected at random and their mean content found to
498.5.
Calculate 95% and 99% confidence intervals for the mean content of all the
bottles. This is the standard deviation of the amount of cool drink in
𝜎 =5
ALL the bottles, this refers to the population standard
𝑛=30 deviation.
A sample of 30 bottles were selected.

𝑥=498.5 This is the mean amount of cool drink in the sample of 30


bottles, this refers to the sample mean,
We are to calculate a confidence interval for the mean amount of cool
drink in ALL the bottles (the population mean).
For a 95% confidence interval for the mean
content:
We need to determine the value 95% is the confidence
of percentage.
(use this
Recall: a (1)100% confidence to solve
interval for is for )
𝜎
𝑥± 𝑧 𝛼
1−
2 √𝑛 𝛼 0.05
¿
2 2
We need to determine (the area to the left 𝛼
of the z-multiplier) in order to find this z- ∴
value in the tables.
2
𝛼
∴𝑧 𝛼 = 𝑧 0.975 ∴ 1−
1−
2 2
¿ 0.975
We can now use the standard normal tables to find the z-multiplier:
∴𝑧
1−
𝛼 = 𝑧 0.975 =1.96 𝜎 =5
2

𝑛=30
Therefore, 95% confidence interval for
is:
𝑥=498.5
𝜎
𝑥± 𝑧
( )
𝛼
1−
2 √𝑛 498.5 −1.96
5
=496.71=L
√30
¿ 498.5 ± 1.96
5
√ 30( )
¿ (496.71 ; 500.29)
498.5+1.96
( √ 30 )
5
=500.29=U
(L ; U)
For a 99% confidence interval for the mean
content:
99% is the confidence
percentage

∴𝑧 𝛼 = 𝑧 0.995 =2.576
1−
𝛼 0.0 1
2

¿
Therefore, 99% confidence interval for 2 2
is:
( )
5 𝛼
¿ 498.5 ± 2.576 ∴
√30 2
𝛼
∴ 1−
¿ (496.15 ; 500.85) 2
¿ 0.995
the population mean 𝜇
Determining the Sample size when estimating

Consider the example where we want to estimate the average time in


minutes () it takes to serve customers at a supermarket checkout counter
using a sample.
Suppose we obtain the following two interval estimates for :
(1.5 ; 15)
(4.3 ; 5.4)
Question: Which of the two interval estimates above would be the best
estimate?
Answer: (4.3 ; 5.4)
This interval is narrower, therefore it gives a more precise estimate. The
interval (1.5 ; 15) is wider, how do we know if is the true value of is small
and therefore closer to 1.5, or larger and therefore closer to 15?
In general, when obtaining an interval estimate/confidence interval for a
population parameter, the narrower the interval, the better the estimate!
Recall: A (1)100% confidence interval for (when is known) is:
𝜎
𝑥± 𝑍 𝛼
1−
2 √𝑛
Error (E)
• The value of the error (E) is what affects the width of a confidence
interval.
• Therefore, the smaller the value of this error, the better the estimate!

Let’s consider what values can be changed in order to obtain the smallest
error possible…
𝜎
¿𝑍
The value of the error E 𝛼 depends
1−
2 √ 𝑛on:
1) (the z-multiplier):

• This is calculated using the confidence percentage (e.g. 90%, 95%...).


When the confidence percentage decreases, this z-multiplier also
decreases.
• Therefore, if we wanted to decrease the value of E by decreasing the
z-multiplier, we could decrease the confidence percentage.
• However, we would like to be fairly confident in our estimate,
therefore
2) (the we generally
population do not want to decrease this confidence
standard deviation):
percentage by too much!
• This value is fixed in real life, therefore it cannot be adjusted to
decrease the value of E.
𝜎
E𝑍
…the value of the error ¿ 𝛼 depends
1−
2 √ 𝑛on:
3) (the sample size):
• As increases, the value of increases, therefore will decrease (dividing
by a larger number), resulting in a decrease in the value of the error E.

Therefore, out of the three values used to determine the error E, (the size
of the sample) is the best choice to change in order to get a certain error
E.
Since the larger the sample size, the smaller the value of the error E, we
would ideally like to obtain the largest possible sample in order to get the
most precise confidence interval for .

However, in reality, quite often resources are limited and it is not always
possible to obtain a very large sample (it can be very costly and time
We therefore need to choose a sample size large enough to obtain a
certain level of accuracy in our estimate but still have the sample size
small enough to be practical.
If we know beforehand (at the start of a study, before we obtain the
sample from the population to be studied) what level of accuracy we
want/need (i.e. if we know what the maximum size of the error E can be),
we can calculate what sample size should be obtained to achieve this
accuracy. 𝜎
Using the fact that the error¿ 𝑍 1− 𝛼 we can solve for :
E 2 √𝑛
√ 𝑛 E= 𝑍 1 − 𝛼 𝜎
2

𝑍 𝛼 𝜎
1−
√ 𝑛= 2
E
( )
𝑍 𝛼 𝜎
2
Note: Always round
1−
On the formula ∴ 𝑛= 2 UP to the nearest
sheet E integer value no
matter what the
decimal place is!!

• This is due to the fact that the value of E is the MAXIMUM error we
want to obtain.
• Recall: from E as increases, E decreases; and as decreases, E
increases.
• Therefore, by rounding UP to the nearest integer value (thus
increasing ), it ensures E will not be more than the stipulated
maximum value.
Example 1: 𝜇
Consider the example on the of the mean content of 500 milliliter cool
drink bottles. The standard deviation of the amount of cool drink in the
bottles is 5 milliliters. Suppose it is desired to estimate the mean content
of the bottles with 95% confidence and an error that is not greater than
0.8. What sample size is needed 𝜎to=5 achieve this accuracy?
𝑛=?
Max value of E =
∴Z 𝛼 =𝑍 0.975 =1.96
0.8

( )
2 1−
𝑍 𝛼 𝜎 2
1−
2
𝑛=
E Rounded UP! 𝛼

¿( )
2 2
1.96(5) 𝛼
0.8 𝑛=150.0625 ∴ 𝑛=151 1 − ¿ 0.975
2
Example 2:
A car manufacturer would like to estimate the average fuel consumption
of their latest model (in litres per 100km). All the cars of this particular
model were designed as similarly as possible so that the standard
=?
𝜎 =0.8
deviation in the fuel consumption is only 0.8 /100km.
What sample size should be taken if the manufacturer is to be 99%
certain that the average consumption of the cars in the sample will be
within 0.3 /100km of the true average consumption of this model of
cars?
• This is specifying that the sample mean () must be 0.3 /100km within
the true value of the mean (). i.e. the distance between and must be at
most 0.3.
• Recall: from the introduction of confidence intervals, the difference
between what is obtained
max valueforoffrom
E = the sample and the true mean , is
What sample size should be taken if the manufacturer is to be 99%
certain that the average consumption of the cars in the sample will be
within 0.3 /100km of the true average consumption of this model of
cars?
𝜎 =0.8 , max value of E =
0.3

( )
2
𝑍 𝛼 𝜎
1− 𝛼
𝑛= 2

E 2
𝛼
1 − ¿ 0.995

( )
2
2.576( 0.8) 2
¿ Rounded UP!
0.3 ∴Z
1−
𝛼 =𝑍 0.995 =2.576
2

𝑛=47.1877 ∴ 𝑛=48
Section 5.4: Confidence interval for the
population mean, (population variance
unknown)
𝑥 −𝜇
Recall: 𝑍= 𝑁 ( 0 , 1)
𝜎
√𝑛
• This is only true when the population standard deviation (and
therefore population variance) is known.
• But, if is unknown, it can be replaced by its sample estimate .

• However, for small sample sizes (), the expression above follows a
t-distribution with degrees of freedom instead of a standard normal
distribution (Z)
𝑥 −𝜇
i.e. when is unknown AND 𝑡= 𝑡 (𝑛 −1)
: 𝑆
√𝑛
Therefore, a (1)100% confidence interval for , when is unknown AND , is

𝑆
on the formula sheet 𝑥±𝑡 𝛼 replaces
𝑛 −1 ; 1 −
2 √𝑛
(
The z-multiplier
𝑧
) is replaced by a t-multiplier
1−
𝛼
2 ( ) 𝑡
with 𝑛− 1 ;1 −
𝛼
2

( 2 ) and degrees of freedom .


1
the same area to the left −
𝛼

The same procedure as that from Section 5.3 is followed in order to construct
the confidence interval.
Tables for the t-
distribution:

• This t-distribution is defined by its degrees of freedom (df) which is


(the sample size less one).
• For each value of df, a different t-distribution is defined.
• If it is assumed that the sample of size n is obtained from a
population that is approximately normally distributed.
• Similar to the standard normal distribution, there are tables for the
t-distribution
• The tables for the t-distribution can be found after that of the
standard normal distribution (Tables D1 and D2).
The layout of the t-tables are as follows:

• The values in the first column () represent the degrees of freedom (df
= ).
• The values in the top row () represents the area under the curve to the
left of a t-value that appears in the body of the table at the intersection
of the row and column entry.
• Notation: denotes the t-value that has an area of to the left where the
df for the t-distribution is .
• The t-tables differ from the standard
normal tables, where the values in the
body of the table are the t-values, with
the areas (to the left of the t-value) in
the corresponding top row.
• There are two t-tables: D1 (with
ranging from 0.900 to 0.995) and D2
(with ranging from 0.980 to 0.999).

Example
3.05
1:
If df = 12 and = 0.995: 5

Locate in the first column


and in the top row.
∴ 𝑡 12 ;0.995
¿ 3.055
Example
2:
If n = 30 and = 0.975:
Since the sample size is 30, the
degrees of freedom will be 29.

Locate in the first column


and in the top row. 2.04
5

∴ 𝑡 29 ; 0.975
¿ 2.045
Notice how for a very large degrees of
freedom (), the t-value for each value
of is equal to the z-value
corresponding to the same value of

e.g.
• There are only t-tables for upper percentiles (high values of ), therefore
only positive t-values can be found using the these tables.
• But due to symmetry, the area to the left of a negative t-value is equal
to the area to the right of the positive t-value (similar to the standard
normal distribution).
• Therefore, when a t-value with an area less than 0.5 to its left (i.e. <
0.5) is to be determined, the following property can be used:
i.e.
Example:
df = = 10 and = ∴ 𝑡 10 ;0.10 =¿−𝑡 10 ;1 −0.10¿ −𝑡 10 ; 0.90¿ −1.372
0.10
• Since < 0.5, the t-value that has this area of 0.10 to its left will be
negative.
• This t-value cannot be directly found from the tables.
Back to confidence
intervals…
In a question, a standard deviation or variance will always be given. You
need to determine if (from the context of the question) the value refers to
that of the sample or the whole population. I.e. if the value represents or .

Example 1:
The time (in seconds) taken to complete a simple task was recorded for
each of 15 randomly selected employees at a certain company. The values
are given below..
38.2 43.9 38.4 26.2 41.3 42.3 37.5 37.2 41.2 42.3 31 50.1 37.3 36.7 31.8

Calculate 95% and 99% confidence intervals for the mean time it takes all
the employees at this company to complete this task.
38.2 43.9 38.4 26.2 41.3 42.3 37.5 37.2 41.2 42.3 31 50.1 37.3 36.7 31.8

𝑛=15 𝑥=38.36 𝑆=5.78


• No information was given about the standard deviation or variance of
the time taken to complete the task for the ALL the employees at this
company.
• Therefore, (and ) is unknown.

• Since the data obtained from the sample was given, we can use STAT
mode on the calculator to determine the sample mean () and the
sample standard deviation ().
C.I. for

For a 95% confidence interval for the mean time it takes all the employees
at this company to complete this
𝑛=15task: 𝑥=38.36 𝑆=5.78
First, we need to determine which formula will be used.

Since is unknown AND , we will use the following confidence interval:


𝑆
𝑥±𝑡 𝛼
𝑛 −1 ; 1 −
2 √𝑛 ∴𝑡 𝛼

( )
𝑛− 1 ;1 −
5.78 2
¿ 38.36 ±2.145
√ 15 ¿𝑡 14 ; 0.975 ∴
𝛼
¿ 2.145 2
¿ (35.16 ; 41.56) 𝛼
∴ 1− ¿ 0.975
2
For a 99% confidence interval for the mean time it takes all the employees
at this company to complete this
𝑛=15task: 𝑥=38.36 𝑆=5.78
𝑆
𝑥±𝑡 𝛼 ∴𝑡
2 √𝑛
𝑛 −1 ; 1 − 𝛼
𝑛− 1 ;1 −
2

¿ 38.36 ±2.977
( )5.78
√ 15
¿𝑡 14 ; 0.995
¿ 2.977 𝛼

¿ (33.92 ; 42.80) 2
𝛼
When is unknown but (large sample), use the C.I. ∴ 1− 2¿ 0.995
formula from Section 5.3 with a z-multiplier, but
replace by its sample estimate
𝑆 Not directly on the
𝑥± 𝑧 𝛼 formula sheet
1−
2 √𝑛

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy