0% found this document useful (0 votes)
45 views48 pages

SEE5211 Chapter5 P2017

This document outlines a lecture on data analysis in environmental applications. It discusses sampling variability and confidence intervals. Specifically, it defines key statistical concepts like statistics, point estimates, and sampling variability. It provides an example to illustrate how sampling variability causes different statistics to be obtained from different samples. The document then introduces confidence intervals as a way to convey more information than a single point estimate by providing a range of plausible values for a population characteristic. It discusses how to construct a 95% confidence interval for a population proportion based on a large random sample.

Uploaded by

kk chan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views48 pages

SEE5211 Chapter5 P2017

This document outlines a lecture on data analysis in environmental applications. It discusses sampling variability and confidence intervals. Specifically, it defines key statistical concepts like statistics, point estimates, and sampling variability. It provides an example to illustrate how sampling variability causes different statistics to be obtained from different samples. The document then introduces confidence intervals as a way to convey more information than a single point estimate by providing a range of plausible values for a population characteristic. It discusses how to construct a 95% confidence interval for a population proportion based on a large random sample.

Uploaded by

kk chan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Data Analysis in Envir Application

(SEE5211/SEE8212)

Dr. Wen Zhou


School of Energy and Environment

Email: wenzhou@cityu.edu.hk ; Office: B5425, AC1


Outline

• The role of statistics and the data analysis process


• Numerical method of describing data
• Summarizing bivariate data
• Population distributions
• Sampling variability and Confidence interval
• Hypothesis Testing Using a Single Sample
• Comparing Two populations
• Regression Analysis
• Analysis of Variance
Data Analysis in Envir Application

(SEE5211/SEE8212)

Sampling variability and Confidence interval

Chapter 5
Statistic

• A number that that can be computed from


sample data
• Some statistics we will use include
x – sample mean This variability is called
s – standard deviation sampling variability

p – sample proportion

• The observed value of the statistic


depends on the particular sample selected
from the population and it will vary from
sample to sample.
An example of sampling variability

A fish pond: Suppose there are 20 fish in the pond. The


lengths of the fish (in inches) are given below:

4.5 5.4 10.3 7.9 8.5 6.6 11.7 8.9 2.2 9.8
6.3 4.3 9.6 8.7 13.3 4.6 10.7 13.4 7.7 5.6
Suppose we randomly catch a sample of 3 fish from this pond and measure their length. What
would the mean length of the sample be?

We caught fish with lengths 6.3 inches, 2.2


Let’s catch two more samples and
inches, and 13.3 inches. look at the sample means.

x = 7.27 inches
2nd sample - 8.5, 4.6, and 5.6 inches.
x = 6.23 inches The true mean m = 8.

3rd sample – 10.3, 8.9, and 13.4 inches. Notice that some sample
means are closer and some
farther away; some above and
x = 10.87 inches some below the mean.
Suppose we wanted to estimate the proportion of blue candies in a
VERY large bowl.
How might we go about estimating this proportion?

We could take a sample of candies


and compute the proportion of blue
candies in our sample.

We would have a sample proportion


or a statistic – a single value for the
estimate.
Point Estimate

• A single number (a statistic) based on sample data that is used


to estimate a population characteristic
• But not always to the population characteristic due to sampling
variation
Different samples may produce different
statistics.

“point” refers to the single value on a


number line.

Population characteristic
The paper reports the results of 7421 students at 40 colleges and
universities. (The sample was selected in such a way that it is
representative of the population of college students.)
The authors want to estimate the proportion (p) of college
students who spend more than 3 hours a day on the Internet.
2998 out of 7421 students reported using the Internet more than 3
hours a day.

This is a point estimate for the population proportion of college students


who spend more than 3 hours a day on the Internet.

p = 2998/7421 = .404
A research paper “The Impact of Internet and Television Use on the Reading Habits and Practices of College Students”
investigates the reading habits of college students. The following observations represent the number of hours spent on
academic reading in 1 week by 20 college students.

If a point estimate of m, the mean academic reading time per week for all
college students, is desired, an obvious choice of a statistic for estimating m is
the sample mean x.
However, there are other possibilities – a trimmed mean or the sample median.

1.7 3.8 4.7 9.6 11.7 12.3 12.3 12.4 12.6 13.4
14.1 14.2 15.8 15.9 18.7 19.4 21.2 21.9 23.3 28.2

The dotplot suggest this data is approximately symmetrical.


College Reading Continued . . .

1.7 3.8 4.7 9.6 11.7 12.3 12.3 12.4 12.6 13.4
14.1 14.2 15.8 15.9 18.7 19.4 21.2 21.9 23.3 28.2

287.2
sample mean  x   14.36
20

13.4  14.1
sample median   13.75
2

230.2
10% trimmed mean   14.39
16
Computing an Estimate

• Choose a statistic that is unbiased (accurate)


A statistic whose mean value is equal to the value of the population
characteristic being estimated is said to be an unbiased statistic.

Unbiased, since the


Biased, since the distribution is Unbiased, since
distribution is centered
NOT centered at the true value the distribution is
at the true value
centered at the
true value
Suppose we wanted to estimate the proportion of blue candies in a
VERY large bowl.

We could take a sample of candies and compute the proportion of blue


candies in our sample.

Would you have more confidence if your answer were an interval?


How much confidence do you have in the point estimate?
Confidence intervals

A confidence interval (CI) for a population characteristic is an


interval of plausible values for the characteristic.

It is constructed so that, with a chosen degree of confidence, the


actual value of the characteristic will be between the lower and
upper endpoints of the interval.

The primary goal of a confidence interval is to estimate an


unknown population characteristic.
Rate your confidence0 – 100%

What does it mean to be within 10 years?

How confident (%) are you that you can ...

Guess a person’s age within 10 years?

. . . within 5 years?

. . . within 1 year?
What happened to your level of
confidence as the interval
became smaller?
Confidence level

The confidence level associated with a confidence interval


estimate is the success rate of the method used to construct the
interval.

If this method was used to generate an interval estimate over and


over again from different samples, in the long run 95% of the
resulting intervals would include the actual value of the
characteristic being estimated.

The most common confidence levels are 90%, 95%, and 99% confidence.
General Properties for sampling distributions

1. m ˆ  p
p As long as the sample size is less
than 10% of the population
p (1  p )
2.  pˆ 
n

3. As long as n is large (np > 10 and n (1-p) > 10) the


sampling distribution of p is approximately normal.

These are the conditions that must be true in order to


calculate a large-sample confidence interval for p
large-sample confidence interval
To begin, we will use a 95% confidence level. Use the table of standard normal
curve areas to determine the value of z* such that a central area of .95 falls
between –z* and z*.
For large random samples, the sampling distribution
of p is approximately normal. So about 95% of the
possible p will fall within
p (1  p )
1.96 within p
n
We can generalize this to
normal distributions other Central Area = .95
than the standard normal
distribution – 95% of these values
About 95% of the values are within 1.96 of
are within 1.96 standard the mean.
deviations of the mean

Lower tail area = .025 Upper tail area = .025

0
-1.96 1.96
Developing a Confidence Interval

If p is within 1.96
p (1  p ) of p,
n

this means the interval


p (1  p ) p (1  p )
pˆ  1.96 to pˆ  1.96
n n
will capture p.

And this will happen for 95% of all possible samples!


Developing a Confidence Interval

Notice that the length of each


half of the interval equals Approximate sampling
distribution of p
p (1  p )
1.96 Here is the mean of the
n sampling distribution

p
p p (1  p ) p (1  p )
1.96 1.96
n n
This line represents 1.96 standard deviations This line represents 1.96 standard
below the mean. deviations above the mean.

This p doesn’t fall within 1.96 standard deviations of the mean


AND its confidence interval does NOT “capture” p.

This p fell within 1.96 standard deviations


When n is large, a 95% confidence interval for p is of the mean AND its confidence interval
“captures” p.
p (1  p )
pˆ  1.96
n
The diagram to the right is
100 confidence intervals for
p computed from 100
different random samples.

Note that the ones with


asterisks do not capture p.

If we were to compute 100


more confidence intervals for
p from 100 different random
samples, would we get the
same results?
The Large-Sample Confidence Interval for p

The general formula for a confidence interval for a population


proportion p when

• p is the sample proportion from a random sample

• the sample size n is large (np > 10 and


n(1-p) > 10), and

• if the sample is selected without replacement, the sample size is small


relative to the population size (at most 10% of the population)
The Large-Sample Confidence Interval for p

The general formula for a confidence interval for a population


proportion p . . . Is
The 95% confidence interval is based on the fact that, for
approximately 95% of all random samples, p is within the
bound on error estimation of p.

pˆ(1  pˆ)
pˆ  (z critical value)
n
This is called the bound on the
error estimation.
A survey of 1031 adult Americans: The survey was carried out by
the National Center for Public Policy and the sample was selected
in a way that makes it reasonable to regard the sample as
representative of adult Americans. Of those surveyed, 567 indicated
that they believe a college education is essential for success.
What is a 95% confidence interval for the population
proportion of adult Americans who believe that a college
education is essential for success?

The point estimate is


Before computing the confidence 567
interval, we need to verify the pˆ   .55
conditions. 1031
College Education Continued . . .
What is a 95% confidence interval for the
population proportion of adults who believe that a
college education is essential for success?

Conditions:
1) np = 1031(.55) = 567 and n(1-p) = 1031(.45) = 364,
since both of these are greater than 10, the sample
size is large enough to proceed.
2) The sample size of n = 1031 is much smaller than
10% of the population size (adults).
3) The sample was selected in a way designed to
produce a representative sample. So we can regard
the sample as a random sample from the population.
College Education Continued . . .
What is a 95% confidence interval for the
population proportion of adults who believe that a college education is
essential for success?

Calculation:
pˆ(1  pˆ)
pˆ  (z critical value)
n
.55(.45)
.55  1.96  (.521,.579)
1031
Conclusion:
We are 95% confident that the population proportion of adults who
believe that a college education is essential for success is between
52.1% and 57.9%
College Education Revisited . . .

A 95% confidence interval for the population proportion of adults


who believe that a college education is essential for success is:
.55(.45)
.55  1.96  (.521,.579)
1031

Compute a 90% confidence interval for this proportion.

.55(.45)
.55  1.645  (.524,.575)
1031 0.51,0.521, 0.524, 0.575,0.579,0.590

Compute a 99% confidence interval for this proportion.


.55(.45)
.55  2.58  (.510,.590)
1031
Choosing a Sample Size
Sometimes, it is feasible to perform a preliminary study to estimate the value for p.

The bound on error estimation for a 95% confidence interval is

Before collecting any data, an investigator may wish to determine a sample


size needed to achieve a certain bound on error estimation.

p (1  p )
If we solve this for n . . .
B  1.96
n

If there is no prior knowledge and a preliminary study is not feasible,


then the conservative estimate for p is 0.5.

2
 1.96 
n  p 1  p  
 B 
Why is the conservative estimate for p = 0.5?

.1(.9) = .09 By using .5 for p, we are using the


.2(.8) = .16 largest value for p(1 – p) in our
.3(.7) = .21 calculations.
.4(.6) = .24
.5(.5) = .25
In spite of the potential safety hazards, some people would like to
have an internet connection in their car. Determine the sample size
required to estimate the proportion of adults who would like an
internet connection in their car to within 0.03 with 95% confidence.

2
 1.96  What value should be used for p?
n  p (1  p ) 
 B 
2
 1.96 
n  .25 
 .03 
Always round the sample size up
n  1067.111  to the next whole number .
n  1068 people
Confidence intervals for m when  is known

The general formula for a confidence interval for a population mean m when .
..
1) x is the sample mean from a random sample,
2) the sample size n is large (n > 30), and
3) , the population standard deviation, is known
is

These are the properties of the sampling


distribution of x.

Bound on error of estimation

  
x  (z critical value)  Standard
Point estimate  n deviation of the
statistic
Cosmic radiation levels rise with increasing altitude, promoting researchers to
consider how pilots and flight crews might be affected by increased
exposure to cosmic radiation. A study reported a mean annual cosmic radiation
dose of 219 mrems for a sample of flight personnel of Xinjiang Airlines.
Suppose this mean is based on a random sample of 100 flight crew members.
Let s = 35 mrems.
Calculate and interpret a 95% confidence interval for the actual
mean annual cosmic radiation exposure for Xinjiang flight crew
members.
1)Data is from a random sample of crew members
2)Sample size n is large (n > 30)
3)  is known
Cosmic Radiation Continued . . .

Let x = 219 mrems


n = 100 flight crew members
s = 35 mrems.
Calculate and interpret a 95% confidence interval for the actual mean annual cosmic
radiation exposure for Xinjiang flight crew members.

  
x  (z critical value ) 
 n
 35 
219  1.96   (212.14, 225.86)
 100 
We are 95% confident that the actual mean annual cosmic radiation exposure
for Xinjiang flight crew members is between 212.14 mrems and 225.86 mrems.
Confidence intervals for m when  is unknown

When  is unknown, we use the sample standard deviation s to


estimate . In place of z-scores, we must use the following to
standardize the values:
x m
t 
s
n

The use of the value of s introduces extra variability.


Therefore the distribution of t values has more variability
than a standard normal curve.

t value 1.98 at 95%, 99df . (212.07,225.93)


Important Properties of t Distributions

1) The t distribution corresponding to any particular


number of degrees of freedom is bell shaped and
centered at zero (just like the standard normal (z)
distribution).
2) Each t distribution is more spread out than the
standard normal distribution.
t distributions are described by degrees of freedom (df).

z curve

t curve for 2 df
Why is the z curve taller
than the t curve for 2 df?

0
Important Properties of t Distributions

3) As the number of degrees of freedom increases, the


spread of the corresponding t distribution decreases.

t curve for 8 df

t curve for 2 df

0
Important Properties of t Distributions Continued . .
.

3) As the number of degrees of freedom increases, the


spread of the corresponding t distribution decreases.
4) As the number of degrees of freedom increases, the
corresponding sequence of t distributions approaches
the standard normal distribution.

z curve

t curve for 2 df
t curve for 5 df

0
Confidence intervals for m when  is unknown

The general formula for a confidence interval for a population


mean m based on a sample of size n when . . .

1) x is the sample mean from a random sample,


2) the population distribution is normal, or the sample size n is
large (n > 30), and
3) s, the population standard deviation, is unknown

 s 
is x  (t critical value) 
 n
Where the t critical value is based on df = n - 1.
In a study, chimpanzees learned to use an apparatus that dispersed food when either of
two ropes was pulled. When one of the ropes was pulled, only the chimp controlling the
apparatus received food. When the other rope was pulled, food was dispensed both to
the chimp controlling the apparatus and also a chimp in the adjoining cage. The
accompanying data represent the number of times out of 36 trials that each of seven
chimps chose the option that would provide food to both chimps (charitable response).

23 22 21 24 19 20 20

Compute a 99% confidence interval for the mean number of


charitable responses for the population of all chimps.
Chimps Continued . . .
23 22 21 24 19 20 20
2

1
Normal Scores

20 22 24
Number of Charitable Responses
The plot is reasonable
-1 straight, so it seems plausible
that the population
distribution of number of
-2 charitable responses is
approximately normal.
Chimps Continued . . .
23 22 21 24 19 20 20
x = 21.29 and s = 1.80 df = 7 – 1 = 6

 s 
x  (t critical value)  
 n
 1.80 
21.29  3.71   (18.77, 23.81)
 7 
We are 99% confident that the mean number of
charitable responses for the population of all
chimps is between 18.77 and 23.81.
Choosing a Sample Size

The bound on error of estimation associated with a 95% confidence


interval is

  
Solve this for n: B  1.96 
 n
When  is unknown, a preliminary study can be This requires  to be
performed to estimate  known – which is rarely the
OR case!
make an educated guess of the value of .
A rough estimate for  (used with distributions
that are not too skewed) is the range divided
2
 1.96 
by 4. We can use this to find
the necessary sample

n  
size for a particular
bound on error of

 B  estimation.
The financial aid office wishes to estimate the mean cost of textbooks
per quarter for students at a particular university. For the estimate to
be useful, it should be within $20 of the true population mean. How
large a sample should be used to be 95% confident of achieving this
level of accuracy?
The financial aid office is believes that the amount spent on books
varies with most values between $150 to $550.

To estimate  :
550  150
  $100
4
Standard deviation

Empirical Rule-

• Approximately 68% of the


observations are within 1 standard
deviation of the mean

• Approximately 95.4% of the


observations are within 2 standard
deviation of the mean 550  150
  $100
4
• Approximately 99.7% of the
observations are within 3 standard
deviation of the mean
The financial aid office wishes to estimate the mean cost of
textbooks per quarter for students at a particular university. For
the estimate to be useful, it should be within $20 of the true
population mean. How large a sample should be used to be 95%
confident of achieving this level of accuracy?

 1.96100  
2

n    96.04  Always round sample size up to

 20 
the next whole number!

n  97
Contour Plot

• Open littlepond.jmp
• Select Graph > Contour plot
• Select the X, Y coordinates and click X
• Select the depth Z and click Y (in a contour plot, the X1, X2 roles are used for the
X and Y axes)
• Red Triangle >Fill Areas
Nominal Logistic Regression

1. Open Penicillin.jmp.
2. Select Analyze > Fit Y by X.
3. Select Response and click Y, Response. (Categorical Variable)
4. Select In(Dose) and click X, Factor. (Continuous Variable)
Notice that JMP automatically fills in Count for Freq. Count was previously
assigned the role of Freq.
5. Click OK.
Right Click , choose marker Size

The plot shows the fitted model, which


is the predicted probability of being
cured, as a function of ln(dose). The
p-value is significant, indicating that
the dosage amounts have a significant
effect on whether the rabbits are
cured.
Principal Component Analysis

The purpose of principal component analysis is to derive a small number of


independent linear combinations (principal components) of a set of measured variables
that capture as much of the variability in the original variables as possible. Principal
component analysis is a dimension-reduction technique, as well as an exploratory data
analysis tool. Principal component analysis is also useful for constructing predictive
models, as in principal components analysis regression (PCA regression)
1. Open Solubility.jmp.
2. Select Analyze > Multivariate Methods > Principal Components.
The Principal Components launch window appears.
3. Select all of the continuous columns and click Y, Columns.
4. Keep the default Estimation Method.
5. Click OK. The Principal Components on Correlations report appears.

Correlations report

Covariance Matrix
6. Click Red Triangle , Scree plot, Scatterplot 3D
• The report gives the eigenvalues and a bar chart of the percent of the
variation accounted for by each principal component. There is a Score Plot
and a Loadings Plot as well.
• The eigenvalues indicate the total number of components extracted based on
the amount of variance contributed by each component.
• The Score Plot graphs each component’s calculated values in relation to the
other, adjusting each value for the mean and standard deviation.
• The Loadings Plot graphs the unrotated loading matrix between the variables
and the components. The closer the value is to 1 the greater the effect of the
component on the variable.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy