0% found this document useful (0 votes)
10 views55 pages

Chapter 01

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views55 pages

Chapter 01

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Statistics

Chapter 01 – Overview and


Descriptive Statistics
Probability vs. Statistics

 Probability of an event is the likelihood of it occurring, e.g., when a


coin is tossed, there is a probability to get a head or tail.
 Statistics deals with a set of data, e.g., finding the most frequently
used item from a set of data. It is the science of learning from data.
 This toy example helped:
 Probability is starting with an animal and figuring out what
footprints it will make.
 Statistics is seeing a footprint and guessing the animal.

2
Probability vs. Statistics Cont’d.

 Suppose we have information


about a population, and we want to Probability: Given the
know about samples we could take information in the pail
from that population. Probability what is in your hand?
addresses these questions.
 Suppose we have sample data, and
we want to know about the
population the sample came from. Statistics: Given the
Statistics use sample data to make information in your hand
what is in the pail?
inferences about the population the
sample came from.
3
What is a Population?

 Population is the entire set of


items from which you draw data for
a statistical study. It can be a group
of individuals, a set of items, etc. It
makes up the data pool for a study.
 An example of a population would
be the entire student body at a
school.

4
What is a Sample?
 A sample is a smaller and more manageable representation of a larger
group. A subset of a larger population that contains characteristics of
that population.

Sample

Population
5
What is a Sample? Cont’d.
 Samples are used when:
The population is too large to collect data.
The data collected is not reliable.
The population is hypothetical and is unlimited in size.
 Take the example of a study that documents the results of a new
medical procedure. It is unknown how the procedure will affect
people across the globe, so a test group is used to find out how
people react to it.

6
What is a Sample? Cont’d.

 A sample should generally:


 Satisfy all different variations present in the population and a well-
defined selection criterion.
 Be unbiased on the properties of the objects being selected.
 Be random to choose the objects of study fairly.

7
Descriptive vs. Inferential Statistics
 There are two main branches in the field of statistics:
① Descriptive statistics aims to describe a chunk of raw data using
summary statistics, graphs, and tables, etc.
 Let's say, we have a set of raw data that shows the test scores of 1000
students at a particular school. We might be interested in the average
test score along with the distribution of test scores.
② Inferential statistics uses a small sample of data to draw inferences
about the larger population that the sample came from.
8
Descriptive vs. Inferential Statistics Cont’d.

 Let's say, we might be interested in understanding the political


preferences of millions of people in a country. However, it would
take too long and be too expensive to survey every individual in the
country. Thus, we would instead take a smaller survey of say, 1000
individuals, and use the results of the survey to draw inferences
about the population.

9
Descriptive vs. Inferential Statistics Cont’d.
 The relationship between the two disciplines can be summarized by
saying that probability reasons from the population to the sample
(deductive reasoning), whereas inferential statistics reasons from the
sample to the population (inductive reasoning).

10
Pictorial and Tabular Methods in Descriptive
Statistics

11
Stem-and-Leaf displays
 Consider a numerical dataset 𝑥1 , 𝑥2 , … , 𝑥𝑛 for which each 𝑥𝑖 consists
of at least two digits. A quick way to obtain an informative visual
representation of the dataset is to construct a stem-and-leaf display.

12
Example. The average number of hours of
sleep per day over a two-week period for a
Stem-and-Leaf
sample of 253 college students. displays Cont’d.

13
Stem-and-Leaf Example. The average number of hours of
displays Cont’d. sleep per day over a two-week period for
a sample of 253 college students*.

 Numbers in the Low Group end with a


second digit of 0, 1, 2, 3, or 4.
Bell-shaped curve  Numbers in the High Group end with a
*Individuals
second digit of 5, 6, 7, 8, or 9.
in this age group need about 8.4 hours of sleep per day. 14
Stem-and-Leaf displays Cont’d.

 A stem-and-leaf display discloses the following aspects of the data:


Identification of a typical or representative value.
Extent of spread about the typical value.
Presence of any gaps in the data.
Extent of symmetry in the distribution of values.
Number and locations of peaks.
Presence of outliers, i.e., values far from the rest of the data.
 Frankly, a display based on ‘between 5 and 20 stems’ is
recommended.
15
Dotplots

 A dotplot is an attractive summary of numerical data when the dataset is


reasonably small or there are relatively few distinct data values. Each
observation is represented by a dot above the corresponding location on a
horizontal measurement scale. When a value occurs more than once, there is
a dot for each occurrence, and these dots are stacked vertically.
Example. There is a growing concern in the U.S. that not enough students are
graduating from college. America used to be number 1 in the world for the
percentage of adults with college degrees, but it has recently dropped to 16th.
Here is data on the percentage of 25- to 34-year-olds in each state who had
some type of post-secondary degree as of 2010 (listed in alphabetical order,
with Washington D.C. included):

16
51 Measures
Dotplots Cont’d.

** Note. A dotplot can be quite cumbersome to construct and look crowded when
the number of observations is large. Now, let’s look at other interesting methods!!
17
Histograms
 A numerical variable is discrete if its set of possible values either is
finite or else can be listed in an infinite sequence (one in which there is
a first number, a second number, and so on).
 A discrete variable 𝑥 almost always results from counting, in which case
possible values are 0, 1, 2, 3, … or some subset of these integers.
 A numerical variable is continuous if its possible values consist of an
entire interval on the number line.
 Continuous variables arise from making measurements. Such as, if 𝑥 is
the pH of a chemical substance, then in theory 𝑥 could be any number
between 0 and 14, e.g., 7.0, 7.03, 7.032, and so on.

18
Histograms Cont’d.
 Consider data consisting of observations on a discrete variable 𝑥. The
frequency of any 𝑥 value is the number of times that value occurs in the
dataset. The relative frequency of a value is the fraction or proportion of
times the value occurs:
number of times the value occurs
relative frequency of a value =
number of observations in the dataset
Example. Suppose that our dataset consists of 200 observations
(students) on 𝑥 = the number of courses a college student is taking this
term. If 70 of these 𝑥 values are 3, then:
19
Histograms Cont’d.
 Frequency of the 𝑥 value 3: 70 and
70
 Relative frequency of the 𝑥 value 3: = .35
200
 Multiplying a relative frequency by 100 gives a percentage; in the
college-course example, 35% of the students in the sample are taking
three courses. The relative frequencies, or percentages, are usually of
more interest than the frequencies themselves.
 In theory, the relative frequencies should sum to 1, but in practice the
sum may differ slightly from 1 because of rounding.

20
Histograms Cont’d.

Example. How unusual is a no-hitter* or a one-hitter in a major league


baseball game, and how frequently does a team get more than 10, 15, or even
20 hits? The table below is a frequency distribution for the number of hits per
team per game for all nine-inning games that were played between 1989 and
1993.
*In baseball, a no-hitter is a game in which a team was not able to record a single hit through conventional means.
21
Histograms Cont’d.
Frequency Frequency
𝒙 𝒙

22
Histograms Cont’d.

23
Histograms Cont’d.

 Proportion of games with at most two hits = relative frequency for


𝑥 = 0 + relative frequency for 𝑥 = 1 + relative frequency for 𝑥 = 2
= 0010 + .0037 + .0108 = .0155
 Similarly, proportion of games with between 5 and 10 hits (inclusive) =
.0752 + .1026 + ⋯ + .1015 = .6361
 That is, roughly 64% of all these games resulted in between 5 and 10
(inclusive) hits.

24
Histogram Shapes
 Histograms come in a variety of shapes. A unimodal histogram is one
that rises to a single peak and then declines. A bimodal histogram has two
different peaks. Bimodality can occur when the dataset consists of
observations on two quite different kinds of individuals or objects.
Example. consider a large dataset consisting of driving times for cars
traveling between San Luis Obispo, California, and Monterey, California
(exclusive of stopping time for sightseeing, eating, etc.). This histogram
would show two peaks: one for those cars that took the inland route
(roughly 2.5 hours) and another for those cars traveling up the coast (3.5
− 4 hours).
 A histogram with more than two peaks is said to be multimodal.
25
Histogram Shapes Cont’d.
 A histogram is symmetric if the left half is a mirror image of the right half (b).
A unimodal histogram is positively skewed if the stretching is to the right (c)
and negatively skewed if the stretching is to the left (a).

 For a positively skewed data,


large positive outliers exist
which will tend to “pull” the
mean upward.
 For a negatively skewed
distribution, large negative
outliers exist which tend to
“pull” the mean downward.

Skewness is simply a reflection of a dataset in which activity is heavily condensed in one range and less condensed in another. 26
Histogram Shapes Cont’d.

27
Histogram Shapes Cont’d.
 Draw a histogram to represent the following data: 5, 3, 3, 6, 4, 3, 5, 4, 7, 3, 3, 5,
3, 6, 4, 3, 4, and then draw a histogram to represent the following data: 7, 4, 6, 7,
5, 7, 6, 3, 4, 7, 5, 6, 6, 7, 7, 5, 7.

Right Skewed Left Skewed

28
Histogram Shapes Cont’d.

Right Skewed Histogram Left Skewed Histogram

Also known as a positively skewed histogram. Also known as a negatively skewed histogram.

Mean > Median > Mode. Mean < Median < Mode.

The peak of the graph lies on the left side of the center. The peak of the graph lies on the right side of the center

29
Measures of Location

30
The Mean
 For a given set of numbers 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the most familiar and useful
measure of the center is the mean, or arithmetic average of the set. We will
often refer to the arithmetic average as the sample mean and denote it by 𝑥.ҧ

31
The Mean Cont’d.

The sample mean can be regarded as 229.0


the balance point of the distribution 𝑥ҧ = = 16.36
14
of observations.

32
The median

33
79.0 89.0
The median Cont’d.

The sample median is very insensitive to


outliers. If the two largest 𝑥𝑖 are increased
from 75.7 and 79.0 to 85.7 and 89 , 66.4 + 67.4
respectively, 𝑥෤ would be unaffected. Thus, in 𝑥෤ = 2
= 66.90
the treatment of outlying data values, 𝑥ҧ and
𝑥෤ are at opposite ends of a spectrum.
34
Measures of Variability

35
The Variance

36
The Variance Cont’d.

Try to validate it yourself!! 37


The Variance Cont’d.

 The variance is unchanged when a constant 𝑐 is added to (or subtracted from)


each data value. This is intuitive, since adding or subtracting 𝑐 shifts the location
of the dataset but leaves distances between data values unchanged.
 Multiplication of each 𝑥𝑖 by 𝑐 results in 𝑠 2 being multiplied by a factor of 𝑐 2 .
These properties can be proved noting that 𝑦ത = 𝑥ҧ + 𝑐 and 𝑦ത = 𝑐 𝑥.ҧ
38
𝑆𝑥𝑥
The Variance Cont’d. 𝑠2 =
𝑛−1
= 31.41

39
Boxplots

40
Boxplots Cont’d.

41
Boxplots Cont’d.

42
Boxplots Cont’d.
 Find the median, lower quartile and upper quartile of the following
numbers: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25.
 First, arrange the data in ascending order:
5, 7, 12, 14, 15, 22, 25, 30, 36, 42, 53
 Median (middle value) = 22
 Lower Quartile (middle value of the lower half) = 12
 Upper Quartile (middle value of the upper half) = 36

 If there is an even number of data items, then we need to get the


average of the middle numbers.
43
Boxplots Cont’d.
The following data consists of observations on the time until failure
(1000s of hours) for a sample of turbo-chargers from one type of
engine.

44
Boxplots Cont’d.

45
Boxplots Cont’d.

46
Boxplots Cont’d.
Example. the following is a sample of TN (total nitrogen) loads (kg
N/day) from a particular location, displayed in increasing order.

47
Boxplots Cont’d.
 Relevant summary quantities are:
𝑥෤ = 92.17 lower 4th = 45.64 upper 4th = 167.79
𝑓𝑠 = 122.15 1.5𝑓𝑠 = 183.225 3𝑓𝑠 = 366.45
 Subtracting 1.5𝑓𝑠 from the lower 4th gives a negative number, and none of
the observations are negative, so there are no outliers on the lower end of
the data. Yet,
upper 4th + 1.5𝑓 = 351.015, upper 4th + 3𝑓 = 534.24
𝑠 𝑠
 Thus, the four largest observations— 563.92 , 690.11 , 826.54 , and
1529.35—are extreme outliers, and 352.09, 371.47, 444.68, and 460.86
are mild outliers.
48
Boxplots Cont’d.
 When the median is in the middle of the box,
and the whiskers are about the same on both
sides of the box, then the distribution is
symmetric.
 When the median is closer to the bottom of
the box, and if the whisker is shorter on the
lower end of the box, then the distribution is
positively skewed* (skewed right).
 When the median is closer to the top of the
box, and if the whisker is shorter on the upper
end of the box, then the distribution is
*If your whisker extends out in the direction of the larger
negatively skewed** (skewed left). numbers, your data are positively skewed.
**If your whisker extends out to the smaller numbers, your data

are negatively skewed. 49


Boxplots Cont’d.

50
Brainstorming

Do you think
Boxplots are a good
choice for
multimodal data?

51
Brainstorming
 To see why boxplots are ill-suited for multimodal data, let’s consider an
example. Imagine our data set consisted of these values: 30, 30, 30, 62, 87, 115,
115, 115, 172, 209, 214. In this example, we have two modes: 30 and 115 both
occur three times.

52
Symmetrical Distribution
 The distribution of the height of males is roughly symmetrically distributed and has
no skew. The average height of a male in the United States is roughly 69.1 inches.
The distribution of heights is roughly symmetrical, with some being shorter and
others taller.
 Notice that the vertical line inside the box representing the median is equally close
to the first and third quartile, which means the distribution is symmetrical and has
no skew.

53
Right-Skewed Distribution
 The distribution of annual household incomes in the United States is right-skewed.
Most households earn between $40k and $80k per year, but there is a long right tail on
the distribution representing households earning much more.
 Notice that the vertical line inside the box representing the median is much closer to
the first quartile than the third quartile, meaning the distribution is right-skewed.

54
Left-Skewed Distribution
 The distribution of the age of deaths in most populations is left-skewed. Most people
live to be between 70 and 80 years old, with fewer and fewer living less than this age.
 Notice that the vertical line inside the box representing the median is much closer to
the third quartile than the first, meaning the distribution is left-skewed.

55

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy