0% found this document useful (0 votes)
3 views28 pages

PDF Notes

Chapter Two discusses methods for describing data sets, focusing on frequency distributions, measures of center, and measures of dispersion. It explains different types of data, including quantitative and qualitative variables, and introduces graphical methods for visualizing data such as bar charts and histograms. The chapter also covers descriptive statistics, including measures of central tendency like mean, median, and mode, and emphasizes the impact of extreme values on these measures.

Uploaded by

anwilliams2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views28 pages

PDF Notes

Chapter Two discusses methods for describing data sets, focusing on frequency distributions, measures of center, and measures of dispersion. It explains different types of data, including quantitative and qualitative variables, and introduces graphical methods for visualizing data such as bar charts and histograms. The chapter also covers descriptive statistics, including measures of central tendency like mean, median, and mode, and emphasizes the impact of extreme values on these measures.

Uploaded by

anwilliams2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Chapter Two

Methods for Describing Sets


of Data

Chapter 2

Purpose
 In this chapter we will study several ways to summarize
data. In this chapter we discuss three complementary
aspects of data description: frequency distributions,
measures of center, and measures of dispersion. The three
help us “paint a picture” of our data by giving us
information about the shape, center, and spread.

1
Chapter Two

Example: Car insurance company evaluates many variables


before deciding on an appropriate rate for automobile insurance.

Types of Data
 Quantitative (Numeric) variable have measurements that are
recorded on a naturally occurring numerical scale.
 Discrete variables arise from a counting process.
 Continuous variables arise from a measuring process.

 Qualitative (Categorical) variable have measurements that cannot


be measured on a natural numerical scale; they can only be classified
into distinct categories.
 Nominal scale classifies data into distinct categories in which no order
or ranking is implied. Nominal scales cannot be ordered!!
 An ordinal scale classifies data into distinct categories in which order or
ranking is implied. Ordinal scales can be ordered!!

2
Chapter Two

Example: An insurance company evaluates many variables


before deciding on an appropriate rate for automobile insurance.
1. The number of claims the principle driver has made in the last 3 years is a
A. Categorical, ordinal scale
B. Numerical, discrete
C. Numerical, continuous
2. The odometer reading on the car being insured is a
A. Categorical, ordinal scale
B. Numerical, discrete
C. Numerical, continuous
3. The color of the car being insured is a
A. Categorical, ordinal scale
B. Categorical, nominal scale
C. Numerical, continuous

Graphical Methods

Sections 2.1 & 2.2

3
Chapter Two

Visualizing Categorical Data


 Statistical pictures used to visualize data
 One categorical variable
 Frequency Distribution
 Bar Chart
 Pie Chart

Frequency Distribution
 A frequency distribution is a table that displays the number of
occurrences (frequency) of each category or class in a data set.

 Relative frequency =

Example: Impairment of language ability.


Type of Type of
Subject Subject
Aphasia Aphasia
Summary Table (Impairment):
1 Broca’s 12 Broca’s
relative
2 Anomic 13 Anomic Class frequency frequency percentage
3 Anomic 14 Broca’s Anomic 10 0.455 45.5
4 Conduction 15 Anomic Broca's 5 0.227 22.7
5 Broca’s 16 Anomic Conduction 7 0.318 31.8
6 Conduction 17 Anomic
TOTALS 22 1.000 100.0

7 Conduction 18 Conduction
8 Anomic 19 Broca’s
9 Conduction 20 Anomic
10 Anomic 21 Conduction
8
11 Conduction 22 Anomic

4
Chapter Two

Bar Chart
Bar chart – a series of bars, with each bars representing the class
frequency/class relative frequency/class percentage.
• Can be used for two or three variables simultaneously

Example: Impairment of language ability.


Bar Chart of Type
100
Summary Table (Impairment):
relative 80

Percentage
Class frequency frequency percentage 60
Anomic 10 0.455 45.5
Broca's 5 0.227 22.7 40
Conduction 7 0.318 31.8 20
TOTALS 22 1.000 100.0
0
Anomic Broca's Conduction
Type

Pie Chart
Pie chart – uses sections of a circle to represent the class
frequency/class relative frequency/class percentage.

Example: Impairment of language ability.


Pie Chart of Type
Summary Table (Major):
relative
Class frequency frequency percentage
Anomic 10 0.455 45.5
31.8%
Broca's 5 0.227 22.7 Anomic
45.5%
Conduction 7 0.318 31.8 Broca's
TOTALS 22 1.000 100.0 Conduction

22.7%

10

10

5
Chapter Two

Visualizing Numeric Data


 Statistical pictures used to visualize data
 One numeric variable
 Dotplot
 Stem-and-leaf plot
 Histogram

11

11

Dotplot
 A dotplot is a graph that is used to show the distribution of a
numeric variable when the sample size is small.

Example: A group of thirty-six 2-year old sows of the same breed were
bread to Yorkshire boars. The number of piglets surviving to 21 days of
age was recorded for each sow

12

12

6
Chapter Two

Histogram
 A histogram is a graphical display that results when we
replace the dots of a dotplot with bars.
 In histograms, the bars usually touch. If there is a space, it is not
arbitrary like in a bar chart.
Example: A group of thirty-six 2-year old sows of the same breed
were bread to Yorkshire boars. The number of piglets surviving to 21
days of age was recorded for each sow

13

13

Example: Serum CK Creatine phosphokinase (Ck) is an enzyme related


to muscle and brain function. As part of a study to determine the
natural variation in Ck concentration, blood was drawn from 36 male
volunteers. Their serum concentrations of CK (measure in U/l) are
given in Table 2.2.6.

14

14

7
Chapter Two

Example CK Serum:

15

15

25 classes 5 classes

16

16

8
Chapter Two

Describing the Shape of a Histogram


Modality, symmetry, and skew.

Mode: Peak/peaks of the histogram Tails: The distribution is


• Unimodal  One peak • Left-skewed  left tail is longer than the
• Bimodal  Two peaks right tail.
• Multimodal  Two or more • Right skewed  left tail is shorter than the
peaks right tail.
• Symmetric if the left and right tails are
approximately equal (mirror images but if
this is NOT the case, it is asymmetric).

17

17

How would we describe the shape of the distribution?

18

18

9
Chapter Two

How would we describe the shape of the distribution?

19

19

How would we describe the shape of the distribution?

20

20

10
Chapter Two

How would we describe the shape of the distribution?

21

21

How would we describe the shape of the distribution?

22

22

11
Chapter Two

Boxplots

Section 2.7 & 2.6

23

23

Terminology
 PERCENTILE: the pth percentile is a value such that p% of
the observations fall below (or at) that value and (100-p)% fall
above (or at) that value

 QUARTILES(Q) divides the distribution into four parts


 Q1 (Q(.25))divides the lower 25% from the upper 75% of the
distribution.
 Q2 divides the lower 50% of the distribution from the upper 50%
of the distribution. (median of entire data set)
 Q3 (Q(.75)) divides the lower 75% from the upper 25% of the
distribution

24

24

12
Chapter Two

Terminology
 INTERQUARTILE RANGE: describes the middle 50% of
data.
 Robust measure of variability(resistant to extreme
values)
 IQR = Q3 - Q1

 FIVE-NUMBER SUMMARY includes:


Minimum, Q1, Median (Q2), Q3, Maximum

25

25

Terminology
 Outlier- a data point that differs so much from the rest of
the data.

 Data point is an outlier that falls outside of the fence


 Data point < Lower Fence = 𝑄 − 1.5 𝑋 𝐼𝑄𝑅
 Data point > Upper Fence = 𝑄 + 1.5 𝑋 𝐼𝑄𝑅

 * or Dot to represent an outlier on a boxplot

STAT 205 26

26

13
Chapter Two

 Example: The pulse rates (beat/min) of 12 college students


were measured. Here are the data arranged in order:
62 64 68 70 70 74 74 76 76 78 78 80

Find the five-number summary and IQR.

27

27

Boxplot for Data with No Outliers


A boxplot is a graph of the 5-number summary.
IQR

25% 25% 25% 25%

Minimum Q1 Median Q3 Maximum

28

28

14
Chapter Two

Constructing a Boxplot with Outliers

Upper inner fence = Q3 + 1.5 (IQR)

If there are outliers, the whisker is


drawn to the smallest or largest value
that is not an outlier and a special
Q3 character is drawn to denote the
Q2 IQR outliers. (See page 50 of the text)
Q1

Lower inner fence = Q1 - 1.5 (IQR)

29

29

 Example: The pulses of 12 college students were measured.


Here are the data arranged in order:
62 64 68 70 70 74 74 76 76 78 78 80

Data point < Lower Fence = 𝑄 − 1.5 𝑋 𝐼𝑄𝑅


Are there any outliers?:
Data point > Upper Fence = 𝑄 + 1.5 𝑋 𝐼𝑄𝑅

𝐈𝐐𝐑 = 77 − 69 = 8
( ) ( )
Q1 = = 69 Q3 = = 77

30

30

15
Chapter Two

Boxplot from R

IQR

5-number summary
31
62, 69, 74, 77, 80
31

Box Plot

32

32

16
Chapter Two

Distribution Shape and The Boxplot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

33

33

DESCRIPTIVE STATISTICS:
MEASURES OF CENTER
Section 2.3

34

34

17
Chapter Two

Definitions
 Statistics: A numerical measure that is calculated from
the sample data.

 Parameters: A numerical measure that is calculated from


the population data.

35

35

Measures of Central Tendency


 Measures of center are used to describe the center or
location of the data.

 Three commonly used measures


 Mean
 Median
 Mode

36

36

18
Chapter Two

Mean
Mean of a variable is computed by determining the sum of all the values of
the variable in the data set divided by the number of observations.

The sample mean(𝑦) for a sample of size n is

∑ 𝑦 𝑦 + 𝑦 + 𝑦 +⋯+ 𝑦
𝑦= =
𝑛 𝑛

where
𝑦 is the 𝑖 value of variable Y
𝑛 is the sample size

*Population mean is denoted by 𝜇.

37

37

Median
Median: the middle value of the data set. (At most 50% of data is
greater than M and at most 50% of data is less than M)

Steps to calculate M:
o Order n data values from smallest to largest.
o Observation in position in the ordered list is the median M

o If is not a whole number, the median will be the average of the


two middle observations.

38

38

19
Chapter Two

Mode
 The mode of a variable is the most frequent observation
of the variable that occurs in the data set.

 If there is no observation that occurs with the most


frequency, we say the data has no mode.

 Two modes  BI-modal

39

39

Weight Gain of Lambs The following are the 2 week weight


gains (lb) of six young lambs of the same breed that had been
raised on the same diet:
11 13 19 2 10 1
1 2 10 11 13 19

Find the mean, median, and mode of this dataset (by hand).

40

40

20
Chapter Two

What if………………….Extreme Values


Weight Gain of Lambs The following are the 2 week weight gains (lb) of
six young lambs of the same breed that had been raised on the same diet:

1 2 10 11 13 19

• we add an observation of 100 pounds?.

1 2 10 11 13 19 100

• ONE extreme value changed the mean by 12.96…

41

41

Extreme Values
 MEAN is STRONGLY AFFECTED by extreme
values

 MEDIAN is less sensitive than the mean to extreme


values.
 Because the median is not affected by large outlying values
as much as the mean, we say it is robust.
 Large values skew the mean in the direction of the skew.

42

42

21
Chapter Two

Shapes of Distributions

43

43

Which To Use?
The most appropriate measure of central tendency depends
on the data set:

 Approximately symmetric and unimodal 

 Skewed 

 Categorical

44

44

22
Chapter Two

MEASURES OF DISPERSIONS
2.4 & 2.6

45

45

Measures of Variation
 Measures of dispersion give us an idea about the
spread of a distribution. Are the observations all
nearly equal or do they differ substantially from each
other.

 Measures of Dispersion
 Range
 Standard deviation & Variance
 IQR

46

46

23
Chapter Two

Range
 Simplest measure of variation.
 RANGE = largest value – smallest value
 Does not consider how the values cluster or distribute between the
extremes.
Example: The data below represents the waiting time at a local urban
outpatient facility. Waiting time is measured from the time when the patient
registered to the time when he or she received the care service. Data was
collected for a sample of 10 patients. Determine the range.

Values 29 31 35 39 39 40 43 44 44 52
Ranks 1 2 3 4 5 6 7 8 9 10

47

47

Variance & Standard deviation


 Common measure of the spread of values in a distribution.
 Shows variation about the mean.

∑ 𝑦 −𝑦
The sample variance (𝑆 ) is 𝑠 =
𝑛−1

The sample standard deviation (𝑆) is ∑ 𝑦 −𝑦


𝑠=
**𝑆 is measured in the same unit. 𝑛−1

where,
𝑦 is the sample mean
𝑦 is the 𝑖 value of variable Y
𝑛 is the sample size
𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦 − 𝑦 (difference between the observation and sample mean)

48

24
Chapter Two

Example: Standard Deviation


The data below represents the waiting time at a local urban outpatient
facility. Waiting time is measured from the time when the patient registered
to the time when he or she received the care service. Data was collected for
a sample of 10 patients. Compute the sample standard deviation.
𝑦 = 39.6 𝑛 = 10
𝒚𝒊 − 𝒚 𝟐
𝒚𝒊 𝒚𝒊 − 𝒚
39 39 − 39.6 = −0.6 −0.6 = 0.36
29 29 − 39.6 = −10.6 −10.6 = 112.36
43 43 − 39.6 = 3.4 3.4 = 11.56
52 52 − 39.6 = 12.4 12.4 = 153.76
39 39 − 39.6 = −0.6 −0.6 = 0.36
44 44 − 39.6 = 4.4 4.4 = 19.36
40 40 − 39.6 = 0.4 0.4 = 0.16
31 31 − 39.6 = −8.6 −8.6 = 73.96
44 44 − 39.6 = 4.4 4.4 = 19.36
35 35 − 39.6 = −4.6 −4.6 = 21.16

49

49

INTERQUARTILE RANGE:
 Describes the middle 50% of data.
 IQR = Q3 - Q1

Example: Data was collected for a sample of 10 patients.


Compute the IQR.
Values 29 31 35 39 39 40 43 44 44 52
Ranks 1 2 3 4 5 6 7 8 9 10

50

50

25
Chapter Two

Extreme Values
 Range and Standard Deviation are AFFECTED by
extreme values

 IQR is less sensitive than the range and standard


deviation to extreme values.
 Because the IQR is not affected by large outlying values,
we say it is robust.

51

51

The Empirical Rule


 For unimodal approximately symmetric distributions (think
bell-shaped), we are able to use the Empirical Rule

• About 68% of observations are within one standard deviation


of the mean (in either direction).
𝑦 ± 1𝑠
• About 95% of observations are within two standard deviations
of the mean (in either direction).
𝑦 ± 2𝑠
• About 99.7% of observations are within three standard
deviations of the mean (in either direction).
𝑦 ± 3𝑠

52

52

26
Chapter Two

Illustration of the Empirical Rule

53

53

Example
The Health and Nutrition Examination Study of 1976-1980 (HANES)
studied the heights of adults (aged 18-24) is bell-shaped with a
Women Mean (𝒚): 65.0 inches standard deviation (s): 2.5 inches
Men Mean (𝒚): 70.0 inches standard deviation (s): 2.8 inches

Find the intervals of the Empirical Rule for the men.

Approximately 68%:

Approximately 95%:

Approximately 99.7%:
61.6 64.4 67.2 70 72.8 75.6 78.4

54

54

27
Chapter Two

Summary
 The End!!

55

55

28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy