0% found this document useful (0 votes)
25 views37 pages

Lecture 2-Summarizing Data - HSciences Biostats - 010232en

Biostatics Lecture UPNG SMHS

Uploaded by

Oxy Maine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views37 pages

Lecture 2-Summarizing Data - HSciences Biostats - 010232en

Biostatics Lecture UPNG SMHS

Uploaded by

Oxy Maine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

SUMMARIZING DATA-

MEASURES OF CENTRAL
TENDENCY: PART A

Elias Namosha
Division of Public Health, SMHS-UPNG
Introduction to Biostatistics
01st March, 2022
OBJECTIVES
Given a set of data you can be able to choose;

 appropriate measure of central locations (Mean,


Median, Mode).

 Be able to calculate MEAN

 Be able to identify and use the MEDIAN and


MODE

The above are used to describe location of data.


Mostly used in descriptive statistics..
MEASURE OF CENTRAL
LOCATION
Definition: a single value that represents
an entire frequency distribution.

Also known as:


• “Measure of the center”
• “Measure of central tendency”

 When we’re talking about measures of central


tendency, what we’re really trying to do is describe
some middle or mid point of data distribution.

 Finding a value that somehow conveys information


about an entire frequency distribution.
MEAN
𝑆𝑢𝑚 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑚𝑒𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
Mean =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑚𝑏𝑒𝑟𝑠

𝑛
𝑖=1 𝑋𝑖
µ=
𝑁
or simply

𝑋
µ= 𝑁

This is the “average” (e.g., height) of all the members.


1. MEAN
Method for identification
1. Sum up all of the values
2. Divide the sum by n
Definition: the “average” (center of gravity)
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Sum = 360; n = 30
Mean = 360 / 30 = ?
MEAN – PROPERTIES / USES

 Probably most common measure of


central location
 Use all of the data
• Affected by extreme values (outliers)
 Best for normally distributed data
 Not usually equal to one of the original
values
 Good statistical properties
2.MEDIAN

Definition: the middle value

Method for identification:

1. Arrange observations in order


2. Find middle rank as (n + 1) / 2
3. Identify the value at the middle
LENGTH OF STAY DATA

0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10,

10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18,

19, 22, 27, 49

What is the median for this data set??


LENGTH OF STAY DATA
n = 30
Median @ 30+1 / 2 = 15.5, i.e., between 15th
and 16th position
Value at 15th position = 10
Value at 16th position = 10
So median = 10

0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, M 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
MEDIAN – PROPERTIES / USES

• Does not use all the data available


• Insensitive to extreme values (outliers)
• Poor statistical properties
• Measure of choice for skewed data
• Equals an original value of n is odd

Medians do not use all data available and thus are


insensitive to extreme values. The median is the
preferred measure of central tendency for skewed data.
3.MODE: METHODS FOR
IDENTIFICATION
Definition: the value that occurs most frequently

1a. Arrange data into frequency


distribution, showing the values of the
variable and the frequency with which
each value occurs.

1b. Alternatively, arrange raw data in


ascending order.
MODE: METHOD FOR
IDENTIFICATION

2. Identify the value that occurs most


often.

The first measure used to describe central tendency


is the simplest – the mode
LENGTH OF STAY DATA

0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10,

10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18,

19, 22, 27, 49

Identify the value that occurs most often in this


dataset..??
Another way to understand data distribution is to depict the
values graphically (as above), with number of observations on
the y axis and data for the variable value on the x axis. In this
graph, it’s immediately apparent that the mode is 10.
MODE – PROPERTIES / USES

 Easiest measure to understand, explain,


identify
 Always equals an original value
 Insensitive to extreme values (outliers)
 Poor statistical properties
 May be more than one mode
 Does not use all the data

The mode is the easiest measure of central tendency to


identify, explain and understand, but , unfortunately, it is also
the least valuable.
COMPARISON OF MODE,
MEDIAN AND MEAN
 Mode – most common value
 Median – central value
 Arithmetic mean – average value
 Mean uses all data, so sensitive to outliers
 Mean has best statistical properties
 Mean preferred for normally distributed
data
 Median preferred for skewed data
NORMAL CURVE

 Here’s an example of a normal curve or normal distribution.

• In a normal curve, the mean, median and mode are all


the same (same value).
THREE CURVES WITH DIFFERENT
SKEWING

 Most curves are not perfectly normal. They exhibit some


degree of skewing.

 Mean, Median and Mode, are all different in a skewed curve


SUMMARIZING DATA-
MEASURES OF
DISPERSION (SPREAD):

PART B
OBJECTIVES
Describe the following measures of
spread/dispersion:
– Range
– interquartile range
– variance
– standard deviation
MEASURES OF
VARIATION
Definition: quantify the variation or dispersion
or spread of a set of data from its central
location
Also known as:
• “Measure of dispersion”
• “Measure of spread”
Common measures
• Range • Standard error
• Interquartile range • 95% CI
• Variance / standard deviation
RANGE
Properties / Uses
• 2 values or 1?
• Greatly affected by outliers
• Usually used with median

Definition: difference between largest


and smallest values
Range
2 4 20
3 49 22
12 10 11
5 0 18
27 10 18
6 5 13
7 9 14
8 10 9
9 10 12
12 16
=MIN(A1:C10) =MAX(A1:C10)
What is the range of this dataset??
Range

Length of hospital stay for pneumonia

MIN: 0
MAX: 49
MODE: 10
MEDIAN: 10
MEAN: 12

Have a look at this dataset above.


What is the range of length of stay??
 This graphs gives us a great visual representation of the spread of the data.
 However, statisticians and epidemiologists tend to like numbers, so how do
we describe this ‘spread’ of data with numbers?
INTERQUARTILE
RANGE
Properties / Uses
Used with median
Five-number summary for box-and
whiskers diagram:
– Maximum (100%, largest value)
– Third quartile (75%)
– Median (50%)
– First quartile (25%)
– Minimum (0%, smallest value)
Definition: the central 50% of a distribution
THE MIDDLE HALF OF THE OBSERVATIONS
IN A FREQUENCY DISTRIBUTION LIE
WITHIN THE INTERQUARTILE RANGE

The white space under the curve represents


the interquartile range in this graphic.
Length of stay data
MEASURES OF
VARIABILITY/ SPREAD
• Units of variance are the square of the units of the
variable of interest.
• Its more common to present the square root of
variance = standard deviation

2 2
𝑋−µ 𝑋−µ
∂= or for a ∂=
𝑁 𝑁−1

2
𝑋−µ
SD =
𝑁−1
VARIANCE AND STANDARD
DEVIATION
 Variance = average of deviations from mean
Sum (x – mean)2 / n

 Variance is the average of the squared differences from the


mean
 Standard deviation is simply the square root of variance

 Standard deviation is a measure of variation that quantifies


how closely clustered the observed values are to the mean

 Standard deviation is a measure of how spread out the numbers


are – it is usually given the greek symbol sigma ‘σ’
Variance is the sum of all differences between
observations and the mean, squared then divided by the
number of observations. Standard deviation is the
square root of variance. The smaller the variance or
standard deviation, the more tightly clumped the data is.
STANDARD DEVIATION –
PROPERTIES / USES

Standard deviation usually calculated only


when data are more or less normally
distributed (bell shaped curve)

For normally distributed data,


• 68.3% of the data fall within plus/minus 1 SD
• 95.5% of the data fall within plus/minus 2 SD
• 99.7% of the data fall within plus/minus 3 SD
AREAS UNDER THE NORMAL CURVE THAT
LIE BETWEEN 1, 2, AND 3 STANDARD
DEVIATIONS ON EACH SIDE OF THE MEAN

In a normal distribution, about 95% of data values are


contained within the mean plus or minus two SDs.
Don’t worry about the math and the formula here,
focus on the concept.
SUMMARY
Mode – simple, not always
useful

Median – best for skewed data


Arithmetic mean – best for
normally distributed data

Geometric mean – use for lab


titers (Geometric mean – different from a regular
mean (arithmetic), it’s not a simple average. Lab
test that measure the presence & amount of
antibodies in blood).
SUMMARY
 Range – use with median
 Standard deviation – use with mean
(Standard deviation shows how much
individuals within the same sample differ from
the sample mean).

 Standard error – used to construct


confidence intervals. (standard error shows
how close your sample mean is to the
population mean).

This also means that standard error should decrease if the


sample size increases, as the estimate of the population mean
improves. Standard deviation will not be affected by sample size.
END OF
PRESENTATION
THANK YOU!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy