Descriptive Statistics Alp2019
Descriptive Statistics Alp2019
10/16/2019 2
What is statistics?
• Collection of data
(1) Explanatory, predictors, covariates, independent variables;
(2) response, outcome, dependent variables; (3) intermediate variables.
• Management of data
Coding, editing, organization and storage.
• Analysis of data
Statistical modelling and applying statistical techniques.
• Interpretation of data
Statistical significance and/or clinical significance.
• Communication of data
Using numbers, tables and graphs to explain the data with reference to
context knowledge.
10/16/2019 3
The purposes of data collection and analyses
• Explanation
To describe and explain the relationship between the explanatory
variables and the outcome. It may be causal or non causal.
• Prediction
To predict the outcome from the predictor variables using a
prediction model.
• Control
To manipulate input variables (such as treatment) and observe the
output variables (such as response).
10/16/2019 4
Objectives
• To learn how to utilize statistical computing packages to analyze data
• The rationale is that nowadays statistical computing packages are readily
available. Some commonly used statistical methods are more accessible and
been modularized.
• To understand the core statistical concepts behind every data analysis
• Namely, statistical modelling, sampling distributions of statistics, and
statistical inferences, estimation and hypothesis testing.
10/16/2019 5
Characteristics of a defined population
Who
Male/female, white/black, children/adolescent/adult,…
When
Calendar year, year of birth, …
Where
Country, city, school, hospital, …
What
Socioeconomic position, educational level,…
Disorders and diseases of clinical populations
Data type and level of measurement
• Data type
• Categorical data such as gender, blood type, attitude, opinion and etc.
• Metric data such as number of children, number of accidents, weight, height,
temperature, IQ and etc.
• Level of measurement
• 1. Categorical
• Nominal: naming or labelling such as sex (categorical data)
• Ordinal: naming plus ordering but can not measure the distance such as severity,
Likert scale, semantic differential scale, rating scale
2. Continous data
• Interval: A scale that had an unit but the scale does not have an absolute zero
such as temperature and IQ. For example, you know 50 degrees and 25 degrees
differ by 25 degrees but you can not say 50 degrees is twice as much as 25
degrees.
10/16/2019
• Ratio: a scale that has an unit and an absolute zero such as weight and height. 7
Level of Measurement (Categorical)
1. Nominal Scales: naming or labelling such as sex, ethnicity,
religion, marital status, region, atc. There is no order or
categories, does not represent any kind of meaningful
order. (Example : Gender = Male: 0. Female:1)
2. Ordinal Scales: naming plus ordering, meaningful
numerical order but can not measure the distance
(Example : disease severity, Likert scale, semantic
differential scale, rating scale) Health status 1=Excellent,
2=Good, 3=Fair, 4=Poor, 5=Very Poor
Level of Measurement (Continous)
1. Interval Scales: Meaningful numerical order, meangingful
interval between values but the scale does not have an
absolute zero such as temperature, IQ, SAT exam, the Gre, etc.
You know 50 degrees and 25 degrees differ by 25 degrees but
you can not say 50 degrees is twice as much as 25 degrees.
2. Ratio Scales: Meaningful numerical order, meaningful interval
between, has absolute zero, not arbitriary but determined by
nature (Examples: weight (not from -5kg but from 0kg), height,
blood pressure, pulse rate, age, income, number of children)
General Description
Tittle:
Predictors of Postpartum Depression among Rural Women in Minia, Egypt: An
Epidemiological Study.
Purpose:
To determine the prevalence of postpartum depression (PPD) in a certain
rural area in Upper Egypt, and to determine the risk factors of PPD.
10
Key Variable
• Dependent Variable
• Postpartum depression (PPD) is a form of clinical depression that can affect women
and, less frequently, men after child-birth, in this research postpartum depression
women is a married woman who encompasses several mood disorders that follow
childbirth. Postpartum depression in this research is assessed with EPDS (Edinburgh
Postpartum Depression Screening) with four-point scale (0-3).
• No mention about scale range description in this • It will be better if they mention
research. about description of scale range like
• No write how they make the cut point of 0 for disagree and 3 for completely
postpartum depression category. agree.
11
Key Variable
Independent Variable
Variable
Demographic Age, woman’s and husband’s education and occupation, total household income.
Data
Data related to Type of delivery, assistance of delivery, personnel attending the delivery, place of
delivery delivery, complication after delivery, method of contraception.
Data related to Rank of birth, age, sex, weight at birth, breast feeding, sleeping habits.
child
Data related to Parity, pregnancy weight gain, previous diagnosis of depression, financial problem
pregnancy after delivery, compilations after delivery, support of family and friends after delivery,
support of husband after delivery, victim of domestic violence.
Previous history of depression
12
Methods
Cross-sectional study with community based approach.
Study was conducted in El-Burgaia village, 5 km north to El-Minia
city over a period of three months, between December 1st 2011 and
February 29th 2012.
Sample selected using systematic random sampling technique for
women who had given birth within 14 months, sample size are 200
women.
Descriptive analysis is used for describing demographic data, data
related to delivery, data related to child, data related to pregnancy,
and previous history of depression and presented as mean and
standard deviation.
13
Data Type
Categorical Data
14
Level of Measurement
Pick one!
Ordinal
Nominal
15
Graphical Displays of Data
Strength
Construct a pie chart no more
than six sectors.
Use percentage corresponding
than absolute frequency.
Drawbacks
Use 3D key-shading.
Better to use 2D shading patterns,
so the patterns of pie chart does
not detract the meaning of the pie
chart itself (Wallgren et al, 1996).
16
Tittle and source:
Nurses’ Knowledge about Palliative Care.
Journal of Hospice & Palliative Nursing, 16(1), 23-30.
Purpose:
The aim of the study is to evaluate palliative care
knowledge among nurses in Jordan.
17
Key variable
Dependent variable
• Palliative care knowledge is a cognitive understanding
toward palliative care including philosophy and
principles of palliative care; managing of pain and other
symptoms; and psychosocial and spiritual care to
individuals and families. It is measured by Palliative Care
Quiz for Nurses (PCQN) (M Ross, McDonald, & McGuinness, 1996).
18
Key variable
Independent variables
Demographic characteristics
19
Research methods
A quantitative descriptive cross-sectional survey design with
convenience sampling.
Categorical
21
data
Any
Level of measurement comment?
Gender
Some studies reported that
currently few men decide be
nurse because nursing
perceived as feminine
profession (Al-Zein & Al-Khawaldeh, 2015;
Ashkenazi, Livshiz-Riven, Romem, & Grinstein-Cohen,
.
2016)
23
Suggestion
Regarding to the
The author
original better
article of
visualize
PCQN the number
explained that
of sample
higher scorein indicates
each
ahospitals intoof
better level bar
chart or pie(Ross,
knowledge chart with&
McDonald,
different1996)
McGuinness, .
colour
The author
clear andbetter
attractive.
show the total mean
score
Chart of PCQN
easier tointo
read
table valuable in
and more
understandable
mind (Plichta & Kelvin, 2013).
determine palliative
care knowledge
among nurses.
24
General Description
1. Title
Cross-sectional study of patients with type 2
diabetes in OR Tambo district, South Africa
2. Purpose
To examines the sociodemographic and clinical
determinants of uncontrolled type 2 diabetes
mellitus (T2DM) in individuals attending primary
healthcare in OR Tambo district, South Africa
Key Variables
Independent Variable
Sociodemographic characteristics gender, type of residence, level of
education
Categoric data
Metric continuous
Level Measurement
Nominal Scale
Ordinal Scale
Metric Continuous
Data
Level of Measurement
Ratio Scale
The mean is most appropriately used to describe ratio data (Plichta, Kelvin, &
Munro, 2013, p. 40).
2019/10/16 38
Organizing numerical data
Numerical Data 41, 24, 32, 26, 27, 27, 30, 24, 38, 21
Frequency Distributions
Ordered Array
Cumulative Distributions
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
2019/10/16 40
The stem-and-leaf plot: what and why
• Data in raw form (as collected):
24, 26, 24, 21, 27, 27, 30, 41, 32, 38
• Data in ordered array from smallest to largest:
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
• Stem-and-leaf display (what is it?):
Separate the sorted data 2 144677
Series into leading digits 3 028
4 1
(stems) and the trailing
digits (leaves)
2019/10/16 41
The stem-and-leaf plot: how
• Choose the leading (10’s) digits as the ‘stem’ units.
• The remaining trailing digits are the leaves.
• Complete the stem-and-leaf plot
Stem leaves
2 144677
3 028
4 1
2019/10/16 42
The frequency distribution: what and why
• What is a frequency distribution?
• A frequency distribution is a list or a table
containing the values of a variable and the corresponding
frequencies with which each value occurs.
• Why use the frequency distribution?
• It is a way to summarize data.
• The distribution condenses the raw data into a more useful
form and allows a quick visual interpretation of data.
2019/10/16 43
The frequency distribution: how
• Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58.
2019/10/16 44
Frequency distributions, relative Frequency distributions and
percentage distributions
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Relative
Class Frequency Frequency Percentage
10 but under 20 3 .15 15
20 but under 30 6 .30 30
30 but under 40 5 .25 25
40 but under 50 4 .20 20
50 but under 60 2 .10 10
Total 20 1 100
2019/10/16 Chap 2-45
How many class intervals?
• There is more to be said about the widths of the class intervals,
sometimes called bin widths. Your choice of bin width determines the
number of class intervals. This decision, along with the choice of
starting point for the first interval, affects the shape of the histogram.
2019/10/16 46
2019/10/16 47
How many class intervals? contd
• The best advice is to experiment with different choices of width, and
to choose a histogram according to how well it communicates the
shape of the distribution.
2019/10/16 48
How many class intervals? Cont’d
• Sturges' rule is to set the number of intervals as close as possible to 1
+ Log2(N), where Log2(N) is the base 2 log of the number of
observations. The formula can also be written as 1 + 3.3 Log10(N),
where Log10(N) is the log base 10 of the number of observations.
According to Sturges' rule, 1000 observations would be graphed with
11 class intervals since 10 is the closest integer to Log2(1000). We
prefer the Rice rule, which is to set the number of intervals to twice
the cube root of the number of observations.
2019/10/16 49
General guideline
• Number of data points
• Number of classes
• Under 50
• 5-7
• 50-100
• 6-10
• 100-250
• 7-12
• Over 500
• 10-20
2019/10/16 50
The histogram: what and why
• What is it?
• A histogram is a bar graph of raw data that creates a
picture of the data distribution.
• Why use the histogram?
• The need to visualize the central location, spread, and
shape of the data.
2019/10/16 51
The histogram: what
• The classes or intervals are shown on the horizontal axis.
• Frequency (or relative frequency) is measured on the vertical axis.
• Bars of appropriate heights can be used to represent the number of
observations within each class.
• Such a graph is called a histogram.
2019/10/16 52
The histogram: create the graph
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Histogram
7 6
Frequency 6 5
5 4 No Gaps
4 3
3 2
Between
2 Bars
1 0 0
0
5 15 25 36 45 55 More
Class Boundaries
2019/10/16 Class Midpoints 53
The frequency polygon: what
• It is a line graph that shows the distribution of numerical data.
• The horizontal axis is the variable.
• The vertical axis is the frequency.
• The line connects points of (midpoints of the class intervals, frequency).
• Tie down to the midpoints of the classes with zero frequency.
2019/10/16 54
The frequency polygon: how
• Determine the frequency distribution.
• Complete the frequency polygon.
2019/10/16 55
The frequency polygon: create the graph
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Frequenc y
3
Tie down to the
midpoints of
2
the classes with
1
zero frequency
0
5 15 25 36 45 55 M ore
2019/10/16 57
Cumulative frequency: table
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Cumulative Cumulative
Class Frequency % Frequency
10 but under 20 3 15
20 but under 30 9 45
30 but under 40 14 70
40 but under 50 18 90
50 but under 60 20 100
2019/10/16 58
Cumulative frequency: create the ogive
• The ogive is a line graph, where we plot the values of a variable on
the horizontal axis and the cumulative frequency on the vertical axis.
If we plot the cumulative relative frequency on the vertical axis, then
the line graph is called the relative frequency ogive.
2019/10/16 59
The Ogive (Cumulative % Polygon)
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Ogive
100
80
60
40
20
0
10 20 30 40 50 60
2019/10/16 61
Central tendency
• What: It is a typical value that represents the distribution of data.
• Measures: Commonly used measures of central tendency are mode,
arithmetic mean and median.
• Mode is simply the commonest occurrence in the data.
• Arithmetic mean is simply the sum of the numbers divided by the number of
observations (n).
• Median is simply the middle of the dataset, defined as the point below which
half the data points lie, and above which half the data lie.
2019/10/16 62
The mode
• This is simply the commonest occurrence in the data. Most real
datasets don’t have a mode, as all values are different.
• As such, the Mode is easily the least useful technique for data
description
• Appropriate measure of tendency for variables at all levels : nominal,
ordinal, interval, ratio
2019/10/16 63
What is the arithmetic mean?
• The arithmetic mean is the most common
measure of central tendency. It is simply the
sum of the numbers divided by the number of
observations.
• The ‘Mean’ is the name given by statisticians to
what everyone else calls the ‘average’.
• Easy to calculate: add up the numbers and
divide by n
2019/10/16 64
Median
• This is the middle of the dataset, defined as the point below which
half the data points lie, and above which half the data lie.
• Arrange the data in increasing order:
• If the number of observations is odd, the median is the
observation exactly in the middle of the ordered list. The
position of the median is calculated as (n+1)/2.
• If the number of observations is even, the median is the mean
(or average) of two middle values. The positions of these two
values are calculated as n/2
2019/10/16 65
Median, cont’d
• The median is an under-rated tool, often preferable to the more widely used
mean, because it gives a sensible answer whatever the shape of data distribution
• It is a special case of a more general descriptive technique known as centiles.
• The median is the 50th centile of a dataset, meaning that 50% of the data points
lie below it.
2019/10/16 66
Mode, mean and median
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
2019/10/16 67
Mean Versus Median
2019/10/16 68
Mean versus Median
• Mean gives equal weights to every data values when averaging them
whereas median put a weight of one to the middle value and zeros to other
data values.
• Mean is influenced by the presence of extreme values, either too small or
too large whereas the median is resistant to the presence of extreme
values.
• Mean and median are equal when you have a symmetric distribution.
• Mean and median are unequal when you have an asymmetric distribution.
Mean is larger than median when the distribution has a long tail to the
right. Mean is smaller than median when the distribution has a long tail to
the left.
• When you have an symmetric distribution, mean may not be an
appropriate representation of central tendency.
2019/10/16 69
Number of observations
A symmetrical distribution
Size of value
Mean and median
about the same
Size of value
Median
2019/10/16 70
Organizing Categorical Data
71 2019/10/16
Tabulating and graphing categorical data
Categorical Data
Graphing Data
Tabulating Data
The Summary Table
Pie Charts
73 2019/10/16
The spread, scatter and variation
74 2019/10/16
Spread
What: It is the variation of data.
Measures: Commonly used measures of spread are standard
deviation (SD), variance, range, interquartile range (IQR), coefficient
of variation (CV).
Variance is simply the sum of squared deviation from mean divided by n
or n-1.
Standard deviation is simply the square root of variance.
Range is simply the maximum data value minus the smallest data value.
Interquartile range is simply third quartile (75th percentile) minus the first
quartile (25th percentile).
Coefficient of variation is simply the standard deviation divided by mean ,
multiplied by 100%.
75 2019/10/16
Number of observations
Number of observations
Distribution B
Size of value
76 2019/10/16
Variance
Having got the Sum of Squares
Variance is the mean value of SS (what)
Variance = SS/n
(an alternative formula also used:
Variance = SS / (n-1)
This estimates the variance of the whole population,
while /n gives variance just for the sample taken.
(How)
Geographers tend to prefer
Variance = SS/n
Biologists tend to prefer
Variance = SS/(n-1)
77 2019/10/16
Standard deviation
Is the square root of variance (what)
Because there are 2 ways to calculate variance, there are 2 s.d.s
Sd = (SS/n)1/2. This is labelled σ on many calculators
or
78 2019/10/16
SD
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Standard deviation
=((12-32.4)**2+(13-32.4)**2+…+(53-32.4)**2+(58-32.4)**2)/(20-1).
=12.67.
Variance
=12.67**2
=160.57.
79 2019/10/16
What is the coefficient variation (C.V.)
It is a measure of variation not dependent on units of
measurements and can be used for comparisons of the
variations of measures.
It is a standardized measure of the spread of the distribution.
Definition: it is the standard deviation divided by mean*100%.
80 2019/10/16
81 2019/10/16
Harga 5 mobil bekas masing-masing Rp 4 juta, Rp 4,5 Jt, Rp 5 jt,
Rp 4,750 Jt dan Rp 4,250 Jt dan harga 5 ekor ayam masing
masing Rp 600, Rp 800, Rp 900, Rp 550 dan Rp 1.000. • Hitung
simpangan baku harga mobil (SD ) dan simpangan baku harga
ayam ( SD ) dan mana yang lebih bervariasi (heterogen), harga
mobil atau harga ayam ?
82 2019/10/16
1. Mencari SD Mobil dan SD Ayam
2. Mencari Mean Mobil dan Mean Ayam
Mean Mobil = 1/5 (Rp. 4.000.000 + 4.500.000 + ………….. + 4.250.000) =
Rp. 4.500.000
Mean Ayam= 1/5 (Rp. 600 + 800 + ……………….. + 1.000) = Rp. 770
Mencari SD Mobil dan Ayam
SD Mobil = Rp. 353.550
SD Ayam = Rp. 172,05
Mencari CV Mobil dan Ayam
CV mobil = 353.550 / 4.500.000 x 100% = 7,86%
CV ayam = 172,05 / 770 x 100% = 22,34%
Simpulan : karena CV ayam > CV mobil, maka harga ayam lebih
bervariasi (heterogen) dibandingkan harga mobil
83 2019/10/16
Centiles
84 2019/10/16
Centile
What: Centile is the location of a data value in an ordered array.
Measures: Commonly used centiles are quartiles and deciles.
Quartiles are 25th- 50th- (i.e., median) and 75th- percentiles.
Deciles are 10th-, 20th-, …, and 90th- percentiles.
85 2019/10/16
What is the percentile?
f(x)
F(x)
c.d.f.
X=x X=x
(P*100%)th (P*100%)th
percentile percentile
86 2019/10/16
What
Percentiles
Percentiles divide the data into 100 equal parts, 1/100, 2/100,….,
100/100.
Quartiles and median
Quartiles divide the data into 4 equal parts,
1st quartile divides the bottom 25% from the top 75%.
2nd quartile divides the bottom 50% from the top 50% also as median.
3rd quartile divides the bottom 75% from the top 25%.
87 2019/10/16
What are the deciles?
Example:Income
0.1 0.1
X=x
10% 90th
percentile percentile
88 2019/10/16
What are the quartiles ?
Example:Body Mass Index (BMI)
0.25 0.25
0.25 0.25
X=x
25th 50th 75th
(median)
89 2019/10/16
How to locate the 1st quartile (25th
percentile) and 3rd quartile (75th
percentile)
1st quartile: find the median of the lower half of
the ordered list.
3rd quartile: find the median of the upper half of
the ordered list.
90 2019/10/16
25th, 50th and 75th percentiles
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
91 2019/10/16
Five-number summary
Min.,1st quartile, median, 3rd quartile, and max.
#n
M Median=31
F 1st quartile=24 3rd quartile=42
Min.=12 Max.=58
92 2019/10/16
What is the box-and-whisker plot
A box-and-whisker plot is a graphical display that involves a five-
number summary of a distribution of values, consisting of minimum
value, the lower quartile, the median, the upper quartile and the
maximum value.
93 2019/10/16
Box-and-Whisker plot
100
Highest value
94 2019/10/16
Interquartile range (IQR)
IQR=3rd quartile-1st quartile. This actually the range of the middle
50% of the data.
IQR=42-24=18.
95 2019/10/16
Why use the box-and-whisker plot
It can be used to visualize the quartiles, min, max, and outliers.
It can be used to compute IQR.
It can be used to visualize whether the distribution is symmetric or
asymmetric.
It can be used to visualize the differences in distribution between
groups.
It can be used to identify outliers.
96 2019/10/16
Schematic box-and-whisker plot
97 2019/10/16
Thank you for your attention
98 2019/10/16