Session 3 Week 2
Session 3 Week 2
Nivedita Nadkarni
Statistics in the Courtroom
November 12th 2024, NLSIU
Today’s Statistical Topics
• Review of last time’s concepts
• Grouped mean
• Grouped variance
• Chebyshev’s Inequality
• Rates and Standardization
Types of variables
Variable Values Examples
type
Continuous An infinite number of Height of students,
real values in an Systolic blood pressure
interval
Binary Either 0 or 1 Presence or absence of
diabetes
Categorical / Nominal Any number of Age groups like 0-14,
categories 15-25 etc..
Ordinal Ordered categories Pain scale, Likert scale
Descriptive Statistics
• Descriptive statistics can help in summarizing data in the form of
simple quantitative measures such as percentages or means or in the
form of visual summaries such as histograms and box plots.
• Mean (s.d) / Median (IQR) / Mode are the typically used measures for
continuous data.
• Mean (s.d) denotes mean and the corresponding standard deviation.
• Median (IQR) represents the median value and the inter-quartile range.
• The mode is the value that occurs most frequently in the dataset.
It’s useful for identifying the most common value in a dataset.
Population and sample: parameters and
estimates
• N is population size; n is sample size.
• X, Y or Z are used to denote random variables.
• Xi, i=1,..,n is used to denote the random sample.
• 𝑥ҧ denotes the sample mean.
• s denotes the sample standard deviation.
• Sample mean:
Population and sample: parameters and
estimates
• Variance is denoted by s2
σ𝑁
𝑖=1 𝑥𝑖 −μ
2
• Population variance 𝜎2 =
𝑁
σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2
• Sample variance s2 =
(𝑛−1)
• The sample standard deviation is the square root of the sample
variance.
• n independent observations are used to obtain the sample mean 𝑥ҧ .
• Hence, for estimating the second parameter 𝜎2, we just have (n-1)
observations available for estimation. This is referred to as degrees of
freedom.
Mean for binary data
• If I have only 0’s and 1’s, so just binary data, I can arithmetically
calculate the “mean” of the data.
• However, it gives the proportion of the 1’s in the data.
• Therefore, though the method by itself may be used to calculate the
proportion,
• We can just count the number of 1’s and divide by the total which by
definition is the proportion of 1’s in the data.
• Personally, the idea of applying the mean function to binary data is not
something I would recommend.
• The example in the book is to just demonstrate that arithmetically it can
be done. Would not be used or recommended in practice.
Dispersion and Distribution
• Dispersion is just the extent to which numerical data is likely to
vary about an average value.
• Distribution refers to the theoretical function that shows the
possible values a variable can take and how frequently they occur.
For example, see the figure on page 43.
• It shows the graph of two different distributions. Both have the
same mean, median and mode but different variance.
• Which effectively means that the dispersion measure is different
between the two distributions. One is spread more widely across
the interval compared to the other.
Problems
• Complete problem 7 from last time,
• For both the variables
• Let us discuss once everyone has finished solving the problem.
Grouped mean, variance and an inequality
• We know how to calculate the mean, but how about the grouped
mean?
• Consider the example on page 49.
• We can obtain the mean using the standard technique which gives
mean = 8.6 years.
• Now, if you notice, a few values occur multiple times in the same
table.
• Three 5’s, one 6, one 8, three 11’s and two 12’s.
• Therefore, we can compute the sum as in page 50.
• The mean can then be obtained by dividing this sum by 10.
Grouped mean, variance and an inequality
• This technique is useful as it can be applied to data that have
been summarized in the form of a frequency distribution.
• Data that are organized in this way are referred to as grouped data.
• Even if the original data is unavailable, and original values are not
known, we are able to determine the number of measurements
that fall into each specified interval.
• Refer to table 3.4 on page 51.
σ𝑘
𝑖=1 𝑚𝑖𝑓𝑖
• 𝑥ҧ = , k=number of intervals, mi is the midpoint of
σ𝑘
𝑖=1 𝑓𝑖
• the ith interval and fi is frequency associated with the ith interval.
Grouped mean, variance and an inequality
• Therefore, the grouped mean is actually a weighted average of the
interval midpoints;
• Each midpoint is weighted by the frequency of observations within
the interval.
• What would be the variance or standard deviation of this data?
σ𝑘 2
𝑖=1(𝑚𝑖 −𝑥) 𝑓𝑖
• 𝑠2 = ,
σ𝑘
𝑖=1 𝑓𝑖 −1
• In a bell-shaped distribution
with mean μ and standard deviation
σ,
• Approximately 68% of the
observations fall within one standard
deviation (σ) of the mean μ.
• Approximately 95% of the
observations fall within two standard
deviations (2σ) of the mean μ.
• Approximately 99.7% of the
observations fall within three
standard deviations (3σ) of
the mean μ.
Chebyshev’s Inequality
• The empirical rule is an approximation that applies only when the
data are symmetric and unimodal.
• If they’re not, Chebyshev’s inequality can be used instead to
summarize the distribution of values.
• This inequality is less specific than the empirical rule, but it is true
for any set of observations, no matter what its shape.
1 2
• Therefore, for any number k ≥ 1, at least [1-( ) ] of the
𝑘
measurements in the data lie within k standard deviations of their
mean.
Chebyshev’s Inequality
• For example, for k=2
• At least ¾ or 75% of the values lie within two standard deviations
of the mean.
• Equivalently, we could say that 𝑥 ± 2𝑠, encompasses at least 75%
of the observations in the group.
• Similarly, for k=3, 𝑥 ± 3𝑠, contains at least 88.9% of the
measurements.
• We can revisit the FEV1 data in table 3.1,see page 53.
• So, though conservative, this inequality allows us to use the mean
and sd of any set of data to describe the entire group!
Rates and Standardization
• Demographic data and vital statistics are numbers that are used
to characterize a population.
• Demographic data includes information such as the size of the
population and its composition by gender, race and age.
• Vital statistics describe the life of a population: dealing with
births, deaths, marriages, divorces and disease occurrence.
• Both types of data are used to describe the health status of a
population, to spot trends and make projections.
• Vital statistics are also used to make comparisons between
groups.
Rates
• Rates are used to make comparisons between groups more
meaningful.
• A rate is defined as the number of cases of a particular outcome
of interest that occur over a given time period divided by the size
of the population at that time period.
• Rate and proportion though used interchangeably, are not
synonymous.
• A proportion is a ratio in which all individuals included in the
numerator must also be included in the denominator.
• Proportions do not have a unit of measurement unlike rates.
Types of frequently used rates
• Death rate or mortality rate: Total # deaths in a time period / Total # at
risk during the same period.
• Infant mortality rate: Total # deaths among infants under 1 in a time
period / Total # of live births during the same period.
• These mortality rates we have considered are all crude rates. Why?
• A crude rate is a single number computed as a summary measure for
an entire population; it disregards differences caused by age, gender,
race and other characteristics.
• Mortality rates calculated for individual age groups are called age-
specific mortality rates.
Standardization of rates
• What is a confounder?
• Two populations from different geographical areas: one
composed entirely of males and the other only of females.
• How to be sure that the difference in the mortality rates is die to
location or some effect of gender.
• In this situation, gender is referred to as a confounder.
• Since it is associated with both geographical area and death rate,
it obscures the true relationship between these factors.
Example to motivate standardization
• Rate of impairment increases with age.
• Age is a confounder between hearing impairment and
employment as it is independently associated with each of these
quantities.
• We therefore cannot infer that the higher rate of impairment
among individuals not in the labour force is the result of some
inherent characteristic of the members of the group or simply the
effect of age.
• For a more accurate comparison, we need to consider the age-
specific impairment rates rather than the crude ones.
Direct and Indirect standardization
• Although the subgroup specific rates provide a more accurate
comparison among populations than the crude rates, if the sub-groups
were far more in number, it would be an overwhelming number of rates
to compare.
• It would be therefore convenient to be able to summarize the entire
situation with a single number calculated for each sub-population, a
number that adjusts for difference in composition.
• Two ways: direct method of standardization and the indirect method.
• Both focus on two components : population composition and
subgroup-specific rates
• Attempt is to overcome problem of confounding by holding one of
these components constant across populations.
Direct method
• The direct method of adjusting for differences among populations
focuses on computing the overall rates;
• That would result if, instead of having different distributions, all
populations being compared were to have the same standard
composition.
• Steps:
• Select the standard distribution.
• For the hearing impairment example, use the total population
questioned in the survey.
• Calculate the numbers of impairments that would have occurred in
each of the two employment status subgroups.
Direct method
• Currently employed and those not in the labour force, assuming that
each has this standard population distribution while retaining its own
individual age-specific impairment rates.
• Refer to the table on page 73
• Therefore, the age-adjusted impairment rates for each group”
• Currently employed = 5.91 per 1000
• Not in the labour force = 5.54 per 1000
• These age adjusted rates are the impairment rates that would apply if
both the currently employed and those not in the labour force had the
same age distribution as the total surveyed population.
Direct method
• After we control for the effect of age in this way, the adjusted
impairment rate for those who are employed is higher than the
adjusted rate for those who are not in the labour force.
• This is the opposite of what we observed when we looked at the
crude rates, implying that the crude rates were indeed being
influenced by the age structure of the underlying groups.
• Note that the choice of a different standard age distribution,
would have led to different adjusted impairment rates.
• This is not critical since an adjusted rate had no meaning by itself.
Direct method
• It is merely a construct that is based on a hypothetical standard
distribution;
• Unlike a crude or specific rate, it does not reflect the true impairment
rate of any population.
• Adjusted rates are meaningful only when comparing two or more
groups, and it has been shown that trends among the groups are
generally unaffected by the choice of a standard.
• If another, but reasonable age distribution were chosen for instance,
the magnitude of the difference between the adjusted impairment
rates of the two sub-groups should not change drastically even if the
rates themselves do; the currently employed would still have a slightly
higher adjusted rate of impairment.
Indirect method
• The indirect method of adjusting for differences in composition
involves the use of a set of standard age-specific impairment rates
along with the actual age composition of each sub-population
being compared.
• Use the total surveyed population as the standard.
• This time however, we calculate the number of impairments that
would have occurred in the two population subgroups if each had
taken on the age-specific impairment rates of the surveyed
population as a whole while retaining its own its own individual
age distribution.
• Refer to page 74.
Indirect method
• The observed number of hearing impairments in each employment
group by the total expected number of impairments.
• The resulting quantity is known as the standardized morbidity ratio.
• If the data pertained to deaths, then the resulting ratio would be
referred to as the standardized mortality ratio.
• Currently employed = 552/536.9 =1.03 = 103%
• Not in the labour force = 368/372.4 = 0.99 = 99%
• This indicates that the group of currently employed individuals has a
3% higher impairment rate than the surveyed population as a whole.
Indirect method
• Where as, the group not in the labour force has an impairment rate
that is 1% lower than that of the total population.
• Recall that the total surveyed population also includes the group
of individuals not currently employed.
• Application of the indirect method often concludes with a
comparison of the standardized ratios.
• Compute the actual age-adjusted impairment rates for each
group.
• These are derived by multiplying the crude impairment rate for the
total surveyed population by the appropriate standardized ratios.
Indirect method
• Currently employed: 5.80/1000 x 1.03 = 5.97 per 1000
• Not in the labour force: 5.8/1000 x 0.99 =5.74 per thousand
• With the effect of age removed, the group of currently employed
individuals is again seen to have a slightly higher adjusted rate
than those not in the labour force.
• Note that though the rates themselves are different, we arrived at
the same conclusion when the direct method of standardization
was applied.
• Let us review section 4.2.3 page 75 on the use of standardized
rates.
Next session
• Events and Probability
• Bayes’ Theorem
• Sensitivity and Specificity – Prosecutor’s fallacy and defense
fallacy
• ROC Curve
• Calculation of prevalence
• Relative Risk and Odds ratio
Prescribed reading
• Please do read:
• From Statistical Science in the Courtroom
• Interpretation of Evidence, and Sample Size Determination (pages
64-68)
• Interpreting DNA evidence: Can Probability Theory Help? (4
Sampling, sections 4.1 and 4.2)