Unit Ii
Unit Ii
In statistics, the concepts of population and sample are fundamental, as they form the basis
for data collection, analysis, and interpretation. Understanding these terms helps in making
generalizations about a larger group based on data collected from a smaller subset. Here's an
explanation of both:
1. Population
A population refers to the entire set of individuals, items, or data points that share a common
characteristic and are of interest in a statistical study. It is the complete group about which
information or conclusions are desired.
Size: A population can be finite or infinite, depending on the scope of the study. For
example, the population of a country is finite, while the population of all possible outcomes
of a random process (like rolling a die) is infinite.
Parameters: The population is typically described by parameters, such as the mean (μ),
variance (σ²), and standard deviation (σ). These parameters are generally unknown and are
what we aim to estimate from a sample.
Scope: The population includes all possible data points or observations that meet the criteria
defined by the study.
Example of a Population:
2. Sample
A sample is a subset of the population that is selected for analysis. Samples are used when it
is impractical or impossible to collect data from every member of the population, as they
provide a way to draw conclusions about the entire population.
Size: A sample is smaller than the population, and its size can vary depending on the
research design. The goal is for the sample to be representative of the population to ensure
generalizability.
Sampling Methods: The process of selecting a sample from a population is called sampling.
There are various sampling methods, including:
2
o Simple Random Sampling: Every member of the population has an equal chance of
being selected.
o Stratified Sampling: The population is divided into subgroups (strata), and a sample
is taken from each subgroup.
o Systematic Sampling: Every nn-th member of the population is selected.
o Cluster Sampling: The population is divided into clusters, and a sample of clusters is
chosen.
Statistics: The data from the sample are used to calculate statistics such as the sample mean
(xˉ\bar{x}), sample variance (s2s^2), and sample standard deviation (ss), which are used to
estimate the population parameters.
Example of a Sample:
A survey of 1,000 people from a city to understand the preferences of the population.
A selection of 50 students from a university to study academic performance.
A sample of 100 cars from a factory to test quality control.
In many real-world scenarios, it is not feasible to collect data from the entire population.
Some common reasons for using a sample include:
Cost: It is often too expensive to collect data from everyone in the population.
Time: Collecting data from the entire population may take too long, especially for large or
infinite populations.
Practicality: It may not be physically possible to observe every member of the population.
Efficiency: By selecting a representative sample, it is possible to draw conclusions about the
entire population while saving time and resources.
5. Sampling Techniques
To ensure that a sample is representative of the population, researchers use various sampling
techniques:
Random Sampling:
Every individual or item in the population has an equal chance of being selected. This is
considered the most unbiased method of sampling.
Example: Drawing names from a hat to select participants for a study.
Stratified Sampling:
The population is divided into distinct subgroups (called strata), and a sample is taken from
each subgroup. This ensures that specific groups are represented proportionally in the
sample.
3
Example: A survey that ensures representation from various age groups (e.g., under 20, 21-
40, 41-60, 61+).
Systematic Sampling:
Individuals are selected at regular intervals from the population. The first individual is
chosen randomly, and subsequent individuals are selected based on a fixed pattern.
Example: Selecting every 10th person in a list of employees.
Cluster Sampling:
The population is divided into clusters (usually based on geographical location), and a
random selection of clusters is chosen. All members of the selected clusters are included in
the sample.
Example: Surveying all schools in a randomly selected district to study student performance.
Statistical Inference
Statistical Inference refers to the process of drawing conclusions about a population based
on a sample of data taken from that population. It involves using sample data to make
estimates, predictions, or decisions about population parameters, such as means, proportions,
variances, etc.
1. Point Estimation: The process of estimating a population parameter using a single value
from the sample (e.g., the sample mean as an estimate of the population mean).
2. Interval Estimation: Estimating a population parameter within a range of values (e.g.,
confidence intervals).
3. Hypothesis Testing: Testing a claim or hypothesis about a population using sample data
(e.g., testing whether the mean of a population equals a specific value).
Sampling refers to the process of selecting a subset (sample) from a larger group
(population) in order to analyze and make inferences about the population. The key
distinction in sampling methods is whether we allow a data point (individual) to be selected
more than once (sampling with replacement) or not (sampling without replacement).
4
Sampling without replacement means that once an individual or item is selected for the
sample, it is not placed back into the population. Therefore, each individual or item can only
be selected once.
Key Characteristics:
Suppose there are 5 students in a class: A, B, C, D, and E. If you randomly select two
students for a survey without replacement, once you select a student (e.g., A), that student
cannot be selected again for the survey.
Mathematical Considerations:
The probability of selecting an item changes after each selection. If there are NN items in the
population and nn items are being sampled, after the first selection, there are only N−1N - 1
items left to choose from for the next selection, and so on.
Without replacement, the sample selections are not independent.
Sampling with replacement means that after an individual or item is selected for the sample,
it is placed back into the population and could be selected again. This allows the possibility
that the same item might be chosen multiple times.
Key Characteristics:
After selecting an item, it is returned to the population before the next selection, allowing it
to be chosen again.
The probability of selecting each item remains the same throughout the sampling process.
This method allows for repetition of items in the sample, so some items may appear multiple
times.
Suppose there are 5 students in a class: A, B, C, D, and E. If you randomly select two
students for a survey with replacement, after selecting the first student (e.g., A), you return
them to the population, and the second selection could potentially be the same student (e.g.,
A again).
5
Mathematical Considerations:
With replacement, the probability of selecting an item remains constant. If there are NN
items in the population and you are sampling nn items, the probability of selecting any one
item in each selection is 1N\frac{1}{N}.
With replacement, the selections are independent.
3. Random Sampling
Random Sampling refers to the process of selecting a sample in such a way that every
individual or item in the population has an equal chance of being selected. It ensures that the
sample is representative of the population, which is crucial for making valid inferences.
There are two key types of random sampling based on whether sampling is done with or
without replacement:
Survey Sampling: In most surveys, each individual should only be surveyed once. For
example, a political poll where each voter is selected to answer a series of questions.
Quality Control: When inspecting products from a factory batch, each product can only be
checked once.
Lottery Draws: A random draw of lottery numbers typically involves no replacement of
numbers once they have been selected.
6
Random Numbers
Random numbers are numbers that are generated in such a way that each number is equally
likely to be selected from a certain range or distribution, without any predictable pattern.
They are used in many statistical and computational applications, particularly in simulation,
Monte Carlo methods, and sampling.
Key Points:
Uniform Distribution: When random numbers are generated with equal probability over a
given range (e.g., between 0 and 1), they follow a uniform distribution.
Generation Methods: Random numbers can be generated using hardware-based random
generators or software-based algorithms like Linear Congruential Generators (LCGs).
Applications: Random numbers are essential in various fields such as:
o Simulation: In simulating processes that involve uncertainty or randomness (e.g.,
predicting outcomes of coin tosses, dice rolls).
o Sampling: Random numbers are used to select individuals or items randomly from a
population, ensuring each has an equal chance of being selected.
Population Parameters
1. Population Mean (μ): The average of all data points in the population.
Where NN is the population size, and XiX_i represents the data points.
2. Population Variance (σ²): A measure of the spread of the data in the population.
3. Population Standard Deviation (σ): The square root of the population variance,
providing a measure of dispersion in the same units as the data.
σ=σ2\sigma = \sqrt{\sigma^2}
Sample Statistics
A sample statistic is a numerical value calculated from a sample that serves as an estimate of
a corresponding population parameter. Sample statistics are used to make inferences about
population parameters.
1. Sample Mean (x̄ ): The average of the data points in the sample.
Where nn is the sample size, and xix_i represents the data points in the sample.
3. Sample Standard Deviation (s): The square root of the sample variance.
s=s2s = \sqrt{s^2}
4. Sample Proportion (p): The proportion of individuals in the sample with a particular
characteristic.
Sampling Distributions
8
1. Central Limit Theorem (CLT): The CLT states that regardless of the population
distribution, the sampling distribution of the sample mean (or sum) will be
approximately normal if the sample size is large enough (typically n≥30n \geq 30).
o The mean of the sampling distribution is equal to the population mean (μ).
o The standard deviation of the sampling distribution (also called the standard error)
is: SExˉ=σnSE_{\bar{x}} = \frac{\sigma}{\sqrt{n}} Where σ\sigma is the population
standard deviation and nn is the sample size.
Frequency Distributions
A frequency distribution is a table or graph that displays the frequency (i.e., count) of
occurrences of different values or ranges of values in a data set. It provides an organized way
to summarize and interpret data.
1. Classes: Categories or intervals into which data is grouped. These can be of equal width or
vary.
2. Frequencies: The number of occurrences of data points within each class.
3. Relative Frequencies: The proportion of the total number of data points that fall within each
class. Relative Frequency=Frequency of a classn\text{Relative Frequency} = \frac{\
text{Frequency of a class}}{n}
4. Cumulative Frequency: The sum of the frequencies up to and including the current class.
5. Class Boundaries: The values that separate one class interval from another. These are used
to avoid gaps between intervals.
9
50-59 2 0.10 2
60-69 3 0.15 5
70-79 3 0.15 8
80-89 4 0.20 12
90-99 5 0.25 17
100-109 3 0.15 20
This frequency distribution summarizes the scores and gives insight into how many students
scored within each range.
A relative frequency distribution is a statistical tool used to show the proportion (or
percentage) of data points that fall within specific categories or intervals (called classes).
Unlike the frequency distribution, which provides counts of occurrences, the relative
frequency distribution provides the proportion of occurrences in relation to the total number
of data points in the dataset.
The relative frequency distribution is particularly useful when comparing datasets of different
sizes, as it normalizes the frequencies and allows for easier comparison across datasets with
different sample sizes.
The relative frequency of a specific class is calculated using the following formula:
Where:
10
Frequency of the class is the count of data points that fall into the class (or interval).
Total number of observations (n) is the total number of data points in the dataset.
If you want to express the relative frequency as a percentage, multiply the result by 100:
1. Organize the Data: Arrange the data in order or group them into classes (if necessary).
2. Create Classes or Intervals: Divide the data into intervals or categories (classes). The number
of classes can vary depending on the nature of the data and the purpose of analysis.
3. Calculate the Frequency of Each Class: Count how many data points fall within each class.
4. Compute the Relative Frequency: Divide the frequency of each class by the total number of
observations to find the relative frequency.
5. Convert to Percentage (optional): Multiply the relative frequency by 100 to express it as a
percentage.
Suppose you have the following data set of scores from a group of 20 students on a test:
Let's create class intervals for the scores, e.g., 50–59, 60–69, 70–79, etc.
50–59 2
60–69 3
70–79 3
80–89 4
90–99 5
100–109 3
Now, we calculate the relative frequency for each class by dividing the frequency of each
class by the total number of observations (20):
Relative Frequency for class 50-59=220=0.10\text{Relative Frequency for class 50-59} = \frac{2}{20} =
0.10 Relative Frequency for class 60-69=320=0.15\text{Relative Frequency for class 60-69} = \frac{3}
{20} = 0.15 Relative Frequency for class 70-79=320=0.15\text{Relative Frequency for class 70-79} = \
frac{3}{20} = 0.15 Relative Frequency for class 80-89=420=0.20\text{Relative Frequency for class 80-
89} = \frac{4}{20} = 0.20 Relative Frequency for class 90-99=520=0.25\text{Relative Frequency for
class 90-99} = \frac{5}{20} = 0.25 Relative Frequency for class 100-109=320=0.15\text{Relative
Frequency for class 100-109} = \frac{3}{20} = 0.15
To express the relative frequencies as percentages, multiply each relative frequency by 100:
Relative Frequency for class 50-59=0.10×100=10%\text{Relative Frequency for class 50-59} = 0.10 \
times 100 = 10\% Relative Frequency for class 60-69=0.15×100=15%\text{Relative Frequency for
class 60-69} = 0.15 \times 100 = 15\% Relative Frequency for class 70-79=0.15×100=15%\
text{Relative Frequency for class 70-79} = 0.15 \times 100 = 15\% Relative Frequency for class 80-
89=0.20×100=20%\text{Relative Frequency for class 80-89} = 0.20 \times 100 = 20\%
Relative Frequency for class 90-99=0.25×100=25%\text{Relative Frequency for class 90-99} = 0.25 \
times 100 = 25\% Relative Frequency for class 100-109=0.15×100=15%\text{Relative Frequency for
class 100-109} = 0.15 \times 100 = 15\%
A relative frequency histogram or bar chart can be used to graphically represent the
relative frequency distribution. The x-axis represents the classes (intervals), and the y-axis
represents the relative frequencies (or percentages).
For grouped data, we don't have the individual data points, but rather, the data is organized
into class intervals with corresponding frequencies. To compute the mean, variance, and
moments for such data, we use approximate formulas that involve the midpoints of the
classes and the frequencies.
The mean for grouped data is calculated using the class midpoints (xix_i) and the
corresponding frequencies (fif_i).
Where:
xi=Lower limit of the class+Upper limit of the class2x_i = \frac{\text{Lower limit of the class}
+ \text{Upper limit of the class}}{2}
Example:
Suppose we have the following grouped data representing the marks of students in a test:
0-10 5 5 25
10-20 8 15 120
20-30 12 25 300
30-40 10 35 350
40-50 7 45 315
Variance measures the spread of data points from the mean. For grouped data, variance can
be calculated using the formula:
Where:
1. Calculate the squared difference between the midpoint xix_i and the mean xˉ\bar{x}, i.e.,
(xi−xˉ)2(x_i - \bar{x})^2.
2. Multiply each squared difference by the frequency fif_i to obtain fi(xi−xˉ)2f_i (x_i - \
bar{x})^2.
3. Sum the values of fi(xi−xˉ)2f_i (x_i - \bar{x})^2.
4. Divide the sum by the total frequency ∑fi\sum f_i to get the variance.
Example:
Using the same data and the mean we calculated earlier (xˉ=26.43\bar{x} = 26.43):
5−26.43=−21.435 - 26.43
0-10 5 5 459.62 2298.08
= -21.43
15−26.43=−11.4315 -
10-20 8 15 130.19 1041.52
26.43 = -11.43
25−26.43=−1.4325 - 26.43
20-30 12 25 2.04 24.47
= -1.43
35−26.43=8.5735 - 26.43
30-40 10 35 73.52 735.22
= 8.57
45−26.43=18.5745 - 26.43
40-50 7 45 344.46 2411.22
= 18.57
15
Standard Deviation:
Moments are measures of the shape of the data distribution. The nth moment of a data set
about the mean is given by:
Where:
The first moment is the mean, which we have already calculated as xˉ\bar{x}.
Second Moment:
Higher-Order Moments:
Higher moments (third, fourth, etc.) are used to describe the skewness (asymmetry) and
kurtosis (peakedness) of the distribution.
1. Unbiased Estimates
An unbiased estimator is a statistical estimator whose expected value is equal to the true
value of the parameter being estimated. In simpler terms, an estimator is unbiased if, on
average, it correctly estimates the parameter across many samples.
E[θ^]=θE[\hat{\theta}] = \theta
Where:
Example:
Suppose θ\theta is the population mean, and μ^\hat{\mu} is the sample mean. The sample
mean is an unbiased estimator of the population mean because E[μ^]=μE[\hat{\mu}] = \mu,
where μ\mu is the population mean.
2. Efficient Estimates
An efficient estimator is one that has the smallest variance among all unbiased estimators.
While unbiased estimators provide the correct value on average, efficient estimators are more
precise, meaning they are more likely to be closer to the true parameter in repeated sampling.
Definition: Among all unbiased estimators, the one with the smallest variance is called the
most efficient estimator.
The efficiency of an estimator is measured by comparing its variance with that of other
unbiased estimators. The estimator with the lowest variance is preferred because it leads to
more reliable estimates.
17
Example:
In the case of estimating the population mean μ\mu, the sample mean μ^\hat{\mu} is not only
an unbiased estimator but also an efficient estimator under the assumption that the data
follows a normal distribution. This is because the sample mean has the smallest variance
compared to other unbiased estimators of μ\mu.
Key Takeaways:
Unbiased estimator: An estimator where the expected value equals the true parameter.
Efficient estimator: An unbiased estimator with the smallest variance among all unbiased
estimators.
1. Point Estimates
A point estimate is a single value used to estimate a population parameter. Point estimates
are derived from sample statistics (such as the sample mean, sample proportion, etc.). A point
estimate gives a specific numerical value but does not provide any information about the
variability or uncertainty associated with the estimate.
Definition: A point estimate is a single value that serves as the best guess or approximation
of an unknown population parameter.
Example:
Suppose we want to estimate the population mean μ\mu based on a sample. The point
estimate of μ\mu is the sample mean xˉ\bar{x}. If xˉ=50\bar{x} = 50, then 50 is our point
estimate of the population mean.
While point estimates provide useful information, they are not foolproof and often suffer
from uncertainty, which brings us to interval estimates.
2. Interval Estimates
An interval estimate provides a range of values within which the population parameter is
expected to lie. This range is based on sample data and includes a measure of uncertainty.
Interval estimates are useful because they offer more information than point estimates,
providing both the estimated parameter and the associated confidence in the estimate.
(θ^−E,θ^+E)(\hat{\theta} - E, \hat{\theta} + E)
Where:
18
A confidence interval is a common type of interval estimate. It gives a range of values for
the parameter, along with the confidence level (usually 95%, 99%, etc.) that the interval
contains the true parameter value.
For example, a 95% confidence interval for a population mean μ\mu might be expressed as:
(xˉ−E,xˉ+E)(\bar{x} - E, \bar{x} + E)
Where xˉ\bar{x} is the sample mean, and EE is the margin of error, calculated based on the
sample's standard deviation and sample size.
Example:
Suppose a sample of 100 students has a sample mean score of 80 with a standard deviation of
10. The margin of error for a 95% confidence interval might be calculated as:
So, the 95% confidence interval for the population mean is:
This means that we are 95% confident that the true population mean lies between 78.04 and
81.96.
A single value estimate of a A range of values within which the parameter is likely
Definition
parameter to lie
Uncertainty Does not indicate uncertainty Provides a measure of uncertainty (confidence level)
Reliability
In the broader engineering context, reliability can refer to the probability that a system or
component will perform its required functions without failure over a specified period, under
stated conditions.
However, in statistical estimation, reliability is tied closely to the confidence interval and
the precision of the estimate.
A confidence interval (CI) is a range of values used to estimate a population parameter, and
it comes with a specific confidence level. This level (usually expressed as a percentage, such
as 95% or 99%) reflects how confident we are that the true parameter lies within the interval.
For a population with an unknown mean μ\mu and a known or large enough sample size nn, a
confidence interval for the population mean is calculated as:
Where:
Example:
Suppose we have a sample of 100 students, and their test scores have a sample mean of 80
and a population standard deviation of 10. We want to calculate a 95% confidence interval
for the population mean.
Given:
xˉ=80\bar{x} = 80
σ=10\sigma = 10
20
n=100n = 100
zα/2=1.96z_{\alpha/2} = 1.96 (for 95% confidence level)
Thus, we are 95% confident that the true population mean lies between 78.04 and 81.96.
For estimating the population proportion pp, the confidence interval is calculated using the
following formula:
Where:
Definition:
Let’s denote a set of observations as x1,x2,...,xnx_1, x_2, ..., x_n. The likelihood function
L(θ)L(\theta) for a parameter θ\theta is defined as the joint probability of observing the data
given θ\theta, i.e.,
The maximum likelihood estimate θ^\hat{\theta} is the value of θ\theta that maximizes
L(θ)L(\theta).
The MLE involves finding the value of θ\theta that maximizes ℓ(θ)\ell(\theta).
1. Write the likelihood function based on the probability distribution of the data.
2. Take the logarithm of the likelihood function to obtain the log-likelihood function.
3. Differentiate the log-likelihood function with respect to the parameter θ\theta.
4. Set the derivative equal to zero and solve for θ\theta. This gives the maximum likelihood
estimate θ^\hat{\theta}.
Example:
Suppose you have a random sample x1,x2,...,xnx_1, x_2, ..., x_n from a normal distribution
with an unknown mean μ\mu and known variance σ2\sigma^2. The probability density
function (PDF) for a normal distribution is:
To find the MLE for μ\mu, we take the derivative of ℓ(μ)\ell(\mu) with respect to μ\mu:
Setting the derivative equal to zero and solving for μ\mu, we get:
Thus, the MLE for the mean μ\mu is simply the sample mean μ^\hat{\mu}.