4th Unit - Statistics
4th Unit - Statistics
Correlation:
Regression:
Here's a table summarizing the key differences between correlation and regression:
Choosing between correlation and regression:
Sample: A sample is a subset of the population that is chosen for study. We use the
data collected from the sample to make inferences about the entire population.
Relationship: The sample is drawn from the population and is used to represent the
characteristics of the population. Ideally, the sample should be representative of the
population so that the conclusions drawn from the sample can be accurately applied
to the population.
Example: Imagine you want to measure the average height of all adults in India. The
population in this case is all adults in India. It would be impossible to measure the
height of every adult, so you would need to draw a sample of adults and measure
their heights. You would then use the data from the sample to estimate the average
height of all adults in India.
Sampling Types
Sampling is the process of selecting a subset of individuals from a larger population
for the purpose of making inferences about the entire population. There are different
types of sampling, each with its own advantages and disadvantages.
1. Probability Sampling:
2. Non-probability Sampling:
There are different types of sampling distributions depending on the statistic being
calculated and the sample size. Here are three common examples:
1. Standard Error:
The standard error (SE) is a measure of the variability in a statistic (e.g., mean,
proportion) due to random sampling. It tells us how much we can expect the statistic
to vary from sample to sample, drawn from the same population.
● Lower standard error indicates less variability and a more precise estimate of
the population parameter.
● Higher standard error indicates greater variability and less precision in the
estimate.
2. Significance Levels:
The significance level (α) is the probability of rejecting a true null hypothesis in a
hypothesis test. It represents the risk of making a Type I error (concluding a
difference exists when it doesn't).
3. Confidence Limits:
A confidence interval (CI) is a range of values within which we are confident that the
true population parameter lies. It is calculated based on the sample statistic,
standard error, and chosen confidence level (1 - α).
In summary:
The null hypothesis is typically the default assumption, while the alternative
hypothesis represents the specific research question or claim being investigated.
The goal of hypothesis testing is to determine whether to reject the null hypothesis or
fail to reject it. This decision is based on a statistical test, which involves calculating
the probability of obtaining the observed data if the null hypothesis were true (known
as the p-value).
1. Formulate the null and alternative hypotheses: Clearly define what you are
trying to test and what would constitute evidence against your null hypothesis.
2. Collect data: Gather a representative sample from the population of interest.
3. Choose a statistical test: Select an appropriate test based on the type of
data and research question.
4. Calculate the test statistic and p-value: Use the collected data to calculate
the statistic and its corresponding p-value.
5. Make a decision: Compare the p-value to a predetermined significance level
(usually 0.05).
○ If the p-value is less than the significance level, reject the null
hypothesis.
○ If the p-value is greater than or equal to the significance level, fail to
reject the null hypothesis.
A type I error occurs when the null hypothesis (H0) is rejected even though it is
actually true. In simpler terms, it means wrongly concluding that there is a significant
effect or difference when there isn't. This can be likened to accusing someone of a
crime they didn't commit.
A type II error occurs when the null hypothesis (H0) is not rejected even though it is
actually false. In simpler terms, it means failing to detect a real effect or difference.
This can be likened to acquitting someone guilty of a crime.
Example: A study is conducted to investigate the link between smoking and lung
cancer. The study fails to find a significant association, but in reality, smoking does
increase the risk of lung cancer.
Minimizing errors
Researchers can take several steps to minimize the risk of making type I and type II
errors:
Understanding type I and type II errors is essential for interpreting the results of
statistical tests and drawing accurate conclusions about research findings. By being
aware of these potential errors and taking steps to minimize them, researchers can
ensure that their findings are reliable and valid.
These are three commonly used statistical tests in various fields, including research,
medicine, business, and social sciences. Each test has its unique purpose and
applicability depending on the type of data and research question.
1. T-Test:
● Used to compare the means of two independent groups or two related groups
(paired samples).
● Can be used for both parametric and non-parametric data.
● Parametric t-test assumes normally distributed data and small sample sizes
require special considerations.
● Non-parametric t-tests (e.g., Mann-Whitney U test) do not require normality
assumptions.
● Example: Comparing the average height of males and females in a
population.
2. Chi-Square Test:
3. F-Test:
In addition to these three main principles, there are a number of other important
considerations when designing an experiment. These include:
● The type of experiment you are conducting: There are many different
types of experiments, each with its own set of design considerations. For
example, a factorial experiment will require a different design than a simple
randomized experiment.
● The number of treatments you are comparing: The number of treatments
you are comparing will affect the sample size you need for your experiment.
● The type of data you are collecting: The type of data you are collecting will
affect the statistical analysis you need to use.
ANOVA stands for Analysis of Variance. It is a statistical test used to compare the
means of two or more groups to determine if there is a statistically significant
difference between them.
● One-way ANOVA: This type of ANOVA compares the means of three or more
groups when there is only one independent variable. It is used to answer
questions like:
○ Is there a significant difference in the average height of plants grown
with different types of fertilizer?
○ Do students perform better on exams when they are given more time to
study?
○ Is there a difference in the average income of people with different
levels of education?
● Two-way ANOVA: This type of ANOVA compares the means of three or more
groups when there are two independent variables. It is used to answer
questions like:
○ Is there a significant difference in the average yield of corn plants
grown with different types of fertilizer and under different watering
conditions?
○ Does the effectiveness of a new medication depend on the patient's
age and gender?
○ Do students perform better on exams when they are given longer study
time and when the teacher uses a different teaching method?
Here's a table summarizing the key differences between one-way and two-way
ANOVA:
Analogy:
Imagine you have a group of students and you want to compare their exam scores.
You could conduct a one-way ANOVA to see if the average score is different for
students who studied more than 10 hours compared to students who studied less
than 10 hours. However, you could also conduct a two-way ANOVA to see if the
average score is different for students who studied more than 10 hours and who
used a particular study method, compared to students who studied less than 10
hours and who used a different study method.