Inferential Statistics For Data Science
Inferential Statistics For Data Science
Inferential Statistics
Sampling Distributions & Estimation
o Hypothesis Testing (One and Two Group Means)
o Hypothesis Testing (Categorical Data)
o Hypothesis Testing (More Than Two Group Means)
Quantitative Data (Correlation & Regression)
Significance in Data Science
Inferential Statistics
Inferential statistics allows you to make inferences about the population from the sample data.
Sampling Distributions
Sample means become more and more normally distributed around the true mean (the population
parameter) as we increase our sample size. The variability of the sample means decreases as
sample size increases.
The Central Limit Theorem is used to help us understand the following facts regardless of
whether the population distribution is normal or not:
1. the mean of the sample means is the same as the population mean
2. the standard deviation of the sample means is always equal to the standard error.
3. the distribution of sample means will become increasingly more normal as the sample size
increases.
Confidence Intervals
The confidence level indicates the number of times out of 100 that the mean of the population
will be within the given interval of the sample mean.
Hypothesis Testing
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting
data, and then examining what the data tells us about how to proceed. The hypothesis to be tested
is called the null hypothesis and given the symbol Ho. We test the null hypothesis against an
alternative hypothesis, which is given the symbol Ha.
When a hypothesis is tested, we must decide on how much of a difference between means is
necessary in order to reject the null hypothesis. Statisticians first choose a level of significance or
alpha(α) level for their hypothesis test.
Critical values are the values that indicate the edge of the critical region. Critical regions
describe the entire area of values that indicate you reject the null hypothesis.
These are the four basic steps we follow for (one & two group means) hypothesis testing:
Hypothesis Test on One Sample Mean When the Population Parameters are Known
We find the z-statistic of our sample mean in the sampling distribution and determine if that z-
score falls within the critical(rejection) region or not. This test is only appropriate when you
know the true mean and standard deviation of the population.
The Student’s t-distribution is similar to the normal distribution, except it is more spread out and
wider in appearance, and has thicker tails. The differences between the t-distribution and the
normal distribution are more exaggerated when there are fewer data points, and therefore fewer
degrees of freedom.
Estimation as a follow-up to a Hypothesis Test
When a hypothesis is rejected, it is often useful to turn to estimation to try to capture the true
value of the population mean.
Two-Sample T Tests
When we have independent samples we assume that the scores of one sample do not affect the
other.
unpaired t-test
In two dependent samples of data, each score in one sample is paired with a specific score in the
other sample.
paired t-test
Chi-square test is used for categorical data and it can be used to estimate how closely the
distribution of a categorical variable matches an expected distribution (the goodness-of-fit test),
or to estimate whether two categorical variables are independent of one another (the test of
independence).
goodness-of-fit
test of independence
Analysis of Variance (ANOVA) allows us to test the hypothesis that multiple population means
and variances of scores are equal. We can conduct a series of t-tests instead of ANOVA but that
would be tedious due to various factors.
ANOVA formulaes
If F-value from the ANOVA test is greater than the F-critical value, so we would reject our Null
Hypothesis.
One-Way ANOVA
One-way ANOVA method is the procedure for testing the null hypothesis that the population
means and variances of a single independent variable are equal.
Two-Way ANOVA
Two-way ANOVA method is the procedure for testing the null hypothesis that the population
means and variances of two independent variables are equal. With this method, we are not only
able to study the effect of two independent variables, but also the interaction between these
variables.
We can also do two separate one-way ANOVA but two-way ANOVA gives us Efficiency,
Control & Interaction.
Correlation
Regression
Regression analysis is a set of statistical processes for estimating the relationships among
variables.
Regression
Simple Regression
This method uses a single independent variable to predict a dependent variable by fitting the best
relationship.
Multiple Regression
This method uses more than one independent variable to predict a dependent variable by fitting
the best relationship.
It works best when multicollinearity is absent. It’s a phenomenon in which two or more predictor
variables are highly correlated.
Nonlinear Regression
In this method, observational data are modeled by a function which is a nonlinear combination
of the model parameters and depends on one or more independent variables.
Significance in Data Science