100% found this document useful (1 vote)
458 views10 pages

Inferential Statistics For Data Science

Notes on Inferential Statistics. Inferential statistics allows you to make inferences about the population from the sample data.

Uploaded by

rsaranms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
458 views10 pages

Inferential Statistics For Data Science

Notes on Inferential Statistics. Inferential statistics allows you to make inferences about the population from the sample data.

Uploaded by

rsaranms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Inferential Statistics for Data Science

 Inferential Statistics
 Sampling Distributions & Estimation
o Hypothesis Testing (One and Two Group Means)
o Hypothesis Testing (Categorical Data)
o Hypothesis Testing (More Than Two Group Means)
 Quantitative Data (Correlation & Regression)
 Significance in Data Science

Inferential Statistics

Inferential statistics allows you to make inferences about the population from the sample data.

Population & Sample

A sample is a representative subset of a population. Conducting a census on population is an


ideal but impractical approach in most of the cases. Sampling is much more practical, however it
is prone to sampling error. A sample non-representative of population is called bias, method
chosen for such sampling is called sampling bias. Convenience bias, judgement bias, size bias,
response bias are main types of sampling bias. The best technique for reducing bias in sampling
is randomization. Simple random sampling is the simplest of randomization techniques, cluster
sampling & stratified sampling are other systematic sampling techniques.

Sampling Distributions

Sample means become more and more normally distributed around the true mean (the population
parameter) as we increase our sample size. The variability of the sample means decreases as
sample size increases.

Central Limit Theorem

The Central Limit Theorem is used to help us understand the following facts regardless of
whether the population distribution is normal or not:
1. the mean of the sample means is the same as the population mean
2. the standard deviation of the sample means is always equal to the standard error.
3. the distribution of sample means will become increasingly more normal as the sample size
increases.

Confidence Intervals

A sample mean can be referred to as a point estimate of a population mean. A confidence


interval is always centered around the mean of your sample. To construct the interval, you
add a margin of error. The margin of error is found by multiplying the standard error of the
mean by the z-score of the percent confidence level:

The confidence level indicates the number of times out of 100 that the mean of the population
will be within the given interval of the sample mean.

Hypothesis Testing
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting
data, and then examining what the data tells us about how to proceed. The hypothesis to be tested
is called the null hypothesis and given the symbol Ho. We test the null hypothesis against an
alternative hypothesis, which is given the symbol Ha.

When a hypothesis is tested, we must decide on how much of a difference between means is
necessary in order to reject the null hypothesis. Statisticians first choose a level of significance or
alpha(α) level for their hypothesis test.

Critical values are the values that indicate the edge of the critical region. Critical regions
describe the entire area of values that indicate you reject the null hypothesis.

left, right & two-tailed tests

These are the four basic steps we follow for (one & two group means) hypothesis testing:

1. State the null and alternative hypotheses.


2. Select the appropriate significance level and check the test assumptions.
3. Analyze the data and compute the test statistic.
4. Interpret the result.
Hypothesis Testing (One and Two Group Means)

Hypothesis Test on One Sample Mean When the Population Parameters are Known

We find the z-statistic of our sample mean in the sampling distribution and determine if that z-
score falls within the critical(rejection) region or not. This test is only appropriate when you
know the true mean and standard deviation of the population.

Hypothesis Tests When You Don’t Know Your Population Parameters

The Student’s t-distribution is similar to the normal distribution, except it is more spread out and
wider in appearance, and has thicker tails. The differences between the t-distribution and the
normal distribution are more exaggerated when there are fewer data points, and therefore fewer
degrees of freedom.
Estimation as a follow-up to a Hypothesis Test

When a hypothesis is rejected, it is often useful to turn to estimation to try to capture the true
value of the population mean.

Two-Sample T Tests

Independent Vs Dependent Samples

When we have independent samples we assume that the scores of one sample do not affect the
other.

unpaired t-test

In two dependent samples of data, each score in one sample is paired with a specific score in the
other sample.
paired t-test

Hypothesis Testing (Categorical Data)

Chi-square test is used for categorical data and it can be used to estimate how closely the
distribution of a categorical variable matches an expected distribution (the goodness-of-fit test),
or to estimate whether two categorical variables are independent of one another (the test of
independence).

goodness-of-fit

degree of freedom (d f) = no. of categories(c)−1

test of independence

degree of freedom (df) = (rows−1)(columns−1)

Hypothesis Testing (More Than Two Group Means)

Analysis of Variance (ANOVA) allows us to test the hypothesis that multiple population means
and variances of scores are equal. We can conduct a series of t-tests instead of ANOVA but that
would be tedious due to various factors.

We follow a series of steps to perform ANOVA:

1. Calculate the total sum of squares (SST )


2. Calculate the sum of squares between (SSB)
3. Find the sum of squares within groups (SSW ) by subtracting
4. Next solve for degrees of freedom for the test
5. Using the values, you can now calculate the Mean Squares Between (MSB) and Mean
Squares Within (MSW ) using the relationships below
6. Finally, calculate the F statistic using the following ratio
7. It is easy to fill in the Table from here — and also to see that once the SS and df are filled
in, the remaining values in the table for MS and F are simple calculations
8. Find F critical

ANOVA formulaes

If F-value from the ANOVA test is greater than the F-critical value, so we would reject our Null
Hypothesis.

One-Way ANOVA

One-way ANOVA method is the procedure for testing the null hypothesis that the population
means and variances of a single independent variable are equal.

Two-Way ANOVA

Two-way ANOVA method is the procedure for testing the null hypothesis that the population
means and variances of two independent variables are equal. With this method, we are not only
able to study the effect of two independent variables, but also the interaction between these
variables.

We can also do two separate one-way ANOVA but two-way ANOVA gives us Efficiency,
Control & Interaction.

Quantitative Data (Correlation & Regression)

Correlation

Correlation refers to a mutual relationship or association between quantitative variables. It can


help in predicting one quantity from another. It often indicates the presence of a causal
relationship. It used as a basic quantity and foundation for many other modeling techniques.
Pearson Correlation

Regression

Regression analysis is a set of statistical processes for estimating the relationships among
variables.

Regression

Simple Regression

This method uses a single independent variable to predict a dependent variable by fitting the best
relationship.
Multiple Regression

This method uses more than one independent variable to predict a dependent variable by fitting
the best relationship.

It works best when multicollinearity is absent. It’s a phenomenon in which two or more predictor
variables are highly correlated.

Nonlinear Regression

In this method, observational data are modeled by a function which is a nonlinear combination
of the model parameters and depends on one or more independent variables.
Significance in Data Science

In data science, inferential statistics is used is many ways:

 Making inferences about the population from the sample.


 Concluding whether a sample is significantly different from the population.
 If adding or removing a feature from a model will really help to improve the model.
 If one model is significantly better than the other?
 Hypothesis testing in general.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy