0% found this document useful (0 votes)
40 views9 pages

Hypothesis Testing PDF

The document discusses different statistical tests that can be used for analyzing categorical and continuous variables from sample data including the chi-square test, t-tests, ANOVA tests, and correlation. Examples of applying each test using Python are shown.

Uploaded by

mdkashif1299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views9 pages

Hypothesis Testing PDF

The document discusses different statistical tests that can be used for analyzing categorical and continuous variables from sample data including the chi-square test, t-tests, ANOVA tests, and correlation. Examples of applying each test using Python are shown.

Uploaded by

mdkashif1299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Chi-Square Test-

The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant
association between the two variables.

import scipy.stats as stats

import seaborn as sns


import pandas as pd
import numpy as np
dataset=sns.load_dataset('tips')

dataset.head()

total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2

1 10.34 1.66 Male No Sun Dinner 3

2 21.01 3.50 Male No Sun Dinner 3

3 23.68 3.31 Male No Sun Dinner 2

4 24.59 3.61 Female No Sun Dinner 4

dataset_table=pd.crosstab(dataset['sex'],dataset['smoker'])
print(dataset_table)

smoker Yes No
sex
Male 60 97
Female 33 54

dataset_table.values

array([[60, 97],
[33, 54]], dtype=int64)

#Observed Values
Observed_Values = dataset_table.values
print("Observed Values :-\n",Observed_Values)

Observed Values :-
[[60 97]
[33 54]]

val=stats.chi2_contingency(dataset_table)

val

(0.008763290531773594,
0.925417020494423,
1,
array([[59.84016393, 97.15983607],
[33.15983607, 53.84016393]]))

Expected_Values=val[3]

no_of_rows=len(dataset_table.iloc[0:2,0])
no_of_columns=len(dataset_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof)
alpha = 0.05

Degree of Freedom:- 1

from scipy.stats import chi2


chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]

print("chi-square statistic:-",chi_square_statistic)

chi-square statistic:- 0.001934818536627623


critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)

critical_value: 3.841458820694124

#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('p-value:',p_value)

p-value: 0.964915107315732
Significance level: 0.05
Degree of Freedom: 1
p-value: 0.964915107315732

if chi_square_statistic>=critical_value:
print("Reject H0,There is a relationship between 2 categorical variables")
else:
print("Retain H0,There is no relationship between 2 categorical variables")

if p_value<=alpha:
print("Reject H0,There is a relationship between 2 categorical variables")
else:
print("Retain H0,There is no relationship between 2 categorical variables")

Retain H0,There is no relationship between 2 categorical variables


Retain H0,There is no relationship between 2 categorical variables

T Test
A t-test is a type of inferential statistic which is used to determine if there is a significant difference between the means of two groups
which may be related in certain features

T-test has 2 types : 1. one sampled t-test 2. two-sampled t-test.

One-sample T-test with Python


The test will tell us whether means of the sample and the population are different

ages=[10,20,35,50,28,40,55,18,16,55,30,25,43,18,30,28,14,24,16,17,32,35,26,27,65,18,43,23,21,20,19,70]

len(ages)

32

from statsmodels.stats.weightstats import ztest

import numpy as np
ages_mean=np.mean(ages)
print(ages_mean)

30.34375

## Lets take sample


sample_size=8
age_sample=np.random.choice(ages,sample_size)

age_sample

array([35, 16, 43, 30, 24, 43, 30, 27])

np.mean(age_sample)

31.0

from scipy.stats import ttest_1samp

ttest,p_value=ttest_1samp(age_sample,30)

print(p_value)

0.7681189381229006

if p_value < 0.05: # alpha value is 0.05 or 5%


print(" we are rejecting null hypothesis")
else:
print("we fail to reject the null hypothesis")

we fail to reject the null hypothesis

Some More Examples


Consider the age of students in a college and in Class A

import numpy as np
import pandas as pd
import scipy.stats as stats
import math
np.random.seed(6)
school_ages=stats.poisson.rvs(loc=18,mu=35,size=1500)
classA_ages=stats.poisson.rvs(loc=18,mu=30,size=25)

np.mean(school_ages)

53.303333333333335

classA_ages.mean()

48.2

_,p_value=stats.ttest_1samp(a=classA_ages,popmean=school_ages.mean())

p_value

3.26936314797003e-05

school_ages.mean()

53.303333333333335

if p_value < 0.05: # alpha value is 0.05 or 5%


print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")

we are rejecting null hypothesis

Two-sample T-test With Python


The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there
is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric
test. This test is also known as: Independent t Test
np.random.seed(12)
ClassB_ages=stats.poisson.rvs(loc=18,mu=33,size=60)
ClassB_ages.mean()

50.63333333333333

_,p_value=stats.ttest_ind(a=classA_ages,b=ClassB_ages,equal_var=False)

p_value

0.06021969607248894

if p_value < 0.05: # alpha value is 0.05 or 5%


print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")

we are accepting null hypothesis

Paired T-test With Python


When you want to check how different samples from the same group are, you can go for a paired T-test

weight1=[25,30,28,35,28,34,26,29,30,26,28,32,31,30,45]
weight2=weight1+stats.norm.rvs(scale=5,loc=-1.25,size=15)

print(weight1)
print(weight2)

[25, 30, 28, 35, 28, 34, 26, 29, 30, 26, 28, 32, 31, 30, 45]
[30.57926457 34.91022437 29.00444617 30.54295091 19.86201983 37.57873174
18.3299827 21.3771395 36.36420881 32.05941216 26.93827982 29.519014
26.42851213 30.50667769 41.32984284]

weight_df=pd.DataFrame({"weight_10":np.array(weight1),
"weight_20":np.array(weight2),
"weight_change":np.array(weight2)-np.array(weight1)})

weight_df

weight_10 weight_20 weight_change

0 25 30.579265 5.579265

1 30 34.910224 4.910224

2 28 29.004446 1.004446

3 35 30.542951 -4.457049

4 28 19.862020 -8.137980

5 34 37.578732 3.578732

6 26 18.329983 -7.670017

7 29 21.377139 -7.622861

8 30 36.364209 6.364209

9 26 32.059412 6.059412

10 28 26.938280 -1.061720

11 32 29.519014 -2.480986

12 31 26.428512 -4.571488

13 30 30.506678 0.506678

14 45 41.329843 -3.670157

_,p_value=stats.ttest_rel(a=weight1,b=weight2)

print(p_value)
0.5732936534411279

if p_value < 0.05: # alpha value is 0.05 or 5%


print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")

we are accepting null hypothesis

Correlation
import seaborn as sns
df=sns.load_dataset('iris')

df.shape

(150, 5)

df.corr()

sepal_length sepal_width petal_length petal_width

sepal_length 1.000000 -0.117570 0.871754 0.817941

sepal_width -0.117570 1.000000 -0.428440 -0.366126

petal_length 0.871754 -0.428440 1.000000 0.962865

petal_width 0.817941 -0.366126 0.962865 1.000000

sns.pairplot(df)

<seaborn.axisgrid.PairGrid at 0x29595f8ea60>

Anova Test(F-Test)
The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time.

For example, if we wanted to test whether petal_width age differs based on some categorical variable like species, we have to compare
the means of each level or group the variable

One Way F-test(Anova) :-


It tell whether two or more groups are similar or not based on their mean similarity and f-score.

Example : there are 3 different category of iris flowers and their petal width and need to check whether all 3 group are similar or not

import seaborn as sns


df1=sns.load_dataset('iris')

df1.head()

sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

df_anova = df1[['petal_width','species']]

grps = pd.unique(df_anova.species.values)

grps

array(['setosa', 'versicolor', 'virginica'], dtype=object)

d_data = {grp:df_anova['petal_width'][df_anova.species == grp] for grp in grps}

d_data

{'setosa': 0 0.2
1 0.2
2 0.2
3 0.2
4 0.2
5 0.4
6 0.3
7 0.2
8 0.2
9 0.1
10 0.2
11 0.2
12 0.1
13 0.1
14 0.2
15 0.4
16 0.4
17 0.3
18 0.3
19 0.3
20 0.2
21 0.4
22 0.2
23 0.5
24 0.2
25 0.2
26 0.4
27 0.2
28 0.2
29 0.2
30 0.2
31 0.4
32 0.1
33 0.2
34 0.2
35 0.2
36 0.2
37 0.1
38 0.2
39 0.2
40 0.3
41 0.3
42 0.2
43 0.6
44 0.4
45 0.3
46 0.2
47 0.2
48 0.2
49 0.2
Name: petal_width, dtype: float64,
'versicolor': 50 1.4
51 1.5
52 1.5
53 1.3
54 1.5
55 1.3
56 1.6
57 1.0
58 1.3
59 1.4
60 1.0
61 1.5
62 1.0
63 1.4
64 1.3
65 1.4
66 1.5
67 1.0
68 1.5
69 1.1
70 1.8
71 1.3
72 1.5
73 1.2
74 1.3
75 1.4
76 1.4
77 1.7
78 1.5
79 1.0
80 1.1
81 1.0
82 1.2
83 1.6
84 1.5
85 1.6
86 1.5
87 1.3
88 1.3
89 1.3
90 1.2
91 1.4
92 1.2
93 1.0
94 1.3
95 1.2
96 1.3
97 1.3
98 1.1
99 1.3
Name: petal_width, dtype: float64,
'virginica': 100 2.5
101 1.9
102 2.1
103 1.8
104 2.2
105 2.1
106 1.7
107 1.8
108 1.8
109 2.5
110 2.0
111 1.9
112 2.1
113 2.0
114 2.4
115 2.3
116 1.8
117 2.2
118 2.3
119 1.5
120 2.3
121 2.0
122 2.0
123 1.8
124 2.1
125 1.8
126 1.8
127 1.8
128 2.1
129 1.6
130 1.9
131 2.0
132 2.2
133 1.5
134 1.4
135 2.3
136 2.4
137 1.8
138 1.8
139 2.1
140 2.4
141 2.3
142 1.9
143 2.3
144 2.5
145 2.3
146 1.9
147 2.0
148 2.3
149 1.8
Name: petal_width, dtype: float64}

F, p = stats.f_oneway(d_data['setosa'], d_data['versicolor'], d_data['virginica'])

print(p)

4.169445839443116e-85

if p<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")

reject null hypothesis

# imports
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest

# Generate a random array of 50 numbers having mean 110 and sd 15


# similar to the IQ scores data we assume above
mean_iq = 110
sd_iq = 15/math.sqrt(50)
alpha =0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq

# print mean and sd


print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))

mean=109.61 stdv=2.22

# now we perform the test. In this function, we passed data, in the value parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we check whether the
# mean is larger

ztest_Score, p_value= ztest(data,value = null_mean, alternative='larger')


# the function outputs a p_value and z-score corresponding to that value, we compare the
# p-value with alpha, if it is greater than alpha then we do not null hypothesis
# else we reject it.

if(p_value < alpha):


print("Reject Null Hypothesis")
else:
print("Fail to Reject NUll Hypothesis")

Reject Null Hypothesis

### EDA ASsignment


from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
df=housing['data']

df.head()

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence

0 1.0 60.0 RL 65.0 8450.0 Pave None Reg Lvl AllPub ... 0.0 0.0 None None

1 2.0 20.0 RL 80.0 9600.0 Pave None Reg Lvl AllPub ... 0.0 0.0 None None

2 3.0 60.0 RL 68.0 11250.0 Pave None IR1 Lvl AllPub ... 0.0 0.0 None None

3 4.0 70.0 RL 60.0 9550.0 Pave None IR1 Lvl AllPub ... 0.0 0.0 None None

4 5.0 60.0 RL 84.0 14260.0 Pave None IR1 Lvl AllPub ... 0.0 0.0 None None

5 rows × 80 columns
import seaborn as sns
sns.load_dataset('titanic')

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone

0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False

1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False

2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True

3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False

4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True

887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True

888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False

889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True

890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

#Assignments: EDA Of Algerian Dataset


#https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++

#Assignments: Housing Dataset(California)


Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy