Chisquare
Chisquare
# creating df
df=sns.load_dataset('tips')
df.head(5)
df.dtypes
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
dtype: object
df.isnull().sum()
total_bill 0
tip 0
sex 0
smoker 0
day 0
time 0
size 0
dtype: int64
df.shape
(244, 7)
categorical_columns
0 1.01
1 1.66
2 3.50
3 3.31
4 3.61
...
239 5.92
240 2.00
241 2.00
242 1.75
243 3.00
Name: tip, Length: 244, dtype: float64
#Initialize OneHotEncoder
encoder=OneHotEncoder(sparse=False,dtype=int)
encoder
# df[['sex','day']]
#df[categorical_columns]
encoded_df
df_encoded
time_Dinner time_Lunch
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
.. ... ...
239 1 0
240 1 0
241 1 0
242 1 0
243 1 0
df_encoded
X=df_encoded.drop('tip',axis=1)
y=df_encoded['tip'].astype('int')
chi2_value,p_value =chi2(X,y)
chi2_value
p_value
import numpy as np
p_value = np.around(p_value,6)
temp =
pd.DataFrame({'features':X.columns,'chi_square':chi2_value,'p_value':p
_value})
temp
temp.sort_values('chi_square',ascending=False)
Let's walk through an example of a Chi-square test with real data. Imagine we are analyzing the
relationship between gender and preference for a new product. We want to determine whether
gender (male/female) affects the likelihood of liking or disliking the product. This creates two
categorical variables: Gender and Product Preference.
Data Example:
Gender Likes Product Dislikes Product Total
Male 30 10 40
Female 20 30 50
Total 50 40 90
Hypotheses:
• Null Hypothesis (H₀): Gender and product preference are independent (no association).
• Alternative Hypothesis (H₁): Gender and product preference are dependent (there is an
association).
Step-by-Step Chi-square Test:
1. Observed Frequencies:
The table above shows the observed frequencies (real counts from our data).
2. Expected Frequencies:
We calculate the expected frequency for each cell assuming that the two variables are
independent, using the formula:
For example, the expected frequency of males who like the product is:
Dislikes Product
Gender Likes Product (Expected) (Expected)
Male 22.22 17.78
Female 27.78 22.22
3. Chi-square Statistic:
We compute the Chi-square statistic using the formula:
Where (O_{ij}) is the observed frequency and (E_{ij}) is the expected frequency.
4. Degrees of Freedom:
The degrees of freedom for a Chi-square test is calculated as:
Where:
5. p-value:
Using the Chi-square distribution table or a calculator, we find the p-value for (\chi^2 = 11.02)
and 1 degree of freedom.
6. Conclusion:
Since p < 0.05, we reject the null hypothesis and conclude that gender and product preference
are dependent — there is a significant relationship between gender and liking/disliking the
product.