0% found this document useful (0 votes)
11 views9 pages

Chisquare

The document outlines a data analysis process using the 'tips' dataset from Seaborn, focusing on feature extraction and one-hot encoding of categorical variables. It includes steps for preparing the data, performing a Chi-square test to assess the relationship between features and the target variable 'tip', and provides an example of how to conduct a Chi-square test with hypothetical data. The analysis results show the Chi-square values and p-values for various features, indicating their significance in relation to the target variable.

Uploaded by

Lucky Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

Chisquare

The document outlines a data analysis process using the 'tips' dataset from Seaborn, focusing on feature extraction and one-hot encoding of categorical variables. It includes steps for preparing the data, performing a Chi-square test to assess the relationship between features and the target variable 'tip', and provides an example of how to conduct a Chi-square test with hypothetical data. The analysis results show the Chi-square values and p-values for various features, indicating their significance in relation to the target variable.

Uploaded by

Lucky Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

#import libraries

import seaborn as sns


import pandas as pd

# creating df
df=sns.load_dataset('tips')

df.head(5)

total_bill tip sex smoker day time size


0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

df.dtypes

total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
dtype: object

df.isnull().sum()

total_bill 0
tip 0
sex 0
smoker 0
day 0
time 0
size 0
dtype: int64

df.shape

(244, 7)

##Extract categorical columns from the dataframe


#Here we extract the columns with category datatype as they are the
categorical columns
categorical_columns =
df.select_dtypes(include=['category']).columns.tolist()

categorical_columns

['sex', 'smoker', 'day', 'time']


# ccreating features
X=df.drop('tip',axis=1)

total_bill sex smoker day time size


0 16.99 Female No Sun Dinner 2
1 10.34 Male No Sun Dinner 3
2 21.01 Male No Sun Dinner 3
3 23.68 Male No Sun Dinner 2
4 24.59 Female No Sun Dinner 4
.. ... ... ... ... ... ...
239 29.03 Male No Sat Dinner 3
240 27.18 Female Yes Sat Dinner 2
241 22.67 Male Yes Sat Dinner 2
242 17.82 Male No Sat Dinner 2
243 18.78 Female No Thur Dinner 2

[244 rows x 6 columns]

# creating target varibale


y=df['tip']

0 1.01
1 1.66
2 3.50
3 3.31
4 3.61
...
239 5.92
240 2.00
241 2.00
242 1.75
243 3.00
Name: tip, Length: 244, dtype: float64

#one hot encoding using OneHotEncoder of Scikit-Learn


from sklearn.preprocessing import OneHotEncoder

#Initialize OneHotEncoder
encoder=OneHotEncoder(sparse=False,dtype=int)

encoder

OneHotEncoder(dtype=<class 'int'>, sparse=False)

# Apply one-hot encoding to the categorical columns


one_hot_encoded=encoder.fit_transform(df[categorical_columns])#--
>df[['sex','day']]
C:\Users\lucky\anaconda3\Lib\site-packages\sklearn\preprocessing\
_encoders.py:868: FutureWarning: `sparse` was renamed to
`sparse_output` in version 1.2 and will be removed in 1.4.
`sparse_output` is ignored unless you leave `sparse` to its default
value.
warnings.warn(

# df[['sex','day']]
#df[categorical_columns]

#Create a DataFrame with the one-hot encoded columns


#We use get_feature_names_out() to get the column names for the
encoded data
encoded_df=pd.DataFrame(one_hot_encoded,columns=encoder.get_feature_na
mes_out(categorical_columns))

encoded_df

sex_Female sex_Male smoker_No smoker_Yes day_Fri day_Sat


day_Sun \
0 1 0 1 0 0 0
1
1 0 1 1 0 0 0
1
2 0 1 1 0 0 0
1
3 0 1 1 0 0 0
1
4 1 0 1 0 0 0
1
.. ... ... ... ... ... ...
...
239 0 1 1 0 0 1
0
240 1 0 0 1 0 1
0
241 0 1 0 1 0 1
0
242 0 1 1 0 0 1
0
243 1 0 1 0 0 0
0

day_Thur time_Dinner time_Lunch


0 0 1 0
1 0 1 0
2 0 1 0
3 0 1 0
4 0 1 0
.. ... ... ...
239 0 1 0
240 0 1 0
241 0 1 0
242 0 1 0
243 1 1 0

[244 rows x 10 columns]

# Concatenate the one-hot encoded dataframe with the original


dataframe
df_encoded = pd.concat([df, encoded_df], axis=1)

df_encoded

total_bill tip sex smoker day time size


sex_Female \
0 16.99 1.01 Female No Sun Dinner 2 1

1 10.34 1.66 Male No Sun Dinner 3 0

2 21.01 3.50 Male No Sun Dinner 3 0

3 23.68 3.31 Male No Sun Dinner 2 0

4 24.59 3.61 Female No Sun Dinner 4 1

.. ... ... ... ... ... ... ... ...

239 29.03 5.92 Male No Sat Dinner 3 0

240 27.18 2.00 Female Yes Sat Dinner 2 1

241 22.67 2.00 Male Yes Sat Dinner 2 0

242 17.82 1.75 Male No Sat Dinner 2 0

243 18.78 3.00 Female No Thur Dinner 2 1

sex_Male smoker_No smoker_Yes day_Fri day_Sat day_Sun


day_Thur \
0 0 1 0 0 0 1
0
1 1 1 0 0 0 1
0
2 1 1 0 0 0 1
0
3 1 1 0 0 0 1
0
4 0 1 0 0 0 1
0
.. ... ... ... ... ... ...
...
239 1 1 0 0 1 0
0
240 0 0 1 0 1 0
0
241 1 0 1 0 1 0
0
242 1 1 0 0 1 0
0
243 0 1 0 0 0 0
1

time_Dinner time_Lunch
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
.. ... ...
239 1 0
240 1 0
241 1 0
242 1 0
243 1 0

[244 rows x 17 columns]

# Drop the original categorical columns


df_encoded = df_encoded.drop(categorical_columns, axis=1)

df_encoded

total_bill tip size sex_Female sex_Male smoker_No


smoker_Yes \
0 16.99 1.01 2 1 0 1
0
1 10.34 1.66 3 0 1 1
0
2 21.01 3.50 3 0 1 1
0
3 23.68 3.31 2 0 1 1
0
4 24.59 3.61 4 1 0 1
0
.. ... ... ... ... ... ... .
..
239 29.03 5.92 3 0 1 1
0
240 27.18 2.00 2 1 0 0
1
241 22.67 2.00 2 0 1 0
1
242 17.82 1.75 2 0 1 1
0
243 18.78 3.00 2 1 0 1
0

day_Fri day_Sat day_Sun day_Thur time_Dinner time_Lunch


0 0 0 1 0 1 0
1 0 0 1 0 1 0
2 0 0 1 0 1 0
3 0 0 1 0 1 0
4 0 0 1 0 1 0
.. ... ... ... ... ... ...
239 0 1 0 0 1 0
240 0 1 0 0 1 0
241 0 1 0 0 1 0
242 0 1 0 0 1 0
243 0 0 0 1 1 0

[244 rows x 13 columns]

X=df_encoded.drop('tip',axis=1)

y=df_encoded['tip'].astype('int')

from sklearn.feature_selection import chi2

chi2_value,p_value =chi2(X,y)

chi2_value

array([451.45173551, 23.98377395, 1.99590599, 1.1060116 ,


2.41410127, 3.9196698 , 4.64580879, 5.68446664,
7.50191554, 8.85006698, 3.79699815, 9.82752462])

p_value

array([1.80666902e-92, 2.30619124e-03, 9.81137095e-01, 9.97485792e-01,


9.65616197e-01, 8.64297389e-01, 7.94674362e-01, 6.82528082e-01,
4.83569431e-01, 3.55101700e-01, 8.74958733e-01, 2.77340784e-
01])

import numpy as np

p_value = np.around(p_value,6)

temp =
pd.DataFrame({'features':X.columns,'chi_square':chi2_value,'p_value':p
_value})
temp

features chi_square p_value


0 total_bill 451.451736 0.000000
1 size 23.983774 0.002306
2 sex_Female 1.995906 0.981137
3 sex_Male 1.106012 0.997486
4 smoker_No 2.414101 0.965616
5 smoker_Yes 3.919670 0.864297
6 day_Fri 4.645809 0.794674
7 day_Sat 5.684467 0.682528
8 day_Sun 7.501916 0.483569
9 day_Thur 8.850067 0.355102
10 time_Dinner 3.796998 0.874959
11 time_Lunch 9.827525 0.277341

temp.sort_values('chi_square',ascending=False)

features chi_square p_value


0 total_bill 451.451736 0.000000
1 size 23.983774 0.002306
11 time_Lunch 9.827525 0.277341
9 day_Thur 8.850067 0.355102
8 day_Sun 7.501916 0.483569
7 day_Sat 5.684467 0.682528
6 day_Fri 4.645809 0.794674
5 smoker_Yes 3.919670 0.864297
10 time_Dinner 3.796998 0.874959
4 smoker_No 2.414101 0.965616
2 sex_Female 1.995906 0.981137
3 sex_Male 1.106012 0.997486

Let's walk through an example of a Chi-square test with real data. Imagine we are analyzing the
relationship between gender and preference for a new product. We want to determine whether
gender (male/female) affects the likelihood of liking or disliking the product. This creates two
categorical variables: Gender and Product Preference.

Data Example:
Gender Likes Product Dislikes Product Total
Male 30 10 40
Female 20 30 50
Total 50 40 90

Hypotheses:
• Null Hypothesis (H₀): Gender and product preference are independent (no association).
• Alternative Hypothesis (H₁): Gender and product preference are dependent (there is an
association).
Step-by-Step Chi-square Test:
1. Observed Frequencies:
The table above shows the observed frequencies (real counts from our data).

2. Expected Frequencies:
We calculate the expected frequency for each cell assuming that the two variables are
independent, using the formula:

[ E_{ij} = \frac{{\text{{Row Total}}_i \times \text{{Column Total}}_j}}{\text{{Grand Total}}} ]

For example, the expected frequency of males who like the product is:

[ E_{\text{Male, Likes}} = \frac{{40 \times 50}}{90} = 22.22 ]

Similarly, we can calculate the expected frequency for each cell:

Dislikes Product
Gender Likes Product (Expected) (Expected)
Male 22.22 17.78
Female 27.78 22.22

3. Chi-square Statistic:
We compute the Chi-square statistic using the formula:

[ \chi^2 = \sum \frac{{(O_{ij} - E_{ij})^2}}{E_{ij}} ]

Where (O_{ij}) is the observed frequency and (E_{ij}) is the expected frequency.

For each cell:

• For males who like the product: (\frac{{(30 - 22.22)^2}}{22.22} = 2.72)


• For males who dislike the product: (\frac{{(10 - 17.78)^2}}{17.78} = 3.4)
• For females who like the product: (\frac{{(20 - 27.78)^2}}{27.78} = 2.18)
• For females who dislike the product: (\frac{{(30 - 22.22)^2}}{22.22} = 2.72)

The total Chi-square statistic is:

[ \chi^2 = 2.72 + 3.4 + 2.18 + 2.72 = 11.02 ]

4. Degrees of Freedom:
The degrees of freedom for a Chi-square test is calculated as:

[ \text{{Degrees of Freedom}} = (r - 1) \times (c - 1) ]

Where:

• (r) is the number of rows (2 for male and female).


• (c) is the number of columns (2 for like and dislike).
[ \text{{Degrees of Freedom}} = (2 - 1) \times (2 - 1) = 1 ]

5. p-value:
Using the Chi-square distribution table or a calculator, we find the p-value for (\chi^2 = 11.02)
and 1 degree of freedom.

The p-value turns out to be p = 0.0009.

6. Conclusion:
Since p < 0.05, we reject the null hypothesis and conclude that gender and product preference
are dependent — there is a significant relationship between gender and liking/disliking the
product.

This is how you would apply a Chi-square test in practice!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy