0% found this document useful (0 votes)
14 views7 pages

Assignment On ANOVA

Uploaded by

mohammed.ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

Assignment On ANOVA

Uploaded by

mohammed.ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment on ANOVA

Name: Ansari Mohammed Shanouf Valijan


Class: B.E. Computer Engineering, Semester - VII
UID: 2021300004
Batch: Monday (30-09-2024)

Problem Statement:
Considering the stroke dataset, perform one-way ANOVA test to determine whether
smoking status of a person plays a significant role in the person’s body mass index. Further,
include gender as an additional factor under consideration and perform two-way ANOVA
test to determine the significance of the above-mentioned factors (individual and
combined) on the body mass index of a person.

Implementation:
Following is a depiction of step-by-step implementation of the above-mentioned task as it
was carried out to reach respective decisions-

Importing the dataset as a pandas dataframe


import pandas as pd
df = pd.read_csv('/content/healthcare-dataset-stroke-data.csv')
df

Getting rid of those rows in the dataset where smoking status of the patient is unknown
df_clean = df[(df['smoking_status'] != 'Unknown')]
df_clean

Dropping the rows from the dataset where BMI value is missing
df_clean = df_clean.dropna(subset=['bmi'])
df_clean

Dropping all the columns from the dataset that are unrelated with the analysis task
df = df_clean
df = df.drop(['id', 'age', 'hypertension', 'heart_disease', 'ever_married',
'work_type', 'Residence_type', 'avg_glucose_level', 'stroke'], axis=1)
The above processed data was used henceforth for ANOVA testing

Implementation of one-way ANOVA-


Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

Grouping the dataset based on categories present in smoking status (never smoked,
formerly smoked, smokes)
grouped = df.groupby('smoking_status')['bmi'].apply(list)
grouped

Calculating the within and between variances of bmi (based on categories of smoking
status). Further, calculating f-statistic and p-value
f_statistic, p_value = stats.f_oneway(*grouped)
overall_mean = df['bmi'].mean()
ss_between = sum(len(group) * (np.mean(group) - overall_mean) ** 2 for group in
grouped)
ss_within = sum(sum((x - np.mean(group)) ** 2 for x in group) for group in grouped)

SS Between: 397.3861360057199
SS Within: 181918.8244565218
F-statistic: 3.738625586470506, p-value: 0.023883960142755647

Plotting F-statistic as calculated and F-critical (obtained through predefined function) (alpha
assumed as 0.05)
alpha = 0.05
critical_value = stats.f.ppf(1 - alpha, len(grouped) - 1, df.shape[0] -
len(grouped))

plt.figure(figsize=(8, 6))
plt.axvline(f_statistic, color='red', label='Calculated F-statistic')
plt.axvline(critical_value, color='green', label='Critical Value (alpha=0.05)')
plt.title('One-Way ANOVA')
plt.xlabel('F-value')
plt.ylabel('Density')
plt.legend()
plt.grid()
plt.show()
Making a decision based on the above plot
if f_statistic > critical_value:
print("Reject the null hypothesis: smoking status has a significant effect on
BMI.")
else:
print("Fail to reject the null hypothesis: smoking status does not have a
significant effect on BMI.")

Decision as obtained
Reject the null hypothesis: smoking status has a significant effect on BMI.

Thus, we may infer that categorically speaking, different statuses of smoking have mean
BMI values that significantly vary from each other in at-least a pair of categories. Thus,
through one-way ANOVA, we may conclude that smoking status of a person significantly
affects the person’s body mass index.

Implementation of two-way ANOVA-


Importing the required libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

Calculating overall mean, group means and column means


overall_mean = df['bmi'].mean()
group_means = df.groupby(['smoking_status', 'gender'])['bmi'].mean()
means_smoking = df.groupby('smoking_status')['bmi'].mean()
means_gender = df.groupby('gender')['bmi'].mean()

Overall Mean --> 30.290046701692937


Group Means -->
smoking_status gender
formerly smoked Female 30.615721
Male 30.928571
Other 22.400000
never smoked Female 29.862677
Male 30.204777
smokes Female 30.750353
Male 30.261859
Name: bmi, dtype: float64
Means Smoking -->
smoking_status
formerly smoked 30.747192
never smoked 29.982559
smokes 30.543555
Name: bmi, dtype: float64
Means Gender -->
gender
Female 30.208869
Male 30.422405
Other 22.400000
Name: bmi, dtype: float64

Counting the number of records, number of categories in smoking status and that in gender
n = len(df)
n_smoking = len(df['smoking_status'].unique())
n_gender = len(df['gender'].unique())

Calculating the various SS terms


sst = sum((df['bmi'] - overall_mean) ** 2)
ssr = sum(df.groupby('smoking_status').size() * (means_smoking - overall_mean) **
2)
ssg = sum(df.groupby('gender').size() * (means_gender - overall_mean) ** 2)

ss_interaction = 0
for (smoke, gen), group in df.groupby(['smoking_status', 'gender']):
ss_interaction += len(group) * (group['bmi'].mean() - means_smoking[smoke] -
means_gender[gen] + overall_mean) ** 2
ss_within = sst - (ssr + ssg + ss_interaction)

Calculating degrees of freedom and corresponding f-statistic


df_r = n_smoking - 1
df_g = n_gender - 1
df_interaction = (n_smoking - 1) * (n_gender - 1)
df_w = n - (n_smoking + n_gender - 1)

f_smoking = (ssr / df_r) / (ss_within / df_w)


f_gender = (ssg / df_g) / (ss_within / df_w)
f_interaction = (ss_interaction / df_interaction) / (ss_within / df_w)

SSt (Total): 182316.21059252787


SSr (Smoking Status): 397.3861360057199
SSg (Gender): 99.45680590873165
SSc (Interaction): 98.05329791586556
SS Within: 181721.31435269755
F-statistic for Smoking Status: 3.740502252358344
F-statistic for Gender: 0.9361635266224352
F-statistic for Interaction: 0.46147631796115257

Getting the critical values from predefined functions


alpha = 0.05
critical_smoking = stats.f.ppf(1 - alpha, df_r, df_w)
critical_gender = stats.f.ppf(1 - alpha, df_g, df_w)
critical_interaction = stats.f.ppf(1 - alpha, df_interaction, df_w)

Plotting the critical and calculated values


plt.figure(figsize=(10, 6))

plt.axvline(f_smoking, color='red', linestyle='--', label='F-statistic for Smoking


Status')
plt.axvline(critical_smoking, color='green', linestyle='--', label='Critical Value
(Smoking Status)')

plt.axvline(f_gender, color='blue', linestyle='--', label='F-statistic for Gender')


plt.axvline(critical_gender, color='orange', linestyle='--', label='Critical Value
(Gender)')

plt.axvline(f_interaction, color='purple', linestyle='--', label='F-statistic for


Interaction')
plt.axvline(critical_interaction, color='brown', linestyle='--', label='Critical
Value (Interaction)')

plt.title('Two-Way ANOVA F-statistics and Critical Values')


plt.xlabel('F-value')
plt.ylabel('Density')
plt.legend()
plt.grid()
plt.show()

Making respective decisions based on the above comparison


for f_stat, crit_val, factor in zip(
[f_smoking, f_gender, f_interaction],
[critical_smoking, critical_gender, critical_interaction],
['Smoking Status', 'Gender', 'Interaction']
):
if f_stat > crit_val:
print(f"Reject the null hypothesis for {factor}: Significant effect
detected.")
else:
print(f"Fail to reject the null hypothesis for {factor}: No significant
effect detected.")

Reject the null hypothesis for Smoking Status: Significant effect detected.
Fail to reject the null hypothesis for Gender: No significant effect detected.
Reject the null hypothesis for Interaction: significant effect detected.

Thus, as previously hypothesized, smoking status significantly affects the BMI of a person.
However, gender independently does not have a significant effect on the BMI. Gender and
smoking status, on the other hand, show a significant combined effect on the BMI of an
individual.
Conclusion:
By implementing one-way and two-way ANOVA, I was able to develop a better intuition on
how these hypothesis testing methodologies work. By using an example healthcare dataset,
I was able to calculate the various statistic parameters associated with the tests and was
able to compare the same with predefined critical values based on significance level under
consideration. I was able to make decisions of whether or not to reject the null hypothesis
(no difference/significant effect) by comparing the above values. In brief, this assignment
aided me in understanding how ANOVA as a hypothesis testing paradigm may be used to
form certain statements on a dataset as a part of its analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy