Assignment On ANOVA
Assignment On ANOVA
Problem Statement:
Considering the stroke dataset, perform one-way ANOVA test to determine whether
smoking status of a person plays a significant role in the person’s body mass index. Further,
include gender as an additional factor under consideration and perform two-way ANOVA
test to determine the significance of the above-mentioned factors (individual and
combined) on the body mass index of a person.
Implementation:
Following is a depiction of step-by-step implementation of the above-mentioned task as it
was carried out to reach respective decisions-
Getting rid of those rows in the dataset where smoking status of the patient is unknown
df_clean = df[(df['smoking_status'] != 'Unknown')]
df_clean
Dropping the rows from the dataset where BMI value is missing
df_clean = df_clean.dropna(subset=['bmi'])
df_clean
Dropping all the columns from the dataset that are unrelated with the analysis task
df = df_clean
df = df.drop(['id', 'age', 'hypertension', 'heart_disease', 'ever_married',
'work_type', 'Residence_type', 'avg_glucose_level', 'stroke'], axis=1)
The above processed data was used henceforth for ANOVA testing
Grouping the dataset based on categories present in smoking status (never smoked,
formerly smoked, smokes)
grouped = df.groupby('smoking_status')['bmi'].apply(list)
grouped
Calculating the within and between variances of bmi (based on categories of smoking
status). Further, calculating f-statistic and p-value
f_statistic, p_value = stats.f_oneway(*grouped)
overall_mean = df['bmi'].mean()
ss_between = sum(len(group) * (np.mean(group) - overall_mean) ** 2 for group in
grouped)
ss_within = sum(sum((x - np.mean(group)) ** 2 for x in group) for group in grouped)
SS Between: 397.3861360057199
SS Within: 181918.8244565218
F-statistic: 3.738625586470506, p-value: 0.023883960142755647
Plotting F-statistic as calculated and F-critical (obtained through predefined function) (alpha
assumed as 0.05)
alpha = 0.05
critical_value = stats.f.ppf(1 - alpha, len(grouped) - 1, df.shape[0] -
len(grouped))
plt.figure(figsize=(8, 6))
plt.axvline(f_statistic, color='red', label='Calculated F-statistic')
plt.axvline(critical_value, color='green', label='Critical Value (alpha=0.05)')
plt.title('One-Way ANOVA')
plt.xlabel('F-value')
plt.ylabel('Density')
plt.legend()
plt.grid()
plt.show()
Making a decision based on the above plot
if f_statistic > critical_value:
print("Reject the null hypothesis: smoking status has a significant effect on
BMI.")
else:
print("Fail to reject the null hypothesis: smoking status does not have a
significant effect on BMI.")
Decision as obtained
Reject the null hypothesis: smoking status has a significant effect on BMI.
Thus, we may infer that categorically speaking, different statuses of smoking have mean
BMI values that significantly vary from each other in at-least a pair of categories. Thus,
through one-way ANOVA, we may conclude that smoking status of a person significantly
affects the person’s body mass index.
Counting the number of records, number of categories in smoking status and that in gender
n = len(df)
n_smoking = len(df['smoking_status'].unique())
n_gender = len(df['gender'].unique())
ss_interaction = 0
for (smoke, gen), group in df.groupby(['smoking_status', 'gender']):
ss_interaction += len(group) * (group['bmi'].mean() - means_smoking[smoke] -
means_gender[gen] + overall_mean) ** 2
ss_within = sst - (ssr + ssg + ss_interaction)
Reject the null hypothesis for Smoking Status: Significant effect detected.
Fail to reject the null hypothesis for Gender: No significant effect detected.
Reject the null hypothesis for Interaction: significant effect detected.
Thus, as previously hypothesized, smoking status significantly affects the BMI of a person.
However, gender independently does not have a significant effect on the BMI. Gender and
smoking status, on the other hand, show a significant combined effect on the BMI of an
individual.
Conclusion:
By implementing one-way and two-way ANOVA, I was able to develop a better intuition on
how these hypothesis testing methodologies work. By using an example healthcare dataset,
I was able to calculate the various statistic parameters associated with the tests and was
able to compare the same with predefined critical values based on significance level under
consideration. I was able to make decisions of whether or not to reject the null hypothesis
(no difference/significant effect) by comparing the above values. In brief, this assignment
aided me in understanding how ANOVA as a hypothesis testing paradigm may be used to
form certain statements on a dataset as a part of its analysis.