0% found this document useful (0 votes)
10 views5 pages

baseline.ipynb - Colab

The document outlines a Jupyter Notebook for analyzing stroke data using Python libraries such as pandas, numpy, and sklearn. It involves loading training and test datasets, performing exploratory data analysis, and building a baseline logistic regression model to predict strokes. The model evaluation includes metrics like AUC and F-beta scores, with results indicating a baseline AUC of 0.7565 and F-beta of 0.5719.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

baseline.ipynb - Colab

The document outlines a Jupyter Notebook for analyzing stroke data using Python libraries such as pandas, numpy, and sklearn. It involves loading training and test datasets, performing exploratory data analysis, and building a baseline logistic regression model to predict strokes. The model evaluation includes metrics like AUC and F-beta scores, with results indicating a baseline AUC of 0.7565 and F-beta of 0.5719.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2/23/25, 2:09 PM baseline.

ipynb - Colab

Importing and Loading the training and test datasets

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import fbeta_score, precision_recall_curve, roc_auc_score, classification_report, confusion_matri
import os

from google.colab import files


uploaded = files.upload()

Choose Files 2 files


test.csv(text/csv) - 152301 bytes, last modified: 2/21/2025 - 100% done
train.csv(text/csv) - 146010 bytes, last modified: 2/21/2025 - 100% done
Saving test.csv to test.csv
Saving train csv to train csv

train_path = "/content/train.csv"
test_path = "/content/test.csv"

import os
print(os.path.exists(train_path))
print(os.path.exists(test_path))

True
True

# Load Data (for manually uploaded files in Colab or Jupyter)


train_path = "/content/train.csv"
test_path = "/content/test.csv"

if not os.path.isfile(train_path):
raise FileNotFoundError(f"Train file not found at {train_path}")
if not os.path.isfile(test_path):
raise FileNotFoundError(f"Test file not found at {test_path}")

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

# Display dataset description


print(train_df.describe(include='all'))

gender age hypertension heart_disease ever_married \


count 2555 2555.000000 2555.000000 2555.000000 2555
unique 5 NaN NaN NaN 2
top Female NaN NaN NaN Yes
freq 1309 NaN NaN NaN 1699
mean NaN 46.373777 0.099804 0.053620 NaN
std NaN 149.971251 0.299798 0.225311 NaN
min NaN 0.000100 0.000000 0.000000 NaN
25% NaN 26.000000 0.000000 0.000000 NaN
50% NaN 44.000000 0.000000 0.000000 NaN
75% NaN 60.000000 0.000000 0.000000 NaN
max NaN 7500.000000 1.000000 1.000000 NaN

work_type Residence_type avg_glucose_level bmi \


count 2555 2555 2555.000000 2454.000000
https://colab.research.google.com/drive/1c7A42T1cSYjgGBxwg3EqrD-DLX2Q_M80#scrollTo=V_4Zg9S-Yrjt&printMode=true 1/5
2/23/25, 2:09 PM baseline.ipynb - Colab
unique 5 2 NaN NaN
top Private Urban NaN NaN
freq 1461 1288 NaN NaN
mean NaN NaN 105.534755 28.898248
std NaN NaN 44.689250 7.958036
min NaN NaN 55.220000 10.300000
25% NaN NaN 77.000000 23.500000
50% NaN NaN 91.450000 28.000000
75% NaN NaN 113.160000 33.200000
max NaN NaN 271.740000 92.000000

smoking_status stroke
count 2555 2554
unique 4 4
top never smoked 0
freq 945 2429
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN

# Frequency distributions
print("Frequency of Males vs Females:")
print(train_df['gender'].value_counts())

print("\nFrequency of Males vs Females with Stroke:")


print(pd.crosstab(train_df['gender'], train_df['stroke']))

print("\nFrequency of People with Hypertension vs Without Hypertension:")


print(train_df['hypertension'].value_counts())

print("\nStroke distribution among those with and without Hypertension:")


print(pd.crosstab(train_df['hypertension'], train_df['stroke']))

print("\nStroke distribution among those with and without Heart Disease:")


print(pd.crosstab(train_df['heart_disease'], train_df['stroke']))

Frequency of Males vs Females:


gender
Female 1309
Male 945
female 183
male 117
Other 1
Name: count, dtype: int64

Frequency of Males vs Females with Stroke:


stroke 0 1 Yes yes
gender
Female 1247 58 3 0
Male 895 47 2 1
Other 1 0 0 0
female 172 11 0 0
male 114 3 0 0

Frequency of People with Hypertension vs Without Hypertension:


hypertension
0 2300
1 255
Name: count, dtype: int64

Stroke distribution among those with and without Hypertension:


stroke 0 1 Yes yes
hypertension
0 2208 86 4 1

https://colab.research.google.com/drive/1c7A42T1cSYjgGBxwg3EqrD-DLX2Q_M80#scrollTo=V_4Zg9S-Yrjt&printMode=true 2/5
2/23/25, 2:09 PM baseline.ipynb - Colab
1 221 33 1 0

Stroke distribution among those with and without Heart Disease:


stroke 0 1 Yes yes
heart_disease
0 2315 96 5 1
1 114 23 0 0

# Check distribution of numerical features


numerical_features = ["age", "avg_glucose_level", "bmi"]
for feature in numerical_features:
plt.figure(figsize=(6, 4))
sns.histplot(train_df[feature], kde=True, bins=30)
plt.title(f"Distribution of {feature}")
plt.show()

https://colab.research.google.com/drive/1c7A42T1cSYjgGBxwg3EqrD-DLX2Q_M80#scrollTo=V_4Zg9S-Yrjt&printMode=true 3/5
2/23/25, 2:09 PM baseline.ipynb - Colab

https://colab.research.google.com/drive/1c7A42T1cSYjgGBxwg3EqrD-DLX2Q_M80#scrollTo=V_4Zg9S-Yrjt&printMode=true 4/5
2/23/25, 2:09 PM baseline.ipynb - Colab
# Baseline Model Without Preprocessing
X_baseline = train_df.drop(columns=["stroke"], errors='ignore')
y_baseline = train_df["stroke"].replace({"Yes": 1, "yes": 1, "0": 0, "1": 1})

<ipython-input-8-1aa6d9e1dc84>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be remov


y_baseline = train_df["stroke"].replace({"Yes": 1, "yes": 1, "0": 0, "1": 1})

# Handle missing values in stroke column


y_baseline = y_baseline.fillna(0).astype(int)

# Handle missing values in features


X_baseline = X_baseline.fillna(X_baseline.median())

# Handle categorical variables by simple label encoding (no one-hot encoding for baseline)
X_baseline = X_baseline.apply(lambda col: col.astype('category').cat.codes if col.dtypes == 'O' else col)

# Train-Test Split
X_train_base, X_val_base, y_train_base, y_val_base = train_test_split(X_baseline, y_baseline, test_size=0.2, random_stat

# Train Logistic Regression Model


model_base = LogisticRegression(class_weight='balanced', max_iter=1000)
model_base.fit(X_train_base, y_train_base)

# Predictions
y_probs_base = model_base.predict_proba(X_val_base)[:, 1]
y_pred_base = model_base.predict(X_val_base)

# Evaluation Metrics
auc_score = roc_auc_score(y_val_base, y_probs_base)
f_beta_base = fbeta_score(y_val_base, y_pred_base, beta=10)
class_report = classification_report(y_val_base, y_pred_base)
conf_matrix = confusion_matrix(y_val_base, y_pred_base)

# Gender-based predictions
gender_counts = pd.crosstab(train_df["gender"], train_df["stroke"])

def print_baseline_results():
print(f"Baseline AUC: {auc_score:.4f}")
print(f"Baseline F-beta (β=10): {f_beta_base:.4f}")
print("Classification Report:\n", class_report)
print("Confusion Matrix:\n", conf_matrix)
print("Stroke distribution by gender:\n", gender_counts)

print_baseline_results()

Baseline AUC: 0.7565


Baseline F-beta (β=10): 0.5719
Classification Report:
precision recall f1-score support

0 0.97 0.72 0.83 486


1 0.10 0.60 0.17 25

https://colab.research.google.com/drive/1c7A42T1cSYjgGBxwg3EqrD-DLX2Q_M80#scrollTo=V_4Zg9S-Yrjt&printMode=true 5/5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy