0% found this document useful (0 votes)
3 views10 pages

Step 1

The document outlines a comprehensive Python code for analyzing an insurance claims dataset, covering data preprocessing, exploratory data analysis, hypothesis testing, feature engineering, and model building using linear regression. Key steps include handling missing values and outliers, encoding categorical variables, performing statistical tests, and evaluating model performance. The final submission includes saving the code and creating a presentation summarizing insights, test results, and model performance metrics.

Uploaded by

ilias ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Step 1

The document outlines a comprehensive Python code for analyzing an insurance claims dataset, covering data preprocessing, exploratory data analysis, hypothesis testing, feature engineering, and model building using linear regression. Key steps include handling missing values and outliers, encoding categorical variables, performing statistical tests, and evaluating model performance. The final submission includes saving the code and creating a presentation summarizing insights, test results, and model performance metrics.

Uploaded by

ilias ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

# Step 1: Import Libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from scipy.stats import ttest_ind, chi2_contingency, f_oneway

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Load the Dataset

data = pd.read_csv(r"C:\Users\User\Desktop\assignment4\insurance - insurance (2).csv")

# Display the first few rows

print("First few rows of the dataset:")

print(data.head())

# Step 3: Data Preprocessing and Cleaning

# a. Handle Missing Values

print("\nMissing values in the dataset:")

print(data.isnull().sum())

data = data.dropna() # Drop rows with missing values

# b. Handle Outliers
sns.boxplot(data['bmi'])

plt.title("BMI Outliers")

plt.show()

# Remove outliers (e.g., BMI outside plausible range)

data = data[(data['bmi'] >= 18.5) & (data['bmi'] <= 50)]

# c. Encode Categorical Variables

data['sex'] = data['sex'].map({'male': 0, 'female': 1})

data['smoker'] = data['smoker'].map({'no': 0, 'yes': 1})

data = pd.get_dummies(data, columns=['region'], drop_first=True)

# d. Normalize Numerical Features

data['age'] = (data['age'] - data['age'].mean()) / data['age'].std()

data['bmi'] = (data['bmi'] - data['bmi'].mean()) / data['bmi'].std()

# Step 4: Exploratory Data Analysis (EDA)

# a. Statistical Analysis

print("\nStatistical summary of the dataset:")

print(data.describe())

# b. Visualizations

# Scatter plot: Age vs Charges

plt.scatter(data['age'], data['charges'])

plt.title("Age vs Charges")

plt.xlabel("Age")

plt.ylabel("Charges")

plt.show()
# Box plot: Charges by Smoker Status

sns.boxplot(x=data['smoker'], y=data['charges'])

plt.title("Charges by Smoker Status")

plt.show()

# Distribution of BMI by Sex

sns.histplot(data, x='bmi', hue='sex', kde=True)

plt.title("BMI Distribution by Sex")

plt.show()

# Step 5: Frequentist Hypothesis Testing

# a. Proportion of Male Beneficiaries

male_count = sum(data['sex'] == 0)

total_count = len(data)

prop_male = male_count / total_count

print(f"\nProportion of Male Beneficiaries: {prop_male}")

# b. Medical Claims by Smokers vs Non-Smokers

smoker_charges = data[data['smoker'] == 1]['charges']

non_smoker_charges = data[data['smoker'] == 0]['charges']

t_stat, p_value = ttest_ind(smoker_charges, non_smoker_charges)

print(f"\nT-Test (Smokers vs Non-Smokers): t-stat={t_stat}, p-value={p_value}")

# c. BMI of Females vs Males

female_bmi = data[data['sex'] == 1]['bmi']

male_bmi = data[data['sex'] == 0]['bmi']

t_stat, p_value = ttest_ind(female_bmi, male_bmi)

print(f"\nT-Test (BMI of Females vs Males): t-stat={t_stat}, p-value={p_value}")


# d. Proportion of Smokers Across Regions

region_smoker = pd.crosstab(data['region_northwest'], data['smoker'])

chi2, p, dof, expected = chi2_contingency(region_smoker)

print(f"\nChi-Square Test (Smokers Across Regions): chi2={chi2}, p-value={p}")

# Step 6: Feature Engineering

# a. Create Age Groups

data['age_group'] = pd.cut(data['age'], bins=[-np.inf, 30, 50, np.inf], labels=['young', 'middle', 'senior'])

# b. Create Interaction Feature

data['smoker_BMI'] = data['smoker'] * data['bmi']

# Step 7: Region Analysis

# a. Mean Charges by Region

region_charges = data.groupby('region_northwest')['charges'].mean()

print("\nMean charges by region:")

print(region_charges)

# b. ANOVA Test

f_stat, p_value = f_oneway(data[data['region_northwest'] == 1]['charges'],

data[data['region_southeast'] == 1]['charges'],

data[data['region_southwest'] == 1]['charges'])

print(f"\nANOVA Test (Mean Charges Across Regions): f-stat={f_stat}, p-value={p_value}")

# Step 8: Model Building

# a. Build Linear Regression Model

X = data[['age', 'bmi', 'children', 'smoker', 'region_northwest', 'region_southeast', 'region_southwest']]

y = data['charges']
# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model

model = LinearRegression()

model.fit(X_train, y_train)

# Evaluate the model

y_pred = model.predict(X_test)

print(f"\nLinear Regression Model Performance:")

print(f"R-squared: {r2_score(y_test, y_pred)}")

print(f"MSE: {mean_squared_error(y_test, y_pred)}")

Below is the full Python code for the Insurance Claims Dataset Assignment, along with explanations for
each step. This code addresses all the tasks mentioned in the assignment.

Step 1: Import Libraries

Python

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from scipy.stats import ttest_ind, chi2_contingency, f_oneway

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

1. Purpose: Import necessary libraries for data manipulation, visualization, statistical analysis, and
machine learning.

Step 2: Load the Dataset


Python

# Load the dataset

data = pd.read_csv(r"C:\Users\User\Desktop\assignment4\insurance - insurance (2).csv")

# Display the first few rows

print(data.head())

1. Purpose: Load the dataset into a pandas DataFrame and display the first few rows to understand
its structure.

Step 3: Data Preprocessing and Cleaning

a. Handle Missing Values

Python

# Check for missing values

print(data.isnull().sum())

# Drop rows with missing values (if any)

data = data.dropna()

b. Handle Outliers

Python

# Identify outliers in BMI

sns.boxplot(data['bmi'])

plt.title("BMI Outliers")

plt.show()

# Remove outliers (e.g., BMI outside plausible range)

data = data[(data['bmi'] >= 18.5) & (data['bmi'] <= 50)]

c. Encode Categorical Variables

Python

# Encode categorical variables


data['sex'] = data['sex'].map({'male': 0, 'female': 1})

data['smoker'] = data['smoker'].map({'no': 0, 'yes': 1})

data = pd.get_dummies(data, columns=['region'], drop_first=True)

d. Normalize Numerical Features

Python

# Normalize age and BMI

data['age'] = (data['age'] - data['age'].mean()) / data['age'].std()

data['bmi'] = (data['bmi'] - data['bmi'].mean()) / data['bmi'].std()

Step 4: Exploratory Data Analysis (EDA)

a. Statistical Analysis

Python

print(data.describe())

b. Visualizations

Python

# Scatter plot: Age vs Charges

plt.scatter(data['age'], data['charges'])

plt.title("Age vs Charges")

plt.xlabel("Age")

plt.ylabel("Charges")

plt.show()

# Box plot: Charges by Smoker Status

sns.boxplot(x=data['smoker'], y=data['charges'])

plt.title("Charges by Smoker Status")

plt.show()

# Distribution of BMI by Sex

sns.histplot(data, x='bmi', hue='sex', kde=True)


plt.title("BMI Distribution by Sex")

plt.show()

Step 5: Frequentist Hypothesis Testing

a. Proportion of Male Beneficiaries

Python

male_count = sum(data['sex'] == 0)

total_count = len(data)

prop_male = male_count / total_count

print(f"Proportion of Male Beneficiaries: {prop_male}")

b. Medical Claims by Smokers vs Non-Smokers

Python

smoker_charges = data[data['smoker'] == 1]['charges']

non_smoker_charges = data[data['smoker'] == 0]['charges']

t_stat, p_value = ttest_ind(smoker_charges, non_smoker_charges)

print(f"T-Test: t-stat={t_stat}, p-value={p_value}")

c. BMI of Females vs Males

Python

female_bmi = data[data['sex'] == 1]['bmi']

male_bmi = data[data['sex'] == 0]['bmi']

t_stat, p_value = ttest_ind(female_bmi, male_bmi)

print(f"T-Test: t-stat={t_stat}, p-value={p_value}")

d. Proportion of Smokers Across Regions

Python

region_smoker = pd.crosstab(data['region_northwest'], data['smoker'])

chi2, p, dof, expected = chi2_contingency(region_smoker)

print(f"Chi-Square Test: chi2={chi2}, p-value={p}")

Step 6: Feature Engineering


a. Create Age Groups

Python

data['age_group'] = pd.cut(data['age'], bins=[-np.inf, 30, 50, np.inf], labels=['young', 'middle', 'senior'])

b. Create Interaction Feature

Python

data['smoker_BMI'] = data['smoker'] * data['bmi']

Step 7: Region Analysis

a. Mean Charges by Region

Python

region_charges = data.groupby('region_northwest')['charges'].mean()

print(region_charges)

b. ANOVA Test

Python

f_stat, p_value = f_oneway(data[data['region_northwest'] == 1]['charges'],

data[data['region_southeast'] == 1]['charges'],

data[data['region_southwest'] == 1]['charges'])

print(f"ANOVA: f-stat={f_stat}, p-value={p_value}")

Step 8: Model Building

a. Build Linear Regression Model

Python

X = data[['age', 'bmi', 'children', 'smoker', 'region_northwest', 'region_southeast', 'region_southwest']]

y = data['charges']

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model


model = LinearRegression()

model.fit(X_train, y_train)

# Evaluate the model

y_pred = model.predict(X_test)

print(f"R-squared: {r2_score(y_test, y_pred)}")

print(f"MSE: {mean_squared_error(y_test, y_pred)}")

Step 9: Submission

1. Save the code in .ipynb or .py format.

2. Create a one-page presentation summarizing:

1. EDA insights (e.g., plots).

2. Hypothesis test results (e.g., p-values).

3. Model performance (e.g., R-squared, MSE).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy