0% found this document useful (0 votes)
14 views11 pages

Programs Week 10

Uploaded by

22951a67g7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

Programs Week 10

Uploaded by

22951a67g7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

🔟

Programs Week 10
Consider a CSV/Excel data set (available on world web or previously used in
your DWVL
experiments) having at least 3 numerical attributes. Correct the data for missing
values and other bad data like invalid characters, if needed.
Solution Expected:

1. Recall the definitions of joint, marginal and conditional probabilities of


multiple events.
(These definitions must be included in the worksheet under a separate
section
“Background Theory”.

2. Create a spreadsheet-style pivot table for 3 numerical attributes using


panda’s
pivot_table() method and fill table using few aggregate statistics (e.g, mean,
median, min or max).

3. Create a contingency table of 3 numerical attributes and compute joint,


marginal and
conditional probabilities using panda’s crosstab() method

4. Compute the correlation matrix of the above three attributes using panda’s
corr() method.5. Draw the conclusions from the results. (Conclusions must
be based on the results, not the theory/ algorithm/procedure)

1.

import pandas as pd

# Load the Titanic dataset (adjust the path if needed)


df = pd.read_csv('titanic.csv')

# Display the first few rows of the dataset to understand its


print("Initial Data:")
print(df.head())

Programs Week 10 1
# Check for missing values in the dataset
missing_values = df.isnull().sum()
print("\nMissing values per column:")
print(missing_values)

# Handle missing values:


# 1. For numerical columns, fill missing values with the mean
numerical_columns = ['Age', 'Fare'] # Example of numerical co

# Filling missing values for numerical columns with the media


df['Age'] = df['Age'].fillna(df['Age'].median())
df['Fare'] = df['Fare'].fillna(df['Fare'].median())

# 2. For categorical columns (like 'Embarked'), fill missing v


df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0

# Check for invalid characters:


# For example, 'Fare' column should be numeric, let's check if
df['Fare'] = pd.to_numeric(df['Fare'], errors='coerce') # Co

# Fill any NaN values in 'Fare' with the median


df['Fare'] = df['Fare'].fillna(df['Fare'].median())

# Check for missing values again after cleaning


missing_values_after = df.isnull().sum()
print("\nMissing values after correction:")
print(missing_values_after)

# Display the cleaned data


print("\nCleaned Data:")
print(df.head())

# Optionally: Save the cleaned dataset to a new CSV file


# df.to_csv('titanic_cleaned.csv', index=False)

Initial Data:
PassengerId Survived Pclass \

Programs Week 10 2
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

Missing values per column:


PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Missing values after correction:


PassengerId 0
Survived 0

Programs Week 10 3
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64

Cleaned Data:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

2.

import pandas as pd

Programs Week 10 4
# Load the Titanic dataset (adjust the path if needed)
df = pd.read_csv('titanic.csv')

# Display the first few rows of the dataset to understand its


print("Initial Data:")
print(df.head())

# Create the pivot table using `pivot_table()`


# For example, let's use 'Pclass' (passenger class) as the ind
pivot_df = pd.pivot_table(df,
index='Pclass',
columns='Sex',
values=['Age', 'Fare', 'SibSp'],
aggfunc={'Age': 'mean', 'Fare': 'mea

# Display the pivot table


print("\nPivot Table:")
print(pivot_df)

# Optionally, you can save the pivot table to a new CSV file
# pivot_df.to_csv('pivot_table_titanic.csv')

Initial Data:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male

Parch Ticket Fare Cabin Embarked

Programs Week 10 5
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

Pivot Table:
Age Fare SibSp
Sex female male female male female mal
Pclass
1 34.611765 41.281386 106.125798 67.226127 0
2 28.722973 30.740707 21.970121 19.741782 0
3 21.750000 26.507589 16.118810 12.661633 0

3.

import pandas as pd

# Load the Titanic dataset (adjust the path if needed)


df = pd.read_csv('titanic.csv')

# Display the first few rows of the dataset to understand its


print("Initial Data:")
print(df.head())

# Select three numerical attributes, for example: Age, Fare, a


# We'll bin the data into categories because crosstab() works
df['Age_bins'] = pd.cut(df['Age'], bins=[0, 20, 40, 60, 80, 10
df['Fare_bins'] = pd.cut(df['Fare'], bins=[0, 20, 50, 100, 150
df['SibSp_bins'] = pd.cut(df['SibSp'], bins=[0, 1, 3, 5, 10],

# Create the contingency table using crosstab


contingency_table = pd.crosstab([df['Age_bins'], df['Fare_bin

# Display the contingency table


print("\nContingency Table:")
print(contingency_table)

Programs Week 10 6
# Compute the joint probability: dividing each cell by the to
joint_prob = contingency_table / contingency_table.sum().sum(

# Display the joint probability table


print("\nJoint Probability Table:")
print(joint_prob)

# Compute the marginal probabilities


marginal_age_fare = contingency_table.sum(axis=1) / contingenc
marginal_sibsp = contingency_table.sum(axis=0) / contingency_

print("\nMarginal Probability (Age & Fare):")


print(marginal_age_fare)

print("\nMarginal Probability (SibSp):")


print(marginal_sibsp)

# Compute the conditional probability: P(A|B) = P(A and B) / P


# For example, the conditional probability of SibSp given Age
conditional_prob = joint_prob / marginal_sibsp

print("\nConditional Probability Table (SibSp given Age & Fare


print(conditional_prob)

Initial Data:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male

Programs Week 10 7
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

Contingency Table:
SibSp_bins 0-1 2-3 4-5
Age_bins Fare_bins
0-20 0-20 23 5 1
21-50 14 10 22
51-100 3 0 0
101-150 4 0 0
151+ 3 2 0
21-40 0-20 27 6 0
21-50 39 4 0
51-100 23 3 0
101-150 6 0 0
151+ 1 3 0
41-60 0-20 1 1 0
21-50 13 0 0
51-100 20 2 0
101-150 2 1 0
151+ 1 0 0
61-80 51-100 2 0 0
151+ 1 0 0

Joint Probability Table:


SibSp_bins 0-1 2-3 4-5
Age_bins Fare_bins
0-20 0-20 0.094650 0.020576 0.004115
21-50 0.057613 0.041152 0.090535
51-100 0.012346 0.000000 0.000000
101-150 0.016461 0.000000 0.000000
151+ 0.012346 0.008230 0.000000
21-40 0-20 0.111111 0.024691 0.000000

Programs Week 10 8
21-50 0.160494 0.016461 0.000000
51-100 0.094650 0.012346 0.000000
101-150 0.024691 0.000000 0.000000
151+ 0.004115 0.012346 0.000000
41-60 0-20 0.004115 0.004115 0.000000
21-50 0.053498 0.000000 0.000000
51-100 0.082305 0.008230 0.000000
101-150 0.008230 0.004115 0.000000
151+ 0.004115 0.000000 0.000000
61-80 51-100 0.008230 0.000000 0.000000
151+ 0.004115 0.000000 0.000000

Marginal Probability (Age & Fare):


Age_bins Fare_bins
0-20 0-20 0.119342
21-50 0.189300
51-100 0.012346
101-150 0.016461
151+ 0.020576
21-40 0-20 0.135802
21-50 0.176955
51-100 0.106996
101-150 0.024691
151+ 0.016461
41-60 0-20 0.008230
21-50 0.053498
51-100 0.090535
101-150 0.012346
151+ 0.004115
61-80 51-100 0.008230
151+ 0.004115
dtype: float64

Marginal Probability (SibSp):


SibSp_bins
0-1 0.753086
2-3 0.152263
4-5 0.094650

Programs Week 10 9
dtype: float64

Conditional Probability Table (SibSp given Age & Fare):


SibSp_bins 0-1 2-3 4-5
Age_bins Fare_bins
0-20 0-20 0.125683 0.135135 0.043478
21-50 0.076503 0.270270 0.956522
51-100 0.016393 0.000000 0.000000
101-150 0.021858 0.000000 0.000000
151+ 0.016393 0.054054 0.000000
21-40 0-20 0.147541 0.162162 0.000000
21-50 0.213115 0.108108 0.000000
51-100 0.125683 0.081081 0.000000
101-150 0.032787 0.000000 0.000000
151+ 0.005464 0.081081 0.000000
41-60 0-20 0.005464 0.027027 0.000000
21-50 0.071038 0.000000 0.000000
51-100 0.109290 0.054054 0.000000
101-150 0.010929 0.027027 0.000000
151+ 0.005464 0.000000 0.000000
61-80 51-100 0.010929 0.000000 0.000000
151+ 0.005464 0.000000 0.000000

4.

import pandas as pd

# Load the Titanic dataset (adjust the path if needed)


df = pd.read_csv('titanic.csv')

# Select the three numerical attributes: Age, Fare, SibSp


numerical_data = df[['Age', 'Fare', 'SibSp']]

# Compute the correlation matrix using pandas corr() method


correlation_matrix = numerical_data.corr()

# Display the correlation matrix

Programs Week 10 10
print("\nCorrelation Matrix:")
print(correlation_matrix)

Correlation Matrix:
Age Fare SibSp
Age 1.000000 0.096067 -0.308247
Fare 0.096067 1.000000 0.159651
SibSp -0.308247 0.159651 1.000000

Programs Week 10 11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy