Programs Week 10
Programs Week 10
Programs Week 10
Consider a CSV/Excel data set (available on world web or previously used in
your DWVL
experiments) having at least 3 numerical attributes. Correct the data for missing
values and other bad data like invalid characters, if needed.
Solution Expected:
4. Compute the correlation matrix of the above three attributes using panda’s
corr() method.5. Draw the conclusions from the results. (Conclusions must
be based on the results, not the theory/ algorithm/procedure)
1.
import pandas as pd
Programs Week 10 1
# Check for missing values in the dataset
missing_values = df.isnull().sum()
print("\nMissing values per column:")
print(missing_values)
Initial Data:
PassengerId Survived Pclass \
Programs Week 10 2
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male
Programs Week 10 3
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
Cleaned Data:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male
2.
import pandas as pd
Programs Week 10 4
# Load the Titanic dataset (adjust the path if needed)
df = pd.read_csv('titanic.csv')
# Optionally, you can save the pivot table to a new CSV file
# pivot_df.to_csv('pivot_table_titanic.csv')
Initial Data:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male
Programs Week 10 5
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Pivot Table:
Age Fare SibSp
Sex female male female male female mal
Pclass
1 34.611765 41.281386 106.125798 67.226127 0
2 28.722973 30.740707 21.970121 19.741782 0
3 21.750000 26.507589 16.118810 12.661633 0
3.
import pandas as pd
Programs Week 10 6
# Compute the joint probability: dividing each cell by the to
joint_prob = contingency_table / contingency_table.sum().sum(
Initial Data:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male
Programs Week 10 7
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Contingency Table:
SibSp_bins 0-1 2-3 4-5
Age_bins Fare_bins
0-20 0-20 23 5 1
21-50 14 10 22
51-100 3 0 0
101-150 4 0 0
151+ 3 2 0
21-40 0-20 27 6 0
21-50 39 4 0
51-100 23 3 0
101-150 6 0 0
151+ 1 3 0
41-60 0-20 1 1 0
21-50 13 0 0
51-100 20 2 0
101-150 2 1 0
151+ 1 0 0
61-80 51-100 2 0 0
151+ 1 0 0
Programs Week 10 8
21-50 0.160494 0.016461 0.000000
51-100 0.094650 0.012346 0.000000
101-150 0.024691 0.000000 0.000000
151+ 0.004115 0.012346 0.000000
41-60 0-20 0.004115 0.004115 0.000000
21-50 0.053498 0.000000 0.000000
51-100 0.082305 0.008230 0.000000
101-150 0.008230 0.004115 0.000000
151+ 0.004115 0.000000 0.000000
61-80 51-100 0.008230 0.000000 0.000000
151+ 0.004115 0.000000 0.000000
Programs Week 10 9
dtype: float64
4.
import pandas as pd
Programs Week 10 10
print("\nCorrelation Matrix:")
print(correlation_matrix)
Correlation Matrix:
Age Fare SibSp
Age 1.000000 0.096067 -0.308247
Fare 0.096067 1.000000 0.159651
SibSp -0.308247 0.159651 1.000000
Programs Week 10 11