Ds&bda 1-14
Ds&bda 1-14
Data Wrangling - I
Perform the following operations using Python on any open source dataset (e.g.,
data.csv)
2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a
clear description of the data and its source (i.e., URL of the web site).
4. Data Preprocessing: check for missing values in the data using pandas isnull(),
describe() function to get some initial statistics. Provide variable descriptions. Types of
variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python. In addition to the codes
and outputs, explain every operation that you do in the above steps and explain
everything that you do to import/read/scrape the data set.
In [19]: df
Out[19]: Admission High School Admission
Name Age Gender City
Test Score Percentage Status
In [24]: df.head(10)
Out[24]: Admission High School Admission
Name Age Gender City
Test Score Percentage Status
In [26]: df.isnull().sum()
Out[26]: Name 10
Age 10
Gender 10
Admission Test Score 11
High School Percentage 11
City 10
Admission Status 10
dtype: int64
In [31]: df.notnull().sum()
In [38]: df.size
Out[38]: 1099
In [40]: df.shape
Out[40]: (157, 7)
In [42]: df.ndim
Out[42]: 2
In [44]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 7 columns):
# Column Non-Null Count Dtype
df["Age"].replace(np.NaN,10,inplace=True)
In [52]: df
In [54]: df.isnull().sum()
Out[54]: Name 10
Age 0
Gender 10
Admission Test Score 11
High School Percentage 11
City 10
Admission Status 10
dtype: int64
In [64]: Avg_percentage
Out[64]: 75.68472602739726
C:\Users\Omkar\AppData\Local\Temp\ipykernel_16356\3697381515.py:1: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained as
signment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work becau
se the intermediate object on which we are setting values always behaves as a cop
y.
In [68]: df.isnull().sum()
Out[68]: Name 10
Age 0
Gender 10
Admission Test Score 11
High School Percentage 0
City 10
Admission Status 10
dtype: int64
In [70]: df
df["Gender"].replace({'Female':0,'Male':1},inplace = True)
C:\Users\Omkar\AppData\Local\Temp\ipykernel_16356\3985573958.py:1: FutureWarning:
Downcasting behavior in `replace` is deprecated and will be removed in a future v
ersion. To retain the old behavior, explicitly call `result.infer_objects(copy=Fa
lse)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_dow
ncasting', True)`
df["Gender"].replace({'Female':0,'Male':1},inplace = True)
In [75]: df
In [79]: df
To count
In [82]: df["Gender"].value_counts()
Out[82]: Gender
0.0 83
1.0 64
Name: count, dtype: int64
In [84]: df["Age"].value_counts()
Out[84]: Age
17 24
23 19
19 18
22 18
24 17
20 17
21 15
18 14
10 10
-1 5
Name: count, dtype: int64
Concept Hierachy
In [87]: def fun1(value):
if (value < 20):
return "teenager"
elif (value >= 20 and value < 40):
return "young"
elif (value >= 40 and value < 60):
return "middle aged"
elif (value >= 60):
return "senior citizen"
else:
pass
In [91]: df
1. Scan all variables for missing values and inconsistencies. If there are missing values and/or inconsistencies, use any of the suitable
techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this transformation should be one of the following
reasons: to change the scale for better understanding of the variable, to convert a non-linear relation into a linear one, or to decrease
the skewness and convert the distribution into a normal distribution. Reason and document your approach properly
import pandas as pd
import numpy as np
import math
%matplotlib
df
0 1 a 40.0 F
1 2 b 23.0 F
2 3 c 50.0 P
3 4 d 78.0 P
4 5 e 48.0 P
5 6 f 89.0 P
6 7 g 90.0 P
7 8 h 67.0 P
8 9 i 90.0 P
9 10 j 96.0 P
10 11 NaN 76.0 P
11 12 NaN NaN F
12 13 k 97.0 P
13 14 l NaN NaN
14 15 m 65.0 NaN
Dataset Statistics
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
# Column Non-Null Count Dtype
name 2
marks 2
Null values
name 2
marks 2
df
0 1 a 40.000000 F
1 2 b 23.000000 F
2 3 c 50.000000 P
3 4 d 78.000000 P
4 5 e 48.000000 P
5 6 f 89.000000 P
6 7 g 90.000000 P
7 8 h 67.000000 P
8 9 i 90.000000 P
9 10 j 96.000000 P
10 11 NaN 76.000000 P
11 12 NaN 69.923077 F
12 13 k 97.000000 P
13 14 l 69.923077 NaN
14 15 m 65.000000 NaN
return
df
0 1 a 40 F
1 2 b 23 F
2 3 c 50 P
3 4 d 78 P
4 5 e 48 P
5 6 f 89 P
6 7 g 90 P
7 8 h 67 P
8 9 i 90 P
9 10 j 96 P
10 11 NaN 76 P
11 12 NaN 69 F
12 13 k 97 P
13 14 l 69 NaN
14 15 m 65 NaN
df
0 1 a 40 F
1 2 b 23 F
2 3 c 50 P
3 4 d 78 P
4 5 e 48 P
5 6 f 89 P
6 7 g 90 P
7 8 h 67 P
8 9 i 90 P
9 10 j 96 P
12 13 k 97 P
13 14 l 69 NaN
14 15 m 65 NaN
# print(row['marks'], row['grade'])
rollno name marks grade
0 1 a 40 F
1 2 b 23 F
2 3 c 50 F
3 4 d 78 P
4 5 e 48 F
5 6 f 89 P
6 7 g 90 P
7 8 h 67 P
8 9 i 90 P
9 10 j 96 P
12 13 k 97 P
13 14 l 69 P
14 15 m 65 P
df.boxplot()
plt.show()
Outliers
rollno name marks grade
0 1 a 40 F
1 2 b 23 F
2 3 c 50 F
3 4 d 78 P
4 5 e 48 F
5 6 f 89 P
6 7 g 90 P
7 8 h 67 P
8 9 i 90 P
9 10 j 96 P
12 13 k 97 P
13 14 l 69 P
14 15 m 65 P
15 16 n 200 P
16 17 o -100 F
df.boxplot()
plt.show()
#newdf
sns.boxplot(data=df, x='marks');
_
rollno name marks grade
0 1 a 40 F
1 2 b 23 F
2 3 c 50 F
3 4 d 78 P
4 5 e 48 F
5 6 f 89 P
6 7 g 90 P
7 8 h 67 P
8 9 i 90 P
9 10 j 96 P
12 13 k 97 P
13 14 l 69 P
14 15 m 65 P
15 16 n 200 P
16 17 o -100 F
df
0 1 a 40 F
1 2 b 23 F
2 3 c 50 F
3 4 d 78 P
4 5 e 48 F
5 6 f 89 P
6 7 g 90 P
7 8 h 67 P
8 9 i 90 P
9 10 j 96 P
12 13 k 97 P
13 14 l 69 P
14 15 m 65 P
df.boxplot()
plt.show()
0 1 a 0.229730 F
1 2 b 0.000000 F
2 3 c 0.364865 F
3 4 d 0.743243 P
4 5 e 0.337838 F
5 6 f 0.891892 P
6 7 g 0.905405 P
7 8 h 0.594595 P
8 9 i 0.905405 P
9 10 j 0.986486 P
12 13 k 1.000000 P
13 14 l 0.621622 P
14 15 m 0.567568 P
0 1 a 0.229730 F
1 2 b 0.000000 F
2 3 c 0.364865 F
3 4 d 0.743243 P
4 5 e 0.337838 F
5 6 f 0.891892 P
6 7 g 0.905405 P
7 8 h 0.594595 P
8 9 i 0.905405 P
9 10 j 0.986486 P
13 14 l 0.621622 P
14 15 m 0.567568 P
plt.show()
Practical No-3
Descriptive Statistics - Measures of Central Tendency and variability Perform the following operations on any open source dataset (e.g.,
data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard deviation) for a dataset (age, income etc.) with numeric
variables grouped by one of the qualitative (categorical) variable. For example, if your categorical variable is age groups and
quantitative variable is income, then provide summary statistics of income grouped by the age groups. Create a list that contains a
numeric value for each response to the categorical variable.
2. Write a Python program to display some basic statistical details like percentile, mean, standard deviation etc. of the species of ‘Iris-
setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’ of iris.csv dataset. Provide the codes with outputs and explain everything that you do in
this step.
Import libraries
import pandas as pd
import numpy as np
%matplotlib
df.head()
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Ter
Not
3 LP001006 Male Yes 0 No 2583.0 2358.0 120.0 360.
Graduate
Basic stats
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 613 non-null float64
7 CoapplicantIncome 613 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 561 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(5), object(8)
memory usage: 62.5+ KB
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
df.isna().sum()
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 1
CoapplicantIncome 1
LoanAmount 22
Loan_Amount_Term 14
Credit_History 53
Property_Area 0
Loan_Status 0
dtype: int64
Let us group the quantitative variables 'ApplicantIncome', 'Coapplicant Income', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History' by
'Loan_Status' categorical variable
df["ApplicantIncome"].plot(kind="hist")
plt.show()
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_T
... ... ... ... ... ... ... ... ... ...
df["CoapplicantIncome"].plot(kind="hist")
plt.show()
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_T
... ... ... ... ... ... ... ... ... ...
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_T
... ... ... ... ... ... ... ... ... ...
Loan_Status
Loan_Status
Loan_Status
max = grouped_df.max()
max
Loan_Status
Loan_Status
Iris dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
Id 75.500000
SepalLengthCm 5.843333
SepalWidthCm 3.054000
PetalLengthCm 3.758667
PetalWidthCm 1.198667
dtype: float64
0 -> setosa
1 -> versicolor
2 -> virginica
Basic stats
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
Setosa stats
count 150.000000
mean 5.843333
std 0.828066
min 4.300000
25% 5.100000
50% 5.800000
75% 6.400000
max 7.900000
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25
Species
Iris- 50.0 25.5 14.57738 1.0 13.25 25.5 37.75 50.0 50.0 5.006 ... 1.575 1.9 50.0 0.244 0.107210 0.1 0.
setosa
Iris- 50.0 75.5 14.57738 51.0 63.25 75.5 87.75 100.0 50.0 5.936 ... 4.600 5.1 50.0 1.326 0.197753 1.0 1.
versicolor
Iris- 50.0 125.5 14.57738 101.0 113.25 125.5 137.75 150.0 50.0 6.588 ... 5.875 6.9 50.0 2.026 0.274650 1.4 1.
virginica
3 rows × 40 columns
Id count 150.000000
mean 226.500000
std 43.732139
min 153.000000
25% 189.750000
50% 226.500000
75% 263.250000
max 300.000000
SepalLengthCm count 150.000000
mean 17.530000
std 1.504540
min 14.100000
25% 16.625000
50% 17.400000
75% 18.400000
max 20.700000
SepalWidthCm count 150.000000
mean 9.162000
std 1.017319
min 6.500000
25% 8.450000
50% 9.200000
75% 9.850000
max 11.600000
PetalLengthCm count 150.000000
mean 11.276000
std 1.195317
min 8.500000
25% 10.500000
50% 11.400000
75% 12.050000
max 13.900000
PetalWidthCm count 150.000000
mean 3.596000
std 0.579612
min 2.500000
25% 3.200000
50% 3.500000
75% 4.100000
max 4.900000
dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
%matplotlib
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0
501 rows × 14 columns
505 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88 11.9
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
df.describe()
count 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000
mean 3.647414 11.402196 11.160619 0.069860 0.555151 6.284341 68.513373 3.786423 9.596806 409.143713 18.453493
std 8.637688 23.414214 6.857123 0.255166 0.116186 0.705587 28.212221 2.103327 8.735509 169.021216 2.166327
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000
25% 0.081990 0.000000 5.190000 0.000000 0.449000 5.884000 45.000000 2.088200 4.000000 279.000000 17.400000
50% 0.261690 0.000000 9.690000 0.000000 0.538000 6.208000 77.700000 3.182700 5.000000 330.000000 19.000000
75% 3.693110 12.500000 18.100000 0.000000 0.624000 6.625000 94.000000 5.118000 24.000000 666.000000 20.200000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000
crim zn indus chas nox rm age dis rad tax ptratio b lstat
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
medv
0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
* Basic stats **
<class 'pandas.core.frame.DataFrame'>
Index: 501 entries, 0 to 505
Data columns (total 13 columns):
# Column Non-Null Count Dtype
0 crim 501 non-null float64
1 zn 501 non-null float64
2 indus 501 non-null float64
3 chas 501 non-null int64
4 nox 501 non-null float64
5 rm 501 non-null float64
6 age 501 non-null float64
7 dis 501 non-null float64
8 rad 501 non-null int64
9 tax 501 non-null int64
10 ptratio 501 non-null float64
11 b 501 non-null float64
12 lstat 501 non-null float64
dtypes: float64(10), int64(3)
memory usage: 54.8 KB
x.describe()
count 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000
mean 3.647414 11.402196 11.160619 0.069860 0.555151 6.284341 68.513373 3.786423 9.596806 409.143713 18.453493
std 8.637688 23.414214 6.857123 0.255166 0.116186 0.705587 28.212221 2.103327 8.735509 169.021216 2.166327
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000
25% 0.081990 0.000000 5.190000 0.000000 0.449000 5.884000 45.000000 2.088200 4.000000 279.000000 17.400000
50% 0.261690 0.000000 9.690000 0.000000 0.538000 6.208000 77.700000 3.182700 5.000000 330.000000 19.000000
75% 3.693110 12.500000 18.100000 0.000000 0.624000 6.625000 94.000000 5.118000 24.000000 666.000000 20.200000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000
<class 'pandas.core.frame.DataFrame'>
Index: 501 entries, 0 to 505
Data columns (total 1 columns):
# Column Non-Null Count Dtype
count 501.000000
mean 22.561277
std 9.232435
min 5.000000
25% 17.000000
50% 21.200000
75% 25.000000
max 50.000000
crim 0
zn 0
indus 0
chas 0
nox 0
rm 0
age 0
dis 0
rad 0
tax 0
ptratio 0
b 0
lstat 0
medv 0
crim zn indus chas nox rm age dis rad tax ptratio b lstat target
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
* Considering only 'RM' and 'LSTAT' by considering correlation and multi-collinearity of other features *
* Scale the data ***
[ 8.18092223e-01, -1.30160443e+00],
[-8.17641984e-01, -1.26300141e-01],
[-9.69542329e-02, -4.35149426e-01],
[-1.99098954e-01, -4.46329491e-01],
[-3.43510045e+00, 9.59036431e-02],
[-4.30445193e-02, -3.06578683e-01],
[ 1.79585778e-02, 6.46521826e-01],
[-6.28957986e-01, -1.72417908e-01],
[ 5.17332768e-01, 1.48083415e+00],
[ 8.08161486e-01, -1.35051722e+00],
[ 1.88068105e+00, -1.33234961e+00],
[-4.21831192e-01, 1.97415450e+00],
[ 1.69756982e-01, 1.07136428e+00],
[-3.99132365e-01, 1.21251260e+00],
[-1.95967671e+00, 2.32632653e+00],
[-3.65084125e-01, 2.35654451e-01],
[ 8.60550582e-02, -1.07241311e+00],
[ 1.71175659e-01, -2.81423538e-01],
[-3.05499704e-01, -4.32354410e-01],
[-2.50171314e-01, -8.48811817e-01],
[-7.90687127e-01, -1.90585513e-01],
[-7.41033444e-01, -3.10771207e-01],
[-3.97713688e-01, 2.27269402e-01],
[ 3.21555386e-01, -1.02909036e+00],
[ 8.36535020e-01, -1.12272340e+00],
[ 1.22383375e+00, -1.02210282e+00],
[ 4.40724227e-01, -1.16325113e+00],
[ 6.29408225e-01, 8.47762989e-01],
[ 3.43271334e+00, -1.05145049e+00],
[ 1.24359328e-01, -2.12945642e-01],
[-2.72870141e-01, 6.21366680e-01],
[ 2.50519612e-02, -8.27849196e-01],
[-3.11174411e-01, -9.95550165e-01],
[ 9.81240041e-01, -9.78780068e-01],
[-1.39078736e+00, 1.95598689e+00],
[ 1.04497855e-01, -1.98970561e-01],
[ 2.10057594e+00, -7.10458518e-01],
[-5.58024152e-01, -2.47883344e-01],
[ 7.23040886e-01, -8.61389390e-01],
[ 3.92387279e-02, -2.88411078e-01],
[-1.29583796e-01, 3.53045129e-01],
[ 5.07402031e-01, -4.12789297e-01],
[ 2.87507146e-01, -9.42444858e-01],
[-3.31035884e-01, 8.54750529e-01],
[ 2.15306697e+00, -1.49865307e+00],
[ 2.50621553e-01, 7.55527456e-01],
[ 1.71175659e-01, -6.21018001e-01],
[-2.04773660e-01, -3.31733829e-01],
[ 1.71175659e-01, 9.39998522e-01],
[-6.38888723e-01, -5.83285283e-01],
[ 1.25778005e-01, -2.99591143e-01],
[ 1.54151539e-01, -1.04166793e+00],
[ 1.61244922e-01, -6.78315832e-01],
[ 1.27774347e+00, -1.20377887e+00],
[-1.55819121e+00, 2.19076825e+00],
[ 2.33323891e+00, -1.24989664e+00],
[ 1.35009598e+00, -7.98501526e-01],
[-3.01243674e-01, 5.53759089e-02],
[ 2.98856560e-01, -5.55335121e-01],
[ 6.27989548e-01, -1.02769285e+00],
[-5.08370469e-01, 7.66707520e-01],
[-7.45289474e-01, 2.07704289e-01],
[-5.87816363e-01, 2.41244483e-01],
[ 2.78437810e+00, -1.21495893e+00],
[-4.06225748e-01, -5.88875315e-01],
[ 1.35708742e-01, 1.54372201e+00],
[-5.84979009e-01, 3.80995291e-01],
[ 2.16573312e-01, 2.39846975e-01],
[ 1.60545778e+00, -9.01917124e-01],
[ 1.14013183e+00, -6.69930783e-01],
[ 1.24653258e+00, -8.78159487e-01],
[-1.17940454e+00, 2.50800258e+00],
[-4.43111342e-01, -3.73659071e-01],
[ 9.69890627e-01, 6.38136777e-01],
[-8.53108901e-01, 1.44449894e+00],
[ 2.48219996e+00, -1.36449230e+00],
[ 1.75300015e+00, -7.52383760e-01],
[-5.67954889e-01, -6.78315832e-01],
[ 6.15221458e-01, -8.50209325e-01],
[-9.32554795e-01, 1.43052386e+00],
[-3.86354081e+00, -7.71948873e-01],
[ 9.45773124e-01, -1.05424550e+00],
[ 1.40400569e+00, -8.41824277e-01],
[-4.71484875e-01, 6.29751729e-01],
[ 5.84010571e-01, -7.63563825e-01],
[-1.87749540e-01, 3.32082508e-01],
[-5.80722979e-01, -5.25987451e-01],
[ 2.91763176e-01, -9.31264794e-01],
[-8.10548601e-01, 1.39139363e+00],
[ 8.02784107e-03, -1.11433835e+00],
[ 4.56329671e-01, -7.37011171e-01],
[ 4.01001280e-01, -4.70087128e-01],
[ 3.24392740e-01, -3.28938812e-01],
[ 3.64013746e-02, -8.23656672e-01],
[ 4.97471294e-01, -1.13250596e+00],
[ 6.71968525e-01, -1.27365427e+00],
[ 2.78995086e-01, -8.78159487e-01],
[ 1.20680963e+00, -1.14508353e+00],
[-9.79371125e-01, 6.23634493e-02],
[-1.99098954e-01, -4.96639782e-01],
[ 2.39272139e-01, -6.16825476e-01],
[ 4.18025400e-01, -9.99742690e-01],
[ 2.08061252e-01, -1.05704052e+00],
[ 2.96596871e+00, -1.30300194e+00],
[-2.30309840e-01, 2.03511765e-01],
[ 4.90377911e-01, -8.54401850e-01],
[-4.75740905e-01, -6.12632952e-01],
[ 9.86914747e-01, -1.37480206e-01],
[-2.18960427e-01, 2.11809783e+00],
[-1.69012814e+00, 2.38082935e+00],
[-6.95635790e-01, 2.03511765e-01],
[-3.94876335e-01, 4.27113058e-01],
[-9.68021712e-01, 4.96988461e-01],
[ 2.80140222e+00, -1.27225676e+00],
[-2.50444855e+00, 3.40101025e+00],
[-5.73629596e-01, -8.57724069e-02],
[-3.36710591e-01, 6.47919334e-01],
[-3.60828095e-01, -6.65738259e-01],
[-1.42909163e+00, 2.53874776e+00],
[ 3.69790393e-01, 1.17897240e+00],
[-5.60861506e-01, 4.28510566e-01],
[-5.48093416e-01, 3.66811002e-03],
[ 5.91002014e-02, -2.33908263e-01],
[-7.48126827e-01, -3.54093958e-01],
[ 4.70516437e-01, -5.18999911e-01],
[-1.99230627e+00, 2.51219511e+00],
[-1.39078736e+00, 1.71421800e+00],
[ 1.39407495e+00, -9.21482237e-01],
[-5.75048273e-01, 6.89844576e-01],
[ 9.89752101e-01, -1.08778570e+00],
[-5.62280183e-01, 1.21058789e-01],
[-4.58716785e-01, 7.38757359e-01],
[ 3.79721130e-01, -1.24151159e+00],
[ 1.93874486e-01, -4.89652241e-01],
[ 2.08061252e-01, 3.83790307e-01],
[ 4.12350694e-01, -1.07101560e+00],
[-5.82141656e-01, -4.95242274e-01],
[-6.16189896e-01, -5.41360040e-01],
[-4.85671642e-01, 5.94814027e-01],
[-2.57264697e-01, 1.48782169e+00],
[-1.06874776e+00, 2.05940249e+00],
[-6.77192993e-01, 4.73230824e-01],
[ 1.39964772e-01, -9.60612463e-01],
[-3.09755734e-01, -8.68376930e-01],
[-1.02628940e-01, 6.92639592e-01],
[-3.04496436e+00, 1.49480923e+00],
[-4.87192260e-02, -9.98345182e-01],
[-9.18368028e-01, 7.96055190e-01],
[-4.23249868e-01, 3.02734839e-01],
[ 1.30327965e+00, -4.26764378e-01],
[-6.36051370e-01, 8.40775448e-01],
[-7.85114361e-02, 3.16182716e-02],
[-5.38162679e-01, -1.66827875e-01],
[-7.70825654e-01, 7.07484977e-02],
[-2.50171314e-01, -4.86857225e-01],
[-7.06985203e-01, 2.10499305e-01],
[ 4.91796587e-01, -4.33751918e-01],
[-1.26746443e-01, 4.97858766e-02],
[ 4.73353791e-01, -7.04868485e-01],
[ 1.35718936e+00, -9.99742690e-01],
[ 3.20136710e-01, -7.34216155e-01],
[ 1.30034035e-01, 2.80374709e-01],
[-3.12593088e-01, 2.81772217e-01],
[-6.68680933e-01, 3.02734839e-01],
[ 8.32278990e-01, -9.18687221e-01],
[-7.12659910e-01, 4.60653251e-01],
[-1.04047616e-01, -8.99122108e-01],
[-9.65184358e-01, 1.86741668e-01],
[-8.54527578e-01, -3.40118877e-01],
[ 1.38546095e-01, -3.19156256e-01],
[-4.58716785e-01, -3.86236644e-01],
[ 3.41416860e-01, 7.66707520e-01],
[ 9.54285184e-01, -1.27365427e+00],
[ 7.54251772e-01, -1.21495893e+00],
[ 9.34423710e-01, -1.12551842e+00],
[-8.38922134e-01, 6.35341761e-01],
[ 1.85362426e-01, -9.11699681e-01],
[ 7.78369276e-01, 9.84718780e-01],
[ 2.91205900e+00, -1.42179013e+00],
[-3.73596185e-01, -3.27541304e-01],
[-1.89168217e-01, 8.12825287e-01],
[ 7.06016765e-01, 1.83719871e+00],
[ 1.68206632e+00, -1.32536207e+00],
[-1.86462537e+00, -1.31138699e+00],
[ 1.10182756e+00, -7.87321462e-01],
[-1.40933210e-01, 7.52732440e-01],
[-1.66469390e-01, 5.09566034e-01],
[-1.59376007e-01, 9.09253344e-01],
[-6.68680933e-01, 1.01127143e+00],
[-1.04047616e-01, -7.27228615e-01],
[ 1.45933408e+00, 1.11276232e-01],
[-1.23757028e+00, 2.36266174e+00],
[ 1.26781273e+00, -1.36588981e+00],
[-1.87749540e-01, 1.92943424e+00],
[-7.85012421e-01, 1.35033869e-01],
[ 1.02521902e+00, -9.98345182e-01],
[-3.59409418e-01, -6.72725799e-01],
[-1.08435320e+00, 1.66530521e+00],
[ 1.28615358e-01, -4.56112047e-01],
[-4.51623402e-01, -3.51298942e-01],
[-1.79511021e+00, 3.04185067e+00],
[ 1.42244849e+00, -1.19679133e+00],
[-9.80789802e-01, -2.00895273e-02],
[-2.71866873e+00, 2.51359262e+00],
[-2.01936307e-01, 8.49160497e-01],
[-5.58126093e-02, -7.13253534e-01],
[ 4.34947580e-02, -8.87942043e-01],
[ 3.75465100e-01, -7.34216155e-01],
[-7.01310497e-01, 4.46678171e-01],
[-7.70927595e-02, 4.00033200e-02],
[-1.24891969e+00, 1.58424975e+00],
[-1.45189240e-01, -4.43534475e-01],
[-1.73562773e-01, -7.28626123e-01],
[ 1.99549192e-01, -8.57724069e-02],
[-3.45222651e-01, -6.90893405e-01],
[ 7.31552946e-01, -1.06402806e+00],
[-2.44496607e-01, 3.66811002e-03],
[-2.41659254e-01, 3.26492476e-01],
[ 1.98850048e+00, -1.21915146e+00],
[ 2.53185365e+00, -1.17862372e+00],
[-1.66469390e-01, -4.38471646e-02],
[ 8.43628403e-01, -3.79249103e-01],
[-1.43618501e+00, 4.88603413e-01],
[-9.31136118e-01, 5.59876325e-01],
[ 7.68438539e-01, -1.18561126e+00],
[-9.26880088e-01, 2.41576705e+00],
[-1.25601308e+00, 2.53874776e+00],
[-4.17575162e-01, 2.83169726e-01],
[ 4.74772467e-01, -7.57973792e-01],
[-3.50897358e-01, -6.47570654e-01],
[-5.76466949e-01, -6.34993081e-01],
[-1.30850411e+00, -3.42913893e-01],
[ 4.97471294e-01, -9.29867286e-01],
[-9.02762585e-01, 2.03511765e-01],
[-5.05533115e-01, 4.35498106e-01],
[ 4.91694647e-02, -2.15740658e-01],
[-1.01483804e+00, 3.43262573e-01],
[ 2.69064349e-01, -5.22322131e-02],
[-3.70758831e-01, -1.29095157e-01],
[-1.21629013e+00, 7.68105028e-01],
[-7.11241233e-01, 7.56924964e-01],
[-4.73005493e-02, 2.16089338e-01],
[-6.06259159e-01, 1.21251260e+00],
[ 2.81133295e+00, -1.18840628e+00],
[-6.10515189e-01, 1.55769709e+00],
[ 2.32472685e+00, -1.32536207e+00],
[ 4.37886874e-01, -4.15584313e-01],
[-5.86397686e-01, 9.73011512e-02],
[-9.26982029e-02, 5.51491276e-01],
[-5.16882529e-01, 1.40623902e-01],
[-2.09029690e-01, 1.11276232e-01],
[ 1.35708742e-01, 1.59822483e+00],
[ 3.28648770e-01, -8.78159487e-01],
[-2.85638231e-01, 2.04382070e-02],
[ 6.27989548e-01, -4.50522015e-01],
[-2.33147194e-01, -4.98037290e-01],
[ 1.16283065e+00, -6.41980622e-01],
[-1.33404029e+00, 1.45987153e+00],
[-4.30343252e-01, -1.38877714e-01],
[-1.60217019e+00, 1.03922160e+00],
[-5.25394589e-01, 7.94657682e-01],
[ 2.49071202e+00, -1.32955460e+00],
[-1.15812439e+00, 1.92524172e+00],
[ 5.91002014e-02, -5.36297212e-02],
[ 2.17860315e+00, -1.26806424e+00],
[-8.71551698e-01, 6.92639592e-01],
[ 4.33630844e-01, -4.39341951e-01],
[ 7.48577066e-01, -1.08918321e+00],
[-4.94183702e-01, 6.00404059e-01],
[ 1.03798711e+00, -1.35331223e+00],
[-9.72277742e-01, 5.40311212e-01],
[ 9.45773124e-01, -4.09994281e-01],
[-2.23216457e-01, -1.62635351e-01],
[-2.43077931e-01, 1.20531585e-02],
[-6.10515189e-01, -1.40275222e-01],
[-4.95602379e-01, 2.98734786e+00],
[ 1.12736374e+00, -9.70395020e-01],
[-3.86364275e-01, -2.98720838e-02],
[-7.57760236e-03, -9.41574554e-02],
[-1.80504095e+00, -7.31948342e-02],
[-1.02760613e+00, -3.06578683e-01],
[ 1.88199779e-01, -8.27849196e-01],
[-9.63765682e-01, 8.12825287e-01],
[ 8.69164583e-01, -1.76610432e-01],
[ 2.64808319e-01, 6.25559205e-01],
[ 1.79687719e-01, 3.32082508e-01],
[ 1.59978307e+00, -1.03328288e+00],
[-1.63632037e-01, -9.55022431e-01],
[-3.75014861e-01, -3.54093958e-01],
[-4.27505898e-01, 7.58322472e-01],
[-6.14771220e-01, 5.34721179e-01],
[-1.87455611e+00, 1.89536684e-01],
[ 6.45013668e-01, -1.11154334e+00],
[ 6.59200435e-01, 6.70279463e-01],
[-3.90620305e-01, -3.13566224e-01],
[ 2.83251116e-01, 8.43570465e-01],
[-5.90653716e-01, -3.73659071e-01],
[-1.95542068e+00, 3.09216096e+00],
[-8.27572721e-01, 7.97452698e-01],
[ 1.23518317e+00, -1.09337573e+00],
[ 9.79821364e-01, -1.12971094e+00],
[ 7.06016765e-01, -8.33439228e-01],
[-1.25033837e+00, 1.98114204e+00],
[ 1.27774347e+00, -1.01371777e+00],
[ 3.92387279e-02, -9.07507157e-01],
[-4.74322229e-01, 2.10552026e+00],
[ 8.32177049e-02, 1.07276179e+00],
[-1.17798586e+00, 7.56924964e-01],
[-1.06165437e+00, 1.53114444e+00],
[ 2.46365523e-01, -6.11444652e-03],
[-4.60135462e-01, 6.46521826e-01],
[ 1.47919555e+00, -1.94778037e-01],
[-2.88475584e-01, 2.87362250e-01],
[-3.90620305e-01, 8.75185947e-02],
[ 1.01245093e+00, -1.35610725e+00],
[ 2.36332845e-02, 5.01180986e-01],
[-2.67195434e-01, -3.33131337e-01],
[-1.17514851e+00, -1.33287682e-01],
[ 9.28749004e-01, -9.43842367e-01],
[-4.24668545e-01, -4.05801757e-01],
[-1.73562773e-01, 1.60101984e+00],
[ 1.52317453e+00, -1.10595330e+00],
[-2.31728517e-01, -5.87477807e-01],
[-5.11207822e-01, 5.10963542e-01],
[-5.79304303e-01, -4.74279652e-01],
[-8.20479337e-01, -2.98720838e-02],
[ 4.63321113e-02, 1.88139176e-01],
[ 1.48476832e-01, -4.22571854e-01],
[ 1.72594335e-01, 9.60961143e-01],
[ 1.56988892e-01, 8.75185947e-02],
[ 5.91103955e-01, 5.27733639e-01],
[ 2.17860315e+00, -1.24151159e+00],
[-2.34981279e+00, 3.03626064e+00],
[ 2.22541948e+00, -1.23452405e+00],
[-6.29059927e-02, -8.46016801e-01],
[ 1.88199779e-01, 9.31613474e-01]])
Linear Regression:
Linear Regression is one of the most fundamental and widely known Machine Learning Algorithm. A Linear Regression model predicts
the dependent variable using a regression line based on the independent variables. The equation of the Linear Regression is:
Y = a*X + C + e
Where, C is the intercept, m is the slope of the line e is the error term The equation above is used to predict the value of the targer
variable based on the given predictor variable(s).
** Make predictions ***
y_pred
29.63386601617516
4.00867504953758
e e
rm lstat target
3.Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.
Import libraries
import pandas as pd
import numpy as np
%matplotlib
Load data
df
Basic stats
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
# Column Non-Null Count Dtype
Age 0
a a a
return 1
return 0
return -1
0 15624510 1 19 19000 0
1 15810944 1 35 20000 0
2 15668575 0 26 43000 0
3 15603246 0 27 57000 0
4 15804002 1 19 76000 0
0 257
1 143
sns.heatmap(df.corr(), annot=True)
plt.show()
Data preparation
Model building
model = LogisticRegression()
model.fit(x_train, y_train)
▾ LogisticRegression i ?
LogisticRegression()
y_pred = model.predict(x_test)
y_pred
array([0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0])
model.score(x_train,y_train)
0.821875
model.score(x,y)
0.835
Evalutation
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[50 2]
[ 7 21]]
disp=ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=model.classes_)
disp.plot()
plt.show()
TN value is 50
FP value is 2
FN value is 7
TP value is 21
print(classification_report(y_test, y_pred))
accuracy 0.89 80
macro avg 0.90 0.86 0.87 80
weighted avg 0.89 0.89 0.88 80
Practical No-6
Data Analytics III
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.
Import libraries
import pandas as pd
import numpy as np
%matplotlib
Load data
a a
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Basic stats
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
# Column Non-Null Count Dtype
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 1 columns):
# Column Non-Null Count Dtype
Data preparation
Model building
model = GaussianNB()
model.fit(x_train, y_train)
▾ GaussianNB i ?
GaussianNB()
y_pred = model.predict(x_test)
Evalutation
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
print(classification_report(y_test, y_pred))
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Practical No-7
Text Analytics
1. Extract Sample document and apply following document preprocessing methods: Tokenization, POS Tagging, stop words removal,
Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document Frequency.
import nltk
nltk.download("all")
Tokenization
['Sachin', 'was', 'the', 'GOAT', 'of', 'the', 'previous', 'generation', '.', 'Virat', 'is', 'the', 'GOAT', 'of'
, 'this', 'generation', '.', 'Shubham', 'will', 'be', 'the', 'GOAT', 'of', 'the', 'next', 'generation']
['Sachin was the GOAT of the previous generation.', 'Virat is the GOAT of this generation.', 'Shubham will be t
he GOAT of the next generation']
POS tagging
[('Sachin', 'NNP'), ('was', 'VBD'), ('the', 'DT'), ('GOAT', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('previous', '
JJ'), ('generation', 'NN'), ('.', '.'), ('Virat', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('GOAT', 'NNP'), ('of',
'IN'), ('this', 'DT'), ('generation', 'NN'), ('.', '.'), ('Shubham', 'NNP'), ('will', 'MD'), ('be', 'VB'), ('th
e', 'DT'), ('GOAT', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('next', 'JJ'), ('generation', 'NN')]
['Sachin', 'GOAT', 'previous', 'generation', '.', 'Virat', 'GOAT', 'generation', '.', 'Shubham', 'GOAT', 'next'
, 'generation']
Stemming
stemmer = PorterStemmer()
stemmed_tokens = []
stemmed = stemmer.stem(token)
stemmed_tokens.append(stemmed)
['sachin', 'goat', 'previou', 'gener', '.', 'virat', 'goat', 'gener', '.', 'shubham', 'goat', 'next', 'gener']
Lemmatization
['Sachin', 'GOAT', 'previous', 'generation', '.', 'Virat', 'GOAT', 'generation', '.', 'Shubham', 'GOAT', 'next'
, 'generation']
TF-IDF
Sachin:1
was:1
the:5
GOAT:3
of:3
previous:1
generation:3
.:2
Virat:1
is:1
this:1
Shubham:1
will:1
be:1
next:1
(0, 7) 1.0
(1, 12) 1.0
(2, 9) 1.0
(3, 2) 1.0
(4, 5) 1.0
(5, 9) 1.0
(6, 6) 1.0
(7, 1) 1.0
(9, 11) 1.0
(10, 3) 1.0
(11, 9) 1.0
(12, 2) 1.0
(13, 5) 1.0
(14, 10) 1.0
(15, 1) 1.0
(17, 8) 1.0
(18, 13) 1.0
(19, 0) 1.0
(20, 9) 1.0
(21, 2) 1.0
(22, 5) 1.0
(23, 9) 1.0
(24, 4) 1.0
(25, 1) 1.0
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information about the passengers who boarded the
unfortunate Titanic ship. Use the Seaborn library to see if we can find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each passenger is distributed by plotting a histogram.
import pandas as pd
import numpy as np
%matplotlib
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Visualization
Ticket
347082 7
CA. 2343 7
1601 7
3101295 6
CA 2144 6
..
9234 1
19988 1
2693 1
PC 17612 1
370376 1
Name: count, Length: 681, dtype: int64
Cabin
B96 B98 4
G6 4
C23 C25 C27 4
C22 C26 3
F33 3
..
E34 1
C7 1
C54 1
E36 1
C148 1
Embarked
S 644
C 168
Q 77
return 1
return 0
return 0
return 1
return 2
return 0
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
# SibSp Distribution
# Parch Distribution
#plt.tight_layout()
#plt.show()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
# Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null int64
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Embarked 891 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 76.7+ KB
sns.countplot(df,x="Pclass", hue="Survived",palette="Accent")
plt.show()
sns.histplot(df["Fare"])
plt.show()
Practical No-9
Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution of age with respect to each gender along
with the information about whether they survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.
import pandas as pd
import numpy as np
%matplotlib
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 3101282 7.9250 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId Survived Pclass Age SibSp Parch Fare
df.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
df.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Visualization
return 1
return 0
return 0
return 1
return 2
return 0
color
0
1
0 20 40 60 80
plt.figure(figsize=(10,7))
box = sns.boxplot(df,x="Sex", y="Age", hue="Survived")
plt.show()
This code will display a box plot showing the distribution of ages
with respect to gender and survival status. You can observe
trends like whether there are age differences between survivors
and non-survivors, or if gender has a distinct influence on survival
outcomes.
Practical No-10
Data Visualization III Download the Iris flower dataset or any other dataset into a DataFrame. (e.g.,
https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distributions and identify outliers.
Import libraries
import pandas as pd
import numpy as np
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label
Visualization
sns.heatmap(df.corr(), annot=True)
plt.show()
sns.histplot(df["sepal width (cm)"], kde=True)
plt.show()
sns.histplot(df["petal width (cm)"], kde=True)
plt.show()
sns.boxplot(x=df['label'] ,y=df["sepal width (cm)"])
plt.show()
sns.boxplot(x=df['label'] ,y=df["petal width (cm)"])
plt.show()
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Practical No:- 11
Title :- Write a code in JAVA for a simple Word Count application that counts the number of occurrences
of each word in a given input set using the Hadoop Map-Reduce framework on local-standalone set-up.
Program / Commands:-
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows
PS C:\Users\Omkar> start-all.cmd
This script is Deprecated. Instead use start-dfs.cmd and start-yarn.cmd
starting yarn daemons
PS C:\Users\Omkar> jps
17152 NodeManager
17936 ResourceManager
7216 NameNode
17496 DataNode
4684 Jps
PS C:\Users\Omkar> hadoop fs -mkdir /input
PS C:\Users\Omkar> hadoop fs -put D:\Hadoop Files\omkar.txt /input
put: `D:/Hadoop': No such file or directory
put: `Files/omkar.txt': No such file or directory
PS C:\Users\Omkar> hadoop fs -put C:\Users\Omkar\Desktop\Hadoop Files /input
OutPut:-
Practical No:- 12
Title :- Design a distributed application using Map-Reduce which processes a log file of
a system.
Program / Commands:-
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows
PS C:\Users\Omkar> start-all.cmd
This script is Deprecated. Instead use start-dfs.cmd and start-yarn.cmd
starting yarn daemons
PS C:\Users\Omkar> jps
20928 NodeManager
7684 Eclipse
14136 NameNode
16296 DataNode
20072 ResourceManager
18236 Jps
PS C:\Users\Omkar> hadoop fs -mkdir /InputLog
PS C:\Users\Omkar> hadoop fs -rm -r /InputLog
Deleted /InputLog
PS C:\Users\Omkar> hadoop fs -mkdir /InputDir
PS C:\Users\Omkar> hadoop fs -put C:\Users\Omkar\Documents\Log\InputLog.txt /InputDir
PS C:\Users\Omkar> hadoop fs -ls /InputDir
Found 1 items
-rw-r--r-- 3 Omkar supergroup 145670 2025-03-31 16:30 /InputDir/InputLog.txt
PS C:\Users\Omkar> hadoop jar C:\Users\Omkar\Documents\JARFILES\Process.jar
com.mapreduce.lf/Process /InputDir/InputLog.txt /OutputDir
2025-03-31 16:37:23,239 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2025-03-31 16:37:23,902 WARN mapreduce.JobResourceUploader: Hadoop command-line option
parsing not performed. Implement the Tool interface and execute your application with ToolRunner to
remedy this.
2025-03-31 16:37:23,918 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for
path: /tmp/hadoop-yarn/staging/Omkar/.staging/job_1743418229851_0001
2025-03-31 16:37:24,077 INFO input.FileInputFormat: Total input files to process : 1
2025-03-31 16:37:24,168 INFO mapreduce.JobSubmitter: number of splits:1
2025-03-31 16:37:24,295 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1743418229851_0001
2025-03-31 16:37:24,297 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-03-31 16:37:24,439 INFO conf.Configuration: resource-types.xml not found
2025-03-31 16:37:24,439 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2025-03-31 16:37:24,699 INFO impl.YarnClientImpl: Submitted application
application_1743418229851_0001
2025-03-31 16:37:24,744 INFO mapreduce.Job: The url to track the job: http://Omkar-
Mete:8088/proxy/application_1743418229851_0001/
2025-03-31 16:37:24,745 INFO mapreduce.Job: Running job: job_1743418229851_0001
2025-03-31 16:37:33,924 INFO mapreduce.Job: Job job_1743418229851_0001 running in uber mode
: false
2025-03-31 16:37:33,928 INFO mapreduce.Job: map 0% reduce 0%
2025-03-31 16:37:39,045 INFO mapreduce.Job: map 100% reduce 0%
2025-03-31 16:37:45,130 INFO mapreduce.Job: map 100% reduce 100%
2025-03-31 16:37:46,152 INFO mapreduce.Job: Job job_1743418229851_0001 completed
successfully
2025-03-31 16:37:46,262 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=4479
FILE: Number of bytes written=486753
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=145778
HDFS: Number of bytes written=3611
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2918
Total time spent by all reduces in occupied slots (ms)=3117
Total time spent by all map tasks (ms)=2918
Total time spent by all reduce tasks (ms)=3117
Total vcore-milliseconds taken by all map tasks=2918
Total vcore-milliseconds taken by all reduce tasks=3117
Total megabyte-milliseconds taken by all map tasks=2988032
Total megabyte-milliseconds taken by all reduce tasks=3191808
Map-Reduce Framework
Map input records=2589
Map output records=1295
Map output bytes=22902
Map output materialized bytes=4479
Input split bytes=108
Combine input records=1295
Combine output records=227
Reduce input groups=227
Reduce shuffle bytes=4479
Reduce input records=227
Reduce output records=227
Spilled Records=454
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=85
CPU time spent (ms)=1091
Physical memory (bytes) snapshot=514068480
Virtual memory (bytes) snapshot=765394944
Total committed heap usage (bytes)=359661568
Peak Map Physical memory (bytes)=308555776
Peak Map Virtual memory (bytes)=437420032
Peak Reduce Physical memory (bytes)=205512704
Peak Reduce Virtual memory (bytes)=328048640
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=145670
File Output Format Counters
Bytes Written=3611
PS C:\Users\Omkar> hadoop dfs -cat /OutputDir/*
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
10.1.1.236 7
10.1.181.142 14
10.1.232.31 5
10.10.55.142 14
10.102.101.66 1
10.103.184.104 1
10.103.190.81 53
10.103.63.29 1
10.104.73.51 1
10.105.160.183 1
10.108.91.151 1
10.109.21.76 1
10.11.131.40 1
10.111.71.20 8
10.112.227.184 6
10.114.74.30 1
10.115.118.78 1
10.117.224.230 1
10.117.76.22 12
10.118.19.97 1
10.118.250.30 7
10.119.117.132 23
10.119.33.245 1
10.119.74.120 1
10.12.113.198 2
10.12.219.30 1
10.120.165.113 1
10.120.207.127 4
10.123.124.47 1
10.123.35.235 1
10.124.148.99 1
10.124.155.234 1
10.126.161.13 7
10.127.162.239 1
10.128.11.75 10
10.13.42.232 1
10.130.195.163 8
10.130.70.80 1
10.131.163.73 1
10.131.209.116 5
10.132.19.125 2
10.133.222.184 12
10.134.110.196 13
10.134.242.87 1
10.136.84.60 5
10.14.2.86 8
10.14.4.151 2
10.140.139.116 1
10.140.141.1 9
10.140.67.116 1
10.141.221.57 5
10.142.203.173 7
10.143.126.177 32
10.144.147.8 1
10.15.208.56 1
10.15.23.44 13
10.150.212.239 14
10.150.227.16 1
10.150.24.40 13
10.152.195.138 8
10.153.23.63 2
10.153.239.5 25
10.155.95.124 9
10.156.152.9 1
10.157.176.158 1
10.164.130.155 1
10.164.49.105 8
10.164.95.122 10
10.165.106.173 14
10.167.1.145 19
10.169.158.88 1
10.170.178.53 1
10.171.104.4 1
10.172.169.53 18
10.174.246.84 3
10.175.149.65 1
10.175.204.125 15
10.177.216.164 6
10.179.107.170 2
10.181.38.207 13
10.181.87.221 1
10.185.152.140 1
10.186.56.126 16
10.186.56.183 1
10.187.129.140 6
10.187.177.220 1
10.187.212.83 1
10.187.28.68 1
10.19.226.186 2
10.190.174.142 10
10.190.41.42 5
10.191.172.11 1
10.193.116.91 1
10.194.174.4 7
10.198.138.192 1
10.199.103.248 2
10.199.189.15 1
10.2.202.135 1
10.200.184.212 1
10.200.237.222 1
10.200.9.128 2
10.203.194.139 10
10.205.72.238 2
10.206.108.96 2
10.206.175.236 1
10.206.73.206 7
10.207.190.45 17
10.208.38.46 1
10.208.49.216 4
10.209.18.39 9
10.209.54.187 3
10.211.47.159 10
10.212.122.173 1
10.213.181.38 7
10.214.35.48 1
10.215.222.114 1
10.216.113.172 48
10.216.134.214 1
10.216.227.195 16
10.217.151.145 10
10.217.32.16 1
10.218.16.176 8
10.22.108.103 4
10.220.112.1 34
10.221.40.89 5
10.221.62.23 13
10.222.246.34 1
10.223.157.186 11
10.225.137.152 1
10.225.234.46 1
10.226.130.133 1
10.229.60.23 1
10.230.191.135 6
10.231.55.231 1
10.234.15.156 1
10.236.231.63 1
10.238.230.235 1
10.239.100.52 1
10.239.52.68 4
10.24.150.4 5
10.24.67.131 13
10.240.144.183 15
10.240.170.50 1
10.241.107.75 1
10.241.9.187 1
10.243.51.109 5
10.244.166.195 5
10.245.208.15 20
10.246.151.162 3
10.247.111.104 9
10.247.175.65 1
10.247.229.13 1
10.248.24.219 1
10.248.36.117 3
10.249.130.132 3
10.25.132.238 2
10.25.44.247 6
10.250.166.232 1
10.27.134.23 1
10.30.164.32 1
10.30.47.170 8
10.31.225.14 7
10.32.138.48 11
10.32.247.175 4
10.32.55.216 12
10.33.181.9 8
10.34.233.107 1
10.36.200.176 1
10.39.45.70 2
10.39.94.109 4
10.4.59.153 1
10.4.79.47 15
10.41.170.233 9
10.41.40.17 1
10.42.208.60 1
10.43.81.13 1
10.46.190.95 10
10.48.81.158 5
10.5.132.217 1
10.5.148.29 1
10.50.226.223 9
10.50.41.216 3
10.52.161.126 1
10.53.58.58 1
10.54.242.54 10
10.54.49.229 1
10.56.48.40 16
10.59.42.194 11
10.6.238.124 6
10.61.147.24 1
10.61.161.218 1
10.61.23.77 8
10.61.232.147 3
10.62.78.165 2
10.63.233.249 7
10.64.224.191 13
10.66.208.82 2
10.69.20.85 26
10.70.105.238 1
10.70.238.46 6
10.72.137.86 6
10.72.208.27 1
10.73.134.9 4
10.73.238.200 1
10.73.60.200 1
10.73.64.91 1
10.74.218.123 1
10.75.116.199 1
10.76.143.30 1
10.76.68.178 16
10.78.95.24 8
10.80.10.131 10
10.80.215.116 17
10.81.134.180 1
10.82.30.199 63
10.82.64.235 1
10.84.236.242 1
10.87.209.46 1
10.87.88.214 1
10.88.204.177 1
10.89.178.62 1
10.89.244.42 1
10.94.196.42 1
10.95.136.211 4
10.95.232.88 1
10.98.156.141 1
10.99.228.224 1
PS C:\Users\Omkar> hdfs dfs -get /OutputDir/part-r-00000
C:\Users\Omkar\Documents\OutputDir\OutputLog.txt
PS C:\Users\Omkar>
OutPut:-
Practical No:-14
Title:- Write a simple program in SCALA using Apache Spark framework
Program :-
// Scala program to print Hello Students
// Creating object
object Students {
// Main method
def main(args: Array[String]): Unit = {
// prints Hello, Students!
println("Hello, Students!")
}
}