0% found this document useful (0 votes)
16 views95 pages

Ds&bda 1-14

The document outlines a practical exercise in data wrangling using Python and pandas on an open-source dataset. It includes steps for importing libraries, loading data, preprocessing for missing values, and converting data types. The exercise emphasizes the importance of data normalization and provides code snippets for various operations performed on the dataset.

Uploaded by

sumitdorle91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views95 pages

Ds&bda 1-14

The document outlines a practical exercise in data wrangling using Python and pandas on an open-source dataset. It includes steps for importing libraries, loading data, preprocessing for missing values, and converting data types. The exercise emphasizes the importance of data normalization and provides code snippets for various operations performed on the dataset.

Uploaded by

sumitdorle91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Practical No-1

Data Wrangling - I
Perform the following operations using Python on any open source dataset (e.g.,
data.csv)

1. Import all the required Python Libraries.

2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a
clear description of the data and its source (i.e., URL of the web site).

3. Load the Dataset into pandas data frame.

4. Data Preprocessing: check for missing values in the data using pandas isnull(),
describe() function to get some initial statistics. Provide variable descriptions. Types of
variables etc. Check the dimensions of the data frame.

5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.

6. Turn categorical variables into quantitative variables in Python. In addition to the codes
and outputs, explain every operation that you do in the above steps and explain
everything that you do to import/read/scrape the data set.

In [9]: import pandas as pd

In [11]: import numpy as np

In [13]: #!unzip archive.zip

In [17]: df = pd.read_csv("C:\\Users\\Omkar\\Desktop\\TE SEM-2\\Practical's\\DSBDA\\Pract

In [19]: df
Out[19]: Admission High School Admission
Name Age Gender City
Test Score Percentage Status

0 Shehroz 24.0 Female 50.0 68.90 Quetta Rejected

1 Waqar 21.0 Female 99.0 60.73 Karachi NaN

2 Bushra 17.0 Male 89.0 NaN Islamabad Accepted

3 Aliya 17.0 Male 55.0 85.29 Karachi Rejected

4 Bilal 20.0 Male 65.0 61.13 Lahore NaN

... ... ... ... ... ... ... ...

152 Ali 19.0 Female 85.0 78.09 Quetta Accepted

153 Bilal 17.0 Female 81.0 84.40 Islamabad Rejected

154 Fatima 21.0 Female 98.0 50.86 Multan Accepted

155 Shoaib -1.0 Male 91.0 80.12 Quetta Accepted

156 Maaz 17.0 Male 88.0 86.85 Lahore Accepted

157 rows × 7 columns

To find Null Values


In [22]: df.isnull()

Out[22]: Admission Test High School Admission


Name Age Gender City
Score Percentage Status

0 False False False False False False False

1 False False False False False False True

2 False False False False True False False

3 False False False False False False False

4 False False False False False False True

... ... ... ... ... ... ... ...

152 False False False False False False False

153 False False False False False False False

154 False False False False False False False

155 False False False False False False False

156 False False False False False False False

157 rows × 7 columns

In [24]: df.head(10)
Out[24]: Admission High School Admission
Name Age Gender City
Test Score Percentage Status

0 Shehroz 24.0 Female 50.0 68.90 Quetta Rejected

1 Waqar 21.0 Female 99.0 60.73 Karachi NaN

2 Bushra 17.0 Male 89.0 NaN Islamabad Accepted

3 Aliya 17.0 Male 55.0 85.29 Karachi Rejected

4 Bilal 20.0 Male 65.0 61.13 Lahore NaN

5 Murtaza 23.0 Female NaN NaN Islamabad Accepted

6 Asad 18.0 Male NaN 97.31 Multan Accepted

7 Rabia 20.0 Female 82.0 55.67 Lahore Accepted

8 Rohail 17.0 Male 64.0 NaN Karachi Accepted

9 Kamran 18.0 Male 53.0 98.98 Multan Rejected

In [26]: df.isnull().sum()

Out[26]: Name 10
Age 10
Gender 10
Admission Test Score 11
High School Percentage 11
City 10
Admission Status 10
dtype: int64

To find Non-Null Values


In [29]: df.notnull()
Out[29]: Admission Test High School Admission
Name Age Gender City
Score Percentage Status

0 True True True True True True True

1 True True True True True True False

2 True True True True False True True

3 True True True True True True True

4 True True True True True True False

... ... ... ... ... ... ... ...

152 True True True True True True True

153 True True True True True True True

154 True True True True True True True

155 True True True True True True True

156 True True True True True True True

157 rows × 7 columns

In [31]: df.notnull().sum()

Out[31]: Name 147


Age 147
Gender 147
Admission Test Score 146
High School Percentage 146
City 147
Admission Status 147
dtype: int64

To get Dataset Information


In [34]: df.describe()

Out[34]: Age Admission Test Score High School Percentage

count 147.000000 146.000000 146.000000

mean 19.680272 77.657534 75.684726

std 4.540512 16.855343 17.368014

min -1.000000 -5.000000 -10.000000

25% 18.000000 68.250000 65.052500

50% 20.000000 79.000000 77.545000

75% 22.000000 89.000000 88.312500

max 24.000000 150.000000 110.500000


In [36]: df.columns

Out[36]: Index(['Name', 'Age', 'Gender', 'Admission Test Score',


'High School Percentage', 'City', 'Admission Status'],
dtype='object')

In [38]: df.size

Out[38]: 1099

In [40]: df.shape

Out[40]: (157, 7)

In [42]: df.ndim

Out[42]: 2

In [44]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 7 columns):
# Column Non-Null Count Dtype

0 Name 147 non-null object


1 Age 147 non-null float64
2 Gender 147 non-null object
3 Admission Test Score 146 non-null float64
4 High School Percentage 146 non-null float64
5 City 147 non-null object
6 Admission Status 147 non-null object
dtypes: float64(3), object(4)
memory usage: 8.7+ KB

to DROP NuLL Values


In [47]: #df=df.dropna()

To deal with Null Values (Replace with


some other value like Mean of any
column)
In [50]: df["Age"].replace(np.NaN,10,inplace=True)
C:\Users\Omkar\AppData\Local\Temp\ipykernel_16356\311224403.py:1: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained as
signment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work becau
se the intermediate object on which we are setting values always behaves as a cop
y.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.meth


od({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to pe
rform the operation inplace on the original object.

df["Age"].replace(np.NaN,10,inplace=True)

In [52]: df

Out[52]: Admission High School Admission


Name Age Gender City
Test Score Percentage Status

0 Shehroz 24.0 Female 50.0 68.90 Quetta Rejected

1 Waqar 21.0 Female 99.0 60.73 Karachi NaN

2 Bushra 17.0 Male 89.0 NaN Islamabad Accepted

3 Aliya 17.0 Male 55.0 85.29 Karachi Rejected

4 Bilal 20.0 Male 65.0 61.13 Lahore NaN

... ... ... ... ... ... ... ...

152 Ali 19.0 Female 85.0 78.09 Quetta Accepted

153 Bilal 17.0 Female 81.0 84.40 Islamabad Rejected

154 Fatima 21.0 Female 98.0 50.86 Multan Accepted

155 Shoaib -1.0 Male 91.0 80.12 Quetta Accepted

156 Maaz 17.0 Male 88.0 86.85 Lahore Accepted

157 rows × 7 columns

In [54]: df.isnull().sum()

Out[54]: Name 10
Age 0
Gender 10
Admission Test Score 11
High School Percentage 11
City 10
Admission Status 10
dtype: int64

To Change Data Type


In [57]: df["Age"] = df["Age"].astype(int)
In [59]: df

Out[59]: Admission High School Admission


Name Age Gender City
Test Score Percentage Status

0 Shehroz 24 Female 50.0 68.90 Quetta Rejected

1 Waqar 21 Female 99.0 60.73 Karachi NaN

2 Bushra 17 Male 89.0 NaN Islamabad Accepted

3 Aliya 17 Male 55.0 85.29 Karachi Rejected

4 Bilal 20 Male 65.0 61.13 Lahore NaN

... ... ... ... ... ... ... ...

152 Ali 19 Female 85.0 78.09 Quetta Accepted

153 Bilal 17 Female 81.0 84.40 Islamabad Rejected

154 Fatima 21 Female 98.0 50.86 Multan Accepted

155 Shoaib -1 Male 91.0 80.12 Quetta Accepted

156 Maaz 17 Male 88.0 86.85 Lahore Accepted

157 rows × 7 columns

To replace Null Values by Mean


In [62]: Avg_percentage=df["High School Percentage"].astype('float').mean()

In [64]: Avg_percentage

Out[64]: 75.68472602739726

In [66]: df["High School Percentage"].replace(np.NaN,Avg_percentage,inplace=True)

C:\Users\Omkar\AppData\Local\Temp\ipykernel_16356\3697381515.py:1: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained as
signment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work becau
se the intermediate object on which we are setting values always behaves as a cop
y.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.meth


od({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to pe
rform the operation inplace on the original object.

df["High School Percentage"].replace(np.NaN,Avg_percentage,inplace=True)

In [68]: df.isnull().sum()
Out[68]: Name 10
Age 0
Gender 10
Admission Test Score 11
High School Percentage 0
City 10
Admission Status 10
dtype: int64

In [70]: df

Out[70]: Admission High School Admission


Name Age Gender City
Test Score Percentage Status

0 Shehroz 24 Female 50.0 68.900000 Quetta Rejected

1 Waqar 21 Female 99.0 60.730000 Karachi NaN

2 Bushra 17 Male 89.0 75.684726 Islamabad Accepted

3 Aliya 17 Male 55.0 85.290000 Karachi Rejected

4 Bilal 20 Male 65.0 61.130000 Lahore NaN

... ... ... ... ... ... ... ...

152 Ali 19 Female 85.0 78.090000 Quetta Accepted

153 Bilal 17 Female 81.0 84.400000 Islamabad Rejected

154 Fatima 21 Female 98.0 50.860000 Multan Accepted

155 Shoaib -1 Male 91.0 80.120000 Quetta Accepted

156 Maaz 17 Male 88.0 86.850000 Lahore Accepted

157 rows × 7 columns

To convert Categorial into Numerial


Variable
In [73]: df["Gender"].replace({'Female':0,'Male':1},inplace = True)
C:\Users\Omkar\AppData\Local\Temp\ipykernel_16356\3985573958.py:1: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained as
signment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work becau
se the intermediate object on which we are setting values always behaves as a cop
y.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.meth


od({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to pe
rform the operation inplace on the original object.

df["Gender"].replace({'Female':0,'Male':1},inplace = True)
C:\Users\Omkar\AppData\Local\Temp\ipykernel_16356\3985573958.py:1: FutureWarning:
Downcasting behavior in `replace` is deprecated and will be removed in a future v
ersion. To retain the old behavior, explicitly call `result.infer_objects(copy=Fa
lse)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_dow
ncasting', True)`
df["Gender"].replace({'Female':0,'Male':1},inplace = True)

In [75]: df

Out[75]: Admission High School Admission


Name Age Gender City
Test Score Percentage Status

0 Shehroz 24 0.0 50.0 68.900000 Quetta Rejected

1 Waqar 21 0.0 99.0 60.730000 Karachi NaN

2 Bushra 17 1.0 89.0 75.684726 Islamabad Accepted

3 Aliya 17 1.0 55.0 85.290000 Karachi Rejected

4 Bilal 20 1.0 65.0 61.130000 Lahore NaN

... ... ... ... ... ... ... ...

152 Ali 19 0.0 85.0 78.090000 Quetta Accepted

153 Bilal 17 0.0 81.0 84.400000 Islamabad Rejected

154 Fatima 21 0.0 98.0 50.860000 Multan Accepted

155 Shoaib -1 1.0 91.0 80.120000 Quetta Accepted

156 Maaz 17 1.0 88.0 86.850000 Lahore Accepted

157 rows × 7 columns

In [77]: df["Admission Status"].replace({'Rejected':0,'Accepted':1},inplace = True)


C:\Users\Omkar\AppData\Local\Temp\ipykernel_16356\999870716.py:1: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained as
signment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work becau
se the intermediate object on which we are setting values always behaves as a cop
y.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.meth


od({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to pe
rform the operation inplace on the original object.

df["Admission Status"].replace({'Rejected':0,'Accepted':1},inplace = True)


C:\Users\Omkar\AppData\Local\Temp\ipykernel_16356\999870716.py:1: FutureWarning:
Downcasting behavior in `replace` is deprecated and will be removed in a future v
ersion. To retain the old behavior, explicitly call `result.infer_objects(copy=Fa
lse)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_dow
ncasting', True)`
df["Admission Status"].replace({'Rejected':0,'Accepted':1},inplace = True)

In [79]: df

Out[79]: Admission High School Admission


Name Age Gender City
Test Score Percentage Status

0 Shehroz 24 0.0 50.0 68.900000 Quetta 0.0

1 Waqar 21 0.0 99.0 60.730000 Karachi NaN

2 Bushra 17 1.0 89.0 75.684726 Islamabad 1.0

3 Aliya 17 1.0 55.0 85.290000 Karachi 0.0

4 Bilal 20 1.0 65.0 61.130000 Lahore NaN

... ... ... ... ... ... ... ...

152 Ali 19 0.0 85.0 78.090000 Quetta 1.0

153 Bilal 17 0.0 81.0 84.400000 Islamabad 0.0

154 Fatima 21 0.0 98.0 50.860000 Multan 1.0

155 Shoaib -1 1.0 91.0 80.120000 Quetta 1.0

156 Maaz 17 1.0 88.0 86.850000 Lahore 1.0

157 rows × 7 columns

To count
In [82]: df["Gender"].value_counts()

Out[82]: Gender
0.0 83
1.0 64
Name: count, dtype: int64

In [84]: df["Age"].value_counts()
Out[84]: Age
17 24
23 19
19 18
22 18
24 17
20 17
21 15
18 14
10 10
-1 5
Name: count, dtype: int64

Concept Hierachy
In [87]: def fun1(value):
if (value < 20):
return "teenager"
elif (value >= 20 and value < 40):
return "young"
elif (value >= 40 and value < 60):
return "middle aged"
elif (value >= 60):
return "senior citizen"
else:
pass

In [89]: df["Age"] = df["Age"].apply(fun1)

In [91]: df

Out[91]: Admission High School Admission


Name Age Gender City
Test Score Percentage Status

0 Shehroz young 0.0 50.0 68.900000 Quetta 0.0

1 Waqar young 0.0 99.0 60.730000 Karachi NaN

2 Bushra teenager 1.0 89.0 75.684726 Islamabad 1.0

3 Aliya teenager 1.0 55.0 85.290000 Karachi 0.0

4 Bilal young 1.0 65.0 61.130000 Lahore NaN

... ... ... ... ... ... ... ...

152 Ali teenager 0.0 85.0 78.090000 Quetta 1.0

153 Bilal teenager 0.0 81.0 84.400000 Islamabad 0.0

154 Fatima young 0.0 98.0 50.860000 Multan 1.0

155 Shoaib teenager 1.0 91.0 80.120000 Quetta 1.0

156 Maaz teenager 1.0 88.0 86.850000 Lahore 1.0

157 rows × 7 columns


Practical No-2
Create an “Academic performance” dataset of students and perform the following operations using Python.

1. Scan all variables for missing values and inconsistencies. If there are missing values and/or inconsistencies, use any of the suitable
techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this transformation should be one of the following
reasons: to change the scale for better understanding of the variable, to convert a non-linear relation into a linear one, or to decrease
the skewness and convert the distribution into a normal distribution. Reason and document your approach properly

import pandas as pd
import numpy as np

import math

%matplotlib

Creating the dataset

df

rollno name marks grade

0 1 a 40.0 F

1 2 b 23.0 F

2 3 c 50.0 P

3 4 d 78.0 P

4 5 e 48.0 P

5 6 f 89.0 P

6 7 g 90.0 P

7 8 h 67.0 P

8 9 i 90.0 P

9 10 j 96.0 P

10 11 NaN 76.0 P

11 12 NaN NaN F

12 13 k 97.0 P

13 14 l NaN NaN
14 15 m 65.0 NaN

Dataset Statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
# Column Non-Null Count Dtype

0 rollno 15 non-null int64


1 name 13 non-null object
2 marks 13 non-null float64
3 grade 13 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 612.0+ bytes
rollno marks

count 15.000000 13.000000

mean 8.000000 69.923077

std 4.472136 23.616596

min 1.000000 23.000000

25% 4.500000 50.000000

50% 8.000000 76.000000

75% 11.500000 90.000000

max 15.000000 97.000000

name 2
marks 2

Null values

name 2
marks 2

df

rollno name marks grade

0 1 a 40.000000 F

1 2 b 23.000000 F

2 3 c 50.000000 P

3 4 d 78.000000 P

4 5 e 48.000000 P

5 6 f 89.000000 P

6 7 g 90.000000 P

7 8 h 67.000000 P

8 9 i 90.000000 P

9 10 j 96.000000 P

10 11 NaN 76.000000 P

11 12 NaN 69.923077 F

12 13 k 97.000000 P

13 14 l 69.923077 NaN
14 15 m 65.000000 NaN

return
df

rollno name marks grade

0 1 a 40 F

1 2 b 23 F

2 3 c 50 P

3 4 d 78 P

4 5 e 48 P

5 6 f 89 P

6 7 g 90 P

7 8 h 67 P

8 9 i 90 P

9 10 j 96 P

10 11 NaN 76 P

11 12 NaN 69 F

12 13 k 97 P

13 14 l 69 NaN

14 15 m 65 NaN

df

rollno name marks grade

0 1 a 40 F

1 2 b 23 F

2 3 c 50 P

3 4 d 78 P

4 5 e 48 P

5 6 f 89 P

6 7 g 90 P

7 8 h 67 P

8 9 i 90 P

9 10 j 96 P

12 13 k 97 P

13 14 l 69 NaN
14 15 m 65 NaN

# print(row['marks'], row['grade'])
rollno name marks grade

0 1 a 40 F

1 2 b 23 F

2 3 c 50 F

3 4 d 78 P

4 5 e 48 F

5 6 f 89 P

6 7 g 90 P

7 8 h 67 P

8 9 i 90 P

9 10 j 96 P

12 13 k 97 P

13 14 l 69 P
14 15 m 65 P

df.boxplot()
plt.show()

Outliers
rollno name marks grade

0 1 a 40 F

1 2 b 23 F

2 3 c 50 F

3 4 d 78 P

4 5 e 48 F

5 6 f 89 P

6 7 g 90 P

7 8 h 67 P

8 9 i 90 P

9 10 j 96 P

12 13 k 97 P

13 14 l 69 P

14 15 m 65 P

15 16 n 200 P

16 17 o -100 F

df.boxplot()
plt.show()

#newdf = df[df["marks"] >100]

#newdf
sns.boxplot(data=df, x='marks');

_
rollno name marks grade

0 1 a 40 F

1 2 b 23 F

2 3 c 50 F

3 4 d 78 P

4 5 e 48 F

5 6 f 89 P

6 7 g 90 P

7 8 h 67 P

8 9 i 90 P

9 10 j 96 P

12 13 k 97 P

13 14 l 69 P

14 15 m 65 P

15 16 n 200 P

16 17 o -100 F

df

rollno name marks grade

0 1 a 40 F

1 2 b 23 F

2 3 c 50 F

3 4 d 78 P

4 5 e 48 F

5 6 f 89 P

6 7 g 90 P

7 8 h 67 P

8 9 i 90 P

9 10 j 96 P

12 13 k 97 P

13 14 l 69 P
14 15 m 65 P

df.boxplot()
plt.show()

Scaling the marks column


df

rollno name marks grade

0 1 a 0.229730 F

1 2 b 0.000000 F

2 3 c 0.364865 F

3 4 d 0.743243 P

4 5 e 0.337838 F

5 6 f 0.891892 P

6 7 g 0.905405 P

7 8 h 0.594595 P

8 9 i 0.905405 P

9 10 j 0.986486 P

12 13 k 1.000000 P

13 14 l 0.621622 P
14 15 m 0.567568 P

rollno name marks grade

0 1 a 0.229730 F

1 2 b 0.000000 F

2 3 c 0.364865 F

3 4 d 0.743243 P

4 5 e 0.337838 F

5 6 f 0.891892 P

6 7 g 0.905405 P

7 8 h 0.594595 P

8 9 i 0.905405 P

9 10 j 0.986486 P

13 14 l 0.621622 P

14 15 m 0.567568 P
plt.show()
Practical No-3
Descriptive Statistics - Measures of Central Tendency and variability Perform the following operations on any open source dataset (e.g.,
data.csv)

1. Provide summary statistics (mean, median, minimum, maximum, standard deviation) for a dataset (age, income etc.) with numeric
variables grouped by one of the qualitative (categorical) variable. For example, if your categorical variable is age groups and
quantitative variable is income, then provide summary statistics of income grouped by the age groups. Create a list that contains a
numeric value for each response to the categorical variable.
2. Write a Python program to display some basic statistical details like percentile, mean, standard deviation etc. of the species of ‘Iris-
setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’ of iris.csv dataset. Provide the codes with outputs and explain everything that you do in
this step.

Import libraries
import pandas as pd
import numpy as np

%matplotlib

df.head()

Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Ter

0 LP001002 Male No 0 Graduate No NaN NaN NaN 360.

1 LP001003 Male Yes 1 Graduate No 4583.0 1508.0 128.0 360.

2 LP001005 Male Yes 0 Graduate Yes 3000.0 0.0 66.0 360.

Not
3 LP001006 Male Yes 0 No 2583.0 2358.0 120.0 360.
Graduate

4 LP001008 Male No 0 Graduate No 6000.0 0.0 141.0 360.

Basic stats

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 613 non-null float64
7 CoapplicantIncome 613 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 561 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(5), object(8)
memory usage: 62.5+ KB
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History

count 613.000000 613.000000 592.000000 600.00000 561.000000

mean 5402.732463 1623.890571 146.412162 342.00000 0.841355

std 6114.004114 2927.903583 85.587325 65.12041 0.365671

min 150.000000 0.000000 9.000000 12.00000 0.000000

25% 2876.000000 0.000000 100.000000 360.00000 1.000000

50% 3812.000000 1210.000000 128.000000 360.00000 1.000000

75% 5780.000000 2302.000000 168.000000 360.00000 1.000000

max 81000.000000 41667.000000 700.000000 480.00000 1.000000

df.isna().sum()

Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 1
CoapplicantIncome 1
LoanAmount 22
Loan_Amount_Term 14
Credit_History 53
Property_Area 0
Loan_Status 0
dtype: int64

Let us group the quantitative variables 'ApplicantIncome', 'Coapplicant Income', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History' by
'Loan_Status' categorical variable

df["ApplicantIncome"].plot(kind="hist")
plt.show()
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_T

0 LP001002 Male No 0 Graduate No 5402.732463 NaN NaN 3

1 LP001003 Male Yes 1 Graduate No 4583.000000 1508.0 128.0 3


2 LP001005 Male Yes 0 Graduate Yes 3000.000000 0.0 66.0 3

3 LP001006 Male Yes 0 Not No 2583.000000 2358.0 120.0 3


Graduate

4 LP001008 Male No 0 Graduate No 6000.000000 0.0 141.0 3

... ... ... ... ... ... ... ... ... ...

609 LP002978 Female No 0 Graduate No 2900.000000 0.0 71.0 3

610 LP002979 Male Yes 3+ Graduate No 4106.000000 0.0 40.0 1

611 LP002983 Male Yes 1 Graduate No 8072.000000 240.0 253.0 3

612 LP002984 Male Yes 2 Graduate No 7583.000000 0.0 187.0 3


613 LP002990 Female No 0 Graduate Yes 4583.000000 0.0 133.0 3

614 rows × 13 columns

df["CoapplicantIncome"].plot(kind="hist")
plt.show()

Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_T

0 LP001002 Male No 0 Graduate No 5402.732463 1623.890571 NaN 3

1 LP001003 Male Yes 1 Graduate No 4583.000000 1508.000000 128.0 3

2 LP001005 Male Yes 0 Graduate Yes 3000.000000 0.000000 66.0 3

3 LP001006 Male Yes 0 Not No 2583.000000 2358.000000 120.0 3


Graduate

4 LP001008 Male No 0 Graduate No 6000.000000 0.000000 141.0 3

... ... ... ... ... ... ... ... ... ...

609 LP002978 Female No 0 Graduate No 2900.000000 0.000000 71.0 3

610 LP002979 Male Yes 3+ Graduate No 4106.000000 0.000000 40.0 1

611 LP002983 Male Yes 1 Graduate No 8072.000000 240.000000 253.0 3

612 LP002984 Male Yes 2 Graduate No 7583.000000 0.000000 187.0 3


613 LP002990 Female No 0 Graduate Yes 4583.000000 0.000000 133.0 3

614 rows × 13 columns


df["Credit_History"].plot(kind="hist")
plt.show()

Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_T

0 LP001002 Male No 0 Graduate No 5402.732463 1623.890571 146.412162 3

1 LP001003 Male Yes 1 Graduate No 4583.000000 1508.000000 128.000000 3


2 LP001005 Male Yes 0 Graduate Yes 3000.000000 0.000000 66.000000 3

3 LP001006 Male Yes 0 Not No 2583.000000 2358.000000 120.000000 3


Graduate

4 LP001008 Male No 0 Graduate No 6000.000000 0.000000 141.000000 3

... ... ... ... ... ... ... ... ... ...

609 LP002978 Female No 0 Graduate No 2900.000000 0.000000 71.000000 3

610 LP002979 Male Yes 3+ Graduate No 4106.000000 0.000000 40.000000 1

611 LP002983 Male Yes 1 Graduate No 8072.000000 240.000000 253.000000 3

612 LP002984 Male Yes 2 Graduate No 7583.000000 0.000000 187.000000 3

613 LP002990 Female No 0 Graduate Yes 4583.000000 0.000000 133.000000 3

614 rows × 13 columns


Stats of the grouped data

ApplicantIncome CoapplicantIncome LoanAmount Credit_History

Loan_Status

N 5446.078125 1877.807292 150.945488 0.505208

Y 5383.011214 1508.364480 144.349606 0.888626

ApplicantIncome CoapplicantIncome LoanAmount Credit_History

Loan_Status

N 3833.5 268.0 133.5 1.0

Y 3812.5 1255.0 128.0 1.0

ApplicantIncome CoapplicantIncome LoanAmount Credit_History

Loan_Status

N 150.0 0.0 9.0 0.0

Y 210.0 0.0 17.0 0.0

max = grouped_df.max()
max

ApplicantIncome CoapplicantIncome LoanAmount Credit_History

Loan_Status

N 81000.0 41667.0 570.0 1.0

Y 63337.0 20000.0 700.0 1.0

ApplicantIncome CoapplicantIncome LoanAmount Credit_History

Loan_Status

N 6819.558528 4384.060103 83.361163 0.501280

Y 5765.397061 1923.362579 84.361109 0.314969

Iris dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype

0 Id 150 non-null int64


1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

Id 75.500000
SepalLengthCm 5.843333
SepalWidthCm 3.054000
PetalLengthCm 3.758667
PetalWidthCm 1.198667
dtype: float64

0 -> setosa
1 -> versicolor
2 -> virginica

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.0 3.0 1.5 0.2 Iris-setosa

1 2 NaN NaN NaN NaN Iris-versicolor

2 3 NaN NaN NaN NaN Iris-virginica

3 4 NaN NaN NaN NaN NaN

4 5 NaN NaN NaN NaN NaN

... ... ... ... ... ... ...

145 146 NaN NaN NaN NaN NaN

146 147 NaN NaN NaN NaN NaN

147 148 NaN NaN NaN NaN NaN

148 149 NaN NaN NaN NaN NaN


149 150 NaN NaN NaN NaN NaN

150 rows × 6 columns

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa


4 5 5.0 3.6 1.4 0.2 Iris-setosa

Basic stats

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype

0 Id 150 non-null int64


1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 75.500000 5.843333 3.054000 3.758667 1.198667

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

75% 112.750000 6.400000 3.300000 5.100000 1.800000

max 150.000000 7.900000 4.400000 6.900000 2.500000

Setosa stats

count 150.000000
mean 5.843333
std 0.828066
min 4.300000
25% 5.100000
50% 5.800000
75% 6.400000
max 7.900000

Id SepalLengthCm ... PetalLengthCm

count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25

Species

Iris- 50.0 25.5 14.57738 1.0 13.25 25.5 37.75 50.0 50.0 5.006 ... 1.575 1.9 50.0 0.244 0.107210 0.1 0.
setosa

Iris- 50.0 75.5 14.57738 51.0 63.25 75.5 87.75 100.0 50.0 5.936 ... 4.600 5.1 50.0 1.326 0.197753 1.0 1.
versicolor

Iris- 50.0 125.5 14.57738 101.0 113.25 125.5 137.75 150.0 50.0 6.588 ... 5.875 6.9 50.0 2.026 0.274650 1.4 1.
virginica

3 rows × 40 columns
Id count 150.000000
mean 226.500000
std 43.732139
min 153.000000
25% 189.750000
50% 226.500000
75% 263.250000
max 300.000000
SepalLengthCm count 150.000000
mean 17.530000
std 1.504540
min 14.100000
25% 16.625000
50% 17.400000
75% 18.400000
max 20.700000
SepalWidthCm count 150.000000
mean 9.162000
std 1.017319
min 6.500000
25% 8.450000
50% 9.200000
75% 9.850000
max 11.600000
PetalLengthCm count 150.000000
mean 11.276000
std 1.195317
min 8.500000
25% 10.500000
50% 11.400000
75% 12.050000
max 13.900000
PetalWidthCm count 150.000000
mean 3.596000
std 0.579612
min 2.500000
25% 3.200000
50% 3.500000
75% 4.100000
max 4.900000
dtype: float64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype

0 Id 150 non-null int64


1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
# Filter the data for the species 'iris-setosa' and 'iris-versicolor'
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

5 6 5.4 3.9 1.7 0.4 Iris-setosa

6 7 4.6 3.4 1.4 0.3 Iris-setosa

7 8 5.0 3.4 1.5 0.2 Iris-setosa

8 9 4.4 2.9 1.4 0.2 Iris-setosa

9 10 4.9 3.1 1.5 0.1 Iris-setosa

10 11 5.4 3.7 1.5 0.2 Iris-setosa

11 12 4.8 3.4 1.6 0.2 Iris-setosa

12 13 4.8 3.0 1.4 0.1 Iris-setosa

13 14 4.3 3.0 1.1 0.1 Iris-setosa

14 15 5.8 4.0 1.2 0.2 Iris-setosa

15 16 5.7 4.4 1.5 0.4 Iris-setosa

16 17 5.4 3.9 1.3 0.4 Iris-setosa

17 18 5.1 3.5 1.4 0.3 Iris-setosa

18 19 5.7 3.8 1.7 0.3 Iris-setosa

19 20 5.1 3.8 1.5 0.3 Iris-setosa

20 21 5.4 3.4 1.7 0.2 Iris-setosa

21 22 5.1 3.7 1.5 0.4 Iris-setosa

22 23 4.6 3.6 1.0 0.2 Iris-setosa

23 24 5.1 3.3 1.7 0.5 Iris-setosa

24 25 4.8 3.4 1.9 0.2 Iris-setosa

25 26 5.0 3.0 1.6 0.2 Iris-setosa

26 27 5.0 3.4 1.6 0.4 Iris-setosa

27 28 5.2 3.5 1.5 0.2 Iris-setosa

28 29 5.2 3.4 1.4 0.2 Iris-setosa

29 30 4.7 3.2 1.6 0.2 Iris-setosa

30 31 4.8 3.1 1.6 0.2 Iris-setosa

31 32 5.4 3.4 1.5 0.4 Iris-setosa

32 33 5.2 4.1 1.5 0.1 Iris-setosa

33 34 5.5 4.2 1.4 0.2 Iris-setosa

34 35 4.9 3.1 1.5 0.1 Iris-setosa

35 36 5.0 3.2 1.2 0.2 Iris-setosa

36 37 5.5 3.5 1.3 0.2 Iris-setosa

37 38 4.9 3.1 1.5 0.1 Iris-setosa

38 39 4.4 3.0 1.3 0.2 Iris-setosa

39 40 5.1 3.4 1.5 0.2 Iris-setosa

40 41 5.0 3.5 1.3 0.3 Iris-setosa

41 42 4.5 2.3 1.3 0.3 Iris-setosa

42 43 4.4 3.2 1.3 0.2 Iris-setosa

43 44 5.0 3.5 1.6 0.6 Iris-setosa

44 45 5.1 3.8 1.9 0.4 Iris-setosa

45 46 4.8 3.0 1.4 0.3 Iris-setosa

46 47 5.1 3.8 1.6 0.2 Iris-setosa

47 48 4.6 3.2 1.4 0.2 Iris-setosa

48 49 5.3 3.7 1.5 0.2 Iris-setosa

49 50 5.0 3.3 1.4 0.2 Iris-setosa


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

50 51 7.0 3.2 4.7 1.4 Iris-versicolor

51 52 6.4 3.2 4.5 1.5 Iris-versicolor

52 53 6.9 3.1 4.9 1.5 Iris-versicolor

53 54 5.5 2.3 4.0 1.3 Iris-versicolor

54 55 6.5 2.8 4.6 1.5 Iris-versicolor

55 56 5.7 2.8 4.5 1.3 Iris-versicolor

56 57 6.3 3.3 4.7 1.6 Iris-versicolor

57 58 4.9 2.4 3.3 1.0 Iris-versicolor

58 59 6.6 2.9 4.6 1.3 Iris-versicolor

59 60 5.2 2.7 3.9 1.4 Iris-versicolor

60 61 5.0 2.0 3.5 1.0 Iris-versicolor

61 62 5.9 3.0 4.2 1.5 Iris-versicolor

62 63 6.0 2.2 4.0 1.0 Iris-versicolor

63 64 6.1 2.9 4.7 1.4 Iris-versicolor

64 65 5.6 2.9 3.6 1.3 Iris-versicolor

65 66 6.7 3.1 4.4 1.4 Iris-versicolor

66 67 5.6 3.0 4.5 1.5 Iris-versicolor

67 68 5.8 2.7 4.1 1.0 Iris-versicolor

68 69 6.2 2.2 4.5 1.5 Iris-versicolor

69 70 5.6 2.5 3.9 1.1 Iris-versicolor

70 71 5.9 3.2 4.8 1.8 Iris-versicolor

71 72 6.1 2.8 4.0 1.3 Iris-versicolor

72 73 6.3 2.5 4.9 1.5 Iris-versicolor

73 74 6.1 2.8 4.7 1.2 Iris-versicolor

74 75 6.4 2.9 4.3 1.3 Iris-versicolor

75 76 6.6 3.0 4.4 1.4 Iris-versicolor

76 77 6.8 2.8 4.8 1.4 Iris-versicolor

77 78 6.7 3.0 5.0 1.7 Iris-versicolor

78 79 6.0 2.9 4.5 1.5 Iris-versicolor

79 80 5.7 2.6 3.5 1.0 Iris-versicolor

80 81 5.5 2.4 3.8 1.1 Iris-versicolor

81 82 5.5 2.4 3.7 1.0 Iris-versicolor

82 83 5.8 2.7 3.9 1.2 Iris-versicolor

83 84 6.0 2.7 5.1 1.6 Iris-versicolor

84 85 5.4 3.0 4.5 1.5 Iris-versicolor

85 86 6.0 3.4 4.5 1.6 Iris-versicolor

86 87 6.7 3.1 4.7 1.5 Iris-versicolor

87 88 6.3 2.3 4.4 1.3 Iris-versicolor

88 89 5.6 3.0 4.1 1.3 Iris-versicolor

89 90 5.5 2.5 4.0 1.3 Iris-versicolor

90 91 5.5 2.6 4.4 1.2 Iris-versicolor

91 92 6.1 3.0 4.6 1.4 Iris-versicolor

92 93 5.8 2.6 4.0 1.2 Iris-versicolor

93 94 5.0 2.3 3.3 1.0 Iris-versicolor

94 95 5.6 2.7 4.2 1.3 Iris-versicolor

95 96 5.7 3.0 4.2 1.2 Iris-versicolor

96 97 5.7 2.9 4.2 1.3 Iris-versicolor

97 98 6.2 2.9 4.3 1.3 Iris-versicolor

98 99 5.1 2.5 3.0 1.1 Iris-versicolor

99 100 5.7 2.8 4.1 1.3 Iris-versicolor


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

100 101 6.3 3.3 6.0 2.5 Iris-virginica

101 102 5.8 2.7 5.1 1.9 Iris-virginica

102 103 7.1 3.0 5.9 2.1 Iris-virginica

103 104 6.3 2.9 5.6 1.8 Iris-virginica

104 105 6.5 3.0 5.8 2.2 Iris-virginica

105 106 7.6 3.0 6.6 2.1 Iris-virginica

106 107 4.9 2.5 4.5 1.7 Iris-virginica

107 108 7.3 2.9 6.3 1.8 Iris-virginica

108 109 6.7 2.5 5.8 1.8 Iris-virginica

109 110 7.2 3.6 6.1 2.5 Iris-virginica

110 111 6.5 3.2 5.1 2.0 Iris-virginica

111 112 6.4 2.7 5.3 1.9 Iris-virginica

112 113 6.8 3.0 5.5 2.1 Iris-virginica

113 114 5.7 2.5 5.0 2.0 Iris-virginica

114 115 5.8 2.8 5.1 2.4 Iris-virginica

115 116 6.4 3.2 5.3 2.3 Iris-virginica

116 117 6.5 3.0 5.5 1.8 Iris-virginica

117 118 7.7 3.8 6.7 2.2 Iris-virginica

118 119 7.7 2.6 6.9 2.3 Iris-virginica

119 120 6.0 2.2 5.0 1.5 Iris-virginica

120 121 6.9 3.2 5.7 2.3 Iris-virginica

121 122 5.6 2.8 4.9 2.0 Iris-virginica

122 123 7.7 2.8 6.7 2.0 Iris-virginica

123 124 6.3 2.7 4.9 1.8 Iris-virginica

124 125 6.7 3.3 5.7 2.1 Iris-virginica

125 126 7.2 3.2 6.0 1.8 Iris-virginica

126 127 6.2 2.8 4.8 1.8 Iris-virginica

127 128 6.1 3.0 4.9 1.8 Iris-virginica

128 129 6.4 2.8 5.6 2.1 Iris-virginica

129 130 7.2 3.0 5.8 1.6 Iris-virginica

130 131 7.4 2.8 6.1 1.9 Iris-virginica

131 132 7.9 3.8 6.4 2.0 Iris-virginica

132 133 6.4 2.8 5.6 2.2 Iris-virginica

133 134 6.3 2.8 5.1 1.5 Iris-virginica

134 135 6.1 2.6 5.6 1.4 Iris-virginica

135 136 7.7 3.0 6.1 2.3 Iris-virginica

136 137 6.3 3.4 5.6 2.4 Iris-virginica

137 138 6.4 3.1 5.5 1.8 Iris-virginica

138 139 6.0 3.0 4.8 1.8 Iris-virginica

139 140 6.9 3.1 5.4 2.1 Iris-virginica

140 141 6.7 3.1 5.6 2.4 Iris-virginica

141 142 6.9 3.1 5.1 2.3 Iris-virginica

142 143 5.8 2.7 5.1 1.9 Iris-virginica

143 144 6.8 3.2 5.9 2.3 Iris-virginica

144 145 6.7 3.3 5.7 2.5 Iris-virginica

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

# Calculate some basic statistical details for each species


g g

Statistics for Iris Setosa:


{'percentile': 5.0, 'mean': 5.006, 'std_dev': 0.348946987377739}

Statistics for Iris Versicolor:


{'percentile': 5.9, 'mean': 5.936, 'std_dev': 0.5109833656783752}

Statistics for Iris Virginica:


{'percentile': 6.5, 'mean': 6.587999999999998, 'std_dev': 0.6294886813914925}
Practical No-4
Data Analytics I : 1.Create a Linear Regression Model using Python/R to predict home prices using Boston Housing Dataset
(https://www.kaggle.com/c/boston-housing). The Boston Housing dataset contains information about various houses in Boston through different
parameters. There are 506 samples and 14 feature variables in this dataset. The objective is to predict the value of prices of the house using the given
features.
import pandas as pd
import numpy as np

%matplotlib

crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

501 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67 22.4

502 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08 20.6

503 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9

504 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0
501 rows × 14 columns
505 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88 11.9

crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2

df.describe()

crim zn indus chas nox rm age dis rad tax ptratio

count 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000

mean 3.647414 11.402196 11.160619 0.069860 0.555151 6.284341 68.513373 3.786423 9.596806 409.143713 18.453493

std 8.637688 23.414214 6.857123 0.255166 0.116186 0.705587 28.212221 2.103327 8.735509 169.021216 2.166327

min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000

25% 0.081990 0.000000 5.190000 0.000000 0.449000 5.884000 45.000000 2.088200 4.000000 279.000000 17.400000

50% 0.261690 0.000000 9.690000 0.000000 0.538000 6.208000 77.700000 3.182700 5.000000 330.000000 19.000000

75% 3.693110 12.500000 18.100000 0.000000 0.624000 6.625000 94.000000 5.118000 24.000000 666.000000 20.200000

max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000
crim zn indus chas nox rm age dis rad tax ptratio b lstat

0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98

1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14

2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03

3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33

medv

0 24.0

1 21.6

2 34.7

3 33.4

4 36.2

* Basic stats **

<class 'pandas.core.frame.DataFrame'>
Index: 501 entries, 0 to 505
Data columns (total 13 columns):
# Column Non-Null Count Dtype
0 crim 501 non-null float64
1 zn 501 non-null float64
2 indus 501 non-null float64
3 chas 501 non-null int64
4 nox 501 non-null float64
5 rm 501 non-null float64
6 age 501 non-null float64
7 dis 501 non-null float64
8 rad 501 non-null int64
9 tax 501 non-null int64
10 ptratio 501 non-null float64
11 b 501 non-null float64
12 lstat 501 non-null float64
dtypes: float64(10), int64(3)
memory usage: 54.8 KB

x.describe()

crim zn indus chas nox rm age dis rad tax ptratio

count 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000 501.000000

mean 3.647414 11.402196 11.160619 0.069860 0.555151 6.284341 68.513373 3.786423 9.596806 409.143713 18.453493

std 8.637688 23.414214 6.857123 0.255166 0.116186 0.705587 28.212221 2.103327 8.735509 169.021216 2.166327

min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000

25% 0.081990 0.000000 5.190000 0.000000 0.449000 5.884000 45.000000 2.088200 4.000000 279.000000 17.400000

50% 0.261690 0.000000 9.690000 0.000000 0.538000 6.208000 77.700000 3.182700 5.000000 330.000000 19.000000

75% 3.693110 12.500000 18.100000 0.000000 0.624000 6.625000 94.000000 5.118000 24.000000 666.000000 20.200000

max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000

<class 'pandas.core.frame.DataFrame'>
Index: 501 entries, 0 to 505
Data columns (total 1 columns):
# Column Non-Null Count Dtype

0 medv 501 non-null float64


dtypes: float64(1)
memory usage: 7.8 KB
medv

count 501.000000

mean 22.561277

std 9.232435

min 5.000000

25% 17.000000

50% 21.200000

75% 25.000000

max 50.000000

crim 0
zn 0
indus 0
chas 0
nox 0
rm 0
age 0
dis 0
rad 0
tax 0
ptratio 0
b 0
lstat 0

medv 0

crim zn indus chas nox rm age dis rad tax ptratio b lstat target

0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
* Considering only 'RM' and 'LSTAT' by considering correlation and multi-collinearity of other features *
* Scale the data ***

* Split the data **

[ 8.18092223e-01, -1.30160443e+00],
[-8.17641984e-01, -1.26300141e-01],

[-9.69542329e-02, -4.35149426e-01],

[-1.99098954e-01, -4.46329491e-01],

[-3.43510045e+00, 9.59036431e-02],
[-4.30445193e-02, -3.06578683e-01],
[ 1.79585778e-02, 6.46521826e-01],
[-6.28957986e-01, -1.72417908e-01],
[ 5.17332768e-01, 1.48083415e+00],
[ 8.08161486e-01, -1.35051722e+00],
[ 1.88068105e+00, -1.33234961e+00],
[-4.21831192e-01, 1.97415450e+00],
[ 1.69756982e-01, 1.07136428e+00],
[-3.99132365e-01, 1.21251260e+00],
[-1.95967671e+00, 2.32632653e+00],
[-3.65084125e-01, 2.35654451e-01],
[ 8.60550582e-02, -1.07241311e+00],
[ 1.71175659e-01, -2.81423538e-01],
[-3.05499704e-01, -4.32354410e-01],
[-2.50171314e-01, -8.48811817e-01],
[-7.90687127e-01, -1.90585513e-01],
[-7.41033444e-01, -3.10771207e-01],
[-3.97713688e-01, 2.27269402e-01],
[ 3.21555386e-01, -1.02909036e+00],
[ 8.36535020e-01, -1.12272340e+00],
[ 1.22383375e+00, -1.02210282e+00],
[ 4.40724227e-01, -1.16325113e+00],
[ 6.29408225e-01, 8.47762989e-01],
[ 3.43271334e+00, -1.05145049e+00],
[ 1.24359328e-01, -2.12945642e-01],
[-2.72870141e-01, 6.21366680e-01],
[ 2.50519612e-02, -8.27849196e-01],
[-3.11174411e-01, -9.95550165e-01],
[ 9.81240041e-01, -9.78780068e-01],
[-1.39078736e+00, 1.95598689e+00],
[ 1.04497855e-01, -1.98970561e-01],
[ 2.10057594e+00, -7.10458518e-01],
[-5.58024152e-01, -2.47883344e-01],
[ 7.23040886e-01, -8.61389390e-01],
[ 3.92387279e-02, -2.88411078e-01],
[-1.29583796e-01, 3.53045129e-01],
[ 5.07402031e-01, -4.12789297e-01],
[ 2.87507146e-01, -9.42444858e-01],
[-3.31035884e-01, 8.54750529e-01],
[ 2.15306697e+00, -1.49865307e+00],
[ 2.50621553e-01, 7.55527456e-01],
[ 1.71175659e-01, -6.21018001e-01],
[-2.04773660e-01, -3.31733829e-01],
[ 1.71175659e-01, 9.39998522e-01],
[-6.38888723e-01, -5.83285283e-01],
[ 1.25778005e-01, -2.99591143e-01],
[ 1.54151539e-01, -1.04166793e+00],
[ 1.61244922e-01, -6.78315832e-01],
[ 1.27774347e+00, -1.20377887e+00],
[-1.55819121e+00, 2.19076825e+00],
[ 2.33323891e+00, -1.24989664e+00],
[ 1.35009598e+00, -7.98501526e-01],
[-3.01243674e-01, 5.53759089e-02],
[ 2.98856560e-01, -5.55335121e-01],
[ 6.27989548e-01, -1.02769285e+00],
[-5.08370469e-01, 7.66707520e-01],
[-7.45289474e-01, 2.07704289e-01],
[-5.87816363e-01, 2.41244483e-01],
[ 2.78437810e+00, -1.21495893e+00],
[-4.06225748e-01, -5.88875315e-01],
[ 1.35708742e-01, 1.54372201e+00],
[-5.84979009e-01, 3.80995291e-01],
[ 2.16573312e-01, 2.39846975e-01],
[ 1.60545778e+00, -9.01917124e-01],
[ 1.14013183e+00, -6.69930783e-01],
[ 1.24653258e+00, -8.78159487e-01],
[-1.17940454e+00, 2.50800258e+00],
[-4.43111342e-01, -3.73659071e-01],
[ 9.69890627e-01, 6.38136777e-01],
[-8.53108901e-01, 1.44449894e+00],
[ 2.48219996e+00, -1.36449230e+00],
[ 1.75300015e+00, -7.52383760e-01],
[-5.67954889e-01, -6.78315832e-01],
[ 6.15221458e-01, -8.50209325e-01],
[-9.32554795e-01, 1.43052386e+00],
[-3.86354081e+00, -7.71948873e-01],
[ 9.45773124e-01, -1.05424550e+00],
[ 1.40400569e+00, -8.41824277e-01],
[-4.71484875e-01, 6.29751729e-01],
[ 5.84010571e-01, -7.63563825e-01],
[-1.87749540e-01, 3.32082508e-01],
[-5.80722979e-01, -5.25987451e-01],
[ 2.91763176e-01, -9.31264794e-01],
[-8.10548601e-01, 1.39139363e+00],
[ 8.02784107e-03, -1.11433835e+00],
[ 4.56329671e-01, -7.37011171e-01],
[ 4.01001280e-01, -4.70087128e-01],
[ 3.24392740e-01, -3.28938812e-01],
[ 3.64013746e-02, -8.23656672e-01],
[ 4.97471294e-01, -1.13250596e+00],
[ 6.71968525e-01, -1.27365427e+00],
[ 2.78995086e-01, -8.78159487e-01],
[ 1.20680963e+00, -1.14508353e+00],
[-9.79371125e-01, 6.23634493e-02],
[-1.99098954e-01, -4.96639782e-01],
[ 2.39272139e-01, -6.16825476e-01],
[ 4.18025400e-01, -9.99742690e-01],
[ 2.08061252e-01, -1.05704052e+00],
[ 2.96596871e+00, -1.30300194e+00],
[-2.30309840e-01, 2.03511765e-01],
[ 4.90377911e-01, -8.54401850e-01],
[-4.75740905e-01, -6.12632952e-01],
[ 9.86914747e-01, -1.37480206e-01],
[-2.18960427e-01, 2.11809783e+00],
[-1.69012814e+00, 2.38082935e+00],
[-6.95635790e-01, 2.03511765e-01],
[-3.94876335e-01, 4.27113058e-01],
[-9.68021712e-01, 4.96988461e-01],
[ 2.80140222e+00, -1.27225676e+00],
[-2.50444855e+00, 3.40101025e+00],
[-5.73629596e-01, -8.57724069e-02],
[-3.36710591e-01, 6.47919334e-01],
[-3.60828095e-01, -6.65738259e-01],
[-1.42909163e+00, 2.53874776e+00],
[ 3.69790393e-01, 1.17897240e+00],
[-5.60861506e-01, 4.28510566e-01],
[-5.48093416e-01, 3.66811002e-03],
[ 5.91002014e-02, -2.33908263e-01],
[-7.48126827e-01, -3.54093958e-01],
[ 4.70516437e-01, -5.18999911e-01],
[-1.99230627e+00, 2.51219511e+00],
[-1.39078736e+00, 1.71421800e+00],
[ 1.39407495e+00, -9.21482237e-01],
[-5.75048273e-01, 6.89844576e-01],
[ 9.89752101e-01, -1.08778570e+00],
[-5.62280183e-01, 1.21058789e-01],
[-4.58716785e-01, 7.38757359e-01],
[ 3.79721130e-01, -1.24151159e+00],
[ 1.93874486e-01, -4.89652241e-01],
[ 2.08061252e-01, 3.83790307e-01],
[ 4.12350694e-01, -1.07101560e+00],
[-5.82141656e-01, -4.95242274e-01],
[-6.16189896e-01, -5.41360040e-01],
[-4.85671642e-01, 5.94814027e-01],
[-2.57264697e-01, 1.48782169e+00],
[-1.06874776e+00, 2.05940249e+00],
[-6.77192993e-01, 4.73230824e-01],
[ 1.39964772e-01, -9.60612463e-01],
[-3.09755734e-01, -8.68376930e-01],
[-1.02628940e-01, 6.92639592e-01],
[-3.04496436e+00, 1.49480923e+00],
[-4.87192260e-02, -9.98345182e-01],
[-9.18368028e-01, 7.96055190e-01],
[-4.23249868e-01, 3.02734839e-01],
[ 1.30327965e+00, -4.26764378e-01],
[-6.36051370e-01, 8.40775448e-01],
[-7.85114361e-02, 3.16182716e-02],
[-5.38162679e-01, -1.66827875e-01],
[-7.70825654e-01, 7.07484977e-02],
[-2.50171314e-01, -4.86857225e-01],
[-7.06985203e-01, 2.10499305e-01],
[ 4.91796587e-01, -4.33751918e-01],
[-1.26746443e-01, 4.97858766e-02],
[ 4.73353791e-01, -7.04868485e-01],
[ 1.35718936e+00, -9.99742690e-01],
[ 3.20136710e-01, -7.34216155e-01],
[ 1.30034035e-01, 2.80374709e-01],
[-3.12593088e-01, 2.81772217e-01],
[-6.68680933e-01, 3.02734839e-01],
[ 8.32278990e-01, -9.18687221e-01],
[-7.12659910e-01, 4.60653251e-01],
[-1.04047616e-01, -8.99122108e-01],
[-9.65184358e-01, 1.86741668e-01],
[-8.54527578e-01, -3.40118877e-01],
[ 1.38546095e-01, -3.19156256e-01],
[-4.58716785e-01, -3.86236644e-01],
[ 3.41416860e-01, 7.66707520e-01],
[ 9.54285184e-01, -1.27365427e+00],
[ 7.54251772e-01, -1.21495893e+00],
[ 9.34423710e-01, -1.12551842e+00],
[-8.38922134e-01, 6.35341761e-01],
[ 1.85362426e-01, -9.11699681e-01],
[ 7.78369276e-01, 9.84718780e-01],
[ 2.91205900e+00, -1.42179013e+00],
[-3.73596185e-01, -3.27541304e-01],
[-1.89168217e-01, 8.12825287e-01],
[ 7.06016765e-01, 1.83719871e+00],
[ 1.68206632e+00, -1.32536207e+00],
[-1.86462537e+00, -1.31138699e+00],
[ 1.10182756e+00, -7.87321462e-01],
[-1.40933210e-01, 7.52732440e-01],
[-1.66469390e-01, 5.09566034e-01],
[-1.59376007e-01, 9.09253344e-01],
[-6.68680933e-01, 1.01127143e+00],
[-1.04047616e-01, -7.27228615e-01],
[ 1.45933408e+00, 1.11276232e-01],
[-1.23757028e+00, 2.36266174e+00],
[ 1.26781273e+00, -1.36588981e+00],
[-1.87749540e-01, 1.92943424e+00],
[-7.85012421e-01, 1.35033869e-01],
[ 1.02521902e+00, -9.98345182e-01],
[-3.59409418e-01, -6.72725799e-01],
[-1.08435320e+00, 1.66530521e+00],
[ 1.28615358e-01, -4.56112047e-01],
[-4.51623402e-01, -3.51298942e-01],
[-1.79511021e+00, 3.04185067e+00],
[ 1.42244849e+00, -1.19679133e+00],
[-9.80789802e-01, -2.00895273e-02],
[-2.71866873e+00, 2.51359262e+00],
[-2.01936307e-01, 8.49160497e-01],
[-5.58126093e-02, -7.13253534e-01],
[ 4.34947580e-02, -8.87942043e-01],
[ 3.75465100e-01, -7.34216155e-01],
[-7.01310497e-01, 4.46678171e-01],
[-7.70927595e-02, 4.00033200e-02],
[-1.24891969e+00, 1.58424975e+00],
[-1.45189240e-01, -4.43534475e-01],
[-1.73562773e-01, -7.28626123e-01],
[ 1.99549192e-01, -8.57724069e-02],
[-3.45222651e-01, -6.90893405e-01],
[ 7.31552946e-01, -1.06402806e+00],
[-2.44496607e-01, 3.66811002e-03],
[-2.41659254e-01, 3.26492476e-01],
[ 1.98850048e+00, -1.21915146e+00],
[ 2.53185365e+00, -1.17862372e+00],
[-1.66469390e-01, -4.38471646e-02],
[ 8.43628403e-01, -3.79249103e-01],
[-1.43618501e+00, 4.88603413e-01],
[-9.31136118e-01, 5.59876325e-01],
[ 7.68438539e-01, -1.18561126e+00],
[-9.26880088e-01, 2.41576705e+00],
[-1.25601308e+00, 2.53874776e+00],
[-4.17575162e-01, 2.83169726e-01],
[ 4.74772467e-01, -7.57973792e-01],
[-3.50897358e-01, -6.47570654e-01],
[-5.76466949e-01, -6.34993081e-01],
[-1.30850411e+00, -3.42913893e-01],
[ 4.97471294e-01, -9.29867286e-01],
[-9.02762585e-01, 2.03511765e-01],
[-5.05533115e-01, 4.35498106e-01],
[ 4.91694647e-02, -2.15740658e-01],
[-1.01483804e+00, 3.43262573e-01],
[ 2.69064349e-01, -5.22322131e-02],
[-3.70758831e-01, -1.29095157e-01],
[-1.21629013e+00, 7.68105028e-01],
[-7.11241233e-01, 7.56924964e-01],
[-4.73005493e-02, 2.16089338e-01],
[-6.06259159e-01, 1.21251260e+00],
[ 2.81133295e+00, -1.18840628e+00],
[-6.10515189e-01, 1.55769709e+00],
[ 2.32472685e+00, -1.32536207e+00],
[ 4.37886874e-01, -4.15584313e-01],
[-5.86397686e-01, 9.73011512e-02],
[-9.26982029e-02, 5.51491276e-01],
[-5.16882529e-01, 1.40623902e-01],
[-2.09029690e-01, 1.11276232e-01],
[ 1.35708742e-01, 1.59822483e+00],
[ 3.28648770e-01, -8.78159487e-01],
[-2.85638231e-01, 2.04382070e-02],
[ 6.27989548e-01, -4.50522015e-01],
[-2.33147194e-01, -4.98037290e-01],
[ 1.16283065e+00, -6.41980622e-01],
[-1.33404029e+00, 1.45987153e+00],
[-4.30343252e-01, -1.38877714e-01],
[-1.60217019e+00, 1.03922160e+00],
[-5.25394589e-01, 7.94657682e-01],
[ 2.49071202e+00, -1.32955460e+00],
[-1.15812439e+00, 1.92524172e+00],
[ 5.91002014e-02, -5.36297212e-02],
[ 2.17860315e+00, -1.26806424e+00],
[-8.71551698e-01, 6.92639592e-01],
[ 4.33630844e-01, -4.39341951e-01],
[ 7.48577066e-01, -1.08918321e+00],
[-4.94183702e-01, 6.00404059e-01],
[ 1.03798711e+00, -1.35331223e+00],
[-9.72277742e-01, 5.40311212e-01],
[ 9.45773124e-01, -4.09994281e-01],
[-2.23216457e-01, -1.62635351e-01],
[-2.43077931e-01, 1.20531585e-02],
[-6.10515189e-01, -1.40275222e-01],
[-4.95602379e-01, 2.98734786e+00],
[ 1.12736374e+00, -9.70395020e-01],
[-3.86364275e-01, -2.98720838e-02],
[-7.57760236e-03, -9.41574554e-02],
[-1.80504095e+00, -7.31948342e-02],
[-1.02760613e+00, -3.06578683e-01],
[ 1.88199779e-01, -8.27849196e-01],
[-9.63765682e-01, 8.12825287e-01],
[ 8.69164583e-01, -1.76610432e-01],
[ 2.64808319e-01, 6.25559205e-01],
[ 1.79687719e-01, 3.32082508e-01],
[ 1.59978307e+00, -1.03328288e+00],
[-1.63632037e-01, -9.55022431e-01],
[-3.75014861e-01, -3.54093958e-01],
[-4.27505898e-01, 7.58322472e-01],
[-6.14771220e-01, 5.34721179e-01],
[-1.87455611e+00, 1.89536684e-01],
[ 6.45013668e-01, -1.11154334e+00],
[ 6.59200435e-01, 6.70279463e-01],
[-3.90620305e-01, -3.13566224e-01],
[ 2.83251116e-01, 8.43570465e-01],
[-5.90653716e-01, -3.73659071e-01],
[-1.95542068e+00, 3.09216096e+00],
[-8.27572721e-01, 7.97452698e-01],
[ 1.23518317e+00, -1.09337573e+00],
[ 9.79821364e-01, -1.12971094e+00],
[ 7.06016765e-01, -8.33439228e-01],
[-1.25033837e+00, 1.98114204e+00],
[ 1.27774347e+00, -1.01371777e+00],
[ 3.92387279e-02, -9.07507157e-01],
[-4.74322229e-01, 2.10552026e+00],
[ 8.32177049e-02, 1.07276179e+00],
[-1.17798586e+00, 7.56924964e-01],
[-1.06165437e+00, 1.53114444e+00],
[ 2.46365523e-01, -6.11444652e-03],
[-4.60135462e-01, 6.46521826e-01],
[ 1.47919555e+00, -1.94778037e-01],
[-2.88475584e-01, 2.87362250e-01],
[-3.90620305e-01, 8.75185947e-02],
[ 1.01245093e+00, -1.35610725e+00],
[ 2.36332845e-02, 5.01180986e-01],
[-2.67195434e-01, -3.33131337e-01],
[-1.17514851e+00, -1.33287682e-01],
[ 9.28749004e-01, -9.43842367e-01],
[-4.24668545e-01, -4.05801757e-01],
[-1.73562773e-01, 1.60101984e+00],
[ 1.52317453e+00, -1.10595330e+00],
[-2.31728517e-01, -5.87477807e-01],
[-5.11207822e-01, 5.10963542e-01],
[-5.79304303e-01, -4.74279652e-01],
[-8.20479337e-01, -2.98720838e-02],
[ 4.63321113e-02, 1.88139176e-01],
[ 1.48476832e-01, -4.22571854e-01],
[ 1.72594335e-01, 9.60961143e-01],
[ 1.56988892e-01, 8.75185947e-02],
[ 5.91103955e-01, 5.27733639e-01],
[ 2.17860315e+00, -1.24151159e+00],
[-2.34981279e+00, 3.03626064e+00],
[ 2.22541948e+00, -1.23452405e+00],
[-6.29059927e-02, -8.46016801e-01],
[ 1.88199779e-01, 9.31613474e-01]])

*** Linear Regression Modelling¶ ***

Linear Regression:

Linear Regression is one of the most fundamental and widely known Machine Learning Algorithm. A Linear Regression model predicts
the dependent variable using a regression line based on the independent variables. The equation of the Linear Regression is:

Y = a*X + C + e

Where, C is the intercept, m is the slope of the line e is the error term The equation above is used to predict the value of the targer
variable based on the given predictor variable(s).
** Make predictions ***

y_pred

array([28.93570378, 13.15518066, 31.92680855, 5.40039443, 22.82828123,


21.04226226, 28.0770197 , 18.56914709, 21.22299382, 26.78902617,
25.71261587, 20.50032367, 28.55381473, 12.79189085, 18.38006882,
20.75795032, 22.55852415, 26.14474823, 19.58759673, 36.88024325,
19.78018038, 14.35620251, 19.97797033, 12.41503219, 18.59211758,
9.46726055, 22.576265 , 21.03726914, 17.47806368, 22.50272351,
21.20324285, 20.33606028, 13.65630735, 23.95476537, 31.25783399,
17.90368997, 22.17379262, 18.08330894, 23.7801768 , 36.31209883,
12.9689076 , 20.36508901, 28.08603306, 30.67029981, 17.07461465,
20.64045839, 26.34014916, 23.14769973, 4.91629824, 28.57004639,
21.83494252, 35.42308813, 16.79889613, 20.70958117, 20.59977682,
27.05848533, 28.56392111, 34.24470721, 11.26386912, 23.22662076,
23.8958616 , 15.96615347, 22.50592533, -3.72208564, 16.16588462,
35.15479278, 25.33172115, 39.23835608, 18.92262283, 39.07276316,
22.13668345, 22.06758669, 19.58311473, 25.53444468, 25.37336541,
8.3900077 , 29.08205872, 28.40885261, 17.6722375 , 20.30681461,
19.08388365, 16.46995692, 18.58406879, 16.47048198, 10.65653463,
30.22502685, 25.2033519 , 22.71490447, 23.51612694, 32.51835921,
24.88360571, 29.09537254, 14.71160176, 26.01072297, 24.02993432,
13.74208172, 20.32226887, 31.01129598, 13.89145881, 25.95815767,
26.24755122, 27.96402373, 24.22376089, 18.6767933 , 4.16786821,
28.15459538, 35.29550753, 35.20607977, 25.58787024, 29.35303083,
25.01379791, 26.24520966, 23.65094917, 19.71682436, 26.26227368,
18.44788539, 27.65217456, 24.49819922, 26.76613854, 29.43644969,
21.89422709, 32.9507895 , 20.959011 , 16.39889751, 19.3803633 ,
24.88100349, 23.58602067, 21.0350933 , 24.67568794, 20.38623468,
17.42043446, 19.49936255, 29.00740464, 15.42939329, 25.68753845,
17.96674431, 30.77861641, 16.4965733 , 10.08535036, 20.03199361,
22.65918857, 18.63646462, 32.1898622 , 29.92069578, 23.29531716,
29.94180793, 17.15282157, 19.55696012, 33.12905937, 25.80830111,
29.87465237])

29.63386601617516

4.00867504953758

sns.regplot(x=y_test,y= y_pred, color='red')


plt.show()
4.00867504953758

e e

Actual Predicted Variance

259 30.1 28.935704 1.164296

389 11.5 13.155181 -1.655181

306 33.4 31.926809 1.473191

147 14.6 5.400394 9.199606

235 24.0 22.828281 1.171719

rm lstat target

0 6.575 4.98 24.0

1 6.421 9.14 21.6

2 7.185 4.03 34.7

3 6.998 2.94 33.4

4 7.147 5.33 36.2

5 6.430 5.21 28.7

6 6.012 12.43 22.9

7 6.172 19.15 27.1

8 5.631 29.93 16.5

9 6.004 17.10 18.9

11 6.009 13.27 18.9

12 5.889 15.71 21.7

13 5.949 8.26 20.4

14 6.096 10.26 18.2

15 5.834 8.47 19.9


Practical No-5
Data Analytics II 1.Implement logistic regression using Python/R to perform classification on Social_Network_Ads.csv dataset.

3.Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.

Import libraries
import pandas as pd
import numpy as np

%matplotlib

Load data

df

User ID Gender Age EstimatedSalary Purchased

0 15624510 Male 19 19000 0

1 15810944 Male 35 20000 0

2 15668575 Female 26 43000 0

3 15603246 Female 27 57000 0

4 15804002 Male 19 76000 0

... ... ... ... ... ...

395 15691863 Female 46 41000 1

396 15706071 Male 51 23000 1

397 15654296 Female 50 20000 1

398 15755018 Male 36 33000 0

399 15594041 Female 49 36000 1

400 rows × 5 columns

Basic stats

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
# Column Non-Null Count Dtype

0 User ID 400 non-null int64


1 Gender 400 non-null object
2 Age 400 non-null int64
3 EstimatedSalary 400 non-null int64
4 Purchased 400 non-null int64
dtypes: int64(4), object(1)
memory usage: 15.8+ KB
User ID Age EstimatedSalary Purchased

count 4.000000e+02 400.000000 400.000000 400.000000

mean 1.569154e+07 37.655000 69742.500000 0.357500

std 7.165832e+04 10.482877 34096.960282 0.479864

min 1.556669e+07 18.000000 15000.000000 0.000000

25% 1.562676e+07 29.750000 43000.000000 0.000000

50% 1.569434e+07 37.000000 70000.000000 0.000000

75% 1.575036e+07 46.000000 88000.000000 1.000000

max 1.581524e+07 60.000000 150000.000000 1.000000

Age 0

histplot = sns.histplot(df['Age'], kde=True, bins=10, color='red', alpha=0.3)


for i in histplot.containers:
histplot.bar_label(i,)
plt.show()

a a a
return 1

return 0

return -1

User ID Gender Age EstimatedSalary Purchased

0 15624510 1 19 19000 0

1 15810944 1 35 20000 0

2 15668575 0 26 43000 0

3 15603246 0 27 57000 0

4 15804002 1 19 76000 0

... ... ... ... ... ...

395 15691863 0 46 41000 1

396 15706071 1 51 23000 1

397 15654296 0 50 20000 1

398 15755018 1 36 33000 0


399 15594041 0 49 36000 1

400 rows × 5 columns

0 257
1 143
sns.heatmap(df.corr(), annot=True)
plt.show()
Data preparation

Model building
model = LogisticRegression()

model.fit(x_train, y_train)

▾ LogisticRegression i ?

LogisticRegression()

y_pred = model.predict(x_test)
y_pred

array([0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0])

model.score(x_train,y_train)

0.821875

model.score(x,y)

0.835

Evalutation
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[50 2]
[ 7 21]]

disp=ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=model.classes_)
disp.plot()
plt.show()
TN value is 50
FP value is 2
FN value is 7
TP value is 21

Accuracy score is 0.8875

Error rate is 0.11250000000000004

Precision score is 0.9130434782608695

Recall score is 0.75

print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.88 0.96 0.92 52


1 0.91 0.75 0.82 28

accuracy 0.89 80
macro avg 0.90 0.86 0.87 80
weighted avg 0.89 0.89 0.88 80
Practical No-6
Data Analytics III

1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.

Import libraries
import pandas as pd
import numpy as np

%matplotlib

Load data

a a

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

Basic stats

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
# Column Non-Null Count Dtype

0 sepal length (cm) 150 non-null float64


1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 1 columns):
# Column Non-Null Count Dtype

0 target 150 non-null int64


dtypes: int64(1)
memory usage: 1.3 KB
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333

std 0.828066 0.435866 1.765298 0.762238

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

Data preparation

Model building
model = GaussianNB()

model.fit(x_train, y_train)

▾ GaussianNB i ?

GaussianNB()

y_pred = model.predict(x_test)

Evalutation
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

plot_confusion_matrix(conf_mat=cm, figsize=(5,5), show_normed=True)


plt.show()
TP value is 10
TN value is 20
FP value is 0
FN value is 0

Accuracy score is 1.0

Error rate is 0.0

Precision score is 1.0

Recall score is 1.0

print(classification_report(y_test, y_pred))

precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Practical No-7
Text Analytics
1. Extract Sample document and apply following document preprocessing methods: Tokenization, POS Tagging, stop words removal,
Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document Frequency.

import nltk
nltk.download("all")

[nltk_data] Downloading collection 'all'


[nltk_data] |
[nltk_data] | Downloading package abc to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package abc is already up-to-date!
[nltk_data] | Downloading package alpino to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package alpino is already up-to-date!
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package averaged_perceptron_tagger is already up-
[nltk_data] | to-date!
[nltk_data] | Downloading package averaged_perceptron_tagger_eng to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package averaged_perceptron_tagger_eng is already
[nltk_data] | up-to-date!
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package averaged_perceptron_tagger_ru is already
[nltk_data] | up-to-date!
[nltk_data] | Downloading package averaged_perceptron_tagger_rus to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package averaged_perceptron_tagger_rus is already
[nltk_data] | up-to-date!
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package basque_grammars is already up-to-date!
[nltk_data] | Downloading package bcp47 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package bcp47 is already up-to-date!
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package biocreative_ppi is already up-to-date!
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package bllip_wsj_no_aux is already up-to-date!
[nltk_data] | Downloading package book_grammars to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package book_grammars is already up-to-date!
[nltk_data] | Downloading package brown to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package brown is already up-to-date!
[nltk_data] | Downloading package brown_tei to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package brown_tei is already up-to-date!
[nltk_data] | Downloading package cess_cat to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package cess_cat is already up-to-date!
[nltk_data] | Downloading package cess_esp to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package cess_esp is already up-to-date!
[nltk_data] | Downloading package chat80 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package chat80 is already up-to-date!
[nltk_data] | Downloading package city_database to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package city_database is already up-to-date!
[nltk_data] | Downloading package cmudict to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package cmudict is already up-to-date!
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package comparative_sentences is already up-to-
[nltk_data] | date!
[nltk_data] | Downloading package comtrans to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package comtrans is already up-to-date!
[nltk_data] | Downloading package conll2000 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package conll2000 is already up-to-date!
[nltk_data] | Downloading package conll2002 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] |
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package conll2007 is already up-to-date!
[nltk_data] | Downloading package crubadan to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package crubadan is already up-to-date!
[nltk_data] | Downloading package dependency_treebank to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package dependency_treebank is already up-to-date!
[nltk_data] | Downloading package dolch to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package dolch is already up-to-date!
[nltk_data] | Downloading package english_wordnet to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package english_wordnet is already up-to-date!
[nltk_data] | Downloading package europarl_raw to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package europarl_raw is already up-to-date!
[nltk_data] | Downloading package extended_omw to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package extended_omw is already up-to-date!
[nltk_data] | Downloading package floresta to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package floresta is already up-to-date!
[nltk_data] | Downloading package framenet_v15 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package framenet_v15 is already up-to-date!
[nltk_data] | Downloading package framenet_v17 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package framenet_v17 is already up-to-date!
[nltk_data] | Downloading package gazetteers to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package gazetteers is already up-to-date!
[nltk_data] | Downloading package genesis to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package genesis is already up-to-date!
[nltk_data] | Downloading package gutenberg to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package gutenberg is already up-to-date!
[nltk_data] | Downloading package ieer to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package ieer is already up-to-date!
[nltk_data] | Downloading package inaugural to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package inaugural is already up-to-date!
[nltk_data] | Downloading package indian to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package indian is already up-to-date!
[nltk_data] | Downloading package jeita to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package jeita is already up-to-date!
[nltk_data] | Downloading package kimmo to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package kimmo is already up-to-date!
[nltk_data] | Downloading package knbc to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package knbc is already up-to-date!
[nltk_data] | Downloading package large_grammars to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package large_grammars is already up-to-date!
[nltk_data] | Downloading package lin_thesaurus to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package lin_thesaurus is already up-to-date!
[nltk_data] | Downloading package mac_morpho to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package mac_morpho is already up-to-date!
[nltk_data] | Downloading package machado to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package machado is already up-to-date!
[nltk_data] | Downloading package masc_tagged to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package masc_tagged is already up-to-date!
[nltk_data] | Downloading package maxent_ne_chunker to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package maxent_ne_chunker is already up-to-date!
[nltk_data] | Downloading package maxent_ne_chunker_tab to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package maxent_ne_chunker_tab is already up-to-
[nltk_data] | date!
[nltk_data] | Downloading package maxent_treebank_pos_tagger to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package maxent_treebank_pos_tagger is already up-
[nltk_data] | to-date!
[nltk_data] | Downloading package maxent_treebank_pos_tagger_tab to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package maxent_treebank_pos_tagger_tab is already
[nltk_data] | up-to-date!
[nltk_data] | Downloading package moses_sample to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] |
Package moses_sample is already up-to-date!
[nltk_data] | Downloading package movie_reviews to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package movie_reviews is already up-to-date!
[nltk_data] | Downloading package mte_teip5 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package mte_teip5 is already up-to-date!
[nltk_data] | Downloading package mwa_ppdb to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package mwa_ppdb is already up-to-date!
[nltk_data] | Downloading package names to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package names is already up-to-date!
[nltk_data] | Downloading package nombank.1.0 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package nombank.1.0 is already up-to-date!
[nltk_data] | Downloading package nonbreaking_prefixes to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package nonbreaking_prefixes is already up-to-date!
[nltk_data] | Downloading package nps_chat to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package nps_chat is already up-to-date!
[nltk_data] | Downloading package omw to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package omw is already up-to-date!
[nltk_data] | Downloading package omw-1.4 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package omw-1.4 is already up-to-date!
[nltk_data] | Downloading package opinion_lexicon to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package opinion_lexicon is already up-to-date!
[nltk_data] | Downloading package panlex_swadesh to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package panlex_swadesh is already up-to-date!
[nltk_data] | Downloading package paradigms to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package paradigms is already up-to-date!
[nltk_data] | Downloading package pe08 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package pe08 is already up-to-date!
[nltk_data] | Downloading package perluniprops to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package perluniprops is already up-to-date!
[nltk_data] | Downloading package pil to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package pil is already up-to-date!
[nltk_data] | Downloading package pl196x to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package pl196x is already up-to-date!
[nltk_data] | Downloading package porter_test to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package porter_test is already up-to-date!
[nltk_data] | Downloading package ppattach to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package ppattach is already up-to-date!
[nltk_data] | Downloading package problem_reports to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package problem_reports is already up-to-date!
[nltk_data] | Downloading package product_reviews_1 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package product_reviews_1 is already up-to-date!
[nltk_data] | Downloading package product_reviews_2 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package product_reviews_2 is already up-to-date!
[nltk_data] | Downloading package propbank to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package propbank is already up-to-date!
[nltk_data] | Downloading package pros_cons to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package pros_cons is already up-to-date!
[nltk_data] | Downloading package ptb to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package ptb is already up-to-date!
[nltk_data] | Downloading package punkt to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package punkt is already up-to-date!
[nltk_data] | Downloading package punkt_tab to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package punkt_tab is already up-to-date!
[nltk_data] | Downloading package qc to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package qc is already up-to-date!
[nltk_data] | Downloading package reuters to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package reuters is already up-to-date!
[nltk_data] | Downloading package rslp to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package rslp is already up-to-date!
[nltk_data] | Downloading package rte to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package rte is already up-to-date!
[nltk_data] | Downloading package sample_grammars to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package sample_grammars is already up-to-date!
[nltk_data] | Downloading package semcor to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package semcor is already up-to-date!
[nltk_data] | Downloading package senseval to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package senseval is already up-to-date!
[nltk_data] | Downloading package sentence_polarity to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package sentence_polarity is already up-to-date!
[nltk_data] | Downloading package sentiwordnet to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package sentiwordnet is already up-to-date!
[nltk_data] | Downloading package shakespeare to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package shakespeare is already up-to-date!
[nltk_data] | Downloading package sinica_treebank to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package sinica_treebank is already up-to-date!
[nltk_data] | Downloading package smultron to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package smultron is already up-to-date!
[nltk_data] | Downloading package snowball_data to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package snowball_data is already up-to-date!
[nltk_data] | Downloading package spanish_grammars to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package spanish_grammars is already up-to-date!
[nltk_data] | Downloading package state_union to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package state_union is already up-to-date!
[nltk_data] | Downloading package stopwords to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package stopwords is already up-to-date!
[nltk_data] | Downloading package subjectivity to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package subjectivity is already up-to-date!
[nltk_data] | Downloading package swadesh to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package swadesh is already up-to-date!
[nltk_data] | Downloading package switchboard to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package switchboard is already up-to-date!
[nltk_data] | Downloading package tagsets to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package tagsets is already up-to-date!
[nltk_data] | Downloading package tagsets_json to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package tagsets_json is already up-to-date!
[nltk_data] | Downloading package timit to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package timit is already up-to-date!
[nltk_data] | Downloading package toolbox to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package toolbox is already up-to-date!
[nltk_data] | Downloading package treebank to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package treebank is already up-to-date!
[nltk_data] | Downloading package twitter_samples to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package twitter_samples is already up-to-date!
[nltk_data] | Downloading package udhr to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package udhr is already up-to-date!
[nltk_data] | Downloading package udhr2 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package udhr2 is already up-to-date!
[nltk_data] | Downloading package unicode_samples to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package unicode_samples is already up-to-date!
[nltk_data] | Downloading package universal_tagset to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package universal_tagset is already up-to-date!
[nltk_data] | Downloading package universal_treebanks_v20 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package universal_treebanks_v20 is already up-to-
[nltk_data] | date!
[nltk_data] | Downloading package vader_lexicon to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package vader_lexicon is already up-to-date!
[nltk_data] | Downloading package verbnet to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package verbnet is already up-to-date!
[nltk_data] | Downloading package verbnet3 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] |
Package verbnet3 is already up-to-date!
[nltk_data] | Downloading package webtext to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package webtext is already up-to-date!
[nltk_data] | Downloading package wmt15_eval to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package wmt15_eval is already up-to-date!
[nltk_data] | Downloading package word2vec_sample to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package word2vec_sample is already up-to-date!
[nltk_data] | Downloading package wordnet to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package wordnet is already up-to-date!
[nltk_data] | Downloading package wordnet2021 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package wordnet2021 is already up-to-date!
[nltk_data] | Downloading package wordnet2022 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package wordnet2022 is already up-to-date!
[nltk_data] | Downloading package wordnet31 to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package wordnet31 is already up-to-date!
[nltk_data] | Downloading package wordnet_ic to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package wordnet_ic is already up-to-date!
[nltk_data] | Downloading package words to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package words is already up-to-date!
[nltk_data] | Downloading package ycoe to
[nltk_data] | C:\Users\PL\AppData\Roaming\nltk_data...
[nltk_data] | Package ycoe is already up-to-date!
[nltk_data] |
[nltk_data] Done downloading collection all
True

Tokenization

['Sachin', 'was', 'the', 'GOAT', 'of', 'the', 'previous', 'generation', '.', 'Virat', 'is', 'the', 'GOAT', 'of'
, 'this', 'generation', '.', 'Shubham', 'will', 'be', 'the', 'GOAT', 'of', 'the', 'next', 'generation']
['Sachin was the GOAT of the previous generation.', 'Virat is the GOAT of this generation.', 'Shubham will be t
he GOAT of the next generation']

POS tagging

[('Sachin', 'NNP'), ('was', 'VBD'), ('the', 'DT'), ('GOAT', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('previous', '
JJ'), ('generation', 'NN'), ('.', '.'), ('Virat', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('GOAT', 'NNP'), ('of',
'IN'), ('this', 'DT'), ('generation', 'NN'), ('.', '.'), ('Shubham', 'NNP'), ('will', 'MD'), ('be', 'VB'), ('th
e', 'DT'), ('GOAT', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('next', 'JJ'), ('generation', 'NN')]

Stop word removal


'as',
'at',
'be',
'because',
'been',
'before',
'being',
'below',
'between',
'both',
'but',
'by',
'can',
'couldn',
"couldn't",
'd',
'did',
'didn',
"didn't",
'do',
'does',
'doesn',
"doesn't",
'doing',
'don',
"don't",
'down',
'during',
'each',
'few',
'for',
'from',
'further',
'had',
'hadn',
"hadn't",
'has',
'hasn',
"hasn't",
'have',
'haven',
"haven't",
'having',
'he',
"he'd",
"he'll",
"he's",
'her',
'here',
'hers',
'herself',
'him',
'himself',
'his',
'how',
'i',
"i'd",
"i'll",
"i'm",
"i've",
'if',
'in',
'into',
'is',
'isn',
"isn't",
'it',
"it'd",
"it'll",
"it's",
'its',
'itself',
'just',
'll',
'm',
'ma',
'me',
'mightn',
"mightn't",
'more',
'most',
'mustn',
"mustn't",
'my',
'myself',
'needn',
"needn't",
'no',
'nor',
'not',
'now',
'o',
'of',
'off',
'on',
'once',
'only',
'or',
'other',
'our',
'ours',
'ourselves',
'out',
'over',
'own',
're',
's',
'same',
'shan',
"shan't",
'she',
"she'd",
"she'll",
"she's",
'should',
"should've",
'shouldn',
"shouldn't",
'so',
'some',
'such',
't',
'than',
'that',
"that'll",
'the',
'their',
'theirs',
'them',
'themselves',
'then',
'there',
'these',
'they',
"they'd",
"they'll",
"they're",
"they've",
'this',
'those',
'through',
'to',
'too',
'under',
'until',
'up',
've',
'very',
'was',
'wasn',
"wasn't",
'we',
"we'd",
"we'll",
"we're",
"we've",
'were',
'weren',
"weren't",
'what',
'when',
'where',
'which',
'while',
'who',
'whom',
'why',
'will',
'with',
'won',
"won't",
'wouldn',
"wouldn't",
'y',
'you',
"you'd",
"you'll",
"you're",
"you've",
'your',
'yours',
'yourself',
'yourselves'}

['Sachin', 'GOAT', 'previous', 'generation', '.', 'Virat', 'GOAT', 'generation', '.', 'Shubham', 'GOAT', 'next'
, 'generation']

Stemming

stemmer = PorterStemmer()

stemmed_tokens = []

stemmed = stemmer.stem(token)
stemmed_tokens.append(stemmed)

['sachin', 'goat', 'previou', 'gener', '.', 'virat', 'goat', 'gener', '.', 'shubham', 'goat', 'next', 'gener']

Lemmatization

['Sachin', 'GOAT', 'previous', 'generation', '.', 'Virat', 'GOAT', 'generation', '.', 'Shubham', 'GOAT', 'next'
, 'generation']

TF-IDF

Sachin:1
was:1
the:5
GOAT:3
of:3
previous:1
generation:3
.:2
Virat:1
is:1
this:1
Shubham:1
will:1
be:1
next:1
(0, 7) 1.0
(1, 12) 1.0
(2, 9) 1.0
(3, 2) 1.0
(4, 5) 1.0
(5, 9) 1.0
(6, 6) 1.0
(7, 1) 1.0
(9, 11) 1.0
(10, 3) 1.0
(11, 9) 1.0
(12, 2) 1.0
(13, 5) 1.0
(14, 10) 1.0
(15, 1) 1.0
(17, 8) 1.0
(18, 13) 1.0
(19, 0) 1.0
(20, 9) 1.0
(21, 2) 1.0
(22, 5) 1.0
(23, 9) 1.0
(24, 4) 1.0
(25, 1) 1.0

['be' 'generation' 'goat' 'is' 'next' 'of' 'previous' 'sachin' 'shubham'


'the' 'this' 'virat' 'was' 'will']
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Practical No-8
Data Visualization I

1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information about the passengers who boarded the
unfortunate Titanic ship. Use the Seaborn library to see if we can find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each passenger is distributed by plotting a histogram.

import pandas as pd
import numpy as np

%matplotlib

Load data and basic stats

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
(Florence Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques Heath (Lily


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
May Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Visualization
Ticket
347082 7
CA. 2343 7
1601 7
3101295 6
CA 2144 6
..
9234 1
19988 1
2693 1
PC 17612 1
370376 1
Name: count, Length: 681, dtype: int64

Cabin
B96 B98 4
G6 4
C23 C25 C27 4
C22 C26 3
F33 3
..
E34 1
C7 1
C54 1
E36 1
C148 1

Embarked
S 644
C 168
Q 77

return 1

return 0

return 0

return 1

return 2

return 0

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64

#Set up the figure and axes


# Age Distribution

# SibSp Distribution

# Parch Distribution

#plt.tight_layout()
#plt.show()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
# Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null int64
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Embarked 891 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 76.7+ KB

"Survived" is the label


sns.countplot(df, x="Survived")
plt.show()

sns.countplot(df,x="Pclass", hue="Survived",palette="Accent")
plt.show()
sns.histplot(df["Fare"])
plt.show()
Practical No-9
Data Visualization II

1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution of age with respect to each gender along
with the information about whether they survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.

import pandas as pd
import numpy as np

%matplotlib

Load data and basic stats

df

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
(Florence Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 3101282 7.9250 NaN S

Futrelle, Mrs. Jacques Heath (Lily


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
May Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

... ... ... ... ... ... ... ... ... ... ... ... ...

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S

Johnston, Miss. Catherine Helen


888 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN S
"Carrie"

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

df.isna().sum()

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

df.isna().sum()

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Visualization

return 1

return 0

return 0

return 1

return 2

return 0
color
0
1

0 20 40 60 80

plt.figure(figsize=(10,7))
box = sns.boxplot(df,x="Sex", y="Age", hue="Survived")
plt.show()

This code will display a box plot showing the distribution of ages
with respect to gender and survival status. You can observe
trends like whether there are age differences between survivors
and non-survivors, or if gender has a distinct influence on survival
outcomes.
Practical No-10
Data Visualization III Download the Iris flower dataset or any other dataset into a DataFrame. (e.g.,
https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as:

1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distributions and identify outliers.

Import libraries
import pandas as pd
import numpy as np

Load and preprocess data

Features and their types:


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 2

146 6.3 2.5 5.0 1.9 2

147 6.5 3.0 5.2 2.0 2

148 6.2 3.4 5.4 2.3 2


149 5.9 3.0 5.1 1.8 2

150 rows × 5 columns

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0


4 5.0 3.6 1.4 0.2 0
(150, 5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype

0 sepal length (cm) 150 non-null float64


1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 label 150 non-null int32
dtypes: float64(4), int32(1)
memory usage: 5.4 KB

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333 1.000000

std 0.828066 0.435866 1.765298 0.762238 0.819232

min 4.300000 2.000000 1.000000 0.100000 0.000000

25% 5.100000 2.800000 1.600000 0.300000 0.000000

50% 5.800000 3.000000 4.350000 1.300000 1.000000

75% 6.400000 3.300000 5.100000 1.800000 2.000000

max 7.900000 4.400000 6.900000 2.500000 2.000000

Visualization
sns.heatmap(df.corr(), annot=True)
plt.show()
sns.histplot(df["sepal width (cm)"], kde=True)
plt.show()
sns.histplot(df["petal width (cm)"], kde=True)
plt.show()
sns.boxplot(x=df['label'] ,y=df["sepal width (cm)"])
plt.show()
sns.boxplot(x=df['label'] ,y=df["petal width (cm)"])
plt.show()

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Practical No:- 11
Title :- Write a code in JAVA for a simple Word Count application that counts the number of occurrences
of each word in a given input set using the Hadoop Map-Reduce framework on local-standalone set-up.

Program / Commands:-

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows

PS C:\Users\Omkar> start-all.cmd
This script is Deprecated. Instead use start-dfs.cmd and start-yarn.cmd
starting yarn daemons
PS C:\Users\Omkar> jps
17152 NodeManager
17936 ResourceManager
7216 NameNode
17496 DataNode
4684 Jps
PS C:\Users\Omkar> hadoop fs -mkdir /input
PS C:\Users\Omkar> hadoop fs -put D:\Hadoop Files\omkar.txt /input
put: `D:/Hadoop': No such file or directory
put: `Files/omkar.txt': No such file or directory
PS C:\Users\Omkar> hadoop fs -put C:\Users\Omkar\Desktop\Hadoop Files /input

put: `C:/Users/Omkar/Desktop/Hadoop': No such file or directory


put: `Files': No such file or directory
PS C:\Users\Omkar> hadoop fs -put C:\Users\Omkar\Desktop\HadoopFiles\omkar.txt /input
PS C:\Users\Omkar> hadoop fs -ls /input
Found 1 items
-rw-r--r-- 3 Omkar supergroup 48 2025-03-28 10:00 /input/omkar.txt
PS C:\Users\Omkar> hadoop jar C:\Users\Omkar\Desktop\JARFILES\WordCountMapReduce.jar
com.mapreduce.wc/WordCount /input/omkar.txt /output
2025-03-28 10:29:24,839 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2025-03-28 10:29:25,558 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for
path: /tmp/hadoop-yarn/staging/Omkar/.staging/job_1743135021734_0001
2025-03-28 10:29:25,777 INFO input.FileInputFormat: Total input files to process : 1
2025-03-28 10:29:25,866 INFO mapreduce.JobSubmitter: number of splits:1
2025-03-28 10:29:26,043 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1743135021734_0001
2025-03-28 10:29:26,045 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-03-28 10:29:26,188 INFO conf.Configuration: resource-types.xml not found
2025-03-28 10:29:26,189 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2025-03-28 10:29:26,687 INFO impl.YarnClientImpl: Submitted application
application_1743135021734_0001
2025-03-28 10:29:26,727 INFO mapreduce.Job: The url to track the job: http://Omkar-
Mete:8088/proxy/application_1743135021734_0001/
2025-03-28 10:29:26,728 INFO mapreduce.Job: Running job: job_1743135021734_0001
2025-03-28 10:29:34,906 INFO mapreduce.Job: Job job_1743135021734_0001 running in uber mode
: false
2025-03-28 10:29:34,909 INFO mapreduce.Job: map 0% reduce 0%
2025-03-28 10:29:38,987 INFO mapreduce.Job: map 100% reduce 0%
2025-03-28 10:29:44,069 INFO mapreduce.Job: map 100% reduce 100%
2025-03-28 10:29:44,084 INFO mapreduce.Job: Job job_1743135021734_0001 completed
successfully
2025-03-28 10:29:44,151 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=81
FILE: Number of bytes written=477919
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=150
HDFS: Number of bytes written=55
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2529
Total time spent by all reduces in occupied slots (ms)=2591
Total time spent by all map tasks (ms)=2529
Total time spent by all reduce tasks (ms)=2591
Total vcore-milliseconds taken by all map tasks=2529
Total vcore-milliseconds taken by all reduce tasks=2591
Total megabyte-milliseconds taken by all map tasks=2589696
Total megabyte-milliseconds taken by all reduce tasks=2653184
Map-Reduce Framework
Map input records=4
Map output records=5
Map output bytes=65
Map output materialized bytes=81
Input split bytes=102
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=81
Reduce input records=5
Reduce output records=5
Spilled Records=10
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=61
CPU time spent (ms)=904
Physical memory (bytes) snapshot=522989568
Virtual memory (bytes) snapshot=798494720
Total committed heap usage (bytes)=395313152
Peak Map Physical memory (bytes)=315162624
Peak Map Virtual memory (bytes)=463032320
Peak Reduce Physical memory (bytes)=207826944
Peak Reduce Virtual memory (bytes)=335462400
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=48
File Output Format Counters
Bytes Written=55
PS C:\Users\Omkar> hadoop dfs -cat /output/*
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
HII 1
METE 1
OMKAR 1
OMKAR@06 1
OMKARNMETE@GMAIL.COM 1
PS C:\Users\Omkar> hadoop dfs -get /output/part-r-00000
C:\Users\Omkar\Desktop\HadoopFiles\textfile.txt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
PS C:\Users\Omkar> stop-all.cmd
This script is Deprecated. Instead use stop-dfs.cmd and stop-yarn.cmd
SUCCESS: Sent termination signal to the process with PID 11832.
SUCCESS: Sent termination signal to the process with PID 10140.
stopping yarn daemons
SUCCESS: Sent termination signal to the process with PID 19476.
SUCCESS: Sent termination signal to the process with PID 17384.

INFO: No tasks running with the specified criteria.


PS C:\Users\Omkar>

OutPut:-
Practical No:- 12
Title :- Design a distributed application using Map-Reduce which processes a log file of
a system.
Program / Commands:-
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows

PS C:\Users\Omkar> start-all.cmd
This script is Deprecated. Instead use start-dfs.cmd and start-yarn.cmd
starting yarn daemons
PS C:\Users\Omkar> jps
20928 NodeManager
7684 Eclipse
14136 NameNode
16296 DataNode
20072 ResourceManager
18236 Jps
PS C:\Users\Omkar> hadoop fs -mkdir /InputLog
PS C:\Users\Omkar> hadoop fs -rm -r /InputLog
Deleted /InputLog
PS C:\Users\Omkar> hadoop fs -mkdir /InputDir
PS C:\Users\Omkar> hadoop fs -put C:\Users\Omkar\Documents\Log\InputLog.txt /InputDir
PS C:\Users\Omkar> hadoop fs -ls /InputDir
Found 1 items
-rw-r--r-- 3 Omkar supergroup 145670 2025-03-31 16:30 /InputDir/InputLog.txt
PS C:\Users\Omkar> hadoop jar C:\Users\Omkar\Documents\JARFILES\Process.jar
com.mapreduce.lf/Process /InputDir/InputLog.txt /OutputDir
2025-03-31 16:37:23,239 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2025-03-31 16:37:23,902 WARN mapreduce.JobResourceUploader: Hadoop command-line option
parsing not performed. Implement the Tool interface and execute your application with ToolRunner to
remedy this.
2025-03-31 16:37:23,918 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for
path: /tmp/hadoop-yarn/staging/Omkar/.staging/job_1743418229851_0001
2025-03-31 16:37:24,077 INFO input.FileInputFormat: Total input files to process : 1
2025-03-31 16:37:24,168 INFO mapreduce.JobSubmitter: number of splits:1
2025-03-31 16:37:24,295 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1743418229851_0001
2025-03-31 16:37:24,297 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-03-31 16:37:24,439 INFO conf.Configuration: resource-types.xml not found
2025-03-31 16:37:24,439 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2025-03-31 16:37:24,699 INFO impl.YarnClientImpl: Submitted application
application_1743418229851_0001
2025-03-31 16:37:24,744 INFO mapreduce.Job: The url to track the job: http://Omkar-
Mete:8088/proxy/application_1743418229851_0001/
2025-03-31 16:37:24,745 INFO mapreduce.Job: Running job: job_1743418229851_0001
2025-03-31 16:37:33,924 INFO mapreduce.Job: Job job_1743418229851_0001 running in uber mode
: false
2025-03-31 16:37:33,928 INFO mapreduce.Job: map 0% reduce 0%
2025-03-31 16:37:39,045 INFO mapreduce.Job: map 100% reduce 0%
2025-03-31 16:37:45,130 INFO mapreduce.Job: map 100% reduce 100%
2025-03-31 16:37:46,152 INFO mapreduce.Job: Job job_1743418229851_0001 completed
successfully
2025-03-31 16:37:46,262 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=4479
FILE: Number of bytes written=486753
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=145778
HDFS: Number of bytes written=3611
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2918
Total time spent by all reduces in occupied slots (ms)=3117
Total time spent by all map tasks (ms)=2918
Total time spent by all reduce tasks (ms)=3117
Total vcore-milliseconds taken by all map tasks=2918
Total vcore-milliseconds taken by all reduce tasks=3117
Total megabyte-milliseconds taken by all map tasks=2988032
Total megabyte-milliseconds taken by all reduce tasks=3191808
Map-Reduce Framework
Map input records=2589
Map output records=1295
Map output bytes=22902
Map output materialized bytes=4479
Input split bytes=108
Combine input records=1295
Combine output records=227
Reduce input groups=227
Reduce shuffle bytes=4479
Reduce input records=227
Reduce output records=227
Spilled Records=454
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=85
CPU time spent (ms)=1091
Physical memory (bytes) snapshot=514068480
Virtual memory (bytes) snapshot=765394944
Total committed heap usage (bytes)=359661568
Peak Map Physical memory (bytes)=308555776
Peak Map Virtual memory (bytes)=437420032
Peak Reduce Physical memory (bytes)=205512704
Peak Reduce Virtual memory (bytes)=328048640
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=145670
File Output Format Counters
Bytes Written=3611
PS C:\Users\Omkar> hadoop dfs -cat /OutputDir/*
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
10.1.1.236 7
10.1.181.142 14
10.1.232.31 5
10.10.55.142 14
10.102.101.66 1
10.103.184.104 1
10.103.190.81 53
10.103.63.29 1
10.104.73.51 1
10.105.160.183 1
10.108.91.151 1
10.109.21.76 1
10.11.131.40 1
10.111.71.20 8
10.112.227.184 6
10.114.74.30 1
10.115.118.78 1
10.117.224.230 1
10.117.76.22 12
10.118.19.97 1
10.118.250.30 7
10.119.117.132 23
10.119.33.245 1
10.119.74.120 1
10.12.113.198 2
10.12.219.30 1
10.120.165.113 1
10.120.207.127 4
10.123.124.47 1
10.123.35.235 1
10.124.148.99 1
10.124.155.234 1
10.126.161.13 7
10.127.162.239 1
10.128.11.75 10
10.13.42.232 1
10.130.195.163 8
10.130.70.80 1
10.131.163.73 1
10.131.209.116 5
10.132.19.125 2
10.133.222.184 12
10.134.110.196 13
10.134.242.87 1
10.136.84.60 5
10.14.2.86 8
10.14.4.151 2
10.140.139.116 1
10.140.141.1 9
10.140.67.116 1
10.141.221.57 5
10.142.203.173 7
10.143.126.177 32
10.144.147.8 1
10.15.208.56 1
10.15.23.44 13
10.150.212.239 14
10.150.227.16 1
10.150.24.40 13
10.152.195.138 8
10.153.23.63 2
10.153.239.5 25
10.155.95.124 9
10.156.152.9 1
10.157.176.158 1
10.164.130.155 1
10.164.49.105 8
10.164.95.122 10
10.165.106.173 14
10.167.1.145 19
10.169.158.88 1
10.170.178.53 1
10.171.104.4 1
10.172.169.53 18
10.174.246.84 3
10.175.149.65 1
10.175.204.125 15
10.177.216.164 6
10.179.107.170 2
10.181.38.207 13
10.181.87.221 1
10.185.152.140 1
10.186.56.126 16
10.186.56.183 1
10.187.129.140 6
10.187.177.220 1
10.187.212.83 1
10.187.28.68 1
10.19.226.186 2
10.190.174.142 10
10.190.41.42 5
10.191.172.11 1
10.193.116.91 1
10.194.174.4 7
10.198.138.192 1
10.199.103.248 2
10.199.189.15 1
10.2.202.135 1
10.200.184.212 1
10.200.237.222 1
10.200.9.128 2
10.203.194.139 10
10.205.72.238 2
10.206.108.96 2
10.206.175.236 1
10.206.73.206 7
10.207.190.45 17
10.208.38.46 1
10.208.49.216 4
10.209.18.39 9
10.209.54.187 3
10.211.47.159 10
10.212.122.173 1
10.213.181.38 7
10.214.35.48 1
10.215.222.114 1
10.216.113.172 48
10.216.134.214 1
10.216.227.195 16
10.217.151.145 10
10.217.32.16 1
10.218.16.176 8
10.22.108.103 4
10.220.112.1 34
10.221.40.89 5
10.221.62.23 13
10.222.246.34 1
10.223.157.186 11
10.225.137.152 1
10.225.234.46 1
10.226.130.133 1
10.229.60.23 1
10.230.191.135 6
10.231.55.231 1
10.234.15.156 1
10.236.231.63 1
10.238.230.235 1
10.239.100.52 1
10.239.52.68 4
10.24.150.4 5
10.24.67.131 13
10.240.144.183 15
10.240.170.50 1
10.241.107.75 1
10.241.9.187 1
10.243.51.109 5
10.244.166.195 5
10.245.208.15 20
10.246.151.162 3
10.247.111.104 9
10.247.175.65 1
10.247.229.13 1
10.248.24.219 1
10.248.36.117 3
10.249.130.132 3
10.25.132.238 2
10.25.44.247 6
10.250.166.232 1
10.27.134.23 1
10.30.164.32 1
10.30.47.170 8
10.31.225.14 7
10.32.138.48 11
10.32.247.175 4
10.32.55.216 12
10.33.181.9 8
10.34.233.107 1
10.36.200.176 1
10.39.45.70 2
10.39.94.109 4
10.4.59.153 1
10.4.79.47 15
10.41.170.233 9
10.41.40.17 1
10.42.208.60 1
10.43.81.13 1
10.46.190.95 10
10.48.81.158 5
10.5.132.217 1
10.5.148.29 1
10.50.226.223 9
10.50.41.216 3
10.52.161.126 1
10.53.58.58 1
10.54.242.54 10
10.54.49.229 1
10.56.48.40 16
10.59.42.194 11
10.6.238.124 6
10.61.147.24 1
10.61.161.218 1
10.61.23.77 8
10.61.232.147 3
10.62.78.165 2
10.63.233.249 7
10.64.224.191 13
10.66.208.82 2
10.69.20.85 26
10.70.105.238 1
10.70.238.46 6
10.72.137.86 6
10.72.208.27 1
10.73.134.9 4
10.73.238.200 1
10.73.60.200 1
10.73.64.91 1
10.74.218.123 1
10.75.116.199 1
10.76.143.30 1
10.76.68.178 16
10.78.95.24 8
10.80.10.131 10
10.80.215.116 17
10.81.134.180 1
10.82.30.199 63
10.82.64.235 1
10.84.236.242 1
10.87.209.46 1
10.87.88.214 1
10.88.204.177 1
10.89.178.62 1
10.89.244.42 1
10.94.196.42 1
10.95.136.211 4
10.95.232.88 1
10.98.156.141 1
10.99.228.224 1
PS C:\Users\Omkar> hdfs dfs -get /OutputDir/part-r-00000
C:\Users\Omkar\Documents\OutputDir\OutputLog.txt
PS C:\Users\Omkar>

OutPut:-
Practical No:-14
Title:- Write a simple program in SCALA using Apache Spark framework
Program :-
// Scala program to print Hello Students

// Creating object
object Students {

// Main method
def main(args: Array[String]): Unit = {
// prints Hello, Students!
println("Hello, Students!")
}
}

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy