0% found this document useful (0 votes)

56 views53 pages

CS 3362 FDS

The document explores various commands for univariate and bivariate analysis on diabetes and housing datasets. It performs descriptive statistics, frequency analysis, linear and logistic regression modeling.

Uploaded by

emailforroughasvrsunoff

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views53 pages

CS 3362 FDS

Uploaded by

emailforroughasvrsunoff

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

lOMoAR cPSD| 31162489

UNIVERSITY COLLEGE OF ENGINEERING-BIT

CAMPUS, TIRUCHIRAPPALLI-620024

DEPARTMENT OF INFORMATION TECHNOLOGY

CS3362 Foundations OF DATA Science LAB Manual

(2021 Regulation)

1
lOMoAR cPSD| 31162489

2
lOMoAR cPSD| 31162489

3
lOMoAR cPSD| 31162489

4
lOMoAR cPSD| 31162489

5
lOMoAR cPSD| 31162489

6
lOMoAR cPSD| 31162489

7
lOMoAR cPSD| 31162489

8
lOMoAR cPSD| 31162489

9
lOMoAR cPSD| 31162489

10
lOMoAR cPSD| 31162489

11
lOMoAR cPSD| 31162489

12
lOMoAR cPSD| 31162489

13
lOMoAR cPSD| 31162489

14
lOMoAR cPSD| 31162489

15
lOMoAR cPSD| 31162489

16
lOMoAR cPSD| 31162489

17
lOMoAR cPSD| 31162489

18
lOMoAR cPSD| 31162489

19
lOMoAR cPSD| 31162489

EX.NO.4. READING DATA FROM TEXT FILES, EXCEL AND THE WEB
DATE:

Aim:
To Reading data from text files, Excel and the web using pandas package.

ALGORITHM:
STEP 1: Start the program
STEP 2: To read data from csv file using pandas package.
STEP 3: To read data from excel file using pandas package.
STEP 4: To read data from html file using pandas package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
DATA INPUT AND OUTPUT

This notebook is the reference code for getting input and output, pandas can read a variety of file
types using its pd.read_ methods. Let’s take a look at the most common data types:

import numpy as np
import pandas as pd

CSV

CSV INPUT:
df = pd.read_csv('example')
df

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

20
lOMoAR cPSD| 31162489

CSV OUTPUT:
df.to_csv('example',index=False)

EXCEL

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or
images, having images or macros may cause this read_excel method to crash.

EXCEL INPUT :
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

EXCEL OUTPUT :
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

HTML

You may need to install htmllib5, lxml, and BeautifulSoup4. In your terminal/command prompt
run:

pip install lxml

pip install html5lib==1.1
pip install BeautifulSoup4

Then restart Jupyter Notebook. (or use conda install)

Pandas can read table tabs off of html.

For example:

HTML INPUT

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
21
lOMoAR cPSD| 31162489

url = https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list

df = pd.read_html(url)

df[0]

match = "Metcalf Bank"

df_list = pd.read_html(url, match=match)

df_list[0]

HTML OUTPUT:

RESULT:
Exploring commands for read data from csv file, excel file and html are successfully
executed.

22
lOMoAR cPSD| 31162489

EX NO 4(a). EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE

DATE: ANALYTICS ON THE IRIS DATA SET.

AIM:
To explore various commands for doing descriptive analytics on the Iris data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To understand idea behind Descriptive Statistics.
STEP 3: Load the packages we will need and also the `iris` dataset.
STEP 4: load_iris() loads in an object containing the iris dataset, which I stored in
`iris_obj`.
STEP 5: Basic statistics: count, mean, median, min, max
STEP 6: Display the output.
STEP 7: Stop the program.
PROGRAM:
import pandas as pd

from pandas import DataFrame

from sklearn.datasets import load_iris

# sklearn.datasetsincludes common example datasets

# A function to load in the iris dataset

iris_obj = load_iris()

# Dataset preview

iris_obj.data

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in

range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]),
index=pd.Index([i for i in range(iris_obj.target.shape[0])])))

iris # prints iris data

Commands

iris_obj.feature_names

iris.count()

iris.mean()

iris.median()

23
lOMoAR cPSD| 31162489

iris.var()

iris.std()

iris.max()

iris.min()

iris.describe()

OUTPUT:

RESULT:
Exploring various commands for doing descriptive analytics on the Iris data set
successfully executed.

24
lOMoAR cPSD| 31162489

EX.NO 5. USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS
DATE: DIABETES DATA SET FOR PERFORMING THE FOLLOWING:

A) UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE,

STANDARD DEVIATION, SKEWNESS AND KURTOSIS.
AIM:
To explore various commands for doing Univariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the mean, median, mode, variance, standard deviation, skewness and
kurtosis in the given excel data set package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('C:/Users/kirub/Documents/Learning/Untitled Folder/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
df.dtypes['Outcome']
df.info()
df.describe().T

# Frequency# finding the unique count

df1 = df['Outcome'].value_counts()

# displaying df1
print(df1)
#mean
df.mean()
#median
df.median()

25
lOMoAR cPSD| 31162489

#mode
df.mode()
#Variance
df.var()
#standard deviation
df.std()
#
#kurtosis
df.kurtosis(axis=0,skipna=True)
df['Outcome'].kurtosis(axis=0,skipna=True)
#skewness
# skewness along the index axis
df.skew(axis = 0, skipna = True)

# skip the na values

# find skewness in each row
df.skew(axis = 1, skipna = True)

#Pregnancy variable
preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)

preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_pro
portion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)

sns.countplot(data=df['Outcome'])

sns.distplot(df['Pregnancies'])

sns.boxplot(data=df['Pregnancies'])

26
lOMoAR cPSD| 31162489

OUTPUT:

RESULT:
Exploring various commands for doing univariate analytics on the UCI AND PIMA
INDIANS DIABETES was successfully executed.

27
lOMoAR cPSD| 31162489

EX.NO:5. B) BIVARIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION

DATE: MODELING
AIM:
To explore the Linear and Logistic Regression model on the USA HOUSING AND UCI
AND PIMA INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the any kind of data set like housing dataset using kaggle.
STEP 3: To read data from downloaded data set.
STEP 4: To find the linear and logistic regression model using the given data set.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
BIVARIATE ANALYSIS GENERAL PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')

fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})

28
lOMoAR cPSD| 31162489

plt.tight_layout()

plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()

plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.')
axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend text
plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend title
plt.tight_layout()

plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()

plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})

29
lOMoAR cPSD| 31162489

axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()

OUTPUT:

## Blood Pressure variable

fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['BloodPressure'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],color='green',
label='Non Diab.')
sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0][1],color='red',label='Diab')

30
lOMoAR cPSD| 31162489

axes[0][1].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][1].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10=sns.boxplot(df['BloodPressure'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()

OUTPUT:

31
lOMoAR cPSD| 31162489

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('BloodPressure',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()

OUTPUT:

32
lOMoAR cPSD| 31162489

LINEAR REGRESSION MODELLING ON HOUSING DATASET

# Data manipulation libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()

USAhousing.columns
sns.pairplot(USAhousing)

sns.distplot(USAhousing['Price'])

sns.heatmap(USAhousing.corr())

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of
Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# print the intercept
print(lm.intercept_)

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)

sns.distplot((y_test-predictions),bins=50);

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

33
lOMoAR cPSD| 31162489

OUTPUT:

34
lOMoAR cPSD| 31162489

LOGISTIC REGRESSION MODELLING ON PIME DIABETIES

# Data manipulation libraries

import numpy as np
import pandas as pd

###scikit Learn Modules needed for Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import
LabelEncoder,MinMaxScaler,OneHotEncoder,StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

35
lOMoAR cPSD| 31162489

#for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')

df=pd.read_csv('C:/Users/diabetes.csv')

df.head()

df.tail()

df.isnull().sum()

df.describe(include='all')

df.corr()

sns.heatmap(df.corr(),annot=True)
plt.show()

df.hist()
plt.show()

sns.countplot(x=df['Outcome'])

scaler=StandardScaler()
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']]=scaler.fit_transform(df[['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']])

df_new = df

36
lOMoAR cPSD| 31162489

# Train & Test split

x_train, x_test, y_train, y_test = train_test_split( df_new[['Pregnancies', 'Glucose',
'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']],
df_new['Outcome'],test_size=0.20,
random_state=21)

print('Shape of Training Xs:{}'.format(x_train.shape))

print('Shape of Test Xs:{}'.format(x_test.shape))
print('Shape of Training y:{}'.format(y_train.shape))
print('Shape of Test y:{}'.format(y_test.shape))

Shape of Training Xs:(614, 8)

Shape of Test Xs:(154, 8)
Shape of Training y:(614,)
Shape of Test y:(154,)

# Build Model
model = LogisticRegression()
model.fit(x_train, y_train)
y_predicted = model.predict(x_test)

score=model.score(x_test,y_test);
print(score)

0.7337662337662337

#Confusion Matrix
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix

37
lOMoAR cPSD| 31162489

OUTPUT:

38
lOMoAR cPSD| 31162489

39
lOMoAR cPSD| 31162489

RESULT:
Exploring various commands for doing Bivariate analytics on the USA HOUSING Dataset
was successfully executed.

40
lOMoAR cPSD| 31162489

EX.NO:5.C) MULTIPLE REGRESSION ANALYSIS

DATE:`
AIM:
To explore various commands for doing Multiivariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the multiple regression analysis the
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Data manipulation libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()

USAhousing.columns
sns.pairplot(USAhousing)

41
lOMoAR cPSD| 31162489

OUTPUT:

42
lOMoAR cPSD| 31162489

RESULT:

Thus the Multi regression analysis using housing data sets are executed successfully.

43
lOMoAR cPSD| 31162489

EX.NO:5.D) ALSO COMPARE THE RESULTS OF THE ABOVE ANALYSIS FOR THE
DATE: TWO DATA SETS.

AIM:
To explore various commands for compare the results of the above analysis for the date:
two data sets.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the comparison between the two different dataset using various command.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Glucose Variable
df.Glucose.describe()

#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0][1],color='green',label='
Non Diab.')
sns.distplot(df[df.Outcome==True]['Glucose'],ax=axes[0][1],color='red',label='Diab')
axes[0][1].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
44
lOMoAR cPSD| 31162489

axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})

axes[1][1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()

45
lOMoAR cPSD| 31162489

OUTPUT:

RESULT:

Thus the comparison of the above analysis for the two datasets are executed successfully.

46
lOMoAR cPSD| 31162489

EX.NO:6. APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI

DATE: DATA SETS.

AIM:
To apply and explore various plotting functions on UCI datasets.

ALGORITHM:

STEP 1: Install seaborn package and import the package.

STEP 2: Normal curves, density or contour plots, correlation and sctter plots, and
histogram plots are visualized.
STEP 3: 3d plotting done using plotly package
STEP 4: Stop the program.
PROGRAM:

A. NORMAL CURVES

#seaborn package
import seaborn as sns
flights = sns.load_dataset("flights")
flights.head()
may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year", y="passengers")

OUTPUT:

47
lOMoAR cPSD| 31162489

B. DENSITY AND CONTOUR PLOTS

iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)

OUTPUT:

C. CORRELATION AND SCATTER PLOTS

#correlation visualized using heatmap function

df = sns.load_dataset("titanic")
ax = sns.heatmap(df annot=True, fmt="d")

#scatter plots of categorical variable

df = sns.load_dataset("titanic")
sns.catplot(data=df, x="age", y="class")

OUTPUT:

48
lOMoAR cPSD| 31162489

D. HISTOGRAMS

#histogram of datafra,e

df = sns.load_dataset("titanic")
sns.histplot(data=df, x="age")

OUTPUT:

E. THREE DIMENSIONAL PLOTTING

#3d plotting using ploty package

import plotly as px
df = sns.load_dataset("iris")

px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalWidthCm",

size="SepalLengthCm",
color="Species", color_discrete_map = {"Joly": "blue", "Bergeron": "violet",
"Coderre":"pink"})

OUTPUT:

49
lOMoAR cPSD| 31162489

RESULT:

Thus the various exploring visual plots are successfully executed.

50
lOMoAR cPSD| 31162489

EX.NO:7. VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

DATE:

AIM:

To check the Visualizing Geographic Data with Basemap using googlecolap.

ALGORITHM:

STEP 1: Install the basemap package

Install the below package:

Use google colab (in anaconda prompt , conda version is need to change, it may affect our
other packages compatability)
pip install basemap
(or)
conda install -c https://conda.anaconda.org/anaconda basemap

STEP 2: Explore on various projection options example: ortho, lcc.

STEP 3: Mark the location using longitude and latitude

PROGRAM:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);

OUTPUT:

51
lOMoAR cPSD| 31162489

fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting

x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

OUTPUT:

from itertools import chain

def draw_map(m, scale=0.2):

# draw a shaded-relief image
m.shadedrelief(scale=scale)

# lats and longs are returned as a dictionary

lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))

# keys contain the plt.Line2D instances

lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)

# cycle through these lines and set the desired style

for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')

fig = plt.figure(figsize=(8, 6), edgecolor='w')

52
lOMoAR cPSD| 31162489

m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

OUTPUT:

fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)

OUTPUT:

RESULT:

Thus the Exploring Geographic Data with Basemap was successfully executed.

Aids Lab
No ratings yet
Aids Lab
45 pages
Pandas Worksheet
No ratings yet
Pandas Worksheet
3 pages
cs3362 Foundations of Data Science Lab Manual
75% (8)
cs3362 Foundations of Data Science Lab Manual
53 pages
Data Science Laboratory
No ratings yet
Data Science Laboratory
40 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Khadeeja - DS - PRACTICAL 4
No ratings yet
Khadeeja - DS - PRACTICAL 4
24 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
Data Science Fundamentals
No ratings yet
Data Science Fundamentals
22 pages
FDSA Lab Manual Aim Algorithm
No ratings yet
FDSA Lab Manual Aim Algorithm
32 pages
Data Science Lab
No ratings yet
Data Science Lab
61 pages
20ca2204 Data Science QB With Answers
No ratings yet
20ca2204 Data Science QB With Answers
48 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
Tutorial2 Q&A
No ratings yet
Tutorial2 Q&A
5 pages
DS - Lab Manual
No ratings yet
DS - Lab Manual
31 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
ML Lab Manual (Upto Cie-1)
No ratings yet
ML Lab Manual (Upto Cie-1)
33 pages
Fods (1) - Merged (1) - 1
No ratings yet
Fods (1) - Merged (1) - 1
100 pages
Practical Assignment4 1
No ratings yet
Practical Assignment4 1
6 pages
41 DS PL MF
No ratings yet
41 DS PL MF
20 pages
Utf-8''libraries Data Management
No ratings yet
Utf-8''libraries Data Management
9 pages
FDS Aim Algorithm
No ratings yet
FDS Aim Algorithm
18 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
T54B VCF PDF
92% (12)
T54B VCF PDF
528 pages
Python For ML
No ratings yet
Python For ML
41 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Machine Learning Lab Word 12-1-2025. Document
No ratings yet
Machine Learning Lab Word 12-1-2025. Document
68 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
DSA Lab Manual Pgms - fINAL
No ratings yet
DSA Lab Manual Pgms - fINAL
34 pages
ML (Sudhanshu)
No ratings yet
ML (Sudhanshu)
24 pages
ML Programs
No ratings yet
ML Programs
41 pages
Python For Exploratory Data Analysis
No ratings yet
Python For Exploratory Data Analysis
12 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Data Science
No ratings yet
Data Science
18 pages
Packages in Python
No ratings yet
Packages in Python
17 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Python For DA
100% (2)
Python For DA
47 pages
Important Questions Soil
No ratings yet
Important Questions Soil
12 pages
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
No ratings yet
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
16 pages
FDS Record-1-4
No ratings yet
FDS Record-1-4
18 pages
Magnetic Particle Testing
80% (5)
Magnetic Particle Testing
7 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
04-01-2017 08-14-24 - 44XX Series Multifunction Meter LCD - Manual
0% (1)
04-01-2017 08-14-24 - 44XX Series Multifunction Meter LCD - Manual
2 pages
Download
100% (1)
Download
47 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
Viva
No ratings yet
Viva
7 pages
JVC Lt-22hg45e Led TV PDF
No ratings yet
JVC Lt-22hg45e Led TV PDF
43 pages
cs3362 Foundations of Data Science Lab Manual
No ratings yet
cs3362 Foundations of Data Science Lab Manual
53 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
CO-367 Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
CO-367 Machine Learning Lab File: Submitted To: Submitted by
12 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Heat of Reaction
83% (6)
Heat of Reaction
8 pages
IGNOU S Indian History Part 2 India Earliest Times To The 8th Century AD
100% (1)
IGNOU S Indian History Part 2 India Earliest Times To The 8th Century AD
439 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Data Exploration in Python PDF
No ratings yet
Data Exploration in Python PDF
1 page
Coir
No ratings yet
Coir
34 pages
Richthofen - The Hunter Birth of A Logo Pt.2
100% (2)
Richthofen - The Hunter Birth of A Logo Pt.2
46 pages
1 - Chemicals in The Workplace
No ratings yet
1 - Chemicals in The Workplace
58 pages
Automobile Engineering Lab II (ETPM Lab)
No ratings yet
Automobile Engineering Lab II (ETPM Lab)
4 pages
Datasheet LORENTZ PSk3 Hybrid Solar Pumping Solution
No ratings yet
Datasheet LORENTZ PSk3 Hybrid Solar Pumping Solution
7 pages
Chong Mai 43-27
100% (2)
Chong Mai 43-27
5 pages
Parts in Your Kit: Breadboard
No ratings yet
Parts in Your Kit: Breadboard
5 pages
Kirt You So Much
No ratings yet
Kirt You So Much
3 pages
M.E. Production Engineering - Manufacturing &amp Automation
No ratings yet
M.E. Production Engineering - Manufacturing &amp Automation
41 pages
Magnifico 160000334 V1 1121 LR 01
No ratings yet
Magnifico 160000334 V1 1121 LR 01
12 pages
Dialogues On Eastern Wisdom
No ratings yet
Dialogues On Eastern Wisdom
85 pages
Visang High School English Lesson 04 WooJack SKUNK WORKS Q113
No ratings yet
Visang High School English Lesson 04 WooJack SKUNK WORKS Q113
50 pages
Intelligence in IoMT Turkey
No ratings yet
Intelligence in IoMT Turkey
17 pages
Template For A5
No ratings yet
Template For A5
8 pages
Green and Orange Illustrated Playful Happy Holiday Kids Book Cover - 20240525 - 091748 - 0000
No ratings yet
Green and Orange Illustrated Playful Happy Holiday Kids Book Cover - 20240525 - 091748 - 0000
9 pages
Optibend Unitube Mini Cable
No ratings yet
Optibend Unitube Mini Cable
7 pages
Business
No ratings yet
Business
7 pages
CSR Initiatives Related To Procurement and Suppliers: Organic Cotton
No ratings yet
CSR Initiatives Related To Procurement and Suppliers: Organic Cotton
2 pages
Consumer Strategies For Controlling Electric Water Heaters Under Dynamic Pricing
No ratings yet
Consumer Strategies For Controlling Electric Water Heaters Under Dynamic Pricing
8 pages
Finland Edition 1 Sisu Trucks
No ratings yet
Finland Edition 1 Sisu Trucks
2 pages
Ken Kim PG79 FINAL
No ratings yet
Ken Kim PG79 FINAL
1 page
Transcendental
No ratings yet
Transcendental
2 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Cody's Data Cleaning Techniques Using SAS, Third Edition
From Everand
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ron Cody
4.5/5 (3)
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Beginning C# 7 Programming with Visual Studio 2017
From Everand
Beginning C# 7 Programming with Visual Studio 2017
Benjamin Perkins
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CS 3362 FDS

Uploaded by

CS 3362 FDS

Uploaded by

lOMoAR cPSD| 31162489

UNIVERSITY COLLEGE OF ENGINEERING-BIT

DEPARTMENT OF INFORMATION TECHNOLOGY

CS3362 Foundations OF DATA Science LAB Manual

pip install lxml

Then restart Jupyter Notebook. (or use conda install)

Pandas can read table tabs off of html.

match = "Metcalf Bank"

df_list = pd.read_html(url, match=match)

EX NO 4(a). EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE

from pandas import DataFrame

from sklearn.datasets import load_iris

# sklearn.datasetsincludes common example datasets

# A function to load in the iris dataset

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in

iris # prints iris data

A) UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE,

# Frequency# finding the unique count

# skip the na values

EX.NO:5. B) BIVARIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION

fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))

## Blood Pressure variable

fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

LINEAR REGRESSION MODELLING ON HOUSING DATASET

# Data manipulation libraries

from sklearn import metrics

LOGISTIC REGRESSION MODELLING ON PIME DIABETIES

# Data manipulation libraries

###scikit Learn Modules needed for Logistic Regression

# Train & Test split

print('Shape of Training Xs:{}'.format(x_train.shape))

Shape of Training Xs:(614, 8)

EX.NO:5.C) MULTIPLE REGRESSION ANALYSIS

axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

EX.NO:6. APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI

STEP 1: Install seaborn package and import the package.

B. DENSITY AND CONTOUR PLOTS

C. CORRELATION AND SCATTER PLOTS

#correlation visualized using heatmap function

#scatter plots of categorical variable

E. THREE DIMENSIONAL PLOTTING

#3d plotting using ploty package

px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalWidthCm",

Thus the various exploring visual plots are successfully executed.

EX.NO:7. VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

To check the Visualizing Geographic Data with Basemap using googlecolap.

STEP 1: Install the basemap package

Install the below package:

STEP 2: Explore on various projection options example: ortho, lcc.

fig = plt.figure(figsize=(8, 8))

# Map (long, lat) to (x, y) for plotting

from itertools import chain

def draw_map(m, scale=0.2):

# lats and longs are returned as a dictionary

# keys contain the plt.Line2D instances

# cycle through these lines and set the desired style

fig = plt.figure(figsize=(8, 6), edgecolor='w')

fig = plt.figure(figsize=(8, 8))

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.