0% found this document useful (0 votes)
37 views25 pages

Fds Manual

Uploaded by

Sindhu Panuganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views25 pages

Fds Manual

Uploaded by

Sindhu Panuganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 25

Fundamentals of Data Science Lab

Dept. of Computer Science and Business System Lab Manual 2024 – 25

Department
of
Computer Science and Business System

Laboratory Manual

Academic Year:2024-25

Name of the Course : FUNDAMENTALS OF DATA SCIENCE LAB

Course Code : 22CS2151

Regulation : R21

Year & Semester : IV-B.Tech & I-Semester

Module Coordinator K.KEERTHANA

Course Coordinator K.KEERTHANA


K,KEERTHANA
Faculty Coordinator
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

Note: The Programs are written as per the idea of imparting basic knowledge in students.
Writing a Program is similar to cooking a dish, one must know what you want to cook,
what ingredients are needed and how to use them to create a marvelous and delicious
dish, hope you people enjoy programming.
Department Vision
To become a centre for excellence with a focused research, innovation and to stand as an
exemplary institute for Computer Science and Business System by enabling students to
develop enthralling industrial and management skills.

DM-1: Provide a rigorous theoretical and practical framework across state of the art
infrastructure with an emphasis on software development.
DM-2: Impart the skills necessary to amplify the pedagogy to grow technically and to meet
interdisciplinary needs with collaborations and innovative research abilities and societal
needs.
DM-3: To develop globally competent engineers with excellent managerial skills to become
leaders and entrepreneurs through quality pedagogy.
DM-4: To evolve as a centre of excellence in the field of interdisciplinary engineering
research and practice.
Program Outcomes:
Engineering graduates will be able to:

PO1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering
problems.
PO2. Problem Analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO3. Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, and environmental
considerations.
PO4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
PO5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
PO7. Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the knowledge
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

of, and need for sustainable development.


PO8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
PO9. Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO 10. Communication: Communicate effectively on complex engineering activities with
the engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective Presentations, and give and
receive clear instructions.
PO 11. Project management and finance: Demonstrate knowledge and understanding of
the engineering and management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary Environments.
PO12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.

Program Specific Outcomes (PSO’s)


PSO-1: Analyze various networking concepts and also aware of how security policies,
standards and practices are used for trouble-shooting.
PSO-2: To Incorporate ethical values and professional standards with soft-skills in computer
science and business disciplines for enabling students to emerge as an entrepreneur in the
society.
PSO-3: To Inculcate Proactive thinking to ensure effective performance in the dynamic ,
socioeconomic and business ecosystem

Course Objectives: The course should enable the students to:


1. Understand the R Programming Language.
2. Recollect concepts on Statistics.
3. Reading and Writing different types of Datasets
4. Exposure on solving data science problems.
5. Understand The classification and Regression Model

Course Outcomes:
1. Illustrate the use of various data structures.
2. Analyze and manipulate Data using Pandas
3. Creating static, animated, and interactive visualizations using Matplotlib.
4. Understand the implementation procedures for the machine learning algorithms.
5. Apply appropriate data sets to the Machine Learning algorithms and Identify
appropriate algorithms to solve real-world problems
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

LIST OF EXPERIMENTS

Program
Name Of The Program Page No.
No.
R AS CALCULATOR APPLICATION
a. Using with and without R objects on console
1 b. Using mathematical functions on console 9
c. Write an R script, to create R objects for calculator application and
save in a specified location in disk

DESCRIPTIVE STATISTICS IN R
2 a. Write an R script to find basic descriptive statistics using summary 13
b. Write an R script to find suBUet of dataset by using suBUet ()

READING AND WRITING DIFFERENT TYPES OF DATASETS


a. Reading different types of data sets (.txt, .csv) from web and disk
and writing in file in specific disk location.
b. Reading Excel data sheet in R.
3 c. Reading XML dataset in R. 18
VISUALIZATIONS
a. Find the data distributions using a box and scatter plot.
b. Find the outliers using a plot.
c. Plot the histogram, bar chart and pie chart on sample data

CORRELATION AND COVARIANCE


a. Find the correlation matrix.
b. Plot the correlation plot on dataset and visualize giving an
4 overview of relationships among data on iris data. 26
c. Analysis of covariance: variance (ANOVA), if data have
categorical variables on iris data

REGRESSION MODEL
Import a data from web storage. Name the dataset and now do
Logistic Regression to find out relation between variables that are
5 affecting the admission of a student in a institute based on his or her 30
GRE score, GPA obtained and rank of the student. Also check the
model is fit or not. require (foreign), require (MASS).

MULTIPLE REGRESSION MODEL


6 Apply multiple regressions, if data have a continuous independent 36
variable. Apply on above dataset.

REGRESSION MODEL FOR PREDICTION


7 Apply regression Model techniques to predict the data on above 39
dataset
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

CLASSIFICATION MODEL
a. Install relevant packages for classification.
8 b. Choose a classifier for classification problems. 73
c. Evaluate the performance of the classifier.

CLUSTERING MODEL
9 a. Clustering algorithms for unsupervised classification. 78
b. Plot the cluster data using R visualizations.

Program specific
Program Outcomes outcomes
COs-
POs 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
CO1
CO2
CO3
CO4
CO5

LAB INSTRUCTOINS
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

 Students should report to the concerned lab as per the time table.
 Students who turn up late to the labs will in no case be permitted to do the program
schedule for the day.
 After completion of the program, certification of the concerned staff in-charge in the
observation book is necessary.
 Student should bring a notebook of 150 pages and should enter the output
/observations into the notebook while performing the experiment.
 The record of observations along with the detailed experimental procedure of the
experiment in the immediate next session should be submitted and certified staff
member in-charge.
 Students should be present in the lab for total scheduled duration.
 Students are required to prepare thoroughly the algorithm to perform the
experiment before coming to laboratory.

System Requirements
Intel based desktop PC with minimum of 2.6GHZ or faster processor with at least 1 GB
RAM and 40 GB free disk space and LAN connected.
Operating system : Flavor of any WINDOWS or UNIX
Software : R-Studio IDE and R Software

1. R AS CALCULATOR APPLICATION
Using with and without R objects on console
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

Using mathematical functions on console


Write an R script, to create R objects for calculator application and save in a specified
location in disk.

# Program make a simple calculator that can add, subtract, multiply and divide using
functions
add <- function(x, y) {
return(x + y)
}
subtract <- function(x, y) {
return(x - y)
}
multiply <- function(x, y) {
return(x * y)
}
divide <- function(x, y) {
return(x / y)
}
# take input from the user
print("Select operation.")
print("1.Add")
print("2.Subtract")
print("3.Multiply")
print("4.Divide")
choice = as.integer(readline(prompt="Enter choice[1/2/3/4]: "))
num1 = as.integer(readline(prompt="Enter first number: "))
num2 = as.integer(readline(prompt="Enter second number: "))
operator <- switch(choice,"+","-","*","/")
result <- switch(choice, add(num1, num2), subtract(num1, num2), multiply(num1, num2),
divide(num1, num2))
print(paste(num1, operator, num2, "=", result))

Output

[1] "Select operation."


[1] "1.Add"
[1] "2.Subtract"
[1] "3.Multiply"
[1] "4.Divide"
Enter choice[1/2/3/4]: 4
Enter first number: 20
Enter second number: 4
[1] "20 / 4 = 5"

1. DESCRIPTIVE STATISTICS IN R
a. Write an R script to find basic descriptive statistics using summary

Example 1 illustrates how to apply the summary function to a numeric vector.


Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

First, we have to create a numeric vector in R:

vec <- 1:10 # Create example vector


vec # Print example vector

# 1 2 3 4 5 6 7 8 9 10

Now, we can use the summary command to calculate summary statistics of our
vector:

summary(vec) # Apply summary function to vector


# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 3.25 5.50 5.50 7.75 10.00

b. Write an R script to find subset of dataset by using suBset ()

#create data frame


df <- data.frame(team=c('A', 'A', 'B', 'B', 'C', 'C', 'C'),
points=c(77, 81, 89, 83, 99, 92, 97),
assists=c(19, 22, 29, 15, 32, 39, 14))

#view data frame


Df
subset(df, team == "A")

team points assists


1 A 77 19
2 A 81 22
3 B 89 29
4 B 83 15
5 C 99 32
6 C 92 39
7 C 97 14

team points assists


1 A 77 19
2 A 81 22

3. READING AND WRITING DIFFERENT TYPES OF DATASETS


a. Reading different types of data sets (.txt, .csv) from web and disk and writing in
file in specific
disk location.
b. Reading Excel data sheet in R.
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

c. Reading XML dataset in R

Read tabular data into R read.table(file, header = FALSE, sep = "", dec = ".")
# Read "comma separated value" files (".csv") read.csv(file, header = TRUE, sep =
",", dec = ".", ...)
# Or use read.csv2: variant used in countries that
# use a comma as decimal point and a semicolon as field separator. read.csv2(file,
header = TRUE, sep = ";", dec = ",", ...)
# Read TAB delimited files read.delim(file, header = TRUE, sep = "\t", dec = ".", ...)
read.delim2(file, header = TRUE, sep = "\t", dec = ",", ...)

VISUALIZATIONS

Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
 x is a vector or a formula.
 data is the data frame.
 notch is a logical value. Set as TRUE to draw a notch.
 varwidth is a logical value. Set as true to draw width of the box proportionate
to the sample size.
 names are the group labels which will be printed under each boxplot.
 main is used to give a title to the graph.
Example
We use the data set "mtcars" available in the R environment to create a basic
boxplot. Let's look at the columns "mpg" and "cyl" in mtcars.

myvar <- mtcars[,c('mpg','cyl')]


print(head(myvar))

When we execute above code, it produces following result −


mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

Hornet 4 Drive 21.4 6


Hornet Sportabout 18.7 8
Valiant 18.1 6

Creating the Boxplot


The below script will create a boxplot graph for the relation between mpg (miles per
gallon) and cyl (number of cylinders).
# Give the chart file a name.
png(file = "boxplot.png")

# Plot the chart.


boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")

# Save the file.


dev.off()

Scatterplots

Scatterplots show many points plotted in the Cartesian plane. Each point represents the values
of two variables. One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

Syntax
The basic syntax for creating scatterplot in R is
plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Following is the description of the parameters used −


 x is the data set whose values are the horizontal coordinates.
 y is the data set whose values are the vertical coordinates.
 main is the tile of the graph.
 xlab is the label in the horizontal axis.
 ylab is the label in the vertical axis.
 xlim is the limits of the values of x used for plotting.
 ylim is the limits of the values of y used for plotting.
 axes indicates whether both axes should be drawn on the plot.

Example
We use the data set "mtcars" available in the R environment to create a basic
scatterplot. Let's use the columns "wt" and "mpg" in mtcars.

Creating the Scatterplot


The below script will create a scatterplot graph for the relation between
wt(weight) and mpg(miles per gallon).

# Get the input values.


input <- mtcars[,c('wt','mpg')]
# Plot the chart for cars with weight between 2.5 to 5 and mileage
between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

OUTPUT

Scatterplot Matrices
When we have more than two variables and we want to find the correlation between
one variable versus the remaining ones we use scatterplot matrix. We
use pairs() function to create matrices of scatterplots.
Syntax
The basic syntax for creating scatterplot matrices in R is −
pairs(formula, data)
Following is the description of the parameters used −
 formula represents the series of variables used in pairs.
 data represents the data set from which the variables will be taken.
Example
Each variable is paired up with each of the remaining variable. A scatterplot is
plotted for each pair.

# Plot the matrices between 4 variables giving 12 plots.

# One variable with 3 others and total 4 variables.

pairs(~wt+mpg+disp+cyl,data = mtcars,

main = "Scatterplot Matrix")


Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

OUTPUT:

b) Find the outliers using a plot.

What are outliers in data?


Outliers, as the name suggests, are the data points that lie away from the other
points of the dataset. That is the data values that appear away from other data
values and hence disturb the overall distribution of the dataset.

This is usually assumed as an abnormal distribution of the data values.

Effect of Outliers on the model -

1. The data turns out to be in a skewed format.


2. Changes the overall statistical distribution of data in terms of mean, variance,
etc.
3. Leads to obtain a bias in the accuracy level of the model.

Outlier Analysis -
At first, it is very important to detect the presence of outliers in the dataset.

So, let us begin. We have made use of the Bike Rental Count Prediction dataset.
You can find the dataset here!
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

x=data.frame(mtcars)

print(x)

boxplot(x$hp, ylab = "hp")

print(x)

1. Loading the Dataset

Initially, we have loaded the dataset into the R environment using


the read.csv() function.

Prior to outlier detection, we have performed missing value analysis just to check for
the presence of any NULL or missing values. For the same, we have made use
of sum(is.na(data)) function.

Histograms
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

Simple Histogram
CODE:Draw histogram on mtcars dataset

hist(mtcars$mpg)

OUTPUT:

Add a Normal Curve:CODE


x <- mtcars$mpg

h <- hist(x, breaks = 10, col = "red", xlab = "Miles Per Gallon", main = "Histogram

with Normal Curve")

xfit <- seq(min(x), max(x), length = 40)

yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))

yfit <- yfit * diff(h$mids[1:2]) * length(x)

lines(xfit, yfit, col = "blue", lwd = 2)

Out put
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

Bar chart: count


Your first graph shows the frequency of cylinder with geom_bar(). The code
below is the most basic syntax.

Download ggplot package using below command in R terminal

> install.packages("ggplot2")

Write below code in R-Script file

library(ggplot2)

# Most basic bar chart


ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar()
OUTPUT:
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

CODE Explanation
 You pass the dataset mtcars to ggplot.
 Inside the aes() argument, you add the x-axis as a factor variable(cyl)
 The + sign means you want R to keep reading the code. It makes the
code more readable by breaking it.
 Use geom_bar() for the geometric object.

Example:2 Code to Draw Bar graph


counts <- table(mtcars$gear)
barplot(counts, main = "Car Distribution", xlab = "Number of Gears")

PIE CHART Code


Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

slices <- c(10, 12, 4, 16, 8)


lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main = "Pie Chart of Countries")

Output

CODE Explanation
 You pass slices dataset for giving weightage to countries
 Take lbls variable for labels as countries
 Draw Pie chart using pie function
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

#create data frame


df <- data.frame(assists=c(4, 5, 5, 6, 7, 8, 8, 10),
rebounds=c(12, 14, 13, 7, 8, 8, 9, 13),
points=c(22, 24, 26, 26, 29, 32, 20, 14))

#view data frame


df

assists rebounds points


1 4 12 22
2 5 14 24
3 5 13 26
4 6 7 26
5 7 8 29
6 8 8 32
7 8 9 20
8 10 13 14

4.CORRELATION AND COVARIANCE

PROBLEM DEFINATION:

a)How to find a corelation matrix and plot the correlation on iris data set

R SOURCE CODE:

d<-data.frame(x1=rnorm(10),x2=rnorm(10),x3=rnorm(10))
cor(d)
m<-cor(d)#get correlations
library('corrplot')
corrplot(m,method='square')
x<-matrix(rnorm(2),nrow=5,ncol=4)
y<-matrix(rnorm(15),nrow=5,ncol=3)
COR<-cor(x,y)
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

COR

OUTPUT:

4.b) Plot the correlation plot on dataset and visualize giving an overview of
relationships among data on iris data.
R-Code
library(ggplot2)
library(tidyr)
library(datasets)
data("iris")
summary(iris)
Create a correlation matrix of the Iris dataset using the Data Explorer correlation
function . Include only continuous variables in your correlation plot to avoid
confusion as factor variables don’t make sense in a correlation plot
library(DataExplorer)
library(corrplot)

Output:
corrplot 0.92 loaded
correlation plot
title="matrix_iris"
plot_correlation(iris)
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

Output:

The correlation coefficient between Petal Length and Petal Width is 0.96. The correlation
cofficient of Sepal length and Sepal Width is -0.12, which indicate that Sepal length and
Sepal Width has negaive correlate relationship. When the correlation coefficient between
Petal Length and Petal Width is 0.96, Petal Length and Petal Width have stronger
correlation relationship than Sepal length and Sepal Width.

# Generate random IQ values with mean = 30 and sd =2


IQ <- rnorm(40, 30, 2)

# Sorting IQ level in ascending order


IQ <- sort(IQ)

# Generate vector with pass and fail values of 40 students


result <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 1, 1, 0, 1)

# Data Frame
df <- as.data.frame(cbind(IQ, result))

# Print data frame


print(df)
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

# output to be present as PNG file


png(file="LogisticRegressionGFG.png")

# Plotting IQ on x-axis and result on y-axis


plot(IQ, result, xlab = "IQ Level",
ylab = "Probability of Passing")

# Create a logistic model


g = glm(result~IQ, family=binomial, df)

# Create a curve based on prediction using the regression model


curve(predict(g, data.frame(IQ=x), type="resp"), add=TRUE)

# This Draws a set of points


# Based on fit to the regression model
points(IQ, fitted(g), pch=30)

# Summary of the regression model


summary(g)

# saving the file


dev.off()

Output:
IQ result
1 25.46872 0
2 26.72004 0
3 27.16163 0
4 27.55291 1
5 27.72577 0
6 28.00731 0
7 28.18095 0
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

8 28.28053 0
9 28.29086 0
10 28.34474 1
11 28.35581 1
12 28.40969 0
13 28.72583 0
14 28.81105 0
15 28.87337 1
16 29.00383 1
17 29.01762 0
18 29.03629 0
19 29.18109 1
20 29.39251 0
21 29.40852 0
22 29.78844 0
23 29.80456 1
24 29.81815 0
25 29.86478 0
26 29.91535 1
27 30.04204 1
28 30.09565 0
29 30.28495 1
30 30.39359 1
31 30.78886 1
32 30.79307 1
33 30.98601 1
34 31.14602 0
35 31.48225 1
36 31.74983 1
37 31.94705 1
38 31.94772 1
39 33.63058 0
Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

40 35.35096 1

Call:
glm(formula = result ~ IQ, family = binomial, data = df)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1451 -0.9742 -0.4950 1.0326 1.7283

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.8093 7.3368 -2.291 0.0220 *
IQ 0.5651 0.2482 2.276 0.0228 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 55.352 on 39 degrees of freedom


Residual deviance: 48.157 on 38 degrees of freedom
AIC: 52.157

Number of Fisher Scoring iterations: 4


Fundamentals of Data Science Lab
Dept. of Computer Science and Business System Lab Manual 2024 – 25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy