0% found this document useful (0 votes)
14 views14 pages

Practical 1 EDA

The document outlines practical exercises for Exploratory Data Analysis (EDA) and data visualization in R. It includes instructions for reading CSV files, extracting data, checking data types, summarizing datasets, subsetting data based on conditions, sorting, handling missing values, and creating various plots. Additionally, it provides a series of tasks to be performed in R, including creating a new CSV file and performing specific data manipulations.

Uploaded by

yashhmehtaa1807
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

Practical 1 EDA

The document outlines practical exercises for Exploratory Data Analysis (EDA) and data visualization in R. It includes instructions for reading CSV files, extracting data, checking data types, summarizing datasets, subsetting data based on conditions, sorting, handling missing values, and creating various plots. Additionally, it provides a series of tasks to be performed in R, including creating a new CSV file and performing specific data manipulations.

Uploaded by

yashhmehtaa1807
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Practical No.

: 01
Aim: EDA & Data Visualization
1.To Read data from CSV file to R
Diabetes <- read.csv(file.choose(),header =
TRUE,sep = ",")
 Diabetes: is the variable we are creating to store
the csv file in form of data frame
 Read.csv :is used as we are reading csv file
 File.choose () :wil open the browser to select
desired csv file
 Header=TRUE: will treat first row as a header
 Sep: as we have csv that comma separated file ,
we use “,”
2.To extract first few lines of data set
head(data_set_name)
head(eda_data) #dataset name is eda_data
 This will produce first few , by default 6 lines
of the dataset

3.To check data type of every variable in dataset.


It completely displays the internal structure of R
object.
str(data_set_name)
The term ‘category’ and ‘enumerated type’
are also used for factors
4.To check summary of entire data frame object
summary(data_set_name)

5.To check first 10 rows of the dataset


data_set_name [row_no:column_no,]
1st row and 10th column [1:10,]

6.To check only 2 columns of the dataset


data_set_name [,row_no:column_no]
1st row and 2nd column [,1:2]
7.To display first 10 rows and only 2 columns of
the dataset
data_set_name [row, column]
 1st to 10th row : 1st and 2nd column [1:10,
1:2]
NOTE:
1.when we want to fetch rows we mention
datasetname rows [no_of_,]
2.When we want to fetch columns we mention
datasetname[,no_of_cols]
3.When we want to fetch both we write
datasetname[no_of_rows, no_of_cols]
8.To display observations having no of students
who have done Graduation
Syntax :
newdata1<-subset(datasetname,
datasetname$column_name=="value")
newdata1
Example :
newdata1<-
subset(EDA_data,EDA_data$Education ==
"Grad")
> newdata1

Here we created new variable called as


newdata1 and we are storing the subset of
EDA_data data in newdata1. It is mandatory
to write again newdata1 in order to view the
output on console as we are creating new
variable to store result of subsetting

9.To display multiple conditions for subsetting


newdata2<-subset(EDA_data,
EDA_data$Age=="51" &
EDA_data$Gender=="M")
>newdata2
Here we extracted details of students whose
age is 51 and gender is Male.

10. To sort the data of a column in ascending


order
 Sorting Data
To sort a data frame in R, use the order( ) function. By
default, sorting is ASCENDING. Prepend the sorting variable
by a minus sign to indicate DESCENDING order. Here are
some examples.

Syntax:
newdata4 <-
datasetname[order(datasetname$column_n
ame), ]
>newdata4
Example :
i.) Newdata4 <-
EDA_data[order(EDA_dataset$Name),]
>Newdata4
ii.) newdata5<-
EDA_data[order(EDA_data$Education,
EDA_data$Salary),]
> newdata5

we are sorting on all rows hence we are not writing


anything after ,
11. To sort the data of a column in descending
order
newdata5<-EDA_data[order(-
EDA_data$Name),]
>newdata5
OR
Newdata5 <-
EDA_data[order(EDA_data$Age, decreasing
= TRUE),]
For Descending order we can use decreasing =
TRUE
12. To check if any column contains missing
observation

colSums(is.na(datasetname)) OR
summary(datasetname)

NA is a logical constant of length 1 which


contains a missing value indicator.

Histogram, boxplot, scatterplot, barplot

13. To plot Histogram of a particular column in


dataset
hist(datasetname$column_name)
14. To plot boxplot of a particular column in
dataset
boxplot(datasetname$column_name)
15. To view properties of particular column of
data
mean(datasetname$column_name)
median(datasetname$column_name)
max(datasetname$column_name)
min(datasetname$column_name)
mode:
y<-table(eda_data$Baths)
names(y)[which(y==max(y))]

my_mode <- function(x) { # Create


mode function
unique_x <- unique(x)
tabulate_x <- tabulate(match(x, unique_x))
unique_x[tabulate_x == max(tabulate_x)]
}
my_mode(eda_data$Baths)
---------------------------------------------------------------
EDA R PRACTICAL

eda_data<-read.csv(file.choose(), header=TRUE,
sep=",")
eda_data
head(eda_data)
summary(eda_data)
str(eda_data)
eda_data[1:8,]
head(eda_data,3)
head(eda_data,8)
tail(eda_data,8)
eda_data[1:8, c(1,5)]
eda_data[,1:5]
newdata1<-subset(eda_data,eda_data$Education ==
"Grad")
newdata1
newdata2<-subset(eda_data, eda_data$Age=="51" &
eda_data$Gender=="M")
newdata2
a<-eda_data[order(eda_data$Name),]
a
a<-eda_data[order(eda_data$Education),]
a
a<-eda_data[order(eda_data$Education, decreasing =
TRUE),]
a
a<-colSums(is.na(eda_data))
a
hist(eda_data$Age)
boxplot(eda_data$Age)
mean(eda_data$Age)
min(eda_data$Age)
max(eda_data$Age)
median(eda_data$Age)
mode(eda_data$Garage)

y<-table(eda_data$Garage)
y

names(y)[which(y==max(y))]

ma<-max(y)
ma
whch<-which(y==ma)
whch
names(y)[whch]

x<-eda_data$Garage
x
y<-unique(x)
y
mat<-match(x, y)
mat
tab<-tabulate(mat)
tab
m<-max(tab)
m
y[tab==m]
x<-eda_data$Age
x
my_mode <- function(x) { # Create
mode function
unique_x <- unique(x)
tabulate_x <- tabulate(match(x, unique_x))
unique_x[tabulate_x == max(tabulate_x)]
}

my_mode(x)

x<-c(0,0,0,1,1,1,1,2,2,2,2,4)
x
y<-table(x)
y
y[max(y)]

hist(eda_data$Rooms)
hist(eda_data$Salary)
#b<-skewness(eda_data$Rooms)
#hist(b)
#Two-way table
#barplot
counts = table(eda_data$Education,eda_data$Gender)
counts
barplot(counts, main = "Data distribution by
Education Vs Gender",col = c("blue","red"))

plot(eda_data$Education,eda_data$Gender, col =
c("blue","red"))

#scatterplot
plot(eda_data$Age, eda_data$Salary)

library(PerformanceAnalytics)
a<-skewness(eda_data$Rooms)
a
hist(eda_data$Rooms)

#imputing missing values


library(e1071)
b<-skewness(eda_data$Garage)
b
hist(eda_data$Garage)

#library(ggplot2)
#ggplot(eda_data$Rooms,
x=returnsstat_density(geom = "line"))

eda_data$Garage[is.na(eda_data$Garage)]<-
mean(eda_data$Garage, na.rm=TRUE)
View(eda_data)

skewness(eda_data$Rooms)
a
hist(a)

hist(eda_data$Rooms)
b<-eda_data$Rooms[is.na(eda_data$Rooms)]<-
median(eda_data$Rooms, na.rm=TRUE)
b
hist(b)
View(eda_data)

#mode
getmode <- function(v){
v=v[nchar(as.character(v))>0]
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

#Identifying duplicate data


data<-eda_data[1:5, 3:4]
data
duplicated(data)
#removing duplicate data
a<-data[!duplicated(data),]
a
#removing an outlier
boxplot(eda_data$AppraisedValue)

plot(eda_data$AppraisedValue)
x<-eda_data$AppraisedValue
x
out <- boxplot.stats(x)$out #identifying the outlier
out ## `boxplot.stats` has picked them out 1200 value
x<-x[!(x %in% out)]
x ## this removes 1200 from x
boxplot(x)
#imputing
q<-quantile(eda_data$AppraisedValue, .95) #95th
percentile
q #850
summary(eda_data$AppraisedValue)
#ifelse(2==2, "equal", "not equal") #example of ifelse
app_val<- ifelse(eda_data$AppraisedValue >=
1000,850,eda_data$AppraisedValue)
app_val
boxplot(app_val)

#conersion : character to numeric values


str<-eda_data$Gender
str
str(eda_data$Gender)
str(eda_data$Education)
num<-as.numeric(str)
num
str(num)
typeof(num)
class(num)

num<-as.factor(num)
num
class(num)

num<-as.character(num)
num
class(num)
typeof(num)

#numeric to logical values


v<-c(0, 0, 1, 1)
v
logi<-as.logical(v)
logi

#logical to numeric
int<-as.integer(logi)
int
typeof(int)

fact<-as.factor(int)
fact
str(eda_data$Name)
Perform following operations in R

1. Create Student.csv file with fields(rollno, name, gender, class, Tmarks) (note:- Total marks out of
1000) Read the file in R
2. Extract first few lines from from dataset
3. Check the data type of dataset's fields
4. Get the summary of data set
5. check the dimensions of dataset and list column names
6. List the row sets where total marks are more than 750
7. List only the first 2 columns where total marks are more than 750 and class is SYCS
8. Sort the data in ascending order of total marks
9. List the records where total marks are not entered.
10. Plot the scatter plot which shows relation between average marks and class.
11. Draw the box plot for totalmarks
12. Get the summary of data set
13. check the dimensions of dataset and list column names
14. List the row sets where total marks are more than 750
15. List only the first 2 columns where total marks are more than 750 and class is SYCS
16. Sort the data in ascending order of total marks
17. List the records where total marks are not entered.
18. Plot the scatter plot which shows relation between average marks and class.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy