Practical 1 EDA
Practical 1 EDA
: 01
Aim: EDA & Data Visualization
1.To Read data from CSV file to R
Diabetes <- read.csv(file.choose(),header =
TRUE,sep = ",")
Diabetes: is the variable we are creating to store
the csv file in form of data frame
Read.csv :is used as we are reading csv file
File.choose () :wil open the browser to select
desired csv file
Header=TRUE: will treat first row as a header
Sep: as we have csv that comma separated file ,
we use “,”
2.To extract first few lines of data set
head(data_set_name)
head(eda_data) #dataset name is eda_data
This will produce first few , by default 6 lines
of the dataset
Syntax:
newdata4 <-
datasetname[order(datasetname$column_n
ame), ]
>newdata4
Example :
i.) Newdata4 <-
EDA_data[order(EDA_dataset$Name),]
>Newdata4
ii.) newdata5<-
EDA_data[order(EDA_data$Education,
EDA_data$Salary),]
> newdata5
colSums(is.na(datasetname)) OR
summary(datasetname)
eda_data<-read.csv(file.choose(), header=TRUE,
sep=",")
eda_data
head(eda_data)
summary(eda_data)
str(eda_data)
eda_data[1:8,]
head(eda_data,3)
head(eda_data,8)
tail(eda_data,8)
eda_data[1:8, c(1,5)]
eda_data[,1:5]
newdata1<-subset(eda_data,eda_data$Education ==
"Grad")
newdata1
newdata2<-subset(eda_data, eda_data$Age=="51" &
eda_data$Gender=="M")
newdata2
a<-eda_data[order(eda_data$Name),]
a
a<-eda_data[order(eda_data$Education),]
a
a<-eda_data[order(eda_data$Education, decreasing =
TRUE),]
a
a<-colSums(is.na(eda_data))
a
hist(eda_data$Age)
boxplot(eda_data$Age)
mean(eda_data$Age)
min(eda_data$Age)
max(eda_data$Age)
median(eda_data$Age)
mode(eda_data$Garage)
y<-table(eda_data$Garage)
y
names(y)[which(y==max(y))]
ma<-max(y)
ma
whch<-which(y==ma)
whch
names(y)[whch]
x<-eda_data$Garage
x
y<-unique(x)
y
mat<-match(x, y)
mat
tab<-tabulate(mat)
tab
m<-max(tab)
m
y[tab==m]
x<-eda_data$Age
x
my_mode <- function(x) { # Create
mode function
unique_x <- unique(x)
tabulate_x <- tabulate(match(x, unique_x))
unique_x[tabulate_x == max(tabulate_x)]
}
my_mode(x)
x<-c(0,0,0,1,1,1,1,2,2,2,2,4)
x
y<-table(x)
y
y[max(y)]
hist(eda_data$Rooms)
hist(eda_data$Salary)
#b<-skewness(eda_data$Rooms)
#hist(b)
#Two-way table
#barplot
counts = table(eda_data$Education,eda_data$Gender)
counts
barplot(counts, main = "Data distribution by
Education Vs Gender",col = c("blue","red"))
plot(eda_data$Education,eda_data$Gender, col =
c("blue","red"))
#scatterplot
plot(eda_data$Age, eda_data$Salary)
library(PerformanceAnalytics)
a<-skewness(eda_data$Rooms)
a
hist(eda_data$Rooms)
#library(ggplot2)
#ggplot(eda_data$Rooms,
x=returnsstat_density(geom = "line"))
eda_data$Garage[is.na(eda_data$Garage)]<-
mean(eda_data$Garage, na.rm=TRUE)
View(eda_data)
skewness(eda_data$Rooms)
a
hist(a)
hist(eda_data$Rooms)
b<-eda_data$Rooms[is.na(eda_data$Rooms)]<-
median(eda_data$Rooms, na.rm=TRUE)
b
hist(b)
View(eda_data)
#mode
getmode <- function(v){
v=v[nchar(as.character(v))>0]
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
plot(eda_data$AppraisedValue)
x<-eda_data$AppraisedValue
x
out <- boxplot.stats(x)$out #identifying the outlier
out ## `boxplot.stats` has picked them out 1200 value
x<-x[!(x %in% out)]
x ## this removes 1200 from x
boxplot(x)
#imputing
q<-quantile(eda_data$AppraisedValue, .95) #95th
percentile
q #850
summary(eda_data$AppraisedValue)
#ifelse(2==2, "equal", "not equal") #example of ifelse
app_val<- ifelse(eda_data$AppraisedValue >=
1000,850,eda_data$AppraisedValue)
app_val
boxplot(app_val)
num<-as.factor(num)
num
class(num)
num<-as.character(num)
num
class(num)
typeof(num)
#logical to numeric
int<-as.integer(logi)
int
typeof(int)
fact<-as.factor(int)
fact
str(eda_data$Name)
Perform following operations in R
1. Create Student.csv file with fields(rollno, name, gender, class, Tmarks) (note:- Total marks out of
1000) Read the file in R
2. Extract first few lines from from dataset
3. Check the data type of dataset's fields
4. Get the summary of data set
5. check the dimensions of dataset and list column names
6. List the row sets where total marks are more than 750
7. List only the first 2 columns where total marks are more than 750 and class is SYCS
8. Sort the data in ascending order of total marks
9. List the records where total marks are not entered.
10. Plot the scatter plot which shows relation between average marks and class.
11. Draw the box plot for totalmarks
12. Get the summary of data set
13. check the dimensions of dataset and list column names
14. List the row sets where total marks are more than 750
15. List only the first 2 columns where total marks are more than 750 and class is SYCS
16. Sort the data in ascending order of total marks
17. List the records where total marks are not entered.
18. Plot the scatter plot which shows relation between average marks and class.