Da (22C01156)
Da (22C01156)
Case Studies in R
Presidency College, Bangalore-24
Report
Department of Computer
Applications
Presidency
Group
Case Study Title: A Data Analytics Approach to Understanding
Student Performance in Studies
Course: BCA
Subject: DATA ANAYTICS
Class & Section: V BCA ‘C’
Certificate
Table of Contents
1 Abstract
2 Introduction
3 Data Collection
4 Data Exploration
6 Data Preprocessing
7 Data Analysis
8 Conclusion
Page 2
Abstract
Page 3
Introduction
Problem statement:
Consider the dataset indicating the number of hours of study put in by the students
(NoOfHours) and their score (Score).
This report presents a data analytics case study developed using R programming as part of the
curriculum, focusing on the relationship between the number of study hours (Hours) and
scores (Score) of students. The dataset, originally formatted in Excel, is converted to CSV for
analysis in R. Key analytical techniques employed include descriptive statistics, data
exploration, and data preprocessing, which facilitate a deeper understanding of the data's
structure and patterns. A simple predictive model is built to assess the impact of study hours
on academic performance, supported by various visualizations and charts. This work aims to
illustrate the practical application of data analytics methodologies in R and provide insights
into factors influencing student success.
Techniques used:
A basic data science project consists of the following six steps:
Page 4
If you have more than one candidate model, apply each and evaluate their goodness-
of-fit using independent data that was not used for training the model.
6. Use the best model to make your final predictions.
Model details:
Page 5
Data Collection
DATASET
HOURS SCORE
2.5 21
5.1 47
3.2 27
8.5 75
3.5 30
1.5 20
9.2 88
5.5
8.3 80
2 25
7.7 85
5.9 62
4.5 41
3.3 42
1.1 17
8.9 95
56
1.9 24
6.1 67
69
2.7 30
4.8 54
3.8
6.9 76
7.8 86
6 45
8 89
57
10 90
14 98
1.5 24
12
4.6 79
7 90
3.6 66
Page 6
6.9 78
1 23
2.2 79
6.8 88
3 50
Data Exploration
Reading CSV Files A CSV file uses .csv extension and stores data in a table structure format
in any plain text. The following function reads data from a CSV file: read.csv(‘filename’)
where, filename is the name of the CSV file that needs to be imported.
setwd("C:/Users/shivp/OneDrive/Desktop/SHEEETAL STUDY-PC")
score<-read.csv("studentscore.csv")
View(score)
Exploring a dataset means displaying the data of the dataset in a different form. Datasets are
the main part of analytical data processing. It uses different forms or parts of the dataset.
With the help of R commands, analysts can easily explore a dataset in different ways.
summary(score)
HOURS SCORE
Min.: 1.00 Min. :17.00
1st Qu.: 3.00 1st Qu.:38.25
Median: 5.00 Median :59.00
Mean: 5.45 Mean :58.75
3rd Qu.: 7.25 3rd Qu.:79.25
Max. :14.00 Max. :98.00
Page 7
str(score)
> head(score)
HOURS SCORE
1 2 21
2 5 47
3 3 27
4 8 75
5 4 30
6 2 20
> tail(score)
HOURS SCORE
35 4 66
36 7 78
37 1 23
38 2 79
39 7 88
40 3 50
> dim(score)
[1] 40 2
Page 8
> is.na(score)
HOURS SCORE
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE TRUE
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] FALSE FALSE
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] FALSE FALSE
[15,] FALSE FALSE
[16,] FALSE FALSE
[17,] TRUE FALSE
[18,] FALSE FALSE
[19,] FALSE FALSE
[20,] TRUE FALSE
[21,] FALSE FALSE
[22,] FALSE FALSE
[23,] FALSE TRUE
[24,] FALSE FALSE
[25,] FALSE FALSE
[26,] FALSE FALSE
[27,] FALSE FALSE
[28,] TRUE FALSE
[29,] FALSE FALSE
[30,] FALSE FALSE
[31,] FALSE FALSE
[32,] FALSE TRUE
[33,] FALSE FALSE
[34,] FALSE FALSE
[35,] FALSE FALSE
[36,] FALSE FALSE
[37,] FALSE FALSE
[38,] FALSE FALSE
Page 9
[39,] FALSE FALSE
[40,] FALSE FALSE
Data Preprocessing
Method1: EDITING using
edit(sscore)
Page 10
Method2 : Removing using na.omit(score)
score1<-na.omit(score)
> print(score1)
HOURS SCORE
1 2 21
2 5 47
3 3 27
4 8 75
5 4 30
6 2 20
7 9 88
8 6 59
9 8 80
10 2 25
11 8 85
12 6 62
13 4 41
14 3 42
15 1 17
16 9 95
18 2 24
Page 11
19 6 67
21 3 30
22 5 54
23 4 59
24 7 76
25 8 86
26 6 45
27 8 89
29 10 90
30 14 98
31 2 24
32 12 59
33 5 79
34 7 90
35 4 66
36 7 78
37 1 23
38 2 79
39 7 88
40 3 50
Page 12
We can round off the values
> print(score1$HOURS <- as.numeric(format(round(score1$HOURS, 0))))
[1] 2 5 3 8 4 2 9 8 2 8 6 4 3 1 9 2 6 3 5 7 8 6 8 10 14 2 5 7 4 7 1 2 7
3
[1] 21 47 27 75 30 20 88 59 80 25 85 62 41 42 17 95 56 24 67 69 30 54 59 76 86 45 89
57 90 98 24 59 79 90 66 78 23 79 88
[40]50
Page 13
Preprocessed data
Page 14
Data Analysis
plot(score$HOURS,score$SCORE)
When we use the mean to predict the score, at some instances we can observe a significant
difference between the actual (observed) value and the predicted value. So we are trying with
Correlation.
Page 15
Correlation Coefficient
The degree and direction of a linear association can be determined using correlation. The Pearson
correlation coefficient of the association between the number of hours studied and GPA score is
shown as follows
> cor(score$HOURS,score$SCORE)
[1] 0.7920693
The correlation value here suggests that there is a strong association between the number of hours
studied and the freshmen score.
A correlation value close to 0 indicates that the variables are not linearly associated. However, these
variables may still be related. Thus, it is advised to plot the data.
Since correlation analysis may be inappropriate in determining the causation, we use regression
techniques to quantify the nature of the relationship between the variables.
When a regression model is of a linear form, such a regression is called a linear regression. Similarly,
when a regression model is of non-linear form, then such a regression is called a non-linear
regression.
A Linear equation can be defined as the equation having a maximum of only one degree. A Nonlinear
equation can be defined as the equation having the maximum degree 2 or more than 2. A linear
equation forms a straight line on the graph. A nonlinear equation forms a curve on the graph.
1. Simple linear form: There is one predictor and one dependent variable: f(X) = b0 + b1x1 + e 2.
Multiple linear form: There are multiple predictor variables and one dependent variable: f(X) = b0 +
b1x1 + b2x2 + …. + bnxn + e
Since the scatter plot between the number of hours of study put in by students and the freshmen
scores suggested a linear association, let us build a linear regression model to quantify the nature
of this relationship.
>model<-lm(score$SCORE~score$HOURS)
>print(model)
Call:
Page 16
Coefficients:
(Intercept) score$HOURS
22.980 6.575
> summary(model)
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The first item shown in the output is the formula (lm(formula = score$Score ~
score$Hour) that R uses to fit the data. lm() is a linear model function in R that is
used to create a simple regression model. score$Hour is the predictor variable
and score$Score is the target/response variable. The next item in the model
output describes residuals. What are “residuals”?
The difference between the actual observed response values and the response
values that the model predicted is called “residuals”.The residuals section of the
model output
breaks it down into five summary points,viz., (Minimum, 1Q (first quartile),
Page 17
Median and 3Q (third quartile) and Maximum). When assessing how well the
model fits the data, one should look for a symmetrical distribution across these
points on the mean
value zero (0).
Coefficient: Estimate
The coefficient, Estimate contains two rows. The first one is the intercept, which
is the mean
of the response Y when all predictors, all X = 0. Note, the mean is only useful if
every X
in the model actually has some values of zero. The second row in the Coefficients
is the
slope, or in our example, the effect Hours has on Score. The slope
term in our model proves that for every hour increase in the Hours, the required
Score goes up by 0.75 points.
Page 18
> res<-predict(model)
> print(res)
1 2 3 4 5 6 7 8 9 10 11 12
39.41696 56.51106 44.01922 78.86489 45.99161 32.84230 83.46715 59.14093
77.54996 36.12963 73.60517 61.77079
13 14 15 16 17 18 19 20 21 22 23
24
52.56627 44.67668 30.21244 81.49476 58.75000 35.47216 63.08572 58.75000
40.73189 54.53867 47.96401 68.34544
25 26 27 28 29 30 31 32 33 34 35
36
74.26263 62.42825 75.57757 58.75000 88.72688 115.02550 32.84230 101.87619
53.22374 69.00291 46.64908 68.34544
37 38 39 40
Page 19
29.55497 37.44456 67.68798 42.70429
Rounding of results:
> res<-round(res,0)
> print(res)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30
39 57 44 79 46 33 83 59 78 36 74 62 53 45 30 81 59 35 63 59 41 55 48
68 74 62 76 59 89 115
31 32 33 34 35 36 37 38 39 40
33 102 53 69 47 68 30 37 68 43
Page 20
18 1.900000 24 35 -11.4721637
19 6.100000 67 63 3.9142809
20 5.440541 69 59 10.2500000
21 2.700000 30 41 -10.7318885
22 4.800000 54 55 -0.5386663
23 3.800000 59 48 11.0359898
24 6.900000 76 68 7.6545560
25 7.800000 86 74 11.7373656
26 6.000000 45 62 -17.4282535
27 8.000000 89 76 13.4224344
28 5.440541 57 59 -1.7500000
29 10.000000 90 89 1.2731223
30 14.000000 98 115 -17.0255019
31 1.500000 24 33 -8.8423013
32 12.000000 59 102 -42.8761898
33 4.600000 79 53 25.7762650
34 7.000000 90 69 20.9970904
35 3.600000 66 47 19.3509210
36 6.900000 78 68 9.6545560
37 1.000000 23 30 -6.5549733
38 2.200000 79 37 41.5554395
39 6.800000 88 68 20.3120216
40 3.000000 50 43 7.2957146
write.csv(data,"C:/Users/welcome/Desktop/PREDICTED.csv")
print ('CSV file written Successfully :)')
Page 21
23 3.800000 59
24 6.900000 76
25 7.800000 86
26 6.000000 45
27 8.000000 89
28 5.440541 57
29 10.000000 90
30 14.000000 98
31 1.500000 24
32 12.000000 59
33 4.600000 79
34 7.000000 90
35 3.600000 66
36 6.900000 78
37 1.000000 23
38 2.200000 79
39 6.800000 88
40 3.000000 50
Call:
lm(formula = new$SCORE ~ new$HOURS)
Coefficients:
(Intercept) new$HOURS
22.980 6.575
> summary(model)
Call:
lm(formula = new$SCORE ~ new$HOURS)
Residuals:
Min 1Q Median 3Q Max
-42.88 -11.21 -0.34 10.45 41.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.980 5.099 4.506 6.12e-05 ***
new$HOURS 6.575 0.822 7.999 1.14e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Page 22
if the number of data points is small, a large F-statistic is required to ascertain
that there may be arelationship between the predictor and response variables.
To compute the F statistic,
the formula is F = (explained variation/(k – 1))/(unexplained variation/(n – k))
where, k is the no. of variables in the dataset and n is the no. of observations.
write.csv(data,"C:/Users/shivp/OneDrive/Desktop/SHEEETAL STUDY-PC")
Page 23
> print(data)
Page 24
Conclusion
The model is validated based on the validation of the assumptions of linear regression: This case
study demonstrates the critical relationship between study habits, quantified by the number of
hours spent studying, and academic performance, as reflected in freshmen scores. Through the
application of various data analytics techniques using R, we were able to explore and analyze the
dataset effectively.
The findings indicate a positive correlation between the hours of study and freshmen scores,
suggesting that increased study time is associated with better academic outcomes.
Descriptive statistics and visualizations highlighted the distribution of study hours and
scores, while data pre-processing ensured the integrity and suitability of the dataset for
analysis.
The predictive model developed during this study provides a foundational understanding of
how study habits can influence academic performance. This insight not only underscores the
importance of effective study practices among students but also offers valuable guidance for
educators and policymakers aiming to enhance student success.
Overall, this analysis serves as a practical example of how data analytics can inform
educational strategies and encourage students to adopt more effective study routines for
improved academic achievement. Future research could expand on this analysis by
incorporating additional variables, such as study methods and student engagement, to
provide a more comprehensive understanding of factors influencing academic success.
Page 25