0% found this document useful (0 votes)
14 views26 pages

Da (22C01156)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views26 pages

Da (22C01156)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Analytics

Case Studies in R
Presidency College, Bangalore-24
Report
Department of Computer
Applications

Faculty In charge: Dr. Sheetal Nitin

Title: A Data Analytics Approach to Understanding


Student Performance in studies.

CLASS & SECTION: _V BCA C_


SL.
NO Register Number Student Name

1 22C01156 SHEETAL KEMPANNAVAR


2 22C01146 SAHANA CR
3 22C01182 VAGEESHA YADAV
Reaccredited by
NAAC with A+

Presidency
Group
Case Study Title: A Data Analytics Approach to Understanding
Student Performance in Studies
Course: BCA
Subject: DATA ANAYTICS
Class & Section: V BCA ‘C’

Certificate

This is to certify that the team has


satisfactorily completed the course of
seminar/project/case studies prescribed by
the Presidency College (Autonomous) for
the semester ……5 BCA ‘C’………………
degree course in the year 20 24 -20 24
SL.
NO Register Number Student Name MARKS
to be filled
by Faculty
1 22C01156 SHEETAL KEMPANNAVAR
2 22C01146 SAHANA CR
3 22C01182 VAGEESHA YADAV
Page 1
Signature of the Staff Member

Table of Contents

Sl.no Contents Pages

1 Abstract

2 Introduction

3 Data Collection

4 Data Exploration

5 Data Reformatting and


Cleaning

6 Data Preprocessing

7 Data Analysis

8 Conclusion

Page 2
Abstract

Title:A Data Analytics Approach to Understanding Student


Performance in Studies.

This report presents a data analytics case study developed


using R programming as part of the curriculum. The study
utilizes a simple dataset originally formatted in Excel, which is
converted to CSV for analysis in R. It encompasses
fundamental techniques including descriptive statistics, data
exploration, and data preprocessing, culminating in the
construction of a basic predictive model. Various
visualizations and charts are employed to enhance
understanding of the data and address the specified problem
statement. This work aims to provide practical insights into
data analytics methodologies and their application using R.

Page 3
Introduction
Problem statement:
Consider the dataset indicating the number of hours of study put in by the students
(NoOfHours) and their score (Score).

This report presents a data analytics case study developed using R programming as part of the
curriculum, focusing on the relationship between the number of study hours (Hours) and
scores (Score) of students. The dataset, originally formatted in Excel, is converted to CSV for
analysis in R. Key analytical techniques employed include descriptive statistics, data
exploration, and data preprocessing, which facilitate a deeper understanding of the data's
structure and patterns. A simple predictive model is built to assess the impact of study hours
on academic performance, supported by various visualizations and charts. This work aims to
illustrate the practical application of data analytics methodologies in R and provide insights
into factors influencing student success.

Techniques used:
A basic data science project consists of the following six steps:

1. State the problem you are trying to solve.


It has to be an unambiguous question that can be answered with data and a statistical
or machine learning model. At least, specify: What is being observed? What has to be
predicted?
2. Collect the data, then clean and prepare it.
This is commonly the most time-consuming task, but it has to be done in order to fit a
prediction model with the data.
3. Explore the data.
Get to know its properties and quirks. Check numerical summaries of your metric
variables, tables of the categorical data, and plot univariate and multivariate
representations of your variables. By this, you also get an overview of the quality of
the data and can find outliers.
4. Check if any variables may need to be transformed.
Most commonly, this is a logarithmic transformation of skewed measurements such as
concentrations or times. Also, some variables might have to be split up into two or
more variables.
5. Choose a model and train it on the data.

Page 4
If you have more than one candidate model, apply each and evaluate their goodness-
of-fit using independent data that was not used for training the model.
6. Use the best model to make your final predictions.

Model details:

Page 5
Data Collection
DATASET
HOURS SCORE
2.5 21
5.1 47
3.2 27
8.5 75
3.5 30
1.5 20
9.2 88
5.5
8.3 80
2 25
7.7 85
5.9 62
4.5 41
3.3 42
1.1 17
8.9 95
56
1.9 24
6.1 67
69
2.7 30
4.8 54
3.8
6.9 76
7.8 86
6 45
8 89
57
10 90
14 98
1.5 24
12
4.6 79
7 90
3.6 66

Page 6
6.9 78
1 23
2.2 79
6.8 88
3 50

Data Exploration
Reading CSV Files A CSV file uses .csv extension and stores data in a table structure format
in any plain text. The following function reads data from a CSV file: read.csv(‘filename’)
where, filename is the name of the CSV file that needs to be imported.

setwd("C:/Users/shivp/OneDrive/Desktop/SHEEETAL STUDY-PC")
score<-read.csv("studentscore.csv")
View(score)

Exploring a dataset means displaying the data of the dataset in a different form. Datasets are
the main part of analytical data processing. It uses different forms or parts of the dataset.
With the help of R commands, analysts can easily explore a dataset in different ways.

summary(score)

HOURS SCORE
Min.: 1.00 Min. :17.00
1st Qu.: 3.00 1st Qu.:38.25
Median: 5.00 Median :59.00
Mean: 5.45 Mean :58.75
3rd Qu.: 7.25 3rd Qu.:79.25
Max. :14.00 Max. :98.00

Page 7
str(score)

'data.frame': 40 obs. of 2 variables:


$ HOURS: num 2 5 3 8 4 2 9 6 8 2 ...
$ SCORE: num 21 47 27 75 30 20 88 59 80 25 ...

> head(score)
HOURS SCORE
1 2 21
2 5 47
3 3 27
4 8 75
5 4 30
6 2 20

> tail(score)
HOURS SCORE
35 4 66
36 7 78
37 1 23
38 2 79
39 7 88
40 3 50

> dim(score)
[1] 40 2

Data Reformatting and Cleaning


Missing Values treatMent in r During analytical data processing, users come across problems
caused by missing and infinite values. To get an accurate output, users should remove or
clean the missing values. In R, NA (Not Available) represents missing values and Inf
(Infinite) represents infinite values. R provides different functions that identify the missing
values during processing

Page 8
> is.na(score)
HOURS SCORE
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE TRUE
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] FALSE FALSE
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] FALSE FALSE
[15,] FALSE FALSE
[16,] FALSE FALSE
[17,] TRUE FALSE
[18,] FALSE FALSE
[19,] FALSE FALSE
[20,] TRUE FALSE
[21,] FALSE FALSE
[22,] FALSE FALSE
[23,] FALSE TRUE
[24,] FALSE FALSE
[25,] FALSE FALSE
[26,] FALSE FALSE
[27,] FALSE FALSE
[28,] TRUE FALSE
[29,] FALSE FALSE
[30,] FALSE FALSE
[31,] FALSE FALSE
[32,] FALSE TRUE
[33,] FALSE FALSE
[34,] FALSE FALSE
[35,] FALSE FALSE
[36,] FALSE FALSE
[37,] FALSE FALSE
[38,] FALSE FALSE

Page 9
[39,] FALSE FALSE
[40,] FALSE FALSE

Data Preprocessing
Method1: EDITING using
edit(sscore)

Page 10
Method2 : Removing using na.omit(score)
score1<-na.omit(score)
> print(score1)
HOURS SCORE
1 2 21
2 5 47
3 3 27
4 8 75
5 4 30
6 2 20
7 9 88
8 6 59
9 8 80
10 2 25
11 8 85
12 6 62
13 4 41
14 3 42
15 1 17
16 9 95
18 2 24

Page 11
19 6 67
21 3 30
22 5 54
23 4 59
24 7 76
25 8 86
26 6 45
27 8 89
29 10 90
30 14 98
31 2 24
32 12 59
33 5 79
34 7 90
35 4 66
36 7 78
37 1 23
38 2 79
39 7 88
40 3 50

Method 3: Auto adjusting the


missing values
Preprocessing forHours

> print(score$HOURS<-ifelse(is.na(score$HOURS),ave(score$HOURS, FUN = function(x)


mean(x, na.rm = TRUE)),score$HOURS))

[1] 2.500000 5.100000 3.200000 8.500000 3.500000 1.500000 9.200000


5.500000 8.300000 2.000000 7.700000
[12] 5.900000 4.500000 3.300000 1.100000 8.900000 5.440541 1.900000
6.100000 5.440541 2.700000 4.800000
[23] 3.800000 6.900000 7.800000 6.000000 8.000000 5.440541 10.000000
14.000000 1.500000 12.000000 4.600000
[34] 7.000000 3.600000 6.900000 1.000000 2.200000 6.800000 3.000000

Page 12
We can round off the values
> print(score1$HOURS <- as.numeric(format(round(score1$HOURS, 0))))

[1] 2 5 3 8 4 2 9 8 2 8 6 4 3 1 9 2 6 3 5 7 8 6 8 10 14 2 5 7 4 7 1 2 7
3

Preprocessing for scores


> print(score$SCORE <- ifelse(is.na(score$SCORE),ave(score$SCORE, FUN = function(x)
mean(x, na.rm = TRUE)),score$SCORE))

[1] 21.00000 47.00000 27.00000 75.00000 30.00000 20.00000 88.00000 58.72973


80.00000 25.00000 85.00000 62.00000 41.00000
[14] 42.00000 17.00000 95.00000 56.00000 24.00000 67.00000 69.00000 30.00000
54.00000 58.72973 76.00000 86.00000 45.00000
[27] 89.00000 57.00000 90.00000 98.00000 24.00000 58.72973 79.00000 90.00000
66.00000 78.00000 23.00000 79.00000 88.00000
[40] 50.00000

print(score$SCORE <- as.numeric(format(round(score$SCORE,0))))

[1] 21 47 27 75 30 20 88 59 80 25 85 62 41 42 17 95 56 24 67 69 30 54 59 76 86 45 89
57 90 98 24 59 79 90 66 78 23 79 88
[40]50

Page 13
Preprocessed data

Page 14
Data Analysis
plot(score$HOURS,score$SCORE)

abline (h=mean (score$SCORE))

When we use the mean to predict the score, at some instances we can observe a significant
difference between the actual (observed) value and the predicted value. So we are trying with
Correlation.

Page 15
Correlation Coefficient

The degree and direction of a linear association can be determined using correlation. The Pearson
correlation coefficient of the association between the number of hours studied and GPA score is
shown as follows

> cor(score$HOURS,score$SCORE)

[1] 0.7920693

The correlation value here suggests that there is a strong association between the number of hours
studied and the freshmen score.

A correlation value close to 0 indicates that the variables are not linearly associated. However, these
variables may still be related. Thus, it is advised to plot the data.

Since correlation analysis may be inappropriate in determining the causation, we use regression
techniques to quantify the nature of the relationship between the variables.

When a regression model is of a linear form, such a regression is called a linear regression. Similarly,
when a regression model is of non-linear form, then such a regression is called a non-linear
regression.

A Linear equation can be defined as the equation having a maximum of only one degree. A Nonlinear
equation can be defined as the equation having the maximum degree 2 or more than 2. A linear
equation forms a straight line on the graph. A nonlinear equation forms a curve on the graph.

1. Simple linear form: There is one predictor and one dependent variable: f(X) = b0 + b1x1 + e 2.
Multiple linear form: There are multiple predictor variables and one dependent variable: f(X) = b0 +
b1x1 + b2x2 + …. + bnxn + e

Since the scatter plot between the number of hours of study put in by students and the freshmen
scores suggested a linear association, let us build a linear regression model to quantify the nature
of this relationship.

Building linear model for prediction

Create the Linear Model Using lm()

Let us compute the coefficients: (a) Intercept (b) HS$NoOfHours

>model<-lm(score$SCORE~score$HOURS)

>print(model)

Call:

lm(formula = score$SCORE ~ score$HOURS)

Page 16
Coefficients:

(Intercept) score$HOURS

22.980 6.575

> summary(model)

Call:

lm(formula = score$SCORE ~ score$HOURS)

Residuals:

Min 1Q Median 3Q Max

-42.88 -11.21 -0.34 10.45 41.55

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 22.980 5.099 4.506 6.12e-05 ***

score$HOURS 6.575 0.822 7.999 1.14e-09 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.5 on 38 degrees of freedom

Multiple R-squared: 0.6274, Adjusted R-squared: 0.6176

F-statistic: 63.98 on 1 and 38 DF, p-value: 1.144e-09

The first item shown in the output is the formula (lm(formula = score$Score ~
score$Hour) that R uses to fit the data. lm() is a linear model function in R that is
used to create a simple regression model. score$Hour is the predictor variable
and score$Score is the target/response variable. The next item in the model
output describes residuals. What are “residuals”?
The difference between the actual observed response values and the response
values that the model predicted is called “residuals”.The residuals section of the
model output
breaks it down into five summary points,viz., (Minimum, 1Q (first quartile),

Page 17
Median and 3Q (third quartile) and Maximum). When assessing how well the
model fits the data, one should look for a symmetrical distribution across these
points on the mean
value zero (0).

Coefficient: Estimate
The coefficient, Estimate contains two rows. The first one is the intercept, which
is the mean
of the response Y when all predictors, all X = 0. Note, the mean is only useful if
every X
in the model actually has some values of zero. The second row in the Coefficients
is the
slope, or in our example, the effect Hours has on Score. The slope
term in our model proves that for every hour increase in the Hours, the required
Score goes up by 0.75 points.

Page 18
> res<-predict(model)
> print(res)
1 2 3 4 5 6 7 8 9 10 11 12
39.41696 56.51106 44.01922 78.86489 45.99161 32.84230 83.46715 59.14093
77.54996 36.12963 73.60517 61.77079
13 14 15 16 17 18 19 20 21 22 23
24
52.56627 44.67668 30.21244 81.49476 58.75000 35.47216 63.08572 58.75000
40.73189 54.53867 47.96401 68.34544
25 26 27 28 29 30 31 32 33 34 35
36
74.26263 62.42825 75.57757 58.75000 88.72688 115.02550 32.84230 101.87619
53.22374 69.00291 46.64908 68.34544
37 38 39 40

Page 19
29.55497 37.44456 67.68798 42.70429

Rounding of results:
> res<-round(res,0)
> print(res)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30
39 57 44 79 46 33 83 59 78 36 74 62 53 45 30 81 59 35 63 59 41 55 48
68 74 62 76 59 89 115
31 32 33 34 35 36 37 38 39 40
33 102 53 69 47 68 30 37 68 43

> ResHS <- resid(model)


> print(ResHS)
1 2 3 4 5 6 7 8 9 10
-18.4169573 -9.5110631 -17.0192166 -3.8648937 -15.9916134 -12.8423013
4.5328471 -0.1409255 2.4500376 -11.1296293
11 12 13 14 15 16 17 18 19 20
11.3948312 0.2292121 -11.5662694 -2.6766822 -13.2124389 13.5052439 -
2.7500000 -11.4721637 3.9142809 10.2500000
21 22 23 24 25 26 27 28 29 30
-10.7318885 -0.5386663 11.0359898 7.6545560 11.7373656 -17.4282535
13.4224344 -1.7500000 1.2731223 -17.0255019
31 32 33 34 35 36 37 38 39 40
-8.8423013 -42.8761898 25.7762650 20.9970904 19.3509210 9.6545560 -
6.5549733 41.5554395 20.3120216 7.2957146

> data <-


data.frame(Hours=score$HOURS,Actualscore=score$SCORE,Predictedscore=res,Residua
ls<-ResHS)
> print(data)

Hours Actualscore Predictedscore Residuals....ResHS


1 2.500000 21 39 -18.4169573
2 5.100000 47 57 -9.5110631
3 3.200000 27 44 -17.0192166
4 8.500000 75 79 -3.8648937
5 3.500000 30 46 -15.9916134
6 1.500000 20 33 -12.8423013
7 9.200000 88 83 4.5328471
8 5.500000 59 59 -0.1409255
9 8.300000 80 78 2.4500376
10 2.000000 25 36 -11.1296293
11 7.700000 85 74 11.3948312
12 5.900000 62 62 0.2292121
13 4.500000 41 53 -11.5662694
14 3.300000 42 45 -2.6766822
15 1.100000 17 30 -13.2124389
16 8.900000 95 81 13.5052439
17 5.440541 56 59 -2.7500000

Page 20
18 1.900000 24 35 -11.4721637
19 6.100000 67 63 3.9142809
20 5.440541 69 59 10.2500000
21 2.700000 30 41 -10.7318885
22 4.800000 54 55 -0.5386663
23 3.800000 59 48 11.0359898
24 6.900000 76 68 7.6545560
25 7.800000 86 74 11.7373656
26 6.000000 45 62 -17.4282535
27 8.000000 89 76 13.4224344
28 5.440541 57 59 -1.7500000
29 10.000000 90 89 1.2731223
30 14.000000 98 115 -17.0255019
31 1.500000 24 33 -8.8423013
32 12.000000 59 102 -42.8761898
33 4.600000 79 53 25.7762650
34 7.000000 90 69 20.9970904
35 3.600000 66 47 19.3509210
36 6.900000 78 68 9.6545560
37 1.000000 23 30 -6.5549733
38 2.200000 79 37 41.5554395
39 6.800000 88 68 20.3120216
40 3.000000 50 43 7.2957146

write.csv(data,"C:/Users/welcome/Desktop/PREDICTED.csv")
print ('CSV file written Successfully :)')

> new<-na. omit(score)


> print(new)
HOURS SCORE
1 2.500000 21
2 5.100000 47
3 3.200000 27
4 8.500000 75
5 3.500000 30
6 1.500000 20
7 9.200000 88
8 5.500000 59
9 8.300000 80
10 2.000000 25
11 7.700000 85
12 5.900000 62
13 4.500000 41
14 3.300000 42
15 1.100000 17
16 8.900000 95
17 5.440541 56
18 1.900000 24
19 6.100000 67
20 5.440541 69
21 2.700000 30
22 4.800000 54

Page 21
23 3.800000 59
24 6.900000 76
25 7.800000 86
26 6.000000 45
27 8.000000 89
28 5.440541 57
29 10.000000 90
30 14.000000 98
31 1.500000 24
32 12.000000 59
33 4.600000 79
34 7.000000 90
35 3.600000 66
36 6.900000 78
37 1.000000 23
38 2.200000 79
39 6.800000 88
40 3.000000 50

> model <- lm(new$SCORE ~ new$HOURS)


> print(model)

Call:
lm(formula = new$SCORE ~ new$HOURS)

Coefficients:
(Intercept) new$HOURS
22.980 6.575

> summary(model)

Call:
lm(formula = new$SCORE ~ new$HOURS)

Residuals:
Min 1Q Median 3Q Max
-42.88 -11.21 -0.34 10.45 41.55

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.980 5.099 4.506 6.12e-05 ***
new$HOURS 6.575 0.822 7.999 1.14e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.5 on 38 degrees of freedom


Multiple R-squared: 0.6274, Adjusted R-squared: 0.6176
F-statistic: 63.98 on 1 and 38 DF, p-value: 1.144e-09

Page 22
if the number of data points is small, a large F-statistic is required to ascertain
that there may be arelationship between the predictor and response variables.
To compute the F statistic,
the formula is F = (explained variation/(k – 1))/(unexplained variation/(n – k))
where, k is the no. of variables in the dataset and n is the no. of observations.

plot(new$HOURS, new$SCORE, co1='blue', main = 'Linear Regression',


abline(lm(new$SCORE ~ new$HOURS)), cex = 1.3, pch = 16, xlab = 'No of hours of
study',ylab = 'Student Score')

write.csv(data,"C:/Users/shivp/OneDrive/Desktop/SHEEETAL STUDY-PC")

print ('CSV file written Successfully :)')

data <-data.frame(Hours=new$HOURS, Actualscore=new$SCORE,


Predictedscore=round(res,0), Residuals<-ResHS)

Page 23
> print(data)

Hours Actualscore Predictedscore Residuals....ResHS


1 2.500000 21 39 -18.4169573
2 5.100000 47 57 -9.5110631
3 3.200000 27 44 -17.0192166
4 8.500000 75 79 -3.8648937
5 3.500000 30 46 -15.9916134
6 1.500000 20 33 -12.8423013
7 9.200000 88 83 4.5328471
8 5.500000 59 59 -0.1409255
9 8.300000 80 78 2.4500376
10 2.000000 25 36 -11.1296293
11 7.700000 85 74 11.3948312
12 5.900000 62 62 0.2292121
13 4.500000 41 53 -11.5662694
14 3.300000 42 45 -2.6766822
15 1.100000 17 30 -13.2124389
16 8.900000 95 81 13.5052439
17 5.440541 56 59 -2.7500000
18 1.900000 24 35 -11.4721637
19 6.100000 67 63 3.9142809
20 5.440541 69 59 10.2500000
21 2.700000 30 41 -10.7318885
22 4.800000 54 55 -0.5386663
23 3.800000 59 48 11.0359898
24 6.900000 76 68 7.6545560
25 7.800000 86 74 11.7373656
26 6.000000 45 62 -17.4282535
27 8.000000 89 76 13.4224344
28 5.440541 57 59 -1.7500000
29 10.000000 90 89 1.2731223
30 14.000000 98 115 -17.0255019
31 1.500000 24 33 -8.8423013
32 12.000000 59 102 -42.8761898
33 4.600000 79 53 25.7762650
34 7.000000 90 69 20.9970904
35 3.600000 66 47 19.3509210
36 6.900000 78 68 9.6545560
37 1.000000 23 30 -6.5549733
38 2.200000 79 37 41.5554395
39 6.800000 88 68 20.3120216
40 3.000000 50 43 7.2957146

Page 24
Conclusion
The model is validated based on the validation of the assumptions of linear regression: This case
study demonstrates the critical relationship between study habits, quantified by the number of
hours spent studying, and academic performance, as reflected in freshmen scores. Through the
application of various data analytics techniques using R, we were able to explore and analyze the
dataset effectively.

The findings indicate a positive correlation between the hours of study and freshmen scores,
suggesting that increased study time is associated with better academic outcomes.
Descriptive statistics and visualizations highlighted the distribution of study hours and
scores, while data pre-processing ensured the integrity and suitability of the dataset for
analysis.

The predictive model developed during this study provides a foundational understanding of
how study habits can influence academic performance. This insight not only underscores the
importance of effective study practices among students but also offers valuable guidance for
educators and policymakers aiming to enhance student success.

Overall, this analysis serves as a practical example of how data analytics can inform
educational strategies and encourage students to adopt more effective study routines for
improved academic achievement. Future research could expand on this analysis by
incorporating additional variables, such as study methods and student engagement, to
provide a more comprehensive understanding of factors influencing academic success.

Page 25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy