0% found this document useful (0 votes)
7 views9 pages

Mid-Semester Test With Solution 2019

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

Mid-Semester Test With Solution 2019

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CS40003

Data Analytics

Mid‐Autumn Semester Test


(Session 2019‐2020)

Full Marks: 50 Time: 120 minutes

Instructions
 This is a question‐cum‐answer booklet. No separate sheet is required for solving any
problem and answering.
 There are two parts in the question paper. Answer to both the parts.
 To give your answers, use the space provided at the end of each question. Do not give
answer anywhere else.

Part A
All questions in this part are of small answer type questions.
For a question, there may be one or more option(s) is(are) correct.
For question with more than one correct options, credit will be given on pro‐rata basis.
No negative marking.
Give your answers ONLY in the space provided at the end of each question.

1.
i. In Table QI, there are attributes of different types. Out of mean, median, and mode,
which is (are) the descriptive measure(s) that can be calculated for the “Income
Group”?

Table QI
Patient ID Gender Age Income Group

Answer to (i)
The descriptive measure(s) which can be obtained for Income Group are median and mode.

Explanation of the answer:


“Income Group” is of ordinal type. For the mode and median calculation, ordering is required
and hence median and mode can be calculated.

ii. A frequency distribution of a set of 10 data is given below (see Table QII). Calculate
the coefficient of variance of the data.

Table QII
x 1 2 3 4 5 6 7 8 9 10
f(x) 1 3 5 7 9 2 4 6 1 0
Answer to (ii)
The expression for coefficient of variance is 𝐶𝑉 100
µ

Page | 2
Here, µ = = = 5.18
and 𝜎 = ∑ 𝑥 𝑥 = 17.60, that is 𝜎 = 4.19

Hence, for the given data, CV = *100 = 80.88%


µ

iii. Which of the following can be considered to remove outliers in data? Mark a circle on
the correct answer(s).
(a) Box plot
(b) Mid range
(c) IQR (Inter Quartile Range)
(d) 0-1 normalization

iv. From which of the following measurements, the “coefficient of determination” can be
calculated? Mark a circle on the correct answer(s).
(a) Degree of correlation
(b) Geometric mean
(c) Harmonic mean
(d) Number of “Type I errors”

v. If X is a random variable and P(X = x) denotes the probability that X=x over a discrete
domain of values of x, then which of the following is NOT true?
(a) ∑∀ 𝑃 𝑋 𝑥 =1
(b) ∑∀ 𝑃 𝑋 𝑥 = ∞
(c) 0 ≤ ∑∀ 𝑃 𝑋 𝑥 ≤ ∞
(d) Cannot be determined until the set of all values of x is given

vi. Which of the following is true about the sampling distribution from a normally
distribution population? (All symbols in this question bear their usual meanings).
(a) 𝑋 = µ (Distribution of samples’ mean is approximately normal
with mean µ).
(b) S = σ/√𝑛 (It is not true that sample’s STD; it is distribution of
samples’ mean is approximately normal with STD σ/√𝑛))
(c) Variance of the mean of samples’ mean is σ 2/𝑛
(d) S = σ/√𝑛 is true for a large value of n

vii. If the value of α, the significant level is increased, then


(a) Type I error increases while Type-II error decreases
(b) Type I error decreases while Type-II error increases
(c) Both Type I and Type II errors increase
(d) Both Type I and Type II errors decrease

viii. Which of the following statement(s) is(are) NOT true?

Page | 3
(a) Pearson’s correlation analysis is applicable to only numeric data.
(b) Spearman’s correlation analysis is applicable to only ordinal data.
(c) 𝜒2 correlation analysis is applicable to only categorical data.
(d) Any non-parametric statistical learning approach is applicable
when the entire population is known.

ix. Let YT = { Y1, Y2, ... , Yn} denotes a time series data, where Yi ( i = 1, 2, …, n) denotes
the data for any i-th period. Mark the following statements as True and False.
(a) Yr = 𝛽 + ∑ 𝛽 𝜌 denotes a auto-regression model to predict a
data in r-th period where r > n and 𝜌 denotes the j-th auto
correlation coefficient [False]

(b) Auto regression analysis is possible if each Yi ( i=1, 2, …,n) satisfies


the stationary property. [True]

(c) All periodic data if available in uniform and continuous manner,


then only 𝜌 , the i-th auto-correlation coefficient is possible.
[True]
(d) If Yj is predicted accurately from 𝑌 , then 𝑌 will be predicted for
k>j from 𝑌 ∪ {𝑌 }. [True]

x. Which of the following statement(s) is(are) NOT true? Mark the correct option with a
circle.
(a) If confidence level is high, then probability that the null hypothesis will be
rejected is high.

(b) The null hypothesis in Chi-Square test is that the there is no association
between the attributes under test.

(c) Ogive polygon can be used to calculate the mean of a sample.

(d) From the box plot for a sample, median value can be obtained.

Page | 4
Part B
This part includes 5 concept level questions.
You should solve each question and give your answer in the space provided in the booklet.
Don’t use any extra sheet for problem solving.

Do not give answer elsewhere.

2. The marks for 15 students on mid-term and end-term examinations in Data Analytics
course are given in Table Q2.

Table Q2
Mid-term 82 73 95 66 84 89 51 82 75 90 60 81 34 49 87
End-term 76 83 89 76 79 73 62 89 77 85 48 69 51 25 74

a) Obtain the simple linear regression analysis to predict the score on the end-term
examination from the mid-term examination score.
[1+2+3]
b) It is suggested that if the regression is significant, then there is no need to have final
examination. How you test the significance level of your regression analysis?
[2+2]
Answer to Q2:
(a) Simple linear regression model to predict the marks of end‐term scores takes the following
form:
Assume, End‐term= Y and Mid‐term = X
So, the simple LR model to predict the marks of end‐term score looks like
𝑌 𝛽𝑋 𝛼
Expression for the model parameters are:
∑ 𝑥 𝑥 𝑦 𝑦
𝛽
∑ 𝑥 𝑥

𝛼 𝑦 𝛽𝑥
Calculated value of the model parameters:
𝑥 = 73.2, 𝑦 =70.4

𝛽 = 0.7651
𝛼 = 70.4 – ((0.7651 * 73.2) = 14.39

(b) The validity of the model can be done as follows.


SSE = Residual sum of the squared error
=∑ 𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡
= ∑ 𝑦 𝑦
= 1714.62
SST = Total corrected sum of squares
=∑ 𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡
=∑ 𝑦 𝑦
= 4275.6
R2 = 1

= 0.599
Page | 5
3. Happiness Index (HI) is measured as low (L), medium (M), high (H) and very high (VH).
A survey is conducted among a population of varied age groups and data observed are
recorded in Table Q3.
Table Q3
Age-
80-90 90-100 10-20 20-30 30-40 40-50 50-60 60-70 70-80
group
HI H VH VH VH M L L M H

a) Apply a suitable correlation analysis to check if there is any correlation exists between
age-group and happiness index.
[1 + 1]
b) Calculate the coefficient of determination and interpret your result.
[3+2+2+1]

Answer to Q3:

(a) For the given data, Spearman correlation analysis is applicable

Justification of the pr0posed correlation analysis: The sample data are of ordinal type. And for
ordinal data, the Spearman Correlation analysis is applicable.

(b) Calculation of coefficient of deamination:

The contingency table form the given data

Sample# Rankx Ranky Diff=d d2


1 2 4.5 -2.5 6.25
2 1 2 -1 1
3 9 2 7 49
4 8 2 6 36
5 7 6.5 0.5 0.25
6 6 8.5 -2.5 6.25
7 5 8.5 -3.5 12.25
8 4 6.5 -2.5 16.25
9 3 4.5 -1.5 2.25

Calculation of coefficient of correlation:


∑ ∗ .
rs = 1 - =1- = 0.00416

Calculation of coefficient of determination:

The coefficient of determination is (rs)2 = 0.000017


t= = 0.0117

Interpretation of the result obtained:


Almost 0% pair is correlated.

Page | 6
4. A survey was conducted among 500 students who are studying either in “government
funded collages” (GVT) or “privately funded colleges” (PVT). The objective of the survey
to see the choice of “classroom based learning” (C) over the “Internet based learning” (I).
The survey results are summarized in the Table Q4.
Table Q4
Learning
C I

GVT
Colleges 75 125 200

PVT 60 240 300


135 365 500

It is proposed to apply the 𝜒 -test to verify if there is exist any association between
“colleges” and “learning”.

a) Decide the null and alternate hypotheses in this case. Justify your answer.
[2]
b) Calculate the 𝜒 –value from the sample data.
[2+2+2]
c) Test the hypothesis with 5% confidence level.
[2]

Answer to Q4:

(a) The hypothesis of the 𝜒 -test is given below.


H0 : There is no association between the attributes College and Learning
H1 : There is an association between the attributes College and Learning

Justification of the mentioned hypothesis.


The null hypothesis assumes that there is no correlation exist between the attributes
under test.

(b) Calculation of 𝝌𝟐 value from the given data.

The contingency table showing observed and expected frequencies are shown in the form
of a contingency table.

Learning
C I
GVT

75 (54) 125 (146) 200


Colleges

PVT

60 (81) 240 (219) 300


135 365 500

Page | 7
The formula for the 𝜒 -value is:

𝜒 = ,𝑜 = Observed frequency and 𝑒 = Expected frequency

The calculated value of 𝜒 -value in this case is:

𝜒 = + + +

= 8.16 +3.02 +5.44 +2.01 =18.63

Testing the null hypothesis with 5% confidence level

(c) Degree of freedom for the given sample is:

v = (r-1)×(c-1) = 1

The critical value of 𝜒 from the 𝜒 -test statistical table in this case is:
Critical value is = 3.841

Inference about the null hypothesis:


As |𝜒 | > 3.841
Reject the null hypothesis that means class room based learning is not equal to internet
based learning.

5. It is claimed that an automobile is driven on the average more than 20,000 kilometres per
year. To test the claim, a random sample of 100 automobile owners is asked to keep a record
of the kilometres they travel. The random sample showed an average of 23500 kilometres
and a standard deviation of 3900 kilometres. It is planned to test the above with parametric
based hypothesis testing. Assume 1% confidence level.

a) Mention the hypotheses that you should consider. Justify the hypothesis you have
proposed.
[2]
b) Calculate the test statistic and decide the critical region for rejecting the hypothesis.
[2+2]
c) Decide the sample statistics.
[2]
d) Check if the null hypothesis be rejected.
[2]
Answer to Q5:

(a) Null hypothesis H0 : µ = 20,000


Justification: The problem is to infer the population mean as 20,000

Alternate hypothesis H1: µ >20,000


Justification: Alternate hypothesis is that population mean is more than 20,000
Page | 8
(b) This hypothesis comes under the case of t‐test.

Reason: Population standard deviation is unknown.

(c) The critical region from the statistical table is:


t = 2.365

The test value from the sample statistics is given below:


𝑡= ⁄√
= ⁄√
= 8.974

(d) Decision from the hypothesis testing is concluded below with justification.
H0 is rejected that means an automobile is driven on the average more than 20,000
kilometres per year.

6. In a test paper, there are two parts with 10 and 40 marks in them. Marks in two parts are
denoted by the random variable X. Another random variable Y denotes the number of
students who have attended the test. Table Q6 shows the joint mass probability
distribution function f(x,y) = P(X=x, Y=y).
Table Q6
Y
f(x,y) 10 20 30 fX(x)

0
10

0.25 0.25 0.5


X

0.25
40

0 0.25 0.5
fY(y) 0.25 0.5 0.25

(a) Calculate the covariance. Interpret the result signifying what the covariance
implies.
[2+1]
(b) Calculate the coefficient of correlation. Express the meaning of the result.
[3+1]
(c) Calculate the coefficient of determination. How do you interpret the result?
[2+1]
Answer to Q6:

(a) The formula for the Covariance calculation:

Covariance = ∑ , 𝑥 𝜇 𝑦 𝑚𝜇 . 𝑓 𝑥, 𝑦

The value of the covariance for the given data:


𝜇 = ∑ 𝑥 . 𝑓 = 25
𝜇 = ∑ 𝑦 . 𝑓 = 20
Covariance = 56.25

Page | 9
(b) Interpretation of the result:
The variance between the attributes X and Y is 56.25

The formula for the coefficient of correlation relevant to the problem is:

𝐶𝑜𝑣 𝑋, 𝑌
𝜌
𝑆 .𝑆

(c) The value of the coefficient of correlation for the given data:

𝑆 ∑ 𝑥 𝜇 ⁄ 2 1 =
= 10 25 40 25 = 21.21

𝑆 ∑ 𝑦 𝜇 ⁄ 3 1 =
= 10 15 20 15 30 15 /2 = 6.12

.
𝜌= = 0.43
. ∗ .

Interpretation of the result:


Since 𝜌 is positive, we can conclude that there is a positive correlation. Further, the value
of 𝜌 is very close to 0, which implies that the correlation is very weak.

The formula for the coefficient of determination is:


Coefficient of determination = 𝜌

The value of the coefficient of determination for the given data:

𝜌 = 0.186

Interpretation of the result:

18.6% of the variations are correlated.

Page | 10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy