Mid-Semester Test With Solution 2019
Mid-Semester Test With Solution 2019
Data Analytics
Instructions
This is a question‐cum‐answer booklet. No separate sheet is required for solving any
problem and answering.
There are two parts in the question paper. Answer to both the parts.
To give your answers, use the space provided at the end of each question. Do not give
answer anywhere else.
Part A
All questions in this part are of small answer type questions.
For a question, there may be one or more option(s) is(are) correct.
For question with more than one correct options, credit will be given on pro‐rata basis.
No negative marking.
Give your answers ONLY in the space provided at the end of each question.
1.
i. In Table QI, there are attributes of different types. Out of mean, median, and mode,
which is (are) the descriptive measure(s) that can be calculated for the “Income
Group”?
Table QI
Patient ID Gender Age Income Group
Answer to (i)
The descriptive measure(s) which can be obtained for Income Group are median and mode.
ii. A frequency distribution of a set of 10 data is given below (see Table QII). Calculate
the coefficient of variance of the data.
Table QII
x 1 2 3 4 5 6 7 8 9 10
f(x) 1 3 5 7 9 2 4 6 1 0
Answer to (ii)
The expression for coefficient of variance is 𝐶𝑉 100
µ
Page | 2
Here, µ = = = 5.18
and 𝜎 = ∑ 𝑥 𝑥 = 17.60, that is 𝜎 = 4.19
iii. Which of the following can be considered to remove outliers in data? Mark a circle on
the correct answer(s).
(a) Box plot
(b) Mid range
(c) IQR (Inter Quartile Range)
(d) 0-1 normalization
iv. From which of the following measurements, the “coefficient of determination” can be
calculated? Mark a circle on the correct answer(s).
(a) Degree of correlation
(b) Geometric mean
(c) Harmonic mean
(d) Number of “Type I errors”
v. If X is a random variable and P(X = x) denotes the probability that X=x over a discrete
domain of values of x, then which of the following is NOT true?
(a) ∑∀ 𝑃 𝑋 𝑥 =1
(b) ∑∀ 𝑃 𝑋 𝑥 = ∞
(c) 0 ≤ ∑∀ 𝑃 𝑋 𝑥 ≤ ∞
(d) Cannot be determined until the set of all values of x is given
vi. Which of the following is true about the sampling distribution from a normally
distribution population? (All symbols in this question bear their usual meanings).
(a) 𝑋 = µ (Distribution of samples’ mean is approximately normal
with mean µ).
(b) S = σ/√𝑛 (It is not true that sample’s STD; it is distribution of
samples’ mean is approximately normal with STD σ/√𝑛))
(c) Variance of the mean of samples’ mean is σ 2/𝑛
(d) S = σ/√𝑛 is true for a large value of n
Page | 3
(a) Pearson’s correlation analysis is applicable to only numeric data.
(b) Spearman’s correlation analysis is applicable to only ordinal data.
(c) 𝜒2 correlation analysis is applicable to only categorical data.
(d) Any non-parametric statistical learning approach is applicable
when the entire population is known.
ix. Let YT = { Y1, Y2, ... , Yn} denotes a time series data, where Yi ( i = 1, 2, …, n) denotes
the data for any i-th period. Mark the following statements as True and False.
(a) Yr = 𝛽 + ∑ 𝛽 𝜌 denotes a auto-regression model to predict a
data in r-th period where r > n and 𝜌 denotes the j-th auto
correlation coefficient [False]
x. Which of the following statement(s) is(are) NOT true? Mark the correct option with a
circle.
(a) If confidence level is high, then probability that the null hypothesis will be
rejected is high.
(b) The null hypothesis in Chi-Square test is that the there is no association
between the attributes under test.
(d) From the box plot for a sample, median value can be obtained.
Page | 4
Part B
This part includes 5 concept level questions.
You should solve each question and give your answer in the space provided in the booklet.
Don’t use any extra sheet for problem solving.
2. The marks for 15 students on mid-term and end-term examinations in Data Analytics
course are given in Table Q2.
Table Q2
Mid-term 82 73 95 66 84 89 51 82 75 90 60 81 34 49 87
End-term 76 83 89 76 79 73 62 89 77 85 48 69 51 25 74
a) Obtain the simple linear regression analysis to predict the score on the end-term
examination from the mid-term examination score.
[1+2+3]
b) It is suggested that if the regression is significant, then there is no need to have final
examination. How you test the significance level of your regression analysis?
[2+2]
Answer to Q2:
(a) Simple linear regression model to predict the marks of end‐term scores takes the following
form:
Assume, End‐term= Y and Mid‐term = X
So, the simple LR model to predict the marks of end‐term score looks like
𝑌 𝛽𝑋 𝛼
Expression for the model parameters are:
∑ 𝑥 𝑥 𝑦 𝑦
𝛽
∑ 𝑥 𝑥
𝛼 𝑦 𝛽𝑥
Calculated value of the model parameters:
𝑥 = 73.2, 𝑦 =70.4
𝛽 = 0.7651
𝛼 = 70.4 – ((0.7651 * 73.2) = 14.39
= 0.599
Page | 5
3. Happiness Index (HI) is measured as low (L), medium (M), high (H) and very high (VH).
A survey is conducted among a population of varied age groups and data observed are
recorded in Table Q3.
Table Q3
Age-
80-90 90-100 10-20 20-30 30-40 40-50 50-60 60-70 70-80
group
HI H VH VH VH M L L M H
a) Apply a suitable correlation analysis to check if there is any correlation exists between
age-group and happiness index.
[1 + 1]
b) Calculate the coefficient of determination and interpret your result.
[3+2+2+1]
Answer to Q3:
Justification of the pr0posed correlation analysis: The sample data are of ordinal type. And for
ordinal data, the Spearman Correlation analysis is applicable.
Page | 6
4. A survey was conducted among 500 students who are studying either in “government
funded collages” (GVT) or “privately funded colleges” (PVT). The objective of the survey
to see the choice of “classroom based learning” (C) over the “Internet based learning” (I).
The survey results are summarized in the Table Q4.
Table Q4
Learning
C I
GVT
Colleges 75 125 200
It is proposed to apply the 𝜒 -test to verify if there is exist any association between
“colleges” and “learning”.
a) Decide the null and alternate hypotheses in this case. Justify your answer.
[2]
b) Calculate the 𝜒 –value from the sample data.
[2+2+2]
c) Test the hypothesis with 5% confidence level.
[2]
Answer to Q4:
The contingency table showing observed and expected frequencies are shown in the form
of a contingency table.
Learning
C I
GVT
PVT
Page | 7
The formula for the 𝜒 -value is:
𝜒 = + + +
v = (r-1)×(c-1) = 1
The critical value of 𝜒 from the 𝜒 -test statistical table in this case is:
Critical value is = 3.841
5. It is claimed that an automobile is driven on the average more than 20,000 kilometres per
year. To test the claim, a random sample of 100 automobile owners is asked to keep a record
of the kilometres they travel. The random sample showed an average of 23500 kilometres
and a standard deviation of 3900 kilometres. It is planned to test the above with parametric
based hypothesis testing. Assume 1% confidence level.
a) Mention the hypotheses that you should consider. Justify the hypothesis you have
proposed.
[2]
b) Calculate the test statistic and decide the critical region for rejecting the hypothesis.
[2+2]
c) Decide the sample statistics.
[2]
d) Check if the null hypothesis be rejected.
[2]
Answer to Q5:
(d) Decision from the hypothesis testing is concluded below with justification.
H0 is rejected that means an automobile is driven on the average more than 20,000
kilometres per year.
6. In a test paper, there are two parts with 10 and 40 marks in them. Marks in two parts are
denoted by the random variable X. Another random variable Y denotes the number of
students who have attended the test. Table Q6 shows the joint mass probability
distribution function f(x,y) = P(X=x, Y=y).
Table Q6
Y
f(x,y) 10 20 30 fX(x)
0
10
0.25
40
0 0.25 0.5
fY(y) 0.25 0.5 0.25
(a) Calculate the covariance. Interpret the result signifying what the covariance
implies.
[2+1]
(b) Calculate the coefficient of correlation. Express the meaning of the result.
[3+1]
(c) Calculate the coefficient of determination. How do you interpret the result?
[2+1]
Answer to Q6:
Covariance = ∑ , 𝑥 𝜇 𝑦 𝑚𝜇 . 𝑓 𝑥, 𝑦
Page | 9
(b) Interpretation of the result:
The variance between the attributes X and Y is 56.25
The formula for the coefficient of correlation relevant to the problem is:
𝐶𝑜𝑣 𝑋, 𝑌
𝜌
𝑆 .𝑆
(c) The value of the coefficient of correlation for the given data:
𝑆 ∑ 𝑥 𝜇 ⁄ 2 1 =
= 10 25 40 25 = 21.21
𝑆 ∑ 𝑦 𝜇 ⁄ 3 1 =
= 10 15 20 15 30 15 /2 = 6.12
.
𝜌= = 0.43
. ∗ .
𝜌 = 0.186
Page | 10