0% found this document useful (0 votes)
22 views11 pages

Esa - QP - Ue19-20cs203 - SDS

Uploaded by

pes1ug23cs690
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

Esa - QP - Ue19-20cs203 - SDS

Uploaded by

pes1ug23cs690
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SRN

PES University, Bengaluru UE19/20CS203


(Established under Karnataka Act No. 16 of 2013)

MAY 2022: END SEMESTER ASSESSMENT (ESA) B TECH III SEMESTER


UE19/20CS203 – STATISTICS FOR DATA SCIENCE
Time: 3 Hrs Answer All Questions Max Marks: 100
• Answer all questions in the same order as given and to the point.
• Do not directly write the answer, write out all the steps taken to solve the problem
• Only the required data tables to solve the given problems are in the last page
1 a) What is sampling? Mention the different probability sampling techniques .Explain any three with 8
examples.
Solution:
The process of selecting observations(a sample) in order to make an inference that can be
generalized to the population.

Simple random sampling, as the name suggests, is an entirely random method of selecting the
sample.
● Here, each subject or unit in the population has an equal chance of being selected.
● The sampling frame should include the whole population.
● A table of random number or lottery system is used to determine which units are to be
selected.
Simple random sampling is always an EPS design, but not all EPS designs are simple random
sampling
Systematic sampling
When to Use : When project budget is tight and less time to complete.
Key Thing: Find the kth value to select every kth member. k = N / n
How: Assign numbers to each population member.
Selection : Randomly select first person and then select every kth person.
Advantages: Easy to select, Sample evenly spread over entire reference population, cost effective.
Disadvantages: Sample may be biased, Each element does not have equal chance, Ignorance of all
elements between two kth element.
SRN

Stratified sampling is the type of sampling in which the population is divided into 2 or more
groups called strata based on a shared characteristic or trait.
Then simple random samples are selected from each group.
The selected 2 or more samples are combined into one.
The strata or groups don’t overlap. But, they represent the entire population.
The shared characteristics based on which the population is divided could be gender,
educational attainment, income, age etc.
Cluster Sampling
When to Use : When population is already broken up into groups(clusters). Key Thing:
Heterogeneous members in each group.
How: Population is divided into non-overlapping areas(clusters).
Each cluster is a miniature or microcosm of a population.
Selection : Clusters are selected randomly and all elements are included or elements are chosen
using simple random sample.
Advantages: More convenient for geographically dispersed populations, Less travel cost,
Simplified administration of the survey.
Disadvantages: Statistically less efficient, Sampling error is higher,
problems are higher than simple random sampling.

b) What is web scraping? With a neat diagram explain the components of a web scraper. 1+5
Solution:
Solution:

Web scraping is like any other Extract-Transform-Load (ETL) Process. Web Scrapers crawl
websites, extracts data from it, transforms to a usable structured format and load it to a file or
database for subsequent use.
A typical web scraper has the following components.
SRN
c) For the following data 1x6
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
Compute the mean, median, mode and the 5%, 10%, and 20% trimmed means

Solution:
The mean is found by averaging together all 24 numbers, which produces a value of
195.42.
The median is the average of the 12th and 13th numbers, which is (191 + 223)/2 =
207.00.
It is trimodal 80,179,232
To compute the 5% trimmed mean, we must drop 5%
of the data from each end. This comes to (0.05)(24) = 1.2 observations.
We round 1.2 to 1, and trim one observation off each end
The 5% trimmed mean is the average of the remaining 22 numbers:
75 + 79 +···+ 274 + 384/22= 190.45
To compute the 10% trimmed mean, round off (0.1)(24) = 2.4 to 2.
Drop 2 observations from each end, and then average the remaining 20:
79 + 80 +···+ 254 + 274/20= 186.55
To compute the 20% trimmed mean, round off (0.2)(24) = 4.8 to 5. Drop 5 observations
from each end, and then average the remaining 14:
105 + 126 +···+ 242 + 245/14= 194.07

2 a) The four sides of a rectangular frame consist of two pieces selected from a population 1+2+2
whose mean length is 30 cm with standard deviation 0.1 cm, and two pieces selected from
a population whose mean length is 45 cm with standard deviation 0.3 cm.
i. Find the mean perimeter of the rectangular frame.
ii. Assuming the four pieces are chosen independently, find the standard deviation of
the perimeter.
Solution:
Let X1 and X2 denote the lengths of the pieces chosen from the population with mean 30
and standard deviation 0.1, and let Y1 and Y2 denote the lengths of the pieces chosen from
the population with mean 45 and standard deviation 0.3.

i. μX1+X2+Y1+Y2 = μX1 +μX2 +μY1 +μY2 = 30+30+45+45= 150

ii.
SRN
b) IC chips often contain surface imperfections. For a certain type of IC chip, 9% contain no 1+2+2
imperfections, 22% contain 1 imperfection, 26% contain 2 imperfections,20% contain 3
imperfections, 12% contain 4 imperfections, and the remaining 11% contain 5
imperfections. Let Y represent the number of imperfections in a randomly chosen chip.
What are the possible values for Y? Is Y discrete or continuous? Find P(Y = y) for each
possible value y.

Solution
The possible values for Y are the integers 0, 1, 2, 3, 4, and 5. The random variable Y is discrete,
because it takes on only integer values. Nine percent of the outcomes in the sample space are
assigned the value 0. Therefore P(Y = 0) = 0.09. Similarly P(Y = 1) = 0.22, P(Y = 2) = 0.26, P(Y =
3) = 0.20, P(Y = 4) = 0.12, and P(Y = 5) = 0.11.

c) X is a continuous Random Variable with the probability density function as given below. 2+3

It is verified that µx=50 and σx=0.45. Compute the probability that the X is outside the
interval 49.1 - 50.9. How close is this probability to the Chebyshev’s Inequality bound?

Solution:
SRN
d) A Company produces “20 ounce” jars of a Chilly sauce. The true amounts of sauce in the 2+3
jars of this brand sauce follow a normal distribution. Suppose the companies “20 ounce”
jars follow a normal distribution with a mean µ=20.2 ounces with a standard deviation
s=0.125 ounces. What proportion of the sauce jars contain between 20 and 20.3 ounces of
sauce?

Solution:

3 a) Let X1, . . . , Xn be a random sample from a population with the Poisson(λ) distribution. 5
Find the MLE of λ.

Solution:
SRN
b) Let X1 and X2 be independent, each with unknown mean μ and known variance 5

σ2 = 1.Let . Find the bias, variance, and mean squared error of .

Solution:

c) A random sample of n = 50 boys showed a mean average daily intake of protein products 2+2
equal to 756 grams with a standard deviation of 35 grams.
i. Find a 95% confidence interval for the population average µ.
ii. Find a 99% confidence interval for µ, the population average daily intake
of protein products for boys.

Solution:
35
s  756 ± 1.96  756 ± 9.70
x ± 1.96 50
n
or 746.30 < μ < 765.70 grams.

x ± 2.58
s
 756 ± 2.58
35  756 ± 12.77
n 50

or 743.23 < μ < 768.77 grams.


SRN
d) 3+3
Estimate the confidence intervals for the following:
i. A group of 78 people enrolled in a weight-loss program that involved adhering to a
special diet and to a daily exercise program. After six months, their mean weight
loss was 25 pounds, with a sample standard deviation of 9 pounds. A second group
of 43 people went on the diet but didn’t exercise. After six months, their mean
weight loss was 14 pounds, with a sample standard deviation of 7 pounds. Find a
95%confidence interval for the mean difference between the weight losses.

ii. In a random sample of 150 customers of a high-speed internet provider, 63 said


that their service had been interrupted one or more times in the past month. Find a
95% confidence interval for the proportion of customers whose service was
interrupted one or more times in the past month.

Solution:

4 a) A marketing company claims that it receives 8% responses from its mailing. To test this 1+1+2+1
claim, a random sample of 500 were surveyed with 30 responses. Test at the  = .05
significance level.

Solution:

First, check:
n p˄ = (500)(.08) = 40 Determine region of rejection

n(1-p˄) = (500)(.92) = 460

H0: p˄ = .08 H1: p˄ ≠ .08

α = .05
n = 500, p = .06
Critical Values: ± 1.96
p − .06 − .08
Z= = = −1.648
 (1 −  ) .08(1 − .08)
n 500

Do not reject H0 at  = .05


There isn’t sufficient evidence to reject the company’s claim of 8% response rate.
SRN
4 b) Recently many of the IT companies have been experimenting with work from 2+2+2
home(WFH), allowing employees to work at home on their computers. Among other
things, WFH is supposed to reduce the number of sick days taken. Suppose that at one
firm, it is known that over the past few years employees have taken a mean of 5.4 sick
days. This year, the firm introduces WFH. Management chooses a simple random sample
of 80 employees to follow in detail, and, at the end of the year, these employees average
4.5 sick days with a standard deviation of 2.7 days. Let μ represent the mean number of
sick days for all employees of the firm.
i. Find the P-value for testing H0 :μ ≥ 5.4 versus H1 :μ < 5.4.
ii. Do you believe it is plausible that the mean number of sick days is at least 5.4, or
are you convinced that it is less than 5.4? Explain your reasoning.
iii. Is the result statistically significant at the 5% level?

Solution:

Yes, the result statistically significant at the 5% level

c) For the given table of observed values, 2+2


i. Construct the corresponding table of expected values.
ii. If appropriate, perform the chi-square test for the null hypothesis that the row and
column outcomes are independent. If not appropriate, explain why.

Solution:
SRN
d) A test is made of the hypotheses H0 :μ ≤ 25 versus H1 :μ > 25. For each of the following 5x1
situations, determine whether the decision was correct, a type I error
occurred, or a type II error occurred.
i. μ = 23, H0 is rejected. ii. μ = 25, H0 is not rejected.
iii. μ = 29, H0 is not rejected. iv. μ = 27, H0 is rejected.
v. μ = 20, H0 is not rejected

Solution:

Correct decision .H0 is True and not rejected

5 a) Find the power of the 5% level test of H0: μ ≤ 80 versus H1: μ > 80 for the mean yield of 1+2+2
the new process under the alternative μ = 82, assuming n = 50 and σ = 5.

Solution:
the population standard deviation for the new process is σ = 5 and that

upper 5% is the rejection region. The critical point has a z-score of 1.645, so its value
is 80+ (1.645) (0.707) = 81.16.We will reject H0 if ≥ 81.16. This is the rejection
region.

We will reject H0 if ≥ 81.16. The z-score for the critical point of 81.16
under the alternate hypothesis is z = (81.16 − 82)/0.707 = −1.19. The area to the right
of z = −1.19 is 0.8830. This is the power.

b) State the assumptions for Errors in Linear Models. 2


Solution:
Assumptions for Errors in Linear Models:
In the simplest situation, the following assumptions are satisfied:
1. The errors 1,…,n are random and independent. In
particular, the magnitude of any error i does not
influence the value of the next error i + 1.
2. The errors 1,…,n all have mean 0.
3. The errors 1,…,n all have the same variance, which
we denote by 2.
The errors 1,…,n are normally distributed.
SRN
c) What is a confounding variable? How we can reduce the risk of confounding 3

Solution:

❖ Confounding Variable is a variable that influences both the independent variable


as well as the dependent variable causing a spurious correlation.
❖ This may interfere in your analysis and ruin your experiment by giving useless
results.
❖ Confounding variables can cause two major problems:
▪ Increase variance
▪ Introduce bias.
❖ A confounding variable are like extra independent variables that are having a
hidden effect on your dependent variables.
❖ A confounding variable can be what the actual cause of a correlation is, hence any
studies must take these into account and find ways of dealing with them.

❖ One of the ways by which confounding can be avoided in controlled experiments


by choosing values for certain factors in such a way that there exists no correlation
between those factors.
SRN
d) The details pertaining to the no. of hours spent by students in preparing for the SDS final 10
exam and the marks scored (on a scale of (0 – 100) is provided in the following table.
Using these values, Estimate the marks scored by a student who has spent 2.35
hours.

Solution:
❖ We need to first obtain the least square line which is given by,


You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy