2021 Quiz2 Problems
2021 Quiz2 Problems
Name:
C-17-80752
1
Instructions
• You have 24 hours to complete this exam. Responses are due by Thursday, March 18 at
11:59pm Pacific Time.
• Upon completion of the quiz, you should submit your answers using this Google form:
https://forms.gle/zYUcXYHXeG4nKiW87
• There are a total of 20 questions, 5 true/false questions and 15 multiple choice questions.
• Apart from the teaching stuff, you may not talk to or consult with anyone about this
exam. If any questions arise, please create a private post on Piazza and the teaching
staff will respond as soon as possible.
• You may use all resources available (books, notes, homework solutions, and general
Internet) for this exam. However, we recommend against using materials outside of this
course, such as searching the Internet, since it will more likely result in over-complication,
confusion, and a waste of time.
• Unless otherwise noted, each problem is self-contained.
• You will get zero points for any incorrectly answered question (no negative points).
• Each problem is identically worth 5 points, for a maximum total of 100, so don’t spend
too much time working on any single question.
• Good luck!
2
True/false questions
Problem 1.
When comparing two models M1 and M2 , if M1 has lower bias than M2 , it must have higher variance than
M2 .
Problem 2.
For a simple linear regression model, the regression line and the standard deviation line always intersect at
(X̄, Ȳ ), where X̄ is the sample mean of X and Ȳ is the sample mean of Y .
Problem 3.
Prediction error on the test set will generally be higher than that of the training set.
Problem 4.
The coefficient β1 in a logistic regression model Pr(Y = 1) = logit−1 (β0 + β1 X) can be interpreted as the
average increase in odds of the event Y = 1 taking place as X increases by one unit.
Problem 5.
Your friend Alice generates a dataset composed of covariates Xi where i = 1, ..., m, and one target variable Y .
Alice generated the data using the following formula: Y = β0∗ + β1∗ X1 + ... + βm
∗
Xm + , where β ∗ is a vector
of predefined coefficients, and ∼ N (0, σ ). Alice gives you β and the entire dataset, but she leaves out the
2 ∗
target variable Y . Using the data and β ∗ , you can perfectly reconstruct Y .
3
Multiple choice questions
Problem 6.
Consider a randomized controlled experiment. Let pa be the proportion of always-treats, pc the proportion of
compliers, and pn the proportion of never-treats in the population. In terms of pa , pn and pc , approximately
what proportion of the control group does not receive treatment?
(a) pa + pn
(b) pa + pc
(c) pn + pc
(d) pa + pc + pn
Problem 7.
Suppose we obtain the following results by running a (unregularized) linear regression:
Which of the following models is most likely to be the result of fitting a regression using the same data and
model formula, but with an L2 (ridge) penalty?
(a) Ŷ = 9.176 + 0.021X1 + 0.623X2 + 0.138X3
(b) Ŷ = 3 + 7X1 + 27X2 + 55X3
(c) Ŷ = 8.456 + 0.000X1 + 0.000X2 + 0.918X3
(d) Ŷ = 0.305 + 0.710X1 + 2.725X2 + 5.523X3
4
Problem 8.
We fit a regression using the formula Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + ε and estimated the following
coefficients:
Coefficient Estimate
β0 1.8
β1 0
β2 1.5
β3 0
β4 0
Which of the following regression models is most likely to have been used?
(a) Linear regression
(b) L1 regularized linear regression (lasso)
(c) L2 regularized linear regression (ridge)
(d) Logistic regression
Problem 9.
People who participate in job training programs are more likely to be unemployed after completing the
program compared to those who do not participate. Based on this information, which of the following is
most reasonable?
(a) Job training programs decrease likelihood of employment
(b) People who are more likely to be employed are more likely to attend job training programs
(c) People who are less likely to be employed are more likely to attend job training programs
(d) People who are less likely to be employed are less likely to attend job training programs
5
Problem 10.
We are given a data frame named df with four columns:
• salary is a continuous variable.
• age is a factor variable with 5 levels.
• edu is a factor variable with 4 levels.
• experience is a continuous variable.
We then fit a linear regression in R using following code:
my_model <- lm(salary ~ 1 + age + edu + experience, data = df)
How many coefficients, including the intercept, will we have in my_model? (Assume there is no missing data
in the data frame.)
(a) 4
(b) 9
(c) 10
(d) 11
Problem 11.
Suppose you are given observations Y1 , . . . , Yn and covariates X1 , . . . , Xn . Suppose you run linear regression
with only an intercept term. What would be the resulting value of R2 ?
(a) 1
(b) 0.5
(c) 0
(d) 0.8
Problem 12.
A car manufacturer is trying to understand the correlation between the fuel efficiency and price of one of
their vehicles. Using a dataset consisting of prior year sales volume (sales) and fuel efficiency (mpg) as
well as horsepower (hp) for a range of different vehicles, they run the following regression in R and find the
subsequent output:
lm(formula = price ~ 1 + mpg + I(mpg^2) + mpg:hp, data = df)
...
Coefficients:
Estimate ...
(Intercept) 26690.0 ...
mpg 271.0 ...
I(mpg^2) 5.0 ...
mpg:hp 0.2 ...
...
Suppose the current fuel efficiency of the vehicle under consideration is 20 mpg, and its horsepower is 200.
Which of the following is correct?
(a) A one unit increase in mpg is associated with a 271 increase in price.
(b) A one unit increase in mpg is associated with a 276 increase in price.
(c) A one unit increase in mpg is associated with a 316 increase in price.
6
(d) A one unit increase in mpg is associated with a 516 increase in price.
7
Suppose we are interested in predicting loan defaults, which happens when a borrower fails to repay a loan.
In collaboration with a local bank, we collect the following information regarding loans in the past 10 years:
• default: a binary indicator of whether or not a loan defaulted, where 1 indicates default and 0 indicates
repayment
• age: age of the borrower, a categorical variable with the following levels:
– 20-29
– 30-44
– 45-64
– 65+
• sex: categorical variable, either female or male
• income: a continuous variable indicating the borrower’s income
After fitting a logistic regression model using all available covariates to predict default, we obtain the following
output:
Problem 13.
All else equal, which age group is most likely to default on a loan?
(a) 20-29
(b) 30-44
(c) 45-64
(d) 65+
Problem 14.
Suppose we have a male and female borrower who are both 50 years old and make $60k per year. The odds
of the male borrower defaulting is s × odds of the female borrower defaulting. What is the correct value
of s?
(a) −0.143
(b) 0.143
(c) exp(0.143)
(d) logit−1 (0.304 + 0.143)
8
Problem 15.
Which of the following statements about prediction intervals is false?
(a) Prediction intervals quantify the uncertainty around a specific response.
(b) For a given X = x, the prediction interval is the same size or larger than the mean confidence interval
of the same significance level.
(c) For a given X = x, the prediction interval is generally smaller when x is closer to the mean of the Xi ’s.
(d) For a given X = x, a 95% prediction interval is theoretically guranteed to cover 95% or more Y s
observed in the data with the same X.
9
Problem 16.
After fitting the linear regression Y = β0 + β1 X1 + β2 X2 + ε on a dataset, we find β̂2 = 2 (i.e., the coefficient
for X2 is 2).
What is the correct interpretation of this result?
(a) Having X1 fixed, for every unit increase in X2 , Y increases by 2 on average.
(b) Having X1 fixed, for every unit increase in Y , X2 increases by 2 on average.
(c) Having X1 fixed, for every unit increase in X2 , Y increases by 1/2 on average.
(d) Having X1 fixed, for every unit increase in X2 , Y increases by 1 on average.
10
Suppose we are conducting a randomized experiment to test the effect of a new drug on an individual’s
survival rate. Only people in the treatment group are given the chance to take the new drug. We record our
observations in following table. The “–” in the table represents unobserved values.
Problem 17.
What is the estimated number of never-treats in the control group? (Note that the total number of people in
the treatment and control groups are different.)
(a) 10
(b) 20
(c) 30
(d) 60
Problem 18.
What is the estimated survival rate among never-treats in the control group?
(a) 50%
(b) 53.3%
(c) 75%
(d) Not enough information to compute
11
Problem 19.
Which of the following relationships is NOT a linear regression model?
(a) yi = β0 + β1 xi + εi
(b) yi = β0 + β1 1+xxi
i
+ εi
(c) yi = β0 + β1 log(xi ) + εi
(d) yi = β0 + β1 xi
β2 +xi + εi
Problem 20.
Suppose you build a logistic regression model to infer whether a message is spam. The outcome is a binary
variable Yi ∈ {0, 1} where 0 denotes that the ith message is not spam and 1 denotes the ith message is spam.
You collect a dataset of 1600 messages and use your model to make predictions, which yields the following
results:
(b) 200
200+650
(c) 200
200+400
(d) 350+200
1600
12
Answer sheet
Name:
{True/false questions}
Fill-in the circle of the correct answer. (T = true, F = false)
1 T F
2 T F
3 T F
4 T F
5 T F
6 a b c d 16 a b c d
7 a b c d 17 a b c d
8 a b c d 18 a b c d
9 a b c d 19 a b c d
10 a b c d 20 a b c d
11 a b c d
12 a b c d
13 a b c d
14 a b c d
15 a b c d