0% found this document useful (0 votes)
14 views7 pages

examBD2223 January Solutions

Uploaded by

100392425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

examBD2223 January Solutions

Uploaded by

100392425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

MASTER IN BIG DATA ANALYTICS. UC3M.

BIG DATA INTELLIGENCE EXAM – JANUARY 2023

12 questions with the same grade. 1 hour and 50 minutes

1. With regard to the K-nearest neighbor method, for which values of K is


overfitting more likely? Large values or small values? Why?

Answer: small values of K. In the extreme case of K=1, instances are classified
with the closest instance. If the closest instance is noisy, that is equivalent to
memorizing noisy instances, which is one of the signs of overfitting. Another
possible answer is that more complex models are more likely to result in
overfitting and for small values of K, classification boundaries in feature space
are more complex than for large values of K.

2. Gradient Boosting produces ensembles of models. What is the first model in


the ensemble if the loss to be optimized is Mean Absolute Error? Why?

Answer: GB constructs ensembles by iteratively adding new base models in


order to improve results. The first model in the ensemble is the simplest of
models. If the loss is the MAE, the first model in the ensemble is just the
median because it is the value that optimizes the MAE:

3. Is successive halving sensitive to the number of resources given to the


candidate hyper-parameters in the first iteration? Why? (sensitive = results
may depend a lot on that factor).

Answer: it is very sensitive, because if the number of resources (e.g. number of


training instances, training time, etc.) is not enough for a hyper-parameter to
show its worth, it will be (wrongly) discarded. For example, in an extreme case
where the number of the initial resources is just one training instance,
selection of the best hyper-parameter in the first iteration will happen basically
by chance.

4. Explain briefly how target mean categorical encoding with leave-one-out


regularization works.

Answer: target mean categorical encoding is a method for encoding categorical


variables as numbers. It is appropriate for high-cardinality categorical variables,
for which one-hot-encoding may produce bad results. For two-class
classification problems, tmce encodes the categorical value as the probability
of the “1” class among the instances with that categorical value. For regression,
problems, the encoding is the average of the response variable.
If used as previously described, the problem of tmce is that data leakage is
going to occur from the response value of an instance to the predictors. In
order to avoid this leakage, leave-one-out regularization is used, so that the
average of the response value is computed using all the instances, but the one
that is being encoded.

5. What is an impostor in the Large Margin KNN Supervised Metric Learning


method?

Answer:
6. Explain one disadvantage of permutation-based feature selection.

Answer:

7. Let us suppose that a machine learning method has two hyper-parameters: a


and b. The maximum possible value of b depends on the value of a. What hype-
parameter optimization technique would you use for this problem? Random
Search or Optuna? Why?

Answer: Optuna. Of course Optuna usually works better because it focuses on


the more promising regions of the hyper-parameter space (although it has
other disadvantages such as larger computation requirements). However, that
is true in general. For the particular case of this question, the main reason for
selecting Optuna is that Random Search allows for fixed search spaces while
Optuna’s spaces can be defined in each iteration. For instance, let us suppose
that when a=1, the range of values of b is [0, 10]; when a=2, [0,20], when a=3,
[0, 30], etc. For Random Search, we could use the following space:

{‘a’: [1, 10], ‘b’:[0, 100]}

many of the values explored by RS would be unfeasible, hence wasting time.


For Optuna, we could define the following function to be optimized, which
dynamycally changes the search space during the search:

def objective(trial):
a = trial.suggest_uniform(‘a’, 1, 10)
b = trial.suggest_uniform(‘b’, 0, 10*a)

8. What is the main idea behind the F-value (f_classif) method for feature
selection in classification problems? (Drawings can be used in the explanation).

Answer: the main idea behind F-value for classification is that features are
relevant if they separate well the average values of the two classes and the
spread of the two classes is small. More in detail:
9. Explain in detail how the KNN-based imputation method works

Answer: Two key ideas: (1) use KNN to predict values for the attribute with the
missing value, based on the average/majority class of that attribute of the K
neighbors. (2) use a distance that has the possibility that other attributes of the
neighbors can also have missing values.

10. Let’s suppose that D is our available data and that we intend to train and
evaluate a model using the most relevant attributes only. We follow this
workflow: 1) a feature selection method is used to select the 5 best attributes;
2) then data is divided into a training partition and a testing partition; 3) Finally,
a model is trained with the training partition and evaluated with the test
partition. Is this workflow correct? Why?
Answer: it is wrong, because if feature selection is done before splitting into
train and test, the most relevant features are selected using information that
later is going to belong to the test partition. Therefore, the model trained with
those selected features, will contain some information about the testing set,
and therefore the evaluation of the model on that testing partition will be
optimistically biased (data/information leakage).

11.For what kind of problems, training ‘Extremely Randomized Trees’ is going to


be much faster than ‘Random Forests’ (assuming similar conditions, such as
having the same number of trees in the ensemble, same computer, etc.)?
Why?

Answer: for problems with lots of numerical features. The most time
consuming process when training the trees of the Random Forest ensemble is
computing the right threshold (because almost all possible thresholds of each
numerical attribute are considered and evaluated). Given that ERT’s thresholds
are random values, much less computation is required for them. Empirically it
has been shown that ERT perform similarly to RF but using much less compute.

It is also the case that RF use bootstraping for obtaining the training samples
for each tree, while ERT always use the complete dataset (hence time is saved
by not generating bootstrap samples). However, this is less important than the
issue about the thresholds commented above.

12.For what kind of machine learning problems and machine learning algorithms,
the ‘thresholding’ technique is useful? How does it work?

Answer: Thresholding means selecting the optimal threshold for some


particular metric, by using a systematic process that changes the threshold of
an already existing model and evaluates the results on an independent
validation set. It is appropriate, for instance, for imbalanced classification
problems, because the default threshold might not be the most appropriate for
the metric of interest.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy