0% found this document useful (0 votes)

14 views7 pages

examBD2223 January Solutions

Uploaded by

100392425

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views7 pages

examBD2223 January Solutions

Uploaded by

100392425

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

MASTER IN BIG DATA ANALYTICS. UC3M.

BIG DATA INTELLIGENCE EXAM – JANUARY 2023

12 questions with the same grade. 1 hour and 50 minutes

1. With regard to the K-nearest neighbor method, for which values of K is

overfitting more likely? Large values or small values? Why?

Answer: small values of K. In the extreme case of K=1, instances are classified
with the closest instance. If the closest instance is noisy, that is equivalent to
memorizing noisy instances, which is one of the signs of overfitting. Another
possible answer is that more complex models are more likely to result in
overfitting and for small values of K, classification boundaries in feature space
are more complex than for large values of K.

2. Gradient Boosting produces ensembles of models. What is the first model in

the ensemble if the loss to be optimized is Mean Absolute Error? Why?

Answer: GB constructs ensembles by iteratively adding new base models in

order to improve results. The first model in the ensemble is the simplest of
models. If the loss is the MAE, the first model in the ensemble is just the
median because it is the value that optimizes the MAE:

3. Is successive halving sensitive to the number of resources given to the

candidate hyper-parameters in the first iteration? Why? (sensitive = results
may depend a lot on that factor).

Answer: it is very sensitive, because if the number of resources (e.g. number of

training instances, training time, etc.) is not enough for a hyper-parameter to
show its worth, it will be (wrongly) discarded. For example, in an extreme case
where the number of the initial resources is just one training instance,
selection of the best hyper-parameter in the first iteration will happen basically
by chance.

4. Explain briefly how target mean categorical encoding with leave-one-out

regularization works.

Answer: target mean categorical encoding is a method for encoding categorical

variables as numbers. It is appropriate for high-cardinality categorical variables,
for which one-hot-encoding may produce bad results. For two-class
classification problems, tmce encodes the categorical value as the probability
of the “1” class among the instances with that categorical value. For regression,
problems, the encoding is the average of the response variable.
If used as previously described, the problem of tmce is that data leakage is
going to occur from the response value of an instance to the predictors. In
order to avoid this leakage, leave-one-out regularization is used, so that the
average of the response value is computed using all the instances, but the one
that is being encoded.

5. What is an impostor in the Large Margin KNN Supervised Metric Learning

method?

Answer:
6. Explain one disadvantage of permutation-based feature selection.

Answer:

7. Let us suppose that a machine learning method has two hyper-parameters: a

and b. The maximum possible value of b depends on the value of a. What hype-
parameter optimization technique would you use for this problem? Random
Search or Optuna? Why?

Answer: Optuna. Of course Optuna usually works better because it focuses on

the more promising regions of the hyper-parameter space (although it has
other disadvantages such as larger computation requirements). However, that
is true in general. For the particular case of this question, the main reason for
selecting Optuna is that Random Search allows for fixed search spaces while
Optuna’s spaces can be defined in each iteration. For instance, let us suppose
that when a=1, the range of values of b is [0, 10]; when a=2, [0,20], when a=3,
[0, 30], etc. For Random Search, we could use the following space:

{‘a’: [1, 10], ‘b’:[0, 100]}

many of the values explored by RS would be unfeasible, hence wasting time.

For Optuna, we could define the following function to be optimized, which
dynamycally changes the search space during the search:

def objective(trial):
a = trial.suggest_uniform(‘a’, 1, 10)
b = trial.suggest_uniform(‘b’, 0, 10*a)

8. What is the main idea behind the F-value (f_classif) method for feature
selection in classification problems? (Drawings can be used in the explanation).

Answer: the main idea behind F-value for classification is that features are
relevant if they separate well the average values of the two classes and the
spread of the two classes is small. More in detail:
9. Explain in detail how the KNN-based imputation method works

Answer: Two key ideas: (1) use KNN to predict values for the attribute with the
missing value, based on the average/majority class of that attribute of the K
neighbors. (2) use a distance that has the possibility that other attributes of the
neighbors can also have missing values.

10. Let’s suppose that D is our available data and that we intend to train and
evaluate a model using the most relevant attributes only. We follow this
workflow: 1) a feature selection method is used to select the 5 best attributes;
2) then data is divided into a training partition and a testing partition; 3) Finally,
a model is trained with the training partition and evaluated with the test
partition. Is this workflow correct? Why?
Answer: it is wrong, because if feature selection is done before splitting into
train and test, the most relevant features are selected using information that
later is going to belong to the test partition. Therefore, the model trained with
those selected features, will contain some information about the testing set,
and therefore the evaluation of the model on that testing partition will be
optimistically biased (data/information leakage).

11.For what kind of problems, training ‘Extremely Randomized Trees’ is going to

be much faster than ‘Random Forests’ (assuming similar conditions, such as
having the same number of trees in the ensemble, same computer, etc.)?
Why?

Answer: for problems with lots of numerical features. The most time
consuming process when training the trees of the Random Forest ensemble is
computing the right threshold (because almost all possible thresholds of each
numerical attribute are considered and evaluated). Given that ERT’s thresholds
are random values, much less computation is required for them. Empirically it
has been shown that ERT perform similarly to RF but using much less compute.

It is also the case that RF use bootstraping for obtaining the training samples
for each tree, while ERT always use the complete dataset (hence time is saved
by not generating bootstrap samples). However, this is less important than the
issue about the thresholds commented above.

12.For what kind of machine learning problems and machine learning algorithms,
the ‘thresholding’ technique is useful? How does it work?

Answer: Thresholding means selecting the optimal threshold for some

particular metric, by using a systematic process that changes the threshold of
an already existing model and evaluates the results on an independent
validation set. It is appropriate, for instance, for imbalanced classification
problems, because the default threshold might not be the most appropriate for
the metric of interest.

Rem Koolhaas-Elements of Architecture
11% (9)
Rem Koolhaas-Elements of Architecture
6 pages
Nptel ML Questions
No ratings yet
Nptel ML Questions
12 pages
ML Interview Questions PDF
100% (5)
ML Interview Questions PDF
20 pages
IML-IITKGP - Assignment 1 Solution
No ratings yet
IML-IITKGP - Assignment 1 Solution
7 pages
Aam Ut-1 QB Ans - (Final)
No ratings yet
Aam Ut-1 QB Ans - (Final)
28 pages
First Inspiration by Jose Rizal
100% (3)
First Inspiration by Jose Rizal
7 pages
Answer 2023-24
No ratings yet
Answer 2023-24
19 pages
Aam Ut-1 QB Ans
No ratings yet
Aam Ut-1 QB Ans
12 pages
CSE381 Introduction To Machine Learning - Image Classification and Loss Functions: Theoretical Questions and Answers
No ratings yet
CSE381 Introduction To Machine Learning - Image Classification and Loss Functions: Theoretical Questions and Answers
8 pages
Week 6-1
No ratings yet
Week 6-1
9 pages
Mid Sem
No ratings yet
Mid Sem
11 pages
MLfinal 1
No ratings yet
MLfinal 1
7 pages
Ai ML Unit 3
No ratings yet
Ai ML Unit 3
15 pages
ML Viva and Oral Question and Answers
No ratings yet
ML Viva and Oral Question and Answers
5 pages
Practice Paper 2
No ratings yet
Practice Paper 2
10 pages
Unit 3
No ratings yet
Unit 3
19 pages
Coincent Data Analysis Answers
No ratings yet
Coincent Data Analysis Answers
16 pages
Final - Model-Machine Learning
No ratings yet
Final - Model-Machine Learning
15 pages
Assignment 9 Solution
No ratings yet
Assignment 9 Solution
4 pages
QCM DL
No ratings yet
QCM DL
7 pages
DMT MCQ
No ratings yet
DMT MCQ
15 pages
Removal of Permanent Hardness of Water
100% (1)
Removal of Permanent Hardness of Water
7 pages
Machine Learning Suggestion (2 Marks) MCQ
No ratings yet
Machine Learning Suggestion (2 Marks) MCQ
5 pages
MCQ of Machine Learning
100% (2)
MCQ of Machine Learning
151 pages
ML Unit 3
No ratings yet
ML Unit 3
83 pages
Khoi KHDL - de On
No ratings yet
Khoi KHDL - de On
6 pages
ML Merged PDF
No ratings yet
ML Merged PDF
14 pages
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
100% (1)
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
5 pages
Assingment On Database
No ratings yet
Assingment On Database
16 pages
2023 ML Assignment
No ratings yet
2023 ML Assignment
57 pages
Q1-What's The Trade-Off Between Bias and Variance?
100% (1)
Q1-What's The Trade-Off Between Bias and Variance?
5 pages
ML Questions Answers
No ratings yet
ML Questions Answers
4 pages
Viva ML
No ratings yet
Viva ML
10 pages
Machine Learning Multiple Choice Questions
100% (1)
Machine Learning Multiple Choice Questions
20 pages
ML Answer Key (M.tech)
No ratings yet
ML Answer Key (M.tech)
31 pages
Machine Learning Qs
No ratings yet
Machine Learning Qs
10 pages
MLP Question Bank of AI and ML and NLP
No ratings yet
MLP Question Bank of AI and ML and NLP
7 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Data Science Final Mock Test
No ratings yet
Data Science Final Mock Test
47 pages
DataMining - Workbook MCQ
No ratings yet
DataMining - Workbook MCQ
16 pages
Lac Matrix ESP
100% (1)
Lac Matrix ESP
4 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
MLT Unit-3 Important Questions
No ratings yet
MLT Unit-3 Important Questions
8 pages
Data Mining f20 Practice Final Solutions
No ratings yet
Data Mining f20 Practice Final Solutions
8 pages
Interview Questions
No ratings yet
Interview Questions
8 pages
ML 2 (Mainly KNN)
100% (1)
ML 2 (Mainly KNN)
12 pages
Advantages:: Q.No 1.a Ans
No ratings yet
Advantages:: Q.No 1.a Ans
12 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
EE2211 Past Paper Ans
No ratings yet
EE2211 Past Paper Ans
19 pages
Interview Questions AI
No ratings yet
Interview Questions AI
7 pages
Lecture 3 Mcqs
No ratings yet
Lecture 3 Mcqs
7 pages
F170a.23 LMLD
No ratings yet
F170a.23 LMLD
70 pages
Machine Learning Bits
100% (2)
Machine Learning Bits
28 pages
212 Final-Solution
No ratings yet
212 Final-Solution
23 pages
EE2211 Past Paper
No ratings yet
EE2211 Past Paper
14 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
Crim 103
No ratings yet
Crim 103
7 pages
Boiler Process Control and Instrumentation
No ratings yet
Boiler Process Control and Instrumentation
70 pages
MCQs Dumps 2
No ratings yet
MCQs Dumps 2
15 pages
Pa ZG512 Ec-3r First Sem 2022-2023
No ratings yet
Pa ZG512 Ec-3r First Sem 2022-2023
5 pages
Computational Machine Learning Mock Test
No ratings yet
Computational Machine Learning Mock Test
6 pages
CS771 IITK EndSem Solutions
100% (1)
CS771 IITK EndSem Solutions
8 pages
ECS7020P Sample Paper Solutions
No ratings yet
ECS7020P Sample Paper Solutions
6 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
Manual of Hyd Control Unit
No ratings yet
Manual of Hyd Control Unit
13 pages
Public International Law LLB 2 Year
No ratings yet
Public International Law LLB 2 Year
144 pages
AIESL CAPABILITY (Group A) 1
No ratings yet
AIESL CAPABILITY (Group A) 1
314 pages
Answers To Questions On The Bible Asked by Christians
No ratings yet
Answers To Questions On The Bible Asked by Christians
23 pages
History and Importance of Hadith
No ratings yet
History and Importance of Hadith
10 pages
Current State of Fabrication Technologies and Materials Fo 2018 Acta Biomate
No ratings yet
Current State of Fabrication Technologies and Materials Fo 2018 Acta Biomate
30 pages
Downloads Papers N59e995a0ab8c2 PDF
No ratings yet
Downloads Papers N59e995a0ab8c2 PDF
6 pages
Unit 1
No ratings yet
Unit 1
35 pages
Solid Dosage Forms: Tablets: Abhay ML Verma (Pharmaceutics)
No ratings yet
Solid Dosage Forms: Tablets: Abhay ML Verma (Pharmaceutics)
5 pages
01112015000000B - Boehler FOX CEL - Ce
No ratings yet
01112015000000B - Boehler FOX CEL - Ce
1 page
Performance Report - FMGE 2019-1
No ratings yet
Performance Report - FMGE 2019-1
20 pages
Philip Freeman - Julian: Rome's Last Pagan Emperor - Yale University Press (2023)
No ratings yet
Philip Freeman - Julian: Rome's Last Pagan Emperor - Yale University Press (2023)
166 pages
ISO 37120 City Indicators - City of Pickering
No ratings yet
ISO 37120 City Indicators - City of Pickering
37 pages
Project Synopsis
No ratings yet
Project Synopsis
5 pages
A Jury of Her Peers Questions
No ratings yet
A Jury of Her Peers Questions
2 pages
Chi Gamma Cinderella Girl Sorority Official Handbook
No ratings yet
Chi Gamma Cinderella Girl Sorority Official Handbook
16 pages
Hifonics Atlas Subwoofer Manual
No ratings yet
Hifonics Atlas Subwoofer Manual
8 pages
Int'l Application Guidelines
No ratings yet
Int'l Application Guidelines
18 pages
Adj - To V - That Clause
No ratings yet
Adj - To V - That Clause
7 pages
Tutorial 4
No ratings yet
Tutorial 4
4 pages
Stress-Strain Curves
No ratings yet
Stress-Strain Curves
4 pages
Erol Özvar
No ratings yet
Erol Özvar
5 pages
Poetry
No ratings yet
Poetry
3 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

examBD2223 January Solutions

Uploaded by

examBD2223 January Solutions

Uploaded by

MASTER IN BIG DATA ANALYTICS. UC3M.

BIG DATA INTELLIGENCE EXAM – JANUARY 2023

12 questions with the same grade. 1 hour and 50 minutes

1. With regard to the K-nearest neighbor method, for which values of K is

2. Gradient Boosting produces ensembles of models. What is the first model in

Answer: GB constructs ensembles by iteratively adding new base models in

3. Is successive halving sensitive to the number of resources given to the

Answer: it is very sensitive, because if the number of resources (e.g. number of

4. Explain briefly how target mean categorical encoding with leave-one-out

Answer: target mean categorical encoding is a method for encoding categorical

5. What is an impostor in the Large Margin KNN Supervised Metric Learning

7. Let us suppose that a machine learning method has two hyper-parameters: a

Answer: Optuna. Of course Optuna usually works better because it focuses on

{‘a’: [1, 10], ‘b’:[0, 100]}

many of the values explored by RS would be unfeasible, hence wasting time.

11.For what kind of problems, training ‘Extremely Randomized Trees’ is going to

Answer: Thresholding means selecting the optimal threshold for some

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.