Mlgs 2021 Retake
Mlgs 2021 Retake
Department of Informatics
Technical University of Munich
Note:
Eexam • During the attendance check a sticker containing a unique code will be put on this exam.
Place student sticker here
• This code contains a unique number that associates this exam with your registration number.
• This number is printed both next to the code and to the signature field in the attendance check
list.
Working instructions
• DO NOT SUBMIT THIS SHEET! ONLY SUBMIT YOUR PERSONALIZED ANSWER SHEET THAT IS
DISTRIBUTED THROUGH TUMEXAM!
• Make sure that you solve the version of the problem stated on your personalized answer sheet (e.g., Problem
1 (Version B), Problem 2 (Version A), etc.)
– Page 1 / 52 –
Problem 1: Normalizing Flows (Version A) (6 credits)
– Page 2 / 52 –
Problem 1: Normalizing Flows (Version B) (6 credits)
We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows. 0
Let 1
(z2 )3 (z1 )3
2
f1 (z) = (1 + max(0, z2 )) · z1 − min(0, z2 ) · z3 and f2 (z) = ln(1 + |z3 |) · exp(z2 ) . 3
z1 − z3 z1 · z3 4
5
Prove or disprove whether f1 and/or f2 are invertible. 6
– Page 3 / 52 –
Problem 1: Normalizing Flows (Version C) (6 credits)
– Page 4 / 52 –
Problem 1: Normalizing Flows (Version D) (6 credits)
We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows. 0
Let 1
(z2 )3 (z2 )3
2
f1 (z) = min(0, z2 ) · z3 + (1 + max(0, z2 )) · z1 and f2 (z) = z2 · |z1 | . 3
4
z1 + 2 · z3 (z1 )3 · exp(z3 )
5
Prove or disprove whether f1 and/or f2 are invertible. 6
– Page 5 / 52 –
Problem 2: Variational Inference (Version A) (7 credits)
Suppose we are given a latent variable model
z2
1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 5)2
1
pθ (x |z) = N (x; z + 5, θ2 ) = √ exp −
θ 2π 2θ 2
(z − ϕ)2
1
qϕ (z) = N (z; ϕ, 1) = √ exp −
2π 2
0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on ϕ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qϕ is defined as
4
L(θ, qϕ ) = E log pθ (x, z) − log qϕ (z) .
z ∼qϕ
Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as
0
1
2
3
b) Suppose θ is fixed. Derive the value of ϕ that maximizes the ELBO.
– Page 6 / 52 –
– Page 7 / 52 –
Problem 2: Variational Inference (Version B) (7 credits)
Suppose we are given a latent variable model
z2
1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − 2z − 4)2
1
pθ (x |z) = N (x; 2z + 4, θ2 ) = √ exp −
θ 2π 2θ2
(z − µ)2
1
qµ (z) = N (z; µ, 1) = √ exp −
2π 2
0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on µ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qµ is defined as
4
L(θ, qµ ) = E log pθ (x, z) − log qµ (z) .
z ∼qµ
Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as
0
1
2
3
b) Suppose θ is fixed. Derive the value of µ that maximizes the ELBO.
– Page 8 / 52 –
– Page 9 / 52 –
Problem 2: Variational Inference (Version C) (7 credits)
Suppose we are given a latent variable model
z2
1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 3)2
1
pθ (x |z) = N (x; z + 3, θ2 ) = √ exp −
θ 2π 2θ 2
(z − ϕ)2
1
qϕ (z) = N (z; ϕ, 1) = √ exp −
2π 2
0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on ϕ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qϕ is defined as
4
L(θ, qϕ ) = E log pθ (x, z) − log qϕ (z) .
z ∼qϕ
Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as
0
1
2
3
b) Suppose θ is fixed. Derive the value of ϕ that maximizes the ELBO.
– Page 10 / 52 –
– Page 11 / 52 –
Problem 2: Variational Inference (Version D) (7 credits)
Suppose we are given a latent variable model
z2
1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 7)2
1
pθ (x |z) = N (x; z + 7, θ2 ) = √ exp −
θ 2π 2θ 2
(z − µ)2
1
qµ (z) = N (z; µ, 1) = √ exp −
2π 2
0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on µ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qµ is defined as
4
L(θ, qµ ) = E log pθ (x, z) − log qµ (z) .
z ∼qµ
Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as
0
1
2
3
b) Suppose θ is fixed. Derive the value of µ that maximizes the ELBO.
– Page 12 / 52 –
– Page 13 / 52 –
Problem 3: Variational Autoencoder (Version A) (2 credits)
0 We would like to define a variational autoencoder model for black-and-white images. Each image is represented as
1 a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows.
2
1. We obtain the distribution parameters as
λ = exp(fθ (z)),
where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.
What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.
– Page 14 / 52 –
Problem 3: Variational Autoencoder (Version B) (2 credits)
We would like to define a variational autoencoder model for black-and-white images. Each image is represented as 0
a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows. 1
2
1. We obtain the distribution parameters as
λ = exp(fθ (z)),
where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.
What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.
– Page 15 / 52 –
Problem 3: Variational Autoencoder (Version C) (2 credits)
0 We would like to define a variational autoencoder model for black-and-white images. Each image is represented as
1 a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows.
2
1. We obtain the distribution parameters as
λ = exp(fθ (z)),
where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.
What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.
– Page 16 / 52 –
Problem 3: Variational Autoencoder (Version D) (2 credits)
We would like to define a variational autoencoder model for black-and-white images. Each image is represented as 0
a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows. 1
2
1. We obtain the distribution parameters as
λ = exp(fθ (z)),
where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.
What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.
– Page 17 / 52 –
Problem 4: Robustness - Convex Relaxation (Version A) (7 credits)
0 In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize
1 this result to the LeakyReLU activation function
2 (
3 x for x ≥ 0
4 LeakyReLU(x) =
5 αx for x < 0
6
7 with α ∈ (0, 1).
Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
T
x LeakyReLU(x) , i.e. whose feasible region is
x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )
Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.
– Page 18 / 52 –
Problem 4: Robustness - Convex Relaxation (Version B) (7 credits)
In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize 0
this result to the LeakyReLU activation function 1
( 2
x for x ≥ 0 3
LeakyReLU(x) = 4
αx for x < 0 5
6
with α ∈ (0, 1). 7
Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
T
x LeakyReLU(x) , i.e. whose feasible region is
x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )
Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.
– Page 19 / 52 –
Problem 4: Robustness - Convex Relaxation (Version C) (7 credits)
0 In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize
1 this result to the LeakyReLU activation function
2 (
3 x for x ≥ 0
4 LeakyReLU(x) =
5 αx for x < 0
6
7 with α ∈ (0, 1).
Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
T
x LeakyReLU(x) , i.e. whose feasible region is
x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )
Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.
– Page 20 / 52 –
Problem 4: Robustness - Convex Relaxation (Version D) (7 credits)
In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize 0
this result to the LeakyReLU activation function 1
( 2
x for x ≥ 0 3
LeakyReLU(x) = 4
αx for x < 0 5
6
with α ∈ (0, 1). 7
Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
T
x LeakyReLU(x) , i.e. whose feasible region is
x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )
Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.
– Page 21 / 52 –
Problem 5: Markov Chain Language Model (Version A) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words I, orange, like, eat. While
the words are borrowed from the English language, our simple language is not bound to its grammatical rules. The
words map to the Markov chain parameters as follows.
0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• I like orange
• I eat orange
• I like I
For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.
0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) I like orange
2) orange eat I
– Page 22 / 52 –
c) Given that the 3rd word X3 of a sentence is orange, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3
– Page 23 / 52 –
Problem 5: Markov Chain Language Model (Version B) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words I, orange, see, like. While
the words are borrowed from the English language, our simple language is not bound to its grammatical rules. The
words map to the Markov chain parameters as follows.
0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• I see I
• I like orange
• I see orange
For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.
0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) I see orange
2) orange like I
– Page 24 / 52 –
c) Given that the 3rd word X3 of a sentence is orange, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3
– Page 25 / 52 –
Problem 5: Markov Chain Language Model (Version C) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words you, apple, see, like.
While the words are borrowed from the English language, our simple language is not bound to its grammatical
rules. The words map to the Markov chain parameters as follows.
0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• you see apple
For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.
0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) you see apple
– Page 26 / 52 –
c) Given that the 3rd word X3 of a sentence is apple, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3
– Page 27 / 52 –
Problem 5: Markov Chain Language Model (Version D) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words they, apple, like, eat.
While the words are borrowed from the English language, our simple language is not bound to its grammatical
rules. The words map to the Markov chain parameters as follows.
0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• they like they
For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.
0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) they like apple
– Page 28 / 52 –
c) Given that the 3rd word X3 of a sentence is apple, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3
– Page 29 / 52 –
Problem 6: Neural Sequence Models (Version A) (4 credits)
0 We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset
1 where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target
2 for each sequence is y = x1 + xn . We use four different encoders:
3
4 1. RNN with positional encoding
2. Transformer with positional encoding
3. Transformer without positional encoding
4. Dilated causal convolution with 2 hidden layers. We set dilation size to 2.
After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?
– Page 30 / 52 –
Problem 6: Neural Sequence Models (Version B) (4 credits)
We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset 0
where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target 1
for each sequence is y = x1 + xn . We use four different encoders: 2
3
1. Recurrent neural network 4
After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?
– Page 31 / 52 –
Problem 6: Neural Sequence Models (Version C) (4 credits)
0 We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset
1 where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target
2 for each sequence is y = x1 + xn . We use four different encoders:
3
4 1. Transformer with positional encoding
2. Transformer without positional encoding
3. Multilayer neural network that takes vector in Rn as input (all numbers concatenated) and outputs Rn×h
4. Recurrent neural network
After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?
– Page 32 / 52 –
Problem 6: Neural Sequence Models (Version D) (4 credits)
We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset 0
where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target 1
for each sequence is y = x1 + xn . We use four different encoders: 2
3
1. Recurrent neural network 4
After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?
– Page 33 / 52 –
Problem 7: Temporal Point Process (Version A) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, T ]. We have observed a single sequence that contains n points {t1 , t2 , ... , tn }, ti ∈ [0, T ].
0 b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times
1 as described above, using the events from the whole day as one sequence. We estimate µ using data we collected
2 in one year. Our task is to find the least busy 2 hour interval in each day to close down the road for maintenance.
Can we use the homogeneous Poisson process to achieve this? If not, can you suggest an alternative model?
Justify your answer.
– Page 34 / 52 –
Problem 7: Temporal Point Process (Version B) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, 5]. We have observed a single sequence {0.7, 0.8, 1.5, 2.3, 4.7}.
b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the 0
times as described above, using the events from the whole day as one sequence. For each day of the week we 1
estimate the parameter µ using data we collected in one year. That means we have µMon , µTue , ... , µSun , each µ 2
corresponding to one day of the week. Our task is to find the least busy day of the week to close down the road
for maintenance. Can we use the homogeneous Poisson process to achieve this? If not, can you suggest an
alternative model? Justify your answer.
– Page 35 / 52 –
Problem 7: Temporal Point Process (Version C) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, 2]. We have observed a single sequence {0.1, 0.8, 1.3, 1.5, 1.7, 1.9}.
0 b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times
1 as described above, using the events from the whole day as one sequence. Using our model, we want to estimate
2 the probability that less than 100 cars will pass our sensor in a day. Can we use the homogeneous Poisson process
to achieve this? If not, can you suggest an alternative model? Justify your answer.
– Page 36 / 52 –
Problem 7: Temporal Point Process (Version D) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[3, 13]. We have observed a single sequence {3.5, 4.3, 4.5, 7.1, 8.3}.
b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times 0
as described above, using the events from the whole day as one sequence. Using our model, we want to answer 1
whether fast vehicles get stuck behind slower vehicles. That is, we want to see if observing one vehicle leads to 2
a few more following behind it. Can we use the homogeneous Poisson process to achieve this? If not, can you
suggest an alternative model? Justify your answer.
– Page 37 / 52 –
Problem 8: Clustering (Version A) (6 credits)
0 We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C
1 A P
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
2 di
P
3 Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the
4 probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C)
5 and vice versa. Show that the normalized cut satisfies the equation
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).
– Page 38 / 52 –
Problem 8: Clustering (Version B) (6 credits)
We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C 0
A P
1
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
di
P 2
Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the 3
probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C) 4
and vice versa. Show that the normalized cut satisfies the equation 5
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).
– Page 39 / 52 –
Problem 8: Clustering (Version C) (6 credits)
0 We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C
1 A P
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
2 di
P
3 Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the
4 probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C)
5 and vice versa. Show that the normalized cut satisfies the equation
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).
– Page 40 / 52 –
Problem 8: Clustering (Version D) (6 credits)
We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C 0
A P
1
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
di
P 2
Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the 3
probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C) 4
and vice versa. Show that the normalized cut satisfies the equation 5
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).
– Page 41 / 52 –
Problem 9: Embeddings & Ranking (Version A) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider
′
the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector
′
E k [i, :] ∈ RD denotes the embedding of node i for model Mk :
• M1 : Node2Vec.
• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).
0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.
0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.
– Page 42 / 52 –
– Page 43 / 52 –
Problem 9: Embeddings & Ranking (Version B) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider
′
the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector
′
E k [i, :] ∈ RD denotes the embedding of node i for model Mk :
• M1 : Node2Vec.
• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).
0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.
0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.
– Page 44 / 52 –
– Page 45 / 52 –
Problem 9: Embeddings & Ranking (Version C) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider
′
the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector
′
E k [i, :] ∈ RD denotes the embedding of node i for model Mk :
• M1 : Node2Vec.
• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).
0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.
0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.
– Page 46 / 52 –
– Page 47 / 52 –
Problem 9: Embeddings & Ranking (Version D) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider
′
the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector
′
E k [i, :] ∈ RD denotes the embedding of node i for model Mk :
• M1 : Node2Vec.
• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).
0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.
0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.
– Page 48 / 52 –
– Page 49 / 52 –
Problem 10: Semi-Supervised Learning (Version A) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1
0.2 0.9
π = 21 and ν = .
2
0.9 0.2
We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.
to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.
– Page 50 / 52 –
Problem 10: Semi-Supervised Learning (Version B) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1
0.2 0.9
π = 21 and ν = .
2
0.9 0.2
We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.
to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.
– Page 51 / 52 –
Problem 10: Semi-Supervised Learning (Version C) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1
0.1 0.8
π = 21 and ν = .
2
0.8 0.1
We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.
to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.
– Page 52 / 52 –
Problem 10: Semi-Supervised Learning (Version D) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1
0.1 0.8
π = 21 and ν = .
2
0.8 0.1
We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.
to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.
– Page 53 / 52 –
– Page 54 / 52 –