0% found this document useful (0 votes)
24 views54 pages

Mlgs 2021 Retake

Uploaded by

Ghaith Chebil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views54 pages

Mlgs 2021 Retake

Uploaded by

Ghaith Chebil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Data Analytics and Machine Learning Group

Department of Informatics
Technical University of Munich

Note:
Eexam • During the attendance check a sticker containing a unique code will be put on this exam.
Place student sticker here
• This code contains a unique number that associates this exam with your registration number.
• This number is printed both next to the code and to the signature field in the attendance check
list.

Machine Learning for Graphs and Sequential Data


(Problem sheet)
Graded Exercise: IN2323 / Retake Date: Thursday 14th October, 2021
Examiner: Prof. Dr. Stephan Günnemann Time: 14:15 – 15:30

Working instructions
• DO NOT SUBMIT THIS SHEET! ONLY SUBMIT YOUR PERSONALIZED ANSWER SHEET THAT IS
DISTRIBUTED THROUGH TUMEXAM!
• Make sure that you solve the version of the problem stated on your personalized answer sheet (e.g., Problem
1 (Version B), Problem 2 (Version A), etc.)

– Page 1 / 52 –
Problem 1: Normalizing Flows (Version A) (6 credits)

0 We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows.


1 Let
(z1 )3
   
2 (1 + max(0, z2 )) · z1 − min(0, z2 ) · z3
3 f1 (z) =  (z2 )3  and f2 (z) = (z3 )3 · exp(z2 ) .
4
z1 − z3 z1 · |z3 |
5
6 Prove or disprove whether f1 and/or f2 are invertible.

– Page 2 / 52 –
Problem 1: Normalizing Flows (Version B) (6 credits)
We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows. 0
Let 1
(z2 )3 (z1 )3
   
2
f1 (z) = (1 + max(0, z2 )) · z1 − min(0, z2 ) · z3  and f2 (z) = ln(1 + |z3 |) · exp(z2 ) . 3
z1 − z3 z1 · z3 4
5
Prove or disprove whether f1 and/or f2 are invertible. 6

– Page 3 / 52 –
Problem 1: Normalizing Flows (Version C) (6 credits)

0 We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows.


1 Let    
2 min(0, z2 ) · z3 + (1 + max(0, z2 )) · z1 ln(1 + |z1 |)
3 f1 (z) =  (z2 )3  and f2 (z) = (z3 )3 · exp(z2 ) .
4 z1 + 2 · z3 z1 · z3
5
6 Prove or disprove whether f1 and/or f2 are invertible.

– Page 4 / 52 –
Problem 1: Normalizing Flows (Version D) (6 credits)
We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows. 0
Let 1
(z2 )3 (z2 )3
   
2
f1 (z) = min(0, z2 ) · z3 + (1 + max(0, z2 )) · z1  and f2 (z) =  z2 · |z1 |  . 3
4
z1 + 2 · z3 (z1 )3 · exp(z3 )
5
Prove or disprove whether f1 and/or f2 are invertible. 6

– Page 5 / 52 –
Problem 2: Variational Inference (Version A) (7 credits)
Suppose we are given a latent variable model

z2
 
1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 5)2
 
1
pθ (x |z) = N (x; z + 5, θ2 ) = √ exp −
θ 2π 2θ 2

where x, z ∈ R. We parametrize the variational distribution qϕ (z) as:

(z − ϕ)2
 
1
qϕ (z) = N (z; ϕ, 1) = √ exp −
2π 2

0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on ϕ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qϕ is defined as
4
 
L(θ, qϕ ) = E log pθ (x, z) − log qϕ (z) .
z ∼qϕ

Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as

E[X 2 ] = Var(X ) + E[X ]2 .

0
1
2
3
b) Suppose θ is fixed. Derive the value of ϕ that maximizes the ELBO.

– Page 6 / 52 –
– Page 7 / 52 –
Problem 2: Variational Inference (Version B) (7 credits)
Suppose we are given a latent variable model

z2
 
1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − 2z − 4)2
 
1
pθ (x |z) = N (x; 2z + 4, θ2 ) = √ exp −
θ 2π 2θ2

where x, z ∈ R. We parametrize the variational distribution qµ (z) as:

(z − µ)2
 
1
qµ (z) = N (z; µ, 1) = √ exp −
2π 2

0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on µ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qµ is defined as
4
 
L(θ, qµ ) = E log pθ (x, z) − log qµ (z) .
z ∼qµ

Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as

E[X 2 ] = Var(X ) + E[X ]2 .

0
1
2
3
b) Suppose θ is fixed. Derive the value of µ that maximizes the ELBO.

– Page 8 / 52 –
– Page 9 / 52 –
Problem 2: Variational Inference (Version C) (7 credits)
Suppose we are given a latent variable model

z2
 
1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 3)2
 
1
pθ (x |z) = N (x; z + 3, θ2 ) = √ exp −
θ 2π 2θ 2

where x, z ∈ R. We parametrize the variational distribution qϕ (z) as:

(z − ϕ)2
 
1
qϕ (z) = N (z; ϕ, 1) = √ exp −
2π 2

0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on ϕ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qϕ is defined as
4
 
L(θ, qϕ ) = E log pθ (x, z) − log qϕ (z) .
z ∼qϕ

Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as

E[X 2 ] = Var(X ) + E[X ]2 .

0
1
2
3
b) Suppose θ is fixed. Derive the value of ϕ that maximizes the ELBO.

– Page 10 / 52 –
– Page 11 / 52 –
Problem 2: Variational Inference (Version D) (7 credits)
Suppose we are given a latent variable model

z2
 
1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 7)2
 
1
pθ (x |z) = N (x; z + 7, θ2 ) = √ exp −
θ 2π 2θ 2

where x, z ∈ R. We parametrize the variational distribution qµ (z) as:

(z − µ)2
 
1
qµ (z) = N (z; µ, 1) = √ exp −
2π 2

0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on µ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qµ is defined as
4
 
L(θ, qµ ) = E log pθ (x, z) − log qµ (z) .
z ∼qµ

Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as

E[X 2 ] = Var(X ) + E[X ]2 .

0
1
2
3
b) Suppose θ is fixed. Derive the value of µ that maximizes the ELBO.

– Page 12 / 52 –
– Page 13 / 52 –
Problem 3: Variational Autoencoder (Version A) (2 credits)

0 We would like to define a variational autoencoder model for black-and-white images. Each image is represented as
1 a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows.
2
1. We obtain the distribution parameters as

λ = exp(fθ (z)),

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as


N
Y
pθ (x |z) = Exponential(xi |λi )
i=1

where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.

What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.

– Page 14 / 52 –
Problem 3: Variational Autoencoder (Version B) (2 credits)
We would like to define a variational autoencoder model for black-and-white images. Each image is represented as 0
a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows. 1
2
1. We obtain the distribution parameters as

λ = exp(fθ (z)),

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as


N
Y
pθ (x |z) = Exponential(xi |λi )
i=1

where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.

What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.

– Page 15 / 52 –
Problem 3: Variational Autoencoder (Version C) (2 credits)

0 We would like to define a variational autoencoder model for black-and-white images. Each image is represented as
1 a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows.
2
1. We obtain the distribution parameters as

λ = exp(fθ (z)),

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as


N
Y
pθ (x |z) = Exponential(xi |λi )
i=1

where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.

What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.

– Page 16 / 52 –
Problem 3: Variational Autoencoder (Version D) (2 credits)
We would like to define a variational autoencoder model for black-and-white images. Each image is represented as 0
a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows. 1
2
1. We obtain the distribution parameters as

λ = exp(fθ (z)),

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as


N
Y
pθ (x |z) = Exponential(xi |λi )
i=1

where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.

What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.

– Page 17 / 52 –
Problem 4: Robustness - Convex Relaxation (Version A) (7 credits)

0 In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize
1 this result to the LeakyReLU activation function
2 (
3 x for x ≥ 0
4 LeakyReLU(x) =
5 αx for x < 0
6
7 with α ∈ (0, 1).

Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
 T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
 T
x LeakyReLU(x) , i.e. whose feasible region is
     
x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )

Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.

– Page 18 / 52 –
Problem 4: Robustness - Convex Relaxation (Version B) (7 credits)
In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize 0
this result to the LeakyReLU activation function 1
( 2
x for x ≥ 0 3
LeakyReLU(x) = 4
αx for x < 0 5
6
with α ∈ (0, 1). 7

Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
 T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
 T
x LeakyReLU(x) , i.e. whose feasible region is
     
x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )

Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.

– Page 19 / 52 –
Problem 4: Robustness - Convex Relaxation (Version C) (7 credits)

0 In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize
1 this result to the LeakyReLU activation function
2 (
3 x for x ≥ 0
4 LeakyReLU(x) =
5 αx for x < 0
6
7 with α ∈ (0, 1).

Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
 T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
 T
x LeakyReLU(x) , i.e. whose feasible region is
     
x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )

Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.

– Page 20 / 52 –
Problem 4: Robustness - Convex Relaxation (Version D) (7 credits)
In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize 0
this result to the LeakyReLU activation function 1
( 2
x for x ≥ 0 3
LeakyReLU(x) = 4
αx for x < 0 5
6
with α ∈ (0, 1). 7

Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
 T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
 T
x LeakyReLU(x) , i.e. whose feasible region is
     
x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )

Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.

– Page 21 / 52 –
Problem 5: Markov Chain Language Model (Version A) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words I, orange, like, eat. While
the words are borrowed from the English language, our simple language is not bound to its grammatical rules. The
words map to the Markov chain parameters as follows.

I orange like eat


I π1
  A
I 11 ··· A14 
orange π2   .. .. .. 
π= like
π  A = orange  . . . 
3  
eat π4 like
eat A41 ··· A44

A ij specifies the probability of transitioning from state i to state j .

0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• I like orange

• I eat orange

• orange eat orange

• I like I

For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.

4/6 I orange like eat


I 1 3 2
I 0 /6 /6 /6
 
orange 2/6 2 4
π= A= orange 0 0 /6 /6 
like 0
 
like
1/6 3
/6 1
/6 1 
/6
eat 0 1 5
eat /6 /6 0 0

0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) I like orange

2) orange eat I

– Page 22 / 52 –
c) Given that the 3rd word X3 of a sentence is orange, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3

– Page 23 / 52 –
Problem 5: Markov Chain Language Model (Version B) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words I, orange, see, like. While
the words are borrowed from the English language, our simple language is not bound to its grammatical rules. The
words map to the Markov chain parameters as follows.

I orange see like


I π1
  AI11 ··· A14 
orange π2   .. .. .. 
π= see
π  A = orange  . . . 
3  
like π4 see
like A41 ··· A44

A ij specifies the probability of transitioning from state i to state j .

0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• I see I

• I like orange

• orange like orange

• I see orange

For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.

4  I orange see like


I /6 1 3 2
I
0 /6 /6 /6 
orange 2/6 2 4
π= 0 A= orange 0 0 /6 /6 
see 1 3 1 1 
see /6 /6 /6 /6
like 0 1 5
like /6 /6 0 0

0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) I see orange

2) orange like I

– Page 24 / 52 –
c) Given that the 3rd word X3 of a sentence is orange, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3

– Page 25 / 52 –
Problem 5: Markov Chain Language Model (Version C) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words you, apple, see, like.
While the words are borrowed from the English language, our simple language is not bound to its grammatical
rules. The words map to the Markov chain parameters as follows.

you apple see like


you π1
 
you
A
11 ··· A14 
apple π2   .. .. .. 
π= see
π  A = apple  . . . 
3  
like π4 see
like A41 ··· A44

A ij specifies the probability of transitioning from state i to state j .

0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• you see apple

• you like apple

• apple like apple

• you see you

For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.

4  you apple see like


you /6 1 3 2
you
0 /6 /6 /6 
apple 2/6 2 4
π= 0 A = apple  0 0 /6 /6 
see  1/6 3 1 1 
see /6 /6 /6
like 0 1 5
like /6 /6 0 0

0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) you see apple

2) apple like you

– Page 26 / 52 –
c) Given that the 3rd word X3 of a sentence is apple, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3

– Page 27 / 52 –
Problem 5: Markov Chain Language Model (Version D) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words they, apple, like, eat.
While the words are borrowed from the English language, our simple language is not bound to its grammatical
rules. The words map to the Markov chain parameters as follows.

they apple like eat


they π1
 
they
A
11 ··· A14 
apple π2   .. .. .. 
π= like
π  A = apple  . . . 
3  
eat π4 like
eat A41 ··· A44

A ij specifies the probability of transitioning from state i to state j .

0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• they like they

• they eat apple

• apple eat apple

• they like apple

For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.

4  they apple like eat


they /6 1 3 2
they
 0 /6 /6 /6 
apple 2/6 2 4
π= 0 A = apple  0 0 /6 /6 
like  1/6 3 1 1 
like /6 /6 /6
eat 0 1 5
eat /6 /6 0 0

0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) they like apple

2) apple eat they

– Page 28 / 52 –
c) Given that the 3rd word X3 of a sentence is apple, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3

– Page 29 / 52 –
Problem 6: Neural Sequence Models (Version A) (4 credits)

0 We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset
1 where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target
2 for each sequence is y = x1 + xn . We use four different encoders:
3
4 1. RNN with positional encoding
2. Transformer with positional encoding
3. Transformer without positional encoding
4. Dilated causal convolution with 2 hidden layers. We set dilation size to 2.

After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?

– Page 30 / 52 –
Problem 6: Neural Sequence Models (Version B) (4 credits)
We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset 0
where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target 1
for each sequence is y = x1 + xn . We use four different encoders: 2
3
1. Recurrent neural network 4

2. Transformer without positional encoding


3. Transformer with positional encoding
4. Sliding window neural network that takes [xi −k , ... , xi −1 , xi ] and outputs h i ∈ Rh , for each i . We set k = 5

After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?

– Page 31 / 52 –
Problem 6: Neural Sequence Models (Version C) (4 credits)

0 We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset
1 where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target
2 for each sequence is y = x1 + xn . We use four different encoders:
3
4 1. Transformer with positional encoding
2. Transformer without positional encoding
3. Multilayer neural network that takes vector in Rn as input (all numbers concatenated) and outputs Rn×h
4. Recurrent neural network

After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?

– Page 32 / 52 –
Problem 6: Neural Sequence Models (Version D) (4 credits)
We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset 0
where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target 1
for each sequence is y = x1 + xn . We use four different encoders: 2
3
1. Recurrent neural network 4

2. Dilated causal convolution with 2 hidden layers. We set dilation size to 2.


3. Transformer with positional encoding
4. Transformer without positional encoding

After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?

– Page 33 / 52 –
Problem 7: Temporal Point Process (Version A) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, T ]. We have observed a single sequence that contains n points {t1 , t2 , ... , tn }, ti ∈ [0, T ].

0 a) Derive the maximum likelihood estimate of the parameter µ.


1
2
3
4

0 b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times
1 as described above, using the events from the whole day as one sequence. We estimate µ using data we collected
2 in one year. Our task is to find the least busy 2 hour interval in each day to close down the road for maintenance.
Can we use the homogeneous Poisson process to achieve this? If not, can you suggest an alternative model?
Justify your answer.

– Page 34 / 52 –
Problem 7: Temporal Point Process (Version B) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, 5]. We have observed a single sequence {0.7, 0.8, 1.5, 2.3, 4.7}.

a) Derive the maximum likelihood estimate of the parameter µ. 0


1
2
3
4

b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the 0
times as described above, using the events from the whole day as one sequence. For each day of the week we 1
estimate the parameter µ using data we collected in one year. That means we have µMon , µTue , ... , µSun , each µ 2
corresponding to one day of the week. Our task is to find the least busy day of the week to close down the road
for maintenance. Can we use the homogeneous Poisson process to achieve this? If not, can you suggest an
alternative model? Justify your answer.

– Page 35 / 52 –
Problem 7: Temporal Point Process (Version C) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, 2]. We have observed a single sequence {0.1, 0.8, 1.3, 1.5, 1.7, 1.9}.

0 a) Derive the maximum likelihood estimate of the parameter µ.


1
2
3
4

0 b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times
1 as described above, using the events from the whole day as one sequence. Using our model, we want to estimate
2 the probability that less than 100 cars will pass our sensor in a day. Can we use the homogeneous Poisson process
to achieve this? If not, can you suggest an alternative model? Justify your answer.

– Page 36 / 52 –
Problem 7: Temporal Point Process (Version D) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[3, 13]. We have observed a single sequence {3.5, 4.3, 4.5, 7.1, 8.3}.

a) Derive the maximum likelihood estimate of the parameter µ. 0


1
2
3
4

b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times 0
as described above, using the events from the whole day as one sequence. Using our model, we want to answer 1
whether fast vehicles get stuck behind slower vehicles. That is, we want to see if observing one vehicle leads to 2
a few more following behind it. Can we use the homogeneous Poisson process to achieve this? If not, can you
suggest an alternative model? Justify your answer.

– Page 37 / 52 –
Problem 8: Clustering (Version A) (6 credits)

0 We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C
1 A P
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
2 di
P
3 Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the
4 probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C)
5 and vice versa. Show that the normalized cut satisfies the equation
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)


Ncut(C, C̄) = + .
vol(C) vol(C̄)

– Page 38 / 52 –
Problem 8: Clustering (Version B) (6 credits)
We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C 0
A P
1
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
di
P 2
Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the 3
probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C) 4
and vice versa. Show that the normalized cut satisfies the equation 5
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)


Ncut(C, C̄) = + .
vol(C) vol(C̄)

– Page 39 / 52 –
Problem 8: Clustering (Version C) (6 credits)

0 We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C
1 A P
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
2 di
P
3 Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the
4 probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C)
5 and vice versa. Show that the normalized cut satisfies the equation
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)


Ncut(C, C̄) = + .
vol(C) vol(C̄)

– Page 40 / 52 –
Problem 8: Clustering (Version D) (6 credits)
We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C 0
A P
1
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
di
P 2
Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the 3
probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C) 4
and vice versa. Show that the normalized cut satisfies the equation 5
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)


Ncut(C, C̄) = + .
vol(C) vol(C̄)

– Page 41 / 52 –
Problem 9: Embeddings & Ranking (Version A) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider

the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector

E k [i, :] ∈ RD denotes the embedding of node i for model Mk :

• M1 : Node2Vec.

• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).

• M3 : Spectral embedding with k smallest eigenvectors.

0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.

0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.

– Page 42 / 52 –
– Page 43 / 52 –
Problem 9: Embeddings & Ranking (Version B) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider

the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector

E k [i, :] ∈ RD denotes the embedding of node i for model Mk :

• M1 : Node2Vec.

• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).

• M3 : Spectral embedding with k smallest eigenvectors.

0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.

0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.

– Page 44 / 52 –
– Page 45 / 52 –
Problem 9: Embeddings & Ranking (Version C) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider

the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector

E k [i, :] ∈ RD denotes the embedding of node i for model Mk :

• M1 : Node2Vec.

• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).

• M3 : Spectral embedding with k largest eigenvectors.

0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.

0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.

– Page 46 / 52 –
– Page 47 / 52 –
Problem 9: Embeddings & Ranking (Version D) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider

the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector

E k [i, :] ∈ RD denotes the embedding of node i for model Mk :

• M1 : Node2Vec.

• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).

• M3 : Spectral embedding with k largest eigenvectors.

0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.

0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.

– Page 48 / 52 –
– Page 49 / 52 –
Problem 10: Semi-Supervised Learning (Version A) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1  
0.2 0.9
π = 21 and ν = .
2
0.9 0.2

We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.

0 a) Do you expect label propagation with the optimization problem


1 X
2 min wij (y i − y j )T (y i − y j )
3 i,j

to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.

0 b) The nodes are now assigned node features sampled as


1        
2 (0) 1 1 0 −1 1 0
hv ∼ N , for v ∈ C1 and h (0)
v ∼N , for v ∈ C2 .
3 1 0 1 −1 0 1

We define N(v) as the 1-hop neighborhood  of node


 v . Do you expect a one-layer GNN with the message passing
step m(1) (0) (0) 1 (0)
and the update step h (1) (0) (1)
P
v (h 1 , ... , h n ) = |N(v)| u∈N(v) W h u + b v = ReLU(Qh v + p + m v ) to work well for
this task? If not, propose a modification to the message passing and/or update step that would solve the problem.
Justify your answer.

– Page 50 / 52 –
Problem 10: Semi-Supervised Learning (Version B) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1  
0.2 0.9
π = 21 and ν = .
2
0.9 0.2

We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.

a) Do you expect label propagation with the optimization problem 0


X 1
min wij (y i − y j )T (y i − y j ) 2
i,j
3

to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.

b) The nodes are now assigned node features sampled as 0


        1
(0) 0 1 0 1 1 0 2
hv ∼ N , for v ∈ C1 and h (0)
v ∼N , for v ∈ C2 . 3
1 0 1 0 0 1

We define N(v) as the 1-hop neighborhood  of node


 v . Do you expect a one-layer GNN with the message passing
step m(1) (0) (0) 1 (0)
and the update step h (1) (0) (1)
P
v (h 1 , ... , h n ) = |N(v)| u∈N(v) W h u + b v = ReLU(Qh v + p + m v ) to work well for
this task? If not, propose a modification to the message passing and/or update step that would solve the problem.
Justify your answer.

– Page 51 / 52 –
Problem 10: Semi-Supervised Learning (Version C) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1  
0.1 0.8
π = 21 and ν = .
2
0.8 0.1

We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.

0 a) Do you expect label propagation with the optimization problem


1 X
2 min wij (y i − y j )T (y i − y j )
3 i,j

to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.

0 b) The nodes are now assigned node features sampled as


1        
2 (0) 1 1 0 −1 1 0
hv ∼ N , for v ∈ C1 and h (0)
v ∼N , for v ∈ C2 .
3 1 0 1 −1 0 1

We define N(v) as the 1-hop neighborhood  of node


 v . Do you expect a one-layer GNN with the message passing
step m(1) (0) (0) 1 (0)
and the update step h (1) (0) (1)
P
v (h 1 , ... , h n ) = |N(v)| u∈N(v) W h u + b v = ReLU(Qh v + p + m v ) to work well for
this task? If not, propose a modification to the message passing and/or update step that would solve the problem.
Justify your answer.

– Page 52 / 52 –
Problem 10: Semi-Supervised Learning (Version D) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1  
0.1 0.8
π = 21 and ν = .
2
0.8 0.1

We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.

a) Do you expect label propagation with the optimization problem 0


X 1
min wij (y i − y j )T (y i − y j ) 2
i,j
3

to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.

b) The nodes are now assigned node features sampled as 0


        1
(0) 0 1 0 1 1 0 2
hv ∼ N , for v ∈ C1 and h (0)
v ∼N , for v ∈ C2 . 3
1 0 1 0 0 1

We define N(v) as the 1-hop neighborhood  of node


 v . Do you expect a one-layer GNN with the message passing
step m(1) (0) (0) 1 (0)
and the update step h (1) (0) (1)
P
v (h 1 , ... , h n ) = |N(v)| u∈N(v) W h u + b v = ReLU(Qh v + p + m v ) to work well for
this task? If not, propose a modification to the message passing and/or update step that would solve the problem.
Justify your answer.

– Page 53 / 52 –
– Page 54 / 52 –

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy