0% found this document useful (0 votes)

24 views54 pages

Mlgs 2021 Retake

Uploaded by

Ghaith Chebil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views54 pages

Mlgs 2021 Retake

Uploaded by

Ghaith Chebil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Data Analytics and Machine Learning Group

Department of Informatics
Technical University of Munich

Note:
Eexam • During the attendance check a sticker containing a unique code will be put on this exam.
Place student sticker here
• This code contains a unique number that associates this exam with your registration number.
• This number is printed both next to the code and to the signature field in the attendance check
list.

Machine Learning for Graphs and Sequential Data

(Problem sheet)
Graded Exercise: IN2323 / Retake Date: Thursday 14th October, 2021
Examiner: Prof. Dr. Stephan Günnemann Time: 14:15 – 15:30

Working instructions
• DO NOT SUBMIT THIS SHEET! ONLY SUBMIT YOUR PERSONALIZED ANSWER SHEET THAT IS
DISTRIBUTED THROUGH TUMEXAM!
• Make sure that you solve the version of the problem stated on your personalized answer sheet (e.g., Problem
1 (Version B), Problem 2 (Version A), etc.)

– Page 1 / 52 –
Problem 1: Normalizing Flows (Version A) (6 credits)

0 We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows.

1 Let
(z1 )3
   
2 (1 + max(0, z2 )) · z1 − min(0, z2 ) · z3
3 f1 (z) =  (z2 )3  and f2 (z) = (z3 )3 · exp(z2 ) .
4
z1 − z3 z1 · |z3 |
5
6 Prove or disprove whether f1 and/or f2 are invertible.

– Page 2 / 52 –
Problem 1: Normalizing Flows (Version B) (6 credits)
We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows. 0
Let 1
(z2 )3 (z1 )3
   
2
f1 (z) = (1 + max(0, z2 )) · z1 − min(0, z2 ) · z3  and f2 (z) = ln(1 + |z3 |) · exp(z2 ) . 3
z1 − z3 z1 · z3 4
5
Prove or disprove whether f1 and/or f2 are invertible. 6

– Page 3 / 52 –
Problem 1: Normalizing Flows (Version C) (6 credits)

0 We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows.

1 Let    
2 min(0, z2 ) · z3 + (1 + max(0, z2 )) · z1 ln(1 + |z1 |)
3 f1 (z) =  (z2 )3  and f2 (z) = (z3 )3 · exp(z2 ) .
4 z1 + 2 · z3 z1 · z3
5
6 Prove or disprove whether f1 and/or f2 are invertible.

– Page 4 / 52 –
Problem 1: Normalizing Flows (Version D) (6 credits)
We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows. 0
Let 1
(z2 )3 (z2 )3
   
2
f1 (z) = min(0, z2 ) · z3 + (1 + max(0, z2 )) · z1  and f2 (z) =  z2 · |z1 |  . 3
4
z1 + 2 · z3 (z1 )3 · exp(z3 )
5
Prove or disprove whether f1 and/or f2 are invertible. 6

– Page 5 / 52 –
Problem 2: Variational Inference (Version A) (7 credits)
Suppose we are given a latent variable model

z2

1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 5)2

1
pθ (x |z) = N (x; z + 5, θ2 ) = √ exp −
θ 2π 2θ 2

where x, z ∈ R. We parametrize the variational distribution qϕ (z) as:

(z − ϕ)2

1
qϕ (z) = N (z; ϕ, 1) = √ exp −
2π 2

0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on ϕ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qϕ is defined as
4

L(θ, qϕ ) = E log pθ (x, z) − log qϕ (z) .
z ∼qϕ

Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as

E[X 2 ] = Var(X ) + E[X ]2 .

0
1
2
3
b) Suppose θ is fixed. Derive the value of ϕ that maximizes the ELBO.

– Page 6 / 52 –
– Page 7 / 52 –
Problem 2: Variational Inference (Version B) (7 credits)
Suppose we are given a latent variable model

z2

1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − 2z − 4)2

1
pθ (x |z) = N (x; 2z + 4, θ2 ) = √ exp −
θ 2π 2θ2

where x, z ∈ R. We parametrize the variational distribution qµ (z) as:

(z − µ)2

1
qµ (z) = N (z; µ, 1) = √ exp −
2π 2

0 a) Derive the evidence lower bound (ELBO) for this particular parametrization. Simplify the parts depending on µ
1 as far as possible.
2
3 Reminder : The ELBO for parameters θ and variational distribution qµ is defined as
4

L(θ, qµ ) = E log pθ (x, z) − log qµ (z) .
z ∼qµ

Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as

E[X 2 ] = Var(X ) + E[X ]2 .

0
1
2
3
b) Suppose θ is fixed. Derive the value of µ that maximizes the ELBO.

– Page 8 / 52 –
– Page 9 / 52 –
Problem 2: Variational Inference (Version C) (7 credits)
Suppose we are given a latent variable model

z2

1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 3)2

1
pθ (x |z) = N (x; z + 3, θ2 ) = √ exp −
θ 2π 2θ 2

where x, z ∈ R. We parametrize the variational distribution qϕ (z) as:

(z − ϕ)2

1
qϕ (z) = N (z; ϕ, 1) = √ exp −
2π 2

Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as

E[X 2 ] = Var(X ) + E[X ]2 .

0
1
2
3
b) Suppose θ is fixed. Derive the value of ϕ that maximizes the ELBO.

– Page 10 / 52 –
– Page 11 / 52 –
Problem 2: Variational Inference (Version D) (7 credits)
Suppose we are given a latent variable model

z2

1
p(z) = N (z; 0, 1) = √ exp −
2π 2
(x − z − 7)2

1
pθ (x |z) = N (x; z + 7, θ2 ) = √ exp −
θ 2π 2θ 2

where x, z ∈ R. We parametrize the variational distribution qµ (z) as:

(z − µ)2

1
qµ (z) = N (z; µ, 1) = √ exp −
2π 2

Hint: Given a random variable X , the variance decomposition Var(X ) = E[X 2 ] − E[X ]2 can be rewritten as

E[X 2 ] = Var(X ) + E[X ]2 .

0
1
2
3
b) Suppose θ is fixed. Derive the value of µ that maximizes the ELBO.

– Page 12 / 52 –
– Page 13 / 52 –
Problem 3: Variational Autoencoder (Version A) (2 credits)

0 We would like to define a variational autoencoder model for black-and-white images. Each image is represented as
1 a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows.
2
1. We obtain the distribution parameters as

λ = exp(fθ (z)),

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as

N
Y
pθ (x |z) = Exponential(xi |λi )
i=1

where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.

What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.

– Page 14 / 52 –
Problem 3: Variational Autoencoder (Version B) (2 credits)
We would like to define a variational autoencoder model for black-and-white images. Each image is represented as 0
a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows. 1
2
1. We obtain the distribution parameters as

λ = exp(fθ (z)),

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as

N
Y
pθ (x |z) = Exponential(xi |λi )
i=1

where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.

What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.

– Page 15 / 52 –
Problem 3: Variational Autoencoder (Version C) (2 credits)

λ = exp(fθ (z)),

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as

N
Y
pθ (x |z) = Exponential(xi |λi )
i=1

where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.

What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.

– Page 16 / 52 –
Problem 3: Variational Autoencoder (Version D) (2 credits)
We would like to define a variational autoencoder model for black-and-white images. Each image is represented as 0
a binary vector x ∈ {0, 1}N . We define the conditional distribution pθ (x |z) as follows. 1
2
1. We obtain the distribution parameters as

λ = exp(fθ (z)),

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as

N
Y
pθ (x |z) = Exponential(xi |λi )
i=1

where Exponential(x |λ) is the exponential distribution with probability density function
(
λi e −λi xi if xi ≥ 0,
Exponential(xi |λi ) =
0 else.

What is the main problem with the above definition of pθ (x |z)? Explain how we can modify the above definition to
fix this problem. Justify your answer.

– Page 17 / 52 –
Problem 4: Robustness - Convex Relaxation (Version A) (7 credits)

0 In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize
1 this result to the LeakyReLU activation function
2 (
3 x for x ≥ 0
4 LeakyReLU(x) =
5 αx for x < 0
6
7 with α ∈ (0, 1).

Let x, y ∈ R be the variables we use to model the function’s input and output, respectively. Assume we know
T
that l ≤ x ≤ u with l, u ∈ R. Specify a set of linear constraints on x y that model the convex hull of
T
x LeakyReLU(x) , i.e. whose feasible region is

x1 x2
λ + (1 − λ) x1 , x2 ∈ [l, u] ∧ λ ∈ [0, 1] .
LeakyReLU(x1 ) LeakyReLU(x2 )

Reminder : A linear constraint is an inequality or equality relation between terms that are linear in x and y .
Hint: You will have to make a case distinction to account for different ranges of l and u.

– Page 18 / 52 –
Problem 4: Robustness - Convex Relaxation (Version B) (7 credits)
In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize 0
this result to the LeakyReLU activation function 1
( 2
x for x ≥ 0 3
LeakyReLU(x) = 4
αx for x < 0 5
6
with α ∈ (0, 1). 7

– Page 19 / 52 –
Problem 4: Robustness - Convex Relaxation (Version C) (7 credits)

– Page 20 / 52 –
Problem 4: Robustness - Convex Relaxation (Version D) (7 credits)
In the lecture, we have derived a tight convex relaxation for the ReLU activation function. Now we want to generalize 0
this result to the LeakyReLU activation function 1
( 2
x for x ≥ 0 3
LeakyReLU(x) = 4
αx for x < 0 5
6
with α ∈ (0, 1). 7

– Page 21 / 52 –
Problem 5: Markov Chain Language Model (Version A) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words I, orange, like, eat. While
the words are borrowed from the English language, our simple language is not bound to its grammatical rules. The
words map to the Markov chain parameters as follows.

I orange like eat

I π1
  A
I 11 ··· A14 
orange π2   .. .. .. 
π= like
π  A = orange  . . . 
3  
eat π4 like
eat A41 ··· A44

A ij specifies the probability of transitioning from state i to state j .

0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• I like orange

• I eat orange

• orange eat orange

• I like I

For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.

4/6 I orange like eat

I 1 3 2
I 0 /6 /6 /6
 
orange 2/6 2 4
π= A= orange 0 0 /6 /6 
like 0
 
like
1/6 3
/6 1
/6 1 
/6
eat 0 1 5
eat /6 /6 0 0

0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) I like orange

2) orange eat I

– Page 22 / 52 –
c) Given that the 3rd word X3 of a sentence is orange, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3

– Page 23 / 52 –
Problem 5: Markov Chain Language Model (Version B) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words I, orange, see, like. While
the words are borrowed from the English language, our simple language is not bound to its grammatical rules. The
words map to the Markov chain parameters as follows.

I orange see like

I π1
  AI11 ··· A14 
orange π2   .. .. .. 
π= see
π  A = orange  . . . 
3  
like π4 see
like A41 ··· A44

A ij specifies the probability of transitioning from state i to state j .

0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• I see I

• I like orange

• orange like orange

• I see orange

For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.

4  I orange see like

I /6 1 3 2
I
0 /6 /6 /6 
orange 2/6 2 4
π= 0 A= orange 0 0 /6 /6 
see 1 3 1 1 
see /6 /6 /6 /6
like 0 1 5
like /6 /6 0 0

0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) I see orange

2) orange like I

– Page 24 / 52 –
c) Given that the 3rd word X3 of a sentence is orange, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3

– Page 25 / 52 –
Problem 5: Markov Chain Language Model (Version C) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words you, apple, see, like.
While the words are borrowed from the English language, our simple language is not bound to its grammatical
rules. The words map to the Markov chain parameters as follows.

you apple see like

you π1
 
you
A
11 ··· A14 
apple π2   .. .. .. 
π= see
π  A = apple  . . . 
3  
like π4 see
like A41 ··· A44

A ij specifies the probability of transitioning from state i to state j .

0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• you see apple

• you like apple

• apple like apple

• you see you

For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.

4  you apple see like

you /6 1 3 2
you
0 /6 /6 /6 
apple 2/6 2 4
π= 0 A = apple  0 0 /6 /6 
see  1/6 3 1 1 
see /6 /6 /6
like 0 1 5
like /6 /6 0 0

0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) you see apple

2) apple like you

– Page 26 / 52 –
c) Given that the 3rd word X3 of a sentence is apple, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3

– Page 27 / 52 –
Problem 5: Markov Chain Language Model (Version D) (7 credits)
We want to use a Markov chain to model a very simple language consisting of the 4 words they, apple, like, eat.
While the words are borrowed from the English language, our simple language is not bound to its grammatical
rules. The words map to the Markov chain parameters as follows.

they apple like eat

they π1
 
they
A
11 ··· A14 
apple π2   .. .. .. 
π= like
π  A = apple  . . . 
3  
eat π4 like
eat A41 ··· A44

A ij specifies the probability of transitioning from state i to state j .

0 a) Fit the Markov chain to the following dataset of example sentences by computing the most likely parameters.
1
2
• they like they

• they eat apple

• apple eat apple

• they like apple

For the remaining problems assume that you are given the following Markov chain parameters that were fit to a
larger dataset.

4  they apple like eat

they /6 1 3 2
they
 0 /6 /6 /6 
apple 2/6 2 4
π= 0 A = apple  0 0 /6 /6 
like  1/6 3 1 1 
like /6 /6 /6
eat 0 1 5
eat /6 /6 0 0

0 b) Which of the following two sentences is more likely according to the model? Justify your answer.
1
2
1) they like apple

2) apple eat they

– Page 28 / 52 –
c) Given that the 3rd word X3 of a sentence is apple, compute the (unnormalized) probability distribution over the 0
previous word X2 . Justify your answer. 1
2
3

– Page 29 / 52 –
Problem 6: Neural Sequence Models (Version A) (4 credits)

0 We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset
1 where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target
2 for each sequence is y = x1 + xn . We use four different encoders:
3
4 1. RNN with positional encoding
2. Transformer with positional encoding
3. Transformer without positional encoding
4. Dilated causal convolution with 2 hidden layers. We set dilation size to 2.

After processing the sequence with the above described encoders, we have access to hidden states h i ∈ Rh ,
corresponding to the i -th place in the sequence. We use the last hidden state h n to make the prediction. For each
of the four encoders, write down if they can learn the given task in theory. Justify your answer. For those encoders
that can learn it, what issues might you encounter in practice?

– Page 30 / 52 –
Problem 6: Neural Sequence Models (Version B) (4 credits)
We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset 0
where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target 1
for each sequence is y = x1 + xn . We use four different encoders: 2
3
1. Recurrent neural network 4

2. Transformer without positional encoding

3. Transformer with positional encoding
4. Sliding window neural network that takes [xi −k , ... , xi −1 , xi ] and outputs h i ∈ Rh , for each i . We set k = 5

– Page 31 / 52 –
Problem 6: Neural Sequence Models (Version C) (4 credits)

0 We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset
1 where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target
2 for each sequence is y = x1 + xn . We use four different encoders:
3
4 1. Transformer with positional encoding
2. Transformer without positional encoding
3. Multilayer neural network that takes vector in Rn as input (all numbers concatenated) and outputs Rn×h
4. Recurrent neural network

– Page 32 / 52 –
Problem 6: Neural Sequence Models (Version D) (4 credits)
We want to find out the limitations of our neural models for sequential data. To do that, we construct a dataset 0
where the inputs are multiple sequences of n > 10 numbers [x1 , x2 , ... , xn ], xi ∈ R, where the corresponding target 1
for each sequence is y = x1 + xn . We use four different encoders: 2
3
1. Recurrent neural network 4

2. Dilated causal convolution with 2 hidden layers. We set dilation size to 2.

3. Transformer with positional encoding
4. Transformer without positional encoding

– Page 33 / 52 –
Problem 7: Temporal Point Process (Version A) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, T ]. We have observed a single sequence that contains n points {t1 , t2 , ... , tn }, ti ∈ [0, T ].

0 a) Derive the maximum likelihood estimate of the parameter µ.

1
2
3
4

0 b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times
1 as described above, using the events from the whole day as one sequence. We estimate µ using data we collected
2 in one year. Our task is to find the least busy 2 hour interval in each day to close down the road for maintenance.
Can we use the homogeneous Poisson process to achieve this? If not, can you suggest an alternative model?
Justify your answer.

– Page 34 / 52 –
Problem 7: Temporal Point Process (Version B) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, 5]. We have observed a single sequence {0.7, 0.8, 1.5, 2.3, 4.7}.

a) Derive the maximum likelihood estimate of the parameter µ. 0

1
2
3
4

b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the 0
times as described above, using the events from the whole day as one sequence. For each day of the week we 1
estimate the parameter µ using data we collected in one year. That means we have µMon , µTue , ... , µSun , each µ 2
corresponding to one day of the week. Our task is to find the least busy day of the week to close down the road
for maintenance. Can we use the homogeneous Poisson process to achieve this? If not, can you suggest an
alternative model? Justify your answer.

– Page 35 / 52 –
Problem 7: Temporal Point Process (Version C) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[0, 2]. We have observed a single sequence {0.1, 0.8, 1.3, 1.5, 1.7, 1.9}.

0 a) Derive the maximum likelihood estimate of the parameter µ.

1
2
3
4

0 b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times
1 as described above, using the events from the whole day as one sequence. Using our model, we want to estimate
2 the probability that less than 100 cars will pass our sensor in a day. Can we use the homogeneous Poisson process
to achieve this? If not, can you suggest an alternative model? Justify your answer.

– Page 36 / 52 –
Problem 7: Temporal Point Process (Version D) (6 credits)
We fit a homogeneous Poisson process with intensity parameter µ to model event occurrences in a time interval
[3, 13]. We have observed a single sequence {3.5, 4.3, 4.5, 7.1, 8.3}.

a) Derive the maximum likelihood estimate of the parameter µ. 0

1
2
3
4

b) Suppose we install a sensor next to a busy road that records the times when cars drive by. We model the times 0
as described above, using the events from the whole day as one sequence. Using our model, we want to answer 1
whether fast vehicles get stuck behind slower vehicles. That is, we want to see if observing one vehicle leads to 2
a few more following behind it. Can we use the homogeneous Poisson process to achieve this? If not, can you
suggest an alternative model? Justify your answer.

– Page 37 / 52 –
Problem 8: Clustering (Version A) (6 credits)

0 We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C
1 A P
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
2 di
P
3 Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the
4 probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C)
5 and vice versa. Show that the normalized cut satisfies the equation
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)

Ncut(C, C̄) = + .
vol(C) vol(C̄)

– Page 38 / 52 –
Problem 8: Clustering (Version B) (6 credits)
We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C 0
A P
1
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
di
P 2
Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the 3
probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C) 4
and vice versa. Show that the normalized cut satisfies the equation 5
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)

Ncut(C, C̄) = + .
vol(C) vol(C̄)

– Page 39 / 52 –
Problem 8: Clustering (Version C) (6 credits)

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)

Ncut(C, C̄) = + .
vol(C) vol(C̄)

– Page 40 / 52 –
Problem 8: Clustering (Version D) (6 credits)
We consider the graph G = (E, V) with adjacency matrix A where the nodes are separated into two clusters, C 0
A P
1
and C̄ . We define the associated random walk Pr(Xt+1 = j |Xt = i) = diji where di = j Aij is the degree of node i and
di
P 2
Pr(X0 = i) = vol(V) is the starting distribution where vol(V) = i ∈V di is the volume of the set of nodes V . We define the 3
probability to transition from cluster C to cluster C̄ in the first random walk step as Pr(C̄ | C) = Pr(X1 ∈ C̄ | X0 ∈ C) 4
and vice versa. Show that the normalized cut satisfies the equation 5
6
Ncut(C, C̄) = Pr(C̄ | C) + Pr(C | C̄).

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)

Ncut(C, C̄) = + .
vol(C) vol(C̄)

– Page 41 / 52 –
Problem 9: Embeddings & Ranking (Version A) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider
′
the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector
′
E k [i, :] ∈ RD denotes the embedding of node i for model Mk :

• M1 : Node2Vec.

• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).

• M3 : Spectral embedding with k smallest eigenvectors.

0 a) We modify the attributed graph such that all nodes have the same features, i.e. the adjacency matrix is A ′ = A and
1 the new node attributes are X ′ such that X ′ [i, :] = X ′ [j, :] for all (i, j). For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graphs E k = Mk (G, X )? Justify
3 your answer.

0 b) We modify the attributed graph such that the graph is a clique, i.e. the new adjacency matrix is A ′ = 1 − I
1 where 1 is the all-ones matrix, and the node attributes are X ′ = X . For which model will the new node embeddings
2 E ′k = Mk (G ′ , X ′ ) be different from the embeddings obtained with the original attributed graph E k = Mk (G, X )? Justify
3
your answer.

– Page 42 / 52 –
– Page 43 / 52 –
Problem 9: Embeddings & Ranking (Version B) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider
′
the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector
′
E k [i, :] ∈ RD denotes the embedding of node i for model Mk :

• M1 : Node2Vec.

• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).

• M3 : Spectral embedding with k smallest eigenvectors.

– Page 44 / 52 –
– Page 45 / 52 –
Problem 9: Embeddings & Ranking (Version C) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider
′
the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector
′
E k [i, :] ∈ RD denotes the embedding of node i for model Mk :

• M1 : Node2Vec.

• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).

• M3 : Spectral embedding with k largest eigenvectors.

– Page 46 / 52 –
– Page 47 / 52 –
Problem 9: Embeddings & Ranking (Version D) (6 credits)
We consider a graph G = (V , E) with adjacency matrix A ∈ Rn×n where Aij = 1(i,j)∈E indicates if an edge exist
between node i and node j in the graph G . The node features are represented by the matrix X ∈ Rn×D . We consider
′
the three following models Mk , k ∈ {1, 2, 3} which produce node embeddings E k = Mk (G, X ) ∈ Rn×D . The vector
′
E k [i, :] ∈ RD denotes the embedding of node i for model Mk :

• M1 : Node2Vec.

• M2 : Mean of Graph2Gauss i.e. E 2 [i, :] = µi where the Graph2Gauss mapping transforms node i into the
Gaussian distribution N (µi , diag(σ i )).

• M3 : Spectral embedding with k largest eigenvectors.

– Page 48 / 52 –
– Page 49 / 52 –
Problem 10: Semi-Supervised Learning (Version A) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1
0.2 0.9
π = 21 and ν = .
2
0.9 0.2

We consider a sampled graph G with n nodes from the SBM defined as above where the node labels are defined
as the ground-truth communities of the SBM. The task is now to predict the labels of all nodes of the graphs where
only a fraction of the node labels is available for training.

0 a) Do you expect label propagation with the optimization problem

1 X
2 min wij (y i − y j )T (y i − y j )
3 i,j

to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.

0 b) The nodes are now assigned node features sampled as

1
2 (0) 1 1 0 −1 1 0
hv ∼ N , for v ∈ C1 and h (0)
v ∼N , for v ∈ C2 .
3 1 0 1 −1 0 1

We define N(v) as the 1-hop neighborhood of node

v . Do you expect a one-layer GNN with the message passing
step m(1) (0) (0) 1 (0)
and the update step h (1) (0) (1)
P
v (h 1 , ... , h n ) = |N(v)| u∈N(v) W h u + b v = ReLU(Qh v + p + m v ) to work well for
this task? If not, propose a modification to the message passing and/or update step that would solve the problem.
Justify your answer.

– Page 50 / 52 –
Problem 10: Semi-Supervised Learning (Version B) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1
0.2 0.9
π = 21 and ν = .
2
0.9 0.2

a) Do you expect label propagation with the optimization problem 0

X 1
min wij (y i − y j )T (y i − y j ) 2
i,j
3

to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.

b) The nodes are now assigned node features sampled as 0

1
(0) 0 1 0 1 1 0 2
hv ∼ N , for v ∈ C1 and h (0)
v ∼N , for v ∈ C2 . 3
1 0 1 0 0 1

We define N(v) as the 1-hop neighborhood of node

– Page 51 / 52 –
Problem 10: Semi-Supervised Learning (Version C) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1
0.1 0.8
π = 21 and ν = .
2
0.8 0.1

0 a) Do you expect label propagation with the optimization problem

1 X
2 min wij (y i − y j )T (y i − y j )
3 i,j

to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.

0 b) The nodes are now assigned node features sampled as

1
2 (0) 1 1 0 −1 1 0
hv ∼ N , for v ∈ C1 and h (0)
v ∼N , for v ∈ C2 .
3 1 0 1 −1 0 1

We define N(v) as the 1-hop neighborhood of node

– Page 52 / 52 –
Problem 10: Semi-Supervised Learning (Version D) (6 credits)
In this problem, we consider a Stochastic Block Model with two ground-truth communities C1 and C2 . The SBM has
community proportions π and edge probability ν given as
1
0.1 0.8
π = 21 and ν = .
2
0.8 0.1

a) Do you expect label propagation with the optimization problem 0

X 1
min wij (y i − y j )T (y i − y j ) 2
i,j
3

to work well for this task? If not, propose a modification of the optimization problem which would solve the problem.
Justify your answer.

b) The nodes are now assigned node features sampled as 0

1
(0) 0 1 0 1 1 0 2
hv ∼ N , for v ∈ C1 and h (0)
v ∼N , for v ∈ C2 . 3
1 0 1 0 0 1

We define N(v) as the 1-hop neighborhood of node

– Page 53 / 52 –
– Page 54 / 52 –

Variational Autoencoder
No ratings yet
Variational Autoencoder
21 pages
CS236 Homework 1
100% (1)
CS236 Homework 1
4 pages
CS236 Hw2 Answers
No ratings yet
CS236 Hw2 Answers
14 pages
Variational Autoencoders
No ratings yet
Variational Autoencoders
94 pages
05 Vae
No ratings yet
05 Vae
76 pages
Final Solution
No ratings yet
Final Solution
12 pages
Demystifying Variational Diffusion Models
No ratings yet
Demystifying Variational Diffusion Models
48 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
38 pages
Mod4 Slides
No ratings yet
Mod4 Slides
49 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
Print Merged
No ratings yet
Print Merged
23 pages
Khan - Diffusion Models and Normalizing Flows
No ratings yet
Khan - Diffusion Models and Normalizing Flows
36 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
L20 GenerativeModels
No ratings yet
L20 GenerativeModels
53 pages
08 VariationalInference
No ratings yet
08 VariationalInference
31 pages
A Beginner's Guide To Variational Inference
No ratings yet
A Beginner's Guide To Variational Inference
48 pages
Variation Al
No ratings yet
Variation Al
25 pages
Latent Variable Models: Stefano Ermon
No ratings yet
Latent Variable Models: Stefano Ermon
26 pages
8.auto-Encoding Variational Bayes
No ratings yet
8.auto-Encoding Variational Bayes
14 pages
Final 2012 W
No ratings yet
Final 2012 W
8 pages
24 Variational Inference
No ratings yet
24 Variational Inference
24 pages
HW 1
No ratings yet
HW 1
11 pages
Auto-Encoding Variational Bayes: Diederik P. Kingma Max Welling
No ratings yet
Auto-Encoding Variational Bayes: Diederik P. Kingma Max Welling
9 pages
Mlgs 2021 Endterm Solution
No ratings yet
Mlgs 2021 Endterm Solution
26 pages
2020 Exam2 Solution
No ratings yet
2020 Exam2 Solution
9 pages
Paper 4 S2 2324
No ratings yet
Paper 4 S2 2324
5 pages
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
16 pages
ADL Midterm Mock Exam 2021
No ratings yet
ADL Midterm Mock Exam 2021
5 pages
Sample Midterm With Solutions (Updated)
No ratings yet
Sample Midterm With Solutions (Updated)
26 pages
Latent 2
No ratings yet
Latent 2
4 pages
2011 End Spring 2011 Computer Science Machine Learning
No ratings yet
2011 End Spring 2011 Computer Science Machine Learning
10 pages
Reparametrization Trick
No ratings yet
Reparametrization Trick
8 pages
DGM 2023 Endterm Solution
No ratings yet
DGM 2023 Endterm Solution
12 pages
2021 Exam2 Solution
No ratings yet
2021 Exam2 Solution
11 pages
Auto Encoding Variational Bayes
No ratings yet
Auto Encoding Variational Bayes
14 pages
E9 205 - Machine Learning For Signal Processing: Practice Midterm Exam
No ratings yet
E9 205 - Machine Learning For Signal Processing: Practice Midterm Exam
4 pages
Quiz2 - Solution
No ratings yet
Quiz2 - Solution
2 pages
DL 1
No ratings yet
DL 1
10 pages
CS 236, Fall 2018 Midterm Exam: Stanford University Honor Code
100% (1)
CS 236, Fall 2018 Midterm Exam: Stanford University Honor Code
6 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Cs 419 Endsemsols
No ratings yet
Cs 419 Endsemsols
6 pages
2022 Exam2 Solution
No ratings yet
2022 Exam2 Solution
10 pages
Exercise 01
No ratings yet
Exercise 01
3 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
On The Challenges of Learning With Inference Networks On Sparse, High-Dimensional Data
No ratings yet
On The Challenges of Learning With Inference Networks On Sparse, High-Dimensional Data
14 pages
Tung Kieu - Probabilistic - Graphical - Model - Report
No ratings yet
Tung Kieu - Probabilistic - Graphical - Model - Report
9 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
CS 229 Autumn 2017 Problem Set #3: Deep Learning & Unsupervised Learning
No ratings yet
CS 229 Autumn 2017 Problem Set #3: Deep Learning & Unsupervised Learning
9 pages
Quiz3 2024
No ratings yet
Quiz3 2024
2 pages
AI60201 Module3 4 Problems
No ratings yet
AI60201 Module3 4 Problems
4 pages
MAST90083 2021 S2 Exam Paper
No ratings yet
MAST90083 2021 S2 Exam Paper
4 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
DL Exam 2023-2
No ratings yet
DL Exam 2023-2
5 pages
2019-20-I ES Key
No ratings yet
2019-20-I ES Key
4 pages
Univan Ship Management LTD.: Chennai Office
No ratings yet
Univan Ship Management LTD.: Chennai Office
17 pages
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
No ratings yet
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
10 pages
hw3 Solutions PDF
No ratings yet
hw3 Solutions PDF
11 pages
NTTF Placement Brochure 2021
No ratings yet
NTTF Placement Brochure 2021
72 pages
Buy Verified Cash App Accounts-$130
No ratings yet
Buy Verified Cash App Accounts-$130
7 pages
Grade 11 ICT - Learning Actvity 001
No ratings yet
Grade 11 ICT - Learning Actvity 001
7 pages
IAPP CERTIFICATION ExamUpdates 072120.2 PDF
No ratings yet
IAPP CERTIFICATION ExamUpdates 072120.2 PDF
1 page
5G Bootcamp Syllabus 3.0 - APPROVED 10 - 12 - 22-1
No ratings yet
5G Bootcamp Syllabus 3.0 - APPROVED 10 - 12 - 22-1
9 pages
Matlab Odd Workbook - 2022-2023
No ratings yet
Matlab Odd Workbook - 2022-2023
60 pages
Loops in Python - Shishir Kant Singh
No ratings yet
Loops in Python - Shishir Kant Singh
16 pages
Gujarat Technological University: Subject Code:150901 Date: Subject Name: Electrical Machine - II Time: Total Marks: 70
No ratings yet
Gujarat Technological University: Subject Code:150901 Date: Subject Name: Electrical Machine - II Time: Total Marks: 70
1 page
Course File OS Session 2022-23
No ratings yet
Course File OS Session 2022-23
34 pages
Sectona Tech Overview
No ratings yet
Sectona Tech Overview
17 pages
Engineering Maths
No ratings yet
Engineering Maths
2 pages
May.11.20 Source BTC
No ratings yet
May.11.20 Source BTC
44 pages
Ms Word
No ratings yet
Ms Word
42 pages
Sherwin's Resume and Application Letter
No ratings yet
Sherwin's Resume and Application Letter
8 pages
Tutorials: Tutorial 1 Getting Started
No ratings yet
Tutorials: Tutorial 1 Getting Started
11 pages
DBMS - Module 3
No ratings yet
DBMS - Module 3
37 pages
Mitsubishi Q170M Quick Start Guide
No ratings yet
Mitsubishi Q170M Quick Start Guide
88 pages
Technology NEW Vocab Parts 1-2-3
No ratings yet
Technology NEW Vocab Parts 1-2-3
21 pages
BSBPEF501 - Assessment Task 2 2024
No ratings yet
BSBPEF501 - Assessment Task 2 2024
14 pages
Class 8 Networking Concepts Part-1 PDF
No ratings yet
Class 8 Networking Concepts Part-1 PDF
7 pages
Computer Vision Based Attendance Management System For Students
No ratings yet
Computer Vision Based Attendance Management System For Students
6 pages
Comp1 Midterm Rev Ae
No ratings yet
Comp1 Midterm Rev Ae
8 pages
Ptu Library
No ratings yet
Ptu Library
25 pages
Gemini For Google Cloud Documentation
No ratings yet
Gemini For Google Cloud Documentation
2 pages
Color Standard
No ratings yet
Color Standard
1 page
#2 - Table of Contents - Approval Sheet - Footer - Acknowledgement
No ratings yet
#2 - Table of Contents - Approval Sheet - Footer - Acknowledgement
3 pages
Survey On Multilevel Security Using Honeypot
No ratings yet
Survey On Multilevel Security Using Honeypot
5 pages
DS Theory HW 3
No ratings yet
DS Theory HW 3
6 pages
Renolit Poliplex Series - en
No ratings yet
Renolit Poliplex Series - en
2 pages
Solving Math Problems
From Everand
Solving Math Problems
George N. Frempong
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Mlgs 2021 Retake

Uploaded by

Mlgs 2021 Retake

Uploaded by

Data Analytics and Machine Learning Group

Machine Learning for Graphs and Sequential Data

0 We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows.

0 We consider two transformations f1 , f2 : R3 → R3 for use in normalizing flows.

where x, z ∈ R. We parametrize the variational distribution qϕ (z) as:

E[X 2 ] = Var(X ) + E[X ]2 .

where x, z ∈ R. We parametrize the variational distribution qµ (z) as:

E[X 2 ] = Var(X ) + E[X ]2 .

where x, z ∈ R. We parametrize the variational distribution qϕ (z) as:

E[X 2 ] = Var(X ) + E[X ]2 .

where x, z ∈ R. We parametrize the variational distribution qµ (z) as:

E[X 2 ] = Var(X ) + E[X ]2 .

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as

where z ∈ RL is the latent variable and fθ : RL → RN is the decoder neural network.

2. We obtain the conditional distribution as

I orange like eat

A ij specifies the probability of transitioning from state i to state j .

• orange eat orange

4/6 I orange like eat

I orange see like

A ij specifies the probability of transitioning from state i to state j .

• orange like orange

4  I orange see like

you apple see like

A ij specifies the probability of transitioning from state i to state j .

• you like apple

• apple like apple

• you see you

4  you apple see like

2) apple like you

they apple like eat

A ij specifies the probability of transitioning from state i to state j .

• they eat apple

• apple eat apple

• they like apple

4  they apple like eat

2) apple eat they

2. Transformer without positional encoding

2. Dilated causal convolution with 2 hidden layers. We set dilation size to 2.

0 a) Derive the maximum likelihood estimate of the parameter µ.

a) Derive the maximum likelihood estimate of the parameter µ. 0

0 a) Derive the maximum likelihood estimate of the parameter µ.

a) Derive the maximum likelihood estimate of the parameter µ. 0

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)

Reminder : The normalized cut of an undirected graph is defined as

cut(C, C̄) cut(C, C̄)

• M3 : Spectral embedding with k smallest eigenvectors.

• M3 : Spectral embedding with k smallest eigenvectors.

• M3 : Spectral embedding with k largest eigenvectors.

• M3 : Spectral embedding with k largest eigenvectors.

0 a) Do you expect label propagation with the optimization problem

0 b) The nodes are now assigned node features sampled as

We define N(v) as the 1-hop neighborhood  of node

a) Do you expect label propagation with the optimization problem 0

b) The nodes are now assigned node features sampled as 0

We define N(v) as the 1-hop neighborhood  of node

0 a) Do you expect label propagation with the optimization problem

0 b) The nodes are now assigned node features sampled as

We define N(v) as the 1-hop neighborhood  of node

a) Do you expect label propagation with the optimization problem 0

b) The nodes are now assigned node features sampled as 0

We define N(v) as the 1-hop neighborhood  of node

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

We define N(v) as the 1-hop neighborhood of node

We define N(v) as the 1-hop neighborhood of node

We define N(v) as the 1-hop neighborhood of node

We define N(v) as the 1-hop neighborhood of node