Bi-Level Optimization using Validation Set #137

mmargalo · 2023-02-16T06:56:18Z

mmargalo
Feb 16, 2023

Hi,

I would appreciate it if anybody has a thought on this.

I am working on bi-level optimization where a validation set is used in the outer optimization. I am using torchopt-distributed, on a single node GPU with 4 workers. My network is a modified vision transformer. My goal, for now, is to get the gradients of the validation loss wrt y. However, when I do distributed autograd backward with the validation loss, it is either stuck or extremely slow. The only error I get is when it has passed the RPC timeout.

I have tried the ff. to investigate:

smaller network <- no effect
backward after inner loop (w/o validation part) - works

I have also considered doing the validation loop inside inner_loop, but seems inefficient as validation will be done on a premature model per worker. I have followed the documentation on distributed training, and the sample with MAML.

Here is a rough pseudo code:

@parallelize
def inner_loop(net, x, y):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    meta_opt.step(loss)
    return loss

@parallelize
def val_loop(net, x, y):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    return loss

@rank_zero_only
def train():
    for x, y in train_loader:
        model_clone = todist.rpc.RRef(torchopt.module_clone(model, by="copy"))
        with todist.autograd.context() as context_id:
            y.requires_grad_()
            inner_loss = inner_loop(model_clone, x, y) 
            for vx, vy in val_loader:
                val_loss = val_loop(model_clone, vx, vy) 
                todist.autograd.backward(context_id, val_loss) # <------ stuck? slow, goes beyond timeout

Answered by mmargalo

Mar 7, 2023

I see, so the model is not updated even if it's referenced on the workers. Thanks for the advice, I ended up merging the inner and val loops instead, with the val loop going over a small batch instead of the entire val set.

SUCCEEDS
train() on rank0 -> inner_val_loop_combined on parallel -> train() on rank0, backprop
FAILS
train() on rank0 -> inner_loop on parallel -> train() on rank0, call val_loop -> val_loop on parallel -> train() on rank0, backdrop

If I understand correctly, you are updating the model parameters in the inner_loop. But in your initial code:
@parallelize
def inner_loop(net, x, y):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    meta_opt.step(

View full answer

XuehaiPan · 2023-02-16T07:34:05Z

XuehaiPan
Feb 16, 2023
Maintainer

Hi @mmargalo。

There is some communication cost during the distributed training procedure. I also suggest you fuse the two functions inner_loop and val_loop together. You can do that because these functions do not update the outer-loop parameters. It will reduce the inter-process communication cost.

def inner_loop(net, x_train, y_train):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    meta_opt.step(loss)
    return loss, net

def val_loss(net, x_valid, y_valid):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    return loss

@parallelize(...)
def compute_outer_loss(net, x_train, y_train, x_valid, y_valid):
    train_loss, trained_net = inner_loop(net, x_train, y_train)
    valid_loss = val_loss(trained_net, x_valid, y_valid)
    return train_loss, valid_loss

In the backward loop, I suggest you collect the losses and backward only once rather than backward multiple times:

val_losses = []
for vx, vy in val_loader:
    val_loss = val_loop(model_clone, vx, vy)
    val_losses.append(val_loss)

val_loss = torch.stack(val_losses).sum()  # or mean()
todist.autograd.backward(context_id, val_loss)

Also, seems that your computation graph is growing too big in the for-loop. Have you ever tried to have fewer loop runs (e.g., increase the batch size)?

8 replies

XuehaiPan Feb 21, 2023
Maintainer

Have you ever tried further parallelizing the validation loop?

Change:

with todist.autograd.context() as context_id:
    ...
    for vx, vy in val_loader:
        val_loss = val_loop(model_clone, vx, vy)
        todist.autograd.backward(context_id, val_loss)
    ...

to:

def partitioner(net, batch_index):
    num_workers = todist.get_world_size()
    num_batches = len(batch_index)
    return [(net, idx % num_workers) for idx in range(num_batches)]

@todist.parallelize(partitioner, todist.mean_reducer)
def calculate_val_loss(net, batch_index):
    vx, vy = val_loader[batch_index]
    return val_loop(net, vx, vy)

@todist.rank_zero_only
def train(...):
    ...
    with todist.autograd.context() as context_id:
        ...
        val_loss = calculate_val_loss(net, range(len(val_loader)))
        todist.autograd.backward(context_id, val_loss)
        ...

mmargalo Feb 28, 2023
Author

Thanks for the suggestion. I think it's a more elegant approach, but since the differentiable loss is still accumulated prior to reducing, I encounter the same memory problem. It maxes out the workers, which is 48GB each.

On a slightly different note, I tested the different approaches on a small val set. Having the inner and validation loops together results to a successful back prop.

@todist.rank_zero_only
def train(...):
    ...
    with todist.autograd.context() as context_id:
        loss = calculate_inner_val_loss(net, x_train, y_train)
        todist.autograd.backward(context_id, loss) # SUCCESS

However, separating them makes the back prop extremely slow. I'm quite new at this. Could I possibly be missing something, in the model referencing perhaps?

@todist.rank_zero_only
def train(...):
    ...
    with todist.autograd.context() as context_id:
        inner_loss = calculate_inner_loss(net, x_train, y_train)
        val_loss = calculate_val_loss(net, range(len(val_loader)))
        todist.autograd.backward(context_id, val_loss) # ERROR - TIMEOUT

Essentially, multiple back propagation was not the issue in my old code. It couldn't even get through one.

inner_loop(model_clone, x, y)
for vx, vy in val_loader:
    val_loss = val_loop(model_clone, vx, vy) 
    todist.autograd.backward(context_id, val_loss) # stuck? slow, goes beyond timeout

XuehaiPan Mar 1, 2023
Maintainer

Having the inner and validation loops together results to a successful back prop.

Did you succeed with only one backprop for one distributed autograd context? I'm not sure that a distributed autograd context can be backward multiple times, since it is different from the single process torch autograd engine. Could you provide a minimal reproduction script? Then we can investigate this.

mmargalo Mar 1, 2023
Author

Did you succeed with only one backprop for one distributed autograd context?

It's not a matter of multiple backprops since it doesn't even succeed in the first iteration for a certain case.

SUCCEEDS
train() on rank0 -> inner_val_loop_combined on parallel -> train() on rank0, backprop

FAILS
train() on rank0 -> inner_loop on parallel -> train() on rank0, call val_loop -> val_loop on parallel -> train() on rank0, backprop

XuehaiPan Mar 1, 2023
Maintainer

SUCCEEDS
train() on rank0 -> inner_val_loop_combined on parallel -> train() on rank0, backprop

FAILS
train() on rank0 -> inner_loop on parallel -> train() on rank0, call val_loop -> val_loop on parallel -> train() on rank0, backdrop

If I understand correctly, you are updating the model parameters in the inner_loop. But in your initial code:

@parallelize
def inner_loop(net, x, y):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    meta_opt.step(loss)
    return loss

@parallelize
def val_loop(net, x, y):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    return loss

@rank_zero_only
def train():
    for x, y in train_loader:
        model_clone = todist.rpc.RRef(torchopt.module_clone(model, by="copy"))
        with todist.autograd.context() as context_id:
            y.requires_grad_()
            inner_loss = inner_loop(model_clone, x, y) 
            for vx, vy in val_loader:
                val_loss = val_loop(model_clone, vx, vy) 
                todist.autograd.backward(context_id, val_loss)

only the inner_loss was returned from the remote workers, without the updated parameters. Then you are doing the validation with the model_clone with untouched parameters. You should either merge the inner and validation functions into one to use the updated parameters to do validation; or add the updated parameters in your inner_loop function return.

mmargalo · 2023-03-07T09:17:42Z

mmargalo
Mar 7, 2023
Author

I see, so the model is not updated even if it's referenced on the workers. Thanks for the advice, I ended up merging the inner and val loops instead, with the val loop going over a small batch instead of the entire val set.

SUCCEEDS
train() on rank0 -> inner_val_loop_combined on parallel -> train() on rank0, backprop
FAILS
train() on rank0 -> inner_loop on parallel -> train() on rank0, call val_loop -> val_loop on parallel -> train() on rank0, backdrop

If I understand correctly, you are updating the model parameters in the inner_loop. But in your initial code:
@parallelize
def inner_loop(net, x, y):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    meta_opt.step(loss)
    return loss

@parallelize
def val_loop(net, x, y):
    # ... clone net by reference
    loss = net(x) # built-in criterion
    return loss

@rank_zero_only
def train():
    for x, y in train_loader:
        model_clone = todist.rpc.RRef(torchopt.module_clone(model, by="copy"))
        with todist.autograd.context() as context_id:
            y.requires_grad_()
            inner_loss = inner_loop(model_clone, x, y) 
            for vx, vy in val_loader:
                val_loss = val_loop(model_clone, vx, vy) 
                todist.autograd.backward(context_id, val_loss)
only the inner_loss was returned from the remote workers, without the updated parameters. Then you are doing the validation with the model_clone with untouched parameters. You should either merge the inner and validation functions into one to use the updated parameters to do validation; or add the updated parameters in your inner_loop function return.

0 replies

Bi-Level Optimization using Validation Set #137

Uh oh!

Uh oh!

mmargalo Feb 16, 2023

Replies: 2 comments · 8 replies

Uh oh!

Uh oh!

XuehaiPan Feb 16, 2023 Maintainer

Uh oh!

Uh oh!

XuehaiPan Feb 21, 2023 Maintainer

Uh oh!

Uh oh!

mmargalo Feb 28, 2023 Author

Uh oh!

Uh oh!

XuehaiPan Mar 1, 2023 Maintainer

Uh oh!

mmargalo Mar 1, 2023 Author

Uh oh!

XuehaiPan Mar 1, 2023 Maintainer

Uh oh!

mmargalo Mar 7, 2023 Author

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

mmargalo
Feb 16, 2023

Replies: 2 comments 8 replies

XuehaiPan
Feb 16, 2023
Maintainer

XuehaiPan Feb 21, 2023
Maintainer

mmargalo Feb 28, 2023
Author

XuehaiPan Mar 1, 2023
Maintainer

mmargalo Mar 1, 2023
Author

XuehaiPan Mar 1, 2023
Maintainer

mmargalo
Mar 7, 2023
Author