Skip to content

Differentiable and Non-differentiable Views #12502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

ssnl
Copy link
Collaborator

@ssnl ssnl commented Oct 9, 2018

Fixes #11390.

This PR introduces the idea of non-differentiable views. A non-differentiable is a view that shares storage with the base variable, but gradient should never flow through the view relation. This includes:

  1. .detach()
  2. Views created when GradMode is disabled
  3. Views that are non-differentiable by nature, e.g., sparse_tensor.indices() (This is being added in [sparse] Autograd get_indices/values and sparse_coo ctor #11253. I base [sparse] Autograd get_indices/values and sparse_coo ctor #11253 on this and update the note in that PR.)

See the note in this PR for details.

cc @colesbury @apaszke @gchanan

@ssnl ssnl force-pushed the nondiff_view branch 3 times, most recently from a633257 to 7102b9b Compare October 10, 2018 23:43
@ssnl ssnl changed the title View op outputs are not registered as views when !GradMode::enabled() Differentiable and Non-differentiable Views Oct 10, 2018
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

/// NOTE [ Autograd View Variables ]
///
/// Many operations return Variable that shares storage with an input Variable.
/// The returned Varaible is called a **view** Variable on the input **base**

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

/// In certain cases, although function outputs share storage with inputs, they
/// will **never** require gradient history tracking. Instead of registering the
/// view relation via DifferentiableViewImpl in autograd, the views will be
/// using usual Varaible::Impl and just share the version counters with the base

This comment was marked as off-topic.

@ssnl
Copy link
Collaborator Author

ssnl commented Oct 12, 2018

Thanks Ed :)

@ssnl
Copy link
Collaborator Author

ssnl commented Oct 12, 2018

I'm landing because the last commit is only text fix and CI passed on the commit before.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, but I'm not sure if that's the right strategy. I don't think that the decisions we make in the no_grad() regions should affect how the system behaves outside of it. I.e. if you create a view in no_grad() part, then obviously it should act as a variable that doesn't require grad in that place, but I can't see why shouldn't it behave as if you executed the code without that block later. In general the "turning gradients off" should be a thing local to the code part, while now you have this fact escaping in the data of the program.

@@ -172,7 +166,7 @@ void Variable::Impl::release_resources() {
hooks_.clear();
}

Variable::ViewImpl::ViewImpl(Variable base, at::Tensor data, Edge gradient_edge)
Variable::DifferentiableViewImpl::DifferentiableViewImpl(Variable base, at::Tensor data, Edge gradient_edge)

This comment was marked as off-topic.

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented Oct 12, 2018

IIRC the last conclusion we reached on this topic was simply to make views return False when asked for requires_grad() in a no-grad regions instead of doing whole refactor of the semantics like this one.

@ssnl
Copy link
Collaborator Author

ssnl commented Oct 12, 2018

but I can't see why shouldn't it behave as if you executed the code without that block later.

I don't think this is correct. What you are saying implies that it is possible to do

with torch.no_grad():
  y = net(x)

torch.autograd.grad(y, x)

But operations done in no_grad() block should not keep any buffers or should be tracked by autograd history. We should make view ops consistent with other ops, so its outputs should just share version counter with base but not track any other autograd stuff. I am quite certain that most of the time this is the expected behavior from users. Also, anecdotally, I use PyTorch in my research and I certainly would expect so.

It was never made clear that what view operation should be tracked and what should not. Previously we basically only have .detach() that is a non-differentiable view. Now we realize that there is no_grad(), and, in future, view ops whose outputs are naturally non-differentiable, e.g., sparse_tensor.indices(). So I believe it is good to have a mechanism like this. IMO, it makes everything easier to understand with a very clear rule.

Moreover, I much more prefer fixing things in a general way and make the rules clear on what each thing should behave like, than apply local patches that makes the code base difficult to navigate around and understand why something happens (both for new comers and for already core devs). For one, I really really don't want to apply a manual patch on sparse_tensor.indices() just to make some op sharing storage with input, especially when all the information is there already in the yaml/codegen scripts (it is a view op; its output is not differentiable).

The conclusion we arrived last time, I think, is to treat the output as if there is a detach() after the view operation. Probably I remembered wrong. But given the above thoughts, I still think that this patch is the correct thing to do.

@apaszke
Copy link
Contributor

apaszke commented Oct 12, 2018

So while your argument is very convincing, I still think there are some edge cases that might need extra care. The reason why views are special is exactly because they are views. This means that in-place modification made to them can have non-trivial globally visible effects. For example, consider this operation, and assume the y.requires_grad is True :

x[i] = y

Now, if x is a base variable, then it doesn't matter too much that we'll set its requires grad to true, because all views based on it will generally get updated (unless they are created in a no-grad block, which you convinced me is ok).

On the other hand, if x is a view onto a different tensor (let's call it z), which has been created outside of a no-grad block, then data of y will be present in z as well, and so usages of that variable should count towards its gradient as well! Your patch however will happily drop the aliasing information. This is exactly the situation I'm talking about:

z = torch.ones(...)
with torch.no_grad():
  x = z[0]
x[i] = y

Note all use sites of the data of z (not its metadata) are outside of no_grad blocks.

@colesbury
Copy link
Member

Simon's behavior seems correct to me. If you create a view inside a no_grad block, it should not track gradient updates, even if the data is modified later outside of no_grad. This behaves the same as:

z = torch.ones(...)
x = z[0].detach()
x[i] = y

I don't think we should complicate the behavior by trying to "re-connect" x after it is disconnected.

@ssnl
Copy link
Collaborator Author

ssnl commented Oct 12, 2018

@apaszke I agree with Sam here. In your example, I don't think it makes much sense to connect x back to z after the block. It feels natural to me that there is not gradient relation between z and y because they only interact through x, which is created in no_grad(). It is still possible to backprop from x to y, which I think is the correct behavior here.

@apaszke
Copy link
Contributor

apaszke commented Oct 14, 2018

I agree that the view should not start tracking gradient, but the problem is that the fact that it's a view and aliases other values that live entirely in with_grad regions might cause us to miss some differentiable connections. I don't think it's very intuitive.

@colesbury when you argued for introducing gradient context managers, your main argument was that "whether you want to differentiate is a property of the code region and not data that's flowing inside it". That's very reasonable and I agreed. In this case however, we're pushing this "should you differentiate" decision back to the data part (to the fact how Variables are wired), which outlives the context managers. I'm still unconvinced that this is how this should work.

@ssnl
Copy link
Collaborator Author

ssnl commented Oct 14, 2018

we're pushing this "should you differentiate" decision back to the data part (to the fact how Variables are wired)

@apaszke I don't think so. I believe we are exactly making this code block non-differentiable, i.e., we are making view relations constructed in this no_grad block not tracking history. This is exactly what we do for the other relations in compute_requires_grad: i.e., if compute_requires_grad is false, we do not construct backward graph, save variables. Similarly, here, if we don't need grad, we do not construct the backward relation between view and base.

edit: If you want to, I can update this patch to using output of compute_requires_grad instead for better clarity. Ah no this doesn't work as it would not track view relation if base doesn't require grad, but view is special cased in this scenario.

@apaszke
Copy link
Contributor

apaszke commented Oct 15, 2018

@ssnl I think the question boils down to how do you define what does it mean for a block of code to be differentiable. Note that all operations that are of concern to me happen outside of the no_grad block. While I agree that no gradients should even flow through the part of the function that has been encapsulated in that context manager, the surrounding program should work as if that didn't happen. In this case z is directly influenced by the data of y, and this influence happens in the differentiable region. This seems like an argument for allowing gradient propagation through it.

@ssnl
Copy link
Collaborator Author

ssnl commented Oct 15, 2018

@apaszke I disagree. Let's recall your example:

z = torch.ones(...)
with torch.no_grad():
  x = z[0]
x[i] = y

Here y only interacts with z via x. So it doesn't make much sense to me to treat as if the no_grad() block didn't happen because x is defined in that block.

@ezyang
Copy link
Contributor

ezyang commented Oct 16, 2018

@apaszke, I don't understand what principle you're trying to argue. It seems to me that you're saying that if a variable has ANY influence on another variable outside of no-grad, we should record a gradient. OK... but, by this reasoning principle, isn't detach() also a bad operation? Because a detached variable is sure as heck influenced by the original variable. What makes a detached variable different from a view operation performed inside a no-grad region?

@apaszke
Copy link
Contributor

apaszke commented Oct 17, 2018

@ezyang I do think that we should change the semantics of detach() to match what I said closer, and then simply say "every tensor allocated in a no grad block is as if detach() was called on it". This would give you the semantics I described.

This is what we've converged on in today's meeting with @ssnl and @soumith.

@albanD
Copy link
Collaborator

albanD commented Sep 24, 2019

@ssnl do we actually need to keep this PR open given that the code here has been merged in #13001 ?

@ssnl
Copy link
Collaborator Author

ssnl commented Sep 26, 2019

@albanD I'll close this. But for documentation purposes, the code is not merged in #13001 . I updated #13001 to use other approaches.

@ssnl ssnl closed this Sep 26, 2019
@albanD
Copy link
Collaborator

albanD commented Sep 26, 2019

@ssnl Really? The commits don't have the same hash but they have the same messages. And the commit content look very similar. What is the difference?

@ssnl
Copy link
Collaborator Author

ssnl commented Sep 26, 2019

@albanD Good point! Took me a while to remember the difference. The main discrepancy is that in this PR, views created in no_grad are treated disconnected with base tensor. Yet in that PR they aren't. You are right in that most code are inherited indeed! I replied to you in the other PR where the exact difference happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Views created in no_grad block still have requires_grad=True
6 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy