Differentiable and Non-differentiable Views #12502

ssnl · 2018-10-09T21:36:01Z

This PR introduces the idea of non-differentiable views. A non-differentiable is a view that shares storage with the base variable, but gradient should never flow through the view relation. This includes:

.detach()
Views created when GradMode is disabled
Views that are non-differentiable by nature, e.g., sparse_tensor.indices() (This is being added in [sparse] Autograd get_indices/values and sparse_coo ctor #11253. I base [sparse] Autograd get_indices/values and sparse_coo ctor #11253 on this and update the note in that PR.)

See the note in this PR for details.

cc @colesbury @apaszke @gchanan

facebook-github-bot

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

torch/csrc/autograd/variable.h

+/// NOTE [ Autograd View Variables ]
+///
+/// Many operations return Variable that shares storage with an input Variable.
+/// The returned Varaible is called a **view** Variable on the input **base**


facebook-github-bot

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

torch/csrc/autograd/variable.h

+/// In certain cases, although function outputs share storage with inputs, they
+/// will **never** require gradient history tracking. Instead of registering the
+/// view relation via DifferentiableViewImpl in autograd, the views will be
+/// using usual Varaible::Impl and just share the version counters with the base


torch/csrc/autograd/variable.h

ssnl · 2018-10-12T16:56:48Z

Thanks Ed :)

ssnl · 2018-10-12T17:28:35Z

I'm landing because the last commit is only text fix and CI passed on the commit before.

facebook-github-bot

SsnL is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

apaszke

Mostly LGTM, but I'm not sure if that's the right strategy. I don't think that the decisions we make in the no_grad() regions should affect how the system behaves outside of it. I.e. if you create a view in no_grad() part, then obviously it should act as a variable that doesn't require grad in that place, but I can't see why shouldn't it behave as if you executed the code without that block later. In general the "turning gradients off" should be a thing local to the code part, while now you have this fact escaping in the data of the program.

torch/csrc/autograd/variable.cpp

@@ -172,7 +166,7 @@ void Variable::Impl::release_resources() {
  hooks_.clear();
 }

-Variable::ViewImpl::ViewImpl(Variable base, at::Tensor data, Edge gradient_edge)
+Variable::DifferentiableViewImpl::DifferentiableViewImpl(Variable base, at::Tensor data, Edge gradient_edge)


apaszke · 2018-10-12T21:47:11Z

IIRC the last conclusion we reached on this topic was simply to make views return False when asked for requires_grad() in a no-grad regions instead of doing whole refactor of the semantics like this one.

ssnl · 2018-10-12T22:07:23Z

but I can't see why shouldn't it behave as if you executed the code without that block later.

I don't think this is correct. What you are saying implies that it is possible to do

with torch.no_grad():
  y = net(x)

torch.autograd.grad(y, x)

But operations done in no_grad() block should not keep any buffers or should be tracked by autograd history. We should make view ops consistent with other ops, so its outputs should just share version counter with base but not track any other autograd stuff. I am quite certain that most of the time this is the expected behavior from users. Also, anecdotally, I use PyTorch in my research and I certainly would expect so.

It was never made clear that what view operation should be tracked and what should not. Previously we basically only have .detach() that is a non-differentiable view. Now we realize that there is no_grad(), and, in future, view ops whose outputs are naturally non-differentiable, e.g., sparse_tensor.indices(). So I believe it is good to have a mechanism like this. IMO, it makes everything easier to understand with a very clear rule.

Moreover, I much more prefer fixing things in a general way and make the rules clear on what each thing should behave like, than apply local patches that makes the code base difficult to navigate around and understand why something happens (both for new comers and for already core devs). For one, I really really don't want to apply a manual patch on sparse_tensor.indices() just to make some op sharing storage with input, especially when all the information is there already in the yaml/codegen scripts (it is a view op; its output is not differentiable).

The conclusion we arrived last time, I think, is to treat the output as if there is a detach() after the view operation. Probably I remembered wrong. But given the above thoughts, I still think that this patch is the correct thing to do.

apaszke · 2018-10-12T22:19:32Z

So while your argument is very convincing, I still think there are some edge cases that might need extra care. The reason why views are special is exactly because they are views. This means that in-place modification made to them can have non-trivial globally visible effects. For example, consider this operation, and assume the y.requires_grad is True :

x[i] = y

Now, if x is a base variable, then it doesn't matter too much that we'll set its requires grad to true, because all views based on it will generally get updated (unless they are created in a no-grad block, which you convinced me is ok).

On the other hand, if x is a view onto a different tensor (let's call it z), which has been created outside of a no-grad block, then data of y will be present in z as well, and so usages of that variable should count towards its gradient as well! Your patch however will happily drop the aliasing information. This is exactly the situation I'm talking about:

z = torch.ones(...)
with torch.no_grad():
  x = z[0]
x[i] = y

Note all use sites of the data of z (not its metadata) are outside of no_grad blocks.

colesbury · 2018-10-12T22:31:50Z

Simon's behavior seems correct to me. If you create a view inside a no_grad block, it should not track gradient updates, even if the data is modified later outside of no_grad. This behaves the same as:

z = torch.ones(...)
x = z[0].detach()
x[i] = y

I don't think we should complicate the behavior by trying to "re-connect" x after it is disconnected.

ssnl · 2018-10-12T22:53:47Z

@apaszke I agree with Sam here. In your example, I don't think it makes much sense to connect x back to z after the block. It feels natural to me that there is not gradient relation between z and y because they only interact through x, which is created in no_grad(). It is still possible to backprop from x to y, which I think is the correct behavior here.

apaszke · 2018-10-14T19:56:32Z

I agree that the view should not start tracking gradient, but the problem is that the fact that it's a view and aliases other values that live entirely in with_grad regions might cause us to miss some differentiable connections. I don't think it's very intuitive.

@colesbury when you argued for introducing gradient context managers, your main argument was that "whether you want to differentiate is a property of the code region and not data that's flowing inside it". That's very reasonable and I agreed. In this case however, we're pushing this "should you differentiate" decision back to the data part (to the fact how Variables are wired), which outlives the context managers. I'm still unconvinced that this is how this should work.

ssnl · 2018-10-14T20:22:31Z

we're pushing this "should you differentiate" decision back to the data part (to the fact how Variables are wired)

@apaszke I don't think so. I believe we are exactly making this code block non-differentiable, i.e., we are making view relations constructed in this no_grad block not tracking history. This is exactly what we do for the other relations in compute_requires_grad: i.e., if compute_requires_grad is false, we do not construct backward graph, save variables. Similarly, here, if we don't need grad, we do not construct the backward relation between view and base.

edit: ~~If you want to, I can update this patch to using output of compute_requires_grad instead for better clarity.~~ Ah no this doesn't work as it would not track view relation if base doesn't require grad, but view is special cased in this scenario.

apaszke · 2018-10-15T19:14:57Z

@ssnl I think the question boils down to how do you define what does it mean for a block of code to be differentiable. Note that all operations that are of concern to me happen outside of the no_grad block. While I agree that no gradients should even flow through the part of the function that has been encapsulated in that context manager, the surrounding program should work as if that didn't happen. In this case z is directly influenced by the data of y, and this influence happens in the differentiable region. This seems like an argument for allowing gradient propagation through it.

ssnl · 2018-10-15T19:46:42Z

@apaszke I disagree. Let's recall your example:

z = torch.ones(...)
with torch.no_grad():
  x = z[0]
x[i] = y

Here y only interacts with z via x. So it doesn't make much sense to me to treat as if the no_grad() block didn't happen because x is defined in that block.

ezyang · 2018-10-16T02:37:20Z

@apaszke, I don't understand what principle you're trying to argue. It seems to me that you're saying that if a variable has ANY influence on another variable outside of no-grad, we should record a gradient. OK... but, by this reasoning principle, isn't detach() also a bad operation? Because a detached variable is sure as heck influenced by the original variable. What makes a detached variable different from a view operation performed inside a no-grad region?

apaszke · 2018-10-17T21:13:32Z

@ezyang I do think that we should change the semantics of detach() to match what I said closer, and then simply say "every tensor allocated in a no grad block is as if detach() was called on it". This would give you the semantics I described.

This is what we've converged on in today's meeting with @ssnl and @soumith.

albanD · 2019-09-24T17:53:33Z

@ssnl do we actually need to keep this PR open given that the code here has been merged in #13001 ?

ssnl · 2019-09-26T16:43:12Z

@albanD I'll close this. But for documentation purposes, the code is not merged in #13001 . I updated #13001 to use other approaches.

albanD · 2019-09-26T20:00:32Z

@ssnl Really? The commits don't have the same hash but they have the same messages. And the commit content look very similar. What is the difference?

ssnl · 2019-09-26T20:09:11Z

@albanD Good point! Took me a while to remember the difference. The main discrepancy is that in this PR, views created in no_grad are treated disconnected with base tensor. Yet in that PR they aren't. You are right in that most code are inherited indeed! I replied to you in the other PR where the exact difference happens.

ssnl requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners October 9, 2018 21:36

ssnl mentioned this pull request Oct 10, 2018

[sparse] Autograd get_indices/values and sparse_coo ctor #11253

Closed

ssnl force-pushed the nondiff_view branch 3 times, most recently from a633257 to 7102b9b Compare October 10, 2018 23:43

ssnl changed the title ~~View op outputs are not registered as views when !GradMode::enabled()~~ Differentiable and Non-differentiable Views Oct 10, 2018

facebook-github-bot reviewed Oct 11, 2018

View reviewed changes

ezyang reviewed Oct 11, 2018

View reviewed changes

torch/csrc/autograd/variable.h Outdated

/// NOTE [ Autograd View Variables ]

///

/// Many operations return Variable that shares storage with an input Variable.

/// The returned Varaible is called a **view** Variable on the input **base**

This comment was marked as off-topic.

Sign in to view

facebook-github-bot reviewed Oct 11, 2018

View reviewed changes

ezyang reviewed Oct 12, 2018

View reviewed changes

torch/csrc/autograd/variable.h Show resolved Hide resolved

ezyang approved these changes Oct 12, 2018

View reviewed changes

facebook-github-bot reviewed Oct 12, 2018

View reviewed changes

apaszke suggested changes Oct 12, 2018

View reviewed changes

ssnl added 9 commits October 12, 2018 14:44

View op outputs are not registered as views when !GradMode::enabled()

33188fc

potential_history_tracking -> potentially_tracks_history

fde3eb4

make the note clearer

d08071c

update note

3257857

more comments

dab846b

diff and nondiff views

f296b6c

more comments

94e3bf8

rename note

9e94c3e

fix typos

20ad87b

typo

574fb2e

ssnl force-pushed the nondiff_view branch from 78177b2 to 574fb2e Compare October 12, 2018 21:46

ssnl mentioned this pull request Oct 17, 2018

nn.Module hook fix and improvements #12573

Closed

ezyang mentioned this pull request Dec 4, 2018

Differentiable views don't work well with python autograd functions #14707

Closed

zdevito removed their request for review February 13, 2019 01:22

ssnl mentioned this pull request May 7, 2019

Views created in no_grad block still have requires_grad=True #11390

Open

ezyang added the open source label Jun 5, 2019

ssnl closed this Sep 26, 2019

ssnl mentioned this pull request Sep 26, 2019

[sparse] Autograd indices/values and sparse_coo ctor #13001

Closed

ezyang mentioned this pull request Dec 18, 2019

Add more tests to the autograd wrt view and inplace #31147

Closed

This was referenced Mar 16, 2021

Implement public API InferenceMode and its error handling #53343

Closed

RFC-0011-InferenceMode pytorch/rfcs#17

Merged

Differentiable and Non-differentiable Views #12502

Differentiable and Non-differentiable Views #12502

Uh oh!

Conversation

ssnl commented Oct 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

ssnl commented Oct 12, 2018

Uh oh!

ssnl commented Oct 12, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke commented Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented Oct 12, 2018

Uh oh!

apaszke commented Oct 12, 2018

Uh oh!

colesbury commented Oct 12, 2018

Uh oh!

ssnl commented Oct 12, 2018

Uh oh!

apaszke commented Oct 14, 2018

Uh oh!

ssnl commented Oct 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke commented Oct 15, 2018

Uh oh!

ssnl commented Oct 15, 2018

Uh oh!

ezyang commented Oct 16, 2018

Uh oh!

apaszke commented Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD commented Sep 24, 2019

Uh oh!

ssnl commented Sep 26, 2019

Uh oh!

albanD commented Sep 26, 2019

Uh oh!

ssnl commented Sep 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

ssnl commented Oct 9, 2018 •

edited

Loading

apaszke commented Oct 12, 2018 •

edited

Loading

ssnl commented Oct 14, 2018 •

edited

Loading

apaszke commented Oct 17, 2018 •

edited

Loading

ssnl commented Sep 26, 2019 •

edited

Loading