-
Notifications
You must be signed in to change notification settings - Fork 24.5k
Differentiable and Non-differentiable Views #12502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a633257
to
7102b9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
torch/csrc/autograd/variable.h
Outdated
/// NOTE [ Autograd View Variables ] | ||
/// | ||
/// Many operations return Variable that shares storage with an input Variable. | ||
/// The returned Varaible is called a **view** Variable on the input **base** |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
torch/csrc/autograd/variable.h
Outdated
/// In certain cases, although function outputs share storage with inputs, they | ||
/// will **never** require gradient history tracking. Instead of registering the | ||
/// view relation via DifferentiableViewImpl in autograd, the views will be | ||
/// using usual Varaible::Impl and just share the version counters with the base |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Thanks Ed :) |
I'm landing because the last commit is only text fix and CI passed on the commit before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SsnL is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, but I'm not sure if that's the right strategy. I don't think that the decisions we make in the no_grad()
regions should affect how the system behaves outside of it. I.e. if you create a view in no_grad()
part, then obviously it should act as a variable that doesn't require grad in that place, but I can't see why shouldn't it behave as if you executed the code without that block later. In general the "turning gradients off" should be a thing local to the code part, while now you have this fact escaping in the data of the program.
@@ -172,7 +166,7 @@ void Variable::Impl::release_resources() { | |||
hooks_.clear(); | |||
} | |||
|
|||
Variable::ViewImpl::ViewImpl(Variable base, at::Tensor data, Edge gradient_edge) | |||
Variable::DifferentiableViewImpl::DifferentiableViewImpl(Variable base, at::Tensor data, Edge gradient_edge) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
IIRC the last conclusion we reached on this topic was simply to make views return |
I don't think this is correct. What you are saying implies that it is possible to do
But operations done in It was never made clear that what view operation should be tracked and what should not. Previously we basically only have Moreover, I much more prefer fixing things in a general way and make the rules clear on what each thing should behave like, than apply local patches that makes the code base difficult to navigate around and understand why something happens (both for new comers and for already core devs). For one, I really really don't want to apply a manual patch on The conclusion we arrived last time, I think, is to treat the output as if there is a detach() after the view operation. Probably I remembered wrong. But given the above thoughts, I still think that this patch is the correct thing to do. |
So while your argument is very convincing, I still think there are some edge cases that might need extra care. The reason why views are special is exactly because they are views. This means that in-place modification made to them can have non-trivial globally visible effects. For example, consider this operation, and assume the x[i] = y Now, if On the other hand, if z = torch.ones(...)
with torch.no_grad():
x = z[0]
x[i] = y Note all use sites of the data of |
Simon's behavior seems correct to me. If you create a view inside a no_grad block, it should not track gradient updates, even if the data is modified later outside of no_grad. This behaves the same as: z = torch.ones(...)
x = z[0].detach()
x[i] = y I don't think we should complicate the behavior by trying to "re-connect" x after it is disconnected. |
@apaszke I agree with Sam here. In your example, I don't think it makes much sense to connect |
I agree that the view should not start tracking gradient, but the problem is that the fact that it's a view and aliases other values that live entirely in @colesbury when you argued for introducing gradient context managers, your main argument was that "whether you want to differentiate is a property of the code region and not data that's flowing inside it". That's very reasonable and I agreed. In this case however, we're pushing this "should you differentiate" decision back to the data part (to the fact how Variables are wired), which outlives the context managers. I'm still unconvinced that this is how this should work. |
@apaszke I don't think so. I believe we are exactly making this code block non-differentiable, i.e., we are making view relations constructed in this edit: |
@ssnl I think the question boils down to how do you define what does it mean for a block of code to be differentiable. Note that all operations that are of concern to me happen outside of the |
@apaszke I disagree. Let's recall your example: z = torch.ones(...)
with torch.no_grad():
x = z[0]
x[i] = y Here |
@apaszke, I don't understand what principle you're trying to argue. It seems to me that you're saying that if a variable has ANY influence on another variable outside of no-grad, we should record a gradient. OK... but, by this reasoning principle, isn't |
@ezyang I do think that we should change the semantics of This is what we've converged on in today's meeting with @ssnl and @soumith. |
@ssnl Really? The commits don't have the same hash but they have the same messages. And the commit content look very similar. What is the difference? |
@albanD Good point! Took me a while to remember the difference. The main discrepancy is that in this PR, views created in |
Fixes #11390.
This PR introduces the idea of non-differentiable views. A non-differentiable is a view that shares storage with the base variable, but gradient should never flow through the view relation. This includes:
sparse_tensor.indices()
(This is being added in [sparse] Autograd get_indices/values and sparse_coo ctor #11253. I base [sparse] Autograd get_indices/values and sparse_coo ctor #11253 on this and update the note in that PR.)See the note in this PR for details.
cc @colesbury @apaszke @gchanan