refactor profiling optional#47667
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 9ceca35 (more details on the Dr. CI page):
Extra GitHub checks: 2 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
eellison
left a comment
There was a problem hiding this comment.
This looks good, i'm a little wary of making changes just because of how easy it is to break something on the backward pass without knowing that you did it. Do you mind running a set of benchmarks to validate that this doesn't regress anything ?
Ideally that infrastructure would be automated, but not yet...
| } | ||
|
|
||
| void InsertProfileNodesForSpecializeAutogradZero(ProfilingRecord* pr) { | ||
| insertProfileNodesForSpecializeAutogradZero_( |
There was a problem hiding this comment.
nit: remove _ suffix, i dont htink that follows naming convention
| auto copy = graph->copy(); | ||
| runProfilingInsensitiveOptimizations(copy); | ||
| pr_ = ProfilingRecord::instrumentGraph(copy); | ||
| // TODO: order for now is important; we should make |
There was a problem hiding this comment.
What was the context of this ?
There was a problem hiding this comment.
My bad on this one during rebase.
The comment should be added when we add a second pass to insert profile_ivalue nodes for nvfuser https://github.com/pytorch/pytorch/pull/47668/files#diff-4b6a31b50172ed0fce36027a913fab9e59e496954264ad5e021324c74355a9c2R498-R500
cc'ing @Krovatkin
There was a problem hiding this comment.
@eellison
If I remember correctly if we run InsertProfileNodesForCUDAFuser first it will insert a profile node on a list input (which is also a graph input) to aten::grad_sum_to_size which would in turn break the logic in InsertProfileNodesForSpecializeAutogradZero that checks that there's a direct use of the graph input that is also aten::grad_sum_to_size : https://github.com/pytorch/pytorch/pull/47667/files#diff-d606514ce9e63ab23356691f3f6df0616222cd50b8a41239add4601f14a53392R16
Ideally, it would be nice if different passes could insert profile nodes in whatever order, but I wasn't quite convinced of the merit of that approach, so I just decided to document it.
| for (auto it = b->nodes().begin(); it != b->nodes().end(); it++) { | ||
| if (it->kind() == prim::profile || it->kind() == prim::profile_optional) { | ||
| if (it->kind() == prim::profile || it->kind() == prim::profile_optional || | ||
| it->kind() == prim::profile_ivalue) { |
There was a problem hiding this comment.
Could we remove all references to profile_optional ?
|
I quickly tried this with a very simple graph: I tried input x/y with / without broadcast and I can verify behavior and performance on The refactor of this PR only changes the symbol from |
ghstack-source-id: bc60515 Pull Request resolved: pytorch#47667
[ghstack-poisoned]
[ghstack-poisoned]
eellison
left a comment
There was a problem hiding this comment.
Looks good, thanks for doing this!
Mostly small comments/nits, but I would like to retain the original flow or specialize_autogradzero before landing. for context, there have been very seemingly-benign changes that completely destroyed perf for hard-to-detect reasons. e.g.: #43900
backwards-graph doesn't have great visibility or testing and i would prefer not to make changes unrelated to this PR.
|
|
||
| // we only check prim::profile and not prim::profile_optional bc profile | ||
| // is inserted on each use, while profile_optional is inserted on the def | ||
| // we only check prim::profile and not prim::profile_ivalue bc profile |
There was a problem hiding this comment.
The reason profile_optional was inserted on def and not per use is because we were just checking whether or not a value is None, which stays constant for each use.
General ivalue profiling doesn't have the same property, since you could profile mutable values like List which change over each use. So, this comment will be out of date once we have more uses of profile_ivalue.
i also wonder whether there's some value in sharing implementation, but having a different node name. depends i guess on the number of different ivalue profiling we have
There was a problem hiding this comment.
I agree that we need to update the comment (and the checks as well?)
sharing implementation, but having a different node name
I think this is suggesting to haveprofile_ivalueas an extension toprofile_optional, which is similar to what I have in the original PR prior to Nick's refactor.
IIUC, the refactor is to have a unique profile_ivalue node, where we would hopefully work towards a coherent profiling strategy that different passes relying on deploying/exploiting profiling ivalues could work consequently with each other. Granted that this is not what we have today, at least the order of passes has to be arranged in certain way to work. More so that it's tricky to cover all the corner cases.
I think the suggested middle ground to have different node name could make it easier to prevent profiling nodes inserted by different passes to breach into each other's territory for now. And we can merge them at later point when we figure out a universal profiling strategy.
There was a problem hiding this comment.
i also wonder whether there's some value in sharing implementation, but having a different node name. depends i guess on the number of different ivalue profiling we have
at the very least, it would give a better readability and slightly easier check since you only have to look for a specific symbol (rather a symbol + attribute)
On the other hand, the way we are adding profiling nodes, switching to different node names would result in a small explosion of copy-pasted classes in ir.h.
Also, if we decide in future to merge profiling node chains into a single profiling node with all the attributes, I think, the benefits of having just one single node on a use outweighs differently named nodes.
I would hold off on this approach.
Granted that this is not what we have today, at least the order of passes has to be arranged in certain way to work
I would like to emphasize that this is no way the limitation of the current design! We could update how SpecializeAutogradZero inserts profiling nodes to account for a possibility that some other pass might have already inserted profiling nodes in between. Given we only have 3 passes that insert profiling nodes I decided to just document the order.
There was a problem hiding this comment.
If we can figure out a way to have a different node-naming without a ton of copy-pasta, that's the best. I don't really know how to do that; maybe someone really good at C++ can.
Also, if we decide in future to merge profiling node chains into a single profiling node with all the attributes, I think, the benefits of having just one single node on a use outweighs differently named nodes.
i don't think this is coming anytime soon (if at all). I don't think we should over-consider this.
Granted that this is not what we have today, at least the order of passes has to be arranged in certain way to work
I think most things should be profiling per-use, which composes pretty well. specialize_autograd_zero is pretty unique
| } | ||
| } | ||
|
|
||
| static Node* getUseWithAttribute(Value* inp, Symbol kind) { |
There was a problem hiding this comment.
nit: getUseWithAttribute doesn't fully reflect the implementation, since we're also filtering out non profile_ivalue uses
| checks.push_back(check->output()); | ||
| profiled_none_.insert(inp); | ||
| } | ||
| removeProfiledOptionalUses(inp); |
There was a problem hiding this comment.
I'm not convinced that changing removeProfiledOptionalUses(inp) to after guardSpecializations doesn't have some edge-case effect somewhere. Could we keep the code here as it was to minimize changes ?
There was a problem hiding this comment.
I think there were multiple reasons why I moved it after guardSpecializations. One of them was that there could be arbitrary long chains of prim::profiles on some graph inputs. Even for a very simple case I considered where **both ** TE and SpecializeAutogradZero want to profile the optional list argument to aten::grad_sum_to_size, rewriting removeProfiledOptionalUses looked very ugly and I hit some edge cases there.
In contrast, walking a graph and removing after guardSpecializations is done looked so much simpler and it also looked quite similar to what we do elsewhere.
I tested this with fastrnns and our CI tests and didn't hit any regressions or crashes.
If you feel very strongly that removeProfiledOptionalUses(inp) should stay in this specific spot, we could try figuring it out.
There was a problem hiding this comment.
If "rewriting removeProfiledOptionalUses looked very ugly" that is a good sign that pass writing is becoming more difficult as a result of the overloaded profile_ivalue cc @jjsjann123.
i don't feel that strongly, but I do prefer it as it was. IMO when there isn't a large overhead of doing so, we should be cleaning the graph as we are manipulating it in the pass. It's the same reason I added
: it was extremely confusing to look at the pass midway through and see three copies of the graph. Why was it there ? Every time I debugged the pass I had to re-remind myself and mentally edit out the dead copy of the graph. Similarly, after we've made the decision whether or not to specialize on None values, why not remove the nodes right there? They are just visual noise for the duration ofguardSpecializations.
The logic right now isn't quite right, either.
def foo(x):
x = profile_ivalue[None=1, Present=1]
return x
Here we won't insert any specializations, so the profile_ivalue nodes will remain in the graph. Granted this isn't likely to happen, but it's sort of a fallout of delaying logic past when it needs to be.
I am maybe more opinionated about this than I should be because in the past I've had to deal with passes that had incorrect IR (like when we never updated the requires_grad property) or other passes that similarly left things in confusing states throughout the pass.
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
…or nvfuser. We tried to go around PR pytorch#47667 refactor profiling optional, since upstream is still working on it at this time.
* This is a cherry-pick from upstream PR pytorch#47668 profile ivalue for nvfuser. We tried to go around PR pytorch#47667 refactor profiling optional, since upstream is still working on it at this time. createConditionalConstant supports profile ivalue including bool, int_list and size New guard to check conditional constant at runtime size_eq_guard op to facilitate comparison of dynamic sizes sum_to_size & _grad_sum_to_size added in integration
Krovatkin
left a comment
There was a problem hiding this comment.
looks like we addressed @eellison's feedback (e.g. moving removeProfiledOptionalUses back to its original spot and double checking we aren't seeing any obvious regressions on fastrnns).
@jjsjann123 could you please fix clang-format failures and rebase this PR and I'll try to merge it?
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
|
clang-format seems to be broken, I'll try rebase again tomorrow to see if someone got it fixed. |
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
|
Hmm, even though my local clang-format is still failing... CI magically works :) |
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
|
clang-tidy error is not on code that's modified by this PR. Should I try to tidy them up here? rocm failure doesn't seem to be relevant. @Krovatkin Feel free to yell at me to fix clang-tidy, otherwise, this one is good to go. |
|
@jjsjann123 that should be fine. @eellison wanted to take one last look, he should be getting to it shortly. |
|
LGTM ! |
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
Differential Revision: [D25255572](https://our.internmc.facebook.com/intern/diff/D25255572) [ghstack-poisoned]
|
@Krovatkin merged this pull request in dd1c2a0. |
Summary: Pull Request resolved: pytorch#47667 Test Plan: Imported from OSS Reviewed By: anjali411, ngimel Differential Revision: D25255572 Pulled By: Krovatkin fbshipit-source-id: d0152c9ef5b1994e27be9888bcb123dca3ecd88f
Stack from ghstack:
Differential Revision: D25255572