Immediately compile backwards graph in AOTAutograd if dynamic shapes#104971
Immediately compile backwards graph in AOTAutograd if dynamic shapes#104971ezyang wants to merge 8 commits intogh/ezyang/2220/basefrom
Conversation
Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately. This should also make it easier to predict when compilation occurs, since compilation now all happens up front during forwards. Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104971
Note: Links to docs will display an error until the docs builds have been completed. ✅ 3 Unrelated FailuresAs of commit ada9501: BROKEN TRUNK - The following job failed but were present on the merge base 8c479d3:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately. This should also make it easier to predict when compilation occurs, since compilation now all happens up front during forwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 74ba604 Pull Request resolved: #104971
torch/_functorch/aot_autograd.py
Outdated
| with track_graph_compiling(aot_config, "backward"): | ||
| placeholder_list = fx_placeholder_vals(bw_module) | ||
|
|
||
| compiled_bw_func = aot_config.bw_compiler( |
There was a problem hiding this comment.
fwiw - one thing that subclass support will (eventually, not immediately) need is backwards guards: when we generate the joint, we might incorrectly assume that the grad_outputs are/are_not subclasses, which would require us to re-trace and recompile the backward later (I'm writing a doc on subclass requirements, more details will be in the doc).
Compiling the backward eagerly is probably not optimal if we end up having to recompile the backward later, although maybe we're okay with this (since invalidating bw guards is hopefully rare).
There was a problem hiding this comment.
@bdhirsh Your thing is going to need true two level cache. But IMO you should just force your users to use compiled backwards in that case, which no longer has this problem. (BTW, @jansel's compiled autograd is what convinced me to do this "simpler" fix; basically if you have any complicated situation where we don't know ahead of time what the gradients will be, you instead use compiled autograd to be able to compile given full info.)
There was a problem hiding this comment.
ok, telling users to use compiled autograd when this happens sounds fair! (still need to eventually figure out how to teach compiled autograd how to add extra bw guards)
Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately. This should also make it easier to predict when compilation occurs, since compilation now all happens up front during forwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]
Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately. This should also make it easier to predict when compilation occurs, since compilation now all happens up front during forwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]
…mic shapes" Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]
…mic shapes" Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]
…mic shapes" Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]
Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately. This should also make it easier to predict when compilation occurs, since compilation now all happens up front during forwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 13a23ce Pull Request resolved: #104971
|
If there are no dynamic shapes, I restore the old behavior of lazy compilation to shut up some TorchScript failures. If eager backwards compilation fails, I suppress it, which gets us through some failures in our test suite where our backwards dynamic codegen doesn't actually work. |
| # NB: It's important to compile backwards ahead of time, as this may | ||
| # add extra guards which we need to apply to the Dynamo cache at | ||
| # forwards |
|
|
||
| def run_and_get_triton_code(fn, *args, **kwargs): | ||
| _, source_codes = run_and_get_code(fn, *args, **kwargs) | ||
| # Can have two outputs if backwards was eagerly compiled |
There was a problem hiding this comment.
Can we store a flag to drive if this should be exactly 1, or of (1, 2)?
There was a problem hiding this comment.
It's very awkward, because the current implementation will attempt to compile backwards, and if backwards failed to compile, suppress the error and return anyway. So there isn't really a clear delineation.
| ) | ||
| m.eval() | ||
| self.common(m, (torch.randn([16, 32]),), check_lowp=False) | ||
| with torch.no_grad(): |
There was a problem hiding this comment.
Layer norm's backward compilation doesn't work, so the no grad forces us not to attempt compile it
| # saved activations can have different stride to eager if | ||
| # the compiler does layout optimization. We should restride the | ||
| # tensor passed in for compiling the backward graph using the | ||
| # saved tensor's stride. |
There was a problem hiding this comment.
I would stamp but I don't know the nuances of strides well enough.
…mic shapes" Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 [ghstack-poisoned]
Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately. This should also make it easier to predict when compilation occurs, since compilation now all happens up front during forwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 47eb2c4 Pull Request resolved: #104971
|
This is ready to go, just waiting for review. |
| try: | ||
| compiled_bw_func = aot_config.bw_compiler( | ||
| bw_module, placeholder_list | ||
| ) | ||
| except Exception: | ||
| log.warning( | ||
| "failed to eagerly compile backwards for dynamic, suppressing in case backwards not needed", | ||
| exc_info=True | ||
| ) |
There was a problem hiding this comment.
What necessitates this ? It would be nice to not land with a try-catch.
There was a problem hiding this comment.
Per operator backwards dynamic codegen is still buggy af. Sample run: https://hud.pytorch.org/pytorch/pytorch/pull/104971?sha=b0bb7782a9eb8b85acbcb9ffe1835c07c6dc9f80
I'm not actually suppressing anything. If you actually try to compile backwards, we will try to compile again and THEN fail. If this suppression works out, it just means you didn't actually need the backwards graph at all.
Chillee
left a comment
There was a problem hiding this comment.
iiuc, this is mostly just code movement?
| # the compiler does layout optimization. We should restride the | ||
| # tensor passed in for compiling the backward graph using the | ||
| # saved tensor's stride. | ||
| for i in range(len(placeholder_list)): |
There was a problem hiding this comment.
I'm assuming this is all code movement?
|
@pytorchbot merge -r |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
…mic shapes" Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 [ghstack-poisoned]
|
Successfully rebased |
Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately. This should also make it easier to predict when compilation occurs, since compilation now all happens up front during forwards. Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 6101c9f Pull Request resolved: #104971
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Previously, we made backwards graph compilation lazy to avoid paying
for compilation if the user didn't actually end up using the backwards
graph. This was useful in the old days when a lot of things in Inductor
didn't work and we could bypass errors this way.
However, this has a bad implication for dynamic shapes: the backwards
graph compilation can trigger extra guards, which are too late to
install in the Dynamo context if we wait until backwards is being run.
So in this PR I move us back to compiling backwards graph immediately
if we capture any SymInts for backwards.
Signed-off-by: Edward Z. Yang ezyang@meta.com
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov