Fix for ppc64le jit graph difference in sigmoid backward, see #10726#11579
Fix for ppc64le jit graph difference in sigmoid backward, see #10726#11579avmgithub wants to merge 8 commits intopytorch:masterfrom avmgithub:master
Conversation
|
@pytorchbot retest this please |
ezyang
left a comment
There was a problem hiding this comment.
This is not really sustainable; most devs will not be able to keep the ppc files up to date. It would be better to figure out why results are different on ppc.
|
So, the first thing I would check is that the canonicalizer is working on subgraphs. There don't appear to be structural differences; and if that's the case, the graphs SHOULD number identically. It really looks like we're not canonicalizing subgraphs. |
|
@ezyang are you referring to the Canonicalize function here ? And when you refer to "not canonicalizing subgraphs" are you referring to : ?It looks to me they are fine. If they are not would I not see the problem also happen in the forward graph? Just to add to the problem description: The problem only happens for the backward graph . The forward graph is fine. So for example part of the output of the backward graph of the test_milstm_fusion_cuda is below: Output of ppc64le Output of x86 So it seems the instructions are kind of re-arranged. Do you think this is due to canonicalization ? |
|
@avmgithub from not knowing very much about what is wrong, I would start from pytorch/torch/csrc/jit/graph_executor.cpp Line 372 in e00fb69 std::cout << *graph << std::endl;) for a ppc64le arch and x86 arch and see where it differs
|
|
@zou3519 , I've attached 2 files , one for x86 and one ppc64le . From a test run like : python -m unittest -q test_jit.TestScript.test_milstm_fusion_cuda . There's differences in the sigmoid function. I'm assuming the graph is for the backward operation. Here is an example: x86 : ppc : There are 3 such instances since there are three sigmoid functions in the MiLSTMCell function in test_jit.py |
|
@zou3519 When you get the chance, do know where (which source file) the sigmoid backward graph is formed ? This seems to be where the discrepancy is coming from. example: %70 : Dynamic = prim::GradOfname="aten::sigmoid" |
|
it depends, but it is most likely coming from autodiff here: pytorch/torch/csrc/jit/autodiff.cpp Line 101 in 76ab26c |
|
@zou3519 Thanks for the tip. It looks like if I re-arrange the line : And re-run the : python test/test_jit.py TestScript.test_milstm_fusion_cuda --accept Can you please let me know if this is OK to do. I have no idea why I have to re-arrange it to give me the same backward expect output. |
|
I tested the proposed solution on a ppc64le and it works for me. |
zou3519
left a comment
There was a problem hiding this comment.
It's not a complete fix, but it works for now and is simple enough. Please add a TODO/comment into the code to prevent regressions and so that someone in the future will take a deeper look at it
|
@zou3519 Thanks for approving, was there something else that needs to be done before this can be merged. It still says merging is blocked |
facebook-github-bot
left a comment
There was a problem hiding this comment.
soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
As reported in Issue #10726, the jit compiler, when running on ppc64le, may produce an isomorphic output but fail a diff test against the expected output file. The expected output file is created from a test that was ran on x86_64. This ensures that if ppc64le test output is different, the output is instead compared to an expected output file created when the test is run on a ppc64le system.