Avoid making node a successor/predecessor of itself#161205
Avoid making node a successor/predecessor of itself#161205eellison wants to merge 2 commits intogh/eellison/821/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161205
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7299c21 with merge base 9668210 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| for index, node in enumerate(nodes): | ||
| size_alloc = sum(buffer.mpi_buffer.size_alloc for buffer in node.get_outputs()) | ||
| succ_nodes = node_to_succ_nodes[node] | ||
| pred_nodes = node_to_pred_nodes[node] |
There was a problem hiding this comment.
Hmm, even though this fixes the issue, I wonder why there are cyclic dependency in the first place.
I wonder if there are some errors in how the pred_nodes and succ_nodes are constructed, or if there are some issues in the dependency tracking.
There was a problem hiding this comment.
i will send you the repro. the input GraphModule is not cyclic, so it must be something as a result of dependencies. this was blocking some of the training runs so i want to land this but agreed more investigation after is good.
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at. Pull Request resolved: pytorch#161205 Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at. Pull Request resolved: pytorch#161205 Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
Stack from ghstack (oldest at bottom):
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben