Move function deletion from the stack to the heap.#11534
Move function deletion from the stack to the heap.#11534resistor wants to merge 1 commit intopytorch:masterfrom
Conversation
|
We should benchmark how much additional time this takes; I'm not sure how expensive it is to do a full traversal of the computation graph |
|
It’s doing the same amount of traversal, just using the heap instead of the program stack. |
f1aeaa0 to
ffdd7df
Compare
|
ah I see, my bad |
zou3519
left a comment
There was a problem hiding this comment.
This looks correct to me (and is much more elegant than what we had before). Some before/after numbers would still be nice just in case because deleting computation graphs happens all the time
|
@zou3519 Can you recommend a test case? |
facebook-github-bot
left a comment
There was a problem hiding this comment.
resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
apaszke
left a comment
There was a problem hiding this comment.
I'd rather vote for making the recursion limit much smaller (think ~500), but still prefer the stack to heap allocation. It's extremely unlikely that someone will get an overflow with that many frames.
|
I remember making the recursion limit smaller added overhead in the 1 - 1000 milliseconds range but I haven't done the numbers in a while. I am down for making the recursion limit much smaller though if you think the stack allocation is better than heap allocation @apaszke |
|
@apaszke Why? This is a very common pattern in compiler-like systems, where recursion over program representations ends up breaking some day on an unexpectedly large program. Programmers aren't in the habit of listening to compiler authors about what a "reasonable" size for a program is. |
Which is reasonable, because those run either ahead of time, or in a background thread (in case of JITs). In this case, the destruction is blocking the main program execution, so it should be as fast as it is possible. |
|
@apaszke Speed is of no value if it crashes. And plenty of JITs block program execution on compilation. I'm also not sure why you would assume the prior version is faster. The memory locality of using vectors will generally be better than using deques like the current version does, and that is borne out in measurement. Decreasing the recursion threshold will only make the difference more pronounced by moving more work from the stack onto the deque.
|
Lack of (deterministically stack-overflow triggered) crashes in 0.01% of use cases is of no value if your library is 2x slower than other libraries in most other cases 😉
Fine, but this also happen once (or a few times), no matter how long does your program runs. This destruction happens at every training step, over the whole duration of the program. I didn't notice you change the dequeue for a vector. That sounds like a good idea, and would be a good improvement even for the older code, so it might be worth re-benchmarking against master + vector. Also, can you please post the same benchmark, but with a graph of low depth (to check how does the stack strategy perform compared to always using the heap). |
torch/csrc/autograd/function.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/autograd/function.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
ezyang
left a comment
There was a problem hiding this comment.
I like killing heuristics. I would like it if some of the nit comments were addressed though.
|
@ezyang Nits should be fixed in latest version. Let me know if you have further feedback. |
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Here are timings using the same script with 100x smaller workload:
|
facebook-github-bot
left a comment
There was a problem hiding this comment.
resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Well it is a 6% regression, which is not great tbh. Any impact on the actual training scripts? Can you try world language model without cuDNN + a ResNet50? |
This eliminates the need for any heuristics regarding stack size limits.
|
I implemented some improvements that reduce the small workload overhead, such that it is no longer a regression in this case.
|
|
For clarity, the changes that sped it up were eliminating an unnecessary move in the primary deleter loop, and making the gather function more aggressive in pruning out functions that don't need to be gathered. I think there may be room to make it even faster by making the gather look through the graph iteratively, but that will be more complex to engineer. |
apaszke
left a comment
There was a problem hiding this comment.
Was the optimization to check use_count in gatherFunctions?
|
That plus eliminating a std::move in the deleter loop. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| auto& curr_func = stack.back(); | ||
|
|
||
| delete function; | ||
| if (curr_func.use_count() == 1) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This eliminates the need for any heuristics regarding stack size limits. This is a re-do pytorch#11534 with a fix to properly handle cases where multiple edges exist between a pair of functions.
Summary: This eliminates the need for any heuristics regarding stack size limits. This is a re-do #11534 with a fix to properly handle cases where multiple edges exist between a pair of functions. Pull Request resolved: #11611 Differential Revision: D9991198 Pulled By: resistor fbshipit-source-id: fecd2c5cac7e78f82a0f20cf33268bb1617bb4a0
This eliminates the need for any heuristics regarding stack size limits.