Skip to content

Move function deletion from the stack to the heap.#11534

Closed
resistor wants to merge 1 commit intopytorch:masterfrom
resistor:deleter
Closed

Move function deletion from the stack to the heap.#11534
resistor wants to merge 1 commit intopytorch:masterfrom
resistor:deleter

Conversation

@resistor
Copy link
Contributor

This eliminates the need for any heuristics regarding stack size limits.

@resistor
Copy link
Contributor Author

@goldsborough

@zou3519
Copy link
Contributor

zou3519 commented Sep 11, 2018

We should benchmark how much additional time this takes; I'm not sure how expensive it is to do a full traversal of the computation graph

@resistor
Copy link
Contributor Author

It’s doing the same amount of traversal, just using the heap instead of the program stack.

@resistor resistor force-pushed the deleter branch 2 times, most recently from f1aeaa0 to ffdd7df Compare September 11, 2018 20:24
@zou3519
Copy link
Contributor

zou3519 commented Sep 11, 2018

ah I see, my bad

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me (and is much more elegant than what we had before). Some before/after numbers would still be nice just in case because deleting computation graphs happens all the time

@resistor
Copy link
Contributor Author

@zou3519 Can you recommend a test case?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zou3519
Copy link
Contributor

zou3519 commented Sep 11, 2018

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather vote for making the recursion limit much smaller (think ~500), but still prefer the stack to heap allocation. It's extremely unlikely that someone will get an overflow with that many frames.

@zou3519
Copy link
Contributor

zou3519 commented Sep 11, 2018

I remember making the recursion limit smaller added overhead in the 1 - 1000 milliseconds range but I haven't done the numbers in a while. I am down for making the recursion limit much smaller though if you think the stack allocation is better than heap allocation @apaszke

@resistor
Copy link
Contributor Author

@apaszke Why?

This is a very common pattern in compiler-like systems, where recursion over program representations ends up breaking some day on an unexpectedly large program. Programmers aren't in the habit of listening to compiler authors about what a "reasonable" size for a program is.

@apaszke
Copy link
Contributor

apaszke commented Sep 11, 2018

This is a very common pattern in compiler-like systems

Which is reasonable, because those run either ahead of time, or in a background thread (in case of JITs). In this case, the destruction is blocking the main program execution, so it should be as fast as it is possible.

@resistor
Copy link
Contributor Author

@apaszke Speed is of no value if it crashes. And plenty of JITs block program execution on compilation.

I'm also not sure why you would assume the prior version is faster. The memory locality of using vectors will generally be better than using deques like the current version does, and that is borne out in measurement. Decreasing the recursion threshold will only make the difference more pronounced by moving more work from the stack onto the deque.

WITH PR

Time to build graph (s):
0.09391583776474
Time to free graph (s):
0.019444111347198485

Time to build graph (s):
0.0975230975151062
Time to free graph (s):
0.02031881332397461

ON MASTER

Time to build graph (s):
0.09743225431442261
Time to free graph (s):
0.024351720809936524

Time to build graph (s):
0.09882893228530884
Time to free graph (s):
0.02464564323425293

@apaszke
Copy link
Contributor

apaszke commented Sep 12, 2018

Speed is of no value if it crashes.

Lack of (deterministically stack-overflow triggered) crashes in 0.01% of use cases is of no value if your library is 2x slower than other libraries in most other cases 😉

And plenty of JITs block program execution on compilation.

Fine, but this also happen once (or a few times), no matter how long does your program runs. This destruction happens at every training step, over the whole duration of the program.


I didn't notice you change the dequeue for a vector. That sounds like a good idea, and would be a good improvement even for the older code, so it might be worth re-benchmarking against master + vector. Also, can you please post the same benchmark, but with a graph of low depth (to check how does the stack strategy perform compared to always using the heap).

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like killing heuristics. I would like it if some of the nit comments were addressed though.

@resistor
Copy link
Contributor Author

@ezyang Nits should be fixed in latest version. Let me know if you have further feedback.

@ezyang
Copy link
Contributor

ezyang commented Sep 12, 2018

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@resistor
Copy link
Contributor Author

Here are timings using the same script with 100x smaller workload:

ON PR

Time to build graph (s):
0.0010059714317321778
Time to free graph (s):
0.00016424322128295898

Time to build graph (s):
0.0009362959861755371
Time to free graph (s):
0.0001583690643310547

ON MASTER

Time to build graph (s):
0.0009480509757995606
Time to free graph (s):
0.00014049625396728515

Time to build graph (s):
0.0010006918907165528
Time to free graph (s):
0.00014899682998657226

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@apaszke
Copy link
Contributor

apaszke commented Sep 12, 2018

Well it is a 6% regression, which is not great tbh. Any impact on the actual training scripts? Can you try world language model without cuDNN + a ResNet50?

This eliminates the need for any heuristics regarding stack size limits.
@resistor
Copy link
Contributor Author

I implemented some improvements that reduce the small workload overhead, such that it is no longer a regression in this case.

Time to build graph (s):
0.000943394660949707
Time to free graph (s):
0.00013764524459838868

Time to build graph (s):
0.0009673638343811035
Time to free graph (s):
0.00013594341278076172

@resistor
Copy link
Contributor Author

For clarity, the changes that sped it up were eliminating an unnecessary move in the primary deleter loop, and making the gather function more aggressive in pruning out functions that don't need to be gathered.

I think there may be room to make it even faster by making the gather look through the graph iteratively, but that will be more complex to engineer.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the optimization to check use_count in gatherFunctions?

@resistor
Copy link
Contributor Author

That plus eliminating a std::move in the deleter loop.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

auto& curr_func = stack.back();

delete function;
if (curr_func.use_count() == 1) {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

resistor pushed a commit to resistor/pytorch that referenced this pull request Sep 21, 2018
This eliminates the need for any heuristics regarding stack size limits.

This is a re-do pytorch#11534 with a fix to properly handle cases where
multiple edges exist between a pair of functions.
facebook-github-bot pushed a commit that referenced this pull request Sep 21, 2018
Summary:
This eliminates the need for any heuristics regarding stack size limits.

This is a re-do #11534 with a fix to properly handle cases where
multiple edges exist between a pair of functions.
Pull Request resolved: #11611

Differential Revision: D9991198

Pulled By: resistor

fbshipit-source-id: fecd2c5cac7e78f82a0f20cf33268bb1617bb4a0
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants