Merged
Conversation
4020058 to
b5bac74
Compare
davidmfrey
approved these changes
Apr 22, 2021
tmehlinger
approved these changes
Apr 22, 2021
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
We were loading multiple MindMeld python applications on the same server to serve multiple client bots. We then found that after awhile, our servers were running out of memory. MindMeld applications run on CPU (as compared to GPUs for most ML platforms) due to our model types.
Since we weren't loading any large objects apart from the MindMeld applications into memory, we had a suspicion that the MindMeld applications had a memory leak. Additionally, we knew that a service hosting a single MindMeld application did not run out of memory, so the act of loading multiple MindMeld caused the memory leak.
Validating the memory leak using memory_profiler
We first crafted a concise code block that could reproduce the memory leak. We focused on loading multiple MindMeld bots in sequence since that seemed to cause the spikes in memory. Some of the bots where large while others were small. We used memory_profiler to profile the memory for the
my_funcfunction.We then plotted the memory profile of the script over time using these commands:
We saw that memory was increasing linearly as
my_funcwas called repeatedly. If there were no leaks, the memory profile would be constant over time since the python objects intomy_funcshould be released and garbage collected after the function is de-scoped.Finding mem-leak hotspots in the codebase using Objgraph & sys.getrefcount
Since we knew a lot about the MindMeld codebase, we suspected that the resource loader, which holds a lot of object in memory was not being released. To test that this object was not being released, we crafted the code block:
As we see, even after deleting the reference to the object that contains the
resource_loaderand commanding thegcto clear memory, the reference count to theresource_loaderhas not decremented. Ideally, we should see it decrement to2, one for the currentref_to_resource_loaderand the other forsys.getrefcount's argument variable.Even though we knew that the
resource_loaderobject was not being released, it still wasn't unclear what action we could take to fix this since any of the 27 to 62 references to theresource_loadercould not be releasing. How do find this "needle" reference in the haystack?Objgraph provides beautiful visualizations of the chain of references to any python object. Since we know one object that is not being released, we can then visualize all the references to that object and see which one of those references could be the culprit.
Note: It is important to set
too_manyandmax_depthappropriately large to visualize the entire chain of references since we found that our issue reference was hidden when the default settings were used.Using Objgraph's visualization, we found a section in the entire visualization where a
dictobject had many references from__globals__pointing to it. Thisdictwas in-turn chained to theresource_loader. Such references from__globals__indicate that there are global references to an object in the MindMeld codebase even though we explicitly de-scoped the variables referring to MindMeld objects. This is definitely a memory leak.Upon inspection of
subproc_call_instance_functionwhich references the dictionary that has global references, we found the problem object:Processor.instance_map.Processor.instance_mapis a static map whose values are references to instanceProcessorobjects. This static map will remain in memory even ifProcessorobjects are de-referenced and since it itself references theseProcessorobjects, theProcessorobjects will not be de-referenced as well due to the static map's reference. And sinceProcessorobjects containresource_loaderobjects, no wonderresource_loaderobjects are persisting across multiple MindMeld applications, leading to out-of-memory issues.Fixing mem leaks using Weak Referencing
We found that the mem leak is caused by a static map having "strong" references to it's
Processorobject values. To solve this problem, we used python's WeakValueDictionary, which references values weakly, so it's values will be discarded when no strong reference to the values exist any more. If the ma[] is the only reference to theProcessorobjects, the objects will be discarded.Note: Read this for a detailed explanation of weak vs strong references in python.
Now, when running the
getrefcountcode block, we see that the number of references are getting decremented to2after deleting the main MindMeld reference and garbage collecting:The memory profile plot shows that memory is being released after each function call as expected.
Conclusion
Using
memory_profilerto chart the memory profile of a program over time andObjgraphto visualize object referencing, we can identify specific data structures that are not being released. We can then useWeakValueDictionaryorWeakKeyDictionaryto create weak reference maps that can release memory appropriately.