Fixed mem leak by vijay120 · Pull Request #296 · cisco/mindmeld

vijay120 · 2021-04-22T00:21:20Z

Problem

We were loading multiple MindMeld python applications on the same server to serve multiple client bots. We then found that after awhile, our servers were running out of memory. MindMeld applications run on CPU (as compared to GPUs for most ML platforms) due to our model types.

Since we weren't loading any large objects apart from the MindMeld applications into memory, we had a suspicion that the MindMeld applications had a memory leak. Additionally, we knew that a service hosting a single MindMeld application did not run out of memory, so the act of loading multiple MindMeld caused the memory leak.

Validating the memory leak using memory_profiler

We first crafted a concise code block that could reproduce the memory leak. We focused on loading multiple MindMeld bots in sequence since that seemed to cause the spikes in memory. Some of the bots where large while others were small. We used memory_profiler to profile the memory for the my_func function.

from memory_profiler import profile
from mindmeld.app_manager import ApplicationManager


@profile
def my_func(app_name):
    temp_manager = ApplicationManager(app_name)
    temp_manager.nlp.load()


if __name__ == '__main__':
    app_1 = 'kwik_e_mart'
    app_2 = 'video_discovery'
    ...
    my_func(app_1)
    my_func(app_2)
    ...

We then plotted the memory profile of the script over time using these commands:

mprof run test.py
mprof plot

We saw that memory was increasing linearly as my_func was called repeatedly. If there were no leaks, the memory profile would be constant over time since the python objects into my_func should be released and garbage collected after the function is de-scoped.

Finding mem-leak hotspots in the codebase using Objgraph & sys.getrefcount

Since we knew a lot about the MindMeld codebase, we suspected that the resource loader, which holds a lot of object in memory was not being released. To test that this object was not being released, we crafted the code block:

import sys
import gc
from mindmeld.app_manager import ApplicationManager


def my_func(app_name):
    temp_manager = ApplicationManager(app_name)
    temp_manager.nlp.load()
    # Count refs to the resource_loader
    print(f"before release: {int(sys.getrefcount(temp_manager.nlp.resource_loader))}")
    ref_to_resource_loader = temp_manager.nlp.resource_loader
    del temp_manager
    gc.collect()
    # Count refs to the resource_loader after deleting the main ref
    print(f"after release: {int(sys.getrefcount(ref_to_resource_loader))}")


if __name__ == '__main__':
    app_1 = 'kwik_e_mart'
    app_2 = 'video_discovery'
    ...
    my_func(app_1)
    my_func(app_2)
    ...

python test.py

>>> before release: 27
>>> after release: 27
>>> before release: 62
>>> after release: 62

As we see, even after deleting the reference to the object that contains the resource_loader and commanding the gc to clear memory, the reference count to the resource_loader has not decremented. Ideally, we should see it decrement to 2, one for the current ref_to_resource_loader and the other for sys.getrefcount's argument variable.

Even though we knew that the resource_loader object was not being released, it still wasn't unclear what action we could take to fix this since any of the 27 to 62 references to the resource_loader could not be releasing. How do find this "needle" reference in the haystack?

Objgraph provides beautiful visualizations of the chain of references to any python object. Since we know one object that is not being released, we can then visualize all the references to that object and see which one of those references could be the culprit.

def my_func(app_name):
    temp_manager = ApplicationManager(app_name)
    temp_manager.nlp.load()
    # Count refs to the resource_loader
    print(f"before release: {int(sys.getrefcount(temp_manager.nlp.resource_loader))}")
    ref_to_resource_loader = temp_manager.nlp.resource_loader
    del temp_manager
    gc.collect()
    objgraph.show_backrefs([ref_to_resource_loader], too_many=50, max_depth=10)
    # Count refs to the resource_loader after deleting the main ref
    print(f"after release: {int(sys.getrefcount(ref_to_resource_loader))}")

Note: It is important to set too_many and max_depth appropriately large to visualize the entire chain of references since we found that our issue reference was hidden when the default settings were used.

Using Objgraph's visualization, we found a section in the entire visualization where a dict object had many references from __globals__ pointing to it. This dict was in-turn chained to the resource_loader. Such references from __globals__ indicate that there are global references to an object in the MindMeld codebase even though we explicitly de-scoped the variables referring to MindMeld objects. This is definitely a memory leak.

Upon inspection of subproc_call_instance_function which references the dictionary that has global references, we found the problem object: Processor.instance_map.

def subproc_call_instance_function(instance_id, func_name, *args, **kwargs):
    try:
        instance = Processor.instance_map[instance_id]
        return getattr(instance, func_name)(*args, **kwargs)
    except Exception:
        sys.exit(1)

class Processor(ABC):
    instance_map = {}
    ...

    def __init__(self, ...):
    	...
    	Processor.instance_map[id(self)] = self

Processor.instance_map is a static map whose values are references to instance Processor objects. This static map will remain in memory even if Processor objects are de-referenced and since it itself references these Processor objects, the Processor objects will not be de-referenced as well due to the static map's reference. And since Processor objects contain resource_loader objects, no wonder resource_loader objects are persisting across multiple MindMeld applications, leading to out-of-memory issues.

Fixing mem leaks using Weak Referencing

We found that the mem leak is caused by a static map having "strong" references to it's Processor object values. To solve this problem, we used python's WeakValueDictionary, which references values weakly, so it's values will be discarded when no strong reference to the values exist any more. If the ma[] is the only reference to the Processor objects, the objects will be discarded.

Note: Read this for a detailed explanation of weak vs strong references in python.

class Processor(ABC):
    instance_map = WeakValueDictionary()
    ...

    def __init__(self, ...):
    	...
    	Processor.instance_map[id(self)] = self

Now, when running the getrefcount code block, we see that the number of references are getting decremented to 2 after deleting the main MindMeld reference and garbage collecting:

python test.py

>>> before release: 27
>>> after release: 2
>>> before release: 62
>>> after release: 2

The memory profile plot shows that memory is being released after each function call as expected.

Conclusion

Using memory_profiler to chart the memory profile of a program over time and Objgraph to visualize object referencing, we can identify specific data structures that are not being released. We can then use WeakValueDictionary or WeakKeyDictionary to create weak reference maps that can release memory appropriately.

vijay120 requested review from davidmfrey, jjjacksn and tmehlinger April 22, 2021 00:21

fix mem leak

b5bac74

vijay120 force-pushed the bug_fix/mem_leak branch from 4020058 to b5bac74 Compare April 22, 2021 00:26

davidmfrey approved these changes Apr 22, 2021

View reviewed changes

tmehlinger approved these changes Apr 22, 2021

View reviewed changes

vijay120 merged commit b6d5bc1 into master Apr 22, 2021

vijay120 deleted the bug_fix/mem_leak branch April 22, 2021 17:08

vijay120 mentioned this pull request Apr 22, 2021

Feature/Improvements to Entity Resolution #278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed mem leak#296

Fixed mem leak#296
vijay120 merged 1 commit intomasterfrom
bug_fix/mem_leak

vijay120 commented Apr 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vijay120 commented Apr 22, 2021

Problem

Validating the memory leak using memory_profiler

Finding mem-leak hotspots in the codebase using Objgraph & sys.getrefcount

Fixing mem leaks using Weak Referencing

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants