Stop memory leak when Python channel is deallocated without invoking "close" by lidizheng · Pull Request #22855 · grpc/grpc

lidizheng · 2020-05-04T19:12:51Z

This PR adds logic to the _ChannelCallState class, so when its reference dropped to zero it will close underlying Cython channel. _ChannelCallState object was referenced by multicallable objects and "rendezvous" objects. It usually lives longer than the Python Channel object, and better reflects the life span of the Channel.

Close logic injection alternatives tried:

__del__ of the Python Channel class. Turned out, the Python Channel object is short-lived and might be deallocated right away in cases like stub(grpc.insecure_channel(...));
__del__ of the Cython Channel class. It somehow triggers a Windows segfault when there is a short-lived channel object. I failed to find the root cause, but it might be related to the slower channel bootstrap time on Windows;
Pass the reference of either Python or Cython Channel class to its subordinate classes. It effectively creates a cyclic reference, which is hard to be garbage collected. Especially, Cython seems not work well with cyclic references.

Also, this PR added a smoke test for memory leak.

gnossen · 2020-05-08T22:04:00Z

@lidizheng

del of the Python Channel class. Turned out, the Python Channel object is short-lived and might be deallocated right away in cases like stub(grpc.insecure_channel(...));

That doesn't make sense to me. Stubs need to retain a reference to their channel, otherwise they won't be able to actually send an RPC.

gnossen

Thanks for looking into this!

src/python/grpcio_tests/tests_py3_only/unit/_leak_test.py

lidizheng · 2020-05-08T22:30:57Z

That doesn't make sense to me. Stubs need to retain a reference to their channel, otherwise they won't be able to actually send an RPC.

It occurs when we have a RPC lives longer than both the stub and the channel. Before the call is finished, both the stub and Python channel objects are deallocated, but the Cython channel object and couple "state" objects stays. The reference between states is a bit messy, but they are keeps essential objects alive.

def fire_and_forget():
    channel = grpc.insecure_channel(...)
    stub = TestStub(channel)
    call = stub.Call(...)
    call.add_done_callback(logic)

for i in range(1000):
    fire_and_forget()

@gnossen

lidizheng · 2020-05-08T22:41:27Z

When I was writing the example, I wonder what should happen if users use with clause but fires RPC using future or invoked add_done_callback?

with grpc.insecure_channel(...) as channel:
    stub = TestStub(channel)
    call = stub.Call(...)
    call.add_done_callback(logic)

Should the call falls immediately? Or should the channel shutdown gracefully (no new RPCs but allow ongoing ones to continue)? WDYT?

gnossen · 2020-05-08T22:47:30Z

I don't think we should encourage any usage that doesn't use the context manager form or an explicit close. The "fire and forget" example is a bit contrived. If someone actually came to us with this question, I would suggest that the example be rewritten to:

def fire(stub):
    call = stub.Call(...)
    call.add_done_callback(logic)

with grpc.insecure_channel(...) as channel:
  stub = TestStub(channel)
  for i in range(1000):
      fire(stub)
  await_all_callbacks_done()

But, viewing this as a corner case, it would be surprising to me if my callback failed to execute. So I would say I'd prefer a "graceful shutdown".

Edit: Added an all-important synchronization step before closing the channel

lidizheng · 2020-05-08T22:53:42Z

In current implementation, the close of channel immediately cancels all ongoing RPC...

gnossen · 2020-05-08T23:01:53Z

That's true, which isn't ideal. But again, I view this as a corner case. Not many people see this rough edge because:

Most people use the synchronous API.
We recommend sharing a single channel across many RPCs.

lidizheng

Thanks for reviewing. PTALA.

src/python/grpcio_tests/tests_py3_only/unit/_leak_test.py

sophia-hanley · 2020-05-27T04:42:13Z

@lidizheng which release is this code going to be in? I see that it is merged, but it seems that master does not correspond to the latest release as far as i can tell from the release notes

lidizheng · 2020-05-27T16:49:05Z

@szikanova We have daily release for our master branch: https://packages.grpc.io/. The interval of releases is around 6 weeks.

sophia-hanley · 2020-05-27T17:07:19Z

@lidizheng thanks! I noticed that the last daily release seems to be May 15, is that expected?

lidizheng · 2020-05-27T17:33:01Z

@szikanova You remind me that there is a ongoing failure blocking our build automation. See the tracker issue #23047. This PR should be included in May 15's build.

honnix · 2020-06-29T21:39:40Z

After upgrade to 1.30.0, we have started to experience exception upon VM exit with stack trace like:

  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 1126, in __del__
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 515, in grpc._cython.cygrpc.Channel.close
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 399, in grpc._cython.cygrpc._close
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 429, in grpc._cython.cygrpc._close
  File "/usr/lib/python3.6/threading.py", line 364, in notify_all
  File "/usr/lib/python3.6/threading.py", line 347, in notify

It seems either _deque or _islice has been garbaged collected. Found a pretty old issue here.

This doesn't do any harm but just noisy exception.

lidizheng · 2020-06-29T21:59:49Z

@honnix I guess we will be better off suppressing the exception log. Thanks for the info. Drafting a fix in #23351.

honnix · 2020-06-29T22:02:36Z

@lidizheng Thanks for looking into this and a quick fix.

lidizheng added kind/bug lang/Python release notes: yes Indicates if PR needs to be in release notes labels May 4, 2020

lidizheng mentioned this pull request May 4, 2020

Python: possible memory leak? #22123

Closed

lidizheng added a commit to lidizheng/grpc that referenced this pull request May 4, 2020

Close Cython channel when it is garbage collected grpc#22855

08166a8

lidizheng force-pushed the stop-leak branch 2 times, most recently from 08166a8 to 96ff7a8 Compare May 4, 2020 22:11

lidizheng added the kokoro:force-run label May 5, 2020

grpc-kokoro removed the kokoro:force-run label May 5, 2020

lidizheng force-pushed the stop-leak branch 2 times, most recently from 487ef6e to 893202b Compare May 6, 2020 20:34

Close Core's channel when there is no reference to the channel

e844f30

lidizheng force-pushed the stop-leak branch from 07ee17f to e844f30 Compare May 8, 2020 20:43

lidizheng mentioned this pull request May 8, 2020

Memory Leak in Python #22851

Closed

lidizheng added the kokoro:force-run label May 8, 2020

grpc-kokoro removed the kokoro:force-run label May 8, 2020

lidizheng marked this pull request as ready for review May 8, 2020 21:53

lidizheng requested a review from gnossen May 8, 2020 21:53

lidizheng assigned gnossen May 8, 2020

lidizheng changed the title ~~Close Cython channel when Python channel is garbage collected~~ Stop memory leak when Python channel is deallocated without invoking "close" May 8, 2020

gnossen reviewed May 8, 2020

View reviewed changes

src/python/grpcio_tests/tests_py3_only/unit/_leak_test.py Show resolved Hide resolved

src/python/grpcio_tests/tests_py3_only/unit/_leak_test.py Show resolved Hide resolved

src/python/grpcio_tests/tests_py3_only/unit/_leak_test.py Show resolved Hide resolved

lidizheng commented May 8, 2020

View reviewed changes

src/python/grpcio_tests/tests_py3_only/unit/_leak_test.py Show resolved Hide resolved

src/python/grpcio_tests/tests_py3_only/unit/_leak_test.py Show resolved Hide resolved

gnossen approved these changes May 8, 2020

View reviewed changes

lidizheng merged commit f7591a3 into grpc:master May 8, 2020

lidizheng mentioned this pull request May 11, 2020

Add module docstring for the leak test #22920

Merged

lidizheng mentioned this pull request Jun 29, 2020

Suppress exceptions from the __del__ of channel object #23351

Merged

busunkim96 mentioned this pull request Mar 22, 2022

fix: Add Python samples GoogleCloudPlatform/samples-style-guide#31

Merged

Conversation

lidizheng commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnossen commented May 8, 2020

Uh oh!

gnossen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lidizheng commented May 8, 2020

Uh oh!

lidizheng commented May 8, 2020

Uh oh!

gnossen commented May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidizheng commented May 8, 2020

Uh oh!

gnossen commented May 8, 2020

Uh oh!

lidizheng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sophia-hanley commented May 27, 2020

Uh oh!

lidizheng commented May 27, 2020

Uh oh!

sophia-hanley commented May 27, 2020

Uh oh!

lidizheng commented May 27, 2020

Uh oh!

honnix commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidizheng commented Jun 29, 2020

Uh oh!

honnix commented Jun 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lidizheng commented May 4, 2020 •

edited

Loading

gnossen commented May 8, 2020 •

edited

Loading

honnix commented Jun 29, 2020 •

edited

Loading