Skip to content

Stop memory leak when Python channel is deallocated without invoking "close"#22855

Merged
lidizheng merged 1 commit intogrpc:masterfrom
lidizheng:stop-leak
May 8, 2020
Merged

Stop memory leak when Python channel is deallocated without invoking "close"#22855
lidizheng merged 1 commit intogrpc:masterfrom
lidizheng:stop-leak

Conversation

@lidizheng
Copy link
Copy Markdown
Contributor

@lidizheng lidizheng commented May 4, 2020

Fixes #22123

This PR adds logic to the _ChannelCallState class, so when its reference dropped to zero it will close underlying Cython channel. _ChannelCallState object was referenced by multicallable objects and "rendezvous" objects. It usually lives longer than the Python Channel object, and better reflects the life span of the Channel.

Close logic injection alternatives tried:

  1. __del__ of the Python Channel class. Turned out, the Python Channel object is short-lived and might be deallocated right away in cases like stub(grpc.insecure_channel(...));
  2. __del__ of the Cython Channel class. It somehow triggers a Windows segfault when there is a short-lived channel object. I failed to find the root cause, but it might be related to the slower channel bootstrap time on Windows;
  3. Pass the reference of either Python or Cython Channel class to its subordinate classes. It effectively creates a cyclic reference, which is hard to be garbage collected. Especially, Cython seems not work well with cyclic references.

Also, this PR added a smoke test for memory leak.

@lidizheng lidizheng added kind/bug lang/Python release notes: yes Indicates if PR needs to be in release notes labels May 4, 2020
lidizheng added a commit to lidizheng/grpc that referenced this pull request May 4, 2020
@lidizheng lidizheng force-pushed the stop-leak branch 2 times, most recently from 08166a8 to 96ff7a8 Compare May 4, 2020 22:11
@lidizheng lidizheng force-pushed the stop-leak branch 2 times, most recently from 487ef6e to 893202b Compare May 6, 2020 20:34
@lidizheng lidizheng mentioned this pull request May 8, 2020
@lidizheng lidizheng marked this pull request as ready for review May 8, 2020 21:53
@lidizheng lidizheng requested a review from gnossen May 8, 2020 21:53
@lidizheng lidizheng changed the title Close Cython channel when Python channel is garbage collected Stop memory leak when Python channel is deallocated without invoking "close" May 8, 2020
@gnossen
Copy link
Copy Markdown
Contributor

gnossen commented May 8, 2020

@lidizheng

del of the Python Channel class. Turned out, the Python Channel object is short-lived and might be deallocated right away in cases like stub(grpc.insecure_channel(...));

That doesn't make sense to me. Stubs need to retain a reference to their channel, otherwise they won't be able to actually send an RPC.

Copy link
Copy Markdown
Contributor

@gnossen gnossen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

@lidizheng
Copy link
Copy Markdown
Contributor Author

That doesn't make sense to me. Stubs need to retain a reference to their channel, otherwise they won't be able to actually send an RPC.

It occurs when we have a RPC lives longer than both the stub and the channel. Before the call is finished, both the stub and Python channel objects are deallocated, but the Cython channel object and couple "state" objects stays. The reference between states is a bit messy, but they are keeps essential objects alive.

def fire_and_forget():
    channel = grpc.insecure_channel(...)
    stub = TestStub(channel)
    call = stub.Call(...)
    call.add_done_callback(logic)

for i in range(1000):
    fire_and_forget()

@gnossen

@lidizheng
Copy link
Copy Markdown
Contributor Author

When I was writing the example, I wonder what should happen if users use with clause but fires RPC using future or invoked add_done_callback?

with grpc.insecure_channel(...) as channel:
    stub = TestStub(channel)
    call = stub.Call(...)
    call.add_done_callback(logic)

Should the call falls immediately? Or should the channel shutdown gracefully (no new RPCs but allow ongoing ones to continue)? WDYT?

@gnossen
Copy link
Copy Markdown
Contributor

gnossen commented May 8, 2020

I don't think we should encourage any usage that doesn't use the context manager form or an explicit close. The "fire and forget" example is a bit contrived. If someone actually came to us with this question, I would suggest that the example be rewritten to:

def fire(stub):
    call = stub.Call(...)
    call.add_done_callback(logic)

with grpc.insecure_channel(...) as channel:
  stub = TestStub(channel)
  for i in range(1000):
      fire(stub)
  await_all_callbacks_done()

But, viewing this as a corner case, it would be surprising to me if my callback failed to execute. So I would say I'd prefer a "graceful shutdown".

Edit: Added an all-important synchronization step before closing the channel

@lidizheng
Copy link
Copy Markdown
Contributor Author

In current implementation, the close of channel immediately cancels all ongoing RPC...

@gnossen
Copy link
Copy Markdown
Contributor

gnossen commented May 8, 2020

That's true, which isn't ideal. But again, I view this as a corner case. Not many people see this rough edge because:

  1. Most people use the synchronous API.
  2. We recommend sharing a single channel across many RPCs.

Copy link
Copy Markdown
Contributor Author

@lidizheng lidizheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. PTALA.

@lidizheng lidizheng merged commit f7591a3 into grpc:master May 8, 2020
@sophia-hanley
Copy link
Copy Markdown

@lidizheng which release is this code going to be in? I see that it is merged, but it seems that master does not correspond to the latest release as far as i can tell from the release notes

@lidizheng
Copy link
Copy Markdown
Contributor Author

@szikanova We have daily release for our master branch: https://packages.grpc.io/. The interval of releases is around 6 weeks.

@sophia-hanley
Copy link
Copy Markdown

@lidizheng thanks! I noticed that the last daily release seems to be May 15, is that expected?

@lidizheng
Copy link
Copy Markdown
Contributor Author

@szikanova You remind me that there is a ongoing failure blocking our build automation. See the tracker issue #23047. This PR should be included in May 15's build.

@honnix
Copy link
Copy Markdown

honnix commented Jun 29, 2020

After upgrade to 1.30.0, we have started to experience exception upon VM exit with stack trace like:

  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 1126, in __del__
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 515, in grpc._cython.cygrpc.Channel.close
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 399, in grpc._cython.cygrpc._close
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 429, in grpc._cython.cygrpc._close
  File "/usr/lib/python3.6/threading.py", line 364, in notify_all
  File "/usr/lib/python3.6/threading.py", line 347, in notify

It seems either _deque or _islice has been garbaged collected. Found a pretty old issue here.

This doesn't do any harm but just noisy exception.

@lidizheng
Copy link
Copy Markdown
Contributor Author

@honnix I guess we will be better off suppressing the exception log. Thanks for the info. Drafting a fix in #23351.

@honnix
Copy link
Copy Markdown

honnix commented Jun 29, 2020

@lidizheng Thanks for looking into this and a quick fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug lang/Python release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: possible memory leak?

5 participants