Stop memory leak when Python channel is deallocated without invoking "close"#22855
Stop memory leak when Python channel is deallocated without invoking "close"#22855lidizheng merged 1 commit intogrpc:masterfrom
Conversation
08166a8 to
96ff7a8
Compare
487ef6e to
893202b
Compare
That doesn't make sense to me. Stubs need to retain a reference to their channel, otherwise they won't be able to actually send an RPC. |
gnossen
left a comment
There was a problem hiding this comment.
Thanks for looking into this!
It occurs when we have a RPC lives longer than both the stub and the channel. Before the call is finished, both the stub and Python channel objects are deallocated, but the Cython channel object and couple "state" objects stays. The reference between states is a bit messy, but they are keeps essential objects alive. def fire_and_forget():
channel = grpc.insecure_channel(...)
stub = TestStub(channel)
call = stub.Call(...)
call.add_done_callback(logic)
for i in range(1000):
fire_and_forget() |
|
When I was writing the example, I wonder what should happen if users use with grpc.insecure_channel(...) as channel:
stub = TestStub(channel)
call = stub.Call(...)
call.add_done_callback(logic)Should the call falls immediately? Or should the channel shutdown gracefully (no new RPCs but allow ongoing ones to continue)? WDYT? |
|
I don't think we should encourage any usage that doesn't use the context manager form or an explicit close. The "fire and forget" example is a bit contrived. If someone actually came to us with this question, I would suggest that the example be rewritten to: def fire(stub):
call = stub.Call(...)
call.add_done_callback(logic)
with grpc.insecure_channel(...) as channel:
stub = TestStub(channel)
for i in range(1000):
fire(stub)
await_all_callbacks_done()But, viewing this as a corner case, it would be surprising to me if my callback failed to execute. So I would say I'd prefer a "graceful shutdown". Edit: Added an all-important synchronization step before closing the channel |
|
In current implementation, the close of channel immediately cancels all ongoing RPC... |
|
That's true, which isn't ideal. But again, I view this as a corner case. Not many people see this rough edge because:
|
lidizheng
left a comment
There was a problem hiding this comment.
Thanks for reviewing. PTALA.
|
@lidizheng which release is this code going to be in? I see that it is merged, but it seems that master does not correspond to the latest release as far as i can tell from the release notes |
|
@szikanova We have daily release for our master branch: https://packages.grpc.io/. The interval of releases is around 6 weeks. |
|
@lidizheng thanks! I noticed that the last daily release seems to be May 15, is that expected? |
|
@szikanova You remind me that there is a ongoing failure blocking our build automation. See the tracker issue #23047. This PR should be included in May 15's build. |
|
After upgrade to 1.30.0, we have started to experience exception upon VM exit with stack trace like: It seems either This doesn't do any harm but just noisy exception. |
|
@lidizheng Thanks for looking into this and a quick fix. |
Fixes #22123
This PR adds logic to the
_ChannelCallStateclass, so when its reference dropped to zero it will close underlying Cython channel._ChannelCallStateobject was referenced by multicallable objects and "rendezvous" objects. It usually lives longer than the Python Channel object, and better reflects the life span of the Channel.Close logic injection alternatives tried:
__del__of the Python Channel class. Turned out, the Python Channel object is short-lived and might be deallocated right away in cases likestub(grpc.insecure_channel(...));__del__of the Cython Channel class. It somehow triggers a Windows segfault when there is a short-lived channel object. I failed to find the root cause, but it might be related to the slower channel bootstrap time on Windows;Also, this PR added a smoke test for memory leak.