etcdv3-ruby icon indicating copy to clipboard operation
etcdv3-ruby copied to clipboard

Modify lease keep alive call to be compatible with grpc proxy

Open jcalvert opened this issue 7 years ago • 11 comments

While attempting utilize the lease_keep_alive_once method, we discovered an incompatibility when connecting to an etcd gRPC proxy. All requests would fail with a Cancelled status like:

        from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:26:in `check_status'
        from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:209:in `block in read_loop'
        from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:195:in `loop'
        from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:195:in `read_loop'
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3/lease.rb:38:in `each'     
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3/lease.rb:38:in `lease_keep_alive_once'
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3/connection.rb:23:in `call'       
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3/connection_wrapper.rb:21:in `handle'
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3.rb:127:in `lease_keep_alive_once'

After some extensive debugging the culprit seems to be the EOF check the proxy makes here: - when the "request" object for the bidirectional stream has completed writing, this code point is reached and the receiving loop server side returns which then fires off a message to the stop channel - and so the context is marked as cancelled before the request even runs. For reference, we tried using the Go client against the etcd proxy and did not encounter this issue.

Looking at the comments here and here it indicates the issue in this case is one of needing to defer signaling that writing is complete until the code has actually received a response. Doing this allows requests that are proxied via gRPC to succeed as expected.

jcalvert avatar May 21 '18 21:05 jcalvert

Codecov Report

Merging #117 into master will increase coverage by 0.02%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #117      +/-   ##
==========================================
+ Coverage   98.35%   98.38%   +0.02%     
==========================================
  Files          24       24              
  Lines        1521     1545      +24     
==========================================
+ Hits         1496     1520      +24     
  Misses         25       25
Impacted Files Coverage Δ
lib/etcdv3/lease.rb 100% <100%> (ø) :arrow_up:
spec/etcdv3/lease_spec.rb 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5b4373d...8335bc7. Read the comment docs.

codecov[bot] avatar May 21 '18 21:05 codecov[bot]

What version of Etcd are you using out of curiosity?

davissp14 avatar May 21 '18 22:05 davissp14

https://github.com/coreos/etcd/issues/9751

davissp14 avatar May 22 '18 02:05 davissp14

We've tested against 3.2 and 3.3.3 as well as master, also with the latest gRPC gem.

jcalvert avatar May 22 '18 14:05 jcalvert

Good deal, I was able to reproduce the issue and am looking to work with the CoreOS guys to get this resolved upstream. If this issue is blocking you guys, I'm happy to get this merged and work to rip it out once it's fixed.

Although, I don't feel great about adding gRPC Proxy logic/specs into this project. I'd rather remain agnostic with regard to users topology/proxy choices. That being the case, I'd ask that you remove any specs/code that relates to the gRPC Proxy, and just add a comment to your workaround.

Does that make sense?

davissp14 avatar May 22 '18 15:05 davissp14

We can just run our fork so no big rush. We can dump the proxy specific bits you mentioned for this. It's a particularly odd and unexpected interaction.

jcalvert avatar May 23 '18 15:05 jcalvert

Right on, I appreciate the help!

davissp14 avatar May 23 '18 15:05 davissp14

We'll need to update this shortly because of a bug in the way it is structured.

When code makes a bidi call, the write operation is run inside a thread - that means that if this call gets a DeadlineExceeded the enumerator doesn't complete and that write thread never exits. So when you have a poorly responding etcd server, you get an unbounded number of threads blocking and bad things happen. We need to ensure the enumerator completes so the GRPC write thread exits.

jcalvert avatar Jun 05 '18 16:06 jcalvert

@jcalvert Hey man, mind updating the parent etcd issue with your findings?

davissp14 avatar Jun 05 '18 19:06 davissp14

I think the etcd issue is still the same - this is an additional issue with error handling of bidi calls.

mgates avatar Jun 05 '18 19:06 mgates

Updated again...looks like some of our forked stuff leaked over, so I cleaned that up.

jcalvert avatar Jun 06 '18 16:06 jcalvert