Modify lease keep alive call to be compatible with grpc proxy
While attempting utilize the lease_keep_alive_once method, we discovered an incompatibility when connecting to an etcd gRPC proxy. All requests would fail with a Cancelled status like:
from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:26:in `check_status'
from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:209:in `block in read_loop'
from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:195:in `loop'
from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:195:in `read_loop'
from /home/jcalvert/etcdv3-ruby/lib/etcdv3/lease.rb:38:in `each'
from /home/jcalvert/etcdv3-ruby/lib/etcdv3/lease.rb:38:in `lease_keep_alive_once'
from /home/jcalvert/etcdv3-ruby/lib/etcdv3/connection.rb:23:in `call'
from /home/jcalvert/etcdv3-ruby/lib/etcdv3/connection_wrapper.rb:21:in `handle'
from /home/jcalvert/etcdv3-ruby/lib/etcdv3.rb:127:in `lease_keep_alive_once'
After some extensive debugging the culprit seems to be the EOF check the proxy makes here: - when the "request" object for the bidirectional stream has completed writing, this code point is reached and the receiving loop server side returns which then fires off a message to the stop channel - and so the context is marked as cancelled before the request even runs. For reference, we tried using the Go client against the etcd proxy and did not encounter this issue.
Looking at the comments here and here it indicates the issue in this case is one of needing to defer signaling that writing is complete until the code has actually received a response. Doing this allows requests that are proxied via gRPC to succeed as expected.
Codecov Report
Merging #117 into master will increase coverage by
0.02%. The diff coverage is100%.
@@ Coverage Diff @@
## master #117 +/- ##
==========================================
+ Coverage 98.35% 98.38% +0.02%
==========================================
Files 24 24
Lines 1521 1545 +24
==========================================
+ Hits 1496 1520 +24
Misses 25 25
| Impacted Files | Coverage Δ | |
|---|---|---|
| lib/etcdv3/lease.rb | 100% <100%> (ø) |
:arrow_up: |
| spec/etcdv3/lease_spec.rb | 100% <100%> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 5b4373d...8335bc7. Read the comment docs.
What version of Etcd are you using out of curiosity?
https://github.com/coreos/etcd/issues/9751
We've tested against 3.2 and 3.3.3 as well as master, also with the latest gRPC gem.
Good deal, I was able to reproduce the issue and am looking to work with the CoreOS guys to get this resolved upstream. If this issue is blocking you guys, I'm happy to get this merged and work to rip it out once it's fixed.
Although, I don't feel great about adding gRPC Proxy logic/specs into this project. I'd rather remain agnostic with regard to users topology/proxy choices. That being the case, I'd ask that you remove any specs/code that relates to the gRPC Proxy, and just add a comment to your workaround.
Does that make sense?
We can just run our fork so no big rush. We can dump the proxy specific bits you mentioned for this. It's a particularly odd and unexpected interaction.
Right on, I appreciate the help!
We'll need to update this shortly because of a bug in the way it is structured.
When code makes a bidi call, the write operation is run inside a thread - that means that if this call gets a DeadlineExceeded the enumerator doesn't complete and that write thread never exits. So when you have a poorly responding etcd server, you get an unbounded number of threads blocking and bad things happen. We need to ensure the enumerator completes so the GRPC write thread exits.
@jcalvert Hey man, mind updating the parent etcd issue with your findings?
I think the etcd issue is still the same - this is an additional issue with error handling of bidi calls.
Updated again...looks like some of our forked stuff leaked over, so I cleaned that up.