[5.0] always set OC's sandbox socket to blocking mode fixing random OC errors seen in CI & maybe elsewhere#2163
Merged
spoonincode merged 1 commit intorelease/5.0from Jan 31, 2024
Conversation
greg7mdp
approved these changes
Jan 30, 2024
heifner
approved these changes
Jan 30, 2024
Contributor
|
Note:start |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It has been relatively rare, but sometimes we've seen an unexplained
failed to compile wasmerror in EOS VM OC unittests. We've also seen at least one report from a user with a similar error. Up until now we've chalked this up to OC's subjective limits being inappropriately enabled during testing, as subjective limits could certainly explain this particular failure case where OC's sandbox fails to produce a response. These subjective limits are now disabled onrelease/5.0andmainso we can no longer attribute failures to that.After upgrading the CI image to the latest Linux kernel -- 6.7.1 at this time -- we've seen an explosion in the number of
failed to compile wasmerrors. I can also reproduce this issue locally with 6.7.x as well, as long as tests are run with a high-jvalue; i.e. high system load.1The problem is occurring in OC's compile monitor component in its sandbox. When a compile is started an asio local datagram socket is created from a socketpair,
leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Lines 93 to 98 in e5d45ce
the compile monitor will then
async_wait()on it,leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Lines 114 to 116 in e5d45ce
and then once the socket is indicated as ready, call
read_message_with_fds()leap/libraries/chain/webassembly/runtimes/eos-vm-oc/compile_monitor.cpp
Line 125 in e5d45ce
which ultimately calls
recvmsg,leap/libraries/chain/webassembly/runtimes/eos-vm-oc/ipc_helpers.cpp
Lines 36 to 41 in e5d45ce
read_message_with_fds()is expectingrecvmsg()to be a blocking call: no where does OC anywhere set up a socket to be non-blocking. However, asio'sasync_wait()will always set the underlying file descriptor to be non-blocking. It doesn't seem like that would be a problem though: we only callrecvmsg()after the kernel (through the asio abstraction) has indicatedEPOLLINso the "naive" (as defined shortly below) expectation is that any call torecvmsg()will succeed with a proper message or indication of a connection failure.But on 6.7.1 sometimes the
async_wait()indicates readiness but therecvmsg()will return with anEAGAIN. The current code considers that a fatal communication error with the compilation process and bails, thus giving us thefailed to compile wasmerror. And this is where some spec lawyering comes in to play. It's not clear to me what exactly is guaranteed via this interface: whenselectreturns indicating a socket is readable, is it a guarantee that a non blocking read must return a message on the next call? afaict no: it's only a guarantee that a blocking read will not block. It seems reasonable to expect a non blocking read would always return the message but I'm not convinced from the specification I've read that's guaranteed.Of course, asio isn't using
select, it's using the linux specificepoll... in edge triggered mode too. This further obscures what the rules ought to be, to me.It's also possible we've stumbled on a defect introduced in Linux. Unfortunately without either a small test case or bisecting where this issue started to become rampant, neither of which is a small effort, it's hard to approach the developers with it.
This PR now asks the asio socket to have its underlying file descriptor set to blocking mode after the
async_wait()and before therecvmsg(). This effectively removes the possibility ofEAGAINcreeping up on us here. I also entertainedread_message_with_fds()looping onEAGAIN, as it already does forEINTR. That also does work in testing. It's not immediately clear which way is better.One negative aspect of this change is that a
read_message_with_fds(asiosocket.native_handle())will subvert the code that sets blocking mode. At the moment the only usages ofread_message_with_fds(int)pass file descriptors that were never "asio-ified"Footnotes
I don't mean to make 6.7.x appear to be culprit even though I explicitly mention it by value multiple times in this PR. The CI image hasn't been touched in nearly a year (so a handful of kernel versions ago), and I don't run the tests with a high
-jvalue locally regularly enough to know if this failure has only recently started occurring with such frequency. But it does seem very clear that within the last year some kernel behavior has shifted. ↩