Add timeouts to cache lookups and handle server eofs#79
Add timeouts to cache lookups and handle server eofs#79alexcrichton merged 3 commits intomozilla:masterfrom
Conversation
|
Could you write something to stderr/somewhere else visible on buildbot when we get unexpected EOF errors, so we can have some monitoring over them? |
|
@arielb1 yeah unfortunately architecturally sccache is slightly difficult to debug since it primarily runs as a daemon (without stderr/stdout). Note though that it supports redirection of stderr in the daemon which we're leveraging and printing. (those logs are how I discovered this issue) |
d990325 to
baf9610
Compare
I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's see if that helps rust-lang#40240!
travis: Update sccache binary I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's see if that helps #40240!
travis: Update sccache binary I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's see if that helps #40240!
|
@luser FWIW I've deployed this to rust-lang/rust and it seems to have at least erased that particular failure mode! |
|
@alexcrichton That's great to hear! I have been trying to get my Rust PR landable, I will finish that off so we can get everything else rebased and landed. |
src/commands.rs
Outdated
| // get a response then we just forge ahead locally and run the | ||
| // compilation ourselves. | ||
| Err(ProtobufError::IoError(ref e)) | ||
| if e.kind() == io::ErrorKind::UnexpectedEof => {} |
There was a problem hiding this comment.
Can you put a log message here so it's easy to figure out when this happens? Actually I think even straight printing something to stderr would be OK in this case, it feels like a bug if we hit this in practice.
luser
left a comment
There was a problem hiding this comment.
This looks fine, just a few little things that would be good. Feel free to rebase and land this.
src/compiler/compiler.rs
Outdated
|
|
||
| // Wait at most a minute for the cache to respond before we forge | ||
| // ahead ourselves with a compilation. | ||
| let timeout = Duration::new(60, 0); |
There was a problem hiding this comment.
As a followup, it might be nice to have a way to configure this so we can tweak it in practice.
| stats.forced_recaches += 1; | ||
| } | ||
| MissType::TimedOut => { | ||
| stats.cache_misses += 1; |
There was a problem hiding this comment.
Might be good to add a stat for this as well so we can track it?
I believe this is possible to hit if the compilation takes quite a long time and times out the server itself. The server may shut down due to being idle while we're still connected and cause us to receive an `UnexpectedEof`.
This should help ensure that we don't wait *too* long for the cache to respond (for example on an excessively slow network) and time out the server unnecessarily.
baf9610 to
b91b81a
Compare
* chore(readme): Remove readme examples This swaps out the readme example to use instead a link to the routeguide tutorial. * Add helloworld tutorial
We've unfortunately been having a lot of trouble over at rust-lang/rust#40240 with sccache causing spurious failures. A number of logs are linked from that issue and to try to diagnose these issues I've enabled sccache logging which we cat to the Travis logs on failures.
Today we got a new failure whose failing build pointed to:
Upon inspection of the sccache server debug logs (at the end of the build log) I noticed the failing object was
ELFDumper.cpp.o, we attempted to fetch this object from the cache (I think), and no other messages aboutELFDumper.cpp.o(success or not) happened on the server. My theory for what happened looks like:To handle this failure mode this PR makes a few modifications
UnexpectedEofthe same way it handles "unhandled compilation", it just compiles the object locally. This should protect against this in the future and other spurious races like this which are generally benign.