cephadm: increase is_available timeout 30s -> 60s#35510
cephadm: increase is_available timeout 30s -> 60s#35510sebastian-philipp merged 3 commits intoceph:masterfrom
Conversation
bootstrap fails because `ceph -s` might take longer than 30 sec to return on resource limited hardware Fixes: https://tracker.ceph.com/issues/45961 Signed-off-by: Michael Fritch <mfritch@suse.com>
|
jenkins test make check |
|
This is fine as far as it goes. What I don't understand is this:
So, I'm thinking something more clever is needed, but I don't know what, exactly. In the reported case, the first attempt did not complete until the 5th attempt was in progress, which means it took more than 30 * 4 = 120 seconds to complete. So it's doubtful that this PR would have had any effect in that case. |
add debug log line to profile the Popen wrapper read loop exec runtime, exitcode, stop, fds to read, etc. Signed-off-by: Michael Fritch <mfritch@suse.com>
no need to continually SIGKILL a process that is already exiting Signed-off-by: Michael Fritch <mfritch@suse.com>
It's a sequential loop (not async). The data from stdout is exactly the data read during the attempt in which it occurred. Despite having some data on stdout, for every single attempt, the pid did not exit before the timeout duration was reached and a SIGKILL was issued causing the pid to return with exitcode -9. I've added a debug log line to profile this loop, hopefully it will make the behavior more apparent. |
| logger.info(desc + ':timeout after %s seconds' % timeout) | ||
| stop = True | ||
| process.kill() | ||
| if process.poll() is None: |
There was a problem hiding this comment.
sure about this? I suspect we also have to kill the process, if poll() is not none. I mean, calling poll() is useless here, as this would be racy with the process.
There was a problem hiding this comment.
It's racy in either form ... but I don't see how issuing another SIGKILL will help when the process has already exited and the loop is simply busy reading the remaining buffers from stdout/stderr .. ?
There was a problem hiding this comment.
Sigh, this loop is so irritating and complex.
ok, what about.
in line 768, we setstop = Truein line 769 andprocess.poll() is not None, i.e. the process is still alive.in line 772,process.poll() is None
never mind.
Check if child process has terminated. Set and return returncode attribute. Otherwise, returns None.
| logger.info(desc + ':timeout after %s seconds' % timeout) | ||
| stop = True | ||
| process.kill() | ||
| if process.poll() is None: |
There was a problem hiding this comment.
Sigh, this loop is so irritating and complex.
ok, what about.
in line 768, we setstop = Truein line 769 andprocess.poll() is not None, i.e. the process is still alive.in line 772,process.poll() is None
never mind.
Check if child process has terminated. Set and return returncode attribute. Otherwise, returns None.
|
jenkins test dashboard backend |
bootstrap fails because
ceph -smight take longerthan 30 sec to return on resource limited hardware
Fixes: https://tracker.ceph.com/issues/45961
Signed-off-by: Michael Fritch mfritch@suse.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox