cephadm: increase is_available timeout 30s -> 60s by mgfritch · Pull Request #35510 · ceph/ceph

mgfritch · 2020-06-09T20:38:56Z

bootstrap fails because ceph -s might take longer
than 30 sec to return on resource limited hardware

Fixes: https://tracker.ceph.com/issues/45961
Signed-off-by: Michael Fritch mfritch@suse.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

bootstrap fails because `ceph -s` might take longer than 30 sec to return on resource limited hardware Fixes: https://tracker.ceph.com/issues/45961 Signed-off-by: Michael Fritch <mfritch@suse.com>

mgfritch · 2020-06-10T02:12:05Z

jenkins test make check

smithfarm · 2020-06-10T10:40:34Z

This is fine as far as it goes. What I don't understand is this:

"ceph -s" is run in a try-wait loop. I.e. the "ceph -s" command is issued and the process waits some number of seconds for it to produce output
in the log excerpt on the tracker ticket, we see that at least 4 attempts go by without getting any output
while cephadm is waiting for output from the 5th attempt, suddenly we see "ceph -s" output in the log!
alas, this "ceph -s" output we see is not from the 5th attempt, but (presumably) from the 1st, so it is ignored. The try-wait loop continues.
we continue to get output from successive (2nd, 3rd, 4th) tries, but while this is happening, the try-wait loop is issuing new commands (6th, 7th, 8th, 9th, finally 10th try)
by the time we get to the end of the try-wait loop, the first 4 tries have successfully completed, but each time they are ignored because the loop has cycled in the meantime

So, I'm thinking something more clever is needed, but I don't know what, exactly. In the reported case, the first attempt did not complete until the 5th attempt was in progress, which means it took more than 30 * 4 = 120 seconds to complete. So it's doubtful that this PR would have had any effect in that case.

sebastian-philipp · 2020-06-10T14:18:17Z

http://pulpito.ceph.com/swagner-2020-06-10_12:32:13-rados:cephadm-wip-swagner-testing-2020-06-10-1105-distro-basic-smithi/

add debug log line to profile the Popen wrapper read loop exec runtime, exitcode, stop, fds to read, etc. Signed-off-by: Michael Fritch <mfritch@suse.com>

no need to continually SIGKILL a process that is already exiting Signed-off-by: Michael Fritch <mfritch@suse.com>

mgfritch · 2020-06-12T15:55:21Z

This is fine as far as it goes. What I don't understand is this:

"ceph -s" is run in a try-wait loop. I.e. the "ceph -s" command is issued and the process waits some number of seconds for it to produce output

in the log excerpt on the tracker ticket, we see that at least 4 attempts go by without getting any output

while cephadm is waiting for output from the 5th attempt, suddenly we see "ceph -s" output in the log!

alas, this "ceph -s" output we see is not from the 5th attempt, but (presumably) from the 1st, so it is ignored. The try-wait loop continues.

we continue to get output from successive (2nd, 3rd, 4th) tries, but while this is happening, the try-wait loop is issuing new commands (6th, 7th, 8th, 9th, finally 10th try)

by the time we get to the end of the try-wait loop, the first 4 tries have successfully completed, but each time they are ignored because the loop has cycled in the meantime

So, I'm thinking something more clever is needed, but I don't know what, exactly. In the reported case, the first attempt did not complete until the 5th attempt was in progress, which means it took more than 30 * 4 = 120 seconds to complete. So it's doubtful that this PR would have had any effect in that case.

It's a sequential loop (not async). The data from stdout is exactly the data read during the attempt in which it occurred.

Despite having some data on stdout, for every single attempt, the pid did not exit before the timeout duration was reached and a SIGKILL was issued causing the pid to return with exitcode -9.

I've added a debug log line to profile this loop, hopefully it will make the behavior more apparent.

DEBUG:cephadm:/usr/bin/podman:profile rt=0.1542048454284668, stop=False, exit=None, reads=[3]
DEBUG:cephadm:/usr/bin/podman:profile rt=0.15837812423706055, stop=False, exit=0, reads=[3, 5]
DEBUG:cephadm:/usr/bin/podman:profile rt=0.1585376262664795, stop=True, exit=0, reads=[3, 5]
...
...
DEBUG:cephadm:/usr/bin/ceph:profile rt=60.063154220581055, stop=False, exit=None, reads=[]
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
DEBUG:cephadm:/usr/bin/ceph:profile rt=65.00448799133301, stop=True, exit=-9, reads=[3, 5]
...

sebastian-philipp · 2020-06-15T08:52:28Z

src/cephadm/cephadm

-            logger.info(desc + ':timeout after %s seconds' % timeout)
            stop = True
-            process.kill()
+            if process.poll() is None:


sure about this? I suspect we also have to kill the process, if poll() is not none. I mean, calling poll() is useless here, as this would be racy with the process.

It's racy in either form ... but I don't see how issuing another SIGKILL will help when the process has already exited and the loop is simply busy reading the remaining buffers from stdout/stderr .. ?

Sigh, this loop is so irritating and complex.

~~ok, what about.~~

~~in line 768, we set stop = True~~

~~in line 769 and process.poll() is not None, i.e. the process is still alive.~~

~~in line 772, process.poll() is None~~

never mind.

Check if child process has terminated. Set and return returncode attribute. Otherwise, returns None.

sebastian-philipp · 2020-06-16T08:44:01Z

src/cephadm/cephadm

-            logger.info(desc + ':timeout after %s seconds' % timeout)
            stop = True
-            process.kill()
+            if process.poll() is None:


Sigh, this loop is so irritating and complex.

~~ok, what about.~~

~~in line 768, we set stop = True~~

~~in line 769 and process.poll() is not None, i.e. the process is still alive.~~

~~in line 772, process.poll() is None~~

never mind.

Check if child process has terminated. Set and return returncode attribute. Otherwise, returns None.

sebastian-philipp · 2020-06-16T08:48:00Z

jenkins test dashboard backend

cephadm: increase is_available timeout 30s -> 60s

de734aa

bootstrap fails because `ceph -s` might take longer than 30 sec to return on resource limited hardware Fixes: https://tracker.ceph.com/issues/45961 Signed-off-by: Michael Fritch <mfritch@suse.com>

mgfritch requested a review from a team as a code owner June 9, 2020 20:38

smithfarm added the cephadm label Jun 9, 2020

sebastian-philipp requested a review from smithfarm June 10, 2020 08:01

sebastian-philipp added needs-qa wip-swagner-testing My Teuthology tests labels Jun 10, 2020

sebastian-philipp added ready-to-merge pending-discussion and removed needs-qa wip-swagner-testing My Teuthology tests labels Jun 10, 2020

mgfritch added 2 commits June 12, 2020 07:59

cephadm: add profile log line to Popen wrappers

49887ab

add debug log line to profile the Popen wrapper read loop exec runtime, exitcode, stop, fds to read, etc. Signed-off-by: Michael Fritch <mfritch@suse.com>

cephadm: only SIGKILL once during timeout

1b63714

no need to continually SIGKILL a process that is already exiting Signed-off-by: Michael Fritch <mfritch@suse.com>

sebastian-philipp suggested changes Jun 15, 2020

View reviewed changes

sebastian-philipp removed ready-to-merge pending-discussion labels Jun 15, 2020

sebastian-philipp approved these changes Jun 16, 2020

View reviewed changes

sebastian-philipp merged commit d77715c into ceph:master Jun 16, 2020

mgfritch deleted the cephadm-bootstrap-timeout branch June 16, 2020 16:22

sebastian-philipp mentioned this pull request Jul 2, 2020

octopus: cephadm batch backport July (1) #35898

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cephadm: increase is_available timeout 30s -> 60s#35510

cephadm: increase is_available timeout 30s -> 60s#35510
sebastian-philipp merged 3 commits intoceph:masterfrom
mgfritch:cephadm-bootstrap-timeout

mgfritch commented Jun 9, 2020

Uh oh!

mgfritch commented Jun 10, 2020

Uh oh!

smithfarm commented Jun 10, 2020 •

edited

Loading

Uh oh!

sebastian-philipp commented Jun 10, 2020

Uh oh!

mgfritch commented Jun 12, 2020

Uh oh!

sebastian-philipp Jun 15, 2020

Uh oh!

mgfritch Jun 16, 2020

Uh oh!

sebastian-philipp Jun 16, 2020 •

edited

Loading

Uh oh!

sebastian-philipp Jun 16, 2020 •

edited

Loading

Uh oh!

sebastian-philipp commented Jun 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mgfritch commented Jun 9, 2020

Checklist

Uh oh!

mgfritch commented Jun 10, 2020

Uh oh!

smithfarm commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-philipp commented Jun 10, 2020

Uh oh!

mgfritch commented Jun 12, 2020

Uh oh!

sebastian-philipp Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

mgfritch Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp commented Jun 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smithfarm commented Jun 10, 2020 •

edited

Loading

sebastian-philipp Jun 16, 2020 •

edited

Loading

sebastian-philipp Jun 16, 2020 •

edited

Loading