Skip to content

cephadm: increase is_available timeout 30s -> 60s#35510

Merged
sebastian-philipp merged 3 commits intoceph:masterfrom
mgfritch:cephadm-bootstrap-timeout
Jun 16, 2020
Merged

cephadm: increase is_available timeout 30s -> 60s#35510
sebastian-philipp merged 3 commits intoceph:masterfrom
mgfritch:cephadm-bootstrap-timeout

Conversation

@mgfritch
Copy link
Contributor

@mgfritch mgfritch commented Jun 9, 2020

bootstrap fails because ceph -s might take longer
than 30 sec to return on resource limited hardware

Fixes: https://tracker.ceph.com/issues/45961
Signed-off-by: Michael Fritch mfritch@suse.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

bootstrap fails because `ceph -s` might take longer
than 30 sec to return on resource limited hardware

Fixes: https://tracker.ceph.com/issues/45961
Signed-off-by: Michael Fritch <mfritch@suse.com>
@mgfritch mgfritch requested a review from a team as a code owner June 9, 2020 20:38
@mgfritch
Copy link
Contributor Author

jenkins test make check

@smithfarm
Copy link
Contributor

smithfarm commented Jun 10, 2020

This is fine as far as it goes. What I don't understand is this:

  1. "ceph -s" is run in a try-wait loop. I.e. the "ceph -s" command is issued and the process waits some number of seconds for it to produce output
  2. in the log excerpt on the tracker ticket, we see that at least 4 attempts go by without getting any output
  3. while cephadm is waiting for output from the 5th attempt, suddenly we see "ceph -s" output in the log!
  4. alas, this "ceph -s" output we see is not from the 5th attempt, but (presumably) from the 1st, so it is ignored. The try-wait loop continues.
  5. we continue to get output from successive (2nd, 3rd, 4th) tries, but while this is happening, the try-wait loop is issuing new commands (6th, 7th, 8th, 9th, finally 10th try)
  6. by the time we get to the end of the try-wait loop, the first 4 tries have successfully completed, but each time they are ignored because the loop has cycled in the meantime

So, I'm thinking something more clever is needed, but I don't know what, exactly. In the reported case, the first attempt did not complete until the 5th attempt was in progress, which means it took more than 30 * 4 = 120 seconds to complete. So it's doubtful that this PR would have had any effect in that case.

@sebastian-philipp
Copy link
Contributor

mgfritch added 2 commits June 12, 2020 07:59
add debug log line to profile the Popen wrapper read loop
exec runtime, exitcode, stop, fds to read, etc.

Signed-off-by: Michael Fritch <mfritch@suse.com>
no need to continually SIGKILL a process that is already exiting

Signed-off-by: Michael Fritch <mfritch@suse.com>
@mgfritch
Copy link
Contributor Author

This is fine as far as it goes. What I don't understand is this:

  1. "ceph -s" is run in a try-wait loop. I.e. the "ceph -s" command is issued and the process waits some number of seconds for it to produce output
  2. in the log excerpt on the tracker ticket, we see that at least 4 attempts go by without getting any output
  3. while cephadm is waiting for output from the 5th attempt, suddenly we see "ceph -s" output in the log!
  4. alas, this "ceph -s" output we see is not from the 5th attempt, but (presumably) from the 1st, so it is ignored. The try-wait loop continues.
  5. we continue to get output from successive (2nd, 3rd, 4th) tries, but while this is happening, the try-wait loop is issuing new commands (6th, 7th, 8th, 9th, finally 10th try)
  6. by the time we get to the end of the try-wait loop, the first 4 tries have successfully completed, but each time they are ignored because the loop has cycled in the meantime

So, I'm thinking something more clever is needed, but I don't know what, exactly. In the reported case, the first attempt did not complete until the 5th attempt was in progress, which means it took more than 30 * 4 = 120 seconds to complete. So it's doubtful that this PR would have had any effect in that case.

It's a sequential loop (not async). The data from stdout is exactly the data read during the attempt in which it occurred.

Despite having some data on stdout, for every single attempt, the pid did not exit before the timeout duration was reached and a SIGKILL was issued causing the pid to return with exitcode -9.

I've added a debug log line to profile this loop, hopefully it will make the behavior more apparent.

DEBUG:cephadm:/usr/bin/podman:profile rt=0.1542048454284668, stop=False, exit=None, reads=[3]
DEBUG:cephadm:/usr/bin/podman:profile rt=0.15837812423706055, stop=False, exit=0, reads=[3, 5]
DEBUG:cephadm:/usr/bin/podman:profile rt=0.1585376262664795, stop=True, exit=0, reads=[3, 5]
...
...
DEBUG:cephadm:/usr/bin/ceph:profile rt=60.063154220581055, stop=False, exit=None, reads=[]
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
DEBUG:cephadm:/usr/bin/ceph:profile rt=65.00448799133301, stop=True, exit=-9, reads=[3, 5]
...

logger.info(desc + ':timeout after %s seconds' % timeout)
stop = True
process.kill()
if process.poll() is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure about this? I suspect we also have to kill the process, if poll() is not none. I mean, calling poll() is useless here, as this would be racy with the process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's racy in either form ... but I don't see how issuing another SIGKILL will help when the process has already exited and the loop is simply busy reading the remaining buffers from stdout/stderr .. ?

Copy link
Contributor

@sebastian-philipp sebastian-philipp Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, this loop is so irritating and complex.

ok, what about.

  1. in line 768, we set stop = True
  2. in line 769 and process.poll() is not None, i.e. the process is still alive.
  3. in line 772, process.poll() is None

never mind.

Check if child process has terminated. Set and return returncode attribute. Otherwise, returns None.

logger.info(desc + ':timeout after %s seconds' % timeout)
stop = True
process.kill()
if process.poll() is None:
Copy link
Contributor

@sebastian-philipp sebastian-philipp Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, this loop is so irritating and complex.

ok, what about.

  1. in line 768, we set stop = True
  2. in line 769 and process.poll() is not None, i.e. the process is still alive.
  3. in line 772, process.poll() is None

never mind.

Check if child process has terminated. Set and return returncode attribute. Otherwise, returns None.

@sebastian-philipp
Copy link
Contributor

jenkins test dashboard backend

@sebastian-philipp sebastian-philipp merged commit d77715c into ceph:master Jun 16, 2020
@mgfritch mgfritch deleted the cephadm-bootstrap-timeout branch June 16, 2020 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants