Skip to content

DNM: test fixes for PR 38443#38471

Closed
rishabh-d-dave wants to merge 3 commits intoceph:masterfrom
rishabh-d-dave:vr-set-shell-to-True-EXTRA
Closed

DNM: test fixes for PR 38443#38471
rishabh-d-dave wants to merge 3 commits intoceph:masterfrom
rishabh-d-dave:vr-set-shell-to-True-EXTRA

Conversation

@rishabh-d-dave
Copy link
Contributor

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

self.watcher_process.stdin.close()
traceback.print_exc()
log.info('=============will close self.watcher_process.stdin')
rv = self.watcher_process.stdin.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason your shell=False works is because closing stdin will then send SIGTERM to the ceph -w process rather than the shell. Killing the shell does not kill its children. ceph -w does not read from stdin so closing this channel does nothing to cause the ceph -w process to exit.

What would probably help here is to also close self.watcher_process.stdout. Then ceph -w would receive SIGPIPE when it next tries to write to stdout. You need to first read whatever data is on stdout before closing it however. IOW, move the self.match() check here, then close, then wait.

Process management is fun, huh? :)

Copy link
Contributor Author

@rishabh-d-dave rishabh-d-dave Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that explanation was really helpful. Thanks! :D I was going through subprocess's docs but the exact sequence that led to crash wasn't clear to me.

IOW, move the self.match() check here, then close, then wait.

Which one? The one at the top or the one at the bottom? I think the one at the top is not related in our context and one at the bottom is perfect where it is since self.match() just might find the pattern when stdout is written just before closing it.

Process management is fun, huh? :)

Yep, didn't expect it while refactoring some code that I am pretty used to. :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified it by moving the self.match at the bottom to just before the stdout and stderr is closed -

020-12-08 23:12:36,637.637 INFO:__main__:======================================================================
2020-12-08 23:12:36,637.637 INFO:__main__:FAIL: test_selftest_cluster_log (tasks.mgr.test_module_selftest.TestModuleSelftest)
2020-12-08 23:12:36,638.638 INFO:__main__:Use the selftest module to test the cluster/audit log interface.
2020-12-08 23:12:36,638.638 INFO:__main__:
2020-12-08 23:12:36,638.638 INFO:__main__:----------------------------------------------------------------------
2020-12-08 23:12:36,638.638 INFO:__main__:Traceback (most recent call last):
2020-12-08 23:12:36,638.638 INFO:__main__:  File "/home/rishabh/repos/ceph/minor/qa/tasks/mgr/test_module_selftest.py", line 242, in test_selftest_cluster_log
2020-12-08 23:12:36,638.638 INFO:__main__:    log.info('============rv = ' + str(rv))
2020-12-08 23:12:36,638.638 INFO:__main__:  File "/home/rishabh/repos/ceph/minor/qa/tasks/ceph_test_case.py", line 127, in __exit__
2020-12-08 23:12:36,638.638 INFO:__main__:    raise AssertionError("Expected log message not found: '{0}'".format(expected_pattern))
2020-12-08 23:12:36,638.638 INFO:__main__:AssertionError: Expected log message not found: '[INF] foo bar info'
2020-12-08 23:12:36,638.638 INFO:__main__:
$ git diff
diff --git a/qa/tasks/ceph_test_case.py b/qa/tasks/ceph_test_case.py
index 7d40532854..9e64aa88f7 100644
--- a/qa/tasks/ceph_test_case.py
+++ b/qa/tasks/ceph_test_case.py
@@ -121,6 +121,10 @@ class CephTestCase(unittest.TestCase):
                         # then some grace
                         time.sleep(5 + timeout)
 
+                if not self.match():
+                    log.error("Log output: \n{0}\n".format(self.watcher_process.stdout.getvalue()))
+                    raise AssertionError("Expected log message not found: '{0}'".format(expected_pattern))
+
                 traceback.print_exc()
                 log.info('=============will close self.watcher_process.stdout and stderr')
                 self.watcher_process.close_stdout_and_stderr()
@@ -137,10 +141,6 @@ class CephTestCase(unittest.TestCase):
                     log.info('===========e.args' + str(e.args))
                     pass
 
-                if not self.match():
-                    log.error("Log output: \n{0}\n".format(self.watcher_process.stdout.getvalue()))
-                    raise AssertionError("Expected log message not found: '{0}'".format(expected_pattern))
-
                 log.info('==============finished running assert_cluster_log.ContextManager.__exit__()')
 
         return ContextManager()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the branch right now without moving self.match().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason your shell=False works is because closing stdin will then send SIGTERM to the ceph -w process rather than the shell. Killing the shell does not kill its children. ceph -w does not read from stdin so closing this channel does nothing to cause the ceph -w process to exit.

On second thought, why would the crash happen only on CI? Locally, the test always ran successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick, forgot to tag you here - #38471 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps because logging is much faster locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I've seen it myself but how does that affect the crashing? How is it related to launching and killing of shell and it's child process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a "match" line before closing stdin. That probably gets hit every time in local testing.

@rishabh-d-dave rishabh-d-dave force-pushed the vr-set-shell-to-True-EXTRA branch from 32b103f to 9031f11 Compare December 9, 2020 04:24
@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@rishabh-d-dave
Copy link
Contributor Author

The failure on last run isn't related - https://jenkins.ceph.com/job/ceph-api/8420/

@rishabh-d-dave
Copy link
Contributor Author

jenkins test make check

@rishabh-d-dave
Copy link
Contributor Author

Same for make check run - https://jenkins.ceph.com/job/ceph-pull-requests/65389/

@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented Dec 9, 2020

subproc.communicate() was called even though stdout and stderr was closed - https://jenkins.ceph.com/job/ceph-api/8471/consoleFull#-85362248744e9240e-b50a-4693-bac0-8a991bac86ac

2020-12-09 13:50:25,652.652 INFO:tasks.ceph_test_case:=============will close self.watcher_process.stdout and stderr
2020-12-09 13:50:25,652.652 INFO:__main__:=============runinng LocalRemoteProcess.close_stdout_and_stderr. self.args = ['./bin/ceph', '-w']
2020-12-09 13:50:25,653.653 INFO:__main__:=============finished returning close_stdout_and_stderr
2020-12-09 13:50:25,653.653 INFO:tasks.ceph_test_case:=============finished closing self.watcher_process.stdout and stderr
2020-12-09 13:50:25,653.653 INFO:tasks.ceph_test_case:=============will close self.watcher_process.stdin
2020-12-09 13:50:25,653.653 INFO:__main__:kill 
2020-12-09 13:50:25,653.653 INFO:__main__:==================running LocalRemoteProcess.finished()
2020-12-09 13:50:25,653.653 INFO:__main__:self.exitstatus is None, running self.subproc.poll() next
2020-12-09 13:50:25,653.653 INFO:__main__:self.subproc.poll is None
2020-12-09 13:50:25,654.654 INFO:__main__:finished running LocalRemote.finsished(), returning False
2020-12-09 13:50:25,654.654 INFO:__main__:kill: killing pid 6223 (['./bin/ceph', '-w'])
2020-12-09 13:50:25,654.654 INFO:tasks.ceph_test_case:=============self.watcher_process.stdin closed. rv = None
2020-12-09 13:50:25,654.654 INFO:tasks.ceph_test_case:=============will wait for self.watcher_process.stdin to be closed
2020-12-09 13:50:25,654.654 INFO:__main__:===========running LocalRemoteProcess.wait(). self.args = ['./bin/ceph', '-w']
2020-12-09 13:50:25,654.654 INFO:__main__:==================running LocalRemoteProcess.finished()
2020-12-09 13:50:25,654.654 INFO:__main__:self.exitstatus is None, running self.subproc.poll() next
2020-12-09 13:50:25,654.654 INFO:__main__:finished running self.subproc.poll(), its not None
2020-12-09 13:50:25,654.654 INFO:__main__:running self.subproc.communicate()
Build step 'Execute shell' marked build as failure

Let's see if waiting/sleeping beforehand will help.

@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@rishabh-d-dave
Copy link
Contributor Author

@batrick Did you get a chance to take a look at the log? Can you please take a quick look, if possible, before holidays begin? Thanks.

if self.subproc.poll() is not None:
log.info('finished running self.subproc.poll(), its not None')
log.info('running self.subproc.communicate()')
out, err = self.subproc.communicate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it hangs here? what I noticed in the control flow is that you enver actually closed stdout/stderr:

2020-12-18 06:35:09,162.162 INFO:__main__:=============runinng LocalRemoteProcess.close_stdout_and_stderr. self.args = ['./bin/ceph', '-w']
2020-12-18 06:35:09,163.163 INFO:__main__:=============finished returning close_stdout_and_stderr
2020-12-18 06:35:09,163.163 INFO:tasks.ceph_test_case:=============finished closing self.watcher_process.stdout and stderr

it never hit any of the debug points where stdout/stderr are closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I spoke about this during our meeting, self.subproc.poll() returns when None called in LocalRemoteProcess.close_stdout_and_stderr() but not in LocalRemoteProcess.finished(), which is odd because call to LocalRemoteProcess.close_stdout_and_stderr() precedes LocalRemotProcess.finished().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick Did you get a chance to take a look into this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss this sometime after Pacific is released.

@rishabh-d-dave rishabh-d-dave force-pushed the vr-set-shell-to-True-EXTRA branch 6 times, most recently from 394f282 to a54b6fa Compare January 4, 2021 06:43
@rishabh-d-dave rishabh-d-dave force-pushed the vr-set-shell-to-True-EXTRA branch from a54b6fa to 9c1586b Compare January 11, 2021 04:51
@rishabh-d-dave rishabh-d-dave force-pushed the vr-set-shell-to-True-EXTRA branch 3 times, most recently from adcd3bc to abbde78 Compare January 25, 2021 04:06
@rishabh-d-dave
Copy link
Contributor Author

jenkins test make check

And set its default value to True since methods like
LocalRemote.write_file() pass bash's built-in commands.

Also, update the part where LocalRemote._do_run() sets shell to True
based on whether or not list contains an instance of class Raw.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Close process's stdout and stderr before stdin so that children process
terminate as well.

Also, Move the code that writes to stdout and stderr to a new method and
call that method instead of duplicating this code block.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
@rishabh-d-dave rishabh-d-dave force-pushed the vr-set-shell-to-True-EXTRA branch from abbde78 to 135a402 Compare February 8, 2021 11:37
@rishabh-d-dave
Copy link
Contributor Author

jenkins test api

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@rishabh-d-dave
Copy link
Contributor Author

Closing this PR since it was only meant to help troubleshoot the crash on ceph API tests when ceph -w is launched with shell=True. Further discussion can take place at https://tracker.ceph.com/issues/49644.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants