Conversation
ea9765d to
71b057e
Compare
| self.watcher_process.stdin.close() | ||
| traceback.print_exc() | ||
| log.info('=============will close self.watcher_process.stdin') | ||
| rv = self.watcher_process.stdin.close() |
There was a problem hiding this comment.
So the reason your shell=False works is because closing stdin will then send SIGTERM to the ceph -w process rather than the shell. Killing the shell does not kill its children. ceph -w does not read from stdin so closing this channel does nothing to cause the ceph -w process to exit.
What would probably help here is to also close self.watcher_process.stdout. Then ceph -w would receive SIGPIPE when it next tries to write to stdout. You need to first read whatever data is on stdout before closing it however. IOW, move the self.match() check here, then close, then wait.
Process management is fun, huh? :)
There was a problem hiding this comment.
Ah, that explanation was really helpful. Thanks! :D I was going through subprocess's docs but the exact sequence that led to crash wasn't clear to me.
IOW, move the self.match() check here, then close, then wait.
Which one? The one at the top or the one at the bottom? I think the one at the top is not related in our context and one at the bottom is perfect where it is since self.match() just might find the pattern when stdout is written just before closing it.
Process management is fun, huh? :)
Yep, didn't expect it while refactoring some code that I am pretty used to. :D
There was a problem hiding this comment.
I verified it by moving the self.match at the bottom to just before the stdout and stderr is closed -
020-12-08 23:12:36,637.637 INFO:__main__:======================================================================
2020-12-08 23:12:36,637.637 INFO:__main__:FAIL: test_selftest_cluster_log (tasks.mgr.test_module_selftest.TestModuleSelftest)
2020-12-08 23:12:36,638.638 INFO:__main__:Use the selftest module to test the cluster/audit log interface.
2020-12-08 23:12:36,638.638 INFO:__main__:
2020-12-08 23:12:36,638.638 INFO:__main__:----------------------------------------------------------------------
2020-12-08 23:12:36,638.638 INFO:__main__:Traceback (most recent call last):
2020-12-08 23:12:36,638.638 INFO:__main__: File "/home/rishabh/repos/ceph/minor/qa/tasks/mgr/test_module_selftest.py", line 242, in test_selftest_cluster_log
2020-12-08 23:12:36,638.638 INFO:__main__: log.info('============rv = ' + str(rv))
2020-12-08 23:12:36,638.638 INFO:__main__: File "/home/rishabh/repos/ceph/minor/qa/tasks/ceph_test_case.py", line 127, in __exit__
2020-12-08 23:12:36,638.638 INFO:__main__: raise AssertionError("Expected log message not found: '{0}'".format(expected_pattern))
2020-12-08 23:12:36,638.638 INFO:__main__:AssertionError: Expected log message not found: '[INF] foo bar info'
2020-12-08 23:12:36,638.638 INFO:__main__:
$ git diff
diff --git a/qa/tasks/ceph_test_case.py b/qa/tasks/ceph_test_case.py
index 7d40532854..9e64aa88f7 100644
--- a/qa/tasks/ceph_test_case.py
+++ b/qa/tasks/ceph_test_case.py
@@ -121,6 +121,10 @@ class CephTestCase(unittest.TestCase):
# then some grace
time.sleep(5 + timeout)
+ if not self.match():
+ log.error("Log output: \n{0}\n".format(self.watcher_process.stdout.getvalue()))
+ raise AssertionError("Expected log message not found: '{0}'".format(expected_pattern))
+
traceback.print_exc()
log.info('=============will close self.watcher_process.stdout and stderr')
self.watcher_process.close_stdout_and_stderr()
@@ -137,10 +141,6 @@ class CephTestCase(unittest.TestCase):
log.info('===========e.args' + str(e.args))
pass
- if not self.match():
- log.error("Log output: \n{0}\n".format(self.watcher_process.stdout.getvalue()))
- raise AssertionError("Expected log message not found: '{0}'".format(expected_pattern))
-
log.info('==============finished running assert_cluster_log.ContextManager.__exit__()')
return ContextManager()
There was a problem hiding this comment.
Updating the branch right now without moving self.match().
There was a problem hiding this comment.
So the reason your shell=False works is because closing stdin will then send SIGTERM to the ceph -w process rather than the shell. Killing the shell does not kill its children. ceph -w does not read from stdin so closing this channel does nothing to cause the ceph -w process to exit.
On second thought, why would the crash happen only on CI? Locally, the test always ran successfully.
There was a problem hiding this comment.
@batrick, forgot to tag you here - #38471 (comment)
There was a problem hiding this comment.
Perhaps because logging is much faster locally.
There was a problem hiding this comment.
That's true, I've seen it myself but how does that affect the crashing? How is it related to launching and killing of shell and it's child process?
There was a problem hiding this comment.
There is a "match" line before closing stdin. That probably gets hit every time in local testing.
32b103f to
9031f11
Compare
|
jenkins test api |
|
The failure on last run isn't related - https://jenkins.ceph.com/job/ceph-api/8420/ |
|
jenkins test make check |
|
Same for make check run - https://jenkins.ceph.com/job/ceph-pull-requests/65389/ |
|
subproc.communicate() was called even though stdout and stderr was closed - https://jenkins.ceph.com/job/ceph-api/8471/consoleFull#-85362248744e9240e-b50a-4693-bac0-8a991bac86ac Let's see if waiting/sleeping beforehand will help. |
|
jenkins test api |
|
@batrick Did you get a chance to take a look at the log? Can you please take a quick look, if possible, before holidays begin? Thanks. |
| if self.subproc.poll() is not None: | ||
| log.info('finished running self.subproc.poll(), its not None') | ||
| log.info('running self.subproc.communicate()') | ||
| out, err = self.subproc.communicate() |
There was a problem hiding this comment.
so it hangs here? what I noticed in the control flow is that you enver actually closed stdout/stderr:
2020-12-18 06:35:09,162.162 INFO:__main__:=============runinng LocalRemoteProcess.close_stdout_and_stderr. self.args = ['./bin/ceph', '-w']
2020-12-18 06:35:09,163.163 INFO:__main__:=============finished returning close_stdout_and_stderr
2020-12-18 06:35:09,163.163 INFO:tasks.ceph_test_case:=============finished closing self.watcher_process.stdout and stderr
it never hit any of the debug points where stdout/stderr are closed
There was a problem hiding this comment.
Yes. I spoke about this during our meeting, self.subproc.poll() returns when None called in LocalRemoteProcess.close_stdout_and_stderr() but not in LocalRemoteProcess.finished(), which is odd because call to LocalRemoteProcess.close_stdout_and_stderr() precedes LocalRemotProcess.finished().
There was a problem hiding this comment.
@batrick Did you get a chance to take a look into this PR?
There was a problem hiding this comment.
We can discuss this sometime after Pacific is released.
394f282 to
a54b6fa
Compare
a54b6fa to
9c1586b
Compare
adcd3bc to
abbde78
Compare
|
jenkins test make check |
And set its default value to True since methods like LocalRemote.write_file() pass bash's built-in commands. Also, update the part where LocalRemote._do_run() sets shell to True based on whether or not list contains an instance of class Raw. Signed-off-by: Rishabh Dave <ridave@redhat.com>
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Close process's stdout and stderr before stdin so that children process terminate as well. Also, Move the code that writes to stdout and stderr to a new method and call that method instead of duplicating this code block. Signed-off-by: Rishabh Dave <ridave@redhat.com>
abbde78 to
135a402
Compare
|
jenkins test api |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
Closing this PR since it was only meant to help troubleshoot the crash on ceph API tests when |
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox