Skip to content

Handle sigterm#830

Merged
ivanpauno merged 22 commits intomasterfrom
ivanpauno/rclpy-handle-sigterm
Oct 21, 2021
Merged

Handle sigterm#830
ivanpauno merged 22 commits intomasterfrom
ivanpauno/rclpy-handle-sigterm

Conversation

@ivanpauno
Copy link
Copy Markdown
Member

@ivanpauno ivanpauno commented Sep 30, 2021

Depends on #814.
Equivalent to ros2/rclcpp#1771.

Copy link
Copy Markdown
Contributor

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change to install signal handlers in rclpy.init() and uninstall them in rclpy.shutdown().

Something odd is this makes rclpy nodes not exit on receiving SIGTERM. Before, sending SIGTERM will cause the process to exit like this:

$ kill -SIGTERM 431136
$ ros2 run demo_nodes_py talker
[INFO] [1633734719.062607437] [talker]: Publishing: "Hello World: 0"
[INFO] [1633734720.050625060] [talker]: Publishing: "Hello World: 1"
...
[INFO] [1633734731.050637979] [talker]: Publishing: "Hello World: 12"
[ros2run]: Terminated

With this PR, sending SIGTERM does nothing. The talker executor keeps running.
There's no Python signal handler for SIGTERM, so no exception gets raised. Triggering the sigint/sigterm guard condition appears to only wake the executor; so rclpy.spin(...) keeps running. Maybe SIGTERM could additionally trigger a shutdown call on all of the Context instances?

"Stop triggering a guard condition when SIGINT occurs."
},
{NULL, NULL, 0, NULL} /* sentinel */
/// This should be updated to match signals.SignalHandlerOptions in python.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, to avoid duplicating these values I would recommend defining the enum in Pybind11 like here:

py::enum_<rcl_clock_type_t>(m, "ClockType")
.value("UNINITIALIZED", RCL_CLOCK_UNINITIALIZED)
.value("ROS_TIME", RCL_ROS_TIME)
.value("SYSTEM_TIME", RCL_SYSTEM_TIME)
.value("STEADY_TIME", RCL_STEADY_TIME);

And exposing that to Python like here:

ClockType = _rclpy.ClockType

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, pybind11 wrapping register_sigint_signal_handler and register_sigterm_signal_handler and doing the logic of void install_signal_handlers(int options) in the python method def install_signal_handlers(options) sounds like another good option as it would remove the need for a C version of this enum entirely

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I will apply this

@ivanpauno
Copy link
Copy Markdown
Member Author

Something odd is this makes rclpy nodes not exit on receiving SIGTERM. Before, sending SIGTERM will cause the process to exit like this:

Thanks for noticing that, I didn't think about that issue.

Maybe SIGTERM could additionally trigger a shutdown call on all of the Context instances?

Yes, that's an option.
The other one I'm thinking of is to raise an exception from the signal handler, that's possible using the python signal module: https://docs.python.org/3/library/signal.html#example.

We could maybe have ShutdownError exception in rclpy, if that doesn't exist already.

@ivanpauno ivanpauno changed the base branch from cpp_signal_handling_rclpy to master October 12, 2021 15:58
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno force-pushed the ivanpauno/rclpy-handle-sigterm branch from 9be1a1d to d69d0f2 Compare October 14, 2021 15:24
@ivanpauno
Copy link
Copy Markdown
Member Author

Now when running the demo_nodes_py listener demo and sending sigterm the following exception is thrown:

  File "/home/ivanpauno/ros2_ws/install/demo_nodes_py/lib/demo_nodes_py/listener", line 11, in <module>
    load_entry_point('demo-nodes-py', 'console_scripts', 'listener')()
  File "/home/ivanpauno/ros2_ws/build/demo_nodes_py/demo_nodes_py/topics/listener.py", line 36, in main
    rclpy.spin(node)
  File "/home/ivanpauno/ros2_ws/install/rclpy/lib/python3.8/site-packages/rclpy/__init__.py", line 213, in spin
    executor.spin_once()
  File "/home/ivanpauno/ros2_ws/install/rclpy/lib/python3.8/site-packages/rclpy/executors.py", line 704, in spin_once
    handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
  File "/home/ivanpauno/ros2_ws/install/rclpy/lib/python3.8/site-packages/rclpy/executors.py", line 690, in wait_for_ready_callbacks
    return next(self._cb_iter)
  File "/home/ivanpauno/ros2_ws/install/rclpy/lib/python3.8/site-packages/rclpy/executors.py", line 592, in _wait_for_ready_callbacks
    raise ExternalShutdownException()

which seems like a reasonable error.


I tried to use the python signal module for all signal handlers, but it didn't work as I expected.
The issue is that waitset is still waiting for an event, with the GIL locked, and the python signal handler is not going to be run until that returns.

Maybe, we should be using an approach similar to rclcpp: having an extra thread waiting on a semaphore, trigger it in a signal safe fashion, and from that thread trigger guard conditions and shutdown contexts.

@ivanpauno ivanpauno requested a review from sloretz October 14, 2021 15:32
@sloretz
Copy link
Copy Markdown
Contributor

sloretz commented Oct 14, 2021

The issue is that waitset is still waiting for an event, with the GIL locked, and the python signal handler is not going to be run until that returns.

That's interesting - the intended behavior is to release the GIL while the waitset is waiting. Is there another spot blocking?

py::gil_scoped_release gil_release;

Another downside is the Pure python signal handler can only be unregistered from the same thread that registered it (#728 (comment)). Maybe that's less of a problem with this PR registering and unregistering it in rclpy.init/shutdown.

@ivanpauno
Copy link
Copy Markdown
Member Author

That's interesting - the intended behavior is to release the GIL while the waitset is waiting. Is there another spot blocking?

Sorry, the GIL is not blocked.

But the waitset is blocked waiting on a condition variable.
The python signal module is not "completely asynchronous", a flag will be set from the C signal handler and the python signal handler will be executed the next time the interpreter can check that flag, which is not possible while a cpython extension function is being executed.

In this case, it seems we really need a C signal handler to at least trigger a condition variable.
The issue is that triggering a condition variable is not signal safe, what we are doing in rclcpp is to do that from a thread only used for signal handling, and use a signal safe synchronization mechanism to wake that thread.
That's a preexistent issue though.

I've used a python sigterm handler only for shutting down contexts, which is safer than doing that from C, though it adds more complexity.
In rclcpp, we're doing that from the same thread that trigger the guard conditions, so maybe we could do that here as well.

@ivanpauno ivanpauno requested a review from sloretz October 15, 2021 21:53
@ivanpauno
Copy link
Copy Markdown
Member Author

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Copy Markdown
Contributor

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

demo_nodes_py talker segfaults when receiving SIGTERM. Is that a sign of an issue here?

$ /home/osrf/ws/ros2/install/demo_nodes_py/lib/demo_nodes_py/talker
[INFO] [1634342485.847740003] [talker]: Publishing: "Hello World: 0"
[INFO] [1634342486.836984592] [talker]: Publishing: "Hello World: 1"
[INFO] [1634342487.837171725] [talker]: Publishing: "Hello World: 2"
[INFO] [1634342488.836918431] [talker]: Publishing: "Hello World: 3"
[INFO] [1634342489.837115967] [talker]: Publishing: "Hello World: 4"
[INFO] [1634342490.837159863] [talker]: Publishing: "Hello World: 5"
[INFO] [1634342491.836732510] [talker]: Publishing: "Hello World: 6"
[INFO] [1634342492.836944020] [talker]: Publishing: "Hello World: 7"
Segmentation fault (core dumped)
$ kill -SIGTERM 856298

@ivanpauno
Copy link
Copy Markdown
Member Author

demo_nodes_py talker segfaults when receiving SIGTERM. Is that a sign of an issue here?

Yes, I will test that.

@ivanpauno
Copy link
Copy Markdown
Member Author

demo_nodes_py talker segfaults when receiving SIGTERM. Is that a sign of an issue here?

It seems that when I tested on Friday I was sending the signal to the ros2cli process launching the node instead that to the node itself.
c032473 should've fixed the issue.
A failure after sigterm is send now looks like:

Traceback (most recent call last):
  File "install/demo_nodes_py/lib/demo_nodes_py/listener", line 11, in <module>
    load_entry_point('demo-nodes-py', 'console_scripts', 'listener')()
  File "/home/ivanpauno/ros2_ws/build/demo_nodes_py/demo_nodes_py/topics/listener.py", line 36, in main
    rclpy.spin(node)
  File "/home/ivanpauno/ros2_ws/install/rclpy/lib/python3.8/site-packages/rclpy/__init__.py", line 224, in spin
    executor.spin_once()
  File "/home/ivanpauno/ros2_ws/install/rclpy/lib/python3.8/site-packages/rclpy/executors.py", line 704, in spin_once
    handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
  File "/home/ivanpauno/ros2_ws/install/rclpy/lib/python3.8/site-packages/rclpy/executors.py", line 690, in wait_for_ready_callbacks
    return next(self._cb_iter)
  File "/home/ivanpauno/ros2_ws/install/rclpy/lib/python3.8/site-packages/rclpy/executors.py", line 592, in _wait_for_ready_callbacks
    raise ExternalShutdownException()
rclpy.executors.ExternalShutdownException

This reverts commit 515f0a3.

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno
Copy link
Copy Markdown
Member Author

Force-pushed to fix DCO

@ivanpauno ivanpauno requested a review from sloretz October 18, 2021 17:07
@ivanpauno
Copy link
Copy Markdown
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@ivanpauno ivanpauno requested a review from hidmic October 20, 2021 16:39
if (g_signal_handler_installed.load()) {
trigger_guard_conditions();
if (g_is_sigterm.exchange(false)) {
rclpy::shutdown_contexts();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanpauno why the difference in behavior between SIGINT and SIGTERM? AFAICS rclcpp treats them equally.

Copy link
Copy Markdown
Member Author

@ivanpauno ivanpauno Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGINT raises keyboard interrupt, so you don't need to asynchronously shutdown.
For SIGTERM, I need to do something else from the signal handler, the options are shutting down or raising an exception from there (which can safely be done from the python signal handler).

edit: I also didn't want to modify what we currently do when SIGINT is send.
Using only the python signal module is also not possible, see #830 (comment) and #830 (comment).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGINT raises keyboard interrupt, so you don't need to asynchronously shutdown.

Does that ensure contexts are shutdown? That's what itches me. SIGTERM for sure does, SIGINT, I don't know.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that ensure contexts are shutdown? That's what itches me. SIGTERM for sure does, SIGINT, I don't know.

No, it does not.
Context are not shutdown when SIGINT is signaled in rclpy, that was also the behavior before this PR.

We could maybe change that, shutting down contexts in both cases will probably not change behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context are not shutdown when SIGINT is signaled in rclpy, that was also the behavior before this PR.

I see. I don't mind if we change it here or in a follow-up PR, but the difference in signal handling behavior between rclpy and rclcpp seems confusing to me for no obvious reason. WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't mind if we change it here or in a follow-up PR, but the difference in signal handling behavior between rclpy and rclcpp seems confusing to me for no obvious reason. WDYT?

Sounds good, I will try that in a follow-up PR, for the moment I will move forward with this one.

rcl_ret_t ret = rcl_shutdown(c);
(void)ret;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanpauno nit: any reason not to move both function and variables into the Context class (via static specifiers)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, it doesn't change anything though, so I will ignore the comment for the moment

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno requested a review from hidmic October 21, 2021 13:11
Copy link
Copy Markdown
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending green CI

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno
Copy link
Copy Markdown
Member Author

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@ivanpauno ivanpauno requested a review from hidmic October 21, 2021 14:18
@ivanpauno ivanpauno merged commit 5297241 into master Oct 21, 2021
@delete-merged-branch delete-merged-branch bot deleted the ivanpauno/rclpy-handle-sigterm branch October 21, 2021 20:37
@emersonknapp
Copy link
Copy Markdown
Collaborator

As you hit on in the conversation above, I have one concern with the behavior of this PR, and it's that you can't handle SIGINT and SIGTERM the same within your Python application. If I make a test as follows:

import rclpy
from demo_nodes_py.topics.talker import Talker

def main():
    rclpy.init()
    node = Talker()
    try:
        rclpy.spin(node)
    except KeyboardInterrupt:
        print('Got SIGINT')
    except rclpy.executors.ExternalShutdownException:
        print('Got SIGTERM')
    print('Destroying')
    node.destroy_node()
    rclpy.shutdown()


if __name__ == '__main__':
    main()

If I run this program, and send a SIGINT to the process, then the following happens:

[INFO] [1635816254.444592338] [talker]: Publishing: "Hello World: 0"
Got SIGINT
Destroying

However if you send a SIGTERM, then the following unclean shutdown happens:

[INFO] [1635815908.502049088] [talker]: Publishing: "Hello World: 0"
Got SIGTERM
Destroying
Traceback (most recent call last):
  File "./src/emersonknapp/ektest/src/ekpy", line 42, in <module>
    main()
  File "./src/emersonknapp/ektest/src/ekpy", line 38, in main
    rclpy.shutdown()
  File "/ws/install/gcc/rclpy/lib/python3.8/site-packages/rclpy/__init__.py", line 123, in shutdown
    _shutdown(context=context)
  File "/ws/install/gcc/rclpy/lib/python3.8/site-packages/rclpy/utilities.py", line 58, in shutdown
    return context.shutdown()
  File "/ws/install/gcc/rclpy/lib/python3.8/site-packages/rclpy/context.py", line 101, in shutdown
    self.__context.shutdown()
rclpy._rclpy_pybind11.RCLError: failed to shutdown: rcl_shutdown already called on the given context, at /ws/src/ros2/rcl/rcl/src/rcl/init.c:241

I don't think it makes sense to require different shutdown logic between these signals - did you come to a resolution for which one of the behaviors it should be?

@ivanpauno
Copy link
Copy Markdown
Member Author

I don't think it makes sense to require different shutdown logic between these signals - did you come to a resolution for which one of the behaviors it should be?

I'm honestly not sure.

Python raises a KeyboardInterrupt from the sigint handler, the signal handler added by ros2 tried to respect that logic.
That seems reasonable, as it's familiar to python users.

We could also raise a KeyboardInterrupt from sigterm. But IMO that would confuse python users, who expect KeyboardInterrupt to only be raised when sigint happens.

Maybe we could raise a different exception when sigterm happens, which would be pretty similar to the sigint case (e.g. rclpy.executors.ExternalShutdownException, without actually shutting down contexts).
It's a bit problematic though, as for that we need the signal module, and signal.signal() can only be called from the main thread, which adds a limitation to rclpy.init() that currently doesn't exist.


The following code should work cleanly in both cases:

import rclpy
from demo_nodes_py.topics.talker import Talker

def main():
    rclpy.init()
    node = Talker()
    try:
        rclpy.spin(node)
    except KeyboardInterrupt:
        print('Got SIGINT')
        rclpy.shutdown()
    except rclpy.executors.ExternalShutdownException:
        print('Got SIGTERM')
    print('Destroying')
    node.destroy_node()


if __name__ == '__main__':
    main()

@ivanpauno
Copy link
Copy Markdown
Member Author

I think that the other alternative is to also shutdown asynchronously when sigint happens, so you don't really need to call rclpy.shutdown() in either case.
That would be a breaking change, though I think it's probably a good idea.

IIRC @hidmic preferred this alternative.

@hidmic
Copy link
Copy Markdown
Contributor

hidmic commented Nov 2, 2021

I think it makes sense for the signals that were configured as a shutdown event upon rclpy initialization to effectively shutdown. That's not mutually exclusive (or shouldn't be IMO) with the default signal handler behavior e.g. raise KeyboardInterrupt on SIGINT.

@emersonknapp
Copy link
Copy Markdown
Collaborator

I think that the other alternative is to also shutdown asynchronously when sigint happens, so you don't really need to call rclpy.shutdown() in either case.
I think it makes sense for the signals that were configured as a shutdown event upon rclpy initialization to effectively shutdown.

Yeah, this also makes sense to me. Then, in practice would we say a ROS 2 Python __main__ should look like:

try:
    rclpy.init()
    my_program()
except (KeyboardInterrupt, rclpy.executors.ExternalShutdownException):
    print('Got clean shutdown signal exception.')
else:
    rclpy.shutdown()
finally:
    my_specific_cleanup()

@ivanpauno
Copy link
Copy Markdown
Member Author

Yeah, this also makes sense to me. Then, in practice would we say a ROS 2 Python main should look like:

Exactly.
I will open a PR trying that, I guess I will have to open some PRs in other repos fixing regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants