Mark test_multi_threaded_executor as xfail#1624
Conversation
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
hidmic
left a comment
There was a problem hiding this comment.
What I think is happening is that the executor is being slow to execute the timer callback sometimes, and then the measured period is sometimes smaller than expected.
That's quite plausible. I'm not sure we can do anything to prevent that, though perhaps we can reduce that latency. A test that's not this sensitive to time would be best e.g. using larger periods or asserts on timer period statistics.
|
I've been reluctant to add more xfail tests, because right now is the period where we are trying to get rid of them. What I'll say here is that if we don't solve this issue during the Galactic stabilization period, then I'm OK with marking it as xfail. But I do think this is one of the issues we should take a deeper look on in the next couple of weeks. So I'd say hold off on merging this PR until then. |
Sounds good.
I will try to confirm if my theory is actually right, and in that case I will just adjust the period to make the failure less likely. |
this also should be detected as error with upper bound? rclcpp/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp Lines 74 to 76 in 41fedb7 i do not really see that is but this is meant to be, based on #904 and #907 to relax the test failure.
👍 it would be nice to see if the problem is with executor or cyclic accuracy. |
|
The problem seems to be that sometimes the timer call happens "way after" it should. I applied a terrible patch to be able to print:
terrible patch herediff --git a/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp b/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
index b6fd2c3f..4992a4ab 100644
--- a/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
+++ b/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
@@ -25,6 +25,36 @@
using namespace std::chrono_literals;
+extern "C"
+{
+
+typedef struct rcl_timer_impl_t
+{
+ // The clock providing time.
+ rcl_clock_t * clock;
+ // The associated context.
+ rcl_context_t * context;
+ // A guard condition used to wake the associated wait set, either when
+ // ROSTime causes the timer to expire or when the timer is reset.
+ rcl_guard_condition_t guard_condition;
+ // The user supplied callback.
+ uintptr_t callback;
+ // This is a duration in nanoseconds.
+ uint_least64_t period;
+ // This is a time in nanoseconds since an unspecified time.
+ int_least64_t last_call_time;
+ // This is a time in nanoseconds since an unspecified time.
+ int_least64_t next_call_time;
+ // Credit for time elapsed before ROS time is activated or deactivated.
+ int_least64_t time_credit;
+ // A flag which indicates if the timer is canceled.
+ bool canceled;
+ // The user supplied allocator.
+ rcl_allocator_t allocator;
+} rcl_timer_impl_t;
+
+}
+
class TestMultiThreadedExecutor : public ::testing::Test
{
protected:
@@ -70,13 +100,21 @@ TEST_F(TestMultiThreadedExecutor, timer_over_take) {
std::atomic_int timer_count {0};
- auto timer_callback = [&timer_count, &executor, &system_clock, &last_mutex, &last]() {
+ rclcpp::TimerBase * timer_ptr = nullptr;
+ auto timer_callback = [&timer_count, &executor, &system_clock, &last_mutex, &last, &timer_ptr]() {
// While this tolerance is a little wide, if the bug occurs, the next step will
// happen almost instantly. The purpose of this test is not to measure the jitter
// in timers, just assert that a reasonable amount of time has passed.
rclcpp::Time now = system_clock.now();
timer_count++;
+ auto rcl_timer = timer_ptr->get_timer_handle().get();
+ auto impl = static_cast<rcl_timer_impl_t *>(rcl_timer->impl);
+ fprintf(stderr, "time last call %zu\n", static_cast<size_t>(impl->last_call_time));
+ fprintf(stderr, "time next call %zu\n", static_cast<size_t>(impl->next_call_time));
+ fprintf(stderr, "now %zu\n", static_cast<size_t>(now.nanoseconds()));
+ fprintf(stderr, "-------------------------\n");
+
if (timer_count > 5) {
executor.cancel();
}
@@ -94,6 +132,7 @@ TEST_F(TestMultiThreadedExecutor, timer_over_take) {
};
auto timer = node->create_wall_timer(PERIOD_MS, timer_callback, cbg);
+ timer_ptr = timer.get();
executor.add_node(node);
executor.spin();
}Here an example output when it fails: outputIn the second callback, you can see that the "time last call" was way late. I have no idea why that's the case. (edit) More info: The "timer call" is done by the executor, for some reason the executor is like "blocked" for a time and executes the timer way off the mark. (edit 2) This confirms my original theory. |
|
@ivanpauno Excellent, this is some great investigation. Thanks for digging in here.
So I think the real question to answer here is why the executor is blocked. There are 2 separate reasons I can think of:
Unfortunately, I'm not quite sure how to tell if it is case 1 or case 2. |
A third option: the timers don't really notify the waitset, we calculate the sleep duration based on the next call time of the timers in |
ros2/ros2#1080, as far as i know, cyclone and fastdds uses "system clock" only. |
|
Closing in favor of #1628. |
I'm assuming you actually meant to close this based on this comment. If not, feel free to reopen :). |
Related to #1008.
This test has been failing since a while ago, and doesn't seem to be easy to solve.
Marking it as xfail until a solution is found to avoid confusions.
What I think is happening is that the executor is being slow to execute the timer callback sometimes, and then the measured period is sometimes smaller than expected.
e.g.: period = 1000ms
timer ready at t1
callback executed at t1 + 500ms (executor delayed to schedule the callback)
timer ready again at t1 + 1000ms
callback executed at t1 + 1100ms
measured period of 1100ms - 500ms = 600ms
I remember experiencing something like that a lot ago, though I should try again to confirm.