Mark test_multi_threaded_executor as xfail by ivanpauno · Pull Request #1624 · ros2/rclcpp

ivanpauno · 2021-04-06T20:54:02Z

Related to #1008.

This test has been failing since a while ago, and doesn't seem to be easy to solve.
Marking it as xfail until a solution is found to avoid confusions.

What I think is happening is that the executor is being slow to execute the timer callback sometimes, and then the measured period is sometimes smaller than expected.

e.g.: period = 1000ms
timer ready at t1
callback executed at t1 + 500ms (executor delayed to schedule the callback)
timer ready again at t1 + 1000ms
callback executed at t1 + 1100ms
measured period of 1100ms - 500ms = 600ms

I remember experiencing something like that a lot ago, though I should try again to confirm.

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>

hidmic

What I think is happening is that the executor is being slow to execute the timer callback sometimes, and then the measured period is sometimes smaller than expected.

That's quite plausible. I'm not sure we can do anything to prevent that, though perhaps we can reduce that latency. A test that's not this sensitive to time would be best e.g. using larger periods or asserts on timer period statistics.

clalancette · 2021-04-07T15:22:39Z

I've been reluctant to add more xfail tests, because right now is the period where we are trying to get rid of them.

What I'll say here is that if we don't solve this issue during the Galactic stabilization period, then I'm OK with marking it as xfail. But I do think this is one of the issues we should take a deeper look on in the next couple of weeks. So I'd say hold off on merging this PR until then.

ivanpauno · 2021-04-07T16:32:33Z

What I'll say here is that if we don't solve this issue during the Galactic stabilization period, then I'm OK with marking it as xfail. But I do think this is one of the issues we should take a deeper look on in the next couple of weeks. So I'd say hold off on merging this PR until then.

Sounds good.

That's quite plausible. I'm not sure we can do anything to prevent that, though perhaps we can reduce that latency. A test that's not this sensitive to time would be best e.g. using larger periods or asserts on timer period statistics.

I will try to confirm if my theory is actually right, and in that case I will just adjust the period to make the failure less likely.

fujitatomoya · 2021-04-08T05:32:20Z

callback executed at t1 + 500ms (executor delayed to schedule the callback)

this also should be detected as error with upper bound?

rclcpp/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp

Lines 74 to 76 in 41fedb7

    
           // While this tolerance is a little wide, if the bug occurs, the next step will 
        
           // happen almost instantly. The purpose of this test is not to measure the jitter 
        
           // in timers, just assert that a reasonable amount of time has passed.

i do not really see that is reasonable... i think it would be nice to detect the problem in the 1st place.

but this is meant to be, based on #904 and #907 to relax the test failure.

I remember experiencing something like that a lot ago, though I should try again to confirm.

👍 it would be nice to see if the problem is with executor or cyclic accuracy.

ivanpauno · 2021-04-08T19:01:29Z

The problem seems to be that sometimes the timer call happens "way after" it should.

I applied a terrible patch to be able to print:

last timer call
next timer call
time when the callback is executed (printed like "now"), the one we're using to measure period.

terrible patch here

diff --git a/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp b/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
index b6fd2c3f..4992a4ab 100644
--- a/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
+++ b/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
@@ -25,6 +25,36 @@
 
 using namespace std::chrono_literals;
 
+extern "C"
+{
+
+typedef struct rcl_timer_impl_t
+{
+  // The clock providing time.
+  rcl_clock_t * clock;
+  // The associated context.
+  rcl_context_t * context;
+  // A guard condition used to wake the associated wait set, either when
+  // ROSTime causes the timer to expire or when the timer is reset.
+  rcl_guard_condition_t guard_condition;
+  // The user supplied callback.
+  uintptr_t callback;
+  // This is a duration in nanoseconds.
+  uint_least64_t period;
+  // This is a time in nanoseconds since an unspecified time.
+  int_least64_t last_call_time;
+  // This is a time in nanoseconds since an unspecified time.
+  int_least64_t next_call_time;
+  // Credit for time elapsed before ROS time is activated or deactivated.
+  int_least64_t time_credit;
+  // A flag which indicates if the timer is canceled.
+  bool canceled;
+  // The user supplied allocator.
+  rcl_allocator_t allocator;
+} rcl_timer_impl_t;
+
+}
+
 class TestMultiThreadedExecutor : public ::testing::Test
 {
 protected:
@@ -70,13 +100,21 @@ TEST_F(TestMultiThreadedExecutor, timer_over_take) {
 
   std::atomic_int timer_count {0};
 
-  auto timer_callback = [&timer_count, &executor, &system_clock, &last_mutex, &last]() {
+  rclcpp::TimerBase * timer_ptr = nullptr;
+  auto timer_callback = [&timer_count, &executor, &system_clock, &last_mutex, &last, &timer_ptr]() {
       // While this tolerance is a little wide, if the bug occurs, the next step will
       // happen almost instantly. The purpose of this test is not to measure the jitter
       // in timers, just assert that a reasonable amount of time has passed.
       rclcpp::Time now = system_clock.now();
       timer_count++;
 
+      auto rcl_timer = timer_ptr->get_timer_handle().get();
+      auto impl = static_cast<rcl_timer_impl_t *>(rcl_timer->impl);
+      fprintf(stderr, "time last call %zu\n", static_cast<size_t>(impl->last_call_time));
+      fprintf(stderr, "time next call %zu\n", static_cast<size_t>(impl->next_call_time));
+      fprintf(stderr, "now %zu\n", static_cast<size_t>(now.nanoseconds()));
+      fprintf(stderr, "-------------------------\n");
+
       if (timer_count > 5) {
         executor.cancel();
       }
@@ -94,6 +132,7 @@ TEST_F(TestMultiThreadedExecutor, timer_over_take) {
     };
 
   auto timer = node->create_wall_timer(PERIOD_MS, timer_callback, cbg);
+  timer_ptr = timer.get();
   executor.add_node(node);
   executor.spin();
 }

Here an example output when it fails:

output

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestMultiThreadedExecutor
[ RUN      ] TestMultiThreadedExecutor.timer_over_take
time last call 6507694055463
time next call 6508689117649
now 6507694056607
-------------------------
time last call 6509558003682
time next call 6509689117649
now 6509558004797
-------------------------
time last call 6509690126356
time next call 6510689117649
now 6509690127686
-------------------------
/home/ivanpauno/ros2_ws/src/ros2/rclcpp/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp:129: Failure
Expected: (diff) > (PERIOD - TOLERANCE), actual: 0.132123 vs 0.75
[  FAILED  ] TestMultiThreadedExecutor.timer_over_take (3083 ms)
[----------] 1 test from TestMultiThreadedExecutor (3083 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (3120 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestMultiThreadedExecutor.timer_over_take

 1 FAILED TEST

In the second callback, you can see that the "time last call" was way late.
Expected call time was 6508689117649 real call time is 6509558003682, i.e. 0.868ms late.

I have no idea why that's the case.

(edit)

More info: The "timer call" is done by the executor, for some reason the executor is like "blocked" for a time and executes the timer way off the mark.
Because the timer uses the "last call time" to adjust the "next call time", you get two callbacks called almost consecutively.

(edit 2) This confirms my original theory.

clalancette · 2021-04-09T19:53:47Z

@ivanpauno Excellent, this is some great investigation. Thanks for digging in here.

More info: The "timer call" is done by the executor, for some reason the executor is like "blocked" for a time and executes the timer way off the mark.

So I think the real question to answer here is why the executor is blocked. There are 2 separate reasons I can think of:

The executor itself is blocked internally. If this is the case, it seems like a bug in the executor that we should probably investigate.
The OS is just really slow to reschedule the process back in sometimes. If this is the case, then it isn't the fault of any of our code and we can fix the test up to be more robust to this kind of problem.

Unfortunately, I'm not quite sure how to tell if it is case 1 or case 2.

ivanpauno · 2021-04-09T20:00:34Z

So I think the real question to answer here is why the executor is blocked. There are 2 separate reasons I can think of:

A third option: the timers don't really notify the waitset, we calculate the sleep duration based on the next call time of the timers in rcl_wait() and then call rmw_wait() with that. Maybe that calculation is wrong, or a clock type mismatch is causing it to take longer (I haven't debugged that yet).
The test is using the "steady clock" and I'm almost sure rmw_wait() uses the "system clock", I remember to have seen an issue about system clocks adjustments ...

fujitatomoya · 2021-04-10T12:38:43Z

The test is using the "steady clock" and I'm almost sure rmw_wait() uses the "system clock", I remember to have seen an issue about system clocks adjustments ...

ros2/ros2#1080, as far as i know, cyclone and fastdds uses "system clock" only.

ivanpauno · 2021-04-12T21:23:34Z

Closing in favor of #1628.

clalancette · 2021-04-13T12:46:57Z

Closing in favor of #1628.

I'm assuming you actually meant to close this based on this comment. If not, feel free to reopen :).

Mark test_multi_threaded_executor as xfail

4135b3c

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>

ivanpauno added the bug Something isn't working label Apr 6, 2021

ivanpauno requested review from hidmic and wjwwood April 6, 2021 20:54

ivanpauno self-assigned this Apr 6, 2021

clalancette mentioned this pull request Apr 7, 2021

Add support for iceoryx ros2/ros2#1119

Merged

hidmic approved these changes Apr 7, 2021

View reviewed changes

wjwwood removed their request for review April 7, 2021 22:02

fujitatomoya approved these changes Apr 8, 2021

View reviewed changes

ivanpauno mentioned this pull request Apr 9, 2021

Resolve issues identified while investigating #21 ros2/rmw_connextdds#22

Merged

ivanpauno mentioned this pull request Apr 12, 2021

Use a different implementation of mutex two priorities #1628

Merged

ivanpauno mentioned this pull request Apr 12, 2021

test_multi_threaded_executor timer_over_take is flaky due to timer jitter #1008

Closed

clalancette closed this Apr 13, 2021

clalancette deleted the ivanpauno/mark-test-multi-threaded-executor-as-xfail branch October 12, 2022 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark test_multi_threaded_executor as xfail#1624

Mark test_multi_threaded_executor as xfail#1624
ivanpauno wants to merge 1 commit intomasterfrom
ivanpauno/mark-test-multi-threaded-executor-as-xfail

ivanpauno commented Apr 6, 2021

Uh oh!

hidmic left a comment

Uh oh!

clalancette commented Apr 7, 2021

Uh oh!

ivanpauno commented Apr 7, 2021

Uh oh!

fujitatomoya commented Apr 8, 2021

Uh oh!

ivanpauno commented Apr 8, 2021 •

edited

Loading

Uh oh!

clalancette commented Apr 9, 2021

Uh oh!

ivanpauno commented Apr 9, 2021

Uh oh!

fujitatomoya commented Apr 10, 2021

Uh oh!

ivanpauno commented Apr 12, 2021

Uh oh!

clalancette commented Apr 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ivanpauno commented Apr 6, 2021

Uh oh!

hidmic left a comment

Choose a reason for hiding this comment

Uh oh!

clalancette commented Apr 7, 2021

Uh oh!

ivanpauno commented Apr 7, 2021

Uh oh!

fujitatomoya commented Apr 8, 2021

Uh oh!

ivanpauno commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clalancette commented Apr 9, 2021

Uh oh!

ivanpauno commented Apr 9, 2021

Uh oh!

fujitatomoya commented Apr 10, 2021

Uh oh!

ivanpauno commented Apr 12, 2021

Uh oh!

clalancette commented Apr 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivanpauno commented Apr 8, 2021 •

edited

Loading