Skip to content

Mark test_multi_threaded_executor as xfail#1624

Closed
ivanpauno wants to merge 1 commit intomasterfrom
ivanpauno/mark-test-multi-threaded-executor-as-xfail
Closed

Mark test_multi_threaded_executor as xfail#1624
ivanpauno wants to merge 1 commit intomasterfrom
ivanpauno/mark-test-multi-threaded-executor-as-xfail

Conversation

@ivanpauno
Copy link
Copy Markdown
Member

Related to #1008.

This test has been failing since a while ago, and doesn't seem to be easy to solve.
Marking it as xfail until a solution is found to avoid confusions.


What I think is happening is that the executor is being slow to execute the timer callback sometimes, and then the measured period is sometimes smaller than expected.

e.g.: period = 1000ms
timer ready at t1
callback executed at t1 + 500ms (executor delayed to schedule the callback)
timer ready again at t1 + 1000ms
callback executed at t1 + 1100ms
measured period of 1100ms - 500ms = 600ms

I remember experiencing something like that a lot ago, though I should try again to confirm.

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno added the bug Something isn't working label Apr 6, 2021
@ivanpauno ivanpauno requested review from hidmic and wjwwood April 6, 2021 20:54
@ivanpauno ivanpauno self-assigned this Apr 6, 2021
Copy link
Copy Markdown
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think is happening is that the executor is being slow to execute the timer callback sometimes, and then the measured period is sometimes smaller than expected.

That's quite plausible. I'm not sure we can do anything to prevent that, though perhaps we can reduce that latency. A test that's not this sensitive to time would be best e.g. using larger periods or asserts on timer period statistics.

@clalancette
Copy link
Copy Markdown
Contributor

I've been reluctant to add more xfail tests, because right now is the period where we are trying to get rid of them.

What I'll say here is that if we don't solve this issue during the Galactic stabilization period, then I'm OK with marking it as xfail. But I do think this is one of the issues we should take a deeper look on in the next couple of weeks. So I'd say hold off on merging this PR until then.

@ivanpauno
Copy link
Copy Markdown
Member Author

What I'll say here is that if we don't solve this issue during the Galactic stabilization period, then I'm OK with marking it as xfail. But I do think this is one of the issues we should take a deeper look on in the next couple of weeks. So I'd say hold off on merging this PR until then.

Sounds good.

That's quite plausible. I'm not sure we can do anything to prevent that, though perhaps we can reduce that latency. A test that's not this sensitive to time would be best e.g. using larger periods or asserts on timer period statistics.

I will try to confirm if my theory is actually right, and in that case I will just adjust the period to make the failure less likely.

@wjwwood wjwwood removed their request for review April 7, 2021 22:02
@fujitatomoya
Copy link
Copy Markdown
Collaborator

callback executed at t1 + 500ms (executor delayed to schedule the callback)

this also should be detected as error with upper bound?

// While this tolerance is a little wide, if the bug occurs, the next step will
// happen almost instantly. The purpose of this test is not to measure the jitter
// in timers, just assert that a reasonable amount of time has passed.

i do not really see that is reasonable... i think it would be nice to detect the problem in the 1st place.

but this is meant to be, based on #904 and #907 to relax the test failure.

I remember experiencing something like that a lot ago, though I should try again to confirm.

👍 it would be nice to see if the problem is with executor or cyclic accuracy.

@ivanpauno
Copy link
Copy Markdown
Member Author

ivanpauno commented Apr 8, 2021

The problem seems to be that sometimes the timer call happens "way after" it should.

I applied a terrible patch to be able to print:

  • last timer call
  • next timer call
  • time when the callback is executed (printed like "now"), the one we're using to measure period.
terrible patch here
diff --git a/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp b/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
index b6fd2c3f..4992a4ab 100644
--- a/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
+++ b/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp
@@ -25,6 +25,36 @@
 
 using namespace std::chrono_literals;
 
+extern "C"
+{
+
+typedef struct rcl_timer_impl_t
+{
+  // The clock providing time.
+  rcl_clock_t * clock;
+  // The associated context.
+  rcl_context_t * context;
+  // A guard condition used to wake the associated wait set, either when
+  // ROSTime causes the timer to expire or when the timer is reset.
+  rcl_guard_condition_t guard_condition;
+  // The user supplied callback.
+  uintptr_t callback;
+  // This is a duration in nanoseconds.
+  uint_least64_t period;
+  // This is a time in nanoseconds since an unspecified time.
+  int_least64_t last_call_time;
+  // This is a time in nanoseconds since an unspecified time.
+  int_least64_t next_call_time;
+  // Credit for time elapsed before ROS time is activated or deactivated.
+  int_least64_t time_credit;
+  // A flag which indicates if the timer is canceled.
+  bool canceled;
+  // The user supplied allocator.
+  rcl_allocator_t allocator;
+} rcl_timer_impl_t;
+
+}
+
 class TestMultiThreadedExecutor : public ::testing::Test
 {
 protected:
@@ -70,13 +100,21 @@ TEST_F(TestMultiThreadedExecutor, timer_over_take) {
 
   std::atomic_int timer_count {0};
 
-  auto timer_callback = [&timer_count, &executor, &system_clock, &last_mutex, &last]() {
+  rclcpp::TimerBase * timer_ptr = nullptr;
+  auto timer_callback = [&timer_count, &executor, &system_clock, &last_mutex, &last, &timer_ptr]() {
       // While this tolerance is a little wide, if the bug occurs, the next step will
       // happen almost instantly. The purpose of this test is not to measure the jitter
       // in timers, just assert that a reasonable amount of time has passed.
       rclcpp::Time now = system_clock.now();
       timer_count++;
 
+      auto rcl_timer = timer_ptr->get_timer_handle().get();
+      auto impl = static_cast<rcl_timer_impl_t *>(rcl_timer->impl);
+      fprintf(stderr, "time last call %zu\n", static_cast<size_t>(impl->last_call_time));
+      fprintf(stderr, "time next call %zu\n", static_cast<size_t>(impl->next_call_time));
+      fprintf(stderr, "now %zu\n", static_cast<size_t>(now.nanoseconds()));
+      fprintf(stderr, "-------------------------\n");
+
       if (timer_count > 5) {
         executor.cancel();
       }
@@ -94,6 +132,7 @@ TEST_F(TestMultiThreadedExecutor, timer_over_take) {
     };
 
   auto timer = node->create_wall_timer(PERIOD_MS, timer_callback, cbg);
+  timer_ptr = timer.get();
   executor.add_node(node);
   executor.spin();
 }

Here an example output when it fails:

output
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestMultiThreadedExecutor
[ RUN      ] TestMultiThreadedExecutor.timer_over_take
time last call 6507694055463
time next call 6508689117649
now 6507694056607
-------------------------
time last call 6509558003682
time next call 6509689117649
now 6509558004797
-------------------------
time last call 6509690126356
time next call 6510689117649
now 6509690127686
-------------------------
/home/ivanpauno/ros2_ws/src/ros2/rclcpp/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp:129: Failure
Expected: (diff) > (PERIOD - TOLERANCE), actual: 0.132123 vs 0.75
[  FAILED  ] TestMultiThreadedExecutor.timer_over_take (3083 ms)
[----------] 1 test from TestMultiThreadedExecutor (3083 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (3120 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestMultiThreadedExecutor.timer_over_take

 1 FAILED TEST

In the second callback, you can see that the "time last call" was way late.
Expected call time was 6508689117649 real call time is 6509558003682, i.e. 0.868ms late.

I have no idea why that's the case.

(edit)

More info: The "timer call" is done by the executor, for some reason the executor is like "blocked" for a time and executes the timer way off the mark.
Because the timer uses the "last call time" to adjust the "next call time", you get two callbacks called almost consecutively.

(edit 2) This confirms my original theory.

@clalancette
Copy link
Copy Markdown
Contributor

@ivanpauno Excellent, this is some great investigation. Thanks for digging in here.

More info: The "timer call" is done by the executor, for some reason the executor is like "blocked" for a time and executes the timer way off the mark.

So I think the real question to answer here is why the executor is blocked. There are 2 separate reasons I can think of:

  1. The executor itself is blocked internally. If this is the case, it seems like a bug in the executor that we should probably investigate.
  2. The OS is just really slow to reschedule the process back in sometimes. If this is the case, then it isn't the fault of any of our code and we can fix the test up to be more robust to this kind of problem.

Unfortunately, I'm not quite sure how to tell if it is case 1 or case 2.

@ivanpauno
Copy link
Copy Markdown
Member Author

So I think the real question to answer here is why the executor is blocked. There are 2 separate reasons I can think of:

A third option: the timers don't really notify the waitset, we calculate the sleep duration based on the next call time of the timers in rcl_wait() and then call rmw_wait() with that. Maybe that calculation is wrong, or a clock type mismatch is causing it to take longer (I haven't debugged that yet).
The test is using the "steady clock" and I'm almost sure rmw_wait() uses the "system clock", I remember to have seen an issue about system clocks adjustments ...

@fujitatomoya
Copy link
Copy Markdown
Collaborator

The test is using the "steady clock" and I'm almost sure rmw_wait() uses the "system clock", I remember to have seen an issue about system clocks adjustments ...

ros2/ros2#1080, as far as i know, cyclone and fastdds uses "system clock" only.

@ivanpauno
Copy link
Copy Markdown
Member Author

Closing in favor of #1628.

@clalancette
Copy link
Copy Markdown
Contributor

Closing in favor of #1628.

I'm assuming you actually meant to close this based on this comment. If not, feel free to reopen :).

@clalancette clalancette deleted the ivanpauno/mark-test-multi-threaded-executor-as-xfail branch October 12, 2022 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants