Skip to content

Executer regression fix#2509

Closed
jmachowinski wants to merge 6 commits intoros2:rollingfrom
cellumation:executer_regression_fix
Closed

Executer regression fix#2509
jmachowinski wants to merge 6 commits intoros2:rollingfrom
cellumation:executer_regression_fix

Conversation

@jmachowinski
Copy link
Collaborator

Fixes #2508

Janosch Machowinski added 3 commits April 17, 2024 14:53
Checks if executors are busy waiting while they should block
in spin_some or spin_all.

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
This test was strange. It looked like, it assumed that spin_all did
not return instantly. Also it was racy, as the thread could terminate
instantly.

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
…available

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

jmachowinski and others added 3 commits April 19, 2024 11:29
Co-authored-by: Tomoya Fujita <Tomoya.Fujita@sony.com>
Signed-off-by: jmachowinski <jmachowinski@users.noreply.github.com>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Before, the method would not recollect available work in case of
spin_some, spin_all. This would lead to the method behaving differently
than to what the documentation states.

Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
@jmachowinski jmachowinski force-pushed the executer_regression_fix branch from 6ff50bf to 9495fa7 Compare April 19, 2024 09:33
@jmachowinski
Copy link
Collaborator Author

I looked further into this, as all the tests lid up.
I pushed an updated version, can someone rerun the CI with this ?

@wjwwood @mjcarroll I noticed, that the spin_some_impl method did not act according to the documentation.
It was not recollecting ready executables on every call. Is this intentional, or a bug ?

@fujitatomoya
Copy link
Collaborator

@jmachowinski sorry i missed this, here is new CI.

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't fix up this pr directly because of push permissions, but I made a new pr that builds on this one: #2517

I think we should take my pr, but I'm open to discussing even more changes.

RCPPUTILS_SCOPE_EXIT(this->spinning.store(false); );

// clear result, to force recollect of ready executables
wait_result_.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is correct, but it doesn't address what was originally reported in #2508, which is about avoiding a busy wait in spin_all. Instead this change prevents old data from a previous spin being used at the beginning of this spin_some/all.

// In the case of spin some, then we can exit
// In the case of spin all, then we will allow ourselves to wait again.
break;
if (!work_available || !exhaustive) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right logic actually. If this is called with exhaustive = false and the first wait_for_work() collects two things to do, this function would stop after executing only one of them, when it should execute both. (!true || !false) -> true -> break.

The && ensured that you didn't consider the exhaustive option until you finished what you collected, which is the behavior of spin_some:

/// Collect work once and execute all available work, optionally within a max duration.

If exhaustive = true then it's spin_all which not only wants to finish all collected work after waiting, but also wants to wait again and again until there's no work available just after waiting:

/// Collect and execute work repeatedly within a duration or until no more work is available.

With all of that in mind, I will rewrite the logic of this function to make it clearer what is happening and why.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would not end up in this code path as long as work is available, as get_next_ready_executable(any_exec) would return true.

Comment on lines -443 to -473
// Long timeout, but should not block test if spin_all works as expected as we cancel the
// executor.
bool spin_exited = false;
std::thread spinner([&spin_exited, &executor, this]() {
executor.spin_all(1s);
executor.remove_node(this->node, true);
spin_exited = true;
});

// Do some work until sufficient calls to the waitable occur
auto start = std::chrono::steady_clock::now();
while (
my_waitable->get_count() <= 1 &&
!spin_exited &&
(std::chrono::steady_clock::now() - start < 1s))
{
my_waitable->trigger();
this->publisher->publish(test_msgs::msg::Empty());
std::this_thread::sleep_for(1ms);
}
// trigger multiple times, so that some work is available
my_waitable->trigger();
my_waitable->trigger();
my_waitable->trigger();

executor.cancel();
start = std::chrono::steady_clock::now();
while (!spin_exited && (std::chrono::steady_clock::now() - start) < 1s) {
std::this_thread::sleep_for(1ms);
}
// Long timeout, but should not block as almost no work is available
executor.spin_all(1s);
executor.remove_node(this->node, true);

EXPECT_LT(1u, my_waitable->get_count());
waitable_interfaces->remove_waitable(my_waitable, nullptr);
ASSERT_TRUE(spin_exited);
spinner.join();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When making a change to the behavior of a function in a way that shouldn't be breaking existing users, it's best to not change the test too. Looking at these changes, I don't see why they are necessary. It might be cleaner or simpler, but it's better to do that in a separate pull request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test triggers as a false positive on my 7800X3D. The problem is, that the spinner thread starts and terminates, before the waitable is even triggered.

Comment on lines +878 to +881
executor->add_node(node);

// spin a bit to get all events from the addition of the node processed
executor->spin_all(std::chrono::milliseconds(100));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than do this, I think it's better to create an isolated callback group and use that for testing, so that transient node-specific stuff does not interfere. This kind of "spin a bit and hope the resulting state is what I'm expecting" is pretty fragile. And I know we do that in other places, but I'd rather not add more cases of it.

Comment on lines +898 to +900
// we allow a big number here, as in case of real active waiting
// it will be off by a few thousands anyway.
ASSERT_LT(waitable->get_is_ready_call_count() - start_is_ready_count, 30);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty fuzzy assertion. It would be better to use a number based on the number of things we expect need handling or something like that, rather than arbitrarily using 30. I think avoiding adding the entire node and instead using a single callback group might help with that, but we also want to avoid coupling the test too tightly to the current implementation details, as well as leaving some flexibility for the executor implementations. I think in general testing for busy waits is going to be really challenging, given that this is not a guarantee made by the interface, e.g. someone could create a busy-wait/polling executor because that met their needs and that would break this test. It would be like making a test for std::condition_variable::wait_for that stated that the predicate could not be called more than X number of times. For some value of X that might be a reasonable check, but if it's too high you get too many false positives for the test, but if it's too low you might be false negatives, given that you don't know the implementation details.

That all being said, I do appreciate trying to make a test for this, and I think we should keep it until some executor breaks the mold.

@wjwwood
Copy link
Member

wjwwood commented May 1, 2024

Closing in favor of #2517, but we can reopen this one if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression : Executor::spin_some_impl is active waiting

4 participants