-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
AMCL crashes with i < set_a->sample_count assertion while "initialpose" message being received. This appears from time to time in such system tests as test_speed_filter_global during the Circle-CI integration. The failure logs are as follows:
[amcl-8] [INFO] [1637842857.912058601] [amcl]: initialPoseReceived
[amcl-8] [WARN] [1637842857.912216894] [amcl]: Failed to transform initial pose in time (Lookup would require extrapolation into the future. Requested time 14.438000 but the latest data is at time 14.410000, when looking up transform from frame [base_footprint] to frame [odom])
[amcl-8] [INFO] [1637842857.912258118] [amcl]: Setting pose (14.438000): -2.000 -0.500 0.000
[amcl-8] amcl: /opt/overlay_ws/src/navigation2/nav2_amcl/src/pf/pf.c:381: pf_update_resample: Assertion `i < set_a->sample_count' failed.
[ERROR] [amcl-8]: process has died [pid 25975, exit code -6, cmd '/opt/overlay_ws/install/nav2_amcl/lib/nav2_amcl/amcl --ros-args -r __node:=amcl --params-file /tmp/tmpln7ui15p -r /tf:=tf -r /tf_static:=tf_static'].
This also could be reproduced locally in CI Docker container running the selected test manually.
Bug report
Required Info:
- Operating System:
- Ubuntu 20.04 focal
- ROS2 Version:
- ROS2 Rolling (20211105 version)
- Version or commit hash:
- DDS implementation:
- Cyclone DDS (CI default), Fast DDS
Steps to reproduce issue
Describing steps to reproduce the problem on local Docker CI container:
- Download docker CI Docker image and make a container from it:
$ docker pull ghcr.io/ros-planning/navigation2:main
$ docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
ghcr.io/ros-planning/navigation2 main 6ade87d6aaba 2 days ago 3.49GB
$ docker run -t -d --name nav2 6ade87d6aaba
- Start Docker container:
$ docker start nav2
$ docker exec -it nav2 /bin/bash
- Build nav2 stack at Docker side:
# source /opt/ros/rolling/setup.bash
# cd /opt/overlay_ws/src
# rm -rf navigation2
# git clone https://github.com/ros-planning/navigation2.git
# cd /opt/overlay_ws/
# rm -rf build install log
# colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Debug
- After build being complete, run
test_speed_filter_globalsystem test:
# source /opt/overlay_ws/install/setup.bash
# cd /opt/overlay_ws
# while [[ "$?" == "0" ]]; do /usr/bin/python3.8 "-u" "/opt/ros/rolling/share/ament_cmake_test/cmake/run_test.py" "/opt/overlay_ws/build/nav2_system_tests/test_results/nav2_system_tests/test_speed_filter_global.xml" "--package-name" "nav2_system_tests" "--generate-result-on-success" "--env" "TEST_DIR=/opt/overlay_ws/src/navigation2/nav2_system_tests/src/costmap_filters" "TEST_MAP=/opt/overlay_ws/src/navigation2/nav2_system_tests/maps/map_circular.yaml" "TEST_MASK=/opt/overlay_ws/src/navigation2/nav2_system_tests/maps/speed_mask.yaml" "TEST_WORLD=/opt/overlay_ws/src/navigation2/nav2_system_tests/worlds/turtlebot3_ros2_demo.world" "PARAMS_FILE=/opt/overlay_ws/src/navigation2/nav2_system_tests/src/costmap_filters/speed_global_params.yaml" "GAZEBO_MODEL_PATH=/opt/overlay_ws/src/navigation2/nav2_system_tests/models" "BT_NAVIGATOR_XML=navigate_to_pose_w_replanning_and_recovery.xml" "ASTAR=False" "--command" "/opt/overlay_ws/src/navigation2/nav2_system_tests/src/costmap_filters/test_speed_launch.py" > test.log 2>&1; echo "PASSED"; done
And wait until test will freeze (takes ~5-10 minutes of run on my PC until fail will appear).
Check the log for Assertion:
root@f888a9d4ac18:/opt/overlay_ws# grep Assert test.log
[amcl-8] amcl: /opt/overlay_ws/src/navigation2/nav2_amcl/src/pf/pf.c:387: pf_update_resample: Assertion `i < set_a->sample_count' failed.
Expected behavior
Test passes
Actual behavior
Test fails (AMCL dies with assertion fail)
Problem analysis and proposed bugfix
The assertion fail appears because of inconsistent (and not normalized) weights of sample array in laserscan points set. These weights are being inconsistent because of parallel write to pf_->sets[]->samples[] array by two different threads: "initialpose" handler callback and "laserscan" handler callback. Therefore, it is proposed to lock an access to shared pf_ structure by parallel processes for all callbacks accessing it:
- AmclNode::globalLocalizationCallback()
- AmclNode::initialPoseReceived()
- AmclNode::laserReceived()
Proposed bugfix:
diff --git a/nav2_amcl/include/nav2_amcl/amcl_node.hpp b/nav2_amcl/include/nav2_amcl/amcl_node.hpp
index 9deb5504..e2446e0c 100644
--- a/nav2_amcl/include/nav2_amcl/amcl_node.hpp
+++ b/nav2_amcl/include/nav2_amcl/amcl_node.hpp
@@ -238,6 +238,7 @@ protected:
*/
static pf_vector_t uniformPoseGenerator(void * arg);
pf_t * pf_{nullptr};
+ std::mutex pf_mutex_;
bool pf_init_;
pf_vector_t pf_odom_pose_;
int resample_count_{0};
diff --git a/nav2_amcl/src/amcl_node.cpp b/nav2_amcl/src/amcl_node.cpp
index 24333091..293bbdb4 100644
--- a/nav2_amcl/src/amcl_node.cpp
+++ b/nav2_amcl/src/amcl_node.cpp
@@ -461,6 +461,8 @@ AmclNode::globalLocalizationCallback(
const std::shared_ptr<std_srvs::srv::Empty::Request>/*req*/,
std::shared_ptr<std_srvs::srv::Empty::Response>/*res*/)
{
+ std::lock_guard<std::mutex> lock(pf_mutex_);
+
RCLCPP_INFO(get_logger(), "Initializing with uniform distribution");
pf_init_model(
@@ -485,6 +487,8 @@ AmclNode::nomotionUpdateCallback(
void
AmclNode::initialPoseReceived(geometry_msgs::msg::PoseWithCovarianceStamped::SharedPtr msg)
{
+ std::lock_guard<std::mutex> lock(pf_mutex_);
+
RCLCPP_INFO(get_logger(), "initialPoseReceived");
if (msg->header.frame_id == "") {
@@ -582,6 +586,8 @@ AmclNode::handleInitialPose(geometry_msgs::msg::PoseWithCovarianceStamped & msg)
void
AmclNode::laserReceived(sensor_msgs::msg::LaserScan::ConstSharedPtr laser_scan)
{
+ std::lock_guard<std::mutex> lock(pf_mutex_);
+
// Since the sensor data is continually being published by the simulator or robot,
// we don't want our callbacks to fire until we're in the active state
if (!active_) {return;}
This works OK about ~40min of local testing without an issue after patch was applied. I am not good at AMCL internals, maybe there could be more careful locking model. If so, it is welcoming for any guidance.
In any case, this fix seems to work fine.