Skip to content

AMCL crashes with assertion after initialpose received #2711

@AlexeyMerzlyakov

Description

@AlexeyMerzlyakov

AMCL crashes with i < set_a->sample_count assertion while "initialpose" message being received. This appears from time to time in such system tests as test_speed_filter_global during the Circle-CI integration. The failure logs are as follows:

[amcl-8] [INFO] [1637842857.912058601] [amcl]: initialPoseReceived
[amcl-8] [WARN] [1637842857.912216894] [amcl]: Failed to transform initial pose in time (Lookup would require extrapolation into the future.  Requested time 14.438000 but the latest data is at time 14.410000, when looking up transform from frame [base_footprint] to frame [odom])
[amcl-8] [INFO] [1637842857.912258118] [amcl]: Setting pose (14.438000): -2.000 -0.500 0.000
[amcl-8] amcl: /opt/overlay_ws/src/navigation2/nav2_amcl/src/pf/pf.c:381: pf_update_resample: Assertion `i < set_a->sample_count' failed.
[ERROR] [amcl-8]: process has died [pid 25975, exit code -6, cmd '/opt/overlay_ws/install/nav2_amcl/lib/nav2_amcl/amcl --ros-args -r __node:=amcl --params-file /tmp/tmpln7ui15p -r /tf:=tf -r /tf_static:=tf_static'].

This also could be reproduced locally in CI Docker container running the selected test manually.

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04 focal
  • ROS2 Version:
    • ROS2 Rolling (20211105 version)
  • Version or commit hash:
  • DDS implementation:
    • Cyclone DDS (CI default), Fast DDS

Steps to reproduce issue

Describing steps to reproduce the problem on local Docker CI container:

  1. Download docker CI Docker image and make a container from it:
$ docker pull ghcr.io/ros-planning/navigation2:main
$ docker image list
REPOSITORY                         TAG                IMAGE ID       CREATED        SIZE
ghcr.io/ros-planning/navigation2   main               6ade87d6aaba   2 days ago     3.49GB
$ docker run -t -d --name nav2 6ade87d6aaba
  1. Start Docker container:
$ docker start nav2
$ docker exec -it nav2 /bin/bash
  1. Build nav2 stack at Docker side:
# source /opt/ros/rolling/setup.bash
# cd /opt/overlay_ws/src
# rm -rf navigation2
# git clone https://github.com/ros-planning/navigation2.git
# cd /opt/overlay_ws/
# rm -rf build install log
# colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Debug
  1. After build being complete, run test_speed_filter_global system test:
# source /opt/overlay_ws/install/setup.bash
# cd /opt/overlay_ws
# while [[ "$?" == "0" ]]; do /usr/bin/python3.8 "-u" "/opt/ros/rolling/share/ament_cmake_test/cmake/run_test.py" "/opt/overlay_ws/build/nav2_system_tests/test_results/nav2_system_tests/test_speed_filter_global.xml" "--package-name" "nav2_system_tests" "--generate-result-on-success" "--env" "TEST_DIR=/opt/overlay_ws/src/navigation2/nav2_system_tests/src/costmap_filters" "TEST_MAP=/opt/overlay_ws/src/navigation2/nav2_system_tests/maps/map_circular.yaml" "TEST_MASK=/opt/overlay_ws/src/navigation2/nav2_system_tests/maps/speed_mask.yaml" "TEST_WORLD=/opt/overlay_ws/src/navigation2/nav2_system_tests/worlds/turtlebot3_ros2_demo.world" "PARAMS_FILE=/opt/overlay_ws/src/navigation2/nav2_system_tests/src/costmap_filters/speed_global_params.yaml" "GAZEBO_MODEL_PATH=/opt/overlay_ws/src/navigation2/nav2_system_tests/models" "BT_NAVIGATOR_XML=navigate_to_pose_w_replanning_and_recovery.xml" "ASTAR=False" "--command" "/opt/overlay_ws/src/navigation2/nav2_system_tests/src/costmap_filters/test_speed_launch.py" > test.log 2>&1;  echo "PASSED"; done

And wait until test will freeze (takes ~5-10 minutes of run on my PC until fail will appear).

Check the log for Assertion:

root@f888a9d4ac18:/opt/overlay_ws# grep Assert test.log 
[amcl-8] amcl: /opt/overlay_ws/src/navigation2/nav2_amcl/src/pf/pf.c:387: pf_update_resample: Assertion `i < set_a->sample_count' failed.

Expected behavior

Test passes

Actual behavior

Test fails (AMCL dies with assertion fail)

Problem analysis and proposed bugfix

The assertion fail appears because of inconsistent (and not normalized) weights of sample array in laserscan points set. These weights are being inconsistent because of parallel write to pf_->sets[]->samples[] array by two different threads: "initialpose" handler callback and "laserscan" handler callback. Therefore, it is proposed to lock an access to shared pf_ structure by parallel processes for all callbacks accessing it:

  • AmclNode::globalLocalizationCallback()
  • AmclNode::initialPoseReceived()
  • AmclNode::laserReceived()

Proposed bugfix:

diff --git a/nav2_amcl/include/nav2_amcl/amcl_node.hpp b/nav2_amcl/include/nav2_amcl/amcl_node.hpp
index 9deb5504..e2446e0c 100644
--- a/nav2_amcl/include/nav2_amcl/amcl_node.hpp
+++ b/nav2_amcl/include/nav2_amcl/amcl_node.hpp
@@ -238,6 +238,7 @@ protected:
    */
   static pf_vector_t uniformPoseGenerator(void * arg);
   pf_t * pf_{nullptr};
+  std::mutex pf_mutex_;
   bool pf_init_;
   pf_vector_t pf_odom_pose_;
   int resample_count_{0};
diff --git a/nav2_amcl/src/amcl_node.cpp b/nav2_amcl/src/amcl_node.cpp
index 24333091..293bbdb4 100644
--- a/nav2_amcl/src/amcl_node.cpp
+++ b/nav2_amcl/src/amcl_node.cpp
@@ -461,6 +461,8 @@ AmclNode::globalLocalizationCallback(
   const std::shared_ptr<std_srvs::srv::Empty::Request>/*req*/,
   std::shared_ptr<std_srvs::srv::Empty::Response>/*res*/)
 {
+  std::lock_guard<std::mutex> lock(pf_mutex_);
+
   RCLCPP_INFO(get_logger(), "Initializing with uniform distribution");
 
   pf_init_model(
@@ -485,6 +487,8 @@ AmclNode::nomotionUpdateCallback(
 void
 AmclNode::initialPoseReceived(geometry_msgs::msg::PoseWithCovarianceStamped::SharedPtr msg)
 {
+  std::lock_guard<std::mutex> lock(pf_mutex_);
+
   RCLCPP_INFO(get_logger(), "initialPoseReceived");
 
   if (msg->header.frame_id == "") {
@@ -582,6 +586,8 @@ AmclNode::handleInitialPose(geometry_msgs::msg::PoseWithCovarianceStamped & msg)
 void
 AmclNode::laserReceived(sensor_msgs::msg::LaserScan::ConstSharedPtr laser_scan)
 {
+  std::lock_guard<std::mutex> lock(pf_mutex_);
+
   // Since the sensor data is continually being published by the simulator or robot,
   // we don't want our callbacks to fire until we're in the active state
   if (!active_) {return;}

This works OK about ~40min of local testing without an issue after patch was applied. I am not good at AMCL internals, maybe there could be more careful locking model. If so, it is welcoming for any guidance.
In any case, this fix seems to work fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions