Skip to content

Fix memory leak in Transform Listener#630

Closed
zygfrydw wants to merge 1 commit intoros2:rollingfrom
zygfrydw:fix-tf-listener
Closed

Fix memory leak in Transform Listener#630
zygfrydw wants to merge 1 commit intoros2:rollingfrom
zygfrydw:fix-tf-listener

Conversation

@zygfrydw
Copy link
Copy Markdown

@zygfrydw zygfrydw commented Sep 26, 2023

There is a memory / message leak when using TransformListener constructed with internal node.

When TransformListener is constructed in the following way:

auto tf_buffer = tf2_ros::Buffer(node->get_clock());
auto tf_listener = tf2_ros::TransformListener(tf_buffer);

There is a memory leake due to messages published to /parameter_events which is not served in the internal node.
We have observed it when using CycloneDDS with shared memory (Iceoryx) and we have created a repo to reproduce this error.

The reason for this issue is that, when in TransformListener is created dedicated thread, the subcriptions for /tf and /tf_static topic are added to callback group. When internal node is created, the node creates subcription on topic /parameter_events. However this subscription is created in default callback group which is never spinned by the executor (the executor is told explicitly to run only the callback group with /tf and /tf_static).
The topic /parameter_events is created with QOS with history of 1000 messages. Therefore not served messages live forever in /parameter_events for each TransformListener.
In the application with default CycloneDDS this results in small, constant memory leak in each instance of TransformListener, however when messages are taken from pool like when using Shared Memory, the memory pool is exausted quickly.

The /parameter_events is created in rclcpp::TimeSource::NodeState in rclcpp/src/rclcpp/time_source.cpp:270.

In my opinion this subcription on parameter events is not necessary in this internal node, however I do not see any option to disable this. Additionaly, the problem will come back if someone adds other subscriptions in Node in future, hance I perfer current solution with adding default callback group to executor.

Other possible solution is:

executor_ = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
if(optional_default_node_)
{
  // If the optional_default_node is created it is necessary to add the default callback group
  // to ensure that internal node subscriptions like /parameter_events are processed.
  executor_->add_node(node_base_interface_);
} else{
  executor_->add_callback_group(callback_group_, node_base_interface_);
}

@clalancette clalancette changed the base branch from ros2 to rolling September 27, 2023 15:37
@clalancette
Copy link
Copy Markdown
Contributor

In my opinion this subcription on parameter events is not necessary in this internal node, however I do not see any option to disable this.

So we do actually need to react to parameter events in TimeSource. The reason is so that we can do the right thing when the user enables or disables simulation time via the use_sim_time parameter.

However, we don't exactly need to subscribe to it anymore. Since Iron, we have the ability to get a callback after a parameter has been updated, using the post_set_parameters_callback (like in https://github.com/ros2/demos/blob/36a4c0bd6ae0531ad3096806c41efd7062a1cb31/demo_nodes_cpp/src/parameters/set_parameters_callback.cpp#L86-L106). Additionally, using the post_set_parameters_callback will be more performant than the subscription.

So my suggestion is that instead we open a PR to rclcpp, replacing https://github.com/ros2/rclcpp/blob/9284d7cefa9e37ce3b6926a44e6177d6e4a4b62a/rclcpp/src/rclcpp/time_source.cpp#L279-L288 with an add_on_post_set_parameters_callback. That will solve this leak.

We can also consider taking this PR, for the reasons you describe, but I think that is a separate issue. @zygfrydw are you willing to open that PR to rclcpp?

@zygfrydw
Copy link
Copy Markdown
Author

zygfrydw commented Oct 2, 2023

I am not exactly sure what it changes to replace the parameter_subscription_ with post_set_parameters_callback in terms of this pull request.

In the TimeSource::NodeState::attachNode there is already a subscription to on_set_parameters_callback:

on_set_parameters_callback_ = node_parameters_->add_on_set_parameters_callback(
  std::bind(&TimeSource::NodeState::on_set_parameters, this, std::placeholders::_1));

post_set_parameters_callback = rclcpp::AsyncParametersClient::on_parameter_event(
  node_topics_,
  [this](std::shared_ptr<const rcl_interfaces::msg::ParameterEvent> event) {
    if (node_base_ != nullptr) {
      this->on_parameter_event(event);
    }
  });

I do not understand why both on_set_parameters_callback and parameter_subscription_ are necessary and why changing the parameter_subscription_ to post_set_parameters_callback would change anything.

In my opinion, it is not correct to spin only the /tf and /tf_static methods of the internal Transform Listener node.
I am not aware of other elements of node routines that require spinning, so removing post_set_parameters_callback may be sufficient for now. However in my opinion it is a fragile fix because if in the future someone adds some subscription to Node the issue will come back.

We are currently locked on the ROS Humble version so we hope you merge this pull request to ROS Humble as well.

@jayyoung
Copy link
Copy Markdown

jayyoung commented Nov 8, 2023

Hello,
We ran into what looks like this issue and discovered this PR. During our investigation we independently discovered that all of our memory-leaking nodes were ones that instantiated a TransformListener in the way you described in your example. We're testing the fix here now, but I wanted to drop by to show the effects of the memory leak, as we did quite some testing on multiple robots in our fleet, and in simulation, to get a picture of its severity, and what triggers it.

Here is a graph of time vs. memory usage of every node in our system, gathered over 548 minutes (9.1 hours) of robot idle time. The lines with an incline are nodes with TransformListener instantiations that leaked memory. The flat lines are the rest of our system.

results_gh

From eyeballing these results, it looks to me like every TransformListener leaks about ~20mb of memory over that ~9hr period, and nodes with multiple instantiations of a TransformListener leak harder than those with just one, or none. This fits your explanation of a never-decreasing, immortal message queue per-instance of that object. In our case, this overall accumulation led to an extra ~1.2gb of memory usage by our system. This is pretty severe for those of us doing long-term robotics tasks.

Just wanted to put here some experimental support for your diagnosis. Not sure if the fix here helps us yet, but we are very suspicious that this object is the one that is leaky, and we will report back.

@nachovizzo
Copy link
Copy Markdown
Contributor

I also created a small repository to run experiments. I spawn a dummy node that doesn't do anything, just advertise a parameter which I later use to send a big string payload. After putting the payload into the parameter, I started seeing memory leaks. The confusing part is that with/without TransformListener I still observe leaks, and with 2 different DDS implementations.

Indeed if a tf2_ros::TransformListener object is part of the nodes the memory leak is notable

LeakyNode - No tf stuff

class LeakyNode : public rclcpp::Node {
public:
    explicit LeakyNode(const std::string &name) : Node(name) {
        this->declare_parameter("my_parameter", "world");
    }
};

leaky_node

LeakyNode - with dummy TransformListener

class LeakyNode : public rclcpp::Node {
public:
    explicit LeakyNode(const std::string &name) : Node(name) {
        this->declare_parameter("my_parameter", "world");
        tf_buffer = std::make_unique<tf2_ros::Buffer>(this->get_clock());
        tf_listener = std::make_unique<tf2_ros::TransformListener>(*tf_buffer);
    }
    std::unique_ptr<tf2_ros::TransformListener> tf_listener;
    std::unique_ptr<tf2_ros::Buffer> tf_buffer;
};

image

Conclusions

I don't have anyone yet 😢 So far all the experiments I have done are in some way contradictory (common in memory leak hunting). Any leads on how to nail down this problem would be highly appreciated

@ahcorde
Copy link
Copy Markdown
Contributor

ahcorde commented May 15, 2024

@zygfrydw Was this issue fixed with the @nachovizzo PR #636 ?

Feel free to reopen it again, if the problem is still there

@ahcorde ahcorde closed this May 15, 2024
@ralwing
Copy link
Copy Markdown

ralwing commented Aug 4, 2025

I've just tested current ros2:humble ros-humble-geometry2/jammy,now 0.25.15-1jammy.20250719.035645 amd64 [installed,automatic] executing testing docker and the problem remains:

Mempool [m_chunkSize = 1064, numberOfChunks = 100, used_chunks = 100 ] has no more space left
2025-08-04 13:28:23.888 [ Error ]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 152The following mempools are available:  MemPool [ ChunkSize = 1064, ChunkPayloadSize = 1024, ChunkCount = 100 ]
2025-08-04 13:28:23.889 [Warning]: ICEORYX error! MEPOO__MEMPOOL_GETCHUNK_POOL_IS_RUNNING_OUT_OF_CHUNKS
Mempool [m_chunkSize = 1064, numberOfChunks = 100, used_chunks = 100 ] has no more space left
2025-08-04 13:28:23.890 [ Error ]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 152The following mempools are available:  MemPool [ ChunkSize = 1064, ChunkPayloadSize = 1024, ChunkCount = 100 ]
2025-08-04 13:28:23.890 [Warning]: ICEORYX error! MEPOO__MEMPOOL_GETCHUNK_POOL_IS_RUNNING_OUT_OF_CHUNKS
Mempool [m_chunkSize = 1064, numberOfChunks = 100, used_chunks = 100 ] has no more space left
2025-08-04 13:28:23.891 [ Error ]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 152The following mempools are available:  MemPool [ ChunkSize = 1064, ChunkPayloadSize = 1024, ChunkCount = 100 ]

Also tested for other ros2 distributions:

  • jazzy - memory leak
  • kilted - fixed
  • rolling - fixed

@clalancette @zygfrydw @ahcorde I'm kindly asking to reopen this issue.

@clalancette
Copy link
Copy Markdown
Contributor

@clalancette @zygfrydw @ahcorde I'm kindly asking to reopen this issue.

This is an old issue that we think we've solved a long time ago. Please open a new issue with your findings and how to reproduce.

@ralwing
Copy link
Copy Markdown

ralwing commented Aug 6, 2025

With all respect, it doesn't make sense to open a new issue, since it wasn't fixed and will contain the same content as this one. I've only reproduced the steps which @zygfrydw mentioned.
I only additionally tested it on different ros2 distros.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants