Christophe Bédard

ROS 2 Over Email: rmw_email, an Actual Working RMW Implementation

2021-09-19T00:00:00+00:00

Service request and response using rmw_email. Messages are exchanged as strings using emails.

tl;dr ROS 2's architecture allows for using almost any middleware, as long as it's done through RMW, the middleware interface. rmw_email contains a middleware that sends & receives strings over email and an RMW implementation that allows ROS 2 to use this middleware to exchange messages. While it's certainly not a production-level middleware, it provides interesting insight into the pros and cons of ROS 2's architecture. Abstractions definitely have a cost, but they're also quite powerful: they allow ROS 2 to run over email without needing to modify it at all!

Introduction
Motivation
Overview
The components of a working(ish) RMW implementation
Demo
Performance
Tracing
Limitations and future work
Conclusion
Links
Update (2022-05-07)
Update (2025-10-26)
References

Introduction

ROS 2’s architecture and underlying middleware are vastly different from ROS 1’s, in part because ROS 2 is targeted at real-time distributed applications. The middleware interface, rmw, is an abstraction that allows ROS 2 to support multiple different middleware implementations. On top of that, there’s rcl, which provides a common implementation in C to support client libraries for any language. Finally, there’s rclcpp and rclpy, the C++ and Python client libraries, respectively. While there definitely are downsides to these abstractions and interfaces, this architecture is very powerful.

In this post, I’ll present rmw_email, which allows ROS 2 to exchange messages using emails. I’ll start by explaining the motivation behind rmw_email. Then I’ll provide an overview and explain how each component works. After that, I’ll show a couple of demos and present the results of performance experiments. Finally, I’ll briefly discuss limitations and possible future work, and then I’ll conclude.

Motivation

The main motivation for rmw_email was my master’s. I got the high-level idea for this in June of 2020. At that point, I had been working on and around ROS 2 for about a year. I had interacted a lot more with the higher levels of the ROS 2 architecture (rclcpp, rcl) while working on ros2_tracing, so I wanted to dig deeper into the middleware level and below (rmw, DDS/other middleware).

I was also seeing some interesting discussions involving middlewares and the middleware interface. Developers of real-time applications (e.g., automotive) wanted to expose middleware features that were rather DDS-specific through rmw. This could possibly make the abstraction “leak,” thereby breaking it at least slightly. Of course, it’s a tradeoff between maintaining the abstraction itself to keep the benefits vs. making ROS 2 more powerful by allowing users to leverage advanced middleware features and possibly reducing the costs of the abstraction and the general overhead of ROS 2. [1, 2]

Also, at the time, there weren’t many non-DDS rmw implementations; even if they do exist, there are actually currently no non-DDS implementations listed under the latest ROS 2 distro in REP 2000. Perhaps adding another one – as absurd as it may be – could help diversify the ROS 2 middleware implementations and illustrate how useful the current abstraction can be. This was even on the ROS 2 roadmap at some point.

And, also, why not? We just can! Sure, DDS is a proven standard to exchange messages, but so is email! Besides, I don’t own a fax machine.

Overview

rmw_email, the repository/project, consists mainly of two packages: email, the middleware, and rmw_email_cpp, the RMW implementation.

email is a simple middleware with the publisher/subscriber pattern to send and receive string messages on topics. It also natively supports the service client/server (RPC) pattern. As its name suggests, emails are used to send messages: the topic or service name is the email subject and the message content is the email body.

rmw_email_cpp is an implementation of the ROS 2 middleware interface, rmw, using email. It uses an external package that does the hard work to convert messages to YAML. Then it converts them to YAML strings and passes them on to email. Indeed, email knows nothing about all the different ROS 2 messages; it simply handles strings.

The components of a working(ish) RMW implementation

In the beginning, I had a rough goal: get ROS 2 working over email. Fortunately, in a way, it’s a rather straightforward process, since it can be split into a few components.

Middleware

The middleware should really be usable on its own, so I started by only focusing on that.

At its core, email simply sends and receives emails using libcurl’s C API. Emails are sent using the SMTP protocol and received by polling using the IMAP protocol. Polling is done by first getting the unique identifier (UID [3]) of the next expected email using the EXAMINE IMAP command [4]. Then it polls until there’s a new email, increments the UID value, and repeats the process. Polling is done on a dedicated thread, while emails are sent synchronously (currently, at least).

Each email message contains metadata that is included as both custom and standard email headers. All emails include a source timestamp and the GID (i.e., a unique ID) of the source object (i.e., publisher, service client, or service server). Additionally, service requests and responses contain a sequence number, and service responses also contain the GID of the service client that made the original request. This is required so that the service response is matched with the original request and delivered to the corresponding service client.

Also, service responses are email replies to the corresponding service request email! This is achieved using standard headers: the values of the In-Reply-To and References headers of the response email are set to the value of the Message-ID header of the request email [5]. All of this is possible without polluting the email as shown in a normal email client.

Let’s illustrate this with the simple server/client example below.

Message-ID: 
Client-GID: 4074879933
Request-Sequence-Number: 42
Source-Timestamp: 1631797734037229979
In-Reply-To: 
References: 
From: client@rmw-email.com
To: server@rmw-email.com
Cc: 
Bcc: 
Subject: /my_service

this is my request!

Message-ID: 
Client-GID: 4074879933
Request-Sequence-Number: 42
Source-Timestamp: 1631797743507593177
In-Reply-To: 
References: 
From: server@rmw-email.com
To: client@rmw-email.com
Cc: 
Bcc: 
Subject: /my_service

this is a response!

The server will receive the email on the left for the request and the client will receive the email reply on the right for the response.

When a new email is received by the polling thread, it is passed on to email handlers. All subscriptions, service clients, and service servers register with those handlers. Handlers use the email’s headers and topic/service name to figure out what kind of message it is and which object it belongs to.

Since sending and receiving emails requires an email account, the path to a configuration file with email login credentials and recipients (to/cc/bcc) must be provided through an environment variable. There is also an intraprocess mode. If enabled, email acts as if it was sending emails to itself, bypassing the very last step of sending/receiving emails. This means that it still relies on email headers, so it has to fake the Message-ID header value, since it is normally added by the email server.

The email design document contains a lot more information and even contains fancy UML diagrams! The API documentation can also provide more insight. Along with that, the email_examples package contains many examples.

While certainly time-consuming, this part was rather fun to create from scratch.

Common message representation

Since we have a middleware that strictly deals with strings, we need to be able to convert ROS 2 messages to strings and convert those strings back to messages.

I knew about type support introspection from reading ROS 2 source code and documentation. It provides metadata about a given message type that allows you to parse the fields of a message given only a type-erased pointer to the message (i.e., void *). Note that it would have been possible to generate code that does this for each message type, similar to how a to_yaml() function is generated for each message type. Also, rosidl_runtime_py can convert YAML strings to messages (e.g., for ros2 topic pub), but it’s in Python.

My first idea was to convert the bytes of the messages to Base64 and send that string over email, but that would have been a bit boring. A while later, after I had a basic working middleware, I saw a post on ROS Discourse about a package to convert messages to a YAML representation. It only supported C messages, though, which was a problem since I needed to support both C and C++ messages. I looked over the code to see how it worked and then I looked at the ROS 2 IDL documentation and this document to understand what I needed to change to adapt it to C++. C structures for message arrays make this task simple – at the expense of being more complex to use – since they keep track of size and capacity. C++ containers make it complicated!

For example, how can you figure out the size of a std::vector if you know the size of the contained type, sizeof(T), but only have a void * to it? This is the case for unbounded dynamic arrays of a non-built-in type, like an array of PointField in a PointCloud2 message. The answer is: by ~~Googling it~~ knowing implementation details! A std::vector object simply contains three pointers: begin, end, and end capacity. Since the elements are stored contiguously, size is simply (end - begin) / sizeof(T)! Fun fact: that trick doesn’t work with std::vector, because its implementation is different, but that’s not a problem here.

Update: this workaround is not actually needed. See below.

I forked the package, added support for C++ messages, made the message<–>YAML conversion symmetrical, and refactored the repository/packages a bit. Below is a simple example of a C++ std_msgs/Header message and the corresponding YAML representation.

builtin_interfaces::msg::Time stamp;
stamp.sec = 4;
stamp.nanosec = 20U;
std_msgs::msg::Header msg;
msg.stamp = stamp;
msg.frame_id = "my_frame";

stamp:
  sec: 4
  nanosec: 20
frame_id: my_frame

Even starting from the implementation for C messages, writing the introspection code for C++ messages was a nice challenge. I’m sure some things could be improved and I might have done some things wrong (although it does work!). It could nonetheless serve as another example of how to do type support introspection.

RMW implementation

To tie everything together, we need to implement the rmw interface for email.

Writing the implementation, rmw_email_cpp, was fairly straightforward. I primarily read the rmw API documentation and looked at other implementations, like rmw_cyclonedds_cpp and rmw_fastrtps_cpp. This summary was also pretty useful to get started!

I knew I would have to modify email in order to support the requirements of the rmw interface. However, it wasn’t until I started working on the implementation that I figured out what I needed to add. The main missing feature was wait sets. With early versions of email, users ~~had~~ would have had to manually poll a subscription for new messages. This is of course not how ROS 2 works; it uses wait sets which allow waiting on different events at the same time in a standard way. For example, you can add all subscriptions, service clients, and service servers to the wait set and ask it to wait. Once that’s done, you can check the wait set to get a list of objects that have a new message, response, or request, and deal with them appropriately.

Some of those features weren’t necessary for a simple email-based string pub/sub/service “middleware,” but adding them definitely improved it and turned it into a sort-of middleware. Obviously, the rmw interface has many other features, like quality of service (for real applications) and introspection (e.g., to support ros2 topic list). Those are not currently supported by rmw_email_cpp, but PRs are welcome!

This layer is where I saw the downsides of the ROS 2 abstractions. Many things are duplicated or very similar: APIs, data structures, arguments validation, etc. Calls often go from rclcpp to rcl to rmw and, finally, to the middleware. While each layer does have its own responsibilities – otherwise we wouldn’t have all those layers – a lot of the actual work is done by the middleware. Furthermore, since DDS has always been the main – and pretty much only – middleware standard, parts of the interface, like the writer GUID field in the request ID struct, are rather DDS-specific.

However, I also saw the clear benefits of these abstractions and interfaces. I didn’t have to write too much code to plug email into ROS 2; I only had to implement an interface. A few bugs aside, after implementing the main rmw functions, running ROS 2 over email just… worked!

Demo

After all of that, it’s time for a demo! First, let’s see what our email inbox looks like when running the classic talker/listener demo.

$ EMAIL_CONFIG_FILE=talker.email.yml \
  RMW_IMPLEMENTATION=rmw_email_cpp \
  ros2 run demo_nodes_cpp talker

Command to run the talker node with rmw_email_cpp and resulting emails on the /chatter topic.

Since messages only go in one direction in the above example, let’s see a client/server example using the add_two_ints service demo.

$ EMAIL_CONFIG_FILE=client.email.yml \
  RMW_IMPLEMENTATION=rmw_email_cpp \
  ros2 run demo_nodes_cpp add_two_ints_client
Result of add_two_ints: 5

$ EMAIL_CONFIG_FILE=server.email.yml \
  RMW_IMPLEMENTATION=rmw_email_cpp \
  ros2 run demo_nodes_cpp add_two_ints_server
Incoming request
a: 2 b: 3

Commands and emails for the /add_two_ints service request and response.

As mentioned previously, we can see the reply to the request email in this example.

Performance

We can use performance_test to measure pub/sub latencies and compare them to another RMW implementation. The current default implementation is rmw_cyclonedds_cpp, so let’s compare to that.

Latency comparison between rmw_email_cpp and rmw_cyclonedds_cpp.

With a mean latency of around 6 seconds over the one-minute experiment, rmw_email_cpp is clearly worse than rmw_cyclonedds_cpp. Approximately 15 332 times worse. Not that we expected anything else, obviously!

The results are different if we run the experiments on a real-time system: PREEMPT_RT-patched Ubuntu Server 20.04.2 (5.4.3-rt1), Intel i7-3770 4-core CPU @ 3.40 GHz (SMT disabled), 8 GB RAM, and SCHED_FIFO policy with the highest priority (99).

Latency comparison between rmw_email_cpp and rmw_cyclonedds_cpp on a real-time system.

As expected, the latencies for rmw_cyclonedds_cpp are much lower on a real-time system. However, the latencies for rmw_email_cpp get worse! The data also stops halfway through the experiment because performance_test throws an exception if messages are not received in the order that they are sent. This assertion could be removed from the performance_test code to compare two complete experiments, but, surely, it’s not a good sign when a middleware shuffles messages!

We could pose that most of the time is spent between the two libcurl calls to send and receive emails, i.e., server-side. To explore this hypothesis, we can enable intraprocess mode for email and run the experiments again.

Latency comparison between rmw_email_cpp (intraprocess) and rmw_cyclonedds_cpp.

The latencies are then much more comparable. They’re now only about 12 times higher, although it would be worse if we did a fair comparison using Cyclone DDS with iceoryx for shared memory inter-process communications. rmw_email_cpp’s message<–>YAML conversion is without a doubt no match for rmw_cyclonedds_cpp’s (de)serialization, and the liberal use of std::string objects internally most likely doesn’t help. It would definitely be interesting to investigate this.

Tracing

Just in case low-overhead instrumentation is needed to investigate real-time performance issues with rmw_email, I added LTTng tracepoints. rmw_email_cpp uses the ros2_tracing instrumentation & tracepoints for the rmw layer. email has its own LTTng tracepoints to collect lower-level information in order to correlate it with the ROS 2 trace data.

Limitations and future work

Unsurprisingly, there are many limitations, with the main ones being high message latency and low pub/sub rate. Also, as mentioned previously, messages might be received in the wrong order. This could be due to Gmail’s infrastructure, since having sub-millisecond latencies and guaranteeing that emails are always in the right order are probably not high priorities. Tackling those limitations might be possible, but it could be argued that it’s not worth the effort.

Nonetheless, many other paths could be explored. Type support introspection could be replaced with static type support (i.e., generated code) to try to lower latencies. Also, although we could compare this to DDS domain IDs, configuration files currently impose a direction on a whole process’ communications unless all emails are sent to & from the same address. Config files could be improved to allow users to specify per-topic or per-namespace email recipients. Furthermore, the email standards and infrastructure could be leveraged even more to get interesting features. For example, mailing lists could be used as a form of configurable “multicast.” Finally, message filtering could be achieved by setting up rules in an email client to forward emails based on the messages’ content.

Conclusion

In conclusion, I presented rmw_email, which contains a standalone middleware as well as a ROS 2 middleware implementation to exchange messages using emails.

ROS 2’s abstractions lead to additional complexity and overhead, but they’re also quite powerful – now you can worry about getting ~~SLAM~~ SPAM into your Nav2 stack! Ultimately, it’s an interesting debate between users who benefit from and need that abstraction, and those who prefer to break it a little bit to have direct access to the underlying middleware. There might be a better middle ground to be found or even an alternative that allows both to coexist. Or perhaps the status quo is good for most users, and those who need a more specialized version of ROS 2 can just fork it, as some have done. Even if rmw_email will never be used in production (or at all), I hope that it can provide some insight and stimulate discussions on this subject.

Aside from that, I think this project had real benefits for the ROS 2 community, both directly and indirectly. There’s of course the separate project to do symmetrical conversions between ROS 2 messages and a YAML representation. Therefore, email stuff aside, that part is probably a cool contribution to the ROS 2 community! I also ~~got distracted~~ embraced the open source philosophy along the way. I started using the ROS 2 tooling working group’s GitHub actions, setup-ros and action-ros-ci, for rmw_email. I contributed some improvements and new features that I needed. Additionally, I made a number of contributions to ROS 2 core packages and ROS 2 dependencies here and there.

Time will tell whether or not this was more useful than my previous blog post. However, even if I received multiple facepalm emojis 🤦‍♂️ from a friend after sharing my “ROS 2… over email” idea, I would say that, after putting 300+ hours into this project over more than a year, I’m extremely satisfied with the outcome!

Time tracking result for rmw_email.

Update (2022-05-07)

It turns out that there is no need to use hacky ways to get the size of an array of a non-built-in type: the type support introspection tools include functions that provide this information for a given type. See: github.com/osrf/dynamic_message_introspection/pull/16.

Update (2025-10-26)

I wrote an rmw implementation guide for the ROS 2 documentation, which features rmw_email.

References

[1] Z. Jiang, Y. Gong, J. Zhai, Y.-P. Wang, W. Liu, H. Wu, and J. Jin, “Message passing optimization in robot operating system,” International Journal of Parallel Programming, vol. 48, no. 1, pp. 119–136, 2020.
[2] T. Kronauer, J. Pohlmann, M. Matthe, T. Smejkal, and G. Fettweis, “Latency overhead of ros2 for modular time-critical systems,” arXiv preprint arXiv:2101.02074, 2021.
[3] RFC 3501, section 2.3.1.1
[4] RFC 3501, section 6.3.2
[5] RFC 5322, page 26

Message Flow Analysis for ROS Through Tracing

2019-06-06T00:00:00+00:00

Outcome of this project: message flow analysis for ROS using Trace Compass.

tl;dr Tracing is a very powerful tool for software development, especially in robotics. Using Trace Compass and existing ROS instrumentation, I built an analysis that can show the path of a message through a ROS stack. This work can serve as a proof-of-concept for future endeavours.

Introduction
Message flow analysis
1. Motive and goal
2. Approach
Results/example
Conclusion
Future work
Links
Acknowledgements
References

Introduction

Tracing can be invaluable when diagnosing complex systems, especially when problems are hard to reproduce using traditional tools. Robotics software development can benefit from tracing and the low-overhead analyses it can provide.

The overall goal of this project was to first look into where ROS could benefit from such analyses, and then work towards that.

This first section will introduce both ROS and Trace Compass for people familiar with only one (or none) of them. I will also talk about robotics and tracing in general. The second and third sections will present my work along with an example.

Finally, I will conclude and briefly talk about possible future work related to this project.

Context

ROS

Robot Operating System (ROS) is an open-source framework and a set of libraries and tools for robotics software development. Although it has “Operating System” in its name, it’s not really an OS!

Its main feature is probably the implementation of the publish-subscribe pattern. Nodes, which are modular “processes” designed to accomplish a specific task, can publish on, or subscribe to, one or more topics to send/receive messages. By launching multiple nodes (either from your own package or from a package someone else made), you can accomplish complex tasks!

Trace Compass

Eclipse Trace Compass is an open source trace viewer and analysis framework designed to solve performance issues. It supports many trace formats, and provides numerous useful analyses & views out of the box, such as the kernel resources and control flow views. Users can also use its API to implement their own analyses, which is what I did!

Topic

My initial objective was to look into where ROS development could benefit from tracing and subsequent analyses, and try to help with that.

Early on in this project, I considered targeting ROS 2. However, as it was still relatively new and less mature than ROS 1, I went with the latter.

Literature review & existing solutions

A presentation at ROSCon 2017, titled “Determinism in ROS – or when things break /sometimes/ and how to fix it…” exposed how ROS’ design does not guarantee determinism in execution. This is actually what piqued my curiosity at first, since I was a ROS user and had started to learn about tracing, and it eventually led to this project.

In this case, lack of determinism can be seen as merely a symptom. This led me to search for possible causes, one of which might be network/communications [1, 2, 3, 4, 5]. For latencies, which might lead to lack of determinism, critical path analyses can help identify the actual root cause [6, 7, 8, 9].

As for tools, many are distributed along with ROS to help users and developers. rqt_graph can create a graph of publisher/subscriber relations between nodes. It can also show publishing rates. Similarly, the ROS CLI tools (e.g. rostopic) can help debug basic pub/sub issues.

Other tools are available. The diagnostics package can collect diagnostics data for analysis. The performance_test package for ROS 2 can test the performance of a communications middleware.

However, all of the tools or solutions mentioned above cannot provide a view of the actual execution. Besides, the performance overhead of using higher-level log aggregators (e.g. as a ROS node) is non-negligible.

The tracetools package uses LTTng to instrument ROS for tracing. However, it does not offer analysis tools.

Trace Compass offers a control flow view, showing the state of threads over time. By selecting one particular thread, a user can launch a critical path analysis.

Message flow analysis

Motive and goal

As mentioned previously, time is one of the main concerns for robotics applications. Critical path analyses can make these anomalies stand out and help developers find the root cause.

My goal was therefore to make a ROS-specific analysis along these lines. I chose to build what I call a “message flow analysis.” Using tracetools and the ROS instrumentation, we can figure out which queues a message went through, how much time it spent in each one, and if it ended up being dropped. Also, by linking a message received by a subscriber to the next corresponding message that gets published by the same node, we can build a model of the message processing pipeline.

Approach

Prerequisites

To build this analysis, some information is needed on:

connections between publishers and subscribers
subscriber/publisher queue states
network packet exchanges

We first need to know about connections between nodes. The ROS instrumentation includes a tracepoint for new connections. It includes the address and port of the host and the destination, with an address:port pair corresponding to a specific publisher or subscription.

We also need to build a model of the publisher and subscriber queues. To achieve this, we can leverage the relevant tracepoints. These include a tracepoint for when a message is added to the queue, when it’s dropped from the queue, and when it leaves the queue (either sent over the network to the subscriber, or handed over to a callback). We can therefore visualize the state of a queue over time!

Finally, we need information on network packet exchanges. Although this isn’t really necessary for this kind of analysis, it allows us to reliably link a message that gets published to a message that gets received by the subscriber. This is good when building a robust analysis, and it paves the way for a future critical path analysis based on this message flow analysis.

This requires us to trace both userspace (ROS) and kernel. Fortunately, we only have to enable 2 kernel events for this. It saves us a lot of disk space, since enabling many events can generate multiple gigabytes of trace data, even when tracing for only a few seconds! Also, as the rate of generated events increases, the overhead also increases. More resources have to be allocated to the buffers to properly process those events, otherwise they can get discarded or overwritten.

Method

In this sub-section, I’ll quickly go over some implementation details and explain how the analysis works!

Let’s start with some background on Trace Compass. It allows you to build analyses that depend on trace events, the output of other analyses, or both. Some analyses are used to create views to directly display processed data. However, we can use them as models that can be queried by other models or analyses. This abstraction was very useful when designing my final analysis and its dependencies.

The first analysis is the connections model. Using the new_connection events from tracetools, it creates a list of connections between two nodes on a certain topic and includes information about the endpoints.

Some new_connection events. Highlighted are two events belonging to the same connection, on opposite endpoints.

Another analysis is created to model queues over time. This uses three tracepoints, also from tracetools:

publisher_message_queued or subscription_message_queued, when a message is added to the queue
subscriber_link_message_write or subscriber_callback_start, when a message is successfully removed from the queue (i.e. sent over the network or processed by a callback)
subscriber_link_message_dropped or subscription_message_dropped, when a message is dropped from the queue

These events always include a reference to the associated message, so it can help validate the model.

View showing the state of a publisher queue. At this timestamp (thin blue vertical line), the first message is removed from the queue and sent to the subscriber.

The third analysis is for network packet exchange. This is the only analysis that needs kernel events: net_dev_queue for packet queuing and netif_receive_skb for packet reception. Fortunately, Trace Compass already does this! It matches sent and received packets. I only had to filter out SYN/FIN/ACK packets and those which were not associated with a known ROS connection. Then, from a node name, a topic name, and a timestamp at which a message was published, we can figure out when it went through the network, and link it to a message received by the subscription.

Finally, we can put everything together! The analysis uses the above analyses to reconstruct and display a message’s path accross queues, callbacks, and the network!

Results/example

To illustrate this, I wrote a simple “pipeline” test case. A first node periodically publishes messages on a topic. A second node does some processing and re-publishes them on another topic. A third node does the same, and a fourth and last node prints the contents of a message.

Graph generated using rqt_graph.

From the view showing queues over time, the user can select an individual message by clicking on it, then hitting the Follow the selected message button.

Message selection. Note that this is the first node in the pipeline, and that, at this moment, the other nodes are not active. Therefore, since latching is enabled, the publisher's queue only keeps the most recent message.

The message flow analysis – and all its dependencies – are run. The output can then be viewed in the corresponding view.

Analysis result.

There it is! Some sections are hard to make out, so we can zoom in.

Zoomed in.

We can see three main states: publisher queue, subscriber queue, and subscriber callback. Of course, the transition represented by the darker arrows between the publisher queue and subscriber queue states includes the network transmission.

However, going back to the original perspective, two states clearly stand out. The first (green) state represents the time spent in the first node’s publisher queue, waiting for other nodes to be ready in order to transmit the message. The biggest state, in orange, represents the time spent in a callback inside the third node. We can hover over the state to get more info.

Hovering to get more information.

We can see that the message spent around 100 milliseconds in the callback before the next related message was sent to the following publisher queue. In this case, it can be explained by looking at the node’s source code!

void callbackFunction(const std_msgs::String::ConstPtr& msg) {
    std_msgs::String next_msg;
    int payload = get_payload(msg->data);
    int new_payload = payload + pow(payload, 2);
    if (node_i == 2) {
        ros::Duration(0.1).sleep(); // <---------
    }
    next_msg.data = MSG_CONTENT_PREFIX + std::to_string(new_payload);
    pub.publish(next_msg);
}

Conclusion

In conclusion, tracing is a very powerful tool for robotics software development. Lack of determinism was identified as a symptom, and timing was chosen as an analysis topic.

Using existing ROS instrumentation, I worked towards providing insight into the timewise execution of a ROS software stack. The result, a Trace Compass analysis for ROS, can serve as a proof-of-concept for future endeavours.

Future work

Many elements could be improved, and many new paths could be explored.

First and foremost, other than not supporting UDP and not explicitly supporting namespaces, there are many limitations and simplifications with the current implementation, as most of the traces I used were taken from executions of (very synthetic) test cases.

To link a message between two endpoints, this selects the first corresponding TCP packet that is queued (net_dev_queue) after the subscriber_link_message_write event, and then selects the next subscription_message_queued event after the matching netif_receive_skb event. This assumption about the sequence of events might not be always valid. Also, it has not been tested with messages bigger than the maximum payload size of a TCP packet.

Furthermore, callbacks were considered as the only possible link between two messages (received & published). Nodes might deal with callbacks and message publishing separately, e.g. when publishing at a fixed rate independently of the received messages. In the same sense, message flows do not have to be linear! In other words, one incoming message can turn into multiple outgoing messages.

Also, Trace Compass can easily aggregate traces from multiple hosts. This is very relevant for robotics systems, and thus would be a great avenue to explore.

Finally, as mentioned previously, the message flow analysis could be extended to provide a critical path analysis. This would provide more information about what actually happened while a message was waiting in a queue.

Acknowledgements

This project was done as part of the UPIR program for undergraduate research at Polytechnique Montréal, and was supervised by Michel Dagenais. I thank him for his great input!

I would also like to thank Matthew Khouzam and Geneviève Bastien for their help and advice, and Ingo Lütkebohle for his commentary on the need for tracing in robotics.

References

[1] C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, and V. M. Vilches, “Real-time Linux communications: an evaluation of the Linux communication stack for real-time robotic applications,” arXiv:1808.10821 [cs], Aug. 2018.
[2] C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, I. M. Goenaga, L. A. Kirschgens, and V. M. Vilches, “Time Synchronization in modular collaborative robots,” arXiv:1809.07295 [cs], Sep. 2018.
[3] C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, and V. M. Vilches, “Time-Sensitive Networking for robotics,” arXiv:1804.07643 [cs], Apr. 2018.
[4] C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, and V. M. Vilches, “Towards a distributed and real-time framework for robots: Evaluation of ROS 2.0 communications for real-time robotic applications,” arXiv:1809.02595 [cs], Sep. 2018.
[5] Y.-P. Wang, W. Tan, X.-Q. Hu, D. Manocha, and S.-M. Hu, “TZC: Efficient Inter-Process Communication for Robotics Middleware with Partial Serialization,” arXiv:1810.00556 [cs], Oct. 2018.
[6] F. Giraldeau and M. Dagenais, “Wait Analysis of Distributed Systems Using Kernel Tracing,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 8, pp. 2450–2461, Aug. 2016.
[7] F. Doray and M. Dagenais, “Diagnosing Performance Variations by Comparing Multi-Level Execution Traces,” IEEE Transactions on Parallel and Distributed Systems, pp. 1–1, 2016.
[8] P.-M. Fournier and M. R. Dagenais, “Analyzing blocking to debug performance problems on multi-core systems,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, p. 77, Apr. 2010.
[9] C.-Q. Yang and B. P. Miller, “Critical path analysis for the execution of parallel and distributed programs,” in [1988] Proceedings. The 8th International Conference on Distributed, San Jose, CA, USA, 1988, pp. 366–373.

ROSCon 2018

2019-05-12T00:00:00+00:00

It’s been a while since ROSCon 2018, but I thought I’d (finally) write some thoughts down. Hopefully without too much rambling.

Context

I’ve been around ROS for a little while now. At first it was from a distance. I only knew the basic concepts, but I still had to work around it. However, a couple years ago I slowly got more and more interested and actually started using it. I thought it was a very powerful concept, and its community and open-sourceness really made me want to contribute.

Why

Why not?

Objectives

I had never been to a tech conference (by myself or for myself), and I didn’t really know what to expect – I didn’t know exactly what I wanted from it.

However, my one objective was to talk to other attendees and learn something from them. Sure, I’m still a student and don’t have as much experience as them, but I still have something to offer.

One thing I was sure of is that I’d learn a lot from the presentations. I was really looking forward to the presentations on ROS 2, since I hadn’t really tried or read a lot about it. ROS 1 is still way more mature, but as companies are slowly starting to move to ROS 2, especially for new products or applications, this is the best moment to get started!

The conference

The presentations were very interesting! For the topics I was familiar with, I was eager to see what other people were doing. For the others, I was simply curious. Here’s a few highlights:

ROS 2-specific presentations, including a how-to on getting involved in ROS 2 development and a demo of the main features.
“Lessons learned building a self-driving car on ROS” and “ROS 2 on Autonomous Driving Vehicles,” which were interesting applications of ROS to self-driving. They also mentioned real problems, like determinism and real time.
“Integrating ROS and ROS2 on mixed-critical robotic systems based on embedded heterogeneous platforms” and “Towards ROS 2 microcontroller meta cross-compilation,” which showed that plans are for ROS 2 to be used for much more than what ROS 1 was generally used for, which is of course very exciting.
The performance_test package for testing middleware peformance for ROS 2, from Apex.AI. Since I’m interested in the tooling side of robotics software development (e.g. tracing & analysis), this was quite a nice surprise!
and other cool presentations, such as “Deterministic reversible debugging of ROS nodes with Mozilla rr,” “Hermetic Robot Deployment Using Multi-Stage Dockers,” and “Deterministic, asynchronous message driven task execution with ROS.”

Between presentations and during the lunches, I talked with other attendees, including both people from academia and the industry!

The biggest event of the conference was obviously the evening reception on the first day ~~because of the beer~~. I motivated myself to go talk to people, and that led me to have some nice conversations and meet some very interesting people. I learned about a couple projects that I then started to follow, like the Open Vision Computer.

What I learned

Looking back, it was a great experience and I learned a lot, and not only from a technical point of view.

I learned that you always have something to offer, whether it’s a different point of view, a different background, or simply the cool stuff you’ve done. No matter how little experience you think you (might) have. For example, people I talked to had never heard of the aerials robotics competition I spent 3-4 years competing in, which involved real challenges that I was able to discuss! Even if it wasn’t in a strictly-professional setting, those challenges are still similar to the ones people in the industry can face. In this case, I mentioned that I worked on autonomous obstacle avoidance in a sterile environment, without external position sensors.

Conclusion

Overall, it was a very nice experience. It made me look forward to the future.

It made me want to get involved in ROS development. I have made (very) small contributions to ROS 1 since then, and I look forward to doing more, especially for ROS 2! Now I’m really hoping I can attend ROSCon 2019 in Macau!

All in all, you can’t learn anything if you never try, so jump in!

Christophe Bédard

ROS 2 Over Email: rmw_email, an Actual Working RMW Implementation

Introduction

Motivation

Overview

The components of a working(ish) RMW implementation

Middleware

Common message representation

RMW implementation

Demo

Performance

Tracing

Limitations and future work

Conclusion

Links

Update (2022-05-07)

Update (2025-10-26)

References

Message Flow Analysis for ROS Through Tracing

Introduction

Context

ROS

Trace Compass

Topic

Literature review & existing solutions

Message flow analysis

Motive and goal

Approach

Prerequisites

Method

Results/example

Conclusion

Future work

Links

Acknowledgements

References

ROSCon 2018

Context

Why

Objectives

The conference

What I learned

Conclusion