<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://christophebedard.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://christophebedard.com/" rel="alternate" type="text/html" /><updated>2026-03-14T21:53:37+00:00</updated><id>https://christophebedard.com/feed.xml</id><title type="html">Christophe Bédard</title><subtitle>Christophe Bédard&apos;s blog.</subtitle><entry><title type="html">ROS 2 Over Email: rmw_email, an Actual Working RMW Implementation</title><link href="https://christophebedard.com/ros-2-over-email/" rel="alternate" type="text/html" title="ROS 2 Over Email: rmw_email, an Actual Working RMW Implementation" /><published>2021-09-19T00:00:00+00:00</published><updated>2021-09-19T00:00:00+00:00</updated><id>https://christophebedard.com/ros-2-over-email</id><content type="html" xml:base="https://christophebedard.com/ros-2-over-email/"><![CDATA[<figure>
    <a href="/assets/img/rmw-email/demo_service.png"><img src="/assets/img/rmw-email/demo_service.png" alt="service request email from client@rmw-email.com and response reply email from server@rmw-email.com" style="border: 2px solid #383838;" /></a>
    <figcaption style="text-align: center;">Service request and response using rmw_email. Messages are exchanged as strings using emails.</figcaption>
</figure>

<p style="text-align: center;"><b>tl;dr</b> ROS 2's architecture allows for using almost any middleware, as long as it's done through RMW, the middleware interface. <a href="https://github.com/christophebedard/rmw_email">rmw_email</a> contains a middleware that sends &amp; receives strings over email and an RMW implementation that allows ROS 2 to use this middleware to exchange messages. While it's certainly not a production-level middleware, it provides interesting insight into the pros and cons of ROS 2's architecture. Abstractions definitely have a cost, but they're also quite powerful: they allow ROS 2 to run over email without needing to modify it at all!</p>

<ol id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#motivation" id="markdown-toc-motivation">Motivation</a></li>
  <li><a href="#overview" id="markdown-toc-overview">Overview</a></li>
  <li><a href="#the-components-of-a-workingish-rmw-implementation" id="markdown-toc-the-components-of-a-workingish-rmw-implementation">The components of a working(ish) RMW implementation</a>    <ol>
      <li><a href="#middleware" id="markdown-toc-middleware">Middleware</a></li>
      <li><a href="#common-message-representation" id="markdown-toc-common-message-representation">Common message representation</a></li>
      <li><a href="#rmw-implementation" id="markdown-toc-rmw-implementation">RMW implementation</a></li>
    </ol>
  </li>
  <li><a href="#demo" id="markdown-toc-demo">Demo</a></li>
  <li><a href="#performance" id="markdown-toc-performance">Performance</a></li>
  <li><a href="#tracing" id="markdown-toc-tracing">Tracing</a></li>
  <li><a href="#limitations-and-future-work" id="markdown-toc-limitations-and-future-work">Limitations and future work</a></li>
  <li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
  <li><a href="#links" id="markdown-toc-links">Links</a></li>
  <li><a href="#update-2022-05-07" id="markdown-toc-update-2022-05-07">Update (2022-05-07)</a></li>
  <li><a href="#update-2025-10-26" id="markdown-toc-update-2025-10-26">Update (2025-10-26)</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ol>

<h2 id="introduction">Introduction</h2>

<p>ROS 2’s architecture and underlying middleware are vastly different from ROS 1’s, in part because ROS 2 is targeted at real-time distributed applications.
The middleware interface, <code class="language-plaintext highlighter-rouge">rmw</code>, is an abstraction that allows ROS 2 to support multiple different middleware implementations.
On top of that, there’s <code class="language-plaintext highlighter-rouge">rcl</code>, which provides a common implementation in C to support client libraries for any language.
Finally, there’s <code class="language-plaintext highlighter-rouge">rclcpp</code> and <code class="language-plaintext highlighter-rouge">rclpy</code>, the C++ and Python client libraries, respectively.
While there definitely are downsides to these abstractions and interfaces, this architecture is very powerful.</p>

<p>In this post, I’ll present <a href="https://github.com/christophebedard/rmw_email">rmw_email</a>, which allows ROS 2 to exchange messages using emails.
I’ll start by explaining the motivation behind rmw_email.
Then I’ll provide an overview and explain how each component works.
After that, I’ll show a couple of demos and present the results of performance experiments.
Finally, I’ll briefly discuss limitations and possible future work, and then I’ll conclude.</p>

<h2 id="motivation">Motivation</h2>

<p>The main motivation for rmw_email was my <a href="/about/#education">master’s</a>.
I got the high-level idea for this in June of 2020.
At that point, I had been working on and around ROS 2 for about a year.
I had interacted a lot more with the higher levels of the ROS 2 architecture (<code class="language-plaintext highlighter-rouge">rclcpp</code>, <code class="language-plaintext highlighter-rouge">rcl</code>) while working on <a href="https://gitlab.com/ros-tracing/ros2_tracing"><code class="language-plaintext highlighter-rouge">ros2_tracing</code></a>,
so I wanted to dig deeper into the middleware level and below (<code class="language-plaintext highlighter-rouge">rmw</code>, DDS/other middleware).</p>

<p>I was also seeing some interesting discussions involving middlewares and the middleware interface.
Developers of real-time applications (e.g., automotive) wanted to expose middleware features that were rather DDS-specific through <code class="language-plaintext highlighter-rouge">rmw</code>.
This could possibly make the abstraction “leak,” thereby breaking it at least slightly.
Of course, it’s a tradeoff between maintaining the abstraction itself to keep the benefits vs. making ROS 2 more powerful by allowing users to leverage advanced middleware features and possibly reducing the costs of the abstraction and the general overhead of ROS 2. 
[<a name="1t" href="#1">1</a>, <a name="2t" href="#2">2</a>]</p>

<p>Also, at the time, there weren’t many non-DDS <code class="language-plaintext highlighter-rouge">rmw</code> implementations;
even if they do exist, there are actually currently no non-DDS implementations listed under the <a href="https://www.ros.org/reps/rep-2000.html#galactic-geochelone-may-2021-november-2022">latest ROS 2 distro in REP 2000</a>.
Perhaps adding another one – as absurd as it may be – could help diversify the ROS 2 middleware implementations and illustrate how useful the current abstraction can be.
This was even <a href="https://github.com/ros2/ros2_documentation/pull/964/files#diff-c29e698395f4491092719353da00819ebb5f2c311e3b74e540eb1c6a5af0bcaaR154">on the ROS 2 roadmap</a> at some point.</p>

<p>And, also, why not?
We just <em>can</em>!
Sure, DDS is a proven standard to exchange messages, but so is email!
Besides, I don’t own a fax machine.</p>

<h2 id="overview">Overview</h2>

<p>rmw_email, the repository/project, consists mainly of two packages: <code class="language-plaintext highlighter-rouge">email</code>, the middleware, and <code class="language-plaintext highlighter-rouge">rmw_email_cpp</code>, the RMW implementation.</p>

<p><code class="language-plaintext highlighter-rouge">email</code> is a simple middleware with the publisher/subscriber pattern to send and receive string messages on topics.
It also natively supports the service client/server (RPC) pattern.
As its name suggests, emails are used to send messages: the topic or service name is the email subject and the message content is the email body.</p>

<p><code class="language-plaintext highlighter-rouge">rmw_email_cpp</code> is an implementation of the ROS 2 middleware interface, <code class="language-plaintext highlighter-rouge">rmw</code>, using <code class="language-plaintext highlighter-rouge">email</code>.
It uses an external package that does the hard work to convert messages to YAML.
Then it converts them to YAML strings and passes them on to <code class="language-plaintext highlighter-rouge">email</code>.
Indeed, <code class="language-plaintext highlighter-rouge">email</code> knows nothing about all the different ROS 2 messages; it simply handles strings.</p>

<h2 id="the-components-of-a-workingish-rmw-implementation">The components of a working(ish) RMW implementation</h2>

<p>In the beginning, I had a rough goal: get ROS 2 working <em>over email</em>.
Fortunately, in a way, it’s a rather straightforward process, since it can be split into a few components.</p>

<h3 id="middleware">Middleware</h3>

<p>The middleware should really be usable on its own, so I started by only focusing on that.</p>

<p>At its core, <code class="language-plaintext highlighter-rouge">email</code> simply sends and receives emails using <a href="https://curl.se/libcurl/"><code class="language-plaintext highlighter-rouge">libcurl</code></a>’s C API.
Emails are sent using the SMTP protocol and received by polling using the IMAP protocol.
Polling is done by first getting the unique identifier (UID 
[<a name="3t" href="#3">3</a>]) of the next expected email using the <code class="language-plaintext highlighter-rouge">EXAMINE</code> IMAP command 
[<a name="4t" href="#4">4</a>].
Then it polls until there’s a new email, increments the UID value, and repeats the process.
Polling is done on a dedicated thread, while emails are sent synchronously (<a href="https://github.com/christophebedard/rmw_email/issues/237">currently</a>, at least).</p>

<p>Each email message contains metadata that is included as both custom and standard email headers.
All emails include a source timestamp and the GID (i.e., a unique ID) of the source object (i.e., publisher, service client, or service server).
Additionally, service requests and responses contain a sequence number, and service responses also contain the GID of the service client that made the original request.
This is required so that the service response is matched with the original request and delivered to the corresponding service client.</p>

<p>Also, service responses are email replies to the corresponding service request email!
This is achieved using standard headers: the values of the <code class="language-plaintext highlighter-rouge">In-Reply-To</code> and <code class="language-plaintext highlighter-rouge">References</code> headers of the response email are set to the value of the <code class="language-plaintext highlighter-rouge">Message-ID</code> header of the request email 
[<a name="5t" href="#5">5</a>].
All of this is possible without polluting the email as shown in a normal email client.</p>

<p>Let’s illustrate this with the simple server/client example below.</p>

<!-- using Bootstrap might be better, but it interferes with the rest of the style, so just use flex -->
<div style="display: flex; flex-wrap: wrap">
<div style="flex: 50%; padding: 2px">
<!-- receive.email.log -->

<figure class="highlight"><pre><code class="language-text" data-lang="text">Message-ID: &lt;a1.b2@mx.rmw-email.com&gt;
Client-GID: 4074879933
Request-Sequence-Number: 42
Source-Timestamp: 1631797734037229979
In-Reply-To: 
References: 
From: client@rmw-email.com
To: server@rmw-email.com
Cc: 
Bcc: 
Subject: /my_service

this is my request!</code></pre></figure>

</div>
<div style="flex: 50%; padding: 2px">
<!-- send.email.log -->

<figure class="highlight"><pre><code class="language-text" data-lang="text">Message-ID: &lt;d4.f5@mx.rmw-email.com&gt;
Client-GID: 4074879933
Request-Sequence-Number: 42
Source-Timestamp: 1631797743507593177
In-Reply-To: &lt;a1.b2@mx.rmw-email.com&gt;
References: &lt;a1.b2@mx.rmw-email.com&gt;
From: server@rmw-email.com
To: client@rmw-email.com
Cc: 
Bcc: 
Subject: /my_service

this is a response!</code></pre></figure>

</div>
</div>

<p>The server will receive the email on the left for the request and the client will receive the email reply on the right for the response.</p>

<p>When a new email is received by the polling thread, it is passed on to email handlers.
All subscriptions, service clients, and service servers register with those handlers.
Handlers use the email’s headers and topic/service name to figure out what kind of message it is and which object it belongs to.</p>

<p>Since sending and receiving emails requires an email account, the path to a <a href="https://github.com/christophebedard/rmw_email#configuration">configuration file</a> with email login credentials and recipients (to/cc/bcc) must be provided through an environment variable.
There is also an intraprocess mode.
If enabled, <code class="language-plaintext highlighter-rouge">email</code> acts as if it was sending emails to itself, bypassing the very last step of sending/receiving emails.
This means that it still relies on email headers, so it has to fake the <code class="language-plaintext highlighter-rouge">Message-ID</code> header value, since it is normally added by the email server.</p>

<p>The <a href="https://christophebedard.com/rmw_email/design/email/"><code class="language-plaintext highlighter-rouge">email</code> design document</a> contains a lot more information and even contains fancy UML diagrams!
The <a href="https://christophebedard.com/rmw_email/api/email/">API documentation</a> can also provide more insight.
Along with that, the <a href="https://github.com/christophebedard/rmw_email/tree/master/email_examples"><code class="language-plaintext highlighter-rouge">email_examples</code> package</a> contains <a href="https://github.com/christophebedard/rmw_email#email-examples">many examples</a>.</p>

<p>While certainly time-consuming, this part was rather fun to create from scratch.</p>

<h3 id="common-message-representation">Common message representation</h3>

<p>Since we have a middleware that strictly deals with strings, we need to be able to convert ROS 2 messages to strings and convert those strings back to messages.</p>

<p>I knew about type support introspection from reading ROS 2 source code and documentation.
It provides metadata about a given message type that allows you to parse the fields of a message given only a type-erased pointer to the message (i.e., <code class="language-plaintext highlighter-rouge">void *</code>).
Note that it would have been possible to <em>generate</em> code that does this for each message type, similar to how a <a href="https://github.com/ros2/rosidl/blob/36ed120f43daeaab31fd9ba2bf8dfb58db05091d/rosidl_generator_cpp/resource/msg__traits.hpp.em#L131"><code class="language-plaintext highlighter-rouge">to_yaml()</code> function is generated</a> for each message type.
Also, <code class="language-plaintext highlighter-rouge">rosidl_runtime_py</code> can <a href="https://github.com/ros2/rosidl_runtime_py/blob/63a9c99ad735ef08b9cfda69ba35322b5f8b75f3/rosidl_runtime_py/set_message.py#L28">convert YAML strings to messages</a> (e.g., for <code class="language-plaintext highlighter-rouge">ros2 topic pub</code>), but it’s in Python.</p>

<p>My first idea was to convert the bytes of the messages to <a href="https://en.wikipedia.org/wiki/Base64">Base64</a> and send that string over email, but that would have been a bit boring.
A while later, after I had a basic working middleware, I saw a <a href="https://discourse.ros.org/t/ros2-c-based-dynamic-typesupport-example/19079/3">post on ROS Discourse</a> about a <a href="https://github.com/osrf/dynamic_message_introspection">package to convert messages to a YAML representation</a>.
It only supported C messages, though, which was a problem since I needed to support both C and C++ messages.
I looked over the code to see how it worked and then I looked at the <a href="https://design.ros2.org/articles/idl_interface_definition.html">ROS 2 IDL documentation</a> and <a href="https://docs.ros.org/en/rolling/Concepts/About-ROS-Interfaces.html">this document</a> to understand what I needed to change to adapt it to C++.
C structures for message arrays make this task simple – at the expense of being more complex to use – since they keep track of size and capacity.
C++ containers make it complicated!</p>

<p>For example, how can you <a href="https://github.com/christophebedard/dynamic_message_introspection/blob/4afd27793d20731a758eb868459a8b1db6186e41/dynmsg/src/message_reading_cpp.cpp#L519-L520">figure out the size of a <code class="language-plaintext highlighter-rouge">std::vector&lt;T&gt;</code></a> if you know the size of the contained type, <code class="language-plaintext highlighter-rouge">sizeof(T)</code>, but <em>only</em> have a <code class="language-plaintext highlighter-rouge">void *</code> to it?
This is the case for unbounded dynamic arrays of a non-built-in type, like an <a href="https://github.com/ros2/common_interfaces/blob/a3a0dde2ba184b01cdc59a3003728906de3240a9/sensor_msgs/msg/PointCloud2.msg#L19">array of <code class="language-plaintext highlighter-rouge">PointField</code> in a <code class="language-plaintext highlighter-rouge">PointCloud2</code> message</a>.
The answer is: by <del>Googling it</del> knowing implementation details!
A <code class="language-plaintext highlighter-rouge">std::vector</code> object simply contains three pointers: begin, end, and end capacity.
Since the elements are stored <a href="https://en.cppreference.com/w/cpp/named_req/ContiguousContainer">contiguously</a>, size is simply <a href="https://github.com/christophebedard/dynamic_message_introspection/blob/4afd27793d20731a758eb868459a8b1db6186e41/dynmsg/src/vector_utils.cpp#L49-L59"><code class="language-plaintext highlighter-rouge">(end - begin) / sizeof(T)</code></a>!
Fun fact: that trick doesn’t work with <code class="language-plaintext highlighter-rouge">std::vector&lt;bool&gt;</code>, because <a href="https://en.cppreference.com/w/cpp/container/vector_bool">its implementation is different</a>, but that’s not a problem here.</p>

<blockquote>
  <p><strong>Update</strong>: this workaround is not actually needed. <a href="#update-2022-05-07">See below</a>.</p>
</blockquote>

<p>I forked the package, <a href="https://github.com/osrf/dynamic_message_introspection/pull/15">added support for C++ messages, made the message&lt;–&gt;YAML conversion symmetrical, and refactored the repository/packages a bit</a>.
Below is a simple example of a C++ <code class="language-plaintext highlighter-rouge">std_msgs/Header</code> message and the corresponding YAML representation.</p>

<div style="display: flex; flex-wrap: wrap">
<div style="flex: 50%; padding: 2px">

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">builtin_interfaces</span><span class="o">::</span><span class="n">msg</span><span class="o">::</span><span class="n">Time</span> <span class="n">stamp</span><span class="p">;</span>
<span class="n">stamp</span><span class="p">.</span><span class="n">sec</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="n">stamp</span><span class="p">.</span><span class="n">nanosec</span> <span class="o">=</span> <span class="mi">20U</span><span class="p">;</span>
<span class="n">std_msgs</span><span class="o">::</span><span class="n">msg</span><span class="o">::</span><span class="n">Header</span> <span class="n">msg</span><span class="p">;</span>
<span class="n">msg</span><span class="p">.</span><span class="n">stamp</span> <span class="o">=</span> <span class="n">stamp</span><span class="p">;</span>
<span class="n">msg</span><span class="p">.</span><span class="n">frame_id</span> <span class="o">=</span> <span class="s">"my_frame"</span><span class="p">;</span></code></pre></figure>

</div>
<div style="flex: 50%; padding: 2px">
<br />

<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">stamp</span><span class="pi">:</span>
  <span class="na">sec</span><span class="pi">:</span> <span class="m">4</span>
  <span class="na">nanosec</span><span class="pi">:</span> <span class="m">20</span>
<span class="na">frame_id</span><span class="pi">:</span> <span class="s">my_frame</span></code></pre></figure>

</div>
</div>

<p>Even starting from the implementation for C messages, writing the introspection code for C++ messages was a nice challenge.
I’m sure some things could be improved and I might have done some things wrong (although it does work!).
It could nonetheless serve as another example of how to do type support introspection.</p>

<h3 id="rmw-implementation">RMW implementation</h3>

<p>To tie everything together, we need to implement the <code class="language-plaintext highlighter-rouge">rmw</code> interface for <code class="language-plaintext highlighter-rouge">email</code>.</p>

<p>Writing the implementation, <code class="language-plaintext highlighter-rouge">rmw_email_cpp</code>, was fairly straightforward.
I primarily read the <a href="https://github.com/ros2/rmw/blob/master/rmw/include/rmw/rmw.h"><code class="language-plaintext highlighter-rouge">rmw</code> API documentation</a> and looked at other implementations, like <a href="https://github.com/ros2/rmw_cyclonedds/tree/master/rmw_cyclonedds_cpp/src"><code class="language-plaintext highlighter-rouge">rmw_cyclonedds_cpp</code></a> and <a href="https://github.com/ros2/rmw_fastrtps"><code class="language-plaintext highlighter-rouge">rmw_fastrtps_cpp</code></a>.
This <a href="https://docs.google.com/presentation/d/1KiRtiMgLCTMV1BeAV_HerHUBfKdC8wjXjRN6-M0LV_U/edit">summary</a> was also pretty useful to get started!</p>

<p>I knew I would have to modify <code class="language-plaintext highlighter-rouge">email</code> in order to support the requirements of the <code class="language-plaintext highlighter-rouge">rmw</code> interface.
However, it wasn’t until I started working on the implementation that I figured out what I needed to add.
The main missing feature was wait sets.
With early versions of <code class="language-plaintext highlighter-rouge">email</code>, users <del>had</del> would have had to manually poll a subscription for new messages.
This is of course not how ROS 2 works; it uses wait sets which allow waiting on different events at the same time in a standard way.
For example, you can add all subscriptions, service clients, and service servers to the wait set and ask it to <a href="https://github.com/ros2/rclcpp/blob/2801553d61c5a30a0327d5cbc8d28bcd74e9703d/rclcpp/include/rclcpp/wait_set_template.hpp#L610-L654">wait</a>.
Once that’s done, you can check the wait set to get a list of objects that have a new message, response, or request, and deal with them appropriately.</p>

<p>Some of those features weren’t <em>necessary</em> for a simple email-based string pub/sub/service “middleware,” but adding them definitely improved it and turned it into a sort-of middleware.
Obviously, the <code class="language-plaintext highlighter-rouge">rmw</code> interface has many other features, like quality of service (for <em>real</em> applications) and introspection (e.g., to support <code class="language-plaintext highlighter-rouge">ros2 topic list</code>).
Those are not currently supported by <code class="language-plaintext highlighter-rouge">rmw_email_cpp</code>, but PRs are welcome!</p>

<p>This layer is where I saw the downsides of the ROS 2 abstractions.
Many things are duplicated or very similar: APIs, data structures, arguments validation, etc.
Calls often go from <code class="language-plaintext highlighter-rouge">rclcpp</code> to <code class="language-plaintext highlighter-rouge">rcl</code> to <code class="language-plaintext highlighter-rouge">rmw</code> and, finally, to the middleware.
While each layer does have its own responsibilities – otherwise we wouldn’t have all those layers – a lot of the actual work is done by the middleware.
Furthermore, since DDS has always been the main – and pretty much only – middleware standard, parts of the interface, like the <a href="https://github.com/ros2/rmw/blob/35fc6ab8fad4db90eb55db9d1ecf50dc1aa3638d/rmw/include/rmw/types.h#L351-L352">writer GUID field in the request ID struct</a>, are rather DDS-specific.</p>

<p>However, I also saw the clear benefits of these abstractions and interfaces.
I didn’t have to write too much code to plug <code class="language-plaintext highlighter-rouge">email</code> into ROS 2; I only had to implement an interface.
A few bugs aside, after implementing the main <code class="language-plaintext highlighter-rouge">rmw</code> functions, running ROS 2 over email just… worked!</p>

<h2 id="demo">Demo</h2>

<p>After all of that, it’s time for a demo!
First, let’s see what our email inbox looks like when running the classic <a href="https://github.com/ros2/demos/tree/master/demo_nodes_cpp/src/topics">talker/listener demo</a>.</p>

<!-- EMAIL_CONFIG_FILE=send.email.yml RMW_IMPLEMENTATION=rmw_email_cpp ros2 run demo_nodes_cpp talker -->
<figure>
<div style="display: flex; flex-wrap: wrap">
<div style="flex: 50%; padding: 2px">
<!-- make the code align vertically with the image -->
<br />
<br />

<figure class="highlight"><pre><code class="language-shell" data-lang="shell"><span class="nv">$ EMAIL_CONFIG_FILE</span><span class="o">=</span>talker.email.yml <span class="se">\</span>
  <span class="nv">RMW_IMPLEMENTATION</span><span class="o">=</span>rmw_email_cpp <span class="se">\</span>
  ros2 run demo_nodes_cpp talker</code></pre></figure>

</div>
<div style="flex: 50%; padding: 2px">
<a href="/assets/img/rmw-email/demo_talker.png"><img src="/assets/img/rmw-email/demo_talker.png" alt="'hello world' talker emails from talker@rmw-email.com" style="border: 1px solid #383838;" /></a>
</div>
</div>
<figcaption style="text-align: center;">Command to run the <code class="highlighter-rouge">talker</code> node with <code class="highlighter-rouge">rmw_email_cpp</code> and resulting emails on the <code class="highlighter-rouge">/chatter</code> topic.</figcaption>
</figure>

<p>Since messages only go in one direction in the above example, let’s see a client/server example using the <a href="https://github.com/ros2/demos/tree/master/demo_nodes_cpp/src/services">add_two_ints service demo</a>.</p>

<!-- EMAIL_CONFIG_FILE=send.email.yml RMW_IMPLEMENTATION=rmw_email_cpp ros2 run demo_nodes_cpp add_two_ints_client -->
<!-- EMAIL_CONFIG_FILE=receive.email.yml RMW_IMPLEMENTATION=rmw_email_cpp ros2 run demo_nodes_cpp add_two_ints_server -->
<figure>
<div style="display: flex; flex-wrap: wrap">
<div style="flex: 50%; padding: 2px">

<figure class="highlight"><pre><code class="language-shell" data-lang="shell"><span class="nv">$ EMAIL_CONFIG_FILE</span><span class="o">=</span>client.email.yml <span class="se">\</span>
  <span class="nv">RMW_IMPLEMENTATION</span><span class="o">=</span>rmw_email_cpp <span class="se">\</span>
  ros2 run demo_nodes_cpp add_two_ints_client
Result of add_two_ints: 5

<span class="nv">$ EMAIL_CONFIG_FILE</span><span class="o">=</span>server.email.yml <span class="se">\</span>
  <span class="nv">RMW_IMPLEMENTATION</span><span class="o">=</span>rmw_email_cpp <span class="se">\</span>
  ros2 run demo_nodes_cpp add_two_ints_server
Incoming request
a: 2 b: 3</code></pre></figure>

</div>
<div style="flex: 50%; padding: 2px">
<br />
<a href="/assets/img/rmw-email/demo_service.png"><img src="/assets/img/rmw-email/demo_service.png" alt="service request email from client@rmw-email.com and response reply email from server@rmw-email.com" style="border: 1px solid #383838;" /></a>
</div>
</div>
<figcaption style="text-align: center;">Commands and emails for the <code class="highlighter-rouge">/add_two_ints</code> service request and response.</figcaption>
</figure>

<p>As mentioned previously, we can see the reply to the request email in this example.</p>

<h2 id="performance">Performance</h2>

<p>We can use <a href="https://gitlab.com/ApexAI/performance_test">performance_test</a> to measure pub/sub latencies and compare them to another RMW implementation.
The current default implementation is <code class="language-plaintext highlighter-rouge">rmw_cyclonedds_cpp</code>, so let’s compare to that.</p>

<figure>
    <a href="/assets/img/rmw-email/perf_comparison_nort.png"><img src="/assets/img/rmw-email/perf_comparison_nort.png" alt="rmw_email_cpp's mean latency is way higher and more jittery compared to rmw_cyclonedds_cpp's" style="" /></a>
    <figcaption style="text-align: center;">Latency comparison between <code class="highlighter-rouge">rmw_email_cpp</code> and <code class="highlighter-rouge">rmw_cyclonedds_cpp</code>.</figcaption>
</figure>

<p>With a mean latency of around 6 seconds over the one-minute experiment, <code class="language-plaintext highlighter-rouge">rmw_email_cpp</code> is clearly worse than <code class="language-plaintext highlighter-rouge">rmw_cyclonedds_cpp</code>.
Approximately 15 332 times worse.
Not that we expected anything else, obviously!</p>

<p>The results are different if we run the experiments on a real-time system: PREEMPT_RT-patched Ubuntu Server 20.04.2 (5.4.3-rt1), Intel i7-3770 4-core CPU @ 3.40 GHz (SMT disabled), 8 GB RAM, and SCHED_FIFO policy with the highest priority (99).</p>

<figure>
    <a href="/assets/img/rmw-email/perf_comparison_rt.png"><img src="/assets/img/rmw-email/perf_comparison_rt.png" alt="rmw_cyclonedds_cpp's mean latency is cut in half, while rmw_email_cpp's mean latency more than doubles" style="" /></a>
    <figcaption style="text-align: center;">Latency comparison between <code class="highlighter-rouge">rmw_email_cpp</code> and <code class="highlighter-rouge">rmw_cyclonedds_cpp</code> on a real-time system.</figcaption>
</figure>

<p>As expected, the latencies for <code class="language-plaintext highlighter-rouge">rmw_cyclonedds_cpp</code> are much lower on a real-time system.
However, the latencies for <code class="language-plaintext highlighter-rouge">rmw_email_cpp</code> get worse!
The data also stops halfway through the experiment because performance_test throws an exception if messages are not received in the order that they are sent.
This assertion could be removed from the performance_test code to compare two complete experiments, but, surely, it’s not a good sign when a middleware shuffles messages!</p>

<p>We could pose that most of the time is spent between the two <code class="language-plaintext highlighter-rouge">libcurl</code> calls to send and receive emails, i.e., server-side.
To explore this hypothesis, we can enable intraprocess mode for <code class="language-plaintext highlighter-rouge">email</code> and run the experiments again.</p>

<figure>
    <a href="/assets/img/rmw-email/perf_comparison_intra.png"><img src="/assets/img/rmw-email/perf_comparison_intra.png" alt="rmw_email_cpp's mean latency in intraprocess mode goes down to 4.41 ms" style="" /></a>
    <figcaption style="text-align: center;">Latency comparison between <code class="highlighter-rouge">rmw_email_cpp</code> (intraprocess) and <code class="highlighter-rouge">rmw_cyclonedds_cpp</code>.</figcaption>
</figure>

<p>The latencies are then much more comparable.
They’re now only about 12 times higher, although it would be worse if we did a fair comparison using Cyclone DDS with <a href="https://github.com/eclipse-iceoryx/iceoryx">iceoryx</a> for shared memory inter-process communications.
<code class="language-plaintext highlighter-rouge">rmw_email_cpp</code>’s message&lt;–&gt;YAML conversion is without a doubt no match for <code class="language-plaintext highlighter-rouge">rmw_cyclonedds_cpp</code>’s (de)serialization, and the liberal use of <code class="language-plaintext highlighter-rouge">std::string</code> objects internally most likely doesn’t help.
It would definitely be interesting to investigate this.</p>

<h2 id="tracing">Tracing</h2>

<p>Just in case low-overhead instrumentation is needed to investigate real-time performance issues with rmw_email, I added <a href="https://lttng.org/docs">LTTng</a> tracepoints.
<code class="language-plaintext highlighter-rouge">rmw_email_cpp</code> uses the <code class="language-plaintext highlighter-rouge">ros2_tracing</code> instrumentation &amp; tracepoints for the <code class="language-plaintext highlighter-rouge">rmw</code> layer.
<code class="language-plaintext highlighter-rouge">email</code> has its own <a href="https://github.com/christophebedard/rmw_email/blob/master/email/include/email/lttng.hpp">LTTng tracepoints</a> to collect lower-level information in order to correlate it with the ROS 2 trace data.</p>

<h2 id="limitations-and-future-work">Limitations and future work</h2>

<p>Unsurprisingly, there are many limitations, with the main ones being high message latency and low pub/sub rate.
Also, as mentioned previously, messages might be received in the wrong order.
This could be due to Gmail’s infrastructure, since having sub-millisecond latencies and guaranteeing that emails are always in the right order are probably not high priorities.
Tackling those limitations might be possible, but it <em>could</em> be argued that it’s not worth the effort.</p>

<p>Nonetheless, many other paths could be explored.
Type support introspection could be replaced with static type support (i.e., generated code) to try to lower latencies.
Also, although we could compare this to DDS domain IDs, configuration files currently impose a direction on a whole process’ communications unless all emails are sent to &amp; from the same address.
Config files could be improved to allow users to specify per-topic or per-namespace email recipients.
Furthermore, the email standards and infrastructure could be leveraged even more to get interesting features.
For example, mailing lists could be used as a form of configurable “multicast.”
Finally, message filtering could be achieved by setting up rules in an email client to forward emails based on the messages’ content.</p>

<h2 id="conclusion">Conclusion</h2>

<p>In conclusion, I presented <a href="https://github.com/christophebedard/rmw_email">rmw_email</a>, which contains a standalone middleware as well as a ROS 2 middleware implementation to exchange messages using emails.</p>

<p>ROS 2’s abstractions lead to additional complexity and overhead, but they’re also quite powerful – now you can worry about getting <del>SLAM</del> SPAM into your <a href="https://docs.nav2.org/">Nav2</a> stack!
Ultimately, it’s an interesting debate between users who benefit from and need that abstraction, and those who prefer to break it a little bit to have direct access to the underlying middleware.
There might be a better middle ground to be found or even an alternative that allows both to coexist.
Or perhaps the status quo is good for most users, and those who need a more specialized version of ROS 2 can just fork it, as some have done.
Even if rmw_email will never be used in production (or at all), I hope that it can provide some insight and stimulate discussions on this subject.</p>

<p>Aside from that, I think this project had real benefits for the ROS 2 community, both directly and indirectly.
There’s of course the separate project to do symmetrical conversions between ROS 2 messages and a YAML representation.
Therefore, email stuff aside, that part is probably a cool contribution to the ROS 2 community!
I also <del>got distracted</del> embraced the open source philosophy along the way.
I started using the ROS 2 tooling working group’s GitHub actions, <a href="https://github.com/ros-tooling/setup-ros"><code class="language-plaintext highlighter-rouge">setup-ros</code></a> and <a href="https://github.com/ros-tooling/action-ros-ci"><code class="language-plaintext highlighter-rouge">action-ros-ci</code></a>, for rmw_email.
I contributed some improvements and new features that I needed.
Additionally, I made a number of contributions to ROS 2 core packages and ROS 2 dependencies here and there.</p>

<p>Time will tell whether or not this was more useful than my <a href="/ros-tracing-message-flow/">previous blog post</a>.
However, even if I received multiple facepalm emojis 🤦‍♂️ from a friend after sharing my “ROS 2… over email” idea, I would say that, after putting 300+ hours into this project over more than a year, I’m extremely satisfied with the outcome!</p>

<figure>
    <a href="/assets/img/rmw-email/overall_rmw_email_time_investment.png"><img src="/assets/img/rmw-email/overall_rmw_email_time_investment.png" alt="over 300 hours spent over 14 months on the code itself and around 25 hours spent on this blog post over a few weeks" style="" /></a>
    <figcaption style="text-align: center;">Time tracking result for rmw_email.</figcaption>
</figure>

<h2 id="links">Links</h2>

<ul>
  <li>rmw_email: <a href="https://github.com/christophebedard/rmw_email">github.com/christophebedard/rmw_email</a>
    <ul>
      <li><code class="language-plaintext highlighter-rouge">email</code> design document: <a href="https://christophebedard.com/rmw_email/design/email/">christophebedard.com/rmw_email/design/email/</a></li>
      <li><code class="language-plaintext highlighter-rouge">email</code> API documentation: <a href="https://christophebedard.com/rmw_email/api/email/">christophebedard.com/rmw_email/api/email/</a></li>
    </ul>
  </li>
  <li>dynamic_message_introspection: <a href="https://github.com/osrf/dynamic_message_introspection">github.com/osrf/dynamic_message_introspection</a>
    <ul>
      <li>the PR with changes mentioned in this post has now been merged: <a href="https://github.com/osrf/dynamic_message_introspection/pull/15">github.com/osrf/dynamic_message_introspection/pull/15</a></li>
    </ul>
  </li>
</ul>

<h2 id="update-2022-05-07">Update (2022-05-07)</h2>

<p>It turns out that there is no need to use hacky ways to get the size of an array of a non-built-in type: the type support introspection tools include functions that provide this information for a given type.
See: <a href="https://github.com/osrf/dynamic_message_introspection/pull/16">github.com/osrf/dynamic_message_introspection/pull/16</a>.</p>

<h2 id="update-2025-10-26">Update (2025-10-26)</h2>

<p>I wrote <a href="https://docs.ros.org/en/rolling/Tutorials/Advanced/Creating-An-RMW-Implementation.html">an <code class="language-plaintext highlighter-rouge">rmw</code> implementation guide for the ROS 2 documentation</a>, which features rmw_email.</p>

<h2 id="references">References</h2>

<p>[<a name="1" href="#1t">1</a>]
Z. Jiang, Y. Gong, J. Zhai, Y.-P. Wang, W. Liu, H. Wu, and J. Jin, “Message passing optimization in robot operating system,” <em>International Journal of Parallel Programming</em>, vol. 48, no. 1, pp. 119–136, 2020.<br />
[<a name="2" href="#2t">2</a>]
T. Kronauer, J. Pohlmann, M. Matthe, T. Smejkal, and G. Fettweis, “Latency overhead of ros2 for modular time-critical systems,” <em>arXiv preprint arXiv:2101.02074</em>, 2021.<br />
[<a name="3" href="#3t">3</a>]
<a href="https://datatracker.ietf.org/doc/html/rfc3501#section-2.3.1.1">RFC 3501, section 2.3.1.1</a><br />
[<a name="4" href="#4t">4</a>]
<a href="https://datatracker.ietf.org/doc/html/rfc3501#section-6.3.2">RFC 3501, section 6.3.2</a><br />
[<a name="5" href="#5t">5</a>]
<a href="https://datatracker.ietf.org/doc/html/rfc5322#page-26">RFC 5322, page 26</a></p>]]></content><author><name></name></author><category term="ROS" /><category term="ROS 2" /><category term="middleware" /><category term="rmw" /><category term="email" /><category term="DDS" /><summary type="html"><![CDATA[Ever wanted to use emails to exchange ROS 2 messages? Yes? Well, now you can!]]></summary></entry><entry><title type="html">Message Flow Analysis for ROS Through Tracing</title><link href="https://christophebedard.com/ros-tracing-message-flow/" rel="alternate" type="text/html" title="Message Flow Analysis for ROS Through Tracing" /><published>2019-06-06T00:00:00+00:00</published><updated>2019-06-06T00:00:00+00:00</updated><id>https://christophebedard.com/ros-tracing-message-flow</id><content type="html" xml:base="https://christophebedard.com/ros-tracing-message-flow/"><![CDATA[<figure>
    <a href="/assets/img/tc4ros/result.png"><img src="/assets/img/tc4ros/result.png" alt="project overview" style="" /></a>
    <figcaption style="text-align: center;">Outcome of this project: message flow analysis for ROS using Trace Compass.</figcaption>
</figure>

<p style="text-align: center;"><b>tl;dr</b> Tracing is a very powerful tool for software development, especially in robotics. Using Trace Compass and existing ROS instrumentation, I built an analysis that can show the path of a message through a ROS stack. This work can serve as a proof-of-concept for future endeavours.</p>

<ol id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a>    <ol>
      <li><a href="#context" id="markdown-toc-context">Context</a></li>
      <li><a href="#topic" id="markdown-toc-topic">Topic</a></li>
      <li><a href="#literature-review--existing-solutions" id="markdown-toc-literature-review--existing-solutions">Literature review &amp; existing solutions</a></li>
    </ol>
  </li>
  <li><a href="#message-flow-analysis" id="markdown-toc-message-flow-analysis">Message flow analysis</a>    <ol>
      <li><a href="#motive-and-goal" id="markdown-toc-motive-and-goal">Motive and goal</a></li>
      <li><a href="#approach" id="markdown-toc-approach">Approach</a></li>
    </ol>
  </li>
  <li><a href="#resultsexample" id="markdown-toc-resultsexample">Results/example</a></li>
  <li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
  <li><a href="#future-work" id="markdown-toc-future-work">Future work</a></li>
  <li><a href="#links" id="markdown-toc-links">Links</a></li>
  <li><a href="#acknowledgements" id="markdown-toc-acknowledgements">Acknowledgements</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ol>

<h2 id="introduction">Introduction</h2>

<p>Tracing can be invaluable when diagnosing complex systems, especially when problems are hard to reproduce using traditional tools. Robotics software development can benefit from tracing and the low-overhead analyses it can provide.</p>

<p>The overall goal of this project was to first look into where ROS could benefit from such analyses, and then work towards that.</p>

<p>This first section will introduce both ROS and Trace Compass for people familiar with only one (or none) of them. I will also talk about robotics and tracing in general. The second and third sections will present my work along with an example.</p>

<p>Finally, I will conclude and briefly talk about possible future work related to this project.</p>

<h3 id="context">Context</h3>

<h4 class="no_toc" id="ros">ROS</h4>

<p><a href="https://ros.org/">Robot Operating System</a> (ROS) is an open-source framework and a set of libraries and tools for robotics software development. Although it has “Operating System” in its name, it’s not really an OS!</p>

<p>Its main feature is probably the implementation of the publish-subscribe pattern. <a href="https://wiki.ros.org/ROS/Concepts">Nodes</a>, which are modular “processes” designed to accomplish a specific task, can publish on, or subscribe to, one or more topics to send/receive messages. By launching multiple nodes (either from your own package or from a package someone else made), you can accomplish complex tasks!</p>

<h4 class="no_toc" id="trace-compass">Trace Compass</h4>

<p><a href="https://eclipse.dev/tracecompass/">Eclipse Trace Compass</a> is an open source <a href="https://github.com/tuxology/tracevizlab/tree/master/labs/001-what-is-tracing">trace</a> viewer and analysis framework designed to solve performance issues. It supports many trace formats, and provides numerous useful analyses &amp; views out of the box, such as the kernel resources and control flow views. Users can also use its API to implement their own analyses, which is what I did!</p>

<h3 id="topic">Topic</h3>

<p>My initial objective was to look into where ROS development could benefit from tracing and subsequent analyses, and try to help with that.</p>

<p>Early on in this project, I considered targeting ROS 2. However, as it was still relatively new and less mature than ROS 1, I went with the latter.</p>

<h3 id="literature-review--existing-solutions">Literature review &amp; existing solutions</h3>

<p>A presentation at ROSCon 2017, titled <a href="https://vimeo.com/236186712">“Determinism in ROS – or when things break /sometimes/ and how to fix it…”</a> exposed how ROS’ design does not guarantee determinism in execution. This is actually what piqued my curiosity at first, since I was a ROS user and had started to learn about tracing, and it eventually led to this project.</p>

<p>In this case, lack of determinism can be seen as merely a symptom. This led me to search for possible causes, one of which might be network/communications 
[<a name="1t" href="#1">1</a>, <a name="2t" href="#2">2</a>, <a name="3t" href="#3">3</a>, <a name="4t" href="#4">4</a>, <a name="5t" href="#5">5</a>]. For latencies, which might lead to lack of determinism, critical path analyses can help identify the actual root cause 
[<a name="6t" href="#6">6</a>, <a name="7t" href="#7">7</a>, <a name="8t" href="#8">8</a>, <a name="9t" href="#9">9</a>].</p>

<p>As for tools, many are distributed along with ROS to help users and developers. <a href="https://wiki.ros.org/rqt_graph"><code class="language-plaintext highlighter-rouge">rqt_graph</code></a> can create a graph of publisher/subscriber relations between nodes. It can also show publishing rates. Similarly, the ROS CLI tools (e.g. <code class="language-plaintext highlighter-rouge">rostopic</code>) can help debug basic pub/sub issues.</p>

<p>Other tools are available. The <code class="language-plaintext highlighter-rouge">diagnostics</code> <a href="https://wiki.ros.org/diagnostics">package</a> can collect diagnostics data for analysis. The <code class="language-plaintext highlighter-rouge">performance_test</code> <a href="https://github.com/apexai/performance_test">package</a> for ROS 2 can test the performance of a communications middleware.</p>

<p>However, all of the tools or solutions mentioned above cannot provide a view of the actual execution. Besides, the performance overhead of using higher-level log aggregators (e.g. as a ROS node) is non-negligible.</p>

<p>The <code class="language-plaintext highlighter-rouge">tracetools</code> <a href="https://github.com/bosch-robotics-cr/tracetools">package</a> uses <a href="https://lttng.org/">LTTng</a> to instrument ROS for tracing. However, it does not offer analysis tools.</p>

<p>Trace Compass offers a <a href="https://github.com/tuxology/tracevizlab/tree/master/labs/101-analyze-system-trace-in-tracecompass#task-2-navigate-in-time-graph-views">control flow view</a>, showing the state of threads over time. By selecting one particular thread, a user can launch a <a href="https://github.com/tuxology/tracevizlab/tree/master/labs/102-tracing-wget-critical-path">critical path analysis</a>.</p>

<h2 id="message-flow-analysis">Message flow analysis</h2>

<h3 id="motive-and-goal">Motive and goal</h3>

<p>As mentioned previously, time is one of the main concerns for robotics applications. Critical path analyses can make these anomalies stand out and help developers find the root cause.</p>

<p>My goal was therefore to make a ROS-specific analysis along these lines. I chose to build what I call a “message flow analysis.” Using <code class="language-plaintext highlighter-rouge">tracetools</code> and the ROS instrumentation, we can figure out which queues a message went through, how much time it spent in each one, and if it ended up being dropped. Also, by linking a message received by a subscriber to the next corresponding message that gets published by the same node, we can build a model of the message processing pipeline.</p>

<h3 id="approach">Approach</h3>

<h4 class="no_toc" id="prerequisites">Prerequisites</h4>

<p>To build this analysis, some information is needed on:</p>

<ul>
  <li>connections between publishers and subscribers</li>
  <li>subscriber/publisher queue states</li>
  <li>network packet exchanges</li>
</ul>

<p>We first need to know about connections between nodes. The ROS instrumentation includes a tracepoint for new connections. It includes the address and port of the host and the destination, with an <code class="language-plaintext highlighter-rouge">address:port</code> pair corresponding to a specific publisher or subscription.</p>

<p>We also need to build a model of the publisher and subscriber queues. To achieve this, we can leverage the relevant tracepoints. These include a tracepoint for when a message is added to the queue, when it’s dropped from the queue, and when it leaves the queue (either sent over the network to the subscriber, or handed over to a callback). We can therefore visualize the state of a queue over time!</p>

<p>Finally, we need information on network packet exchanges. Although this isn’t really necessary for this kind of analysis, it allows us to reliably link a message that gets published to a message that gets received by the subscriber. This is good when building a robust analysis, and it paves the way for a future critical path analysis based on this message flow analysis.</p>

<p>This requires us to trace both userspace (ROS) and kernel. Fortunately, we only have to enable 2 kernel events for this. It saves us a lot of disk space, since enabling many events can generate multiple gigabytes of trace data, even when tracing for only a few seconds! Also, as the rate of generated events increases, the overhead also increases. More resources have to be allocated to the buffers to properly process those events, otherwise they can get <a href="https://lttng.org/docs/v2.13/#doc-channel">discarded or overwritten</a>.</p>

<h4 class="no_toc" id="method">Method</h4>

<p>In this sub-section, I’ll quickly go over some implementation details and explain how the analysis works!</p>

<p>Let’s start with some background on Trace Compass. It allows you to build analyses that depend on trace events, the output of other analyses, or both. Some analyses are used to create views to directly display processed data. However, we can use them as models that can be queried by other models or analyses. This abstraction was very useful when designing my final analysis and its dependencies.</p>

<p>The first analysis is the connections model. Using the <code class="language-plaintext highlighter-rouge">new_connection</code> events from <code class="language-plaintext highlighter-rouge">tracetools</code>, it creates a list of connections between two nodes on a certain topic and includes information about the endpoints.</p>

<figure>
    <a href="/assets/img/tc4ros/new_connection_events.png"><img src="/assets/img/tc4ros/new_connection_events.png" alt="examples of new_connection events" style="" /></a>
    <figcaption style="text-align: center;">Some <code class="highlighter-rouge">new_connection</code> events. Highlighted are two events belonging to the same connection, on opposite endpoints.</figcaption>
</figure>

<p>Another analysis is created to model queues over time. This uses three tracepoints, also from <code class="language-plaintext highlighter-rouge">tracetools</code>:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">publisher_message_queued</code> or <code class="language-plaintext highlighter-rouge">subscription_message_queued</code>, when a message is added to the queue</li>
  <li><code class="language-plaintext highlighter-rouge">subscriber_link_message_write</code> or <code class="language-plaintext highlighter-rouge">subscriber_callback_start</code>, when a message is successfully removed from the queue (i.e. sent over the network or processed by a callback)</li>
  <li><code class="language-plaintext highlighter-rouge">subscriber_link_message_dropped</code> or <code class="language-plaintext highlighter-rouge">subscription_message_dropped</code>, when a message is dropped from the queue</li>
</ol>

<p>These events always include a reference to the associated message, so it can help validate the model.</p>

<figure>
    <a href="/assets/img/tc4ros/queues_view.png"><img src="/assets/img/tc4ros/queues_view.png" alt="view showing the state of a publisher queue" style="" /></a>
    <figcaption style="text-align: center;">View showing the state of a publisher queue. At this timestamp (thin blue vertical line), the first message is removed from the queue and sent to the subscriber.</figcaption>
</figure>

<p>The third analysis is for network packet exchange. This is the only analysis that needs kernel events: <code class="language-plaintext highlighter-rouge">net_dev_queue</code> for packet queuing and <code class="language-plaintext highlighter-rouge">netif_receive_skb</code> for packet reception. Fortunately, Trace Compass already does this! It matches sent and received packets. I only had to filter out <code class="language-plaintext highlighter-rouge">SYN</code>/<code class="language-plaintext highlighter-rouge">FIN</code>/<code class="language-plaintext highlighter-rouge">ACK</code> packets and those which were not associated with a known ROS connection. Then, from a node name, a topic name, and a timestamp at which a message was published, we can figure out when it went through the network, and link it to a message received by the subscription.</p>

<p>Finally, we can put everything together! The analysis uses the above analyses to reconstruct and display a message’s path accross queues, callbacks, and the network!</p>

<h2 id="resultsexample">Results/example</h2>

<p>To illustrate this, I wrote a simple “pipeline” test case. A first node periodically publishes messages on a topic. A second node does some processing and re-publishes them on another topic. A third node does the same, and a fourth and last node prints the contents of a message.</p>

<figure>
    <a href="/assets/img/tc4ros/testcase_graph.png"><img src="/assets/img/tc4ros/testcase_graph.png" alt="rqt_graph example" style="" /></a>
    <figcaption style="text-align: center;">Graph generated using <code class="highlighter-rouge">rqt_graph</code>.</figcaption>
</figure>

<p>From the view showing queues over time, the user can select an individual message by clicking on it, then hitting the <em>Follow the selected message</em> button.</p>

<figure>
    <a href="/assets/img/tc4ros/result_select_message.png"><img src="/assets/img/tc4ros/result_select_message.png" alt="message selection result" style="" /></a>
    <figcaption style="text-align: center;">Message selection. Note that this is the first node in the pipeline, and that, at this moment, the other nodes are not active. Therefore, since latching is enabled, the publisher's queue only keeps the most recent message.</figcaption>
</figure>

<p>The message flow analysis – and all its dependencies – are run. The output can then be viewed in the corresponding view.</p>

<figure>
    <a href="/assets/img/tc4ros/result_analysis_initial.png"><img src="/assets/img/tc4ros/result_analysis_initial.png" alt="analysis result" style="" /></a>
    <figcaption style="text-align: center;">Analysis result.</figcaption>
</figure>

<p>There it is! Some sections are hard to make out, so we can zoom in.</p>

<figure>
    <a href="/assets/img/tc4ros/result_analysis_initial_zoom.png"><img src="/assets/img/tc4ros/result_analysis_initial_zoom.png" alt="zoomed in" style="" /></a>
    <figcaption style="text-align: center;">Zoomed in.</figcaption>
</figure>

<p>We can see three main states: publisher queue, subscriber queue, and subscriber callback. Of course, the transition represented by the darker arrows between the publisher queue and subscriber queue states includes the network transmission.</p>

<p>However, going back to the original perspective, two states clearly stand out. The first (green) state represents the time spent in the first node’s publisher queue, waiting for other nodes to be ready in order to transmit the message. The biggest state, in orange, represents the time spent in a callback inside the third node. We can hover over the state to get more info.</p>

<figure>
    <a href="/assets/img/tc4ros/result_analysis_hover.png"><img src="/assets/img/tc4ros/result_analysis_hover.png" alt="hover feature example" style="" /></a>
    <figcaption style="text-align: center;">Hovering to get more information.</figcaption>
</figure>

<p>We can see that the message spent around 100 milliseconds in the callback before the next related message was sent to the following publisher queue. In this case, it can be explained by looking at <a href="https://github.com/christophebedard/tracecompass_ros_testcases/blob/melodic-devel/tracecompass_ros_testcases/src/nodes_pipeline/node_m.cpp">the node’s source code</a>!</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="kt">void</span> <span class="nf">callbackFunction</span><span class="p">(</span><span class="k">const</span> <span class="n">std_msgs</span><span class="o">::</span><span class="n">String</span><span class="o">::</span><span class="n">ConstPtr</span><span class="o">&amp;</span> <span class="n">msg</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">std_msgs</span><span class="o">::</span><span class="n">String</span> <span class="n">next_msg</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">payload</span> <span class="o">=</span> <span class="n">get_payload</span><span class="p">(</span><span class="n">msg</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">new_payload</span> <span class="o">=</span> <span class="n">payload</span> <span class="o">+</span> <span class="n">pow</span><span class="p">(</span><span class="n">payload</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">node_i</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">ros</span><span class="o">::</span><span class="n">Duration</span><span class="p">(</span><span class="mf">0.1</span><span class="p">).</span><span class="n">sleep</span><span class="p">();</span> <span class="c1">// &lt;---------</span>
    <span class="p">}</span>
    <span class="n">next_msg</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">MSG_CONTENT_PREFIX</span> <span class="o">+</span> <span class="n">std</span><span class="o">::</span><span class="n">to_string</span><span class="p">(</span><span class="n">new_payload</span><span class="p">);</span>
    <span class="n">pub</span><span class="p">.</span><span class="n">publish</span><span class="p">(</span><span class="n">next_msg</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure>

<h2 id="conclusion">Conclusion</h2>

<p>In conclusion, tracing is a very powerful tool for robotics software development. Lack of determinism was identified as a symptom, and timing was chosen as an analysis topic.</p>

<p>Using existing ROS instrumentation, I worked towards providing insight into the timewise execution of a ROS software stack. The result, a Trace Compass analysis for ROS, can serve as a proof-of-concept for future endeavours.</p>

<h2 id="future-work">Future work</h2>

<p>Many elements could be improved, and many new paths could be explored.</p>

<p>First and foremost, other than not supporting UDP and not explicitly supporting namespaces, there are many limitations and simplifications with the current implementation, as most of the traces I used were taken from executions of (very synthetic) test cases.</p>

<p>To link a message between two endpoints, this selects the first corresponding TCP packet that is queued (<code class="language-plaintext highlighter-rouge">net_dev_queue</code>) after the <code class="language-plaintext highlighter-rouge">subscriber_link_message_write</code> event, and then selects the next <code class="language-plaintext highlighter-rouge">subscription_message_queued</code> event after the matching <code class="language-plaintext highlighter-rouge">netif_receive_skb</code> event. This assumption about the sequence of events might not be always valid. Also, it has not been tested with messages bigger than the maximum payload size of a TCP packet.</p>

<p>Furthermore, callbacks were considered as the only possible link between two messages (received &amp; published). Nodes might deal with callbacks and message publishing separately, e.g. when publishing at a fixed rate independently of the received messages. In the same sense, message flows do not have to be linear! In other words, one incoming message can turn into multiple outgoing messages.</p>

<p>Also, Trace Compass can easily aggregate traces from multiple hosts. This is very relevant for robotics systems, and thus would be a great avenue to explore.</p>

<p>Finally, as mentioned previously, the message flow analysis could be extended to provide a critical path analysis. This would provide more information about what actually happened while a message was waiting in a queue.</p>

<h2 id="links">Links</h2>

<ul>
  <li>My <a href="https://github.com/christophebedard/ros_comm/tree/tc4ros">fork</a> of the <a href="https://github.com/boschresearch/ros_comm/tree/melodic-devel">original</a> instrumentation fork. I improved and fixed some small things, including adding information about latched messages.</li>
  <li>My <a href="https://github.com/christophebedard/tracetools/tree/tc4ros">fork</a> of the <a href="https://github.com/bosch-robotics-cr/tracetools">original</a> <code class="language-plaintext highlighter-rouge">tracetools</code> package.</li>
  <li><a href="https://github.com/christophebedard/tracecompass_ros_testcases">Repo</a> with a few test traces and a <code class="language-plaintext highlighter-rouge">.repos</code> file to easily setup a workspace to trace ROS.</li>
  <li><a href="https://archive.eclipse.org/tracecompass/doc/stable/org.eclipse.tracecompass.doc.user/Trace-Compass-Incubator.html#Trace_Compass_Incubator">Documentation</a> on how to install features from the Trace Compass Incubator, which includes support for ROS traces, the analyses mentioned in this post, and more.</li>
</ul>

<h2 id="acknowledgements">Acknowledgements</h2>

<p>This project was done as part of the <a href="https://www.polymtl.ca/aide-financiere/bourses/bourses-upir-unite-de-participation-et-dinitiation-la-recherche">UPIR</a> program for undergraduate research at Polytechnique Montréal, and was supervised by Michel Dagenais. I thank him for his great input!</p>

<p>I would also like to thank Matthew Khouzam and Geneviève Bastien for their help and advice, and Ingo Lütkebohle for his commentary on the need for tracing in robotics.</p>

<h2 id="references">References</h2>

<p>[<a name="1" href="#1t">1</a>]
C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, and V. M. Vilches, “Real-time Linux communications: an evaluation of the Linux communication stack for real-time robotic applications,” arXiv:1808.10821 [cs], Aug. 2018.<br />
[<a name="2" href="#2t">2</a>]
C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, I. M. Goenaga, L. A. Kirschgens, and V. M. Vilches, “Time Synchronization in modular collaborative robots,” arXiv:1809.07295 [cs], Sep. 2018.<br />
[<a name="3" href="#3t">3</a>]
C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, and V. M. Vilches, “Time-Sensitive Networking for robotics,” arXiv:1804.07643 [cs], Apr. 2018.<br />
[<a name="4" href="#4t">4</a>]
C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, and V. M. Vilches, “Towards a distributed and real-time framework for robots: Evaluation of ROS 2.0 communications for real-time robotic applications,” arXiv:1809.02595 [cs], Sep. 2018.<br />
[<a name="5" href="#5t">5</a>]
Y.-P. Wang, W. Tan, X.-Q. Hu, D. Manocha, and S.-M. Hu, “TZC: Efficient Inter-Process Communication for Robotics Middleware with Partial Serialization,” arXiv:1810.00556 [cs], Oct. 2018.<br />
[<a name="6" href="#6t">6</a>]
F. Giraldeau and M. Dagenais, “Wait Analysis of Distributed Systems Using Kernel Tracing,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 8, pp. 2450–2461, Aug. 2016.<br />
[<a name="7" href="#7t">7</a>]
F. Doray and M. Dagenais, “Diagnosing Performance Variations by Comparing Multi-Level Execution Traces,” IEEE Transactions on Parallel and Distributed Systems, pp. 1–1, 2016.<br />
[<a name="8" href="#8t">8</a>]
P.-M. Fournier and M. R. Dagenais, “Analyzing blocking to debug performance problems on multi-core systems,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, p. 77, Apr. 2010.<br />
[<a name="9" href="#9t">9</a>]
C.-Q. Yang and B. P. Miller, “Critical path analysis for the execution of parallel and distributed programs,” in [1988] Proceedings. The 8th International Conference on Distributed, San Jose, CA, USA, 1988, pp. 366–373.</p>]]></content><author><name></name></author><category term="ROS" /><category term="tracing" /><category term="Trace Compass" /><category term="analysis" /><category term="UPIR" /><summary type="html"><![CDATA[An overview of my project on Trace Compass & ROS]]></summary></entry><entry><title type="html">ROSCon 2018</title><link href="https://christophebedard.com/roscon2018/" rel="alternate" type="text/html" title="ROSCon 2018" /><published>2019-05-12T00:00:00+00:00</published><updated>2019-05-12T00:00:00+00:00</updated><id>https://christophebedard.com/roscon2018</id><content type="html" xml:base="https://christophebedard.com/roscon2018/"><![CDATA[<p>It’s been a while since <a href="https://roscon.ros.org/2018/">ROSCon 2018</a>, but I thought I’d (<em>finally</em>) write some thoughts down. Hopefully without too much rambling.</p>

<h2 id="context">Context</h2>

<p>I’ve been around ROS for a little while now. At first it was from a distance. I only knew the basic concepts, but I still had to work around it. However, a couple years ago I slowly got more and more interested and actually started using it. I thought it was a very powerful concept, and its community and open-sourceness really made me want to contribute.</p>

<h2 id="why">Why</h2>

<p>Why not?</p>

<h2 id="objectives">Objectives</h2>

<p>I had never been to a tech conference (by myself or for myself), and I didn’t really know what to expect – I didn’t know <em>exactly</em> what I wanted from it.</p>

<p>However, my one objective was to talk to other attendees and learn <em>something</em> from them. Sure, I’m still a student and don’t have as much experience as them, but I still have something to offer.</p>

<p>One thing I was sure of is that I’d learn a lot from the presentations. I was really looking forward to the presentations on ROS 2, since I hadn’t really tried or read a lot about it. ROS 1 is still way more mature, but as companies are slowly starting to move to ROS 2, especially for new products or applications, this is the best moment to get started!</p>

<h2 id="the-conference">The conference</h2>

<p>The presentations were very interesting! For the topics I was familiar with, I was eager to see what other people were doing. For the others, I was simply curious. Here’s a few highlights:</p>

<ul>
  <li>
    <p>ROS 2-specific presentations, including a <a href="https://vimeo.com/292699328">how-to on getting involved in ROS 2 development</a> and a <a href="https://vimeo.com/292693129">demo of the main features</a>.</p>
  </li>
  <li>
    <p><a href="https://vimeo.com/292693011">“Lessons learned building a self-driving car on ROS”</a> and <a href="https://vimeo.com/292695688">“ROS 2 on Autonomous Driving Vehicles,”</a> which were interesting applications of ROS to self-driving. They also mentioned real problems, like determinism and real time.</p>
  </li>
  <li>
    <p><a href="https://vimeo.com/293304372">“Integrating ROS and ROS2 on mixed-critical robotic systems based on embedded heterogeneous platforms”</a> and <a href="https://vimeo.com/293305909">“Towards ROS 2 microcontroller meta cross-compilation,”</a> which showed that plans are for ROS 2 to be used for much more than what ROS 1 was generally used for, which is of course very exciting.</p>
  </li>
  <li>
    <p>The <a href="https://vimeo.com/293257342"><code class="language-plaintext highlighter-rouge">performance_test</code> package</a> for testing middleware peformance for ROS 2, from Apex.AI. Since I’m interested in the tooling side of robotics software development (e.g. tracing &amp; analysis), this was quite a nice surprise!</p>
  </li>
  <li>
    <p>and other cool presentations, such as <a href="https://vimeo.com/293623186">“Deterministic reversible debugging of ROS nodes with Mozilla rr,”</a> <a href="https://vimeo.com/293626218">“Hermetic Robot Deployment Using Multi-Stage Dockers,”</a> and <a href="https://vimeo.com/293540767">“Deterministic, asynchronous message driven task execution with ROS.”</a></p>
  </li>
</ul>

<p>Between presentations and during the lunches, I talked with other attendees, including both people from academia and the industry!</p>

<p>The biggest event of the conference was obviously the evening reception on the first day <del>because of the beer</del>. I motivated myself to go talk to people, and that led me to have some nice conversations and meet some very interesting people. I learned about a couple projects that I then started to follow, like the <a href="https://github.com/osrf/ovc">Open Vision Computer</a>.</p>

<h2 id="what-i-learned">What I learned</h2>

<p>Looking back, it was a great experience and I learned a lot, and not only from a technical point of view.</p>

<p>I learned that you always have something to offer, whether it’s a different point of view, a different background, or simply the cool stuff you’ve done. No matter how little experience you think you (might) have. For example, people I talked to had never heard of the <a href="http://aerialroboticscompetition.org/">aerials robotics competition I spent 3-4 years competing in</a>, which involved real challenges that I was able to discuss! Even if it wasn’t in a strictly-professional setting, those challenges are still similar to the ones people in the industry can face. In this case, I mentioned that I worked on autonomous obstacle avoidance in a sterile environment, without external position sensors.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Overall, it was a very nice experience. It made me look forward to the future.</p>

<p>It made me want to get involved in ROS development. I have made (very) small contributions to ROS 1 since then, and I look forward to doing more, especially for ROS 2! Now I’m really hoping I can attend ROSCon 2019 in Macau!</p>

<p>All in all, you can’t learn anything if you never try, so jump in!</p>]]></content><author><name></name></author><category term="ROS" /><category term="ROSCon" /><category term="conference" /><summary type="html"><![CDATA[..or what I learned during my first conference.]]></summary></entry></feed>