Conversation
added support for unique ptr and messageT queues in design doc
wjwwood
left a comment
There was a problem hiding this comment.
I'm not quite done reviewing this, but I wanted to post what I have now. I'll keep going and hopefully finish up tomorrow some time.
| RMW implementation. | ||
|
|
||
| At the moment, all the ROS2 supported middlewares (Fast-RTPS, OpenSplice, Connext, DPS) do not make | ||
| distinctions whether the `Publisher` and the `Subscription` are in the same process or not. In consequence, |
There was a problem hiding this comment.
At the moment, all the ROS2 supported middlewares (Fast-RTPS, OpenSplice, Connext, DPS) do not make
distinctions whether thePublisherand theSubscriptionare in the same process or not.
I'm not sure that's correct either. Needs citation at the very least.
There was a problem hiding this comment.
For now, I removed the sentence.
Looking at Fast-RTPS and DPS (and their RMW) I'm pretty sure that's correct.
It's also supported from experimental evaluations where having a publisher and a subscription in the same process or in two processes of the same machine does not affect the performances.
For what concerns OpenSplice, that statement is not correct, as this middleware uses the same ports for all the entities in the same process. This results in an almost instantaneous discovery, but I haven't seen any changes in latency for example.
No idea about Connext.
There was a problem hiding this comment.
Either way, they could do that if they wanted. I think it's worth separating whether or not they could make optimizations versus whether or not they do. Not being able to make these optimizations is a strong case for improving our intra-process system, but in the current situation, a valid question is "why not improve the rmw implementations using the information they have first"? I'm not saying we should do that, it's just a weak justification for adding more stuff into our intra-process pipeline instead, whereas if it were not possible for them to do those optimizations, that would be a strong reason to do it in rclcpp.
For example, a strong reason to do it in rclcpp (in my opinion) is that when we add type masquerading (using OpenCV Image or PCL point cloud instead of the ROS message type) it will be easier to support intra-process with those types if it is in rclcpp, though that doesn't imply that we use an envelope message or have our own queueing system.
There was a problem hiding this comment.
"Optimization" is not a binary operation though. There's no doubt that DDS vendors could do some optimizations and improve the efficiency of intra-process communication but there's only so much that can be done at the rmw layer. Unless some APIs are re-designed, there would always be the overhead of some additional copies. This overhead can be avoided entirely by doing this optimization in rclcpp directly without depending on the DDS vendor. IMO that is a strong enough reason since it guarantees a minimum level of performance irrespective of the DDS library.
There was a problem hiding this comment.
I agree, but my point is that it's not a strong supporting reason to put more complexity into rclcpp rather than try to improve the existing design. Since we haven't even tried to improve performance within the rmw layer. If performance (latency and cpu/memory usage) were improved for small messages within the same process at the rmw layer, then the drawbacks of using a "meta-message" decrease significantly, even if they will perhaps never be as good as a queue in rclcpp itself.
It's a bit of a moot point since the work is essentially already done, but adding things to the rclcpp layer adds complexity to a part of the system that is not optional (or at least not without removing code manually -- they don't have to use intra process after all).
I'm thinking towards certification efforts where they might have a much simpler or already certified rmw/middleware layer below.
To be clear, I think the approach described here is the best option, I'm just trying to make sure it's properly motivated as it does cost something now and in the future (in terms of maintenance). For example, each time we expose a QoS in ROS 2 we'll have to update the intra process system, and I think the proposed system will be more fragile/complex to update than the meta-message approach we have now.
There was a problem hiding this comment.
I agree that the proposed system will require an effort every time a new QoS is added.
However, I don't think that the current one handles this in a better way.
The RMW delivers the meta-messages enforcing the QoS, however if the IntraProcessManager "thinks" that a message has to delivered to N subscription, it will wait for all of them before releasing the memory in the ring buffer. This could be a potential source of problems.
Moreover, even if this proposal adds complexity to the rclcpp layer, I think that it's important that the intra-process communication is implemented in a way that is agnostic to the RMW.
Implementing it in the RMW will potentially lead to have different versions of intra-process communication for each RMW implementation...
| RMW which does not have a notion of inter and intra-process communication. Consequently, the real message will | ||
| be delivered to both `Subscription`s present in the system. The `Subscription` that is in the same process as | ||
| the `Publisher` will actually discard the message, but it will be able to do that only after receiving and | ||
| deserializing it. |
There was a problem hiding this comment.
There is a potential way to fix this, e.g.:
- Search for
ignore_participantorignore_publicationorignore_subscriptionin the DDS Standard:- https://www.omg.org/spec/DDS/1.4/
- Note that we didn't use this option because at the time, OpenSplice didn't support it (years ago), and I don't know if Fast-RTPS supports it or not.
But this might be affected by our decision to change the relationship between nodes and participants (again trying to save resources).
There was a problem hiding this comment.
This is the missing piece that prevents me from recommending we enable intra-process by default, by the way.
wjwwood
left a comment
There was a problem hiding this comment.
The approach sounds reasonable to me, though I would add to the open issue that we're not reusing resources as much as we could because we create a ring buffer or FIFO queue for each subscription, but in reality we could have them share a single buffer in many cases. The overhead isn't extreme, as these ring buffers/queues only store pointers, but still in cases where there are many subscriptions in the same process (e.g. a process with hundreds of nodes that all subscribe to /tf) it could be a really good optimization.
I'll have another look later, but that's my first pass review. Sorry it took so long.
| However, it does not know whether the entities have intra-process communication enabled or not. | ||
| Consequently, the real message will be delivered to both `Subscription`s present in the system. The | ||
| `Subscription` that is in the same process as the `Publisher` will actually discard the message, but it will | ||
| be able to do that only after receiving and deserializing it. |
There was a problem hiding this comment.
I still think it's worth pointing out that it should be possible to avoid this problem in the future, as I discussed here:
Otherwise it seems like an insurmountable technical issue (which I don't think is accurate).
There was a problem hiding this comment.
I updated this section explaining that this problem are only present in the current implementation and pointing to possible solutions.
| `shared_ptr<constMessageT>` and `unique_ptr<MessageT>` that better fits with its callback type. This is | ||
| deduced looking at the output of `AnySubscriptionCallback::use_take_shared_method()`. | ||
|
|
||
| If the history QoS is set to `keep all`, the buffers are dynamically allocated. On the other hand, if the |
There was a problem hiding this comment.
This isn't how KEEP_ALL works in DDS, btw. There's another QoS that we're not currently exposing (but we should) which is max_samples, max_instances, and max_samples_per_instance, which is the actual limit for a KEEP_ALL since it doesn't have a depth. See:
There was a problem hiding this comment.
Mmm could you clarify the difference between sample and instance?
See my last comment #1 (comment), I decoupled all the different components and in this way it will be much simpler to support different options for the length of the queue
There was a problem hiding this comment.
Here's what the documentation for Connext [1] says :
Keep all the samples.
On the publishing side, RTI Connext will attempt to keep all samples (representing each value written) of each instance of data (identified by its key) managed by the DDSDataWriter until they can be delivered to all subscribers.
On the subscribing side, RTI Connext will attempt to keep all samples of each instance of data (identified by its key) managed by the DDSDataReader. These samples are kept until the application takes them from RTI Connext via the take() operation.
@wjwwood seems to be suggesting that there's another limit above which the size of the history does not grow. That's reasonable I think.
In this intra-process proposal, the messages are delivered directly to the subscriber's buffer so publishing side does not need to do anything different. On the sub side, we would have to allow dynamic resizing of the buffer (i.e. standard queue instead of ring buffer) if this QoS is enabled.
Do you agree @wjwwood?
| intra-process `Subscription`s when it is created, but, eventually, a `transient local` `Subscription` in a | ||
| different process joins. Initially, published messages are not passed to the middleware, since all the | ||
| `Subscription`s are in the same process. This means that the middleware is not able to store old messages | ||
| for eventual late-joiners. No newline at end of file |
There was a problem hiding this comment.
Yeah, this one is a kind of a deal-breaker. We just have to publish to inter-process anytime this QoS is used. There's no choice.
There was a problem hiding this comment.
In addition to intra-process buffers, we can unconditionally forward the messages to rmw layer if transient_local QoS is used. That will address this issue since rmw layer should take care of sending these messages to later joiners from a different process.
| - The proposal does not take into account the problem of having a queue with twice the size when both inter | ||
| and intra-process communication are used. A node with a history depth of 10 will be able to store up to | ||
| 20 messages without processing them (10 intra-process and 10 inter-process). This issue | ||
| is also present in the current implementation, since each `Subscription` is doubled. |
There was a problem hiding this comment.
I don't think this can be addressed though right? Also what does "A node with a history depth of 10" mean? Shouldn't you say publisher or subscription?
There was a problem hiding this comment.
Yes, I meant subscription or publisher.
I don't have any nice idea for addressing this without entangling RMW and Intra-Process Manager
…ignore publications
|
I made a big update of the intra-process implementation in order to fix the type-erasing and to better prepare for the future. Briefly this is how I implemented it now. Intra-process communication involves potentially different data-types
By default I defined a hierarchy of classes for the
Once the The This potentially allows to use different buffer implementations (i.e. simple queues, ring buffers, wait-free implementations or whatever). |
| show that creating a `Publisher` or a `Subscription` has a non-negligible memory cost. | ||
|
|
||
| Moreover, when using the default RMW implementation, Fast-RTPS, the memory consumed by a ROS2 application | ||
| increases almost exponentially with the number of `Publisher` and `Subscriptions`. |
There was a problem hiding this comment.
I think this argument alone is not a good rational for avoiding to use publishers and subscribers. ROS is a publish / subscribe middleware - instead of avoid to use publishers and subscribers the memory consumption should be reduced where possible.
There was a problem hiding this comment.
No disagreement there. It's more of a practical consideration (than philosophical). I hope you agree that exponential increase in memory with the number of pubs (or subs) is definitely something that can't be ignored. And every core design decision that increases this number makes ROS2 less and less friendly for a resource constrained system (e.g. 32-bit arm linux with 100s of MB of RAM).
This statement (and others about rmw) are simply mentioned as our motivation for creating a new intra-process design.
|
|
||
| In order to avoid the issue previously described, i.e., that a `Subscription` would receive both inter and | ||
| intra-process messages, it is necessary to inform the underlying middleware that some "connections" will be | ||
| handled at the `rclcpp` level, enabling the RMW to neglect them. |
There was a problem hiding this comment.
This functionality could be implemented independently of the separate subscriber-specific buffers to address the problem described in scenario 1. I think the proposal should separate this into two independent changes.
| | unique_ptr\<MessageT\> @1 | <ul><li>shared_ptr\<MessageT\></li><li>shared_ptr\<MessageT\></li></ul> | <ul><li>@1</li><li>@1</li></ul> | | ||
| | unique_ptr\<MessageT\> @1 | <ul><li>unique_ptr\<MessageT\></li><li>shared_ptr\<MessageT\></li></ul> | <ul><li>@1</li><li>@2</li></ul> | | ||
| | unique_ptr\<MessageT\> @1 | <ul><li>unique_ptr\<MessageT\></li><li>shared_ptr\<MessageT\></li><li>shared_ptr\<MessageT\></li></ul> | <ul><li>@1</li><li>@2</li><li>@2</li></ul> | | ||
| | unique_ptr\<MessageT\> @1 | <ul><li>unique_ptr\<MessageT\></li><li>unique_ptr\<MessageT\></li><li>shared_ptr\<MessageT\></li><li>shared_ptr\<MessageT\></li></ul> | <ul><li>@1</li><li>@2</li><li>@3</li><li>@3</li></ul> | |
There was a problem hiding this comment.
The proposed approach with subscriber specific buffers does come with drawback though which isn't described in the current document. In the existing implementation the potential duplication of the message happens on demand when a subscriber actually takes the message. In the proposal the duplication directly at publishing time.
Considering a scenario (3) with N subscribers all taking a unique pointer. If the subscribers don't actually take the message (e.g. it busy and the message is being overwritten due to QoS settings) the existing implementation doesn't require a copy where as the proposal does result in the copy taking place anyway. So in a high load scenario the message being skipped since the subscriber is e.g. too slow still adds to the system load.
In another scenario (4) the separate buffers with the already duplicated data can't optimize the number of copies in all cases. Consider two subscribers - one taking a shared pointer and one taking a unique pointer. In the current proposal the message is always copies once. In the existing code with a single buffer serving both subscribers the logic can avoid this if e.g. the first subscriber takes the shared pointer and already gives away its ownership of the shared pointer before the second subscriber even tries to take the message. In that example there is no requirement to actually copy the data.
There was a problem hiding this comment.
One way to delay creating additional copies would be to always store shared pointers in the subscribers' buffers. Then, any additional copies (say, to covert to unique pointer) are only done when the subscriber removes the message from the buffer. This directly addresses (3).
Regarding (4), I don't see how you can avoid the copy if two subscribers want the same message but one wants a shared pointer and the other a unique pointer. Could you perhaps elaborate?
There was a problem hiding this comment.
Could you perhaps elaborate?
As mentioned in the scenario: the first subscriber can get a const shared pointer of the existing message. After reading some data from it the subscriber releases the shared pointer. So the only reference of the message is the one in the buffer. If now the second subscriber wants to get a unique pointer the ownership can be passed from the buffer to the second subscriber. There was no need to copy the data because the usage of the message between the subscribers didn't overlap.
There was a problem hiding this comment.
@dirk-thomas sorry, but I still don't understand exactly.
The first subscriber will call the method get, while the second will call pop.
Regardless whether the ring buffer contains a unique_ptr or a shared_ptr, for the first subscriber no copies are needed and the internal message is promoted to shared_ptr.
When the second subscriber asks for the message now there is a shared_ptr inside the buffer and I don't see the mechanism that allows to avoid the copy
There was a problem hiding this comment.
From what I can understand, @dirk-thomas is trying to describe the following scenario (note: the temporal aspect of the scenario is important):
time_0: subscriber_a wants a const shared pointer to the message (hence, no copy)
time_1: subscriber_a is done processing the message and relinquishes ownership of the message.
time_2: subscriber_b wants a unique pointer to the message (shared pointer is converted to unique pointer, thus avoiding the copy).
Requirement: time_2 comes after time_1
Is this correct, @dirk-thomas?
There was a problem hiding this comment.
I think it is possible to do this optimization (essentially, move instead of copy for a single remaining sub) though it may not be trivial. The shared pointer counter alone can not be used as a proxy for the number of subs since the message may not have been copied to all the subscribers' buffers when this check is done, so this counter may underestimate the number of remaining subs.
Would you be okay with this being addressed as a future optimization considering it may not be a very common case (large message + multiple subs with atleast one sub, say A, requiring unique ptr + A tries to get the message only after other subs are done processing the message)?
There was a problem hiding this comment.
I am not suggesting that this must be addressed. But I think this case should be described in the document and that the separate buffers make that kind of optimization more difficult.
| intra-process `Subscription`s when it is created, but, eventually, a `transient local` `Subscription` in a | ||
| different process joins. Initially, published messages are not passed to the middleware, since all the | ||
| `Subscription`s are in the same process. This means that the middleware is not able to store old messages | ||
| for eventual late-joiners. |
There was a problem hiding this comment.
That is a good example for why a transient local publisher must publish to the RMW impl even if at the time of publishing no inter process subscribers exist.
Thanks @wjwwood for taking the time to review! The design to have independent buffers for each of the subscriber was deliberate:
|
|
@wjwwood @dirk-thomas Thanks for taking time to review the design proposal. We want to figure out what should be the next step(s). Ideally, we would want to close this PR and open an official design proposal and continue conversations there. We can address some of the topics that you have brought up before that:
Sound good? |
dgoel
left a comment
There was a problem hiding this comment.
I am closing this review in favor of the official ros2 design PR.
New IPC proposal
CC: @clalancette @raghaprasad