Skip to content

[ALICE] Expose some shm API to allow shm message metadata to be passed through a side channel#551

Closed
aalkin wants to merge 3 commits into
FairRootGroup:devfrom
aalkin:v1.10.1-alice
Closed

[ALICE] Expose some shm API to allow shm message metadata to be passed through a side channel#551
aalkin wants to merge 3 commits into
FairRootGroup:devfrom
aalkin:v1.10.1-alice

Conversation

@aalkin

@aalkin aalkin commented Jun 9, 2026

Copy link
Copy Markdown

For a specific case where we want to have a cache of preallocated shared memory messages that is accessible by other devices through a time-based table, to avoid complicated synchronizations, we send a table with only the messages metadata. To access the objects in shared memory, referred to by this metadata, the target device needs to calculate the device-local pointer to the corresponding memory. This is achieved by exposing the shared memory manager API from the transport of the specific channel, that is used to send the metadata table. This approach allows us to centralize the cache management and re-use FairMQ-based shared memory management meaning the client code remains largely unchanged.

To summarize, shem message now exposes its metadata through a public method, that is then transferred to a target device, shmem transport exposes its shmem manager, that, in turn, exposes its API to get the local pointer from message handle and managed segment id for the target device. This is, of course, a draft, any suggestions as to how to handle this better (specifically, better aligned to the FairMQ architecture) are welcome.

Since "shmem/TransportFactory.h" needs to be included in the client code, the class is out-of-lined so that we do not need to expose other internal headers or link to ZeroMQ directly.

@ktf

@ktf

ktf commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@rbx does this make sense to you? Do you have any better suggestions on how to deal with this?

@dennisklein

dennisklein commented Jun 9, 2026

Copy link
Copy Markdown
Member

Can you elaborate (e.g. on a simplified pseudo topology) which device owns the msg and which will need which (read/write) access to it (also with regard to timing/parallel access)? I am trying to understand which features of the fmq memory mgmt you still need while you want to opt-out of the msg api. This is not clear to me yet.

@aalkin

aalkin commented Jun 9, 2026

Copy link
Copy Markdown
Author

The objects that we want to cache have validity intervals in terms of timestamps in data. Those intervals can span many timeframes, or less than a timeframe, in both cases this means that the border between the two intervals, where we need to change the object, is not, in general, a timeframe border. And, of course, there are several consuming devices that have different processing rates to complicate things further. Our solution is to provide a centralized cache, that retrieves and stores corresponding objects based on the timestamp tables for the currently used timeframes, that are then transparently delivered to consuming devices through a table isomorphic with the timestamps table. Trying to deliver the objects through messages would be overly complicated, since there could be cases where the same message needs to be sent for several consecutive timeframes, or the opposite, several message for a single timeframe, and this is for a single such object. Instead, the objects are allocated in shared memory as messages in the source device, using the transport of the channels pointing to the consuming devices, but are not sent. What is sent are arrow tables, isomorphic to timestamps tables, with each row containing metadata for the corresponding unsent messages with the objects for the particular timestamp. This way the control is not passed to consuming devices and stays with the central cache, but devices can still access the content of the unsent messages - provided the pointer can be inferred from the metadata.

Specifically, we use the preconfigured channels, with their transport, to allocate the messages and then send their metadata in an unrelated message. The consumers, having access to the same channel, are able to use the contents of those unsent messages, while the cache is still managed by a single device. The consumers need read-only access and are not concerned with validity intervals or life-time of the objects, the cache will drop everything that belongs to a timeframe that is already reported as consumed and will not send a new metadata table until all of objects are ready.

I hope this clarifies the intent.

@dennisklein

Copy link
Copy Markdown
Member

Thx for the explanation I think I got the constraints now. I don't object your proposal in this PR.

One alternative that still comes to mind is that you create a boost::interprocess::managed_shared_memory directly (skip fmq entirely for this cache) which would mean your side-channel metadata table needs to carry the segment handle additionally. I may overlook something, but currently I don't see yet the big advantage of using the fmq memory abstraction here.

@ktf

ktf commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I'd rather keep the fairmq abstraction, actually. I do not want to have a parallel transport which needs to be configured and so on.

@aalkin

aalkin commented Jun 9, 2026

Copy link
Copy Markdown
Author

Indeed, this could also be achieved with manually managed shared memory, however going through the FairMQ API is easier simply because everything is already preconfigured in the workflow deployment, we can just re-use existing transport with minimal changes.

@aalkin aalkin marked this pull request as ready for review June 10, 2026 08:18
Comment thread fairmq/shmem/Manager.h Outdated
@rbx

rbx commented Jun 10, 2026

Copy link
Copy Markdown
Member

I think the use case is well-motivated and keeping the cache centralised while letting consumers resolve pointers from metadata is fine.

But there are some issues with this implementation:

  • inline on out-of-line methods. The header declares all factory methods inline but their definitions now live exclusively in TransportFactory.cxx. The inline keywords should be dropped.

  • The implementation unconditionally uses UserPtr(GetAddressFromHandle(...)), which is only valid for managed-segment messages. MetaHeader carries fManaged and fRegionId, so an unmanaged-region message would fail.

  • GetManager() exposes too much surface. Returning Manager* drags the entire Manager API - including all Boost.Interprocess internals - into public headers, which partially defeats the purpose of out-lining TransportFactory.h in the first place.

I propose an alternative that covers the same use case without GetManager() or the refactor:

shmem::Message::GetMeta() - same as yours; returns a copy of the MetaHeader.

Manager::GetDataAddressFromHandle(const MetaHeader&) - handles both managed segments and unmanaged regions.

shmem::GetDataAddressFromHandle(fair::mq::TransportFactory&, const MetaHeader&) - a free function declared in shmem/Common.h. Callers only need <fairmq/shmem/Common.h>, which is already transitively available via <fairmq/shmem/Message.h>, so there is no exposure of zmq.h or Manager.h internals in client code. TransportFactory also gains a same-named thin forwarding member for callers who already have the concrete type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants