Skip to content

feat(shmem): expose side-channel metadata API for unsent messages#555

Merged
dennisklein merged 2 commits into
FairRootGroup:devfrom
rbx:shm-expose
Jun 10, 2026
Merged

feat(shmem): expose side-channel metadata API for unsent messages#555
dennisklein merged 2 commits into
FairRootGroup:devfrom
rbx:shm-expose

Conversation

@rbx

@rbx rbx commented Jun 10, 2026

Copy link
Copy Markdown
Member

Add two public entry points needed by the ALICE use case where shmem messages are allocated via a transport but never sent - their metadata is instead serialised into Arrow tables and delivered over a separate channel, allowing consumer devices to resolve the payload pointer without taking ownership.

shmem::Message::GetMeta() returns the MetaHeader of the message.

shmem::GetDataAddressFromHandle(TransportFactory&, const MetaHeader&) is a free function declared in Common.h and defined in Manager.cxx. Keeping it out of the TransportFactory class body means callers only need to include Common.h (available transitively via Message.h) and do not drag in Socket.h or zmq.h.

The implementation handles both managed segments and unmanaged regions, and throws SharedMemoryError with a typed message on a bad segment or region id.

Lifetime of the returned pointer is the caller's responsibility; the cache device is expected to hold the messages alive.

@dennisklein dennisklein left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This review was generated with Claude.

Overview

Adds two public entry points so ALICE-style consumers can resolve shmem payload pointers from metadata delivered out-of-band (e.g. serialized into Arrow tables) instead of via FairMQ channels:

  • shmem::Message::GetMeta() — returns the message's MetaHeader by value.
  • shmem::GetDataAddressFromHandle(TransportFactory&, const MetaHeader&) — free function declared in Common.h, defined in Manager.cxx, dispatching through shmem::TransportFactoryManager.

+158/−0 across 6 files, three new tests. No API breaks, purely additive.

Correctness — verified against the codebase, looks right

  • GetMeta() aggregate init order matches MetaHeader exactly (fSize, fHint, fHandle, fShared, fRegionId, fSegmentId, fManaged). ✔
  • Manager::GetDataAddressFromHandle() faithfully mirrors Message::GetData() for all three paths: zero-size managed → nullptr, managed → ShmHeader::UserPtr(GetAddressFromHandle(...)), unmanaged → region->GetData() + fHandle. ✔
  • It's actually safer than GetData() on failure: GetSegment() swallows open failures with just a LOG(error), after which GetData() would die on an uncaught std::out_of_range from fSegments.at(). The new code re-checks the map and throws a typed SharedMemoryError instead. Good.
  • The static_cast downcast in the free function is guarded by the GetType() != Transport::SHM check, and shmem::TransportFactory is final. ✔
  • Returning a raw pointer rather than constructing a Message from the header is the right call — a real shmem::Message would deallocate/deref-count the buffer on destruction, which is exactly what this no-ownership path must avoid.
  • The #include "TransportFactory.h" in Manager.cxx doesn't break the boost/process compile-firewall — that comment is about keeping boost/process out of headers; this include only flows zmq.h into a .cxx that's already heavy.

Issues

1. CI is red because of an unused parameter (must fix). SideChannelUnmanaged(const string& address) never uses address — unlike the neighboring tests, it never binds/connects a channel. The static-analysis job fails on exactly this (_message.cxx:497: -Wunused-parameter / misc-unused-parameters). Drop the parameter and the call-site argument; all other CI jobs (gcc 12–15, asan/ubsan, tsan, boost 1.87) are green.

2. Negative caching makes the unmanaged error path permanently sticky — worth documenting. GetRegionFromCache() caches nullptr results in the thread-local cache, and the generation counter fRegionsGen only bumps on local region create/remove. So if a consumer resolves a MetaHeader before the producer's region is discoverable, the SharedMemoryError will keep being thrown on that thread forever — retrying never heals. This is pre-existing behavior shared with GetData(), but this PR is the first API explicitly designed for cross-process resolution where arrival order isn't guaranteed, so the doc comment should warn that the call is not retryable on region-not-found (or the cache should skip caching nullptr).

3. Tests never exercise the actual cross-factory use case. Both round-trip tests resolve metadata through the same factory that allocated the message, so the segment/region is already in the local maps — the open_only attach paths (GetSegment's lazy open, GetRegion's remote lookup) are never hit through the new API, yet that's the entire production scenario. Suggest creating a second factory with the same session id and resolving there.

Minor / nits

  • Missing error test for the unmanaged branch: a fManaged = false, bogus fRegionId case (EXPECT_THROW(..., SharedMemoryError)) would complete the error matrix in SideChannelErrors().
  • Name: GetDataAddressFromHandle takes a MetaHeader, not a handle (unlike Manager::GetAddressFromHandle, which does). GetDataAddressFromMeta would be more accurate — worth considering before the API is set in stone.
  • GetMeta()'s positional aggregate init silently misassigns if MetaHeader fields are ever reordered (the size_t/handle_t/uint16_t pairs would still compile). Low risk since MetaHeader is the wire format, but a brief comment on the dependency wouldn't hurt.
  • The bad-segment-id test will emit a LOG(error) from GetSegment before the throw — cosmetic noise in test output, fine to leave.

Verdict

The implementation is correct and the design rationale (free function in Common.h to avoid dragging zmq.h into consumers; raw pointer to avoid ownership) is sound. Needs work before merge: fix the unused address parameter (it's the only thing failing CI), and ideally add a second-factory test plus a doc note on the non-retryable region-not-found behavior.

rbx added 2 commits June 10, 2026 15:19
Add two public entry points needed by the ALICE use case where shmem
messages are allocated via a transport but never sent — their metadata
is instead serialised into Arrow tables and delivered over a separate
channel, allowing consumer devices to resolve the payload pointer
without taking ownership.

shmem::Message::GetMeta() returns the MetaHeader of the message,
mirroring the existing positional-init pattern already used in Socket.h.

shmem::GetDataAddressFromHandle(TransportFactory&, const MetaHeader&)
is a free function declared in Common.h and defined in Manager.cxx.
Keeping it out of the TransportFactory class body means callers only
need to include Common.h (available transitively via Message.h) and do
not drag in Socket.h or zmq.h. The implementation handles both managed
segments and unmanaged regions, and throws SharedMemoryError with a
typed message on a bad segment or region id. TransportFactory also
gains a same-named member for callers that already have the concrete
type. Lifetime of the returned pointer is the caller's responsibility;
the cache device is expected to hold the messages alive.

A SideChannel test covers the GetMeta/GetDataAddressFromHandle
round-trip for both standard and expanded-metadata configurations.
A failed region lookup was inserted into the thread-local cache as
nullptr, making the failure permanent for the lifetime of the cache
generation - retrying never healed because the fast path would return
nullptr without calling GetRegion again. Skip the cache insert on
failure so subsequent calls retry the slow path.
@FairRootGroup FairRootGroup deleted a comment from coderabbitai Bot Jun 10, 2026
@aalkin

aalkin commented Jun 10, 2026

Copy link
Copy Markdown

Thank you for preparing this implementation, indeed, this is cleaner than my attempt in #551. For my own curiosity, what is the reason for the implementations of TransportFactory and others to be header-only?

@dennisklein

dennisklein commented Jun 10, 2026

Copy link
Copy Markdown
Member

For my own curiosity, what is the reason for the implementations of TransportFactory and others to be header-only?

Originally, transport implementations used to be private. But over time more and more of the shmem details have been exposed instead of changing/redesigning the abstract interfaces. The changes to header-only (ccbf0be, df574c6) were performed when they were still private, unfortunately, the commit msgs do not explain it. Perhaps @rbx remembers. I guess just organizationally, fewer files.

@rbx

rbx commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

Generally, having everything available in the headers should give the compiler more room and information for optimization. That was the primary motivation back then, although it may be becoming less relevant today with various LTO techniques. Naturally, at some point the compile time will climb so it becomes a trade-off between optimization potential and compile time. In principle I am not against changing this if there is a good reason to do so. But that should go in a separate PR to keep it focused.

@dennisklein dennisklein merged commit 1597999 into FairRootGroup:dev Jun 10, 2026
9 checks passed
@rbx

rbx commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

@aalkin do you need a tag with this feature?

@aalkin

aalkin commented Jun 10, 2026

Copy link
Copy Markdown

@rbx yes please, we would like to integrate it as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants