feat(shmem): expose side-channel metadata API for unsent messages#555
Conversation
dennisklein
left a comment
There was a problem hiding this comment.
Note
This review was generated with Claude.
Overview
Adds two public entry points so ALICE-style consumers can resolve shmem payload pointers from metadata delivered out-of-band (e.g. serialized into Arrow tables) instead of via FairMQ channels:
shmem::Message::GetMeta()— returns the message'sMetaHeaderby value.shmem::GetDataAddressFromHandle(TransportFactory&, const MetaHeader&)— free function declared inCommon.h, defined inManager.cxx, dispatching throughshmem::TransportFactory→Manager.
+158/−0 across 6 files, three new tests. No API breaks, purely additive.
Correctness — verified against the codebase, looks right
GetMeta()aggregate init order matchesMetaHeaderexactly (fSize, fHint, fHandle, fShared, fRegionId, fSegmentId, fManaged). ✔Manager::GetDataAddressFromHandle()faithfully mirrorsMessage::GetData()for all three paths: zero-size managed →nullptr, managed →ShmHeader::UserPtr(GetAddressFromHandle(...)), unmanaged →region->GetData() + fHandle. ✔- It's actually safer than
GetData()on failure:GetSegment()swallows open failures with just aLOG(error), after whichGetData()would die on an uncaughtstd::out_of_rangefromfSegments.at(). The new code re-checks the map and throws a typedSharedMemoryErrorinstead. Good. - The
static_castdowncast in the free function is guarded by theGetType() != Transport::SHMcheck, andshmem::TransportFactoryisfinal. ✔ - Returning a raw pointer rather than constructing a
Messagefrom the header is the right call — a realshmem::Messagewould deallocate/deref-count the buffer on destruction, which is exactly what this no-ownership path must avoid. - The
#include "TransportFactory.h"in Manager.cxx doesn't break the boost/process compile-firewall — that comment is about keeping boost/process out of headers; this include only flows zmq.h into a .cxx that's already heavy.
Issues
1. CI is red because of an unused parameter (must fix). SideChannelUnmanaged(const string& address) never uses address — unlike the neighboring tests, it never binds/connects a channel. The static-analysis job fails on exactly this (_message.cxx:497: -Wunused-parameter / misc-unused-parameters). Drop the parameter and the call-site argument; all other CI jobs (gcc 12–15, asan/ubsan, tsan, boost 1.87) are green.
2. Negative caching makes the unmanaged error path permanently sticky — worth documenting. GetRegionFromCache() caches nullptr results in the thread-local cache, and the generation counter fRegionsGen only bumps on local region create/remove. So if a consumer resolves a MetaHeader before the producer's region is discoverable, the SharedMemoryError will keep being thrown on that thread forever — retrying never heals. This is pre-existing behavior shared with GetData(), but this PR is the first API explicitly designed for cross-process resolution where arrival order isn't guaranteed, so the doc comment should warn that the call is not retryable on region-not-found (or the cache should skip caching nullptr).
3. Tests never exercise the actual cross-factory use case. Both round-trip tests resolve metadata through the same factory that allocated the message, so the segment/region is already in the local maps — the open_only attach paths (GetSegment's lazy open, GetRegion's remote lookup) are never hit through the new API, yet that's the entire production scenario. Suggest creating a second factory with the same session id and resolving there.
Minor / nits
- Missing error test for the unmanaged branch: a
fManaged = false, bogusfRegionIdcase (EXPECT_THROW(..., SharedMemoryError)) would complete the error matrix inSideChannelErrors(). - Name:
GetDataAddressFromHandletakes aMetaHeader, not a handle (unlikeManager::GetAddressFromHandle, which does).GetDataAddressFromMetawould be more accurate — worth considering before the API is set in stone. GetMeta()'s positional aggregate init silently misassigns ifMetaHeaderfields are ever reordered (the size_t/handle_t/uint16_t pairs would still compile). Low risk sinceMetaHeaderis the wire format, but a brief comment on the dependency wouldn't hurt.- The bad-segment-id test will emit a
LOG(error)fromGetSegmentbefore the throw — cosmetic noise in test output, fine to leave.
Verdict
The implementation is correct and the design rationale (free function in Common.h to avoid dragging zmq.h into consumers; raw pointer to avoid ownership) is sound. Needs work before merge: fix the unused address parameter (it's the only thing failing CI), and ideally add a second-factory test plus a doc note on the non-retryable region-not-found behavior.
Add two public entry points needed by the ALICE use case where shmem messages are allocated via a transport but never sent — their metadata is instead serialised into Arrow tables and delivered over a separate channel, allowing consumer devices to resolve the payload pointer without taking ownership. shmem::Message::GetMeta() returns the MetaHeader of the message, mirroring the existing positional-init pattern already used in Socket.h. shmem::GetDataAddressFromHandle(TransportFactory&, const MetaHeader&) is a free function declared in Common.h and defined in Manager.cxx. Keeping it out of the TransportFactory class body means callers only need to include Common.h (available transitively via Message.h) and do not drag in Socket.h or zmq.h. The implementation handles both managed segments and unmanaged regions, and throws SharedMemoryError with a typed message on a bad segment or region id. TransportFactory also gains a same-named member for callers that already have the concrete type. Lifetime of the returned pointer is the caller's responsibility; the cache device is expected to hold the messages alive. A SideChannel test covers the GetMeta/GetDataAddressFromHandle round-trip for both standard and expanded-metadata configurations.
A failed region lookup was inserted into the thread-local cache as nullptr, making the failure permanent for the lifetime of the cache generation - retrying never healed because the fast path would return nullptr without calling GetRegion again. Skip the cache insert on failure so subsequent calls retry the slow path.
|
Thank you for preparing this implementation, indeed, this is cleaner than my attempt in #551. For my own curiosity, what is the reason for the implementations of TransportFactory and others to be header-only? |
Originally, transport implementations used to be private. But over time more and more of the shmem details have been exposed instead of changing/redesigning the abstract interfaces. The changes to header-only (ccbf0be, df574c6) were performed when they were still private, unfortunately, the commit msgs do not explain it. Perhaps @rbx remembers. I guess just organizationally, fewer files. |
|
Generally, having everything available in the headers should give the compiler more room and information for optimization. That was the primary motivation back then, although it may be becoming less relevant today with various LTO techniques. Naturally, at some point the compile time will climb so it becomes a trade-off between optimization potential and compile time. In principle I am not against changing this if there is a good reason to do so. But that should go in a separate PR to keep it focused. |
|
@aalkin do you need a tag with this feature? |
|
@rbx yes please, we would like to integrate it as soon as possible. |
Add two public entry points needed by the ALICE use case where shmem messages are allocated via a transport but never sent - their metadata is instead serialised into Arrow tables and delivered over a separate channel, allowing consumer devices to resolve the payload pointer without taking ownership.
shmem::Message::GetMeta()returns the MetaHeader of the message.shmem::GetDataAddressFromHandle(TransportFactory&, const MetaHeader&)is a free function declared in Common.h and defined in Manager.cxx. Keeping it out of the TransportFactory class body means callers only need to include Common.h (available transitively via Message.h) and do not drag in Socket.h or zmq.h.The implementation handles both managed segments and unmanaged regions, and throws
SharedMemoryErrorwith a typed message on a bad segment or region id.Lifetime of the returned pointer is the caller's responsibility; the cache device is expected to hold the messages alive.