Fix macOS and Windows shared mem wait for detach#7321
Fix macOS and Windows shared mem wait for detach#7321DennisOSRM merged 5 commits intoProject-OSRM:masterfrom
Conversation
TheMarex
left a comment
There was a problem hiding this comment.
Looks good, some smaller questions. Thanks for breaking it out!
| throw util::exception("shmctl encountered an error: " + errorToMessage(error_code) + | ||
| SOURCE_REF); | ||
| ::shmid_ds xsi_ds; | ||
| if (::shmctl(xsi.get_shmid(), IPC_STAT, &xsi_ds) < 0) |
There was a problem hiding this comment.
If I read this correctly, in the case of an error we swallow that and just return? Previously it would have thrown an exception.
There was a problem hiding this comment.
But we are expecting an error here. The error being that the region is gone.
There was a problem hiding this comment.
But can we check for a precise error code in that case? There may be other reasons this call errors.
There was a problem hiding this comment.
If we got there, the only plausible error that still can occur is: "EIDRM shmid points to a removed identifier". And that is exactly what we are waiting for to happen. I can test for this error and throw if any other error happens.
There was a problem hiding this comment.
IMHO it makes it a bit more clear that we are waiting for something specific to happen.
| #else | ||
| // Windows - specific code | ||
|
|
||
| // POSIX shared mem |
There was a problem hiding this comment.
I don't remember exactly why we used XSI over POSIX. Maybe issues with OSX at some point? To be honest we were originally planning to drop the whole hot-swapping thing in favor of mmap. The hotswap/shmem was conceived almost 15 years ago when the typical deployment process was "run it on a big server", not rolling deployments behind a load balancer.
Would have been nice for the 6.0 to clean this up but alas.
d738492 to
8bc6e77
Compare
There was a problem hiding this comment.
Pull request overview
This pull request aims to fix an issue where osrm-datastore on macOS and Windows did not properly wait for other processes to detach from shared memory before exiting, causing test failures and performance issues. The PR refactors the shared memory implementation by:
Changes:
- Extracting shared memory logic from headers into a new
src/storage/shared_memory.cppimplementation file - Implementing a
WaitForDetachfunction for all platforms (Linux/macOS/Windows) - Renaming
shm_keytoproj_idthroughout the codebase for clarity - Adding proper wait logic when removing old shared memory regions in
storage.cpp
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/storage/shared_memory.cpp |
New implementation file containing SharedMemory class and platform-specific helper functions |
include/storage/shared_memory.hpp |
Refactored to be a cleaner header with forward declarations instead of full implementation |
src/storage/storage.cpp |
Updated to call WaitForDetach when removing old memory regions; renamed shm_key to proj_id |
include/storage/shared_datatype.hpp |
Introduced ProjID type alias and renamed shm_key to proj_id throughout |
src/tools/store.cpp |
Updated to use proj_id and new function signatures |
src/engine/datafacade/shared_memory_allocator.cpp |
Updated to use proj_id and new API |
include/engine/datafacade/shared_memory_allocator.hpp |
Updated function signature to use ProjID |
include/engine/data_watchdog.hpp |
Updated to use proj_id and reordered log statement |
include/storage/shared_monitor.hpp |
Added missing includes and minor parentheses fix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * @param key A ProjID | ||
| * @return bool Returns true if the region exists | ||
| */ | ||
| bool RegionExists(const ProjID proj_id); | ||
| /** | ||
| * @brief Destroys the shared memory region | ||
| * | ||
| * @param key A valid ProjID | ||
| * @return bool returns false on error. | ||
| */ | ||
| bool Remove(const ProjID proj_id); | ||
| /** | ||
| * @brief Waits for all processes to detach from the shared memory region | ||
| * | ||
| * @param key A ProjID |
There was a problem hiding this comment.
The @param documentation is inconsistent. The parameter name is "proj_id" but the documentation says "key A valid ProjID". It should say "@param proj_id A valid ProjID" to match the actual parameter name.
| * @param key A ProjID | |
| * @return bool Returns true if the region exists | |
| */ | |
| bool RegionExists(const ProjID proj_id); | |
| /** | |
| * @brief Destroys the shared memory region | |
| * | |
| * @param key A valid ProjID | |
| * @return bool returns false on error. | |
| */ | |
| bool Remove(const ProjID proj_id); | |
| /** | |
| * @brief Waits for all processes to detach from the shared memory region | |
| * | |
| * @param key A ProjID | |
| * @param proj_id A ProjID | |
| * @return bool Returns true if the region exists | |
| */ | |
| bool RegionExists(const ProjID proj_id); | |
| /** | |
| * @brief Destroys the shared memory region | |
| * | |
| * @param proj_id A valid ProjID | |
| * @return bool returns false on error. | |
| */ | |
| bool Remove(const ProjID proj_id); | |
| /** | |
| * @brief Waits for all processes to detach from the shared memory region | |
| * | |
| * @param proj_id A ProjID |
|
|
||
| // The second parameter passed to ftok(). See: man 3 ftok | ||
| // It should actually be an int, but for compatibility with earlier versions of OSRM it | ||
| // is an uint16. It should't matter since, according to the man page, only the lowest 8 |
There was a problem hiding this comment.
Typo in comment: "should't" should be "shouldn't".
| // is an uint16. It should't matter since, according to the man page, only the lowest 8 | |
| // is an uint16. It shouldn't matter since, according to the man page, only the lowest 8 |
| #include "storage/shared_datatype.hpp" | ||
| #include <boost/interprocess/shared_memory_object.hpp> | ||
| #include <cstring> | ||
| #include <string> | ||
| #include <thread> |
There was a problem hiding this comment.
Missing necessary includes. The file uses util::Log, util::exception, and SOURCE_REF but does not include "util/log.hpp", "util/exception.hpp", or "util/exception_utils.hpp". It also uses std::ofstream (line 203) without including fstream, and std::filesystem without including filesystem (though these might come from shared_datatype.hpp).
| #include "storage/shared_datatype.hpp" | |
| #include <boost/interprocess/shared_memory_object.hpp> | |
| #include <cstring> | |
| #include <string> | |
| #include <thread> | |
| #include "storage/shared_datatype.hpp" | |
| #include "util/log.hpp" | |
| #include "util/exception.hpp" | |
| #include "util/exception_utils.hpp" | |
| #include <boost/interprocess/shared_memory_object.hpp> | |
| #include <cstring> | |
| #include <string> | |
| #include <thread> | |
| #include <fstream> | |
| #include <filesystem> |
| * Returns the contents of the environment variable SHM_LOCK_DIR if set else the system | ||
| * temp directory. | ||
| * | ||
| * @return std::filesystem::path The directory to usew for lock files |
There was a problem hiding this comment.
The documentation has a typo: "usew" should be "use".
| * @return std::filesystem::path The directory to usew for lock files | |
| * @return std::filesystem::path The directory to use for lock files |
| * @param key A ProjID | ||
| * @return bool Returns true if the region exists | ||
| */ | ||
| bool RegionExists(const ProjID proj_id); | ||
| /** | ||
| * @brief Destroys the shared memory region | ||
| * | ||
| * @param key A valid ProjID | ||
| * @return bool returns false on error. | ||
| */ | ||
| bool Remove(const ProjID proj_id); | ||
| /** | ||
| * @brief Waits for all processes to detach from the shared memory region | ||
| * | ||
| * @param key A ProjID |
There was a problem hiding this comment.
The @param documentation is inconsistent. The parameter name is "proj_id" but the documentation says "key A ProjID". It should say "@param proj_id A ProjID" to match the actual parameter name.
| * @param key A ProjID | |
| * @return bool Returns true if the region exists | |
| */ | |
| bool RegionExists(const ProjID proj_id); | |
| /** | |
| * @brief Destroys the shared memory region | |
| * | |
| * @param key A valid ProjID | |
| * @return bool returns false on error. | |
| */ | |
| bool Remove(const ProjID proj_id); | |
| /** | |
| * @brief Waits for all processes to detach from the shared memory region | |
| * | |
| * @param key A ProjID | |
| * @param proj_id A ProjID | |
| * @return bool Returns true if the region exists | |
| */ | |
| bool RegionExists(const ProjID proj_id); | |
| /** | |
| * @brief Destroys the shared memory region | |
| * | |
| * @param proj_id A valid ProjID | |
| * @return bool returns false on error. | |
| */ | |
| bool Remove(const ProjID proj_id); | |
| /** | |
| * @brief Waits for all processes to detach from the shared memory region | |
| * | |
| * @param proj_id A ProjID |
| * @param key A ProjID | ||
| * @return bool Returns true if the region exists | ||
| */ | ||
| bool RegionExists(const ProjID proj_id); | ||
| /** | ||
| * @brief Destroys the shared memory region | ||
| * | ||
| * @param key A valid ProjID | ||
| * @return bool returns false on error. | ||
| */ | ||
| bool Remove(const ProjID proj_id); | ||
| /** | ||
| * @brief Waits for all processes to detach from the shared memory region | ||
| * | ||
| * @param key A ProjID |
There was a problem hiding this comment.
The @param documentation is inconsistent. The parameter name is "proj_id" but the documentation says "key A ProjID". It should say "@param proj_id A ProjID" to match the actual parameter name.
| * @param key A ProjID | |
| * @return bool Returns true if the region exists | |
| */ | |
| bool RegionExists(const ProjID proj_id); | |
| /** | |
| * @brief Destroys the shared memory region | |
| * | |
| * @param key A valid ProjID | |
| * @return bool returns false on error. | |
| */ | |
| bool Remove(const ProjID proj_id); | |
| /** | |
| * @brief Waits for all processes to detach from the shared memory region | |
| * | |
| * @param key A ProjID | |
| * @param proj_id A ProjID | |
| * @return bool Returns true if the region exists | |
| */ | |
| bool RegionExists(const ProjID proj_id); | |
| /** | |
| * @brief Destroys the shared memory region | |
| * | |
| * @param proj_id A valid ProjID | |
| * @return bool returns false on error. | |
| */ | |
| bool Remove(const ProjID proj_id); | |
| /** | |
| * @brief Waits for all processes to detach from the shared memory region | |
| * | |
| * @param proj_id A ProjID |
| { | ||
| return blocks.find(name) != blocks.end(); | ||
| } | ||
| inline bool HasBlock(const std::string &name) const { return blocks.contains(name); } |
There was a problem hiding this comment.
The change from blocks.find(name) != blocks.end() to blocks.contains(name) requires C++20. Ensure that the project's minimum C++ standard is set to C++20 or this will cause compilation errors on older standards.
| inline bool HasBlock(const std::string &name) const { return blocks.contains(name); } | |
| inline bool HasBlock(const std::string &name) const | |
| { | |
| return blocks.find(name) != blocks.end(); | |
| } |
| try | ||
| { | ||
| while (true) | ||
| { | ||
| shared_memory_object shm_obj(open_only, name, read_write); | ||
| if (--timeout < 0) | ||
| break; | ||
| std::this_thread::sleep_for(std::chrono::milliseconds(1)); | ||
| } | ||
| } | ||
| catch (interprocess_exception &) | ||
| { | ||
| // already detached | ||
| } |
There was a problem hiding this comment.
The WaitForDetach implementation for Windows/macOS has a fundamental issue. After calling shared_memory_object::remove(), the shared memory name is immediately unlinked from the filesystem. Any subsequent attempt to open it with open_only will fail, even if processes are still attached to the underlying memory. This means the while loop at line 174-180 will immediately throw an exception and exit, not actually waiting for processes to detach. This defeats the purpose of the PR, which is to wait for clients to detach before proceeding. A proper solution would need to either: (1) not call Remove before WaitForDetach, or (2) use a different mechanism to detect when processes have detached (though POSIX shared memory doesn't provide this capability).
| try | |
| { | |
| while (true) | |
| { | |
| shared_memory_object shm_obj(open_only, name, read_write); | |
| if (--timeout < 0) | |
| break; | |
| std::this_thread::sleep_for(std::chrono::milliseconds(1)); | |
| } | |
| } | |
| catch (interprocess_exception &) | |
| { | |
| // already detached | |
| } | |
| // Best-effort wait: as long as we can open the shared memory object by name, | |
| // assume that clients may still be attached and wait in 1ms increments. | |
| // If we can no longer open it (e.g. because it has been removed), we still | |
| // wait out the remaining timeout before returning. | |
| while (timeout-- >= 0) | |
| { | |
| try | |
| { | |
| shared_memory_object shm_obj(open_only, name, read_write); | |
| } | |
| catch (interprocess_exception &) | |
| { | |
| // Shared memory object cannot be opened anymore (e.g. already removed). | |
| // We cannot reliably detect attached processes, but to honour the | |
| // requested timeout we wait for the remaining duration before returning. | |
| if (timeout > 0) | |
| { | |
| std::this_thread::sleep_for(std::chrono::milliseconds(timeout)); | |
| } | |
| return; | |
| } | |
| std::this_thread::sleep_for(std::chrono::milliseconds(1)); | |
| } |
Issue
Fixes #7320
On macOS and Windows
osrm-datastorenow correctly waits for other processes to detach from shared memory before exiting. Previous behaviour:osrm-datastoreslept for 50ms and then boldly announced: 'All clients switched.'This caused
Tasklist
Requirements / Relations
Related to PR #7309