-
Notifications
You must be signed in to change notification settings - Fork 14
ZeroMQ->NNG change for thread-safe sockets #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nng is mostly thread-safe so can avoid the need to create as many sockets/connections as zeromq needed.
…ant test endpoints in mock mode
Uses absl::flat_hash_map which is significantly faster than std::unordered_map, as it doesn't allocate linked lists of nodes and instead uses a big array.
Create send&recv endpoints at the same time to prevent deadlocks.
…background thread
nng is slower in debug mode, so I had to reduce the number of iterations on it
csegarragonz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks great! Thanks for taking the time to do this refactor. The thread-unsafety issue in ZeroMQ had caused many headaches in the past, so hopefully this fixes them.
I only have one minor comment: are all methods in the new (and very cool) ConcurrentMap class used? I feel some of them are not, but I can not be entirely sure. I'm reluctant to add methods that we don't use.
Other than that, just left some minor comments around.
|
|
||
| int ec = nng_close(socket); | ||
| if (ec != 0) { | ||
| SPDLOG_WARN("Error when closing socket: {}", nng_strerror(ec)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this error non-fatal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only error that nng_close currently returns is ECLOSED which means the socket was already closed, so while it would probably indicate a programming bug somewhere it is harmless.
|
Also, @eigenraven does this close #202? |
I added test methods to cover everything else, they're not all used in faabric right now but I added them to mirror the
It makes good progress on it, but I've definitely not covered every place where spans could be used. |
|
Great, thanks for your answers. LGTM, let's wait for Simon to take a look too. |
Shillaker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! I've had a quick look and it all looks sensible to me! I haven't had time to look at everything in detail, but I trust you two, so LGTM!
We discussed this a long time ago: ZeroMQ sockets are not thread-safe, so a lot of threads end up creating their own client objects that create connections to other nodes if communication is required. This causes a lot of unnecessary TCP connections to get created, leading to running over default socket limits, it can lead to worse slow start behaviour on those sockets and other similar scaling issues. NNG is a library made by the original authors of 0mq after learning from its mistakes and nanomsg's mistakes.
Crucially, nng sockets are thread-safe and can be used from multiple threads. Even patterns such as req-rep can be done on the same socket from multiple threads by using
nng_contextobjects (instead of the req-router-dealer-rep 0mq pattern).One big difference between nng and 0mq sockets is that send operations will block until a receiving endpoint is connected, rather than going blindly into a queue that might never connect to an endpoint. This broke a few tests which were connecting to non-existent endpoints, so they needed fixing. Some tests also created the sending endpoints and sent messages before creating the corresponding receivers which led to deadlocks. Passing the
NNG_FLAG_NONBLOCKflag wouldn't solve this issue as in that case the send function returns an error EAGAIN, so some parts of the code (mostly tests) need to be changed to make sure receivers become available at some point during send calls.To-do:
Closes #266