Skip to content

Conversation

@eigenraven
Copy link
Collaborator

@eigenraven eigenraven commented Nov 25, 2022

We discussed this a long time ago: ZeroMQ sockets are not thread-safe, so a lot of threads end up creating their own client objects that create connections to other nodes if communication is required. This causes a lot of unnecessary TCP connections to get created, leading to running over default socket limits, it can lead to worse slow start behaviour on those sockets and other similar scaling issues. NNG is a library made by the original authors of 0mq after learning from its mistakes and nanomsg's mistakes.

Crucially, nng sockets are thread-safe and can be used from multiple threads. Even patterns such as req-rep can be done on the same socket from multiple threads by using nng_context objects (instead of the req-router-dealer-rep 0mq pattern).

One big difference between nng and 0mq sockets is that send operations will block until a receiving endpoint is connected, rather than going blindly into a queue that might never connect to an endpoint. This broke a few tests which were connecting to non-existent endpoints, so they needed fixing. Some tests also created the sending endpoints and sent messages before creating the corresponding receivers which led to deadlocks. Passing the NNG_FLAG_NONBLOCK flag wouldn't solve this issue as in that case the send function returns an error EAGAIN, so some parts of the code (mostly tests) need to be changed to make sure receivers become available at some point during send calls.

To-do:

  • Replace 0MQ with NNG in CMake
  • Switch the message/endpoint classes to wrap NNG sockets&contexts
  • Find instances of thread-local sockets and replace them with global sockets
  • Add a thread-safe concurrent map wrapper to replace manual handling of mutexes+maps, because quite a few of the thread-local maps and also some global maps use the same patterns over and over making it prone to mistakes with forgotten locks.
  • Replace map+mutex usage with the concurrent map type where appropriate

Closes #266

Nng is mostly thread-safe so can avoid the need to create as many sockets/connections as zeromq needed.
Uses absl::flat_hash_map which is significantly faster than std::unordered_map, as it doesn't allocate linked lists of nodes and instead uses a big array.
Create send&recv endpoints at the same time to prevent deadlocks.
@eigenraven eigenraven marked this pull request as ready for review November 28, 2022 14:12
Copy link
Collaborator

@csegarragonz csegarragonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks great! Thanks for taking the time to do this refactor. The thread-unsafety issue in ZeroMQ had caused many headaches in the past, so hopefully this fixes them.

I only have one minor comment: are all methods in the new (and very cool) ConcurrentMap class used? I feel some of them are not, but I can not be entirely sure. I'm reluctant to add methods that we don't use.

Other than that, just left some minor comments around.


int ec = nng_close(socket);
if (ec != 0) {
SPDLOG_WARN("Error when closing socket: {}", nng_strerror(ec));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error non-fatal?

Copy link
Collaborator Author

@eigenraven eigenraven Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only error that nng_close currently returns is ECLOSED which means the socket was already closed, so while it would probably indicate a programming bug somewhere it is harmless.

@csegarragonz
Copy link
Collaborator

Also, @eigenraven does this close #202?

@csegarragonz csegarragonz mentioned this pull request Nov 30, 2022
8 tasks
@eigenraven
Copy link
Collaborator Author

I only have one minor comment: are all methods in the new (and very cool) ConcurrentMap class used? I feel some of them are not, but I can not be entirely sure. I'm reluctant to add methods that we don't use.

I added test methods to cover everything else, they're not all used in faabric right now but I added them to mirror the std::unordered_map/absl::flat_hash_map interface where it was possible to build a thread-safe wrapper (so e.g. no operator[]). There still are other places in faabric&faasm that use maps with mutexes that use the various methods, so this should make porting those over easier.

does this close #202?

It makes good progress on it, but I've definitely not covered every place where spans could be used.

@csegarragonz
Copy link
Collaborator

Great, thanks for your answers. LGTM, let's wait for Simon to take a look too.

Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! I've had a quick look and it all looks sensible to me! I haven't had time to look at everything in detail, but I trust you two, so LGTM!

@csegarragonz csegarragonz merged commit b99db83 into main Dec 14, 2022
@csegarragonz csegarragonz deleted the nng branch December 14, 2022 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Too many files open error in ZeroMQ

4 participants