core: fix various data races (connection_pool/heartbeat_thread)#1685
Merged
BareosBot merged 11 commits intobareos:masterfrom Feb 13, 2024
Merged
Conversation
95edd17 to
d8a9939
Compare
pstorz
approved these changes
Feb 2, 2024
05e34d7 to
008bf1f
Compare
This was referenced Feb 12, 2024
baaf031 to
0face27
Compare
Some operations were improperly synchronized. For example take
cleanup() for example:
```
|for (i = connections_->size() - 1; i >= 0; i--) {
1| connection = connections_->get(i);
| Dmsg2(800, "checking connection %s (%d)\n", connection->name(), i);
2| if (!connection->check()) {
| Dmsg2(120, "connection %s (%d) is terminated => removed\n",
| connection->name(), i);
| connections_->remove(i);
4| delete (connection);
| }
|}
```
We dont lock connections_ or connection in anyway here. This means
that not only could we get a NULL returned at (1), we also have to
account for the fact that at any moment connection could get deleted
from under us from a different thread -- even if we are currently
holding its lock. This will happen if two threads call cleanup at
the same time and one is at (2) while the other one is at (4).
Similarly the check() function just calls WaitDataIntr() on the socket
without ensuring exclusive access (for example by locking the
connection!). WaitDataIntr is not a const function so its not safe to
call without ensuring exclusive access. Even though it might look
like this should be safe since the function just waits, but it in fact
can write to some internal data (e.g. b_errno in case of an error)
which can definitely cause problems.
Connection::in_use is also very misleading. While it does not suffer
from the data race problem (as its an atomic value), its
interpretation does: If you read false from it, you do not actually know
whether some thread is using the connection (and has yet to update the
bool) or if the connection is actually unused.
All these problems and some more lead to the decision to rewrite this
code completely.
The basic idea is that the connection pool now is simply a vector of
connections protected by one lock. The connections itself do not have
a lock.
The locks are owned by the vector. The only way to interact with the
connections inside the pool is by locking the whole vector. This
eliminates all the problems above.
The connections itself are now also an raii type. They own the socket
they hold. That means that they will take care of closing/destroying
the socket once they leave the scope (similarly to a unique pointer).
This was also done this way before the rewrite
This is needed in case you want to use timed mutex for example.
Since all connections in the pool were always authenticated, we can just remove that member and just assume that the connection is authenticated.
This is done with a plugin that just spams job messages for a while.
Since both Jmsg and the heartbeat thread write to the director socket, we need to enable locking!
Since our binary is not started correctly, we should not depend on the state of global objects and instead create them manually when needed.
1bbc00e to
b6b7a05
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thank you for contributing to the Bareos Project!
This pr fixes data races related to the connection_pool on the director and the heartbeat threads and the fd.
Please check
If you have any questions or problems, please give a comment in the PR.
Helpful documentation and best practices
Checklist for the reviewer of the PR (will be processed by the Bareos team)
Make sure you check/merge the PR using
devtools/pr-toolto have some simple automated checks run and a proper changelog record added.General
Source code quality
Tests