Skip to content

core: fix various data races (connection_pool/heartbeat_thread)#1685

Merged
BareosBot merged 11 commits intobareos:masterfrom
sebsura:dev/ssura/master/fix-connection-pool-crash
Feb 13, 2024
Merged

core: fix various data races (connection_pool/heartbeat_thread)#1685
BareosBot merged 11 commits intobareos:masterfrom
sebsura:dev/ssura/master/fix-connection-pool-crash

Conversation

@sebsura
Copy link
Contributor

@sebsura sebsura commented Jan 26, 2024

Thank you for contributing to the Bareos Project!

This pr fixes data races related to the connection_pool on the director and the heartbeat threads and the fd.

Please check

  • Short description and the purpose of this PR is present above this paragraph
  • Your name is present in the AUTHORS file (optional)

If you have any questions or problems, please give a comment in the PR.

Helpful documentation and best practices

Checklist for the reviewer of the PR (will be processed by the Bareos team)

Make sure you check/merge the PR using devtools/pr-tool to have some simple automated checks run and a proper changelog record added.

General
  • Is the PR title usable as CHANGELOG entry?
  • Purpose of the PR is understood
  • Commit descriptions are understandable and well formatted
  • Check backport line
  • Required backport PRs have been created
Source code quality
  • Source code changes are understandable
  • Variable and function names are meaningful
  • Code comments are correct (logically and spelling)
  • Required documentation changes are present and part of the PR
Tests
  • Decision taken that a test is required (if not, then remove this paragraph)
  • The choice of the type of test (unit test or systemtest) is reasonable
  • Testname matches exactly what is being tested
  • On a fail, output of the test leads quickly to the origin of the fault

@sebsura sebsura force-pushed the dev/ssura/master/fix-connection-pool-crash branch from 95edd17 to d8a9939 Compare January 26, 2024 11:42
@sebsura sebsura changed the title fix various data races fix various data races (connection_pool/heartbeat_thread) Jan 30, 2024
@pstorz pstorz requested review from pstorz and removed request for arogge February 2, 2024 11:50
@pstorz pstorz changed the title fix various data races (connection_pool/heartbeat_thread) core: fix various data races (connection_pool/heartbeat_thread) Feb 2, 2024
@sebsura sebsura force-pushed the dev/ssura/master/fix-connection-pool-crash branch 3 times, most recently from 05e34d7 to 008bf1f Compare February 12, 2024 06:26
@sebsura sebsura force-pushed the dev/ssura/master/fix-connection-pool-crash branch 2 times, most recently from baaf031 to 0face27 Compare February 13, 2024 08:02
Some operations were improperly synchronized.  For example take
cleanup() for example:

```
 |for (i = connections_->size() - 1; i >= 0; i--) {
1|  connection = connections_->get(i);
 |  Dmsg2(800, "checking connection %s (%d)\n", connection->name(), i);
2|  if (!connection->check()) {
 |    Dmsg2(120, "connection %s (%d) is terminated => removed\n",
 |          connection->name(), i);
 |    connections_->remove(i);
4|    delete (connection);
 |  }
 |}
```
We dont lock connections_ or connection in anyway here.  This means
that not only could we get a NULL returned at (1), we also have to
account for the fact that at any moment connection could get deleted
from under us from a different thread -- even if we are currently
holding its lock.   This will happen if two threads call cleanup at
the same time and one is at (2) while the other one is at (4).

Similarly the check() function just calls WaitDataIntr() on the socket
without ensuring exclusive access (for example by locking the
connection!).  WaitDataIntr is not a const function so its not safe to
call without ensuring exclusive access.  Even though it might look
like this should be safe since the function just waits, but it in fact
can write to some internal data (e.g. b_errno in case of an error)
which can definitely cause problems.

Connection::in_use is also very misleading.  While it does not suffer
from the data race problem (as its an atomic value), its
interpretation does: If you read false from it, you do not actually know
whether some thread is using the connection (and has yet to update the
bool) or if the connection is actually unused.

All these problems and some more lead to the decision to rewrite this
code completely.

The basic idea is that the connection pool now is simply a vector of
connections protected by one lock.  The connections itself do not have
a lock.

The locks are owned by the vector.  The only way to interact with the
connections inside the pool is by locking the whole vector.  This
eliminates all the problems above.

The connections itself are now also an raii type.  They own the socket
they hold.  That means that they will take care of closing/destroying
the socket once they leave the scope (similarly to a unique pointer).
This was also done this way before the rewrite
This is needed in case you want to use timed mutex for example.
sebsura and others added 6 commits February 13, 2024 10:28
Since all connections in the pool were always authenticated, we can
just remove that member and just assume that the connection is
authenticated.
This is done with a plugin that just spams job messages for a while.
Since both Jmsg and the heartbeat thread write to the director socket,
we need to enable locking!
Since our binary is not started correctly, we should not depend on the
state of global objects and instead create them manually when needed.
@BareosBot BareosBot force-pushed the dev/ssura/master/fix-connection-pool-crash branch from 1bbc00e to b6b7a05 Compare February 13, 2024 10:28
@BareosBot BareosBot merged commit 73b7a97 into bareos:master Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants