Skip to content

stored: fix authentication race condition / deadlock#1732

Merged
BareosBot merged 13 commits intobareos:masterfrom
sebsura:dev/ssura/master/fix-authentication
Mar 18, 2024
Merged

stored: fix authentication race condition / deadlock#1732
BareosBot merged 13 commits intobareos:masterfrom
sebsura:dev/ssura/master/fix-authentication

Conversation

@sebsura
Copy link
Contributor

@sebsura sebsura commented Mar 12, 2024

Thank you for contributing to the Bareos Project!

Sometimes the fd and sd do not agree on the authentication status of the connection, which leads to both of them waiting for the other.
This is in part caused by not using condition variables correctly, which causes the sd to not notice that the authenticated condition changed from false to true.

This PR also adds an additional timeout check to our systemtests. If a single run_bconsole invocation takes more than 100 seconds, then the testrunner will create backtraces of the currently running daemons and exit the test with exit code 124.

This should make it easier to debug hangs (like the one above) in our ci pipeline.

Please check

  • Short description and the purpose of this PR is present above this paragraph
  • Your name is present in the AUTHORS file (optional)

If you have any questions or problems, please give a comment in the PR.

Helpful documentation and best practices

Checklist for the reviewer of the PR (will be processed by the Bareos team)

Make sure you check/merge the PR using devtools/pr-tool to have some simple automated checks run and a proper changelog record added.

General
  • Is the PR title usable as CHANGELOG entry?
  • Purpose of the PR is understood
  • Commit descriptions are understandable and well formatted
  • Check backport line
  • Required backport PRs have been created
Source code quality
  • Source code changes are understandable
  • Variable and function names are meaningful
  • Code comments are correct (logically and spelling)
  • Required documentation changes are present and part of the PR
Tests
  • Decision taken that a test is required (if not, then remove this paragraph)
  • The choice of the type of test (unit test or systemtest) is reasonable
  • Testname matches exactly what is being tested
  • On a fail, output of the test leads quickly to the origin of the fault

@sebsura sebsura force-pushed the dev/ssura/master/fix-authentication branch from fc22222 to 074ad77 Compare March 12, 2024 12:42
@pstorz pstorz self-requested a review March 14, 2024 07:41
Copy link
Member

@pstorz pstorz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments

sebsura and others added 13 commits March 18, 2024 13:54
The condition variable is not used correctly:

// reader
1|  while (!unprotected) {
2|        wait(cond_var)
    }
// writer

3|  unprotected = true;
4|  signal(cond_var)

The execution order 1->3->4->2 will cause a deadlock.  This is why the
wait command takes a mutex:  Everything that might change the
condition to be true needs to lock the mutex,  this way we can ensure
that we either see the updated value or the wait sees the signal.

Since jcr->authenticate is used all over the place in a lot of
different situations, this problem could not be easily fixed by just
protecting that variable (we do not want weird deadlocks to happen
after all).

We just do not rely on jcr->authenticate anymore when it comes to
waiting on job start.  Instead we have a single, properly protected
bool `client_available` that we can wait on.
This bool obviously needs to be set by whoever authenticates the FD/SD
connection, otherwise the job will deadlock.   But at least that is
easily fixable.
once that timeout is reached, we kill the daemons and create a trace.
@BareosBot BareosBot force-pushed the dev/ssura/master/fix-authentication branch from 603b237 to f229099 Compare March 18, 2024 13:54
@BareosBot BareosBot merged commit 61febc7 into bareos:master Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants