Skip to content

vadp-dumper: fix multithreaded backup/restore issues#1623

Merged
BareosBot merged 28 commits intobareos:bareos-23from
sebsura:dev/ssura/bareos-23/fix-vadp-dumper
Dec 13, 2023
Merged

vadp-dumper: fix multithreaded backup/restore issues#1623
BareosBot merged 28 commits intobareos:bareos-23from
sebsura:dev/ssura/bareos-23/fix-vadp-dumper

Conversation

@sebsura
Copy link
Contributor

@sebsura sebsura commented Dec 11, 2023

Thank you for contributing to the Bareos Project!

Backport of PR #1593 to bareos-23

Please check

  • Short description and the purpose of this PR is present above this paragraph
  • Your name is present in the AUTHORS file (optional)

If you have any questions or problems, please give a comment in the PR.

Helpful documentation and best practices

Checklist for the reviewer of the PR (will be processed by the Bareos team)

Make sure you check/merge the PR using devtools/pr-tool to have some simple automated checks run and a proper changelog record added.

General
  • Is the PR title usable as CHANGELOG entry?
  • Purpose of the PR is understood
  • Commit descriptions are understandable and well formatted
  • Check backport line
  • Required backport PRs have been created
Source code quality
  • Source code changes are understandable
  • Variable and function names are meaningful
  • Code comments are correct (logically and spelling)
  • Required documentation changes are present and part of the PR

Copy link
Member

@pstorz pstorz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments and let's talk.

@pstorz pstorz force-pushed the dev/ssura/bareos-23/fix-vadp-dumper branch from 7a8ba6d to 0b9c714 Compare December 12, 2023 19:38
@sebsura
Copy link
Contributor Author

sebsura commented Dec 13, 2023

I think its best if we continue the conversation on the main pull request instead of the backport ( #1593 )

@sebsura sebsura force-pushed the dev/ssura/bareos-23/fix-vadp-dumper branch from 0b9c714 to 1302afe Compare December 13, 2023 11:43
Neither write nor read guarantee that they will finish everything in
one call.  One has to repeatedly call them in a loop to ensure that
everything gets written/read.
This code path is only taken if multithreaded restore is selected.
Since the input size is a variable, we need to ensure that we hold
enough capacity to actually read the data into the buffer.  Otherwise,
if we get unlucky, we will run into the situation were the first few
cbts are very small, and the rest are really big, ensuring a buffer
overflow; possibly causing a crash.
We need to ensure that the writer does not write into the block we are
currently reading.  Currently we only distinguish between data thats
inside the queue (which can neither be written to nor read from) and
data thats outside the queue (which can _both_ be written to and read
from).
To fix this data race we split the data into three categories instead:
- The head of the queue (which can only be read from)
- The rest of the queue (which can neither be read from nor written
to)
- The data outside the queue (which can only be written to)
To accomplish this we split the normal dequeue() function into two
parts.  The normal dequeue() becomes peek-like, only returning the
head element, and the new function FinishDeq() does the actual
dequeuing.
The general gist before was this:

* copy thread:
- Wait for start
- Wait for dequeue
  - if data: handle data; loop
  - if no data: break inner (this only happens on flush)
- `signal` that queue was fushed and that the thread is waiting
  for work
- goto begin
* main thread
- do some setup
- start enqueueing data; signal start to thread if not started
  (this was done with an unsynchronized read!)
- flush when done
- During cleanup, cancel the thread while it is waiting on a
  start signal.
with some locks & unlocks sprinkled in.  One should note that the
cleanup copy thread callback always unlockes the lock regardless of
whether the copy thread locked the mutex or not!

This was slightly changed:
* Clearly there is no need to wait for "start" at all, since we wait on
  the dequeue anyways -- so this was removed.
* Instead of just canceling the thread, we set a flag that tells the
  Copythread to exit, which the thread checks after every flush.  As
  such we also make sure flush the queue on cleanup.
* We also now properly initialize all thread_context members!
Since sectors_per_call is not actually bounded by SECTORS_PER_CALL, we
need to ensure that the buffer has enough space at runtime.
This function returns -- up to a certain amount of precision -- a list
of intervals of all sectors containing useful (i.e. allocated) data.

For full backups its recommended to backup only the allocated blocks;
for other kinds of backups (incremental/differential) it is only
necessary to backup those sectors that are both changed & allocated.

Since we currently cannot take advantage of information regarding
unallocated, changed blocks, we just ignore them.  In the future these
could be used for faster restores & consolidation.

The algorithm used for finding out the intersection is trivial as both
lists are sorted (and the intervals themselves are disjoint in each
set).  As such it is enough to just go through both arrays linearly at
the same time, sending the pairwise intersection if any, and finally
advancing  the pointer to the "smaller" interval.
Since we cannot meaningfully recover from a mutex error, we should
just assert that they do not happen.
@sebsura sebsura force-pushed the dev/ssura/bareos-23/fix-vadp-dumper branch from 1302afe to 7c6ba5b Compare December 13, 2023 13:19
@pstorz pstorz self-requested a review December 13, 2023 17:25
@BareosBot BareosBot merged commit 0f6e6ee into bareos:bareos-23 Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants