Skip to content

stored: dird: add backup checkpoints that save backup metadata to the Catalog during the execution of the backup#1074

Merged
pstorz merged 46 commits intobareos:masterfrom
alaaeddineelamri:dev/alaaeddineelamri/master/backup-checkpoints
Sep 19, 2022
Merged

stored: dird: add backup checkpoints that save backup metadata to the Catalog during the execution of the backup#1074
pstorz merged 46 commits intobareos:masterfrom
alaaeddineelamri:dev/alaaeddineelamri/master/backup-checkpoints

Conversation

@alaaeddineelamri
Copy link
Contributor

@alaaeddineelamri alaaeddineelamri commented Feb 17, 2022

Description

When doing backup jobs, backup metadata (number of files written to volume, number of bytes written, etc.) is only saved on a successful job. On a failed or canceled job, users end up with volumes filled with backed up data, but information about this data is not available in the catalog, and therefore already backed up data before the fail cannot be restored using the restore command.
This PR addresses the issue by introducing checkpoints during a backup's execution that save necessary metadata (updating File, Job, and Jobmedia tables) that enables users to restore data even for failed or canceled jobs. Said checkpoints happen on volume switches, or after a certain interval of time set by the user in the SD configuration.

Please check

  • Short description and the purpose of this PR is present above this paragraph
  • Your name is present in the AUTHORS file (optional)

If you have any questions or problems, please give a comment in the PR.

Helpful documentation and best practices

Checklist for the reviewer of the PR (will be processed by the Bareos team)

General
  • PR name is meaningful
  • Purpose of the PR is understood
  • Separate commit for this PR in the CHANGELOG.md, PR number referenced is same
  • Commit descriptions are understandable and well formatted
  • If backport: add original PR number and target branch at top of this file: Backport of PR#000 to bareos-2x
Source code quality
  • Source code changes are understandable
  • Variable and function names are meaningful
  • Code comments are correct (logically and spelling)
  • Required documentation changes are present and part of the PR
  • bareos-check-sources --since-merge does not report any problems
  • git status should not report modifications in the source tree after building and testing
Tests
  • Decision taken that a system- or unittest is required (if not, then remove this paragraph)
  • The decision towards a systemtest is reasonable compared to a unittest
  • Testname matches exactly what is being tested
  • Output of the test leads quickly to the origin of the fault

@alaaeddineelamri alaaeddineelamri force-pushed the dev/alaaeddineelamri/master/backup-checkpoints branch 11 times, most recently from 0b09d14 to ca5487a Compare February 24, 2022 17:02
@alaaeddineelamri alaaeddineelamri force-pushed the dev/alaaeddineelamri/master/backup-checkpoints branch 9 times, most recently from b004c04 to 3309146 Compare March 3, 2022 13:06
Copy link
Member

@pstorz pstorz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work.

However, I see the following things that need to be tested in a system test:
1: Kill the filedaemon during the backup and check that the last successful checkpoint can be recovered.
2: The same with killing the storage daemon.
3: The same with the director.

4: We also need to measure the impact on backup speeed with a big amount data both to disk and to tape.Depending on the results we can also define a suggested value for the checkpoint interval.

5: The documentation of the feature needs to be written based on the results of the testing..

@alaaeddineelamri alaaeddineelamri force-pushed the dev/alaaeddineelamri/master/backup-checkpoints branch 2 times, most recently from 7b9988d to 073843c Compare March 14, 2022 16:24
@alaaeddineelamri alaaeddineelamri force-pushed the dev/alaaeddineelamri/master/backup-checkpoints branch 5 times, most recently from a1150c6 to 3d34d74 Compare March 17, 2022 13:05
Alaa Eddine Elamri added 29 commits September 14, 2022 09:46
Moved attribute check to SendAttrsToDir callers
This tests is disabled by default, and needs to be
manually adjusted to feed with custom made data
For the moment, we only enable any type of checkpoints
(on cancel, on volume change, or on time interval) when the 
interval value is set.
In previous commits, we had to modify the `UpdateJobEndRecord` 
function in cats in order to mitigate an update issue that would
overwrite the last jobbytes and jobfiles values with zeros.

With this commit, we can keep the original `UpdateJobEndRecord` 
as it is while keeping the data intact
use expect_grep
fix for solaris
cancel with jobid
small refactor of checkpoint function headers
s


wait 60


s


wesh


s
Copy link
Member

@pstorz pstorz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants