Skip to content

cephfs-journal-tool: Add preventive measures to avoid fs corruption#55758

Merged
vshankar merged 3 commits intoceph:mainfrom
joscollin:wip-F62925-cephfs-journal-tool-warning-message
May 29, 2024
Merged

cephfs-journal-tool: Add preventive measures to avoid fs corruption#55758
vshankar merged 3 commits intoceph:mainfrom
joscollin:wip-F62925-cephfs-journal-tool-warning-message

Conversation

@joscollin
Copy link
Member

@joscollin joscollin commented Feb 26, 2024

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Feb 26, 2024
@joscollin joscollin requested a review from vshankar February 26, 2024 13:46
@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch from 04141d1 to 5d8d636 Compare February 27, 2024 05:07
@joscollin
Copy link
Member Author

jenkins test make check

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would require mentions in documentation and PendingReleaseNotes.

@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch from 5d8d636 to d0e5b77 Compare February 27, 2024 06:24
@joscollin
Copy link
Member Author

@vshankar Will update the documentation.

@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch from d0e5b77 to 70bd491 Compare February 27, 2024 09:49
@joscollin joscollin requested a review from a team as a code owner February 27, 2024 09:49
@joscollin
Copy link
Member Author

@vshankar Updated the documentation. Please check.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might include/change qa tests that needs fixing. Please check those.

@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch from 70bd491 to bb8b899 Compare February 28, 2024 06:26
@github-actions github-actions bot added the tests label Feb 28, 2024
@joscollin
Copy link
Member Author

We might include/change qa tests that needs fixing. Please check those.

@vshankar Updated the qa changes.

@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch 3 times, most recently from f2bc443 to 2389f00 Compare February 28, 2024 14:17
Copy link
Contributor

@chrisphoffman chrisphoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You missed adding --yes-i-really-really-mean-it to one instance of cephfs-journal-tool journal reset

see:

# cephfs-journal-tool --rank=<fs_name>:0 journal reset

@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch from 2389f00 to 13a847d Compare February 29, 2024 02:26
@joscollin
Copy link
Member Author

You missed adding --yes-i-really-really-mean-it to one instance of cephfs-journal-tool journal reset

see:

# cephfs-journal-tool --rank=<fs_name>:0 journal reset

Fixed

@vshankar vshankar requested a review from a team February 29, 2024 10:59
@vshankar vshankar self-assigned this Mar 14, 2024
@joscollin
Copy link
Member Author

@vshankar @chrisphoffman ping

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joscollin This would require a rebase since more invocations of the journal tool were recently merged.

qa/tasks/cephfs/test_journal_migration.py:        # Verify that cephfs-journal-tool can now read the rewritten journal
qa/tasks/vstart_runner.py:    require_binaries = ["ceph-dencoder", "cephfs-journal-tool", "cephfs-data-scan",
qa/workunits/fs/damage/test-first-damage-lost-found.sh:  cephfs-journal-tool --rank="$FS":0 event recover_dentries summary
qa/workunits/fs/damage/test-first-damage-lost-found.sh:  cephfs-journal-tool --rank="$FS":0 journal reset
qa/workunits/fs/damage/test-first-damage-lost-found.sh:  cephfs-journal-tool --rank="$FS":0 journal reset
qa/workunits/fs/damage/test-first-damage-lost-found.sh:  cephfs-journal-tool --rank="$FS":0 event recover_dentries summary
qa/workunits/fs/damage/test-first-damage-lost-found.sh:  cephfs-journal-tool --rank="$FS":0 journal reset
qa/workunits/fs/damage/test-first-damage.sh:  cephfs-journal-tool --rank="$FS":0 event recover_dentries summary
qa/workunits/fs/damage/test-first-damage.sh:  cephfs-journal-tool --rank="$FS":0 journal reset
qa/workunits/suites/cephfs_journal_tool_smoke.sh:export BIN="${BIN:-cephfs-journal-tool --rank=cephfs:0}"

vshankar added a commit to vshankar/ceph that referenced this pull request Apr 30, 2024
* refs/pull/55758/head:
	doc: update 'journal reset' command with --yes-i-really-really-mean-it
	qa: fix cephfs-journal-tool commands
	cephfs-journal-tool: Fix the error code
	cephfs-journal-tool: stop execution if the fs is active
	cephfs-journal-tool: Add warning messages during 'journal reset'

Reviewed-by: Dhairya Parmar <dparmar@redhat.com>
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joscollin I think there are a few more instance in tests to fix. See: https://pulpito.ceph.com/vshankar-2024-05-01_17:34:00-fs-wip-vshankar-testing-20240430.111407-debug-testing-default-smithi/7683778/

2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:======================================================================
2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:ERROR: test_journal_migration (tasks.cephfs.test_journal_migration.TestJournalMigration)
2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/test_journal_migration.py", line 70, in test_journal_migration
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    inspect_out = self.fs.journal_tool(["journal", "inspect"], 0)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/filesystem.py", line 1715, in journal_tool
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    return self._run_tool("cephfs-journal-tool", args, fs_rank, quiet)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/filesystem.py", line 1694, in _run_tool
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r = self.tool_remote.sh(script=base_args + args, stdout=StringIO()).strip()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/remote.py", line 97, in sh
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    proc = self.run(**kwargs)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/remote.py", line 523, in run
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 455, in run
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r.wait()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 161, in wait
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 181, in _raise_for_status
2024-05-01T22:10:13.294 INFO:tasks.cephfs_test_runner:    raise CommandFailedError(
2024-05-01T22:10:13.294 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi016 with status 255: 'cephfs-journal-tool --debug-mds=20 --debug-ms=1 --debug-objecter=1 --rank cephfs:0 journal inspect'

@github-actions
Copy link

github-actions bot commented May 6, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@joscollin
Copy link
Member Author

@joscollin I think there are a few more instance in tests to fix. See: https://pulpito.ceph.com/vshankar-2024-05-01_17:34:00-fs-wip-vshankar-testing-20240430.111407-debug-testing-default-smithi/7683778/

2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:======================================================================
2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:ERROR: test_journal_migration (tasks.cephfs.test_journal_migration.TestJournalMigration)
2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/test_journal_migration.py", line 70, in test_journal_migration
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    inspect_out = self.fs.journal_tool(["journal", "inspect"], 0)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/filesystem.py", line 1715, in journal_tool
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    return self._run_tool("cephfs-journal-tool", args, fs_rank, quiet)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/filesystem.py", line 1694, in _run_tool
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r = self.tool_remote.sh(script=base_args + args, stdout=StringIO()).strip()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/remote.py", line 97, in sh
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    proc = self.run(**kwargs)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/remote.py", line 523, in run
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 455, in run
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r.wait()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 161, in wait
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 181, in _raise_for_status
2024-05-01T22:10:13.294 INFO:tasks.cephfs_test_runner:    raise CommandFailedError(
2024-05-01T22:10:13.294 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi016 with status 255: 'cephfs-journal-tool --debug-mds=20 --debug-ms=1 --debug-objecter=1 --rank cephfs:0 journal inspect'

@vshankar
Cannot run cephfs-journal-tool on an active file system! Those tests were running cephfs-journal-tool on an active filesystem. It should do an fs fail <fsname>, before running cephfs-journal-tool, if there's an active filesystem.

@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch from 92b3251 to 5c74d4a Compare May 9, 2024 03:09
@joscollin
Copy link
Member Author

@joscollin I think there are a few more instance in tests to fix. See: https://pulpito.ceph.com/vshankar-2024-05-01_17:34:00-fs-wip-vshankar-testing-20240430.111407-debug-testing-default-smithi/7683778/

2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:======================================================================
2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:ERROR: test_journal_migration (tasks.cephfs.test_journal_migration.TestJournalMigration)
2024-05-01T22:10:13.292 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/test_journal_migration.py", line 70, in test_journal_migration
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    inspect_out = self.fs.journal_tool(["journal", "inspect"], 0)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/filesystem.py", line 1715, in journal_tool
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    return self._run_tool("cephfs-journal-tool", args, fs_rank, quiet)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_5ef2baf9bfba0061cb137abe5d8fc74e1acd4bd8/qa/tasks/cephfs/filesystem.py", line 1694, in _run_tool
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r = self.tool_remote.sh(script=base_args + args, stdout=StringIO()).strip()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/remote.py", line 97, in sh
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    proc = self.run(**kwargs)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/remote.py", line 523, in run
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 455, in run
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    r.wait()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 161, in wait
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2024-05-01T22:10:13.293 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 181, in _raise_for_status
2024-05-01T22:10:13.294 INFO:tasks.cephfs_test_runner:    raise CommandFailedError(
2024-05-01T22:10:13.294 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi016 with status 255: 'cephfs-journal-tool --debug-mds=20 --debug-ms=1 --debug-objecter=1 --rank cephfs:0 journal inspect'

This is fixed and the changes are pushed. There's a similar failure with test_flush.py, which is WIP.

@Svelar
Copy link
Member

Svelar commented May 9, 2024

jenkins test make check arm64

@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch from 5c74d4a to cab86fd Compare May 10, 2024 07:42
@joscollin
Copy link
Member Author

@vshankar The qa failures are fixed. This PR is ready for another review and qa.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@joscollin joscollin force-pushed the wip-F62925-cephfs-journal-tool-warning-message branch from cab86fd to 42953ec Compare May 13, 2024 07:12
@vshankar
Copy link
Contributor

@joscollin Could you update here how the assert in the tests was fixed?

@joscollin
Copy link
Member Author

@joscollin Could you update here how the assert in the tests was fixed?

The AssertionError [1] was not because of string mismatch. When we do fs.fail(), https://github.com/ceph/ceph/blob/main/qa/tasks/cephfs/test_flush.py#L73 needs an active fs. That was the issue. I've fixed it by making the fs active/inactive at the right places: 0820b31

[1] https://qa-proxy.ceph.com/teuthology/jcollin-2024-05-08_05:08:43-fs:functional-wip-jcollin-testing-07052024-distro-default-smithi/7697306/teuthology.log

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/66035.

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/66063.

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/66090.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit 75bcfd1 into ceph:main May 29, 2024
@joscollin joscollin deleted the wip-F62925-cephfs-journal-tool-warning-message branch May 29, 2024 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants