cephfs_mirror: update peer status for invalid metadata in remote snapshot by joscollin · Pull Request #56698 · ceph/ceph

joscollin · 2024-04-04T11:30:16Z

Fixes: https://tracker.ceph.com/issues/65317

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

leonid-s-usov

It concerns me that we have failure_reason and failed as separate fields set at different sites and conditions because it creates opportunities for inconsistent state.

Would it be possible to design an interface for updating these fields in unison? Or even, have an std::optional<string> failure_reason where we could then say bool failed() { return failure_reason.has_value(); }?

Maybe we could deduce the info from the error code - another reason to make sure that we return meaningful codes. If that were the case, we could save space because the string could be constructed from the error code at the recipient side.

If we have to share an error code for different internal reasons, we could still do something about it, like passing around a pair of the rc and the error string, or the rc and the detail code representing some error string. In either case, we could then treat the pair as an atomic result and status that would be passed into the _inc_failed_count function to unambiguously encode the state into the PeerReplayer object

UPD:
Looking into this further, I realize that failed could be false while some failures could have been seen, as long as they are fewer than max_failures. That explains how we could have the fields out of sync, so failure_reason.has_value() is not a valid suggestion.

However, there is still a point about having the int error code stored instead of a formatted string, and a new point about naming: the reason field should probably be prefixed with last_ just like the boost::optional<monotime> last_failed. That would not allow for the misinterpretation of how the fields were related I had above.

joscollin · 2024-04-08T11:02:41Z

It concerns me that we have failure_reason and failed as separate fields set at different sites and conditions because it creates opportunities for inconsistent state.

Would it be possible to design an interface for updating these fields in unison? Or even, have an std::optional<string> failure_reason where we could then say bool failed() { return failure_reason.has_value(); }?

Maybe we could deduce the info from the error code - another reason to make sure that we return meaningful codes. If that were the case, we could save space because the string could be constructed from the error code at the recipient side.

If we have to share an error code for different internal reasons, we could still do something about it, like passing around a pair of the rc and the error string, or the rc and the detail code representing some error string. In either case, we could then treat the pair as an atomic result and status that would be passed into the _inc_failed_count function to unambiguously encode the state into the PeerReplayer object

UPD: Looking into this further, I realize that failed could be false while some failures could have been seen, as long as they are fewer than max_failures. That explains how we could have the fields out of sync, so failure_reason.has_value() is not a valid suggestion.

However, there is still a point about having the int error code stored instead of a formatted string, and a new point about naming: the reason field should probably be prefixed with last_ just like the boost::optional<monotime> last_failed. That would not allow for the misinterpretation of how the fields were related I had above.

I could make it a struct (union?) or make it optional, as you've suggested.

Regarding the error code: there's no error here, maybe I'll rename the variable error_str if it causes confusion. This is a special case when ceph_get_snap_info succeeds and then found an invalid metadata on the remote fs. This patch provides additional information in the peer status command output, which we've already logged.

leonid-s-usov · 2024-04-08T12:22:52Z

Regarding the error code: there's no error here

There's EINVAL we return for the condition.

maybe I'll rename the variable error_str if it causes confusion.

It should be last_error_str then, to denote that it's just the last error we've encountered, which may not have been fatal. Or, it could be last_error_code: int and then you could set it in _inc_failed_count by passing the non-zero rc that you test for before calling the function.
You could also generate the error string there but only pass the code in, so you won't need a struct (a pair).

Note: union won't work.

…shot Fixes: https://tracker.ceph.com/issues/65317 Signed-off-by: Jos Collin <jcollin@redhat.com>

joscollin · 2024-04-09T04:09:03Z

Regarding the error code: there's no error here

There's EINVAL we return for the condition.

How does that help? The peer_status() is not implemented by checking the return value sent by build_snap_map. Also, EINVAL is not a unique error code for 'invalid metadata', it is returned at other places too. You need to check the code and understand how it works.

maybe I'll rename the variable error_str if it causes confusion.

It should be last_error_str then, to denote that it's just the last error we've encountered, which may not have been fatal.

There's no error, just a sync failure (please check the code, understand the difference). I've updated them to invalid_metadata_str and last_failed_reason, which should be more appropriate.

Or, it could be last_error_code: int and then you could set it in _inc_failed_count by passing the non-zero rc that you test for before calling the function. You could also generate the error string there but only pass the code in, so you won't need a struct (a pair).

Note: union won't work.

_inc_failed_count is not specific to 'invalid metadata' errors. The failures are incremented for what ever error comes out of do_sync_snaps(). Also EINVAL is not a unique error code for 'invalid metadata', it's returned for other errors too. The suggested change should make do_sync_snaps() return unique error codes for every error it encounters, so that _inc_failed_count() could set the values. Otherwise, it would display/log invalid metadata for all kind of failures.
It's a broader change, but it's out of scope of this PR. I'd stick with the existing implementation instead.

leonid-s-usov · 2024-04-09T11:22:03Z

The last edit looks better with the last_failed_reason, thank you!

The whole point of my comment was to keep related fields coherent: we update the reason in one place, and the timestamp in another, and we use the error code to control the flow. With the new naming, it's easier to see related fields, that's an improvement. A function to set both the timestamp and the reason together could be another step forward, as would be checking for has_value on either of the fields to couple the flows that set failures and those that act upon them (i.e. _inc_failed_count)

Having that said, I'm happy to close the thread about the failure reason and how we store it, thanks for discussing it with me. More on the functional side, shouldn't we now update that failure reason in other error flows? Like the failure to fetch the snapinfo, or the failure to find PRIMARY_SNAP_ID_KEY?

joscollin · 2024-04-10T11:29:56Z

Closing in favour of #56816

github-actions bot added the cephfs Ceph File System label Apr 4, 2024

joscollin added the bug-fix label Apr 4, 2024

joscollin force-pushed the wip-invalid-metadata-in-remote-snapshot branch from e8ef142 to 8de5970 Compare April 4, 2024 11:51

joscollin requested a review from a team April 5, 2024 10:15

leonid-s-usov reviewed Apr 5, 2024

View reviewed changes

joscollin marked this pull request as draft April 8, 2024 12:01

cephfs_mirror: update peer status for invalid metadata in remote snap…

f87d84b

…shot Fixes: https://tracker.ceph.com/issues/65317 Signed-off-by: Jos Collin <jcollin@redhat.com>

joscollin force-pushed the wip-invalid-metadata-in-remote-snapshot branch from 8de5970 to f87d84b Compare April 9, 2024 03:49

joscollin marked this pull request as ready for review April 9, 2024 03:51

joscollin closed this Apr 10, 2024

vshankar mentioned this pull request Apr 10, 2024

cephfs_mirror: update peer status for invalid metadata in remote snapshot #56816

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cephfs_mirror: update peer status for invalid metadata in remote snapshot#56698

cephfs_mirror: update peer status for invalid metadata in remote snapshot#56698
joscollin wants to merge 1 commit intoceph:mainfrom
joscollin:wip-invalid-metadata-in-remote-snapshot

joscollin commented Apr 4, 2024 •

edited

Loading

Uh oh!

leonid-s-usov left a comment •

edited

Loading

Uh oh!

joscollin commented Apr 8, 2024

Uh oh!

leonid-s-usov commented Apr 8, 2024

Uh oh!

joscollin commented Apr 9, 2024

Uh oh!

leonid-s-usov commented Apr 9, 2024

Uh oh!

joscollin commented Apr 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joscollin commented Apr 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

leonid-s-usov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joscollin commented Apr 8, 2024

Uh oh!

leonid-s-usov commented Apr 8, 2024

Uh oh!

joscollin commented Apr 9, 2024

Uh oh!

leonid-s-usov commented Apr 9, 2024

Uh oh!

joscollin commented Apr 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joscollin commented Apr 4, 2024 •

edited

Loading

leonid-s-usov left a comment •

edited

Loading