Skip to content

cephfs_mirror: update peer status for invalid metadata in remote snapshot#56698

Closed
joscollin wants to merge 1 commit intoceph:mainfrom
joscollin:wip-invalid-metadata-in-remote-snapshot
Closed

cephfs_mirror: update peer status for invalid metadata in remote snapshot#56698
joscollin wants to merge 1 commit intoceph:mainfrom
joscollin:wip-invalid-metadata-in-remote-snapshot

Conversation

@joscollin
Copy link
Member

@joscollin joscollin commented Apr 4, 2024

Fixes: https://tracker.ceph.com/issues/65317

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Apr 4, 2024
@joscollin joscollin force-pushed the wip-invalid-metadata-in-remote-snapshot branch from e8ef142 to 8de5970 Compare April 4, 2024 11:51
@joscollin joscollin requested a review from a team April 5, 2024 10:15
Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It concerns me that we have failure_reason and failed as separate fields set at different sites and conditions because it creates opportunities for inconsistent state.

Would it be possible to design an interface for updating these fields in unison? Or even, have an std::optional<string> failure_reason where we could then say bool failed() { return failure_reason.has_value(); }?

Maybe we could deduce the info from the error code - another reason to make sure that we return meaningful codes. If that were the case, we could save space because the string could be constructed from the error code at the recipient side.

If we have to share an error code for different internal reasons, we could still do something about it, like passing around a pair of the rc and the error string, or the rc and the detail code representing some error string. In either case, we could then treat the pair as an atomic result and status that would be passed into the _inc_failed_count function to unambiguously encode the state into the PeerReplayer object

UPD:
Looking into this further, I realize that failed could be false while some failures could have been seen, as long as they are fewer than max_failures. That explains how we could have the fields out of sync, so failure_reason.has_value() is not a valid suggestion.

However, there is still a point about having the int error code stored instead of a formatted string, and a new point about naming: the reason field should probably be prefixed with last_ just like the boost::optional<monotime> last_failed. That would not allow for the misinterpretation of how the fields were related I had above.

@joscollin
Copy link
Member Author

It concerns me that we have failure_reason and failed as separate fields set at different sites and conditions because it creates opportunities for inconsistent state.

Would it be possible to design an interface for updating these fields in unison? Or even, have an std::optional<string> failure_reason where we could then say bool failed() { return failure_reason.has_value(); }?

Maybe we could deduce the info from the error code - another reason to make sure that we return meaningful codes. If that were the case, we could save space because the string could be constructed from the error code at the recipient side.

If we have to share an error code for different internal reasons, we could still do something about it, like passing around a pair of the rc and the error string, or the rc and the detail code representing some error string. In either case, we could then treat the pair as an atomic result and status that would be passed into the _inc_failed_count function to unambiguously encode the state into the PeerReplayer object

UPD: Looking into this further, I realize that failed could be false while some failures could have been seen, as long as they are fewer than max_failures. That explains how we could have the fields out of sync, so failure_reason.has_value() is not a valid suggestion.

However, there is still a point about having the int error code stored instead of a formatted string, and a new point about naming: the reason field should probably be prefixed with last_ just like the boost::optional<monotime> last_failed. That would not allow for the misinterpretation of how the fields were related I had above.

I could make it a struct (union?) or make it optional, as you've suggested.

Regarding the error code: there's no error here, maybe I'll rename the variable error_str if it causes confusion. This is a special case when ceph_get_snap_info succeeds and then found an invalid metadata on the remote fs. This patch provides additional information in the peer status command output, which we've already logged.

@joscollin joscollin marked this pull request as draft April 8, 2024 12:01
@leonid-s-usov
Copy link
Contributor

Regarding the error code: there's no error here

There's EINVAL we return for the condition.

maybe I'll rename the variable error_str if it causes confusion.

It should be last_error_str then, to denote that it's just the last error we've encountered, which may not have been fatal. Or, it could be last_error_code: int and then you could set it in _inc_failed_count by passing the non-zero rc that you test for before calling the function.
You could also generate the error string there but only pass the code in, so you won't need a struct (a pair).

Note: union won't work.

@joscollin joscollin force-pushed the wip-invalid-metadata-in-remote-snapshot branch from 8de5970 to f87d84b Compare April 9, 2024 03:49
@joscollin joscollin marked this pull request as ready for review April 9, 2024 03:51
@joscollin
Copy link
Member Author

Regarding the error code: there's no error here

There's EINVAL we return for the condition.

How does that help? The peer_status() is not implemented by checking the return value sent by build_snap_map. Also, EINVAL is not a unique error code for 'invalid metadata', it is returned at other places too. You need to check the code and understand how it works.

maybe I'll rename the variable error_str if it causes confusion.

It should be last_error_str then, to denote that it's just the last error we've encountered, which may not have been fatal.

There's no error, just a sync failure (please check the code, understand the difference). I've updated them to invalid_metadata_str and last_failed_reason, which should be more appropriate.

Or, it could be last_error_code: int and then you could set it in _inc_failed_count by passing the non-zero rc that you test for before calling the function. You could also generate the error string there but only pass the code in, so you won't need a struct (a pair).

Note: union won't work.

_inc_failed_count is not specific to 'invalid metadata' errors. The failures are incremented for what ever error comes out of do_sync_snaps(). Also EINVAL is not a unique error code for 'invalid metadata', it's returned for other errors too. The suggested change should make do_sync_snaps() return unique error codes for every error it encounters, so that _inc_failed_count() could set the values. Otherwise, it would display/log invalid metadata for all kind of failures.
It's a broader change, but it's out of scope of this PR. I'd stick with the existing implementation instead.

@leonid-s-usov
Copy link
Contributor

The last edit looks better with the last_failed_reason, thank you!

The whole point of my comment was to keep related fields coherent: we update the reason in one place, and the timestamp in another, and we use the error code to control the flow. With the new naming, it's easier to see related fields, that's an improvement. A function to set both the timestamp and the reason together could be another step forward, as would be checking for has_value on either of the fields to couple the flows that set failures and those that act upon them (i.e. _inc_failed_count)

Having that said, I'm happy to close the thread about the failure reason and how we store it, thanks for discussing it with me. More on the functional side, shouldn't we now update that failure reason in other error flows? Like the failure to fetch the snapinfo, or the failure to find PRIMARY_SNAP_ID_KEY?

@joscollin
Copy link
Member Author

Closing in favour of #56816

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants