Skip to content

crimson/osd/replicated_recovery_backend: Fix recovery obc usage#56610

Merged
Matan-B merged 8 commits intoceph:mainfrom
Matan-B:wip-matanb-crimson-replicated-head-obc
Apr 30, 2024
Merged

crimson/osd/replicated_recovery_backend: Fix recovery obc usage#56610
Matan-B merged 8 commits intoceph:mainfrom
Matan-B:wip-matanb-crimson-replicated-head-obc

Conversation

@Matan-B
Copy link
Copy Markdown
Contributor

@Matan-B Matan-B commented Apr 1, 2024

Previously, only the head obc had ssc reference. Let clone obc also reference it's head ssc.

Fixes: https://tracker.ceph.com/issues/65203
Fixes: https://tracker.ceph.com/issues/65201

Follow-up PR: #56752

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@Matan-B Matan-B marked this pull request as ready for review April 1, 2024 08:44
@Matan-B Matan-B requested a review from a team as a code owner April 1, 2024 08:44
@Matan-B Matan-B added the DNM label Apr 1, 2024
@Matan-B

This comment was marked as resolved.

@Matan-B Matan-B force-pushed the wip-matanb-crimson-replicated-head-obc branch from 758458e to 0946054 Compare April 2, 2024 13:55
@Matan-B Matan-B changed the title crimson/osd/replicated_recovery_backend: use head obc crimson/osd/replicated_recovery_backend: Fix recovery obc usage Apr 2, 2024
@Matan-B Matan-B force-pushed the wip-matanb-crimson-replicated-head-obc branch from 0946054 to fcabaf6 Compare April 3, 2024 09:21
@Matan-B Matan-B removed the DNM label Apr 3, 2024
@Matan-B Matan-B force-pushed the wip-matanb-crimson-replicated-head-obc branch 2 times, most recently from 1fa6818 to 6ac8bd7 Compare April 8, 2024 08:11
@Matan-B

This comment was marked as outdated.

@xxhdx1985126
Copy link
Copy Markdown
Member

Seems that I reviewed the other PR #56752.

@Matan-B Matan-B force-pushed the wip-matanb-crimson-replicated-head-obc branch from 6ac8bd7 to 99d6713 Compare April 10, 2024 09:12
Copy link
Copy Markdown
Member

@xxhdx1985126 xxhdx1985126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Matan-B Matan-B force-pushed the wip-matanb-crimson-replicated-head-obc branch from 99d6713 to c606691 Compare April 10, 2024 11:01
@rzarzynski
Copy link
Copy Markdown
Contributor

jenkins test windows

if (resolved_oid->is_head()) {
// See resolve_oid
return std::move(func)(head, head);
}
Copy link
Copy Markdown
Member

@xxhdx1985126 xxhdx1985126 Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During our teuthology test, resolve_oid here will lead to the failure of recovering clone objects to replica osds, when the SnapSet::clone_snaps is like f:[c, b]. In this case, snapshot f has been removed, if we need to recover clone object oid:f to replica osds, the resolve_oid will fail.

Copy link
Copy Markdown
Member

@xxhdx1985126 xxhdx1985126 Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should allow resolved_oid to be nullopt

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps with_clone_obc_direct can be used here. It's similar to with_clone_obc but without actually resolving the clone. See WIP 93b0f0a from #56752.
What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During our teuthology test, resolve_oid here will lead to the failure of recovering clone objects to replica osds, when the SnapSet::clone_snaps is like f:[c, b]. In this case, snapshot f has been removed, if we need to recover clone object oid:f to replica osds, the resolve_oid will fail.

Why are we trying to recover oid:f if the snapshot was removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please share the osd logs?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we modified the code as follows, and it seemed that the issue was fixed:

diff --git a/src/crimson/osd/object_context_loader.cc b/src/crimson/osd/object_context_loader.cc
index 32a448cfce0..521f81e391a 100644
--- a/src/crimson/osd/object_context_loader.cc
+++ b/src/crimson/osd/object_context_loader.cc
@@ -48,18 +48,14 @@ using crimson::common::local_conf;
         };
       }
       auto resolved_oid = resolve_oid(head->get_head_ss(), oid);
-      if (!resolved_oid) {
-        ERRORDPP("clone {} not found", dpp, oid);
-        return load_obc_iertr::future<>{
-          crimson::ct_error::enoent::make()
-        };
-      }
-      if (resolved_oid->is_head()) {
+      if (resolved_oid && resolved_oid->is_head()) {
         // See resolve_oid
         return std::move(func)(head, head);
       }
       return this->with_clone_obc_only<State>(std::move(head),
-                                              *resolved_oid,
+                                              ((bool)resolved_oid)
+                                               ? *resolved_oid
+                                               : oid,
                                               std::move(func));
     });
   }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we modified the code as follows, and it seemed that the issue was fixed:

There are correct cases where we should expect resolve to fail and return an error, we can't remove this.
However, there are users (snaptrim) which are expected to not resolve_oid. Which is essentially the code you have added. I have implemented this in 1dffd26.

Copy link
Copy Markdown
Member

@xxhdx1985126 xxhdx1985126 Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we modified the code as follows, and it seemed that the issue was fixed:

There are correct cases where we should expect resolve to fail and return an error, we can't remove this. However, there are users (snaptrim) which are expected to not resolve_oid. Which is essentially the code you have added. I have implemented this in 1dffd26.

I think the issue mentioned above (#56610 (comment)) was caused by the primary osd trying to recover a clone object (the latest snap of which had been removed, which further led to the failure of "resolve_oid"). This doesn't seem to be related to snap trimming. On the other hand, could you elaborate a little bit about the correct cases in which the failure of "resolve_oid" is expected, please? Thanks:-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue mentioned above (#56610 (comment)) was caused by the primary osd trying to recover a clone object (the latest snap of which had been removed, which further led to the failure of "resolve_oid"). This doesn't seem to be related to snap trimming. On the other hand,

I'll update recovery users to also skip resolve_oid same as snap trim does.

could you elaborate a little bit about the correct cases in which the failure of "resolve_oid" is expected, please? Thanks:-)

There are cases we resolve_oid returning null and causing and error is correct. When clone doesn't exist or when the snapset's clone_snap does not contain the snap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the last commit

Previously, only the head obc had ssc reference. Let clone obc
also reference it's head ssc.

Fixes: https://tracker.ceph.com/issues/65203
Fixes: https://tracker.ceph.com/issues/65201

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-crimson-replicated-head-obc branch from c606691 to 1dffd26 Compare April 21, 2024 10:09
Matan-B added 5 commits April 22, 2024 08:33
…case

Resolve_oid on a clone object may actually return the head:
```
    // Because oid.snap > ss.seq, we are trying to read from a snapshot
    // taken after the most recent write to this object. Read from head.
```

In this case, with_clone_obc should apply `func` same as with_head_obc would have.

Note: previously, with_clone_obc_only was called on the resolved head object.
While it didn't cause any errors, using the head_obc as clone is wrong.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
ObjectContextLoader interface provides two variants:

* with_obc:
  // Use this variant by default
  // If oid is a clone object, the clone obc *and* it's
  // matching head obc will be locked and can be used in func.

* with_clone_obc_only:
  // Use this variant in the case where the head object
  // obc is already locked and only the clone obc is needed.
  // Avoid nesting with_head_obc() calls by using with_clone_obc()
  // with an already locked head.

with_clone_obc_direct variant is equal to with_obc on a clone obc
since both the head and the clone obcs will be locked and can be used.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
No change in behavior

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
No change in behavior, improved readability

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-crimson-replicated-head-obc branch from 1dffd26 to 07336fc Compare April 22, 2024 08:36
Matan-B added 2 commits April 22, 2024 08:39
SnapTrimObjSubEvent::remove_or_update partially resolves the to be
trimmed clone taking into account in_removed_snaps_queue.
The general resolve_oid is not suitable for this scenario.
Specifically the following check:
```
    if (std::find(
      citer->second.begin(),
      citer->second.end(),
      oid.snap) == citer->second.end()) {
       logger().debug("{} {} does not contain {} -- DNE",
                      __func__, ss.clone_snaps, oid.snap);
       return std::nullopt;
    }
```
because of earlier snap_map_modify call.

Example:
```
INFO  2024-04-07 13:44:01,118 [shard 0:main] osd - SnapTrimObjSubEvent(coid=2:e8855410:::folio011816418-576:8 snapid=8): 2:e8855410:::folio011816418-576:8 snaps [8, 7] -> {7}
DEBUG 2024-04-07 13:44:01,118 [shard 0:main] osd - snap_map_modify: soid 2:e8855410:::folio011816418-576:8, snaps {7}
...
This case will fail:
INFO  2024-04-07 13:44:04,139 [shard 0:main] osd - SnapTrimObjSubEvent(coid=2:e8855410:::folio011816418-576:8 snapid=7): 2:e8855410:::folio011816418-576:8 snaps [7] -> {} ... deleting
DEBUG 2024-04-07 13:44:04,139 [shard 0:main] osd - snap_map_remove: soid 2:e8855410:::folio011816418-576:8
```

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-crimson-replicated-head-obc branch from 07336fc to 48be64b Compare April 22, 2024 08:39
@Matan-B
Copy link
Copy Markdown
Contributor Author

Matan-B commented Apr 25, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants