Skip to content

librados: use CEPH_OSD_FLAG_FULL_FORCE for IoCtxImpl::remove#55348

Merged
yuriw merged 1 commit intoceph:mainfrom
chenyuanrun:fix-io-ctx-impl-flags
Mar 21, 2024
Merged

librados: use CEPH_OSD_FLAG_FULL_FORCE for IoCtxImpl::remove#55348
yuriw merged 1 commit intoceph:mainfrom
chenyuanrun:fix-io-ctx-impl-flags

Conversation

@chenyuanrun
Copy link
Contributor

librados::OPERATION_FULL_FORCE should be translated to CEPH_OSD_FLAG_FULL_FORCE before calling IoCtxImpl::remove().

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@chenyuanrun chenyuanrun requested a review from a team as a code owner January 29, 2024 10:10
@github-actions github-actions bot added the core label Jan 29, 2024
@chenyuanrun
Copy link
Contributor Author

@ajarr @gregsfortytwo I think this is a BUG for librados, could you please review this fix? Thanks!

@chenyuanrun
Copy link
Contributor Author

@tchaikov fix commit 435954b

@interestingyong
Copy link

It seems like only at the delete blocking ABIs, against aio ABIs is ok.
Could help to review this pr?
@xiexingguo @idryomov

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems correct, but doesn't this issue also affect

int librados::IoCtx::remove(const std::string& oid, int flags)
{
  object_t obj(oid);
  return io_ctx_impl->remove(obj, flags); 
}

Here flags are passed from a public API to a private implementation

int librados::IoCtxImpl::remove(const object_t& oid, int flags)
{
  ::ObjectOperation op;
  prepare_assert_ops(&op);
  op.remove();
  return operate(oid, &op, NULL, flags);
}

without any translation.

It would be nice to:

  • file a ticket on https://tracker.ceph.com ticket so the fix can be backported
  • add an integration test to ensure that this doesn't break in the future

@chenyuanrun
Copy link
Contributor Author

@idryomov Thanks! I will file a ticket on https://tracker.ceph.com/ later.

@chenyuanrun
Copy link
Contributor Author

@idryomov librados::IoCtxImpl::remove can not be fixed because it is used by src/tools/rados/rados.cc with the private flags:

ceph/src/tools/rados/rados.cc

Lines 3077 to 3083 in ed736fa

for (const auto& oid : oids) {
if (forcefull) {
ret = detail::remove(io_ctx, oid, (CEPH_OSD_FLAG_FULL_FORCE |
CEPH_OSD_FLAG_FULL_TRY), use_striper);
} else {
ret = detail::remove(io_ctx, oid, use_striper);
}

If we change it , we break the public API, and the break the apps which rely on the improper behavior of this API.

@idryomov
Copy link
Contributor

If we change it , we break the public API, and the break the apps which rely on the improper behavior of this API.

We need to at least document this -- I don't see anything (not even a one-liner comment!) in src/include/rados/librados.hpp or src/librados/librados_cxx.cc. It's very easy to get confused as it is.

Then this is up to @ceph/core, but I'd also suggest deprecating librados::IoCtx::remove(const std::string& oid, int flags). We could add librados::IoCtx::remove2() taking the right flags as a replacement or perhaps just change librados::IoCtx::remove(const std::string& oid) to pass both FULL_TRY and FULL_FORCE as done in src/tools/rados/rados.cc in the snippet that you highlighted in the previous comment. That would effectively make --force-full behavior the default which seems like what @jdurgin had in mind in https://tracker.ceph.com/issues/22413#note-2.

@idryomov
Copy link
Contributor

or perhaps just change librados::IoCtx::remove(const std::string& oid) to pass both FULL_TRY and FULL_FORCE as done in src/tools/rados/rados.cc in the snippet that you highlighted in the previous comment. That would effectively make --force-full behavior the default which seems like what @jdurgin had in mind in https://tracker.ceph.com/issues/22413#note-2.

Note from git history spelunking: at the time when #20534 landed, only FULL_FORCE was passed in src/tools/rados/rados.cc. FULL_TRY was added later in #24264.

@rzarzynski rzarzynski self-requested a review February 26, 2024 18:49
prepare_assert_ops(&op);
op.remove();
return operate(oid, &op, nullptr, librados::OPERATION_FULL_FORCE);
return operate(oid, &op, nullptr, CEPH_OSD_FLAG_FULL_FORCE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take a look on the flow:

int librados::IoCtxImpl::operate(const object_t& oid, ::ObjectOperation *o,
                                 ceph::real_time *pmtime, int flags)
{
  // ...
  Objecter::Op *objecter_op = objecter->prepare_mutate_op(
    oid, oloc,
    *o, snapc, ut,
    flags | extra_op_flags,
    oncommit, &ver);
  // mid-level helpers
  Op *prepare_mutate_op(
    const object_t& oid, const object_locator_t& oloc,
    ObjectOperation& op, const SnapContext& snapc,
    ceph::real_time mtime, int flags,
    Context *oncommit, version_t *objver = NULL,
    osd_reqid_t reqid = osd_reqid_t(),
    ZTracer::Trace *parent_trace = nullptr) {
    Op *o = new Op(oid, oloc, std::move(op.ops), flags | global_op_flags |
                   CEPH_OSD_FLAG_WRITE, oncommit, objver,
                   nullptr, parent_trace);

The union with CEPH_OSD_FLAG_WRITE suggests the conceptual type.

All constructors of Op do:

  struct Op : public RefCountedObject {
    // ...
    op_target_t target;
    // ...
    Op(const object_t& o, const object_locator_t& ol,  osdc_opvec&& _ops,
       int f, /* ... */) :
      target(o, ol, f),
  struct op_target_t {
    int flags = 0;
    // ...
    op_target_t(const object_t& oid, const object_locator_t& oloc, int flags)
      : flags(flags),
    // ...
    bool respects_full() const {
      return
        (flags & (CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_RWORDERED)) &&
        !(flags & (CEPH_OSD_FLAG_FULL_TRY | CEPH_OSD_FLAG_FULL_FORCE));
    }

respects_full() clearly assumes CEPH_OSD_FLAG_FULL_FORCE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The respects_full() governs the objecter's behavior on traffic pausing:

bool Objecter::target_should_be_paused(op_target_t *t)
{
  const pg_pool_t *pi = osdmap->get_pg_pool(t->base_oloc.pool);
  bool pauserd = osdmap->test_flag(CEPH_OSDMAP_PAUSERD);
  bool pausewr = osdmap->test_flag(CEPH_OSDMAP_PAUSEWR) ||
    (t->respects_full() && (_osdmap_full_flag() || _osdmap_pool_full(*pi)));

  return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
    (t->flags & CEPH_OSD_FLAG_WRITE && pausewr) ||
    (osdmap->get_epoch() < epoch_barrier);
}

However, the main responsibility of op_target_t::flags is being passed to an OSD via MOSDOp:

Objecter::MOSDOp *Objecter::_prepare_osd_op(Op *op)
{
  // rwlock is locked

  int flags = op->target.flags;
  flags |= CEPH_OSD_FLAG_KNOWN_REDIR;
  flags |= CEPH_OSD_FLAG_SUPPORTSPOOLEIO;

  // Nothing checks this any longer, but needed for compatibility with
  // pre-luminous osds
  flags |= CEPH_OSD_FLAG_ONDISK;

  if (!honor_pool_full)
    flags |= CEPH_OSD_FLAG_FULL_FORCE;

  // ...
  auto m = new MOSDOp(client_inc, op->tid,
                      hobj, op->target.actual_pgid,
                      osdmap->get_epoch(),
                      flags, op->features);

@chenyuanrun
Copy link
Contributor Author

Note from git history spelunking: at the time when #20534 landed, only FULL_FORCE was passed in src/tools/rados/rados.cc. FULL_TRY was added later in #24264.

@idryomov Should I change CEPH_OSD_FLAG_FULL_FORCE to CEPH_OSD_FLAG_FULL_FORCE|CEPH_OSD_FLAG_FULL_TRY so that this fix could be merged?

@idryomov
Copy link
Contributor

idryomov commented Feb 27, 2024

That would be my preference, as @jdurgin clearly wanted remove() to behave like that CLI command:

You can get around this by using rados_write_op_operate with the 'LIBRADOS_OPERATION_FULL_FORCE' flag (128), like the rados cli, but I agree this should be the default for plain old remove().

But the current fix makes sense on its own, and one could argue that FULL_FORCE -> FULL_FORCE | FULL_TRY should be a separate change. @rzarzynski would have the final say here.

@rzarzynski
Copy link
Contributor

These flags have rich history. FULL_TRY seems to be even stronger than FULL_FORCE:

  // discard due to cluster full transition?  (we discard any op that
  // originates before the cluster or pool is marked full; the client
  // will resend after the full flag is removed or if they expect the
  // op to succeed despite being full).  The except is FULL_FORCE and
  // FULL_TRY ops, which there is no reason to discard because they
  // bypass all full checks anyway.  If this op isn't write or
  // read-ordered, we skip.
  // FIXME: we exclude mds writes for now.
  if (write_ordered && !(m->get_source().is_mds() ||
                         m->has_flag(CEPH_OSD_FLAG_FULL_TRY) ||
                         m->has_flag(CEPH_OSD_FLAG_FULL_FORCE)) &&
      info.history.last_epoch_marked_full > m->get_map_epoch()) {
    dout(10) << __func__ << " discarding op sent before full " << m << " "
             << *m << dendl;
    return;
  }
  // mds should have stopped writing before this point.
  // We can't allow OSD to become non-startable even if mds
  // could be writing as part of file removals.
  if (write_ordered && osd->check_failsafe_full(get_dpp()) &&
      !m->has_flag(CEPH_OSD_FLAG_FULL_TRY)) {
    dout(10) << __func__ << " fail-safe full check failed, dropping request." << dendl;
    return;
  }

The failsafe check has been relaxed in #17177. The idea was to accept operations that are net reducers of usage. However, I'm not sure it can be said with certainty about a deletion of an RADOS object. To exemplify, let's imagine removing 1 byte long obj. Is it a net reducer? Well, on one hand it going to squeeze at least 1 bytes from an ObjectStore (how much exactly is an implementation's detail); on the other it will result in generation of a few dozens of bytes in PGLog and, maybe, trimming other entries.

Pretty complex thing. I would prefer to be cautious when it comes to the failsafe check.

@rzarzynski
Copy link
Contributor

If we change it , we break the public API, and the break the apps which rely on the improper behavior of this API.

I agree with @idryomov on the doc and deprecating the private-flags-taking variant of remove(). We should get remove2() doing the proper translation (see translate_flags() in librados/librados_util.cc).

@idryomov
Copy link
Contributor

@rzarzynski So I take it that you prefer to see a new remove2() and have plain old remove() fixed to pass the proper flag only for FULL_FORCE and not FULL_TRY? I'm not sure I see much use in that...

It might be better to have plain old remove() not pass any flags then -- this would avoid confusion as passing FULL_FORCE is only a half measure and also wouldn't be a change in behavior compared to today since the public FULL_FORCE that is passed today is clearly bogus.

@idryomov
Copy link
Contributor

idryomov commented Feb 27, 2024

To exemplify, let's imagine removing 1 byte long obj. Is it a net reducer? Well, on one hand it going to squeeze at least 1 bytes from an ObjectStore (how much exactly is an implementation's detail); on the other it will result in generation of a few dozens of bytes in PGLog and, maybe, trimming other entries.

I thought the way FULL_TRY was supposed to work was that it's always safe to pass and is effectively free, because if the op isn't a net reducer (let's define that as "makes the situation worse for the OSD"), it's failed with ENOSPC error. This is in contrast to the default behavior where the OSD would drop the op and it would hang in the Objecter and therefore also from a user perspective. Am I mistaken?

@rzarzynski
Copy link
Contributor

rzarzynski commented Feb 27, 2024

@rzarzynski So I take it that you prefer to see a new remove2() and have plain old remove() fixed to pass the proper flag only for FULL_FORCE and not FULL_TRY? I'm not sure I see much use in that...

Yes, just FULL_FORCE, without FULL_TRY (I know the naming became confusing :-/).

The rationale would be to let removes() squeeze some data when it's still quite safe – under the full condition but stay really cautious (paranoid?) when things had gone so badly that OSD got into failsafe full.

I thought the way FULL_TRY was supposed to work was that it's always safe to pass and is effectively free, because if the op isn't a net reducer (let's define that as "makes the situation worse for the OSD"), it's failed with ENOSPC error.

I think you means the second check which happens later, in prepare_transaction(), and bases on the delta_stats:

  // check for full
  if ((ctx->delta_stats.num_bytes > 0 ||
       ctx->delta_stats.num_objects > 0) &&  // FIXME: keys?
      pool.info.has_flag(pg_pool_t::FLAG_FULL)) {
    auto m = ctx->op->get_req<MOSDOp>();
    if (ctx->reqid.name.is_mds() ||   // FIXME: ignore MDS for now
        m->has_flag(CEPH_OSD_FLAG_FULL_FORCE)) {
      dout(20) << __func__ << " full, but proceeding due to FULL_FORCE or MDS"
               << dendl;
    } else if (m->has_flag(CEPH_OSD_FLAG_FULL_TRY)) {
      // they tried, they failed.
      dout(20) << __func__ << " full, replying to FULL_TRY op" << dendl;
      return pool.info.has_flag(pg_pool_t::FLAG_FULL_QUOTA) ? -EDQUOT : -ENOSPC;
    } else {
      // drop request
      dout(20) << __func__ << " full, dropping request (bad client)" << dendl;
      return -EAGAIN;
    }
  }

I doubt these counters cover size of PGLog entries, internals of an ObjectStore' and so on (UPDATE: actually, I doubt they even cover OMAP). I perceive the failsafe full as a measure to (imperfectly) cover such circumstances and prevent an OSD from becoming unable to boot. I imagine it can be very hard for ObjectStore to be always exact about ENOSPC – I'm thinking here about things like a RocksDB's compaction-on-startup in BlueStore.

@chenyuanrun
Copy link
Contributor Author

@idryomov @rzarzynski So what should I do if I want this fix could be merged?

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections from me to this going in as is.

@idryomov
Copy link
Contributor

@idryomov @rzarzynski So what should I do if I want this fix could be merged?

Since this is a bug fix, I think we need a tracker ticket filed for backporting purposes.

@chenyuanrun
Copy link
Contributor Author

@idryomov I have created a issue: https://tracker.ceph.com/issues/64558

@idryomov
Copy link
Contributor

@idryomov I have created a issue: https://tracker.ceph.com/issues/64558

Now need to add Fixes: https://tracker.ceph.com/issues/64558 on the line before Signed-off-by in the commit message. See https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst#fixes-lines for details.

librados::OPERATION_FULL_FORCE should be translated to
CEPH_OSD_FLAG_FULL_FORCE before calling IoCtxImpl::remove().

Fixes: https://tracker.ceph.com/issues/64558
Signed-off-by: Chen Yuanrun <chen-yuanrun@foxmail.com>
@chenyuanrun chenyuanrun force-pushed the fix-io-ctx-impl-flags branch from 1ee6764 to a4e2a59 Compare March 14, 2024 02:10
@chenyuanrun
Copy link
Contributor Author

@idryomov I have modified my commit message, could you please review it again? Thanks!

@idryomov
Copy link
Contributor

@idryomov I have modified my commit message, could you please review it again? Thanks!

LGTM!

@amathuria
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants