Skip to content

osd: implement per-pg leases to avoid stale reads#29236

Merged
tchaikov merged 32 commits intoceph:masterfrom
liewegas:wip-read-hole-bypg
Sep 29, 2019
Merged

osd: implement per-pg leases to avoid stale reads#29236
tchaikov merged 32 commits intoceph:masterfrom
liewegas:wip-read-hole-bypg

Conversation

@liewegas
Copy link
Member

No description provided.

@liewegas liewegas requested a review from athanatos July 23, 2019 19:35
@liewegas liewegas force-pushed the wip-read-hole-bypg branch 2 times, most recently from 6c1c9d2 to c0605fc Compare July 23, 2019 19:39
utime_t ping_stamp; ///< when the PING was sent
ceph::signedspan mono_ping_stamp; ///< relative to sender's clock
ceph::signedspan mono_send_stamp; ///< replier's send stamp
boost::optional<ceph::time_detail::signedspan> delta_ub; ///< ping sender
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::optional?

@athanatos athanatos self-requested a review July 25, 2019 02:11
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liewegas liewegas force-pushed the wip-read-hole-bypg branch from c0605fc to af7d7ea Compare July 25, 2019 17:58
@liewegas liewegas force-pushed the wip-read-hole-bypg branch from af7d7ea to c7179fe Compare August 5, 2019 18:31
@liewegas liewegas force-pushed the wip-read-hole-bypg branch 2 times, most recently from eef5dd8 to d697e11 Compare August 6, 2019 18:41
@liewegas liewegas force-pushed the wip-read-hole-bypg branch from d697e11 to 2f7e969 Compare August 8, 2019 15:52
@tchaikov
Copy link
Contributor

retest this please.

@liewegas liewegas force-pushed the wip-read-hole-bypg branch from 2f7e969 to c6b39a5 Compare August 15, 2019 17:29
@liewegas liewegas changed the title osd: implement per-pg leases to avoid stale reads WIP osd: implement per-pg leases to avoid stale reads Aug 15, 2019
@liewegas liewegas force-pushed the wip-read-hole-bypg branch from c6b39a5 to 7e908f4 Compare August 15, 2019 21:17
@liewegas liewegas force-pushed the wip-read-hole-bypg branch 2 times, most recently from 578c7fa to 83ca87c Compare September 9, 2019 17:01
@liewegas liewegas force-pushed the wip-read-hole-bypg branch 2 times, most recently from 2211f62 to f5fc93a Compare September 20, 2019 21:05
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
PG is laggy (unreadable) because ping(s) are delayed.

Signed-off-by: Sage Weil <sage@redhat.com>
PG is waiting for previous intervals' readable intervals to expire.

Signed-off-by: Sage Weil <sage@redhat.com>
This is the simplest strategy--much simpler than queueing them and
waking them up again later.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas changed the title WIP osd: implement per-pg leases to avoid stale reads osd: implement per-pg leases to avoid stale reads Sep 26, 2019
@athanatos
Copy link
Contributor

LGTM

@tchaikov
Copy link
Contributor

/build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.cc:457:50: error: invalid new-expression of abstract class type 'ceph::osd::PG'
  457 |       return seastar::make_ready_future<Ref<PG>>(new PG{pgid,
      |                                                  ^~~~~~~~~~~~
  458 |      pg_shard_t{whoami, pgid.shard},
      |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~              
  459 |      coll_ref,
      |      ~~~~~~~~~                                    
  460 |      std::move(pool),
      |      ~~~~~~~~~~~~~~~~                             
  461 |      std::move(name),
      |      ~~~~~~~~~~~~~~~~                             
  462 |      create_map,
      |      ~~~~~~~~~~~                                  
  463 |      shard_services,
      |      ~~~~~~~~~~~~~~~                              
  464 |      ec_profile});
      |      ~~~~~~~~~~~                                  
In file included from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/pg_map.h:14,
                 from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.h:27,
                 from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.cc:4:
/build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/pg.h:48:7: note:   because the following virtual functions are pure within 'ceph::osd::PG':
   48 | class PG : public boost::intrusive_ref_counter<
      |       ^~
In file included from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/shard_services.h:12,
                 from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.h:25,
                 from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.cc:4:
/build/ceph-15.0.0-5643-ga29a4f4/src/osd/PeeringState.h:281:18: note: 	'virtual void PeeringState::PeeringListener::queue_check_readable(epoch_t, ceph::time_detail::timespan)'
  281 |     virtual void queue_check_readable(epoch_t lpr, ceph::timespan delay) = 0;
      |                  ^~~~~~~~~~~~~~~~~~~~
/build/ceph-15.0.0-5643-ga29a4f4/src/osd/PeeringState.h:282:18: note: 	'virtual void PeeringState::PeeringListener::recheck_readable()'
  282 |     virtual void recheck_readable() = 0;
      |                  ^~~~~~~~~~~~~~~~

could you please help add dummy methods for crimson?

liewegas and others added 12 commits September 28, 2019 11:51
Signed-off-by: Sage Weil <sage@redhat.com>
Keep track of which OSDs from the prior set we care about that affect
the prior_readable_until_ub.  Note that it is only the *down* OSDs that
we have to track here, since everything in the *probe* set we will already
contact during peering (they are still up), guaranteeing that those PGs
are aware of the interval change and are no longer readable in the prior
interval.

Signed-off-by: Sage Weil <sage@redhat.com>
If we see that a prior_readable_down_osd is known to be dead, we can
remove it from the set.  And if the set is empty, we can skip the rest of
our waiting period and leave the WAIT state.

Signed-off-by: Sage Weil <sage@redhat.com>
We want to renew before we prepeare or send activate messages so that we
have the opportunity to include leases in them (coming soon!).

And we do not want to send explicit lease messages until we know that the
peers have activate.  In particular, we want to avoid queueing a notify
(via pending_activators) and then sending a lease that will arrive before
it.

Signed-off-by: Sage Weil <sage@redhat.com>
The lease goes out with the MOSDPGLog or info, and the ack comes back with
the info.

We no longer need to renew the lease explicitly in
all_activated_and_committed() because we *just* piggybacked on activation.
We can just wait for the normal renew event to fire.

Signed-off-by: Sage Weil <sage@redhat.com>
We only do this for primary -> replica, so we only need to proc_lease()
from the replica states.

Signed-off-by: Sage Weil <sage@redhat.com>
The 'replica' term does not map well onto EC pools.  More importantly,
the implementation is often wrong for EC pools, where role may be 0 or 1
for EC pools independent of whether the OSD is the primary or not.

Introduce 'nonprimary' to mean an acting osd that is not the primary.

Signed-off-by: Sage Weil <sage@redhat.com>
If there are no down OSDs from prior intervals, then the normal peering
process will end up contacting all of the prior OSDs and ensuring that
their prior interval is terminated during peering.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
These are stubs; the reschule one (at minimum) probably needs a meaningful
implementation in order for the PG to peer in some cases.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

@tchaikov added stubs to that it builds, but they implementations probably need to be fleshed out in order for the pg to successfully peer in many cases.

@tchaikov
Copy link
Contributor

thanks @liewegas ! i will come up with a follow-up PR.

@tchaikov
Copy link
Contributor

The following tests FAILED:
	176 - unittest_rgw_amqp (Failed)

https://tracker.ceph.com/issues/42042

@tchaikov tchaikov merged commit e659e86 into ceph:master Sep 29, 2019
- \(POOL_APP_NOT_ENABLED\)
- \(SLOW_OPS\)
- \(PG_AVAILABILITY\)
- \(PG_DEGRADED\)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://pulpito.ceph.com/kchai-2019-09-29_02:16:32-rados-wip-kefu-testing-2019-09-28-1615-distro-basic-mira/4343377/

for posterity, the PG_DEGRADED failure was addressed by the change after this PR was picked up by my batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants