osd: implement per-pg leases to avoid stale reads by liewegas · Pull Request #29236 · ceph/ceph

liewegas · 2019-07-23T19:35:14Z

No description provided.

athanatos · 2019-07-25T01:28:46Z

src/messages/MOSDPing.h

+  utime_t ping_stamp;               ///< when the PING was sent
+  ceph::signedspan mono_ping_stamp; ///< relative to sender's clock
+  ceph::signedspan mono_send_stamp; ///< replier's send stamp
+  boost::optional<ceph::time_detail::signedspan> delta_ub;  ///< ping sender


std::optional?

athanatos

LGTM

tchaikov · 2019-08-14T02:00:27Z

retest this please.

Signed-off-by: Sage Weil <sage@redhat.com>

PG is laggy (unreadable) because ping(s) are delayed. Signed-off-by: Sage Weil <sage@redhat.com>

PG is waiting for previous intervals' readable intervals to expire. Signed-off-by: Sage Weil <sage@redhat.com>

This is the simplest strategy--much simpler than queueing them and waking them up again later. Signed-off-by: Sage Weil <sage@redhat.com>

athanatos · 2019-09-27T18:05:45Z

LGTM

tchaikov · 2019-09-28T11:36:41Z

/build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.cc:457:50: error: invalid new-expression of abstract class type 'ceph::osd::PG'
  457 |       return seastar::make_ready_future<Ref<PG>>(new PG{pgid,
      |                                                  ^~~~~~~~~~~~
  458 |      pg_shard_t{whoami, pgid.shard},
      |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~              
  459 |      coll_ref,
      |      ~~~~~~~~~                                    
  460 |      std::move(pool),
      |      ~~~~~~~~~~~~~~~~                             
  461 |      std::move(name),
      |      ~~~~~~~~~~~~~~~~                             
  462 |      create_map,
      |      ~~~~~~~~~~~                                  
  463 |      shard_services,
      |      ~~~~~~~~~~~~~~~                              
  464 |      ec_profile});
      |      ~~~~~~~~~~~                                  
In file included from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/pg_map.h:14,
                 from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.h:27,
                 from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.cc:4:
/build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/pg.h:48:7: note:   because the following virtual functions are pure within 'ceph::osd::PG':
   48 | class PG : public boost::intrusive_ref_counter<
      |       ^~
In file included from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/shard_services.h:12,
                 from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.h:25,
                 from /build/ceph-15.0.0-5643-ga29a4f4/src/crimson/osd/osd.cc:4:
/build/ceph-15.0.0-5643-ga29a4f4/src/osd/PeeringState.h:281:18: note: 	'virtual void PeeringState::PeeringListener::queue_check_readable(epoch_t, ceph::time_detail::timespan)'
  281 |     virtual void queue_check_readable(epoch_t lpr, ceph::timespan delay) = 0;
      |                  ^~~~~~~~~~~~~~~~~~~~
/build/ceph-15.0.0-5643-ga29a4f4/src/osd/PeeringState.h:282:18: note: 	'virtual void PeeringState::PeeringListener::recheck_readable()'
  282 |     virtual void recheck_readable() = 0;
      |                  ^~~~~~~~~~~~~~~~

could you please help add dummy methods for crimson?

Signed-off-by: Sage Weil <sage@redhat.com>

Keep track of which OSDs from the prior set we care about that affect the prior_readable_until_ub. Note that it is only the *down* OSDs that we have to track here, since everything in the *probe* set we will already contact during peering (they are still up), guaranteeing that those PGs are aware of the interval change and are no longer readable in the prior interval. Signed-off-by: Sage Weil <sage@redhat.com>

If we see that a prior_readable_down_osd is known to be dead, we can remove it from the set. And if the set is empty, we can skip the rest of our waiting period and leave the WAIT state. Signed-off-by: Sage Weil <sage@redhat.com>

We want to renew before we prepeare or send activate messages so that we have the opportunity to include leases in them (coming soon!). And we do not want to send explicit lease messages until we know that the peers have activate. In particular, we want to avoid queueing a notify (via pending_activators) and then sending a lease that will arrive before it. Signed-off-by: Sage Weil <sage@redhat.com>

The lease goes out with the MOSDPGLog or info, and the ack comes back with the info. We no longer need to renew the lease explicitly in all_activated_and_committed() because we *just* piggybacked on activation. We can just wait for the normal renew event to fire. Signed-off-by: Sage Weil <sage@redhat.com>

We only do this for primary -> replica, so we only need to proc_lease() from the replica states. Signed-off-by: Sage Weil <sage@redhat.com>

The 'replica' term does not map well onto EC pools. More importantly, the implementation is often wrong for EC pools, where role may be 0 or 1 for EC pools independent of whether the OSD is the primary or not. Introduce 'nonprimary' to mean an acting osd that is not the primary. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

If there are no down OSDs from prior intervals, then the normal peering process will end up contacting all of the prior OSDs and ensuring that their prior interval is terminated during peering. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

These are stubs; the reschule one (at minimum) probably needs a meaningful implementation in order for the PG to peer in some cases. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2019-09-28T16:52:46Z

@tchaikov added stubs to that it builds, but they implementations probably need to be fleshed out in order for the pg to successfully peer in many cases.

tchaikov · 2019-09-28T17:30:15Z

thanks @liewegas ! i will come up with a follow-up PR.

tchaikov · 2019-09-29T06:15:08Z

The following tests FAILED:
	176 - unittest_rgw_amqp (Failed)

https://tracker.ceph.com/issues/42042

tchaikov · 2019-09-29T06:15:21Z

tchaikov · 2019-09-29T06:44:04Z

qa/suites/rados/singleton-nomsgr/all/osd_stale_reads.yaml

+      - \(POOL_APP_NOT_ENABLED\)
+      - \(SLOW_OPS\)
+      - \(PG_AVAILABILITY\)
+      - \(PG_DEGRADED\)


http://pulpito.ceph.com/kchai-2019-09-29_02:16:32-rados-wip-kefu-testing-2019-09-28-1615-distro-basic-mira/4343377/

for posterity, the PG_DEGRADED failure was addressed by the change after this PR was picked up by my batch.

liewegas added core feature labels Jul 23, 2019

liewegas requested a review from athanatos July 23, 2019 19:35

liewegas force-pushed the wip-read-hole-bypg branch 2 times, most recently from 6c1c9d2 to c0605fc Compare July 23, 2019 19:39

athanatos reviewed Jul 25, 2019

View reviewed changes

athanatos self-requested a review July 25, 2019 02:11

athanatos approved these changes Jul 25, 2019

View reviewed changes

liewegas force-pushed the wip-read-hole-bypg branch from c0605fc to af7d7ea Compare July 25, 2019 17:58

liewegas force-pushed the wip-read-hole-bypg branch from af7d7ea to c7179fe Compare August 5, 2019 18:31

liewegas added the wip-sage-testing label Aug 5, 2019

liewegas force-pushed the wip-read-hole-bypg branch 2 times, most recently from eef5dd8 to d697e11 Compare August 6, 2019 18:41

liewegas removed the wip-sage-testing label Aug 8, 2019

liewegas force-pushed the wip-read-hole-bypg branch from d697e11 to 2f7e969 Compare August 8, 2019 15:52

liewegas force-pushed the wip-read-hole-bypg branch from 2f7e969 to c6b39a5 Compare August 15, 2019 17:29

liewegas changed the title ~~osd: implement per-pg leases to avoid stale reads~~ WIP osd: implement per-pg leases to avoid stale reads Aug 15, 2019

liewegas force-pushed the wip-read-hole-bypg branch from c6b39a5 to 7e908f4 Compare August 15, 2019 21:17

liewegas added wip-sage3-testing and removed wip-sage3-testing labels Aug 15, 2019

liewegas force-pushed the wip-read-hole-bypg branch 2 times, most recently from 578c7fa to 83ca87c Compare September 9, 2019 17:01

liewegas force-pushed the wip-read-hole-bypg branch 2 times, most recently from 2211f62 to f5fc93a Compare September 20, 2019 21:05

liewegas force-pushed the wip-read-hole-bypg branch from f5fc93a to f6a1b5c Compare September 23, 2019 02:47

liewegas added 4 commits September 26, 2019 10:55

osd/PeeringState: PG state 64 bits wide

3c9a285

Signed-off-by: Sage Weil <sage@redhat.com>

osd/osd_types: pg_history_t: add prior_readable_until_ub

caaf0f8

Signed-off-by: Sage Weil <sage@redhat.com>

osd/osd_types: add PG_STATE_LAGGY

7df85a6

PG is laggy (unreadable) because ping(s) are delayed. Signed-off-by: Sage Weil <sage@redhat.com>

osd/osd_types: add PG_STATE_WAIT

a0b453a

PG is waiting for previous intervals' readable intervals to expire. Signed-off-by: Sage Weil <sage@redhat.com>

osd: send ops back to primary if replica is not readable

ca80c3f

This is the simplest strategy--much simpler than queueing them and waking them up again later. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas changed the title ~~WIP osd: implement per-pg leases to avoid stale reads~~ osd: implement per-pg leases to avoid stale reads Sep 26, 2019

liewegas force-pushed the wip-read-hole-bypg branch from e47e8f5 to ef29f4d Compare September 26, 2019 20:20

tchaikov added the needs-qa label Sep 27, 2019

liewegas force-pushed the wip-read-hole-bypg branch from ef29f4d to 565f723 Compare September 27, 2019 21:54

tchaikov added the wip-kefu-testing label Sep 28, 2019

liewegas and others added 12 commits September 28, 2019 11:51

qa/suites/rados/singleton-nomsg/osd_stale_reads.yaml

379bf4b

Signed-off-by: Sage Weil <sage@redhat.com>

osd/PeeringState: notice 'dead' prior_readable OSDs

e2d95c4

If we see that a prior_readable_down_osd is known to be dead, we can remove it from the set. And if the set is empty, we can skip the rest of our waiting period and leave the WAIT state. Signed-off-by: Sage Weil <sage@redhat.com>

osd/PeeringState: piggyback pg_lease on MOSDPGLog

70a037f

We only do this for primary -> replica, so we only need to proc_lease() from the replica states. Signed-off-by: Sage Weil <sage@redhat.com>

osd/PeeringState: make proc_lease, recalc_readable_until more verbose

e234d67

Signed-off-by: Sage Weil <sage@redhat.com>

doc: stale reads notes

8be0106

Signed-off-by: Sage Weil <sage@redhat.com>

doc: document new 'laggy' and 'wait' pg states

9d23250

Signed-off-by: Sage Weil <sage@redhat.com>

crimson/osd: fix osdpg build

da2dc1c

These are stubs; the reschule one (at minimum) probably needs a meaningful implementation in order for the PG to peer in some cases. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-read-hole-bypg branch from 910d232 to da2dc1c Compare September 28, 2019 16:51

tchaikov merged commit e659e86 into ceph:master Sep 29, 2019

tchaikov reviewed Sep 29, 2019

View reviewed changes

liewegas deleted the wip-read-hole-bypg branch September 29, 2019 15:35

liewegas mentioned this pull request Sep 29, 2019

osd/PrimaryLogPG: include op_returns in dup replies #30640

Merged

tchaikov mentioned this pull request Jun 29, 2022

osd/PeeringState: fix missed recheck_readable from laggy #44499

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd: implement per-pg leases to avoid stale reads#29236

osd: implement per-pg leases to avoid stale reads#29236
tchaikov merged 32 commits intoceph:masterfrom
liewegas:wip-read-hole-bypg

liewegas commented Jul 23, 2019

Uh oh!

athanatos Jul 25, 2019

Uh oh!

athanatos left a comment

Uh oh!

tchaikov commented Aug 14, 2019

Uh oh!

athanatos commented Sep 27, 2019

Uh oh!

tchaikov commented Sep 28, 2019

Uh oh!

liewegas commented Sep 28, 2019

Uh oh!

tchaikov commented Sep 28, 2019

Uh oh!

tchaikov commented Sep 29, 2019

Uh oh!

tchaikov commented Sep 29, 2019

Uh oh!

tchaikov Sep 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liewegas commented Jul 23, 2019

Uh oh!

athanatos Jul 25, 2019

Choose a reason for hiding this comment

Uh oh!

athanatos left a comment

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Aug 14, 2019

Uh oh!

athanatos commented Sep 27, 2019

Uh oh!

tchaikov commented Sep 28, 2019

Uh oh!

liewegas commented Sep 28, 2019

Uh oh!

tchaikov commented Sep 28, 2019

Uh oh!

tchaikov commented Sep 29, 2019

Uh oh!

tchaikov commented Sep 29, 2019

Uh oh!

tchaikov Sep 29, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants