os/bluestore: fix deep-scrub operation againest disk silent errors by wangxiaoguang · Pull Request #23629 · ceph/ceph

wangxiaoguang · 2018-08-17T13:15:39Z

Say a object who has data caches, but in a while later, caches' underlying
physical device has silent disk erros accidentally, then caches and physical
data are not same. In such case, deep-scrub operation still tries to read
caches firstly and won't do crc checksum, then deep-scrub won't find such
data corruptions timely.

Here introduce a new flag 'CEPH_OSD_OP_FLAG_BYPASS_CACHE' which tells
deep-scrub to bypass object caches. Note that we only bypass cache who is in
STATE_CLEAN state. For STATE_WRITING caches, currently they are not written
to physical device, so deep-scrub operation can not read physical device and
can read these dirty caches safely. Once they are in STATE_CLEAN state(or not
added to bluestore cache), next round deep-scurb can check them correctly.

As to above discussions, I refactor BlueStore::BufferSpace::read sightly,
adding a new 'flags' argument, whose value will be:

     enum {
       READ_CLEAN_CACHE = 0x1,     // read clean cache
       READ_DIRTY_CACHE = 0x2,     // read dirty cache
       READ_ALL_CACHE   = 0x3,     // read clean & dirty cache
    };

flags READ_ALL_CACHE: normal read
flags READ_DIRTY_CACHE: bypass clean cache, currently only for deep-scrube
operation

Test:
I deliberately corrupt a object with cache, with this patch, deep-scrub
can find data error very timely.

Signed-off-by: Xiaoguang Wang xiaoguang.wang@easystack.cn

wangxiaoguang · 2018-08-17T13:16:47Z

@liewegas please take a look at this patch, thanks.

liewegas · 2018-08-17T14:33:58Z

This is great! I would modify it slightly, though, so that there is a single BYPASS_CLEAN_CACHE flag, because it doesn't make sense to use clean and not dirty buffers (READ_CLEAN_CACHE).

One question: did check what the behavior was when you have a clean buffer, bypass it and read the device but get an error? is the clean buffer still there so that the next (normal) read will succeed? In principle there is an opportunity to recover some fo the data (write it somewhere else etc) although in practice I wouldn't worry about this narrow corner case too much, at least not for a first pass.

wangxiaoguang · 2018-08-20T07:15:01Z

This is great! I would modify it slightly, though, so that there is a single BYPASS_CLEAN_CACHE flag, because it doesn't make sense to use clean and not dirty buffers (READ_CLEAN_CACHE).

ok, thanks for review:)

did check what the behavior was when you have a clean buffer, bypass it and read the device but get an error?

This object will be repaired.
==>> do_osd_ops
====>> do_read
======>> objects_read_sync // get EIO error.
======>> rep_repair_primary_object
then read operation will be issued again. I use “rados get” to test this case, 'rados' tool does not perceive this object's corruption even.

is the clean buffer still there so that the next (normal) read will succeed?

Yes, clean buffer is there and next read will succeed(indeed current read will also succeed, corrupted object will be fix immediately), but I'm not sure whether caches were discarded and re-generated in the procedure of fixing object.

dzafman

It would be hard to test this because ceph-objectstore-tool takes the osd down to corrupt an object, so it won't be in the cache on osd restart. All current scrub tests use run_osd() instead of run_osd_bluestore() too.

xiexingguo · 2018-08-27T07:46:20Z

src/include/rados.h

 	CEPH_OSD_OP_FLAG_FADVISE_DONTNEED   = 0x20,/* data will not be accessed in the near future */
 	CEPH_OSD_OP_FLAG_FADVISE_NOCACHE   = 0x40, /* data will be accessed only once by this client */
 	CEPH_OSD_OP_FLAG_WITH_REFERENCE   = 0x80, /* need reference couting */
+	CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE = 0x100, /* bypass ObjectStore cache, mainly for deep-scrub */


need to update ceph_osd_op_flag_name too:

const char * ceph_osd_op_flag_name(unsigned flag) { const char *name; switch(flag) { case CEPH_OSD_OP_FLAG_EXCL: name = "excl"; break; case CEPH_OSD_OP_FLAG_FAILOK: name = "failok"; break; case CEPH_OSD_OP_FLAG_FADVISE_RANDOM: name = "fadvise_random"; break; case CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL: name = "fadvise_sequential"; break; case CEPH_OSD_OP_FLAG_FADVISE_WILLNEED: name = "favise_willneed"; break; case CEPH_OSD_OP_FLAG_FADVISE_DONTNEED: name = "fadvise_dontneed"; break; case CEPH_OSD_OP_FLAG_FADVISE_NOCACHE: name = "fadvise_nocache"; break; default: name = "???"; }; return name; }

Thanks, I'll update this patch soon.

wangxiaoguang · 2018-08-29T05:25:17Z

@xiexingguo I've updated a new version, could you please check it again? thanks.

xiexingguo · 2018-08-29T05:40:36Z

src/osd/osd_types.cc

    case CEPH_OSD_OP_FLAG_FADVISE_NOCACHE:
      name = "fadvise_nocache";
      break;
+    case CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE:


Do you mind adding the new CEPH_OSD_OP_FLAG_WITH_REFERENCE flag too?

ok, will add it.

xiexingguo · 2018-08-29T07:49:24Z

src/osd/osd_types.cc

      name = "fadvise_nocache";
      break;
+    case CEPH_OSD_OP_FLAG_WITH_REFERENCE:
+      name = "reference_couting";


Don't want to be nit-picking, but I'd prefer with_reference to keep pace with others...

Really never mind, your suggestions are good and reasonable, indeed I should appreciate your patience! It's me not doing well, thanks 💯

dzafman

1 nit

src/osd/osd_types.cc

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@easystack.cn>

wangxiaoguang · 2018-08-30T02:54:54Z

@dzafman @xiexingguo @tchaikov I submitted new version which just divided previous patch into twos, thanks.

dzafman · 2018-08-30T04:54:27Z

retest this please

tchaikov · 2018-08-30T15:14:45Z

src/os/bluestore/BlueStore.cc


  ready_regions_t ready_regions;

+  // for deep-scrub, wo only read dirty cache and bypass clean cache in


tchaikov · 2018-08-30T15:14:54Z

src/os/bluestore/BlueStore.cc

  ready_regions_t ready_regions;

+  // for deep-scrub, wo only read dirty cache and bypass clean cache in
+  // order to read undering block device in case there are silent disk errors.


underlying.

Sorry, will send new version.

Since you are there... :-)

tchaikov · 2018-08-30T15:18:05Z

http://pulpito.ceph.com/kchai-2018-08-30_02:23:40-rados-wip-kefu-testing-2018-08-29-2346-distro-basic-smithi/
the failures are tracked by

hence i don't think they are relevant to this PR.

tchaikov · 2018-08-30T15:27:38Z

@liewegas @dzafman @xiexingguo shall we backport this change?

liewegas · 2018-08-30T15:29:46Z

I think so!

Say a object who has data caches, but in a while later, caches' underlying physical device has silent disk erros accidentally, then caches and physical data are not same. In such case, deep-scrub operation still tries to read caches firstly and won't do crc checksum, then deep-scrub won't find such data corruptions timely. Here introduce a new flag 'CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE' which tells deep-scrub to bypass object caches. Note that we only bypass cache who is in STATE_CLEAN state. For STATE_WRITING caches, currently they are not written to physical device, so deep-scrub operation can not read physical device and can read these dirty caches safely. Once they are in STATE_CLEAN state(or not added to bluestore cache), next round deep-scurb can check them correctly. As to above discussions, I refactor BlueStore::BufferSpace::read sightly, adding a new 'flags' argument, whose value will be 0 or: enum { BYPASS_CLEAN_CACHE = 0x1, // bypass clean cache }; flags 0: normal read, do not bypass clean or dirty cache flags BYPASS_CLEAN_CACHE: bypass clean cache, currently only for deep-scrube operation Test: I deliberately corrupt a object with cache, with this patch, deep-scrub can find data error very timely. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@easystack.cn>

tchaikov · 2018-08-31T08:54:25Z

http://tracker.ceph.com/issues/34541

tchaikov added the core label Aug 17, 2018

wangxiaoguang force-pushed the fix_deep_scrub branch from 48f8dae to ee6873d Compare August 20, 2018 07:22

liewegas approved these changes Aug 26, 2018

View reviewed changes

liewegas added the needs-qa label Aug 26, 2018

liewegas requested a review from dzafman August 26, 2018 15:50

dzafman approved these changes Aug 26, 2018

View reviewed changes

tchaikov added the wip-kefu-testing label Aug 27, 2018

xiexingguo reviewed Aug 27, 2018

View reviewed changes

tchaikov removed the wip-kefu-testing label Aug 28, 2018

dzafman added the needs-rebase label Aug 28, 2018

wangxiaoguang force-pushed the fix_deep_scrub branch from ee6873d to 8ca2d66 Compare August 29, 2018 05:20

xiexingguo reviewed Aug 29, 2018

View reviewed changes

wangxiaoguang force-pushed the fix_deep_scrub branch from 8ca2d66 to a175d51 Compare August 29, 2018 07:03

xiexingguo reviewed Aug 29, 2018

View reviewed changes

wangxiaoguang force-pushed the fix_deep_scrub branch from a175d51 to 56572ff Compare August 29, 2018 08:43

xiexingguo approved these changes Aug 29, 2018

View reviewed changes

xiexingguo removed the needs-rebase label Aug 29, 2018

tchaikov added the wip-kefu-testing label Aug 29, 2018

dzafman approved these changes Aug 29, 2018

View reviewed changes

src/osd/osd_types.cc Show resolved Hide resolved

core: add missing flag name for CEPH_OSD_OP_FLAG_WITH_REFERENCE

2c862ce

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@easystack.cn>

wangxiaoguang force-pushed the fix_deep_scrub branch from 56572ff to 67295cd Compare August 30, 2018 02:52

tchaikov reviewed Aug 30, 2018

View reviewed changes

tchaikov removed the needs-qa label Aug 30, 2018

tchaikov removed the wip-kefu-testing label Aug 30, 2018

wangxiaoguang force-pushed the fix_deep_scrub branch from 67295cd to a7f1af2 Compare August 31, 2018 06:12

tchaikov merged commit f8985aa into ceph:master Aug 31, 2018

sajibreadd mentioned this pull request Apr 29, 2024

osd: CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE flag is passed from ECBackend #57137

Merged

14 tasks


		ready_regions_t ready_regions;

		// for deep-scrub, wo only read dirty cache and bypass clean cache in

Conversation

wangxiaoguang commented Aug 17, 2018 • edited by tchaikov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangxiaoguang commented Aug 17, 2018

Uh oh!

liewegas commented Aug 17, 2018

Uh oh!

wangxiaoguang commented Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzafman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiaoguang commented Aug 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dzafman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wangxiaoguang commented Aug 30, 2018

Uh oh!

dzafman commented Aug 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Aug 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tchaikov commented Aug 30, 2018

Uh oh!

liewegas commented Aug 30, 2018

Uh oh!

tchaikov commented Aug 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangxiaoguang commented Aug 17, 2018 •

edited by tchaikov

Loading

wangxiaoguang commented Aug 20, 2018 •

edited

Loading

tchaikov commented Aug 30, 2018 •

edited

Loading