Skip to content

os/bluestore: fix deep-scrub operation againest disk silent errors#23629

Merged
tchaikov merged 2 commits intoceph:masterfrom
wangxiaoguang:fix_deep_scrub
Aug 31, 2018
Merged

os/bluestore: fix deep-scrub operation againest disk silent errors#23629
tchaikov merged 2 commits intoceph:masterfrom
wangxiaoguang:fix_deep_scrub

Conversation

@wangxiaoguang
Copy link
Contributor

@wangxiaoguang wangxiaoguang commented Aug 17, 2018

Say a object who has data caches, but in a while later, caches' underlying
physical device has silent disk erros accidentally, then caches and physical
data are not same. In such case, deep-scrub operation still tries to read
caches firstly and won't do crc checksum, then deep-scrub won't find such
data corruptions timely.

Here introduce a new flag 'CEPH_OSD_OP_FLAG_BYPASS_CACHE' which tells
deep-scrub to bypass object caches. Note that we only bypass cache who is in
STATE_CLEAN state. For STATE_WRITING caches, currently they are not written
to physical device, so deep-scrub operation can not read physical device and
can read these dirty caches safely. Once they are in STATE_CLEAN state(or not
added to bluestore cache), next round deep-scurb can check them correctly.

As to above discussions, I refactor BlueStore::BufferSpace::read sightly,
adding a new 'flags' argument, whose value will be:

     enum {
       READ_CLEAN_CACHE = 0x1,     // read clean cache
       READ_DIRTY_CACHE = 0x2,     // read dirty cache
       READ_ALL_CACHE   = 0x3,     // read clean & dirty cache
    };

flags READ_ALL_CACHE: normal read
flags READ_DIRTY_CACHE: bypass clean cache, currently only for deep-scrube
operation

Test:
I deliberately corrupt a object with cache, with this patch, deep-scrub
can find data error very timely.

Signed-off-by: Xiaoguang Wang xiaoguang.wang@easystack.cn

@wangxiaoguang
Copy link
Contributor Author

@liewegas please take a look at this patch, thanks.

@tchaikov tchaikov added the core label Aug 17, 2018
@liewegas
Copy link
Member

This is great! I would modify it slightly, though, so that there is a single BYPASS_CLEAN_CACHE flag, because it doesn't make sense to use clean and not dirty buffers (READ_CLEAN_CACHE).

One question: did check what the behavior was when you have a clean buffer, bypass it and read the device but get an error? is the clean buffer still there so that the next (normal) read will succeed? In principle there is an opportunity to recover some fo the data (write it somewhere else etc) although in practice I wouldn't worry about this narrow corner case too much, at least not for a first pass.

@wangxiaoguang
Copy link
Contributor Author

wangxiaoguang commented Aug 20, 2018

This is great! I would modify it slightly, though, so that there is a single BYPASS_CLEAN_CACHE flag, because it doesn't make sense to use clean and not dirty buffers (READ_CLEAN_CACHE).

ok, thanks for review:)

did check what the behavior was when you have a clean buffer, bypass it and read the device but get an error?

This object will be repaired.
==>> do_osd_ops
====>> do_read
======>> objects_read_sync // get EIO error.
======>> rep_repair_primary_object
then read operation will be issued again. I use “rados get” to test this case, 'rados' tool does not perceive this object's corruption even.

is the clean buffer still there so that the next (normal) read will succeed?

Yes, clean buffer is there and next read will succeed(indeed current read will also succeed, corrupted object will be fix immediately), but I'm not sure whether caches were discarded and re-generated in the procedure of fixing object.

@liewegas liewegas requested a review from dzafman August 26, 2018 15:50
Copy link
Contributor

@dzafman dzafman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be hard to test this because ceph-objectstore-tool takes the osd down to corrupt an object, so it won't be in the cache on osd restart. All current scrub tests use run_osd() instead of run_osd_bluestore() too.

CEPH_OSD_OP_FLAG_FADVISE_DONTNEED = 0x20,/* data will not be accessed in the near future */
CEPH_OSD_OP_FLAG_FADVISE_NOCACHE = 0x40, /* data will be accessed only once by this client */
CEPH_OSD_OP_FLAG_WITH_REFERENCE = 0x80, /* need reference couting */
CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE = 0x100, /* bypass ObjectStore cache, mainly for deep-scrub */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update ceph_osd_op_flag_name too:

const char * ceph_osd_op_flag_name(unsigned flag)
{
  const char *name;

  switch(flag) {
    case CEPH_OSD_OP_FLAG_EXCL:
      name = "excl";
      break;
    case CEPH_OSD_OP_FLAG_FAILOK:
      name = "failok";
      break;
    case CEPH_OSD_OP_FLAG_FADVISE_RANDOM:
      name = "fadvise_random";
      break;
    case CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL:
      name = "fadvise_sequential";
      break;
    case CEPH_OSD_OP_FLAG_FADVISE_WILLNEED:
      name = "favise_willneed";
      break;
    case CEPH_OSD_OP_FLAG_FADVISE_DONTNEED:
      name = "fadvise_dontneed";
      break;
    case CEPH_OSD_OP_FLAG_FADVISE_NOCACHE:
      name = "fadvise_nocache";
      break;
    default:
      name = "???";
  };

  return name;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll update this patch soon.

@wangxiaoguang
Copy link
Contributor Author

@xiexingguo I've updated a new version, could you please check it again? thanks.

case CEPH_OSD_OP_FLAG_FADVISE_NOCACHE:
name = "fadvise_nocache";
break;
case CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding the new CEPH_OSD_OP_FLAG_WITH_REFERENCE flag too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will add it.

name = "fadvise_nocache";
break;
case CEPH_OSD_OP_FLAG_WITH_REFERENCE:
name = "reference_couting";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to be nit-picking, but I'd prefer with_reference to keep pace with others...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really never mind, your suggestions are good and reasonable, indeed I should appreciate your patience! It's me not doing well, thanks 💯

Copy link
Contributor

@dzafman dzafman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 nit

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@easystack.cn>
@wangxiaoguang
Copy link
Contributor Author

@dzafman @xiexingguo @tchaikov I submitted new version which just divided previous patch into twos, thanks.

@dzafman
Copy link
Contributor

dzafman commented Aug 30, 2018

retest this please


ready_regions_t ready_regions;

// for deep-scrub, wo only read dirty cache and bypass clean cache in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/wo/we/

ready_regions_t ready_regions;

// for deep-scrub, wo only read dirty cache and bypass clean cache in
// order to read undering block device in case there are silent disk errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underlying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, will send new version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/wo/we/ ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are there... :-)

@tchaikov
Copy link
Contributor

tchaikov commented Aug 30, 2018

@tchaikov tchaikov removed the needs-qa label Aug 30, 2018
@tchaikov
Copy link
Contributor

@liewegas @dzafman @xiexingguo shall we backport this change?

@liewegas
Copy link
Member

I think so!

Say a object who has data caches, but in a while later, caches' underlying
physical device has silent disk erros accidentally, then caches and physical
data are not same. In such case, deep-scrub operation still tries to read
caches firstly and won't do crc checksum, then deep-scrub won't find such
data corruptions timely.

Here introduce a new flag 'CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE' which tells
deep-scrub to bypass object caches. Note that we only bypass cache who is in
STATE_CLEAN state. For STATE_WRITING caches, currently they are not written
to physical device, so deep-scrub operation can not read physical device and
can read these dirty caches safely. Once they are in STATE_CLEAN state(or not
added to bluestore cache), next round deep-scurb can check them correctly.

As to above discussions, I refactor BlueStore::BufferSpace::read sightly,
adding a new 'flags' argument, whose value will be 0 or:
     enum {
       BYPASS_CLEAN_CACHE = 0x1,     // bypass clean cache
     };

flags 0: normal read, do not bypass clean or dirty cache
flags BYPASS_CLEAN_CACHE: bypass clean cache, currently only for deep-scrube
                        operation

Test:
   I deliberately corrupt a object with cache, with this patch, deep-scrub
   can find data error very timely.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@easystack.cn>
@tchaikov tchaikov merged commit f8985aa into ceph:master Aug 31, 2018
@tchaikov
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants