os/bluestore: fix deep-scrub operation againest disk silent errors#23629
os/bluestore: fix deep-scrub operation againest disk silent errors#23629tchaikov merged 2 commits intoceph:masterfrom
Conversation
|
@liewegas please take a look at this patch, thanks. |
|
This is great! I would modify it slightly, though, so that there is a single BYPASS_CLEAN_CACHE flag, because it doesn't make sense to use clean and not dirty buffers (READ_CLEAN_CACHE). One question: did check what the behavior was when you have a clean buffer, bypass it and read the device but get an error? is the clean buffer still there so that the next (normal) read will succeed? In principle there is an opportunity to recover some fo the data (write it somewhere else etc) although in practice I wouldn't worry about this narrow corner case too much, at least not for a first pass. |
ok, thanks for review:)
This object will be repaired.
Yes, clean buffer is there and next read will succeed(indeed current read will also succeed, corrupted object will be fix immediately), but I'm not sure whether caches were discarded and re-generated in the procedure of fixing object. |
48f8dae to
ee6873d
Compare
dzafman
left a comment
There was a problem hiding this comment.
It would be hard to test this because ceph-objectstore-tool takes the osd down to corrupt an object, so it won't be in the cache on osd restart. All current scrub tests use run_osd() instead of run_osd_bluestore() too.
| CEPH_OSD_OP_FLAG_FADVISE_DONTNEED = 0x20,/* data will not be accessed in the near future */ | ||
| CEPH_OSD_OP_FLAG_FADVISE_NOCACHE = 0x40, /* data will be accessed only once by this client */ | ||
| CEPH_OSD_OP_FLAG_WITH_REFERENCE = 0x80, /* need reference couting */ | ||
| CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE = 0x100, /* bypass ObjectStore cache, mainly for deep-scrub */ |
There was a problem hiding this comment.
need to update ceph_osd_op_flag_name too:
const char * ceph_osd_op_flag_name(unsigned flag)
{
const char *name;
switch(flag) {
case CEPH_OSD_OP_FLAG_EXCL:
name = "excl";
break;
case CEPH_OSD_OP_FLAG_FAILOK:
name = "failok";
break;
case CEPH_OSD_OP_FLAG_FADVISE_RANDOM:
name = "fadvise_random";
break;
case CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL:
name = "fadvise_sequential";
break;
case CEPH_OSD_OP_FLAG_FADVISE_WILLNEED:
name = "favise_willneed";
break;
case CEPH_OSD_OP_FLAG_FADVISE_DONTNEED:
name = "fadvise_dontneed";
break;
case CEPH_OSD_OP_FLAG_FADVISE_NOCACHE:
name = "fadvise_nocache";
break;
default:
name = "???";
};
return name;
}There was a problem hiding this comment.
Thanks, I'll update this patch soon.
ee6873d to
8ca2d66
Compare
|
@xiexingguo I've updated a new version, could you please check it again? thanks. |
| case CEPH_OSD_OP_FLAG_FADVISE_NOCACHE: | ||
| name = "fadvise_nocache"; | ||
| break; | ||
| case CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE: |
There was a problem hiding this comment.
Do you mind adding the new CEPH_OSD_OP_FLAG_WITH_REFERENCE flag too?
There was a problem hiding this comment.
ok, will add it.
8ca2d66 to
a175d51
Compare
src/osd/osd_types.cc
Outdated
| name = "fadvise_nocache"; | ||
| break; | ||
| case CEPH_OSD_OP_FLAG_WITH_REFERENCE: | ||
| name = "reference_couting"; |
There was a problem hiding this comment.
Don't want to be nit-picking, but I'd prefer with_reference to keep pace with others...
There was a problem hiding this comment.
Really never mind, your suggestions are good and reasonable, indeed I should appreciate your patience! It's me not doing well, thanks 💯
a175d51 to
56572ff
Compare
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@easystack.cn>
56572ff to
67295cd
Compare
|
@dzafman @xiexingguo @tchaikov I submitted new version which just divided previous patch into twos, thanks. |
|
retest this please |
src/os/bluestore/BlueStore.cc
Outdated
|
|
||
| ready_regions_t ready_regions; | ||
|
|
||
| // for deep-scrub, wo only read dirty cache and bypass clean cache in |
src/os/bluestore/BlueStore.cc
Outdated
| ready_regions_t ready_regions; | ||
|
|
||
| // for deep-scrub, wo only read dirty cache and bypass clean cache in | ||
| // order to read undering block device in case there are silent disk errors. |
There was a problem hiding this comment.
Sorry, will send new version.
|
http://pulpito.ceph.com/kchai-2018-08-30_02:23:40-rados-wip-kefu-testing-2018-08-29-2346-distro-basic-smithi/
hence i don't think they are relevant to this PR. |
|
@liewegas @dzafman @xiexingguo shall we backport this change? |
|
I think so! |
Say a object who has data caches, but in a while later, caches' underlying
physical device has silent disk erros accidentally, then caches and physical
data are not same. In such case, deep-scrub operation still tries to read
caches firstly and won't do crc checksum, then deep-scrub won't find such
data corruptions timely.
Here introduce a new flag 'CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE' which tells
deep-scrub to bypass object caches. Note that we only bypass cache who is in
STATE_CLEAN state. For STATE_WRITING caches, currently they are not written
to physical device, so deep-scrub operation can not read physical device and
can read these dirty caches safely. Once they are in STATE_CLEAN state(or not
added to bluestore cache), next round deep-scurb can check them correctly.
As to above discussions, I refactor BlueStore::BufferSpace::read sightly,
adding a new 'flags' argument, whose value will be 0 or:
enum {
BYPASS_CLEAN_CACHE = 0x1, // bypass clean cache
};
flags 0: normal read, do not bypass clean or dirty cache
flags BYPASS_CLEAN_CACHE: bypass clean cache, currently only for deep-scrube
operation
Test:
I deliberately corrupt a object with cache, with this patch, deep-scrub
can find data error very timely.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@easystack.cn>
67295cd to
a7f1af2
Compare
Say a object who has data caches, but in a while later, caches' underlying
physical device has silent disk erros accidentally, then caches and physical
data are not same. In such case, deep-scrub operation still tries to read
caches firstly and won't do crc checksum, then deep-scrub won't find such
data corruptions timely.
Here introduce a new flag 'CEPH_OSD_OP_FLAG_BYPASS_CACHE' which tells
deep-scrub to bypass object caches. Note that we only bypass cache who is in
STATE_CLEAN state. For STATE_WRITING caches, currently they are not written
to physical device, so deep-scrub operation can not read physical device and
can read these dirty caches safely. Once they are in STATE_CLEAN state(or not
added to bluestore cache), next round deep-scurb can check them correctly.
As to above discussions, I refactor BlueStore::BufferSpace::read sightly,
adding a new 'flags' argument, whose value will be:
flags READ_ALL_CACHE: normal read
flags READ_DIRTY_CACHE: bypass clean cache, currently only for deep-scrube
operation
Test:
I deliberately corrupt a object with cache, with this patch, deep-scrub
can find data error very timely.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@easystack.cn