osd/PrimaryLogPG: do not use approx_size() for log trimming by xiexingguo · Pull Request #18338 · ceph/ceph

xiexingguo · 2017-10-17T02:41:52Z

There might be holes on log versions, thus the approx_size()
should (almost) always overestimate the actual number of log entries.
As a result, we might be at the risk of accessing violation (though it's rare)
while searching for the oldest log entry to keep in the log list later.

Fix the above problem by counting the precise number of current
log entries instead.

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

There might be holes on log versions, thus the approx_size() should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of accessing violation while searching for the oldest log entry to keep in the log list later. Fix the above problem by counting the precise number of current log entries instead. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

liewegas

I think we used approx_size() because std::list::size() used to be O(n) instead of O(1).

xiexingguo · 2017-10-17T12:23:52Z

I think we used approx_size() because std::list::size() used to be O(n) instead of O(1).

Yeah. It's constant now!(C++11)

tchaikov · 2017-10-18T02:48:16Z

In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of accessing violation while searching for the oldest log entry to keep in the log list later. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 3654d56) Conflicts: src/osd/PrimaryLogPG.cc: trivial resolution

In ceph/ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph/ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 3654d56) Conflicts: src/osd/PrimaryLogPG.cc

In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 3654d56) Conflicts: src/osd/PrimaryLogPG.cc: trivial resolution (cherry picked from commit 85a029a) Resolves: rhbz#1608060

In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 3654d56) Conflicts: src/osd/PrimaryLogPG.cc: trivial resolution

In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 3654d56) Conflicts: src/osd/PrimaryLogPG.cc: trivial resolution (cherry picked from commit 54b04ba) Resolves: rhbz#1608060

This function is only used by RocksDB WAL writing so it must sync data. This fixes ceph#18338 and thus allows to actually set `bluefs_preextend_wal_files` to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups. To my knowledge it doesn't hurt performance in other cases. Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`. while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`. Fixes: https://tracker.ceph.com/issues/18338 https://tracker.ceph.com/issues/38559 Signed-Off-By: Vitaliy Filippov <vitalif@yourcmc.ru>

This function is only used by RocksDB WAL writing so it must sync data. This fixes ceph#18338 and thus allows to actually set `bluefs_preextend_wal_files` to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups. To my knowledge it doesn't hurt performance in other cases. Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`. Issue ceph#18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`. Fixes: https://tracker.ceph.com/issues/18338 https://tracker.ceph.com/issues/38559 Signed-Off-By: Vitaliy Filippov <vitalif@yourcmc.ru>

This function is only used by RocksDB WAL writing so it must sync data. This fixes ceph#18338 and thus allows to actually set `bluefs_preextend_wal_files` to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups. To my knowledge it doesn't hurt performance in other cases. Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`. Issue ceph#18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`. Fixes: https://tracker.ceph.com/issues/18338 https://tracker.ceph.com/issues/38559 Signed-off-by: Vitaliy Filippov <vitalif@yourcmc.ru>

In ceph#21580 I set a trap to catch some wired and random segmentfaults and in a recent QA run I was able to observe it was successfully triggered by one of the test case, see: ``` http://qa-proxy.ceph.com/teuthology/xxg-2018-07-30_05:25:06-rados-wip-hb-peers-distro-basic-smithi/2837916/teuthology.log ``` The root cause is that there might be holes on log versions, thus the approx_size() method should (almost) always overestimate the actual number of log entries. As a result, we might be at the risk of overtrimming log entries. ceph#18338 reveals a probably easier way to fix the above problem but unfortunately it also can cause big performance regression and hence comes this pr.. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 3654d56) Conflicts: src/osd/PrimaryLogPG.cc: trivial resolution

This function is only used by RocksDB WAL writing so it must sync data. This fixes ceph#18338 and thus allows to actually set `bluefs_preextend_wal_files` to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups. To my knowledge it doesn't hurt performance in other cases. Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`. Issue ceph#18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`. Fixes: https://tracker.ceph.com/issues/18338 https://tracker.ceph.com/issues/38559 Signed-off-by: Vitaliy Filippov <vitalif@yourcmc.ru> (cherry picked from commit c703cf9)

This function is only used by RocksDB WAL writing so it must sync data. This fixes ceph#18338 and thus allows to actually set `bluefs_preextend_wal_files` to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups. To my knowledge it doesn't hurt performance in other cases. Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`. Issue ceph#18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`. Fixes: https://tracker.ceph.com/issues/18338 https://tracker.ceph.com/issues/38559 Signed-off-by: Vitaliy Filippov <vitalif@yourcmc.ru> (cherry picked from commit c703cf9) Conflicts: src/os/bluestore/KernelDevice.cc - mimic has a single variable "fd_buffered" where master has an array "fd_buffereds"

This function is only used by RocksDB WAL writing so it must sync data. This fixes ceph#18338 and thus allows to actually set `bluefs_preextend_wal_files` to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups. To my knowledge it doesn't hurt performance in other cases. Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`. Issue ceph#18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`. Fixes: https://tracker.ceph.com/issues/18338 https://tracker.ceph.com/issues/38559 Signed-off-by: Vitaliy Filippov <vitalif@yourcmc.ru> (cherry picked from commit c703cf9) Conflicts: - path: src/os/bluestore/KernelDevice.cc comment: luminous has a single variable "fd_buffered" where master has an array "fd_buffereds"

xiexingguo force-pushed the wip-pg branch from 0a4e09e to 024b5bc Compare October 17, 2017 07:32

xiexingguo requested a review from liewegas October 17, 2017 11:07

liewegas approved these changes Oct 17, 2017

View reviewed changes

liewegas added bug-fix core needs-qa labels Oct 17, 2017

tchaikov added the wip-kefu-testing label Oct 17, 2017

tchaikov merged commit 9adab8b into ceph:master Oct 18, 2017

xiexingguo deleted the wip-pg branch October 18, 2017 03:00

xiexingguo mentioned this pull request Jul 30, 2018

osd/PrimaryLogPG: fix potential pg-log overtrimming #23317

Merged

This was referenced Mar 8, 2019

osd/bluestore: Actually wait until completion in write_sync #26868

Closed

osd/bluestore: Actually wait until completion in write_sync #26870

Closed

vitalif mentioned this pull request Mar 12, 2019

os/bluestore: Actually wait until completion in write_sync #26909

Merged

3 tasks

k0ste mentioned this pull request Aug 9, 2019

luminous: osd/bluestore: Actually wait until completion in write_sync #29564

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/PrimaryLogPG: do not use approx_size() for log trimming#18338

osd/PrimaryLogPG: do not use approx_size() for log trimming#18338
tchaikov merged 1 commit intoceph:masterfrom
xiexingguo:wip-pg

xiexingguo commented Oct 17, 2017 •

edited

Loading

Uh oh!

liewegas left a comment

Uh oh!

xiexingguo commented Oct 17, 2017

Uh oh!

tchaikov commented Oct 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiexingguo commented Oct 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liewegas left a comment

Choose a reason for hiding this comment

Uh oh!

xiexingguo commented Oct 17, 2017

Uh oh!

tchaikov commented Oct 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiexingguo commented Oct 17, 2017 •

edited

Loading