Skip to content

BlueStore:NCB:Bug-Fix for recovery code with shared blobs#44563

Merged
neha-ojha merged 1 commit intoceph:masterfrom
benhanokh:NCB_new_alloc_map
Feb 4, 2022
Merged

BlueStore:NCB:Bug-Fix for recovery code with shared blobs#44563
neha-ojha merged 1 commit intoceph:masterfrom
benhanokh:NCB_new_alloc_map

Conversation

@benhanokh
Copy link
Contributor

Fixes: https://tracker.ceph.com/issues/53678
Signed-off-by: gbenhano@redhat.com

Replaced the BitmapAllocator used by NCB Recovery code with a Simple bitmap allowing for bits to be set multiple times without any adverse effect(since shared-blobs will report the same allocation multiple times)

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@benhanokh benhanokh requested a review from a team as a code owner January 17, 2022 10:45
{
//dout(10) << "offset=" << offset << ", length=" <<length<< ", min_alloc_size=" <<min_alloc_size
// << ", min_alloc_size_mask=" << min_alloc_size_mask << dendl;
#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better not leave the unused code there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed code which is used elsewhere because I don't the case could ever happen (if if it will the old code will break), but I still want to keep a reference to the previous code

Copy link
Contributor

@aclamk aclamk Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think this function need cleanup.
I changed semantics here and want people to be able to see it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't a good idea to leave dead code around - either '#if 0' or commented-out.
If you want to explain a change from a previous version - you can add relevant text in a comment, explaining
what was changed and why.

derr << "****failed create_bitmap_allocator()" << dendl;
utime_t start = ceph_clock_now();
SimpleBitmap *sbmap = create_simple_bitmap_allocator(cct, path, bdev->get_size(), min_alloc_size);
if (sbmap == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen. A ctor never fails in this way

}

//---------------------------------------------------------
int BlueStore::read_allocation_from_drive_on_startup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the various failure values?

if (allocator) {
dout(5) << "bitmap-allocator=" << allocator << dendl;
} else {
SimpleBitmap *sbmap = create_simple_bitmap_allocator(cct, path, bdev->get_size(), min_alloc_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@github-actions github-actions bot added the tests label Jan 19, 2022
@benhanokh benhanokh force-pushed the NCB_new_alloc_map branch 3 times, most recently from cf4627f to c916905 Compare January 22, 2022 07:22
@benhanokh
Copy link
Contributor Author

@neha-ojha
Copy link
Member

@benhanokh the make check failures in https://jenkins.ceph.com/job/ceph-pull-requests/88970/ seem related, PTAL

@benhanokh benhanokh force-pushed the NCB_new_alloc_map branch 2 times, most recently from eff20b1 to 7ae7e5b Compare January 26, 2022 13:11
@benhanokh benhanokh requested a review from aclamk January 26, 2022 13:15
@benhanokh
Copy link
Contributor Author

@neha-ojha can you please check the failures ?
I saw nothing I could attribute to my code
thx
http://pulpito.front.sepia.ceph.com/benhanokh-2022-01-26_21:12:05-rados-WIP_GBH_NCB_new_alloc_map_A6-distro-basic-smithi/

@ronen-fr
Copy link
Contributor

@neha-ojha can you please check the failures ?
I saw nothing I could attribute to my code
thx
http://pulpito.front.sepia.ceph.com/benhanokh-2022-01-26_21:12:05-rados-WIP_GBH_NCB_new_alloc_map_A6-distro-basic-smithi/

benhanokh-2022-01-26_21:12:05-rados-WIP_GBH_NCB_new_alloc_map_A6-distro-basic-smithi/6642395
is a data corruption. It might be a bug in the scrubber code, but it might well be a bug introduced here.

@ronen-fr
Copy link
Contributor

benhanokh-2022-01-26_21:12:05-rados-WIP_GBH_NCB_new_alloc_map_A6-distro-basic-smithi/6642395
is a data corruption. It might be a bug in the scrubber code, but it might well be a bug introduced here.

Appears in master, too. So - scrub testing issue. Not related to this PR.

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

@neha-ojha
Copy link
Member

@benhanokh The reason the api tests are failing is https://jenkins.ceph.com/job/ceph-api/31284/consoleFull#-4369474192a811ea2-3e7b-466b-84b4-d13df7e35809

../src/os/bluestore/bluestore_tool.cc: In function ‘int main(int, char**)’:
../src/os/bluestore/bluestore_tool.cc:575:73: error: no matching function for call to ‘BlueStore::read_allocation_from_drive_for_bluestore_tool(bool)’
  575 |     int r = bluestore.read_allocation_from_drive_for_bluestore_tool(true);
      |                                                                         ^
In file included from ../src/os/bluestore/bluestore_tool.cc:22:
../src/os/bluestore/BlueStore.h:3655:8: note: candidate: ‘int BlueStore::read_allocation_from_drive_for_bluestore_tool()’
 3655 |   int  read_allocation_from_drive_for_bluestore_tool();

@neha-ojha
Copy link
Member

@benhanokh The reason the api tests are failing is https://jenkins.ceph.com/job/ceph-api/31284/consoleFull#-4369474192a811ea2-3e7b-466b-84b4-d13df7e35809

../src/os/bluestore/bluestore_tool.cc: In function ‘int main(int, char**)’:
../src/os/bluestore/bluestore_tool.cc:575:73: error: no matching function for call to ‘BlueStore::read_allocation_from_drive_for_bluestore_tool(bool)’
  575 |     int r = bluestore.read_allocation_from_drive_for_bluestore_tool(true);
      |                                                                         ^
In file included from ../src/os/bluestore/bluestore_tool.cc:22:
../src/os/bluestore/BlueStore.h:3655:8: note: candidate: ‘int BlueStore::read_allocation_from_drive_for_bluestore_tool()’
 3655 |   int  read_allocation_from_drive_for_bluestore_tool();

@benhanokh your last commit fixes the api test failure! could you please either squash it into previous commits or add a Signed-off-by to this commit and update the commit title to something like os/bluestore/bluestore_tool.cc: update read_allocation_from_drive_for_bluestore_tool call - I think we should be good to merge after that

Replaces the BitmapAllocator used by NCB Recovery code with a dedicated SimpleBitmap.
The SimpleBitmap allows for bits to be set multiple times without any adverse effect.
This is needed beacuse shared-blobs will report the same allocation multiple times.

Fixes: https://tracker.ceph.com/issues/53678

Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
@neha-ojha neha-ojha merged commit e849923 into ceph:master Feb 4, 2022
@neha-ojha
Copy link
Member

@benhanokh please prepare a quincy backport for this, thanks!

benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 10, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 14, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 14, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) add a step to scrub allocation file after each teuthology test

Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 15, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) add a step to scrub allocation file after each teuthology test
18) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 15, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) add a step to scrub allocation file after each teuthology test
18) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 15, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 17, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 17, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 23, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Feb 23, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Mar 7, 2022
…ies and destage allocations to disk before killing the OSD

1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown

15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR ceph#44563)

17) make shutdown-timeout configurable

18) Fix BlueFS handling of SYNC compaction
19) Fix SimpleBitmap init to ignore incomplete block at the end of the bdev (when bdev-size in unaligned on block-size)
20) Force consisnt view of BlueFS by calling bluefs->compact_log() and then bluefs->sync_metadata()

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Mar 7, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
benhanokh added a commit to benhanokh/ceph that referenced this pull request Mar 10, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
(cherry picked from commit 9b2a64a)
nSedrickm pushed a commit to nSedrickm/ceph that referenced this pull request Mar 21, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
markhpc pushed a commit to markhpc/ceph that referenced this pull request Mar 24, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
(cherry picked from commit 9b2a64a)
markhpc pushed a commit to markhpc/ceph that referenced this pull request Mar 26, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
(cherry picked from commit 9b2a64a)
joscollin pushed a commit to joscollin/ceph that referenced this pull request Apr 5, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
zalsader pushed a commit to zalsader/ceph that referenced this pull request Apr 11, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
dpaganel pushed a commit to dpaganel/ceph that referenced this pull request May 17, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
dependabot bot added a commit to joscollin/ceph that referenced this pull request Jun 22, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
dependabot bot added a commit to joscollin/ceph that referenced this pull request Aug 16, 2022
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants