mon/AuthMonitor: fix potential repeated global id by shminjs · Pull Request #55006 · ceph/ceph

shminjs · 2023-12-26T11:34:32Z

When we expand or shrink monitors, there is the possibility of repeatedly allocating global ids. Because the calculation of last_allocated_id depends on mon_num and mon_rank, and expand or shrink monitors changes mon_num and mon_rank. A simple fix is to reset last_allocated_id when the election is complete, and that make all monitor' last_allocated_id start at max_global_id.

Fixs: https://tracker.ceph.com/issues/63891
Signed-off-by: shimin shimin@kuaishou.com

shminjs · 2023-12-26T11:44:32Z

Use the following code to check correctness

def assign_global_id(last_allocated_id, mon_num, mon_rank, max_global_id, total_allocations):
    allocated_ids = []
    for _ in range(total_allocations):
        id = last_allocated_id + 1
        remainder = id % mon_num
        if remainder:
            remainder = mon_num - remainder
        id += remainder + mon_rank

        if id >= max_global_id:
            break

        last_allocated_id = id
        allocated_ids.append(id)

    return allocated_ids

last_allocated_id = 10000
max_global_id = 100000000
total_allocations = 100


mon_a_list_3 = assign_global_id(last_allocated_id, 3, 0, max_global_id, total_allocations)
mon_b_list_3 = assign_global_id(last_allocated_id, 3, 1, max_global_id, total_allocations)
mon_c_list_3 = assign_global_id(last_allocated_id, 3, 2, max_global_id, total_allocations)


mon_a_list_5 = assign_global_id(last_allocated_id, 5, 0, max_global_id, total_allocations)
mon_b_list_5 = assign_global_id(last_allocated_id, 5, 1, max_global_id, total_allocations)
mon_c_list_5 = assign_global_id(last_allocated_id, 5, 2, max_global_id, total_allocations)
mon_d_list_5 = assign_global_id(last_allocated_id, 5, 3, max_global_id, total_allocations)
mon_e_list_5 = assign_global_id(last_allocated_id, 5, 4, max_global_id, total_allocations)


intersection_a = set(mon_a_list_5).intersection(set(mon_b_list_3 + mon_c_list_3))
intersection_b = set(mon_b_list_5).intersection(set(mon_a_list_3 + mon_c_list_3))
intersection_c = set(mon_c_list_5).intersection(set(mon_a_list_3 + mon_b_list_3))

print(intersection_a)
print(intersection_b)
print(intersection_c)

shminjs · 2023-12-26T12:10:06Z

jenkins test make check

When we expand or shrink monitors, there is the possibility of repeatedly allocating global ids. Because the calculation of last_allocated_id depends on mon_num and mon_rank, and expand or shrink monitors changes mon_num and mon_rank. A simple fix is to reset last_allocated_id when the election is complete. Fixs: https://tracker.ceph.com/issues/63891 Signed-off-by: shimin <shimin@kuaishou.com>

shminjs · 2023-12-26T14:48:00Z

jenkins test make check

shminjs · 2024-01-30T03:01:21Z

jenkins test make check arm64

github-actions · 2024-03-30T08:01:47Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

shminjs · 2024-03-30T08:13:39Z

does anyone handle this issue?

shminjs · 2024-03-30T08:19:10Z

cephfs client depend on global_id, and repeated global_id would result some inconsistent problem.

github-actions · 2024-06-07T19:01:24Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

shminjs · 2024-06-11T01:22:50Z

This is a bug, need to fix

github-actions · 2024-08-10T01:27:55Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

rzarzynski · 2024-08-11T23:47:08Z

src/mon/Monitor.cc

  {
    std::lock_guard l(auth_lock);
    authmon()->_set_mon_num_rank(monmap->size(), rank);
+    authmon()->_reset_last_allocated_id();


In my understanding this should lead to:

temporarily ceasing issuance of new IDs:

uint64_t AuthMonitor::_assign_global_id() { ceph_assert(ceph_mutex_is_locked(mon->auth_lock)); if (mon_num < 1 || mon_rank < 0) { dout(10) << __func__ << " inactive (num_mon " << mon_num << " rank " << mon_rank << ")" << dendl; return 0; } if (!last_allocated_id) { dout(10) << __func__ << " last_allocated_id == 0" << dendl; return 0; } // ...

letting leader to maybe bump max_global_id up:

int64_t AuthMonitor::assign_global_id(bool should_increase_max) { uint64_t id; { std::lock_guard l(mon->auth_lock); id =_assign_global_id(); if (should_increase_max) { should_increase_max = _should_increase_max_global_id(); } } if (mon->is_leader() && should_increase_max) { increase_max_global_id(); } return id; }

void AuthMonitor::increase_max_global_id() { ceph_assert(mon->is_leader()); Incremental inc; inc.inc_type = GLOBAL_ID; inc.max_global_id = max_global_id + g_conf()->mon_globalid_prealloc; dout(10) << "increasing max_global_id to " << inc.max_global_id << dendl; pending_auth.push_back(inc); }

maybe letting all monitors to synchronize on new max_global_id:

void AuthMonitor::update_from_paxos(bool *need_bootstrap) { switch (inc.inc_type) { case GLOBAL_ID: max_global_id = inc.max_global_id; break;

finally start accepting clients again:

void AuthMonitor::update_from_paxos(bool *need_bootstrap) { max_global_id = inc.max_global_id; // ... { std::lock_guard l(mon->auth_lock); if (last_allocated_id == 0) { last_allocated_id = max_global_id; dout(10) << __func__ << " last_allocated_id initialized to " << max_global_id << dendl; } }

Steps 2 and 3 are rather improbable for the AuthMonitor::_reset_last_allocated_id() case.

I agree the patch should help with the sharding of the ID space if mon_num or mon_rank change. (I will take 2nd look this week).

batrick · 2024-09-16T20:06:06Z

This PR is under test in https://tracker.ceph.com/issues/68089.

batrick

This failed QA, see for example:

https://pulpito.ceph.com/pdonnell-2024-09-17_02:09:51-fs-wip-pdonnell-testing-20240916.200549-debug-distro-default-smithi/7908340/

and the leader mon log:

ceph-mon.a.log.gz

shminjs · 2024-09-18T12:13:57Z

This failed QA, see for example:

https://pulpito.ceph.com/pdonnell-2024-09-17_02:09:51-fs-wip-pdonnell-testing-20240916.200549-debug-distro-default-smithi/7908340/

and the leader mon log:

ceph-mon.a.log.gz

ok, I check the log and reply soon as possible

github-actions · 2024-11-23T07:02:21Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-12-23T09:01:31Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

shminjs requested a review from a team as a code owner December 26, 2023 11:34

github-actions bot added core mon labels Dec 26, 2023

shminjs force-pushed the fix-potential-repeated-global-id branch from a0c21df to 6d7f96a Compare December 26, 2023 11:38

shminjs force-pushed the fix-potential-repeated-global-id branch from 6d7f96a to eccd106 Compare December 26, 2023 12:12

github-actions bot added the stale label Mar 30, 2024

github-actions bot removed the stale label Mar 30, 2024

tchaikov self-requested a review March 30, 2024 14:39

rzarzynski requested review from idryomov and rzarzynski April 8, 2024 18:30

github-actions bot added the stale label Jun 7, 2024

github-actions bot removed the stale label Jun 11, 2024

github-actions bot added the stale label Aug 10, 2024

rzarzynski reviewed Aug 12, 2024

View reviewed changes

github-actions bot removed the stale label Aug 12, 2024

rzarzynski approved these changes Aug 16, 2024

View reviewed changes

rzarzynski requested a review from batrick August 16, 2024 07:20

batrick approved these changes Aug 29, 2024

View reviewed changes

batrick added the needs-qa label Aug 29, 2024

batrick added the wip-pdonnell-testing label Aug 29, 2024

SrinivasaBharath added the wip-bharath4-testing label Sep 2, 2024

batrick requested changes Sep 17, 2024

View reviewed changes

batrick removed needs-qa wip-pdonnell-testing labels Sep 17, 2024

SrinivasaBharath removed the wip-bharath4-testing label Sep 24, 2024

github-actions bot added the stale label Nov 23, 2024

github-actions bot closed this Dec 23, 2024

Conversation

shminjs commented Dec 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shminjs commented Dec 26, 2023

Uh oh!

shminjs commented Dec 26, 2023

Uh oh!

shminjs commented Dec 26, 2023

Uh oh!

shminjs commented Jan 30, 2024

Uh oh!

github-actions bot commented Mar 30, 2024

Uh oh!

shminjs commented Mar 30, 2024

Uh oh!

shminjs commented Mar 30, 2024

Uh oh!

github-actions bot commented Jun 7, 2024

Uh oh!

shminjs commented Jun 11, 2024

Uh oh!

github-actions bot commented Aug 10, 2024

Uh oh!

rzarzynski Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

batrick commented Sep 16, 2024

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

shminjs commented Sep 18, 2024

Uh oh!

github-actions bot commented Nov 23, 2024

Uh oh!

github-actions bot commented Dec 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shminjs commented Dec 26, 2023 •

edited

Loading

rzarzynski Aug 11, 2024 •

edited

Loading