Skip to content

mon/AuthMonitor: fix potential repeated global id#55006

Closed
shminjs wants to merge 1 commit intoceph:mainfrom
shminjs:fix-potential-repeated-global-id
Closed

mon/AuthMonitor: fix potential repeated global id#55006
shminjs wants to merge 1 commit intoceph:mainfrom
shminjs:fix-potential-repeated-global-id

Conversation

@shminjs
Copy link

@shminjs shminjs commented Dec 26, 2023

When we expand or shrink monitors, there is the possibility of repeatedly allocating global ids. Because the calculation of last_allocated_id depends on mon_num and mon_rank, and expand or shrink monitors changes mon_num and mon_rank. A simple fix is to reset last_allocated_id when the election is complete, and that make all monitor' last_allocated_id start at max_global_id.

Fixs: https://tracker.ceph.com/issues/63891
Signed-off-by: shimin shimin@kuaishou.com

@shminjs shminjs requested a review from a team as a code owner December 26, 2023 11:34
@shminjs shminjs force-pushed the fix-potential-repeated-global-id branch from a0c21df to 6d7f96a Compare December 26, 2023 11:38
@shminjs
Copy link
Author

shminjs commented Dec 26, 2023

Use the following code to check correctness

def assign_global_id(last_allocated_id, mon_num, mon_rank, max_global_id, total_allocations):
    allocated_ids = []
    for _ in range(total_allocations):
        id = last_allocated_id + 1
        remainder = id % mon_num
        if remainder:
            remainder = mon_num - remainder
        id += remainder + mon_rank

        if id >= max_global_id:
            break

        last_allocated_id = id
        allocated_ids.append(id)

    return allocated_ids

last_allocated_id = 10000
max_global_id = 100000000
total_allocations = 100


mon_a_list_3 = assign_global_id(last_allocated_id, 3, 0, max_global_id, total_allocations)
mon_b_list_3 = assign_global_id(last_allocated_id, 3, 1, max_global_id, total_allocations)
mon_c_list_3 = assign_global_id(last_allocated_id, 3, 2, max_global_id, total_allocations)


mon_a_list_5 = assign_global_id(last_allocated_id, 5, 0, max_global_id, total_allocations)
mon_b_list_5 = assign_global_id(last_allocated_id, 5, 1, max_global_id, total_allocations)
mon_c_list_5 = assign_global_id(last_allocated_id, 5, 2, max_global_id, total_allocations)
mon_d_list_5 = assign_global_id(last_allocated_id, 5, 3, max_global_id, total_allocations)
mon_e_list_5 = assign_global_id(last_allocated_id, 5, 4, max_global_id, total_allocations)


intersection_a = set(mon_a_list_5).intersection(set(mon_b_list_3 + mon_c_list_3))
intersection_b = set(mon_b_list_5).intersection(set(mon_a_list_3 + mon_c_list_3))
intersection_c = set(mon_c_list_5).intersection(set(mon_a_list_3 + mon_b_list_3))

print(intersection_a)
print(intersection_b)
print(intersection_c)

@shminjs
Copy link
Author

shminjs commented Dec 26, 2023

jenkins test make check

When we expand or shrink monitors, there is the possibility of
repeatedly allocating global ids. Because the calculation of
last_allocated_id depends on mon_num and mon_rank, and expand or
shrink monitors changes mon_num and mon_rank. A simple
fix is to reset last_allocated_id when the election is complete.

Fixs: https://tracker.ceph.com/issues/63891
Signed-off-by: shimin <shimin@kuaishou.com>
@shminjs shminjs force-pushed the fix-potential-repeated-global-id branch from 6d7f96a to eccd106 Compare December 26, 2023 12:12
@shminjs
Copy link
Author

shminjs commented Dec 26, 2023

jenkins test make check

@shminjs
Copy link
Author

shminjs commented Jan 30, 2024

jenkins test make check arm64

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Mar 30, 2024
@shminjs
Copy link
Author

shminjs commented Mar 30, 2024

does anyone handle this issue?

@shminjs
Copy link
Author

shminjs commented Mar 30, 2024

cephfs client depend on global_id, and repeated global_id would result some inconsistent problem.

@github-actions github-actions bot removed the stale label Mar 30, 2024
@tchaikov tchaikov self-requested a review March 30, 2024 14:39
@github-actions
Copy link

github-actions bot commented Jun 7, 2024

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jun 7, 2024
@shminjs
Copy link
Author

shminjs commented Jun 11, 2024

This is a bug, need to fix

@github-actions github-actions bot removed the stale label Jun 11, 2024
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 10, 2024
{
std::lock_guard l(auth_lock);
authmon()->_set_mon_num_rank(monmap->size(), rank);
authmon()->_reset_last_allocated_id();
Copy link
Contributor

@rzarzynski rzarzynski Aug 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding this should lead to:

  1. temporarily ceasing issuance of new IDs:
uint64_t AuthMonitor::_assign_global_id()
{
  ceph_assert(ceph_mutex_is_locked(mon->auth_lock));
  if (mon_num < 1 || mon_rank < 0) {
    dout(10) << __func__ << " inactive (num_mon " << mon_num
             << " rank " << mon_rank << ")" << dendl;
    return 0;
  }
  if (!last_allocated_id) {
    dout(10) << __func__ << " last_allocated_id == 0" << dendl;
    return 0;
  }
  // ...
  1. letting leader to maybe bump max_global_id up:
int64_t AuthMonitor::assign_global_id(bool should_increase_max)
{
  uint64_t id;
  {
    std::lock_guard l(mon->auth_lock);
    id =_assign_global_id();
    if (should_increase_max) {
      should_increase_max = _should_increase_max_global_id();
    }
  }
  if (mon->is_leader() &&
      should_increase_max) {
    increase_max_global_id();
  }
  return id;
}
void AuthMonitor::increase_max_global_id()
{
  ceph_assert(mon->is_leader());

  Incremental inc;
  inc.inc_type = GLOBAL_ID;
  inc.max_global_id = max_global_id + g_conf()->mon_globalid_prealloc;
  dout(10) << "increasing max_global_id to " << inc.max_global_id << dendl;
  pending_auth.push_back(inc);
}
  1. maybe letting all monitors to synchronize on new max_global_id:
void AuthMonitor::update_from_paxos(bool *need_bootstrap)
{
      switch (inc.inc_type) {
      case GLOBAL_ID:
        max_global_id = inc.max_global_id;
        break;
  1. finally start accepting clients again:
void AuthMonitor::update_from_paxos(bool *need_bootstrap)
{
        max_global_id = inc.max_global_id;
  //  ...
  {
    std::lock_guard l(mon->auth_lock);
    if (last_allocated_id == 0) {
      last_allocated_id = max_global_id;
      dout(10) << __func__ << " last_allocated_id initialized to "
               << max_global_id << dendl;
    }
  }

Steps 2 and 3 are rather improbable for the AuthMonitor::_reset_last_allocated_id() case.

I agree the patch should help with the sharding of the ID space if mon_num or mon_rank change. (I will take 2nd look this week).

@github-actions github-actions bot removed the stale label Aug 12, 2024
@rzarzynski rzarzynski requested a review from batrick August 16, 2024 07:20
@batrick
Copy link
Member

batrick commented Sep 16, 2024

This PR is under test in https://tracker.ceph.com/issues/68089.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shminjs
Copy link
Author

shminjs commented Sep 18, 2024

This failed QA, see for example:

https://pulpito.ceph.com/pdonnell-2024-09-17_02:09:51-fs-wip-pdonnell-testing-20240916.200549-debug-distro-default-smithi/7908340/

and the leader mon log:

ceph-mon.a.log.gz

ok, I check the log and reply soon as possible

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 23, 2024
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants