Skip to content

nautilus: ceph-monstore-tool: use a large enough paxos/{first,last}_committed#41874

Merged
yuriw merged 3 commits intoceph:nautilusfrom
tchaikov:nautilus-pr-27465
Jun 17, 2021
Merged

nautilus: ceph-monstore-tool: use a large enough paxos/{first,last}_committed#41874
yuriw merged 3 commits intoceph:nautilusfrom
tchaikov:nautilus-pr-27465

Conversation

@tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Jun 16, 2021

backport of #27465, the cleanup and doc changes are dropped. as they are not necessary for the bug fix.
backport ticket: https://tracker.ceph.com/issues/51237

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

tchaikov added 3 commits June 16, 2021 10:12
so the rebuild paxos transaction won't be overwritten by the ones
created before recovery completes.

when the quorum is recovering, the leader will collect the paxos
transactions from peons. if the quorum accept the proposal for setting
the fingerprint, the peon will update the monitor with the paxos
transaction with a newer "last_committed" than the one created using
update_paxos() in ceph_monstore_tool.cc. the latter "last_committed" is
always 0.

so, to avoid this extra paxos proposal obsoleting the "rebuilding" paxos
transaction, we use a large enough number for {first,last}_committed.

Fixes: http://tracker.ceph.com/issues/38219
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 5475ef7)
for better readability

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 3908c1f)
mon_tick_interval is 5 seconds by default. monitors update their
rotating keys every mon_tick_interval. before monitors forms a
quorum, the auth requests from clients are put into the wait list.
these requests are re-enqueued once the monitors form a quorum. but
there is a small window of mon_tick_interval, before they are able
to serve the auth requests even after their claim to be able to
server requests. if these re-enqueued requests happen to be served
in this window, and if authx is enabled, they will be greeted with
errors like

handle_auth_bad_method server allowed_methods [2] but i only support [2]

in the case of ceph cli, the error would look like:

[errno 13] RADOS permission denied (error connecting to the cluster)

so, to address this issue, the EACCES error is ignored when waiting
for a quorum.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 7afd38f)
@github-actions github-actions bot added this to the nautilus milestone Jun 16, 2021
@tchaikov tchaikov added nautilus-batch-1 nautilus point releases needs-qa labels Jun 16, 2021
@tchaikov
Copy link
Contributor Author

hi @yuriw , not sure if you are still planning yet another round of nautilus qa batch. if yes, would you kindly include this change as well? if not, i'd keep it around in case it could help the community.

@yuriw
Copy link
Contributor

yuriw commented Jun 16, 2021

@tchaikov I can't build it https://shaman.ceph.com/builds/ceph/wip-yuri3-testing-2021-06-16-0702-nautilus/ see if you can resolve and I will retest

@tchaikov
Copy link
Contributor Author

@yuriw thank you for testing! that's a known issue. and has been fixed. i am building the branch at https://shaman.ceph.com/builds/ceph/wip-yuri3-testing-2021-06-16-0702-nautilus-1/8aebecf8580d16fe26fb7ac2c2317d240257e596/.

@ideepika
Copy link
Member

jenkins test make check

Copy link
Member

@ideepika ideepika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thankyou @tchaikov! :-)

@yuriw yuriw merged commit 406a6d1 into ceph:nautilus Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants