mon/MonClient: resurrect original client_mount_timeout handling by idryomov · Pull Request #52124 · ceph/ceph

idryomov · 2023-06-20T03:26:10Z

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

While reducing a "waiting for config" timeout from 30 seconds to 3 (mon_client_hunt_interval default) and instead introducing 10 retries, commit 3c2b30e ("mon/MonClient: apply timeout while fetching config") also subjected authenticate() to these retries. However, authenticate() is going by client_mount_timeout which defaults to 5 minutes. As a result, when the monitors are unreachable or there are other connectivity issues, we end up taking 50 minutes to return ETIMEDOUT from rados_connect(). Fixes: https://tracker.ceph.com/issues/61733 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Over the years, with commits 8bc1974 ("ceph_test_librados_api_misc: fix stupid LibRadosMiscConnectFailure.ConnectFailure test") and f357459 ("test/librados: modify LibRadosMiscConnectFailure.ConnectFailure to comply with new seconds unit"), this test has lost its original meaning. Any ability to time out is definitely gone since a 1 second timeout is too high to kick in a normally functioning setup, not to mention that the timeout was being ballooned to 10 seconds until the previous commit. The first connection attempt would normally succeed while the second one would immediately fail with "cannot identify monitors to contact" error due do the cluster handle getting recreated. The 16 iteration loop is dead code. This commit just codifies the above to avoid the appearance that this test has anything to do with timeouts. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

idryomov · 2023-06-20T07:04:04Z

jenkins test api

idryomov · 2023-06-20T07:04:09Z

jenkins test windows

rzarzynski · 2023-06-21T13:43:58Z

src/test/librados/misc.cc

+
+  utime_t dur = end - start;
+  ASSERT_GE(dur, utime_t(2, 0));
+  ASSERT_LT(dur, utime_t(4, 0));


ljflores · 2023-07-12T18:30:26Z

@idryomov I'm seeing a failure that looks related to this PR:

RADOS timed out (error connecting to the cluster), failed to fetch mon config
/a/yuriw-2023-07-03_15:28:45-rados-wip-yuri-testing-2023-06-23-0831-distro-default-smithi/7324852

2023-07-03T21:07:34.736 INFO:tasks.workunit.client.0.smithi159.stderr:2023-07-03T21:07:34.731+0000 7fe36e1c9640  0 monclient(hunting): authenticate timed out after 300
2023-07-03T21:07:34.736 INFO:tasks.workunit.client.0.smithi159.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2023-07-03T21:07:34.740 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:53: TEST_stretched_cluster_failover_add_three_osds:  run_mgr td/mon-stretch-fail-recovery x
2023-07-03T21:07:34.740 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:553: run_mgr:  local dir=td/mon-stretch-fail-recovery
2023-07-03T21:07:34.740 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:554: run_mgr:  shift
2023-07-03T21:07:34.741 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:555: run_mgr:  local id=x
2023-07-03T21:07:34.741 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:556: run_mgr:  shift
2023-07-03T21:07:34.741 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:557: run_mgr:  local data=td/mon-stretch-fail-recovery/x
2023-07-03T21:07:34.741 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:559: run_mgr:  ceph config set mgr mgr_pool false --force
2023-07-03T21:12:34.866 INFO:tasks.workunit.client.0.smithi159.stderr:2023-07-03T21:12:34.862+0000 7fa4cb991640  0 monclient(hunting): authenticate timed out after 300
2023-07-03T21:12:34.867 INFO:tasks.workunit.client.0.smithi159.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2023-07-03T21:12:34.871 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:561: run_mgr:  get_asok_path
2023-07-03T21:12:34.871 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:116: get_asok_path:  local name=
2023-07-03T21:12:34.871 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:117: get_asok_path:  '[' -n '' ']'
2023-07-03T21:12:34.872 INFO:tasks.workunit.client.0.smithi159.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path:  get_asok_dir
2023-07-03T21:12:34.872 INFO:tasks.workunit.client.0.smithi159.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:108: get_asok_dir:  '[' -n '' ']'
2023-07-03T21:12:34.872 INFO:tasks.workunit.client.0.smithi159.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:111: get_asok_dir:  echo /tmp/ceph-asok.30235
2023-07-03T21:12:34.872 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path:  echo '/tmp/ceph-asok.30235/$cluster-$name.asok'
2023-07-03T21:12:34.873 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:561: run_mgr:  realpath /home/ubuntu/cephtest/clone.client.0/src/pybind/mgr
2023-07-03T21:12:34.886 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:561: run_mgr:  ceph-mgr --id x --osd-failsafe-full-ratio=.99 --debug-mgr 20 --debug-objecter 20 --debug-ms 20 --debug-paxos 20 --chdir= --mgr-data=td/mon-stretch-fail-recovery/x '--log-file=td/mon-stretch-fail-recovery/$name.log' '--admin-socket=/tmp/ceph-asok.30235/$cluster-$name.asok' --run-dir=td/mon-stretch-fail-recovery '--pid-file=td/mon-stretch-fail-recovery/$name.pid' --mgr-module-path=/home/ubuntu/cephtest/clone.client.0/src/pybind/mgr
2023-07-03T21:17:34.946 INFO:tasks.workunit.client.0.smithi159.stderr:failed to fetch mon config (--no-mon-config to skip)
2023-07-03T21:17:34.949 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:575: run_mgr:  return 1
2023-07-03T21:17:34.949 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:53: TEST_stretched_cluster_failover_add_three_osds:  return 1

The failure reproduced several other times in this test link:
http://pulpito.front.sepia.ceph.com/?branch=wip-yuri-testing-2023-06-23-0831

ljflores · 2023-07-12T18:31:06Z

Please re-add "needs-qa" when this needs testing again.

idryomov · 2023-07-12T20:20:09Z

@batrick and I agree that this appears to be caused by #50503 that was present in the same batch.

What I see happening is:

ceph mon set election_strategy connectivity command hangs for almost 3 hours and gets terminated:

2023-07-03T18:03:08.444 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:50: TEST_stretched_cluster_failover_add_three_osds:  ceph mon set election_strategy connectivity
2023-07-03T21:02:34.488 INFO:tasks.workunit.client.0.smithi159.stderr:Terminated

teardown code is invoked and ostensibly kills daemons:

2023-07-03T21:02:34.521 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:1: TEST_stretched_cluster_failover_add_three_osds:  teardown td/mon-stretch-fail-recovery 1
2023-07-03T21:02:34.522 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:164: teardown:  local dir=td/mon-stretch-fail-recovery
2023-07-03T21:02:34.522 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:165: teardown:  local dumplogs=1
2023-07-03T21:02:34.522 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:166: teardown:  kill_daemons td/mon-stretch-fail-recovery KILL

however the test proceeds and attempts to run the next command which times out (likely because there is nothing to connect to at that point):

2023-07-03T21:02:34.627 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:51: TEST_stretched_cluster_failover_add_three_osds:  ceph mon add disallowed_leader e
2023-07-03T21:07:34.736 INFO:tasks.workunit.client.0.smithi159.stderr:2023-07-03T21:07:34.731+0000 7fe36e1c9640  0 monclient(hunting): authenticate timed out after 300
2023-07-03T21:07:34.736 INFO:tasks.workunit.client.0.smithi159.stderr:[errno 110] RADOS timed out (error connecting to the cluster)

Please retest without #50503 in the mix.

ljflores · 2023-07-12T20:37:04Z

Sure, thanks for looking into it @idryomov

ljflores · 2023-07-18T20:55:29Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomcVbn2qwgo1793-wip-yuri-testing-2023-07-14-1641-old-wip-yuri-testing-2023-07-12-1332-old-wip-yuri-testing-2023-07-12-1140-old-wip-yuri-testing

idryomov added 2 commits June 20, 2023 05:21

idryomov added the bug-fix label Jun 20, 2023

idryomov requested a review from ljflores June 20, 2023 03:26

idryomov requested a review from a team as a code owner June 20, 2023 03:26

github-actions bot added core mon tests labels Jun 20, 2023

rzarzynski approved these changes Jun 21, 2023

View reviewed changes

src/test/librados/misc.cc

utime_t dur = end - start;

ASSERT_GE(dur, utime_t(2, 0));

ASSERT_LT(dur, utime_t(4, 0));

Copy link

Contributor

rzarzynski Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

rzarzynski added the needs-qa label Jun 21, 2023

ljflores added the wip-yuri-testing label Jun 23, 2023

ljflores removed needs-qa wip-yuri-testing labels Jul 12, 2023

idryomov added the needs-qa label Jul 12, 2023

yuriw added the wip-yuri-testing label Jul 12, 2023

idryomov merged commit a49357a into ceph:main Jul 19, 2023

idryomov deleted the wip-61733 branch July 19, 2023 09:41

This was referenced Jul 19, 2023

pacific: mon/MonClient: resurrect original client_mount_timeout handling #52533

Merged

quincy: mon/MonClient: resurrect original client_mount_timeout handling #52534

Merged

reef: mon/MonClient: resurrect original client_mount_timeout handling #52535

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon/MonClient: resurrect original client_mount_timeout handling#52124

mon/MonClient: resurrect original client_mount_timeout handling#52124
idryomov merged 2 commits intoceph:mainfrom
idryomov:wip-61733

idryomov commented Jun 20, 2023

Uh oh!

idryomov commented Jun 20, 2023

Uh oh!

idryomov commented Jun 20, 2023

Uh oh!

rzarzynski Jun 21, 2023

Uh oh!

ljflores commented Jul 12, 2023

Uh oh!

ljflores commented Jul 12, 2023

Uh oh!

idryomov commented Jul 12, 2023 •

edited

Loading

Uh oh!

ljflores commented Jul 12, 2023

Uh oh!

ljflores commented Jul 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

idryomov commented Jun 20, 2023

Contribution Guidelines

Checklist

Uh oh!

idryomov commented Jun 20, 2023

Uh oh!

idryomov commented Jun 20, 2023

Uh oh!

rzarzynski Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

ljflores commented Jul 12, 2023

Uh oh!

ljflores commented Jul 12, 2023

Uh oh!

idryomov commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Jul 12, 2023

Uh oh!

ljflores commented Jul 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

idryomov commented Jul 12, 2023 •

edited

Loading