Skip to content

mon/MonClient: resurrect original client_mount_timeout handling#52124

Merged
idryomov merged 2 commits intoceph:mainfrom
idryomov:wip-61733
Jul 19, 2023
Merged

mon/MonClient: resurrect original client_mount_timeout handling#52124
idryomov merged 2 commits intoceph:mainfrom
idryomov:wip-61733

Conversation

@idryomov
Copy link
Contributor

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

idryomov added 2 commits June 20, 2023 05:21
While reducing a "waiting for config" timeout from 30 seconds to 3
(mon_client_hunt_interval default) and instead introducing 10 retries,
commit 3c2b30e ("mon/MonClient: apply timeout while fetching
config") also subjected authenticate() to these retries.  However,
authenticate() is going by client_mount_timeout which defaults to
5 minutes.  As a result, when the monitors are unreachable or there
are other connectivity issues, we end up taking 50 minutes to return
ETIMEDOUT from rados_connect().

Fixes: https://tracker.ceph.com/issues/61733
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Over the years, with commits 8bc1974 ("ceph_test_librados_api_misc:
fix stupid LibRadosMiscConnectFailure.ConnectFailure test") and
f357459 ("test/librados: modify LibRadosMiscConnectFailure.ConnectFailure
to comply with new seconds unit"), this test has lost its original
meaning.  Any ability to time out is definitely gone since a 1 second
timeout is too high to kick in a normally functioning setup, not to
mention that the timeout was being ballooned to 10 seconds until the
previous commit.

The first connection attempt would normally succeed while the second
one would immediately fail with "cannot identify monitors to contact"
error due do the cluster handle getting recreated.  The 16 iteration
loop is dead code.

This commit just codifies the above to avoid the appearance that this
test has anything to do with timeouts.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
@idryomov idryomov requested a review from ljflores June 20, 2023 03:26
@idryomov idryomov requested a review from a team as a code owner June 20, 2023 03:26
@idryomov
Copy link
Contributor Author

jenkins test api

@idryomov
Copy link
Contributor Author

jenkins test windows


utime_t dur = end - start;
ASSERT_GE(dur, utime_t(2, 0));
ASSERT_LT(dur, utime_t(4, 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

@ljflores
Copy link
Member

@idryomov I'm seeing a failure that looks related to this PR:

RADOS timed out (error connecting to the cluster), failed to fetch mon config
/a/yuriw-2023-07-03_15:28:45-rados-wip-yuri-testing-2023-06-23-0831-distro-default-smithi/7324852

2023-07-03T21:07:34.736 INFO:tasks.workunit.client.0.smithi159.stderr:2023-07-03T21:07:34.731+0000 7fe36e1c9640  0 monclient(hunting): authenticate timed out after 300
2023-07-03T21:07:34.736 INFO:tasks.workunit.client.0.smithi159.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2023-07-03T21:07:34.740 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:53: TEST_stretched_cluster_failover_add_three_osds:  run_mgr td/mon-stretch-fail-recovery x
2023-07-03T21:07:34.740 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:553: run_mgr:  local dir=td/mon-stretch-fail-recovery
2023-07-03T21:07:34.740 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:554: run_mgr:  shift
2023-07-03T21:07:34.741 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:555: run_mgr:  local id=x
2023-07-03T21:07:34.741 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:556: run_mgr:  shift
2023-07-03T21:07:34.741 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:557: run_mgr:  local data=td/mon-stretch-fail-recovery/x
2023-07-03T21:07:34.741 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:559: run_mgr:  ceph config set mgr mgr_pool false --force
2023-07-03T21:12:34.866 INFO:tasks.workunit.client.0.smithi159.stderr:2023-07-03T21:12:34.862+0000 7fa4cb991640  0 monclient(hunting): authenticate timed out after 300
2023-07-03T21:12:34.867 INFO:tasks.workunit.client.0.smithi159.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2023-07-03T21:12:34.871 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:561: run_mgr:  get_asok_path
2023-07-03T21:12:34.871 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:116: get_asok_path:  local name=
2023-07-03T21:12:34.871 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:117: get_asok_path:  '[' -n '' ']'
2023-07-03T21:12:34.872 INFO:tasks.workunit.client.0.smithi159.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path:  get_asok_dir
2023-07-03T21:12:34.872 INFO:tasks.workunit.client.0.smithi159.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:108: get_asok_dir:  '[' -n '' ']'
2023-07-03T21:12:34.872 INFO:tasks.workunit.client.0.smithi159.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:111: get_asok_dir:  echo /tmp/ceph-asok.30235
2023-07-03T21:12:34.872 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path:  echo '/tmp/ceph-asok.30235/$cluster-$name.asok'
2023-07-03T21:12:34.873 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:561: run_mgr:  realpath /home/ubuntu/cephtest/clone.client.0/src/pybind/mgr
2023-07-03T21:12:34.886 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:561: run_mgr:  ceph-mgr --id x --osd-failsafe-full-ratio=.99 --debug-mgr 20 --debug-objecter 20 --debug-ms 20 --debug-paxos 20 --chdir= --mgr-data=td/mon-stretch-fail-recovery/x '--log-file=td/mon-stretch-fail-recovery/$name.log' '--admin-socket=/tmp/ceph-asok.30235/$cluster-$name.asok' --run-dir=td/mon-stretch-fail-recovery '--pid-file=td/mon-stretch-fail-recovery/$name.pid' --mgr-module-path=/home/ubuntu/cephtest/clone.client.0/src/pybind/mgr
2023-07-03T21:17:34.946 INFO:tasks.workunit.client.0.smithi159.stderr:failed to fetch mon config (--no-mon-config to skip)
2023-07-03T21:17:34.949 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:575: run_mgr:  return 1
2023-07-03T21:17:34.949 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:53: TEST_stretched_cluster_failover_add_three_osds:  return 1

The failure reproduced several other times in this test link:
http://pulpito.front.sepia.ceph.com/?branch=wip-yuri-testing-2023-06-23-0831

@ljflores
Copy link
Member

Please re-add "needs-qa" when this needs testing again.

@idryomov
Copy link
Contributor Author

idryomov commented Jul 12, 2023

@batrick and I agree that this appears to be caused by #50503 that was present in the same batch.

What I see happening is:

  • ceph mon set election_strategy connectivity command hangs for almost 3 hours and gets terminated:
2023-07-03T18:03:08.444 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:50: TEST_stretched_cluster_failover_add_three_osds:  ceph mon set election_strategy connectivity
2023-07-03T21:02:34.488 INFO:tasks.workunit.client.0.smithi159.stderr:Terminated
  • teardown code is invoked and ostensibly kills daemons:
2023-07-03T21:02:34.521 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:1: TEST_stretched_cluster_failover_add_three_osds:  teardown td/mon-stretch-fail-recovery 1
2023-07-03T21:02:34.522 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:164: teardown:  local dir=td/mon-stretch-fail-recovery
2023-07-03T21:02:34.522 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:165: teardown:  local dumplogs=1
2023-07-03T21:02:34.522 INFO:tasks.workunit.client.0.smithi159.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:166: teardown:  kill_daemons td/mon-stretch-fail-recovery KILL
  • however the test proceeds and attempts to run the next command which times out (likely because there is nothing to connect to at that point):
2023-07-03T21:02:34.627 INFO:tasks.workunit.client.0.smithi159.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon-stretch/mon-stretch-fail-recovery.sh:51: TEST_stretched_cluster_failover_add_three_osds:  ceph mon add disallowed_leader e
2023-07-03T21:07:34.736 INFO:tasks.workunit.client.0.smithi159.stderr:2023-07-03T21:07:34.731+0000 7fe36e1c9640  0 monclient(hunting): authenticate timed out after 300
2023-07-03T21:07:34.736 INFO:tasks.workunit.client.0.smithi159.stderr:[errno 110] RADOS timed out (error connecting to the cluster)

Please retest without #50503 in the mix.

@ljflores
Copy link
Member

Sure, thanks for looking into it @idryomov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants