Bug #56574: rados/valgrind-leaks: cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)" in cluster log - RADOS - Ceph

Actions

Copy link

Bug #56574

closed

rados/valgrind-leaks: cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)" in cluster log

Added by Laura Flores over 3 years ago. Updated 9 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Sridhar Seshasayee

Category:

Target version:

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Tags:

test-failure

Description

Description: rados/valgrind-leaks/{1-start 2-inject-leak/osd centos_latest}

/a/nojha-2022-07-14_20:32:09-rados-snapshot_key_conversion-distro-default-smithi/6931015

2022-07-14T23:01:41.552 INFO:teuthology.misc.health.smithi145.stdout:{"epoch":13,"fsid":"24de2949-251d-4169-919f-ec93cdf5f733","created":"2022-07-14T22:51:15.685917+0000","modified":"2022-07-14T22:52:15.383144+0000","last_up_change":"2022-07-14T22:52:11.619286+0000","last_in_change":"2022-07-14T22:51:19.303295+0000","flags":"sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit","flags_num":5799936,"flags_set":["pglog_hardlimit","purged_snapdirs","recovery_deletes","sortbitwise"],"crush_version":5,"full_ratio":0.94999998807907104,"backfillfull_ratio":0.89999997615814209,"nearfull_ratio":0.85000002384185791,"cluster_snapshot":"","pool_max":1,"max_osd":3,"require_min_compat_client":"luminous","min_compat_client":"jewel","require_osd_release":"quincy","pools":[{"pool":1,"pool_name":".mgr","create_time":"2022-07-14T22:52:12.379399+0000","flags":1,"flags_names":"hashpspool","type":1,"size":2,"min_size":1,"crush_rule":0,"peering_crush_bucket_count":0,"peering_crush_bucket_target":0,"peering_crush_bucket_barrier":0,"peering_crush_bucket_mandatory_member":2147483647,"object_hash":2,"pg_autoscale_mode":"off","pg_num":1,"pg_placement_num":1,"pg_placement_num_target":1,"pg_num_target":1,"pg_num_pending":1,"last_pg_merge_meta":{"source_pgid":"0.0","ready_epoch":0,"last_epoch_started":0,"last_epoch_clean":0,"source_version":"0'0","target_version":"0'0"},"last_change":"13","last_force_op_resend":"0","last_force_op_resend_prenautilus":"0","last_force_op_resend_preluminous":"0","auid":0,"snap_mode":"selfmanaged","snap_seq":0,"snap_epoch":0,"pool_snaps":[],"removed_snaps":"[]","quota_max_bytes":0,"quota_max_objects":0,"tiers":[],"tier_of":-1,"read_tier":-1,"write_tier":-1,"cache_mode":"none","target_max_bytes":0,"target_max_objects":0,"cache_target_dirty_ratio_micro":400000,"cache_target_dirty_high_ratio_micro":600000,"cache_target_full_ratio_micro":800000,"cache_min_flush_age":0,"cache_min_evict_age":0,"erasure_code_profile":"","hit_set_params":{"type":"none"},"hit_set_period":0,"hit_set_count":0,"use_gmt_hitset":true,"min_read_recency_for_promote":0,"min_write_recency_for_promote":0,"hit_set_grade_decay_rate":0,"hit_set_search_last_n":0,"grade_table":[],"stripe_width":0,"expected_num_objects":0,"fast_read":false,"options":{"pg_num_max":32,"pg_num_min":1},"application_metadata":{"mgr":{}}}],"osds":[{"osd":0,"uuid":"da4088a2-ef48-4e6e-bafe-29b8472b672c","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":9,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6810","nonce":103817},{"type":"v1","addr":"172.21.15.145:6811","nonce":103817}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6812","nonce":103817},{"type":"v1","addr":"172.21.15.145:6813","nonce":103817}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6816","nonce":103817},{"type":"v1","addr":"172.21.15.145:6817","nonce":103817}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6814","nonce":103817},{"type":"v1","addr":"172.21.15.145:6815","nonce":103817}]},"public_addr":"172.21.15.145:6811/103817","cluster_addr":"172.21.15.145:6813/103817","heartbeat_back_addr":"172.21.15.145:6817/103817","heartbeat_front_addr":"172.21.15.145:6815/103817","state":["exists","up"]},{"osd":1,"uuid":"5420bad2-daeb-4064-bb0f-a58aed6bab71","up":1,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":10,"up_thru":11,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6818","nonce":103818},{"type":"v1","addr":"172.21.15.145:6819","nonce":103818}]},"cluster_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6820","nonce":103818},{"type":"v1","addr":"172.21.15.145:6821","nonce":103818}]},"heartbeat_back_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6824","nonce":103818},{"type":"v1","addr":"172.21.15.145:6825","nonce":103818}]},"heartbeat_front_addrs":{"addrvec":[{"type":"v2","addr":"172.21.15.145:6822","nonce":103818},{"type":"v1","addr":"172.21.15.145:6823","nonce":103818}]},"public_addr":"172.21.15.145:6819/103818","cluster_addr":"172.21.15.145:6821/103818","heartbeat_back_addr":"172.21.15.145:6825/103818","heartbeat_front_addr":"172.21.15.145:6823/103818","state":["exists","up"]},{"osd":2,"uuid":"3e6b48d7-1062-486a-8504-b043154edeb2","up":0,"in":1,"weight":1,"primary_affinity":1,"last_clean_begin":0,"last_clean_end":0,"up_from":0,"up_thru":0,"down_at":0,"lost_at":0,"public_addrs":{"addrvec":[]},"cluster_addrs":{"addrvec":[]},"heartbeat_back_addrs":{"addrvec":[]},"heartbeat_front_addrs":{"addrvec":[]},"public_addr":"(unrecognized address family 0)/0","cluster_addr":"(unrecognized address family 0)/0","heartbeat_back_addr":"(unrecognized address family 0)/0","heartbeat_front_addr":"(unrecognized address family 0)/0","state":["exists","new"]}],"osd_xinfo":[{"osd":0,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138320759226367,"old_weight":0,"last_purged_snaps_scrub":"2022-07-14T22:51:57.552229+0000","dead_epoch":0},{"osd":1,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":4540138320759226367,"old_weight":0,"last_purged_snaps_scrub":"2022-07-14T22:51:58.124476+0000","dead_epoch":0},{"osd":2,"down_stamp":"0.000000","laggy_probability":0,"laggy_interval":0,"features":0,"old_weight":0,"last_purged_snaps_scrub":"0.000000","dead_epoch":0}],"pg_upmap":[],"pg_upmap_items":[],"pg_temp":[],"primary_temp":[],"blocklist":{},"range_blocklist":{},"erasure_code_profiles":{"default":{"crush-failure-domain":"osd","k":"2","m":"1","plugin":"jerasure","technique":"reed_sol_van"}},"removed_snaps_queue":[],"new_removed_snaps":[],"new_purged_snaps":[],"crush_node_flags":{},"device_class_flags":{},"stretch_mode":{"stretch_mode_enabled":false,"stretch_bucket_count":0,"degraded_stretch_mode":0,"recovering_stretch_mode":0,"stretch_mode_bucket":0}}
2022-07-14T23:01:41.566 DEBUG:teuthology.misc:2 of 3 OSDs are up

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Laura Flores over 3 years ago

/a/nojha-2022-07-15_14:45:04-rados-snapshot_key_conversion-distro-default-smithi/6932156

Actions

Copy link

Updated by Nitzan Mordechai over 3 years ago

osd.0 is still down..

The valagrind for osd.0 shows:

<fatal_signal>
  <tid>1</tid>
  <signo>15</signo>
  <signame>SIGTERM</signame>

that coming from wait to mon_cmd_set_config

the code in OSD.CC shows for function mon_cmd_set_config:

int OSD::mon_cmd_set_config(const std::string &key, const std::string &val)
{
  std::string cmd =
    "{" 
      "\"prefix\": \"config set\", " 
      "\"who\": \"osd." + std::to_string(whoami) + "\", " 
      "\"name\": \"" + key + "\", " 
      "\"value\": \"" + val + "\"" 
    "}";

  vector<std::string> vcmd{cmd};
  bufferlist inbl;
  std::string outs;
  C_SaferCond cond;
  monc->start_mon_command(vcmd, inbl, nullptr, &outs, &cond);
  int r = cond.wait();
  if (r < 0) {

looks like we are waiting but failed during that, but mon.c accepted the set and also ack

mon.c showing it ack the set of osd_mclock_max_capacity_iops_ssd:

2022-07-15T15:17:00.355+0000 1721f700  1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] <== osd.0 v2:172.21.15.36:6802/103698 3 ==== mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 ==== 149+0+0 (secure 0 0 0) 0x1c2a97b0 con 0x1c6ccbf0
2022-07-15T15:17:00.359+0000 1721f700 20 is_capable service=config command=config set read write addr v2:172.21.15.36:6802/103698 on cap allow profile osd
2022-07-15T15:17:00.373+0000 1721f700 10 mon.c@2(peon).paxosservice(config 1..1) dispatch 0x1c2a97b0 mon_command([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}] v 0) v1 from osd.0 v2:172.21.15.36:6802/103698 con 0x1c6ccbf0
2022-07-15T15:17:00.635+0000 1721f700  1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] --> [v2:172.21.15.36:6802/103698,v1:172.21.15.36:6803/103698] -- config(1 keys) v1 -- 0x1c79f000 con 0x1c6ccbf0
2022-07-15T15:17:00.649+0000 1721f700  1 -- [v2:172.21.15.36:3302/0,v1:172.21.15.36:6791/0] --> [v2:172.21.15.36:6802/103698,v1:172.21.15.36:6803/103698] -- mon_command_ack([{prefix=config set, name=osd_mclock_max_capacity_iops_ssd}]=0  v2)=0  v2) v1 -- 0x1246f350 con 0x1c6ccbf0

Actions

Copy link

Updated by Radoslaw Zarzynski over 3 years ago

Status changed from New to Need More Info

Watching for more reoccurances.

Actions

Copy link

Updated by Laura Flores over 3 years ago

Assignee set to Sridhar Seshasayee

Found another occurrence here: /a/yuriw-2022-07-18_18:20:02-rados-wip-yuri8-testing-2022-07-18-0918-distro-default-smithi/6936397
Description: rados/valgrind-leaks/{1-start 2-inject-leak/mon centos_latest}

Based on the valgrind (same from what Nitzan shared), it seems like it could be something to do with mClock. (Or a separate issue that is interacting badly with mClock). This was the first PR that this failure occurred on: https://github.com/ceph/ceph/pull/46908. During testing, we reran valgrind-leaks tests 10 times, and they all passed after failing initially; thus, we deemed the PR unrelated. See the PR for more testing info.

@Sridhar, do you have an idea of what this could be?

Actions

Copy link

Updated by Sridhar Seshasayee over 3 years ago

This looks similar to https://tracker.ceph.com/issues/52948. See comment https://tracker.ceph.com/issues/52948#note-5 for more details. The osd in question doesn't get the ack for the command even though the mon sends it. In the original tracker the issue looks to be msgr related as there were socket failure injections in that test. I will look more into the logs of this tracker soon, but it could help if this can be looked into more from the msgr perspective.

Actions

Copy link

Updated by Laura Flores over 3 years ago

Ah, thanks Sridhar. I will compare the two Trackers and mark this one as a duplicate if needed.

Actions

Copy link

Updated by Nitzan Mordechai over 3 years ago

Just a note, i was able to recreate it with vstart, without error injection but with valgrind
as soon as we step into that snap code that i posted, we will hang in that wait (i think that more than 1 mon is needed for recreation)

Actions

Copy link

Updated by Radoslaw Zarzynski over 3 years ago

Nitzan, could it be a different issue?

Actions

Copy link

Updated by Laura Flores over 3 years ago

/a/yuriw-2022-08-22_16:21:19-rados-wip-yuri8-testing-2022-08-22-0646-distro-default-smithi/6985175

Actions

Copy link

#10

Updated by Laura Flores over 3 years ago

Tags set to test-failure

Actions

Copy link

#11

Updated by Sridhar Seshasayee 10 months ago · Edited

@Laura Flores I think this is no longer happening. Additionally, the code related to this
as mentioned in #note-2 has been modified to asynchronous wait (using context
completion) which I think resolves the issue.

Perhaps this and https://tracker.ceph.com/issues/52948 can be marked as duplicate
of https://tracker.ceph.com/issues/57040 which implemented the context completion
mechanism.

Please let me know and I can associate the trackers accordingly and close them.

Actions

Copy link

#12

Updated by Laura Flores 9 months ago

Sridhar Seshasayee wrote in #note-11:

@Laura Flores I think this is no longer happening. Additionally, the code related to this
as mentioned in #note-2 has been modified to asynchronous wait (using context
completion) which I think resolves the issue.

Perhaps this and https://tracker.ceph.com/issues/52948 can be marked as duplicate
of https://tracker.ceph.com/issues/57040 which implemented the context completion
mechanism.

Please let me know and I can associate the trackers accordingly and close them.

Sure @Sridhar Seshasayee, feel free to add the associations and close them.

Actions

Copy link

#13

Updated by Sridhar Seshasayee 9 months ago

Is duplicate of Enhancement #57040: osd: Update osd's IOPS capacity using async Context completion instead of cond wait. added

Actions

Copy link

#14

Updated by Sridhar Seshasayee 9 months ago

Status changed from Need More Info to Duplicate

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #56574

rados/valgrind-leaks: cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)" in cluster log

Updated by Laura Flores over 3 years ago

Updated by Nitzan Mordechai over 3 years ago

Updated by Radoslaw Zarzynski over 3 years ago

Updated by Laura Flores over 3 years ago

Updated by Sridhar Seshasayee over 3 years ago

Updated by Laura Flores over 3 years ago

Updated by Nitzan Mordechai over 3 years ago

Updated by Radoslaw Zarzynski over 3 years ago

Updated by Laura Flores over 3 years ago

Updated by Laura Flores over 3 years ago

Updated by Sridhar Seshasayee 10 months ago · Edited

Updated by Laura Flores 9 months ago

Updated by Sridhar Seshasayee 9 months ago

Updated by Sridhar Seshasayee 9 months ago