Skip to content

fixes for CRC mismatch failures on OSD when upgrade#11258

Closed
tchaikov wants to merge 2 commits intoceph:masterfrom
tchaikov:wip-17386
Closed

fixes for CRC mismatch failures on OSD when upgrade#11258
tchaikov wants to merge 2 commits intoceph:masterfrom
tchaikov:wip-17386

Conversation

@tchaikov
Copy link
Contributor

No description provided.

add an option named "osd_ignore_bad_map_crc" for OSD.

the OSD will request for full maps from monitor if the CRC of the
resulting full map after applying the incremental map does not match
with the one stored in inc map being applied. in most cases, this
behaviour is just annoying but harmless. it just adds up the burden
of monitor add cause extra traffic in the network. but in a large
deployment, after upgrading the monitor with change(s) in the encoding
of OSDMap, any changes related to osd map, the monitors is obliged to
reply the full map request from old clients which are not able to
re-encode the full map with the expected CRC, this incurs unnecessary
overhead of the network and the load on both peers. so we can choose
to disable this functionality on the client side while  upgrading
the cluster.

Fixes: http://tracker.ceph.com/issues/17386
Signed-off-by: Kefu Chai <kchai@redhat.com>
if OSD does not request for full map at seeing mismatched CRC of
re-encoded osd map, the osd maps on the OSD and monitor will diverge
over time. this could cause mysterious problem because of the
divergence in the cluster. so we need to enable this option and reboot
the osd if `osd_ignore_bad_map_crc` is enabled when upgrading this OSD
in question.

Fixes: http://tracker.ceph.com/issues/17386
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

diff --git a/src/messages/MOSDMap.h b/src/messages/MOSDMap.h
index 06e79de..05ff0b6 100644
--- a/src/messages/MOSDMap.h
+++ b/src/messages/MOSDMap.h
@@ -85,7 +85,8 @@ public:
     if ((features & CEPH_FEATURE_PGID64) == 0 ||
        (features & CEPH_FEATURE_PGPOOL3) == 0 ||
        (features & CEPH_FEATURE_OSDENC) == 0 ||
-        (features & CEPH_FEATURE_OSDMAP_ENC) == 0) {
+        (features & CEPH_FEATURE_OSDMAP_ENC) == 0 ||
+       (features & CEPH_FEATURE_OSD_HITSET_GMT) == 0) {
       if ((features & CEPH_FEATURE_PGID64) == 0 ||
          (features & CEPH_FEATURE_PGPOOL3) == 0)
        header.version = 1;  // old old_client version
diff --git a/src/osd/OSDMap.cc b/src/osd/OSDMap.cc
index 7961d2f..ce3596d 100644
--- a/src/osd/OSDMap.cc
+++ b/src/osd/OSDMap.cc
@@ -404,7 +404,8 @@ void OSDMap::Incremental::encode(bufferlist& bl, uint64_t features) const
   // OSDMaps.  others should be passing around the canonical encoded
   // buffers from on high.  select out those callers by passing in an
   // "impossible" feature bit.
-  assert(features & CEPH_FEATURE_RESERVED);
+  // temporarily disabled for http://tracker.ceph.com/issues/17386
+  // assert(features & CEPH_FEATURE_RESERVED);
   features &= ~CEPH_FEATURE_RESERVED;

   size_t start_offset = bl.length();
@@ -1889,7 +1890,8 @@ void OSDMap::encode(bufferlist& bl, uint64_t features) const
   // OSDMaps.  others should be passing around the canonical encoded
   // buffers from on high.  select out those callers by passing in an
   // "impossible" feature bit.
-  assert(features & CEPH_FEATURE_RESERVED);
+  // assert(features & CEPH_FEATURE_RESERVED);
+  // temporarily disabled for http://tracker.ceph.com/issues/17386
   features &= ~CEPH_FEATURE_RESERVED;

   size_t start_offset = bl.length();

this patch is not included in this patchset.

@tchaikov
Copy link
Contributor Author

@athanatos shall we document in the doc/install/upgrading-ceph.rst to suggest users plan to upgrade from 0.94.6 to upgrade OSD first?

@athanatos
Copy link
Contributor

@tchaikov Wouldn't there still be crc mismatch?

@athanatos
Copy link
Contributor

@liewegas Does this look right to you?

@liewegas
Copy link
Member

That third patch is tricky, but I think it will work. We just need to be really careful to remove the ability once we add something to the OSDMap encoding that matters or else someone who does an upgrade will leave this option turned on and be unhappy later. Or rather we should only apply it to hammer... because if the OSDs have the first patch, which we'll backport to jewel, they wouldn't need the third one... it's only there so that current deployments can upgrade mons only and avoid pain. Right?

@athanatos
Copy link
Contributor

Yeah, I'm not entirely convinced by that last patch (except as an emergency hotfix which then gets rolled back).

@tchaikov
Copy link
Contributor Author

@athanatos yes, i was wrong, thought that the pg_pool_t::encode() took the GMT_HITSET feature bit into consideration.

it's only there so that current deployments can upgrade mons only and avoid pain. Right?

right, without the patch of #11258 (comment), this change does not make sense. i am dropping the "do not encode crc bits if crc_full is not valid anymore" commit from this changeset.

@tchaikov
Copy link
Contributor Author

@athanatos shall we put this down in doc/install/upgrading-ceph.rst

for user upgrade from 0.94.6 to jewel, it is suggested to upgrade OSD first, and then monitor to workaround a known issue, where a large cluster could experience network saturation and slow requests caused CRC mismatch.

@bstillwell
Copy link

We're stuck in the middle of an upgrade from 0.94.6 to 0.94.9 because of this problem. The OSD nodes are about 10% on 0.94.9 and 90% on 0.94.6. All the mon nodes are on 0.94.9. Would it make sense to downgrade the mon nodes to 0.94.6, upgrade all the OSD nodes to 0.94.9, and then upgrade the mon nodes? I ask this because even with the hotfix we're still seeing some of the "cluster [WRN] failed to encode map eNNNN with expected crc" messages appear on our test clusters.

@tchaikov
Copy link
Contributor Author

tchaikov commented Oct 4, 2016

note to myself, we need to note down this in the upgrade doc:

for user planning to upgrade from 0.94.6 to 0.94.9 or up, a large cluster could experience network saturation and slow requests caused CRC mismatch.

  1. a workaround of this known issue is to upgrade all OSD to 0.94.9 first, and then the monitors to jewel.
  2. another workaround is to upgrade all monitors to a version with following fix first and then OSD, as the general practice of upgrade puts.

@tchaikov
Copy link
Contributor Author

@Sage, with your patches to encode the osdmap with the feature set allowed by the connection, i think we can dispense with this changeset?

@liewegas
Copy link
Member

Yes and no. If we backport the changes ot jewel and/or hammer, and add a condition to encode the full osdmap without the hitset feature, then yeah. The current PR only pulls out a handful of features, though, and they have to be explicitly called out.

We could adjust that to mask against osdmap::get_up_osd_features too, I suppose, but I think it might be better to explicitly call out any changes that affect osdmap encoding.

@liewegas
Copy link
Member

In any case, though, I think we can drop this. For now the simple workaround of upgrading OSDs before mons will always work, too.

@liewegas liewegas closed this Oct 24, 2016
@tchaikov tchaikov deleted the wip-17386 branch October 24, 2016 10:30
smithfarm pushed a commit to SUSE/ceph that referenced this pull request Nov 12, 2016
When MDS is no longer laggy, it should process deferred messages
first, then process newly received messages.

Fix: ceph#11258
Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit ccdeaf8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants