Skip to content

tracing: fix OSD message breakage with backward compatibility#51288

Closed
yuvalif wants to merge 1 commit intoceph:mainfrom
yuvalif:wip-yuval-fix-tracing-msg
Closed

tracing: fix OSD message breakage with backward compatibility#51288
yuvalif wants to merge 1 commit intoceph:mainfrom
yuvalif:wip-yuval-fix-tracing-msg

Conversation

@yuvalif
Copy link
Contributor

@yuvalif yuvalif commented Apr 29, 2023

this is fixing backward compatibility issue that was introduced in commit: 5d4c5e6 clients will encode the open-telemetry traces only if server is REEF

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

this is fixing backward compatibility issue that was introduced
in commit: 5d4c5e6
clients will encode the open-telemetry traces only if server is REEF

Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com>
@yuvalif yuvalif requested a review from a team as a code owner April 29, 2023 14:34
@github-actions github-actions bot added the core label Apr 29, 2023
@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 29, 2023

could you please let me know which test suites to run, so that cephfs and other client failures could be detected

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't complete for several reasons; please do the revert first so we aren't in a hurry trying to fix it and can do things properly.

(Reason 1: this doesn't fix the interplay of encoding versions, compat_version, etc.
Reason 2: Is this going into reef? Because it's not on that branch and it's getting late for breaking changes.
Reason 3: needs more discussion about performance implications and why this is going into the front of the payload.)

@athanatos
Copy link
Contributor

The patch above didn't go into reef, right? Using the reef flag doesn't seem right. I'm with @gregsfortytwo, please revert #47457 .

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 29, 2023

This isn't complete for several reasons; please do the revert first so we aren't in a hurry trying to fix it and can do things properly.

(Reason 1: this doesn't fix the interplay of encoding versions, compat_version, etc. Reason 2: Is this going into reef? Because it's not on that branch and it's getting late for breaking changes. Reason 3: needs more discussion about performance implications and why this is going into the front of the payload.)

  1. sure. will revert. but please explain, what is missing exactly on that point. FYI, the original PR was up for review by the core team for several months. I don't see how adding more delay makes the code better
  2. why not port to reef? i thought it wasn't released yet
  3. OSD performance was tested and results were reviewed by the performance team (all documented in the PR). please provide details for anything is missing

@gregsfortytwo
Copy link
Member

This isn't complete for several reasons; please do the revert first so we aren't in a hurry trying to fix it and can do things properly.

(Reason 1: this doesn't fix the interplay of encoding versions, compat_version, etc. Reason 2: Is this going into reef? Because it's not on that branch and it's getting late for breaking changes. Reason 3: needs more discussion about performance implications and why this is going into the front of the payload.)

  1. sure. will revert. but please explain, what is missing exactly on that point. FYI, the original PR was up for review by the core team for several months. I don't see how adding more delay makes the code better

Yep, sorry about that -- unfortunately it seems like we lost some institutional memory and now it got more (needed) attention.

@gregsfortytwo
Copy link
Member

And where is the performance analysis? I didn't see it when scrolling through comments.

@athanatos
Copy link
Contributor

@yuvalif The reef dev freeze was some time ago. The release is supposed to be quite soon. Changes to the client<->osd messaging layer are (to me, at least) very clearly too risky to casually backport this late in the testing process. The encoding issue uncovered here is a really good example of why.

From the history on the PR, it looks like core should probably have taken a look much earlier and given advice on which bits needed to be fixed and how to test it. We'll have to work on that going forward.

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 30, 2023

And where is the performance analysis? I didn't see it when scrolling through comments.

sorry. it is in a separate PR (where tracing was compiled in but disabled in the OSD code):
#44684

comment with final 70K benchmark results on OSD only (not end2end): #44684 (comment)

when tracing is disabled (the default) the difference between PR #44684 and PR#47457 (as replied to a similar question from @rzarzynski) is in boolean deserilization.
see: https://github.com/ceph/ceph/pull/47457/files#diff-b46d02199b4bb6f489ab7279d33d54db05f46245022bb937c48a94b36ba5d0cfR69
if no trace is sent, juts decodes a boolean.
if a trace is sent, it decodes tha span. the actual cost is denpendent with how much info is sent there.
however, if the user is turning tracing on, they expect cost in perfrmance.

please let me know if other perf tests are needed for the end2end case?

@yuvalif yuvalif added the DNM label Apr 30, 2023
@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 30, 2023

removing the commits would create a huge change to the log, so i removed the commits using git revert.
revert PR: #51293

will keep this PR open (with DNM flag) for discussion on the correct way to fix the client-osd compatibility issue

@rzarzynski
Copy link
Contributor

rzarzynski commented Apr 30, 2023

ok. what is the expected way to revert? remove the commits or create a "revert" commit?

Please always create a PR with a revert inside; just removing a commit from the history (and force push...) is a no-no.

I don't see how adding more delay makes the code better

Yep, sorry about that -- unfortunately it seems like we lost some institutional memory and now it got more (needed) attention.

Yes, the PR was in the queue for a very long time and now we're going to add even more. I truly apologize for that. I believe there is a good reason. As Greg said, we lost the memory on making (and reviewing!) encoding changes. We need to reestablish it. I want to deeply understand the weeds to the last tiny detail and document them / create guidelines / checklists-for-review to prevent such problems in the future.

why not port to reef? i thought it wasn't released yet

Because it tends to be risky. I recall that even the dups.length() printing in pg query (a diagnostic tool for a critical bug) wasn't backported to pacific because of requiring encoding changes.
Perhaps if we systemize the process, they may become less scary; hopefully to an extent it would affect the stance on backporting.

@rzarzynski
Copy link
Contributor

Reason 3: needs more discussion about (...) why this is going into the front of the payload.)

@gregsfortytwo: I think I know the reason. As this is all about op tracing, the new data are needed early, at least on OSD::ms_fast_dispatch() or maybe even earlier.
However, MOSDOp in its newer encoding versions requires the two staged (decode_payload() and finish_decode()) decode. The second stage happens at PrimaryLogPG::do_op().

If so, this requirement knocks out the new-extending-only-version trick.

@yuvalif: please correct me if I'm wrong.

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 30, 2023

Reason 3: needs more discussion about (...) why this is going into the front of the payload.)

@gregsfortytwo: I think I know the reason. As this is all about op tracing, the new data are needed early, at least on OSD::ms_fast_dispatch() or maybe even earlier. However, MOSDOp in its newer encoding versions requires the two staged (decode_payload() and finish_decode()) decode. The second stage happens at PrimaryLogPG::do_op().

If so, this requirement knocks out the new-extending-only-version trick.

@yuvalif: please correct me if I'm wrong.

i just copied whatever was done for the blkin tracing - which was added mid messages.
the commment below in the code suggests what @rzarzynski wrote (about the 2 stages)

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 30, 2023

we lost the memory on making (and reviewing!) encoding changes.

I looked at the history, and the blkin tracing was added in "luminus", and was controlled by a feature flag when added.
however, in "quincy", the flag check was removed (since "quincy" does not need compatibility with "luminus").
so, when looking at latest code, it does not seem like the trace need to be protected...

@rzarzynski
Copy link
Contributor

The comment is:

      encode_trace(payload, features);

      // -- above decoded up front; below decoded post-dispatch thread --

      encode(client_inc, payload);

@gregsfortytwo
Copy link
Member

we lost the memory on making (and reviewing!) encoding changes.

I looked at the history, and the blkin tracing was added in "luminus", and was controlled by a feature flag when added. however, in "quincy", the flag check was removed (since "quincy" does not need compatibility with "luminus"). so, when looking at latest code, it does not seem like the trace need to be protected...

Hmm this probably happened when cleaning up the inter-server compatibility checks and the luminous feature flag, but we may want to bring it back for client compatibility reasons. IIRC there’s a mechanism for that (involving the MASK bits of encoding.h).

@rzarzynski
Copy link
Contributor

rzarzynski commented Apr 30, 2023

we may want to bring it back for client compatibility reasons. IIRC there’s a mechanism for that (involving the MASK bits of encoding.h).

Looks we'll need to dig deeper.

  1. The new tracing has been modeled after the blkin tracing.
  2. The MOSDOp decoding for blkin tracing has been altered by 6159027 in August 2015:
 class MOSDOp : public MOSDFastDispatchOp {

   static const int HEAD_VERSION = 8;
...
   void encode_payload(uint64_t features) override {
...
     } else {
       // latest v8 encoding with hobject_t hash separate from pgid, no
       // reassert version
       header.version = HEAD_VERSION;
       ::encode(pgid, payload);
       ::encode(hobj.get_hash(), payload);
       ::encode(osdmap_epoch, payload);
       ::encode(flags, payload);
       ::encode(reqid, payload);
+      encode_trace(payload, features);
+
+      // -- above decoded up front; below decoded post-dispatch thread --
+
  1. The real work is delegated out to Message::encode_trace() introduced several commits earlier, ...
  2. by 612d15b (which added the BLKIN_TRACING bit):
+void Message::encode_trace(bufferlist &bl, uint64_t features) const
+{
+#ifdef WITH_BLKIN
+  if (features & CEPH_FEATURE_BLKIN_TRACING)
+    ::encode(*trace.get_info(), bl);
+#endif
+}
  1. In April 2017 the flag check has been removed by 25eb142:
 void Message::encode_trace(bufferlist &bl, uint64_t features) const
 {
-#ifdef WITH_BLKIN
-  if (features & CEPH_FEATURE_BLKIN_TRACING)
-    ::encode(*trace.get_info(), bl);
-#endif
+  auto p = trace.get_info();
+  static const blkin_trace_info empty = { 0, 0, 0 };
+  if (!p) {
+    p = &empty;
+  }
+  ::encode(*p, bl);
 }
  1. The bit has been kept untouched till June 2017 when it got renamed to OSD_RECOVERY_DELETES and reused.

@yuvalif
Copy link
Contributor Author

yuvalif commented May 1, 2023

  1. The bit has been kept untouched till June 2017 when it got renamed to OSD_RECOVERY_DELETES and reused.

i think that at some point we ran out of feature bits, so we changed to use "release bits" like "LUMINUS_SERVER", and repurposed some of the bits taken earlier.

@github-actions
Copy link

github-actions bot commented May 1, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@rzarzynski
Copy link
Contributor

when tracing is disabled (the default) the difference between PR #44684 and PR#47457 (as replied to a similar question from @rzarzynski) is in boolean deserilization.
see: https://github.com/ceph/ceph/pull/47457/files#diff-b46d02199b4bb6f489ab7279d33d54db05f46245022bb937c48a94b36ba5d0cfR69
if no trace is sent, juts decodes a boolean.

After looking into the weeds this seems a bit more complex:

  • from the tracing point of view – one byte only on the hot path;
  • from the MOSDOp point of view – there is the extra encoding preamble crunching (performed the ENCODE_START ENCODE_FINISH macros) to produce the struct_v, struct_compat and struct_len bytes which, on the hot path, are only needed to handle the is_valid boolean.

I haven't profiled the code, so I don't know whether this is a real problem. From my gut feelings: I doubt. However, if we want to be paranoid, we could move the is tracing valid? decision to MOSDOp at the price of establishing a thin interface between it and the tracing component, just to abstract span_ctx.IsValid().

@gregsfortytwo
Copy link
Member

gregsfortytwo commented May 2, 2023

We really can't use the previous blkin tracing as a model for how feature bits should be used — it was always broken.

As @jdurgin noted in June 2017, commit 2525609:

include/ceph_features.h: add feature bit for handling deletes during recovery

The BLKIN feature bit was actually unused - it was a remnant from
earlier versions of the blkin work, but all the encoding is handled by
struct-level versioning in the version that merged.

Use bit 60 (unused in any prior version) so that recovery deletes
could potentially be backported.

Because CEPH_FEATURE_BLKIN_TRACING was never added to CEPH_FEATURES_ALL — it got a CEPH_FEATURES_BLKIN which was #define'd as 0. So this was always a problem but it apparently got shielded by other feature bits and the struct encoding at the same time. :/

In terms of feature bit re-use, I never really grokked it properly but there's stuff around CEPH_FEATUREMASK and the HAVE_FEATURE MACRO that is supposed to handle the full check and make compatibility work across those bit deprecations etc.

@yuvalif
Copy link
Contributor Author

yuvalif commented May 5, 2023

We really can't use the previous blkin tracing as a model for how feature bits should be used — it was always broken.

As @jdurgin noted in June 2017, commit 2525609:

include/ceph_features.h: add feature bit for handling deletes during recovery

The BLKIN feature bit was actually unused - it was a remnant from
earlier versions of the blkin work, but all the encoding is handled by
struct-level versioning in the version that merged.

Use bit 60 (unused in any prior version) so that recovery deletes
could potentially be backported.

Because CEPH_FEATURE_BLKIN_TRACING was never added to CEPH_FEATURES_ALL — it got a CEPH_FEATURES_BLKIN which was #define'd as 0. So this was always a problem but it apparently got shielded by other feature bits and the struct encoding at the same time. :/

In terms of feature bit re-use, I never really grokked it properly but there's stuff around CEPH_FEATUREMASK and the HAVE_FEATURE MACRO that is supposed to handle the full check and make compatibility work across those bit deprecations etc.

so, please let me know what you think would be best:
(1) protect the new field and the message version under the SERVER_SXXX (whatever comes after reef)
(2) add a dedicated feature flag (if there is a but that could be reused for that) - though I'm not sure what would be the benefit over otion (1)

@yuvalif
Copy link
Contributor Author

yuvalif commented May 5, 2023

when tracing is disabled (the default) the difference between PR #44684 and PR#47457 (as replied to a similar question from @rzarzynski) is in boolean deserilization.
see: https://github.com/ceph/ceph/pull/47457/files#diff-b46d02199b4bb6f489ab7279d33d54db05f46245022bb937c48a94b36ba5d0cfR69
if no trace is sent, juts decodes a boolean.

After looking into the weeds this seems a bit more complex:

  • from the tracing point of view – one byte only on the hot path;
  • from the MOSDOp point of view – there is the extra encoding preamble crunching (performed the ENCODE_START ENCODE_FINISH macros) to produce the struct_v, struct_compat and struct_len bytes which, on the hot path, are only needed to handle the is_valid boolean.

I haven't profiled the code, so I don't know whether this is a real problem. From my gut feelings: I doubt. However, if we want to be paranoid, we could move the is tracing valid? decision to MOSDOp at the price of establishing a thin interface between it and the tracing component, just to abstract span_ctx.IsValid().

will give it a try

@yuvalif
Copy link
Contributor Author

yuvalif commented May 14, 2023

waiting for this commit: 1049d3e
to be merged.

@yuvalif
Copy link
Contributor Author

yuvalif commented May 28, 2023

@athanatos, @gregsfortytwo, @vshankar, @cbodley
changes are done based @rzarzynski "squid" branch: https://github.com/rzarzynski/ceph/tree/wip-all-kickoff-s
please review here: rzarzynski#2

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 19, 2023

closing in favor of: #52114

@yuvalif yuvalif closed this Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants