Skip to content

rgw/tracing: end to end trace for put obj op#47457

Merged
yuvalif merged 1 commit intoceph:mainfrom
zenomri:wip-omri-full-putobj-trace
Apr 7, 2023
Merged

rgw/tracing: end to end trace for put obj op#47457
yuvalif merged 1 commit intoceph:mainfrom
zenomri:wip-omri-full-putobj-trace

Conversation

@zenomri
Copy link

@zenomri zenomri commented Aug 4, 2022

Following #47370
The goal of this PR is to unify the trace created for put obj operation in the RGW and the subsequent spans created for the same operation in OSDs

Librados already had the blkin_trace_info in some places, but this tracing system isn't used either in RGW or OSD, so we can replace it with open-telemetry trace information.
for now, I just added an extra argument for the relevant methods. cleanup for blkin stuff will be covered in another commit. since it involves changes to many different source files, so we want to make this PR cleaner and leave the cleanup to another PR.

Screenshot of how the putobj trace looks like after this change:

image

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions
Copy link

github-actions bot commented Aug 4, 2022

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@zenomri zenomri force-pushed the wip-omri-full-putobj-trace branch from e0fcbc0 to b665f19 Compare August 4, 2022 12:37
@zenomri
Copy link
Author

zenomri commented Aug 7, 2022

jenkins retest this please

@zenomri zenomri changed the title [WIP] rgw/tracing: end to end trace for put obj op rgw/tracing: end to end trace for put obj op Aug 8, 2022
@zenomri zenomri force-pushed the wip-omri-full-putobj-trace branch 2 times, most recently from fb40ddc to 166a146 Compare August 9, 2022 07:57
}
op->osd_parent_span = tracing::osd::tracer.start_trace("op-request-created");

if (m->otel_trace.IsValid()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to check that m->otel_trace is not null?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to check that m->otel_trace is not null?

no. the otel_trace in the msg is not a pointer. it's the trace info which initialized to noop span context. in case we recieve an otel trace info in the msg protocol, we decode it to that span context.

virtual bool have_instance(void) override { return state.obj.key.have_instance(); }
virtual void clear_instance() override { state.obj.key.instance.clear(); }
virtual jspan_context& get_trace() { return trace_ctx; }
virtual void set_trace (jspan_context&& _trace_ctx) { trace_ctx = _trace_ctx; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you do "copy" here and "move" in: src/rgw/rgw_sal_filter.h

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a move call (&&), so it shouldn't need std::move. I don't know about the filter version, though, because it's calling through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the filter also accept jspan_context&& as a parameter

@zenomri
Copy link
Author

zenomri commented Aug 11, 2022

jenkins retest this please

@cbodley cbodley requested a review from alimaredia August 11, 2022 12:09
@zenomri
Copy link
Author

zenomri commented Aug 12, 2022

jenkins test make check

@zenomri zenomri requested a review from a team as a code owner August 12, 2022 10:23
@zenomri zenomri force-pushed the wip-omri-full-putobj-trace branch from 5a07bc5 to 359b25f Compare August 12, 2022 10:25
@zenomri
Copy link
Author

zenomri commented Aug 12, 2022

jenkins test make check

@zenomri zenomri force-pushed the wip-omri-full-putobj-trace branch 3 times, most recently from 07f3bc0 to 2ffd073 Compare August 12, 2022 20:00
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@yuvalif yuvalif force-pushed the wip-omri-full-putobj-trace branch from ebc6065 to e7506af Compare March 16, 2023 11:32
@yuvalif
Copy link
Contributor

yuvalif commented Mar 16, 2023

@ljflores @yuriw rebased the code and squashed commits. code is ready for testing

…telemetry

* build: add opentelemetry to cmake system
crimson targets that uses Message.cc/h are built before opentelemetry (o-tel), so we need to build o-tel eralier so we also add the library to the include path earlier
this shoud work for WITH_JAEGER flag both the ON/OFF cases, and for librados where the compilation flag is ignored

* msg/tracer: add o-tel trace to Messages with decode/encode function in tracer.h
some files that uses Message.cc/h just need the encode/decode functions  and not all others functions.
some crimson targets does not link with ceph_context (common) which is required for tracer.cc file. so we just need to include that functions

* librados: Add opentelemtry trace param for aio_operate and operate methods
in order to propagate the trace info I added the otel-trace as an extra param.
in some places, there already was a blkin trace info, and since it is not used in other places we can safely change it to o-tel trace info.
this will be done in another commit, so the cleanup of blkin trace will be in a dedicated commit

* osd: use the o-tel trace of the msg as a parent span of the osd trace
if there is a valid span in the msg, we will add this op to the request
trace, otherwise it will start a new trace for the OSD op

* rgw: pass put obj trace info to librados
in order to make it possible, I saved the trace info inside the sal::Object, so we can use it later when writing the object to rados
it could be used also later for read ops.
note the trace field of req_state is initalized only in rgw_process, so it's also required in librgw request flow

Signed-off-by: Omri Zeneva <ozeneva@redhat.com>
@yuvalif yuvalif force-pushed the wip-omri-full-putobj-trace branch from e7506af to 5d4c5e6 Compare March 19, 2023 15:06
@yuvalif
Copy link
Contributor

yuvalif commented Mar 27, 2023

@ljflores @yuriw rebased the code and squashed commits. code is ready for testing

any update on testing?
please let me know if i can assist in any way.

@ljflores
Copy link
Member

Hey @yuvalif, it's in the yuri11 batch!

@ljflores
Copy link
Member

ljflores commented Apr 5, 2023

Rados suite review:
https://pulpito.ceph.com/?branch=wip-yuri11-testing-2023-03-28-0950
https://pulpito.ceph.com/?branch=wip-yuri11-testing-2023-03-31-1108

Failures, unrelated:
1. https://tracker.ceph.com/issues/58585
2. https://tracker.ceph.com/issues/58946
3. https://tracker.ceph.com/issues/58265
4. https://tracker.ceph.com/issues/59271
5. https://tracker.ceph.com/issues/59057
6. https://tracker.ceph.com/issues/59333
7. https://tracker.ceph.com/issues/59334
8. https://tracker.ceph.com/issues/59335

Details:
1. rook: failed to pull kubelet image - Ceph - Orchestrator
2. cephadm: KeyError: 'osdspec_affinity' - Ceph - Orchestrator
3. TestClsRbd.group_snap_list_max_read failure during upgrade/parallel tests - Ceph - RBD
4. mon: FAILED ceph_assert(osdmon()->is_writeable()) - Ceph - RADOS
5. rados/test_envlibrados_for_rocksdb.sh: No rule to make target 'rocksdb_env_librados_test' on centos 8 - Ceph - RADOS
6. PgScrubber: timeout on reserving replicas - Ceph - RADOS
7. test_pool_create_with_quotas: Timed out after 60s and 0 retries - Ceph - Mgr - Dashboard
8. Found coredumps on smithi related to sqlite3 - Ceph - Cephsqlite

@ljflores
Copy link
Member

ljflores commented Apr 5, 2023

@yuvalif all looks good on the rados side!

@yuvalif
Copy link
Contributor

yuvalif commented Apr 6, 2023

jenkins test windows

@yuvalif
Copy link
Contributor

yuvalif commented Apr 6, 2023

jenkins test make check arm64

@lxbsz
Copy link
Member

lxbsz commented Apr 27, 2023

BTW, this PR caused the cephfs upgrade client test failures https://tracker.ceph.com/issues/59534.

@lxbsz
Copy link
Member

lxbsz commented Apr 27, 2023

This PR will also make the latest clients won't be compatible with the ~reef cephs.

@yuvalif
Copy link
Contributor

yuvalif commented Apr 27, 2023

This PR will also make the latest clients won't be compatible with the ~reef cephs.

what are the compatibility requirements between the cephfs clients and the OSDs?

@lxbsz
Copy link
Member

lxbsz commented Apr 27, 2023

This PR will also make the latest clients won't be compatible with the ~reef cephs.

what are the compatibility requirements between the cephfs clients and the OSDs?

The osds in old ceph will always drop all the osd reqs.

With this PR the HEAD_VERSION will be changed to 9 for the MOSDOp. And then for the clients with this PR will encode all the MOSDOp with 9. But the old cephs the version are still 8 and they will skip decoding when receiving the osd reqs.

That means in the osd side the osd requests will always be something like osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0), which the epoch and the pgid, etc will always be 0.0. And then the osd will drop this osd reqs and then the client will timedout:

2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0) pg 0
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 op_wq(0) _process 0.0 no pg, shouldn't exist e17, dropping OpQueueItem(0.0 PGOpItem(op=osd_op(client.14477.0:3 0.0 0.0 (undecoded) - e0) v9) prio 63 cost 0 e0)
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 share_map client.14477 v1:172.21.15.179:0/1855002755 0
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 should_share_map client.14477 v1:172.21.15.179:0/1855002755 0
2023-04-22 23:58:31.671 7efdcebe9700 20 osd.3 17 client session last_sent_epoch: 0 versus osdmap epoch 17
2023-04-22 23:58:31.671 7efdcebe9700 10 osd.3 17 client.14477 has old map 0 < 17

More detail please see the tracker: https://tracker.ceph.com/issues/59534

@yuvalif
Copy link
Contributor

yuvalif commented Apr 27, 2023

@lxbsz i think we can rollback this change with this PR: #48971
if we remote the blkin traces completly, we do not need to add a new field to the message, and keep the head version at 8
we will reuse the blkin trace field to hold the otel trace info

@lxbsz
Copy link
Member

lxbsz commented Apr 27, 2023

@lxbsz i think we can rollback this change with this PR: #48971 if we remote the blkin traces completly, we do not need to add a new field to the message, and keep the head version at 8 we will reuse the blkin trace field to hold the otel trace info

In cephfs we have already make this case to work smoothly. In the new client side we will check the features bits supported or version first and then encode the requests accordingly.

Maybe here can do it like cephfs ?

class MOSDOp final : public MOSDFastDispatchOp {
private:
static constexpr int HEAD_VERSION = 8;
static constexpr int HEAD_VERSION = 9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuvalif @rzarzynski - This change seems to be causing failures in some fs suite runs where a new(-er) cephfs client talks to old cluster daemons. The client (ceph-fuse) is upgraded to reef (main branch) while the other ceph daemons are nautilus. Details are captured in this tracker: https://tracker.ceph.com/issues/59534

The upgraded client sends an osd_op with (version) v9. The OSD knows to decode upto v8 and would silently drop the request. This seems to be the case with pacific too: https://github.com/ceph/ceph/blob/pacific/src/messages/MOSDOp.h#L402

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see this change in reef. this is currently only on main. see HEAD_VERSION = 8 there https://github.com/ceph/ceph/blob/reef/src/messages/MOSDOp.h#L39

so, my suggestion is to wait for PR: #48971, and revert the version back to 8 - by repurposing the blkin trace entry to send otel traces

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, my suggestion is to wait for PR: #48971, and revert the version back to 8 - by repurposing the blkin trace entry to send otel traces

How soon would #48971 be merged? cephfs test runs are kind of blocked by this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by repurposing the blkin trace entry to send otel traces

doesn't that still change the format of the encoding? i don't think we can reuse version 8 if the contents are different

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there could be a workaround that by using some magic number to distinguish the 2 types of payloads, and decode and throw away if its not there.

on a different note, @vshankar, what happened in the past when we had to change the message format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on ceph/main and not on ceph/reef, correct? This is a RADOS change, it's your call whether to revert first. Either way, let's find a coherent and testable change that can replace this. This feature has been in progress for a couple of years, if I'm not mistaken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reef isn't affected:

Let's revert to not hot-patch-in-a-hurry an encoding problem, then get into the weeds and arrange a meeting / workshop on the encoding stuff to finally merge the redo of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that doesn't work. You need to guarantee compatibility of the messages at the protocol level, not via a config option that can be done wrong or that can have different values in different places. The RADOS team does this regularly and has a lot of models they should be able to share.

Given how broken the protocol changes here are, I really think we need to revert the whole PR until they can be hashed out properly — doing a quick fix is only going to come back and bite us. :( Not sure how you missed this @rzarzynski the first time around, but can you quickly provide some pointers?

I think I might have misunderstood the earlier conversation -- a feature bit would be a way to resolve this and might be the right one (I'm hoping we don't need to burn a bit that way, but if we need to we will). A config option would be very bad. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I was pointing out for using a feature bit and then choose the encoding versions accordingly.

Anyway, +1 for reverting which cannot be done via the github UI for this change. I guess @yuvalif is on it and the revert will be merged soon...

encode(flags, payload);
encode(reqid, payload);
encode_trace(payload, features);
encode_otel_trace(payload, features);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's the conversation about the encoding breaking things, and this here is the fundamental issue — you just shoved a new data member into the middle of an encoding, which is an absolutely incompatible change, and didn't do anything to make sure all the various endpoints involved were able to keep communicating.

I'm guessing this was modeled after 6159027, but that looks to have been riding along on an existing check for the SERVER_LUMINOUS flag to keep things okay (or else it was broken and fixed up elsewhere), which this PR doesn't get to take advantage of.

Copy link
Contributor

@yuvalif yuvalif Apr 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. will post a similar fix that checks on the SERVER_REEF flag.
however, in commit 6159027 the feature bit used in several files, but not in: MOSDOp.h . what is the reason for that? where should the fix be added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a fix here: #51288
if fix is acceptable, will squash into one commit

@athanatos
Copy link
Contributor

I'd like to revert this until there's a testable fix to merge.

@yuvalif
Copy link
Contributor

yuvalif commented Apr 30, 2023

I'd like to revert this until there's a testable fix to merge.

done here: #51293

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.