Skip to content

tracer: set tracing compiled in by default#44684

Merged
yuvalif merged 14 commits intoceph:mainfrom
zenomri:wip-omri-tracing-compiled
Jun 2, 2022
Merged

tracer: set tracing compiled in by default#44684
yuvalif merged 14 commits intoceph:mainfrom
zenomri:wip-omri-tracing-compiled

Conversation

@zenomri
Copy link

@zenomri zenomri commented Jan 20, 2022

Following performance testing on tracing code, this PR sets WITH_JAEGER=ON by default, and performance optimization was done in order to minimize tracing overhead when tracing is compiled in and disabled, which is the base state of tracing.

  1. create and return static noop span when tracing is disabled - instead of calling NoopTracer->StartSpan(...).
  2. remove unused spans - the creation of spans uses shared_ptr. and the overhead on the creation of spans that were used to measure latency was found redundant.

in addition, changes were done in order to optimize performance when tracing is enabled:

  1. use batch span processor. sends a batch of spans, instead of each span alone.
  2. since batch span uses a single thread to send span, thread_local tracer is no longer needed.

here are the latest hsbench results:
UPDATED 15 Feb

  • with WITH_JAEGER=OFF - link
  • with WITH_JAEGER=ON, tracing disabled - link
  • with WITH_JAEGER=ON, tracing enabled - link
    there is no effect on performance when tracing is disabled.
    when tracing is enabled, we are having degradation of ~18% on PUT operations, and ~10% on GET operations.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@yuvalif yuvalif requested a review from cbodley January 20, 2022 13:24
@zenomri zenomri marked this pull request as ready for review January 23, 2022 10:04
@zenomri
Copy link
Author

zenomri commented Jan 23, 2022

@ideepika could you please take a look at the make check failure?
it doesn't find opentelemetry/trace/provider.h. is it related to your previous PR about opentelemetry build?

@zenomri zenomri force-pushed the wip-omri-tracing-compiled branch from 40015fd to d22d2a6 Compare January 23, 2022 23:03
@ideepika
Copy link
Member

@ideepika could you please take a look at the make check failure? it doesn't find opentelemetry/trace/provider.h. is it related to your previous PR about opentelemetry build?

Yeah, there are some updates that we'd need to set Jaeger as always on, I will update them

@zenomri zenomri force-pushed the wip-omri-tracing-compiled branch from c10bcfe to 04f4ba2 Compare February 1, 2022 09:17
@zenomri zenomri changed the title [WIP] tracer: set tracing compiled in by default tracer: set tracing compiled in by default Feb 1, 2022
@zenomri zenomri force-pushed the wip-omri-tracing-compiled branch from fcaf6ca to 2973230 Compare February 7, 2022 17:29
@zenomri zenomri requested a review from a team as a code owner February 7, 2022 17:29
@zenomri zenomri force-pushed the wip-omri-tracing-compiled branch from 2973230 to b6fe24f Compare February 7, 2022 18:42
@zenomri
Copy link
Author

zenomri commented Feb 7, 2022

jenkins test api

@zenomri zenomri force-pushed the wip-omri-tracing-compiled branch 3 times, most recently from 4a1799c to 573d0cf Compare February 15, 2022 11:00
@zenomri
Copy link
Author

zenomri commented Feb 15, 2022

PR description was updated. added commits relevant for enabled tracing.

Deepika Upadhyay and others added 11 commits May 17, 2022 04:38
specify --without-jaeger during rpm build process if jaeger package
install is not desired.

Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
jaeger opentelemetry deps that will be installed by default now:
libyaml-dev  > 0.6
libthrift-dev (thift deps: libevent-dev, bison, flex, boost(we use ceph compiled boost))
nlohmann-json3-dev

removes:
pkg.ceph.jaeger build package optional option

Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
flag

Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
the batch span processor holds batch of spans until timeout reached or
it fills up, instead of sending each span alone which is a huge overhead
Signed-off-by: Omri Zeneva <ozeneva@redhat.com>
since we are using batch span processor, the tracer does not send the
spans by himself. so a single tracer is good enough.
Signed-off-by: Omri Zeneva <ozeneva@redhat.com>
Signed-off-by: Omri Zeneva <ozeneva@redhat.com>
Signed-off-by: Omri Zeneva <ozeneva@redhat.com>
@zenomri zenomri force-pushed the wip-omri-tracing-compiled branch from f651103 to dd57ad6 Compare May 17, 2022 08:43
@markhpc
Copy link
Member

markhpc commented May 19, 2022

@ideepika looks pretty good, though I will note that 15K iops isn't pushing the OSDs very hard. We may want to test this on Officinalis with a single OSD setup just to see how much impact it has on an OSD that can normally do 70-80K IOPS.
@benhanokh might be able to help since he's done quite a bit of testing recently with high speed OSDs for his work on making the OSD write path faster.

@markhpc is this test a blocker for the PR or can we do that test effort separately?

It probably wouldn't be a bad idea so we don't end up having to do a backtrace later in case this ends up being a bigger loss than we expected. Was @benhanokh able to help run the tests on a faster setup?

@yuvalif
Copy link
Contributor

yuvalif commented May 31, 2022

https://gist.github.com/ideepika/4f37641e8b666c29b9205e8c80983db3

can you please give some context here? what are these numbers?

Ah yeah sure, so these metrics are collected with the rados perf suite run in for of collectl raw data, using collectl utility we can extract all collected metrics back to perf stats, which can then also be used to plot graphs with and compared with the master run perf stats. But this is incomplete as I am not aware of which metrics would be our object of interest, I was going to check with RADOS folks on that and then add comment.

@ideepika can you please re-review? (PR is currently blocked on this request)

@zenomri
Copy link
Author

zenomri commented Jun 2, 2022

@ideepika looks pretty good, though I will note that 15K iops isn't pushing the OSDs very hard. We may want to test this on Officinalis with a single OSD setup just to see how much impact it has on an OSD that can normally do 70-80K IOPS.

@benhanokh might be able to help since he's done quite a bit of testing recently with high speed OSDs for his work on making the OSD write path faster.

@markhpc @yuvalif @mattbenjamin @ideepika
Got help from Gabi and I eventually ran the same tests, on his machine. got the following results:

randread
4096 randread single_delete1_fio_read_without_jaeger 118576
4096 randread single_delete1_fio_read_jaeger_disabled 117616
4096 randread single_delete1_fio_read_jaeger_enabled 116590

randwrite
4096 randwrite single_delete1_fio_write_without_jaeger 67821
4096 randwrite single_delete1_fio_write_jaeger_disabled 67358
4096 randwrite single_delete1_fio_write_jaeger_enabled 65959

those are the IOPS of the tests.

@zenomri
Copy link
Author

zenomri commented Jun 2, 2022

jenkins retest this please

@ideepika
Copy link
Member

ideepika commented Jun 2, 2022

hey @zenomri tests make sense, perf degradation also looks <1% with jaeger disabled, lgtm.
@markhpc tagging if you have any other comments.

@zenomri
Copy link
Author

zenomri commented Jun 2, 2022

jenkins test make check

Copy link
Member

@markhpc markhpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ideepika Thanks a ton for testing! Those are great results. LGTM!

@yuvalif
Copy link
Contributor

yuvalif commented Jun 2, 2022

jenkins test make check

@cbodley
Copy link
Contributor

cbodley commented Jun 2, 2022

nice work @zenomri! and thanks to everyone else that helped!

@ideepika
Copy link
Member

ideepika commented Jun 3, 2022

@markhpc actually Omri took up work on perf, @zenomri thanks for working on this!

@markhpc
Copy link
Member

markhpc commented Jun 9, 2022

@zenomri Great job and thank you much for the performance work!

@tchaikov
Copy link
Contributor

tchaikov commented Jul 9, 2022

/opt/rh/gcc-toolset-11/root/usr/bin/ld: ../../opentelemetry-cpp/exporters/jaeger/libopentelemetry_exporter_jaeger_trace.a(thrift_sender.cc.o): in function `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetLogLevel()':
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13468-g4af4b624/rpm/el8/BUILD/ceph-17.0.0-13468-g4af4b624/src/jaegertracing/opentelemetry-cpp/sdk/include/opentelemetry/sdk/common/global_log_handler.h:118: undefined reference to `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetHandlerAndLevel()'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: ../../opentelemetry-cpp/exporters/jaeger/libopentelemetry_exporter_jaeger_trace.a(thrift_sender.cc.o): in function `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetLogHandler()':
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13468-g4af4b624/rpm/el8/BUILD/ceph-17.0.0-13468-g4af4b624/src/jaegertracing/opentelemetry-cpp/sdk/include/opentelemetry/sdk/common/global_log_handler.h:100: undefined reference to `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetHandlerAndLevel()'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: ../../opentelemetry-cpp/exporters/jaeger/libopentelemetry_exporter_jaeger_trace.a(recordable.cc.o): in function `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetLogLevel()':
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13468-g4af4b624/rpm/el8/BUILD/ceph-17.0.0-13468-g4af4b624/src/jaegertracing/opentelemetry-cpp/sdk/include/opentelemetry/sdk/common/global_log_handler.h:118: undefined reference to `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetHandlerAndLevel()'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: ../../opentelemetry-cpp/exporters/jaeger/libopentelemetry_exporter_jaeger_trace.a(recordable.cc.o): in function `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetLogHandler()':
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13468-g4af4b624/rpm/el8/BUILD/ceph-17.0.0-13468-g4af4b624/src/jaegertracing/opentelemetry-cpp/sdk/include/opentelemetry/sdk/common/global_log_handler.h:100: undefined reference to `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetHandlerAndLevel()'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: ../../opentelemetry-cpp/exporters/jaeger/libopentelemetry_exporter_jaeger_trace.a(recordable.cc.o): in function `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetLogLevel()':
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13468-g4af4b624/rpm/el8/BUILD/ceph-17.0.0-13468-g4af4b624/src/jaegertracing/opentelemetry-cpp/sdk/include/opentelemetry/sdk/common/global_log_handler.h:118: undefined reference to `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetHandlerAndLevel()'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: ../../opentelemetry-cpp/exporters/jaeger/libopentelemetry_exporter_jaeger_trace.a(recordable.cc.o):/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13468-g4af4b624/rpm/el8/BUILD/ceph-17.0.0-13468-g4af4b624/src/jaegertracing/opentelemetry-cpp/sdk/include/opentelemetry/sdk/common/global_log_handler.h:100: more undefined references to `opentelemetry::v1::sdk::common::internal_log::GlobalLogHandler::GetHandlerAndLevel()' follow
/opt/rh/gcc-toolset-11/root/usr/bin/ld: ../../opentelemetry-cpp/exporters/jaeger/libopentelemetry_exporter_jaeger_trace.a(recordable.cc.o): in function `opentelemetry::v1::exporter::jaeger::JaegerRecordable::SetResource(opentelemetry::v1::sdk::resource::Resource const&)':
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13468-g4af4b624/rpm/el8/BUILD/ceph-17.0.0-13468-g4af4b624/src/jaegertracing/opentelemetry-cpp/exporters/jaeger/src/recordable.cc:252: undefined reference to `opentelemetry::v1::sdk::resource::Resource::GetAttributes() const'
collect2: error: ld returned 1 exit status

see https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/63812//consoleFull

@ideepika @zenomri how was the change in src/crimson/CMakeLists.txt tested?

@ideepika
Copy link
Member

@tchaikov sorry, I guess it got masked in other failures that were present during time of testing, will work on fixing it, thanks for 40039e7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants