Skip to content

[Log] new spdlog tool for ray#10967

Merged
rkooo567 merged 13 commits intoray-project:masterfrom
antgroup:support-spdlog
Oct 29, 2020
Merged

[Log] new spdlog tool for ray#10967
rkooo567 merged 13 commits intoray-project:masterfrom
antgroup:support-spdlog

Conversation

@ashione
Copy link
Copy Markdown
Member

@ashione ashione commented Sep 23, 2020

Motiviation:

  • For long running job, it'd better to support log rotation
  • For easily debug, we may change loglevel in runtime
  • Support multiple sinks. Ray core worker can only focus on itself log message if user want to reuse ray log tool in cpp worker, so application logs will not mixed with ray core informations.
  • More advanced functions: user-defined format pattern, async flush for high performace and trace session

Why are these changes needed?

A new log tool for users.

Related issue number

#10493

spdlog V.S. glog : https://docs.google.com/document/d/1pEXTLOhTFu5-va14Wj-tGoJD9tt_M_-JRMtxqd1JCyI

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ashione ashione requested a review from jovany-wang September 23, 2020 03:47
@ashione ashione changed the title new spdlog tool for ray [WIP] new spdlog tool for ray Sep 23, 2020
@ashione
Copy link
Copy Markdown
Member Author

ashione commented Sep 23, 2020

@rkooo567 @ericl

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Sep 23, 2020

I test both glog and spdlog:
glog performance of Pertest in logging_test.cc
log file name pattern : LogPerfTest.20200923-142334.49592.log

[       OK ] LogPerfTest.PerfTest (10494 ms)
[----------] 1 test from LogPerfTest (10494 ms total)

spdlog : log file name pattern LogPerfTest_45421.log ( It can be modifed as same as glog)

Testing DEBUG log for 100000 rounds takes 0 ms.
Testing RAY_ERROR log for 100000 rounds takes 1210 ms.
Testing RAY_CHECK(true) for 100000 rounds takes 0 ms.
[       OK ] LogPerfTest.PerfTest (1239 ms)
[----------] 1 test from LogPerfTest (1239 ms total)

Only one difference is glog will produce log messages in both console(stdout) and log, but spdlog we define currently just flush them in log file.

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Sep 23, 2020

Can you add some details about the motivation in the description?

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Sep 23, 2020

Can you add some details about the motivation in the description?

More details you may find in this issue #10493 and its related event PR #10419

@ashione ashione linked an issue Sep 23, 2020 that may be closed by this pull request
@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Sep 23, 2020

So, does it mean spdlog is 10 times faster than glog in your workload?

Also, why don't we do this Only one difference is glog will produce log messages in both console(stdout) and log, in this PR?

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Sep 23, 2020

Can you also push the benchmarking code to this PR?

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Sep 24, 2020

Can you also push the benchmarking code to this PR?

It is LogPerfTest in src/ray/util/logging_test.cc.

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Sep 24, 2020

So, does it mean spdlog is 10 times faster than glog in your workload?

Also, why don't we do this Only one difference is glog will produce log messages in both console(stdout) and log, in this PR?

~ 5200ms when I remove stderr verbose of glog. (Just change google::SetStderrLogging(GetMappedSeverity(RayLogLevel::ERROR)); to google::SetStderrLogging(GetMappedSeverity(RayLogLevel::FATAL));

@ashione ashione requested review from ericl and rkooo567 September 24, 2020 03:04
@ashione
Copy link
Copy Markdown
Member Author

ashione commented Sep 24, 2020

@rkooo567 I have pushed benchmark target commit. You may execute
bazel test logging_spdlog_test and bazel test logging_test to show their different performance.
In my MacBook Pro with 2.2 GHz Intel Core i7, 6 GB 1600 MHz DDR3

[==========] Running 3 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 2 tests from PrintLogTest
[ RUN      ] PrintLogTest.LogTestWithoutInit
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0924 13:57:48.694461 39597 2729898880 logging_test.cc:40] This is the INFO message
W0924 13:57:48.696022 39597 2729898880 logging_test.cc:42] This is the WARNING message
E0924 13:57:48.696040 39597 2729898880 logging_test.cc:44] This is the ERROR message
[       OK ] PrintLogTest.LogTestWithoutInit (2 ms)
[ RUN      ] PrintLogTest.LogTestWithInit
[       OK ] PrintLogTest.LogTestWithInit (1 ms)
[----------] 2 tests from PrintLogTest (3 ms total)

[----------] 1 test from LogPerfTest
[ RUN      ] LogPerfTest.PerfTest
Testing DEBUG log for 100000 rounds takes 0 ms.
Testing RAY_ERROR log for 100000 rounds takes 5701 ms.
Testing RAY_CHECK(true) for 100000 rounds takes 0 ms.
[       OK ] LogPerfTest.PerfTest (5721 ms)
[----------] 1 test from LogPerfTest (5721 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 2 test cases ran. (5724 ms total)
[  PASSED  ] 3 tests.

And

[==========] Running 3 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 2 tests from PrintLogTest
[ RUN      ] PrintLogTest.LogTestWithoutInit
[2020-09-24 13:58:07.811] [info] logging_test.cc:40: This is the INFO message
[2020-09-24 13:58:07.812] [warning] logging_test.cc:42: This is the WARNING message
[2020-09-24 13:58:07.812] [error] logging_test.cc:44: This is the ERROR message
[       OK ] PrintLogTest.LogTestWithoutInit (1 ms)
[ RUN      ] PrintLogTest.LogTestWithInit
[2020-09-24 13:58:07,812 D 40336 10382541] logging_test.cc:37: This is the DEBUG message
[2020-09-24 13:58:07,812 I 40336 10382541] logging_test.cc:40: This is the INFO message
[2020-09-24 13:58:07,812 W 40336 10382541] logging_test.cc:42: This is the WARNING message
[2020-09-24 13:58:07,812 E 40336 10382541] logging_test.cc:44: This is the ERROR message
[       OK ] PrintLogTest.LogTestWithInit (0 ms)
[----------] 2 tests from PrintLogTest (1 ms total)

[----------] 1 test from LogPerfTest
[ RUN      ] LogPerfTest.PerfTest
Testing DEBUG log for 100000 rounds takes 0 ms.
Testing RAY_ERROR log for 100000 rounds takes 1392 ms.
Testing RAY_CHECK(true) for 100000 rounds takes 0 ms.
[       OK ] LogPerfTest.PerfTest (1432 ms)
[----------] 1 test from LogPerfTest (1432 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 2 test cases ran. (1433 ms total)
[  PASSED  ] 3 tests.

@rkooo567
Copy link
Copy Markdown
Contributor

@ashione Can you remove WIP when it is ready?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 25, 2020
@ashione ashione changed the title [WIP] new spdlog tool for ray [Log] new spdlog tool for ray Oct 15, 2020
@ashione
Copy link
Copy Markdown
Member Author

ashione commented Oct 15, 2020

@ashione Can you remove WIP when it is ready?

I remove WIP and create all same functions for spdlog.

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Oct 16, 2020

@ashione Please avoid rebasing/modifying commit history in a pull request. It's hard to find out what's changed since the last review.

Copy link
Copy Markdown
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you paste an example output of RAY_CHECK?

inline void Flush() {
auto logger = get_logger();
if (loglevel_ == static_cast<int>(spdlog::level::critical)) {
stream() << "\n" << ray::GetCallTrace();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this line? I think GLOG will fetch the stack trace for us once std::abort() is invoked.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you paste an example output of RAY_CHECK?

Stacktraces will be exactly produced twice with InstallFailureSignalHandler, but nothing will be catched we do not call GetCallTrace manually without InstallFailureSignalHandler for glog.

@rkooo567
Copy link
Copy Markdown
Contributor

I always use the merge master command and never had any trouble (and it is okay to do this because the commit history doesn't matter anyway. All of commits are merged into one by Github when merged).

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Oct 19, 2020

Could you paste an example output of RAY_CHECK?

Something like that

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from PrintLogTest
[ RUN      ] PrintLogTest.RayCheckAbortTest
[2020-10-19 13:55:37,743 C 37433 2749808] logging_test.cc:140:  Check failed: 0 Check for aborting
    @        0x107aae9f8  ray::SpdLogMessage::Flush()
    @        0x107aadf99  ray::RayLog::~RayLog()
    @        0x107a62393  ray::PrintLogTest_RayCheckAbortTest_Test::TestBody()
    @        0x108404ff8  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @        0x108404f4e  testing::Test::Run()
    @        0x108405f10  testing::TestInfo::Run()
    @        0x1084067c7  testing::TestCase::Run()
    @        0x108412e47  testing::internal::UnitTestImpl::RunAllTests()
    @        0x108412868  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @        0x1084127ed  testing::UnitTest::Run()
    @        0x107a625a1  main
    @     0x7fff7488c015  start

I can't find SIGABRT in the log. Is it intended? What does the log look like if it's a SIGSEGV?

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Oct 19, 2020

Could you paste an example output of RAY_CHECK?

Something like that

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from PrintLogTest
[ RUN      ] PrintLogTest.RayCheckAbortTest
[2020-10-19 13:55:37,743 C 37433 2749808] logging_test.cc:140:  Check failed: 0 Check for aborting
    @        0x107aae9f8  ray::SpdLogMessage::Flush()
    @        0x107aadf99  ray::RayLog::~RayLog()
    @        0x107a62393  ray::PrintLogTest_RayCheckAbortTest_Test::TestBody()
    @        0x108404ff8  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @        0x108404f4e  testing::Test::Run()
    @        0x108405f10  testing::TestInfo::Run()
    @        0x1084067c7  testing::TestCase::Run()
    @        0x108412e47  testing::internal::UnitTestImpl::RunAllTests()
    @        0x108412868  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @        0x1084127ed  testing::UnitTest::Run()
    @        0x107a625a1  main
    @     0x7fff7488c015  start

I can't find SIGABRT in the log. Is it intended? What does the log look like if it's a SIGSEGV?

It's shown by return value of command line.

[1]    40850 abort      sh -c "bazel-bin/logging_test --gtest_filter=*AbortTest"

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Oct 20, 2020

FYI, I tested with an std::abort() and this is the log file:

tail -n 20 python-core-driver-01000000ffffffffffffffffffffffffffffffff_854.log
[2020-10-20 16:02:22,220 I 854 1015] actor_manager.cc:127: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: df5a1a8201000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0
[2020-10-20 16:02:22,220 I 854 854] direct_task_transport.cc:47: Creating actor via GCS actor id = : df5a1a8201000000
[2020-10-20 16:02:22,220 E 854 1015] logging.cc:397: *** Aborted at 1603180942 (unix time) try "date -d @1603180942" if you are using GNU date ***
[2020-10-20 16:02:22,222 E 854 1015] logging.cc:397: PC: @                0x0 (unknown)
[2020-10-20 16:02:22,222 E 854 1015] logging.cc:397: *** SIGABRT (@0x70900000356) received by PID 854 (TID 0x7f6506ffd700) from PID 854; stack trace: ***
[2020-10-20 16:02:22,223 E 854 1015] logging.cc:397:     @     0x7f6526755620 (unknown)
[2020-10-20 16:02:22,224 E 854 1015] logging.cc:397:     @     0x7f6525ca5277 __GI_raise
[2020-10-20 16:02:22,225 E 854 1015] logging.cc:397:     @     0x7f6525ca6968 __GI_abort
[2020-10-20 16:02:22,226 E 854 1015] logging.cc:397:     @     0x7f65187a2c09 ray::CoreWorkerDirectActorTaskSubmitter::SendPendingTasks()
[2020-10-20 16:02:22,227 E 854 1015] logging.cc:397:     @     0x7f65187a84a5 _ZNSt17_Function_handlerIFvvEZN3ray34CoreWorkerDirectActorTaskSubmitter10SubmitTaskENS1_17TaskSpecificationEEUlvE_E9_M_invokeERKSt9_Any_data
[2020-10-20 16:02:22,228 E 854 1015] logging.cc:397:     @     0x7f65187afcb2 ray::LocalDependencyResolver::ResolveDependencies()
[2020-10-20 16:02:22,230 E 854 1015] logging.cc:397:     @     0x7f65187ae805 ray::CoreWorkerDirectActorTaskSubmitter::SubmitTask()
[2020-10-20 16:02:22,230 E 854 1015] logging.cc:397:     @     0x7f651873893c _ZN5boost4asio6detail18completion_handlerIZN3ray10CoreWorker15SubmitActorTaskERKNS3_7ActorIDERKNS3_11RayFunctionERKSt6vectorISt10unique_ptrINS3_7TaskArgESt14default_deleteISD_EESaISG_EERKNS3_11TaskOptionsEPSB_INS3_8ObjectIDESaISO_EEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
[2020-10-20 16:02:22,232 E 854 1015] logging.cc:397:     @     0x7f6518baa6fd boost::asio::detail::scheduler::do_run_one()
[2020-10-20 16:02:22,233 E 854 1015] logging.cc:397:     @     0x7f6518bab1c1 boost::asio::detail::scheduler::run()
[2020-10-20 16:02:22,234 E 854 1015] logging.cc:397:     @     0x7f6518bacc92 boost::asio::io_context::run()
[2020-10-20 16:02:22,235 E 854 1015] logging.cc:397:     @     0x7f6518722c80 ray::CoreWorker::RunIOService()
[2020-10-20 16:02:22,236 E 854 1015] logging.cc:397:     @     0x7f65180abc60 (unknown)
[2020-10-20 16:02:22,237 E 854 1015] logging.cc:397:     @     0x7f652674de25 start_thread
[2020-10-20 16:02:22,238 E 854 1015] logging.cc:397:     @     0x7f6525d6cf1d __clone

The only difference from the glog solution is that there are many duplicated [2020-10-20 16:02:22,235 E 854 1015] logging.cc:397:.

Copy link
Copy Markdown
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@rkooo567
Copy link
Copy Markdown
Contributor

@kfstorm please don’t merge it yet. I will review it tmrw

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Oct 20, 2020

@rkooo567 Don't worry. I'm not going to merge this PR without your approval.

public:
explicit SpdLogMessage(const char *file, int line, int loglevel) : loglevel_(loglevel) {
stream() << ConstBasename(file) << ":" << line << ": ";
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto-gen by clang-format

if (!logger) {
logger = spdlog::get("stderr");
}
// If no default logger we just emit all log informations to stderr.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When's the time we don't have a default logger? Can you comment it?

loglevel_ == static_cast<int>(spdlog::level::critical)) {
stream() << "\n*** StackTrace Information ***\n" << ray::GetCallTrace();
}
logger->log(static_cast<spdlog::level::level_enum>(loglevel_), "{}", str_.str());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger->log(static_cast<spdlog::level::level_enum>(loglevel_), "{}", str_.str());
logger->log(static_cast<spdlog::level::level_enum>(loglevel_), /*fmt*/ "{}", str_.str());
```?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is the correct argument. Can you specify what this argument means?

#include "ray/util/logging.h"

#include <chrono>
#include <csignal>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it added when we commented out all tests here?

exit(0);
}

TEST(PrintLogTest, RayCheckAbortTest) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did it work before?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add it only for testing manually because process will abort and exit if signal catched.

#endif
}

void WriteFailureMessage(const char *data, int size) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

spdlog::rotating_logger_mt("ray_log_sink",
dir_ends_with_slash + app_name_without_path + "_" +
std::to_string(pid) + ".log",
1 << 29, 10);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this part configurable?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll extract theses parameters from env vars.

Can we make this part configurable?

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Oct 23, 2020

@rkooo567 Could you go over it again? I found failures are related to python travis tests.

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Oct 28, 2020

@kfstorm @rkooo567 Any other comments about this PR? It's ready to be merged if nothing can be blocked.

@rkooo567
Copy link
Copy Markdown
Contributor

@ashione Can you merge the latest master? It seems like all serve tests are failing, but I think it was same for the master. But I'd like to make sure if that's the case.

@rkooo567
Copy link
Copy Markdown
Contributor

Remove the tag (@author-action-required) once you merge the latest master (and passes all builds)

@ashione ashione removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2020
@ashione
Copy link
Copy Markdown
Member Author

ashione commented Oct 28, 2020

@rkooo567 I have no idea about why //python/ray/tune:test_experiment_analysis failed in TUNE_TESTING? Is it known-issue in master.

@rkooo567
Copy link
Copy Markdown
Contributor

@ashione I think there were some failures on master. This must've been resolved now. Can you try merging the master one more time?

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Oct 29, 2020

@ashione I think there were some failures on master. This must've been resolved now. Can you try merging the master one more time?

It's done and looks all green.

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Oct 29, 2020

Windows CI failure seems unrelated.

@rkooo567
Copy link
Copy Markdown
Contributor

Windows tests seem to have weird test failures that I've never seen. Rerunning the job now...

@rkooo567
Copy link
Copy Markdown
Contributor

@kfstorm The test failure indeed seems unrelated, but I've never seen this error in any CI builds. So let me re-run it just in case!

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Oct 29, 2020

@rkooo567 Sure. Thank you.

@rkooo567 rkooo567 merged commit 0b7a3d9 into ray-project:master Oct 29, 2020
@kfstorm kfstorm deleted the support-spdlog branch October 30, 2020 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray uses spdlog instead of glog

4 participants