Skip to content

[EVENT] add log reporter#10419

Merged
raulchen merged 11 commits intoray-project:masterfrom
antgroup:master_ray_event_Basasuya_logreporter
Sep 16, 2020
Merged

[EVENT] add log reporter#10419
raulchen merged 11 commits intoray-project:masterfrom
antgroup:master_ray_event_Basasuya_logreporter

Conversation

@Basasuya
Copy link
Copy Markdown
Contributor

This PR is based on the previous PR: #9657
Ray Event Design Doc: https://docs.google.com/document/d/1QSJ1UUEv6zBh6UjVY24w3kkpNdZq4hOGdU7TXgs8YrE/edit#heading=h.ej1vw0pzfkg0

I add the log reporter to support write event to the files

every process will write the events to the file (event_{SOURCE_TYPE}.txt) just like below:

{"time_stamp":"2020-08-29 14:18:15.998084","severity":"INFO","label":"label 1","event_id":"de150792ceb151c815d359d4b675fcc6266a","source_type":"CORE_WORKER","host_name":"Macbool.local","pid":"20830","message":"send message 1","custom_fields":{"task_id":"task 1","job_id":"job 1","node_id":"node 1"}}

Why are these changes needed?

Ray needs an event framework to report critical information, this information includes application-level data like job-condition, node adds/remove, alert, error, and so on. event messages can be sent to different destinations such as the dashboard server, the file. Users can see this information in real-time, or analyze the historical information after the program ends.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@Basasuya Basasuya force-pushed the master_ray_event_Basasuya_logreporter branch from d29ed33 to 7b21c1d Compare August 31, 2020 08:21
@rkooo567
Copy link
Copy Markdown
Contributor

Why not just using glog for writing framework? What's the reason why we'd like to develop our own file writing framework?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 31, 2020
@ashione
Copy link
Copy Markdown
Member

ashione commented Sep 1, 2020

Why not just using glog for writing framework? What's the reason why we'd like to develop our own file writing framework?

First, I aslo disagree with repeated development of file systems in ray. But glog has low customizability, so we have to face these:

  • No rotating (we need automatically clean log files genreated by long running jobs),
  • Can be initialized only once,
  • No custom format pattern.

I'd like to replace glog with spdlog https://github.com/gabime/spdlog/tree/v1.x/include/spdlog (popular and good performance).
Or making spdlog as an option.

@Basasuya Basasuya force-pushed the master_ray_event_Basasuya_logreporter branch from 7b21c1d to 6828efc Compare September 1, 2020 12:03
@ashione
Copy link
Copy Markdown
Member

ashione commented Sep 2, 2020

@Basasuya LGTM. Could you check the rest of comments and reslove them.

@Basasuya Basasuya closed this Sep 3, 2020
@Basasuya Basasuya reopened this Sep 3, 2020
@ashione
Copy link
Copy Markdown
Member

ashione commented Sep 4, 2020

@rkooo567 @raulchen @yuyiming Could you merge it if travis is green.

@ashione
Copy link
Copy Markdown
Member

ashione commented Sep 4, 2020

@Basasuya More comments should be better for more readable and advance development before all public functions.

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Sep 4, 2020

I will review this very soon. Have been too busy by Ray 1.0 related stuff.

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 4, 2020

log_sink_->info(result);
if (force_flush_) {
Flush();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we need this attribute? Is it for testing? Can you clearly specify this attribute is for testing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force_flush_ is used to clear the file buffer. When fwrite function is called, sometimes the information is not actually written to the file, but stored in the OS file buffer. The flush will be triggered when it exceeds certain buffer size.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand when we need flush! But isn’t this triggered automatically by spdlog if buffers are filled up? I just don’t know when we need to set this attribute true except test scenario.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the system receives SIGTERM, the log reporter's destructor may not be triggered. Some of the last event information may not be sent successfully. But if it is force_flush mode, because it will be flushed when RAY_EVENT is called, so we don’t worry about this problem.

RAY_CHECK(Event_SourceType_IsValid(event.source_type()));
RAY_CHECK(Event_Severity_IsValid(event.severity()));
std::string result = EventToString(event);
// Pop the last character from the result string because it is breakline '\n'.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will logs still have newline without this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because boost::property_tree::json_parser::write_json would generate'\n' at the end of string.
Then. every string which the spdlog write would also generate a newline.
So I should remove the '\n' at the end of the string.

}

TEST(EVENT_TEST, LOG_ONE_THREAD) {
std::string log_dir = GenerateLogDir();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to have actual filesystem operation in uni tests. is there any way we can mock this out?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uni test is designed to test the logic of the log reporter. Concretely speaking, I will test the correctness of the interface in single thread and multi-thread and the correctness of rotating file.At least for now, I think it’s hard to mock this function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 �Could you tell us what's your main concern in testing log files rotation? If we have to check whether tiny event log files can be rotated in our test cases.

@Basasuya
Copy link
Copy Markdown
Contributor Author

Basasuya commented Sep 10, 2020

I have added some code comments in this PR

Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey guys, do we really need spdlog here? I would much rather keep glog, even if it has slightly less features, rather than add a new dependency on some random library.

@ashione
Copy link
Copy Markdown
Member

ashione commented Sep 12, 2020

Hey guys, do we really need spdlog here? I would much rather keep glog, even if it has slightly less features, rather than add a new dependency on some random library.

Hi eric, my answer is yes. We do need spdlog becasue of its features and performance.

In our production business and best practices, we not only need to support a large number of long-running jobs, some are even online jobs. These long-running tasks are usually run in a docker or serverless environment, and their specifications are destined to not carry large-capacity disks. Therefore, we must clean up the log files regularly to prevent our long-running tasks from being affected.
Besides, it is not that we have to replace glog, because glog is currently unable to meet business needs.
Let me briefly summarize again:

  1. Cannot automatically clean up and rotating the log.
  2. There are two duplicate outputs, one is standard output and the other is log file.
  3. For each process, glog can only be initialized once. If not, multiple errors are reported and the process exits abnormally. This means that if the user uses cpp worker, the log generated by the user's actor logic will be printed in the same log file as the core worker. Event sender can not create a new file for genrating event stuff as well.
  4. The log level cannot be dynamically modified at runtime, which is very helpful for debugging long-running tasks.
  5. Unable to customize the log input format, which is fatal for events.

We have also discussed for a long time whether to introduce this dependency. The conclusion is that we firstly introduce spdlog into the event system so that its needs can be met, instead of modifying glog or repeating filesystem operations.
The next step is to add a switch to select the logging tool. So we can freely choose whether to use glog or spdlog. The design of the RAY_LOG macro allows us to realize these two kinds of compatibility easily.

In addition, it is worth mentioning that the maintenance of glog is indeed inferior to spdlog. My understanding is that adapting to the needs to make some changes will make ray a greater open source software. Instead of avoiding it, endure it temporarily.

cc @Basasuya @raulchen @jovany-wang

@ashione ashione closed this Sep 12, 2020
@ashione ashione reopened this Sep 12, 2020
@Basasuya
Copy link
Copy Markdown
Contributor Author

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Sep 12, 2020

https://stackoverflow.com/questions/16439857/google-glog-rolling-files seems to indicate glog supports log rotation, have you looked into this?

@rkooo567
Copy link
Copy Markdown
Contributor

@ericl I think the hardest blocker for using glog is 3 (imo, other features are not "absolutely needed" for OSS). For events, we'd like to write them in a separate file from regular logs.

It seems like there's a way to implement additional log sink for glog, but it looked a little sketchy (https://ray-distributed.slack.com/archives/C016MPZE1L1/p1598943141006800?thread_ts=1598937145.003600&cid=C016MPZE1L1).

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Sep 12, 2020

Hmm I can understand the desire to write to a separate file, but why not instead just write at DEBUG or INFO level into the normal logs?. It is useful to correlate these events with other log message, and easy enough to filter with grep if needed. Is the concern the volume of logs is way too high?

@ashione
Copy link
Copy Markdown
Member

ashione commented Sep 12, 2020

Hmm I can understand the desire to write to a separate file, but why not instead just write at DEBUG or INFO level into the normal logs?. It is useful to correlate these events with other log message, and easy enough to filter with grep if needed. Is the concern the volume of logs is way too high?

It is better not to mix the information of event with the normal log, otherwise this is not the so-called event, but a special log item. The purpose of putting the event information into a separate file is to facilitate the distributed retrieval or index building of dashboards or other processes. And you said that using grep to find is our way of positioning artificially.

@ashione
Copy link
Copy Markdown
Member

ashione commented Sep 12, 2020

https://stackoverflow.com/questions/16439857/google-glog-rolling-files seems to indicate glog supports log rotation, have you looked into this?

The current version of GLOG does not support rotation because it produces a random file with the current timestamp as the suffix. On GLOG's GitHub homepage, I observed that there was a PR about rotation a few years ago, and it has not been merged. Internally, we modified GLOG to achieve it.

BTW, the parameter of GLOG_max_log_size is only for generating a new file, but old files can not be deleted.

@ashione
Copy link
Copy Markdown
Member

ashione commented Sep 12, 2020

@ericl I think the hardest blocker for using glog is 3 (imo, other features are not "absolutely needed" for OSS). For events, we'd like to write them in a separate file from regular logs.

It seems like there's a way to implement additional log sink for glog, but it looked a little sketchy (https://ray-distributed.slack.com/archives/C016MPZE1L1/p1598943141006800?thread_ts=1598937145.003600&cid=C016MPZE1L1).

This way of adding custom sink also requires us to manage files and decide when and whether rotation is needed.

@Basasuya
Copy link
Copy Markdown
Contributor Author

@ericl I don't think we should use glog at DEBUG or INFO level to send event messages. The first reason is that glog does not support multiple logger object of one thread. In this case, RAY_EVENT and RAY_LOG will share glog object, then glog can not write to different files. Second, If we redirected DEBUG, FATAL mode to write events files, these two severities are not enough in our practical experience. At the same time, it will affect the function of RAY_LOG.
In summary, glog doesn't support multiple logger object of one thread natively. And glog can not rotate file just like other log frameworks such as log4j, spdlog. That's the reason why we want to try another framework.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Sep 12, 2020 via email

@Basasuya
Copy link
Copy Markdown
Contributor Author

@rkooo567 Could you please review my reply to your previous comments?
#10419 (comment)
#10419 (comment)
#10419 (comment)

@rkooo567 rkooo567 requested a review from ericl September 15, 2020 18:23
@raulchen raulchen merged commit 5e030db into ray-project:master Sep 16, 2020
@raulchen raulchen deleted the master_ray_event_Basasuya_logreporter branch September 16, 2020 03:54
@ashione ashione mentioned this pull request Sep 23, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants