[EVENT] add log reporter#10419
Conversation
d29ed33 to
7b21c1d
Compare
|
Why not just using glog for writing framework? What's the reason why we'd like to develop our own file writing framework? |
First, I aslo disagree with repeated development of file systems in ray. But glog has low customizability, so we have to face these:
I'd like to replace glog with spdlog https://github.com/gabime/spdlog/tree/v1.x/include/spdlog (popular and good performance). |
7b21c1d to
6828efc
Compare
|
@Basasuya LGTM. Could you check the rest of comments and reslove them. |
|
@Basasuya More comments should be better for more readable and advance development before all public functions. |
|
I will review this very soon. Have been too busy by Ray 1.0 related stuff. |
|
|
||
| log_sink_->info(result); | ||
| if (force_flush_) { | ||
| Flush(); |
There was a problem hiding this comment.
When do we need this attribute? Is it for testing? Can you clearly specify this attribute is for testing?
There was a problem hiding this comment.
force_flush_ is used to clear the file buffer. When fwrite function is called, sometimes the information is not actually written to the file, but stored in the OS file buffer. The flush will be triggered when it exceeds certain buffer size.
There was a problem hiding this comment.
I understand when we need flush! But isn’t this triggered automatically by spdlog if buffers are filled up? I just don’t know when we need to set this attribute true except test scenario.
There was a problem hiding this comment.
When the system receives SIGTERM, the log reporter's destructor may not be triggered. Some of the last event information may not be sent successfully. But if it is force_flush mode, because it will be flushed when RAY_EVENT is called, so we don’t worry about this problem.
| RAY_CHECK(Event_SourceType_IsValid(event.source_type())); | ||
| RAY_CHECK(Event_Severity_IsValid(event.severity())); | ||
| std::string result = EventToString(event); | ||
| // Pop the last character from the result string because it is breakline '\n'. |
There was a problem hiding this comment.
Will logs still have newline without this?
There was a problem hiding this comment.
Because boost::property_tree::json_parser::write_json would generate'\n' at the end of string.
Then. every string which the spdlog write would also generate a newline.
So I should remove the '\n' at the end of the string.
| } | ||
|
|
||
| TEST(EVENT_TEST, LOG_ONE_THREAD) { | ||
| std::string log_dir = GenerateLogDir(); |
There was a problem hiding this comment.
I prefer not to have actual filesystem operation in uni tests. is there any way we can mock this out?
There was a problem hiding this comment.
This uni test is designed to test the logic of the log reporter. Concretely speaking, I will test the correctness of the interface in single thread and multi-thread and the correctness of rotating file.At least for now, I think it’s hard to mock this function.
There was a problem hiding this comment.
@rkooo567 �Could you tell us what's your main concern in testing log files rotation? If we have to check whether tiny event log files can be rotated in our test cases.
…t_Basasuya_logreporter
|
I have added some code comments in this PR |
ericl
left a comment
There was a problem hiding this comment.
Hey guys, do we really need spdlog here? I would much rather keep glog, even if it has slightly less features, rather than add a new dependency on some random library.
Hi eric, my answer is yes. We do need spdlog becasue of its features and performance. In our production business and best practices, we not only need to support a large number of long-running jobs, some are even online jobs. These long-running tasks are usually run in a docker or serverless environment, and their specifications are destined to not carry large-capacity disks. Therefore, we must clean up the log files regularly to prevent our long-running tasks from being affected.
We have also discussed for a long time whether to introduce this dependency. The conclusion is that we firstly introduce spdlog into the event system so that its needs can be met, instead of modifying glog or repeating filesystem operations. In addition, it is worth mentioning that the maintenance of glog is indeed inferior to spdlog. My understanding is that adapting to the needs to make some changes will make ray a greater open source software. Instead of avoiding it, endure it temporarily. |
|
@ericl @rkooo567 @ashione We have discussed this in slack |
|
https://stackoverflow.com/questions/16439857/google-glog-rolling-files seems to indicate glog supports log rotation, have you looked into this? |
|
@ericl I think the hardest blocker for using glog is 3 (imo, other features are not "absolutely needed" for OSS). For events, we'd like to write them in a separate file from regular logs. It seems like there's a way to implement additional log sink for glog, but it looked a little sketchy (https://ray-distributed.slack.com/archives/C016MPZE1L1/p1598943141006800?thread_ts=1598937145.003600&cid=C016MPZE1L1). |
|
Hmm I can understand the desire to write to a separate file, but why not instead just write at DEBUG or INFO level into the normal logs?. It is useful to correlate these events with other log message, and easy enough to filter with grep if needed. Is the concern the volume of logs is way too high? |
It is better not to mix the information of event with the normal log, otherwise this is not the so-called event, but a special log item. The purpose of putting the event information into a separate file is to facilitate the distributed retrieval or index building of dashboards or other processes. And you said that using grep to find is our way of positioning artificially. |
The current version of GLOG does not support rotation because it produces a random file with the current timestamp as the suffix. On GLOG's GitHub homepage, I observed that there was a PR about rotation a few years ago, and it has not been merged. Internally, we modified GLOG to achieve it. BTW, the parameter of GLOG_max_log_size is only for generating a new file, but old files can not be deleted. |
This way of adding custom sink also requires us to manage files and decide when and whether rotation is needed. |
|
@ericl I don't think we should use glog at DEBUG or INFO level to send event messages. The first reason is that glog does not support multiple logger object of one thread. In this case, RAY_EVENT and RAY_LOG will share glog object, then glog can not write to different files. Second, If we redirected DEBUG, FATAL mode to write events files, these two severities are not enough in our practical experience. At the same time, it will affect the function of RAY_LOG. |
|
Ok, the lack of log rotation does seem like a deal breaker. Frankly, saving
events at debug level does seem fine though. There is no principled reason
why it can't be done that way.
I'm fine with adding this dependency due to the lack of rotation, but in
the future we should really try as hard as we can to avoid depending on
external libraries.
…On Sat, Sep 12, 2020, 2:39 AM Basasuya ***@***.***> wrote:
@ericl <https://github.com/ericl> I don't think we should use glog at
DEBUG or INFO level to send event messages. The first reason is that glog
does not support multiple logger object of one thread. In this case,
RAY_EVENT and RAY_LOG will share glog object, then glog can not write to
different files. Second, If we redirected DEBUG, FATAL mode to write events
files, these two severities are not enough in our practical experience. At
the same time, it will affect the function of RAY_LOG.
In summary, glog doesn't support multiple logger object of one thread
natively. And glog can not rotate file just like other log frameworks such
as log4j, spdlog. That's the reason why we want to try another framework.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10419 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSSDOC44PHVIQCI5K33SFM6UZANCNFSM4QO2HNDA>
.
|
…t_Basasuya_logreporter
|
@rkooo567 Could you please review my reply to your previous comments? |
This PR is based on the previous PR: #9657
Ray Event Design Doc: https://docs.google.com/document/d/1QSJ1UUEv6zBh6UjVY24w3kkpNdZq4hOGdU7TXgs8YrE/edit#heading=h.ej1vw0pzfkg0
I add the log reporter to support write event to the files
every process will write the events to the file (event_{SOURCE_TYPE}.txt) just like below:
{"time_stamp":"2020-08-29 14:18:15.998084","severity":"INFO","label":"label 1","event_id":"de150792ceb151c815d359d4b675fcc6266a","source_type":"CORE_WORKER","host_name":"Macbool.local","pid":"20830","message":"send message 1","custom_fields":{"task_id":"task 1","job_id":"job 1","node_id":"node 1"}}Why are these changes needed?
Ray needs an event framework to report critical information, this information includes application-level data like job-condition, node adds/remove, alert, error, and so on. event messages can be sent to different destinations such as the dashboard server, the file. Users can see this information in real-time, or analyze the historical information after the program ends.
Checks
scripts/format.shto lint the changes in this PR.