[EVENT] Basic Function and Definition by Basasuya · Pull Request #9657 · ray-project/ray

Basasuya · 2020-07-23T08:48:54Z

This PR is moved from the https://github.com/ray-project/ray/pull/9612/commits
The previous PR is closed

Why are these changes needed?

Ray needs an event framework to report critical information, this information includes application-level data like job-condition, node adds/remove, alert, error, and so on. event messages can be sent to different destinations such as the dashboard server, the file. Users can see this information in real-time, or analyze the historical information after the program ends.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

AmplabJenkins · 2020-07-23T08:53:52Z

Can one of the admins verify this patch?

AmplabJenkins · 2020-07-24T09:58:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28841/
Test FAILed.

rkooo567 · 2020-07-25T07:16:09Z

Some questions.

Is it still WIP? It seems like there are bunch of test failures now, so I assume it is the case?
Can you at least address all comments from the previous PR? We can re-review after that.

Basasuya · 2020-07-25T16:20:21Z

Some questions.

Is it still WIP? It seems like there are bunch of test failures now, so I assume it is the case?

Can you at least address all comments from the previous PR? We can re-review after that.

it is not WIP, it finished. Maybe there are some Lint errors. I will fix it and rebase the master.
Here are the comments in the previous PRs and my solutions

#9612 (comment)
It's OK to use the JSON format to write, I prepare to move this function to the subclass(next PR). And there is some question about how to deal with multiple lines.

#9612 (comment)
I have changed the event protobuf to support custom labels

#9612 (comment)
I am not very clear about this review. It is just our original design intention in Ant. Or do you mean there will be some automatic binding to report event?

AmplabJenkins · 2020-07-25T22:31:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28933/
Test FAILed.

AmplabJenkins · 2020-07-25T22:45:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28934/
Test FAILed.

rkooo567 · 2020-07-26T04:43:29Z

@Basasuya I meant address all comments in the previous PR! (so that Edward and Lingxuan doesn't need to have the same review). I just would like to have your confirmation, so that we can resume review after that (or reply here if you have questions). Also, thanks for your responses to my msgs!

rkooo567 · 2020-07-26T05:44:36Z

I am not very clear about this review. It is just our original design intention in Ant. Or do you mean there will be some automatic binding to report event?

It was more like a question (not really a review). I am not super familiar with events, so I am not sure if my concern is valid. Also, I generally agree with the whole structures. It is just about how we record events.

Here was my thought process.

When I see description of what events are about, it gives me impression that events should

not have messages that are easily mutable.
not be abused. Each events should have a clear intention. Devs should not use events everywhere.
have hierarchy for searchability.

And it makes me feel like events should be pre-defined with description/schema. For example, when actors are created, we only need these events. Adding more is probably unnecessary.

ACTOR_CREATION(description="Actor creation process")

ACTOR REGISTRATION INITIATED (parent=ACTOR CREATION, sev=INFO, description="events that indicate actor registration process is initiated", actor_id="123")
DEPENDECY_RESOLVED (parent=ACTOR CREATION, sev=INFO, actor_id)
SCHEDULING REQUSET TO RAYLET (parent=ACTOR CREATION, sev=INFO, actor_id)
SCHEDULING CONFIRMED (parent=ACTOR CREATION, sev=INFO, actor_id)
CREATING ACTOR TO NODE (parent=ACTOR CREATION, sev=INFO, actor_id)
ACTOR CREATION FAILED (parent=ACTOR CREATION, sev=ERROR, actor_id)
ACTOR REGISTRATION DONE (parent=ACTOR CREATION, sev=INFO, actor_id)

// Sorry the syntax is weird, but it is how I imagine to use this. 
EVENT(DEPENDECY_RESOLVED(actor_id="")) << "msg"

Here are some problems I can imagine with the current APIs.

If someone changes messages, it can screw up the whole systems. For example, someone has an alert to "Actor restart" event, and if someone changes the string to "actor restart", alarm can be broken. It is also hard to make it backward compatible.
It is very hard to know the purpose of each events. / It is hard to search events.
It is super easy to abuse events. Someone can easily add events at unnecessary places.

Basasuya · 2020-07-26T16:06:48Z

@Basasuya I meant address all comments in the previous PR! (so that Edward and Lingxuan doesn't need to have the same review). I just would like to have your confirmation, so that we can resume review after that (or reply here if you have questions). Also, thanks for your responses to my msgs!

I don't know how to release all the reviews in the previous PR = =. Do you mean just copy these links?

LingXuan's
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)

Edward's
#9612 (comment)
#9612 (comment)
#9612 (comment)

Sangbin's
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)

Basasuya · 2020-07-26T16:17:26Z

#9657 (comment)

I think this question is about whether we need to make the event more structured. I think, first of all, the event call should be buried in the ray core carefully, not just like LOG. If you need to create a high-level API like ACTOR_CREATION, DEPENDECY_RESOLVED, and so on. it should be defined by the core_worker itself, This can be achieved by designing another layer above this event framework. Second, users may abuse this API, I think some tags can be set to distinguish these events (such as label). We can also use a method like the LOG severity level.

AmplabJenkins · 2020-07-26T22:25:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28967/
Test FAILed.

rkooo567 · 2020-07-28T05:47:03Z

@Basasuya I agree we can build that abstraction in the higher layer. Also, I didn't mean to port those comments to here. I just meant "please make sure to address those comments in this PR!"

rkooo567

I will leave more comments later tomorrow! Also, please add newlines at the end of the file!

src/ray/protobuf/event.proto

src/ray/util/event.cc

src/ray/util/event.h

src/ray/util/event.cc

src/ray/util/event.h

src/ray/util/event_test.cc

AmplabJenkins · 2020-07-29T02:41:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29076/
Test FAILed.

edoakes · 2020-07-31T20:49:59Z

@Basasuya IIRC the previous PR was just to show the concept/API for this. Should I review this one assuming we're actually going to merge it?

Basasuya · 2020-08-01T05:43:52Z

@edoakes @rkooo567 Sorry for not replying to you these days. I have a deadline for paper revision in my Lab this week. Now it is finished. I will work on this PR from now on.

Basasuya · 2020-08-01T05:50:18Z

@Basasuya IIRC the previous PR was just to show the concept/API for this. Should I review this one assuming we're actually going to merge it?

Yes, we should discuss on this PR. This PR involves the basic level design of the event framework.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

rkooo567

LGTM! Just a few last comments. I'd still be waiting for @edoakes or @ashione to sign off the PR. Thanks for the contribution :)

src/ray/util/event.cc

src/ray/util/event.h

src/ray/util/event_test.cc

src/ray/util/event.cc

src/ray/util/event.h

src/ray/util/util.h

Basasuya · 2020-08-05T03:54:03Z

@raulchen @yuyiming welcome to review this PR

AmplabJenkins · 2020-08-10T03:56:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29646/
Test FAILed.

raulchen

LGTM

src/ray/util/event.h

src/ray/protobuf/event.proto

AmplabJenkins · 2020-08-10T04:09:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29647/
Test FAILed.

AmplabJenkins · 2020-08-10T04:56:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29649/
Test FAILed.

AmplabJenkins · 2020-08-10T08:01:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29652/
Test FAILed.

AmplabJenkins · 2020-08-10T11:23:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29662/
Test FAILed.

rkooo567 · 2020-08-10T16:25:35Z

There's a lint failure.

AmplabJenkins · 2020-08-10T17:25:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29675/
Test FAILed.

AmplabJenkins · 2020-08-11T02:44:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29705/
Test FAILed.

Basasuya · 2020-08-11T05:56:43Z

There's a lint failure.

fixed

rkooo567 self-assigned this Jul 23, 2020

Basasuya changed the title ~~[WIP] [EVENT] basic function and defination~~ [EVENT] basic function and defination Jul 25, 2020

Basasuya force-pushed the master_Basasuya_event branch 2 times, most recently from fa08d95 to 63e145f Compare July 25, 2020 16:41

rkooo567 changed the title ~~[EVENT] basic function and defination~~ [EVENT] Basic Function and Definition Jul 26, 2020

[EVENT] basic function and defination

db7e2b5

Basasuya force-pushed the master_Basasuya_event branch from 63e145f to db7e2b5 Compare July 26, 2020 16:18

rkooo567 reviewed Jul 28, 2020

View reviewed changes

fix some bug

0de368c

rkooo567 reviewed Jul 28, 2020

View reviewed changes

rkooo567 requested review from ashione and edoakes July 30, 2020 23:14

Basasuya and others added 2 commits August 1, 2020 14:00

Update src/ray/util/event.cc

ea1372e

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

fix some name and comment

81d1fd5

rkooo567 reviewed Aug 4, 2020

View reviewed changes

src/ray/util/event.cc Show resolved Hide resolved

src/ray/util/event.cc Show resolved Hide resolved

src/ray/util/event.h Outdated Show resolved Hide resolved

src/ray/util/event_test.cc Show resolved Hide resolved

rkooo567 approved these changes Aug 4, 2020

View reviewed changes