Skip to content

[EVENT] Basic Function and Definition#9657

Merged
raulchen merged 14 commits intoray-project:masterfrom
antgroup:master_Basasuya_event
Aug 11, 2020
Merged

[EVENT] Basic Function and Definition#9657
raulchen merged 14 commits intoray-project:masterfrom
antgroup:master_Basasuya_event

Conversation

@Basasuya
Copy link
Copy Markdown
Contributor

@Basasuya Basasuya commented Jul 23, 2020

This PR is moved from the https://github.com/ray-project/ray/pull/9612/commits
The previous PR is closed

Why are these changes needed?

Ray needs an event framework to report critical information, this information includes application-level data like job-condition, node adds/remove, alert, error, and so on. event messages can be sent to different destinations such as the dashboard server, the file. Users can see this information in real-time, or analyze the historical information after the program ends.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@rkooo567 rkooo567 self-assigned this Jul 23, 2020
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28841/
Test FAILed.

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Jul 25, 2020

Some questions.

  1. Is it still WIP? It seems like there are bunch of test failures now, so I assume it is the case?
  2. Can you at least address all comments from the previous PR? We can re-review after that.

@Basasuya
Copy link
Copy Markdown
Contributor Author

Some questions.

  1. Is it still WIP? It seems like there are bunch of test failures now, so I assume it is the case?
  2. Can you at least address all comments from the previous PR? We can re-review after that.
  1. it is not WIP, it finished. Maybe there are some Lint errors. I will fix it and rebase the master.
  2. Here are the comments in the previous PRs and my solutions

#9612 (comment)
It's OK to use the JSON format to write, I prepare to move this function to the subclass(next PR). And there is some question about how to deal with multiple lines.

#9612 (comment)
I have changed the event protobuf to support custom labels

#9612 (comment)
I am not very clear about this review. It is just our original design intention in Ant. Or do you mean there will be some automatic binding to report event?

@Basasuya Basasuya changed the title [WIP] [EVENT] basic function and defination [EVENT] basic function and defination Jul 25, 2020
@Basasuya Basasuya force-pushed the master_Basasuya_event branch 2 times, most recently from fa08d95 to 63e145f Compare July 25, 2020 16:41
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28933/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28934/
Test FAILed.

@rkooo567 rkooo567 changed the title [EVENT] basic function and defination [EVENT] Basic Function and Definition Jul 26, 2020
@rkooo567
Copy link
Copy Markdown
Contributor

@Basasuya I meant address all comments in the previous PR! (so that Edward and Lingxuan doesn't need to have the same review). I just would like to have your confirmation, so that we can resume review after that (or reply here if you have questions). Also, thanks for your responses to my msgs!

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Jul 26, 2020

I am not very clear about this review. It is just our original design intention in Ant. Or do you mean there will be some automatic binding to report event?

It was more like a question (not really a review). I am not super familiar with events, so I am not sure if my concern is valid. Also, I generally agree with the whole structures. It is just about how we record events.

Here was my thought process.

When I see description of what events are about, it gives me impression that events should

  1. not have messages that are easily mutable.
  2. not be abused. Each events should have a clear intention. Devs should not use events everywhere.
  3. have hierarchy for searchability.

And it makes me feel like events should be pre-defined with description/schema. For example, when actors are created, we only need these events. Adding more is probably unnecessary.

ACTOR_CREATION(description="Actor creation process")

ACTOR REGISTRATION INITIATED (parent=ACTOR CREATION, sev=INFO, description="events that indicate actor registration process is initiated", actor_id="123")
DEPENDECY_RESOLVED (parent=ACTOR CREATION, sev=INFO, actor_id)
SCHEDULING REQUSET TO RAYLET (parent=ACTOR CREATION, sev=INFO, actor_id)
SCHEDULING CONFIRMED (parent=ACTOR CREATION, sev=INFO, actor_id)
CREATING ACTOR TO NODE (parent=ACTOR CREATION, sev=INFO, actor_id)
ACTOR CREATION FAILED (parent=ACTOR CREATION, sev=ERROR, actor_id)
ACTOR REGISTRATION DONE (parent=ACTOR CREATION, sev=INFO, actor_id)

// Sorry the syntax is weird, but it is how I imagine to use this. 
EVENT(DEPENDECY_RESOLVED(actor_id="")) << "msg"

Here are some problems I can imagine with the current APIs.

  1. If someone changes messages, it can screw up the whole systems. For example, someone has an alert to "Actor restart" event, and if someone changes the string to "actor restart", alarm can be broken. It is also hard to make it backward compatible.
  2. It is very hard to know the purpose of each events. / It is hard to search events.
  3. It is super easy to abuse events. Someone can easily add events at unnecessary places.

@Basasuya
Copy link
Copy Markdown
Contributor Author

@Basasuya I meant address all comments in the previous PR! (so that Edward and Lingxuan doesn't need to have the same review). I just would like to have your confirmation, so that we can resume review after that (or reply here if you have questions). Also, thanks for your responses to my msgs!

I don't know how to release all the reviews in the previous PR = =. Do you mean just copy these links?

LingXuan's
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)

Edward's
#9612 (comment)
#9612 (comment)
#9612 (comment)

Sangbin's
#9612 (comment)
#9612 (comment)
#9612 (comment)
#9612 (comment)

@Basasuya
Copy link
Copy Markdown
Contributor Author

#9657 (comment)

I think this question is about whether we need to make the event more structured. I think, first of all, the event call should be buried in the ray core carefully, not just like LOG. If you need to create a high-level API like ACTOR_CREATION, DEPENDECY_RESOLVED, and so on. it should be defined by the core_worker itself, This can be achieved by designing another layer above this event framework. Second, users may abuse this API, I think some tags can be set to distinguish these events (such as label). We can also use a method like the LOG severity level.

@Basasuya Basasuya force-pushed the master_Basasuya_event branch from 63e145f to db7e2b5 Compare July 26, 2020 16:18
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28967/
Test FAILed.

@rkooo567
Copy link
Copy Markdown
Contributor

@Basasuya I agree we can build that abstraction in the higher layer. Also, I didn't mean to port those comments to here. I just meant "please make sure to address those comments in this PR!"

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave more comments later tomorrow! Also, please add newlines at the end of the file!

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29076/
Test FAILed.

@rkooo567 rkooo567 requested review from ashione and edoakes July 30, 2020 23:14
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jul 31, 2020

@Basasuya IIRC the previous PR was just to show the concept/API for this. Should I review this one assuming we're actually going to merge it?

@Basasuya
Copy link
Copy Markdown
Contributor Author

Basasuya commented Aug 1, 2020

@edoakes @rkooo567 Sorry for not replying to you these days. I have a deadline for paper revision in my Lab this week. Now it is finished. I will work on this PR from now on.

@Basasuya
Copy link
Copy Markdown
Contributor Author

Basasuya commented Aug 1, 2020

@Basasuya IIRC the previous PR was just to show the concept/API for this. Should I review this one assuming we're actually going to merge it?

Yes, we should discuss on this PR. This PR involves the basic level design of the event framework.

Basasuya and others added 2 commits August 1, 2020 14:00
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few last comments. I'd still be waiting for @edoakes or @ashione to sign off the PR. Thanks for the contribution :)

@Basasuya
Copy link
Copy Markdown
Contributor Author

Basasuya commented Aug 5, 2020

@raulchen @yuyiming welcome to review this PR

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29646/
Test FAILed.

Copy link
Copy Markdown
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29647/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29649/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29652/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29662/
Test FAILed.

@rkooo567
Copy link
Copy Markdown
Contributor

There's a lint failure.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 10, 2020
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29675/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29705/
Test FAILed.

@Basasuya
Copy link
Copy Markdown
Contributor Author

There's a lint failure.

fixed

@raulchen raulchen merged commit 0400a88 into ray-project:master Aug 11, 2020
@raulchen raulchen deleted the master_Basasuya_event branch August 11, 2020 09:36
@Basasuya Basasuya mentioned this pull request Aug 29, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants