Skip to content

[WIP]Placement group dev#8914

Closed
Alisahhh wants to merge 34 commits intoray-project:masterfrom
antgroup:PlacementGroup-dev
Closed

[WIP]Placement group dev#8914
Alisahhh wants to merge 34 commits intoray-project:masterfrom
antgroup:PlacementGroup-dev

Conversation

@Alisahhh
Copy link
Copy Markdown
Contributor

Why are these changes needed?

To alloc resources in node before creating actors and store the resources map in gcs server.

Now I almost finish the cpp part.

TODO:

  • Finish cpp part and add python API and JAVA api.
  • Add unit test.

NOTES:

  • This part is WIP and I will submit this change(cpp part) in about three small pr.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27079/
Test FAILed.

@rkooo567 rkooo567 self-requested a review June 12, 2020 15:51
@ericl ericl self-assigned this Jun 12, 2020
@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jun 12, 2020

@Alisahhh when it's ready can you add a python example script to the PR description so others can easily know how to try it out?

@Alisahhh
Copy link
Copy Markdown
Contributor Author

Alisahhh commented Jun 15, 2020 via email

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jun 15, 2020 via email

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27141/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27151/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27216/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27272/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27329/
Test FAILed.

Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Alisahhh one big question I have from skimming the header files is: why not represent the bundle id as std::pair<PlacementGroupId, int>? That seems simpler and ensures you can always find the placement group given the bundle identifier.

RAY_LOG(DEBUG) << "bundle lease request " << bundle_spec.BundleId();
auto resource_ids = ScheduleBundle(cluster_resource_map_, bundle_spec);
if (resource_ids.AvailableResources().size() == 0) {
send_reply_callback(Status::OutOfMemory("reserve resource failed"), nullptr, nullptr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be returned as a boolean field in the proto message indicating failure, not a system-level OOM exception.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in the #9039

message Bundle {
bytes bundle_id = 1;
map<string, double> unit_resources = 2;
int64 unit_count = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need unit count in the first iteration? Consider leaving it out and implementing it later if needed.

// State of a placement group.
enum PlacementGroupState {
// Placement Group is pending or scheduling
PENDING = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it means it goes back to PENDING if reconstructing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if a placement group creating failed, it will be added into the tail of the queue and the status will keep PENDING

// Service for placement group info access.
service PlacementGroupInfoGcsService{
// Create placement group via gcs service
rpc CreatePlacementGroup(CreatePlacementGroupRequest) returns (CreatePlacementGroupReply);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we check the status of a placement group? Or is that for a later PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the check in next pr. It looks like an actor creation in the core worker callback.

@Alisahhh
Copy link
Copy Markdown
Contributor Author

Hey @Alisahhh one big question I have from skimming the header files is: why not represent the bundle id as std::pair<PlacementGroupId, int>? That seems simpler and ensures you can always find the placement group given the bundle identifier.

We have discussed this question last week and given multi plan. Now the bundle id will be given when the bundle creates which is earlier than the placement group creates.
BTW, I think a placement group is a virtual concept, so the bundle should not bound to the placement group?

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jun 22, 2020 via email

@Alisahhh
Copy link
Copy Markdown
Contributor Author

What if someone wants to place into any bundle in the group or some range of bundles? That could be represented as a placement group id with no index restriction. I think passing around raw bundle IDs is problematic for that reason, and also because it's lower level than placement group.

On Sun, Jun 21, 2020, 7:55 PM Alisa @.***> wrote: Hey @Alisahhh https://github.com/Alisahhh one big question I have from skimming the header files is: why not represent the bundle id as std::pair<PlacementGroupId, int>? That seems simpler and ensures you can always find the placement group given the bundle identifier. We have discussed this question last week and given multi plan. Now the bundle id will be given when the bundle creates which is earlier than the placement group creates. BTW, I think a placement group is a virtual concept, so the bundle should not bound to the placement group? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#8914 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSQQIDMJQATS2OJF6BTRX3B2LANCNFSM4N4KU6XA .

Sure,

What if someone wants to place into any bundle in the group or some range of bundles? That could be represented as a placement group id with no index restriction. I think passing around raw bundle IDs is problematic for that reason, and also because it's lower level than placement group.

On Sun, Jun 21, 2020, 7:55 PM Alisa @.***> wrote: Hey @Alisahhh https://github.com/Alisahhh one big question I have from skimming the header files is: why not represent the bundle id as std::pair<PlacementGroupId, int>? That seems simpler and ensures you can always find the placement group given the bundle identifier. We have discussed this question last week and given multi plan. Now the bundle id will be given when the bundle creates which is earlier than the placement group creates. BTW, I think a placement group is a virtual concept, so the bundle should not bound to the placement group? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#8914 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSQQIDMJQATS2OJF6BTRX3B2LANCNFSM4N4KU6XA .

I want to check if you want I assign the BundleID like PlacementID+index, which is like the actor id has the information about task id and worker id.
BTW, If I want to include the placement group id information in bundle id, I need to copy the whole array in the core worker, I don't know if it a good implementation?

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jun 22, 2020

I want to check if you want I assign the BundleID like PlacementID+index, which is like the actor id has the information about task id and worker id.
BTW, If I want to include the placement group id information in bundle id, I need to copy the whole array in the core worker, I don't know if it a good implementation?

I don't think we need a "BundleID" at all. The way I think of it there are two cases here:

  1. The representation of a bundle identifier (INTERNAL, this data structure should never appear in any public APIs), which can be
message BundleIdentifier {
   PlacementGroupId group_id;
   int32 bundle_index; 
}
  1. The representation of a placement specification targeting bundle(s), which is exposed to the user. This can be something like
message PlacementGroupRequirements {
    PlacementGroupId group_id;
    int32 target_bundle [default = -1];   # -1 means any bundle is ok to place in
}

@Alisahhh let me know if you see any problems.

@raulchen
Copy link
Copy Markdown
Contributor

@ericl are you mainly concerned about exposing BundleId to the user API?
I agree that it's not a good idea to expose BundleId. Ideally, we shouldn't expose PlacementGroupId either (due to the same reason why we change ObjectId to ObjectRef).

What about just using the placement group and bundle object in the API?

placement_group = ray.create_placement_group([{"CPU": 1}, {"CPU": 2}])  # Type of `placement_group` is `PlacementGroup`. 
bundle0 = placement_group.get_bundle(0)  # Type of `bundle` is `Bundle`.
# You can pass either a `PlacementGroup` or a `Bundle` object when creating an actor.
actor1 = MyActor.options(placement=placement_group).remote()
actor2 = MyActor.options(placement=bundle0).remote()

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jun 23, 2020 via email

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jun 23, 2020

In the future, when we have referencing counting for PlacementGroup handles, having bundles be implemented as a placement group + index also simplifies the situation. The entire placement group is referenced if any bundle is.

@ericl ericl removed their assignment Jun 29, 2020
@Alisahhh Alisahhh closed this Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants