Skip to content

Add internal_api.global_gc() method, which triggers gc.collect() on all workers#7327

Merged
ericl merged 10 commits intoray-project:masterfrom
ericl:fix-gc-co
Feb 26, 2020
Merged

Add internal_api.global_gc() method, which triggers gc.collect() on all workers#7327
ericl merged 10 commits intoray-project:masterfrom
ericl:fix-gc-co

Conversation

@ericl
Copy link
Copy Markdown
Contributor

@ericl ericl commented Feb 26, 2020

Why are these changes needed?

This adds a global_gc method, which triggers gc.collect() on all workers to collect cyclic object references. This can be called when there is object store memory pressure to trigger the release of distributed object references.

It works like this:

  • The GlobalGC() method of the core worker sends a RPC to the raylet to request global GC.
  • The raylet sets a should_global_gc flag in its heartbeat, which is broadcast to all other raylets. When a raylet sees a heartbeat with should_global_gc set, it sets a local should_local_gc flag.
  • When a raylet sees should_local_gc set, it will send a RPC to all workers in the next heartbeat.

This effectively throttles the frequency of worker GC to once per raylet heartbeat (100ms), no matter how often it is called across the cluster.

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?


// Trigger local GC at the next heartbeat interval.
if (heartbeat_data.should_global_gc()) {
should_local_gc_ = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just DoLocalGC here and not have the should_local_gc_ flag?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh never mind, I guess we might receive more than one HeartbeatAdded per interval?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Actually I need to change the other site to set the flag too.

RAY_LOG(WARNING) << "Broadcasting global GC request to all raylets.";
should_global_gc_ = true;
// We won't see our own request, so trigger local GC immediately too.
DoLocalGC();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could move this to Heartbeat(), or just set should_local_gc_, so that we also throttle if there are a bunch of global GC requests from all local workers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

repeated ObjectReferenceCount borrowed_refs = 1;
}

message LocalGCRequest {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message LocalGCRequest {
message TriggerLocalGCRequest {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably won't change this, it doesn't seem that much more clear.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, just bothers me because the other RPCs start with a verb

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22414/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22412/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22422/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22436/
Test FAILed.

@ericl ericl merged commit b310661 into ray-project:master Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants