Skip to content

[Metric] Fix crashed when register metric view in multithread#13485

Merged
jovany-wang merged 3 commits intoray-project:masterfrom
antgroup:fix-metric-register-view
Jan 25, 2021
Merged

[Metric] Fix crashed when register metric view in multithread#13485
jovany-wang merged 3 commits intoray-project:masterfrom
antgroup:fix-metric-register-view

Conversation

@ashione
Copy link
Copy Markdown
Member

@ashione ashione commented Jan 15, 2021

We fond there will be race while registering metric in multithread and process crash finally.
According to opencensus source code, it's known issue but not fixed yet.

void StatsManager::ViewInformation::AddConsumer() {
  mu_->AssertHeld();
  ++num_consumers_;
}

int StatsManager::ViewInformation::RemoveConsumer() {
  mu_->AssertHeld();
  return --num_consumers_;
}
  // TODO: PERF: Global synchronization is only needed for adding or
  // removing measures--we can reduce recording contention by claiming a reader
  // lock on mu_ and a writer lock on a measure-specific mutex.
  mutable absl::Mutex mu_;

So we'd better add lock for registering metric when first recording.

Why are these changes needed?

Related issue number

#13484

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Jan 15, 2021

@zhongchun

Copy link
Copy Markdown
Contributor

@zhongchun zhongchun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment.

std::unique_ptr<opencensus::stats::Measure<double>> measure_;

// For making sure thread-safe to all of metric registrations.
static absl::Mutex mutex_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename it to registration_mutex_?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename it to registration_mutex_?

Nice catch.

Copy link
Copy Markdown
Contributor

@jovany-wang jovany-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@jovany-wang
Copy link
Copy Markdown
Contributor

I guess this can be merged if CI passes.

@jovany-wang
Copy link
Copy Markdown
Contributor

jovany-wang commented Jan 18, 2021

Hi @ashione @zhongchun , should we hold this PR until we fix it totally? If so, I will mark the state to RequestChange.

Copy link
Copy Markdown
Contributor

@jovany-wang jovany-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashione Please ping me if this get fixed totally~

Copy link
Copy Markdown
Contributor

@jovany-wang jovany-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no failure any more, this can be merged now I think.

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Jan 25, 2021

Since there is no failure any more, this can be merged now I think.

No related failures to this PR.

@jovany-wang jovany-wang merged commit f9f2bfa into ray-project:master Jan 25, 2021
@jovany-wang jovany-wang deleted the fix-metric-register-view branch January 25, 2021 12:32
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
…oject#13485)

* Fix crashed when register metric view in multithread

* fix comments

* fix
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants