Skip to content

[WIP][Metric] add multiple setup-shutdown tests for metric exporters#13708

Closed
ashione wants to merge 6 commits intoray-project:masterfrom
antgroup:fix-opencensus-delta-producer
Closed

[WIP][Metric] add multiple setup-shutdown tests for metric exporters#13708
ashione wants to merge 6 commits intoray-project:masterfrom
antgroup:fix-opencensus-delta-producer

Conversation

@ashione
Copy link
Copy Markdown
Member

@ashione ashione commented Jan 26, 2021

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

which makes variable can be updated no matter cpu ordering
@ashione
Copy link
Copy Markdown
Member Author

ashione commented Jan 26, 2021

DeltaProducer is a singleton, its harvest thread has never been initialized once producer shutdowns. To slove this issue, we'd better to make harvest thread reopenable. @rkooo567 @simon-mo @zhongchun

@rkooo567
Copy link
Copy Markdown
Contributor

This sounds good. What needs to be done to make it reopneable?

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Jan 26, 2021

This sounds good. What needs to be done to make it reopneable?

Coreworker shares whole lifecycle with stats, but it does not work in driver process since driver will reconnect to ray cluster by invoking ray.init again.

In short, we have two solutions to address this issue.

  1. Let all users know that it's unsupported to record after driver invokes shutdown.
  2. Delete global delta producer in shutdown function and recreate a new producer when next DeltaProducer::Get. We might modify singleton of DeltaProducer and make it thread-safe without any performance degradation.

@zhongchun zhongchun force-pushed the fix-opencensus-delta-producer branch from 6925a44 to bac5de6 Compare February 1, 2021 02:39
@jovany-wang
Copy link
Copy Markdown
Contributor

[2021-02-01 20:32:17,872 I 58448 731242] io_service_pool.cc:36: IOServicePool is running with 1 io_service.
[2021-02-01 20:32:27,932 W 58448 731250] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-02-01 20:32:28,022 E 58448 731242] logging.cc:435: *** Aborted at 1612182748 (unix time) try "date -d @1612182748" if you are using GNU date ***
[2021-02-01 20:32:28,022 E 58448 731242] logging.cc:435: PC: @                0x0 (unknown)
[2021-02-01 20:32:28,022 E 58448 731242] logging.cc:435: *** SIGTERM (@0x7fff6afa5882) received by PID 58448 (TID 0x10d9eddc0) stack trace: ***
[2021-02-01 20:32:28,022 E 58448 731242] logging.cc:435:     @     0x7fff6b05a5fd _sigtramp
[2021-02-01 20:32:28,023 E 58448 731242] logging.cc:435:     @        0x10d944117 (unknown)
[2021-02-01 20:32:28,028 E 58448 731242] logging.cc:435:     @        0x10252bff6 absl::lts_2019_08_08::synchronization_internal::Waiter::Wait()
[2021-02-01 20:32:28,028 E 58448 731242] logging.cc:435:     @        0x10252be4f AbslInternalPerThreadSemWait
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435:     @        0x10252cad2 absl::lts_2019_08_08::Mutex::Block()
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435:     @        0x10252d521 absl::lts_2019_08_08::Mutex::LockSlowLoop()
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435:     @        0x10252ceac absl::lts_2019_08_08::Mutex::LockSlowWithDeadline()
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435:     @        0x102581748 absl::lts_2019_08_08::Mutex::LockSlow()
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435:     @        0x10252cc77 absl::lts_2019_08_08::Mutex::Lock()
[2021-02-01 20:32:28,030 E 58448 731242] logging.cc:435:     @        0x1025077fe opencensus::stats::DeltaProducer::DeltaProducer()
[2021-02-01 20:32:28,030 E 58448 731242] logging.cc:435:     @        0x102506c94 opencensus::stats::DeltaProducer::Get()
[2021-02-01 20:32:28,031 E 58448 731242] logging.cc:435:     @        0x101cc6bb2 main
[2021-02-01 20:32:28,031 E 58448 731242] logging.cc:435:     @     0x7fff6ae61cc9 start
[2021-02-01 20:32:28,031 E 58448 731242] logging.cc:435:     @                0x8 (unknown
)

@zhongchun The patch causes the process delay 10s. Can you take a look?

[2021-02-01 20:32:17,872 I 58448 731242] io_service_pool.cc:36: IOServicePool is running with 1 io_service.
[2021-02-01 20:32:27,932 W 58448 731250] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

@zhongchun
Copy link
Copy Markdown
Contributor

@jovany-wang Thanks for your reminder, I'll fix it.

@zhongchun
Copy link
Copy Markdown
Contributor

@jovany-wang @ashione @rkooo567 please have a look

@ashione
Copy link
Copy Markdown
Member Author

ashione commented Feb 3, 2021

@rkooo567
We tried a simple transformation to make init after shutdown and metric can be successfully reported. But this has caused other problems. To solve these problems, we need to modify the overall singleton mode of opencensus. This work may be relatively large and can cause other side effects.

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Feb 4, 2021

What's other problems?

@zhongchun
Copy link
Copy Markdown
Contributor

zhongchun commented Feb 4, 2021

What's other problems?

There are some other singletons, like TagKeyRegistry, MeasureRegistryImpl and StatsManager. We should reconstruct these classes or clear all registered tags, measures.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 24, 2021
@ashione ashione closed this Nov 8, 2021
@jovany-wang jovany-wang deleted the fix-opencensus-delta-producer branch November 8, 2021 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants