[WIP][Metric] add multiple setup-shutdown tests for metric exporters#13708
[WIP][Metric] add multiple setup-shutdown tests for metric exporters#13708ashione wants to merge 6 commits intoray-project:masterfrom
Conversation
which makes variable can be updated no matter cpu ordering
|
DeltaProducer is a singleton, its harvest thread has never been initialized once producer shutdowns. To slove this issue, we'd better to make harvest thread reopenable. @rkooo567 @simon-mo @zhongchun |
|
This sounds good. What needs to be done to make it reopneable? |
Coreworker shares whole lifecycle with stats, but it does not work in driver process since driver will reconnect to ray cluster by invoking ray.init again. In short, we have two solutions to address this issue.
|
6925a44 to
bac5de6
Compare
[2021-02-01 20:32:17,872 I 58448 731242] io_service_pool.cc:36: IOServicePool is running with 1 io_service.
[2021-02-01 20:32:27,932 W 58448 731250] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-02-01 20:32:28,022 E 58448 731242] logging.cc:435: *** Aborted at 1612182748 (unix time) try "date -d @1612182748" if you are using GNU date ***
[2021-02-01 20:32:28,022 E 58448 731242] logging.cc:435: PC: @ 0x0 (unknown)
[2021-02-01 20:32:28,022 E 58448 731242] logging.cc:435: *** SIGTERM (@0x7fff6afa5882) received by PID 58448 (TID 0x10d9eddc0) stack trace: ***
[2021-02-01 20:32:28,022 E 58448 731242] logging.cc:435: @ 0x7fff6b05a5fd _sigtramp
[2021-02-01 20:32:28,023 E 58448 731242] logging.cc:435: @ 0x10d944117 (unknown)
[2021-02-01 20:32:28,028 E 58448 731242] logging.cc:435: @ 0x10252bff6 absl::lts_2019_08_08::synchronization_internal::Waiter::Wait()
[2021-02-01 20:32:28,028 E 58448 731242] logging.cc:435: @ 0x10252be4f AbslInternalPerThreadSemWait
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435: @ 0x10252cad2 absl::lts_2019_08_08::Mutex::Block()
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435: @ 0x10252d521 absl::lts_2019_08_08::Mutex::LockSlowLoop()
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435: @ 0x10252ceac absl::lts_2019_08_08::Mutex::LockSlowWithDeadline()
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435: @ 0x102581748 absl::lts_2019_08_08::Mutex::LockSlow()
[2021-02-01 20:32:28,029 E 58448 731242] logging.cc:435: @ 0x10252cc77 absl::lts_2019_08_08::Mutex::Lock()
[2021-02-01 20:32:28,030 E 58448 731242] logging.cc:435: @ 0x1025077fe opencensus::stats::DeltaProducer::DeltaProducer()
[2021-02-01 20:32:28,030 E 58448 731242] logging.cc:435: @ 0x102506c94 opencensus::stats::DeltaProducer::Get()
[2021-02-01 20:32:28,031 E 58448 731242] logging.cc:435: @ 0x101cc6bb2 main
[2021-02-01 20:32:28,031 E 58448 731242] logging.cc:435: @ 0x7fff6ae61cc9 start
[2021-02-01 20:32:28,031 E 58448 731242] logging.cc:435: @ 0x8 (unknown
)@zhongchun The patch causes the process delay 10s. Can you take a look? [2021-02-01 20:32:17,872 I 58448 731242] io_service_pool.cc:36: IOServicePool is running with 1 io_service.
[2021-02-01 20:32:27,932 W 58448 731250] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster. |
|
@jovany-wang Thanks for your reminder, I'll fix it. |
|
@jovany-wang @ashione @rkooo567 please have a look |
|
@rkooo567 |
|
What's other problems? |
There are some other singletons, like TagKeyRegistry, MeasureRegistryImpl and StatsManager. We should reconstruct these classes or clear all registered tags, measures. |
Why are these changes needed?
Related issue number
Checks
scripts/format.shto lint the changes in this PR.