Skip to content

panic: failed to send close signal: () #709

@davepacheco

Description

@davepacheco

@leftwo reported that in some automated tests, when the test would panic, he'd see a second panic from Dropshot. I think this problem doesn't report a lot of details because Rust avoids trying to be too fancy when it panics while panicking. This sounds like the problems we resolved in #103 but I've confirmed that it remains on "main" today.

I've reproduced this by adding an example called boom.rs

// Copyright 2020 Oxide Computer Company
//! Example use of Dropshot.

use dropshot::ApiDescription;
use dropshot::ConfigDropshot;
use dropshot::ConfigLogging;
use dropshot::ConfigLoggingLevel;
use dropshot::HttpServerStarter;

#[tokio::main]
async fn main() -> Result<(), String> {
    let config_dropshot: ConfigDropshot = Default::default();
    let config_logging =
        ConfigLogging::StderrTerminal { level: ConfigLoggingLevel::Info };
    let log = config_logging
        .to_logger("example-basic")
        .map_err(|error| format!("failed to create logger: {}", error))?;
    let api = ApiDescription::new();
    let api_context = ();
    let server =
        HttpServerStarter::new(&config_dropshot, api, api_context, &log)
            .map_err(|error| format!("failed to create server: {}", error))?
            .start();

    tokio::spawn(server);
    Ok(())
}

Sometimes it runs fine:

$ RUST_BACKTRACE=1 cargo run --example=boom
   Compiling dropshot v0.9.1-dev (/home/dap/dropshot-tls-sni/dropshot)
    Finished dev [unoptimized + debuginfo] target(s) in 5.02s
     Running `target/debug/examples/boom`
Jun 23 20:40:04.534 INFO listening, local_addr: 127.0.0.1:41938
$

but sometimes it panics like this:

$ RUST_BACKTRACE=1 cargo run --example=boom
    Finished dev [unoptimized + debuginfo] target(s) in 0.16s
     Running `target/debug/examples/boom`
Jun 23 20:40:05.640 INFO listening, local_addr: 127.0.0.1:46483
thread 'tokio-runtime-worker' panicked at 'failed to send close signal: ()', dropshot/src/server.rs:736:24
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
   1: core::panicking::panic_fmt
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
   2: core::result::unwrap_failed
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/result.rs:1687:5
   3: core::result::Result<T,E>::expect
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/result.rs:1046:23
   4: <dropshot::server::CloseHandle as core::ops::drop::Drop>::drop
             at ./dropshot/src/server.rs:736:13
   5: core::ptr::drop_in_place<dropshot::server::CloseHandle>
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ptr/mod.rs:490:1
   6: core::ptr::drop_in_place<dropshot::server::HttpServer<()>>
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ptr/mod.rs:490:1
   7: core::ptr::drop_in_place<tokio::runtime::task::core::Stage<dropshot::server::HttpServer<()>>>
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ptr/mod.rs:490:1
   8: tokio::runtime::task::core::Core<T,S>::set_stage::{{closure}}
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/core.rs:277:41
   9: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/loom/std/unsafe_cell.rs:14:9
  10: tokio::runtime::task::core::Core<T,S>::set_stage
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/core.rs:277:9
  11: tokio::runtime::task::core::Core<T,S>::drop_future_or_output
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/core.rs:242:13
  12: tokio::runtime::task::harness::cancel_task::{{closure}}
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/harness.rs:447:9
  13: core::ops::function::FnOnce::call_once
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5
  14: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panic/unwind_safe.rs:271:9
  15: std::panicking::try::do_call
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:485:40
  16: __rust_try
  17: std::panicking::try
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:449:19
  18: std::panic::catch_unwind
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panic.rs:140:14
  19: tokio::runtime::task::harness::cancel_task
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/harness.rs:446:15
  20: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/harness.rs:212:25
  21: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/harness.rs:152:15
  22: tokio::runtime::task::raw::poll
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/raw.rs:255:5
  23: tokio::runtime::task::raw::RawTask::poll
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/raw.rs:200:18
  24: tokio::runtime::task::LocalNotified<S>::run
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/mod.rs:394:9
  25: tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/scheduler/multi_thread/worker.rs:464:13
  26: tokio::runtime::coop::with_budget
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/coop.rs:107:5
  27: tokio::runtime::coop::budget
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/coop.rs:73:5
  28: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/scheduler/multi_thread/worker.rs:463:9
  29: tokio::runtime::scheduler::multi_thread::worker::Context::run
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/scheduler/multi_thread/worker.rs:426:24
  30: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/scheduler/multi_thread/worker.rs:406:17
  31: tokio::macros::scoped_tls::ScopedKey<T>::set
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/macros/scoped_tls.rs:61:9
  32: tokio::runtime::scheduler::multi_thread::worker::run
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/scheduler/multi_thread/worker.rs:403:5
  33: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/scheduler/multi_thread/worker.rs:365:45
  34: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/blocking/task.rs:42:21
  35: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/core.rs:223:17
  36: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/loom/std/unsafe_cell.rs:14:9
  37: tokio::runtime::task::core::Core<T,S>::poll
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/core.rs:212:13
  38: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/harness.rs:476:19
  39: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panic/unwind_safe.rs:271:9
  40: std::panicking::try::do_call
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:485:40
  41: __rust_try
  42: std::panicking::try
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:449:19
  43: std::panic::catch_unwind
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panic.rs:140:14
  44: tokio::runtime::task::harness::poll_future
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/harness.rs:464:18
  45: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/harness.rs:198:27
  46: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/harness.rs:152:15
  47: tokio::runtime::task::raw::poll
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/raw.rs:255:5
  48: tokio::runtime::task::raw::RawTask::poll
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/raw.rs:200:18
  49: tokio::runtime::task::UnownedTask<S>::run
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/task/mod.rs:431:9
  50: tokio::runtime::blocking::pool::Task::run
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/blocking/pool.rs:159:9
  51: tokio::runtime::blocking::pool::Inner::run
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/blocking/pool.rs:513:17
  52: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.2/src/runtime/blocking/pool.rs:471:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

We're crashing here:

c.send(()).expect("failed to send close signal")

The code is dying trying to send a message on a channel because it believes the other side of the channel should always be around while this code is around. Intuitively, it thinks that because the other side of the channel is in its own tokio task that we own and never cancel. When a Dropshot server goes out of scope, we hit this code first (the code that's panicking). How can the other task have gone away?

More precisely, the rx side of that channel is passed to InnerHttpServerStarter::start() and then moved into a Future (an async block). That Future is passed to hyper::server::Server::with_graceful_shutdown, which consumes the server and returns a Future (we'll call this one graceful) that you can wait on to wait for the server to shut down. I gather at that point, the underlying hyper server is basically owned by this "graceful" Future. Then we spawn a new tokio task to run graceful. We return the JoinHandle from InnerHttpServerStarter::start() and then store that in the dropshot::HttpServer. It's used if you call close() or if the server gets dropped.

The panic suggests that the graceful Future that we spawned in a tokio task has been dropped. I don't know how that could happen unless the tokio runtime itself were being dropped...which I think is what's happening. It's hard to tell from the stack trace above. But it's clearer in the stack trace Alan reports from his core file:

mdb  /net/catacomb/data/staff/alan//core.crucible*
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> ::status
debugging core file of crucible_integr (64-bit) from ip-10-150-1-252.us-west-2.compute.internal
initial argv: /work/oxidecomputer/crucible/target/debug/deps/crucible_integration_tests-95750
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=4791 uid=12345 code=-1
> $C !demangle
fffffc7fecdfc520 libc.so.1`_lwp_kill+0xa()
fffffc7fecdfc550 libc.so.1`raise+0x1e(6)
fffffc7fecdfc5a0 libc.so.1`abort+0x58()
fffffc7fecdfc5b0 std::sys::unix::abort_internal::h8bec27660b2c85fa+9()
fffffc7fecdfc660 std::panicking::rust_panic_with_hook::hb09aca5ad1ad2161+0x317()
fffffc7fecdfc6b0 std::panicking::begin_panic_handler::{{closure}}::h189c984e260f383e+0xc5()
fffffc7fecdfc6e0 std::sys_common::backtrace::__rust_end_short_backtrace::hcab55adb02d49332+0x20()
fffffc7fecdfc720 rust_begin_unwind+0x71()
fffffc7fecdfc760 core::panicking::panic_fmt::hc653fbe903263e77+0x33()
fffffc7fecdfc7e0 core::result::unwrap_failed::hf3fb563cfc573601+0x75()
fffffc7fecdfc840 core::result::Result<T,E>::expect::heccdd714828625f3+0x54()
fffffc7fecdfc880 <dropshot::server::CloseHandle as core::ops::drop::Drop>::drop::he0e0ea946e6ec2ff+0x68()
fffffc7fecdfc8b0 core::ptr::drop_in_place<dropshot::server::CloseHandle>::hf1866f047a401703+0x15()
fffffc7fecdfc8e0 core::ptr::drop_in_place<dropshot::server::HttpServer<alloc::sync::Arc<crucible_downstairs::repair::FileServerContext>>>::h96c5dd77d7bddfb0+0x74()
fffffc7fecdfcaa0 core::ptr::drop_in_place<crucible_downstairs::repair::repair_main::{{closure}}::{{closure}}>::hf294431f71dca310+0x53()
fffffc7fecdfcac0 core::ptr::drop_in_place<core::future::from_generator::GenFuture<crucible_downstairs::repair::repair_main::{{closure}}::{{closure}}>>::h5dade1d89f3ea9f9+0x11()
fffffc7fecdfcaf0 core::ptr::drop_in_place<tokio::runtime::task::core::Stage<core::future::from_generator::GenFuture<crucible_downstairs::repair::repair_main::{{closure}}::{{closure}}>>>::he902185279d644df+0x4e()
fffffc7fecdfcc00 tokio::runtime::task::core::Core<T,S>::set_stage::{{closure}}::h1bcfccfd25f32ce9+0x39()
fffffc7fecdfcd20 tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut::h7244e1f4d7375d7b+0x71()
fffffc7fecdfce70 tokio::runtime::task::core::Core<T,S>::set_stage::h980de41291f48e63+0xbf()
fffffc7fecdfcf60 tokio::runtime::task::core::Core<T,S>::drop_future_or_output::hdd463fe4cd570ca3+0x26()
fffffc7fecdfcf80 tokio::runtime::task::harness::cancel_task::{{closure}}::h812e471622711b68+0x14()
fffffc7fecdfcfb0 core::ops::function::FnOnce::call_once::hcc154a5753d70e32+0x15()
fffffc7fecdfcfd0 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h1bd2a7e6e6fafcb6+0x11()
fffffc7fecdfd050 std::panicking::try::do_call::h4c56d65d4735cf1d+0x48()
fffffc7fecdfd080 __rust_try+0x1d()
fffffc7fecdfd100 std::panicking::try::h02264b806da252c3+0x3b()
fffffc7fecdfd130 std::panic::catch_unwind::h4daeae4b491bd79a+0x11()
fffffc7fecdfd1f0 tokio::runtime::task::harness::cancel_task::hd2ae80ee78af6cae+0x37()
fffffc7fecdfd220 tokio::runtime::task::harness::Harness<T,S>::shutdown::h9ae1aa00b4be18be+0x55()
fffffc7fecdfd250 tokio::runtime::task::raw::shutdown::h2d889584989dd116+0x22()
fffffc7fecdfd280 tokio::runtime::task::raw::RawTask::shutdown::h3b2c62a07be9dfee+0x2f()
fffffc7fecdfd2b0 tokio::runtime::task::Task<S>::shutdown::ha677372269c30d3c+0x22()
fffffc7fecdfd350 tokio::runtime::task::list::OwnedTasks<S>::close_and_shutdown_all::h582ba12c5d902875+0x171()
fffffc7fecdfd4f0 tokio::runtime::scheduler::current_thread::CurrentThread::shutdown::{{closure}}::h3a047a000dca6e7a+0x6b()
fffffc7fecdfd520 tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}::h5da3766b5d0b2b27+0x2b()
fffffc7fecdfd5c0 tokio::macros::scoped_tls::ScopedKey<T>::set::hde2f889ea8de65c9+0x8d()
fffffc7fecdfd6a0 tokio::runtime::scheduler::current_thread::CoreGuard::enter::h577fbb9b31b437f7+0x15a()
fffffc7fecdfd7c0 tokio::runtime::scheduler::current_thread::CurrentThread::shutdown::hd72dd49754691b39+0xef()
fffffc7fecdfd820 <tokio::runtime::runtime::Runtime as core::ops::drop::Drop>::drop::h4d0652b5bd54ba0a+0x6f()
fffffc7fecdfd850 core::ptr::drop_in_place<tokio::runtime::runtime::Runtime>::h2f300571f8489940+0x15()
fffffc7fecdff890 crucible_integration_tests::test::test_pantry_bulk_read_max_chunk_size::hddf01a252cb10ed2+0x144()
fffffc7fecdff8c0 crucible_integration_tests::test::test_pantry_bulk_read_max_chunk_size::{{closure}}::h833e3fcfbf769c7c+0x19()
fffffc7fecdff900 core::ops::function::FnOnce::call_once::h30c98bee8b54db39+0x18()
fffffc7fecdff930 test::__rust_begin_short_backtrace::h63286c65bb34492a+0x12()
fffffc7fecdff950 core::ops::function::FnOnce::call_once{{vtable.shim}}::h18739cfb67a65b08+0x11()
fffffc7fecdffdd0 test::run_test::run_test_inner::{{closure}}::h390104348f8e0af2+0x71f()
fffffc7fecdfff00 std::sys_common::backtrace::__rust_begin_short_backtrace::h38964d8d454d0588+0xe4()
fffffc7fecdfff60 core::ops::function::FnOnce::call_once{{vtable.shim}}::hedfece91c9c16870+0x7c()
fffffc7fecdfffb0 std::sys::unix::thread::Thread::new::thread_start::hc3b10b204fa14a5c+0x27()
fffffc7fecdfffe0 libc.so.1`_thrp_setup+0x6c(fffffc7fee7d2240)
fffffc7fecdffff0 libc.so.1`_lwp_start()

Specifically, we see it at:

fffffc7fecdfd820 <tokio::runtime::runtime::Runtime as core::ops::drop::Drop>::drop::h4d0652b5bd54ba0a+0x6f()
fffffc7fecdfd850 core::ptr::drop_in_place<tokio::runtime::runtime::Runtime>::h2f300571f8489940+0x15()

Why don't we see this all the time? The key is that in our example, the Dropshot server itself was moved to a different tokio task when we spawned a new task to wait on it. In most of the tests I've seen (mostly in Omicron), the Dropshot server is owned by the test function, so if the test panics, I expect the Dropshot server is guaranteed to be dropped before the runtime gets dropped and then we don't hit this problem. By moving the server to another tokio task, it gets cleaned up only when the runtime cancels all of its tasks, and the runtime can clean those up in any order. At that point it's a race against the other task. If the hyper one gets cleaned up first, you hit this. If not, you don't.

In summary, I think it's wrong that sending the close signal assumes that the hyper server is still alive given that the hyper server could be dropped while the dropshot server is still running. (I think the only case this can happen is if the runtime is being dropped, admittedly.) Instead of panicking, ideally we'd log a warning message. If we can't do that for whatever reason, that's probably okay because I don't think there's actually a reason to think anything's wrong here. The only reason to log is that it isn't the shutdown sequence we'd normally expect and that might be a useful data point for someone debugging something strange.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions