feat(common): introduce async stack trace#4383
Conversation
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Codecov Report
@@ Coverage Diff @@
## main #4383 +/- ##
==========================================
- Coverage 74.34% 74.33% -0.01%
==========================================
Files 847 853 +6
Lines 125199 126753 +1554
==========================================
+ Hits 93084 94228 +1144
- Misses 32115 32525 +410
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
| // periodically. | ||
| // | ||
| // Comment out the following line to enable. | ||
| #[cfg(any())] |
There was a problem hiding this comment.
Better to make it a feature lol
There was a problem hiding this comment.
Agree. I've left a TODO here for now. We may introduce a more complete and unified configuration for all kinds of tracings.
|
|
||
| let mut all_traces = BTreeMap::new(); | ||
|
|
||
| // FIXME: the compute node may not be accessible from the our network with their listen |
There was a problem hiding this comment.
We have client address and listen address separately for compute node. The address you get here should be a client address, and should be accessible any machine within the cluster.
There was a problem hiding this comment.
Here we assume the streaming service of compute nodes is also accessible with risectl. However, it's possible that the cluster is running in an internal network and only the meta service is exposed, so we may fail to connect to the compute nodes. Left a more detailed explanation here.🥵
| Ok((block, len)) | ||
| } | ||
| }) | ||
| .stack_trace("block_cache_lookup") |
There was a problem hiding this comment.
I think this is not a "stack_trace". It's like a tracing span instead. I can do the renaming later when I introduce minitrace to the system. At that time, we might have something like .rw_span which creates both minitrace span and async stack trace info,
This reverts commit b8927ad.
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
| #[derive(Debug, Clone)] | ||
| pub enum SpanValue { | ||
| Slice(&'static str), | ||
| Shared(Arc<String>), | ||
| } |
There was a problem hiding this comment.
Seems equivalent to Cow?
| #[derive(Debug, Clone)] | |
| pub enum SpanValue { | |
| Slice(&'static str), | |
| Shared(Arc<String>), | |
| } | |
| pub type SpanValue = Cow<'static, str>; |
There was a problem hiding this comment.
I wanna make this struct cheap to clone, so Arc<String> is used. For example, we need to trace the Future actor.next() multiple times with the same span Actor 233, and they'd better share a span instance.
BTW, I've just realized that constructing an Arc<str> from &'static str or String both leads to allocation. 🥵
There was a problem hiding this comment.
Sorry... I used to think that cloning Cow::Owned only borrows a reference, actually it clones the data! 😰
There was a problem hiding this comment.
So is it Cow ?
Cow only accepts a single borrowed type, the owned type must be <B as ToOwned>::Owned. 🥵
wangrunji0408
left a comment
There was a problem hiding this comment.
LGTM!
Next step we can add some unit tests to make sure the reported traces are equal to our expectations. I'm also looking forward to seeing it on crates.io! 🥰
* move context to debug module Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add trace context Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add trace context for streaming actor Signed-off-by: Bugen Zhao <i@bugenzhao.com> * handle cancel Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add span for storage Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add todos Signed-off-by: Bugen Zhao <i@bugenzhao.com> * use rc refcell Signed-off-by: Bugen Zhao <i@bugenzhao.com> * simplify interfaces Signed-off-by: Bugen Zhao <i@bugenzhao.com> * print in stream manager Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix cancel stream next Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add more traces Signed-off-by: Bugen Zhao <i@bugenzhao.com> * use set and add time Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add stream test Signed-off-by: Bugen Zhao <i@bugenzhao.com> * record actor Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add rpc to risectl Signed-off-by: Bugen Zhao <i@bugenzhao.com> * assertions and refactor Signed-off-by: Bugen Zhao <i@bugenzhao.com> * support stack trace for release Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add more traces to hummock Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add trace report type Signed-off-by: Bugen Zhao <i@bugenzhao.com> * yield if ns = 0 Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add traces for outputs and source Signed-off-by: Bugen Zhao <i@bugenzhao.com> * print !!! on long span Signed-off-by: Bugen Zhao <i@bugenzhao.com> * use current thread runtime for ctl Signed-off-by: Bugen Zhao <i@bugenzhao.com> * do not check when drop Signed-off-by: Bugen Zhao <i@bugenzhao.com> * move to separated crate Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add more docs Signed-off-by: Bugen Zhao <i@bugenzhao.com> * refine docs Signed-off-by: Bugen Zhao <i@bugenzhao.com> * bump tokio Signed-off-by: Bugen Zhao <i@bugenzhao.com> * minor refine Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix licenses Signed-off-by: Bugen Zhao <i@bugenzhao.com> * refactor with arena Signed-off-by: Bugen Zhao <i@bugenzhao.com> * print detached Signed-off-by: Bugen Zhao <i@bugenzhao.com> * use arc for span Signed-off-by: Bugen Zhao <i@bugenzhao.com> * use coarsetime Signed-off-by: Bugen Zhao <i@bugenzhao.com> * rename to async_stack_trace Signed-off-by: Bugen Zhao <i@bugenzhao.com> * split to multiple modules Signed-off-by: Bugen Zhao <i@bugenzhao.com> * allow printing detached Signed-off-by: Bugen Zhao <i@bugenzhao.com> * turn off by default, add release ci step Signed-off-by: Bugen Zhao <i@bugenzhao.com> * add hints to ctl Signed-off-by: Bugen Zhao <i@bugenzhao.com> * do not collect test analytics on e2e Signed-off-by: Bugen Zhao <i@bugenzhao.com> * use Arc<String> Signed-off-by: Bugen Zhao <i@bugenzhao.com> * try with context Signed-off-by: Bugen Zhao <i@bugenzhao.com> * try enable pr ci Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fmt Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Revert "try enable pr ci" This reverts commit b8927ad. * minor fixes Signed-off-by: Bugen Zhao <i@bugenzhao.com> * do not use offset arc Signed-off-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
What's changed and what's your intention?
This PR introduces a brand new tracing of async stack trace, which allows one to check where all tasks are pending and why it's pending. Note that multiple futures can be polled in a single task with
selectandjoin, so the pending status snapshot is actually a tree instead of a stack.Samples
To make it intuitive, here're some samples.
try_join_allon them, they're in the same level in the tree.2777905903042560156ms ago, while there are no new chunks received for 8s.The epoch number in this actor
2667is also smaller than the actor 1222669, showing that the concurrent checkpoint is stuck. (A real case from here 😄)Usages
To get the traces printed, one may launch the cluster with env-var
RW_ASYNC_STACK_TRACEset and userisedev ctl stream trace [--actor-id 233]to find a recent captured report. You may also print it periodically in the logs by commenting out this line.https://github.com/singularity-data/risingwave/blob/bc44aa526fced1ddfab523f6d133b05c4ff48a7c/src/compute/src/server.rs#L199-L205
The tags must be attached manually (unless we hack the future compiling in
rustc). Luckily, it's easy enough by adding a chained call of.stack_trace(..).&'static strandStringare both acceptable here, while the former one may perform better.https://github.com/singularity-data/risingwave/blob/bc44aa526fced1ddfab523f6d133b05c4ff48a7c/src/storage/src/monitor/monitored_store.rs#L232-L236
Implementations
Writing a double-linked tree in Rust is hard and ugly, not to mention that we must take care of future pending, ready, aborting, and sending to other tasks. I finally decided to implement this with the arena tree, which can be both cleaner and more efficient. For the tracing part, if you're interested, check
src/utils/stack_trace.Limitations
Since we need to manipulate the tree every time we polled a future, this seems significantly hurt the performance with dev profile. That's why we disable it by default. May further investigate it later.
Checklist
./risedev check(or alias,./risedev c)Refer to a related PR or issue link (optional)