Skip to content

USDT tools for higher level languages#774

Merged
4ast merged 29 commits intoiovisor:masterfrom
goldshtn:usdt-tools
Dec 19, 2016
Merged

USDT tools for higher level languages#774
4ast merged 29 commits intoiovisor:masterfrom
goldshtn:usdt-tools

Conversation

@goldshtn
Copy link
Copy Markdown
Contributor

This is a work-in-progress: please do not merge yet. I'm putting it out there for review and feedback at this time. Some of the tools are not done yet, and none have example and man files at this time.

Inspired by tools from the DTrace toolkit, here are a couple of tools for high-level languages (currently Java, Python, and Ruby) using their USDT probes when available.

The tools I have planned are:

  • ucalls (in this PR) -- summarize method call numbers and total latency
  • ugc (in this PR) -- monitor and trace GC events
  • uflow (planned) -- print method call flow
  • ustat (planned) -- print top-like statistics on method calls, class loads, GC events, and so on

On top of these, profile should already work with languages that can emit /tmp/perf-PID.map files by resolving stack traces, so there's probably no need for a custom uprofile.

I think it should be possible to avoid forking each tool for each language, but rather keep a single set of tools with a language parameter. Examples (some of this should already work at this patchset):

# top 10 slowest Python methods, times in milliseconds
ucalls python 148 -mLT 10

# Java GC events
ugc java 8004

# print flow graph for java.io methods
uflow java 8004 java.io.*

# print stats on method calls, object creation, GC events, etc. for Ruby process 1103
ustat ruby 1103

Any thoughts or feedback on tool naming, options, and so on would be appreciated. I am mostly trying to mimic the per-language tools from the DTrace toolkit, while recognizing that some things are not currently/yet possible (e.g. syscall tracing in addition to method tracing).

/cc @brendangregg

@brendangregg
Copy link
Copy Markdown
Member

Some general comments before I get into the tools...

Not a lot was done with USDT in runtimes. Here's the three problems:

  • GC tracing: via USDT works great, but most runtimes already have a GC log, and those logs can have a ton more detail. There may be times when we want a bcc tool to do it, but just remember that it won't be the primary tool many people use.
  • Function tracing: The overhead of int 3 trapping made these fairly prohibitive (@4ast might fix it one day :). In the DTraceToolkit, I did some awesome looking tools with them, and they were useful for debugging, but high overhead for production. I should have warned about overhead more, and we can here in bcc, so people know they are a function-equivalent of running strace!
  • Sun was on the way out: There's more we can do with USDT probes in runtimes, like looking into memory allocation in more detail, but around the time we should have been doing those kinds of things, people were switching to Linux. So I think there's untapped opportunity here, which we'll discover now that USDT can be a thing on Linux, and we can go back to adding more runtime USDT probes.

Java had some interesting ones: like class allocations, thread creations. I think those interest me the most. Sure, I'll take GC scripts and function tracing scripts like existed in the DTraceToolkit, but the latter are more debug-only until we lower overheads.

Another interesting use is mixing USDT probes with other user and kernel events, which is something other analyzers can't do. You can imagine doing this for MySQL USDT probes: casting kernel events in the context of a query. Need to think about opportunities in runtimes too.

As for naming: I'd go with javagc, javacalls, javaclassstat, nodegc, nodecalls, pythoncalls, etc. So we can list java* for Java developers. Findability. Now if the tools are so close, then symlinks or light 2-line shell wrappers to a common multi-tool?

You know what I'd do: I'd start with separate files for each, and consider refactoring later. Think about this: if a Java developer wants to enhance separate Java tools, they can go to town. But if they are all multi-tools, then we'll be telling them: please test your changes with Node.js, Python, etc, etc. I can imagine how that'd go!

@goldshtn
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback! I realize the method tracing tools are going to be for debug use, mostly. Re GC stats, Java sure has a slew of logs, but for other runtimes it can be quite useful. For example, a quick search reveals that Node has a --trace-gc and no one understands its output. In Ruby, you need to add a gem that collects GC statistics. And so on.

Re kernel events in the context of a query: this can be done today with trace. The tool is harder to use, sure, but it's general-purpose. I don't know if we can anticipate exactly which kernel events people would want along with the USDT traces. Or maybe we will be able to, in the future, when BPF + USDT get more traction. What I really would like is syscall support, and as you recall (#748) we don't have that yet. If there's anything we can do to move that forward, I'd love to help.

@brendangregg
Copy link
Copy Markdown
Member

Yes, good points about Node.js and Ruby.

There's a lot of languages out there. I've been thinking about it, and maybe it does make sense to begin with the multi-tools (and multiple symlinks/shell scripts to them). Because: A) accommodating Java developers with individual scripts is a nice idea, but what's the chance none contribute and we end up doing the work anyway?, and B) bcc tools are long. With DTrace, it was no big deal to have 4 tools that had 95% common code, when the tools were 20 lines each. With bcc, we have 100-300 line tools, so duplicating that is 10x more work to maintain.

@goldshtn
Copy link
Copy Markdown
Contributor Author

@brendangregg So I've added a couple of tools and changed some of the tools I pushed originally. What we have right now (not 100% done, and no documentation yet, so not mergeable):

  • ugc for tracing GC events (Java, Node, Python, Ruby)
  • ucalls for tracing method calls and/or syscall stats (Java, Python, Ruby, syscalls-only)
  • uthreads for tracing thread start/stop events (Java, pthreads)
  • uobjnew for summarizing object allocations by type (Java, Ruby, malloc)

The last remaining tool I haven't touched yet is uflow, which is going to be super-crazy-high overhead but probably still useful as a light debugger replacement for tracing method flow.

I was also thinking of building systrace -- an strace equivalent that uses BPF. Don't know if that'd be interesting or useful.

@brendangregg
Copy link
Copy Markdown
Member

May as well do a quick systrace. I did dtruss once, just to show DTrace could do syscalls. Its only feature over Solaris truss was being able to match multiple process names with -n, instead of one PID only. Next thing dtruss is the only syscall tracer for OS X.

I say quick, as I suspect strace has a lot of translations in that would be redundant to copy and hard to maintain. So I wouldn't invest too much time in the experiment.

What would be also useful (in addition) would be a syscount, like my https://github.com/brendangregg/perf-tools/blob/master/examples/syscount_example.txt . Yes, that's really a more featured version of strace -c. I'd be inclined not to add it to systrace, since that would be mixing a per-event output tool with a summary tool (I know strace has a precedent for that, but I'm not sure it's a good precedent). I like the argdist / trace split.

Your choice of USDT tools makes sense. I still don't know whether these should all become symlinks (java_, node_) or just keep them as u*, or both.

I took a quick look at the code -- maybe because they are new you just haven't done this yet, but I'd have put big comments above each language block. Like:

    probes.append(Probe("gc__start", "gc__done",
                        begin_save, end_save, formatter))
#
# Ruby
#
elif args.language == "ruby":
    # Ruby GC probes do not have any additional information available.
    probes.append(Probe("gc__mark__begin", "gc__mark__end",
                        "", "", lambda _: "GC mark stage"))
    probes.append(Probe("gc__sweep__begin", "gc__sweep__end",
                        "", "", lambda _: "GC sweep stage"))
#
# Node.js
#
elif args.language == "node":
    end_save = """

@goldshtn
Copy link
Copy Markdown
Contributor Author

syscount is pretty much there with ucalls -S, except I built the tool to require a pid (this can be changed specifically for -S); or with funccount -r '^[Ss]y[Ss]_*'. I guess it won't hurt to have a separate tool with some better grouping/filtering/sorting options like your syscount from perf-tools.

Re systrace, I didn't mean to provide the same level of detail (with the fd parsing, string extraction and so on). Just a basic trace of each syscall with 6 arguments. More advanced parsing would be possible with trace.

@goldshtn goldshtn force-pushed the usdt-tools branch 2 times, most recently from c681cdc to 1207f28 Compare October 27, 2016 14:21
@goldshtn
Copy link
Copy Markdown
Contributor Author

Just forced-push after rebasing. Added a new tool, ustat (which I forgot about) that prints high-level stats on GCs, method calls, object allocations, and so on -- in all high-level processes it can put its hands on.

Example output:

# ustat -C 5 3
Tracing... Output every 5 secs. Hit Ctrl-C to end                               

 06:44:43 loadavg: 0.07 0.23 0.15 5/463 19945                                    

 PID    CMDLINE          METHOD/s   GC/s   OBJNEW/s   CLOAD/s  EXCP/s   THREAD/s 
 16109  /home/vagrant/no 0          5      0          0        0        0        
 17595  irb              0          0      0          0        0        0
 17815  ./python         17995      0      0          0        0        0       
 19928  java -XX:+Extend 0          0      0          0        0        0       

 06:44:49 loadavg: 0.15 0.24 0.15 1/464 19947                                    

 PID    CMDLINE          METHOD/s   GC/s   OBJNEW/s   CLOAD/s  EXCP/s   THREAD/s 
 16109  /home/vagrant/no 0          0      0          0        0        0        
 17595  irb              130        1      109        0        0        0
 17815  ./python         0          0      0          0        0        0       
 19928  java -XX:+Extend 0          0      0          0        0        0       

 06:44:54 loadavg: 0.28 0.27 0.16 3/465 19948                                    

 PID    CMDLINE          METHOD/s   GC/s   OBJNEW/s   CLOAD/s  EXCP/s   THREAD/s 
 16109  /home/vagrant/no 0          0      0          0        0        0        
 17595  irb              0          0      0          0        0        0
 17815  ./python         0          0      0          0        0        0       
 19928  java -XX:+Extend 762207     0      0          0        0        0       

Also, added uflow for flow tracing.

@goldshtn
Copy link
Copy Markdown
Contributor Author

goldshtn commented Nov 7, 2016

Added man pages and example files. This is ready for review and merging. /cc @brendangregg

@goldshtn goldshtn changed the title WIP [do not merge]: USDT tools for higher level languages USDT tools for higher level languages Nov 7, 2016
@dark
Copy link
Copy Markdown
Member

dark commented Nov 8, 2016

@goldshtn I tried ustat because it looks very interesting. Thanks!
I have noticed one thing I'd like to report, though: it looks like ustat leaks file descriptors for bpf-map, and eventually errors out with error: could not open bpf map: Too many open files.
By following along with lsof I noticed a leak of around 20-30 bpf-map FDs per second.

Maybe it's a problem of my setups? My two machines have oldish kernels (4.4.26 and 4.7.4) and no packaged configured with --enable-dtrace, so there is no activity to report actually. I will try with dtrace-enabled software sometime in the future.

@goldshtn
Copy link
Copy Markdown
Contributor Author

goldshtn commented Nov 9, 2016

@dark Thanks, it's leaking indeed on my 4.9-rc as well. I'll track these fds down. I'm explicitly cleaning up the BPF object, but something must be sticking around. I believe this is the only tool we have that reloads the BPF program with new maps instead of just initializing everything upfront, which is probably why we have only now run into it.

@goldshtn
Copy link
Copy Markdown
Contributor Author

goldshtn commented Nov 9, 2016

@dark Should be fixed by #808. When it gets merged, I'll rebase.

@goldshtn
Copy link
Copy Markdown
Contributor Author

goldshtn commented Nov 9, 2016

Rebased. We should get this merged at some point :)

@brendangregg
Copy link
Copy Markdown
Member

I'm back... I like the tools. They seem to work. :)

uobjnew.py is missing a PID filter, at least for the "c" match (the following python program is idle):

# ./uobjnew.py c `pgrep -n python` -v
#include <uapi/linux/ptrace.h>


#include <linux/ptrace.h>

struct key_t {
#if 1
    u64 size;
#else
    char name[50];
#endif
};

struct val_t {
    u64 total_size;
    u64 num_allocs;
};

BPF_HASH(allocs, struct key_t, struct val_t);

int alloc_entry(struct pt_regs *ctx, size_t size) {
    struct key_t key = {};
    struct val_t *valp, zero = {};
    key.size = size;
    valp = allocs.lookup_or_init(&key, &zero);
    valp->total_size += size;
    valp->num_allocs += 1;
    return 0;
}

Tracing allocations in process 26230 (language: c)... Ctrl-C to quit.
^C
TYPE                           # ALLOCS      # BYTES
block size 0                         32            0
block size 50                         8          400
block size 6                         76          456
block size 57                         8          456
[...]
block size 224                    15411      3452064
block size 552                     7162      3953424
block size 32816                    242      7941472
block size 4097                    4992     20452224

uobjnew examples have a "-l", but that should be omitted (will need to update the tool and example file) (update: read ahead as I have a different suggestion about -l):

examples:
    ./uobjnew -l java 145         # summarize Java allocations in process 145
    ./uobjnew -l c 2020 1         # grab malloc() sizes and print every second
    ./uobjnew -l ruby 6712 -C 10  # top 10 Ruby types by number of allocations
    ./uobjnew -l ruby 6712 -S 10  # top 10 Ruby types by total size

In the future, it would be nice if these tools could pick a default language if one wasn't specified. Eg, if the process name is python or java, then go with that. There will be times you'll want to override it: examining C calls for a Java program, for example. But most of the time I'm using these tools like this:

# ./ucalls.py -l python `pgrep -n python`

ustat.py, ucalls.py, and uthreads.py want the -l. Should that be dropped for consistency? ... It might be a bit tidier, if these tools could pick the java/python/etc default, to then have the -l as the optional override. In which case, all tools try to pick a default, and all tools have an optional -l.

uflow.py:

  -C CLAZZ, --class CLAZZ

BTW, uflow.py works well, at least for something simple:

# ./uflow.py python `pgrep -n python`
Tracing method calls in python process 26230... Ctrl-C to quit.
CPU PID    TID    TIME(us) METHOD
6   26230  26230  1.846    -> <stdin>.<module>                     
6   26230  26230  1.846      -> <stdin>.func_a                     
6   26230  26230  2.847        -> <stdin>.func_b                   
6   26230  26230  3.848          -> <stdin>.func_c                 
6   26230  26230  4.849          <- <stdin>.func_c                 
6   26230  26230  4.849        <- <stdin>.func_b                   
6   26230  26230  4.849      <- <stdin>.func_a                     
6   26230  26230  4.849    <- <stdin>.<module> 

Those PID/TID columns should be handy for filtering multi-threaded apps.

@4ast
Copy link
Copy Markdown
Member

4ast commented Nov 29, 2016

@goldshtn are you planning one more update before it can be merged?

@goldshtn
Copy link
Copy Markdown
Contributor Author

Yes - will fix the immediate issues and rebase, and leave non-pressing stuff for a separate PR. Hope to wrap this up this week.

`ucalls` is a new tool that uses the USDT probes in high-
level languages (Java, Python, Ruby) to summarize the
number of method calls and optionally their latency. This
is similar to `funccount` and `funclatency` but for methods
in Java, Python, Ruby, and other languages that might have
USDT probes for method entry and exit.

`ucalls` uses the `method__entry` and `method__return`
probes in Java and Ruby, and the `function__entry` and
`function__return` probes in Python. The Python probes are
not as accurate because they do not report the class name,
but are still useful.

The optional `-L` switch indicates whether latency numbers
are required. By default, only the number of calls to each
method is printed. The `-T` switch filters out the top
methods (by number of calls or total latency).
`ugc` monitors GC start and end events in Java, Python,
and Ruby, printing out a message when a GC cycle completes
along with the duration of the GC and an optional
additional description that depends on the target language.

NOTE: This WIP does not have any interesting additional
descriptions yet.

Usage example:

```
ugc java 1448	# monitor Java GCs in process 1448
```
For runtimes that make it available, we now extract GC
event information such as heap sizes and print this
information in addition to the GC duration and timestamp.
Skeletons for method flow, object allocations, general
event statistics, and thread statistics.
Add Node support for gc__start and gc__done probes, which
includes GC type flag parsing.
Add `-S` switch for tracing syscalls with or without high-
level language method calls. When `-S` is provided, attach
to all `sys_` and `SyS_` entry points in the kernel and
trace the number of calls and execution latency (if `-L`
was also specified).

This is now also a useful general-purpose tool for syscall
count/latency summarizing when a language is not specified.
`uthreads` attaches to thread start and stop USDT probes
and traces out thread creation and destruction events.
Currently, only Java is supported.
Trace `pthread_start` calls and get the symbol for the
newly-started thread for printing. Also allow no-Java runs
with no language specified for pthread-only tracing.
Use various USDT probes in Java and Ruby, and `malloc()`
uprobe in C, to summarize object allocations by type or
by size. Print statistics at the end or at intervals, with
optional trimming of top types by number of instances
allocated or number of bytes.
Allow callers to obtain the list of active (enabled) USDT
probes from the USDT object, if they need to enable the
uprobes by hand without using the bcc attach support. This
is required when the same uprobe needs to be enabled in
more than one process.
This tool gathers statistics (by pid) on method calls,
class loads, garbage collections, thread creations, object
allocations, and exceptions raised and prints them in a
top-like style. Supported languages currently include
Java, Python, Ruby, and Node. Not all languages provide
the necessary probes for all event types.

Usage examples:

```
ustat -C
ustat -l java 1
ustat -l ruby -S objnew
```
This tool traces method executions and prints out messages
with a nesting depth indicator that allows relatively easy
visualization of program flow. To reduce overhead, `-C` and
`-M` switches take class and method prefixes to filter the
output by. The filtering takes place in the BPF program, so
the unnecessary data never ends up in user space.

Usage examples:

```
uflow java 185
uflow -M index python 180
uflow -C 'IRB::Irb' ruby 10041
```
The /proc/PID/cmdline file has \0 for spaces in the command
line, and there may be trailing \0 characters as well.
Replace them all with spaces.
If the process in question doesn't have the USDT probes we
expect, handle the failure gracefully. After all, it could
be a binary compiled without USDT support, the pid could be
recycled to some other application, or -- the process name
could match, but by accident, and really belong to some
other runtime.
When C allocations are traced, attach the `malloc`
uprobe only to the requested process -- otherwise, we
get allocation information from the entire system.
@goldshtn
Copy link
Copy Markdown
Contributor Author

@4ast @brendangregg

Rebased and fixed the issues found above -- I think this is ready for merging. I will open a separate issue for automatically detecting the language, I don't know if we always want it. In any case I think it belongs in a separate PR.

Re the -l inconsistency in some tools: the reason the -l is required in ustat, ucalls, and uthreads is that it is an optional argument. If it had been positional and optional, it couldn't have been first in the list, and I think that would be kind of ugly, e.g.: ustat 4482 1 java reads a bit weird.

Let's get this merged (I also have a blog post I need to write about these :)) and then iterate?

@ceeaspb
Copy link
Copy Markdown
Contributor

ceeaspb commented Dec 19, 2016

great stuff
@goldshtn likely a separate ticket . I tested with a jvm where usdts are not built in (eg. Oracle). An error message is provided about not specifying a pid. I did specify a pid so this could be confusing. In this set of scripts it would be useful to indicate to the user that the usdts (uprobes) were not found in the process likely because the binary was not built with it enabled (as it's mentioned in the man pages etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants