Optimize `strcmp` calls for VM stackwalker by fandreuz · Pull Request #1662 · async-profiler/async-profiler

fandreuz · 2026-01-20T10:29:12Z

Description

I noticed StackWalker::walkVM calls strcmp several times. We could save the last char* which matched the strcmp as an optimization.

The following table shows the value of stackwalk_ns_avg on master for several benchmarks in Renaissance, and how much it is reduced by this PR:

Bench	`master`	`save-ptr`	% reduction
akka-uct	19645	17634	10.24
chi-square	19530	18501	5.27
future-genetic	22795	22750	0.20
mnemonics	38996	34033	12.73
page-rank	20354	19890	2.28
scala-kmeans	22907	19887	13.18

Related issues

n/a

Motivation and context

It's likely that the char* refers to a string literal, and thus won't change throughout the whole lifetime of the program.

Thread 2 "java" hit Breakpoint 1.2, CodeBlob::CodeBlob (this=this@entry=0x7fee1b49a408, name=name@entry=0x7fee336461ab "Interpreter", kind=kind@entry=CodeBlobKind::Buffer, size=size@entry=1109016, 
    header_size=header_size@entry=80) at /jdk/src/hotspot/share/code/codeBlob.cpp:169
169     CodeBlob::CodeBlob(const char* name, CodeBlobKind kind, int size, uint16_t header_size) :
(gdb) bt
#0  CodeBlob::CodeBlob (this=this@entry=0x7fee1b49a408, name=name@entry=0x7fee336461ab "Interpreter", kind=kind@entry=CodeBlobKind::Buffer, size=size@entry=1109016, header_size=header_size@entry=80)
    at /jdk/src/hotspot/share/code/codeBlob.cpp:169
#1  0x00007fee320c6a42 in RuntimeBlob::RuntimeBlob (header_size=80, size=1109016, kind=CodeBlobKind::Buffer, name=0x7fee336461ab "Interpreter", this=0x7fee1b49a408)
    at /jdk/src/hotspot/share/code/codeBlob.hpp:332
#2  BufferBlob::BufferBlob (header_size=80, size=1109016, kind=CodeBlobKind::Buffer, name=0x7fee336461ab "Interpreter", this=0x7fee1b49a408) at /jdk/src/hotspot/share/code/codeBlob.cpp:392
#3  BufferBlob::create (name=name@entry=0x7fee336461ab "Interpreter", buffer_size=buffer_size@entry=1108928) at /jdk/src/hotspot/share/code/codeBlob.cpp:406
#4  0x00007fee331fab0c in StubQueue::StubQueue (this=this@entry=0x7fee2c1bd830, stub_interface=0x7fee2c1bd710, buffer_size=buffer_size@entry=1108928, lock=lock@entry=0x0, 
    name=name@entry=0x7fee336461ab "Interpreter") at /jdk/src/hotspot/share/code/stubs.cpp:70
#5  0x00007fee33269bc6 in TemplateInterpreter::initialize_stub () at /jdk/src/hotspot/share/interpreter/templateInterpreter.cpp:55

[...]

  _code = new StubQueue(new InterpreterCodeletInterface, code_size + max_aligned_bytes, nullptr,
                        "Interpreter");

#0  CodeBlob::CodeBlob (this=this@entry=0x7f97a3cf3008, name=name@entry=0x7f97bd77b3d2 "native nmethod", kind=kind@entry=CodeBlobKind::Nmethod, cb=cb@entry=0x7f97bb512ad0, size=1392, 
    header_size=header_size@entry=240, frame_complete_offset=47, frame_size=10, oop_maps=0x7f97b402d108, caller_must_gc_arguments=false, mutable_data_size=48) at /jdk/src/hotspot/share/code/codeBlob.cpp:125
#1  0x00007f97bcd6da4a in nmethod::nmethod (this=0x7f97a3cf3008, method=0xe235d88, type=<optimized out>, nmethod_size=<optimized out>, compile_id=<optimized out>, offsets=0x7f97bb511bf0, 
    code_buffer=<optimized out>, frame_size=<optimized out>, basic_lock_owner_sp_offset=<optimized out>, basic_lock_sp_offset=<optimized out>, oop_maps=<optimized out>, mutable_data_size=<optimized out>)
    at /jdk/src/hotspot/share/code/nmethod.cpp:1280
#2  0x00007f97bcd6e40f in nmethod::new_native_nmethod (method=..., compile_id=compile_id@entry=5, code_buffer=0x7f97bb512ad0, vep_offset=<optimized out>, frame_complete=frame_complete@entry=47, 
    frame_size=<optimized out>, basic_lock_owner_sp_offset=<optimized out>, basic_lock_sp_offset=<optimized out>, oop_maps=<optimized out>, exception_handler=<optimized out>)
    at /jdk/src/hotspot/share/code/nmethod.cpp:1102

[...]

// For native wrappers
nmethod::nmethod(
  Method* method,
  CompilerType type,
  int nmethod_size,
  int compile_id,
  CodeOffsets* offsets,
  CodeBuffer* code_buffer,
  int frame_size,
  ByteSize basic_lock_owner_sp_offset,
  ByteSize basic_lock_sp_offset,
  OopMapSet* oop_maps,
  int mutable_data_size)
  : CodeBlob("native nmethod", CodeBlobKind::Nmethod, code_buffer, nmethod_size, sizeof(nmethod),
             offsets->value(CodeOffsets::Frame_Complete), frame_size, oop_maps, false, mutable_data_size),

VtableBlob* blob = VtableBlob::create("vtable chunks", bytes);

#define DEFINE_BLOB_INIT_METHOD(blob_name)                              \
  void StubRoutines::initialize_ ## blob_name ## _stubs() {             \
    if (STUBGEN_BLOB_FIELD_NAME(blob_name) == nullptr) {                \
      BlobId blob_id = BlobId:: JOIN3(stubgen, blob_name, id);          \
      int size = _ ## blob_name ## _code_size;                          \
      int max_aligned_size = 10;                                        \
      const char* timer_msg = "StubRoutines generation " # blob_name " stubs"; \
->    const char* name = "StubRoutines (" # blob_name " stubs)";        \
      const char* assert_msg = "_" # blob_name "_code_size";            \
      STUBGEN_BLOB_FIELD_NAME(blob_name) =                              \
        initialize_stubs(blob_id, size, max_aligned_size, timer_msg,    \
                         name, assert_msg);                             \
    }                                                                   \
  }

How has this been tested?

There might be some noise as I ran all the benchmarks at the same time in Docker Compose. Setup here.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

fandreuz · 2026-01-20T10:31:32Z

+        if (ptr2 == n) return n != nullptr;
+
+        if (strcmp(n, "nmethod") == 0) {
+            ptr1 = n;


Initially I thought about using something like

__atomic_store_n(&ptr, n, __ATOMIC_RELAXED);

but I realized the assembly is the same for both clang and gcc, on x86 and ARM.

apangin

TBH, 10% improvement looks unrealistic to me. I believe stack walking performance should be dominated by NMethod lookup, ScopeDesc parsing and jmethodID retrieval. It is worth validating measurement methodology.

In any case, I think this could be implemented better (if necessary). In particular, proposed PR only caches positive lookups, but once the canonical name is obtained it can be used to return both positives and negatives.

apangin · 2026-01-20T17:06:11Z

+        if (ptr == n) return true;
+        if (strncmp(n, "StubRoutines", 12) == 0) {
+            // There might be multiple pointers in use for this name,
+            // we just keep the last one seen as an optimization


This is an anti-optimization.
If two or more threads have frames from different StubRoutines sections, they will start overwriting this variable over and over again, causing lots of CPU cache invalidations. Furthermore, this may slow down access to unrelated neighbor variables due to false sharing. FYI, there was a notorious JDK issue of the same kind - this bug is a great source of wisdom :)

I see, my thinking here was that since we don't impose any strong synchronization mechanism the scenario you describe wouldn't be so bad, but that would require some measurements of course. It's probably best to leave this out.

apangin · 2026-03-18T16:42:25Z

I propose an alternative implementation of this optimization: #1701
@fandreuz wdyt?

fandreuz · 2026-03-18T17:34:19Z

I propose an alternative implementation of this optimization: #1701 @fandreuz wdyt?

Looks good!

apangin · 2026-03-18T18:09:06Z

Thank you for the review and for finding this this optimization opportunity.

stuff

ef82b1d

fandreuz commented Jan 20, 2026

View reviewed changes

Baraa-Hasheesh reviewed Jan 20, 2026

View reviewed changes

Comment thread src/vmStructs.h Outdated

Comment thread src/vmStructs.h Outdated

Comment thread src/vmStructs.h Outdated

Comment thread src/vmStructs.h Outdated

fix

a33572b

apangin reviewed Jan 20, 2026

View reviewed changes

use for ne

4d11106

apangin mentioned this pull request Mar 18, 2026

Optimize NMethod type check #1701

Merged

apangin closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `strcmp` calls for VM stackwalker#1662

Optimize `strcmp` calls for VM stackwalker#1662
fandreuz wants to merge 3 commits into
async-profiler:masterfrom
fandreuz:save-ptr

fandreuz commented Jan 20, 2026 •

edited

Loading

Uh oh!

fandreuz Jan 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

apangin left a comment

Uh oh!

apangin Jan 20, 2026

Uh oh!

fandreuz Jan 20, 2026

Uh oh!

fandreuz Jan 21, 2026

Uh oh!

apangin commented Mar 18, 2026

Uh oh!

fandreuz commented Mar 18, 2026

Uh oh!

apangin commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fandreuz commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Motivation and context

How has this been tested?

Uh oh!

fandreuz Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

apangin left a comment

Choose a reason for hiding this comment

Uh oh!

apangin Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

fandreuz Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

fandreuz Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

apangin commented Mar 18, 2026

Uh oh!

fandreuz commented Mar 18, 2026

Uh oh!

apangin commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fandreuz commented Jan 20, 2026 •

edited

Loading

fandreuz Jan 20, 2026 •

edited

Loading